0% found this document useful (0 votes)
18 views534 pages

Information and Communication Technology

The document outlines the proceedings of the 13th International Symposium on Information and Communication Technology (SOICT 2024) held in Danang, Vietnam, from December 13-15, 2024. It details the conference's focus on various research areas, including AI, networking, and cybersecurity, and highlights the rigorous peer-review process that led to the selection of 88 regular and 68 poster presentations. Additionally, it acknowledges the contributions of the organizing and program committees, as well as the support from sponsors and invited speakers.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views534 pages

Information and Communication Technology

The document outlines the proceedings of the 13th International Symposium on Information and Communication Technology (SOICT 2024) held in Danang, Vietnam, from December 13-15, 2024. It details the conference's focus on various research areas, including AI, networking, and cybersecurity, and highlights the rigorous peer-review process that led to the selection of 88 regular and 68 poster presentations. Additionally, it acknowledges the contributions of the organizing and program committees, as well as the support from sponsors and invited speakers.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 534

Wray Buntine · Morten Fjeld ·

Truyen Tran · Minh-Triet Tran ·


Binh Huynh Thi Thanh ·
Takumi Miyoshi (Eds.)

Communications in Computer and Information Science 2350

Information and
Communication Technology
13th International Symposium, SOICT 2024
Danang, Vietnam, December 13–15, 2024
Proceedings, Part I
Communications
in Computer and Information Science 2350

Series Editors
Gang Li , School of Information Technology, Deakin University, Burwood, VIC,
Australia
Joaquim Filipe , Polytechnic Institute of Setúbal, Setúbal, Portugal
Zhiwei Xu, Chinese Academy of Sciences, Beijing, China
Rationale
The CCIS series is devoted to the publication of proceedings of computer science con-
ferences. Its aim is to efficiently disseminate original research results in informatics
in printed and electronic form. While the focus is on publication of peer-reviewed full
papers presenting mature work, inclusion of reviewed short papers reporting on work in
progress is welcome, too. Besides globally relevant meetings with internationally repre-
sentative program committees guaranteeing a strict peer-reviewing and paper selection
process, conferences run by societies or of high regional or national relevance are also
considered for publication.
Topics
The topical scope of CCIS spans the entire spectrum of informatics ranging from foun-
dational topics in the theory of computing to information and communications science
and technology and a broad variety of interdisciplinary application fields.
Information for Volume Editors and Authors
Publication in CCIS is free of charge. No royalties are paid, however, we offer registered
conference participants temporary free access to the online version of the conference
proceedings on SpringerLink (http://link.springer.com) by means of an http referrer from
the conference website and/or a number of complimentary printed copies, as specified
in the official acceptance email of the event.
CCIS proceedings can be published in time for distribution at conferences or as post-
proceedings, and delivered in the form of printed books and/or electronically as USBs
and/or e-content licenses for accessing proceedings at SpringerLink. Furthermore, CCIS
proceedings are included in the CCIS electronic book series hosted in the SpringerLink
digital library at http://link.springer.com/bookseries/7899. Conferences publishing in
CCIS are allowed to use Online Conference Service (OCS) for managing the whole
proceedings lifecycle (from submission and reviewing to preparing for publication) free
of charge.
Publication process
The language of publication is exclusively English. Authors publishing in CCIS have
to sign the Springer CCIS copyright transfer form, however, they are free to use their
material published in CCIS for substantially changed, more elaborate subsequent publi-
cations elsewhere. For the preparation of the camera-ready papers/files, authors have to
strictly adhere to the Springer CCIS Authors’ Instructions and are strongly encouraged
to use the CCIS LaTeX style files or templates.
Abstracting/Indexing
CCIS is abstracted/indexed in DBLP, Google Scholar, EI-Compendex, Mathematical
Reviews, SCImago, Scopus. CCIS volumes are also submitted for the inclusion in ISI
Proceedings.
How to start
To start the evaluation of your proposal for inclusion in the CCIS series, please send an
e-mail to [email protected].
Wray Buntine · Morten Fjeld · Truyen Tran ·
Minh-Triet Tran · Binh Huynh Thi Thanh ·
Takumi Miyoshi
Editors

Information and
Communication Technology
13th International Symposium, SOICT 2024
Danang, Vietnam, December 13–15, 2024
Proceedings, Part I
Editors
Wray Buntine Morten Fjeld
VinUniversity University of Bergen
Hanoi, Vietnam Bergen, Norway

Truyen Tran Minh-Triet Tran


Deakin University University of Science - VNUHCM
Burwood, VIC, Australia Ho Chi Minh City, Vietnam

Binh Huynh Thi Thanh Takumi Miyoshi


Hanoi University of Science and Technology Shibaura Institute of Technology
Hanoi, Vietnam Saitama, Japan

ISSN 1865-0929 ISSN 1865-0937 (electronic)


Communications in Computer and Information Science
ISBN 978-981-96-4281-6 ISBN 978-981-96-4282-3 (eBook)
https://doi.org/10.1007/978-981-96-4282-3

© The Editor(s) (if applicable) and The Author(s), under exclusive license
to Springer Nature Singapore Pte Ltd. 2025

This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of
illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission
or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar
methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the
editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors
or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in
published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd.
The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore

If disposing of this product, please recycle the paper.


Preface

The 13th International Symposium on Information and Communication Technology


(SOICT 2024) was held on December 13–15, 2024, in Danang City, Vietnam. SOICT
2024 was an international academic forum for researchers and graduate students to share
their latest research findings and to identify future challenges in computer science.
SOICT 2024 received papers from 24 countries and regions in six major areas of
research including Networking and Communication Technologies, AI Foundations and
Big Data, AI Applications, Multimedia Processing, Software Engineering, and Recent
Advances in Cyber Security, in addition to special sessions on Applied Operations
Research and Optimization, Generative AI, Human-Computer Interaction and Intelli-
gent Interactive Systems, and Lifelog and Multimedia Retrieval. The Program Com-
mittee followed a formal, standardized reviewing process utilizing single-blind reviews.
This process involved bidding, reviewing, and deliberating and led to selection of 88
papers for regular presentation and 68 papers for poster presentation and publication in
the proceedings. Each paper was evaluated based on an average of three reviews, ensur-
ing a comprehensive assessment. Additionally, each reviewer contributed by reviewing
an average of three papers, maintaining a balanced workload and high-quality feedback
throughout the review process.
It was our great honor to receive world-class invited speakers: Timothy Baldwin
(MBZUAI, United Arab Emirates & University of Melbourne, Australia), Nitesh V.
Chawla (University of Notre Dame, USA), Uichin Lee (Korea Advanced Institute of
Science and Technology, South Korea), Yasuyuki Matsushita (Microsoft & Osaka Uni-
versity, Japan), Gopal Ramchurn (University of Southampton, UK), Du Tran (Google,
USA), and Minghui Zhou (Peking University, China).
We would like to thank the Program Committee members for their great responsibility
in reviewing papers, and all the track chairs, who are listed below, for actively monitoring
the review and deliberation process and for proposing decisions on papers.
In particular, we would like to thank all the Organizing Committee members, who
worked hard to ensure the best quality of the symposium.
We are grateful to Vingroup Innovation Foundation (VinIF) for financial support.
Finally, we hope that the SOICT 2024 conference provided an engaging and up-to-date
scientific program. We would like to thank all authors and participants for making SOICT
2024 a memorable and enjoyable academic event in Danang City, Vietnam.
vi Preface

Together, we made this SOICT 2024 Conference a successful event!

Binh Huynh Thi Thanh


Wray Buntine
Morten Fjeld
Takumi Miyoshi
Minh-Triet Tran
Truyen Tran
Ho Tu Bao
Cathal Gurrin
Ichiro Ide
Ta Hai Tung
Organization

Honorary Chairs

Huynh Quyet Thang Hanoi University of Science and Technology,


Vietnam
Mourad Baiou CNRS, LIMOS, France

General Chairs

Cathal Gurrin Dublin City University, Ireland


Ichiro Ide Nagoya University, Japan
Ho Tu Bao Vietnam Institute for Advanced Study in
Mathematics, Vietnam
Ta Hai Tung Hanoi University of Science and Technology,
Vietnam

Program Chairs

Wray Buntine VinUniversity, Vietnam


Morten Fjeld University of Bergen, Norway & Chalmers
University of Technology, Sweden
Truyen Tran Deakin University, Australia
Minh-Triet Tran University of Science, VNUHCM, Vietnam
Huynh Thi Thanh Binh Hanoi University of Science and Technology,
Vietnam
Takumi Miyoshi Shibaura Institute of Technology, Japan

Track Chairs

Networking and Communication Technologies

Hans-Jürgen Zepernick Blekinge Institute of Technology, Sweden


Hoang Dinh University of Technology Sydney, Australia
Hien Quoc Ngo Queen’s University Belfast, UK
viii Organization

Giang T. Nguyen Technische Universität Dresden, Germany


Trinh Van Chien Hanoi University of Science and Technology,
Vietnam

AI Foundations and Big Data

Massimo Zancanaro University of Trento, Italy


Thanh H. Nguyen University of Oregon, USA
Tan Nguyen National University of Singapore, Singapore
Cam-Tu Nguyen Nanjing University, China
Nguyen Duc Dung Vietnam Academy of Science and Technology,
Vietnam

AI Applications

Ryosuke Yamanishi Kansai University, Japan


Min-Chun Hu National Tsing Hua University, Taiwan
Pham Huy Hieu VinUniversity, Vietnam
Nguyen Phi Le Hanoi University of Science and Technology,
Vietnam

Multimedia Processing

Tam V. Nguyen University of Dayton, USA


Duc-Tien Dang-Nguyen University of Bergen, Norway
Habib Ullah Norwegian University of Life Sciences, Norway
Mehdi Elahi University of Bergen, Norway
Hongjo Kim Yonsei University, South Korea

Recent Advances in Cyber Security

Gérard Chalhoub University of Clermont Auvergne, France


Vinh-Thong Ta Edge Hill University, UK
Max Hashem Eiza Liverpool John Moores University, UK
Tran Quang Duc Hanoi University of Science and Technology,
Vietnam
Van-Hau Pham University of Information Technology – VNU,
Vietnam
Organization ix

Software Engineering

Iris Reinhartz-Berger University of Haifa, Israel


Thomas Degueule CNRS, France
Fiorella Zampetti Università degli Studi del Sannio, Italy
Indika Weerasingha Dewage Tilburg University, Netherlands

Special Session Chairs

Applied Operations Research and Optimization

Alain Quilliot LIMOS, France


Ridha Mahjoub Université Paris Dauphine, France
Jean-Philippe Gayon LIMOS, France
Paul Weng Duke Kunshan University, China
Viet Hung Nguyen Clermont Auvergne University, France
Dang Thu Huong Lancaster University, UK
Ha Minh Hoang National Economics University, Vietnam

Generative AI

Kai-Kristian Kemell Tampere University, Finland


Quan Thanh Tho University of Technology, VNUHCM, Vietnam
Dron Khanna Free University of Bozen-Bolzano, Italy
Dinh Viet Sang Hanoi University of Science and Technology,
Vietnam

Human Computer Interaction and Intelligent Interactive Systems

Shengdong Zhao City University of Hong Kong, China


Khanh-Duy Le University of Science, VNUHCM, Vietnam
Liting Zhou Dublin City University, Ireland
Chi-Thanh Vi International University, VNUHCM, Vietnam
Vinh-Tiep Nguyen University of Information Technology,
VNUHCM, Vietnam
x Organization

Lifelog Event Retrieval

Klaus Schöffmann Klagenfurt University, Austria


Trong-Le Do University of Science, VNUHCM, Vietnam
Hai-Dang Nguyen University of Science, VNUHCM, Vietnam
Tu V. Ninh Dublin City University, Ireland
Tu-Khiem Le Dublin City University, Ireland
Thuc Nguyen-Quang University of Science, VNUHCM, Vietnam
Mai-Khiem Tran University of Science, VNUHCM, Vietnam

Tutorial Chairs

Ngo Duc Thanh University of Information Technology, Vietnam


Vo Dinh Bay HUTECH University of Technology, Vietnam
Than Quang Khoat Hanoi University of Science and Technology,
Vietnam

Organizing Chairs

Le Xuan Thanh Hanoi University of Science and Technology,


Vietnam
Tran Van Man University of Science, VNUHCM, Vietnam
Ngo Dai Nghiep University of Science, VNUHCM, Vietnam
Ngo Lam Trung Hanoi University of Science and Technology,
Vietnam
Nguyen Tan Khoi University of Da Nang, Vietnam

Publication Chairs

Dinh Anh Dung University of Sydney, Australia


Dang Tuan Linh Hanoi University of Science and Technology,
Vietnam
Dinh Thi Ha Ly Hanoi University of Science and Technology,
Vietnam
Tong Van Van Hanoi University of Science and Technology,
Vietnam
Nguyen Thi Oanh Hanoi University of Science and Technology,
Vietnam
Nguyen Ngoc Thao University of Science, VNUHCM, Vietnam
Organization xi

Publicity Chairs

Tran Hai Anh Hanoi University of Science and Technology,


Vietnam
Pham Minh Phuong University of Science, VNUHCM, Vietnam
Huynh Viet Tham University of Science, VNUHCM, Vietnam
Trinh Thanh Trung Hanoi University of Science and Technology,
Vietnam

Industrial Session Chairs

Pham Ngoc Hung Hanoi University of Science and Technology,


Vietnam

Web Chairs

Nguyen Quoc Khanh Hanoi University of Science and Technology,


Vietnam
Hoang Viet Dung Hanoi University of Science and Technology,
Vietnam

Program Committee

Alain Quilliot LIMOS, France


Alessio Bucaioni Mälardalen University, Sweden
Amleto Di Salle Gran Sasso Science Institute, Italy
An Vuong MBZUAI, United Arab Emirates
Andrea Giachetti University of Verona, Italy
Andreas Fischer University of Fribourg, Switzerland
Anh Cuong Le Ton Duc Thang University, Vietnam
Anh Pham Hoang Post and Telecommunications Institute of
Technology, Vietnam
Anh-Son Ta Hanoi University of Science and Technology,
Vietnam
Annalisa Navarro University of Naples Federico II, Italy
Antoine Doucet University of La Rochelle, France
Antonio Mastropaolo William & Mary, USA
Bao Nguyen RMIT Vietnam, Vietnam
Binh Long Nguyen Queensland University of Technology, Australia
Binh P. Nguyen Victoria University of Wellington, New Zealand
xii Organization

Bui Thi-Mai-Anh Hanoi University of Science and Technology,


Vietnam
Calvin Ku National Tsing Hua University, Taiwan
Cam-Tu Nguyen Nanjing University, China
Cam-Van Thi Nguyen University of Engineering and Technology,
Vietnam National University, Vietnam
Cathal Gurrin Dublin City University, Ireland
Charles Olivier-Anclin LIMOS, France
Chi Thanh Vi International University, VNUHCM, Vietnam
Chien Trinh Van Hanoi University of Science and Technology,
Vietnam
Chuan Xiao Osaka University & Nagoya University, Japan
Claudio Di Sipio University of L’Aquila, Italy
Cuong Pham VinUni-Illinois Smart Healthcare Center,
VinUniversity, Vietnam
Dai Hai Nguyen University of Tsukuba, Japan
Daisuke Kitayama Kogakuin University, Japan
Dang Hung Tran Hanoi National University of Education, Vietnam
Dang Nguyen Hai University of Science, VNUHCM, Vietnam
David Istvan McMaster University, Canada
Dhekra Mahmoud LIMOS, France
Dieu Vu Phenikaa University, Vietnam
Dinh Thai Hoang University of Technology Sydney, Australia
Dinh Viet Sang Hanoi University of Science and Technology,
Vietnam
Dron Khanna Free University of Bozen-Bolzano, Italy
Duc Tran Quang Hanoi University of Science and Technology,
Vietnam
Duc-Anh Nguyen Hanoi University of Science and Technology,
Vietnam
Duc-Dung Nguyen Institute of Information Technology, Vietnam
Academy of Science and Technology, Vietnam
Duc-Thinh Pham Nanyang Technological University, Singapore
Duc-Tien Dang-Nguyen University of Bergen, Norway
Duc-Tri Tran Hanoi University of Science and Technology,
Vietnam
Duc-Vu Nguyen University of Information Technology,
VNUHCM, Vietnam
Dung Dinh University of Sydney, Australia
Elif Tasdemir TU Dresden, Germany
Fiorella Zampetti University of Sannio, Italy
Frédéric Hayek University of Clermont Auvergne, France
Gerard Chalhoub University of Clermont Auvergne, France
Organization xiii

Giammaria Giordano University of Salerno, Italy


Giang Nguyen Technische Universität Dresden, Germany
Giang Nguyen Hanoi University of Science and Technology,
Vietnam
Giang Son Tran Hanoi University of Science and Technology,
Vietnam
Gianluca Schiavo University of Trento, Italy
Habib Ullah Norwegian University of Life Sciences, Norway
Hai Anh Tran Hanoi University of Science and Technology,
Vietnam
Hai Tran Hanoi University of Science and Technology,
Vietnam, Vietnam
Hai Vu Hanoi University of Science and Technology,
Vietnam
Hai Vu Tuan University of Information Technology,
VNUHCM, Vietnam
Hanh Nguyen Thi Phenikaa University, Vietnam
Hans-Jurgen Zepernick Blekinge Institute of Technology, Sweden
Hau Pham Van University of Information Technology,
VNUHCM, Vietnam
Hien Ngo Queen’s University Belfast, UK
Hiep Luong Ghent University, Belgium
Hieu Pham Huy VinUniversity, Vietnam
Hikaru Ikuta University of Tokyo, Japan
Hisashi Miyamori Kyoto Sangyo University, Japan
Hongjo Kim Yonsei University, South Korea
Huan Nguyen Van Vietnam National University, Hanoi, Vietnam
Hung Tran Singapore Management University, Singapore
Hung Tuan Nguyen Tokyo University of Agriculture and Technology,
Japan
Huo-Chong Ling RMIT University Vietnam, Vietnam
Huong Thanh Le Hanoi University of Science and Technology,
Vietnam
Huy Nguyen Sinh Academy of Military Science and Technology,
Vietnam
Huyen Le VinUniversity, Vietnam
Huynh Thi Thanh Binh Hanoi University of Science and Technology,
Vietnam
Ichiro Ide Nagoya University, Japan
Indika Weerasingha Dewage Tilburg University, Netherlands
Iris Reinhartz-Berger University of Haifa, Israel
Ja-Hwung Su NCKU, Taiwan
Jean-Philippe Gayon LIMOS, France
xiv Organization

Jerry Chun-Wei Lin Western Norway University of Applied Sciences,


Norway
Johannes Hofer TU Dresden, Germany
Ju-Chin Chen National Kaohsiung University of Science and
Technology, Taiwan
Juri Di Rocco Università degli Studi dell’Aquila, Italy
Kai-Kristian Kemell Tampere University, Finland
Khanh-Duy Le University of Science, VNUHCM, Vietnam
Khoa Phan Da Nang University of Science and Technology,
Vietnam
Kiem-Hieu Nguyen Hanoi University of Science and Technology,
Vietnam
Kiet Nguyen University of Information Technology,
VNUHCM, Vietnam
Klaus Schoeffmann Klagenfurt University, Austria
Kunal Agrawal University of Dayton, USA
Kyoung-Sook Kim National Institute of Advanced Industrial Science
and Technology, Japan
Lilian Aveneau XLIM/SIC, France
Liting Zhou Dublin City University, Ireland
Li-Wu Tsao National Yang Ming Chiao Tung University,
Taiwan
Long Doan George Mason University, USA
Long Giang Nguyễn Viện Công nghệ thông tin, Vietnam
Long Nguyen University of Science, VNUHCM, Vietnam
Luca Berardinelli Johannes Kepler University Linz, Austria
Luigi Quaranta University of Bari Aldo Moro, Italy
Mahdi Attawna Technical University of Dresden, Germany
Manh Cuong Dao Hanoi University of Science and Technology,
Vietnam
Masaharu Hirota Okayama University of Science, Japan
Massimo Zancanaro University of Trento, Italy
Max Hashem Eiza Liverpool John Moores University, UK
Maxime Puys University Clermont Auvergne, LIMOS, CNRS
UMR (6158), France
Mehdi Elahi University of Bergen, Norway
Mika Saari Tampere University of Technology, Finland
Min-Chun Hu National Tsing Hua University, Taiwan
Minh Hieu Nguyen Griffith University, Australia
Minh Hieu Nguyen ENSTA Paris, France
Minh Nguyen Auckland University of Technology, New Zealand
Minh-Khoi Pham Dublin City University, Ireland
Organization xv

Minh-Quan Le State University of New York at Stony Brook,


USA
Minh-Son Dao National Institute of Information and
Communications Technology, Japan
Minh-Tien Nguyen Hung Yen University of Technology and
Education, Vietnam
Minh-Triet Tran University of Science, VNUHCM, Vietnam
Mitsuo Yoshida University of Tsukuba, Japan
Morten Fjeld University of Bergen, Norway
Namal Rathnayake University of Tokyo, Japan
Nang Hung Nguyen Hanoi University of Science and Technology,
Japan
Ngoc Tran School of Information and Communications
Technology, Vietnam
Ngoc-Thao Nguyen University of Science, VNUHCM, Vietnam
Nguyen Hoai Nam Le VNUHCM – University of Science, Vietnam
Nguyen Hong Thanh Oregon University, USA
Nhan Dang Tam International University, VNUHCM, Vietnam
Nhat Hoang-Xuan University of Florida, USA
Nhat-Quang Doan University of Science and Technology of Hanoi,
Vietnam Academy of Science and Technology,
Vietnam
Pascal Lafourcade LIMOS, University of Clermont Auvergne,
France
Paul Weng Duke Kunshan University, China
Phi Le Nguyen Hanoi University of Science and Technology,
Vietnam
Phu Nguyen KU Leuven, Belgium
Phuong Le-Hong Vietnam National University, Hanoi, Vietnam
Phuong Nguyen Japan Advanced Institute of Science and
Technology, Japan
Pietro Liguori University of Naples Federico II, Italy
Pu Ching National Tsing Hua University, Taiwan
Quan La University of Sydney, Australia
Quang Tran Minh Ho Chi Minh City University of Technology,
Vietnam
Quang Uy Nguyen Le Quy Don Technical University, Vietnam
Quang-Vinh Dang Industrial University of Ho Chi Minh City,
Vietnam
Quoc Nguyen National Institute of Information and
Communications, Japan
Rafael Colares LIMOS, Clermont Auvergne INP, Université
Clermont Auvergne, France
xvi Organization

Riccardo Rubei University of L’Aquila, Italy


Ridha Mahjoub Université Paris Dauphine, France
Ryosuke Yamanishi Kansai University, Japan
Sanjeel Parekh Technicolor R&D France, France
Shengdong Zhao City University of Hong Kong, China
Shigeaki Sakurai Toshiba Solutions Corporation, Japan
Shih Yin Ooi Multimedia University, Malaysia
Shingo Otsuka Kanagawa Institute of Technology, Japan
Shoji Yamamoto Tokyo Metropolitan College of Industrial
Technology, Japan
Shoko Wakamiya Nara Institute of Science and Technology, Japan
Son Nguyen Van Phenikaa University, Vietnam
Son Tran Deakin University, Australia
Son-T. Mai Aarhus University, Denmark
Stuart Perry University of Technology, Sydney, Australia
Swapna Narla Tek Yantra Inc, USA
Takahiro Komamizu Nagoya University, Japan
Takumi Miyoshi Shibaura Institute of Technology, Japan
Tam V. Nguyen University of Dayton, USA
Tan Nguyen National University of Singapore, Singapore
Tatsushi Matsubayashi Accenture, Japan
Thang-Long Nguyen-Ho Dublin City University, Ireland
Thanh Duc Ngo University of Information Technology,
VNUHCM, Vietnam
Thanh Hai Phung National Yang Ming Chiao Tung University,
Taiwan
Thanh Hung Bui Industrial University of Ho Chi Minh City,
Vietnam, Vietnam
Thanh Le-Cong University of Melbourne, Australia
Thanh Nguyen Chi Institute of Information Technology, AMST,
Vietnam
Thanh Pham Shizuoka University, Japan
Thanh Phuong Nguyen University of Toulon, France
Thanh T. H. Duong Hanoi University of Mining and Geology,
Vietnam
Thanh Tuan Nguyen HCMC University of Technology and Education,
Vietnam
Thanh Van Le University of Technology, VNUHCM, Vietnam
Thao Duong Murdoch University, Australia
Thi Lan Le Hanoi University of Science and Technology,
Vietnam
Thi Thanh Hai Tran Hanoi University of Science and Technology,
Vietnam
Organization xvii

Thi Thu Hong Phan FPT University, Da Nang, Vietnam


Thien Huynh-The Ho Chi Minh City University of Technology and
Education, Vietnam
Thien-Phuc Tran University of Science, VNUHCM, Vietnam
Thi-Oanh Nguyen Hanoi University of Science and Technology,
Vietnam
Thirusubramanian Ganesan Cognizant Technology Solutions, USA
Tho Quan Ho Chi Minh City University of Technology,
Vietnam
Thomas Degueule CNRS, France
Thu Hang Phung Hanoi University of Science and Technology,
Vietnam
Thu Huong Dang Lancaster University, UK
Thuc Nguyen-Quang University of Science, VNUHCM, Vietnam
Tran Tri Dang RMIT University Vietnam, Vietnam
Trang Vu Monash University, Australia
Trong-Le Do University of Science, VNUHCM, Vietnam
Trong-Thuan Nguyen University of Arkansas, USA
Trung Ngo Hanoi University of Science and Technology,
Vietnam
Trung-Kien Tran Information Technology Institute, Academy of
Military Science and Technology, Vietnam
Truong Thao Nguyen National Institute of Advanced Industrial Science
and Technology, Japan
Truyen Tran Deakin University, Australia
Tse-Yu Pan National Taiwan University of Science and
Technology, Taiwan
Tu V. Ninh Dublin City University, Ireland
Tuan Dung Nguyen Hanoi University of Science and Technology,
Vietnam
Tuan Linh Dang Hanoi University of Science and Technology,
Vietnam
Tuan Luu NTU, Singapore
Tuan Thai Jeju National University, South Korea
Tu-Khiem Le Dublin City University, Ireland
Tung Doan TU Dresden, Germany
Tung Le University of Science, VNUHCM, Vietnam
Tung Nguyen Hanoi University of Science and Technology,
Vietnam
Upaka Rathnayake Atlantic Technological University, Ireland
Valeria Pontillo Vrije Universiteit Brussel, Belgium
Van An Le National Institute of Advanced Industrial Science
and Technology, Japan
xviii Organization

Van Tong Hanoi University of Science and Technology,


Vietnam
Vatsa Patel University of Dayton, USA
Viet Cuong Nguyen HPC SYSTEMS Inc., Japan
Viet Cuong Ta VNU University of Engineering and Technology,
Hanoi, Vietnam
Viet Hung Nguyen LIMOS CNRS, Clermont Auvergne University,
France
Viet-Trung Tran Hanoi University of Science and Technology,
Vietnam
Vincent Nguyen LIFO, University of Orléans, France
Vincenzo Riccio University of Udine, Italy
Vinh Duc Tran Hanoi University of Science and Technology,
Vietnam
Vinh Thong Ta Edge Hill University, UK
Vinh-Tiep Nguyen University of Information Technology, Vietnam
Vu Minh Hieu Phan University of Adelaide, Australia
Vu-Hoang Tran Ho Chi Minh City University of Technology and
Education, Vietnam
Wei-Chuen Yau Xiamen University Malaysia, Malaysia
Wei-Lun Tseng National Yang Ming Chiao Tung University,
Taiwan
Wei-Ta Chu National Cheng Kung University, Taiwan
Wen-Cheng Chen Nvidia Corporation, Taiwan
Wen-Huang Cheng National Taiwan University, Taiwan
Wray Buntine VinUniversity, Vietnam
Xuan-Son Vu Umeå University, Sweden
Yeong-Chyi Lee Cheng Shiu University, Taiwan
Ying Han Pang Multimedia University, Malaysia
Yu Suzuki Gifu University, Japan
Yudai Tsujino Meiji University, Japan
Yukinobu Hoshino Kochi University of Technology, Japan
Zoltan Miklos University of Rennes 1, France
Organization xix

Organizers
xx Organization

Technical Sponsors

Financial Sponsors
Contents – Part I

Multimedia Processing

FDE-Net: Lightweight Depth Estimation for Monocular Cameras . . . . . . . . . . . . 3


Van-Truong Nguyen, Nhu-Nghia Bui, Dinh-Manh-Cuong Tran,
Thai-Viet Dang, and Phan Xuan Tan

Language-Guided Video Object Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14


Minh-Duy Phan, Minh-Huan Le, Minh-Triet Tran, and Trung-Nghia Le

MythraGen: Two-Stage Retrieval Augmented Art Generation Framework . . . . . . 25


Quang-Khai Le, Cong-Long Nguyen, Minh-Triet Tran,
and Trung-Nghia Le

Towards Unsupervised Speaker Diarization System for Multilingual


Telephone Calls Using Pre-trained Whisper Model and Mixture of Sparse
Autoencoders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
Phat Lam, Lam Pham, Truong Nguyen, Dat Ngo, Thinh Pham,
Tin Nguyen, Loi Khanh Nguyen, and Alexander Schindler

Hybrid Compression: Integrating Pruning and Quantization for Optimized


Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
Minh-Loi Nguyen, Long-Bao Nguyen, Van-Hieu Huynh,
and Trung-Nghia Le

AI-Generated Image Recognition via Fusion of CNNs and Vision


Transformers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Xuan-Bach Mai, Hoang-Minh Nguyen-Huu, Quoc-Nghia Nguyen,
Hoang-Tung Vu, and Trung-Nghia Le

Decoding Deepfakes: Caption Guided Learning for Robust Deepfake


Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
Y-Hop Nguyen and Trung-Nghia Le

Minimalist Preprocessing Approach for Image Synthesis Detection . . . . . . . . . . . 88


Hoai-Danh Vo and Trung-Nghia Le

KidRisk: Benchmark Dataset for Children Dangerous Action Recognition . . . . . 100


Minh-Kha Nguyen, Trung-Hieu Do, Kim Anh Phung,
Thao Thi Phuong Dao, Minh-Triet Tran, and Trung-Nghia Le
xxii Contents – Part I

DOLG-CNet: Deep Orthogonal Fusion of Local and Global Features


Combined with Contrastive Learning and Deep Supervision for Polyp
Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
Trong-Hieu Nguyen-Mau, Kim-Trang Phu-Thi, Minh-Triet Tran,
and Hai-Dang Nguyen

VisChronos: Revolutionizing Image Captioning Through Real-Life Events . . . . 127


Phuc-Tan Nguyen, Hieu Nguyen, and Trung-Nghia Le

TI-JEPA: An Innovative Energy-Based Joint Embedding Strategy


for Text-Image Multimodal Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
Khang H. N. Vo, Duc P. T. Nguyen, Thong T. Nguyen, and Tho T. Quan

A Lightweight End-to-End Multi-task Learning System for Vietnamese


Speaker Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
Mai Hoang Dao, Son Thai Nguyen, Duy Minh Le, Cong Tran,
and Cuong Pham

Domain Generalization in Vietnamese Dependency Parsing: A Novel


Benchmark and Domain Gap Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
Vinh-Hien D. Huynh, Chau-Anh Le, Chau M. Truong, Y. Thien Huynh,
and Quy T. Nguyen

Distribution-Guided Object Counting with Optimal Transport


and DINO-Based Density Refinement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
Ngo Xuan Cuong and Tien-Dung Mai

Motion Analysis in Static Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193


Kunal Agrawal, Vastsa S. Patel, Reema Tharra, Trung-Nghia Le,
Minh-Triet Tran, and Tam V. Nguyen

Motorcycle Helmet Detection Benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203


Kunal Agrawal, Vatsa S. Patel, Ian Cannon, Minh-Triet Tran,
and Tam V. Nguyen

MEPC: Multi-level Product Category Recognition Image Dataset . . . . . . . . . . . . 216


Thanh Long Nguyen, Manh Quang Do, and Ba Nghien Nguyen

A Simple Approach Towards Frame Filtering for Efficient Gaussian


Splatting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
Thien-Phuc Tran, Minh-Quang Nguyen, and Minh-Triet Tran

Enhancing Unsupervised Person Re-identification with Multi-view Image


Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
Anh D. Nguyen, Dang H. Pham, Duy B. Vu, and Hoa N. Nguyen
Contents – Part I xxiii

Boosting Image Super-Resolution: Incorporating Locally-Enhanced FFN


and Data Augmentation in the Swin Transformer Architecture . . . . . . . . . . . . . . . 251
Phong Hai Tran and Ngoc-Thao Nguyen

Dual-Domain Reconstruction Network for Enhancing Sparse-View


and Low-Dose CT Imaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
Pham Cong Thang and Phan Minh Nhat

DehazeCLNet: A Contrastive Learning Framework with Advanced


Feature Extraction for Image Dehazing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
Pham Cong Thang, Nguyen An Hung, Nguyen Quoc Cuong,
and Phan Minh Nhat

Distortion-Resilient DIBR for Novel View Synthesis from a Single Image . . . . . 287
Yuchen Liu, Eiji Kamioka, and Phan Xuan Tan

Towards Real-Time Open World Instance Segmentation . . . . . . . . . . . . . . . . . . . . . 298


Bao Ly Tran Hoang, Minh Le Thanh, and Khanh-Duy Nguyen

An Attempt to Develop a Neural Parser Based on Simplified Head-Driven


Phrase Structure Grammar on Vietnamese . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313
Duc-Vu Nguyen, Thang Chau Phan, Quoc-Nam Nguyen,
Kiet Van Nguyen, and Ngan Luu-Thuy Nguyen

Knowledge Distillation for Lumbar Spine X-ray Classification . . . . . . . . . . . . . . . 329


Minh-Khang Nguyen, Viet-Tham Huynh, Thuy-Giang Thi Vo,
and Minh-Triet Tran

Forecasting Traffic Flow Under Uncertainty: A Case Study in Da Nang . . . . . . . 343


Doan Phuoc Mien, Tran The Vu, and Ngo Van Sy

Constraint Programming-Based Cutting Plane Algorithm


for a Combination of Orienteering and Maximum Capture
Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354
Hoang Giang Pham, Tien Mai, and Minh Hoàng Hà

Operations Research

Cost Optimization in Competitive Facility Location Under General


Demand Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371
Ba Luat Le, Thuy Anh Ta, and Hoang Giang Pham

A Historical GPS Trajectory-Based Framework for Predicting Bus Travel


Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 386
Khang Nguyen Duy, Minh Nguyen Tuan, and Nam Thoai
xxiv Contents – Part I

Influence Maximization with Fairness Allocation Constraint . . . . . . . . . . . . . . . . . 401


Hue T. Nguyen, Bac D. Pham, Uyen T. Tran, Nguyen Long Giang,
and Canh V. Pham

Exemplar-Embed Complex Matrix Factorization with Elastic-Net Penalty:


An Advanced Approach for Data Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . 415
Manh Quan Bui, Viet Hang Duong, and Jia-Ching Wang

A Method Combining the Reference Information of the Adaptive


Adjustment Method and the Decision Maker of Multi-objective
Evolutionary Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 427
Long Nguyen, Minh Tran Binh, and Thu To Thi

Modeling Information Diffusion in Bibliographic Networks Using


Pretopology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 439
Thi Kim Thoa Ho, Quang Vu Bui, and Marc Bui

Optimizing Credit Scoring Models for Decentralized Financial


Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 452
Trong Hoan Dao, Tuan-Dat Trinh, and Viet-Bang Pham

Application of the SFE Feature Selection Method for Multi-omic


Biomarker Discovery in Brain Cancer Subtyping . . . . . . . . . . . . . . . . . . . . . . . . . . . 467
Hien Nguyen Minh, Ha Tang Vinh, Hoang Le, and Diep Thi Hoang

A Reputation Scoring Framework for Lending Protocols Using


the PageRank Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 478
Mau-Tra Nguyen, Tuan-Dat Trinh, and Viet-Bang Pham

Unifying Convolution and Self-attention for Liver Lesion Diagnosis


on Multi-phase Magnetic Resonance Imaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495
Huynh-Sang Nguyen, Nhat-Minh Truong, and Minh-Triet Tran

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 511


Multimedia Processing
FDE-Net: Lightweight Depth Estimation
for Monocular Cameras

Van-Truong Nguyen1 , Nhu-Nghia Bui2 , Dinh-Manh-Cuong Tran2 ,


Thai-Viet Dang2(B) , and Phan Xuan Tan3
1 Hanoi University of Industry, Hanoi 10000, Vietnam
2 Hanoi University of Science and Technology, Hanoi 10000, Vietnam
[email protected]
3 Shibaura Institute of Technology, Toyosu, Koto-Ku, Tokyo 135-8548, Japan

Abstract. Depth estimation techniques typically involve extracting features of


objects in the environment and their relationships. However, these methods require
multiple images, making them less feasible for real-time scenarios. To alleviate
these challenges, the rise of efficient convolutional neural networks (CNNs) with
the ability to infer depth from a single image opens a new avenue for inves-
tigation. Current research introduces an efficient FDE-Net designed to gener-
ate cost-effective depth maps from a single image. The new framework consists
of a PPLC-Net as the convolutional encoder and a fast decoder as the decoder.
Moreover, this combination integrates the Squeeze-Exploit (SE) module using the
MKLDNN optimizer to enhance convolutional efficiency and rationalize model
size with efficient training. Meanwhile, the proposed multi-scale pixel-wise fast
decoder generates state-of-the-art depth maps while maintaining an efficient struc-
ture. Experimental results demonstrate that our model achieves state-of-the-art
performance on four datasets: NYU-V2, KITTI, Cityscapes, and a simulated envi-
ronment. Unexpectedly, FDE-Net utilizes merely 0.04 times the parameter counts
of Resnet-Upconv. Computational efficiency is profoundly underscored by FLOP
and MAC, showcasing a considerable superiority relative to competing models.
FDE-Net exhibits a remarkably reduced latency of 4.2 times, in addition to a 3.9
times enhancement in throughput when contrasted with Resnet18-Upconv.

Keywords: Convolutional Neural Network (CNN) · Depth Estimation ·


MKLDNN optimizer · Monocular Images

1 Introduction
Automated navigation is primarily aimed at avoiding obstacles accurately and efficiently.
Building a vision system for complex systems makes systems expensive. Compact and
low-cost monocular cameras offer the advantage of responding to contextual information
and compatibility with deployments [1]. Accurate estimation of the depth of field scale
objects allows the system to accurately determine the location of objects within a certain
distance. Depth values are watered down through multiple approaches. Typically, the
combination of cameras and Lidar. While Lidar-based depth estimation is highly accurate

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025
W. Buntine et al. (Eds.): SOICT 2024, CCIS 2350, pp. 3–13, 2025.
https://doi.org/10.1007/978-981-96-4282-3_1
4 V.-T. Nguyen et al.

and efficient, their high cost and complex computational resource requirements limit their
application in many contexts [2]. Vision-based cognitive systems are often used due to
their low cost and effective compatibility with systems. However, monocular cameras
do not allow the direct extraction of depth information [3]. Therefore, they limit their
effectiveness in understanding and reproducing postures and 3D maps during operation.
Binocular vision costs more than monocular vision, but it provides more accurate depth
information. But its intrinsic nature is not suitable for a wide range, besides that is the
combination of information from complex perspectives [4].
In order to accurately estimate the depth of a 2D scene, the features, and relationships
of the specific details in the image and the overall context of the scene must be extract-ed
and processed. Leveraging the ability to learn from both local and global contexts, deep
convolutional neural networks (DCNNs) have been used extensively in recent studies
to estimate monocular depth. Gao et al. proposed an unsupervised learning to simulta-
neously predict both monocular depth and ego-motion trajectory [5]. Then, Xiong et al.
proposed robust geometric losses to maintain consistency of the depth and pose estima-
tion [6]. The self-supervised monocular depth estimation is stabilized by incorporating
scale-consistent geometric constraints into the loss functions. Subsequently, Nguyen
et al. employed deep learning model to ease the training based on himself-collected
dataset [7]. Nevertheless, the implementation of such extensive networks consisting of
parameters, causing substantial computational expenses and extensive memory demands.
Godard et al. generated disparity images by using re-construction loss for depth predic-
tion in poor quality depth images [8]. As a result, the distortion of depth information near
edges significantly reduces the accuracy of following tasks such as 3D reconstruction.
Therefore, the objective of the 3D perception system is to retrieve the 3D bounding box,
described in the coordinate frame of the 3D environment, as well as mobile robot’s bird
eye view.
The paper proposes the lightweight FDE-Net built on the PPLC-Net as backbone
and fast convolution block as decoder model. Firstly, proposed backbone improves the
network’s performance on multiple tasks. Then, fast block convolution will decode
information from backbone and return depth map. By using only the current data, a
new transform eliminates the redundant computations without the need for the over-
all overlapped data. Furthermore, the number of parameters significantly reduced for
low-resource embedded systems. Data processing speed is significantly improved from
arithmetic analysis, which implies that the reduced transform size gives additional advan-
tage in data manipulation. In summary, the prediction layer produces the final segment
map for identifying obstacles and constructing the real-time global path in mobile robot’s
environment.
Main contributions are as follows:
• The authors present an FDE-Net model: combining the decoder of a fast convolu-
tional block with lightweight PPLC-Net backbone for efficient depth estimation and
resource-limited systems.
• The combination of L1 and SSIM loss functions helps to bring efficiency and balance
in the model training process.
FDE-Net: Lightweight Depth Estimation for Monocular Cameras 5

• Based on experiments on the NYU-V2, Cityscapes, and KITTI datasets, the proposed
model shows superior performance compared to state-of-the-art monocular camera-
based depth estimation methods.
2 Related Work
Depth estimation from perspective images has garnered considerable interest over the
past decade through the utilization of deep-learning-based monocular perspective depth
estimation. To improve the precision of depth-map prediction, Eigen and Fergus intro-
duced a novel multi-dimensional monocular depth estimation approach that combines
the fundamental aspects of global and local perspectives [9]. Zhou et al. devised an inno-
vative method to reduce reliance on ground truth data by concurrently enhancing the
accuracy of depth estimation and pose estimation [10]. This was accomplished by lever-
aging an input image from monocular video sequences. Godard et al. have made signifi-
cant advancements with the introduction of MonoDepth2, an advanced model designed
to effectively handle occluded pixels [8]. By filtering out unsuitable training pixels
with camera motion, the reprojection loss was minimized at a per-pixel level through
an auto-masking loss framework. The key features include a redesigned arrangement
of skip connections and the incorporation of suitable attributes to achieve exceptional
high-resolution output. Wofk et al. introduced FastDepth, a proficient and lightweight
encoder-decoder network structure that reduces computational complexity and latency
[11]. However, challenges persist, such as the loss of intricate details and blurring of
predicted depth map edges. Rudolph et al. reconstructed high-resolution depth maps
using guided upsampling blocks in the decoder [12]. Zhou et al. iteratively improved
the depth map through a recurrent multi-scale feature modulation [13]. In an effort to
incorporate global contexts, Zhang et al. proposed the Lite-Mono architecture, which
combines a lightweight CNN with a transformer [14]. Consequently, this architecture not
only reduces model size but also maintains accuracy. Nevertheless, with the increased
data processing in 3D image reconstruction, these lightweight approaches must over-
come limitations in representation and computational resources. Following the lead of
CondConv, Zhang et al. chose to replace regular convolutions with CondConv to enhance
network scale and capabilities while preserving performance and inference costs [14].
By dynamically adapting the convolution kernel using CondConv and integrating sub-
pixel convolution, the authors introduce a spatially aware dynamic lightweight depth
estimation network. This strategy enables accurate depth estimation with minimal com-
putational overhead. In essence, the challenge lies in developing depth estimation models
that offer enhanced efficiency with minimal resource requirements and reliable real-time
operation.

3 Proposed Method
In this section, the authors propose the FDE-Net architecture in Fig. 1. Our sugges-
tion involves harnessing features through the PPLC-Net structure to boost the pace of
processing in the depth estimation model. Through the utilization of DepthSepConv,
the model is capable of delivering precise outcomes on various CPU or GPU tools. The
incorporation of the SE module aims to enhance convolutional proficiency by effectively
segregating filters for individual channels.
6 V.-T. Nguyen et al.

Fig. 1. The proposed FDE-Net architecture.

3.1 Encoder
CNNs have demonstrated remarkable progress in the realm of computer vision tasks in
recent years. These networks exhibit the capability to undergo training and application
in a wide range of scenarios, including depth estimation. Within this study, the authors
have developed a lightweight depth estimation network by leveraging the PPLC-Net
architecture [15], renowned for its superior alignment with the MKLDNN acceleration
strategy. We advocate for the utilization of the PPLC-Net model for feature extraction to
enhance the processing speed of the depth estimation model. By employing DepthSep-
Conv, the model can achieve high accuracy when functioning on CPU or GPU devices.
Each convolution block consists of multiple sequential convolution layers that utilize
filters to extract features from the input image, with the size and quantity of filters being
determined by the network’s architecture. Typically, smaller filters such as 3×3 or 5×
5 are favored to reduce the network’s parameters while improving computational effi-
ciency. Given the demand for precise per-pixel accuracy in tasks like single-camera
depth estimation (e.g., obstacle avoidance or object centering), aggregating features at
various scales becomes crucial for the decoder to accurately decode these features. To
enhance the understanding of image semantics at different scales, we propose integrating
convolution layers with distinct kernel sizes directly associated with the model outputs
at sizes 112×112, 56×56, 28×28, 7×7. Subsequently, the decoded segments are com-
bined to produce the final output of the model. Initializing the convolution layers for
each encoder output size assists in detailing the decoded segments.
Moreover, our approach involves breaking down the convolution operation into two
stages: depthwise convolution (DW) and pointwise convolution (PW). Global average
pooling (GAP) is utilized, and the activation function H-wish is selected for its effi-
ciency and adaptability in scenarios of data imbalance or interference. The SE module is
positioned near the network’s end to ensure improved balance and accuracy, aiming to
harness higher-level features effectively [18]. The activation functions employed include
ReLU and Sigmoid.
FDE-Net: Lightweight Depth Estimation for Monocular Cameras 7

3.2 Decoder

Fig. 2. Basic operation of Conv2D layer in the decoder block.

The decipherer possesses the solution for untangling and merging the distinct qual-
ities obtained from the code creator. Consequently, this segment generates a forecast
map containing detailed information for each individual pixel. The novel methodology
utilizes 2D convolutional layers (conv2D) with a variety of kernel sizes. These funda-
mental components are carefully designed to reveal features across different scales and
sizes. Each layer evaluates input from its corresponding DepthSepConv component. The
findings from each layer are combined and directed into the next layer. Decoders incor-
porating varied kernel dimensions introduce complexity into the final conclusions. The
unique characteristics and roles of the decoder layers are as follows: (Fig. 2)
• 3×3 Conv2D: Extracts global-scale features from the Stem conv/h-swish layer.
• 3×2 Conv2D: Extracts horizontal direction features from the first DepthSepConv
layer.
• 2×3 Conv2D: Like the previous decoder layer, but a 2x3 kernel is used to capture
vertical information.
• 2×2 Conv2D: Employs a small kernel to extract detailed features from the last
DepthSepConv layer.
The decoder comprises of four upconvolution modules featuring a reduction in the
number of channels alongside an increase in the size of the feature map. Within each
module, the sequence of blocks unfolds as follows: Unpool, convolution, batch normal-
ization, and Rectified Linear Unit (ReLU). Diverse information can be gathered across
various scales and merged through a concatenation block to reintegrate the details into a
unified prediction map. Then, the deciphered data undergoes interpolation to produce a
comprehensive depth estimation. Finally, a filter is added to denoise values and normalize
the predicted values.
8 V.-T. Nguyen et al.

4 Results and Discussion


4.1 Model Training
The proposed model was trained on three datasets which include NYU-V2 of 7560
images (a resolution of 640×480 pixels), Cityscapes of 5000 images (the size of 128×
256 pixels, accompanied by segmentation labels for 19 different classes and depth labels)
and KITTI of 400 images (initially possessed dimensions of 1216×352 pixels, the orig-
inal images underwent a down-sampling process to 912×228 pixels). We implemented
the Adam optimization algorithm, incorporating a learning rate of 0.001, weight decay
set at 0.9, and a batch size of 32. Our choice of framework was Pytorch, and the model
underwent training on a server boasting a CPU with an Intel Core i9 11900 k, 64 GB of
RAM, and an RTX3070 for enhanced performance.
Then, the loss of the train and the validation embodies the mesmerizing dance
between the model’s envisioned depths and the true essence hidden within the train-
ing and testing realms. This mystical calculation unfolds after every epoch, unveiling
the model’s journey towards enlightenment. Rooted in the sacred grounds of SSIM and
L1 loss, this divine formula paints a masterpiece as follows:
Total_Loss(YGT , Y) = δ1 × (1 − SSIM) + δ2 × L1 . (1)
In particular, the δ1 is 0,4 and δ2 is 0,6. The SSIM metric quantifies similarity between
images by analyzing their intricate local patterns, embracing contrast and structural
nuances. This comparison is visually represented as the SSIM index.
  
2µx µy + c1 2σxy + c2
SSIM =    (2)
µ2x µ2y + c1 σx2 + µ2y + c2

where µx , and µy are the mean of the two images; σx2 , and σy2 are the variances of the
two images; σxy is the covariance of the two images; c1 , and c2 are constants stabilizing
the calculation SSIM.
The standard loss function L1 is the sum of the absolute difference between the target
value and the estimated value. Hence, the L1 loss is illustrated as follows:

n
L1 = |Y − YGT | (3)
i=1

Main advantages of the total loss function proposed such as follows:


• Two distinct loss functions are utilized in the calculation process, enhancing the
versatility of assessing the model’s training and testing efficacy.
• SSIM exhibits a keen sensitivity towards pixel discrepancies in images while deriving
the conversion value gradients.
• L1 loss displays a heightened sensitivity towards erroneous predictions, hence
expediting enhancements in the model’s training efficiency. This leads to a faster
convergence of predicted values towards the true ground values.
The parameter α is adjustable, allowing the model to accommodate datasets with
varying attributes.
FDE-Net: Lightweight Depth Estimation for Monocular Cameras 9

4.2 Evaluation Metrics

Mean Squared Error (MSE): is defined as a measure of the average error between the
predicted depth and the true depth value. MSE is calculated using the following formula:

1 
n
2
MSE = Yi − Ŷi , (4)
n
i=1

where n is the total number of predicted pixels, and are the predicted depth and ith actual
depth, respectively.
Mean Absolute Error (MAE): is a common loss function for deep learning-based
methods. The author uses this metric to represent the pixel difference between basic
reality and predicted depth. Then, averaging the results of the evaluation of pixels on the
photo. MAE is calculated using the following formula:

1  
n

MAE = Yi − Ŷi . (5)
n
i=1

Accuracy: under threshold determines whether a prediction is considered accurate


based on a specific threshold. The used threshold values σ are 1.25, 1.252 , and 1.253 .
Accuracy under threshold is calculated as the following:

1
n
Ŷi Yi
max , = δ < Threshold. (6)
n Yi Ŷi
i=1

Absolute relative error (Abs Rel): measures the average of the absolute relative
difference between predicted depth values and actual ground depth values, normalized
by the average of actual ground depth values. Abs Rel is calculated as the following:
n  
1   Yi − Ŷi 
Abs_Rel =  Ŷ . (7)
n i
i=1

4.3 Benchmark Result

Based on the results observed during the training phase, FDE-Net demonstrates its com-
petitive advantage in attributes. As illustrated in Fig. 3, the monitoring values converge
quickly to the thresholds necessary for the model to work effectively.
Parametric fluctuations do not occur, indicating the stability of the pattern. The
proposed approach shows a high degree of adaptability to the datasets used for training.
Specifically, it is demonstrated in Table 1 with three datasets: NYU-V2, KITTI and
Cityscapes. Although only trained with a modest number of epochs, the proposed model
still provides highly competitive hit indicators with other methods in the same training
scenario. The predicted value is a small difference compared to the trained ground truth.
10 V.-T. Nguyen et al.

Fig. 3. FDE-Net training results.

Table 1. Experiments on NYU-V2, KITTI and cityscape datasets.

Dataset MAE MSE Abs-Rel σ < 1.25 σ < 1.252 σ < 1.253
NYU-V2 0.0493 0.0064 0.162 0.8722 0.9641 0.9811
KITTI 0.030 0.0030 0.5484 0.7015 0.8007 0.9042
Cityscapes 0.040 0.0033 0.4599 0.733 0.8126 0.9287

FDE-Net has been crafted with a size and resource cost that are just right. By har-
nessing the benefits of DepthSepConv layers, the model effectively utilizes the trained
parameters. The performance of FDE-Net and Resnet18 + Upconv is scrutinized by the
authors based on various metrics such as number of parameters, FLOPs, MACs, Latency,
and Throughput, with the outcomes laid out in Table 2.

Table 2. Comparison with other methods using different backbones on the parameters of
flexibility and model weight.

Model Parameters FLOPs MACs Latency Throughput


(M) (M) (ms) (images/s)
SENet-154 [16] 157075323 - - 2.45 2041.3
MobileNetV2 [17] 5821099 20987 30198.0 6.66 3093.14
MobileNet [17] 20661003 18075 26467.9 7.15 2569.06
Resnet18 + Upconv [18] 12396720 22302 11131.1 4.69 3659.7
Our 576216 47 23.5 1.11 14438.81

Remarkably, FDE-Net uses only 0.04 times the number of parameters in Resnet-
Upconv. Furthermore, FLOPs and MAC demonstrate a much more efficient computa-
tion in comparison to the alternative model. The latency of FDE-Net is notably 4.2 times
smaller, while the throughput has surged by 3.9 times when contrasted with Resnet18 +
Up-conv. Similar to the rest of the compared methods, the proposed model shows supe-
rior agility and speed of operation by significantly reducing the computational volume.
FDE-Net: Lightweight Depth Estimation for Monocular Cameras 11

These findings underscore FDE-Net as a nimble model with a scant number of parame-
ters. The adept utilization of a small parameter count significantly slashes computational
and storage expenses for embedded or edge devices, thereby notably enhancing computa-
tional performance and data flow. Superior compatibility with limited-resource systems
or mobile robots is met.
Comparison with the methods that have been introduced shows that FDE-Net offers
remarkable competitiveness. The evaluation parameters illustrated in Table 3 demon-
strate its effectiveness in predicting the depth map of the environment. To summarize,
FDE-Net achieves outstanding inference accuracy with little model size and weight. The
results show that compared to the current best practices, the proposed strategy performs
quite well in the experiments. When coupling with efficient machine learning meth-
ods like dropout and knowledge distillation, FDE-Net’s size advantage and flexibility is
suitable for limited-resource systems.

Table 3. Comparison of predictive performance with other methods.

Model MSE MAE Abs-Rel σ < 1.25 σ < 1.252 σ < 1.253
MRF [19] 0.0082 0.074 0.623 0.800 0.928 0.280
Alex-Net [9] 0.030 0.0030 0.5484 0.7015 0.8007 0.9042
FCN [20, 21] 0.040 0.0033 0.4599 0.733 0.8126 0.9287
SSDM [22] 0.0068 0.052 0.803 0.935 0.946 0.183
Ours 0.0063 0.050 0.709 0.949 0.9641 0.184

Next, Fig. 4 illustrates the inferences obtained from the KITTI dataset. The suggested
method’s output closely matches the objects’ real depth values, which is readily apparent.
Predictions have never been wrong and overlapping zones have never been confused. It
does not create a difference to the final projection. The proposed method demonstrates
reliable effectiveness in creating depth maps through monocular cameras. Furthermore,
semantic segmentation data is successfully integrated with depth data. Since it has grad-
ually completed the construction of a knowledge model to help the system accurately
grasp the characteristics of the environment.
12 V.-T. Nguyen et al.

Fig. 4. Inference results on the KITTI dataset.

5 Conclusion
The paper proposes a novel monocular depth estimation solution of FDE-Net efficiently
extracting depth information with minimal computational cost. Moreover, the integra-
tion of DepthSepConv, combined with model optimization techniques of squeeze-and-
excitation and the Adam optimizer improves the efficiency of the proposed model. Based
on three datasets of Cityscapes, Kitti, and NYU-V2, experiments yield remarkable eval-
uation results. Surprisingly, FDE-Net uses only 0.04 times the number of parameters
of Resnet-Upconv. Computational efficiency is significantly emphasized by FLOP and
MAC, demonstrating a notable advantage over alternative models. FDE-Net boasts a
significantly shorter latency of 4.2 times, along with 3.9 times increase in throughput
compared to Resnet18-Upconv. Consequently, this innovative model has great potential
for seamless integration into both real-time scenarios and simulated environments. Addi-
tionally, the distance data from the viewpoint to the center of the object, as provided by
FDE-Net, is confirmed to be valid. Future work will investigate and evaluate the ability
to combine depth maps with semantic information about the environment to enhance
knowledge-based systems for mobile robots.

Acknowledgments. This work was supported by Vingroup Innovation Foundation (VINIF) under
Project code VINIF.2023.DA089.

References
1. Dang, T.V., Bui, N.T.: Multi-scale fully convolutional network-based semantic segmentation
for mobile robot navigation. Electronics 12(3), 533 (2023)
2. Dang, T.V., Bui, N.T.: Obstacle avoidance strategy for mobile robot based on monocular
camera. Electronics 12(8), 1932 (2023)
3. Huang, K.C., Wu, T.H., Su, H.T., Hsu, W.H.: MonoDTR: monocular 3D object detection with
depth-aware transformer. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR), pp. 4002–4011 (2022)
FDE-Net: Lightweight Depth Estimation for Monocular Cameras 13

4. Sun, H., et al.: Transformer-based Stereo-Aware 3D Object Detection from Binocular Images.
arXiv:2304.11906 (2023)
5. Gao, R., et al.: Unsupervised learning of monocular depth and ego-motion in outdoor/indoor
environments. IEEE Internet Things J. 9(17), 16247–16258 (2022)
6. Xiong, M., et al.: Self-supervised monocular depth and visual odometry learning with scale-
consistent geo-metric constraints. In: Proceedings of the Twenty-Ninth International Joint
Conference on Artificial Intelligence, pp. 963–969 (2020)
7. Nguyen, V.T., Nguyen, A.T., Nguyen, V.T., Bui, H.A.: A real-time human tracking system
using convolutional neural network and particle filter. In: ICISN 2021, Intelligent Systems
and Networks 50(243), 411–417 (2021)
8. Godard, C., Aodha, O.M., Firman, M., Brostow, G.J.: Digging into self-supervised monocular
depth estimation. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV),
pp. 3827–3837 (2019)
9. Eigen, D., Fergus, R.: Predicting depth, surface normal and semantic labels with a common
multi-scale convolutional architecture. In: Proceedings of the IEEE International Conference
on Computer Vision (ICCV), pp. 2650–2658 (2015)
10. Zhou, T., Brown, M.A., Snavely, N., Lowe, D.G.: Unsupervised learning of depth and ego-
motion from video. In: IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), pp. 6612–6619 (2017)
11. Wofk, D., Ma. F., Yang. T.J., Karaman, S., Sze, V.: FastDepth: fast monocular depth estimation
on embedded systems. In International Conference on Robotics and Automation (ICRA),
pp. 6101–6108 (2019)
12. Rudolph, M.B., Dawoud, Y., Guldenring, R., Nalpantidis, L., Belagiannis, V.: Lightweight
monocular depth estimation through guided decoding. In: International Conference on
Robotics and Automation (ICRA), pp. 2344–2350 (2022)
13. Zhou, Z., Fan, X., Shi. P., Xin, Y.: R-MSFM: recurrent multi-scale feature modulation
for monocular depth estimating. IEEE/CVF International Conference on Computer Vision
(ICCV), pp. 12757–12766 (2021)
14. Zhang, N., Nex, F., Vosselman, G., Kerle, N.: Lite-mono: a lightweight CNN and transformer
architecture for self-supervised monocular depth estimation. In: IEEE/CVF Conference on
Computer Vision and Pattern Recognition (CVPR) (2023)
15. Cheng, C., et al.: PP-LCNet: A Lightweight CPU Convolutional Neural Network. arXiv:2109.
15099v1 (2021)
16. Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, pp. 7132–7141 (2018)
17. Tu, X., et al.: Efficient monocular depth estimation for edge devices in internet of things.
IEEE Trans. Industr. Inf. 17(4), 2821–2832 (2021)
18. Kubilay, M.S., Esat, K., Fatma, B.C., Ahmet, A.: Comparative parotid gland segmentation
by using ResNet-18 and MobileNetV2 based DeepLab v3+ architectures from MR images.
Concurrency and Computation Practice and Experience 35(1), e7405 (2023)
19. Saxena, A., Sun, M., Ng, A.Y.: Make3D: learning 3D scene structure from a single still image.
IEEE Trans. Pattern Anal. Mach. Intell. 31(5), 824–840 (2009)
20. Dang, T.V., Tran, D.M.C., Phan, X.T.: IRDC-net: lightweight semantic segmentation network
based on monocular camera for mobile robot navigation. Sensors 23(15), 6907 (2023)
21. Dang, T.V., Phan, X.T.: Hybrid mobile robot path planning using safe JBS-A*B algorithm
and improved dwa based on monocular camera. J. Intell. Rob. Syst. 110(151), 1–21 (2024)
22. Israr, H., Shunquan, T., Jiwu, H.: A semi-supervised deep learning approach for cropped
image detection. Expert Syst. Appl. 243(5), 122832 (2023)
Language-Guided Video Object
Segmentation

Minh-Duy Phan1,2(B) , Minh-Huan Le1,2 , Minh-Triet Tran1,2 ,


and Trung-Nghia Le1,2
1
University of Science, VNU-HCM, Ho Chi Minh, Vietnam
2
Vietnam National University, Ho Chi Minh, Vietnam
[email protected]

Abstract. Referring Video Object Segmentation (RVOS) is a challeng-


ing computer vision task that requires segmenting and tracking objects in
video based on natural language descriptions. Traditional RVOS methods
typically focus on static visual features such as color and shape, often
extending image segmentation techniques to video using mask propa-
gation and memory attention. While these approaches have seen vary-
ing levels of success, they often struggle with the dynamic nature of
video content. Current RVOS datasets and methodologies have not fully
addressed the complexity posed by motion and other temporal factors in
video. To bridge this gap, the MeViS dataset emphasizes motion expres-
sions in conjunction with language-based object segmentation. MeViS
presents unique challenges, including the use of motion-centric language
expressions, complex scenes with multiple objects of the same category,
interactions between text and objects, and long video sequences. These
complexities require a deeper understanding of both temporal and spatial
information in video. This paper enhances existing RVOS techniques to
meet the specific demands of the MeViS dataset. Our model is built upon
the Swin-Large architecture and is initially trained on the Ref-Youtube-
VOS-2021 dataset before being fine-tuned with the MeViS dataset. We
implement a multi-step approach that leverages masks generated during
training to accurately track object movement and eliminate misidenti-
fications. Our solution achieves a J &F score of 0.5319 on the valida-
tion set, demonstrating its effectiveness in handling the complexities of
motion and dynamic video content in RVOS.

Keywords: Referring Video Object Segmentation · Knowledge


Distillation

1 Introduction
Referring video object segmentation (RVOS) is an increasingly important task
in computer vision that combines visual and natural language processing to
segment objects in video frames based on textual descriptions [13, 14]. Unlike
traditional video object segmentation, which depends solely on visual cues,
c The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025
W. Buntine et al. (Eds.): SOICT 2024, CCIS 2350, pp. 14–24, 2025.
https://doi.org/10.1007/978-981-96-4282-3_2
Language-Guided Video Object Segmentation 15

RVOS incorporates natural language expressions to identify and track specific


objects, leading to more context-aware segmentation. This capability is vital
in a range of applications, including language-guided video editing, augmented
reality, human-machine interaction, and video surveillance.
RVOS poses several challenges, primarily due to the complexity of aligning
visual and linguistic data in dynamic video environments [1]. Many existing seg-
mentation methods are rooted in static image features, which are insufficient
when objects change appearance across frames or are defined by their motion.
As the field advances, transformer-based architectures, particularly MTTR [16]
and ReferFormer [17], have gained prominence for their ability to model both
visual features and language queries. These models utilize multi-modal atten-
tion mechanisms, allowing for stronger alignments between objects and their
linguistic references across time, thus improving segmentation accuracy.
One critical limitation in RVOS research is the lack of datasets that empha-
size motion as a key feature for object identification [1]. Traditional datasets
have predominantly focused on static visual features, making them less effec-
tive for testing models on dynamic content. The MeViS dataset [1] addresses
this gap by incorporating natural motion expressions, which require models to
track objects not only based on appearance but also on their movement patterns.
This dataset provides a more challenging benchmark, highlighting weaknesses in
current methodologies when faced with long sequences and motion-dependent
objects.
In response to these challenges, this paper focuses on enhancing an exist-
ing RVOS architecture to improve performance on dynamic datasets like MeViS
[1]. We integrate large language models such as Roberta to better process com-
plex natural language queries, ensuring that textual descriptions are accurately
linked to visual objects. We also employ the Swin Transformer [12] as the back-
bone of our model, capitalizing on its ability to capture both fine-grained and
global dependencies in video sequences. Additionally, we apply knowledge distil-
lation techniques [20] to reduce model size and complexity while retaining high
performance, making the model more suitable for real-world applications where
computational efficiency is critical.
The motivation for this work lies in the need to move beyond traditional,
appearance-based segmentation methods. By incorporating motion as a funda-
mental characteristic for object identification, RVOS allows for a more intuitive
and accurate selection of objects, particularly in dynamic environments where
appearance alone is insufficient. Our goal is to refine RVOS methodologies to
handle such complexities, improving the precision.
The key contributions of this paper include: (1) integrating large language
models for improved query processing in RVOS, (2) utilizing the Swin Trans-
former for more effective handling of complex video sequences, and (3) optimiz-
ing the model through knowledge distillation to maintain high accuracy while
reducing computational costs. These contributions are rigorously evaluated on
the MeViS dataset [1], demonstrating the model’s enhanced ability to handle
motion-centric video content and providing insights into further advancements
in RVOS.
16 M.-D. Phan et al.

2 Related Work
Referring Video Object Segmentation (RVOS) is an evolving field [13, 14] in com-
puter vision that combines visual and linguistic information to segment objects
in videos based on natural language descriptions. Early approaches primarily
focused on leveraging static visual attributes such as color and shape to iden-
tify objects, often extending techniques from image segmentation to video by
using per-frame mask propagation and memory attention modules. Although
these methods achieved some success, they struggled with dynamic and motion-
centric content inherent in videos [1].
Traditional RVOS datasets, including DAVIS17-RVOS [15] and Refer-
YouTube-VOS [3], focused on salient objects with limited motion information.
These datasets often allowed the target object to be identified based on static
features, enabling strong performance using models initially designed for image
segmentation [1]. However, these benchmarks neglected the importance of tem-
poral dynamics, which are essential for real-world applications where motion,
rather than static appearance, defines the identity of the object [1].
To address these limitations, the MeViS dataset was developed to emphasize
motion expressions in RVOS tasks. MeViS is distinguished by its complexity: it
includes multiple objects of the same category within a video, longer sequences,
and focuses on dynamic attributes. This poses unique challenges for existing
RVOS methods that typically rely on static information [1]. The dataset is a
crucial contribution as it forces models to incorporate temporal context and
motion understanding into segmentation decisions, representing a significant leap
from previous datasets [1].
Recent advancements in RVOS architectures have shifted towards
transformer-based methods, which have proven effective in combining visual and
linguistic cues. Approaches such as MTTR [16] and ReferFormer [17] introduced
end-to-end frameworks that use transformers to model multi-modal interactions
between object queries and textual descriptions. These models rely on robust
multi-modal alignments and temporal-aware interactions to achieve strong per-
formance on existing benchmarks. However, when applied to motion-centric
datasets like MeViS, these models often struggle due to their reliance on static
visual cues and shorter video sequences used during training [1].
Innovative methods such as SOC [18], MUTR [19] have emerged to address
these challenges by focusing on temporal and motion-aware interactions. These
models attempt to unify object detection across frames by leveraging transform-
ers and multi-modal attention mechanisms to maintain coherent segmentation
throughout a video sequence. Despite this progress, achieving consistent per-
formance on long and complex videos remains a challenge, particularly when
motion plays a critical role in identifying object.
Language-Guided Video Object Segmentation 17

Fig. 1. The overview architecture of the baseline approach Language-guided Motion


Perception and Matching (LMPM) [1].

3 Methods
3.1 Language-Guided Motion Perception and Matching

In this paper, we leverage the LMPM [1] framework as our main architec-
ture. Figure 1 illustrates the overall architecture of LMPM. This architecture
is inspired by VITA [9] and has several improvements:

– First, instead of using conventional object queries, LMPM employs a


transformer-based language model to generate N1 language queries. These
queries are better in identifying potential target objects in the video, across
T frames by filtering out the irrelevant objects.
– Second, it leverages a matching threshold σ to choose object trajectories only
if their similarity with the language features exceeds the threshold σ instead
of only choosing the best matched. This enables the model to handle both
single-object and multi-object expressions, which is a unique feature of the
dataset MeViS.

In the later Sect. 3.2 and 3.3, we discuss more deeply how the previous version
VITA [9] as well as the frame-level detector Mask2Former [6] work to understand
this architecture better.
The input goes through the frame-level detector Mask2Former [6] as
the first stage. The output from the multi-scale transformer decoder (from
Mask2Former), which also represents objects and provides instance-specific infor-
mation, reduces computational requirements [7, 8]. It puts the object embeddings
through layers of Transformer Encoder. Indeed, these layers perform motion per-
ception by inter-frame self-attention on the object embeddings to obtain a global
view across T frames. Motion perception enables object embeddings to capture
temporal contextual information that spans multiple frames or even the entire
video. We discuss these Encoders more in Sect. 3.3 [9]. Then, it uses N2 language
queries as the query and the object embeddings after Motion Perception as the
18 M.-D. Phan et al.

key and value for the Transformer Decoders. The Transformer Decoders decode
language-related information from all object embeddings and aggregate relevant
information to predict object trajectories. Finally, it matches the language fea-
tures with the predicted object trajectories to identify the target object(s) by
using a matching threshold σ, as we mentioned above.

3.2 Mask2Former as Frame-Level Detector


Mask2Former is utilized as a frame-level detector in the architecture, employing
masks to precisely identify instances without the need for bounding boxes, based
on the prediction mechanism of DETR [10]. The frame-level detector analyzes an
input image using object queries, known as frame queries, which are employed to
classify and segment objects. These predictions serve as auxiliary supervision for
VITA [9]. The frame-level detector outputs two key characteristics: (1) dynamic
convolution applied directly to the frame queries and (2) per-pixel embeddings
generated by the pixel decoder. These embeddings, categorized using a dot prod-
uct, are essential for accurate object categorization [5].
Mask2Former comprises three main components: (1) a backbone such as
ResNet [11] or Swin Transformer [12] for multi-scale feature extraction, (2)
a pixel decoder that upsamples the low-resolution features to create high-
resolution embeddings, and (3) a Transformer decoder that processes object
queries. The architecture leverages masked attention in the Transformer decoder,
which focuses on specific regions projected by mask predictions, improving both
performance and convergence speed. Multi-scale feature maps from the pixel
decoder are fed sequentially into the Transformer decoder to ensure the model
captures high-resolution details for accurate object detection and segmentation,
even for small or fine details [5].
Mask2Former incorporates optimization techniques to enhance perfor-
mance while reducing computational complexity. These include rearranging the
sequence of self-attention and masked attention to improve query features and
eliminating dropout layers from the Transformer decoder. The total loss func-
tion in Mask2Former is a weighted sum of classification loss and mask loss. The
classification loss is computed using cross-entropy, and the mask loss is a combi-
nation of binary cross-entropy and dice loss. The total loss can be formulated as:
L = λcls Lcls + λmask Lmask where Lmask is further broken down into binary cross-
entropy and dice losses. This design enables Mask2Former to generate accurate
mask predictions efficiently, optimizing segmentation performance by utilizing
a combination of multi-scale features, masked attention, and optimized Trans-
former decoder components [5].

3.3 VITA Module


The Video Instance Segmentation via Object Token Association (VITA) [9]
framework operates in a frame-independent manner, aiming to address the
unique challenges of video instance segmentation. VITA’s architecture is com-
posed of three primary stages: the Frame-level Detector, Object Encoder, and
Language-Guided Video Object Segmentation 19

Object Decoder. Each of these components plays a distinct role in ensuring the
system’s ability to capture object information from individual frames and aggre-
gate it over time to accurately track instances across an entire video sequence.
The Frame-level Detector processes each of the T input frames independently,
generating two key features that serve as the foundation  for further process- 
ing within VITA. The first feature is the Frame Queries {f t }Tt=1 ∈ RC×T ×Nf ,
which encapsulate object-centric information
 distilled from each frame.
 The sec-
t T C×T × H × W
ond feature is the Per-Pixel Embeddings {M }t=1 ∈ R S S , which are
produced by the pixel decoder of the Frame-level Detector. These embeddings
provide dense pixel-level representations, which will be used later in the model
for mask prediction.
The Object Encoder gathers the Frame Queries from all frames and trans-
forms them into object tokens through a linear layer. To handle long video
sequences efficiently, VITA employs a window-based self-attention mechanism
inspired by the Swin Transformer [12]. This mechanism partitions the object
tokens into local windows along the temporal axis, facilitating communication
between frames without the prohibitive computational cost of a naive self-
attention approach. By alternately shifting these windows across frames, VITA
ensures that object tokens from different frames can effectively exchange infor-
mation, allowing it to handle long sequences in a scalable manner.
To address the challenges posed by dynamic scenes and long videos, VITA
replaces traditional dense spatio-temporal features with object tokens in its
Object Decoder. The Object Decoder uses Nv trainable video queries to extract
object-wise information from all object tokens across the video. These video
queries are decoded into the final predictions, which include class probabilities
and mask logits. VITA’s class head acts as a linear classifier, predicting the
class probabilities of each instance, while the mask head dynamically generates
mask embeddings for each video query, corresponding to an object’s trajectory
across frames. This approach enables faster convergence and more accurate video
instance segmentation by effectively capturing video contexts and aggregating
relevant information into compact representations.
VITA introduces clip-wise losses to optimize the model’s predictions while
maintaining consistency across frames. Instance Matching in VITA is designed
to efficiently pair predictions with ground-truth annotations by extending the
Mask2Former cost function to the temporal axis, considering the video con-
text. Using the Hungarian algorithm, VITA determines optimal matching pairs
between the Nv video queries and the ground-truth annotations, eliminating the
need for post-processing techniques such as Non-Maximum Suppression (NMS).
This ensures that the model’s predictions are well-aligned with the ground-truth,
even across complex video sequences.
VITA further improves instance tracking through the use of Similarity Loss,
which helps preserve object identity across frames. Inspired by MaskTrack R-
CNN, the similarity loss encourages consistency between frame-level and video-
level queries by using binary cross-entropy to measure the similarity between
matching object queries. Queries representing the same object are assigned a
20 M.-D. Phan et al.

label of 1, while those representing different objects are labeled as 0. This loss
function helps cluster queries with the same identity closer together in the latent
space, improving the model’s ability to track objects throughout the video.
The total loss used for training VITA is a weighted combination of several key
components. First, the frame-level loss is applied to per-frame outputs, following
the approach used in Mask2Former. Second, the video-level loss Lv is applied to
video-level outputs and is computed in a similar way to the frame-level loss but
extended across the temporal axis to account for video context. Finally, the sim-
ilarity loss Lsim is included to enhance identity consistency between frame and
video queries. The total loss is expressed as a weighted sum of these components:
Ltotal = λv Lv + λf Lf + λsim Lsim . This comprehensive loss formulation ensures
that VITA performs robustly across both frame and video levels, enabling accu-
rate and efficient video instance segmentation.

3.4 Distillation Our Trained Checkpoints


We modify the size of the Swin backbone in the LMPM model [1] and then
train on MEVIS [1] and YTVOS 2021 [3] to improve the result. Indeed, it does
improve a significant number on the metrics, which is represented in the later
section. After that, we employ this model as a teacher model to generate soft
labels and predicted masks to train the smaller version of it. The teacher model’s
output probabilities (soft targets) provide more information than hard labels
(ground-truth). Hence, we use it to calculate the loss during the distillation
training process by using Kullback-Leibler (KL) divergence on logit output and
Mean Squared Error (MSE) on predicted mask output. Because Kullback-Leibler
(KL) divergence [4] helps the student model match the relative probabilities of
the teacher model’s outputs smoothly. The loss is calculated based on many
outputs from different stages, including the output of Mask2Former heads [5]
and the auxiliary output of the transformer decoders in the model’s tail.

4 Experiments
4.1 Datasets
In our study, we use two datasets, these are MeViS [1] and Refer-Youtube-Vos-
2021 [3]. MeViS (Motion Expressions Video Segmentation) [1] is a new large-
scale dataset focusing on video object segmentation guided by motion expres-
sions. Unlike previous datasets that rely on static attributes, MeViS emphasizes
motion, with 2,006 videos and 28,570 motion-based expressions referring to 8,171
objects. The dataset is divided into training, validation, and testing subsets
and includes complex scenarios where multiple objects with similar appearances
require motion-based identification. Compared to Refer-Youtube-VOS, MeViS
features longer videos, more objects (4.28 per video), and exclusively motion-
centric expressions, making it more challenging and realistic. The dataset sup-
ports multi-object expressions and requires understanding temporal motion for
effective segmentation, offering a valuable resource for studying video under-
standing in dynamic environments.
Language-Guided Video Object Segmentation 21

4.2 Implementation Details

We adjusted the model parameters from the original paper [1] and experimented
with various backbones [2], including Tiny Swin Transformer, Base Swin Trans-
former, and Large Swin Transformer. Unlike the original study, which used
Roberta, we opted for Roberta-Large as the text encoder. We trained the modi-
fied models with either 100,000 or 150,000 iterations on the Refer-Youtube-VOS
2021 [3] and MeViS dataset [1], respectively, and then evaluated them on the
MeViS validation set. For each version of the Swin Transformer, we also modify
configuration parameters such as the dimensionality of the feature embeddings,
the number of transformer blocks at each stage, the number of attention heads
in the multi-head self-attention, the size of the local window for self-attention,
and the pretrain image size to suit the current backbone.
The MeViS dataset poses significant challenges in detecting and understand-
ing object motions in both video and language contexts. Language expressions
describe motions that can span a random number of frames, requiring metic-
ulous frame-by-frame analysis to detect short-term actions and a comprehen-
sive understanding of extended motions. Current state-of-the-art methods, as
referenced in studies [1], typically rely on random frame sampling, risking omit-
ting crucial information. These methods also struggle with extracting tempo-
ral context, often defaulting to spatial-temporal feature extractors due to high
computational demands. Additionally, language expressions can describe vary-
ing numbers of objects, necessitating variable outputs from the model. Some
parameters remained unchanged, such as the number of layers in motion percep-
tion (six layers) and the Transformer decoder (three layers), as well as specific
hyperparameters (σ, N1 , and N2 ) set to 0.8, 20, and 10, respectively.

4.3 Experimental Results

Table 1. Benchmark Results for Various Backbone Architectures

Backbone Architecture Iterations trained in RVOS Iterations trained in MeViS J F J&F


Swin-Tiny x 100000 36.45 42.90 39.67
Swin-Base x 150000 39.49 48.07 43.78
Swin-Base 100000 200000 40.26 49.94 45.10
Swin-Base 100000 300000 39.73 50.66 45.20
Swin-Large x 150000 45.20 54.78 50.00
Swin-Large 100000 200000 45.79 52.11 48.95
Swin-Large 100000 300000 47.81 55.10 51.45
Swin-Base-distilled x x 48.11 58.27 53.19

In Table 1, we report our results on the validation-unit set in the MeViS dataset,
as we utilized the baseline method [1], adjusted the backbone architecture and
22 M.-D. Phan et al.

pre trained language model (From Roberta to Roberta-large). When compared


to the results of the study that established the MeViS challenge [1], we obtain
some enhancement outcomes. In the original paper [1], the authors achieved a J
score of 34.2, F score of 40.2, and J&F score of 37.2. By changing these architec-
tures, especially for the modification to Swin-Large architecture and training the
baseline model in the dataset Refer-Youtube-Vos 2021 [3] with 100000 iterations
and in the dataset MeViS with 300000 iterations, we achieve the results which
are better than the original challenge paper, as the J&F score improves signif-
icantly by 14.25%. After all of that, we make a task of knowledge distillation
and get a better result for the Swin-base version, which achieves the score J&F
of 53.19%, better than 15.99% when we compare it with the baseline result. We
also discovered that all of our tested results outperformed the original research
compared to the supplied dataset’s validation set.
In our ablation study, we investigate how different backbone architectures
affect the performance of the RVOS task using the MeViS dataset. More precisely,
it analyses the Swin Transformer model in three specific forms: Swin-Tiny, Swin-
Base, and Swin-Large. The versions vary in size and capacity, with Swin-Tiny
being the smallest and Swin-Large being the largest. The study proves that larger
models, such as Swin-Large, exhibit better performance than smaller ones. The
enhanced performance can be ascribed to the augmented capability of the larger
models to comprehend intricate characteristics and spatial connections within
the video data, resulting in more precise object segmentation and tracking.
This approach allows for assessing the impact of pre-training on the RVOS
dataset and the subsequent fine-tuning on the MeViS dataset. The results show
that models pre-trained on RVOS generally perform better when fine-tuned on
MeViS compared to those that are not pre-trained. This suggests that initial
training on a relevant dataset helps the model learn some useful features, espe-
cially with the static attributes or some attributes that are not in the training
set but appear in the validation and test set.
Modifying the large language model employed in our research also improved
the model’s performance. According to Table 1, the models were modified by
replacing the language model from Roberta to Roberta-large. In comparison to
the initial baseline established by the authors, using the identical Swin-Tiny
backbone, we attained a J score of 36.45%, an F score of 42.9%, and a combined
J and F score of 39.67%. Unlike the initial findings presented by the authors in
their work (34.2% for J, 40.2% for F, and 37.2% for J&F), our update resulted
in a 2.47% increase in the J&F score by solely modifying the language model.

5 Conclusion
In this paper, we leverage LMPM as our baseline method to deal with the prob-
lem of Referring Video Object Segmentation, which posed by the MeViS chal-
lenge. We try with several backbone architecture of Swin Transformers and get
the improvement results in the ablation studies. We also do the knowledge dis-
tillation to get the light weight model but can achieve the better performance
Language-Guided Video Object Segmentation 23

when evaluate on the validation set. Despite these successes, we will focus on
the failure cases to find the better methods and enhance the result of our model
in the near future.

Acknowledgements. This research is supported by research funding from Faculty of


Information Technology, University of Science, Vietnam National University- Ho Chi
Minh City.

References
1. Henghui, D., et al.: MeViS: a large-scale benchmark for video segmentation with
motion expressions. In: Proceedings of the IEEE/CVF International Conference
on Computer Vision (2023)
2. Liu, Z., et al.: Video Swin transformer. In: Proceedings of the IEEE/CVF Confer-
ence on Computer Vision and Pattern Recognition (2022)
3. Seo, S., Lee, J.-Y., Han, B.: URVOS: unified referring video object segmentation
network with a large-scale benchmark. In: Computer Vision ECCV 2020: 16th
European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part XV
16. Springer (2020)
4. Kullback, S., Leibler, R.A.: On information and sufficiency. Ann. Math. Stat. 22(1),
79–86 (1951)
5. Cheng, B., et al.: Masked-attention mask transformer for universal image segmen-
tation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition (2022)
6. Cheng, B., Choudhuri, A., Misra, I., Kirillov, A., Girdhar, R., Schwing, A.G.:
Mask2Former for video instance segmentation (2021)
7. Li, X., et al.: Transformer-based visual segmentation: a survey. IEEE Trans. Pat-
tern Anal. Mach. Intell. (2024)
8. Li, X., et al.: Tube-Link: a flexible cross tube framework for universal video segmen-
tation. In: Proceedings of the IEEE/CVF International Conference on Computer
Vision (2023)
9. Heo, M., et al.: Vita: video instance segmentation via object token association.
Adv. Neural. Inf. Process. Syst. 35, 23109–23120 (2022)
10. Carion, N., et al.: End-to-end object detection with transformers. In: European
Conference on Computer Vision. Springer, Cham (2020)
11. He, K., et al.: Deep residual learning for image recognition. In: Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition (2016)
12. Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted win-
dows. In: Proceedings of the IEEE/CVF International Conference on Computer
Vision (2021)
13. Liu, S., et al.: Cross-modal progressive comprehension for referring segmentation.
IEEE Trans. Pattern Anal. Mach. Intell. 44(9), 4761–4775 (2021)
14. Hui, T., et al.: Collaborative spatial-temporal modeling for language-queried video
actor segmentation. In: Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition (2021)
15. Jordi, P.-T., et al.: The 2017 Davis challenge on video object segmentation. arXiv
preprint arXiv:1704.00675 (2017)
24 M.-D. Phan et al.

16. Botach, A., Zheltonozhskii, E., Baskin, C.: End-to-end referring video object seg-
mentation with multimodal transformers. In: Proceedings of the IEEE/CVF Con-
ference on Computer Vision and Pattern Recognition (2022)
17. Wu, J., et al.: Language as queries for referring video object segmentation. In: Pro-
ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni-
tion (2022)
18. Luo, Z., et al.: Soc: semantic-assisted object cluster for referring video object seg-
mentation. Adv. Neural Inf. Process. Syst. 36 (2024)
19. Yan, S., et al.: Referred by multi-modality: a unified temporal transformer for
video object segmentation. In: Proceedings of the AAAI Conference on Artificial
Intelligence, vol. 38. no. 6 (2024)
20. Hinton, G.: Distilling the knowledge in a neural network. arXiv preprint
arXiv:1503.02531 (2015)
MythraGen: Two-Stage Retrieval
Augmented Art Generation Framework

Quang-Khai Le1,2 , Cong-Long Nguyen1,2 , Minh-Triet Tran1,2 ,


and Trung-Nghia Le1,2(B)
1
University of Science, Ho Chi Minh city, Vietnam
{22120148,22120191}@student.hcmus.edu.vn,
{tmtriet,ltnghia}@fit.hcmus.edu.vn
2
Vietnam National University, Ho Chi Minh city, Vietnam

Abstract. Text-to-image generation has seen rapid advancements, espe-


cially with the development of generative models. However, challenges
remain in achieving high-quality, contextually accurate image outputs
that faithfully match the provided textual descriptions, especially in
artistic generation. In this paper, we present a simple yet efficient
retrieval augmented generation framework, namely MythraGen, for text-
to-artistic image generation by integrating an art retrieval mechanism
with LoRA-based model fine-tuning. Our method extracts features from
a large-scale art dataset, optimizing the generation process by combin-
ing artist-specific styles and content. Particularly, retrieved images from
an external art database that have the highest similarity to the query
prompt are used to finetune Stable Diffusion using LoRA for desired art
generation. Experimental results and user studies on the WikiArt dataset
show that our proposed method can generate artworks that closely match
the user’s input, significantly outperforming existing solutions.

Keywords: Text-to-image generation · Art retrieval · Art generation

1 Introduction
In art analysis, content and style are two fundamental elements. Content
describes the concepts depicted in the image, such as objects, people, or loca-
tions. Style, on the other hand, describes the visual appearance of the artwork,
including its color, composition, and shape. Furthermore, each artist expresses
his/her own style, creating unique features in his/her works. Through the unique
combination of content and the artist’s individual style that makes each piece of
art special.
Recent advances in deep learning have facilitated powerful breakthroughs in
text-to-image generation (T2I) [1, 6, 14, 19, 22, 26]. T2I methods can incorporate
a specific style into generated images by taking a textual description of the style
as a prompt. However, conveying artistic style through text descriptions has
Q.-K. Le and C.-L. Nguyen—Contributed equally to this research.
c The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025
W. Buntine et al. (Eds.): SOICT 2024, CCIS 2350, pp. 25–38, 2025.
https://doi.org/10.1007/978-981-96-4282-3_3
26 Q.-K. Le et al.

Fig. 1. Examples of images generated by our method against Stable Diffusion.

limitations. These descriptions are often less expressive and informative than
visual representations of style, so the style features of T2I outputs are often
rough and lack details.
Recent fine-tuning techniques such as Dream-booth [20], Textual Inversion
[8], and Low-Rank Adaptation (LoRA) [11] can enhance adaptability to specific
T2I generation tasks, and show convincing capability in creating images with
unique content and style. Among these methods, LoRA stands out and gains
extensive adoption among art designers and T2I enthusiasts, due to its advan-
tages of low-cost an computational efficiency, making it user-friendly and suit-
able for consumer devices. However, an artist can paint in many different styles.
When extending this to hundreds of artists, the need to fine-tune or retrain the
model for each artist’s style becomes impractical. This process requires a vast
amount of computational resources and time, making these methods unrealistic
for large-scale application.
To address these issues, we propose MythraGen, a simple yet efficient retrieval
augmented art generation framework. First, we employ a retrieval technique to
search for paintings from an external database that have the highest similarity in
content, genre, and style to the artist described in the prompt. Our art retrieval
technique leverages BLIP-2 to encode both the visual features of each image and
its associated metadata, including captions, genre, style, and artist information.
These encoded features are combined into a comprehensive feature vector, which
is indexed using FAISS [7] to facilitate the retrieval of relevant images. Then,
we utilize LoRA algorithm [11] for finetuning Stable Diffusion on the identified
paintings. This two-stage framework allows for the flexible combination of dif-
ferent styles from various artists and content, optimizing image generation while
ensuring the quality of the created images.
In this paper, the WikiArt dataset [23], consisting of 80,000 unique images
from more than 1,100 artists across 27 styles, was used as the external art
database. We also leveraged a zero-shot classifier based on Visual Question
Answering (VQA) to annotate genres for round 16,000 images missing labels.
Extensive experiments and user study showcase an impressive performance our
MythraGen, outperforming existing existing open sources and commercial image
generation methods in all evaluation metrics.
Our contributions can be summarized as follows:
Retrieval Augmented Art Generation 27

– We introduce a simple yet retrieval augmented art generation framework


called MythraGen to create artistic images with desired content, genre, and
style from the query prompt.
– We train LoRA models on various artistic styles and then mix them to create
superior composite results.
– We employ Visual Question Answering (VQA) based zero-shot classification
to automatically label the genre for 16,452 unclassified images in the WikiArt
dataset.
– We conduct a comprehensive evaluation of our proposed framework in repro-
ducing different artistic styles. Experimental results and user study demon-
strate the superiority of our proposed method against existing open sources
and commercial image generation methods.

2 Related Work
2.1 Art Retrieval

Similarity search algorithms have played a crucial role in many artificial intelli-
gence applications, with K-Nearest Neighbors (KNN) and Approximate Nearest
Neighbors (ANN) being commonly used methods. KNN works by finding the
closest points in the dataset to a query point based on a specified distance metric,
making it useful for classification and regression tasks. ANN, on the other hand,
provides faster search results by approximating the nearest neighbors, making
it suitable for high-dimensional data where exact searches are computationally
expensive.
Recently, the Facebook AI Similarity Search (FAISS) library [7] is a power-
ful tool for efficient similarity search and clustering of high-dimensional data,
enabling developers to quickly find similar items in large datasets. It is particu-
larly useful for tasks such as image retrieval, recommendation systems, and nat-
ural language processing, where finding similar items in large datasets is crucial.
Therefore, this paper utilized FAISS due to its ability to quickly and accurately
search for similar embedding vectors in latent space. This is especially useful
when working with large datasets like WikiArt (around 80k images), allowing
us to rapidly gather relevant images to support image generation processes.
To ensure accurate retrieval, the alignment of image and text embeddings
is crucial for effective cooperation. Inspired by the vision-language pre-training
model BLIP-2 [12], which produces high-quality text-aligned visual representa-
tions, we adapted it to extract text-aligned subject representations, improving
the model’s ability to understand and generate content across modalities.
By integrating BLIP-2 [12] with FAISS, we harness the strengths of advanced
vision-language pre-training models and efficient similarity search algorithms.
This combination allows us to improve the accuracy of image-text retrieval,
providing a more precise and comprehensive dataset for further applications.
28 Q.-K. Le et al.

Fig. 2. Overview of the proposed MythraGen framework, with two main stages: (a)
the Art Retrieval retrieves images to enhance the image generation process and (b)
the Art Generation generates images based on the user’s input text combined with the
images provided by the Art Retrieval module.

2.2 Art Generation

Early text-to-image models [17, 24, 25, 27] made significant progress by utiliz-
ing Generative Adversarial Networks (GANs) [8] trained on large paired image-
caption datasets, which can lead to model collapse issues [2, 5, 10]. Recently, diffu-
sion models [9, 15, 18, 22] have become powerful in text-to-image (T2I) tasks due
to their ability to generate high-quality images and their flexibility in adapting
images to the context of the text. GLIDE [15] and Imagen [21] employ classifier-
free guidance by replacing the label in the class-conditioned diffusion model with
text descriptions of the images. Stable Diffusion [19] utilizes VQ-GAN for the
latent representation of images, enhancing photorealism through an adversar-
ial objective. DALL-E2 [16] employs a multimodal latent space where image
and text embeddings are aligned to generate images reflecting a deeper level of
language understanding. However, these models often struggle when handling
prompts containing less common genres or styles related to artists. Instead of
generating a suitable image, they tend to either create nonexistent genres or
styles or use similar but more popular ones, which does not align with the user’s
intent, leading to a mismatch between the original prompt and the final image.
Unifying both image retrieval and generation processes, Re-Imagen [4]
addressed augmenting rare entities to improve image quality. However, our goal
is to use retrieval to enhance the less common drawing styles of various artists,
while ensuring that when these styles are applied, the main content of the prompt
remains preserved. Additionally, Re-Imagen trains its image generation process
on the cascaded diffusion model [10], while our method uses LoRA [11] to fine-
tune the Stable Diffusion model [12] for reducing required resources and speed
up the training process.
Retrieval Augmented Art Generation 29

Fig. 3. Pipeline of the Art Retrieval module.

3 Proposed Method
3.1 Overview

Figure 2 illustrates an overview of our proposed two-stage retrieval augmented


art generation framework. The input prompt is fed into the Art Retrieval mod-
ule to search for images that are similar to the input text content. In the Art
Generation module, we use the images obtained from the Art Retrieval module
to enrich the data during the image generation process, which is also controlled
by the text embedding vector. Finally, the Art Generation module produces a
target image that has the highest similarity to the user’s input text.

3.2 Art Retrieval

In this section, we present our approach for retrieval from an external database
to find images with the highest similarity in content, genre, and style to the
artist described in the input query. Figure 3 provides a visual representation
of the Art Retrieval module, built around the Bootstrapping Language Image
Pre-training (BLIP-2 [12]) architecture. First, each 256-dimensional vector from
the feature, genre, artist, and style databases, which are related to each other,
are concatenated into a single 1024-dimensional vector. This vector is then fed
to the FAISS system for indexing to support the retrieval process. When the
input query is processed by the Text Embedding Generator, generated by the
Q-Former component of BLIP-2 [12], the resulting vector is sent to the FAISS
system to return a set of images with the highest similarity in content, genre,
and style to the artist described in the input query.
The Image Embedding Generator consists of two main components: a frozen
pre-trained image encoder and a multimodal encoder called Q-Former (see
Fig. 4). The process begins with the input image, which is passed through the
image encoder to extract visual features. These features are then combined with
learnable query tokens and fed into the Q-Former. The output of the Q-Former
is then passed through a fully connected layer and normalized to produce the
final image feature vector. Similarly, the Text Embedding Generator tokenizes
30 Q.-K. Le et al.

Fig. 4. Multimodal representation, where image and text embeddings processes rely
primarily on the Q-former.

the input text and processes it through the Q-Former. The output from the Q-
Former is then passed through a fully connected layer and normalized to produce
the final text feature vector.
To improve the retrieval performance, we combine image embedding with
caption and genre embeddings. Let Ei with i ∈ {image, caption, genre} repre-
sents the image, caption, and genre embeddings. We compute the weighted sum
of the embeddings as follows:

V = Wi ∗ E i , (1)
i

where Wi denotes the corresponding weight of embedding Ei . The resulting


vector V becomes the final feature vector used for retrieval. By combining infor-
mation from images, captions, and genres, a stronger feature vector is created
for better retrieval performance for genre.
The feature vector may not perform well in retrieving style and artist infor-
mation because it mainly captures the content and genre of the artwork. More-
over, many art styles share significant similarities in technique, such as Action
Painting and Abstract Expressionism, or between movements like Cubism, Ana-
lytical Cubism and Synthetic Cubism or Impressionism and Post-Impressionism,
etc. This makes it difficult to distinguish between closely related styles or spe-
cific artists. To address this, we concatenate additional style embedding and
artist embedding to enrich the data for the vector before the indexing phase
(See Fig. 3). By incorporating these additional embeddings, the model can bet-
ter capture the nuances of style and artist characteristics, thereby improving
retrieval performance in these specific areas. Moreover, despite the increased
complexity from the added embeddings, our optimized retrieval system main-
tain an impressive speed, achieving a retrieval time of just 0.1 second per query.
This efficiency demonstrates the robustness of our approach, ensuring that even
with enriched data, the system remains highly responsive.
Retrieval Augmented Art Generation 31

3.3 Art Generation


We utilize the LoRA technique [11] to fine-tune two distinct LoRA models, each
is specifically trained on different types of input data sourced from the Art
Retrieval module (see Fig. 5). The fine-tuning process allows us to tailor each
LoRA model to accurately reflect particular artistic attributes, such as genre,
style, and artist characteristics. After fine-tuning, each LoRA model is assigned
its own unique weight, carefully calibrated to balance the influence of each model
on the final output. These weighted models are then seamlessly mixed together,
in conjunction with the given prompt, to generate the final target image. This
process allows to create highly nuanced and accurate representations of artistic
styles. For example, as shown in Fig. 5, by combining the Landscape genre with
the Impressionism style and integrating the influence of the artist Claude Monet,
the technique produces an image of a sunrise over snow-capped mountains with
a crystal-clear lake. The resulting image not only depicts the specified scene but
also strongly embodies Monet’s distinctive Impressionist style, showcasing the
effectiveness of our approach in capturing complex artistic nuances.

Fig. 5. LoRA combination in Art Generation module. The first LoRA model is fine-
tuned for the artist and style (Wstyle+artist ), while the second LoRA model is fine-
tuned for the genre and style (Wgenre+style ). Both are combined and applied to the
Stable Diffusion model to generate an image that faithfully reflects the input prompt,
incorporating the specific style, artist, and genre.

4 Experiments
4.1 Implementations
In our experiments, we leveraged the PyTorch deep learning framework on a
computer equipped with an NVIDIA RTX A4500 GPU with 24 GB. For the Art
32 Q.-K. Le et al.

Fig. 6. Visual Question Answering (VQA) model used to classify the genre of an image
when the genre is unknown.

Retrieval module, we used BLIP-2 [12] pre-trained with ViT-L/14 to embed the
image and its related metadata. Weights in Eq. 1 is set Wimage = 1.0, Wcaption =
0.9, and Wgenre = 0.75.
Regarding the Art Generation module, we employed Stable Diffusion V1.5
[19] as the backbone. We utilized two LoRA models for our experiments: one
fine-tuned for genre + style and another fine-tuned for artist + style. For
both LoRA, we set max_train_steps to 4095, using AdamW8bit as the opti-
mizer to save memory, with xformers enabled for memory optimization and
mixed_precision = “bf16” for faster computation without compromising qual-
ity. The learning rates were set to 1 × 10−4 for the U-Net and 5 × 10−5 for
the text encoder. The LoRA network was configured with network_dim = 32
and network_alpha = 1, adjusting the rank and scaling of the LoRA layers for
efficient fine-tuning.

4.2 Dataset
We used WikiArt as the external database for all experiments in this paper.
WikiArt has complete labels for artists and styles, but over 16,452 images
are missing genre labels. Therefore, we employed a Visual Question Answering
(VQA) model (i.e., ShareGPT4V [3]) to label these 16,452 images according to 12
corresponding genres (genre painting, illustration, landscape, etc.) as described
in Fig. 6, where the model analyzes each image and assigns the appropriate genre
label based on visual content.
We performed retrieval on the WikiArt dataset using three separate cate-
gories: artist, styles, and genres, as well as a combined category that includes all
three.

4.3 Art Retrieval


Table 1 compares the performance of three BLIP-2 versions in retrieval tasks
across genres, artists, and styles. All models get a perfect mAP@5 of 1.0, showing
Retrieval Augmented Art Generation 33

Table 1. Results of different BLIP-2 versions on retrieval across characteristics.

Model Genres
mAP@5 mAP@15 mAP@25 mAP@40 mAP@50
BLIP2 using ViT-g/14 (EVA-CLIP) 100% 98.4% 98.4% 98.6% 98.6%
BLIP2 using ViT-L/14 (CLIP) 100% 99.8% 99.7% 99.6% 99.5%
BLIP2 finetuned on COCO [13] 100% 98.8% 98.3% 97.8% 97.7%
Model Artists, Styles
mAP@5 mAP@15 mAP@25 mAP@40 mAP@50
BLIP2 using ViT-g/14 (EVA-CLIP) 100% 100% 100% 100% 100%
BLIP2 using ViT-L/14 (CLIP) 100% 100% 100% 100% 100%
BLIP2 finetuned on COCO [13] 100% 100% 100% 100% 100%
Model Genres + Artists + Styles
mAP@5 mAP@15 mAP@25 mAP@40 mAP@50
BLIP2 using ViT-g/14 (EVA-CLIP) 90.5% 92.1% 92.4% 92.4% 92.6%
BLIP2 using ViT-L/14 (CLIP) 92.8% 94.8% 94.4% 93.8% 94.2%
BLIP2 finetuned on COCO [13] 93.4% 94% 93.7% 93.7% 93.7%

Table 2. Results of comparing the performance between using a standalone image


embedding and combining it with caption and genre information across different ver-
sions of BLIP-2.

Embedding vector BLIP2 using ViT-g/14 (EVA-CLIP)


mAP@5 mAP@15 mAP@25 mAP@40 mAP@50
Image embedding 82.7% 78.5% 77.0% 75.7% 75.2%
Image + Caption + Genre embedding 100% 98.4% 98.4% 98.6% 98.6%
Embedding vector BLIP2 using ViT-L/14 (CLIP)
mAP@5 mAP@15 mAP@25 mAP@40 mAP@50
Image embedding 89.0% 81.3% 78.6% 75.3% 74.1%
Image + Caption + Genre embedding 100% 99.8% 99.7% 99.6% 99.5%
Embedding vector BLIP2 finetuned on COCO [13]
mAP@5 mAP@15 mAP@25 mAP@40 mAP@50
Image embedding 84.6% 79.8% 77.0% 75.5% 74.9%
Image + Caption + Genre embedding 100% 98.8% 98.3% 97.8% 97.7%

high accuracy for the top 5 results. For genres, BLIP2 with ViT-L/14 (CLIP)
scores the highest mAP@50 at 99.5%, followed by ViT-g/14 (EVA-CLIP) at
98.6%, and the COCO-finetuned model at 97.7%. In artist-based and style-based
retrieval, all models reach a perfect mAP of 1.0 at all levels, showing equal skill in
retrieving artist and style information. However, when combining genre, artist,
and style, the COCO-finetuned model performs best at mAP@5 with 93.4%, but
34 Q.-K. Le et al.

ViT-L/14 (CLIP) performs better as the number of queries increases, reaching


the highest mAP@50 at 94.2%. Overall, ViT-L/14 (CLIP) performs best in more
complex tasks, while ViT-g/14 (EVA-CLIP) has slightly lower results.
Table 2 shows the comparison between using only image embedding and com-
bining it with caption and genre information across different versions of BLIP-
2. The results use metrics like mAP@5, mAP@15, mAP@25, mAP@40, and
mAP@50 to measure accuracy. Using only image embedding gives good results,
such as 82.7% with BLIP-2 using ViT-g/14 (EVA-CLIP) and 89.0% with BLIP-2
using ViT-L/14 (CLIP) at mAP@5. However, when adding caption and genre
embeddings, the performance improves significantly, reaching over 98% for most
cases.

Table 3. Comparison of methods in terms of CLIP-T, CLIP-I, and FID metrics.

Metrics Methods
MythraGen BingAI Midjourney SD
CLIP-T ↑ 30.68 27.77 30.13 29.61
CLIP-I ↑ 79.84 66.82 65.19 75.29
FID ↓ 322.9 373.85 329.22 325.79

4.4 Art Generation


Compared Methods. We compared our method with state-of-the-art (SOTA)
methods, including Stable Diffusion (SD) version 2.0 [19] and two well-known
commercial products: BingAI1 and Midjourney2 .

Dataset. The comparison was conducted on a customized dataset extracted


from WikiArt. The dataset consists of 50 randomly selected (genre, artist, style)
combinations from WikiArt, along with 50 prompts based on scenes that we
generated. Each prompt was carefully crafted to evaluate the methods’ ability to
accurately generate images that adhere to both the style and content described.
This customized dataset allowed for a comprehensive comparison of the models’
performance in generating stylistically consistent and content-accurate images.

Experimental Results. Table 3 shows that our MythraGen outperforms the


other methods across all three metrics. In terms of CLIP-T, which measures
the textual similarity between the prompt and the generated image, MythraGen
achieves the highest score at 30.68, surpassing Midjourney (30.13), SD (29.61),
and BingAI (27.77). For CLIP-I, which measures the style similarity, MythraGen
also lead with a score of 79.84, followed by SD (75.29), BingAI (66.82), and

1
https://www.bing.com/images/create/.
2
https://www.midjourneyfree.ai/.
Retrieval Augmented Art Generation 35

Fig. 7. Qualitative comparison with SOTA methods based on style reference image.
Methods like BingAI, Midjourney, and SD 2.0 lack specific stylistic information drawn
from the original artist, leading to difficulties in balancing content and style. In contrast,
MythraGen performs better in generating both style and content as intended.

Midjourney (65.19). Finally, for FID, which measures the quality of the images,
MythraGen obtains the lowest score (indicating better performance) at 322.9,
compared to SD (325.79), Midjourney (329.22), and BingAI (373.85).
Figure 7 illustrates the visual results of these methods. While SD [19] strug-
gles to balance style and content due to poor representation of the text extracted
from the reference image, the content of the images generated by BingAI, and
Midjourney is relatively closer to the prompt, but their style differs from the
style reference. In contrast, our method produces images that are more faith-
ful to the style of the reference image, especially regarding brushstrokes, lines,
etc. This demonstrates that our MythraGen method achieves a better balance
between content similarity, style similarity, and generated quality according to
objective metrics.

4.5 Human Evaluation


Setup. We also conducted a human evaluation of MythraGen against three
SOTA methods as in Sect. 4.4, on the same dataset as in Sect. 4.4. The user
study focused on two key criteria:

– Faithfulness assesses the extent to which the generated image is true to


both the entity’s appearance and the text description. It considers factors
such as object fidelity and text relevance, ensuring that the image accurately
represents the caption provided
– Naturalness evaluates the visual quality of the generated image. It encom-
passes aspects of style similarity to the original image, ensuring that the
image adheres faithfully to the style described in the prompt without notice-
able artifacts or discrepancies that detract from its overall aesthetic appeal.
36 Q.-K. Le et al.

Fig. 8. Humans evaluate the methods based on two criteria: Faithfulness and Natural-
ness.

The user study involves 32 participants (48.4% of whom are male) with age
between 11 and 60 (most of them are from 11 to 20). Each participant was asked
to score the outputs from different methods on a scale of 1 (worst) to 5 (best)
based on two primary criteria. Each participant evaluated a total of 30 images
per method.

Quantitative Results. Fig. 8 shows the quantitative results of the human eval-
uation, where MythraGen achieves the highest scores for both faithfulness and
naturalness. For faithfulness, MythraGen gets an average score of 3.09, clearly
improving over SD 2.0 (2.88), Midjourney (2.97), and BingAI (2.53). This result
shows that MythraGen is better at generating images that accurately match
the descriptions, both visually and textually. Similarly, for naturalness, Mythra-
Gen also outperforms other methods with an average score of 3.28. Participants
find MythraGen’s images more visually appealing and better aligned with the
described styles in the input prompt. These results demonstrate that MythraGen
not only excels in generating images that are faithful to their descriptions but
also produces outputs that accurately reflect the requested artistic style, proving
its effectiveness over SOTA methods in both accuracy and quality.
Our experimental results show that MythraGen is highly effective at generat-
ing images that faithfully represent both the text prompt and the desired artistic
style. By using a retrieval-augmented approach, MythraGen leverages existing
artwork to fine-tune the generation process, producing high-quality outputs that
closely match user expectations. Additionally, the use of LoRA for efficient fine-
tuning helps reduce computational costs while maintaining impressive perfor-
mance, making the model accessible even on less powerful hardware.

5 Conclusion

In this paper, we present MythraGen, which represents the advancement in the


field of text-to-artistic image generation by effectively combining retrieval-based
techniques with LoRA fine-tuning. Our work not only enhances the quality
Retrieval Augmented Art Generation 37

and contextual accuracy of generated artworks but also enables the incorpo-
ration of diverse artistic styles that meet the user’s expectations. Experimental
results demonstrate that MythraGen outperforms existing methods in gener-
ating images that faithfully reflect text descriptions and are highly natural, as
evidenced by user studies. We further demonstrate that our model is particularly
effective at generating images from text that requires a greater diversity of artis-
tic genres and periods. We believe that our work can inspire further innovations
in the intersection of art and artificial intelligence, fostering deeper engagement
with both creators and audiences.

Acknowledgement. This research is supported by research funding from Faculty of


Information Technology, University of Science, Vietnam National University - Ho Chi
Minh City.

References
1. Alaluf, Y., Richardson, E., Metzer, G., Cohen-Or, D.: A neural space-time repre-
sentation for text-to-image personalization. ACM Trans. Graph. 42(6), 243 (2023).
https://doi.org/10.1145/3618322
2. Brock, A., Donahue, J., Simonyan, K.: Large scale GAN training for high fidelity
natural image synthesis. arXiv preprint (2018). https://api.semanticscholar.org/
CorpusID:52889459
3. Chen, L., et al.: ShareGPT4v: improving large multi-modal models with better
captions. arXiv preprint arXiv:2311.12793 (2023)
4. Chen, W., Hu, H., Saharia, C., Cohen, W.W.: Re-Imagen: retrieval-augmented
text-to-image generator. arXiv preprint (2022). https://api.semanticscholar.org/
CorpusID:252596087
5. Dhariwal, P., Nichol, A.: Diffusion models beat GANs on image synthesis. In:
NeurIPS, vol. 34, pp. 8780–8794 (2021). https://proceedings.neurips.cc/paper_
files/paper/2021/file/49ad23d1ec9fa4bd8d77d02681df5cfa-Paper.pdf
6. Ding, M., et al.: CogView: mastering text-to-image generation via transformers.
In: NeurIPS, pp. 19822–19835 (2021). https://proceedings.neurips.cc/paper_files/
paper/2021/file/a4d92e2cd541fca87e4620aba658316d-Paper.pdf
7. Douze, M., et al.: The Faiss library. arXiv preprint arXiv:2401.08281 (2024)
8. Gal, R., et al.: An image is worth one word: personalizing text-to-image generation
using textual inversion. arXiv preprint arXiv:2208.01618 (2022)
9. Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. arXiv (2020).
https://api.semanticscholar.org/CorpusID:219955663
10. Ho, J., Saharia, C., Chan, W., Fleet, D.J., Norouzi, M., Salimans, T.: Cascaded
diffusion models for high fidelity image generation. JMLR 23(47), 1–33 (2022).
http://jmlr.org/papers/v23/21-0635.html
11. Hu, E.J., et al.: Lora: low-rank adaptation of large language models. In: ICLR
(2022). https://openreview.net/forum?id=nZeVKeeFYf9
12. Li, J., Li, D., Savarese, S., Hoi, S.C.H.: BLIP-2: bootstrapping language-image pre-
training with frozen image encoders and large language models. In: ICML (2023).
https://api.semanticscholar.org/CorpusID:256390509
13. Lin, T.Y., et al.: Microsoft coco: common objects in context. arXiv preprint
arXiv:1405.0312 (2014)
38 Q.-K. Le et al.

14. Liu, D., Fan, H., Liu, J.: Expogenius: robust personalized human image generation
using diffusion model for exposure variation and pose transfer. In: ICMR, pp. 239–
247 (2024). https://doi.org/10.1145/3652583.3658071
15. Nichol, A., et al.: Glide: towards photorealistic image generation and editing with
text-guided diffusion models. arXiv preprint arXiv:2112.10741 (2022)
16. Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-
conditional image generation with clip latents. arXiv preprint arXiv:2204.06125
(2022)
17. Reed, S., Akata, Z., Yan, X., Logeswaran, L., Schiele, B., Lee, H.: Generative
adversarial text to image synthesis. In: ICML, Proceedings of Machine Learn-
ing Research, vol. 48, pp. 1060–1069 (2016). https://proceedings.mlr.press/v48/
reed16.html
18. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution
image synthesis with latent diffusion models. In: CVPR, pp. 10674–10685 (2021).
https://api.semanticscholar.org/CorpusID:245335280
19. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution
image synthesis with latent diffusion models. In: CVPR, pp. 10684–10695 (2022)
20. Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., Aberman, K.: Dream-
booth: fine tuning text-to-image diffusion models for subject-driven generation.
arXiv preprint arXiv:2208.12242 (2022)
21. Saharia, C., et al.: Photorealistic text-to-image diffusion models with deep language
understanding. NeurIPS 35, 36479–36494 (2022)
22. Saharia, C., et al.: Photorealistic text-to-image diffusion models with deep language
understanding. In: NeurIPS, p. 2643 (2024). https://doi.org/10.5555/3600270.
3602913
23. Tan, W.R., Chan, C.S., Aguirre, H., Tanaka, K.: Improved ArtGAN for conditional
synthesis of natural image and artwork. IEEE Trans. Image Process. 28(1), 394–
409 (2019). https://doi.org/10.1109/TIP.2018.2866698
24. Xu, T., et al.: AttnGAN: fine-grained text to image generation with attentional
generative adversarial networks. In: CVPR (2018)
25. Zhang, H., et al.: StackGAN++: realistic image synthesis with stacked generative
adversarial networks. TPAMI 41(8), 1947–1962 (2019). https://doi.org/10.1109/
TPAMI.2018.2856256
26. Zhou, Y., et al.: Towards language-free training for text-to-image generation. In:
CVPR, pp. 17886–17896 (2022). https://doi.org/10.1109/CVPR52688.2022.01738
27. Zhu, M., Pan, P., Chen, W., Yang, Y.: DM-GAN: dynamic memory generative
adversarial networks for text-to-image synthesis. In: CVPR, pp. 5795–5803 (2019).
https://api.semanticscholar.org/CorpusID:91183909
Towards Unsupervised Speaker Diarization
System for Multilingual Telephone Calls
Using Pre-trained Whisper Model
and Mixture of Sparse Autoencoders

Phat Lam1(B) , Lam Pham2(B) , Truong Nguyen1 , Dat Ngo3 , Thinh Pham4 ,
Tin Nguyen1 , Loi Khanh Nguyen1 , and Alexander Schindler2
1
Ho Chi Minh University of Technology, Ho Chi Minh City, Vietnam
{phat.lamhcmutddk21,truongnguyen,tin.nguyen112101bku,nkloi}@hcmut.edu.vn
2
Austrian Institute of Technology, Vienna, Austria
{lam.pham,alexander.schindler}@ait.ac.at
3
University of Essex, Colchester, UK
[email protected]
4
Ho Chi Minh City University of Science, Ho Chi Minh City, Vietnam
[email protected]

Abstract. Existing speaker diarization systems typically rely on large


amounts of manually annotated data, which is labor-intensive and diffi-
cult to obtain, especially in real-world scenarios. Additionally, language-
specific constraints in these systems significantly hinder their effective-
ness and scalability in multilingual settings. In this paper, we propose a
cluster-based speaker diarization system designed for multilingual tele-
phone call applications. Our proposed system supports multiple lan-
guages and eliminates the need for large-scale annotated data during
training by utilizing the multilingual Whisper model to extract speaker
embeddings. Furthermore, we introduce a network architecture called
Mixture of Sparse Autoencoders (Mix-SAE) for unsupervised speaker
clustering. Experimental results on the evaluation dataset derived from
two-speaker subsets of benchmark CALLHOME and CALLFRIEND
telephonic speech corpora demonstrate the superior performance of the
proposed Mix-SAE network to other autoencoder-based clustering meth-
ods. The overall performance of our proposed system also highlights the
promising potential for developing unsupervised, multilingual speaker
diarization systems within the context of limited annotated data. It also
indicates the system’s capability for integration into multi-task speech
analysis applications based on general-purpose models such as those that
combine speech-to-text, language detection, and speaker diarization.

Keywords: Unsupervised speaker diarization · Whisper · Mixture of


sparse autoencoders · Deep clustering · Telephone call

c The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025
W. Buntine et al. (Eds.): SOICT 2024, CCIS 2350, pp. 39–53, 2025.
https://doi.org/10.1007/978-981-96-4282-3_4
40 P. Lam et al.

1 Introduction
Sound-based applications have drawn significant attention from the research
community and have become an integral part in the forefront of driving innova-
tion. These applications involve advanced audio processing techniques to analyze
and interpret various types of sound data (e.g., acoustic scenes [27, 28], sound
events [20], machinery sound [21], human speech [16, 26]), enabling the core func-
tionality in many intelligence systems. In human speech analysis, speaker diariza-
tion plays a crucial role by identifying and segmenting audio streams based on
speaker identity, making it essential for various applications such as communi-
cation (e.g., customer support calls), security (e.g., voice tracking), healthcare
(e.g., patient monitoring), smart home (e.g., personal assistants), etc. Typically,
a cluster-based speaker diarization system consists of five modules. The tradi-
tional approach to such a system is illustrated at the top of Fig. 1. The prepro-
cessing module first converts raw audio into a suitable format, followed by the
voice activity detection (VAD) module extracting speech segments. These seg-
ments are then divided into fixed-length speaker segments. The speaker embed-
ding extractor converts these speaker segments into vectors representing speaker
characteristics, and a clustering algorithm assigns speaker labels. Among these
modules, speaker embedding and clustering are crucial components to enhance
the performance of a cluster-based speaker diarization system [24].
Regarding the speaker embedding extractor, numerous approaches have been
proposed for speaker embedding extraction, including metric-based models (GLR
[8], BIC [37], etc.), probabilistic models (GMM-UBM [33], i-vectors [6], etc.),
and neural network-based models (d-vectors [38], x-vectors [35], etc.). All these
methods require a substantial amount of annotated data, especially for neu-
ral network-based approaches, to optimize speaker feature extractors. However,
training these extractors on one type of dataset could reduce the model’s ability
to generalize to diverse or unseen data, particularly from different domains. In
addition, datasets for speaker diarization mainly support one single language,
due to the labor-intensive and time-consuming nature of collecting data and
insufficient availability of data from diverse languages, limiting the effectiveness
of speaker diarization systems in multilingual speech analysis applications.
Concerning the clustering module, common methods such as Agglomera-
tive Hierarchical Clustering (AHC) [9], k-Means [40], Mean-shift [36] have been
proposed. However, these methods operate directly on the input vector space
and rely heavily on distance-based metrics, without leveraging representation
learning techniques to uncover deeper patterns. While some deep learning-based
frameworks, such as DNN [14], GAN [23], and Autoencoder [12], incorporate
representation learning for speaker embeddings, they often require pre-extracted
embeddings (e.g., x-vectors) that fit on certain datasets and are primarily eval-
uated on single-language datasets, typically English.
To address existing limitations, we aim to develop an unsupervised speaker
diarization system that does not rely on large-scale training datasets and sup-
ports multiple languages. For speaker embedding extraction, we use the mul-
tilingual Whisper model [30]. This foundation model was trained on diverse
Unsupervised Speaker Diarization for Multilingual Calls 41

Fig. 1. The high-level architectures of (A) Traditional cluster-based speaker diarization


system and (B) Our proposed unsupervised speaker diarization system

audio data for relevant tasks such as speech recognition, language identification,
and translation. The Whisper’s representations have been applied to several
downstream classification or detection tasks (e.g., speaker change detection [10],
dysarthric severity-level classification [32], vocal intensity categorization [11],
audio deepfake detection [26]), indicating that these representations can cap-
ture a wide range of speech features such as acoustic characteristics, speaker
attributes, vocal details, etc. [41]. However, its applicability in speaker diariza-
tion task has not been widely explored. Thus, leveraging Whisper’s scalability
and robustness, we explore its potential to produce high-quality speaker embed-
dings for diarization, assuming that as a general-purpose model, Whisper can
learn representations that incorporate different aspects of large training data
(e.g., phonetic content, speaker characteristics, acoustic features) that may be
useful for various downstream tasks, hypothetically including speaker diariza-
tion, despite being primarily designed for automatic speech recognition. For
speaker clustering, we propose an unsupervised deep clustering network called
Mixture of Sparse Autoencoders (Mix-SAE) to cluster the extracted embeddings.
Overall, our key contributions can be summarized as follows:
– We explored the Whisper model’s capability in the diarization task by using
it as an alternative to conventional speaker embedding extractors, eliminating
the need for annotated training data in developing diarization systems.
– Inspired by [5], we proposed the Mix-SAE network for speaker clustering,
which enhances both speaker representation learning and clustering by using
a mixture of sparse autoencoders with additional pseudo-label supervision.
– Through extensive experiments, we indicated that speaker diarization can
be effectively integrated into Whisper-based systems, enabling comprehen-
sive and multilingual speech analysis applications that combine speech-to-
42 P. Lam et al.

Table 1. The Pre-trained Whisper Models

Version Parameters Embedding Dimension


Tiny 39M 384
Base 74M 512
Small 244M 768
Medium 769M 1024
Large 1550M 1280

text, language identification, and speaker diarization. A simple example of a


Whisper-based speech analysis application can be found at1 .
The remainder of this paper is organized as follows: The overall proposed
speaker diarization system is described in Sect. 2. Next, Sect. 3 comprehensively
describes our proposed deep clustering framework (Mix-SAE). Experimental set-
tings and results are discussed in Sect. 4. The conclusion is represented in Sect. 5.

2 The Overall Proposed System


Our proposed system pipeline is comprehensively described at the bottom of
Fig. 1. Generally, the system comprises three main blocks: Front-end preprocess-
ing (Preprocessing), Speaker embedding extraction (Speaker Embedding) and
Unsupervised clustering (Mix-SAE). The next subsections represent each block
of the overall pipeline in detail.

2.1 Front-End Preprocessing


Firstly, the input audio is divided into fixed-length segments of .W seconds and
re-sampled to 16 kHz using Librosa toolbox [13]. To match the Whisper encoder’s
input requirements, zero-padding is applied to the segments. Next, a voice activ-
ity detection (VAD) [29] is performed using an energy-based threshold to extract
speech segments, which are then converted into spectrograms via Short-time
Fourier Transform (STFT) with the setting of 400 filters, 10-ms window size,
and a 160-sample hop size. These spectrograms are used as inputs to the Whis-
per encoder for speaker embedding extraction.

2.2 Speaker Embedding Extraction Using Whisper Model


In our work, we explore using the Whisper model as an alternative to conven-
tional speaker embedding extractors, leveraging its scalability and diverse train-
ing data. We aim to utilize Whisper’s robustness and generalization to capture
various speaker characteristics across languages and domains. This approach
1
https://huggingface.co/spaces/AT-VN-Research-Group/SpeakerDiarization.
Unsupervised Speaker Diarization for Multilingual Calls 43

Fig. 2. Sparse Autoencoder Architecture (SAE)

allows us to obtain speaker embeddings directly from Whisper, bypassing the


need for specific training datasets. For each speech segment, we generate the
speaker embedding by feeding its spectrogram into the Whisper model. The
final one-dimensional speaker embedding is derived by averaging the 2D tensor
output from the last residual attention block of the Whisper encoder along the
second axis, with its dimension varying by Whisper model versions, as shown in
Table 1.

2.3 Unsupervised Clustering


Given the speaker embeddings extracted from the Whisper model, the unsuper-
vised clustering block groups together speech segments that are likely to be from
the same speaker. In this work, we propose a new unsupervised deep clustering
method for speaker embedding clustering called Mixture of Sparse Autoencoders
(Mix-SAE). The proposed network is inspired by Mixture of Experts (MoE)
architecture and is applied to Sparse Autoencoders (SAE), as detailed in Sect. 3.
After clustering, we assign speakers to each segment and generate the diarization
prediction by organizing the segments according to these assignments.

3 Mixture of Sparse Autoencoder Deep Clustering


Network (Mix-SAE)
Our proposed Mix-SAE architecture, shown in Fig. 3, consists of two main parts:
A set of k-sparse autoencoders, each representing a speaker cluster; and a gating
projection that interprets the outputs produced by each autoencoder and assigns
the input to a specific sparse autoencoder via its trainable weights.

3.1 Individual Sparse Autoencoder (SAE)


Consider one sparse autoencoder .A, represented at Fig. 2. The sparse autoen-
coder .A has .2L + 1 layers, including one encoder (.E) with .L layers, one decoder
(l)
(.D) with .L layer and one latent layer. We denote .aj as the activation of hidden
44 P. Lam et al.

(i)
unit .j at the .l-th hidden layer, .zj is the input of .i-th sample that leads to
hidden unit .j. We obtain the average activation of hidden unit .j at .l-th layer
over one batch of .N samples, which is written as:

1    (l) (i) 
N
(l)
ρ̂
. j = g aj (zj ) (1)
N i=1

where the mapping .g(.) uses the sigmoid function, which aims to scale the activa-
(l)
tion parameter to .[0; 1] and avoid too large value of .ρ̂j . The sparsity constraint
(l)
ensures the average activation .ρ̂j is close to the sparsity parameter .ρ, which
is quite small. This helps the model learn meaningful features while avoiding
copying or memorizing the input by enforcing a limited number of activation
neurons in the hidden layer. To achieve the approximation .ρ̂j ≈ ρ, we leverage
Kullback-Leibler divergence penalty term [19]. The KL penalty term applied for
the .l-th hidden layer that has .n(l) hidden units can be written as:
(l) (l)
n
 n

(l) (l) ρ 1−ρ
.Lpen = KL(ρ||ρ̂j ) = ρ log + (1 − ρ) log (2)
(l) (l)
j=1 j=1 ρ̂j 1 − ρ̂j

Then, the penalty term is calculated for all hidden layers of the autoencoder .A
(except the latent layer) by taking the sum of KL terms as:
(l)
2L 
 n
ρ 1−ρ
Lpen =
. ρ log (l)
+ (1 − ρ) log (l)
(3)
l=1 j=1 ρ̂j 1 − ρ̂j

We also apply MSE loss for the pair of input data .x and its reconstruction
data over one batch of .N samples as:
N
1 
LMSE
. = ||xi − D (E(xi )) ||22 (4)
2N i=1

Given the KL penalty and MSE losses, we define the final objective function
for the optimization of one individual sparse autoencoder .A:

LSAE = LMSE + βLpen


. (5)

where .β is the parameter to control the effect of sparsity constraint on the


objective function.

3.2 k-Sparse Autoencoders


Given the problem of clustering a set of .M points .{x(i) }M m
i=1 ∈ R into .k clusters,
the classical k-Means algorithm uses a centroid to represents each cluster in the
embedding space, the centroids are mostly calculated by taking the average of
all points belonging to that cluster. Inspired by [5, 22], we use k-autoencoders
Unsupervised Speaker Diarization for Multilingual Calls 45

Fig. 3. The overall architecture of Mix-SAE clustering network

to represent .k clusters, with each autoencoder’s latent space acting as a cluster


centroid. In this paper, we use sparse autoencoders instead of standard ones,
resulting in k-sparse autoencoders as shown in Fig. 3. This approach allows data
points in the same cluster have their own autoencoder, making feature learning
more efficient compared to using a single autoencoder for all data [5]. In our
deep clustering network, all k-sparse autoencoders share the same settings and
loss function .LSAE from Eq. 5.

Fig. 4. The Pre-training step of Mix-SAE clustering network

3.3 Gating Projection

The role of the Gating Projection (.G) is to assign weights .p̂ = [p̂1 , p̂2 , ..., p̂k ]
to the outputs of k-sparse autoencoders based on the input data. Given the
weights of .p̂ = [p̂1 , p̂2 , ..., p̂k ], the Gating Projection is also utilized to assign
46 P. Lam et al.

Table 2. Mix-SAE Deep Clustering Network

Algorithm 1: Mix-SAE mini-batch training strategy


Input: One batch of N points X ={x(i) }N m
i=1 ∈ R .
Output: One of k cluster labels for N input points.
Components:
- A set of k-autoencoders: {A1 , A2 , ..., Ak }
x → x̄j = Dj (Ej (x)), j = 1, 2, ..., k.
- The Gating Projection G that produces pseudo-labels and assigns input to suitable autoencoders.
p = Sof tmax(W x + b) ∈ Rk .
• Pre-training:
1: Train a single autoencoder Apre for the entire dataset with the objective function (5).
2: Use one off-the-shelf cluster algorithm to initialize pseudo-labels P [0] for the entire dataset.
3: for j = 1 to k do:
4: Train j-th sparse autoencoder with data points P [0] [c = j].
5: end for
• Main-training:
6: for t = 1 to T do:
7: Train k-sparse autoencoders and the Gating projection G jointly with the objective function (7).
8: if t mod τ = 0 then:
9: tu ← t /* Save current epoch */
10: P [tu ] = argmax [Sof tmax(W X + B)] /* Update new pseudo-labels for the batch X */
axis = 1

Get final cluster result: Get the final cluster result for the batch X via the gating projection G:

P̂ = argmax [Sof tmax (W X + B)]


axis = 1

labels for clusters during the inference phase. In this work, the Gating Projection
leverages an MLP architecture with a single linear layer, followed by Leaky
ReLU activation and the final softmax layer. Given the input data .x, the Gating
Projection (.G) produces weights .p̂ = [p̂1 , p̂2 , ..., p̂k ] as:

.p̂ = [p̂1 , p̂2 , ..., p̂k ] = Sof tmax (W x + b) ∈ Rk (6)

where .W ∈ Rk×m , b ∈ Rk are the trainable weights and bias of the linear layer
in the gating projection.

3.4 Training Strategy


The training strategy for our proposed Mix-SAE clustering network includes two
steps: Pre-training and Main-training.
In the Pre-training step as shown in Fig. 4, we first train a single main sparse
autoencoder .Apre as shown in the upper part of Fig. 4, for the entire dataset using
the loss function described at Eq. 5. After training the main sparse autoencoder
.Apre , one off-the-shelf cluster algorithm such as AHC or k-Means, is utilized to
Unsupervised Speaker Diarization for Multilingual Calls 47

obtain initial pseudo-labels .P [0] from the learned latent representation of the
sparse autoencoder .Apre . Next, we initialize the parameters of k-sparse autoen-
coders by sequentially training the .j-th sparse autoencoder .Aj with the subset of
points such that .P [0] [c = j], as shown in the lower part of Fig. 4, where .c denotes
the cluster index, .j = 1, 2, ..., k. Notably, the training process of k-sparse autoen-
coders also use Eq. 5 as the loss function.
The next Main-training step is described in Fig. 3. This step involves the joint
optimization of the k-sparse autoencoders with initialized parameters obtained
from the Pre-training step, and the predicted probabilities from the gating pro-
jection. Given k-sparse autoencoders .{A1 (θ1 ), A2 (θ2 ), ..., Ak (θk )}, where .θj is
the parameters of encoder (.Ej ) and decoder (.Dj ) of sparse autoencoder .Aj ,
.j = 1, 2, ..., k, and the parameters (.W , .B) of the gating projection .G, the main
objective function of the proposed Mix-SAE network for one batch of .N samples
[.x(1) , x(2) , ..., x(N ) ] is defined as:

.Lmain (θ1 , θ2 , ..., θk , W , B) = Lrec + αLent (7)

where .α is the parameter to constrain the effect of both terms on the main
objective function.
The term .Lrec is the weighted sum of reconstruction error over k-sparse
autoencoders. This term ensures that the sparse autoencoders could have infor-
mation on inter-cluster reconstruction error to further strengthen feature learn-
ing within their own clusters. We define this term as:
 2 
1  (i)
N k
1   (i)
Lrec = − p̂j exp − x − Dj (Ej (x(i) ))
N i=1 j=1 2
. (8)
k
 (i)
s.t. p̂j = 1, ∀i = 1, 2, . . . , N.
j=1

where .Dj (Ej (x(i) )) is the output of the .j-th sparse autoencoder given the input
(i)
sample .x(i) ; the probability .p̂j , which is computed from (.W , .B) in Eq. 6, is
the weight from the gating projection assigned to the .j-th reconstruction loss.
The term .Lent is referred to as the pseudo-label guided supervision loss.
We denote the pseudo-labels for one batch of .N samples at epoch .t as: .P[t] =
[t] [t] [t] [t]
[p1 , p2 , ..., pN ], where .pi ∈ Rk . The supervision loss is defined as the Cross-
Entropy loss between the pseudo-labels .P [tu ] previously updated at epoch .tu and
[t]
the prediction of the gating projection .P̂ at the current epoch .t:
N
1  [tu ] [t]
Lent = −
. p log p̂i (9)
N i=1 i
The entropy loss .Lent uses pseudo-labels to provide additional learning sig-
nals, simulating a semi-supervised setting [15]. This aims to guide the model
towards correct clustering and enhances feature learning. Notably, pseudo-labels
48 P. Lam et al.

are periodically updated after .τ epochs during optimization by the predictions


of the gating projection .G at the current epoch .t using Eq. 6. This process aims
to reinforce reliable pseudo-labels while correcting noisy ones over time.
After the Main-training step, final cluster label can be inferred via the gating
projection .G. Given each data sample .x, the probability vector .p̂ is calculated
using Eq. 6. Then, the cluster label is determined as:

ĉ = argmax p̂ = argmax p̂(c = j|x)


. (10)
j=1,2,..k

Overall, needed steps in the training strategy of our proposed Mix-SAE cluster-
ing network can be summarized in Table 2.

4 Experimental Settings and Results


4.1 Evaluation Datasets

To evaluate the performance and generalization of our proposed system to diverse


data sources, we gather data from two benchmark corpora CALLHOME [1, 2, 4]
and CALLFRIEND [17, 18]. Each corpus includes various language subsets like
English, German, French, Spanish, and Japanese, with multiple telephone con-
versations from different sources. For evaluation, we use two-speaker subsets of
the above benchmark corpora (the most common case in telephone call appli-
cations), to form a combined dataset called SD-EVAL. The SD-EVAL dataset
comprises 127 recordings totaling around 6.35 h and is divided into four language-
specific subsets: English (EN), Spanish (SPA), German (GER), and French (FR).
Each subset has 25 to 35 recordings, each lasting 2 to 5 min.

4.2 Evaluation Metrics

We evaluated the proposed system with diarization error rate (DER).

4.3 Experimental Settings

The proposed method was implemented with deep learning framework PyTorch
[25]. The network architecture consists of autoencoders with hidden layers [256,
128, 64, 32] for the encoder and mirrored for the decoder, using Leaky ReLU
activation and Batch Normalization followed each hidden layer. The latent vector
size is also .k (equal to the number of speakers), with mini-batch size .N = 16.
We use k-Means.++ [3] to initialize pseudo-labels in the Pre-training step.
Regarding hyperparameters, we set the sparsity parameter .ρ = 0.2, the spar-
sity constraint .β = 0.01, pseudo-label supervision parameter .α = 1. The training
process uses learning rate .0.001 and weight decay .5×10−4 . The Pre-training step
involves 50 epochs for the main autoencoder .Apre and 20 epochs for each of k-
sparse autoencoders. The Main training step runs for 29 epochs and updates
pseudo-labels after 10 epochs.
Unsupervised Speaker Diarization for Multilingual Calls 49

4.4 Results and Discussions


Speaker Clustering Methods: We evaluate several speaker clustering meth-
ods using embeddings from the tiny Whisper model, including k-Means, Agglom-
erative Hierarchical Clustering (AHC), SpectralNet, autoencoder-based meth-
ods such as DCN, DAMIC, k-DAE, and our proposed Mix-SAE. Experiments
were conducted with segment sizes (.W ) ranging from 0.2 s to 1.0 s. As shown in
Table 3, Mix-SAE consistently outperforms other methods, achieving the best
performance in English with a DER of 26.51%. This can be attributed to high-
quality embeddings from Whisper’s extensive English training data. While other
methods, especially autoencoder-based ones like DCN and k-DAE, show vari-
ability with segment size, Mix-SAE remains stable across different .W values,
demonstrating its efficiency in capturing speaker features from variable-length
segments (e.g., the proposed system achieves DER scores of 26.51%, 26.88%,
27.08%, 27.24%, 26.85% on English and 35.00%, 35.64%, 35.55%, 34.17%, 34.55%
on German with .W = 0.2, 0.4, 0.6, 0.8, 1.0, respectively).

Table 3. Diarization Error Rate (DER) (%) of different systems on SD-EVAL dataset
(Whisper version: Tiny, no tolerance)

Methods .W= 0.2 s .W = 0.4 s .W = 0.6 s .W = 0.8 s .W = 1.0 s

EN FR GER SPA EN FR GER SPA EN FR GER SPA EN FR GER SPA EN FR GER SPA
k-Means 44.77 51.42 49.11 48.25 43.75 51.92 43.84 47.08 38.72 46.88 40.97 42.77 40.23 46.61 44.11 44.38 42.06 47.72 46.13 44.66
AHC 38.42 46.72 41.41 42.93 47.64 52.81 46.33 50.69 40.50 48.69 42.90 43.15 38.55 45.91 43.02 43.44 42.91 47.81 47.63 44.80
SpectralNet [34] 36.18 44.62 40.02 46.03 40.44 51.63 41.22 47.52 37.06 44.68 41.29 42.69 36.11 44.67 44.16 46.42 41.88 46.08 44.31 47.23
DCN [39] 32.15 35.77 36.51 36.98 37.42 38.92 42.17 43.01 32.08 37.57 38.84 40.77 33.02 43.72 44.23 40.55 40.17 45.96 40.21 38.51
DAMIC [5] 27.78 36.22 36.93 35.21 27.97 35.96 36.14 35.11 28.11 36.67 34.66 33.31 27.22 36.91 34.78 34.22 26.95 36.91 36.11 34.65
k-DAE [22] 29.12 37.91 41.23 37.00 30.53 39.81 37.10 37.29 32.72 38.84 34.96 35.23 33.33 38.55 34.24 35.51 30.36 37.32 36.22 35.02
Mix-SAE-V1 32.18 38.61 36.07 36.78 29.02 35.92 36.51 35.04 27.28 37.01 34.98 34.03 27.90 37.51 34.42 33.83 28.00 37.88 36.18 34.29
Mix-SAE-V2 28.72 43.22 40.66 36.32 29.62 40.07 36.71 35.72 27.81 36.83 34.90 33.54 27.98 39.68 34.62 33.21 27.93 38.05 36.73 33.82
Mix-SAE 26.51 36.12 35.00 34.91 26.88 37.30 35.64 34.33 27.08 36.70 34.55 32.82 27.24 38.39 34.17 32.03 26.85 37.57 35.33 33.82

For an ablation study, we establish two other systems: Mix-SAE-V1 (Mix-


SAE w/o sparsity penalty loss in Eq. 5), Mix-SAE-V2 (Mix-SAE w/o pseudo-
label loss in Eq. 7). Results in Table 3 demonstrate the role of both sparsity
loss and pseudo-label loss in improving the overall performance. For instance,
improvements of 5.67% and 2.21% are obtained in the case of English with .W =
0.2 s when Mix-SAE is compared to Mix-SAE-V1 and Mix-SAE-V2, respectively.
The Quality of Speaker Embeddings: We assess the impact of speaker
embeddings on diarization performance, as shown in Fig. 5a, using different ver-
sions of the Whisper model (Tiny, Base, Small, Medium, Large) with .W set to
0.2 s in English. Larger Whisper models provided superior embeddings, leading
to better performance, with the best DER score of 17.75% (0.25 s tolerance).
This highlights the potential of using general-purpose like Whisper for multilin-
gual and unsupervised speaker diarization systems as well as integrating speaker
diarization as a component in Whisper-based speech analysis applications.
The Model Complexity: Fig. 5b shows the trade-off between model complex-
ity and diarization performance (DER) across deep clustering methods. Our
50 P. Lam et al.

Fig. 5. Evaluation: (a) DER scores using speaker embeddings from different Whisper
versions; (b) Compare DER score versus complexity across deep clustering methods

Fig. 6. t-SNE visualization of speaker embeddings after the pre-training step (Whisper
version: Tiny).

Mix-SAE achieves 26.51% DER with 334k parameters, striking a good balance
between accuracy and efficiency. Additionally, when combined with Whisper
Tiny (39M), the system is promising for integration into edge devices for sound
applications [7, 31].
Visualization and the Effect of Pre-training Step: We visualized 2-speaker
embeddings after the Pre-training step in our Mix-SAE by applying t-SNE. As
Fig. 6 shows, the sparse autoencoders effectively learn underlying patterns from
extracted speaker embeddings and map them into latent space where the embed-
Unsupervised Speaker Diarization for Multilingual Calls 51

dings of two speakers were relatively well-separated. These clustering results


serve as pseudo-labels for optimizing the deep clustering network at the next
Main-training step.

5 Conclusion
This paper has presented an unsupervised speaker diarization system for mul-
tilingual telephone call applications. In this proposed system, the traditional
feature extractor was replaced with the Whisper encoder, benefiting from its
robustness and generalization on diverse data. Additionally, the Mix-SAE net-
work architecture was also proposed for speaker clustering. Experimental results
demonstrate that our Mix-SAE network outperforms other compared cluster-
ing methods. The overall performance of our system highlights the effectiveness
of our approach when exploring Whisper embedding for the diarization task to
develop unsupervised speaker diarization system in the contexts of limited anno-
tated training data. Furthermore, the results also enhances the system’s ability
to integrate into Whisper-based multi-task speech analysis application. Overall,
this work indicates a promising direction toward developing generalized speaker
diarization systems based on general-purpose models in future work.

Acknowledgments. The work described in this paper is per-


formed in the H2020 project STARLIGHT (“Sustainable Autonomy
and Resilience for LEAs using AI against High Priority Threats”).
This project has received funding from the European Union’s Hori-
zon 2020 research and innovation program under grant agreement No
101021797.

References
1. Alexandra, C., Graff, D., Zipperlen, G.: CABank Spanish CallHome Corpus (1996).
https://doi.org/10.21415/T51K54
2. Alexandra, C., Graff, D., Zipperlen, G.: CABank English CallHome Corpus (1997).
https://doi.org/10.21415/T5KP54
3. Arthur, D., Vassilvitskii, S.: k-means++: the advantages of careful seeding. In:
Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algo-
rithms, pp. 1027–1035. SODA ’07, Society for Industrial and Applied Mathematics,
USA (2007)
4. Canavan, A., Graff, D., Zipperlen, G.: CABank German CallHome Corpus (1997).
https://doi.org/10.21415/T56P4B
5. Chazan, S.E., Gannot, S., Goldberger, J.: Deep clustering based on a mixture of
autoencoders. In: 29th International Workshop on Machine Learning for Signal
Processing (MLSP), pp. 1–6 (2019). https://doi.org/10.1109/MLSP.2019.8918720
6. Dehak, N., et al.: Front-end factor analysis for speaker verification. IEEE Trans.
Audio Speech Lang. Process. 19(4), 788–798 (2011). https://doi.org/10.1109/
TASL.2010.2064307
52 P. Lam et al.

7. Froiz-Míguez, I., et al.: Design, implementation, and practical evaluation of a voice


recognition based IoT home automation system for low-resource languages and
resource-constrained edge IoT devices: a system for Galician and mobile oppor-
tunistic scenarios. IEEE Access 11, 63623–63649 (2023)
8. Gangadharaiah, R., et al.: A novel method for two-speaker segmentation.
In: Proceedings of the INTERSPEECH (2004). https://api.semanticscholar.org/
CorpusID:12436529
9. Han, K.J., Narayanan, S.S.: A robust stopping criterion for agglomerative hierar-
chical clustering in a speaker diarization system. In: Proceedings of the INTER-
SPEECH (2007). https://api.semanticscholar.org/CorpusID:17876640
10. Kawa, P., Plata, M., Czuba, M., Szymański, P., Syga, P.: Improved deepfake detec-
tion using whisper features. arXiv preprint arXiv:2306.01428 (2023)
11. Kodali, M., Kadiri, S., Alku, P.: Classification of vocal intensity category from
speech using the wav2vec2 and whisper embeddings. In: Interspeech, pp. 4134–
4138. International Speech Communication Association (ISCA) (2023)
12. Li, Y., Wang, W., Liu, M., Jiang, Z., He, Q.: Speaker clustering by co-optimizing
deep representation learning and cluster estimation. IEEE Trans. Multimedia 23,
3377–3387 (2020)
13. McFee, B., et al.: librosa: audio and music signal analysis in Python. In: SciPy, pp.
18–24 (2015)
14. Milner, R., Hain, T.: DNN-based speaker clustering for speaker diarisation.
In: Proceedings of the INTERSPEECH (2016). https://api.semanticscholar.org/
CorpusID:26152646
15. Min, Z., Ge, Q., Tai, C.: Why the pseudo label based semi-supervised learning
algorithm is effective? (2023)
16. Mofrad, M.H., et al.: Speech recognition and voice separation for the internet
of things. In: Proceedings of the 8th International Conference on the Internet of
Things, pp. 1–8 (2018)
17. Mondada, L., Granadillo, T.: CABank Spanish CallFriend Corpus. https://doi.
org/10.21415/T5ZC76
18. Mondada, L., et al.: CABank French CallFriend Corpus. https://doi.org/10.21415/
T5T59N
19. Ng, A., et al.: Sparse autoencoder. CS294A Lecture Notes 72(2011), 1–19 (2011)
20. Nguyen, T.N.T., et al.: A general network architecture for sound event localiza-
tion and detection using transfer learning and recurrent neural network. In: IEEE
International Conference on Acoustics, Speech and Signal Processing (ICASSP),
pp. 935–939 (2021). https://doi.org/10.1109/ICASSP39728.2021.9414602
21. Nguyen, T., Pham, L., Lam, P., Ngo, D., Tang, H., Schindler, A.: The impact of
frequency bands on acoustic anomaly detection of machines using deep learning
based model. arXiv preprint arXiv:2403.00379 (2024)
22. Opochinsky, Y., et al.: K-autoencoders deep clustering. In: IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4037–4041
(2020). https://doi.org/10.1109/ICASSP40776.2020.9053109
23. Pal, M., et al.: Speaker diarization using latent space clustering in generative adver-
sarial network. In: IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP), pp. 6504–6508 (2020)
24. Park, T.J., Kanda, N., Dimitriadis, D., Han, K.J., Watanabe, S., Narayanan, S.: A
review of speaker diarization: recent advances with deep learning. Comput. Speech
Lang. 72, 101317 (2022)
Unsupervised Speaker Diarization for Multilingual Calls 53

25. Paszke, A., et al.: PyTorch: an imperative style, high-performance deep learn-
ing library. In: Advances in Neural Information Processing Systems, vol. 32, pp.
8024–8035. Curran Associates, Inc. (2019). http://papers.neurips.cc/paper/9015-
pytorch-an-imperative-style-high-performance-deep-learning-library.pdf
26. Pham, L., Lam, P., Nguyen, T., Nguyen, H., Schindler, A.: Deepfake audio detec-
tion using spectrogram-based feature and ensemble of deep learning models. In:
2024 IEEE 5th International Symposium on the Internet of Sounds (IS2), pp. 1–5
(2024). https://doi.org/10.1109/IS262782.2024.10704095
27. Pham, L., et al.: Lightweight deep neural networks for acoustic scene classification
and an effective visualization for presenting sound scene contexts. Appl. Acoust.
211, 109489 (2023)
28. Pham, L., Nguyen, T., Lam, P., Ngo, D., Jalali, A., Schindler, A.: Light-weight
deep learning models for acoustic scene classification using teacher-student scheme
and multiple spectrograms. In: 4th International Symposium on the Internet of
Sounds, pp. 1–8 (2023). https://doi.org/10.1109/IEEECONF59510.2023.10335258
29. Quatra, M.L., et al.: Vad - simple voice activity detection in python. https://
github.com/MorenoLaQuatra/vad
30. Radford, A., et al.: Robust speech recognition via large-scale weak supervision. In:
International Conference on Machine Learning, pp. 28492–28518 (2023)
31. Ramírez, A., Foster, M.E.: A whisper ROS wrapper to enable automatic speech
recognition in embedded systems (2023)
32. Rathod, S., Charola, M., Patil, H.A.: Noise robust whisper features for dysarthric
severity-level classification. In: International Conference on Pattern Recognition
and Machine Intelligence, pp. 708–715. Springer (2023)
33. Reynolds, D.A., Quatieri, T.F., Dunn, R.B.: Speaker verification using adapted
Gaussian mixture models. Digital Sig. Process. 10(1), 19–41 (2000). https://doi.
org/10.1006/dspr.1999.0361
34. Shaham, U., Stanton, K., Li, H., Nadler, B., Basri, R., Kluger, Y.: SpectralNet:
spectral clustering using deep neural networks (2018)
35. Snyder, D., et al.: X-vectors: robust DNN embeddings for speaker recognition.
In: IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP), pp. 5329–5333 (2018). https://doi.org/10.1109/ICASSP.2018.8461375
36. Stafylakis, T., Katsouros, V., Carayannis, G.: Speaker Clustering via the mean shift
algorithm. In: Proceedings of the Speaker and Language Recognition Workshop
(Speaker Odyssey), pp. 186 – 193. ISCA, Brno, Czech Republic (2010)
37. Tritschler, A., Gopinath, R.A.: Improved speaker segmentation and segments
clustering using the Bayesian information criterion. In: EUROSPEECH (1999).
https://api.semanticscholar.org/CorpusID:15220583
38. Wan, L., Wang, Q., Papir, A., Moreno, I.L.: Generalized end-to-end loss for speaker
verification (2020)
39. Yang, B., Fu, X., Sidiropoulos, N.D., Hong, M.: Towards k-means-friendly spaces:
simultaneous deep learning and clustering (2017)
40. Zhang, A., Wang, Q., Zhu, Z., Paisley, J., Wang, C.: Fully supervised speaker
diarization (2019)
41. Zhang, L., Jiang, N., Wang, Q., Li, Y., Lu, Q., Xie, L.: Whisper-SV: adapting
whisper for low-data-resource speaker verification. Speech Commun. 163, 103103
(2024). https://doi.org/10.1016/j.specom.2024.103103
Hybrid Compression: Integrating Pruning
and Quantization for Optimized
Neural Networks

Minh-Loi Nguyen1,2 , Long-Bao Nguyen1,2 , Van-Hieu Huynh1,2 ,


and Trung-Nghia Le1,2(B)
1
University of Science, Ho Chi Minh City, Vietnam
{22120189,22120025,22120105}@student.hcmus.edu.vn,
[email protected]
2
Vietnam National University, Ho Chi Minh City, Vietnam

Abstract. Deep neural networks have witnessed remarkable advance-


ments in recent years and have become integral to various applications.
However, alongside these developments, training and deployment of neu-
ral network models on embedding and edge devices face significant chal-
lenges due to limited memory and computational resources. These prob-
lems can be addressed with deep neural network compression, which
involves a trade-off between model size and performance. In this paper,
we propose a novel method for model compression through two phases.
First, we utilize model compression techniques, such as pruning and
quantization, to significantly reduce the model size. Then, we use Mixture
of Experts to route the previously compressed models to enhance perfor-
mance while maintaining a balance in inference efficiency. MoEs consist
of multiple expert models (i.e., compressed models) that are moderately
sized and deliver stable performance. Experimental results on several
benchmark datasets show that our method successfully compresses CNN
models which achieves substantial reductions in FLOPs and parameters
with a negligible accuracy drop.

Keywords: deep neural network · deep compression · pruning ·


quantization

1 Introduction
Over the past decade, the explosion of data and computational power has driven
significant advancements in deep neural networks (DNNs) [13]. As a result, many
new network architectures have emerged, each more complex and demanding
more resources than its predecessors [22]. For instance, the first Convolutional
Neural Network (CNN) model was proposed in 1998 with fewer than 1 mil-
lion parameters [19], while OpenAI’s GPT-3 model in 2020 comprised up to
175 billion parameters, requiring hundreds of gigabytes of memory for storage
M.-L. Nguyen, L.-B. Nguyen and V.-H. Huynh—Contributed equally to this research.
c The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025
W. Buntine et al. (Eds.): SOICT 2024, CCIS 2350, pp. 54–64, 2025.
https://doi.org/10.1007/978-981-96-4282-3_5
Hybrid Compression for Optimized Neural Networks 55

Fig. 1. Comparison of CNN model performance on CIFAR-10 dataset. Our proposed


method for model pruning and quantization achieves significant reduction in FLOPs,
total parameters, and inference time, while maintaining accuracy with minimal dropout
compared to the original models.

and thousands of teraflops for training. The rapid development of these large-
scale models has introduced significant challenges and limitations. Deploying
DNN models in real-world scenarios, such as mobile applications and Internet
of Things (IoT) devices, often becomes impractical due to constrained memory
and computing resources [22].
To address these challenges, the field of model compression has gained con-
siderable attention [3]. Model compression techniques aim to reduce the size
and computational requirements of DNNs without significantly compromising
their performance. Among these techniques, deep compression has emerged as
a robust approach, with methods such as pruning, quantization, and the Mix-
ture of Experts (MoE) achieving substantial reductions in model size and com-
putational cost [15]. Pruning methods [10] are designed to remove less impor-
tant connections in neural network layers based on various evaluation criteria.
Instead of using high-precision floating-point numbers, quantization methods
[15, 24] reduce the precision of parameters by representing them with fewer bits.
MoE [7, 16] dynamically selects a subset of parameters (or experts) for each
input, optimizing resource use by activating only the network parts relevant to
each task, enabling efficient scaling.
In this paper, we introduce a novel multi-stage method to develop a cost-
efficient CNN-based model. Our approach focuses on optimizing both the com-
plexity and the computational efficiency of the model through a series of tar-
geted stages. In the first stage, we employ well-established hard compression
techniques such as pruning and quantization to significantly reduce the model’s
complexity, including the number of parameters and the overall inference cost,
making it more feasible for deployment in resource-constrained environments.
In addition, we leverage the Neural Network Intelligence (NNI) [23] framework
56 M.-L. Nguyen et al.

to implement and automate our pruning and quantization techniques. The sec-
ond stage involves the MoEs paradigm to enhances the model’s adaptability and
efficiency by allocating previous compressed model to specialize for each input.
The specialization helps enhance the performance and stability of compressed
models, which might drop due to pruning and quantization, while still leverag-
ing the low resource consumption and computational cost of these compressed
models. Experimental results on CIFAR-10 [18] and BloodMNIST [5] datasets
show that our method successfully achieved a 10x-11x reduction in FLOPs and
a 10.5x reduction in parameters, with a negligible accuracy drop on the image
classification task (See Fig. 1).
In summary, our contributions are as follows:
– We introduce a novel method that combines pruning, quantization, and
the Mixture of Experts (MoE) paradigm, demonstrating how this fusion
brings superior effectiveness and provides detailed insights into the trade-offs
between model size, computational efficiency, and accuracy.
– We investigate our method on different CNN models, providing practical
insights for implementing compression model techniques.

2 Related Work
Pruning is a widely used technique for compressing neural networks by remov-
ing redundant weights and connections. These methods identify unimportant
elements in the model, such as weights and neural connections, and prune them
by setting their values to zero, ensuring they do not participate in the back-
propagation process. Hassibi et al. [12] introduced an early pruning method that
uses the inverse Hessian matrix to identify and remove redundant weights, while
updating the remaining ones with second-order information. More recently, var-
ious pruning techniques have emerged, including magnitude-based weight prun-
ing [10], which gradually eliminates small magnitude weights to achieve network
sparsity. In CNN models, pruning is typically categorized into two approaches:
weight pruning [6], which removes individual redundant weights, and filter prun-
ing [20], which eliminates entire convolutional filters with minimal impact on
performance.
Quantization is a popular technique for compressing neural networks by low-
ering parameter precision, reducing memory usage and computational costs.
Binarized neural networks Quantization method [4] trains networks with binary
weights and activations but still accumulates gradients in 32-bit precision, high-
lighting the need for high precision during training. DoReFa-Net Quantizer [28]
reduces gradient precision by quantizing them into low-bitwidth floating-point
numbers. Quantization methods generally fall into two categories: Quantization-
Aware Training (QAT) [15], where models are retrained with quantized weights
and activations, and Post-Training Quantization (PTQ) [24], which quantizes a
pretrained model without retraining. This paper focuses on the QAT approach
for CNN model quantization.
Hybrid Compression for Optimized Neural Networks 57

Fig. 2. Overview of our proposed three-phase deep compression method, including


pruning, quantization, and the Mixture of Experts (MoE) paradigm.

As network architectures evolve, combining them for improved performance is


gaining traction, with the Mixture of Experts (MoE) method [25] being a promi-
nent approach. Unlike traditional ensemble methods like bagging or boosting,
where all models contribute equally, MoE uses a gating network to select specific
experts for different problem areas, enhancing performance with fewer compu-
tational resources [17, 25]. Classic MoE models [16] consist of multiple expert
models and a gating network, while Eigen et al. [7] extended this by incorporat-
ing MoE subcomponents with their own gating.
In natural language processing, Shazeer et al. [25] introduced Sparse Mixture
of Experts (SMoE), replacing dense layers in Transformers with MoE layers. For
example, Mixtral 8. × 7B [17], an SMoE-based model, performs comparably to
Llama 2 70B and GPT-3.5. In computer vision, particularly CNNs, MoE has also
shown success, as seen in DeepMoE [27], where MoE layers replace traditional
convolutional layers, with a multi-headed gating network optimizing channel
selection.

3 Methodology
3.1 Overview
We present an in-depth exploration of the methodologies employed in our
research, aim to compress CNN models for deployment on resource-constrained
hardware, such as mobile and edge devices. As seen in Fig. 2, our proposed
method consists of three phases: pruning, quantization, and integrating the MoE
58 M.-L. Nguyen et al.

paradigm; each is designed to progressively compress CNN, enhance efficiency,


and ensure the robustness of the model.

– Pruning Phase: by applying iterative pruning algorithms, such as Auto-


mated Gradual Pruning (AGP) [29], we systematically remove less important
connections. The outcome is a model with reduced computational require-
ments, mitigating the risk of layer collapse and preserving essential network
structures.
– Quantization Phase: Following pruning, we apply the QAT technique [15]
to reduce further both the model’s size and its computational resource require-
ments. This involves representing the model’s weight with lower precision,
typically using fewer bits.
– Mixture of Experts (MoE) Integration: finally, we introduce the MoE
paradigm to enhance the performance and efficiency of the compressed model.
The MoE approach dynamically selects a subset of experts for each input,
optimizing resource utilization and maintaining high performance.

3.2 Pruning

Network pruning, crucial for reducing model size and bandwidth needs, removes
unnecessary neural connections [21]. We use Magnitude Pruning [11] to eliminate
redundancies effectively.

Automatic Gradual Pruning (AGP): Large neural networks often contain


redundant parameters, as shown in prior work [10, 21]. Frankle et al. [8] proposed
that large networks have a subnetwork capable of matching the original’s per-
formance, while Arora et al. [2] demonstrated that over-parameterized models
retain generalization when compressed. Thus, we apply sparsity levels between
50% and 70%, but high sparsity can lead to layer collapse, where entire layers
are removed under one-shot pruning [26].
To prevent this, we adopt the Automatic Gradual Pruning (AGP) algorithm
[29], an iterative framework that removes redundant connections gradually with a
cubic function, adjusting sparsity at each step. At each training step, magnitude
pruning [11] globally prunes CNN connections, avoiding model collapse while
allowing recovery through concurrent training.
 3
t − t0
s = sf + (si − sf ) 1 −
. t , (1)
nΔt
where .n is the number of pruning steps, .si is the initial sparsity level, .sf is the
final sparsity level, .st is the sparsity level at pruning step t, which is updated for
every .Δt steps, and .t is current pruning step whereas .t ∈ {t0 , t0 + Δt , . . . , t0 +
nΔt}.
The AGP’s formula indicates that the sparsity level increases gradually dur-
ing the initial stages, leading to fewer redundant connections being removed.
Hybrid Compression for Optimized Neural Networks 59

This gradual increase allows the model to adapt and learn from the pruned infor-
mation, maintaining a balance in the importance scores across layers [26]. This
balance prevents layer-collapse, as the importance scores among layers remain
equivalent. In the final stage, AGP pruning imposes a high sparsity to achieve
the configured target level.
We also consider that pruning can potentially disrupt the neural network’s
structure, resulting in a substantial decrease in accuracy. This challenge can
be addressed by retraining the model, which incurs additional costs [10, 21]. In
our pipeline, retraining is performed after the model undergoes the Speedup
technical, which optimizes the model for faster execution. Additionally, during
model construction, we observed the weight distribution of the classifier layer,
responsible for generating the logit vector for classification output. Therefore,
in our configuration, we set a target sparsity level for the entire CNN backbone
and a lower sparsity level for the classifier layer.

Model Speedup: In the pruning process, a binary mask layer is used to rep-
resent retained connections, assigning a value of 1 to kept connections and 0
to reduced ones. Consequently, during the forward and backward passes of the
model, this binary mask matrix is multiplied with the corresponding weights.
Obviously, this pruning method does not significantly enhance model inference
and training speed.
To mitigate this limitation, we utilize the Model Speedup method [1], which
involves removing the feature maps that were previously pruned in the CNN
layer and retaining the weights to preserve the layer’s output. As a result, the
model achieves a smaller weight set than the original. This optimization can lead
to a latency reduction by a factor of 2 compared to the original model, albeit
with a slight trade-off in accuracy.

3.3 Quantization
Quantization reduces model size and speeds up inference by converting weights
or activations from high-precision floating points to lower bit-widths, like 8-bit
integers, with minimal accuracy loss [15, 24].
In this paper, we implement QAT [15] to maintain high accuracy post-
quantization. QAT [15] simulates the effects of quantization during training,
allowing the model to learn and adjust to the reduced precision, which mini-
mizes the accuracy degradation typically observed in post-training quantization
[10, 24].

Quantization Process: The quantization process involves mapping the


floating-point values to discrete integer values. This is achieved through two
main steps: scaling and rounding. The transformation for a value .x in floating-
point format to an 8-bit integer is given by:
 
x − min(x)
.x̃ = , (2)
Δ
60 M.-L. Nguyen et al.

where .min(x) is the minimum value in the range of .x, and .Δ is the quantization
step size defined as:

max(x) − min(x)
Δ=
. , (3)
2n − 1
where .n is the bit-width (e.g., 8 for 8-bit quantization) [15]. The quantized
value .x̃ is then converted back to a floating-point format during inference using
.xq = x̃ · Δ + min(x), ensuring the model operates within the quantized value

range during execution.

Fake Quantization in QAT. In QAT, we introduce fake quantization oper-


ations during training to mimic the effects of quantization on weights and acti-
vations. This operation is expressed as:
x
.x̃fake = Δ · , (4)
Δ
where .· denotes the rounding operation to the nearest integer [9, 10]. This
transformation ensures that the model learns to adapt to the quantized weights
and activations during training.

Straight-Through Estimator (STE): To facilitate backpropagation through


quantized nodes, we use the Straight-Through Estimator (STE) [10, 14]. The
STE approximates the gradient of the quantization function with respect to the
input as:
∂ x̃fake
. ≈ 1 if |x| ≤ 1, else 0. (5)
∂x
This approximation allows gradients to flow through the quantization oper-
ation, enabling effective training of the quantized model.

Scale Factor and Zero Point Initialization: To ensure the quantization


parameters are suitable for the data distribution, we initialize the scale factor .Δ
and zero point .z using calibration data. The scale factor is computed as:

max(xcalib ) − min(xcalib )
.Δ= . (6)
2n − 1
The zero
 point is calculated
 to align the quantized range with the original
min(xcalib )
range .z = − Δ .

Loss Function: Our implemeted loss function is as follow:

Ltotal = Ltask + λ · Lquant ,


. (7)
where .Ltask is the task-specific loss (e.g., cross-entropy for classification), .Lquant
is the quantization loss, and .λ is a balancing hyperparameter [10, 14].
Hybrid Compression for Optimized Neural Networks 61

3.4 Retraining
Retraining is an essential step in the model compression pipeline to recover
the accuracy lost during pruning and quantization. After applying pruning and
quantization, the model may suffer from reduced performance due to significant
structural and precision changes. Retraining enables the model to regain per-
formance by re-optimizing its weights within the new compressed architecture.
This improves accuracy by allowing the model to adapt to altered parameters
and mitigate errors introduced during quantization [10, 21].

3.5 Mixuture of Experts (MoE)


The MoE framework leverages a set of neural network blocks, known as experts
E1 , E2 , . . . , En , along with a gating network .G that determines which experts
.
should be activated for a given input (See Fig. 2). The gating network outputs
an n-dimensional distribution vector, which directs the input to specific experts
based on their relevance to the task at hand. While traditional MoE implemen-
tations typically select feed-forward networks as experts, our approach integrate
pre-compressed models into the MoE architecture. This design choice allows us to
leverage the computational efficiency of compressed models while benefiting from
the diverse expertise each model brings, thus enhancing overall performance.
In this paper, we selected a variety of models to undergo compression and
be integrated into the MoE framework. This strategy not only reduces the com-
putational overhead associated with using full-sized models but also provides
a rich diversity of learned representations that can be exploited by the gating
network. However, a significant challenge in this approach is the potential imbal-
ance among experts, where certain experts may dominate the inference process,
leading to what is known as expert collapse. This phenomenon can diminish the
diversity of the activated experts, ultimately affecting the model’s performance
and generalization capability.
To address this issue, we implemented the Top-k Noisy Gating algorithm [25]
as our routing strategy. This algorithm introduces tunable Gaussian noise to the
logits before applying the softmax function, ensuring that the gating decisions are
not deterministic but rather probabilistic, allowing for a more balanced selection
of experts. By selecting the top-k experts based on their noisy scores and setting
the rest to .inf, we promote a more balanced activation of experts, reducing the
risk of expert collapse and ensuring that a diverse set of experts contribute to
the inference process:

G(x) = Softmax(TopK(H(x), k)),


. (8)
H(x)i = (x · Wg )i +  · Softplus ((x · Wnoise )i ) ,
. (9)

vi if vi is in the top k elements of v
. TopK (v, k)i = , (10)
−∞ otherwise
where . denotes the standard normal random noise while .Wg and .Wnoise repre-
sent weights for the gating mechanism and noise, respectively.
62 M.-L. Nguyen et al.

Table 1. Experimental results on CNN models.

Method Total Parameters FLOPs .↓ Inference speed (s) Accuracy (%)


CIFAR10 BloodMNIST CIFAR10 BloodMNIST CIFAR10 BloodMNIST
VGG16 (Original) 21.15M – – 0.162 0.0055 89.8 94.5
VGG16 (Ours) 3.3M x10.71 x10.76 0.041 0.0045 88.6 90.0
Resnet18 (Original) 11.2M – – 0.042 0.007 88.8 96.1
Resnet18 (Ours) 1.01M x10.6 x11.0 0.018 0.0059 85.8 95.6
InceptionV3 (Original) 35.37M – – 0.068 0.045 86.6 93.9
InceptionV3 (Ours) 3.23M x10.86 x10.81 0.037 0.032 90.3 90.7
Densenet121 (Original) 6.87M – – 0.052 0.042 87.6 92.5
Densenet121 (Ours) 0.648M x10.51 x10.56 0.037 0.036 88.6 95.8
MOE (Ours) 92.1 96.9

4 Experimental Results
4.1 Implementation Details

We conducted our experiments using PyTorch to redefine common CNN back-


bones. For each CNN model, we sequentially perform the steps of defining, pre-
training, pruning, quantization, and finetuning the compressed model. Addi-
tionally, we leverage the Neural Network Intelligence (NNI) [23] framework to
implement and automate our pruning and quantization techniques.

4.2 Datasets

We utilized a diverse set of datasets to validate the robustness and generaliz-


ability of our proposed model compression techniques across different domains.
CIFAR-10 [18] consists of 60,000 color images with resolution .32 × 32 in 10
different classes, with 6,000 images per class. BloodMNIST [5] includes 17,092
images of normal cells captured using the CellaVision DM96 analyzer at the
Hospital Clinic of Barcelona and grouped into 8 categories.

4.3 Results

Table 1 shows our extensive experiments on CIFAR-10 and BloodMNIST


datasets. The experimental results indicate that our proposed compression
method yields significant FLOPs and inference time reduction across multiple
models while maintaining the high performance.
Our compressed and finetuned models usually perform well compared to the
original models. The accuracy of VGG16 and Resnet18 slightly drops from 1.2%
to 4.5% and from 0.5% to 3.0%, respectively, indicating that our compression
method does not significantly harm the model’s ability to make accurate pre-
dictions. In fact, in some cases like InceptionV3 and Densenet121, the accuracy
even improves from 1.8% to 3.7% and from 1.0% to 1.1%. These results demon-
strate that our compression method preserves, and in some cases enhances, the
Hybrid Compression for Optimized Neural Networks 63

performance even with substantial reductions in model size, making it highly


effective for tasks on resource-limited platforms.
The reduction in FLOPs is a significant highlight of our compression app-
roach. For all CNN models, the reduction in FLOPs consistently achieves a factor
of from 10.5 to 11.0, confirming the effectiveness of our compression techniques in
enhancing computational efficiency. On the other hand, inference speed improve-
ments are evident in the results, with compressed models exhibiting faster pre-
diction times than their original counterparts. These findings underscore the
practical advantages of compression for real-time applications, where quicker
inference can greatly improve the user experience.
To address accuracy dropout, we decided to integrate our compressed and
finetuned models with the MoE algorithm. The results in Table 1 shows that
our implemented MoE method achieved an accuracy of 92.1% on CIFAR-10
and 96.9% on the BloodMNIST dataset, successfully increased accuracy levels
comparable to the original models before compression while leveraging the com-
putational efficiency of the compressed models.

5 Conclusion
We have presented a novel deep learning model compression method that com-
bines pruning, quantization, and the Mixture of Experts (MoE) paradigm. Our
approach significantly reduces model size and computational requirements with-
out compromising accuracy. Experimental results demonstrated the potential
of our method for deploying sophisticated deep learning models on resource-
constrained devices. Our approach enables the use of complex neural networks
in mobile and edge applications where computational resources and energy effi-
ciency are critical constraints.

Acknowledgment. This research is supported by research funding from Faculty of


Information Technology, University of Science, Vietnam National University - Ho Chi
Minh City.

References
1. Neural network intelligence. https://github.com/microsoft/nni (2020)
2. Arora, S., Du, S.S., Hu, W., Li, Z., Wang, R.: Fine-grained analysis of optimization
and generalization for overparameterized two-layer neural networks. arXiv preprint
arXiv:1901.08584 (2019)
3. Cheng, Y., Wang, D., Zhou, P., Zhang, T.: Model compression and acceleration
for deep neural networks: The principles, progress, and challenges. IEEE 35(1),
126–136 (2018)
4. Courbariaux, M., Hubara, I., Soudry, D., El-Yaniv, R., Bengio, Y.: Binarized neural
networks: Training deep neural networks with weights and activations constrained
to+ 1 or-1. arXiv preprint arXiv:1602.02830 (2016)
5. Doe, J., Smith, J.: Bloodmnist dataset (2022), version 1.0
64 M.-L. Nguyen et al.

6. Dong, X., Chen, S., Pan, S.: Learning to prune deep neural networks via layer-wise
optimal brain surgeon. In: Proceedings of NIPS (2017)
7. Eigen, D., Ranzato, M., Sutskever, I.: Learning factored representations in a deep
mixture of experts. In: ICLR (2014)
8. Frankle, J., Carbin, M.: The lottery ticket hypothesis: Finding sparse, trainable
neural networks. In: ICLR (2019)
9. Girdhar, R., Ramanan, D.: Attentional pooling for action recognition. In: Proceed-
ings of NIPS (2017)
10. Han, S., Mao, H., Dally, W.J.: Deep compression: Compressing deep neural net-
works with pruning, trained quantization, and huffman coding. In: ICLR, pp. 199–
203 (2016)
11. Han, S., Pool, J., Tran, J., Dally, W.: Learning both weights and connections for
efficient neural network. In: NIPS (2015)
12. Hassibi, B., Stork, D.G., Wolff, G.J.: Optimal brain surgeon and general network
pruning. In: IEEE (1993)
13. Hatcher, W.G., Yu, W.: A survey of deep learning: platforms, applications and
emerging research trends. IEEE Access 6 (2018)
14. Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition (2018)
15. Jacob, B., et al.: Quantization and training of neural networks for efficient integer-
arithmetic-only inference. arXiv preprint arXiv:1712.05877 (2017)
16. Jacobs, R.A., Jordan, M.I., Nowlan, S.J., Hinton, G.E.: Adaptive mixtures of local
experts. Neural Computation 3(1) (1991)
17. Jiang, A.Q., et al.: Mixtral of experts. arXiv preprint arXiv:2401.04088 (2024)
18. Krizhevsky, A.: Learning multiple layers of features from tiny images (2009)
19. Lecun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to
document recognition. IEEE 86(11), 2278–2324 (1998)
20. Li, H., Kadav, A., Durdanovic, I., Samet, H., Graf, H.P.: Pruning filters for efficient
convnets. ArXiv abs/1608.08710 (2016)
21. Liang, T., Glossner, J., Wang, L., Shi, S., Zhang, X.: Pruning and quantization for
deep neural network acceleration: A survey. Neurocomputing 461, 370–403 (2021)
22. Liao, H., et al.: A survey of deep learning technologies for intrusion detection in
internet of things. IEEE Access (2024)
23. Microsoft: Nni automl toolkit. https://nni.readthedocs.io/en/latest/ (2021)
24. Nagel, M., Amjad, R.A., van Baalen, M., Louizos, C., Blankevoort, T.: Up or down?
adaptive rounding for post-training quantization. arXiv preprint arXiv:2004.10568
(2020)
25. Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G., Dean, J.:
Outrageously large neural networks: The sparsely-gated mixture-of-experts layer.
arXiv preprint arXiv:1701.06538 (2017)
26. Tanaka, H., Kunin, D., Yamins, D.L.K., Ganguli, S.: Pruning neural networks
without any data by iteratively conserving synaptic flow
27. Wang, X., et al.: Deep mixture of experts via shallow embedding. In: PMLR (2020)
28. Zhou, S., Wu, Y., Ni, Z., Zhou, X., Wen, H., Zou, Y.: Dorefa-net: Training low
bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint
arXiv:1606.06160 (2016)
29. Zhu, M.H., Gupta, S.: To prune, or not to prune: exploring the efficacy of pruning
for model compression (2017)
AI-Generated Image Recognition
via Fusion of CNNs and Vision
Transformers

Xuan-Bach Mai1,2 , Hoang-Minh Nguyen-Huu1,2 , Quoc-Nghia Nguyen1,2 ,


Hoang-Tung Vu1,2 , and Trung-Nghia Le1,2(B)
1
University of Science, VNU-HCM, Ho Chi Minh City, Vietnam
{mxbach22,nhhminh22,nqnghia22,vhtung22}@apcs.fitus.edu.vn,
[email protected]
2
Vietnam National University, Ho Chi Minh City, Vietnam

Abstract. Recent advancements in synthetic data technology have


opened a new era where images of remarkable quality are generated,
blurring the lines between real-life images and those produced by Arti-
ficial Intelligence (AI). This evolution poses a significant challenge to
ensuring the reliability and authenticity of data, underscoring the need
for robust detection methods. In this paper, we present a robust app-
roach aimed at addressing these pressing concerns. Our methodology
revolves around leveraging fusion strategies, combining the strengths of
multiple detection methods for identifying AI-generated images. Through
extensive experimentation on the CIFAKE dataset, our model show-
cases remarkable performance, achieving an impressive accuracy rate of
97.32%. This accomplishment underscores the efficacy of our approach
in accurately distinguishing between AI-generated images and real-life
images, thus contributing to the advancement of data authentication
techniques amidst the proliferation of synthetic data.

1 Introduction
Artificial intelligence (AI)-generated images have become increasingly popular
in recent years, with various tools and platforms available for users to create
captivating visuals for social media and other purposes. One of the top recent
image generators is DALL.·E 3, known for its ability to produce high-quality
images quickly [19]. AI has the capacity to create images from scratch using
trained artificial neural networks [1]. This technology allows for limitless cre-
ativity at the fingertips of users, with AI engines capable of producing art that
rivals human creativity [16]. Overall, AI-generated images have the potential to
be of excellent quality and offer users a range of creative possibilities. As tech-
nology continues to advance, AI image generators are likely to become even more
sophisticated and user-friendly, providing new opportunities for creativity and
visual expression.
Detecting AI-generated images is crucial due to the potential harm they can
cause in society. These images can be easily manipulated and altered to create
c The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025
W. Buntine et al. (Eds.): SOICT 2024, CCIS 2350, pp. 65–76, 2025.
https://doi.org/10.1007/978-981-96-4282-3_6
66 X.-B. Mai et al.

misleading content, disseminating false stories and undermining the credibility of


media sources. Moreover, the unauthorized use of personal images, such as those
obtained from social media, poses significant privacy risks and potential security
threats. Identity theft is another serious concern, as AI-generated images can be
leveraged to fabricate convincing fake identities, putting individuals’ reputations
and online safety at risk. Furthermore, the non-consensual use of likenesses in
AI-generated images raises ethical questions regarding ownership and control
over one’s identity. This technology enables the creation of images that closely
resemble real people without their consent, highlighting the moral implications
of using someone’s likeness without permission. Additionally, the amplification
of biases is a concerning issue, as AI algorithms trained on biased datasets can
perpetuate and exacerbate existing societal inequalities and discriminatory prac-
tices. Ensuring that AI algorithms are trained on diverse and unbiased datasets
is essential to mitigate these risks and promote fairness and inclusivity in AI-
generated content [2].
In the realm of AI, the detection of AI-generated images has become a cru-
cial area of focus. Various tools and methods have been developed to identify
whether an image has been created using AI technology. One popular option is
the “Hive AI Detector,” a Chrome extension that provides a score ranking the
likelihood of an image being real or AI-generated. Additionally, AI detection
tools allow users to upload an image and click a “Check” button to determine if
it is AI-generated. When examining AI-generated images, it is important to look
for inconsistencies that may indicate artificial intelligence was used in the cre-
ation process. Furthermore, industry-standard indicators have been developed
to label AI-generated images on social media platforms like Facebook and Insta-
gram. These indicators help users identify images that have been created using
AI technology. Another method to detect AI-generated images is by checking
the image metadata, as this can reveal telltale signs that may not be visible
to the naked eye [7]. Additionally, APIs like Illuminarty’s AI detection tool
can be integrated into services to automatically identify whether an image was
generated using AI. Despite the advancements in AI detection tools, there are
still challenges in fooling these systems. Tools like Art Detector are designed to
identify subtle markers embedded in AI-generated images, looking for unusual
patterns that may indicate artificial intelligence was used in the creation pro-
cess. Overall, the development of AI image detectors and detection methods
continues to evolve as the use of AI technology in image creation becomes more
prevalent [15].
Recognizing the crucial need for detecting AI-generated images, we propose a
robust detection method to address the potential harms associated with digitally
manipulated visuals. Our approach revolves around leveraging fusion strategies
to significantly boost the accuracy of our AI-generated image recognition model.
Specifically, we combine the strengths of Convolutional Neural Networks (CNNs)
and Vision Transformers (ViTs) through fusion algorithms. Extensive experi-
ments on the CIFAKE dataset [4] shows that our proposed method achieve an
impressive accuracy of 97.32%.
AI-Generated Image Recognition 67

Our main contribution lies in running individual detection models and fusing
their outputs to create higher-accuracy models. This approach enhances detec-
tion precision and ensures our model’s adaptability against a diverse range of
AI-generated images. By integrating the capabilities of CNNs and ViTs, our goal
is to build a detection system that effectively safeguards against the spread of
misleading or harmful content in the digital realm.
In the remaining sections of this paper, we cover related methods, method-
ology, experimental results, conclusions, and future work. We discuss existing
approaches to detecting AI-generated images in Sect. 2, detail our methodology
for developing our AI-generated image recognition model in Sect. 3, present the
experimental results obtained from testing our model, draw conclusions based
on our findings in Sect. 4, and outline future research directions in Sect. 5.

2 Related Work
2.1 Generative Models
Generative models constitute a pivotal domain within artificial intelligence,
aiming to capture and model the inherent distribution of data from observed
samples. These computational frameworks, which encompass a diverse array of
methodologies, have witnessed significant evolution over time. Early forays into
generative modeling leveraged deep neural networks, exemplified by restricted
Boltzmann machines (RBMs) and deep Boltzmann machines (DBMs). Recent
advancements have introduced a multitude of innovative approaches, includ-
ing variational autoencoders (VAEs), autoregressive models, normalizing flows,
generative adversarial networks (GANs), and diffusion models [14]. Notably, the
ProGAN/StyleGAN [12] family has demonstrated remarkable capabilities in pro-
ducing photorealistic images, predominantly focusing on single-class generation
tasks. The emergence of these sophisticated generative techniques has spurred
investigations into forensic methodologies geared towards discerning synthetic
imagery from authentic counterparts. Particularly noteworthy are recent strides
in diffusion models, which have showcased unprecedented proficiency in gener-
ating images from textual descriptions [3].
In this paper, we leverage the CIFAKE [4] dataset, which harnesses Stable
Diffusion [21] to generate synthetic images. CIFAKE serves as a valuable resource
for training and evaluating AI-generated image detection models, as it provides
a diverse collection of synthetic images across various domains. As the field
continues to push the boundaries of generative capabilities, the challenge of
effectively distinguishing between real and synthetic content grows increasingly
complex.

2.2 AI-Generated Image Recognition


The identification of manipulated images has a longstanding history in media
forensics [9], with established methodologies relying on signals such as resampling
artifacts [20], JPEG quantization, shadows, and the detection of operations like
68 X.-B. Mai et al.

image splicing or Photoshop warps [25]. With the proliferation of deep generative
methods, particularly in the context of GAN-based techniques [11], recent inves-
tigations have delved into the efficacy of discriminative methods for detecting
synthesized content. A central inquiry pertains to the generalizability of detec-
tors to unseen methods, with studies indicating that a classifier trained on one
GAN model can generalize to others, especially under aggressive augmentations.
[26]
Despite successes, challenges emerge when adapting detectors to new gener-
ators, where observed high average precision is juxtaposed with low accuracy,
indicating proficient separation between real and fake classes but suboptimal
calibration. Various techniques, including the utilization of frequency cues [10],
co-occurrence matrices [17], pretrained CLIP features [24], and augmentation
with representation mixing [5], have demonstrated effectiveness [18]. Notably,
Ojha et al. [18] demonstrate that a simple nearest neighbors classifier improve
accuracy, though at the cost of inference time. We expand upon the common
observation that even rudimentary classifiers possess a capacity for generalization
across various data generators. This exploration involves analyzing and defining
their performance within an online context.
Recent investigations into diffusion methods reveal that, contrary to GAN-
based detectors’ limitations in generalization, diffusion models are detectable
and exhibit some degree of mutual generalizability [18]. David C. Epstein et al.
[8] take these studies further by training a detector on 14 methods in an online
fashion, simulating their release dates, and releasing an accompanying dataset of
570k images. While these works detect whole images, local prediction also offers
important use-cases. For instance, in forensic analysis, there’s a growing need to
identify alterations made by conventional editing tools like Photoshop, such as
image warping and splicing [25]. Chai et al. [6] show that patch-based classifiers
can generate heatmaps for regions that contain more detectable cues. We aim to
determine whether we can localize inpainted regions. Remarkably, even in the
absence of direct access to inpainted examples, employing CutMix augmentation
[27] enables us to utilize entire images effectively for pixel-level predictions.

3 Proposed Method
3.1 Overview
Our new approach focuses on using fusion strategies to significantly enhance
the accuracy of our AI-generated image recognition model. By combining the
strengths of CNNs and (ViTs) through fusion, we aim to create a model that
efficient in noticing both tiny local details and broader global context within
images. This fusion not only deals with the different types of AI-generated images
but also makes sure that CNNs, which are good at pulling out details, and ViTs,
which understand the whole picture, work well together. The fusion strategies
play a vital role in improving the model’s precision and adaptability, making it
a potent solution for the task of accurately detecting AI-generated images. Our
proposed method is outlined in Fig. 1.
AI-Generated Image Recognition 69

Fig. 1. Illustrating of our fusion method between efficientnetv2-b0 model and ViT-b16
model.

3.2 CNN Branch

In the CNNs branch, we decided to use EfficientNet v2 as the main model of


CNNs family to fuse with the one in ViTs family. We choose this model because
EfficientNet is a new family of CNNs that has better training speed and better
parameter efficiency than previous CNNs models [23].

3.3 ViT Branch

In the ViT branch, our contribution unfolds as we meticulously process input


images through a Vision Transformer (ViT) model, leveraging the strengths of
the pre-trained “dima806/ai_vs_real_image_detection” architecture.
To adapt the images to the ViT model’s requirements, we apply a series of
transformations, including resizing, rotation, sharpness adjustment, and normal-
ization. These transformations ensure that the input images are aligned with the
RGB format expected by the ViT model.
The subsequent step involves the creation of a ViTForImageClassification
model from the pre-trained checkpoint. This model is configured to map class
labels to their corresponding indices, facilitating subsequent model evaluation
and interpretation. The number of trainable parameters is calculated, providing
insights into the model’s complexity and capacity to capture intricate features.
Importantly, the ViT branch’s contribution extends beyond mere feature
extraction. It involves the orchestration of the entire pipeline, from meticulous
image processing to the instantiation of a ViT model capable of distilling global
features. These global-aware features, obtained through multi-head self-attention
and an MLP, play a pivotal role in our fusion strategy.
During the fusion process with the CNN branch, these features are seamlessly
integrated, fostering a comprehensive representation that harnesses both local
and global feature extraction mechanisms. This holistic approach amplifies the
model’s accuracy, resilience, and effectiveness in detecting AI-generated images.
70 X.-B. Mai et al.

3.4 Fusion Strategy


The fusion strategy in our proposed method plays a pivotal role in combining
the strengths of the CNN and Vision Transformer (ViT) branches while ensuring
that both spatial and global context information are effectively integrated. In this
section, we introduce various feature fusion techniques, including concatenation
and linear combination fusion to effectively integrate information from CNNs
and ViTs.

Concatenation. Our first approach is fusing CNNs and ViTs using concatena-
tion method. This fusion aims to capitalize on the localized feature extraction
capabilities of CNNs, which are known for capturing intricate spatial hierarchies,
and the global context understanding of ViTs, adept at discerning long-range
dependencies within images.
The method involves training dedicated CNN and ViT models for image
feature extraction, followed by the extraction of representative features from
their intermediate layers. These features are then fused using concatenation or
merging techniques, facilitated by a fusion layer. This innovative amalgamation
of features creates a unified representation that leverages the complementary
advantages of both architectures.
Let .XCN N be the output features from the CNN model, and .XV iT be the
output features from the ViT model. The concatenation operation can be rep-
resented as follows:
Xconcatenated = [XCN N , XV iT s ],
. (1)
Xf inal = F ullyConnectedLayer(Xconcatenated ). (2)
.

By integrating a classification head onto the fused features, the model


becomes proficient in making accurate predictions, thereby offering a robust
solution for AI-generated image detection. This method underscores the impor-
tance of combining diverse neural network architectures to enhance the overall
performance and adaptability of computer vision models. Fine-tuning and cus-
tomization allow for the optimization of the method according to specific datasets
and detection tasks.

Linear Combination Fusion. Fusing Convolutional Neural Networks (CNNs)


and Vision Transformers (ViTs) linearly involves combining the features learned
by both architectures in a linear manner. This approach aims to leverage the
strengths of both CNNs, known for their spatial hierarchies through convolu-
tions, and ViTs, which capture long-range dependencies through self-attention
mechanisms
Let .XCN N be the feature representation from the CNN architecture, .XV iT be
the feature representation from the ViT architecture, .WCN N and .WV iT be the
learnable weights for the linear transformations of the CNN and ViT features,
respectively, .b be the bias term, .Y is the final result. The linear fusion equation
can be written as:
AI-Generated Image Recognition 71

.Y = WCN N · XCN N + WV iT · XV iT + b. (3)


The equation performs a linear combination of the features from the CNN
and ViT architectures. The feature representations from both architectures are
multiplied by their respective learnable weights (.WCN N and .WV iT ), and the
results are summed together. This process allows the model to blend information
from both architectures linearly.
The addition of the bias term (.b) provides further flexibility in adjusting the
combined features.
The resulting fused feature representation .Y is the output of the linear fusion
process. This combined representation integrates information from both the
CNN and ViT architectures, potentially capturing complementary aspects of
the input data.

4 Experimental Results
Experiment Settings. We performed all experiments using the TensorFlow
framework on an Ubuntu system. Our experiments were run on two Nvidia T4
GPUs with 16GB of memory each. We selected Binary cross-entropy loss as our
loss function for two-label classification. We employed the AdamW optimizer for
training our CNN models. The optimizer’s weight decay was set to 100 for every
CNN, ensuring regularization during training. A momentum of 0.9 was utilized
to facilitate faster convergence. Our training process incorporates an early stop-
ping mechanism, stopping training if the validation loss does not improve by a
margin of 1000 within five epochs. Additionally, to safeguard model progress,
we implement a checkpoint system, preserving the model’s state each time the
validation loss experiences reduction of at least 10000.

4.1 Datasets

We used the CIFAKE dataset [4] as a benchmark to evaluate the proposed


network. The dataset contains two classes - “real” and “fake”. For “real”, the
authors collected the images from Krizhevsky and Hinton’s CIFAR-10 dataset
[13]. The “fake” class comprises synthetic images generated using Stable Diffusion
version 1.4.

4.2 Results

We performed qualitative and quantitative experiments for evaluation. The


prediction from our proposed method is compared with the state-of-the-art
CNN-based and fused CNN-Transformer networks. We chose the Efficientnet,
Resnet50, VGG16, Mobilenet as the pure CNN architectures to employ the eval-
uation and comparison with our fused model which combines Efficientnet and
ViT using our proposed fusion strategies.
72 X.-B. Mai et al.

Table 1. Result of pure CNNs base model and pure ViTs base model.

Model Loss Accuracy Precision Recall


efficientnetv2-b0 10.95% 97.17% 96.63% 97.74%
Resnet50 16.47% 95.73% 96.70% 94.69%
VGG16 12.66% 96.28% 96.12% 96.44%
MobilenetV3small 14.64% 94.98% 94.53% 95.49%
ViT-b16 41.60% 87.48% 88.67% 85.94%

Pure CNNs Models Performance. Based on our experimental findings


detailed in Table 1, the efficiency of various CNN models was evaluated, with
EfficientNet v2 emerging as the top performer, boasting an impressive accu-
racy rate of 97.17%. Furthermore, in our training process, EfficientNet consis-
tently exhibits stability and high accuracy across the initial epochs of training.
Given its consistently superior performance across multiple tests, it stands out
as the optimal candidate for combining with Vision Transformer (ViT) models
to potentially yield an even more accurate hybrid model.

Pure ViT Model Performance. In our experimental evaluation, the ViT


model exhibited the least accuracy among the tested models, achieving a score
of 87.48% in Table 1. Recognizing this limitation, we expect the fusion technique
to improve the performance of the ViT model and also to use its strengths to
enhance the performance of the EfficientNet model.

Fusion Result. Our proposed fusion strategies have successfully achieved the
two highest accuracy scores when compared with the accuracy of the two indi-
vidual models, CNN and ViT. As depicted in Table 2, the concatenation method
achieved the highest accuracy (97.44%), representing an increase of 0.37% com-
pared to EfficientNet and 9.96% compared to ViT. Additionally, the Linear Com-
bination method surpassed our initial expectations during training, achieving an
accuracy of 97.32%. This achievement was realized through adjustments of the
weight constants, assigning a weight of 0.6 to EfficientNet and 0.4 to ViTs.

Table 2. Results of fusion algorithms between efficientnetv2-b0 and ViT-B16.

Model Loss Accuracy Precision Recall


Concatenation 10.74% 97.44% 97.60% 97.27%
(efficientnet + ViT)
Linear 11.24% 97.32% 98.02% 96.59%
Combination
(efficientnet + ViT)
AI-Generated Image Recognition 73

Reduce Brightness by 50%. After applying the fusion method, the accuracy
slightly increases that of the base model. Thus, we attempt to assess our model’s
performance under challenging conditions by reducing the brightness of testing
images.
In this experiment, we convert the image color space from RGB-base to HSV-
base and then decrease the “Value” parameter by 50% in the validation dataset,
effectively reducing image brightness by half [22]. Then we use our pretrained
CNNs base models and our proposed model to evaluate that modified validation
dataset. The performances of the CNNs base models and our custom models
described in Table 3 and Table 4 respectively.

Table 3. Results of pure CNNs base model and pure ViTs base model on the reduce
brightness dataset.

Model Loss Accuracy Accuracy drop


ResNet50 33.55% 86.59% 9.14%
VGG16 18.75% 94.78% 1.5%
efficientnetv2-b0 49.49% 81.67% 15.50%
MobilenetV3small 69.36% 50% 44.98%
ViT-b16 70.24% 74.01% 13.47%

Table 4. Results of fusion algorithms on the reduce brightness dataset.

Model Loss Accuracy Precision Recall


Linear 34.96% 89.59% 87.46% 92.43%
Combination
(efficientnet + ViT)
Linear 15.62% 95.29% 93.08% 97.85%
Combination (VGG
+ ViT)

The observation that most CNN base models experience a sharp decline in
accuracy when faced with images of reduced brightness underscores the com-
plexity of the task at hand. Notably, the VGG16 model stands out as being
relatively resilient to such challenges, with only a marginal 1.5% drop in accu-
racy. Leveraging this insight, we devised our custom fusion model, which not only
preserves the robustness of VGG16 but also harnesses the advanced capabilities
74 X.-B. Mai et al.

of ViT-b16. The result is a substantial enhancement in accuracy, culminating


in an impressive 95.29% accuracy rate on the processed validation dataset with
challenging brightness conditions.
Undoubtedly, our method has made significant strides in enhancing the accu-
racy of detecting AI-generated images. By combining the strengths of the VGG16
and ViT-b16 models through a fusion approach, we’ve achieved a noteworthy
improvement in accuracy metrics. This improvement is particularly pronounced
when confronted with images subjected to lower brightness conditions, a scenario
where conventional CNN base models often struggle.
This remarkable improvement underscores the effectiveness of our approach
in enhancing the model’s ability to discern AI-generated images with greater
precision and reliability. By leveraging the complementary strengths of different
architectures and intelligently fusing them, we’ve unlocked new levels of perfor-
mance, paving the way for more accurate and reliable detection of AI-generated
content in various real-world scenarios.

5 Conclusion
Our study introduces a novel method for recognizing AI-generated images, aim-
ing to enhance prediction efficiency and accuracy across several popular models.
At the heart of our proposed approach lies the fusion of Convolutional Neu-
ral Networks (CNNs) and Vision Transformer architectures. We explore two
fusion strategies-concatenation and linear combination-which yield slight accu-
racy improvements compared to using the models separately.
Our methodology begins by extracting feature vectors from both Efficient-
Net and Vision Transformer models. These vectors are then combined into a
unified output vector using mathematical formulas and algorithms. This fusion
process enables the models to leverage the strengths of both CNNs and Vision
Transformer, resulting in more robust predictions.
To validate the effectiveness of our approach, we subjected the testing dataset
to challenging conditions. Remarkably, our experiments reveal that the fusion of
VGG16 and ViT achieved the highest accuracy under these demanding circum-
stances. This finding underscores the resilience and effectiveness of our fusion
technique, particularly when faced with complex and varied image data.
Overall, our experiments demonstrate that our proposed fusion technique
significantly enhances feature extraction accuracy and image recognition capa-
bilities compared to individual branch models. By seamlessly integrating CNNs
and Vision Transformer architectures, we pave the way for more accurate and
efficient AI-generated image recognition systems.

Acknowledgement. This research is supported by research funding from Faculty of


Information Technology, University of Science, Vietnam National University - Ho Chi
Minh City.
AI-Generated Image Recognition 75

References
1. Ai image generation, explained. https://www.altexsoft.com/blog/ai-image-
generation/. Accessed 01 Apr 2024
2. How can ai-generated photos harm each of us. https://www.aiornot.com/blog/
how-can-ai-generation-photos-can-harm-each-of-us. Accessed 01 Apr 2024
3. Balaji, Y.e.a.: ediff-i: Text-to-image diffusion with expert denoisers. arXiv preprint
arXiv:2211.01324 (2022)
4. Bird, J., Lotfi, A.: Cifake: Image classification and explainable identification of
ai-generated synthetic images. https://www.kaggle.com/datasets/birdy654/cifake-
real-and-ai-generated-synthetic-images (Mar 2023)
5. Bui, T.e.a.: RepMix: Representation Mixing for Robust Attribution of Synthesized
Images (2022). https://doi.org/10.1007/978-3-031-19781-9_9
6. Chai, L.e.a.: What makes fake images detectable? (2020)
7. Collins, B.: Ai or not? how to detect if an image is ai-generated
— forbes.com. https://www.forbes.com/sites/barrycollins/2023/10/14/ai-or-not-
how-to-detect-if-an-image-is-ai-generated/?sh=6db008b83254. Accessed 01 Apr
2024
8. Epstein, D.C.e.a.: Online detection of ai-generated images (2023)
9. Farid, H.: Image forgery detection. IEEE Signal Process. Mag. 26(2), 16–25 (2009).
https://doi.org/10.1109/MSP.2008.931079
10. Frank, J.e.a.: Leveraging frequency analysis for deep fake image recognition. In:
ICML, pp. 3247–3258 (2020). https://doi.org/10.48550/arXiv.2302.10174
11. Karras, T.e.a.: A style-based generator architecture for gans. In: IEEE TPAMI
(2019). https://doi.org/10.1109/CVPR.2019.00453
12. Karras, T.e.a.: A style-based generator for gans. In: CVPR, pp. 4401–4410 (2019)
13. Krizhevsky, A.: Learning multiple layers from tiny images (2013-2017)
14. Luo, C.: Understanding diffusion models. arXiv preprint arXiv:2208.11970 (2022)
15. Maybe, M.: Ai image detector - hugging face space by umm-maybe — hugging-
face.co. https://huggingface.co/spaces/umm-maybe/AI-image-detector. Accessed
01 Apr 2024
16. Nast, C.: What ai-generated art really means for human creativity. https://
www.wired.com/story/picture-limitless-creativity-ai-image-generators/. Accessed
01 Apr 2024
17. Nataraj, L.e.a.: Detecting gan generated fake images using co-occurrence matri-
ces. Electronic Imaging (2019). https://doi.org/10.2352/ISSN.2470-1173.2019.5.
MWSF-532
18. Ojha, U.e.a.: Towards universal fake image detectors (2023). https://doi.org/10.
48550/arXiv.2302.10174
19. OpenAI: Dall.·e 3. https://openai.com/dall-e-3. Accessed 25 Mar 2024
20. Popescu, A., Farid, H.: Exposing digital forgeries by detecting traces of re-sampling.
IEEE Trans. on Signal Process. 53, 758–767 (2005). https://doi.org/10.1109/TSP.
2004.839932
21. Rombach, R.e.a.: High-resolution image synthesis with latent diffusion models. In:
CVPR, pp. 10684–10695 (2022)
22. StackOverflow: How to fast change image brightness with python +
opencv? — stackoverflow.com. https://stackoverflow.com/questions/32609098/
how-to-fast-change-image-brightness-with-python-opencv, [Accessed 08-05-2024]
23. Tan, M., Le, Q.V.: Efficientnetv2: Smaller models and faster training (2021)
76 X.-B. Mai et al.

24. Tejankar, A.e.a.: A fistful of words: Learning transferable visual models from bag-
of-words supervision (2021)
25. Wang, S.Y.e.a.: Detecting photoshopped faces by scripting photoshop. In: ICCV
(2019). https://doi.org/10.1109/ICCV.2019.01017
26. Wang, S.Y.e.a.: Cnn-generated images are surprisingly easy to spot. In: CVPR
(2020). https://doi.org/10.1109/CVPR42600.2020.00872
27. Yun, S.e.a.: Cutmix: Regularization strategy for strong classifiers. In: ICCV (2019)
Decoding Deepfakes: Caption Guided
Learning for Robust Deepfake Detection

Y-Hop Nguyen1,2 and Trung-Nghia Le1,2(B)


1
University of Science, VNU-HCM, Ho Chi Minh City, Vietnam
[email protected], [email protected]
2
Vietnam National University, Ho Chi Minh City, Vietnam

Abstract. The rapid development of generative image models has raised


concerns about misuse, especially in journalism and media. Therefore,
developing tools for detecting fake images is essential. However, many
current methods focus on short-term gains and lack long-term adapt-
ability. This paper focuses on detecting deepfakes across various types of
image data, such as faces, landscapes, objects, and scenes, using the
visual-language CLIP model. Although CLIP has shown potential in
deepfake detection, it has yet to clarify why it performs effectively in
this task. Our analysis shows that CLIP’s combination of image features
enhances the model’s generalization capability. By extracting image fea-
tures trained for the deepfake detection task and generating captions
through a text-decoding model, we demonstrate its effectiveness. Based
on these findings, we introduce a novel method that enables the learning
of forgery features and semantic features to improve generalization in
image forgery detection. Extensive experiments show that our method
achieves an accuracy of 98.3% on GAN-generated datasets and 95.9% on
previously unknown diffusion model datasets. Our code is available at:
https://github.com/genkerizer/CGL.

Keywords: Robust deepfake detection · Vision-Language model ·


Vision Transformer · CLIP · BLIP

1 Introduction

Recent advances in image generation via GANs [3] and Diffusion models [4]
complicate real vs. synthetic image identification. Hyper-realistic deepfakes can
mislead audiences by fabricating actions or statements from public figures,
underscoring the need for effective detection methods to ensure digital content
authenticity.
Early detection methods combining CNN classifiers and data augmentation
[12, 21] have struggled with diffusion model-generated images. Techniques like
frequency domain analysis [2] and noise reconstruction [22] aim to enhance deep-
fake detection robustness. However, identifying invariant features remains chal-
lenging, particularly for unseen fakes. Progress in face-related deepfake detection
c The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025
W. Buntine et al. (Eds.): SOICT 2024, CCIS 2350, pp. 77–87, 2025.
https://doi.org/10.1007/978-981-96-4282-3_7
78 Y.-H. Nguyen and T.-N. Le

Fig. 1. The CLIP model enhances deepfake detection by generalizing forgery recogni-
tion. Expanding on UniFD [16], we decoded forgery features and found that captions
generated from these features often misrepresent the images, with specific words in the
captions proving essential for detecting deepfakes.

Fig. 2. Word frequency across ProGAN, BigGAN, StarGAN, and deepfake datasets
was analyzed by applying a text-decoder after the CLIP image-encoder’s adapter layer,
trained on ProGAN for deepfake classification. High-frequency words in the resulting
captions were found to enhance CLIP’s generalization in deepfake detection.

includes combining cross-dataset features and data augmentation to boost model


generalization [6, 23].
CLIP shows strong generalization on unseen datasets [11, 16], but why it can
discern real from fake images remains unexplored. Studies typically use adapter
or prompt-tuning, which means the output features of the CLIP-image encoder
remain unchanged. As a result, the model combines relevant features to form
an optimal set for deepfake detection. Following this, we decoded post-adapter
image features to generate captions (Fig. 1). Analysis of these captions revealed
certain words (Fig. 2) as key to the model’s classification, even though the cap-
tions often misrepresented the images.
From these observations, this paper aims to design a deepfake detection
model focusing on two key factors: 1) The CLIP-image encoder must extract
specific forgery features, and 2) Features through the adapter must accurately
Caption Guided Learning for Robust Deepfake Detection 79

represent image properties. We propose Caption Guided Learning (CGL), which


combines visual-language training with LoRA algorithm [5] in the CLIP image
encoder. This method integrates low-to-high-level features for effective fake
image detection. Experiments on GAN and Diffusion model datasets show CGL
achieves state-of-the-art performance, with a 3.0% improvement over existing
methods, demonstrating strong generalization in real-world scenarios. Our main
contributions are summarized as follows:

– We propose a novel Caption Guided Learning (CGL) method to efficiently


extract significant deepfake features, which are independent of semantic fea-
tures of images.
– We demonstrate that CLIP predicts by combining feature clusters to enhance
generalization capabilities.
– Our experiments verify the effectiveness of the proposed method, showing
strong generalization across 15 different generative models used to synthesize
fake images.

2 Related Work

Most recent research on deepfake detection typically focuses on three main devel-
opment directions: Data processing to enhance generalization capabilities, data
augmentation, and model design aimed at improving the detection of deepfake
features.

Data Processing- based methods enhance image features or transform images


to improve forgery detection. Frequency domain conversions [2, 8] and noise
extraction models [1] were explored to improve generalization. Inversion and
reconstruction methods [14, 22] detect differences between original and fake
images, while gradient-based feature representation [20] and up-sampling anal-
ysis [19] also aid in deepfake detection.

Data Augmentation- based methods improve generalization by generating


new features through augmented techniques [21], distillation learning [25], and
domain-specific augmentations [23]. Intra- and cross- domain augmentations [23]
as well as face swapping and face blending techniques [6] were applied to enhance
deepfake detection.

Model Design- based methods propose end-to-end architectures for feature


extraction. Gram-Net [12] and FreqNet [18] improved deepfake detection by
learning image textures and frequency relationships. CLIP-based approaches
[9, 16] utilized frozen encoders and fast adaptation for better performance, while
forgery-aware adapters combined with contrastive learning [11] achieved strong
results on GAN and diffusion datasets.
80 Y.-H. Nguyen and T.-N. Le

3 Proposed Method
3.1 Overview
Figure 3 illustrates our proposed Caption Guided Learning (CGL) method, con-
sisting of three main parts: Caption Generation generates corresponding captions
for training images and appending forgery captions to achieve enhanced cap-
tions. LoRA Contrastive Learning aims to train LoRA models using contrastive
learning to efficiently extract semantic features and deepfake specific-features.
Forgery Fusion Learning combines forgery features from low-level to high-level
across stages in CLIP-image encoder to predict real and fake images.

3.2 Caption Generation

Previous studies [9, 16] show that features from the CLIP-image encoder excel
in forgery detection via linear classification. We found that these features were
semantically aligned during CLIP training. Text decoding features after pass-
ing through the adapter layer trained for deepfake detection revealed key words
crucial for identifying deepfakes, despite often misrepresenting the images. Ana-
lyzing the frequency of these words helped differentiate real from fake images
(Fig. 2). Our goal is to enhance CLIP’s ability to extract both semantic and
forgery features by creating captions that fulfill two criteria:

– Semantic caption accurately describes the image information.


– Forgery caption provides cues to guide the CLIP image encoder in detecting
forgery features.

Given a dataset .X consisting of real and fake images, we can define it as


follows:
.X = {xj , yj }j=1 , y ∈ {0, 1},
N
(1)
where .y = 1 denotes the fake image, and .y = 0 denotes the real image. For
each training image, a semantic caption is generated using either the ClipCap
model [15] or the BLIP model [10]. The set of captions for the training images
defined .C = {cj , yj }N
j=1 , where .y ∈ {0, 1}, .cj is a semantic caption corresponding
to the .j th photo in the dataset. The forgery caption is set of real caption and
fake caption, defined as:

Cf orgery = {Creal , Cfake }.


. (2)

To generate enhanced captions, we concatenate them with semantic captions:

. C̃enhance = {c̃j }N
j=1 , (3)

(Creal , cj ), if y = 0,
.c̃j = (4)
(Cfake , cj ), if y = 1,
Caption Guided Learning for Robust Deepfake Detection 81

Fig. 3. Our CGL framework employs a CLIP-image encoder to guide learning of


deepfake-specific features using contrastive loss with the CLIP-text encoder. LoRA
is applied to finetune ViT blocks in the CLIP-image encoder. CLS tokens extracted
from each ViT block stage are concatenated with forgery CLS tokens and fused via
multi-stage ViT features. The resulting Forgery CLS tokens serve as the final feature for
classification. During inference, we merge the LoRA weights and remove the CLIP-text
encoder.

where .Cf orgery = {Creal , Cfake } typically assigns pairs of words that dif-
fer from the image caption, for example: .{Creal = real, Cfake = synthetic} or
.{Creal = authentic, Cfake = deepf ake}. These enhanced captions are added to

the image captions to enhance the textual context and guide the CLIP-image
encoder in learning the cues for detecting deepfake images.

3.3 CLIP-Based Feature Extractor


Benefiting from CLIP’s powerful pre-trained weights and efficient feature extrac-
tion capabilities, we adopt CLIP-ViT [17] as our pre-trained feature extractor.
CLIP-image encoder employs Vision Transformer (ViT) [24] as an image
encoder, ensuring high-quality feature extraction while mitigating overfitting
risks during feature guidance. Given an RGB image .x ∈ RH×W ×3 , the patch
embedding layer splits and transforms it into a sequence of image embeddings
(N +1)×D
.Eimg ∈ R . Subsequently, the image embeddings are further processed
through multiple ViT blocks, and the output CLS token is taken as the image
82 Y.-H. Nguyen and T.-N. Le

encoding vector .fi ∈ R1×D . In this context, .H and .W denote the height and
width of the image, .1 represents the CLS token, .D represents projected D-
dimensional image features, .N = HW P 2 denotes patch number, .fi represents .i-th
stage output of CLIP-ViT.
CLIP-text Encoder has primary role to transform input text into an embed-
ding that can be compared with image embeddings generated by the CLIP-image
encoder. CLIP-text encoder uses a Transformer architecture. The input text is
tokenized into a sequence of tokens .x = [CLS, token1 , token2 , . . . , tokenn ]. .CLS is
a special classification token, and .tokeni are the word tokens. The Transformer
applies a series of layers, including multi-head self-attention and feedforward
layers .z = Transformer(x). After applying positional embeddings and passing
through multiple Transformer layers, the final output of the .CLS token is used
as the text representation .h = zCLS . This vector is then projected into a joint
image-text embedding space using a learned linear projection .ftext (x) = Wtext h.
where .Wtext is the projection matrix.

3.4 LoRA Constrastive Learning

To enable the CLIP-image encoder to learn both semantic features and forgery
features, we apply LoRA algorithm [5] at each stage of the ViT in the image
encoder. LoRA is a method for updating pre-trained weights by using the prod-
uct of two smaller matrices, denoted as: .A and .B, based on the ’intrinsic rank’
of the downstream task. Given an input .x, a hidden state .h, and a weight matrix
d ×d2
.W ∈ R 1 , the weight adjustment process when applying the LoRA module
is carried out as follows:

h = Wx + γΔWx = Wx + γBAx,
. (5)
d1 ×r d1 ×d2
where .A ∈ R r×d2
, .B ∈ R , .ΔW ∈ R of rank .r, where .r  {d1 , d2 },
and .γ is the scaling factor. Matrix .A is initialized with Kaiming initialization,
while the matrix .B is initialized with zeros. This initialization implies that no
change occurs in the weights before the LoRA training process, and thus, the
weight updates are not altered.
LoRA is applied to a stack of ViT blocks of the CLIP image encoder, each
block containing a multi-head attention (MHA) module:
 
xWQi (xWKi )T
.headi = Softmax √ (xWVi ), (6)
d
MHA(x) = concat(head1 , . . . , headH )WO ,
. (7)
where .d is the scaling factor and .WKi , .WQi , .WVi , .WO are the matrices corre-
sponding to the key, query, value, and output matrices, respectively.
Using LoRA during the fine-tuning of the CLIP image encoder helps reduce
computation time and costs, achieving high performance as the updates are
applied to all ViT Blocks.
Caption Guided Learning for Robust Deepfake Detection 83

3.5 Forgery Fusion Learning


Classical Transformers use the .CLStoken to capture a comprehensive represen-
tation of the input sequence by consolidating diverse feature representations into
a unified embedding.
Additionally, we observed that the output .CLSi ∈ RD from i-th ViT stage
of the CLIP-image encoder inherently meet the input requirements of a Trans-
former, thereby eliminating the need for further embedding transformations.
Consequently, we concatenated the .CLS tokens from each stage and appended
a new .F orgery CLS token with .CLSf orgery ∈ RD , which serves as a compre-
hensive global representation of all stage outputs. This composite representation
is expressed as:

c = [CLSf orgery ; CLS1 ; CLS2 ; . . . ; CLSN ].


. l (8)

Based on this, we develop Forgery Fusion Learning module, employing the


standard ViT Block architecture [24], the process can be easily demonstrated as
follows: 
cl = M HSA(LN (cl−1 )) + cl−1 ,
.   (9)
cl = M LP (LN (cl )) + cl .
where .M HSA and .M LP stand for multi-head self-attention and multi-layer per-

ceptron in standard vision transformer block. .cl , .cl represent the output features
of the .M HSA and .M LP modules for standard ViT block .l, respectively.
The Forgery Fusion Learning module utilizes the Transformer’s self-attention
mechanism to fuse multi-stage features of the CLIP-image encoder, combining
low-level details with high-level abstractions. This integration enhances deepfake
detection by maximizing the utility of extracted features.

3.6 Loss Functions


For Guided LoRA learning, we use contrastive loss. Specifically, we first compute
text features .u and image features .v as follows:

u = Encodertext (c̃j ),
. j vj = EncoderLoRA
img (xj ), (10)

where .Encodertext (·) denotes the text encoder, and .EncoderLoRA


img (·) refers to the
image encoder equipped with the Lora layer. The contrastive learning loss is
then computed as follows:
Lv→u + Lu→v
Lcontrastive =
. , (11)
2
1  exp(viT ui )
Lv→u = −
. log N , (12)
j=1 exp(vi uj )
N i T

1  exp(uTi vi )
. Lu→v = − log N . (13)
j=1 exp(ui vj )
N i T
84 Y.-H. Nguyen and T.-N. Le

In addition, we also apply a linear classifier .Linear on .F orgery CLS token to


perform the classification. The cross-entropy loss is adopted as the classification
loss .Lclassification . The final loss function combines the two loss functions, with
.μ1 and .μ2 as tunable hyperparameters:

L = μ1 Lcontrastive + μ2 Lclassification .
. (14)

4 Experiments
4.1 Implementation Details
Our method involves end-to-end training on an Nvidia Tesla T4 16GB GPU
over 3 epochs with a batch size of 24. The CLIP-text encoder is frozen, with
LoRA layers applied to all ViT blocks of the CLIP-image encoder. With loss
function, we set .μ1 = μ2 = 1.0. In the LoRA parameters, we assign .r = 2,
.α = 1, and the dropout rate to 0.25. Input images were resized to 256 .× 256 and
then cropped to 224 .× 224. Only Random cropping and horizontal flipping were
applied. We utilized the AdamW optimizer [13] with .weight_decay = 5 × 10−2 ,
−4
.betas = (0.9, 0.999) and initial learning rate was set to .5 × 10 . For testing,
we removed the CLIP-text encoder and merged the LoRA weights at the ViT
stages of the CLIP-image encoder.

4.2 Datasets
We train our network on the ForenSynths dataset, incorporating real LSUN
images and ProGAN-generated fake images, with 1-class (horse), 2-class (chair,
horse), and 4-class (car, cat, chair, horse) configurations. For evaluation, we use
the same testing datasets as RINE [9], covering GANs (ProGAN, StyleGAN,
StyleGAN2, BigGAN, CycleGAN, StarGAN, GauGAN, deepfake) and Diffusion
models (PNDM, Guided, DALL-E, VQ-Diffusion, LDM, Glide).

4.3 Evaluation Metric


Following the procedures in UniFD [16] and RINE [9], we used accuracy (Acc)
and average precision score (A.P) as primary metrics to evaluate our method. We
calculated the mean Acc (mAcc)and mean AP (mAP) for each dataset to com-
prehensively assess model performance across GAN and diffusion model datasets.

4.4 Comparison with State-of-the-Art Methods


We primarily compared our model with previous methods using the frozen pre-
trained CLIP, such as UniFD [16], RINE [9], and FatFormer [11]. Additionally,
we evaluated our model’s effectiveness against other state-of-the-art methods,
including BiHPF [7], FrePGAN [8], LGrad [20], FreqNet [18], and NPR [19].
Table 1 shows our method significantly outperform all other methods when train-
ing on 2 classes with mAcc of 95.4% and mAP of 99.6%.
Caption Guided Learning for Robust Deepfake Detection 85

Table 1. Comparison of accuracy and average precision (Acc/AP) between our method
and state-of-the-art techniques on the GANs dataset under three distinct training sce-
narios: 1-class, 2-classes, and 4-classes. The top-performing results are emphasized in
bold.
Method class ProGAN StyleGAN StyleGAN2 BigGAN CycleGAN StarGAN GauGAN deepfake Mean
Wang 1 50.4/63.8 50.4/79.3 68.2/97.4 50.2/61.3 50.0/52.9 50.0/48.2 50.3/67.6 50.1/51.5 52.5/64.9
BiHPF 1 82.5/81.4 68.0/62.8 68.8/63.6 67.0/62.5 75.5/74.2 90.1/90.1 73.6/92.1 51.6/49.9 72.1/72.1
FrePGAN 1 95.5/99.4 80.6/90.6 77.4/93.0 63.5/60.5 59.4/59.9 99.6/100.0 53.0/49.1 70.4/81.5 74.9/79.3
LGrad 1 99.4/99.9 96.0/99.6 93.8/99.4 79.5/88.9 84.7/94.4 99.5/100.0 70.9/81.8 66.7/77.9 86.3/92.7
UniFD 1 99.1/100.0 77.2/95.9 69.8/95.8 94.5/99.0 97.1/99.9 98.0/100.0 95.7/100.0 82.4/91.7 89.2/97.8
FreqNet 1 98.0/99.9 92.0/98.7 89.5/97.9 85.5/93.1 96.1/99.1 94.2/98.4 91.8/99.6 69.8/94.4 89.6/97.6
RINE 1 99.8/100.0 88.7/99.1 86.9/99.7 99.1/99.9 99.4/100.0 98.8/100.0 99.7/100.0 82.7/97.4 94.4/99.5
Ours 1 99.5/100.0 90.3/99.9 85.6/98.9 95.4/99.8 96.7/99.8 99.8/100.0 98.2/100.0 85.0/97.4 93.8/99.5
Wang 2 64.6/92.7 52.8/82.8 75.7/96.6 51.6/70.5 58.6/81.5 51.2/74.3 53.6/86.6 50.6/51.5 57.3/79.6
BiHPF 2 87.4/87.4 71.6/74.1 77.0/81.1 82.6/80.6 86.0/86.6 93.8/80.8 75.3/88.2 53.7/54.0 78.4/79.1
FrePGAN 2 99.0/99.9 80.8/92.0 72.2/94.0 66.0/61.8 69.1/70.3 98.5/100.0 53.1/51.0 62.2/80.6 75.1/81.2
LGrad 2 99.8/100.0 94.8/99.7 92.4/99.6 82.5/92.4 95.9/94.7 99.7/99.9 73.7/83.2 60.6/67.8 86.2/92.2
UniFD 2 99.7/100.0 78.8/97.4 75.4/96.7 91.2/99.0 91.9/99.8 96.3/99.9 91.9/100.0 80.0/89.4 88.1/97.8
FreqNet 2 99.6/100.0 90.4/98.9 85.8/98.1 89.0/96.0 96.7/99.8 97.5/100.0 88.0/98.8 80.7/92.0 91.0/97.9
RINE 2 99.8/100.0 84.9/99.5 76.7/99.6 98.3/99.9 99.4/100.0 99.6/100.0 99.9/100.0 66.7/96.4 90.6/99.4
Ours 2 100.0/100.0 95.8/99.9 98.0/99.7 95.9/99.9 93.4/99.9 99.9/100.0 96.3/100.0 84.27/97.9 95.4/99.6
Wang 4 91.4/99.4 63.8/91.4 76.4/97.5 52.9/73.3 72.7/88.6 63.8/90.8 63.9/92.2 51.7/62.3 67.1/86.9
BiHPF 4 90.7/86.2 76.9/75.1 76.2/74.7 84.9/81.7 81.9/78.9 94.4/94.4 69.5/78.1 54.4/54.6 78.6/77.9
FrePGAN 4 99.0/99.9 80.7/89.6 84.1/98.6 69.2/71.1 71.1/74.4 99.9/100.0 60.3/71.7 70.9/91.9 79.4/87.2
LGrad 4 99.9/100.0 94.8/99.9 96.0/99.9 82.9/90.7 85.3/94.0 99.6/100.0 72.4/79.3 58.0/67.9 86.1/91.5
UniFD 4 99.7/100.0 89.0/98.7 83.9/98.4 90.5/99.1 87.9/99.8 91.4/100.0 89.9/100.0 80.2/90.2 89.1/98.3
FreqNet 4 99.6/100.0 90.2/99.7 88.0/99.5 90.5/96.0 95.8/99.6 85.7/99.8 93.4/98.6 88.9/94.4 91.5/98.5
NPR 4 99.8/100.0 96.3/99.8 97.3/100.0 87.5/94.5 95.0/99.5 99.7/100.0 86.6/88.8 77.4/86.2 92.5/96.1
RINE 4 100.0/100.0 88.9/99.4 94.5/100.0 99.6/99.9 99.3/100.0 99.5/100.0 99.8/100.0 80.6/97.9 95.3/99.7
Ours 4 100.0/100.0 97.1/99.9 99.0/99.90 98.8/98.4 98.5/99.3 100.0/100.0 98.8/99.8 94.00/98.1 98.3/99.5

Table 2. Comparison of accuracy and average precision (Acc/AP) between our


method, trained on three different scenarios (1-class, 2-classes, and 4-classes), and
state-of-the-art techniques trained exclusively on the 4-classes scenario using the dif-
fusion model dataset. The top-performing results are emphasized in bold.

Method class PNDM Guided DALL-E VQ-Diff LDM 200 LDM w/CFG LDM 100 Glide 100-27 Glide 50-27 Glide 100-10 Mean
Wang 4 50.8/90.3 54.9/66.6 51.8/61.3 50.0/71.0 52.0/64.5 51.6/63.1 51.9/63.7 53.0/71.3 54.2/76.0 53.3/72.9 52.4/70.1
LGrad 4 69.8/98.5 86.6/100.0 88.5/97.3 96.3/100.0 94.2/99.1 95.9/99.2 94.8/99.2 87.4/93.2 90.7/95.1 89.4/94.9 89.4/97.7
UniFD 4 75.3/92.5 75.7/85.1 89.5/96.8 83.5/97.7 90.2/97.1 77.3/88.6 90.5/97.0 90.7/97.2 91.1/97.4 90.1/97.0 85.4/94.6
RINE 4 83.8/98.6 76.2/96.6 95.1/99.5 91.4/99.8 98.3/99.9 88.2/98.7 98.7/99.9 88.9/99.1 92.6/99.5 90.7/99.2 90.40/99.1
FatFormer 4 99.3/100.0 76.1/92.0 98.8/99.8 100.0/100.0 98.6/99.8 94.9/99.1 98.7/99.9 94.4/99.1 94.7/99.4 94.2/99.2 95.0/98.8
Ours 1 90.6/99.9 76.8/86.7 97.2/99.7 87.0/99.9 91.3/98.8 80.9/96.2 92.3/98.9 87.4/98.0 93.2/99.1 89.1/98.3 88.6/97.5
Ours 2 95.2/99.8 78.3/86.7 98.9/99.9 98.4/99.9 98.8/99.9 96.5/99.5 98.5/99.8 92.1/98.5 94.6/99.1 93.2/98.8 94.4/98.2
Ours 4 96.4/99.9 82.4/88.7 99.3/99.8 99.6/100.0 99.4/99.8 98.6/99.7 98.7/99.7 94.6/99.1 95.9/99.4 95.2/99.3 95.9/98.5

Given the fundamental differences between the generation mechanisms of


GANs and Diffusion models, we compared our model with existing detection
methods on the Diffusion model dataset, shown in Table 2. All methods were
assessed on diffusion models to evaluate generalization, with our model achiev-
ing a mAcc of 95.9% and mAP of 98.5%. Notably, our model outperformed
UniFD [16] and FatFormer [11], surpassing FatFormer by 0.9% in mean ACC.
Additionally, when trained with only two ProGAN classes, our method achieved
86 Y.-H. Nguyen and T.-N. Le

Table 3. Impacts of captions on the model’s performance, evaluated by mean accuracy


and mean average precision (mAcc/mAP).

Gan Datatest Diffusion Datatest


mAcc/mAP mAcc/mAP
ClipCAP captions 98.3 / 99.5 95.9 / 98.5
BLIP captions 97.3 / 97.6 95.0 / 97.1

94.4% mean accuracy on diffusion data, suggesting that less training data may
enhance generalization by reducing overfitting risks.

4.5 Caption Evaluation


To assess the impact of captions on model performance, we trained using two
types of captions generated from the ProGAN dataset, which includes four
classes: car, cat, chair, and house. The captions were produced by BLIP [10]
and ClipCAP [15]. Table 3 compares the results, demonstrating that ClipCAP
captions significantly enhance performance on both GAN and Diffusion model
datasets.

5 Conclusion
We present a novel Caption Guided Learning (CGL) method for generalizable
image detection, incorporating three modules with CLIP to enhance feature
extraction for deepfake detection. Extensive experiments on GAN and Diffusion
model datasets show that CGL achieves state-of-the-art performance, highlight-
ing its strong generalization capability. Additionally, the simplicity and flexibility
of our approach may inspire further advancements in deepfake detection using
frozen pre-trained models.

Acknowledgment. This research is funded by Vietnam National Foundation for


Science and Technology Development (NAFOSTED) under Grant Number 102.05-
2023.31.

References
1. Bi, X., et al.: Detecting generated images by real images only. arXiv preprint
arXiv:2311.00962 (2023)
2. Frank, J., Eisenhofer, T., Schönherr, L., Fischer, A., Kolossa, D., Holz, T.: Lever-
aging frequency analysis for deep fake image recognition. In: ICML, pp. 3247–3258.
PMLR (2020)
3. Goodfellow, I., et al.: Generative adversarial nets. NIPS 27 (2014)
4. Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. NIPS 33,
6840–6851 (2020)
Caption Guided Learning for Robust Deepfake Detection 87

5. Hu, E.J., et al.: Lora: Low-rank adaptation of large language models. arXiv preprint
arXiv:2106.09685 (2021)
6. Huang, B., et al.: Implicit identity driven deepfake face swapping detection. In:
CVPR, pp. 4490–4499 (2023)
7. Jeong, Y., Kim, D., Min, S., Joe, S., Gwon, Y., Choi, J.: Bihpf: bilateral high-pass
filters for robust deepfake detection. In: WACV, pp. 48–57 (2022)
8. Jeong, Y., Kim, D., Ro, Y., Choi, J.: Frepgan: robust deepfake detection using
frequency-level perturbations. In: AAAI. vol. 36, pp. 1060–1068 (2022)
9. Koutlis, C., Papadopoulos, S.: Leveraging representations from intermediate
encoder-blocks for synthetic image detection. arXiv preprint arXiv:2402.19091
(2024)
10. Li, J., Li, D., Xiong, C., Hoi, S.: Blip: bootstrapping language-image pre-training
for unified vision-language understanding and generation. In: ICML, pp. 12888–
12900. PMLR (2022)
11. Liu, H., Tan, Z., Tan, C., Wei, Y., Wang, J., Zhao, Y.: Forgery-aware adaptive
transformer for generalizable synthetic image detection. In: CVPR, pp. 10770–
10780 (2024)
12. Liu, Z., Qi, X., Torr, P.H.: Global texture enhancement for fake face detection in
the wild. In: CVPR, pp. 8060–8069 (2020)
13. Loshchilov, I.: Decoupled weight decay regularization. arXiv preprint
arXiv:1711.05101 (2017)
14. Luo, Y., Du, J., Yan, K., Ding, S.: Lareˆ . 2: Latent reconstruction error based
method for diffusion-generated image detection. In: CVPR, pp. 17006–17015 (2024)
15. Mokady, R., Hertz, A., Bermano, A.H.: Clipcap: Clip prefix for image captioning.
arXiv preprint arXiv:2111.09734 (2021)
16. Ojha, U., Li, Y., Lee, Y.J.: Towards universal fake image detectors that generalize
across generative models. In: CVPR, pp. 24480–24489 (2023)
17. Radford, A., et al.: Learning transferable visual models from natural language
supervision. In: ICML, pp. 8748–8763. PMLR (2021)
18. Tan, C., Zhao, Y., Wei, S., Gu, G., Liu, P., Wei, Y.: Frequency-aware deepfake
detection: Improving generalizability through frequency space domain learning.
In: AAAI, vol. 38, pp. 5052–5060 (2024)
19. Tan, C., Zhao, Y., Wei, S., Gu, G., Liu, P., Wei, Y.: Rethinking the up-sampling
operations in cnn-based generative network for generalizable deepfake detection.
In: CVPR, pp. 28130–28139 (2024)
20. Tan, C., Zhao, Y., Wei, S., Gu, G., Wei, Y.: Learning on gradients: generalized
artifacts representation for gan-generated images detection. In: CVPR, pp. 12105–
12114 (2023)
21. Wang, S.Y., Wang, O., Zhang, R., Owens, A., Efros, A.A.: Cnn-generated images
are surprisingly easy to spot... for now. In: CVPR, pp. 8695–8704 (2020)
22. Wang, Z., et al.: Dire for diffusion-generated image detection. In: ICCV, pp. 22445–
22455 (2023)
23. Yan, Z., Luo, Y., Lyu, S., Liu, Q., Wu, B.: Transcending forgery specificity with
latent space augmentation for generalizable deepfake detection. In: CVPR, pp.
8984–8994 (2024)
24. Yuan, L., et al.: Tokens-to-token vit: training vision transformers from scratch on
imagenet. In: ICCV, pp. 558–567 (2021)
25. Zhu, M., et al.: Gendet: towards good generalizations for AI-generated image detec-
tion. arXiv preprint arXiv:2312.08880 (2023)
Minimalist Preprocessing Approach
for Image Synthesis Detection

Hoai-Danh Vo1,2 and Trung-Nghia Le1,2(B)


1
University of Science, VNU-HCM, Ho Chi Minh City, Vietnam
[email protected], [email protected]
2
Vietnam National University, Ho Chi Minh City, Vietnam

Abstract. Generative models have significantly advanced image gen-


eration, resulting in synthesized images that are increasingly indistin-
guishable from authentic ones. However, the creation of fake images
with malicious intent is a growing concern. Low-configured smart devices
have become highly popular, making it easier for deceptive images to
reach users. Consequently, the demand for effective detection methods
is increasingly urgent. In this paper, we introduce a simple yet efficient
method that captures pixel fluctuations between neighboring pixels by
calculating the gradient, which highlights variations in grayscale inten-
sity. This approach functions as a high-pass filter, emphasizing key fea-
tures for accurate image distinction while minimizing color influence. Our
experiments on multiple datasets demonstrate that our method achieves
accuracy levels comparable to state-of-the-art techniques while requiring
minimal computational resources. Therefore, it is suitable for deploy-
ment on low-end devices such as smartphones. The code is available at
https://github.com/vohoaidanh/adof.

Keywords: Image synthesis detection · Lightweight model · Low-level


computation

1 Introduction
In recent years, significant advancements in image generation have been achieved,
particularly with Generative Adversarial Networks (GANs) [11] and Diffusion
models [12, 14]. These approaches produce high-quality images that closely
resemble real-world visuals [31] and have garnered attention in academic and
societal circles. Generative models have found applications in various fields,
including virtual try-ons and personalized fashion recommendations in the fash-
ion industry [25], as well as in image editing [4, 39] and interior design [6].
Despite the valuable applications of image generation technology, signifi-
cant drawbacks exist. According to a survey conducted by Bauer and Bind-
schaedlerr [2], generative models can create fake information, particularly deep-
fakes, which depict fabricated scenarios involving famous individuals. In response
to these dangers, several US states [3, 20] have outlawed the malicious use of
c The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025
W. Buntine et al. (Eds.): SOICT 2024, CCIS 2350, pp. 88–99, 2025.
https://doi.org/10.1007/978-981-96-4282-3_8
Minimalist Preprocessing Approach for Image Synthesis Detection 89

Fig. 1. Comparison of different synthetic image detection methods on the Ojha


dataset [28]. Our proposed method is simple yet efficient, significantly reduces FLOPs,
total parameters, while achieving comparable accuracy state-of-the-art methods.

deepfake technology, especially for harmful content like revenge and celebrity
pornography. To address the threats posed by synthetic images on digital com-
munication platforms and social media, it is essential to develop effective coun-
termeasures for verifying image authenticity directly on mobile devices. Given
the ubiquity and portability of these devices, real-time detection of generated
images is crucial for preventing misinformation and preserving the integrity
of visual content. However, the constrained computational capacity of mobile
devices presents a significant challenge. This paper introduces a simple yet effi-
cient solution for synthesized image detection, specifically the Adjacency Differ-
ence Orientation Filter (ADOF) for data preprocessing; this filter allows us to
compute the gradient in both the .x and .y directions. The direction of the gradient
reflects the behavior of grayscale variation among neighboring pixels, assisting
in distinguishing between real and generated images. Focusing on extracting
useful low-level features, our approach ensures generalization while utilizing a
lightweight CNN architecture for detecting generated images, without demand-
ing extensive computational resources. This strategy effectively reduces irrele-
vant information, enabling the model to concentrate on fine-grained variations,
ultimately leading to improved performance and generalization. In contrast to
existing methods [21, 28, 35] that require large deep learning architectures, such
as CLIP [30], ViT [7], Resnet50 [13], and significant computational resources, our
approach demands fewer resources while still ensuring generalization and achiev-
ing comparable accuracy. Figure 1 presents a comparative overview of results,
highlighting the advantages of this strategy.
Experiments on well-known datasets [28, 36, 37] demonstrate the effectiveness
of our method, achieving impressive accuracy of 94.9% on the Ojha dataset [28]
and 98.3% on the DiffusionForensis [37]. Additionally, there is a reduction in
90 H.-D. Vo and T.-N. Le

computational load by 97.8% compared to RINE [21] and 57.8% compared to


LGrad [35]. These results underscore the advantages of our approach, as illus-
trated in Fig. 1. The code for reproducing our results is publicly available at
https://github.com/vohoaidanh/adof. Our contributions are as follow:

– We introduce a simple yet efficient approach for detecting synthetic images,


and our approach is more generalized than existing methods.
– We present a filter-based method that computes pixel intensity gradients to
capture pixel fluctuations and reduce color influence, leading to improved
model performance with faster inference speed and lower complexity.
– Our proposed method reduces the number of parameters and FLOPs while
maintaining accuracy compared to state-of-the-art methods.

2 Related Work
Various methods have been developed to address the challenge of distinguish-
ing synthetic images from real ones, utilizing both traditional machine learn-
ing techniques and modern deep learning approaches. Durall et al. [8] applied
a Fourier Transform [1] to grayscale images and used azimuthal averaging to
convert the 2D frequency data into a 1D feature vector, retaining essential infor-
mation for classification. They then employed either Support Vector Machines
or K-means clustering to detect GAN-generated images. Alternatively, methods
like RINE [21] and Ojha et al. [28], along with similar approaches, leverage pre-
trained deep learning networks such as CLIP to enhance performance. This inte-
gration contributing to consistently high success rates in detecting synthesized
images through the integration of these networks into their frameworks. Notably,
the FatFormer [23] method focuses on the contrastive objectives between adapted
image features and text prompt embeddings, providing valuable information that
enables the deep learning models to learn more robust and discriminative rep-
resentations, ultimately improving their ability to accurately classify real and
generated images.

Frequency Domain-Based Methods involve transforming images from the


spatial domain to the frequency domain using transformations such as Fast
Fourier Transform or Discrete Cosine Transform (DCT). By focusing on fre-
quency characteristics, these methods effectively capture artifacts that might not
be evident in the spatial domain. This allows classifiers to distinguish between
real and fake images by analyzing the unique patterns that emerge in the spectral
domain. Frank et al. [10] utilized the DCT to analyze images in the frequency
domain, revealing unique spectral differences between real images and those
produced by GAN models. Qian et al. introduced .F 3 -Net [29] to decompose
the spectrum into various bands, enabling the analysis of these components to
identify unusual distributions. This method effectively detects subtle artifacts,
enhancing the ability to recognize synthetic image manipulations.
Minimalist Preprocessing Approach for Image Synthesis Detection 91

Tan et al. proposed FreqNet [33], which emphasizes high-frequency details


and directs the detector to concentrate on these features across spatial and chan-
nel dimensions, rather than utilizing the full spectrum of frequency bands as is
common in many other approaches. BiHPF [16], a method by Jeong et al., ampli-
fies frequency-level artifacts commonly found in images generated by generative
models to tackle the challenge of identifying images from previously unobserved
models. Jeong et al. [15] generated perturbation maps added to training images
to prevent overfitting to frequency-specific features, reducing high-frequency
noise and enhancing classifier generalization.

Spatial-Based Methods analyze images directly on pixel values, as seen in


models like CNNDetection [36] and Gram-Net [24]. A key issue, however, is that
raw images often contain excessive, irrelevant information, such as semantic con-
tent, which Nang et al. [40] identified as detrimental to image classification effec-
tiveness. This extraneous information, unnecessary for distinguishing real from
fake images, can disrupt the model’s learning process and reduce its effectiveness.
Wang et al. [36] developed a comprehensive detector to distinguish real images
from CNN-generated ones [22]. Using a dataset of images from 11 CNN-based
generators, they showed that, with effective pre-/post-processing and data aug-
mentation, a classifier trained on ProGAN [18] generalizes well to other models,
including StyleGAN2 [19]. By dividing images into small patches categorized
as either rich texture or flat, PatchCraft [40] exploits the inter-pixel correlation
contrast between these regions. This approach breaks the semantic coherence
present in traditional methods, addressing a key limitation and enhancing the
model’s ability to generalize more effectively. Tan et al. introduced the con-
cept of Neighboring Pixel Relationships (NPR) [34] to capture and characterize
generalized structural artifacts that arise from up-sampling operations, which
are commonly used in image generation models to enhance image quality. This
method shows a significant improvement over other techniques within the same
approach.
Frequency-based approaches achieve faster convergence with smaller models
but may lose essential spatial information needed to distinguish real from gen-
erated images. In contrast, spatial-based methods often require larger models
and may struggle with new domain data. Our method leverages the strengths of
both approaches by focusing on pixel perturbations between neighboring pixels,
effectively discarding much of the pixel color information. Additionally, our filter
functions as a high-pass filter, removing low-frequency components to emphasize
the relevant features.

3 Proposed Method
3.1 Overview

Generative models, such as GAN [11] and Diffusion [14], currently use CNN lay-
ers for image synthesis, meaning that neighboring pixel regions are correlated
92 H.-D. Vo and T.-N. Le

Fig. 2. The left image represents the original image, the middle shows the gradient
calculation applied, and the right image illustrates the resulting gradient map.

to a certain extent. We hypothesize that synthetic images exhibits a stronger


correlation between adjacent pixels compared to real images. Furthermore, due
to the design of neural networks, noise in synthetic images tends to be aver-
aged out, whereas in real images, noise typically remains more prominent. To
investigate and evaluate the impact of noise on adjacent pixels, we designed a
simple yet effective filter to capture these variations. This filter aids the CNN in
learning the differences in noise distribution between real and synthetic images.
The overall architecture of ADOF is illustrated in Fig. 2. This approach cap-
tures noise information by calculating the differences between adjacent pixels
and incorporating these differences into the gradient to account for variations in
both the .x and .y directions.

3.2 Adjacency Difference Orientation Filter (ADOF)


Finite Difference. The finite difference technique [27, 38] is a mathematical
approach for estimating intensity variations across neighboring points in a grid
or matrix. In image processing, it calculates gradients by measuring differences
in pixel values, thereby detecting changes in intensity in both horizontal and
vertical directions. This technique aids in edge detection and texture analysis by
highlighting contrasts between adjacent pixels.
The general formula for calculating the gradient in a given direction .u is
.Gradientu = I(x + Δx, y + Δy) − I(x, y), where .Δx and .Δy define the direction
of the difference.

Filter Construction. We have applied the finite difference to compute the


gradients of intensity in our images. This helps in identifying intensity variations
at each pixel and supports the detection of important geometric features in our
approach. The formula that computes the difference between adjacent pixels
along the .x and .y-direction is given by:

.Dx (x, y) = I(x + 1, y) − I(x, y), (1)

. Dy (x, y) = I(x, y + 1) − I(x, y), (2)


where .I represents the image in which the difference is being calculated, .Dx (x, y)
represents the difference between the pixel value at .(x + 1, y) and the pixel value
Minimalist Preprocessing Approach for Image Synthesis Detection 93

Fig. 3. Architecture of our lightweight model.

at .(x, y). This filter captures variations in pixel intensity along the horizon-
tal direction. Similarly, .Dy (x, y) captures variations in pixel intensity along the
vertical direction. To determine the gradient magnitude and orientation, these
values are computed from .Dx and .Dy .

.Gm (x, y) = Dx (x, y)2 + Dy (x, y)2 , (3)
 
Dy (x, y)
Go (x, y) = arctan
. , (4)
Dx (x, y)
where .Dx (x, y) and .Dy (x, y) are as previously defined. The gradient orientation
Go , which represents the overall angle of gray-level changes at a pixel and indi-
.
cates the direction of these combined intensity variations, is referred to as the
Adjacency Difference Orientation Filter (ADOF) in this paper. Mean-
while, the gradient magnitude .Gm quantifies the strength of intensity changes
at that pixel. The result of this computational process is illustrated in Fig. 2.

3.3 Lightweight Model Architecture


To evaluate the effectiveness of our filter ADOF on images, we use basic CNN
architectures. Specifically, this work employs a modified ResNet50 [13] model
with layer3 and layer4 removed. To capture information from 8-connected
neighboring pixels more effectively, the kernel size of the conv1 layer was adjusted
from 7 to 3. The architecture is depicted in Fig. 3.

4 Experiments
4.1 Implementation Details
In practice, we are more concerned with the flat regions of an image rather
than the edge areas where there is a significant variation in gray levels between
the x and y directions. This is because, in regions with large changes in gray
levels in one direction compared to the other, the gradient angles are close to
94 H.-D. Vo and T.-N. Le

± π2 . Although these angles both indicate edge regions in the image, the gradient
.
angles at edges typically take values of .± π2 , which are numerically distant from
each other despite conveying similar edge information. To exclude these areas,
we set the gradient values approaching .± π2 to 0, the experimental process has
demonstrated that this approach leads to higher accuracy for the model.
All experiments are conducted on a computing system using a NVIDIA RTX
A4000 GPU with 16 GB of memory and an AMD Ryzen 5 5600X 6-Core CPU.
We trained our model using parameters that are closely aligned with those used
in common methods [34–36] to ensure a fair comparison and demonstrate the
effectiveness of our method independent of specific hyperparameters. Further-
more, we utilized the source code provided by NPR [34] to streamline the training
process and maintain consistency. The model was trained using the Adam opti-
mizer with a learning rate of .2 × 10−4 and a batch size of 32. To accelerate the
training process, we adjusted the learning rate every 5 epochs instead of every
10 epochs and utilized 4 out of the 20 classes (car, cat, chair, horse) for training,
similar to the protocol used in existing works [15, 16, 34, 36].

4.2 Dataset
Training Set. To facilitate comparison between methods, we used the same
ForenSynths dataset with existing methods [17, 28, 34–36]. This dataset con-
sists of 20 object classes selected from the LSUN dataset. Each class contains
18,000 real-world images, with corresponding generative images generated using
the ProGAN [18] model. To verify the generalization of methods, all compared
method was trained on a subset of the ForenSynths [36] dataset consisting of 4
classes: car, cat, chair, horse.

Evaluation Set. To investigate the generalization of methods, our evaluation


was conducted using the Self-Synthesis 9 GANs [34], which contains 36,000
images sourced from LSUN, ImageNet, CelebA, CelebA-HQ, COCO, and Face-
Forensics++, generated using models like AttGAN, BEGAN, and CramerGAN.
The second dataset, DiffusionForensics [37], comprises 40,000 images from LSUN
and ImageNet, utilizing models such as ADM, DDPM, and IDDPM. Lastly, the
Ojha Test Set [28] includes 16,000 images from LAION and ImageNet, generated
with ADM, Glide, DALL-E-Mini, and LDM.

4.3 Comparison with State-of-the-Art Methods


We conduct a performance comparison of our method with 10 State-of-the-
Art methods, including CNNDetection [36], Frank [10], Durall [9], Patchfor [5],
F3Net [29] , SelfBland [32], GANDetection [26], LGrad [35] , Ojha [28], NPR [34].
The experimental results in Tables 1, 2, and 3 demonstrate that our method
exceeds existing approaches. On the 9-GAN dataset, ADOF delivers the highest
accuracy at 94.2%, surpassing Ojha [28] with a mere 77.6% 1, and NPR [34] at
93.2% (see Table 1). Notably, our approach achieves a remarkable 98.3% accu-
racy on the DiffusionForensics [37] dataset, outperforming the NPR method [34],
Minimalist Preprocessing Approach for Image Synthesis Detection 95

Table 1. Evaluation results on the Self-Synthesis 9 GANs [34].

AttGAN BEGAN CramerGAN InfoMaxGAN MMDGAN RelGAN S3GAN SNGAN STGAN Mean
Method
Acc. A.P. Acc. A.P. Acc. A.P. Acc. A.P. Acc. A.P. Acc. A.P. Acc. A.P. Acc. A.P. Acc. A.P. Acc. A.P.
CNNDetection [36] 51.1 83.7 50.2 44.9 81.5 97.5 71.1 94.7 72.9 94.4 53.3 82.1 55.2 66.1 62.7 90.4 63.0 92.7 62.3 82.9
Frank [10] 65.0 74.4 39.4 39.9 31.0 36.0 41.1 41.0 38.4 40.5 69.2 96.2 69.7 81.9 48.4 47.9 25.4 34.0 47.5 54.7
Durall [9] 39.9 38.2 48.2 30.9 60.9 67.2 50.1 51.7 59.5 65.5 80.0 88.2 87.3 97.0 54.8 58.9 62.1 72.5 60.3 63.3
Patchfor [5] 68.0 92.9 97.1 100.0 97.8 99.9 93.6 98.2 97.9 100.0 99.6 100.0 66.8 68.1 97.6 99.8 92.7 99.8 90.1 95.4
F3Net 85.2 94.8 87.1 97.5 89.5 99.8 67.1 83.1 73.7 99.6 98.8 100.0 65.4 70.0 51.6 93.6 60.3 99.9 75.4 93.1
SelfBland [32] 63.1 66.1 56.4 59.0 75.1 82.4 79.0 82.5 68.6 74.0 73.6 77.8 53.2 53.9 61.6 65.0 61.2 66.7 65.8 69.7
GANDetection [26] 57.4 75.1 67.9 100.0 67.8 99.7 67.6 92.4 67.7 99.3 60.9 86.2 69.6 83.5 66.7 90.6 69.6 97.2 66.1 91.6
LGrad [35] 68.6 93.8 69.9 89.2 50.3 54.0 71.1 82.0 57.5 67.3 89.1 99.1 78.5 86.0 78.0 87.4 54.8 68.0 68.6 80.8
Ojha [28] 78.5 98.3 72.0 98.9 77.6 99.8 77.6 98.9 77.6 99.7 78.2 98.7 85.2 98.1 77.6 98.7 74.2 97.8 77.6 98.8
NPR [34] 83.0 96.2 99.0 99.8 98.7 99.0 94.5 98.3 98.6 99.0 99.6 100.0 79.0 80.0 88.8 97.4 98.0 100.0 93.2 96.6
ADOF(ours) 99.5 100.0 92.2 100.0 96.0 99.6 94.1 99.1 96.0 99.7 100.0 100.0 77.5 86.7 94.8 99.3 97.8 99.7 94.2 98.2

Table 2. Evaluation results on the test set of DiffusionForensics dataset [37].

Stable Stable
ADM DDPM IDDPM LDM PNDM VQ-Diffusion Diffusion v1 Diffusion v2 Mean
Method
Acc. A.P. Acc. A.P. Acc. A.P. Acc. A.P. Acc. A.P. Acc. A.P. Acc. A.P. Acc. A.P. Acc. A.P.
CNNDetection [36] 53.9 71.8 62.7 76.6 50.2 82.7 50.4 78.7 50.8 90.3 50.0 71.0 38.0 76.7 52.0 90.3 51.0 79.8
Frank [10] 58.9 65.9 37.0 27.6 51.4 65.0 51.7 48.5 44.0 38.2 51.7 66.7 32.8 52.3 40.8 37.5 46.0 50.2
Durall [9] 39.8 42.1 52.9 49.8 55.3 56.7 43.1 39.9 44.5 47.3 38.6 38.3 39.5 56.3 62.1 55.8 47.0 48.3
Patchfor [5] 77.5 93.9 62.3 97.1 50.0 91.6 99.5 100.0 50.2 99.9 100.0 100.0 90.7 99.8 94.8 100.0 78.1 97.8
F3Net [29] 80.9 96.9 84.7 99.4 74.7 98.9 100.0 100.0 72.8 99.5 100.0 100.0 73.4 97.2 99.8 100.0 85.8 99.0
SelfBland [32] 57.0 59.0 61.9 49.6 63.2 66.9 83.3 92.2 48.2 48.2 77.2 82.7 46.2 68.0 71.2 73.9 63.5 67.6
GANDetection [26] 51.1 53.1 62.3 46.4 50.2 63.0 51.6 48.1 50.6 79.0 51.1 51.2 39.8 65.6 50.1 36.9 50.8 55.4
LGrad [35] 86.4 97.5 99.9 100.0 66.1 92.8 99.7 100.0 69.5 98.5 96.2 100.0 90.4 99.4 97.1 100.0 88.2 98.5
Ojha [28] 78.4 92.1 72.9 78.8 75.0 92.8 82.2 97.1 75.3 92.5 83.5 97.7 56.4 90.4 71.5 92.4 74.4 91.7
NPR [34] 88.6 98.9 99.8 100.0 91.8 99.8 100.0 100.0 91.2 100.0 100.0 100.0 97.4 99.8 93.8 100.0 95.3 99.8
ADOF(ours) 93.5 99.0 99.6 100.0 99.2 100.0 99.9 100.0 97.4 99.9 97.1 99.8 99.8 100.0 99.9 100.0 98.3 99.8

Table 3. Evaluation results on the diffusion test set of Ojha [28].

DALLE Glide_100_10 Glide_100_27 Glide_50_27 ADM LDM_100 LDM_200 LDM_200_cfg Mean


Method
Acc. A.P. Acc. A.P. Acc. A.P. Acc. A.P. Acc. A.P. Acc. A.P. Acc. A.P. Acc. A.P. Acc. A.P.
CNNDetection [36] 51.8 61.3 53.3 72.9 53.0 71.3 54.2 76.0 54.9 66.6 51.9 63.7 52.0 64.5 51.6 63.1 52.8 67.4
Frank [10] 57.0 62.5 53.6 44.3 50.4 40.8 52.0 42.3 53.4 52.5 56.6 51.3 56.4 50.9 56.5 52.1 54.5 49.6
Durall [9] 55.9 58.0 54.9 52.3 48.9 46.9 51.7 49.9 40.6 42.3 62.0 62.6 61.7 61.7 58.4 58.5 54.3 54.0
Patchfor [5] 79.8 99.1 87.3 99.7 82.8 99.1 84.9 98.8 74.2 81.4 95.8 99.8 95.6 99.9 94.0 99.8 86.8 97.2
F3Net [29] 71.6 79.9 88.3 95.4 87.0 94.5 88.5 95.4 69.2 70.8 74.1 84.0 73.4 83.3 80.7 89.1 79.1 86.5
SelfBland [32] 52.4 51.6 58.8 63.2 59.4 64.1 64.2 68.3 58.3 63.4 53.0 54.0 52.6 51.9 51.9 52.6 56.3 58.7
GANDetection [26] 67.2 83.0 51.2 52.6 51.1 51.9 51.7 53.5 49.6 49.0 54.7 65.8 54.9 65.9 53.8 58.9 54.3 60.1
LGrad [35] 88.5 97.3 89.4 94.9 87.4 93.2 90.7 95.1 86.6 100.0 94.8 99.2 94.2 99.1 95.9 99.2 90.9 97.2
Ojha [28] 89.5 96.8 90.1 97.0 90.7 97.2 91.1 97.4 75.7 85.1 90.5 97.0 90.2 97.1 77.3 88.6 86.9 94.5
NPR [34] 94.5 99.5 98.2 99.8 97.8 99.7 98.2 99.8 75.8 81.0 99.3 99.9 99.1 99.9 99.0 99.8 95.2 97.4
RINE [21] 95.0 99.5 90.7 99.2 88.9 99.1 92.6 99.5 76.1 96.6 98.7 99.9 98.3 99.9 88.2 98.7 91.1 99.0
ADOF(ours) 92.1 98.3 98.6 100.0 98.7 100.0 98.4 99.9 75.9 87.6 98.8 100.0 98.6 99.9 98.5 99.9 94.9 98.2

which only reaches 95.3% (see Table 2). It also surpasses DIRE [37], which reports
97.9% accuracy on its own dataset, despite our model being trained on Foren-
synths [22], in contrast to DIRE’s training on DiffusionForensics. Additionally,
this approach exceeds both RINE [21] and Ojha [28] (see Table 3), with the latter
achieving 91.1% on its dataset [28]. It is worth mentioning that both methods
utilize a large CLIP model for their evaluations.
96 H.-D. Vo and T.-N. Le

4.4 Computational Efficiency Evaluation


We conduct a comparative analysis of several representative methods based on
their computational efficiency and resource requirements. Specifically, we evalu-
ate and compare the following metrics:

Fig. 4. Standard pipeline for Image Synthesis Methods.

Table 4. Resource usage and performance of synthetic image detection methods on


the DiffusionForensics [37]. The method marked with .† indicates it was trained on this
dataset.

Processing Inference Time Means


Method Parameters FLOPs
(ms) (ms) (acc/ap)
LGrad [35] 25.56 × 106 11.6 4.81 4.12 × 109 88.2/98.5
DIRE† [37] 25.56 × 106 4,502.7 4.81 4.12 × 109 97.9/100
Ojha [28] 427.62 × 106 None 29.19 77.83 × 109 74.4/91.7
ADOF(ours) 1.44 × 106 0.40 2.43 1.74 × 109 98.3/99.8

– Number of Parameters: We quantify the total number of parameters for


each model to assess its complexity.
– Input Processing Time: We measure the time required for processing
images before they are fed into the model, where these processing steps are
tailored to the specific method used (See Fig. 4).
– Inference Time: We record the time taken for the model to process an
image and produce a result.
– FLOPs (Floating Point Operations Per Second): We leveraged the
fvcore library to estimate the FLOPs required by each model during infer-
ence, providing valuable insights into their computational demands.
Our method requires substantially fewer parameters and FLOPs while
achieving faster inference and the highest mean accuracy (98.3%) compared to
existing methods, including the DIRE [37] (see Table 4), which is trained on the
same dataset but does not achieve comparable performance. This demonstrates
its superior performance in synthetic image detection.
Minimalist Preprocessing Approach for Image Synthesis Detection 97

5 Conclusion
In this paper, we proposed a simple yet highly effective filter, namely ADOF,
for capturing pixel-level variations. By treating an image as a discrete digi-
tal signal, this method eliminates the average components of the signal. These
components typically carry semantic information, which is less helpful for distin-
guishing between real and synthetic images compared to the subtle traces that
the proposed filter is designed to detect. Experimental results indicates that our
proposed method significantly reduces model complexity while enhancing both
accuracy and generalization, even on previously unseen data.

Acknowledgment. This research is funded by Vietnam National University - Ho Chi


Minh City (VNU-HCM) under Grant Number C2024-18-25.

References
1. Arunachalam, S., Khairnar, S., Desale, B.: The fast fourier transform algorithm
and its application in digital image processing. New J. Chem. 35(5) (2013)
2. Bauer, L.A., Bindschaedler, V.: Generative models for security: Attacks, defenses,
and opportunities. arXiv:2107.10139 (2021)
3. Cara Curtis: California makes deepfakes illegal to curb revenge porn and doctored
political videos (2019). https://bit.ly/4f40oaX. Accessed 24 Sept 2024
4. Casteleiro-Pitrez, J.: Generative artificial intelligence image tools among future
designers: a usability, user experience, and emotional analysis. Digital 4(2), 316–
332 (2024)
5. Chai, L., Bau, D., Lim, S.N., Isola, P.: What makes fake images detectable? under-
standing properties that generalize. In: European Conference on Computer Vision
(2020)
6. Chen, Z., Wang, X.: Application of AI technology in interior design 179 (2020)
7. Dosovitskiy, A., et al.: An image is worth 16x16 words: Transformers for image
recognition at scale. CoRR abs/2010.11929 (2020)
8. Durall, R., Keuper, M., Pfreundt, F.J., Keuper, J.: Unmasking deepfakes with
simple features. ArXiv abs/1911.00686 (2019)
9. Durall, R., Keuper, M., Keuper, J.: Watch your up-convolution: Cnn based gen-
erative deep neural networks are failing to reproduce spectral distributions, pp.
7890–7899 (2020)
10. Frank, J.C., Eisenhofer, T., Schönherr, L., Fischer, A., Kolossa, D., Holz, T.: Lever-
aging frequency analysis for deep fake image recognition. ArXiv (2020)
11. Goodfellow, I.J., et al.: Generative adversarial networks. Commun. ACM 63, 139–
144 (2014)
12. Gu, S., et al.: Vector quantized diffusion model for text-to-image synthesis. In:
CVPR, pp. 10696–10706 (2022)
13. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition,
pp. 770–778 (2016)
14. Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. ArXiv
abs/2006.11239 (2020)
15. Jeong, Y., Kim, D., Ro, Y., Choi, J.: Frepgan: robust deepfake detection using
frequency-level perturbations. In: AAAI Conference on Artificial Intelligence (2022)
98 H.-D. Vo and T.-N. Le

16. Jeong, Y., Kim, D., Min, S., Joe, S., Gwon, Y., Choi, J.: Bihpf: bilateral high-pass
filters for robust deepfake detection. In: WACV, pp. 48–57 (2022)
17. Ju, Y., Jia, S., Ke, L., Xue, H., Nagano, K., Lyu, S.: Fusing global and local features
for generalized AI-synthesized image detection. In: ICIP, pp. 3465–3469 (2022)
18. Karras, T., Aila, T., Laine, S., Lehtinen, J.: Progressive growing of gans for
improved quality, stability, and variation. ArXiv abs/1710.10196 (2017)
19. Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., Aila, T.: Analyzing
and improving the image quality of stylegan, pp. 8110–8119 (2020)
20. Korosec, K.: Deepfake revenge porn is now illegal in virginia (2019). https://
techcrunch.com/2019/07/01/deepfake-revenge-porn-is-now-illegal-in-virginia/.
Accessed 24 Sep 2019
21. Koutlis, C., Papadopoulos, S.: Leveraging representations from intermediate
encoder-blocks for synthetic image detection. arXiv:2402.19091 (2024)
22. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep con-
volutional neural networks. Commun. ACM 60, 84–90 (2012)
23. Liu, H., Tan, Z., Tan, C., Wei, Y., Wang, J., Zhao, Y.: Forgery-aware adaptive
transformer for generalizable synthetic image detection. In: CVPR, pp. 10770–
10780 (2024)
24. Liu, Z., Qi, X., Torr, P.H.: Global texture enhancement for fake face detection in
the wild. In: CVPR, pp. 8060–8069 (2020)
25. Lomov, I., Makarov, I.: Generative models for fashion industry using deep neural
networks. In: ICCAIS, pp. 1–6. IEEE (2019)
26. Mandelli, S., Bonettini, N., Bestagini, P., Tubaro, S.: Detecting gan-generated
images by orthogonal training of multiple cnns, pp. 3091–3095 (2022)
27. Mickens, R.E.: Difference equations: theory, applications and advanced topics. CRC
Press (2015)
28. Ojha, U., Li, Y., Lee, Y.J.: Towards universal fake image detectors that generalize
across generative models, pp. 24480–24489 (2023)
29. Qian, Y., Yin, G., Sheng, L., Chen, Z., Shao, J.: Thinking in frequency: face forgery
detection by mining frequency-aware clues. ArXiv abs/2007.09355 (2020)
30. Radford, A., et al.: Learning transferable visual models from natural language
supervision, pp. 8748–8763 (2021)
31. for Schools, B.: Spotting AI: Knowing how to recognise real vs AI images. https://
elearn.eb.com/real-vs-ai-images/ (2024). Accessed 21 Aug 2024
32. Shiohara, K., Yamasaki, T.: Detecting deepfakes with self-blended images, pp.
18720–18729 (2022)
33. Tan, C., Zhao, Y., Wei, S., Gu, G., Liu, P., Wei, Y.: Frequency-aware deepfake
detection: Improving generalizability through frequency space domain learning
38(5), 5052–5060 (2024)
34. Tan, C., Zhao, Y., Wei, S., Gu, G., Liu, P., Wei, Y.: Rethinking the up-sampling
operations in cnn-based generative network for generalizable deepfake detection,
pp. 28130–28139 (2024)
35. Tan, C., Zhao, Y., Wei, S., Gu, G., Wei, Y.: Learning on gradients: generalized arti-
facts representation for gan-generated images detection, pp. 12105–12114 (2023)
36. Wang, S.Y., Wang, O., Zhang, R., Owens, A., Efros, A.A.: Cnn-generated images
are surprisingly easy to spot... for now, pp. 8695–8704 (2020)
37. Wang, Z., et al.: Dire for diffusion-generated image detection, pp. 22445–22455
(2023)
38. Wikipedia Contributors: finite difference (2024). https://en.wikipedia.org/wiki/
Finite_difference. Accessed 21 Aug 2024
Minimalist Preprocessing Approach for Image Synthesis Detection 99

39. Wootaek Shin, P., Ahn, J.J., Yin, W., Sampson, J., Narayanan, V.: Can prompt
modifiers control bias? a comparative analysis of text-to-image generative models.
arXiv e-prints (2024)
40. Zhong, N., Xu, Y., Li, S., Qian, Z., Zhang, X.: Patchcraft: exploring texture patch
for efficient AI-generated image detection (2024)
KidRisk: Benchmark Dataset for Children
Dangerous Action Recognition

Minh-Kha Nguyen1,2 , Trung-Hieu Do1,2 , Kim Anh Phung3 ,


Thao Thi Phuong Dao1,2,4 , Minh-Triet Tran1,2 ,
and Trung-Nghia Le1,2(B)
1
University of Science, VNU-HCM, Ho Chi Minh City, Vietnam
{20120502,20120007}@student.hcmus.edu.vn, [email protected],
[email protected]
2
Vietnam National University, Ho Chi Minh City, Vietnam
[email protected]
3
PrimeLabs LLC, Draper, USA
4
Thong Nhat Hospital, Ho Chi Minh City, Vietnam

Abstract. Children are naturally energetic, and during their sponta-


neous activities, they often encounter potentially dangerous situations,
especially when lacking parental supervision. Identifying actions that
pose risks plays a crucial role in ensuring their safety. This paper build a
novel challenging dataset, namely KidRisk, including 2,500 short videos
of children’s actions and 10,000 images for dangerous action of children.
We also introduce a benchmark on our newly constructs dataset and find
that traditional deep learning models demonstrated limited effectiveness
on these tasks. Therefore, we develop vision-language based baselines
with exceptional context understanding of visual information. Our pro-
posed methods achieved an accuracy of 83.53% in classifying children’s
actions and 96.14% in recognizing children’s dangerous actions, signifi-
cantly outperforming traditional approaches. These results confirm that
vision-language models are not only feasible but also highly effective in
detecting hazardous actions, contributing positively to safeguarding chil-
dren’s safety.

Keywords: Action recognition · dangerous action recognition ·


vision-language model

1 Introduction

Action recognition is a crucial research area in computer vision that involves


identifying and classifying actions from images or videos. In the context of child
monitoring, recognizing dangerous actions is of paramount importance to ensure
safety. Children often exhibit curiosity and lack of awareness about their sur-
roundings, leading to potentially hazardous situations such as falls, collisions,
M.-K. Nguyen, T.-H. Do—These authors contributed equally to this research.
c The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025
W. Buntine et al. (Eds.): SOICT 2024, CCIS 2350, pp. 100–111, 2025.
https://doi.org/10.1007/978-981-96-4282-3_9
Children Dangerous Action Recognition 101

or contact with sharp objects. With the advancement of technology, camera


systems are becoming increasingly common for monitoring purposes. However,
these systems typically require constant human supervision, which can be both
demanding and impractical. An intelligent monitoring system can timely detect
these actions and alert caregivers to potential risks.
Despite significant advances in action recognition, applying this technology
in child monitoring presents several challenges. One of the significant challenges
is that existing action recognition models are typically trained on adult datasets,
resulting in poor performance when applied to child-related data. Additionally,
there is a scarcity of data on dangerous actions in children, making it difficult
to train deep learning models effectively. Moreover, many current systems over-
look the context and environmental factors, leading to inaccurate assessments of
danger levels.
The rapid development of vision-language models has introduced powerful
tools for understanding and interpreting complex visual contexts, which is par-
ticularly promising for enhancing child safety. Models such as CLIP [16], ALIGN
[8], and BLIP [11] exemplify this advancement. These models excel in linking
visual data with textual descriptions and can interpret and generate nuanced
contextual information. We assume that applying such advanced vision-language
models to child monitoring may enhance broader contextual understanding of
dangerous situations, resulting in enhancing child safety.
This paper aims to propose a promising approach capable of accurately rec-
ognizing and analyzing dangerous actions. To this end, we construct a new
dataset called KidRisk focused on dangerous actions in children, built up on the
InfAct dataset [7]. The KidRisk dataset consists of 2,500 short videos of chil-
dren’s actions and 10,000 images for dangerous action of children. The dataset
can capture a range of dangerous actions specific to children, enhancing the
model’s ability to generalize from limited examples. Our dataset also addresses
issues such as the limited number of samples for each action and the imbalance
between unsafe and safe labels. We also introduce a benchmark with state-of-
the-art action recognition methods. Furthermore, we propose using the BLIP-2
[10] model for transfer learning to maximize the effectiveness of children danger-
ous action recognition. Additionally, to further improve the model’s versatility,
we utilize zero-shot learning techniques for classifying actions it has not seen
during training.
Extensive experiments on the newly constructed dataset demonstrate the
efficacy of our proposed method. Specifically, the integration of BLIP-2 with
LSTM significantly enhances the performance of action recognition and danger
situation detection. With this approach, we achieved a remarkable accuracy of
96.1% in detecting dangerous situations, showcasing a substantial improvement
over traditional models. For action recognition, our method attained an accu-
racy rate of 83.5%. These results highlight not only the effectiveness of transfer
learning in leveraging pre-trained vision-language models but also its ability to
adapt effectively to specific datasets. Our approach shows that, even with limited
training data, it is possible to achieve high accuracy and reliability. This under-
102 M.-K. Nguyen et al.

scores the practicality of our solution for real-world applications, particularly


in monitoring and ensuring child safety, where accurate and timely detection of
dangerous actions is critical.
Our contributions are as follow:
– We present a new dataset designed for both action recognition and dan-
ger situation recognition in children’s activities. This dataset, coupled with
a benchmark, tackles real-world challenges such as the scarcity of samples
per action and the imbalance between safe and unsafe labels, ensuring more
realistic scenario representation.
– We propose simple yet efficient baselines leveraged BLIP-2 model. Our pro-
posed methods excel in capturing contextual information surrounding chil-
dren, achieving strong performance in recognizing both general actions and
dangerous situations.

2 Related Work
2.1 Action Recognition
Chen et al. [4] introduced various models based on convolutional neural net-
works (CNN) and achieved high accuracy in action recognition. This approach
can be divided into two main types: 2D CNN uses 2D filters to process each video
frame independently, capturing mainly spatial information but not explicitly
modeling temporal relationships. The advantage of 2D CNN is its smaller size
and lower computational cost. 3D CNN uses filters to process the video volume,
capturing both spatial and temporal information. However, 3D CNNs are larger
in size and more computationally expensive than 2D CNNs. Meanwhile, Lin
et al. [13] proposed Temporal Shift Module (TSM), capturing spatio-temporal
information similar to 3D-CNN models but with computational costs equivalent
to 2D-CNNs. Specifically, uni-directional TSM was developed to handle online
video processing by only using information from past frames.
CNNs perform well in action recognition. Studies indicate improved accuracy
with 3D CNNs compared to 2D CNNs. However, despite their ability to capture
both spatial and temporal information, 3D CNNs do not significantly outperform
2D CNNs in terms of accuracy. Research suggests that both 2D CNNs and
3D CNNs exhibit similar behavior regarding the learning of spatio-temporal
representations and the transfer of knowledge to new tasks.
On the other hand, recurrent neural networks (RNN) and their variant
Long Short-Term Memory (LSTM) have become powerful tools in video analysis,
particularly action recognition. RNNs are effective at capturing temporal rela-
tionships between frames, while LSTMs are designed to overcome the vanishing
gradient problem of RNNs. However, RNNs/LSTMs also have some limitations,
such as high computational costs and issues with vanishing/exploding gradients,
though LSTM mitigates this to some extent. Some improved methods have been
proposed, such as combining CNN and LSTM to reduce computational costs and
improve performance. Several works [5, 17] introduced advancements in terms of
computation and performance.
Children Dangerous Action Recognition 103

Fig. 1. Transition period between crawl posture and stand posture.

Vision Transformer (ViT) [6] leveraged the attention mechanism to analyze


relationships between parts of an image. However, ViT faces challenges in cap-
turing temporal information and comes with high computational costs. Recent
works, such as TimeSformer [3] and ViViT [2], have been proposed to address
these limitations by modeling spatio-temporal dependencies in videos.

Graph Convolutional Networks (GCN) are the primary tool for analyzing
skeletal graphs and recognizing human actions. GCN helps capture the spatial
relationships between body parts. Many works [12, 20] used GCN to extract fea-
tures from skeleton graphs and achieved positive results. However, GCN models
struggle to capture complex temporal information from actions. Some studies
have employed attention mechanisms or separate streams for spatial and tem-
poral information to enhance the ability to capture temporal features.

2.2 Hazard Recognition

Research on hazard recognition often focuses on analyzing actions to predict


safety risks. Wang et al. [18] proposed a method for analyzing joint connec-
tions between body parts to quickly and accurately identify hazardous actions.
However, this method requires a large amount of well-annotated data and does
not capture information about the surrounding environment. Meanwhile, Nie et
al. [15] proposed combining action recognition with object detection to assess
hazard levels, particularly in children. However, combining both factors requires
more complex system processing and additional research to develop more effi-
cient approaches for hazard recognition.
104 M.-K. Nguyen et al.

3 Methodology
3.1 Proposed KidRisk Dataset
To address the challenges in child action recognition and safety detection, we
propose a new datasets, including two parts: children’s action videos and chil-
dren’s safety images.

Children’s Action Videos are developed based on the InfAct dataset [7],
which consists of short video clips capturing two actions performed by children
with a transition in between, such as sitting, standing, lying down, and crawling.
We extend this dataset by extracting and labeling additional video segments
from the source, focusing on basic child actions. After processing, the videos are
trimmed into shorter clips (up to 5 s), each containing only a single action, with
the transition periods removed (illustrated by Fig. 1).

Children’s Safety Images are compiled from various sources, including more
than 10,000 images depicting children in safe and dangerous situations. These
images are categorized into two groups: “Safe” and “Dangerous,” with dangerous
scenarios including children playing near stairs, handling sharp objects, or being
in situations near swimming pools. To increase the dataset’s diversity, we sup-
plemented it with additional dangerous situations, creating a rich dataset that
accurately reflects the real-world risks children may encounter (see Fig. 2).

3.2 Proposed Baselines


The vision-language models have made significant advancements in recent years.
However, these models require high computational costs during the training pro-
cess. In this paper, we utilize the BLIP-2 model, proposed by Li et al. [10], to
leverage pre-trained components such as a Vision Encoder and a Large Lan-
guage Model (LLM) with frozen parameters, aiming to reduce computational
costs. The Query Transformer (Q-Former) connects information between images
and text via a cross-attention mechanism, improving the model’s ability to pro-
duce contextually accurate outputs (see Fig. 3a). The model achieves high per-
formance with only 188M parameters, significantly fewer than SimVLM (1.4B)
and Flamingo (10.2B) (see Fig. 3b).

Zero-Shot Learning Classification. The zero-shot learning method reduces


dependence on labeled datasets, making it especially effective when a training
set for specific actions is unavailable. In this study, we apply the BLIP-2 model
to classify actions in children without prior training (see Fig. 4).
Zero-shot learning leverages the power of pre-trained models to identify
actions by comparing the similarity between input images or videos and target
action labels, without requiring direct training on the target datasets. The BLIP-
2 model uses a Vision Transformer (ViT) to extract features from images/videos.
Children Dangerous Action Recognition 105

Fig. 2. Examples of children’s safety images.

Fig. 3. BLIP-2 framework.

These features are passed through the Q-Former block, which employs a cross-
attention mechanism to link visual and textual information, creating feature vec-
tors that represent the visual information. Similarly, action labels or dangerous
situations are converted into corresponding feature vectors of the same length
using the Q-Former. Cosine similarity (formular 1) is then applied to compare
the visual and textual vectors, helping to determine the closest matching action
106 M.-K. Nguyen et al.

Fig. 4. Zero-shot learning for action classification.

label or dangerous situation based on the highest similarity score:


V vision · V label
similarity =
. . (1)
||V vision || · ||V label ||

Vision-Language Transfer Learning. In the task of recognizing children’s


actions in videos, we propose a transfer learning method using BLIP-2 as a
feature extractor to obtain feature vectors for each frame (see Fig. 5). These
vectors are then fed into an LSTM network, which helps the model understand
the temporal relationships between frames. Finally, the information is processed
through linear layers for action classification. This process not only enhances the
accuracy of recognizing children’s actions but also reduces the requirement for
extensive training data, as the model has been pre-trained on various tasks. This
offers significant benefits in improving action recognition capabilities without the
need to collect a large dataset.

Fig. 5. Transfer learning for action recognition.

For danger detection, the study suggests an approach based on process-


ing each frame of the video independently to determine whether it contains a
Children Dangerous Action Recognition 107

dangerous situation. By applying transfer learning with BLIP-2, the model can
extract important features from each frame, which are then input into classi-
fication layers with a sigmoid activation function to predict the probability of
danger. This approach not only improves the accuracy of detecting dangerous
situations but also enhances the safety of children in everyday activities. Mon-
itoring and analyzing each moment allows parents or caregivers to intervene
promptly, reducing the risk of accidents.
Our training process involves several key steps to ensure that the model’s
parameters are optimized for achieving the best performance in action recogni-
tion and danger detection tasks. First, the input data undergoes preprocessing.
Images from the video are normalized to fit the input format of the BLIP-2
model, typically including resizing to the standard size of (224, 224) and nor-
malizing pixel values. For video data, to reduce load and retain important infor-
mation, only representative frames are selected from each second of the video for
processing. During the training process, the loss function plays a crucial role in
guiding the model to optimize its parameters. For the danger detection task, the
Binary Cross-Entropy (BCE) loss function is used. This function is suitable for
binary classification problems, where it compares the model’s predicted proba-
bilities with the actual labels. The BCE loss formula helps the model adjust its
parameters so that predictions are as close as possible to the true labels. For the
action recognition task, the Cross-Entropy loss function is applied, allowing the
model to accurately classify actions across multiple classes. A significant factor in
the training process is the issue of overfitting. To address this, L2 regularization
is employed. L2 regularization helps mitigate the risk of the model’s parame-
ters becoming too large, thereby enhancing the model’s ability to generalize to
unseen data. The regularization coefficient is adjusted to control the impact of
regularization on the loss function.
One notable challenge in training is data imbalance, especially in the case of
danger detection. Typically, the number of samples labeled as dangerous is much
fewer than those labeled as safe, leading the model to become biased toward
the more prevalent class. To overcome this, samples labeled as dangerous are
augmented by duplicating them, thus balancing the quantity with safe samples.
The training process runs with a learning rate of .α = 0.0001 and the compu-
tational resources used include a single T4 GPU, allowing the model to optimize
effectively for both tasks.

4 Experimental Results
4.1 Children’s Action Classification
The challenge of not having access to a large-scale dataset for children’s action
recognition highlights the motivation for developing a zero-shot learning app-
roach for classifying children’s actions. In this study, we tested several advanced
models, including S3D, Alpro, and BLIP-2, to evaluate their effectiveness in clas-
sifying children’s actions using zero-shot learning. Models utilizing the ViT back-
bone demonstrated significantly higher performance compared to those based on
108 M.-K. Nguyen et al.

CNN backbones. Among these, BLIP-2 showed superior capabilities compared


to other methods (see Table 1).

Table 1. Experimental results of action classification models based on zero-shot learn-


ing on the children’s action videos.

Methods Backbone Accuracy


S3D [14] 3D-CNN 23.40%
Alpro [9] ViT 32.10%
BLIP-2 ViT 62.21%
BLIP-2 + LSTM ViT 83.5%

While action recognition using zero-shot learning with the BLIP-2 model
has achieved some impressive results, it cannot yet be considered truly effective
in classifying children’s actions. Specifically, the performance of this method
remains limited, suggesting that the lack of contextual information from train-
ing data can hinder the model’s ability to accurately recognize complex actions.
However, when applying transfer learning, the results obtained are highly promis-
ing. Fine-tuning the BLIP-2 model on a specific dataset has led to a significant
improvement in classification performance, with accuracy increasing by 21.3%
compared to the previous zero-shot learning method (see Table 1). This demon-
strates that using transfer learning not only allows the model to learn from
the features of the target data but also enhances its generalization ability and
accuracy in recognizing children’s actions.

Table 2. Experimental results of transfer learning and other experiments on the danger
situation images.

Methods Accuracy
Resnet 85.1%
ViT 75.4%
BLIP-2 (zero-shot) 56.1%
BLIP-2 + Transfer learning 96.1%

4.2 Children’s Danger Recognition


The pretrained BLIP-2 model has not yet demonstrated effective capabilities
in detecting dangerous situations. Specifically, the zero-shot learning method
shows relatively low accuracy, while traditional models like ViT and Resnet
achieve better results. However, after applying transfer learning, the BLIP-2
Children Dangerous Action Recognition 109

model has proven to possess superior detection capabilities, with significantly


higher accuracy compared to traditional models on the dataset labeled with
dangerous and safe situations (see Table 2).
We also utilized attention maps to interpret the predictions of the BLIP-
2 model (see Fig. 6), revealing how the model focuses on important regions in
the images to detect dangerous situations. The differences between the methods
become evident when applying traditional deep learning models, highlighting
that leveraging transfer learning has significantly improved the detection capa-
bilities for children’s safety in potentially dangerous scenarios.

Fig. 6. Visualized attention maps of BLIP-2 for action recognition.

5 Conclusion
In this paper, we introduced the comprehensive KidRisk dataset, encompassing
video clips of children’s actions and images of hazardous situations, designed to
push the boundaries of risk recognition in children’s activities. We also develop
a simple yet efficient approach for identifying dangerous actions in children
through the use of the vision-language BLIP-2 model. Our experimental findings
reveal that the integration of BLIP-2 with transfer learning techniques not only
delivers exceptional performance but also underscores the potential of vision-
language models in real-world applications. Experimental results demonstrated
that the combination of BLIP-2 with transfer learning techniques achieved high
performance. These results highlight the feasibility of vision-language models
110 M.-K. Nguyen et al.

in advancing child safety, paving the way for more intelligent, context-aware
monitoring systems capable of preemptively identifying and mitigating risks in
unsupervised environments.

Acknowledgement. This research is supported by research funding from Faculty of


Information Technology, University of Science, Vietnam National University - Ho Chi
Minh City.

References
1. Alayrac, J.B., et al.: Flamingo: a visual language model for few-shot learning. Adv.
Neural. Inf. Process. Syst. 35, 23716–23736 (2022)
2. Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M., Schmid, C.: Vivit: A
video vision transformer. In: Proceedings of the IEEE/CVF International Confer-
ence on Computer Vision, pp. 6836–6846 (2021)
3. Bertasius, G., Wang, H., Torresani, L.: Is space-time attention all you need for
video understanding? In: ICML, vol. 2, p. 4 (2021)
4. Chen, C.F.R., et al.: Deep analysis of cnn-based spatio-temporal representations
for action recognition. In: Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pp. 6165–6175 (2021)
5. Cho, K., et al.: Learning phrase representations using rnn encoder-decoder for
statistical machine translation. arXiv preprint arXiv:1406.1078 (2014)
6. Dosovitskiy, A., et al.: An image is worth 16x16 words: Transformers for image
recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
7. Huang, X., et al.: Posture-based infant action recognition in the wild with very
limited data. In: Proceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition, pp. 4912–4921 (2023)
8. Jia, C., et al.: Scaling up visual and vision-language representation learning with
noisy text supervision. In: International Conference on Machine Learning, pp.
4904–4916. PMLR (2021)
9. Li, D., Li, J., Li, H., Niebles, J.C., Hoi, S.C.: Align and prompt: Video-and-language
pre-training with entity prompts. In: Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, pp. 4953–4963 (2022)
10. Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: bootstrapping language-image pre-
training with frozen image encoders and large language models. In: International
Conference on Machine Learning, pp. 19730–19742. PMLR (2023)
11. Li, J., Li, D., Xiong, C., Hoi, S.: Blip: bootstrapping language-image pre-training
for unified vision-language understanding and generation. In: International Con-
ference on Machine Learning, pp. 12888–12900. PMLR (2022)
12. Li, M., Chen, S., Chen, X., Zhang, Y., Wang, Y., Tian, Q.: Actional-structural
graph convolutional networks for skeleton-based action recognition. In: Proceedings
of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.
3595–3603 (2019)
13. Lin, J., Gan, C., Han, S.: Tsm: temporal shift module for efficient video under-
standing. In: Proceedings of the IEEE/CVF International Conference on Computer
Vision, pp. 7083–7093 (2019)
Children Dangerous Action Recognition 111

14. Miech, A., Alayrac, J.B., Smaira, L., Laptev, I., Sivic, J., Zisserman, A.: End-to-end
learning of visual representations from uncurated instructional videos. In: Proceed-
ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,
pp. 9879–9889 (2020)
15. Nie, Q., Wang, X., Wang, J., Wang, M., Liu, Y.: A child caring robot for the
dangerous behavior detection based on the object recognition and human action
recognition. In: 2018 IEEE International Conference on Robotics and Biomimetics
(ROBIO), pp. 1921–1926. IEEE (2018)
16. Radford, A., et al.: Learning transferable visual models from natural language
supervision. In: International Conference on Machine Learning, pp. 8748–8763.
PMLR (2021)
17. Shi, X., Chen, Z., Wang, H., Yeung, D.Y., Wong, W.K., Woo, W.c.: Convolu-
tional lstm network: A machine learning approach for precipitation nowcasting.
In: Advances in Neural Information Processing Systems, vol. 28 (2015)
18. Wang, C., Zhang, H., Zhai, Z., et al.: Real time dangerous action warning system
based on graph convolution neural network. Acad. J. Comput. Inform. Sci. 5(6),
89–94 (2022)
19. Wang, Z., Yu, J., Yu, A.W., Dai, Z., Tsvetkov, Y., Cao, Y.: Simvlm: sim-
ple visual language model pretraining with weak supervision. arXiv preprint
arXiv:2108.10904 (2021)
20. Yan, S., Xiong, Y., Lin, D.: Spatial temporal graph convolutional networks for
skeleton-based action recognition. In: Proceedings of the AAAI Conference on Arti-
ficial Intelligence, vol. 32 (2018)
DOLG-CNet: Deep Orthogonal Fusion
of Local and Global Features Combined
with Contrastive Learning and Deep
Supervision for Polyp Segmentation

Trong-Hieu Nguyen-Mau1,2 , Kim-Trang Phu-Thi1,2 , Minh-Triet Tran1,2 ,


and Hai-Dang Nguyen1,2(B)
1
University of Science, VNU-HCM, Ho Chi Minh City, Vietnam
{nmthieu,nhdang}@selab.hcmus.edu.vn, [email protected],
[email protected]
2
Vietnam National University, Ho Chi Minh City, Vietnam

Abstract. Accurate polyp segmentation is vital for diagnosing colorec-


tal cancer but faces challenges due to varying sizes, colors, and clinical
conditions. Despite advancements, deep learning systems still have sig-
nificant limitations in effectively detecting and segmenting polyps. Con-
volutional Neural Network-based methods struggle to capture long-range
semantic relationships, whereas Transformer-based approaches often fail
to understand local pixel interactions effectively. Moreover, these meth-
ods sometimes inadequately extract detailed features and face limita-
tions in scenarios requiring optimized local and global feature modeling.
To tackle these challenges, we introduce DOLG-CNet, a novel one-stage,
end-to-end framework specifically crafted for polyp segmentation. Ini-
tially, we employ the cutting-edge ConvNeXt for its superior segmenta-
tion capabilities. Additionally, we integrate an orthogonal fusion module
that adeptly merges global and local features to generate a rich combined
feature set. We also introduce a unique training strategy that marries
contrastive learning with segmentation training, enhanced by an auxil-
iary deep supervision loss to boost performance. Specifically, we create
both high and low augmented versions for each input image and train the
system to align their vector embeddings closely, regardless of the aug-
mentation level. This method, combined with standard segmentation loss
and deep supervision, facilitates faster and more effective convergence.
Our experimental results demonstrate that DOLG-CNet achieves impres-
sive performance, with a dice coefficient score of 0.913 on Kvasir-SEG,
0.761 on CVC-ColonDB, and 0.722 on ETIS. Additionally, in qualita-
tive and quantitative benchmarks across various datasets, DOLG-CNet
consistently outperforms well-known methods, proving its efficacy and
potential in the field.

Keywords: Colorectal Cancer · Polyp Segmentation · Contrastive


Learning

c The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025
W. Buntine et al. (Eds.): SOICT 2024, CCIS 2350, pp. 112–126, 2025.
https://doi.org/10.1007/978-981-96-4282-3_10
DOLG-CNet 113

1 Introduction
Colorectal cancer (CRC) is the second most lethal and third most common can-
cer globally, with millions of new cases and deaths reported annually. Research
indicates that most CRCs start as adenomatous polyps, which may progress from
benign mucosa to adenocarcinoma before becoming cancerous [3]. This preva-
lence underlines the necessity for early detection and removal of these polyps
through colonoscopy to prevent CRC [32]. Despite its critical role, the effec-
tiveness of colonoscopy is compromised by a significant missed detection rate,
which ranges from 6% to 27% [1]. With CRC accounting for 9.4% of all cancer
deaths globally in 2020, enhancing the precision of polyp detection through auto-
mated segmentation technologies is imperative to improve treatment outcomes
and reduce mortality rates [24].
Early methods for polyp segmentation relied on manually crafted features
such as color, texture, shape, and appearance, utilizing classifiers that often
failed due to the limitations of these features [26]. In recent years, Convolutional
Neural Networks (CNNs) based models like UNet [21] have proven highly effec-
tive for medical image segmentation, particularly for polyps. The UNet archi-
tecture [21] features a symmetric encoder-decoder structure with skip connec-
tions that preserve information across different levels, enabling the generation of
detailed feature maps. This has established UNet as a fundamental architecture
in biomedical image segmentation [7, 14, 35].
Despite their effectiveness, CNN-based models can sometimes miss contex-
tual and spatial details due to pooling and convolution striding, which affects
their ability to model long-range dependencies [6]. To overcome these limitations,
there has been a shift towards Transformer-based approaches like TransUNet [6],
TransFuse [33], and TransNetR [15], which employ self-attention mechanisms to
capture long-range dependencies better, thus enhancing accuracy. However, these
methods can be complex and resource-demanding. Their dependence on larger
patch sizes might also impact their performance in translational equivariance,
essential for biomedical imaging tasks [16]. Moreover, Transformers may result
in less accurate segmentations as they struggle to incorporate low-level details
[6]. CNN-based or Transformer-based methods may especially encounter chal-
lenges in extracting detailed, fine-grained features. They are also constrained
by specific scenarios, easily affected by different clinical settings or changes in
augmentation, and lack extensive exploration into how local and global feature
representations are integrated. This oversight could mean missing opportunities
to enhance feature attributes.
In this paper, we introduce DOLG-CNet, a novel framework designed for
polyp segmentation. This framework uses the state-of-the-art CNN backbone,
ConvNeXt [17], for its segmentation capabilities. It also incorporates an orthog-
onal fusion module, effectively capturing global and local feature relationships.
Additionally, we propose a novel training strategy that combines contrastive
learning with segmentation training, supplemented by auxiliary deep supervi-
sion loss to enhance performance. Our contributions are multifaceted and are
presented in four-fold:
114 T.-H. Nguyen-Mau et al.

Fig. 1. Overall architecture of our proposed DOLG-CNet.

– Novel Architecture: We propose DOLG-CNet, an innovative one-stage,


end-to-end framework for 2D medical image segmentation. This framework
processes original images to generate high and low augmented versions, which
are then used for contrastive learning.
– Advanced Segmentation Backbone: We employ the state-of-the-art Con-
vNeXt backbone as an encoder to extract detailed features. Additionally,
we integrate local and global features using an orthogonal fusion module to
enhance feature richness and model effectiveness.
– Combined Loss Function: To train DOLG-CNet, we utilize a standard
segmentation loss function and deepen supervision across various resolutions,
further integrating it with a contrastive loss.
– Competitive Performance: DOLG-CNet showcases robust performance
relative to several established polyp segmentation methods and consistently
achieves strong results across multiple dataset evaluations.

2 Related Works
2.1 Polyp Segmentation

Over the past decade, deep learning has achieved significant advancements, par-
ticularly in early polyp diagnosis using endoscopic images. UNet [21], a well-
known architecture for medical image segmentation, features a CNN encoder-
decoder structure with a contracting path to gather context and an expanding
path to enhance detailed accuracy. Building upon the UNet architecture, sev-
eral variants [7, 14, 35] have emerged, each enhancing segmentation capabilities.
UNet++ [35] employs a complex network of nested and densely connected skip
pathways to minimize the semantic gap between encoder and decoder feature
DOLG-CNet 115

maps. Subsequently, SFA [9] and PraNet [8] aim to delineate the precise bound-
ary of polyps from surrounding tissues. In particular, PraNet [8] employs reverse
attention modules to refine boundary details using a global feature map pro-
duced by a parallel partial decoder that utilizes high-level features. MSNet [34]
introduces a multi-scale subtraction network that effectively reduces redundancy
while harnessing complementary features across multiple scales.
In recent developments, Transformer-based models have also demonstrated
remarkable effectiveness in polyp image segmentation [6, 15]. For instance, Tran-
sUNet [6] employs a combined CNN-transformer encoder to grasp long-range
dependencies and utilizes a cascaded CNN upsampler in its decoder to discern
local contextual relationships between pixels. TransNetR [15] is a Transformer-
based residual network with an encoder-decoder structure, offering efficient seg-
mentation capabilities for both in-distribution and out-of-distribution datasets.

2.2 Fusing Local and Global Features

Recent advancements in local feature learning from images have been driven
by deep learning techniques [31]. DELF [18] is a prominent framework that
develops attentive local feature descriptors for large-scale image retrieval. Global
features are generally obtained through operations like GeM pooling [19]. Inte-
grating local and global features is beneficial as feature maps at the local scale in
image representation models act like visual words [23]. Yang et al.[30] introduce
the Deep Orthogonal Local and Global (DOLG) information fusion framework
for enhanced image retrieval. This includes an orthogonal fusion module that
merges local and global information, improving the final descriptor. Our study
applies these advancements to polyp segmentation, incorporating the Orthogonal
Fusion Module from the DOLG framework [30] to enhance detection accuracy
and robustness by leveraging both feature types.

2.3 Contrastive Learning in Segmentation

Self-supervised contrastive learning builds representations by contrasting posi-


tive pairs with negative pairs using contrastive loss [10]. Each instance is treated
as a unique class, and learning occurs through instance discrimination driven
by contrastive loss. Positive pairs are created from augmented versions of an
instance, and negative pairs are formed from randomly chosen different instances.
The effectiveness of self-supervised contrastive learning in image-level recogni-
tion tasks has inspired its adaptation to pixel-level prediction tasks, as mentioned
in recent studies [5]. Our approach to contrastive learning in polyp segmentation
introduces novel aspects. We focus on the global image vector, which remains
consistent across augmentations, and integrate it with local features to form a
comprehensive feature set, potentially improving polyp segmentation effective-
ness.
116 T.-H. Nguyen-Mau et al.

Fig. 2. The visualization of our Segmentation Backbone.

3 Proposed Method
This section provides a general framework for our method. Figure 1 presents
an overview of DOLG-CNet. By incorporating principles of contrastive learning
and commonly used segmentation training with deep supervision, DOLG-CNet
adopts a one-stage, end-to-end framework. Its objective is to analyze image repre-
sentations in different augmentation forms and segment them, thereby improving
polyp segmentation through various types of deep supervision. The framework
includes several key components: a novel developed segmentation backbone. This
contrastive learning task analyzes the representations of images with high and
low augmentations, and an auxiliary deep supervision loss.

3.1 Segmentation Backbone


ConvNeXt Encoder. CNN-based models, particularly the advanced Con-
vNeXt [17], excel in medical image segmentation by providing high-resolution
maps and integrating features through skip connections. ConvNeXt builds on
ResNet’s [12] principles and incorporates hierarchical Vision Transformer ele-
ments, like those in Swin Transformer [16]. It benefits from macro design
improvements such as the ResNeXt architecture [29], inverted bottlenecks, larger
kernels, and micro design tweaks, allowing it to rival hierarchical Vision Trans-
formers in performance while retaining traditional ConvNet simplicity and effi-
ciency. Therefore, we have chosen ConvNeXt [17] as our segmentation encoder.

Segmentation Backbone. The overall architecture of our segmentation back-


bone as shown in Fig. 2. In this study, we consider an RGB image represented
DOLG-CNet 117

as .x ∈ RH×W ×3 for input. Initially, we extract feature maps from each stage
of ConvNeXt [17] and input them into residual blocks. The feature maps at
H W
stage .i are .Fi ∈ R 2i+1 × 2i+1 ×96 , with .i ranging from 1 to 4. Subsequently, these
feature maps, together with features from the Orthogonal Fusion Module, are
upsampled to match the original image resolution of .H × W . After concatenat-
ing these upsampled maps, they are processed through a residual block to derive
the final encoded features. The output stage involves a convolution layer with a
kernel size of 1 and a sigmoid activation function. This layer is responsible for
predicting the pixel-wise label map of the input image at the original resolution.
Additionally, for deep supervision purposes, a corresponding pixel label map is
generated at a resolution of . 2i+1H
× 2W
i+1 from the corresponding feature map .Fi

and an additional feature map from the Orthogonal Fusion Module.


Significantly, we utilize the feature map from stage .4, the deepest level within
the ConvNeXt architecture [17], to conduct advanced feature extraction. This
process involves employing a convolution operation with a kernel size of 3 and
stride of 2, effectively halving the spatial dimensions of the feature map while
retaining the depth dimension. This operation is primarily aimed at capturing
global features. Subsequently, this is followed by global average pooling and a
fully connected layer, which together serve to compress the feature dimensions,
culminating in the formation of a global representation denoted by .fg ∈ RC×1 .
Meanwhile, the original feature map from stage .4 of the ConvNeXt model [17]
 
functions as a local feature tensor, expressed as .fl ∈ RH ×W ×C , where .H  = 64 H

and .W = 64 . This dual representation aids in detailed and nuanced feature
W

analysis, leveraging both localized and holistic information from the image.
After that, we employ a novel Orthogonal Fusion Module (OFM) [30] specif-
ically designed to aggregate the global feature tensor .fg and the local feature
tensor .fl . Following the orthogonal fusion process, a final compact descriptor is
generated, effectively integrating both local and global information.

Orthogonal Fusion Module. Yang et al.[30] introduced the Deep Orthogonal


Local and Global (DOLG) framework for image retrieval, which extracts orthog-
onal local components from the global image representation and combines them
to enhance the final representation. Inspired by this, we implemented a simplified
version using the Orthogonal Fusion Module to meet our needs.
The operational mechanics of our orthogonal fusion module are depicted in
Fig. 3a. This module processes the local feature tensor .fl (transposed to conform
to the required format) and the global feature vector .fg . It first computes the
(i,j) (i,j)
projection .fl,proj of each local feature point .fl onto .fg . The projection is
mathematically expressed as:
(i,j)
(i,j) fl · fg
.fl,proj = fg (1)
|fg |2
(i,j)
where .fl · fg denotes the dot product operation and .|fg |2 represents the L2
norm of .fg .
118 T.-H. Nguyen-Mau et al.

Fig. 3. Architectural diagrams of (a) Orthogonal Fusion Module and (b) Residual
Block.

Subsequently, the orthogonal component is calculated as the discrepancy


(i,j) (i,j)
between .fl and its projection .fl,proj . This component, orthogonal to .fg , is
given by:
(i,j) (i,j) (i,j)
.fl,orth = fl − fl,proj (2)
Through this procedure, a tensor of dimensions .C × H  × W  is derived,
where each point is orthogonal to .fg . Each point in this tensor is subsequently
concatenated with the .C × 1 vector .fg , resulting in a tensor of dimensions .2C ×
H  × W  . The tensor is then transposed and processed through a fully connected
layer, yielding a feature map of dimensions .H  × W  × C.

Residual Block. Residual blocks [12], shown in Fig. 3b, use skip connections
to learn residuals, tackling the vanishing gradient problem and boosting infor-
mation processing. These connections also enhance generalization and prevent
overfitting [11, 12]. Our architecture features sequences of convolution, batch nor-
malization, and Swish activation [20], chosen for its effectiveness across a range
of input values, outperforming traditional activations like ReLU.
DOLG-CNet 119

Fig. 4. Visualization of some augmentation techniques used in our DOLG-CNet.

3.2 DOLG-CNet

As shown in Fig. 1, for an input image .x ∈ RH×W ×3 , we employ two distinct


augmentation methods: one involves strong, substantial augmentation yielding
a highly augmented image .xhigh , and the other utilizes a milder augmentation
setting resulting in a less augmented image .xlow . Both images are then input into
a Segmentation Backbone with shared weights, producing vector embeddings of
the segmentation masks along with their corresponding deep supervisions.
The augmentation functions include modifications to image color properties
such as saturation, contrast, and brightness, in addition to random cropping and
rotation. The degree of high and low augmentations depends on the probability
and intensity of these modifications. We have visualized several cases in Fig. 4.
Our hypothesis posits that, regardless of the variation in augmentation, the
global vector embedding derived from the different augment image remains
unequivalient, because it all describes the same original image. This suggests
that vector embeddings are invariant to augmentation, which may enhance model
robustness against augmentation, improve generalization, and reduce susceptibil-
ity to noise [4]. This invariant global vector embedding is subsequently integrated
into the Orthogonal Fusion Module alongside local features.
Furthermore, we utilize a segmentation loss complemented by auxiliary deep
supervision losses to refine model training and performance further.

3.3 Loss Function

The segmentation backbone was optimized using both the Dice loss .(LDice ) and
the Binary Cross Entropy loss .(LBCE ), following the methodologies described in
[28]. Supervision, including deep supervision, was consistently applied at each
output layer of the model. Let .P and .G denote the predicted and ground truth
values, respectively, both assumed to be at the same resolution. The weighting
coefficients, .λ1 and .λ2 , were set to 1 for simplicity. Therefore, the segmentation
loss .LSegment is formulated as:

LSegment = λ1 LDice (P, G) + λ2 LBCE (P, G)


. (3)
120 T.-H. Nguyen-Mau et al.

On the other hand, the contrastive loss .(LContrastive ) is constructed by using


the Mean Square Error loss .(LMSE ). Given the vector embeddings .V1 and .V2 ,
the formula becomes:

LContrastive = LMSE (V1 , V2 )


. (4)
From Eqs. 3 and 4, with the weighting coefficient .λ3 also set to 1, we have
the total loss as follows:

.L = LSegment + λ3 LContrastive (5)

4 Experiments
4.1 Datasets and Evaluation Metrics
We evaluate our DOLG-CNet on five public polyp segmentation datasets: Kvasir-
SEG [13], CVC-ClinicDB [2], CVC-ColonDB [25], CVC-T [27], and ETIS [22],
following protocols from [8, 34] with identical training and testing splits. The
training set includes 1450 images, with 550 from CVC-ClinicDB and 900 from
Kvasir-SEG, and the testing set comprises 798 images across all datasets.
For evaluation, we use six metrics: mean Dice score (mDice), mean Intersec-
tion over Union (mIoU), weighted .Fβ -measure (.Fβw ), structure measure (.Sα ),
enhanced-alignment measure .Eφmax , and mean absolute error (MAE). Aa lower
MAE indicates better performance, whereas higher values are preferable for the
other metrics.
DOLG-CNet is benchmarked against six leading methods: UNet [21],
UNet++ [35], SFA [9], PraNet [8], MSNet [34], and TransNetR [15], using pub-
lished results or replicated experiments to ensure comparable training conditions.

4.2 Implementation Details


We employ Keras TensorFlow for our programming needs. Images are resized to
a resolution of .384 × 384 pixels. Details on the augmentation method used can
be found in 3.2. Following these transformations, the images are normalized to
0 and 1. For training, the Adam optimizer was utilized, with a momentum .β1 of
0.9, .β2 of 0.999, and a learning rate of .10−4 . The experiments were conducted
using a single NVIDIA A100 40GB graphics card. The batch size was set to 4,
and the model was trained throughout 50 epochs.

4.3 Quantitative Comparison


Table 1 presents a performance comparison using six metrics. It shows that our
DOLG-CNet outperforms or competes with other methods across all datasets.
Specifically, DOLG-CNet exhibits superior performance on the Kvasir-SEG [13],
CVC-ColonDB [25], and ETIS [22] datasets. Particularly on the demanding ETIS
[22] dataset, our method outstrips the next best approach (MSNet [34]) by sig-
nificant margins in mDice, .Fβw , .Sα , .Eφmax , and MAE, with improvements of 0.3%,
7.8%, 1.0%, 3.3%, and 53.8%, respectively.
DOLG-CNet 121

Table 1. Comparison results on Kvasir-SEG, CVC-ClinicDB, CVC-ColonDB, CVC-T,


and ETIS datasets. The best and second results are highlighted.

Datasets Methods Venue mDice mIoU .Fβw .Sα


max
.Eφ MAE
Kvasir UNet MICCAI’15 [21] 0.818 0.746 0.794 0.858 0.893 0.055
UNet++ TMI’19 [35] 0.821 0.743 0.808 0.862 0.910 0.048
SFA MICCAI’19 [9] 0.723 0.611 0.670 0.782 0.849 0.075
PraNet MICCAI’20 [8] 0.898 0.840 0.885 0.915 0.943 0.030
MSNet MICCAI’21 [34] 0.907 0.862 0.893 0.922 0.944 0.028
TransNetR PMLR’24 [15] 0.896 0.839 0.918 0.900 0.942 0.028
DOLG-CNet - 0.913 0.859 0.950 0.905 0.954 0.026
ClinicDB UNet MICCAI’15 [21] 0.823 0.755 0.811 0.889 0.954 0.019
UNet++ TMI’19 [35] 0.794 0.729 0.785 0.873 0.931 0.022
SFA MICCAI’19 [9] 0.700 0.607 0.647 0.793 0.885 0.042
PraNet MICCAI’20 [8] 0.899 0.849 0.896 0.936 0.961 0.009
MSNet MICCAI’21 [34] 0.921 0.879 0.914 0.941 0.972 0.008
TransNetR PMLR’24 [15] 0.906 0.856 0.941 0.934 0.968 0.008
DOLG-CNet - 0.929 0.847 0.929 0.929 0.960 0.009
ColonDB UNet MICCAI’15 [21] 0.512 0.444 0.498 0.712 0.776 0.061
UNet++ TMI’19 [35] 0.483 0.410 0.467 0.691 0.760 0.064
SFA MICCAI’19 [9] 0.469 0.347 0.379 0.634 0.765 0.094
PraNet MICCAI’20 [8] 0.709 0.640 0.696 0.819 0.869 0.045
MSNet MICCAI’21 [34] 0.755 0.678 0.737 0.836 0.883 0.041
TransNetR PMLR’24 [15] 0.664 0.597 0.736 0.817 0.790 0.041
DOLG-CNet - 0.761 0.670 0.811 0.854 0.878 0.031
CVC-T UNet MICCAI’15 [21] 0.710 0.627 0.684 0.843 0.876 0.022
UNet++ TMI’19 [35] 0.707 0.624 0.687 0.839 0.898 0.018
SFA MICCAI’19 [9] 0.467 0.329 0.341 0.640 0.817 0.065
PraNet MICCAI’20 [8] 0.871 0.797 0.843 0.924 0.938 0.010
MSNet MICCAI’21 [34] 0.869 0.807 0.849 0.925 0.943 0.010
TransNetR PMLR’24 [15] 0.893 0.819 0.904 0.906 0.971 0.006
DOLG-CNet - 0.880 0.811 0.878 0.899 0.952 0.009
ETIS UNet MICCAI’15 [21] 0.398 0.335 0.366 0.684 0.740 0.036
UNet++ TMI’19 [35] 0.401 0.344 0.390 0.683 0.776 0.035
SFA MICCAI’19 [9] 0.297 0.217 0.231 0.557 0.633 0.109
PraNet MICCAI’20 [8] 0.628 0.567 0.600 0.794 0.841 0.031
MSNet MICCAI’21 [34] 0.719 0.664 0.678 0.840 0.830 0.020
TransNetR PMLR’24 [15] 0.613 0.547 0.642 0.809 0.779 0.016
DOLG-CNet - 0.722 0.648 0.731 0.848 0.857 0.013
122 T.-H. Nguyen-Mau et al.

Table 2. Ablation study of DOLG-CNet on CVC-ClinicDB and CVC-ColonDB.

Metric ClinicDB ColonDB


mDice mIoU mDice mIoU
DOLG-CNet without ConvNeXt 0.654 0.594 0.489 0.437
DOLG-CNet without Orthogonal Fusion Module 0.911 0.845 0.736 0.661
DOLG-CNet without Deep Supervision 0.865 0.816 0.667 0.604
DOLG-CNet without Contrastive Loss 0.903 0.842 0.751 0.657
DOLG-CNet 0.929 0.847 0.761 0.670

Fig. 5. Comparative visual analysis of various methods.

4.4 Qualitative Evaluation

Fig. 5 presents a visual comparison between our method and other counterparts.
Our proposed technique performs better in segmenting polyps of various sizes
and shapes. Moreover, our approach demonstrates precise segmentation capabil-
ities, particularly for polyps that are challenging to detect, as shown in the .3rd ,
.4 and .5th rows.
th

5 Ablation Study

We conducted ablation studies on CVC-ClinicDB [2] and CVC-ColonDB [25]


datasets, using mean Dice score (mDice) and mean Intersection over Union
(mIoU) as metrics (see Table 2). Our evaluation of DOLG-CNet showed that
changing the backbone from ResNet [12] to ConvNeXt [17] significantly improved
DOLG-CNet 123

Fig. 6. Visualization of heatmaps from the DOLG-CNet.

performance, with mDice and mIoU increase 42.0% and 42.6% on CVC-
ClinicDB, and 55.6% and 53.3% on CVC-ColonDB, respectively. These results
highlight ConvNeXt’s effectiveness in polyp segmentation, a field where minor
enhancements significantly impact clinical diagnostics. Besides, incorporating
the Orthogonal Fusion Module and modifying the loss function, including Deep
Supervision and Contrastive Loss, further improved performance by 3.8% to
14.1%, underscoring their value in enhancing diagnostic accuracy and efficiency.
For a comprehensive understanding of the models, we have utilized heatmaps
to visualize the activation within the DOLG-CNet during image processing, as
depicted in Fig. 6. These heatmaps are generated by averaging feature map chan-
nels and applying color mapping to highlight the most responsive areas of each
layer. Brighter colors indicate higher activations. The analysis ranges from the
initial to final stages, showing a progression from scattered attention in early
layers to a focused emphasis on key image features in later layers, such as the
polyp region, which is particularly enhanced by the Orthogonal Fusion Mod-
ule. This visualization aids in refining model performance by demonstrating the
model’s effective learning and recognition capabilities, especially in challenging
scenarios like distinguishing polyps from complex backgrounds.

6 Conclusion

This paper introduces DOLG-CNet, a deep learning framework for polyp image
segmentation using ConvNeXt as its backbone. It employs contrastive learning
alongside standard segmentation training and deep supervision loss to enhance
performance. Images undergo varied augmentation levels, with the model trained
to align vector embeddings using an orthogonal fusion module for effective
global and local feature merging. With deep supervision, integrating contrastive
and segmentation losses accelerates convergence. DOLG-CNet achieves notable
124 T.-H. Nguyen-Mau et al.

results with dice scores of 0.913 (Kvasir-SEG), 0.761 (CVC-ColonDB), and 0.722
(ETIS), surpassing existing methods qualitatively and quantitatively across mul-
tiple datasets. Future work aims to optimize training efficiency for larger net-
works and enhance the utilization of local and global features for superior seman-
tic segmentation.

Acknowledgements. This research is funded by Vietnam National University - Ho


Chi Minh City (VNU-HCM) under grant number C2024-18-26.

References
1. Ahn, S.B., Han, D.S., Bae, J.H., Byun, T.J., Kim, J.P., Eun, C.S.: The miss rate for
colorectal adenoma determined by quality-adjusted, back-to-back colonoscopies.
Gut Liver 6(1), 64 (2012)
2. Bernal, J., et al.: Wm-dova maps for accurate polyp highlighting in colonoscopy:
Validation vs. saliency maps from physicians. Comput. Med. Imaging Graph. 43,
99–111 (2015)
3. Bernal, J., Sánchez, J., Vilarino, F.: Towards automatic polyp detection with a
polyp appearance model. Pattern Recogn. 45(9), 3166–3182 (2012)
4. Caron, M., et al.: Emerging properties in self-supervised vision transformers. In:
Proceedings of the International Conference on Computer Vision (ICCV) (2021)
5. Chaitanya, K., Erdil, E., Karani, N., Konukoglu, E.: Contrastive learning of global
and local features for medical image segmentation with limited annotations. Adv.
Neural. Inf. Process. Syst. 33, 12546–12558 (2020)
6. Chen, J., et al.: Transunet: transformers make strong encoders for medical image
segmentation. arXiv preprint arXiv:2102.04306 (2021)
7. Diakogiannis, F.I., Waldner, F., Caccetta, P., Wu, C.: Resunet-a: a deep learning
framework for semantic segmentation of remotely sensed data. ISPRS J. Pho-
togramm. Remote. Sens. 162, 94–114 (2020)
8. Fan, D.-P., et al.: PraNet: parallel reverse attention network for polyp segmenta-
tion. In: Martel, A.L., et al. (eds.) MICCAI 2020. LNCS, vol. 12266, pp. 263–273.
Springer, Cham (2020). https://doi.org/10.1007/978-3-030-59725-2_26
9. Fang, Y., Chen, C., Yuan, Y., Tong, K.: Selective feature aggregation network
with area-boundary constraints for polyp segmentation. In: Shen, D., et al. (eds.)
MICCAI 2019. LNCS, vol. 11764, pp. 302–310. Springer, Cham (2019). https://
doi.org/10.1007/978-3-030-32239-7_34
10. Hadsell, R., Chopra, S., LeCun, Y.: Dimensionality reduction by learning an invari-
ant mapping. In: 2006 IEEE Computer Society Conference on Computer Vision
and Pattern Recognition (CVPR 2006), vol. 2, pp. 1735–1742. IEEE (2006)
11. He, K., Zhang, X., Ren, S., Sun, J.: Identity mappings in deep residual networks. In:
Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp.
630–645. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_38
12. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In:
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
pp. 770–778 (2016)
13. Jha, D., et al.: Kvasir-SEG: a segmented polyp dataset. In: Ro, Y.M., et al. (eds.)
MMM 2020. LNCS, vol. 11962, pp. 451–462. Springer, Cham (2020). https://doi.
org/10.1007/978-3-030-37734-2_37
DOLG-CNet 125

14. Jha, D., et al.: Resunet++: an advanced architecture for medical image segmenta-
tion. In: 2019 IEEE International Symposium on Multimedia (ISM), pp. 225–2255.
IEEE (2019)
15. Jha, D., Tomar, N.K., Sharma, V., Bagci, U.: Transnetr: transformer-based residual
network for polyp segmentation with multi-center out-of-distribution testing. In:
Medical Imaging with Deep Learning, pp. 1372–1384. PMLR (2024)
16. Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted win-
dows. In: Proceedings of the IEEE/CVF International Conference on Computer
Vision, pp. 10012–10022 (2021)
17. Liu, Z., Mao, H., Wu, C.Y., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for
the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, pp. 11976–11986 (2022)
18. Noh, H., Araujo, A., Sim, J., Weyand, T., Han, B.: Large-scale image retrieval with
attentive deep local features. In: Proceedings of the IEEE International Conference
on Computer Vision, pp. 3456–3465 (2017)
19. Radenović, F., Tolias, G., Chum, O.: Fine-tuning CNN image retrieval with no
human annotation. IEEE Trans. Pattern Anal. Mach. Intell. 41(7), 1655–1668
(2018)
20. Ramachandran, P., Zoph, B., Le, Q.V.: Swish: a self-gated activation function.
arXiv preprint arXiv:1710.05941 7(1), 5 (2017)
21. Ronneberger, O., Fischer, P., Brox, T.: U-net: convolutional networks for biomed-
ical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F.
(eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015).
https://doi.org/10.1007/978-3-319-24574-4_28
22. Silva, J., Histace, A., Romain, O., Dray, X., Granado, B.: Toward embedded detec-
tion of polyps in wce images for early diagnosis of colorectal cancer. Int. J. Comput.
Assist. Radiol. Surg. 9, 283–293 (2014)
23. Siméoni, O., Avrithis, Y., Chum, O.: Local features and visual words emerge in
activations. In: Proceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition, pp. 11651–11660 (2019)
24. Sung, H., el al.: Global cancer statistics 2020: Globocan estimates of incidence
and mortality worldwide for 36 cancers in 185 countries. CA: Can. J. Clin. 71(3),
209–249 (2021)
25. Tajbakhsh, N., et al.: Automated polyp detection in colonoscopy videos using shape
and context information. IEEE Trans. Med. Imaging (2015)
26. Tajbakhsh, N., Gurudu, S.R., Liang, J.: Automated polyp detection in colonoscopy
videos using shape and context information. IEEE Trans. Med. Imaging 35(2),
630–644 (2015)
27. Vázquez, D., et al.: A benchmark for endoluminal scene segmentation of
colonoscopy images. J. Healthcare Eng. 2017 (2017)
28. Wei, J., Hu, Y., Zhang, R., Li, Z., Zhou, S.K., Cui, S.: Shallow attention network
for polyp segmentation. In: de Bruijne, M., et al. (eds.) MICCAI 2021. LNCS, vol.
12901, pp. 699–708. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-
87193-2_66
29. Xie, S., Girshick, R., Dollár, P., Tu, Z., He, K.: Aggregated residual transformations
for deep neural networks. In: Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pp. 1492–1500 (2017)
30. Yang, M., et al.: Dolg: single-stage image retrieval with deep orthogonal fusion of
local and global features. In: Proceedings of the IEEE/CVF International confer-
ence on Computer Vision, pp. 11772–11781 (2021)
126 T.-H. Nguyen-Mau et al.

31. Yi, K.M., Trulls, E., Lepetit, V., Fua, P.: LIFT: learned invariant feature trans-
form. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS,
vol. 9910, pp. 467–483. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-
46466-4_28
32. Zauber, A.G., et al.: Colonoscopic polypectomy and long-term prevention of
colorectal-cancer deaths. N. Engl. J. Med. 366(8), 687–696 (2012)
33. Zhang, Y., Liu, H., Hu, Q.: TransFuse: fusing transformers and cnns for medi-
cal image segmentation. In: de Bruijne, M., et al. (eds.) MICCAI 2021. LNCS,
vol. 12901, pp. 14–24. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-
87193-2_2
34. Zhao, X., Zhang, L., Lu, H.: Automatic polyp segmentation via multi-scale subtrac-
tion network. In: de Bruijne, M., et al. (eds.) MICCAI 2021. LNCS, vol. 12901, pp.
120–130. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-87193-2_12
35. Zhou, Z., Rahman Siddiquee, M.M., Tajbakhsh, N., Liang, J.: Unet++: a nested
u-net architecture for medical image segmentation. In: Deep Learning in Medi-
cal Image Analysis and Multimodal Learning for Clinical Decision Support: 4th
International Workshop, DLMIA 2018, and 8th International Workshop, ML-CDS
2018, Held in Conjunction with MICCAI 2018, Granada, Spain, 20 September
2018, Proceedings 4, pp. 3–11. Springer (2018)
VisChronos: Revolutionizing Image
Captioning Through Real-Life Events

Phuc-Tan Nguyen1,2 , Hieu Nguyen1,2 , and Trung-Nghia Le1,2(B)


1
University of Science, VNU-HCM, Ho Chi Minh City, Vietnam
{21120028,21120068}@student.hcmus.edu.vn, [email protected]
2
Vietnam National University, Ho Chi Minh City, Vietnam

Abstract. This paper aims to bridge the semantic gap between visual
content and natural language understanding by leveraging historical
events in the real world as a source of knowledge for caption gener-
ation. We propose VisChronos, a novel framework that utilizes large
language models and dense captioning models to identify and describe
real-life events from a single input image. Our framework can automati-
cally generate detailed and context-aware event descriptions, enhancing
the descriptive quality and contextual relevance of generated captions
to address the limitations of traditional methods in capturing contextual
narratives. Furthermore, we introduce a new dataset, EventCap (https://
zenodo.org/records/14004909), specifically constructed using the pro-
posed framework, designed to enhance the model’s ability to identify
and understand complex events. The user study demonstrates the effi-
cacy of our solution in generating accurate, coherent, and event-focused
descriptions, paving the way for future research in event-centric image
understanding.

Keywords: Event-Based Image Captioning · Contextual Captioning ·


Real-World Semantics · Event Extraction · VisChronos framework

1 Introduction
Image captioning aims to generate descriptive captions for images. However,
most current methods tend to produce captions with a limited understanding of
the image, focusing primarily on identifying objects, actions, and basic physical
attributes [8, 9, 14]. These approaches fall short in conveying deeper context,
as they lack the ability to infer meaningful information about the events or
interactions taking place in the image. This limitation becomes especially evident
when the goal is to describe not just what is visible but also the underlying story
or context associated with the image.
In many cases, the generated captions are too superficial to capture complex
scenarios where additional information, such as who is involved, what is hap-
pening, where and when the event took place, and its significance, is critical. As
P.-T. Nguyen and H. Nguyen—Both authors contributed equally to this research.
c The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025
W. Buntine et al. (Eds.): SOICT 2024, CCIS 2350, pp. 127–140, 2025.
https://doi.org/10.1007/978-981-96-4282-3_11
128 P.-T. Nguyen et al.

Fig. 1. Comparison between general caption by Grit [14] and our event-based caption,
highlighting additional details such as location, date, identity, and event purpose.

a result, these methods are inadequate for providing rich, informative captions
that align with more sophisticated user needs, such as understanding or retriev-
ing images related to real-world events. To overcome this, a new approach is
needed-one that integrates contextual details and event-related information to
create comprehensive, narrative-driven captions that go beyond simple object
recognition.
Our research introduces the task of Event-Enriched Image Captioning
(EEIC), which aims to generate captions that provide richer, more compre-
hensive information about an image. This approach is demonstrated through a
sample depicted in Fig. 1, where we showcase the result caption that our method
generates. This example illustrates the enhanced descriptive quality and contex-
tual depth that EEIC can bring to image captioning. These captions go beyond
simple visual descriptions by offering deeper insights, including the names and
attributes of objects, the timing, context, outcomes of events, and other crucial
details-information that cannot be gleaned from merely observing the image.
This approach facilitates the creation of more coherent and detailed narratives,
capturing not only the visible elements but also the underlying context and sig-
nificance of the scene, ultimately offering a more complete understanding of what
the image represents.
The core idea of our approach is to harness event-related information from
credible sources while leveraging the reasoning capabilities of both vision-
language models and large language models (LLMs). We propose a fully auto-
mated four-step framework, called VisChronos, to analyze both the visual con-
tent and the temporal, event-based aspects of the scene, ensuring a more com-
prehensive understanding. VisChronos operates through a systematic process
designed to extract, analyze, and synthesize information from images and asso-
ciated events. First, a vision-language model identifies and describes the most
important aspects of the image, including both those specified by prompts and
VisChronos: Revolutionizing Image Captioning Through Real-Life Events 129

those deemed important by the model itself. Next, in the second step, an LLM
generates questions about the image based on the key aspects identified in the
first step, which includes both mandatory and optional questions. In the third
step, another LLM answers these questions using event information that we pro-
vide. Finally, a separate LLM synthesizes and processes the information from the
previous three steps to infer and generate the final caption for the image. Unlike
traditional models that rely on learning from a massive dataset and risk generat-
ing information that may be fabricated or irrelevant, our method addresses this
issue by incorporating factual sourcescredible, human-authored articles that pro-
vide real, context-rich information. By drawing directly from authentic articles
and aligning this information with image content, our framework ensures that
captions accurately represent real events in human history. This framework is
designed to establish an efficient information mining flow, systematically divid-
ing different stages of information extraction across each step. This approach
ensures that the framework can mine the most useful and relevant information
at each stage, ultimately resulting in rich and contextually accurate captions.
Extensive human evaluations of captions generated by the VisChronos reveal
that they are comparable to captions crafted by human annotators. These
machine-generated captions were particularly praised for their completeness,
coherence, conciseness, and the inclusion of relevant information not explicitly
visible in the images.
Using VisChronos, we have created a dataset named EventCap, consisting
of 3140 event-based image-caption pairs. Each pair has been carefully curated
using images and related information sourced from a wide range of credible
articles. This collection serves as a valuable resource for training and evaluating
the performance of image captioning models in understanding and describing
complex real-world events. To the best of our knowledge, no similar dataset
exists that is specifically designed for the task of event-enriched image captioning,
making EventCap a unique and essential resource for advancing research in this
area.
Our main contributions can be summarized as follows:

– We introduce the new task of Event-Enriched Image Captioning (EEIC),


which generates captions that provide profound insights into both the image
and the events depicted, including information that cannot be inferred solely
from the image.
– We propose a novel four-stage framework, namely VisChronos to solve the
task of EEIC, combining vision-language models and LLMs to generate
detailed, context-rich captions.
– We publicly release EventCap (https://zenodo.org/records/14004909), a
dataset consisting of 3140 image-caption pairs, processed from 491 CNN arti-
cles.
130 P.-T. Nguyen et al.

2 Related Work
2.1 Dense Captioning
Dense captioning, which aims to generate detailed descriptions for objects in
scenes or videos, remains a challenging task. Over the years, several methods
have been developed to address this problem, each employing distinct approaches
and making notable contributions. Chen et al. [5] introduced Scan2Cap, an end-
to-end method for dense captioning in RGB-D scans, utilizing 3D point cloud
inputs to generate bounding boxes and corresponding object descriptions. Build-
ing on this, Wang et al. [13] presented PDVC, a framework that formulates dense
video captioning as a task of set prediction, enabling efficient parallel decoding.
Aafaq et al. [1] proposed the ViSE framework and VSJM-Net, which leverage
early linguistic information fusion to model word-context distributional proper-
ties for improved dense video captioning. Shao et al. [10] further advanced the
field by incorporating textual context-aware methods that generate diverse and
context-rich captions. Similarly, Jiao et al. [7] presented MORE, a model that
captures complex scene relations for superior captioning accuracy. Furthermore,
Wu et al. [14] developed GRiT, a generative region-to-text transformer model
that emphasizes object understanding in dense captioning tasks.
Despite the advancements made by these approaches, they typically generate
conventional captions that lack real-world semantic depth, as they rely solely on
information from the image or video. Moreover, most existing methods produce
captions in a single pass through learned representations, which can result in
missing critical details. In contrast, our method continuously supplements the
captioning process through an interactive dialogue between models, allowing for
the extraction of more nuanced and semantically rich information.

2.2 Large Language Models

Large Language Models (LLMs) have garnered significant attention due to their
ability to generate human-like text and solve complex tasks across various
domains. GPT (Generative Pre-trained Transformer), developed by OpenAI,
is one of the most well-known LLMs. Floridi et al. [6] introduced GPT-3, a
third-generation autoregressive language model that generates human-like text
using deep learning techniques. They explored its nature, scope, limitations, and
potential consequences, emphasizing that GPT-3 is not intended to pass complex
mathematical, semantic, or ethical tests. Brown et al. [4] further demonstrated
that scaling language models, such as GPT-3 with 175 billion parameters, sig-
nificantly improves few-shot performance across various tasks, sometimes out-
performing state-of-the-art (SOTA) approaches. In subsequent developments,
Achiam et al. [2] presented GPT-4, a large-scale multimodal model capable of
processing both image and text inputs to generate text outputs, marking a sig-
nificant advancement in multimodal language modeling. Meanwhile, Yang et
al. [15] explored GPT-4V, expanding GPT-4’s capabilities to include vision-
based tasks, opening new avenues for large multimodal models. In addition,
VisChronos: Revolutionizing Image Captioning Through Real-Life Events 131

Fig. 2. VisChronos framework for Event-Enriched Image Captioning.

Wang et al. [12] evaluated the trustworthiness of GPT models, concluding that
GPT-4 is generally more reliable than GPT-3.5 but remains vulnerable to adver-
sarial attacks, such as jailbreaking or misleading prompts.
Gemma is a lightweight, SOTA open model that builds upon the research
and technology developed for Gemini models. As outlined in recent studies,
Gemma has exhibited strong performance across various benchmarks for lan-
guage understanding, reasoning, and safety [11]. Notably, Gemma has been rig-
orously evaluated alongside other large language models (LLMs) to assess its
capabilities across different languages, modalities, models, and tasks [3]. The
advancements in Gemma are a result of ongoing improvements in multimodal
research, with enhancements in understanding and processing non-textual data
inputs such as images and speech, which significantly contribute to its versatil-
ity in both academic and practical applications. Furthermore, by being available
as an open-source model, Gemma provides accessible, cutting-edge AI technol-
ogy to the broader research community, while also offering paid solutions for
advanced functionalities and commercial deployment [11].
In this framework, we integrate the use of dense captioning models with
both paid and open-source LLMs, including models such as GPT and Gemma,
to leverage their combined strengths for enhanced performance across a range
of tasks. By this framework, we also generate the EventCap dataset, the first
dataset specifically designed for event captioning, providing a unique resource
for accurately describing event contexts in diverse applications.
132 P.-T. Nguyen et al.

3 Proposed VisChronos Framework


3.1 Overview
In this work, we introduce a novel multi-stage framework aimed at enhanc-
ing image captioning by generating contextually rich captions that go beyond
describing the image. Our approach captures not only visual elements but also
events, facts, and broader contextual information inferred from the image. As
shown in Fig. 2, our framework consists of four stages, each of which is handled
by a dedicated bot, designed to perform a specific task and feed information to
the next stage. These stages are:

– Stage 1 (Dense Captioning - Perception Bot): Generates a detailed


description of the image, capturing objects, people, actions, and contextual
elements.
– Stage 2 (Question Generation - Interrogation Bot): Creates a set of
questions based on the image description to explore the event’s context and
details.
– Stage 3 (Answer Extraction - Explanation Bot): Extracts answers to
the questions from external knowledge sources, ensuring that answers are
provided only when the model is confident.
– Stage 4 (Caption Synthesis - Integration Bot): Synthesizes the dense
caption, answers, and external knowledge into a comprehensive and contex-
tually enriched final caption.

Each stage is designed to progressively enrich the information related to


the image, ultimately producing a caption that is both visually descriptive and
contextually informative.

3.2 Stage 1: Dense Captioning Perception Bot


In this stage, the Perception Bot is responsible for generating a dense cap-
tion that thoroughly describes the image’s contents, including objects, people,
actions, and background elements. This process is guided by a carefully crafted
instruction provided to dense captioning model. The instruction ensures that
the caption captures all significant visual elements, with attention to detail,
relationships between objects, and overall context. Therefore, the dense caption
can serve as the foundation for the entire framework, offering a detailed and
structured description of the scene. This stage results in a rich description of
the image with detailed information of objects, actions, and relationships, that
serves as input for subsequent stages.

3.3 Stage 2: Question Generation Interrogation Bot


The second stage is crucial for extracting the deeper context and events sur-
rounding the image. This stage focuses on generating a set of structured ques-
tions that aim to explore various aspects of the event or scene depicted in the
VisChronos: Revolutionizing Image Captioning Through Real-Life Events 133

Table 1. Key aspects covered by the questions in Question Generation stage.

Category Description
Time Questions addressing when the event occurred or its timeline to
understand the temporal context.
Context or Reason Questions exploring the circumstances or motivations behind the event,
seeking to clarify why the event took place.
Main Events Questions focusing on the central actions or occurrences within the event,
ensuring a clear understanding of the primary narrative.
Outcome Questions that inquire about the result or conclusion of the event, aiming
to highlight the final impact or resolution.
Impact Questions exploring the broader consequences or effects of the event on
individuals, groups, or larger contexts.
Objects or People Several questions delving into the key people or objects mentioned in the
dense caption, ensuring that all significant elements are covered.
Special Figures Specific questions about notable or important figures involved in the
event, shedding light on their roles and influence.
Emotions and Reactions Questions designed to explore the emotional states or reactions of the
people in the image, providing insight into the human element of the
scene.
Background Details Questions addressing the setting or background elements in the image,
helping to paint a fuller picture of the environment in which the event
takes place.
Future Implications Questions speculating on the potential future outcomes or ramifications of
the event, aiming to place the event within a broader temporal and
societal context.

image. The questions are designed to extract more information from the accom-
panying article or external sources, leading to a comprehensive understanding of
the depicted event.
A structured approach is employed to generate questions that comprehen-
sively address the event or context depicted in the image. The questions are
crafted to ensure that all key aspects of the scenario are explored, providing a
robust foundation for the subsequent explanation and synthesis stages as shown
in Table 1.
The design of the question-generation model ensures that each of these
dimensions is adequately explored, leading to a comprehensive inquiry into the
event depicted in the image. This stage serves as the foundation for the next
phase, where the generated questions are used to retrieve detailed answers from
external knowledge sources.

3.4 Stage 3: Answer Extraction Explanation Bot

The third stage focuses on extracting detailed answers to the questions gen-
erated in Stage 2 by leveraging external knowledge sources such as articles or
accompanying text related to the image. The Explanation Bot ensures that the
answers are accurate and directly relevant to the event depicted in the image.
134 P.-T. Nguyen et al.

This stage requires a highly precise approach to ensure that the model pro-
vides accurate answers grounded in the available information. A key princi-
ple guiding the model’s behavior during this phase is certainty. The model is
instructed to answer questions only when it is 100% confident in the informa-
tion it provides. This ensures that the answers are factually reliable and directly
linked to the knowledge from the article or dense caption.

Certainty Rule: If the model is unable to locate relevant information or if it


is unsure about the correctness of an answer, it must respond with “no informa-
tion.” This instruction prevents the model from speculating or providing poten-
tially misleading or inaccurate answers.

Information Retrieval: The model focuses on extracting factual data from the
article or other external knowledge sources and cross-referencing this information
with the dense caption to ensure consistency and accuracy. The goal is to answer
each question as fully as possible, but only when the necessary information is
available and can be confidently inferred.
This strict adherence to certainty ensures that the answers provided are reli-
able and fact-based, maintaining the integrity of the image-captioning process.
By instructing the model to explicitly state “no information” when necessary,
the framework avoids overgeneralization or the inclusion of speculative answers.

3.5 Stage 4: Caption Synthesis Integration Bot


The final stage synthesizes the information gathered from previous stages into
a comprehensive caption that reflects both the visual content of the image and
the broader contextual information.
The instruction provided to the model for this stage ensures that it combines
the dense caption, external knowledge, and question responses to generate a
detailed and contextually enriched final caption. This synthesis produces a nar-
rative that reflects both the visual and factual elements of the image. The final
caption aims to describe the event depicted in the image, taking into account
the knowledge gathered in the previous stages.
Output of this stage is a final caption that provides a rich, event-based
description of the image, integrating visual details with contextual information
from external sources.

3.6 Detailed Implementation


Our VisChronos framework is designed with flexibility, allowing integration with
a variety of dense captioning models and LLMs, such as GRIT, GPT, Gemma,
Bard, etc. This adaptability ensures that different models can be used for various
tasks within the framework. However, for the creation of the EventCap dataset,
we specifically employed GPT-4o as the vision-language model for dense cap-
tioning in the first step, and utilized the Gemma-2-9b-it (open-source version)
VisChronos: Revolutionizing Image Captioning Through Real-Life Events 135

model for stages 2, 3, and 4 to handle question generation, answer extraction,


and final caption synthesis.

4 Proposed EventCap Dataset


4.1 Dataset Construction
To build the EventCap dataset using our VisChronos framework, we required
both images of real-world events and the corresponding event-related informa-
tion, which we refer to as “knowledge.” Specifically, we crawled data from 491
articles published by CNN between 2014 and 2022, covering a wide range of cat-
egories, including business, entertainment, health, news, politics, and sport. For
each article, we collected the full textual content and all images included within
the article.
The dataset creation process involved processing each image along with the
corresponding article content, which served as external knowledge, through the
VisChronos framework. The images were sourced from reputable news sites and
captured by human photographers, ensuring authenticity and real-world rele-
vance. The framework then generated captions for each image, using the article’s
content to ensure contextually relevant and detailed event descriptions. Captions
generated by our method, as evaluated in Sect. 4.3, achieved a quality level com-
parable to human-written descriptions. This procedure formed the foundation
for creating each image-caption pair in the dataset. Examples of image-caption
pairs generated by our framework are illustrated in Fig. 3.

4.2 Dataset Specifications and Statistics


The EventCap dataset comprises 3140 image-caption pairs, with each pair con-
sidered a single sample, specifically curated to enhance the understanding of
complex, event-driven content. The samples in the EventCap dataset are dis-
tributed across the years 2014 to 2022, covering a wide range of categories,
including business, entertainment, health, news, politics, and sport; each cate-
gories having different sections. The images prominently feature current events,
capturing significant moments in these years. These images vary widely in size,
from 532 to 5440 pixels in width and 338 to 4618 pixels in height. Captions are
generated using our VisChronos framework, designed to distill and articulate the
essence and nuances of each event in a narrative style. These captions are exten-
sive, ranging from 50 to 793 words with an average length of 106 words, and are
structured in one or more paragraphs to convey detailed, context-rich descrip-
tions of the depicted scenes. Figure 4 presents sample distribution by years in
the EventCap dataset while Fig. 5 breaks down the sample distribution by cat-
egories, illustrating the dataset’s diversity.

4.3 User Study


Since this task is novel, the existing metrics typically used to evaluate tradi-
tional models are not directly applicable. Furthermore, as this is the first dataset
136 P.-T. Nguyen et al.

Fig. 3. Examples of image-caption pairs generated by VisChronos.

specifically created for this task, there is no comparable dataset available for
direct benchmarking. To address this, we conducted a user study aimed at assess-
ing both the effectiveness of our method and the quality of the generated dataset.
VisChronos: Revolutionizing Image Captioning Through Real-Life Events 137

Fig. 4. Sample distribution in EventCap dataset by year (best view in color & zoom-
in). (Color figure online)

Additionally, the study aimed to compare the quality of captions generated by


our framework with those written by humans, highlighting the strengths and
weaknesses of each approach.

Participants: We invited 10 participants (8 males, 2 females, aged 18 to 30) to


participate in our study. Participants were selected from diverse backgrounds,
including professionals in media, journalism, and computer science, to ensure a
comprehensive and well-rounded evaluation.

Apparatus and Procedure: Our study was conducted both online and on-site in
our lab, where participants completed the tasks. Each participant received clear
instructions on evaluating captions and writing their own for comparison. They
were required to spend at least 4 min evaluating and 5 min writing the caption
for each image including reading the corresponding article. The total time for
the study sessions was approximately 120 min per participant.
First, participants were asked to write their own captions for 5-10 images
from 2 articles. After that, participants evaluated the quality of captions for
approximately 10-15 images from 4 different articles. Half of the captions were
generated by our framework, while the other half were written by humans (i.e.,
other participants).
The participants were asked to rate the performance of the captions on a
scale of 1 to 5 across three metrics, based on their individual perspectives. The
comparison was based on several key metrics:

– Faithfulness: Whether the content of the caption fully describes the key
events depicted in the image.
– Comprehensibility: Whether the caption is concise, easy to read, and free
of unnecessary information.
138 P.-T. Nguyen et al.

Fig. 5. Sample distribution in EventCap dataset by category and section (best view in
color & zoom-in). (Color figure online)

– Plus-Info: Whether the caption provides additional useful information that


cannot be inferred directly from the image.

Fig. 6. Comparative performance of human and VisChronos in writing event-based


image captions across metrics. Our proposed method can achieve human-level writing
(best view in color & zoom-in). (Color figure online)
VisChronos: Revolutionizing Image Captioning Through Real-Life Events 139

Quantitative Results: We obtained quantitative results by averaging the ratings


of each participant across three metrics and then averaging those results over all
participants. These outcomes shown in Fig. 6 indicate that the performance of the
VisChronos framework is not significantly different from that of human-written
captions across evaluated metrics. This highlights potentials of our solution in
real-life applications such as writing image description for newspapers, journals,
books, etc. The user study results also demonstrates the high quality of our
EventCap dataset in the development of event-based image captioning models.

Limitations: Our method relies heavily on the quality of the accompanying


information to produce accurate captions. Incomplete or poor-quality data can
negatively impact the results. Additionally, the four-step process increases the
time required, with an average of 1 min needed to caption each image, which
may limit scalability for large datasets.

5 Conclusion

In this paper, we introduced the VisChronos framework, a novel four-stage


approach designed to tackle the new task of Event-Enriched Image Caption-
ing (EEIC). This task aims to generate captions that not only describe the
visual content of an image but also provide insights into the underlying events,
incorporating information beyond what is directly observable. By leveraging a
combination of vision-language models and large language models (LLMs), our
framework iteratively refines the captioning process through interactive dialogue
between models, resulting in semantically richer and contextually enhanced cap-
tions. Additionally, we presented EventCap dataset, consisting of 3140 image-
caption pairs from 491 CNN articles, which serves as a strong foundation for
future research in event-based image captioning and related tasks. To the best
of our knowledge, EventCap is the first dataset specifically created for the task
of event-enriched image captioning, establishing it as a pioneering resource and
a strong foundation for future research in event-based image captioning and
related tasks.
As part of ongoing efforts, we aim to significantly expand the EventCap
dataset to 20,000 articles and 50,000 images, creating a comprehensive resource
for the research community to advance event-enriched image captioning and
multimodal learning. Additionally, we plan to refine the VisChronos framework
into a multimedia tool that enhances human experiences by integrating visual,
textual, and contextual information. This evolution will support applications like
event detection, real-time image analysis, and interactive storytelling, leveraging
the framework’s ability to generate rich, event-driven captions.

Acknowledgement. This research is supported by research funding from Faculty of


Information Technology, University of Science, Vietnam National University - Ho Chi
Minh City.
140 P.-T. Nguyen et al.

References
1. Aafaq, N., Mian, A., Akhtar, N., Liu, W., Shah, M.: Dense video captioning with
early linguistic information fusion. IEEE Trans. Multimedia (2022)
2. Achiam, J., Adler, S., Agarwal, S., Ahmad, L., et al.: GPT-4 technical report.
arXiv preprint arXiv:2303.08774 (2023)
3. Ahuja, S., Aggarwal, D., Gumma, V., et al.: Megaverse: benchmarking large
language models across languages, modalities, models and tasks. arXiv preprint
arXiv:2308.05698 (2023)
4. Brown, T., et al.: Language models are few-shot learners. In: NIPS (2020)
5. Chen, D.Z., Gholami, A., Nießner, M., Chang, A.X.: Scan2cap: context-aware dense
captioning in RGB-D scans. arXiv preprint arXiv:2012.02202 (2020)
6. Floridi, L., Chiriatti, M.: GPT-3: its nature, scope, limits, and consequences. Minds
Mach. (2020)
7. Jiao, Y., Chen, S., Jie, Z., Chen, J., Ma, L., Jiang, Y.G.: More: multi-order relation
mining for dense captioning in 3D scenes. In: ECCV (2022)
8. Krause, J., Johnson, J., Krishna, R., Fei-Fei, L.: A hierarchical approach for gen-
erating descriptive image paragraphs. In: Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, pp. 317–325 (2017)
9. Li, C., et al.: mPLUG: effective and efficient vision-language learning by cross-
modal skip-connections. arXiv preprint arXiv:2205.12005 (2022)
10. Shao, Z., Han, J., Debattista, K., Pang, Y.: Textual context-aware dense captioning
with diverse words. IEEE Trans. Multimedia (2023)
11. Team, G., Mesnard, T., Hardin, C., et al.: Gemma: open models based on Gemini
research and technology. arXiv preprint arXiv:2401.01234 (2024)
12. Wang, B., Chen, W., Pei, H., et al.: Decodingtrust: a comprehensive assessment of
trustworthiness in GPT models. NIPS (2023)
13. Wang, T., Zhang, R., Lu, Z., Zheng, F., Cheng, R., Luo, P.: End-to-end dense
video captioning with parallel decoding. arXiv preprint arXiv:2107.12589 (2021)
14. Wu, J., et al.: Grit: a generative region-to-text transformer for object understand-
ing. arXiv preprint arXiv:2203.15806 (2022)
15. Yang, Z., Li, L., Lin, K., Wang, J., Lin, C.C., Liu, Z., Wang, L.: The dawn of LMMs:
preliminary explorations with GPT-4V(ision). arXiv preprint arXiv:2309.00332
(2023)
TI-JEPA: An Innovative Energy-Based
Joint Embedding Strategy for Text-Image
Multimodal Systems

Khang H. N. Vo1,2 , Duc P. T. Nguyen1,2 , Thong T. Nguyen3 ,


and Tho T. Quan1,2(B)
1
URA Research Group, Faculty of Computer Science and Engineering, Ho Chi Minh
City University of Technology (HCMUT), Ho Chi Minh City, Viet Nam
[email protected]
2
Vietnam National University Ho Chi Minh City, Ho Chi Minh City, Viet Nam
3
National University of Singapore, Singapore, Singapore

Abstract. This paper focuses on multimodal alignment within the


realm of Artificial Intelligence, particularly in text and image modali-
ties. The semantic gap between the textual and visual modality poses a
discrepancy problem towards the effectiveness of multi-modalities fusion.
Therefore, we introduce Text-Image Joint Embedding Predictive Archi-
tecture (TI-JEPA), an innovative pre-training strategy that leverages
energy-based model (EBM) framework to capture complex cross-modal
relationships. TI-JEPA combines the flexibility of EBM in self-supervised
learning to facilitate the compatibility between textual and visual ele-
ments. Through extensive experiments across multiple benchmarks, we
demonstrate that TI-JEPA achieves state-of-the-art performance on mul-
timodal sentiment analysis task (and potentially on a wide range of
multimodal-based tasks, such as Visual Question Answering), outper-
forming existing pre-training methodologies. Our findings highlight the
potential of using energy-based framework in advancing multimodal
fusion and suggest significant improvements for downstream applications.

Keywords: Multimodal fusion · Joint-Embedding Predictive


Architecture · Energy-based model

1 Introduction
In the era of Artificial Intelligence, the ability to process and understand infor-
mation from multiple modalities simultaneously has become increasingly crucial
[31, 32]. Multimodal fusion, the process of integrating information from various
sensory inputs to form a coherent understanding, stands at the forefront of this
challenge. Among the myriad of multimodal tasks, text-image alignment has
emerged as a fundamental problem with far-reaching applications in areas such
as visual question answering, image captioning, and cross-modal retrieval [26].
K. H. N. Vo and D. P. T. Nguyen—Contributed equally to this paper.
c The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025
W. Buntine et al. (Eds.): SOICT 2024, CCIS 2350, pp. 141–154, 2025.
https://doi.org/10.1007/978-981-96-4282-3_12
142 K. H. N. Vo et al.

Despite significant advancements in natural language processing and com-


puter vision independently, aligning these two modalities remains a formidable
challenge [16, 17]. The semantic gap between the continuous, high-dimensional
space of images and the discrete, symbolic nature of text poses substantial dif-
ficulties. Traditional approaches often struggle to capture the latent relation-
ships between visual and textual elements, leading to sub-optimal performance
in downstream tasks.
Different from traditional approaches, Energy-based models (EBMs) [7, 18]
have recently shown promise in various machine learning domains due to their
flexibility and ability to capture complex dependencies. By learning a scalar
energy function that associates low energy with correct configurations and high
energy with incorrect ones, EBMs offer a natural framework for modeling the
compatibility between different modalities. However, their application to text-
image alignment has been limited, leaving room for innovation in this critical
area.
Therefore, we propose Text-Image Joint Embedding Predictive Architecture
(TI-JEPA), a novel pre-training approach for text-image alignment which lever-
ages EBM concepts to capture complex dependencies. Extensive experiments
demonstrate that TI-JEPA achieves state-of-the-art performance on a wide range
of text-image alignment benchmarks, demonstrating its effectiveness and versa-
tility.
To sum up, the main contributions of this work are as follows.
– We propose TI-JEPA as a new pre-training strategy for text-image alignment,
leveraging EBM concepts to capture complex cross-modal relationships. TI-
JEPA learns strong relationships between text and image representations.
– We introduce TI-JEPA as a flexible and dynamic framework for multimodal
training, capable of producing a robust multimodal encoder. This framework
can effectively capture the relationships between text and image while utiliz-
ing pre-trained encoders for enhanced performance and scalability.
– We demonstrate that TI-JEPA is competitive with state-of-the-art (SOTA)
pre-training methodologies, showing improved performance in both accuracy
and F1-score on multimodal sentiment analysis task.

2 Related Works
2.1 Multimodal Fusion
Multimodal fusion has been a growing area of interest in machine learning
[20, 21]. Early works [11, 12, 28] focused on feature-level fusion, combining repre-
sentations from different modalities using simple concatenation or averaging.
More advanced techniques have since emerged. Xu et al. [11] introduced
attention-based fusion, allowing models to dynamically focus on relevant fea-
tures across modalities. They also proposed tensor-based methods for capturing
higher-order interactions between modalities. Lu et al. [12] explored the use of
Transformer architecture for multimodal fusion, leveraging their ability to model
long-range dependencies across different data types.
TI-JEPA 143

2.2 Text-Image Alignment

Text-image alignment has seen significant advancements in recent years [3, 15,
25]. Frome et al. [3] introduced the concept of visual-semantic embeddings in
DeViSE, learning a joint space where semantically similar text and images are
close to each other. While Radford et al. [25] demonstrated the power of con-
trastive learning with large-scale image-text data for text-image alignment with
the CLIP model. This approach has since been extended by methods like ALIGN
[4], which further improved performance through larger-scale training. More
recently, Li et al. [8] proposed UNIMO, a unified framework for text-image pre-
training, incorporating multiple pretext tasks to learn robust representations.

2.3 Self-supervised Learning in Multimodal Domain

Self-supervised learning has emerged as a powerful paradigm for learning rep-


resentations without relying on explicit labels, significantly advancing various
fields of artificial intelligence [5, 8, 12, 14, 19, 22]. This approach has shown par-
ticular promise in the multimodal domain, where it can leverage the rich, com-
plementary information present across different modalities. The application of
self-supervised learning to text-image tasks has led to significant advancements
in multimodal understanding:

1. Masked Language Modeling for Multimodal Inputs: Lu et al. [12]


adapted the successful masked language modeling technique from natural
language processing to handle multimodal inputs. Their approach allows the
model to learn joint representations of text and images by predicting masked
tokens in the presence of both modalities.
2. Unified Masked Modeling: Building upon this concept, Li et al. [8] pro-
posed UNIMO, a unified framework that applies masked modeling objectives
to both text and image modalities simultaneously. This approach enables
more robust and versatile multimodal representations, as the model learns to
understand and generate content across modalities in a unified manner.

These advancements in self-supervised learning for multimodal tasks have paved


the way for improved performance in various applications, including visual ques-
tion answering, image captioning, and cross-modal retrieval [19, 22]. The ability
to learn from unlabeled multimodal data has also opened up new possibilities
for processing and understanding the vast amounts of unstructured multimodal
content available on the internet [14].

2.4 Energy-Based Models (EBMs) and Joint-Embedding Predictive


Architecture (JEPA)

EBMs are a class of probabilistic models that define a probability distribution


over the input space by associating an “energy” value with each possible input
[23]. The energy function is typically parameterized by a neural network, and the
144 K. H. N. Vo et al.

probability of an input is inversely proportional to its energy. This formulation


allows EBMs to model complex, non-linear relationships in the data, making
them well-suited for tasks such as speech and language processing.
EBMs have recently regained attention in the machine learning community.
LeCun et al. [7] provided a comprehensive overview of EBMs and their applica-
tions. Assran et al. [1] introduced the Image-based Joint-Embedding Predictive
Architecture (I-JEPA), a non-generative self-supervised learning approach that
focuses on predicting representations of various target blocks within the same
image. This method effectively scales with Vision Transformers and achieves
impressive downstream performance across tasks like linear classification and
object counting, further showcasing the versatility of self-supervised learning
architectures.

3 Our Approach
The proposed TI-JEPA architecture integrates cross-attention mechanisms to
effectively align textual and visual information for predicting masked image
patches, which is demonstrated in Fig. 1. Before getting into details, denote .I
as the original image, .Icontext as the same image but after performing context
masking, and its caption as .T . The high-level aspect of our architecture can be
described in smaller components as below:

– Image encoder .fI processes on full image .I and masked image .Icontext , gen-
erate its full embedding representation and masked context parts (context
block) representation.
– Text encoder .fT converts image description .T into a dense representation
that captures semantic information.
– We employs two blocks of text-to-image (t2i) cross attention, namely block
.X and block .X̃, to align the encoded text features with the visual features
from the image encoder.
– The output of the t2i cross-attention block .X is passed through a predictor
.gφ , which generates the final predictions for the representations of the target
patches.

3.1 Creating Target and Context Blocks

To create the target and context representations, we designed an approach which


consists of two stages: Target representation creation step and context represen-
tation creation step.

Target Representation Creation. The first step involves creating target


representations for the blocks to be predicted. Particularly, the target represen-
tations correspond to a combination of embedding vectors from the text and
TI-JEPA 145

Fig. 1. The proposed TI-JEPA architecture, where cross-attention between text and
image encodings is leveraged to predict masked patches.

the target image patches. Given an input image .I, we divide it into .N non-
overlapping patches, which are passed through the image encoder .fI to obtain
corresponding representations .sI = {sI1 , sI2 , . . . , sIN }, where .sIk is the repre-
sentation for the .k th patch. The paired text .T will be divided into .L tokens,
which are passed through the target encoder .fT to obtain corresponding repre-
sentations .sT = {sT1 , sT2 , . . . , sTL }, where .sTk is the representation for the .k th
token of the text. Finally they will go through the cross-attention to generate
the final target representations .sy = X̃(sT , sI ). To obtain the targets for the
loss, from those representations we sample .M blocks, which are going to include
the patches that needs to be predicted. We denote the mask corresponding to
the .ith block with .Bi , and its patch-level representation with .sy (i) = {syj }j∈Bi .

Context Representation Creation. For context representations, we ran-


domly sample a single block .Ix from the image .I with a random scale. We denote
the mask associated with this context block as .Bx . To prevent trivial predictions,
we remove any regions that overlap between the context blocks and the target
blocks created earlier. The masked context block are also passed through the
image encoder .fI , and through the context cross-attention module .X with the
encoded text to obtain corresponding representations .sx = {sxi }i∈Bx .

3.2 Output Prediction


The output of the attention module forms the context vectors, which serve as
the input to the predictor. More formally, given the output .sx from the attention
module, the goal is to predict the representations for the .M target embeddings:
.sy (1), sy (2), . . . , sy (M ). For each target embedding .sy (i), corresponding to the

target mask .Bi , the predictor takes as input the attention output .sx and a set of
mask tokens .{mj }j∈Bi for each patch that needs to be predicted. The predictor
146 K. H. N. Vo et al.

then outputs a prediction .{ŝyj }j∈Bj = gφ (sx , {mj }j∈Bi ). The mask tokens are
parameterized by a shared learnable vector with added positional encoding.
We obtain the final predictions .ŝy (1), ŝy (2), . . . , ŝy (M ) by applying the pre-
dictor .M times, each time we rely on the mask tokens for the corresponding
target-block locations.

3.3 Objective Function

M M
1  1 
LP = D(ŝy (i), sy (i)) = ŝyj − syj 22
M i=1 M i=1
j∈Bi

4 Training TI-JEPA
4.1 Dataset

Our experimental setup was designed to thoroughly evaluate the performance


and capabilities of the TI-JEPA model. For training, we utilized the Microsoft
COCO 2017 dataset [10], specifically the training set, which comprises over
118,000 image-text pairs. The COCO dataset was chosen for its exceptional
diversity in both visual content and descriptive captions, providing a compre-
hensive foundation for multimodal training. Since each image in the dataset is
associated with multiple captions, we randomly selected one caption per image
to create a unique text-image pair for each record. This process ensures that our
dataset contains distinguishable and meaningful text-image pairs, enabling more
effective multimodal learning.

4.2 Experiment and Hyperparameter Configurations

dAll experiments were conducted on two NVIDIA GeForce GTX 1080 Ti GPUs.
The training process spanned 300 epochs, with the largest model requiring
approximately 188 h on our hardware setup.

Encoder Modules: For the image encoder, we utilized a checkpoint pretrained


with the ViT-H architecture, using a .16 × 16 patch size and .224 × 224 resolu-
tion, trained for 300 epochs as part of the I-JEPA model1 . The text encoder is
based on gte-base-en-v1.5, a model developed by the Institute for Intelligent
Computing at Alibaba Group, using the transformer++ backbone (BERT +
RoPE + GLU). It supports a context length of up to 8192 tokens with a text
embedding dimension of 768.

1
Checkpoint available here: https://github.com/facebookresearch/ijepa?
tab=readme-ov-file#pretrained-models.
TI-JEPA 147

Cross-Attention Module Variants: The t2i cross-attention module con-


sists of multiple blocks, each containing a self-attention layer, a cross-attention
layer with text representations, and a multilayer perceptron (MLP) layer, all
with residual connections. We configured three variants of the module: Small,
Medium, and Large, as detailed in Table 1 below.

Table 1. Cross-attention module variants

Model Layers Heads Hidden Size Params


Small .4 .8 .768 .39M

Medium.6 .10 .768 .58M

Large .8 .12 .1024 .131M

Predictor Module: The predictor is inherited from the original I-JEPA pre-
dictor, which is a shallow vision transformer with a depth of 12 layers and 12
attention heads per layer.

Training Configs: We utilized a learning rate of .0.001 and a momentum sched-


uler with an exponential moving average (EMA) ranging between .0.996 and .1.0.
The model was trained with a batch size of .1024, processing the entire training
set of the COCO 2017 dataset in each epoch. Optimization was performed using
AdamW. In addition, we applied image masking techniques during training to
enhance robustness. For the encoder masks, we used a scale ranging from .0.85
to .1.0, while for the predictor masks, a smaller scale range of .0.15 to .0.2 was
applied to focus on smaller image regions. The model was trained with .12 layers
in the prediction depth, utilizing a patch size of .14 × 14, with one context block
(encoder block) and four predictor blocks to guide the model’s learning. Our
hyper-parameter configuration is detailed in Table 2, showcasing the compre-
hensive setup for learning rate, weight decay, image masking scales, and batch
size.
Unlike training from scratch, which often results in energy collapse - where
diverse inputs yield near-identical outputs - our proposed TI-JEPA frame-
work mitigates this issue by leveraging pretrained encoder-decoder models that
have consistently outperformed SOTA encoders. TI-JEPA offers a flexible and
dynamic framework for multimodal training, capable of producing a robust mul-
timodal encoder that effectively captures intricate relationships between text
and image. To further optimize the multimodal encoding process, we strategi-
cally froze both the encoder and decoder components of the pipeline, training
only the cross-attention modules. This decision was made to address two critical
concerns:
148 K. H. N. Vo et al.

Table 2. Experimental Parameters

Parameter Value
Batch size .1024

Learning rate .0.001

Optimizer AdamW with Momentum Scheduler


Exponential Moving Average (EMA).[0.996, 1.0]
Epochs .300

Context Mask Scale .[0.85, 1.0]

Target Mask Scale .[0.15, 0.2]

Number of Context Blocks .1

Number of Target Blocks .4

– One of the major challenges in JEPA models is energy collapse, where the
model converges to a state where multiple inputs result in similar out-
puts, significantly reducing representational diversity. By freezing the pre-
trained encoder-decoder modules, we ensure that these components retain
their capacity to extract diverse and meaningful features from text and image
inputs, thus mitigating this degenerative phenomenon.
– The encoder-decoder components used in our framework are pretrained on
extensive datasets, providing a rich set of learned representations. By freez-
ing these layers, we effectively reuse the knowledge encapsulated in the pre-
trained weights, allowing us to focus computational resources on optimizing
the cross-attention modules. This approach not only enhances the scalability
and stability of our model but also ensures that the pretrained components
contribute effectively to overall performance without being disrupted by fur-
ther training.

This targeted training methodology allows us to harness the full potential


of pretrained models, ensuring stability while improving scalability and perfor-
mance in multimodal tasks.

5 Evaluation By Sentiment Analysis With Text


and Image
5.1 Data Preparation
To evaluate the versatility of our TI-JEPA model, we applied it to the task of
multimodal sentiment analysis. This task requires understanding both the tex-
tual content and the visual context to predict the sentiment (positive, negative,
or neutral) of a given image-text pair.
We utilized the MVSA-Single and MVSA-Multi datasets created by
Niu et al. [27], originally containing .5, 129 and .19, 600 image-text pairs, respec-
tively, with sentiment labels (positive, neutral, and negative). The datasets were
TI-JEPA 149

collected from Twitter and annotated for multimodal sentiment analysis. How-
ever, we conducted a preprocessing step to remove emotionally inconsistent sam-
ples, where the sentiment labels of the image and text conflicted.
For the MVSA-Single dataset, we processed the data by first addressing
instances where both the text and image labels were identical, which we retained
as trivial cases. We removed any image-text pairs where one label was positive
and the other was negative, considering such contradictions unreliable. For cases
where one component (either text or image) had a neutral label and the other a
positive or negative label, we assigned the final label based on the non-neutral
component. This ensured consistency between the text and image sentiment
annotations. The MVSA-Multi dataset, however, contains sentiment annotations
from three annotators, which required a majority voting approach for determin-
ing the final sentiment labels for both the text and image components. For each
pair, we calculated the majority sentiment for both the text and image anno-
tations. In cases where the annotations were perfectly balanced, such as one
annotator labeling the text as neutral, another as positive, and the third as neg-
ative, we considered the label ambiguous and removed the pair from the dataset.
This approach ensured that only clear sentiment pairs were retained for further
analysis.
After preprocessing, the MVSA-Single dataset was reduced to .4, 511 pairs,
and the MVSA-Multi dataset to .17, 027 pairs. The number of records for each
table is shown in Table 3 as follows:

Table 3. MVSA-Single and MVSA-Multi datasets after pre-processing

Dataset Positive Neutral Negative Total


MVSA-Single.2, 683 .470 .1, 358 .4, 511

MVSA-Multi .11, 320 .4, 408 .1, 299 .17, 027

We then split the revised datasets into training, validation, and test sets with
the ratio of .8 : 1 : 1. To adapt our model for this task, we fine-tuned the pre-
trained TI-JEPA by adding a classification head consisting of a simple linear
layer on top. The classification head was trained for .40 epochs using the Adam
optimizer with a learning rate of .0.001 and the loss function is cross-entropy loss.

5.2 Model Evaluation and Metrics

To validate the effectiveness of the proposed model, we conducted compara-


tive experiments against several mainstream single-modal and multimodal fusion
models. The performance evaluation was based on accuracy and F1-score, cal-
culated using the following metrics. Precision (P) is defined as:
TP
.P =
TP + FP
150 K. H. N. Vo et al.

Recall (R) is given by:


TP
R=
.
TP + FN
The F1-score is the harmonic mean of precision and recall:
P ×R
F1 = 2 ×
.
P +R
Accuracy (Acc) is defined as:
TP + TN
Acc =
.
TP + TN + FP + FN
In these equations, TP denotes true positives, TN true negatives, FP false
positives, and FN false negatives. Precision (P) measures the proportion of cor-
rectly identified positive instances, while recall (R) captures the proportion of
actual positives identified correctly.

5.3 Comparative Model Performance

In our study, we compared the proposed model with several benchmark mod-
els, evaluating their accuracy and F1-score. Traditional models like SentiBank
and SentiStrength [2] rely on statistical feature extraction and struggle to cap-
ture intrinsic multimodal features, leading to relatively low performance. On the
other hand, CNNMulti [24] processes text and image features separately using
two distinct CNNs, leveraging deep learning’s capacity to capture emotional
expressiveness and improving prediction by merging these features.
The DNN-LR model [6] employs transfer learning with pretrained models and
utilizes logistic regression for decision making. The CoMemory model [29] intro-
duces a multimodal fusion mechanism, which enhances the interaction between
text and image features, improving sentiment prediction. The MVAN model [30]
applies a memory network on top of a multi-view attention mechanism, enabling
richer semantic interactions between image and text and achieving better results.
Moreover, the CLMLF model [9] utilizes contrastive learning to enhance the
representation of multimodal features, fostering stronger associations between
image and text inputs, thereby improving model performance. Besides, the ITIN
model [33] implements cross-modal alignment operations along with an adaptive
fusion module, leading to substantial gains in accuracy for sentiment analysis
tasks. And lastly, the CLIP-CA-CG model [13] utilizes pre-trained RoBERTa
and ResNet50 models to extract visual and textual features, which are further
processed through CLIP contrastive learning to acquire features that are more
level-deeper.
We compared three configurations of our proposed TI-JEPA model - Small,
Medium, and Large - against mentioned baselines. Table 4 presents the compar-
ative results, demonstrating the performance of each configuration of TI-JEPA
across both the MVSA-Single and MVSA-Multi datasets.
TI-JEPA 151

Table 4. Comparative experiments of several models on MVSA-Single and MVSA-


Multi datasets.

Model MVSA-Single MVSA-Multi


Accuracy (%) F1 (%) Accuracy (%) F1 (%)
SentiBank & SentiStrength 52.12 50.15 65.70 55.42
CNN-Multi 61.25 58.40 67.92 62.19
DNN-LR 64.10 61.50 66.41 63.97
Co-Memory 66.75 64.08 68.92 70.77
MVAN 70.15 68.75 71.00 73.05
CLMLF 72.75 71.95 74.20 76.00
ITIN 73.90 72.25 75.08 74.80
CLIP-CA-CG 75.25 73.62 76.05 74.02
TI-JEPA-Small (Ours) 73.03 71.69 73.59 72.10
TI-JEPA-Medium (Ours) 75.26 72.15 75.13 73.57
TI-JEPA-Large (Ours) 76.75 74.62 77.55 75.02

The comparative results clearly demonstrate the superior performance of


our proposed architecture, TI-JEPA, across both the MVSA-Single and MVSA-
Multi datasets, particularly in terms of accuracy and F1-score. Specifically, the
TI-JEPA-Large model surpasses all previous state-of-the-art models, achieving
the highest accuracy of .76.75% and F1-score of .74.62% on MVSA-Single, as
well as .77.55% and .75.02% on MVSA-Multi. This highlights the effectiveness of
our multimodal approach in integrating and aligning text and image features to
better capture the underlying sentiment in multimodal data.
Notably, the TI-JEPA-Medium model also outperforms the previous models
like CLIP-CA-CG and ITIN, achieving competitive results with an accuracy of
.75.26% on MVSA-Single and .75.13% on MVSA-Multi. This shows that even
with fewer parameters, our TI-JEPA framework remains robust, delivering high
accuracy and strong generalization.
The TI-JEPA-Small model, despite having a more compact architecture, still
demonstrates comparable performance to more complex models like MVAN and
Co-Memory. With accuracy and F1-scores of .73.03% and .71.69% respectively
on MVSA-Single, and .73.59% and .72.10% on MVSA-Multi, it offers an efficient
alternative with lower computational costs, making it ideal for scenarios requir-
ing faster inference or less hardware-intensive training.
In more complex sentiment analysis tasks, such as those presented by the
MVSA-Multi dataset, the alignment between modalities becomes even more
crucial. Our TI-JEPA-Large’s superior F1-score, which balances precision and
recall, indicates that TI-JEPA not only captures the most important multimodal
features but also avoids overfitting to either modality, thus generalizing better
across diverse data points. This ability to balance performance across modalities
is a key advantage over other approaches, as evidenced by the consistently higher
F1-scores across both datasets.
152 K. H. N. Vo et al.

6 Limitations and Future Works


While the proposed approach shows promising results, there are several notable
limitations. Due to constraints on data resources, we were unable to further vali-
date the model’s robustness using other publicly available datasets, which might
limit its generalization to diverse scenarios. Additionally, our experiments were
restricted to two modalities: image features and text features. This limitation
could lead to potential misjudgments in more complex multimodal tasks where
additional data types are required. Furthermore, although we employed pre-
trained encoders for both image and text, we were unable to implement a fully
pre-trained pipeline for both encoders, which may have impacted the overall
performance of the model.
To address these limitations and further improve our approach, several direc-
tions for future work are proposed. First, conducting an ablation study would
allow for a more detailed evaluation of the model’s performance when tested
solely on text or image features, offering insights into each modality’s contribu-
tion. Second, extending the framework to tackle more advanced tasks, such as
visual question answering, by integrating our multimodal encoder into existing
multimodal systems could enhance its versatility. Third, adding more evalua-
tion metrics could increase the reliability of the model’s assessment and improve
the trustworthiness of the pipeline. Finally, acquiring additional resources would
enable larger-scale experiments, facilitating a more comprehensive evaluation
across diverse datasets and potentially leading to greater model robustness.

7 Conclusion
In this paper, we introduced TI-JEPA, a novel energy-based model for text-
image alignment in multimodal fusion. Our approach addresses the challenge
of bridging the semantic gap between visual and textual modalities, offering a
flexible framework for various multimodal tasks. The success of TI-JEPA can be
attributed to its joint embedding space and predictive architecture, enabling the
model to learn robust and generalizable representations.

References
1. Assran, M., et al.: Self-supervised learning from images with a joint-embedding
predictive architecture. arXiv: 2301.08243 [cs.CV] (2023)
2. Borth, D., et al.: SentiBank: large-scale ontology and classifiers for detecting senti-
ment and emotions in visual content. In: Proceedings of the 21st ACM International
Conference on Multimedia. MM ’13. Barcelona, Spain: Association for Comput-
ing Machinery, pp. 459–460 (2013). isbn: 9781450324045. https://doi.org/10.1145/
2502081.2502268
3. Frome, A., et al.: DeViSE: a deep visual-semantic embedding model. In: Burges,
C.J., et al. (eds.) Advances in Neural Information Processing Systems, vol. 26.
Curran Associates, Inc. (2013)
TI-JEPA 153

4. Jia, C., et al.: Scaling up visual and vision-language representation learning with
noisy text supervision. arXiv:2102.05918 (2021). https://api.semanticscholar.org/
CorpusID:231879586
5. Korbar, B., Tran, D., Torresani, L.: Cooperative learning of audio and video models
from self-supervised synchronization. In: Bengio, S., et al. (eds.) Advances in Neural
Information Processing Systems, vol. 31. Curran Associates, Inc. (2018)
6. Krishna, R., et al.: Visual genome: connecting language and vision using crowd-
sourced dense image annotations. Int. J. Comput. Vision 123(1), 32–73 (2017).
https://doi.org/10.1007/s11263-016-0981-7
7. LeCun, Y., et al.: A tutorial on energy-based learning (2006)
8. Li, W., et al.: UNIMO: towards unified-modal understanding and generation via
cross-modal contrastive learning. In: Zong, C., et al (eds.) Proceedings of the 59th
Annual Meeting of the Association for Computational Linguistics and the 11th
International Joint Conference on Natural Language Processing (Volume 1: Long
Papers). Association for Computational Linguistics, pp. 2592–2607 (2021). https://
doi.org/10.18653/v1/2021.acl-long.202, https://aclanthology.org/2021.acllong.202
9. Li, Z., et al.: CLMLF: a contrastive learning and multi-layer fusion method for
multimodal sentiment detection. In: Findings of the Association for Computational
Linguistics: NAACL 2022. Association for Computational Linguistics (2022)
10. Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: European Con-
ference on Computer Vision (2014). https://api.semanticscholar.org/CorpusID:
14113767
11. Liu, Z., et al.: Efficient low-rank multimodal fusion with modality- specific fac-
tors. In: Annual Meeting of the Association for Computational Linguistics (2018).
https://api.semanticscholar.org/CorpusID:44131945
12. Lu, J., et al.: ViLBERT: pretraining task-agnostic visiolinguistic representations
for vision-and-language tasks. In: Wallach, H., et al. (eds.) Advances in Neural
Information Processing Systems, vol. 32. Curran Associates, Inc. (2019)
13. Lu, X., Ni, Y., Ding, Z.: Cross-modal sentiment analysis based on CLIP image-text
attention interaction. Int. J. Adv. Comput. Sci. Appl. 15(2) (2024). https://doi.
org/10.14569/IJACSA.2024.0150290
14. Nguyen, C.-D., et al.: Expand BERT representation with visual information via
grounded language learning with multimodal partial alignment. In: Proceedings of
the 31st ACM International Conference on Multimedia, pp. 5665–5673 (2023)
15. Nguyen, C.-D., et al.: Improving multimodal sentiment analysis: supervised angular
margin-based contrastive learning for enhanced fusion representation. In: Findings
of the Association for Computational Linguistics: EMNLP 2023, pp. 14714–14724
(2023)
16. Nguyen, C.-D., et al.: KDMCSE: knowledge distillation multimodal sentence
embeddings with adaptive angular margin contrastive learning. In: North Ameri-
can Chapter of the Association for Computational Linguistics (2024). https://api.
semanticscholar.org/CorpusID:268691429
17. Nguyen, T., et al.: Adaptive contrastive learning on multimodal transformer for
review helpfulness prediction. In: Goldberg, Y., Kozareva, Z., Zhang, Y. (eds.)
Proceedings of the 2022 Conference on Empirical Methods in Natural Language
Processing. Abu Dhabi, United Arab Emirates: Association for Computational Lin-
guistics, pp. 10085–10096 (2022). https://doi.org/10.18653/v1/2022.emnlp-main.
686, https://aclanthology.org/2022.emnlp-main.686
18. Nguyen, T., et al.: DemaFormer: damped exponential moving average transformer
with energy-based modeling for temporal language grounding. In: Findings of the
Association for Computational Linguistics: EMNLP 2023, pp. 3635–3649 (2023)
154 K. H. N. Vo et al.

19. Nguyen, T., et al.: Video-language understanding: a survey from model architec-
ture, model training, and data perspectives. In: Ku, L.-W., Martins, A., Srikumar,
V. (eds.) Findings of the Association for Computational Linguistics ACL 2024.
Bangkok, Thailand and virtual meeting: Association for Computational Linguis-
tics, pp. 3636–3657 (2024). https://aclanthology.org/2024.findings-acl.217
20. Nguyen, T., et al.: Vision-and-language pretraining. arXiv preprint
arXiv:2207.01772 (2022)
21. Nguyen, T.T., et al.: Encoding and controlling global semantics for long-form video
question answering. arXiv preprint arXiv:2405.19723 (2024)
22. Nguyen, T.T., et al.: Topic modeling as multi-objective contrastive optimization.
In: The Twelfth International Conference on Learning Representations (2024).
https://openreview.net/forum?id=HdAoLSBYXj
23. Ou, Z.: Energy-based models with applications to speech and language processing.
Found. Trends Signal Process. 18, 1–199 (2024). https://api.semanticscholar.org/
CorpusID:268459305
24. Ouyang, X., et al.: Sentiment analysis using convolutional neural network. In:
2015 IEEE International Conference on Computer and Information Technology;
Ubiquitous Computing and Communications; Dependable, Autonomic and Secure
Computing; Pervasive Intelligence and Computing, pp. 2359–2364 (2015). https://
doi.org/10.1109/CIT/IUCC/DASC/PICOM.2015.349.
25. Radford, A., et al.: Learning transferable visual models from natural language
supervision. arXiv: 2103.00020 [cs.CV] (2021)
26. Siebert, T., et al.: Multi-modal fusion transformer for visual question answering in
remote sensing (2022)
27. Wang, H., Ren, C., Yu, Z.: Multimodal sentiment analysis based on cross-instance
graph neural networks. Appl. Intell. 54(4), 3403–3416 (2024). https://doi.org/10.
1007/s10489-024-05309-0
28. Xu, K., et al.: Show, attend and tell: neural image caption generation with visual
attention. In: Bach, F., Blei, D. (eds.) Proceedings of the 32nd International Con-
ference on Machine Learning, vol. 37. Proceedings of Machine Learning Research,
pp. 2048–2057. PMLR, Lille, France (2015). https://proceedings.mlr.press/v37/
xuc15.html
29. Xu, N., Mao, W., Chen, G.: A co-memory network for multimodal sentiment anal-
ysis. In: SIGIR ’18, The 41st International ACM SIGIR Conference on Research
& Development in Information Retrieval, pp. 929–932. Association for Computing
Machinery, Ann Arbor, MI, USA (2018). https://doi.org/10.1145/3209978.3210093
30. Yang, X., et al.: Image-text multimodal emotion classification via multi-view atten-
tional network. IEEE Trans. Multimedia 23, 4014–4026 (2021). https://doi.org/
10.1109/TMM.2020.3035277
31. Zhao, G., Li, Y., Xu, Q.: From emotion AI to cognitive AI. Int. J. Network Dyn.
Intell. 1(1), 65–72 (2022). https://doi.org/10.53941/ijndi0101006, https://www.
sciltp.com/journals/ijndi/article/view/115
32. Zhao, J., et al.: Cognitive psychology-based artificial intelligence review. Front.
Neuroscience 16, 1024316 (2022). https://doi.org/10.3389/fnins.2022.1024316
33. Zhu, T., et al.: Multimodal sentiment analysis with image-text interaction network.
Trans. Multi. 25, 3375–3385 (2023). https://doi.org/10.1109/TMM.2022.3160060
A Lightweight End-to-End Multi-task
Learning System for Vietnamese Speaker
Verification

Mai Hoang Dao1,2 , Son Thai Nguyen2 , Duy Minh Le2 , Cong Tran2(B) ,
and Cuong Pham1,2
1
VinAI Research, Hanoi, Vietnam
[email protected]
2
Posts and Telecommunications Institute of Technology, Hanoi, Vietnam
{congtt,cuongpv}@ptit.edu.vn

Abstract Automatic speaker verification (ASV) in low-capacity devices


utilized for industrial Internet of Things (IoT) applications is faced with
two major challenges: lack of annotated training data and model com-
plexity. To address these challenges, this paper introduces the first Viet-
namese audio dataset for training a multi-task learning method named
Vi-LMM that jointly performs command detection, fake voice recog-
nition, and speaker verification tasks. To optimize Vi-LMM for low-
capacity devices, we further employ knowledge distillation to reduce the
number of parameters by 3.5 times. An empirical experiment is con-
ducted to evaluate the effectiveness of the proposed method and the
results show that Vi-LMM outperforms strong single-task models in
terms of both reducing the number of learnable parameters and achieving
higher .F1 scores while maintaining comparable error rates.

Keywords: Speaker verification · Multi-task learning · Vietnamese


1 Introduction
Automatic speaker verification (ASV) is one of the most important fields of
speech processing and is a fundamental component of speech-based security sys-
tems [15, 20]. However, ASV for low-capacity devices, especially for Vietnamese,
often faces two major challenges: the lack of dataset and model complexity.
ASVspoof 2019 [26] and ASVspoof 2021 [31] are two widely used datasets
in English for anti-spoofing in automatic speaker verification, while FMFCC-A
[33] is the largest publicly available dataset for synthetic speech detection in
Mandarin. VoxCeleb [16] is an audio dataset consisting of more than 100,000
utterances from 1,251 celebrities, which is suitable for speaker verification tasks.
However, for Vietnamese speaker verification tasks, only two relatively small
public datasets [18, 27] exist, but they are not publicly available to the research
community. The VLSP2021 Challenge1 provides a Vietnamese Speaker Verifi-
cation dataset, but it is also not publicly available. This lack of high-quality
annotated datasets in Vietnamese poses a significant challenge for studies on
1
https:/vlsp.org.vn/vlsp2021/eval/vsv.
c The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025
W. Buntine et al. (Eds.): SOICT 2024, CCIS 2350, pp. 155–166, 2025.
https://doi.org/10.1007/978-981-96-4282-3_13
156 M. H. Dao et al.

ASV in the language. In this study, we aim to address this issue by developing
a meticulously designed Vietnamese dataset suitable for novel AI model-based
tasks that are currently receiving widespread attention from the research com-
munity worldwide.
Recent deep learning models for ASV have a significant number of param-
eters and require substantial computational resources to perform accurately
[1, 3, 11, 13, 22, 29]. Deep neural networks have been widely used for ASV, either
as standalone models [6–8, 18, 32] or as feature extractors for other classifiers
[5]. Moreover, previous studies on ASV tend to explore large pretrained speech
presentation models [4, 25, 30]. Though these models achieve outstanding per-
formances, they have a massive size and long inference time. Thus, using such
models in low-capacity IoT devices is challenging. Furthermore, current ASV
systems focus on specific tasks, while real-world applications require the abil-
ity to handle multiple tasks simultaneously. Using single model for each task
significantly increases the operations that the hardware system needs to per-
form, which thereby leads to certain latency and degrades the user experience.
As a result, developing compact and fast multi-task learning models that can
be embedded in low-capacity devices is crucial for ASV applications. Despite
previous efforts, no dataset or prior models address multi-task learning for Viet-
namese speaker verification. Additionally, most existing deep learning models
are either exclusive to a single task or have an excessive number of parameters,
while our goal is to build a comprehensive and lightweight system suitable for
low-capacity devices. In this study, we aim to address these challenges by devel-
oping multi-task learning models that are both compact and fast for Vietnamese
speaker verification.
For all the aforementioned motivations, we push this research field forward by
introducing a new dataset and a lightweight model. Our dataset includes 6480
audio and text label pairs from 162 individuals and 65 types of AI synthetic
voice for three key ASV tasks. Our Vi-LMM model is a lightweight multitasking
model that incorporates an attention layer to integrate information between
tasks. We have reduced the number of parameters of Vi-LMM by 3.5 times
using recent advances in knowledge distillation. Our contributions to this field
are summarized as follows:
– We introduce the first public Vietnamese dataset as training data of three
sub-tasks of ASV, namely command detection, fake voice recognition, and
speaker verification;
– We propose two lightweight models, termed Vi-LMM and Vi-LMM-S, for joint
learning of the three tasks;
– Experimental results on our dataset show that (i) while requiring a signif-
icantly smaller number of parameters, our proposed models exhibit compa-
rable performance to other strong single baselines [6, 7, 11, 13, 18, 24] and (ii)
our joint learning method improves the overall performance of the model on
the three sub-tasks.
A Lightweight End-to-End Multi-task Learning System 157

We publicly release our dataset and model implementations for research or edu-
cational purpose.2 We hope that our dataset and model can serve as a starting
point for future Vietnamese speech processing research and applications.

2 Our Dataset
2.1 Multi-tasking Dataset
Our goal is to develop a comprehensive dataset that can be used to train a
multi-task learning model capable of performing three tasks: command detec-
tion, fake voice recognition, and speaker verification. The model should be able
to differentiate between authentic user speech and different types of distractors,
including synthetic AI speeches, non-command speeches with similar patterns,
and noises from other speakers. To achieve this, we select two widely-used com-
mands in IoT applications,
(Close the door), and define four categories of speech: A) exact com-
mand, B) conversational speech containing one command, C) speech with no
command, and D) speech with similar words to the command but should not
be identified as a correct one. Examples for each category are presented in the
Table 1. The annotated dataset is divided into training, validation, and test sets
in a 5/2/3 ratio, ensuring that the distribution of order utterance types and
gender is well-balanced across all subsets. The dataset statistics are displayed in
Table 2.

Table 1. Examples of four speech categories in Vietnamese and their English translated
version

Type Example

A
Turn on the camera.

B
He has already come, turn on the camera.

C
It’s her scene, prepare the clothes for me.

D
The camera of this phone is so bad.

2.2 Construction Process


Guideline Construction. We initially create a data collection protocol and
record a small sample of audio recordings according to it. After that, the team
reviews the process and identifies any issues, leading to the development of the
final data collection guideline that is used for the remaining dataset.

2
Our dataset and model will be released upon acceptance.
158 M. H. Dao et al.

Data Collection. In this stage, 170 Vietnamese participants between 18 and


25 years of age are recorded and labeled. Next, the subjects choose one out of
two commands and prepare 20 transcript sentences, as described in Sect. 2.1.
They prepare five transcripts for each of the four categories. The transcripts
are reviewed and audited by the engineers before the subject records the 20
corresponding audios. The subjects are instructed to make sure that both the
transcripts and the audios sound natural, convey meaningful sentences, and are
common in real-life scenarios.

AI Synthetic Speech Generation. After the data collection phase, we employ


HiFiGAN [14] to perform an automatic speech generation task. The process
finally obtains 3240 synthetic data samples from 65 different AI voices.

Revision. We conduct a manual quality check of each audio and its label file
to ensure consistency and remove samples that did not meet the criteria. For
the verification task, we label each performer and their corresponding audio
accordingly. For command detection, audios belonging to groups A and B are
labeled as True, while others are labeled as False. Additionally, all speeches
generated by HiFiGAN are labeled as AI synthesized speech for the fake voice
recognition task. The final Vietnamese dataset contains 6480 audios from 162
subjects and 65 different AI voices.

Table 2. Statistics of our Vietnamese dataset. (s) represents for “seconds” and (t)
represents for “tokens”.

Statistic A B C D Total
# audios 810 810 810 810 3240
# subjects 162 162 162 162 162
Minimum length (s) 0.75 1.88 1.54 0.96 0.75
Maximum length (s) 5.63 10.36 10.08 6.76 10.36
Average length (s) 2.38 4.43 4.00 2.79 3.38
Minimum length (t) 3 7 6 3 3
Maximum length (t) 3 37 29 23 37
Average length (t) 3.91 11.64 11.29 6.02 8.22
# AI synthetic audios 810 810 810 810 3240
Total duration (s) 5238 9602 9552 4572 28964

2.3 Discussion

We generate an artificial dataset using a limited set of two specific commands


and scripted speech, which may not fully represent the diversity and complexity
of commands found in real-world scenarios. It is essential to acknowledge that
there are no publicly available Vietnamese speech datasets for the three tasks
A Lightweight End-to-End Multi-task Learning System 159

of command detection, fake voice recognition, and speaker verification. In our


study, we aimed to simulate real-world speech and plan to compare the accuracy
of our scripted speech dataset with genuine real-world speech in future work.

3 The Proposed Method


In this section, we present our proposed end-to-end method, named Vi-LMM,
that simultaneously solves the three aforementioned sub-tasks. Figure 1 illus-
trates the architecture of Vi-LMM, which consists of five main components in-
cluding an audio representation module, a graph module, three task-specific Max
Graph Operation branches, a cross-task attention layer, and a decoding layer.

Fig. 1. The schematic overview of our Vi-LMM method

3.1 Audio Representation


To operate directly upon raw waveform inputs, we utilize a neural network-
based encoder to take raw waveforms as inputs and then directly extracts the
160 M. H. Dao et al.

high-level representation .F ∈ RC×S×T , where .C is the number of channels, .S


is the number of spectral bins, and .T is the temporal sequence length. For each
channel in .F, we extract the spectral and the temporal feature vectors by getting
the maximum values along with the temporal and the spectral axis, respectively.
This approach not only reduces the inference time but also enhances the model’s
lightweight design.

3.2 Graph Module


Motivated by recent studies that utilized graph neural networks to achieve state-
of-the-art (SOTA) performance even with compact models [12, 23, 28], we con-
struct two fully-connected bidirectional graphs .Gs and .Gt representing for the
connections among spectral features and connections among temporal features,
respectively, where the features are outputs of the audio representation step.
Since .Gs and .Gt are fully-connected, we can represent both graphs as matrices,
i.e., .G∗ ∈ RN∗ ×D∗ , where the subscript * represents .s and .t for spectral and
temporal networks, and .N∗ and .D∗ are the number of nodes and the dimen-
sionality of each node vector in the corresponding graph, respectively. We note
that .Ns = Nt = C, Ds = T , and .Dt = S. Next, we adopt a graph attention
layer [28] to assign a learnable attention weight to each edge for representing the
relationship between two nodes.
After calculating the attention weights between nodes, a simple yet effective
attentive graph pooling layer is applied to the output of each graph attention
layer to reduce the number of nodes in both .Gs and .Gt , which would reduce the
computational complexity and improve discrimination between nodes. Finally,
the model employs a graph combination to obtain a heterogeneous attention
network .Gst ∈ RNst ×Dst , where .Nst = Ns + Nt that contains both spectral and
temporal information by combining two separated graphs .Gs and .Gt .

3.3 Task-Specific Max Graph Operation (MGO)


We adopt three MGO branches, originally introduced in [11], for multiple speech
processing tasks. Each MGO consists of two modules, namely HS-GAL and
feature synthesis. Unlike [11], each of the task-specific MGO branches in our
study has two sequential HS-GAL modules.
Taking the heterogeneous graph .Gst obtained from the previous step, the
HS-GAL module assigns additional attention weights to the edges connected
between spectral and temporal feature nodes to learn high-level feature vectors
that combine both spectral and temporal information for classification. We de-
note the feature vectors as .hf , hc , and .hr ∈ R3Dst , which are the representation
of fake voice recognition, command detection, and speaker verification tasks,
respectively.

3.4 Cross-Task Attention


Based on the intuition that a representation of AI synthesis voice cannot be ac-
cepted as either a genuine speaker or a correct command, we design an additional
A Lightweight End-to-End Multi-task Learning System 161

attention layer to explicitly feed the information from the fake voice detection
to the two remaining tasks, including command detection and speaker verifica-
tion. Specifically, the cross-task attention layer inputs .{hf , hc , hr } and produces
specific cross-task attention weights that illustrate the influence of one task on
another. Formally, to incorporate the information from fake voice detection to
speaker verification, the layer first creates a cross information-concentrated vec-
tor .xf r ∈ R3Dst by multiplying a weight matrix .Wf r with the feature vector
.hf . Next, we compute the attention weight between the two tasks using an at-
tention weight .λf r and concatenates the result vector to the original .hr vector
as follows:

.h r = (λf r xf r )  hr .

The layer produces the .h c to integrate useful information from fake voice recog-
nition to the command detection task in a similar manner. The vectors .h r , h c ,
and .hf are then passed into a fully-connected layer (FC) where the output of
each FC is the predicted label for each task.

3.5 Joint Learning


All attention weights and learnable matrices in our joint learning model are
trained via a joint loss function .L, which is the weighted sum of three single-
task losses as follows:

L = αLC + βLF + (1 − α − β)LR ,


. (1)

where .LC , LF , and .LR are cross-entropy losses computed based on labels from
command detection, fake voice recognition, and speaker verification, respectively.
The loss coefficients .α and .β are fined tune during training to figure out the
optimal loss function.

3.6 Vi-LMM Variant

Taking the advantage of knowledge distillation [9], we aim to further reduce


the size of Vi-LMM whilst retaining the model’s performance. To this end, our
Vi-LMM acts as the teacher model and the encoder of the teacher with a sig-
nificantly more lightweight encoding layer to construct a student model, termed
Vi-LMM-S. Next, we transfer the teacher’s knowledge to the student model using
knowledge distillation techniques [9]. The training objective loss for Vi-LMM-S
is the weighted sum of the student loss .LST and the distillation loss .LDI as
follows:
.LLW = LST + LDI ,

where .LST can be established similarly as (1) and .LDI are the weighted sum of
cross-entropy losses computed based on soft labels from teacher model and soft
predictions from student models.
162 M. H. Dao et al.

4 Experiment
We conduct experiments on our dataset to study a quantitative comparison be-
tween Vi-LMM, Vi-LMM-S, and recent strong methods in terms of performance,
model’s size, and inference time.

4.1 Competitive Schemes

We compare our proposed models to five strong baselines across various domains,
including:

– Rawnet2 [24] & Rawnet3 [13]: end-to-end DNNs classifier for raw wave-
form speaker recognition.
– GFCC-ResNet101 [18]: a recent deep model designed for Vietnamese
speaker authentication problem.
– FastAudio [7]: an end-to-end framework for audio classification problem.
– AASIST [11]: current state-of-the-art model on the ASVspoof 2019 LA
dataset.
– AutoSpeech [6]: a derived CNN architectures for the task of speaker veri-
fication on the VoxCeleb1 dataset.

Note that, our approach does not include large pre-trained models as encoders
e.g. wav2vec2.0 [2], HuBERT [10]. Therefore, models that utilize pre-trained
models are incomparable to our system. Besides, we also perform an ablation
study by removing the cross-task attention layer to create a model termed Vi-
LMM-S i.e. the feature vectors are fed directly into the classifiers after passing
through the task-specific MGOs.

Table 3. Results on the test set. “Command Dec.”, “Fake Voice Rec.”, “Speaker Ver.” de-
note command detection, fake voice recognition, and speaker verification, respectively.
“Avg-EER” and “Avg-.F1 ” denote Average .F1 and Average EER, respectively. Here, Vi-
LMM-S is the compact variant of Vi-LMM and Vi-LMM-C is Vi-LMM without the
cross-task attention layer.

Model # Parameters Inference Time Command Dec. Fake Voice Rec. Speaker Ver. Avg-EER Avg-.F1
.F1 EER .F1 EER .F1
Rawnet2 40.14 M 135 ms 90.72 15.27 79.26 19.83 73.18 17.55 81.05
Rawnet3 52.38 M 223 ms 91.82 4.57 90.51 13.82 79.08 9.19 87.14
GFCC-ResNet101 128.4 M 630 ms 95.81 8.36 87.31 15.32 78.54 11.84 87.22
FastAudio 40.2 M 150 ms 91.32 5.26 89.45 14.03 78.92 9.65 85.56
AASIST 41.4 M 174 ms 92.19 4.06 90.72 13.65 79.12 8.86 87.34
AutoSpeech 54 M 267 ms 93.27 7.92 87.65 15.76 78.24 11.84 86.39
Vi-LMM 14 M 64 ms 93.58 4.58 90.45 13.87 79.43 9.22 87.82
Vi-LMM-S 4M 46 ms 91.82 5.86 88.97 16.52 77.63 11.19 86.14
Vi-LMM-C 14 M 60 ms 93.21 4.87 89.95 14.03 78.96 9.45 87.37
A Lightweight End-to-End Multi-task Learning System 163

4.2 Experimental Settings

To harness the power of transfer learning, we use a pretrained encoder that


has been trained on a large amount of audio data and fine-tune all layers with
our dataset. For Vi-LMM, we use a Rawnet2-based encoder [22] to extract the
high-level audio representations from raw waveform inputs. For the Vi-LMM-S,
we replace the Rawnet2-based encoder with MobileNetV2 [19], a widely used
network in applications for low-resource devices such as [17, 21], to construct the
student model.
To optimize our model’s hyper-parameters, we performed a grid search on the
validation set with the Adam optimizer. The results showed that a learning rate
and weight decay of .10−5 were best, along with .α and .β values of 0.3 and 0.4,
respectively. Our model was trained for 100 epochs with these hyper-parameters.
For evaluation metrics, we adopt the standard .F1 -score for all three tasks and
the equal error rate (EER) for two security-related tasks, including fake voice
recognition and speaker verification. All our reported results are the average
output over five experiments with different random seeds.

4.3 Main Results

Table 3 reports the performances of the chosen baselines and our system. It is
worth noting that each baseline is trained specifically for each task. Thus, in
order to make a fair comparison, the number of parameters of each single model
presented in Table 3 is tripled compared that of the original study.
In general, our findings indicate that both Vi-LMM and Vi-LMM-S demon-
strate competitive performance compared to other strong baselines, while en-
joying a significantly lower time and space complexity. Notably, Vi-LMM out-
performs all other methods with the highest Average-.F1 -score of 87.82%. In
terms of Average-EER, Vi-LMM is the third-best performer following AASIT
and Rawnet3. It is noteworthy that Vi-LMM only requires 14 million param-
eters, whereas Rawnet2, which is the second-smallest method, requires 40.14
million parameters.
Our system performs comparably well to other models in terms of individ-
ual task performance. For command detection, Vi-LMM achieves an .F1 -score of
93.58%, close to that of the highest-performing model GFCC-ResNet101, which
has approximately nine times more parameters. For fake voice recognition, Vi-
LMM’s performance is comparable to that of AASIST, the highest-performing
model, in terms of EER and .F1 -score, despite having significantly fewer parame-
ters. For speaker verification, AASIST has the best EER, but Vi-LMM achieves
the highest .F1 -score of 79.43%, showing the effectiveness of information feeding
from the fake voice detection task.
To reflect the speed advantage of Vi-LMM, we also report the inference time
for each model. It should be noted that other models require three runs to obtain
outputs for a single data sample, whereas our model only requires one. Our
results indicate that Vi-LMM has the fastest inference time, taking only 64ms,
while GFCC-ResNet101 and AASIST take 630ms and 174ms, respectively.
164 M. H. Dao et al.

Variants of Vi-LMM. Results from Table 3 indicate that Vi-LMM-S performs


comparably to Vi-LMM but with significantly fewer parameters, making it suit-
able for low-capacity devices. Conversely, removing the cross-task attention layer
(Vi-LMM-C) results in reduced performance across all three tasks. Notably, voice
command detection sees a 0.3% reduction, while the remaining two tasks expe-
rience a 0.5% reduction in .F1 -score. These findings suggest that the cross-task
attention layer plays a vital role in multi-task learning.

5 Conclusions
In this study, we introduced the initial public dataset for Vietnamese speaker
verification, which comprises three sub-tasks: command detection, fake voice
recognition, and speaker verification. In addition, we proposed two simple yet
effective models, Vi-LMM and Vi-LMM-S, for jointly learning the three tasks.
Particularly, Vi-LMM extends AASIST by integrating three task-specific MGO
branches and a cross-task attention layer, while Vi-LMM-S employs knowledge
distillation techniques and has only 4 million parameters. The experimental eval-
uation shows that both models surpass most of the strong methods in terms of
Average-.F1 while using significantly fewer parameters. Furthermore, we verified
that joint learning of the three sub-tasks via a cross-task attention layer is ben-
eficial to enhance the performance of all the tasks. We hope that our dataset
and model can serve as a starting point for future Vietnamese speech processing
research and applications.

Acknowledgments. This work was supported by the research project coded DT.
18/24, funded by the Ministry of Information and Communication, 2024.

Bibliography
1. Aravind, P., Nechiyil, U., Paramparambath, N., et al.: Audio spoofing verifica-
tion using deep convolutional neural networks by transfer learning. arXiv preprint
arXiv:2008.03464 (2020)
2. Baevski, A., Zhou, H., Mohamed, A., Auli, M.: wav2vec 2.0: a framework for self-
supervised learning of speech representations. CoRR (2020)
3. Bai, Z., Zhang, X.L.: Speaker recognition based on deep learning: an overview.
Neural Networks (2021)
4. Chen, Z., et al.: Large-scale self-supervised speech representation learning for auto-
matic speaker verification. In: Proceedings of 2022 IEEE International Conference
on Acoustics, Speech and Signal Processing (ICASSP) (2022)
5. Chen, Z., Xie, Z., Zhang, W., Xu, X.: Resnet and model fusion for automatic spoof-
ing detection. In: Proceedings of the 18th Annual Conference of the International
Speech Communication Association (INTERSPEECH), pp. 102–106 (2017)
6. Ding, S., Chen, T., Gong, X., Zha, W., Wang, Z.: Autospeech: neural architecture
search for speaker recognition. In: Proceedings of the 21st Annual Conference of the
International Speech Communication Association (INTERSPEECH), pp. 916–920
(2020)
A Lightweight End-to-End Multi-task Learning System 165

7. Fu, Q., Teng, Z., White, J., Powell, M., Schmidt, D.C.: Fastaudio: a learnable audio
front-end for spoof speech detection. arXiv preprint arXiv:2109.02774 (2021)
8. Ge, Z., Iyer, A.N., Cheluvaraja, S., Sundaram, R., Ganapathiraju, A.: Neural net-
work based speaker classification and verification systems with enhanced features.
In: Proceedings of 2017 Intelligent Systems Conference (IntelliSys), pp. 1089–1094
(2017)
9. Hinton, G., Vinyals, O., Dean, J., et al.: Distilling the knowledge in a neural net-
work. arXiv preprint arXiv:1503.02531 (2015)
10. Hsu, W., Bolte, B., Tsai, Y.H., Lakhotia, K., Salakhutdinov, R., Mohamed, A.:
Hubert: self-supervised speech representation learning by masked prediction of
hidden units. CoRR (2021)
11. Jung, J., et al.: Aasist: audio anti-spoofing using integrated spectro-temporal graph
attention networks. arXiv preprint arXiv:2110.01200 (2021)
12. Jung, J.W., Heo, H.S., Yu, H.J., Chung, J.S.: Graph attention networks for speaker
verification. In: ICASSP 2021-2021 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), pp. 6149–6153 (2021)
13. Jung, J.w., Kim, Y.J., Heo, H.S., Lee, B.J., Kwon, Y., Chung, J.S.: Pushing the lim-
its of raw waveform speaker recognition. In: Proceedings of the 23rd Annual Con-
ference of the International Speech Communication Association (INTERSPEECH)
(2022)
14. Kong, J., Kim, J., Bae, J.: Hifi-GAN: generative adversarial networks for efficient
and high fidelity speech synthesis. arXiv preprint arXiv:2010.05646 (2020)
15. van Leeuwen, D.A.: Speaker verification systems and security considerations. In:
Proceedings of 8th European Conference on Speech Communication and Technol-
ogy (Eurospeech), pp. 1661–1664 (2003)
16. Nagrani, A., Chung, J.S., Zisserman, A.: Voxceleb: a large-scale speaker identifica-
tion dataset. In: Proceedings of the 18th Annual Conference of the International
Speech Communication Association (INTERSPEECH), pp. 2616–2620 (2017)
17. Nagrath, P., Jain, R., Madan, A., Arora, R., Kataria, P., Hemanth, J.: Ssdmnv2:
a real time DNN-based face mask detection system using single shot multibox
detector and mobilenetv2. Sustain. Cities Soc. (2021)
18. Nguyen, S.T., Lai, V.D., Dam-Ba, Q., Nguyen-Xuan, A., Pham, C.: Vietnamese
speaker authentication using deep models. In: Proceedings of the International
Symposium on Information and Communication Technology, pp. 177–184 (2018)
19. Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.C.: Mobilenetv2: in-
verted residuals and linear bottlenecks. In: Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pp. 4510–4520 (2018)
20. Saquib, Z., Salam, N., Nair, R.P., Pandey, N., Joshi, A.: A survey on automatic
speaker recognition systems. In: Communications in Computer and Information
Science, pp. 134–145 (2010)
21. Sukhavasi, M., Adapa, S.: Music theme recognition using CNN and self-attention.
arXiv preprint arXiv:1911.07041 (2019)
22. Tak, H., weon Jung, J., Patino, J., Kamble, M., Todisco, M., Evans, N.: End-to-end
spectro-temporal graph attention networks for speaker verification anti-spoofing
and speech deepfake detection. In: Proceedings of 2021 Edition of the Automatic
Speaker Verification and Spoofing Countermeasures Challenge (2021)
23. Tak, H., Jung, J.w., Patino, J., Todisco, M., Evans, N.: Graph attention networks
for anti-spoofing. arXiv preprint arXiv:2104.03654 (2021)
24. Tak, H., Patino, J., Todisco, M., Nautsch, A., Evans, N., Larcher, A.: End-to-end
anti-spoofing with rawnet2. In: Proceedings of 2021 IEEE International Conference
on Acoustics, Speech and Signal Processing (ICASSP), pp. 6369–6373 (2021)
166 M. H. Dao et al.

25. Tak, H., Todisco, M., Wang, X., Jung, J.w., Yamagishi, J., Evans, N.: Automatic
speaker verification spoofing and deepfake detection using wav2vec 2.0 and data
augmentation (2022)
26. Todisco, M., et al.: Asvspoof 2019: future horizons in spoofed and fake audio de-
tection. In: Proceedings of the 20th Annual Conference of the International Speech
Communication Association (INTERSPEECH), pp. 1008–1012 (2019)
27. Van, T.P., Quang, N.T.N., Thanh, T.M.: Deep learning approach for singer voice
classification of vietnamese popular music. In: Proceedings of the International
Symposium on Information and Communication Technology, pp. 255–260 (2019)
28. Veličković, P., Cucurull, G., Casanova, A., Romero, A., Liò, P., Bengio, Y.: Graph
attention networks. In: Proceedings of 6th International Conference on Learning
Representations (2018)
29. Wang, X., Yamagishi, J.: A comparative study on recent neural spoofing coun-
termeasures for synthetic speech detection. In: Proceedings of the 22nd Annual
Conference of the International Speech Communication Association (INTER-
SPEECH), pp. 4259–4263 (2021)
30. Wang, Y., Boumadane, A., Heba, A.: A fine-tuned wav2vec 2.0/hubert benchmark
for speech emotion recognition, speaker verification and spoken language under-
standing. CoRR (2021)
31. Yamagishi, J., et al.: ASVspoof 2021: accelerating progress in spoofed and deepfake
speech detection. In: Proceedings of Edition of ASVspoof, pp. 47–54 (2021)
32. Yang, J., Das, R.K., Zhou, N.: Extraction of octave spectra information for spoof-
ing attack detection. IEEE/ACM Trans. Audio Speech Lang. Process. 2373–2384
(2019)
33. Zhang, Z., Gu, Y., Yi, X., Zhao, X.: FMFCC-a: a challenging mandarin dataset
for synthetic speech detection. arXiv preprint arXiv:2110.09441 (2021)
Domain Generalization in Vietnamese
Dependency Parsing: A Novel Benchmark
and Domain Gap Analysis

Vinh-Hien D. Huynh1,2(B) , Chau-Anh Le1,2 , Chau M. Truong1,2 ,


Y. Thien Huynh1,2 , and Quy T. Nguyen1,2
1
University of Information Technology, Ho Chi Minh City, Vietnam
{21520029,21521821}@gm.uit.edu.vn, {chautm,yht,quynt}@uit.edu.vn
2
Vietnam National University, Ho Chi Minh City, Vietnam

Abstract. Dependency parsing has received significant attention from


the research community due to its recognized applications across diverse
areas of natural language processing (NLP). However, the majority
of dependency parsing studies to date have not addressed the out-of-
domain problem, where the data in the testing phase are in a dif-
ferent distribution compared with data in training domains, despite
this being a common problem in practice. Furthermore, Vietnamese is
still considered a low-resource language in parsing tasks, as most stan-
dard treebanks are primarily developed for more widely spoken lan-
guages such as English and Chinese. This shortage pushes the diffi-
culty of studies of Vietnamese dependency parsing task even further. To
advance research on domain generalization in Vietnamese dependency
parsing task, this paper introduces a new treebank called DGDT (Viet-
namese Domain Generalization Dependency T reebank), where domains
in train/dev/test set are completely separated. This is the distinction
of our treebank, compared to other Vietnamese dependency treebanks.
We also release DGDTMark, a cross-domain Vietnamese dependency
parsing benchmark suite using our treebank to assess the generalization
ability of parsers over domains. Moreover, our suite can support further
research in analyzing the impacts of domain gaps on the dependency
parsing task. Through experiments, we observe that the performance
of parsers is most affected by two gaps: newspaper topics and writing
styles. Besides, the performance drops remarkably by 3.27% UAS and
5.09% LAS in the scenario with the largest domain gap, which proves
that our treebank poses a significant challenge for further research.

Keywords: Dependency parsing · Vietnamese treebank · Domain


generalization · Out-of-domain problem · Domain gap

1 Introduction

Dependency parsing is an NLP task focused on analyzing the grammatical struc-


ture of sentences, specifically the relationships between words. In this task, each

V. D. Huynh, C. A. Le—Equal contributions.


c The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025
W. Buntine et al. (Eds.): SOICT 2024, CCIS 2350, pp. 167–181, 2025.
https://doi.org/10.1007/978-981-96-4282-3_14
168 V.-H. D. Huynh et al.

sentence is transformed into a dependency tree, consisting of directed binary


grammatical relations. For instance, a dependency tree for the sentence “Tôi I
mua buy bánh mı̀ bread .” is illustrated in Fig. 1. In this tree, the word mua is
designated as the root node, serving as the semantic core of the entire struc-
ture. An example of a dependency relation is the link from the word mua (head)
to Tôi (dependent), with the respective label NSUBJ. This indicates that the
noun Tôi functions as the subject of mua. In addition, the relation from mua to
bánh mı̀ also shows that bánh mı̀ is the object of the action mua.

Fig. 1. A dependency tree following the NIIVTB DT-1 treebank [21] format

Dependency parsing plays a crucial role in the field of NLP research because
it provides syntactic information about language. A high-quality dependency
parsing system can be deployed to improve the performance of various down-
stream tasks, such as information extraction [6], name entity recognition [8,24],
question answering [3], machine translation [2,22], text summarization [23] and
multi-task learning [4].
With the advent of deep neural networks, current dependency parsing models
have achieved significantly high performance. For example, the Biaffine model
[9] reached 94.10% LAS1 on Penn Treebank. However, parsing models have still
faced difficulties when there is a distribution difference between training and
evaluation data, known as the domain gap. Studies by Blodgett et al. on African-
American English [1] and Kanerva et al. [11] on Finnish have demonstrated that
the performance of parsing systems dropped remarkably in domain generaliza-
tion setups, which the worst decrease was by 24.90% LAS when the parser is
evaluated on the clinical domain [11].
Additionally, we have found that there is a limited number of studies on Viet-
namese dependency parsing. To our knowledge, there are no published studies
examining the domain gap challenge in this task. Furthermore, available depen-
dency treebanks in Vietnamese are not designed to evaluate the effects of the
domain gap despite this being a common issue with real-world data. As a result,
it becomes difficult to assess the model’s performance across diverse concepts
completely.
To accommodate further research on Vietnamese cross-domain depen-
dency parsing, we introduce DGDT (Vietnamese Domain Generalization
Dependency T reebank), a multi-domain Vietnamese dependency treebank.
Moreover, we also released DGDTMark, a benchmark suite to evaluate a
dependency parser on different scenarios, both with and without the domain
gap, in several settings of domain generalization on our novel treebank. In our
1
There are two standard evaluation metrics in dependency parsing: UAS (unlabeled
attachment score) and LAS (labeled attachment score).
Domain Generalization in Vietnamese Dependency Parsing 169

benchmark suite, we also combine our treebank and released datasets to extend
the domain gap to demonstrate its effects on Vietnamese dependency parsing
comprehensively.

2 Related Works
2.1 Data Arrangement Methods

In domain-generalization dependency parsing, data can be organized in various


ways depending on how domains are defined, such as by topic, writing style,
or formality. Additionally, to facilitate the evaluation phase, both training and
testing data need to follow a same annotation guideline. For example, Kanerva
et al. [11] used two Finnish datasets within the Universal Dependency framework
to evaluate how parsers handle out-of-domain data. They also published a third
treebank containing more challenging data from poetry and clinical reports to
assess the model’s performance. In addition to this approach, Blodgett et al. [1]
and Li et al. [13] further expanded the domain gap by building a testing dataset
with sentences from tweets, product blogs, and web fictions.

2.2 Vietnamese Dependency Treebanks

There is a limited number of dependency treebanks in Vietnamese. One of


them is the BKTreebank [17], sourced from Dan Tri 2 newspaper and manu-
ally annotated, contains 6,909 sentences, each with a maximum length of 50
words. Another notable treebank, VnDT [15], was automatically converted from
the VTB constituency treebank [18], with raw texts extracted from Tuoi Tre 3
newspaper. Besides, the UD Vietnamese-VTB treebank4 , another conversion of
VTB, comprises 58,069 tokens in 3,323 sentences. However, these Vietnamese
dependency treebanks lack domain separation and follow different annotation
guidelines, making them unsuitable for cross-domain tasks.

2.3 Cross-Domain Dependency Parsing Models

Regarding domain generalization in dependency parsing, to the best of our


knowledge, we could not find any designated models for this task. In domain-
generalization experiments, current cutting-edge dependency parsers are usually
utilized as baselines [1,11].
On the other hand, most studies addressing cross-domain dependency parsing
rely on domain adaptation approaches, which focus on improve model perfor-
mance by leveraging a limited amount of data from testing domains for the
training phase. For instance, Sato et al. [20] and Li et al. [12] proposed solu-
tions by the same philosophy - utilizing both separated and shared components
2
https://dantri.com.vn.
3
https://tuoitre.vn.
4
https://github.com/UniversalDependencies/UD Vietnamese-VTB.
170 V.-H. D. Huynh et al.

(LSTMs, MLPs layers) between different domains in their model architectures.


Additionally, the study by Li et al. [13] released a semi-supervised approach by
training the task simultaneously with the unsupervised domain embedding task.

3 Our Treebank
Typically, there are several approaches to creating a dependency treebank: man-
ually annotating the dependency trees, using an automatic parser on raw text
followed by manual adjustment (semi-automatic parsing), or transforming an
existing constituency treebank to a dependency treebank.
Due to the high cost of manually annotating a dependency treebank, we
decided to use an automatic converter to transform an existing constituency
treebank into our dependency treebank. After carefully reviewing available con-
verters for Vietnamese, we select the converter released by Truong et al. [21]. This
converter employs a new dependency label set with novel Vietnamese-specific
dependency labels, making it more effective at capturing Vietnamese linguis-
tic characteristics. For example, previous works [15,17] did not propose specific
dependency class for sino-vietnamese words or classifier nouns, which are com-
monly seen in Vietnamese text. Moreover, Truong et al.’s approach builds the
dependency relations based on both syntactic and semantic features rather than
relying solely on functional tags as seen in the VnDT treebank [15].
With the aim to construct a multi-domain treebank, we choose the second
subset (NIIVTB-2) of the NIIVTB constituency treebank [19] because this subset
is organized into 14 distinct topics, crawled from Thanh Nien 5 newspaper. Since
the authors of this dataset hide the raw text because of copyright, we have
to collect them from newspaper sites and match them with the corresponding
annotations. In fact, this process encountered several challenges such as defunct
hyperlinks, and inconsistency in the number of words between raw text and the
constituency treebank, among others. Additionally, we found some duplication
within the dataset, which needs to be removed to guarantee data reliability.
Consequently, our treebank contains 9,765 sentences.
Converter Validation: To ensure the quality of our treebank, we build a gold
dataset of 1,245 sentences in Law and Life of youth topics of NIIVTB-2 by
manually correcting the automatically-parsed trees from the converter. Then,
we compare the gold dataset with the initially-parsed trees from the converter
and receive the accuracy of 95.62% UAS and 89.49% LAS. These results are
sufficiently high to guarantee our treebank’s quality (Fig. 2).
In NIIVTB-2, every topic is present in all three sets: train, dev, and test.
This setup makes it challenging to analyze the impact of the domain gap on
the dependency parsing task. To address this issue and support the domain
generalization task, in the initial setup of DGDT, we decided to treat each topic
as a domain and rearrange the dataset so that each domain appears exclusively
in one of the train, dev, or test sets. Moreover, we process the domain allocation
step not only to maintain a suitable sentence ratio across these sets but also
5
https://thanhnien.vn.
Domain Generalization in Vietnamese Dependency Parsing 171

Fig. 2. A dependency tree in the DGDT dataset

to emphasize the difficulty of the test set. Table 1 describes the structure of our
treebank, which totally contains 9,765 dependency trees (245,006 tokens), across
14 domains along with individual statistics information for each domain included
in the treebank.
Besides, as shown in Table 2, we can observe that the distributions over labels
in DGDT are imbalanced. In detail, labels like PUNCT (punctuation), NN (noun
compound modifier), OBJ (object), and PREP (prepositional modifier) appear
more frequently, while labels related to expressions such as VOCATIVE, SOUND
are found only in a small minority of cases in our treebank. This imbalance is
understandable because almost all input sentences end with punctuation, and
sentences may contain reported speech (using colons and quotation marks) or
include extra information within parentheses. Additionally, NN, OBJ, and PREP
are also common components in natural language and frequently appear in text.
Meanwhile, words expressing sound are relatively rare in general, particularly in
our treebank, where the content of its domains does not relate much to sound.
Moreover, both sound expressions and vocatives are more commonly found in
spoken language than in newspaper text. Although an imbalanced data distri-
bution is not ideal in machine learning, it is often unavoidable when collecting
real-world data.
In comparison, our dataset has some key differences. Although our dataset
is only behind the VnDT treebank in terms of sentence count, it has more total
number of tokens than all mentioned treebanks. Furthermore, our treebank also
contains a large volume of long sentences, with 1,087 sentences in DGDT that
each of them has more than 40 tokens. This quantity is 36% more than VnDT
and 1,312% more than UD Vietnamese-VTB. Hence, this dataset can effectively
examine how models handle long-distance dependency relations, which pose a
challenge for parsers. Further distinctions including sentence distribution by
length, number of domains, and number of tokens between DGDT and other
Vietnamese dependency treebanks are shown in Table 3 and Fig. 3.

4 Experiments
4.1 Baseline Model

Introduced in 2016, biaffine parser [9] quickly became a highly influential model
in the field of dependency parsing. Many of today’s state-of-the-art dependency
parsing models are based on the biaffine approach with customizations involving
modifying the encoder, the number of MLP layers, and additional linguistics
features. Due to its feasible deployment and high performance, we select this
model as the baseline model for our experiments on DGDT. Specifically, we adopt
172 V.-H. D. Huynh et al.

Table 1. Domain-specific statistics of DGDT

No. Set Domain No. of sentences No. of tokens


1 Education 844 21,068
2 Health 725 16,045
3 Law 610 17,188
4 Life of youth 635 16,396
5 Train Military 690 16,949
6 Politics Society 712 20,245
7 Science 692 16,964
8 Sports 697 16,780
9 Travel 540 13,397
10 World 645 16,393
Set summary 6,790 171,425
11 Entertainment 708 17,463
12 Dev Information Technology 714 18,374
Set summary 1,422 35,837
13 Economic 725 18,207
14 Test Life 828 19,537
Set summary 1,553 37,744
Entire dataset 9,765 245,006

the biaffine parser implementation released by Zhang et al.6 , which includes


several customizations compared to the original model architecture.
Biaffine Parser. The original architecture of the biaffine parser model is com-
posed of four key components:

– Input layer: This model received embeddings .e1 , e2 . . . en as input.


– Encoder layer: Three BiLSTM layers are applied to generate context-aware
word representations .h1 , h2 . . . hn .
– MLP Layer: Two independent MLPs are used to get lower-dimensional vectors
when each word plays the role of a head and a dependent, respectively:

i = MLP (hi ) , ri = MLP (hi )


H D
. rH D
(1)

– Biaffine layer: The score of a dependency .i → j with respective label .k is


computed via a biaffine attention.
 D T
rj
.s(i, j) = Wbiaffine rH
i (2)
1
 T
rD
.s(i, j, k) = Wkbiaffine rH
j
i (3)
1

Lastly, the non-projective MST algorithm is applied to search for the highest-
scoring projective dependency tree from obtained arc scores.
The implementation of Zhang et al. [25] (Supar) customized the biaffine by
replacing POS tag embeddings with CharLSTM word representation vectors. In
6
https://github.com/yzhangcs/parser/.
Domain Generalization in Vietnamese Dependency Parsing 173

Table 2. Distribution of labels in DGDT

Label type No. of occurrences Ratio (%)


PUNCT 31,349 12.80
NN 23,822 9.72
OBJ 18,889 7.71
PREP 18,487 7.55
VMOD 17,221 7.03
POBJ 16,568 6.76
NSUBJ 15,185 6.20
ADJUNCT 13,854 5.65
CONJ 13,488 5.51
MARK 1,195 0.49
INTJ 1,006 0.41
DEP 962 0.39
ADJP ADVMOD 511 0.21
NUMBER 471 0.19
ADVCL 426 0.17
QUANTMOD 267 0.11
VSUBJ 261 0.11
CSUBJ 185 0.08
APPOS 124 0.05
IOBJ 117 0.05
ASUBJ 78 0.03
VOCATIVE 35 0.01
SOUND 5 0.00
Only label types with frequency more than 5%
or less than 0.5% are shown.

Table 3. Comparison between DGDT and available Vietnamese dependency treebanks

Treebank No. of labels No. of domains No. of sentences No. of tokens Manual annotation
BKTreebank 26 unknown 6,909 unknown ✓
UD-VTB 84 1 3,323 58,069 ✗
VnDT 33 1 10,197 218,749 ✗
DGDT (ours) 40 14 9,765 245,006 ✗

addition, the first-order Eisner algorithm [10] is also used instead of the non-
projective MST algorithm.
With the main goal of handling the task in Vietnamese, we replace the
BiLTSMs in the encoder layer with two options: PhoBERT [16] or XLM-
RoBERTa [5] (XLM-R). PhoBERT is a pre-trained language model for Viet-
namese, based on RoBERTa [14], an advanced version of BERT [7] with some
modifications in pre-training procedure. XLM-R, on the other hand, is a multilin-
gual masked language model pre-trained on text in 100 languages, also inherited
from RoBERTa.

4.2 Experimental Setups


To evaluate the effects of the domain gap in our dataset, we release a benchmark
suite for Vietnamese cross-domain dependency parsing, called DGDTMark.
174 V.-H. D. Huynh et al.

Fig. 3. Distribution of sentences in Vietnamese dependency treebanks by length

This suite includes experiments on Vietnamese dependency parsing via four


scenarios:

– Scenario 1 (in-domain): We split each domain into three subsets train, dev,
and test with an 8:1:1 ratio. Then, merge the corresponding parts from dif-
ferent domains to create the overall train, dev, and test sets. While this setup
is commonly used in almost dependency parsing experiments, it is not specif-
ically designed to make clear the effects of the domain gap. We implement
this setup with the aim of evaluating the model’s performance in the absence
of a domain gap.
– Scenario 2 (domain-k-fold): To examine how each domain in DGDT affects
the parser differently from the others, we choose k-fold as the evaluating
approach. Each one of 14 domains should be utilized to assess the model,
while the rest of the treebank is merged into the train set. The average result
between folds is the overall performance of the parser.
– Scenario 3 (domain-generalization): We assign each domain to appear exclu-
sively in one of the train, dev, or test sets to observe how the model handles
the effects of the domain gap. In this setup, we use Entertainment and Infor-
mation Technology domains to construct the dev set and merge Economic
and Life domains to organize the test set, leaving other domains to form the
train set, as in Table 1.
– Scenario 4 (dataset-generalization): We hypothesize that the domain gap can
be better demonstrated by using training data which have a different source
from testing data. Hence, we set up this experiment as follows: we use the
train set of NIIVTB DT-1 dependency treebank [21] as data for the model
training phase, then use the dev and test sets of DGDT for model selection and
evaluation, respectively. The reason for choosing NIIVTB DT-1 is that this
treebank follows the same annotation guideline as our dataset, which makes
possible the model evaluation process. Moreover, the data for this treebank
was derived from Tuoi Tre newspaper, while our treebank is built on Thanh
Domain Generalization in Vietnamese Dependency Parsing 175

Nien, which satisfies the requirement of data source separation. The differ-
ence in the data source may result in variations in writing style. Besides, the
data from NIIVTB DT-1 was published in the early 2000 s, whereas DGDT’s
data was released at the beginning of the 2010 s. As vocabulary continuously
expands daily, we believe that the passage of time can cause shifts in both
writing style and language diversity.

4.3 Results
Table 4. Results in the DGDTMark benchmark suite via different scenarios

Model Metric Scenario


in-domain domain-k-fold domain-generalization dataset-generalization
Supar + PhoBERT UAS 91.62 91.22 88.95 88.54
LAS 87.83 87.01 84.76 82.74
Supar + XLM-R UAS 90.25 89.72 88.26 86.98
LAS 85.89 84.99 82.88 80.97

As shown in Table 4, the Supar model has a considerably better performance


when using PhoBERT as the encoder compared to XLM-R. This can be
attributed to PhoBERT’s use of word-level data [16], while XLM-R relies on
syllable-level data as the original architecture of BERT [7], making it less effec-
tive than PhoBERT in capturing the characteristics of the Vietnamese language.
As a result, we decided to use the Supar+PhoBERT model to examine our tree-
bank deeper in the following analyses.
Firstly, it is evident that the model performs best in the scenario where
no domain gap is present (in-domain scenario). Meanwhile, the performance
decreases significantly, by 2.67% UAS and 3.07% LAS when the training domains
differ from the developing and testing domains (domain-generalization scenario).
Obviously, the difference in data distribution between train, dev, and test sets
makes it more challenging for the model to handle the task.
In the domain-k-fold scenario, we can observe the fluctuations of parser per-
formance across different domains. The margin between the best and worst out-
come is 3.39% UAS and 4.79% LAS, between World and Life domain, as shown
in Table 5, which indicates that the level of difficulty is not the same across
domains. From our perspective, the cause of this dissimilarity is because high-
performance topics like World and Science are written formally, which are easier
to understand. In contrast, Life domain consists of articles on social drama and
life stories, which are informal and affected by the writing style of authors.
In dataset-generalization scenario, we expand the domain gap by training the
model with data from another treebank, the performance furthermore drops by
3.08% UAS and 5.09% LAS, compared to the results in the in-domain scenario.
This outcome is not out of our forecast, as there are significant gaps between
the train, dev, and test sets, not only in newspaper topics but also in writing
style and the timing of data publication. Consequently, the results prove that
the greater the domain gap, the more complexity the model needs to handle.
176 V.-H. D. Huynh et al.

Table 5. Detail result of domain-k-fold scenario

Dev/Test Domain UAS LAS


Economic 90.53 85.82
Education 90.97 86.74
Entertainment 90.52 85.94
Health 91.18 86.55
Information Technology 90.36 85.89
Law 91.58 87.45
Life 89.50 84.82
Life of youth 91.12 86.56
Miltitary 92.29 88.82
Politics Society 91.16 86.83
Science 92.00 88.85
Sports 91.84 87.16
Travel 91.17 87.03
World 92.89 89.61
Average 91.22 87.01

4.4 Error Analysis

To better evaluate how the model performs in our main concern - the domain-
generalization scenario, we further analyze the results based on label type and
sentence length. From Table 6, we can observe that the parser has a remarkable
performance in handling popular cases like labels related to subjects, objects,
modifiers, and unique cases such as punctuation, number, or determining root. In
contrast, difficult labels including CCOMP (clausal complement), PARATAXIS,
and CONJ (conjunct) have low accuracy because they demonstrate a connection
between clauses, or from a token to its complement clause, which also requires
the model to choose not only a suitable dependent but also a sufficient main
word for the dependent clause. Moreover, clause-linking relations are usually
not dependent on explicit word forms, which makes the task become more diffi-
cult. On the other hand, ambiguous cases, for example, NN and NP ADVMOD
(noun phrase as adverbial modifier), both represent noun modifiers but with
different roles in semantic meaning, cause a dramatic decrease (by 24.82% LAS)
in parsing performance. This statistic demonstrates that there is still room for
model development in the field of acknowledging deep linguistic meaning.
We also found that the effectiveness of the parser is vastly influenced by the
distance of relations. The results shown in Fig. 4 indicate that the more distant
the relations, the worse the parser performs in selecting the correct head for a
word. The cause of these results, from our perspective, is that the frequency
of short-distance relations is extremely higher, which makes the model biased.
However, the performance by LAS metric rises remarkably, from 50% to nearly
90% when handling long-distance relations. We explain this interesting increase
by the capped number of options the parser annotates the relations because
Domain Generalization in Vietnamese Dependency Parsing 177

Table 6. Benchmark results by each label in the domain-generalization scenario

Label type No. of occurrences UAS LAS


PUNCT 5136 83.94 83.92
NN 3570 92.83 88.01
OBJ 3080 94.42 90.06
VMOD 2637 91.13 81.53
PREP 2570 85.37 83.66
NSUBJ 2340 90.64 89.53
POBJ 2216 97.70 95.35
ADJUNCT 2185 93.00 90.30
CONJ 2169 78.01 74.09
ROOT 1553 90.86 90.86
CC 1250 82.56 81.84
AMOD 1132 89.66 85.42
DET 1062 96.23 95.10
NUM 947 97.36 96.41
CCOMP 746 84.05 77.48
ACOMP 660 88.64 83.94
PARATAXIS 619 70.76 63.81
RCMOD 589 91.00 79.80
NP ADVMOD 527 86.72 63.19
PCOMP 481 86.28 78.59
Only label types with more than 250 occur-
rences are shown.

Fig. 4. Performance of parser by distance of relations

long-distance relations are usually in dependency classes related to punctua-


tion, or clause linking. This limitation makes the model easier to annotate the
relations.

4.5 Effects of Testing-Domain Data to Model


We are curious to explore whether the model’s performance can be improved by
providing a small amount of data from the testing domains. To investigate this,
we first train the model under the domain-generalization scenario and then feed
178 V.-H. D. Huynh et al.

it with from 50 to 250 sentences (in steps of 50 sentences) from the test set and
evaluate its performance on the remaining portion of the test set. We stopped
feeding at 250 sentences to avoid overshadowing the domain gap setup.

Fig. 5. Effects of using test-domain data to model outcome

As shown in Fig. 5, allocating sentences from the test to the train set enhances
the model’s performance. Moving 250 sentences results in an increase of UAS by
approximately 1%, while the parser only shows a slight improvement by the LAS
metric. From our perspective, the enhancement in this situation occurs because
the model receives only a small amount of knowledge from domains in the test
set, leading to better performance, but not significant. Although domain adap-
tation can handle the domain gap by lowering distribution differences between
the source and target domains, it depends on the assumption that target data
is accessible, which is not always the case in practice.

5 Conclusion

In this research, we release DGDT, a treebank designed to serve domain general-


ization for the Vietnamese dependency parsing task. Compared to other available
treebanks, DGDT excels not only in domain variety but also offers key advan-
tages such as a reliable data source and a clear separation between domains in
data. Specifically, the DGDTMark benchmark suite is introduced in this study,
where we arrange our treebank in different domain generalization scenarios to
gain deeper insights into the impact of the domain gap. The experimental results
indicate that the model’s performance decreases significantly if the domain gap
occurs, with the decline worsening as the difference in the data increases. In
the most difficult scenario, performance drops by 3.27% UAS and 5.09% LAS,
demonstrating that the domain gap remains a challenging problem for current
cutting-edge parsing systems.
Domain Generalization in Vietnamese Dependency Parsing 179

References
1. Blodgett, S.L., Wei, J., O’Connor, B.: Twitter universal dependency parsing for
African-American and mainstream American English. In: Gurevych, I., Miyao,
Y. (eds.) Proceedings of the 56th Annual Meeting of the Association for Com-
putational Linguistics (Volume 1: Long Papers), pp. 1415–1425. Association for
Computational Linguistics, Melbourne, Australia, July 2018. https://doi.org/10.
18653/v1/P18-1131, https://aclanthology.org/P18-1131
2. Bugliarello, E., Okazaki, N.: Enhancing machine translation with dependency-
aware self-attention. In: Jurafsky, D., Chai, J., Schluter, N., Tetreault, J. (eds.)
Proceedings of the 58th Annual Meeting of the Association for Computa-
tional Linguistics, pp. 1618–1627. Association for Computational Linguistics,
July 2020. https://doi.org/10.18653/v1/2020.acl-main.147, https://aclanthology.
org/2020.acl-main.147
3. Chen, C., Bunescu, R., Marling, C.: A semantic parsing pipeline for context-
dependent question answering over temporally structured data. Nat. Lang. Eng.
29(3), 769–793 (2023). https://doi.org/10.1017/S1351324921000292
4. Clark, K., Luong, M.T., Manning, C.D., Le, Q.: Semi-supervised sequence mod-
eling with cross-view training. In: Riloff, E., Chiang, D., Hockenmaier, J., Tsujii,
J. (eds.) Proceedings of the 2018 Conference on Empirical Methods in Natural
Language Processing, pp. 1914–1925. Association for Computational Linguistics,
Brussels, Belgium, Oct-Nov 2018. https://doi.org/10.18653/v1/D18-1217, https://
aclanthology.org/D18-1217
5. Conneau, A., et al.: Unsupervised cross-lingual representation learning at scale. In:
Jurafsky, D., Chai, J., Schluter, N., Tetreault, J. (eds.) Proceedings of the 58th
Annual Meeting of the Association for Computational Linguistics, pp. 8440–8451.
Association for Computational Linguistics, July 2020. https://doi.org/10.18653/
v1/2020.acl-main.747, https://aclanthology.org/2020.acl-main.747
6. Culotta, A., Sorensen, J.: Dependency tree kernels for relation extraction. In: Pro-
ceedings of the 42nd Annual Meeting of the Association for Computational Lin-
guistics (ACL-04), pp. 423–429. Barcelona, Spain, July 2004. https://doi.org/10.
3115/1218955.1219009, https://aclanthology.org/P04-1054
7. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep
bidirectional transformers for language understanding. In: Burstein, J., Doran,
C., Solorio, T. (eds.) Proceedings of the 2019 Conference of the North Ameri-
can Chapter of the Association for Computational Linguistics: Human Language
Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Association for
Computational Linguistics, Minneapolis, Minnesota, June 2019. https://doi.org/
10.18653/v1/N19-1423, https://aclanthology.org/N19-1423
8. Dou, C., Sun, X., Wang, Y., Ji, Y., Ma, B., Li, X.: Domain-adapted dependency
parsing for cross-domain named entity recognition. In: Proceedings of the Thirty-
Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference
on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on
Educational Advances in Artificial Intelligence. AAAI’23/IAAI’23/EAAI’23, AAAI
Press (2023). https://doi.org/10.1609/aaai.v37i11.26498, https://doi.org/10.1609/
aaai.v37i11.26498
9. Dozat, T., Manning, C.D.: Deep biaffine attention for neural dependency parsing
(2017). https://arxiv.org/abs/1611.01734
10. Eisner, J.: Bilexical grammars and their cubic-time parsing algorithms, pp. 29–
61. Springer Netherlands, Dordrecht (2000). https://doi.org/10.1007/978-94-015-
9470-7 3
180 V.-H. D. Huynh et al.

11. Kanerva, J., Ginter, F.: Out-of-domain evaluation of Finnish dependency parsing.
In: Calzolari, N., et al. (eds.) Proceedings of the Thirteenth Language Resources
and Evaluation Conference, pp. 1114–1124. European Language Resources Asso-
ciation, Marseille, France, June 2022. https://aclanthology.org/2022.lrec-1.120
12. Li, Y., Li, Z., Zhang, M.: Semi-supervised domain adaptation for dependency pars-
ing via improved contextualized word representations. In: Scott, D., Bel, N., Zong,
C. (eds.) Proceedings of the 28th International Conference on Computational Lin-
guistics, pp. 3806–3817. International Committee on Computational Linguistics,
Barcelona, Spain, December 2020. https://doi.org/10.18653/v1/2020.coling-main.
338, https://aclanthology.org/2020.coling-main.338
13. Li, Z., Peng, X., Zhang, M., Wang, R., Si, L.: Semi-supervised domain adaptation
for dependency parsing. In: Korhonen, A., Traum, D., Màrquez, L. (eds.) Proceed-
ings of the 57th Annual Meeting of the Association for Computational Linguistics,
pp. 2386–2395. Association for Computational Linguistics, Florence, Italy, July
2019. https://doi.org/10.18653/v1/P19-1229, https://aclanthology.org/P19-1229
14. Liu, Y., et al.: Roberta: a robustly optimized bert pretraining approach (2019).
https://arxiv.org/abs/1907.11692
15. Nguyen, D.Q., Nguyen, D.Q., Pham, S.B., Nguyen, P.T., Nguyen, M.L.: From
treebank conversion to automatic dependency parsing for vietnamese. In: Interna-
tional Conference on Applications of Natural Language to Data Bases/Information
Systems, pp. 196–207. Springer (2014)
16. Nguyen, D.Q., Tuan Nguyen, A.: PhoBERT: pre-trained language models for Viet-
namese. In: Cohn, T., He, Y., Liu, Y. (eds.) Findings of the Association for Compu-
tational Linguistics: EMNLP 2020, pp. 1037–1042. Association for Computational
Linguistics, November 2020. https://doi.org/10.18653/v1/2020.findings-emnlp.92,
https://aclanthology.org/2020.findings-emnlp.92
17. Nguyen, K.H.: BKTreebank: building a vietnamese dependency treebank. In: Cal-
zolari, N., et al. (eds.) Proceedings of the Eleventh International Conference on
Language Resources and Evaluation (LREC 2018). European Language Resources
Association (ELRA), Miyazaki, Japan, May 2018. https://aclanthology.org/L18-
1341
18. Nguyen, P.T., Vu, X.L., Nguyen, T.M.H., Nguyen, V.H., Le, H.P.: Building a large
syntactically-annotated corpus of Vietnamese. In: Stede, M., Huang, C.R., Ide, N.,
Meyers, A. (eds.) Proceedings of the Third Linguistic Annotation Workshop (LAW
III), pp. 182–185. Association for Computational Linguistics, Suntec, Singapore,
August 2009. https://aclanthology.org/W09-3035
19. Nguyen, Q.T., Miyao, Y., Le, H., Nguyen, N.: Ensuring annotation consistency
and accuracy for Vietnamese treebank. Lang. Resour. Eval. 52(1), 269–315 (2017).
https://doi.org/10.1007/s10579-017-9398-3
20. Sato, M., Manabe, H., Noji, H., Matsumoto, Y.: Adversarial training for cross-
domain universal dependency parsing. In: Hajič, J., Zeman, D. (eds.) Proceedings
of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Uni-
versal Dependencies, pp. 71–79. Association for Computational Linguistics, Van-
couver, Canada, August 2017. https://doi.org/10.18653/v1/K17-3007, https://
aclanthology.org/K17-3007
21. Truong, C.M., Pham, T.V., Phan, M.N., Le, N.D.T., Nguyen, T.V., Nguyen, Q.T.:
Converting a constituency treebank to dependency treebank for vietnamese. In:
2022 RIVF International Conference on Computing and Communication Tech-
nologies (RIVF), pp. 256–261 (2022). https://doi.org/10.1109/RIVF55975.2022.
10013806
Domain Generalization in Vietnamese Dependency Parsing 181

22. Xu, P., Kang, J., Ringgaard, M., Och, F.: Using a dependency parser to improve
SMT for subject-object-verb languages. In: Ostendorf, M., Collins, M., Narayanan,
S., Oard, D.W., Vanderwende, L. (eds.) Proceedings of Human Language Technolo-
gies: The 2009 Annual Conference of the North American Chapter of the Associ-
ation for Computational Linguistics. pp. 245–253. Association for Computational
Linguistics, Boulder, Colorado, June 2009. https://aclanthology.org/N09-1028
23. Yoshida, Y., Suzuki, J., Hirao, T., Nagata, M.: Dependency-based discourse parser
for single-document summarization. In: Moschitti, A., Pang, B., Daelemans, W.
(eds.) Proceedings of the 2014 Conference on Empirical Methods in Natural
Language Processing (EMNLP), pp. 1834–1839. Association for Computational
Linguistics, Doha, Qatar, October 2014. https://doi.org/10.3115/v1/D14-1196,
https://aclanthology.org/D14-1196
24. Yu, J., Bohnet, B., Poesio, M.: Named entity recognition as dependency parsing.
In: Jurafsky, D., Chai, J., Schluter, N., Tetreault, J. (eds.) Proceedings of the 58th
Annual Meeting of the Association for Computational Linguistics, pp. 6470–6476.
Association for Computational Linguistic, July 2020. https://doi.org/10.18653/
v1/2020.acl-main.577, https://aclanthology.org/2020.acl-main.577
25. Zhang, Y., Li, Z., Zhang, M.: Efficient second-order TreeCRF for neural depen-
dency parsing. In: Jurafsky, D., Chai, J., Schluter, N., Tetreault, J. (eds.) Proceed-
ings of the 58th Annual Meeting of the Association for Computational Linguistics,
pp. 3295–3305. Association for Computational Linguistics, July 2020. https://doi.
org/10.18653/v1/2020.acl-main.302, https://aclanthology.org/2020.acl-main.302
Distribution-Guided Object Counting
with Optimal Transport and DINO-Based
Density Refinement

Ngo Xuan Cuong1,2(B) and Tien-Dung Mai1,2


1
University of Information Technology, Ho Chi Minh City, Vietnam
[email protected], [email protected]
2
Vietnam National University, Ho Chi Minh City, Vietnam

Abstract. Prompt-based object counting refers to estimating the num-


ber of object correspondence to a selected category based on the text
description provided by the user. Current state-of-the-art methods esti-
mate object counts by summing the values in the predicted density map,
not caring about the distribution of object locations. This is reflected by
their loss function, mainly MSE loss, a loss function that focuses solely
on quantity. This leads to the model overestimating the count of object
class due to certain factors like overlapping, occlusion, or object with the
trait of self-similarity. To address this, we propose OptiCount, a frame-
work using Optimal transport plan to measure the difference between the
density map and ground truth for training. Furthermore, we introduce a
density-refinement module that validates the number of objects counted
to avoid overcounting. This module significantly reduces the counting
error of the model, making it more robust to various challenges. Experi-
ments on the FSC147 dataset show that OptiCount outperforms state-of-
the-art methods in terms of Mean Absolute Error (MAE), demonstrating
its effectiveness in counting task.

Keywords: Prompt-based object counting · Object detection ·


Zero-shot counting

1 Introduction
Object counting in images is a crucial task within the field of computer
vision, with extensive applications across various domains such as surveillance,
autonomous driving, wildlife monitoring, and retail analytics. Despite signifi-
cant advancements in these areas, accurately counting unseen object categories
- those not present in the training data - remains a major challenge. Current
regression-based techniques, as highlighted in references [11], typically generate
a 2D density map from which the total object count is derived by summing the
density values across all spatial locations of the density map. For images with a
large number of objects, this density map estimation approach has been shown
to be more robust than the detection-then-counting approach.
c The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025
W. Buntine et al. (Eds.): SOICT 2024, CCIS 2350, pp. 182–192, 2025.
https://doi.org/10.1007/978-981-96-4282-3_15
Distribution-Guided Object Counting 183

An essential phase in density map estimation method involves training a deep


neural network to convert input images into their respective annotated density
maps. In the current counting dataset [11], each training image is annotated with
a sparse binary mask, where each object is represented by a single dot placed
in the center of it. The spatial extent of each person is not specified, due to the
extensive labor required to delineate it, particularly in cases of significant occlu-
sion ambiguity. Training a density map estimation network with dot-annotated
images involves optimizing the network’s parameters to minimize a differentiable
loss function that quantifies the difference between the predicted density map
and the dot-annotation map. The predicted density map is a dense matrix of
real values, whereas the dot-annotation map is a sparse binary matrix. Due to
the sparsity of the dot annotations, using a pixel-wise difference function, such
as Mean Squared Error (MSE) loss, can lead to instability. This drawback is evi-
dent in the way the model performs on overlapping and self-similarity objects,
where it typically overcounts the number of objects.
To address this issue, we introduce a novel counter OptiCount, which com-
bines the use of distribution relation between predicted and annotated map to
guide the model and density-refinement module to control the density of pre-
dicted map. First, we tackle spatial issues by measuring the optimal transport
cost between 2 distributions of predicted and annotated map with Sinkhorn
algorithm and add this cost to the total loss function for model training, which
encourages the model to pay attention to the location of objects. However,
the model still tends to overcount in certain cases, especially when facing self-
similarity objects. We solve this by proposing a module called density-refinement.
This module controls object count by considering density summation in certain
region of the image that contains exactly one object generated by Grounding
DINO [7]. The total density is then normalized by these regions’ density, thus
decreasing the model error.
The main contributions of this paper are threefold:
• We enhance object counting by incorporating spatial distribution informa-
tion through a novel loss function that integrates Optimal Transport cost to
measure and minimize distribution disparities between predicted and ground-
truth data.
• We propose OptiCount, a method that uses enhanced loss for training. Fur-
thermore, this model is capable of controlling object quantity with density-
refinement module, effectively alleviating counting errors.
• Empirically, our method marginally surpasses state-of-the-art techniques on
the challenging FSC147 dataset. Notably, OptiCount achieves a .14.73% reduc-
tion in MAE compared to the most recent zero-shot counters on the FSC147
test set.

2 Related Work
Computer vision researchers have long faced difficulties in visual counting, with
much of the study concentrating on particular categories such as cars, cells,
184 N. X. Cuong and T.-D. Mai

humans, or polyps. Detection-based techniques were frequently used as early


counting methods to identify and count objects belonging to specific categories.
For example, complicated detection algorithms like YOLO are usually employed
in vehicle counting systems, whereas segmentation techniques are typically used
in biological cell [14] and human [16] counting methods to achieve reliable find-
ings. However, detection-based methods struggle in crowded scenes due to over-
laps and occlusions, increasing the counting error of the model. The density-
based method appears as a solution for this issue, as it has been shown to
perform more robustly against occluded objects.
Class-agnostic approaches have been developed to alleviate the drawback of
being limited to a particular object class. These approaches allow models to
count items dynamically based on exemplars within the image or a text descrip-
tion given by users. Early methods for object counting include using siamese
matching networks to predict density maps, while more recent approaches, such
as BMNet+ [12], blend representation learning with a non-linear similarity met-
ric. Another example is Famnet [11], which enhances the backbone architecture
to improve density estimation. Furthermore, CounTR [6] integrates vision trans-
formers with convolutional encoders and cross-attention modules for improved
feature fusion, while SAFECount [18] focuses on enhancing generalization with a
feature improvement module. LOCA [15] introduced an object prototype extrac-
tion module for iterative adaptation using exemplar appearance and shape.
Recent advancements in cross-modality learning have significantly enhanced
visual counting tasks by integrating text-based inputs. These text-based meth-
ods, which are inherently zero-shot, offer a novel approach to counting without
relying on extensive annotated datasets. Notably, [8] extended CLIP’s capabil-
ities by fine-tuning the model with a contrastive loss that includes both true
and counterfactual prompts, in addition to the standard text-image contrastive
objective, allowing CLIP to count up to ten objects. Building on this, methods
such as VLCounter [3] and CLIP-Count [2] utilize CLIP’s text encoder to embed
class names, enabling interaction with image features to regress density maps for
zero-shot, class-agnostic counting. TFOC [13] further advances this by employ-
ing various prompts-point, box, and text-to query the segmentation model SAM
[4], facilitating object counting in a training-free manner.
This method is, however, without limitation. A common drawback of density-
based method is the ability to count overlap and self-similarity objects. This is
due to the loss function defined by pixel-wise loss between the predicted and
annotated maps (mean square error). Mean square error considers only object
numbers while ignoring other spatial information such as object positions. This
leads to a high error rate when the model deals with challenging scenes, such as
self-similarity objects, where the model tends to overcount. In few-shot count-
ing scenario, overcount could be alleviated with test time normalization by
dividing the estimated count by the average sum of the density map at each
exemplar region, however, in prompt-based setup, this is not possible. We
focus on diminishing the above issues by incorporating optimal transport (OT)
Distribution-Guided Object Counting 185

cost and density-refinement module to OptiCount, effectively reducing counting


errors.

3 Proposed Method
3.1 Overview

Fig. 1. Overall Framework of the Proposed Model. The framework consists of four key
components: encoders, feature interaction module, decoders, and the density refinement
module.

This section introduces a framework for counting arbitrary objects. Inspired


by prior work, we adopt a similar structure to [1] for density map generation,
but with the addition of a density-refinement module. The proposed frame-
work, OptiCount, incorporates CLIP encoders, a feature interaction module,
and up-sampling decoders to handle feature extraction, feature fusion, and den-
sity map up-sampling, respectively. After the up-sampling decoder, we introduce
the density-refinement module to regulate the density map and reduce count-
ing errors. The model is trained using an Optimal Transport loss. The overall
framework is illustrated in Fig. 1.

3.2 Optimal Transport Loss (OT Loss)

In object counting, L2 pixel-wise loss is widely used as a loss function for model
optimization, which is inappropriate since spatial distribution is ignored. For
example, a small background change and an object density change are equivalent
if we use MSE loss even though the former can cause a large localization error
while the latter change is considered a small error. As a consequence, a model
trained with MSE tends to be misleading in the case of ambiguity, leading to
high errors in object counting. Therefore, a loss function to penalize both count
and distribution mismatch is compulsory. To overcome this, we propose OT loss,
186 N. X. Cuong and T.-D. Mai

a loss function that considers both elements for model training. The total loss
includes OT loss .LOT and counting loss .LC :

.Ltotal = λ1 × LOT + λ2 × LC (1)

where .λ1 and .λ2 are hyperparameters for OT loss (.LOT ) and counting loss (.LC ).
The former measures the distribution difference while the latter measures the
counting difference between predicted and annotated maps.

OT Loss: Before calculating the OT loss, we turn the predicted density map to
probability distribution by normalizing it. Considering the normalized density
map .A = {(ai , xi )}ni=1 where .ai ≥ 0 and .xi ∈ R2 represents the probability and
the position of pixel .i respectively and ground truth map .B = {(bj , yj )}m j=1 ,
2
where .bj = 1 and .yj ∈ R represents the .j − th object location. Our loss function
is based on the Sinkhorn distance between .A and .B:

.LOT = L (A, B) = min C, P  − H(P ) (2)


P ∈U (a,b)
 
= Cij Pij +  Pij log(Pij ) (3)
i,j i,j

where .C is the quadric transport cost, defined as: .c(zi , ẑj ) = zi − ẑj 22 and .P is
the transport plan.
The Sinkhorn distance in 3 finds the optimal transport plan P whose element
.Pij is the density transported from .xi to .yj that minimizes the total transport
cost. Following [9], the solution for the Optimal transport plan can be expressed
in matrix form:
.P = diag(u)Kdiag(v). (4)
The variables .u and .v must satisfy the following nonlinear equations, which cor-
respond to the mass conservation constraints inherent in the optimal transport
problem .U (a, b):

diag(u)Kdiag(v)1m = a
. (5)
diag(v)K  diag(u)1n = b.
. (6)
These two equations can be further simplified, as .diag(v)1m is simply .v, and
the multiplication of .diag(u) with .Kv is given by:

u  (Kv) = a
. (7)

v  (K u) = b,
. (8)
where . denotes element-wise multiplication of vectors.
An intuitive approach to solving these equations is to use an iterative method,
where .u is first adjusted to satisfy the left-hand side of (7), followed by adjusting
.v to satisfy the right-hand side. These two updates define Sinkhorn’s algorithm:
Distribution-Guided Object Counting 187

a
u(+1) =
. (9)
Kv ()
b
.v (+1) =  (+1) (10)
K u
which is initialized with an arbitrary positive vector .v (0) = 1m . The division
operator between two vectors is understood element-wise.

Count Loss: For count loss, we simply apply MSE between predicted and
annotated map following [6]:
1 
LC = L(ŷi , yi ) =
. ||yi − ŷi ||22 (11)
HW
where .yi , ŷi ∈ RH×W ×1 represent the ground truth and predicted density map
respectively. .H and .W are the height and width of the image.

3.3 Density-Refinement Module

To reduce counting errors caused by self-similar objects, we introduce a density-


refinement module. This module leverages Grounding DINO [7] to predict can-
didate bounding boxes. Since the boxes predicted by Grounding DINO often
encompass multiple objects, we apply a filtering process based on the shape of
the boxes, as our goal is to isolate bounding boxes that contain a single object.
The density-refinement module is depicted in Fig. 2.

Fig. 2. Overview of the Density Refinement Module.


188 N. X. Cuong and T.-D. Mai

After filtering, we select the three boxes with the highest confidence scores.
To refine the density map, we normalize it by dividing the density values within
each selected box by the total sum of the density values within that box. This
approach ensures a more accurate representation of object counts within the
density map, mitigating errors due to overlapping or clustered objects.

4 Experiment
4.1 Dataset and Metric
We experiment on FSC-147 [11], which is a multi-class few-shot object counting
dataset containing 6135 images. The number of counted objects in each image
varies widely, ranging from 7 to 3731, with an average of 56. The dataset also
provides three randomly selected object instances annotated by bounding boxes
as exemplars in each image. The training set includes 89 object categories, while
the validation and test sets each contain 29 disjoint categories, making FSC-147
an open-set object counting dataset.
We use two standard metrics to measure the performance of our model,
namely, Mean Absolute Error (.M AE) and Root Mean Squared Error (.RM SE).

4.2 Implementation Details


For model training, we crop the image to the size .224 × 224 and normalize
it. Basic data augmentation techniques are applied, including Gaussian noise,
Gaussian blur, horizontal flip, and color jittering. Unlike other methods [1–3],
we directly use a binary ground truth map for model training instead of applying
Gaussian filter as it can harm the training process as mentioned in [16]. .λ1 is
set to 60 according to [6] and .λ2 is tune in FSC147 dataset with MAE ranging
from 15.6 to 16.2. We choose .λ2 to be 0.1 as it provides best result. The model
is trained using the Adam optimizer with a learning rate set to 3e-6 and a batch
size of 16.

5 Results and Analysis


5.1 Quantitative Result
We conducted a comparison of our method, OptiCount, against various density-
based and detection-based prompt counting techniques. The quantitative results
are summarized in Table 1. OptiCount significantly outperforms all zero-shot
density-based methods, including the long-standing winner, CounTX, achieving
improvements of .14.73% and .8.1% in Mean Absolute Error (MAE) on the test
and validation of FSC147 datasets, respectively. This performance improvement
can be attributed to our loss function, Optimal Transport (OT) loss, which
demonstrates its effectiveness by incorporating additional spatial information
into density map generation. Additionally, our method achives significantly lower
counting errors across FSC147 dataset compared to detection-based method like
Pseco. Although being a two-stage method, ZSOC error is considerably higher
than Opticount, further validating our method’s robustness.
Distribution-Guided Object Counting 189

Table 1. Performance comparison of different methods on validation and test sets.


Pseco [5] is detection-based, while other methods are density-based.

Method Type Test set Validation set


MAE .↓ RMSE .↓ MAE .↓ RMSE .↓
Pseco [5] Detection-based 16.58 129.77 23.9 100.33
ZSOC [17] Density-based 22.09 115.17 26.93 88.63
RepRPN [10] Density-based 28.32 128.76 31.69 100.31
CLIP-Count [2] Density-based 17.78 106.62 18.79 61.18
VLCounter [3] Density-based 17.05 106.16 18.06 65.13
CounTX [1] Density-based 15.88 106.29 17.1 65.61
Our Density-based 13.84 107.18 15.73 60.15

Fig. 3. Quantitative result comparison of different density-based model VLCounter [3],


CLIP-Count [2], CounTX [1]) and OptiCount (our).

5.2 Qualitative Result


Figure 3 presents representative counting results obtained through prompt-based
counting across different models. While other methods struggle with challenges
such as occlusion and self-similar objects, OptiCount maintains high accuracy
in these scenarios. For instance, the density maps generated by OptiCount focus
densely on the centers of objects, in contrast to the more dispersed density maps
produced by other methods. This improvement can be attributed to OptiCount’s
190 N. X. Cuong and T.-D. Mai

incorporation of location distribution during model training, leading to enhanced


performance.
Notably, other models tend to double counting result when addressing self-
similar objects, such as in the sunglasses image. In contrast, OptiCount effec-
tively handle this issue by a refinement module that regulates the density values
cross the density map, which could be seen in Fig. 4. This capability highlights
OptiCount’s superior handling of complex counting scenarios, further establish-
ing its efficacy in prompt-based counting tasks.

Fig. 4. Qualitative result of OptiCount on self-similarity objects.

5.3 Ablation Study

We conducted an ablation study on the FSC147 dataset to evaluate the contribu-


tions of different components of OptiCount. Specifically, we trained OptiCount
using three configurations: first with Mean Squared Error (MSE) loss without the
refinement module, then with Optimal Transport (OT) loss without the refine-
ment module, and finally with OT loss incorporating the refinement module.
The results of this study are summarized in Table 2.
The Table 2 indicates that OT loss significantly influences model perfor-
mance, underscoring the importance of incorporating spatial information during
training. Additionally, the refinement module plays a crucial role, as it consid-
erably reduces the test Mean Absolute Error (MAE), as illustrated in Fig. 5.
However, it is worth noting that the validation MAE does not show a significant
change, which may be attributed to the higher prevalence of self-similar objects
in the test set.
The effectiveness of filters within the refinement module is evident in Fig. 5.
The initial predictions from the ground truth data often include numerous false
positive boxes, particularly for clustered objects rather than individual ones.
The shape filter effectively removes negative samples, while the density filter
Distribution-Guided Object Counting 191

Table 2. Comparison of on test set of FSC147 of model’s performance with different


components.

MAE .↓ RMSE .↓
.OptiCountM SE 15.88 106.29
.OptiCountOT 15.65 108.22
.OptiCountOT +DR 13.84 107.18

Fig. 5. Filter visualization

addresses occlusions, thereby enhancing the quality of the samples used for nor-
malization. This process leads to a more representative dataset for accurate
density estimation, further optimizing the performance of OptiCount.

6 Conclusion
In this paper, we propose a novel framework for prompt-based object counting.
By enhancing the loss function with spatial information with optimal transport
loss and proposing density-refinement module, our method enables model to
reduce counting errors in challenging cases. Experiments on the FSC147 dataset
demonstrate that our model performs reliably, especially in scenarios with over-
lapping and self-similarity objects. In future work, we plan to focus on extremely
dense regions to further enhance the model’s performance.

Acknowledgment. This research is funded by University of Information Technology


- Vietnam National University HoChiMinh City under grant number D4-2024-02.

References
1. Amini-Naieni, N., Amini-Naieni, K., Han, T., Zisserman, A.: Open-world text-
specified object counting. arXiv preprint arXiv:2306.01851 (2023)
192 N. X. Cuong and T.-D. Mai

2. Jiang, R., Liu, L., Chen, C.: Clip-count: towards text-guided zero-shot object
counting. In: MM, pp. 4535–4545 (2023)
3. Kang, S., Moon, W., Kim, E., Heo, J.P.: Vlcounter: text-aware visual representa-
tion for zero-shot object counting. In: AAAI, pp. 2714–2722 (2024)
4. Kirillov, A., et al.: Segment anything. arXiv:2304.02643 (2023)
5. Li, G., Li, X., Wang, Y., Wu, Y., Liang, D., Zhang, S.: Pseco: pseudo labeling and
consistency training for semi-supervised object detection. In: ECCV, pp. 457–472.
Springer (2022)
6. Liu, C., Zhong, Y., Zisserman, A., Xie, W.: Countr: transformer-based generalised
visual counting. arXiv preprint arXiv:2208.13721 (2022)
7. Liu, S., et al.: Grounding dino: marrying dino with grounded pre-training for open-
set object detection. arXiv preprint arXiv:2303.05499 (2023)
8. Paiss, R., et al.: Teaching clip to count to ten. In: ICCV, pp. 3170–3180 (2023)
9. Peyré, G., Cuturi, M.: Computational optimal transport. Found. Trends Mach.
Learn. 11(5–6), 355–607 (2019)
10. Ranjan, V., Nguyen, M.H.: Exemplar free class agnostic counting. In: ACCV, pp.
3121–3137 (2022)
11. Ranjan, V., Sharma, U., Nguyen, T., Hoai, M.: Learning to count everything. In:
CVPR, pp. 3394–3403 (2021)
12. Shi, M., Hao, L., Feng, C., Liu, C., Cao, Z.: Represent, compare, and learn: a
similarity-aware framework for class-agnostic counting. In: CVPR (2022)
13. Shi, Z., Sun, Y., Zhang, M.: Training-free object counting with prompts. In: WACV,
pp. 323–331 (2024)
14. Tyagi, A.K., et al.: DeGPR: deep guided posterior regularization for multi-class
cell detection and counting. In: CVPR, pp. 23913–23923 (2023)
15. -Dukić, N., Lukežič, A., Zavrtanik, V., Kristan, M.: A low-shot object counting
network with iterative prototype adaptation. In: ICCV, pp. 18872–18881 (2023)
16. Wang, B., Liu, H., Samaras, D., Nguyen, M.H.: Distribution matching for crowd
counting. In: NeurIPS, vol. 33, pp. 1595–1607 (2020)
17. Xu, J., Le, H., Nguyen, V., Ranjan, V., Samaras, D.: Zero-shot object counting.
In: CVPR, pp. 15548–15557 (2023)
18. You, Z., Yang, K., Luo, W., Lu, X., Cui, L., Le, X.: Few-shot object counting with
similarity-aware feature enhancement. In: WACV, pp. 6315–6324 (2023)
Motion Analysis in Static Images

Kunal Agrawal1 , Vastsa S. Patel1 , Reema Tharra1 , Trung-Nghia Le2 ,


Minh-Triet Tran2 , and Tam V. Nguyen1(B)
1 Department of Computer Science, University of Dayton, Dayton, USA
[email protected]
2 Faculty of Information Technology, University of Science, VNUHCM, Ho Chi Minh City,

Vietnam

Abstract. In this paper, we address the recognition of motion illusions in static


images. To this end, we collect a new dataset containing images both with and
without motion illusions. We then benchmark state-of-the-art deep learning mod-
els to determine the presence of illusions in the images. Additionally, we assess
the role of color in the recognition process. The experimental results show that
deep learning models are effective in identifying motion illusions, with superior
performance on color images, highlighting the importance of color in analyzing
motion within static images.

Keywords: Motion Analysis · Static Images · Optical Illusion · MISS

1 Introduction
We have long been fascinated by motion illusions, the fascinating visual puzzles that
play tricks on our eyes. This research takes a deep dive into the realm of motion illusions,
aiming to advance our understanding of how machines interpret these visual phenomena
[1]. Beyond the intrigue of optical illusions, the focus is on equipping computers with
the ability to recognize and comprehend illusory motion patterns within static images as
in Fig. 1. This introductory section sets the stage for two primary areas of exploration:
the critical role of bespoke datasets in training effective machine learning models [2]
and a preliminary observation hinting at the superiority of colored images over grayscale
ones in motion illusion classification [3].
Motion illusions, such as the iconic rotating snakes or barber pole illusions, pose
unique challenges for computational systems [4]. While human vision effortlessly nav-
igates these illusions, teaching machines to discern the intricacies of illusory motion
demands a specialized focus. This research is positioned at the intersection of cognitive
psychology and computer vision, seeking to unravel the mysteries of motion illusions
and their computational interpretation.
One of the critical revelations in our exploration lies in the recognition of the inad-
equacy of generic datasets in capturing the diverse nuances of motion illusions [5].
Consequently, we advocate for the creation of bespoke datasets, meticulously tailored to
the specific characteristics of illusory motion [6]. These datasets serve as more than just

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025
W. Buntine et al. (Eds.): SOICT 2024, CCIS 2350, pp. 193–202, 2025.
https://doi.org/10.1007/978-981-96-4282-3_16
194 K. Agrawal et al.

training grounds for deep learning models; they offer insights into the features crucial
for machines to discern illusory movement. The imperative here is to understand the
impact of dataset specificity on model interpretability.
While specific details about the employed models remain undisclosed in this section,
our research delves into the intricacies of computational models when confronted with
motion illusions [7]. Deep learning architectures, known for their prowess in pattern
recognition, confront unique challenges in decoding illusory motion. The objective is
to unravel the decision-making processes within these architectures when tasked with
distinguishing illusory movement from static scenes. The aim is to offer insights that
transcend the specifics of the models used, contributing to the broader discourse on the
interpretability of deep learning in perceptual tasks.

Fig. 1. An example of motion illusion in a static scene (image).

In a preliminary observation, we allude to an intriguing finding – colored images


potentially outperform their grayscale counterparts in the realm of motion illusion classi-
fication [8]. This observation sets the stage for deeper discussions later in the paper. The
choice of image representation emerges as a critical factor, prompting questions about
the role of color information in the computational interpretation of illusory motion.
The subsequent sections of the paper unfold in a structured manner. Section 3
delves into the methodology, providing insights into the creation of bespoke datasets
and offering an overview of the deep learning models employed [9, 10]. Section 4
presents our findings, including comparative analyses and insights into the impact of
color imagery on model performance. Section 5 discusses the practical implications of
our research, emphasizing its relevance in real-world applications [11], summarizing
key contributions, and proposing avenues for future research.
In essence, our research is a bridge between the cognitive complexities of motion
illusions and the computational power of deep learning [12]. By advocating for special-
ized datasets and uncovering model intricacies, we aim to enrich the dialogue on the
evolving landscape of visual perception in artificial intelligence.
Motion Analysis in Static Images 195

2 Related Work
Motion illusions in static images have captivated researchers across cognitive psychology
and computer vision, prompting a multidisciplinary exploration. This section delves into
key contributions, laying the groundwork for our investigation and referencing ten studies
not covered in the introduction.
In a pioneering work, Johansson, G. [13] investigated the perceptual mechanisms
underlying motion illusions, elucidating the intricacies of how the human visual system
interprets dynamic phenomena. This foundational work serves as a compass, guiding
our understanding of the cognitive processes involved in perceiving motion illusions.
Williams et al. [14] addressed challenges in creating specialized datasets for motion
illusion studies, emphasizing the importance of tailored datasets to capture intricate
variations in illusory motion patterns. Wang et al. [15] introduces objective methods for
perceptual image quality assessment, focusing on quantifying the visibility of errors in
distorted images. It proposes a structural similarity index, demonstrating its effectiveness
through intuitive examples and subjective evaluations.
Later, Watanabe et al. [12] demonstrate that DNNs accurately replicate the direction
of illusory rotation but fail to detect motion components in negative control. The study
sheds light on the capability of DNNs to simulate complex perceptual phenomena like
illusory motion. Overall, the findings contribute to understanding the computational
mechanisms underlying visual perception in neural networks.
Kobayashi et al. [9] investigates the extraction of motion illusion-like patterns from
photographs and artworks employing predictive deep neural networks. Their study
demonstrates the successful replication of illusory motion observed in visual stimuli
using deep learning techniques. By leveraging predictive deep neural networks, the
research contributes to understanding and reproducing complex visual phenomena.
Meanwhile, Luckiesh’s et al. [11] explored visual illusions, delving into their causes,
characteristics, and practical applications. It provides a comprehensive study of visual
illusions, offering insights into their underlying mechanisms and practical implications.
This seminal work continues to be relevant for understanding the complexities of visual
perception. Next, Sun et al. [16] explored multisensory integration in motion perception,
shedding light on how combining visual and auditory cues influences the interpretation of
motion illusions. This complements our understanding of motion illusions by incorporat-
ing a multisensory perspective. In another work, Nishida and Johnston [17] investigated
neurophysiological correlates of motion illusions, providing insights into the neural
mechanisms underlying the perception of dynamic visual phenomena. Understanding
these correlates enriches the broader discussion on motion illusion recognition.
Taylor et al. [18] explores how viewers perceive and physiologically respond to frac-
tal patterns in Jackson Pollock’s art. It discusses the positive responses to fractal patterns,
indicating aesthetic appreciation and physiological engagement. By analyzing both per-
ceptual and physiological aspects, the research sheds light on the intricate relationship
between art and human cognition. This investigation expands understanding of fractals’
impact on human experience. In summary, this related work section incorporates diverse
perspectives from recent research, extending our understanding of motion illusion recog-
nition within static images. Each study contributes uniquely to our exploration, forming
the mosaic of knowledge guiding our investigation.
196 K. Agrawal et al.

Fig. 2. The examples of motion images in our collected dataset. Please see the color figures in
pdf with 400% zoom.

3 Dataset Collection
There are some efforts of collecting datasets [19] for motion illusion. However, these
datasets are small and not well organized. The need for creating this dataset arises from
the limited availability of publicly accessible datasets specifically designed for studying
motion perception in static images.
Therefore, in this work, we collect a new dataset, Motion Illusion in Static Scene,
dubbed MISS. We use Google Image Search Engine [20] with different input keywords,
for example, motion illusion, optical illusion, eye trick motion. Then, we use Google
Lens [21] to find similar images to the ones we initially collected with keywords. To
ensure the quality and relevance of the dataset, images were meticulously curated based
on established criteria for motion illusion stimuli. Each image was assessed for its effec-
tiveness in eliciting the perception of motion through manual inspection and validation
by seven individuals with normal vision and expertise in visual perception research.
The dataset comprises a diverse range of images with motion exhibiting different
patterns and configurations known to evoke the perception of motion in observers as
shown in Fig. 2. These patterns include but are not limited to radial, concentric, spi-
ral, and grid-like structures that exploit visual processing mechanisms to create the
illusion of movement. The MISS Dataset comprises not only images depicting motion
illusions but also a significant portion of non-motion images as shown in Fig. 3. These
non-motion images serve as crucial counterparts to their motion counterparts, providing
essential context for comparison and model training. Captured from various sources and
meticulously selected, the non-motion images encompass scenes devoid of any apparent
motion or illusionary effects. Their inclusion ensures a balanced dataset representa-
tion, enabling models to discern between genuine motion illusions and static scenes
accurately. By incorporating non-motion images, the dataset offers a comprehensive
Motion Analysis in Static Images 197

spectrum of visual stimuli, facilitating robust model training and evaluation for motion
perception analysis. In total, the dataset consists of 600 high-resolution images, with
an equal distribution between motion and non-motion categories in both the color and
grayscale datasets. This balanced dataset composition ensures robustness and reliability
in subsequent model training and evaluation processes.

Fig. 3. The examples of non-motion images in our collected dataset.

Moreover, to investigate the impact of color information on motion perception, the


dataset was further processed to create a grayscale version. This grayscale dataset was
derived from the original color images as in Fig. 4, resulting in another set of 600
grayscale images. The utilization of this carefully curated dataset enables the exploration
and analysis of the underlying mechanisms of motion perception in static images, facil-
itating the development and evaluation of machine learning models for motion illusion
classification in both color and grayscale contexts.

4 Experiments
4.1 Model Training
The experiments involved training multiple deep learning models, MobileNet [22],
MobileNetV2 [23], ResNet50 [24], ResNetRS200 [25], Xception [26], EfficientNetB5
[27], EfficientNetV2S [28], InceptionV3 [29], NASNetMobile [30], and NASNetLarge
[30], on both the color and grayscale versions of the MISS dataset. The training process
included feeding the models with the training dataset, comprising 272 motion images
and 128 non-motion images for the color dataset, and an equivalent distribution for the
198 K. Agrawal et al.

grayscale dataset. Meanwhile, 100 images (50 motion and 50 non-motion images) are
used for the validation set and 100 images 50 motion and 50 non-motion images) are
used for the testing purpose. Stochastic gradient descent with momentum was utilized
as the optimization algorithm, with the following update rule:
 
θ{t+1} = θt − α · ∇J (θt ) + β · θt − θ{t−1} .

where θ_t is the parameter vector at iteration t, α is the learning rate, ∇J(θ_t) is the
gradient of the loss function J with respect to θ_t, and β is the momentum term.
The learning rate (α) was set to 0.001, and the momentum (β) was set to 0.9 to
balance between fast convergence and avoiding oscillations.

Fig. 4. Motion illusion in color (left) vs. grayscale (right). Please see the color figures in pdf with
200% zoom.

4.2 Performance Metrics

After training, the models were evaluated on separate testing sets containing 50 motion
and 50 non-motion images for both the color and grayscale datasets. Evaluation is done
based on the testing accuracy, which was calculated from model predictions.
The experimental results revealed the efficacy of the trained models in accurately
classifying motion illusions in static images. In addition to testing accuracy, precision,
recall, and F1-score metrics were calculated to provide a comprehensive evaluation of
model performance.
Precision measures the accuracy of positive predictions. It is calculated as the ratio of
true positive predictions to the total number of positive predictions made by the model.
Recall, also known as sensitivity or true positive rate, measures the proportion of
actual positive instances that were correctly identified by the model. It is calculated as
the ratio of true positive predictions to the total number of actual positive instances.
Motion Analysis in Static Images 199

Table 1. Experimental results on the collected dataset. Each model is tested on both color and
grayscale images

Model Dataset mAP


MobileNet Colored 80%
Grayscale 73%
MobileNetV2 Colored 77.99%
Grayscale 70.99%
ResNet50 Colored 81%
Grayscale 76.99%
ResNetRS200 Colored 70.99%
Grayscale 68%
Xception Colored 68.99%
Grayscale 63.99%
EfficientNetB5 Colored 75%
Grayscale 55%
EfficientNetV2S Colored 74%
Grayscale 67%
InceptionV3 Colored 52.99%
Grayscale 50%
NASNetMobile Colored 72%
Grayscale 54%
NASNetLarge Colored 68.99%
Grayscale 56%

4.3 Experimental Results

We aim to assess the performance of various deep learning models on detecting motion
in static images, using both colored and grayscale datasets. According to Table 1. The
models tested include MobileNet, MobileNetV2, ResNet50, ResNetRS200, Xception,
EfficientNetB5, EfficientNetV2S, InceptionV3, NASNetMobile, and NASNetLarge. For
evaluation, the mean Average Precision (mAP) was used as the primary performance
metric.
The results clearly indicate that models generally perform better on the colored
dataset compared to the grayscale dataset. The drop in performance when switching to
grayscale is observed across all models, though the extent of the performance degradation
varies.
200 K. Agrawal et al.

Top Performing Model. ResNet50 achieved the highest mAP for both colored (81%)
and grayscale (76.99%) datasets, making it the most robust across both image types.
MobileNet also performed well, with 80% mAP on the colored dataset and a 7% drop
when tested on the grayscale dataset.
Performance Impact. EfficientNetB5 and NASNetMobile had the largest drops in per-
formance when switching to grayscale. EfficientNetB5, for example, went from 75%
mAP on colored images to just 55% on grayscale. NASNetMobile also dropped signif-
icantly, from 72% on colored images to 54% on grayscale. These models seem to rely
more on color information to understand motion in static images.
Models that Adapt Well. Some models, like ResNet50 and MobileNetV2, showed
smaller performance drops when trained on grayscale data. For instance, ResNet50 only
dropped by about 4%, and MobileNetV2 by 7%. This suggests that these models are
better at finding important features in images, even without color.
The results of this experiment highlight that color images are generally more useful
than grayscale images for detecting motion in static images. Models tend to perform
better when they have access to color, which provides more detailed information. How-
ever, some models, such as ResNet50, still manage to perform well even with grayscale
images. This means they can focus on other details like textures and shapes, even when
color is missing.
Moreover, examining precision and recall values can offer deeper insights into the
models’ behavior. A high precision value indicates that the model rarely misclassifies
non-motion illusion samples, while a high recall value suggests the model effectively
captures most of the actual motion illusion samples. Balancing these two metrics is
crucial, as prioritizing one over the other may lead to biased performance evaluations.

5 Conclusion and Future Work

In this paper, we explored how different deep learning models perform when detecting
motion in static images using both colored and grayscale datasets. The results of our
experiments show that color images consistently lead to better performance compared
to grayscale images across all the models tested. This highlights the importance of color
information in helping models recognize motion-related patterns.
Among the models tested, ResNet50 stood out as the best performer for both colored
and grayscale images. Although all models saw a drop in accuracy when trained on
grayscale data, some models—like ResNet50 and MobileNetV2—handled the absence
of color better than others. Models like EfficientNetB5 and NASNetMobile, on the
other hand, struggled more with grayscale images, experiencing significant drops in
performance.
Overall, our findings suggest that color information plays a key role in motion detec-
tion tasks. While some models can still perform reasonably well with grayscale images,
the results show that including color data generally leads to more accurate and reliable
motion detection. Therefore, if color data is available, it should be used to maximize the
performance of the models.
Motion Analysis in Static Images 201

For future work, we can enhance motion perception classification by exploring novel
deep learning architectures tailored for this task and incorporating semantic segmenta-
tion and attention mechanisms. Collaboration with experts in psychology and neuro-
science can deepen our understanding of motion perception mechanisms. Expanding
and diversifying the dataset will improve model generalization. Real-world applica-
tions, such as human-computer interaction and autonomous systems, warrant explo-
ration, along with user studies to assess model impact. Developing explainable AI tech-
niques will increase model transparency and trustworthiness. Addressing these directions
will advance motion perception analysis and its application in various domains.

Acknowledgment. This research was supported by the National Science Foundation (NSF) under
Grant 2025234.

References
1. Carbon, C.C.: Understanding human perception by human-made illusions. Front. Hum.
Neurosci. 8, 566 (2014)
2. Koch, B., Denton, E., Hanna, A., Foster, J.G.:. Reduced, Reused and Recycled: The Life of
a Dataset in Machine Learning Research. arXiv preprint arXiv:2112.01716 (2021)
3. Kitaoka, A.: Color-dependent motion illusions in stationary images and their phenomenal
dimorphism. Perception 43(9), 914–925 (2014)
4. Otero-Millan, J., Macknik, S.L., Martinez-Conde, S.: Microsaccades and blinks trigger
illusory rotation in the ‘rotating snakes’ illusion. J. Neurosci. 32(17), 6043 (2012)
5. Salari, A., Djavadifar, A., Liu, X., Najjaran, H.: Object recognition datasets and challenges:
a review. Neurocomputing 495, 129–152 (2022)
6. Chung, S.T., Patel, S.S., Bedell, H.E., Yilmaz, O.: Spatial and temporal properties of the
illusory motion-induced position shift for drifting stimuli. Vision. Res. 47(2), 231–243 (2007)
7. Gomez-Villa, A., Martín, A., Vazquez-Corral, J., Bertalmío, M., Malo, J.: Color illusions
also deceive CNNs for low-level vision tasks: analysis and implications. Vision. Res. 176,
156–174 (2020)
8. Sowmya, V., Govind, D., Soman, K.P.: Significance of contrast and structure features for
an improved color image classification system. In: 2017 IEEE International Conference on
Signal and Image Processing Applications (ICSIPA), pp. 210–215. IEEE (2017)
9. Kobayashi, T., Kitaoka, A., Kosaka, M., Tanaka, K., Watanabe, E.: Motion illusion-like pat-
terns extracted from photo and art images using predictive deep neural networks. Sci. Rep.
12(1), 3893 (2022)
10. Kirubeswaran, O.R., Storrs, K.R.: Inconsistent illusory motion in predictive coding deep
neural networks. Vision. Res. 206, 108195 (2023)
11. Luckiesh, M.: Visual Illusions, their Causes, Characteristics and Applications. D. Van
Nostrand Company (1922)
12. Watanabe, E., Kitaoka, A., Sakamoto, K., Yasugi, M., Tanaka, K.: Illusory motion reproduced
by deep neural networks trained for prediction. Frontiers in Psychology 345 (2018)
13. Johansson, G.: Visual perception of biological motion and a model for its analysis. Percept.
Psychophys. 14, 201–211 (1973)
14. Williams, R.M., Yampolskiy, R.V.:. Optical Illusions Images Dataset. arXiv preprint arXiv:
1810.00415, 2 (2018)
15. Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error
visibility to structural similarity. IEEE Trans. Image Process. 13(4), 600–612 (2004)
202 K. Agrawal et al.

16. Sun, H.J., Campos, J.L., Chan, G.S.: Multisensory integration in the estimation of relative
path length. Exp. Brain Res. 154, 246–254 (2004)
17. Nishida, S., Johnston, A.: Marker correspondence, not processing latency, determines
temporal binding of visual attributes. Curr. Biol. 24(15), 1677–1686 (2014)
18. Taylor, R.P., Spehar, B., Van Donkelaar, P., Hagerhall, C.M.: Perceptual and physiological
responses to Jackson Pollock’s fractals. Front. Hum. Neurosci. 5, 60 (2011)
19. Akiyoshi Kitaoka’s website. Ritsumeikan University. https://www.ritsumei.ac.jp/~akitaoka/
index-e.html. Last access February 2024
20. Bitirim, Y.: Retrieval effectiveness of google on reverse image search. J. Imaging Sci. Technol.
66, 010505–010511 (2022). https://doi.org/10.2352/J.ImagingSci.Technol.2022.66.1.010505
21. Taffel, S.: Google’s lens: computational photography and platform capitalism. Media Cult.
Soc. 43(2), 237–255 (2021)
22. Sinha, D., El-Sharkawy, M.: Thin mobilenet: an enhanced mobilenet architecture. In: 2019
IEEE 10th Annual Ubiquitous Computing, Electronics & Mobile Communication Conference
(UEMCON), pp. 0280–0285. IEEE (2019)
23. Dong, K., Zhou, C., Ruan, Y., Li, Y.: MobileNetV2 model for image classification. In: 2020
2nd International Conference on Information Technology and Computer Application (ITCA),
pp. 476–480. IEEE (2020)
24. Koonce, B., Koonce, B.: ResNet 50. Convolutional Neural Networks with Swift for
Tensorflow: Image Recognition and Dataset Categorization, pp. 63–72 (2021)
25. Bello, I., et al.: Revisiting resnets: improved training and scaling strategies. Adv. Neural. Inf.
Process. Syst. 34, 22614–22627 (2021)
26. Chollet, F.: Xception: Deep learning with depthwise separable convolutions. In Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1251–1258 (2017)
27. Bhawarkar, Y., Bhure, K., Chaudhary, V., Alte, B.: Diabetic retinopathy detection from fundus
images using multi-tasking model with EfficientNet B5. In ITM Web of Conferences, 44,
p. 03027. EDP Sciences (2022)
28. Tan, M., Le, Q.:. Efficientnetv2: smaller models and faster training. In: International
Conference on Machine Learning, pp. 10096–10106. PMLR (2021)
29. Wang, C., et al.: Pulmonary image classification based on inception-v3 transfer learning
model. IEEE Access 7, 146533–146541 (2019)
30. Zoph, B., Vasudevan, V., Shlens, J., Le, Q.V.: Learning transferable architectures for scalable
image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pp. 8697–8710 (2018)
Motorcycle Helmet Detection Benchmarking

Kunal Agrawal1 , Vatsa S. Patel1 , Ian Cannon1,2 , Minh-Triet Tran3 ,


and Tam V. Nguyen1(B)
1 Department of Computer Science, University of Dayton, Dayton, USA
[email protected]
2 Applied Sensing Lab, University of Dayton Research Institute, Dayton, USA
3 Faculty of Information Technology, University of Science, VNUHCM, Ho Chi Minh City,

Vietnam

Abstract. In this paper, we focus on evaluating the robustness of helmet detection


in the context of traffic surveillance, achieved through state-of-the-art deep learn-
ing models. This aims to contribute significantly to motorcycle safety by imple-
menting intelligent systems adept at accurately identifying helmets. An integral
component of this inquiry entails a meticulous benchmark of cutting-edge object
detection models and the integration of advanced techniques, aiming not only to
bolster accuracy but also to improve the overall practicality and effectiveness of
helmet detection systems. The experimental results highlight the effectiveness of
the state-of-the-art object detection methods in detecting helmets and the potential
of transferring from the traffic domain to the construction site domain.

Keywords: Helmet · Object Detection · Benchmarking · Robustness

1 Introduction
In the field of computer vision and intelligent transportation systems, the precise identifi-
cation of safety equipment, particularly helmets, is pivotal for advancing road safety. This
research embarks on a transformative journey to push the boundaries of helmet detection,
harnessing the power of sophisticated deep-learning methodologies. The imperative for
robust and efficient helmet detection becomes particularly pronounced in the domain of
traffic surveillance, where traditional methods often prove inadequate in addressing the
multifaceted challenges posed by real-world scenarios.
As urban landscapes undergo a notable surge in the prevalence of motorcycles and
electric bikes [1], the imperative to ensure the safety of riders has become an increasingly
critical concern in contemporary society. Helmets, recognized as fundamental safety
accessories, play a crucial role in mitigating the risk of head injuries during accidents.
However, the effectiveness of helmets is intricately linked to their proper usage, empha-
sizing the urgent need to develop advanced systems capable of precisely and reliably
identifying the presence of helmets in various scenarios.
This paper introduces a diverse array of innovative approaches to helmet detection,
as depicted in Fig. 1, with a deliberate focus on creating a new dataset and harnessing

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025
W. Buntine et al. (Eds.): SOICT 2024, CCIS 2350, pp. 203–215, 2025.
https://doi.org/10.1007/978-981-96-4282-3_17
204 K. Agrawal et al.

the capabilities of different object detection models such as YOLO [2] (You Only Look
Once), Faster RCNN [3], RT-DETR (Real Time Detection Trans-former) [4] and Detec-
tron2 [5]. These real-time object detection algorithms are strategically selected for their
ability to swiftly and efficiently identify objects in dynamic scenarios, rendering them
especially well-suited for applications such as traffic surveillance. The inherent robust-
ness of these models is further emphasized through the incorporation of advanced tech-
niques like Spatial Pyramid Pooling, thereby augmenting their effectiveness in intricate
and varied environments.

Fig. 1. Different examples of helmets under different conditions/viewpoints/clarity: a) 1-rider


bike, b) 2-rider bike, c) rear view, d) side view, e) night-time view and, f) blurry view.

Moreover, the research extends its exploration into the domain of ensemble methods,
aiming to fortify the overall robustness and reliability of the helmet detection system.
This involves the integration of multiple models within an ensemble framework, with the
strategic objective of synergizing their individual strengths. By doing so, the system’s
performance is enhanced across a broad spectrum of conditions, solidifying its position
as a comprehensive solution for accurate helmet detection in settings that continuously
evolve and present dynamic challenges.
In response to the escalating prevalence of motorcycles and electric bikes, the inno-
vative approaches presented in this paper not only address the immediate concerns
surrounding helmet detection but also contribute to the broader narrative of rider safety
in urban environments of developing countries. Beyond the field of technical intrica-
cies, the research anticipates and responds to the evolving landscape of transportation,
where intelligent systems are essential components in the quest for enhanced safety and
efficiency.
The significance of this research transcends its technical intricacies and resonates
within the broader domain of intelligent transportation systems. By elevating the pre-
cision of helmet detection in challenging conditions, the outcomes directly align with
the overarching goal of such systems – the reduction of accidents and the improvement
Motorcycle Helmet Detection Benchmarking 205

of road safety. Furthermore, envisioning seamless integration, the proposed methods


could easily be incorporated into existing traffic surveillance and planning systems [6,
7], thereby fostering a safer environment for riders, and enhancing the overall efficacy
of traffic management.
This paper, as a comprehensive exploration of helmet detection, meticulously navi-
gates through the intricacies of deep learning models, the integration of novel techniques,
and their practical application in real-world scenarios. Subsequent sections will system-
atically unveil the methodology, experiments, and results, offering a holistic understand-
ing of the pioneering advancements achieved in the field of helmet detection and their
far-reaching implications for the evolution of intelligent transportation systems. The
seamless integration of these advancements into existing frameworks is poised to revo-
lutionize road safety practices and contribute significantly to the broader landscape of
intelligent transportation systems.

2 Related Work

The field of helmet detection has undergone profound transformations driven by the
continuous evolution of computer vision and deep learning techniques. This section
strives to offer a comprehensive review of relevant literature, shedding light on key
contributions in the domain. This exploration not only synthesizes existing knowledge
but also establishes a contextual foundation for the proposed architecture.
A noteworthy aspect of recent research involves the exploration of YOLOv5s in
the realm of object detection tasks. Huang et al. [8] pioneered an advanced YOLOv5s-
based method specifically tailored for electric bike helmet recognition. Their innovative
approach yielded enhanced detection efficiency and practicality, acting as a catalyst
for further investigations in specialized domains. This underscores the adaptability of
YOLOv5s in addressing nuanced challenges within the realm of helmet detection.
Chen et al. [9] embarked on the development of lightweight helmet detection algo-
rithms, a crucial pursuit for ensuring real-time processing in safety applications. Their
work placed significant emphasis on safety helmet-wearing detection in industrial set-
tings, advocating for algorithms that offer swift and accurate recognition. This research
substantially contributes to the intersection of real-time safety applications and com-
puter vision, recognizing the importance of expeditious and precise helmet detection in
critical environments.
In a parallel vein, Fan et al. [10] delved into the application of ensemble methods
in helmet detection. Their deep learning-based ensemble method showcased advance-
ments in minimizing false positives, ensuring a more reliable helmet detection system.
This work not only addresses the challenges of false positives but also makes valuable
strides in enhancing the overall robustness of object detection models. The integration of
ensemble methods adds a layer of complexity and efficacy to helmet detection systems.
The YOLO series has gained popularity for real-time object detection due to its
effective balance between speed and accuracy, but its performance is hindered by the
Non-Maximum Suppression (NMS) step. Transformer-based detectors like DETR offer
an alternative by eliminating NMS but suffer from high computational costs that limit
their practicality. To address these issues, Lv et al. [4] proposed the Real-Time DEtection
206 K. Agrawal et al.

TRansformer (RT-DETR), an end-to-end object detector that maintains high speed and
accuracy by employing an efficient hybrid encoder for rapid multi-scale feature process-
ing and an uncertainty-minimal query selection to enhance initial query quality, while
also allowing flexible speed tuning through adjustable decoder layers.
Recent advancements in deep learning have significantly improved image classi-
fication, segmentation, and object detection, including detecting helmets on bike rid-
ers to enhance road safety. Singh et al. [5] analyze various approaches and experi-
ments with state-of-the-art models like Detectron2 and EfficientDet, demonstrating their
effectiveness in helmet detection.
The synthesis of the reviewed literature underscores the diverse approaches employed
in helmet detection, with a particular emphasis on the YOLOv5s architecture, lightweight
algorithms, integration of SPP, utilization of ensemble methods, and the significance
of curated datasets. Building upon these insights, the proposed architecture aspires to
contribute to ongoing advancements in intelligent transportation systems and road safety.
By amalgamating strengths and addressing the limitations highlighted in the literature,
the proposed architecture seeks to elevate the precision, efficiency, and adaptability of
helmets detection in dynamic real-world scenarios. This endeavor aligns with the broader
trajectory of advancements in computer vision and deep learning, fostering a safer and
more intelligent future for transportation systems.

Fig. 2. The user interface of the annotation tool.

3 Proposed Work

In this section, we outline a thorough research methodology for benchmarking a resilient


helmet detection system in traffic videos related to helmets. The proposed work includes
data collection, data augmentation and preprocessing, model development, training, and
evaluation, and cross-domain adaptation using the Hardhat Construction Dataset [11],
iterative refinement, ethical considerations, and documentation. Leveraging pertinent
literature and best practices guides our research at every step.
Motorcycle Helmet Detection Benchmarking 207

3.1 Motorcycle Helmet Detection Dataset (MHDD)

Our methodology started with the collection of the Motorcycle Helmet Detection Dataset
(MHDD), a foundational element crucial for developing a robust helmet detection system
capable of adapting to a myriad of diverse environmental conditions and traffic scenarios.
There are a few similar datasets available such as the Caltech Pedestrian dataset [12]
(taken from a vehicle driving) and multiple datasets focusing on biker’s helmets. No
such public dataset is readily available that focuses on motorcyclist helmets from a
traffic camera view. This made us create a new dataset that overcomes these issues and
provides a readily available dataset for future use.
Compiling our dataset involves a comprehensive sourcing strategy, tapping into var-
ious channels to ensure a rich and diverse representation. We leverage public video
feeds from traffic cameras and surveillance systems in Vietnam from different regions.
This exhaustive selection ensures the inclusion of a broad spectrum of traffic scenarios
and environmental conditions in developing countries, significantly contributing to the
robustness and adaptability of our helmet detection system.
The cornerstone of our methodology lies in rigorous data annotation performed by
trained annotators. This involves the meticulous marking of regions of interest (ROIs)
containing motorcyclists and the precise indication of helmet presence [13]. In our work,
we use Roboflow [14] for annotation. In particular, this tool empowers annotators to
create high-quality annotations efficiently, thereby contributing to the depth and accuracy
of our dataset. The graphical user interface of this tool is illustrated in Fig. 2.

Fig. 3. The flowchart of the computational framework.


208 K. Agrawal et al.

3.2 Data Processing and Augmentation

To ensure uniform data representation, videos undergo segmentation into individual


frames or clips, effectively addressing variations in frame rates and mitigating compres-
sion artifacts [15]. In our pursuit of an enriched and diverse dataset, we employ a spec-
trum of data augmentation techniques, encompassing random rotations, flips, brightness
adjustments, and cropping [16].
In the initial stages of data preprocessing, we prioritize image enhancement tech-
niques such as histogram equalization and adaptive contrast enhancement. These meth-
ods are pivotal in refining image quality, particularly for videos captured in challenging
lighting conditions [17]. Subsequently, we implement normalization, wherein pixel val-
ues in the images are adjusted to attain a zero mean and unit variance, thereby instrumen-
tal in facilitating model convergence and enhancing the overall stability of the system
[18].
As we immerse ourselves in simulating diverse weather conditions during the data
augmentation process, our primary objective is to fortify the robustness of the helmet
detection system. Synthetic scenarios involving rain, fog, and snow are introduced,
effectively replicating adverse weather conditions. By leveraging Pillow and NumPy
libraries we added synthetic weathers such as creating an overlay of small white circles
(snowflakes) with random positions and some transparency to simulate snowfall, gener-
ating semi-transparent white ellipses (fog clouds) randomly placed over the image and
then applying Gaussian blur to soften and blend them to create a foggy appearance, and
adding vertical white lines (raindrop) of varying lengths and positions onto a transparent
overlay to mimic falling rain.
Our dataset becomes more resilient by seamlessly integrating these simulated
weather conditions into the training and evaluation phases. This ensures that the helmet
detection system is not only adept at handling real-world challenges posed by dynamic
weather scenarios but also excels in accurately identifying helmets in adverse conditions.
This comprehensive approach to data augmentation not only augments the model’s
adaptability but also significantly contributes to its precision, reliability, and overall
effectiveness in varied environmental conditions.
The overall size of the dataset consists of 8,000 images including different weather
scenarios, i.e., normal conditions, rain, fog, and snow. The dataset is broadly divided
into two sections Colored and Grayscale each comprising of 4000 images. Furthermore,
both these sections have 4 subsections based on conditions such as normal conditions,
synthetic snow added, synthetic fog added, and synthetic rain added. Each of these
subsections has 1000 images each respectively in both colored and grayscale format
making the total size of the dataset 8000 images. (2 broad sections x 4 subsections x
1000 images each).

3.3 Computational Framework

Our envisioned framework for robust helmet detection in traffic videos is strategically
crafted to surmount challenges posed by fluctuating weather conditions, ultimately
ensuring the safety of motorcyclists. This comprehensive framework consists of key
components, each playing a pivotal role in the overall system, as illustrated in Fig. 3.
Motorcycle Helmet Detection Benchmarking 209

To initiate the framework, traffic videos captured by surveillance cameras serve as


the primary input source. These videos constitute the foundational data for the helmet
detection system. Frame extraction is the next crucial step, where frames are extracted
from the input traffic videos. Each frame represents a snapshot of the traffic scene, form-
ing the basis for subsequent analysis. Extracted frames undergo essential preprocessing
tasks, including noise reduction, contrast enhancement, and resizing as part of the pre-
processing dataset phase. These tasks aim to ensure the consistency and high quality of
the input data, laying the groundwork for accurate analysis.
Moving forward, the data labeling stage is imperative. For supervised training, the
dataset is meticulously annotated, with each helmet within the frames being labeled.
This annotated dataset becomes the bedrock for training the helmet detection model.
Subsequently, the labeled dataset undergoes further refinement to align it with real-world
scenarios. Synthetic weather scenarios, such as rain, fog, and darkness, are introduced to
simulate various environmental conditions, making the dataset more robust and reflective
of diverse challenges.
The object detection model, the final and pivotal element in the framework, is a
deep learning-based model specifically designed to identify helmets within the frames.
Crafted for high accuracy and capable of handling challenging conditions, this model
is finetuned to excel under various weather conditions. The finetuned object detection
model stands as the core component for robust helmet detection, ensuring the system’s
adaptability to dynamic scenarios. The object detector locates and highlights helmets
within the frames, contributing significantly to motorcyclist safety in diverse traffic
scenarios.

Fig. 4. Object Detection results of all models.

3.4 Implementation
We adopt pre-trained deep convolutional neural networks (CNN) that serve as the base
model for feature extraction. Consideration will be given to well-established architec-
tures such as ResNet [19], VGG [20], or Inception [21]. Then, the selected base model
will undergo fine-tuning using our annotated traffic video dataset. Transfer learning tech-
niques will be applied, leveraging knowledge from large-scale datasets like ImageNet to
210 K. Agrawal et al.

adapt the model for helmet detection. An object detection head, such as a Region Pro-
posal Network (R-CNN), will be added to the base model to identify ROIs containing
motorcyclists’ heads. For the helmet detection head, a subnetwork for helmet detection
will be integrated within the ROIs identified in the previous step. Architecture options
include Faster R-CNN [3] or YOLO [2] for object detection.
Regarding the loss function, an appropriate loss function, combining classification
loss and localization loss (e.g., Smooth L 1 loss), will be defined for helmet detection.
The model will be trained using annotated data, employing an optimizer such as Adam
[22]. Training progress will be monitored using validation data, and techniques like
learning rate schedules and early stopping will be used for optimization.

4 Benchmarking

In this section, we conduct a comprehensive evaluation of robust helmet detection bench-


marking, aiming to scrutinize the performance of multiple models across diverse sce-
narios. Our objective is to benchmark and compare these models using key metrics:
precision, recall, F1-Score, and mean average precision (mAP) [18] to gauge their
effectiveness in helmet detection.

Hardhat Nighttime Snow


Helmet

Daytime Rain Fog


Fig. 5. The illustration of different weather conditions.

4.1 Metrics Performance

We employed the following evaluation metrics. Precision indicates the percentage of


correctly detected helmets out of all identified by the model. The recall represents the
percentage of actual helmets correctly detected by the model. F1-Score is a balanced
measure between precision and recall. Finally, Mean Average Precision (mAP) reflects
the average precision scores for different classes, in this case, helmets.
Motorcycle Helmet Detection Benchmarking 211

4.2 Modal Comparison


Several models designed for helmet detection underwent training and testing. Table 1 &
Table 2 summarize their performance. Table 1 streamlines the decision-making process
for selecting the most appropriate model for real-world applications as discussed in
further sections in more detail.

4.3 Visual Assessment


Visual assessment, coupled with quantitative metrics, plays a pivotal role in conducting
a thorough evaluation of the model’s performance. The inclusion of visual examples, as
illustrated in Fig. 4, enriches our understanding of how each model operates in diverse
scenarios, introducing a qualitative dimension to the evaluation process.
Within these visual representations, the models’ effectiveness in identifying hel-
mets under various conditions is vividly highlighted. Each image encapsulates the
model’s responses to real-world challenges, encompassing factors such as varying light-
ing conditions, adverse weather scenarios, and complex traffic scenes. This visual evi-
dence not only validates the quantitative data derived from metrics but also provides a
comprehensive and nuanced perspective on the models’ resilience and adaptability.

Table 1. The experimental results of different models trained on traffic data and tested on traffic
data. Both training and testing data are included in MHDD dataset. The best performance is marked
as boldface.

Sr. No Model mAP


1 Yolov5s (Colored) 0.857
2 Yolov5s (Grayscale) 0.863
3 Yolov7-w6 (Colored) 0.881
4 Yolov7-w6 (Grayscale) 0.899
5 Yolov6s (Colored) 0.809
6 Yolov6s (Grayscale) 0.795
7 Yolov8s (Colored) 0.863
8 Yolov8s (Grayscale) 0.847
9 FasterRCNN-ResNet50 (Colored) 0.870
10 FasterRCNN-ResNet50 (Grayscale) 0.871
11 RT-DETR (Colored) 0.790
12 RT-DETR (Grayscale) 0.770
13 Detectron2 (Colored) 0.705
14 Detectron2 (Grayscale) 0.757

These visual examples act as a qualitative supplement to the quantitative assess-


ment, fostering a deeper comprehension of the models’ strengths and limitations. Beyond
212 K. Agrawal et al.

numerical scores and metrics, these images furnish the research community with tangible
proof of the model’s performance in practical, real-world situations. Such visual assess-
ments contribute significantly to a holistic and insightful interpretation of the models’
overall effectiveness and suitability for deployment in dynamic environments.

4.4 Discussions

The comprehensive evaluation has unveiled valuable insights into the models’ perfor-
mance as shown in Table 1. The critical metric of precision highlights the exemplary
capabilities of Models 3 and 10 (different families). Model 4 (Yolov7-w6 Grayscale)
distinguishes itself with an impressive mean Average Precision (mAP) of 0.899, closely
trailed by Model 10 (FasterRCNN-ResNet50 Grayscale) with a commendable mAP of
0.871.
In the realm of Recall, Model 3 demonstrates outstanding performance, surpassing
its counterparts with a mAP of 0.881. Model 10 exhibits competitive recall capabilities,
boasting a mAP of 0.871. Meanwhile, Model 1 (Yolov5s Colored) achieves the highest
F1-Score, with a mAP of 0.857. This metric signifies a harmonious blend of precision
and recall, positioning it as a noteworthy contender adept at balancing these two crucial
aspects of helmet detection.
A pivotal consideration lies in the mAP, where Model 10 (FasterRCNN-ResNet50
Grayscale) outperforms others with a score of 0.871. This underscores the model’s
consistency and effectiveness across a spectrum of conditions, reinforcing its reliability
in diverse helmet detection scenarios. Considering all the results we can broadly say the
model works better in grayscale over colored ones.
These insights empower users to make informed decisions by comprehending the
trade-offs between precision, recall, and adaptability. Although RT-DETR and Detec-
tron2 performed poorly compared to all other models in the range of mAP 0.7–0.8 but
whether prioritizing precise helmet identification or comprehensive coverage, the varied
strengths of each model facilitate customized selections aligned with specific application
needs.

Table 2. The experimental results on the test traffic data with different models and training
datasets.

Model Train Dataset mAP


Yolov7-w6 (Colored) HardHat + Traffic 0.904
Yolov7-w6 (Grayscale) HardHat + Traffic 0.892
Yolov7-w6 (Colored) HardHat 0.904
Yolov7-w6 (Grayscale) HardHat 0.865
FasterRCNN-ResNet50 (Colored) HardHat 0.878
FasterRCNN-ResNet50 (Grayscale) HardHat 0.904
Motorcycle Helmet Detection Benchmarking 213

4.5 Robustness Evaluation


This section emphasizes the robustness of hardhat helmet detection, achieved by inte-
grating a hardhat dataset with the traffic dataset and training it using the best-performing
models (Model 4 - Yolov7-w6 Grayscale) and (Model 10 – FasterRCNN-ResNet50) as
discussed in the previous section. Evaluations showcase the model’s adaptability and
performance in real-world scenarios as mentioned in Table 2 with an impressive mAP
of 0.904. Though both the models were able to score the mAP of 0.904 the preferred
model will be Model 4 (YOLOv7-w6) as it’s a faster model compared to Model 10
(FasterRCNN-ResNet50).
Weather Robustness Assessment. Model 4’s adaptability under different weather con-
ditions is tested, showing consistent precision and recall across adverse scenarios as
shown in Fig. 5.
Real-World Implications. The benchmarking results have significant implications for
helmet detection system deployment. Depending on application demands, users can
select a suitable model, considering critical factors such as precision and recall. The
adaptability of models across different weather conditions underscores the necessity
for versatile, robust systems ensuring safety in adverse environments The benchmark
has significantly advanced our understanding of helmet detection models in diverse
conditions, emphasizing the critical need for adaptability to ensure motorcyclist safety.
Beyond traffic scenarios, the application scope extends to industrial safety and various
domains, showcasing the broader societal impact of intelligent transportation systems.
Our contributions to the fields of computer vision, deep learning, and transportation
safety underscore the importance of tailored model selection to meet specific application
needs, recognizing the nuanced challenges presented in real-world scenarios. Despite
inherent limitations, this paper serves as a foundational step toward future advancements
in helmet detection research, shedding light on the necessity for diverse datasets and
effective domain adaptation strategies.

Fig. 6. Visualization of failure cases: a) wrongly detected helmet b) wrongly detected helmet and
motorcycle c) dark hair detected as helmet and d) cap detected as helmet.

5 Conclusion and Future Work


This paper advances helmet detection systems strategically focusing on optimizing real-
time processing capabilities for practical deployment in dynamic traffic scenarios. We
address the ongoing challenge of enhancing model performance in adverse weather con-
ditions which demands dedicated research efforts and exploring innovative techniques
214 K. Agrawal et al.

and technologies. We plan to explore multimodal data fusion, where integrating infor-
mation from various sensors could significantly elevate detection accuracy, particularly
in scenarios with challenging visibility conditions as shown in Fig. 6. The expansion
of benchmark datasets and the exploration of advanced domain adaptation techniques
are pivotal steps toward creating more robust models capable of handling diverse and
complex environments.
For future work, we focus on optimizing helmet detection systems for real-time
processing, facilitating practical deployment in traffic scenarios. Enhancing model per-
formance in adverse weather conditions remains a critical challenge, warranting further
investigation. Multimodal data fusion, encompassing data from various sensors, could
enhance detection accuracy, especially in challenging visibility conditions. Expanding
benchmark datasets, exploring domain adaptation techniques, and broadening the scope
to include anomaly detection for comprehensive traffic safety are avenues for future
exploration.

Acknowledgment. This research was supported by the National Science Foundation (NSF) under
Grant 2025234.

References
1. Huang Ma, C., Yang, D., Zhou, J., Feng, Z., Yuan, Q.: Risk riding behaviors of urban e-bikes:
a literature review. Int. J. Environ. Res. Public Health 16(13), 2308 (2019)
2. Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You Only Look Once: Unified, Real-Time
Object Detection. arXiv preprint arXiv:1506.02640 (2016)
3. Redmon, J., Farhadi, A.: You only look once: unified, real-time object detection. In: Pro-
ceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
pp. 779–788 (2016)
4. Lv, W., et al.: DETRs Beat YOLOs on Real-time Object Detection (2023)
5. Singh, R., Shetty, S., Patil, G., Bide, P.J.: Helmet detection using detectron2 and efficient-
det. In: 2021 12th International Conference on Computing Communication and Networking
Technologies (ICCCNT), Kharagpur, India, pp. 1–5 (2021)
6. Nguyen, X.-D., et al.: Adaptive multi-vehicle motion counting. J. Signal Image Video
Processing 16(8), 2193–2201 (2022)
7. Nguyen, T.V., et al.: Data-driven city traffic planning simulation. ISMAR Adjunct, pp. 859–
864 (2022)
8. Huang, B., et al.: An improved YOLOv5s-based helmet recognition method for electric bikes.
Appl. Sci. 13(15), 8759 (2023)
9. Chen, J., Deng, S., Wang, P., Huang, X., Liu, Y.: Lightweight helmet detection algorithm
using an improved YOLOv4. Sensors 23(3), 1256 (2023)
10. Fan, Z., Peng, C., Dai, L., Cao, F., Qi, J., Hua, W.: A deep learning-based ensemble method
for helmet-wearing detection. PeerJ. Computer Sci. 6, e311 (2020)
11. Shen, J., Xiong, X., Li, Y., He, W., Li, P., Zheng, X.: Detecting safety helmet wearing on
construction sites with bounding-box regression and deep transfer learning. Computer-Aided
Civil and Infrastructure Eng. 36(2), 180–196 (2021)
12. Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual
object classes (VOC) challenge. Int. J. Comput. Vision 88(2), 303–338 (2010)
Motorcycle Helmet Detection Benchmarking 215

13. Zhou, Y., Liu, L., Shao, L., Mellor, M.: Fast automatic vehicle annotation for urban traffic
surveillance. IEEE Trans. Intell. Transp. Syst. 19(6), 1973–1984 (2017)
14. Dwyer, B., Nelson, J., Solawetz, J., et al.: Roboflow (Version 1.0) (2022). https://roboflow.
com.
15. Wang, W., Zhou, T., Porikli, F., Crandall, D., Van Gool, L.: A survey on Deep Learning
Technique for Video Segmentation. arXiv e-prints, arXiv-2107 (2021)
16. Li, J., Wang, D., Li, S., Zhang, M., Song, C., Chen, X.: Deep learning based adaptive sequential
data augmentation technique for the optical network traffic synthesis. Opt. Express 27(13),
18831–18847 (2019)
17. Ren, D., Sun, T., Yu, C., Zhou, C.: Research on safety helmet detection for construction site. In:
2021 International Conference on Computer Information Science and Artificial Intelligence
(CISAI), pp. 186–189. IEEE (2021)
18. Flach, P., Kull, M.: Precision-recall-gain curves: PR analysis done right. Advances in Neural
Information Processing Systems 28 (2015)
19. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In Pro-
ceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
pp. 770–778 (2016)
20. Simonyan, K., Zisserman, A.: Very Deep Convolutional Networks for Large-Scale Image
Recognition. arXiv preprint arXiv:1409.1556 (2014)
21. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception archi-
tecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), pp. 2818–2826 (2016)
22. Kingma, D.P., Ba, J.: Adam: A Method for Stochastic Optimization. arXiv preprint arXiv:
1412.6980 (2014)
MEPC: Multi-level Product Category
Recognition Image Dataset

Thanh Long Nguyen1 , Manh Quang Do2(B) , and Ba Nghien Nguyen1


1
Faculty of Information Technology, Hanoi University of Industry, Hanoi, Vietnam
[email protected], nguyenbanghien [email protected]
2
Faculty of Interdisciplinary Digital Technology, Phenikaa University,
Hanoi, Vietnam
[email protected]

Abstract. Multi-level product category prediction is a problem for busi-


nesses providing online retail sector systems. Accurate Multi-level predic-
tion supports the system in avoiding the need for sellers to fill in product
category information, saving time and reducing the cost of listing prod-
ucts online. This is an open research problem, which always attracts
researchers. Deep learning techniques have shown promising results for
category recognition problems. A neat and clean dataset is an elementary
requirement for building accurate and robust deep-learning models for
category prediction. This article introduces a new image dataset of the
multi-level product, called MEPC. MEPC dataset has +164.000 images
in the processed format available in the dataset. We evaluate the MEPC
dataset with popular deep learning models, benchmark results in a top-1
accuracy score of 92.055% with 10 classes and a top-5 accuracy score of
57.36% with 1000 classes. The proposed dataset is good for training, vali-
dation, and testing for hierarchical image classification to improve predict
multi-level categories in the online retail sector systems. Data and code
will be released at https://huggingface.co/datasets/sherlockvn/MEPC.

Keywords: Product category prediction · Category prediction ·


Multi-level classification · Hierarchical image classification

1 Introduction
E-commerce platforms have become more and more popular over the years.
The digital transformation 4.0 further stimulated public interest in e-commerce,
resulting in a boom in e-commerce businesses [1, 2]. As a result, the e-commerce
industry has become more competitive, driving firms to make considerable
expenditures to improve their platforms. In recent years, hierarchical classifi-
cation has emerged as a powerful tool in the online retail industry [3], assist-
ing sellers in auto-filling fast product category information. With the speed-up
growth of online products, efficiently hierarchical classifying these products has
T.L. Nguyen and M.Q. Do—These authors contributed equally to this work.
c The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025
W. Buntine et al. (Eds.): SOICT 2024, CCIS 2350, pp. 216–225, 2025.
https://doi.org/10.1007/978-981-96-4282-3_18
MEPC: Multi-level Product Category Recognition Image Dataset 217

become crucial for success in the online retail sector. Applying deep learning
and machine learning techniques to retail data enhances the seller experience on
e-commerce platforms.
Category Prediction (CP), which aims to recognize the intent categories of
given texts, is regarded as one of the most fundamental machine-learning tasks in
an e-commerce system [4]. For example, this predicted category information will
influence product ranking in the search and recommendation system. Different
from the traditional classification [5, 6]. Category Prediction is formally catego-
rized as a hierarchical classification problem since categories in most e-commerce
websites are organized as a hierarchical tree (we consider the situation that the
categories are organized as a hierarchical tree, but not a directed acyclic graph).
Figure 1 shows a simplified fragment of one category architecture.

Fig. 1. Visualization of the label hierarchy for hierarchical image classification.

To resolve the problem of category prediction of business then data has always
been one of the key points to driving AI-based category recognition research. Dif-
ferent training data may lead to different training results. No matter the target,
product image datasets are needed to evaluate the performance of the proposed
deep-learning models. In this article, We proposed a new multi-level product
category dataset with more than 164.000 images in Table 1. We experimented
on many different pre-trained models. The results show that this new dataset
is appropriate for improving the model’s performance when researching deep
learning models to predict multiple categories.

2 Related Work

Category prediction image classification is a classification problem in which hier-


archical information related to the classes is given in addition to the image [7].
We list existing known competitions or datasets related to Hierarchical Image
Classification (HIC) in Table 2. For example, CIFAR-100, ETH Entomologi-
cal Collection (ETHEC), CUB-200–2011 (CUB), FGVCAircraft (AIR), Stanford
Cars (CAR), and Lego-15 as Fig. 2.
218 T. L. Nguyen et al.

Table 1. Summary of MEPC dataset Statistics.

Statistics Value
Train Val
Number of images 131,292 32,824
Number of 1st level categories 28
Number of 2nd level categories 193
Number of 3rd level categories 659

Fig. 2. The summary of random images from the datasets for hierarchical image clas-
sification.

The first CIFAR-100 [8] a commonly used benchmark for hierarchical classi-
fication, has 20 coarse (super) classes, and each class is associated with five fine
classes (i.e., the course class “people” has five fine classes: “baby”, “boy”, “girl”,
“man”, and “woman”), for a total of 100 fine classes.
The next, for ETH Entomological Collection (ETHEC) dataset [9] compris-
ing images of Lepidoptera specimens with their taxonomy tree. The real-world
dataset has variations in terms of the images per category and a significant imbal-
ance in the structure of the taxonomical tree. For CUB-200–2011 (CUB) [10],
CUB follows the setting to organize the label hierarchy of birds to 200 species,
122 genera, 37 families, and 13 orders. CUB datasets contain 11,788 images of
200 subcategories belonging to birds, 5,994 for training and 5,794 for testing.
Each image has detailed annotations: 1 subcategory label, 15 part locations, 312
binary attributes, and 1 bounding box.
MEPC: Multi-level Product Category Recognition Image Dataset 219

Fig. 3. The categories-cloud of 1nd /2nd /3nd - level keywords of photos. The larger
the font size, the more products the corresponding categories.

Table 2. Table description of hierarchical image datasets.

Name Year Public Type Total Classes Levels


CIFAR-100 [8] 2009 yes classification 60,000 100 2
ETHEC [9] 2019 yes classification 47,978 723 4
CUB [10] 2011 yes classification 11,788 200 4
FGVC-Aircraft [11] 2013 yes classification 10,200 100 3
CAR [12] 2013 yes classification 16,185 196 3
Lego-15 [13] 2021 no classification 4,688 15 3
MEPC-10 (our) 2024 yes classification 2,192 10 3
MEPC-1000 (our) 2024 yes classification 164,117 1000 3

The next dataset, FGVC-Aircraft [11] contains 10,200 images of aircraft,


with 100 images for each of 102 different aircraft model variants, most of which
are airplanes. Each image’s main aircraft is annotated with a tight bounding
box and a hierarchical airplane model label. Aircraft models are organized in a
three-level hierarchy. Stanford Cars (CAR) [12] dataset consists of 196 classes
of cars with a total of 16,185 images, taken from the rear. The data is divided
into almost a 50–50 train/test split with 8,144 training images and 8,041 testing
images. Categories are typically at the level of Make, Model, and Year. CAR
has image sizes are 360×240. The last dataset, Lego-15 [13] is a Lego image
dataset consisting of 3000 synthetic images and 1688 real images in 15 classes.
For synthetic images, there are 200 images in each class. Lego-15 real images,
the number of images ranges from 70 to 150 in each class.
The above datasets are all used to optimize deep learning models for hierar-
chical image classification tasks. However, the weakness of the above datasets is
that they are not suitable for predicting product categories on e-commerce plat-
forms. In this article, we introduce a new dataset named Multi-level E-commerce
Product Categorization image datasets (MEPC). MEPC is a product hierar-
220 T. L. Nguyen et al.

chy image dataset focused on product e-commerce for hierarchical image classi-
fication tasks.
MEPC dataset has multiple backgrounds to increase the variety of real
images. We also evaluate the impact of MEPC dataset with EffcientNet,
ResNet, VGG, and MobiNet series models. The results benchmark models with
a top-1 accuracy score of 92.055% for MEPC-10 and a top-5 accuracy score of
57.36% for MEPC-1000. We hope that the introduced dataset will be of good
assistance in fine-tuning the model for image classification in the online retail
sector systems.

Fig. 4. MEPC dataset overview. An example of random categories of the MEPC


dataset. It shows a 2nd -level category “Motorbike accessory” with 2 different 3nd -level
categories.

3 Dataset
In this section, we introduce Multi-level E-commerce Product Categorization
(MEPC) dataset.

Dataset Description. All of the images in our MEPC dataset are col-
lected from the Internet data collection method introduced in the research
article“End-to-End System For Data Crawling, Monitoring, And Analyzation
Of E-Commerce Websites” at ICTA2024 [14]. There are nearly +164,000 images
in total. For the practical application scenes, the distribution of image amount
is imbalanced data, as shown in Fig. 4.
We conducted important statistical analyses better to understand the struc-
ture and characteristics of the data. First, we analyzed the number of categories
by hierarchy, as illustrated in Fig. 6. This analysis helps us identify the dis-
tribution of categories and detect any imbalances among the levels. We used
MEPC: Multi-level Product Category Recognition Image Dataset 221

data visualization techniques to highlight this distribution, making identifying


categories with fewer samples easier.
Most e-commerce websites organize product displays in a hierarchical tree
structure. We can determine which product groups have the most items by count-
ing the number of categories. In Fig. 3, we generated a “categories cloud” at the
category level. The larger the font size, the more products the corresponding
category has. Based on the figure, it can be seen that the “Women Fashion”
category has many products in the MEPC dataset.
Next, we analyzed the image sizes, presented in Fig. 7, to assess the diversity
in image dimensions within the dataset. This information helps us optimize
the preprocessing steps and ensure that the deep learning models can perform
effectively on this dataset.
Dataset Challenge Previously, several datasets [8,10,12,13] have been con-
structed for HIC. However, they are relatively plain, i.e., scenarios with real-
world complexity are not well represented in these benchmarks. Compared with
real scene datasets such as (ETHEC [9] and FGVC-Aircraft [11]), MEPC has
more diverse products and richer structures. The images in MEPC are much
more challenging in that they are taken from website E-Commerce (Fig. 2)
such as background noise, uneven illumination, diverse image size, and com-
plicated connections between parent-childlen Fig. 5. In the next section, we will
present the experimental image classification of this dataset using well-known
deep-learning models.

4 Methods

In this study, we used the deep learning models ResNet50 [15], VGG16 [16],
MoBiNet [17] and EfficientNet [18], to evaluate a new dataset. These models have
been proven to be highly effective in various image recognition and classification
tasks. We will provide a detailed account of how we prepare the data, train the
models, and evaluate the results.
First, we resized the images to 224 × 224 and normalized the image data.
Next, we used the Adam [19] optimizer with a learning rate of 1e−3 . When
visualizing the random number of images in each class as shown in Fig. 8, we
noticed that the MEPC data was imbalanced among the classes. Therefore, we
used the focal loss [20] function to penalize the model whenever it incorrectly
predicted classes with fewer samples. We used k-fold and metrics such as Top-1
Accuracy for the MEPC-10 dataset to evaluate performance. The experimental
results are described in Table 3. In addition, We split the MEPC-10 and MEPC-
1000 datasets into 80% training and 20% evaluation sets and used the YOLOv8
[18] model for image classification as follows in Table 4.
Based on Tables 3 and 4, we look at the MEPC-10 dataset has a Top-1 Acc of
92.055%. Therefore, MEPC-10 works well for improving deep learning models or
for evaluating new deep learning model architectures. However, the MEPC-1000
dataset has a Top-5 Acc of 57.36% so this dataset will be a big challenge for the
hierarchical image classification problem.
222 T. L. Nguyen et al.

Fig. 5. Label-only embeddings visualizing label connections of the MEPC dataset, with
multi-colored labels for level 1, yellow for level 2, gray for level 3, red connections from
level 1 to level 2, and blue connections from level 2 to level 3. (Color figure online)

Fig. 6. Statistics of the number of multi-level categories in the two datasets MEPC-10
and MEPC-1000.
MEPC: Multi-level Product Category Recognition Image Dataset 223

Fig. 7. Statistics of image sizes of MEPC dataset. It can be seen that the images of
the dataset are square with various aspect ratios.

Fig. 8. Random visualization of classes in MEPC dataset. It is easy to see that the
MEPC data is imbalanced.
224 T. L. Nguyen et al.

Table 3. Benmark K-fold for MEPC-10 dataset (K = 5).

Name MEPC-10T op−1 (%) #Params


MoBiNet [17] 79.041% ± 2.978 4.3 M
MoBiNetV2 [21] 72.694% ± 1.766 3.5 M
MoBiNetV3-S [22] 89.224% ± 1.185 2.5 M
MoBiNetV3-L [22] 92.055% ± 0.602 5.4 M
ResNet50 [15] 89.817% ± 0.897 25.6 M
VGG16 [16] 90.457% ± 1.203 138.4 M
EfficientNetB4 [18] 91.37% ± 1.68 19.5 M

Table 4. Experimental results with YOLOv8 Model

Name YOLOv8m (17M params)


Top-1 Accuracy (%) Top-5 Accuracy (%)
MEPC-10 90.87% -
MEPC-1000 34.41% 57.36%

5 Conclusion
We have released an MEPC dataset for HIC tasks, where the images present
numerous challenges such as background noise, uneven lighting, diverse image
sizes, complex parent-child linkages, etc. This dataset includes a variety of prod-
uct images, focusing on predicting product categories for e-commerce systems.
In our study, we also provided an overview of the dataset, visualized the data
from various perspectives, and tested this dataset with well-known deep learn-
ing models. In the future, we will collect additional semantic descriptions of the
images from the real-world or large language models to optimize and improve
the classification performance of this dataset.

References
1. Barat, M.I., Haque, M.M.: Small business boom in ecommerce: an in-depth research
exploration. Int. J. Bus. Manage. Finan. Res. 2, 1–14 (2024)
2. Azam, A., Ansari, A.M.: The emerging role of e-commerce in today’s business: a
conceptual study. Asian J. Manage. Commer. 05, 428–439 (2024)
3. Wei, Y., Tran, S., Xu, S., Kang, B., Springer, M.: Deep learning for retail product
recognition: challenges and techniques. Comput. Intell. Neurosci. 2020(1), 8875910
(2020)
4. Cevahir, A., Murakami, K.: Large-scale multi-class and hierarchical product cat-
egorization for an E-commerce giant. In: Proceedings of COLING 2016, the 26th
International Conference on Computational Linguistics: Technical Papers (Y. Mat-
sumoto and R. Prasad, eds.), (Osaka, Japan), The COLING 2016 Organizing Com-
mittee, pp. 525–535 (2016)
MEPC: Multi-level Product Category Recognition Image Dataset 225

5. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to
document recognition. Proc. IEEE 86, 2278–2324 (1998)
6. Larkey, L.S., Croft, W.B.: Combining classifiers in text categorization. In: Proceed-
ings of the 19th Annual International ACM SIGIR Conference on Research and
Development in Information Retrieval, SIGIR ’96, New York, NY, USA, Associa-
tion for Computing Machinery, pp. 289–297 (1996)
7. Zhu, X., Bain, M.: B-CNN: branch convolutional neural network for hierarchical
classification (2017)
8. Krizhevsky, A.: Learning Multiple Layers of Features from Tiny Images. Technical
Report (2009)
9. Dhall, A.: Eth Entomological Collection (ETHEC) dataset [Palearctic Macrolepi-
doptera, Spring 2019] (2019). Please cite the dataset and our work if you use it or
report results based on it: “Learning Representations for Images with Hierarchical
Labels” (https://arxiv.org/abs/2004.00909) and “Hierarchical Image Classification
using Entailment Cone Embeddings” (https://arxiv.org/abs/2004.03459)
10. Wah, C., Branson, S., Welinder, P., Perona, P., Belongie, S.: The Caltech-UCSD
Birds-200-2011 Dataset (2011)
11. Maji, S., Rahtu, E., Kannala, J., Blaschko, M., Vedaldi, A.: Fine-grained visual
classification of aircraft (2013)
12. Krause, J., Stark, M., Deng, J., Fei-Fei, L.: 3D object representations for fine-
grained categorization. In: 2013 IEEE International Conference on Computer
Vision Workshops, pp. 554–561 (2013)
13. He, L., Song, D., Zheng, L.: Hierarchical image classification with a literally toy
dataset (2021)
14. Do, M.Q.: End-to-end system for data crawling, monitoring, and analyzation of
e-commerce websites (2024). https://doi.org/10.1007/978-3-031-80943-9 107
15. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition
(2015)
16. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale
image recognition (2015)
17. Phan, H., Huynh, D., He, Y., Savvides, M., Shen, Z.: MoBiNet: a mobile binary
network for image classification (2019)
18. Tan, M. and Le, Q. V.: EfficientNet: rethinking model scaling for convolutional
neural networks (2020)
19. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization (2017)
20. Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object
detection (2018)
21. Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.C.: MobileNetV2:
inverted residuals and linear bottlenecks (2019)
22. Howard, A., et al: Searching for mobileNetV3 (2019)
A Simple Approach Towards Frame
Filtering for Efficient Gaussian Splatting

Thien-Phuc Tran1,2(B) , Minh-Quang Nguyen1,2 , and Minh-Triet Tran1,2


1
Faculty of Information Technology, University of Science, VNU-HCM,
Ho Chi Minh City, Vietnam
{ttphuc21,nmquang21}@apcs.fitus.edu.vn, [email protected]
2
Viet Nam National University, Ho Chi Minh City, Vietnam

Abstract. Neural rendering has established itself as the state-of-the-


art approach for scene reconstruction and novel view synthesis (NVS)
tasks. However, its reliance on precise camera poses presents a signif-
icant limitation. Since 2023, Gaussian Splatting (GS) has emerged as
a promising approach for volumetric rendering, gaining traction in the
3D computer vision and graphics community due to its efficiency and
real-time rendering capabilities. While COLMAP-free GS methods have
been proposed to address camera pose dependency, they often struggle
with “useless frames” - frames that do not introduce information gain
about the rendered surface and/or have low resolution - leading to slower
reconstruction and inefficient use of computational resources, potentially
causing out-of-memory issues on mid-tier machines which does not have
extraordinary computational power. To address these challenges, we pro-
pose a frame filtering method for efficient NVS based on COLMAP-free
GS. Our approach enables scene reconstruction under computational
resource constraints while maintaining high rendering quality. Experi-
mental results demonstrate that our method achieves an approximately
30–50% reduction in GPU VRAM usage and a 20–30% decrease in train-
ing time for scene reconstruction, offering a more efficient solution for
NVS tasks.

Keywords: Machine Learning · Computer Vision · 3D Reconstruction

1 Introduction
Gaussian Splatting has become a popular technique for creating highly realis-
tic, real-time renderings of 3D models. Its ability to capture fine details and
produce lifelike visualizations makes it a go-to choice for various applications
in 3D rendering. However, one major hurdle in using Gaussian Splatting is the
initial step of initialization, which usually requires the Structure-from-Motion
(SfM) [5] algorithm to extract the key points of the images at the beginning,
often implemented using a tool like COLMAP [2]. While effective, this method
T.-P. Tran and M.-Q. Nguyen—These authors contributed equally to this work.
c The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025
W. Buntine et al. (Eds.): SOICT 2024, CCIS 2350, pp. 226–238, 2025.
https://doi.org/10.1007/978-981-96-4282-3_19
Frame Filtering for Gaussian Splatting 227

can be extremely slow and computationally heavy due to the overhead in the
COLMAP preprocessing period (which includes the extraction of features and
reconstruction of the sparse scene using the key points).
Recognizing this bottleneck, some researchers have been working on ways to
skip the initialization phase, aiming to speed up the process. However, with-
out COLMAP, the problem shifts—especially when processing video at slower
speeds. We believe that slowing down the video adds its layer of inefficiency, mak-
ing the overall rendering process time-consuming once again due to the duplica-
tion of frames with minor transitions compared to previous frames - which does
not bring major information gain about the scanned surface of the reconstructed
scene and might introduce blurry artifacts into the rendered model.
Multiple teams has proposed frameworks and modern scene compression
techniques in order to enhance the training progress as well as to effectively
render the scene in real-time. For example, Niedermayr et al. [18] has proposed
a scene compression technique based on sensitivity-aware clustering of vectors,
quantization-aware training and entropy encoding. Kerbl et al. [19] suggest a
novel representations of Gaussian Splatting scenes using a hierarchical represen-
tation for scenes of large scale within massive dataset.
As efficient as the scene compression methods are, the authors reckon that
the scene-compression-oriented methods can be hard to implement and deploy in
practice for portable usage of the Gaussian Splatting scene files, especially when
specific engineering steps are carried out to meet with very specific requirements
over the rendering engines. Therefore, we would like to accelerate the training
process of the Gaussian Splatting scenes instead, as the training duration and
number of iterations are correlated with the growing number of 3D Gaussians.
To tackle this, we propose a simple yet efficient approach: a frame-skipping
method based on analyzing the blurriness of each frame. By skipping frames that
don’t contribute significantly to the quality of the final render, we can reduce
processing time while still preserving the high quality of the 3D model. This
method strikes a balance between speed and visual fidelity, offering a faster and
more efficient solution for Gaussian Splatting in real-time applications.
The key contributions of this paper can be summed up as follows:
– A frame evaluating the function for scoring the quality and potential contri-
bution of the frame into the Gaussian-Splatting-based reconstructed models,
specifically the COLMAP-free 3D Gaussian Splatting approach.
– An in-depth analysis of the rendering quality, the memory usage, and the
training time of the scene reconstructed under different settings of the Gaus-
sian Splatting approaches.
The structure of the paper is as follows: Sect. 2 discusses the current
approaches in the novel view synthesis task, Gaussian Splatting, and approaches
to optimize Gaussian Splatting; Sect. 3 proposes and suggests a simple approach
to filter frames for the reconstruction of 3D models in Gaussian Splatting; Sect. 4
illustrates the experiment settings for our approach and discusses the results and
properties of the method; Sect. 5 proposes future works over the field of optimiz-
ing the Gaussian Splatting.
228 T.-P. Tran et al.

2 Related Works
2.1 Novel View Synthesis
For the Novel View Synthesis task, the objective is to provide a dataset of view-
points for a specific scene and augment this dataset by generating images from
new, unseen angles. Traditionally, techniques such as Image-Based Rendering
(IBR) [16] and Multi-View Stereo (MVS) [17] have been employed for this task.
IBR synthesizes new views by interpolating between existing images, relying on
depth information and pixel correspondences to create smooth transitions. MVS
reconstructs 3D geometry from multiple images to create dense point clouds or
meshes, allowing for rendering from new viewpoints.
Some new methods have been developed to tackle the problem by using depth
information. MultiDiff, which enables consistent novel view synthesis from a sin-
gle RGB image by depth information and video diffusion priors. This also utilizes
an attention layer, making the quality of synthesized views realistic. Depth-
Guided Methods [15] extract depth information from provided images to guide
the synthesis process, therefore, improving the accuracy. Gaussian Splatting and
NeRF are developed as two of the new successful approaches for this task.

2.2 Gaussian Splatting


Gaussian Splatting [1] has emerged as a promising technique for neural rendering,
offering several advantages over traditional methods. In contrast to voxel-based
approaches, Gaussian Splatting represents scenes as a set of continuous Gaussian
kernels, allowing for more efficient storage and rendering. Gaussian Splatting
can handle complex scenes with dynamic lighting and materials more effectively
compared to previous state-of-the-art methods such as Neural Radiance Fields
(NeRF) [3].
This technique has been successfully applied in various domains, such as
computer vision, robotics, and virtual reality in recent years. However, to be able
to use the method for the reconstruction of 3D scenes and models, the method
needs to initialize by extracting a set of key points using the SfM algorithm,
traditionally using the COLMAP framework for the extraction process. The
points are then used as the initial centers for 3D Gaussian primitives to grow
and clone in order to densify the scene.
However, Yang et al. [14] suggest that COLMAP may increase the risks of
failing reconstruction and the amount of time for preprocessing the images, as
COLMAP is sensitive towards feature extraction errors and repetitive regions of
texture. Therefore, the author proposes a method that abandons the COLMAP
preprocessing phase and reconstructs the scene in the sequential order of frames,
allowing the Gaussian primitives to grow in accordance with the new information
on the surface provided by the frame.

2.3 COLMAP-Free Gaussian Splatting


COLMAP has been a widely used method for generating point clouds of a scene
through Structure from Motion (SfM). Over time, various works have improved
Frame Filtering for Gaussian Splatting 229

COLMAP, such as optimizing precise fine-scale registration with point cloud


data (PCD) [14].
Previous methods of Gaussian Splatting relied heavily on COLMAP for accu-
rate camera pose estimation and scene reconstruction, using the initial set of key-
points of the scene extracted from the Structure-from-Motion algorithm. While
this method sampling from the video context to leverage geometric represen-
tation and input video stream, Yang et al. claim that it is possible to achieve
novel view synthesis without relying on COLMAP by abandon the COLMAP
camera poses extraction process, and learn the camera poses and their transfor-
mations during the training process of the 3D Gaussians (with the assumption
that the transformations between each consecutive pairs of camera poses are
affine transformations). With the assistance of a monocular depth estimation
paradigm (using Dense Prediction Transformers), the method can approximate
a fair estimation for the depth of the Gaussians, thus enhance the accuracy of
pose estimation and the rendering quality of the scene.
The COLMAP-free approach of Yang et al. allows for real-time adaptation
as the scene is progressively constructed from the video input, making it better
suited for tasks requiring immediate feedback, such as robotics, virtual reality,
and interactive systems.

2.4 Image Blurriness Evaluation

The Sobel operator [4] and Tenengrad [13] are widely used techniques in image
processing for assessing image sharpness and detecting blur. The Sobel operator
is an edge detection algorithm that computes the image intensity gradient at each
pixel, highlighting areas of rapid intensity change. At its core, the Sobel operator
performs convolution with two 3. × 3 kernels, which are specifically designed to
approximate derivatives in the horizontal and vertical directions. These convo-
lutional filters are crucial for detecting edge information by amplifying inten-
sity changes in the image, making the Sobel operator a fundamental tool in
convolutional-based image processing techniques.
Tenengrad, based on the Sobel operator, is a focus measure that quantifies
image sharpness by summing the squared magnitudes of these gradients across
the entire image. A higher Tenengrad value generally indicates a sharper image
with more defined edges, while a lower value suggests a blurrier image. These
methods are particularly effective because they are sensitive to high-frequency
content in images, which tends to be reduced in blurry or out-of-focus pho-
tographs. These techniques enable objective comparisons of image quality and
can be used in various applications, from autofocus systems in cameras to quality
control in image processing pipelines (Fig. 1).
230 T.-P. Tran et al.

3 Methods
3.1 Preliminaries: Gaussian Primitives

Fig. 1. Overview of the original Gaussian Splatting pipeline. [1] Gaussian Splatting
consists of three main components: an initialisation component to create the set of
points from the Structure-from-Motion algorithm at the beginning, a projection and
adaptive density control component to optimise the Gaussian Splats and minimise the
photometric loss, and a differentiable rasteriser to rasterise and render the projections
of 3D Gaussians for computing the loss and optimise it using gradient descent.

For each scene, the Gaussian Splatting model [1] perceives it as a set of 3D Gaus-
sian primitives, an explicit representation of the 3D scene. Each 3D Gaussian is
characterized by a set of parameters including:

Parameter Symbol Description


Center Position .μ ∈ R3 The 3D coordinates of the Gaussian’s center.
Spherical Harmonics Coefficients (SH) .c Represents the color of the Gaussian.
Rotation Quaternion .r Specifies the rotation of the Gaussian.
Scale Factor .s Determines the size of the Gaussian.
Opacity .α Defines the transparency level of the Gaussian.
Covariance Matrix .Σ = RSS T RT Describes the Gaussian’s ellipsoid shape through its covariance.

For optimization purposes, there must exist a differentiable rendering mech-


anism of the Gaussian primitives. The rendering from a given camera pose .W is
Frame Filtering for Gaussian Splatting 231

approximated based on the projection


2D of the Gaussians along the depth dimen-
sion. The covariance matrix . can be computed as
2D 
. = JW W T JT (1)

where .J is the Jacobian of the affine transformation of the projection plane.


The color and opacity of each pixel are calculated upon all Gaussians that overlap
and thus influence the pixel using the alpha-blending of .N sorted points based
on the depth:
N
 i−1

.Cpixel = ci αi (1 − αj ) (2)
i=1 j=1

The reconstruction progress is optimised based on the ground truth poses


which define the desired projections, in which we try to fit the set of Gaussian
points by learning the parameters using the photometric loss. The ground truth
poses, in further context, would be the estimated poses for the current set of
Gaussians (Fig. 2).

3.2 Preliminaries: COLMAP-Free Gaussian Splatting

Fig. 2. Overview of the original CF-3DGS method. [14] The method takes the sequence
of images from the scene as input to learn the 3D Gaussians for the scene and the
camera poses of the frames. Unlike the original Gaussian Splatting method, the CF-
3DGS method optimise two separate set of Gaussians - local and global Gaussians - in
order to learn the assumed-to-be-affine transformations of the camera poses

The scene optimization process proposed by Yang et al. suggests removing the
initialization phase using the SfM points of the scene obtained from COLMAP
232 T.-P. Tran et al.

and replacing it with only the camera intrinsics and the camera poses. For each
frame .It at step .t, a set of points is initialized using a monocular-depth estimation
network, in this method Dense Prediction Transformer is employed, to estimate
the monocular depth .Dt . Then, the model learns to optimize the Gaussians for
this specific frame, by minimizing a linear combination of the photometric loss
and D-SSIM:

.L = (1 − λ)L1 + λLD-SSIM (3)


For the experiments, we maintain similar settings to the original one from
the CF-3DGS paper, in which they propose .λ = 0.2.
In order to learn the transformation of the Gaussians under different cam-
era poses, the pre-trained 3D Gaussians will learn an affine transformation from
frame .It to frame .It+1 . The transformation is learned by optimizing the pho-
tometric loss between the rendered transformed Gaussians and the next frame
.It+1 . This is the overall learning pipeline of the local Gaussians, in contrast to
the global Gaussians that would be mentioned later.
The global Gaussians can be optimized using the relations between each pair
of images - since the transformations are assumed to be affine, we can infer the
transformation between the poses of the first frame and any frame .It in the
sequence. However, Yang et al. claimed that the set of global Gaussians for the
whole scene may be influenced massively due to the noisy transitions between
frames in the sequence. Therefore, among the global Gaussians, primitives that
are expanding and fixing the missing sections of the scene are optimized and
densified based on the local Gaussians at each time step. In other words, the
Gaussians in the global set having large view-space position gradients are chosen
for cloning and moving towards the previously unobserved neighboring regions of
the current global Gaussians and update the parameters based on current frames
and local Gaussians that trained earlier for a more subtle learning process of the
transformations.
Overall, the method constructs two sets of 3D Gaussians using learnable
camera pose transformations of the scene, with the global set of Gaussians being
adapted and incremented throughout the sequence of frames in the video, one
frame at a time. This implies that the computational resources and training
duration of the scene are proportional to the number of frames of the sequence
recorded for each scene, more frames in the recordings means heavier resources
and longer training time are required, making the method wasting computational
resources on frames which do not contribute significantly to the output scene
(frames that are duplicated/blurred with no or very little increase in surface
information gain of the scene). Thus, we propose a frame filtering framework to
adaptively compress and reduce the length of the scene by removing insignificant
frames to mitigate the wasting of computational resources and waiting time for
the rendering of the scene.
Frame Filtering for Gaussian Splatting 233

Fig. 3. Our method is to calculate the Tenengrad values of the images of each viewpoint
of each scene. The sequence of images would form a distribution of Tenengrad to take
sampling.

3.3 Frame-Quality Scoring and Filtering

In order to reduce the burden of the initial datasets over the process of training
the Gaussian primitives, we propose a simple method 3 that has proven to be
effective in eradicating frames that do not drastically boost the resolution of the
rendered model, thus saving the memory usage of the computational resources
and the training time. First, consider the .m × n source frame .A, we can compute
the Sobel of the image w.r.t .x and .y directions:
⎡ ⎤ ⎡ ⎤
+1 0 −1 +1 +2 +1
. Gx = ⎣+2 0 −2⎦ ∗ A Gy = ⎣ 0 0 0 ⎦ ∗ A (4)
+1 0 −1 −1 −2 −1
Then, we would be able to calculate the Tenengrad value of each image,
which is essentially the mean of the magnitude of Sobel gradient of each pixel.
This value will serve as the basis for the scoring and sampling process before
reconstructing the scene.
m−1 n−1
1 
Tenengrad(A) =
. G2x (i, j) + G2y (i, j) (5)
mn i=0 j=0
234 T.-P. Tran et al.

We assume that the images of one single sequence of a scene would form a
distribution of Tenengrad, thus converting the process of sampling the scene as a
random sampling from the distribution. As mentioned before, a high Tenengrad
value indicates a high resolution of the image with less blurry areas on the image,
therefore we assigned the weights for each of the images based on the normalized
Tenengrad from distribution .TScene (μ; σ) as follows

Tenengrad(A) − μ
.w(A) = (6)
σ
Using the weights, we will randomly sample a proportion .p of the images from
the original image set of the scene for training, and the model will be trained
exclusively on this subset. The authors decided to randomly sample from the
image set instead of directly removing the images with low quality from the
training set to reduce the likelihood that it would result in catastrophic disrup-
tion in the temporal continuity of the sequence, which leads to a large section
of the recorded scene not being reconstructed properly. For our experiments, we
choose .p = 80% of frames from each of the scenes to be filtered at the beginning.

4 Experiments and Results


4.1 Dataset

Table 1. Table of Novel View Synthesis, Camera Pose Estimation, and Resource Usage.
Scenes Church when training at the 200% settings encountered out-of-memory issues,
thus the result of this scene is redacted on the table. The results of the settings with
lowest memory usage and the shortest training duration is highlighted in bold.

Scene Size Novel view synthesis Camera pose estimation Resources Utilisation Status
SSIM .↑ PSNR .↑ LPIPS .↓ RPE trans .↓ RPE rot .↓ ATE .↓ VRAM (GB) .↓ Time .↓
Church 80% 0.939 31.889 0.083 0.011 0.022 0.002 9.86 3 h 51 m
100% 0.93 30.23 0.11 0.008 0.018 0.002 13.63 5 h 28 m
200% [RED] [RED] [RED] [RED] [RED] [RED] [RED] [RED] OOM
Francis 80% 0.936 34.827 0.125 0.061 0.198 0.006 7.64 1 h 00 m
100% 0.91 32.72 0.14 0.029 0.154 0.006 13.45 2 h 44 m
200% 0.961 37.309 0.095 0.010 0.062 0.003 11.34 3 h 47 m
Family 80% 0.957 33.819 0.061 0.102 0.08 0.004 8.22 0 h 44 m
100% 0.94 31.27 0.07 0.022 0.024 0.002 17.71 1 h 39 m
200% 0.972 36.522 0.042 0.007 0.016 0.002 16.60 5 h 12 m
Ignatius 80% 0.79 25.849 0.139 0.053 0.041 0.005 7.64 0 h 43 m
100% 0.9 28.43 0.09 0.033 0.032 0.005 16.55 1 h 47 m
200% 0.944 32.52 0.061 0.01 0.018 0.003 12.17 3 h 10 m

We perform benchmark testing on the dataset Tanks and Temples, including


tests on the novel view synthesis quality, pose estimation accuracy, and training
Frame Filtering for Gaussian Splatting 235

time and resource usage. From the dataset, we have chosen to train on 4 scenes:
Francis, Family, Church, and Ignatius Fig. 4.
To create a new interpolate, we follow a similar data preprocessing protocol
to that of the original CF-3DGS settings, with an extra frame interpolation
procedure to mimic the effects of blurry and/or slow recordings of the scenes:
1. From the original sequence of frames for each scene, we conduct the frame
filtering process mentioned in Sect. 3.3 to sample .80% of the frames from the
original sequence.
2. We interpolate by duplicating every single frame of the video and average
pooling consecutive frames in the sequence to impose the slow-down effect of
the videos. This would lead to the number of frames being interpolated to be
.200% of that of the original sequence.

4.2 Settings
We keep most of our training settings the same as the original training settings of
CF-3DGS: The initial learning rate is .10−5 and gradually decays to .10−6 until
convergence. All of our experiments are conducted on 1 single NVIDIA RTX
A5000 GPU with 24 GB of VRAM.

4.3 Metrics
For novel view synthesis, we employ standard metrics for evaluating the ren-
dering quality of models, including PSNR, SSIM [9], and LPIPS [10], similar to
traditional settings of SfM-free NeRF tasks. For camera pose estimation, we cal-
culate the difference between the ground-truth camera trajectories of the scene
and the estimated one of the models using ATE and RPE.

4.4 Results
Novel View Synthesis. The overall quality of rendered models is indicated
in Table 1. The sampling method that we proposed is capable of producing
the models with high rendered quality in certain scenes (with three out of four
scenes getting no less than 30 PSNR), however, the resolutions of the models are
inconsistent. This indicates that choosing a high-quality subset of images from
each sequence does help with getting high-quality rendering of 3D Gaussians in
real time with much less training time compared to the original results in most
cases, as we are going to show later on.

Camera Pose Estimation. Compared to the original 3D-CFGS method, our


method slightly fell short of outperforming it, though the results are very close
to that of the original settings. We hypothesize that the small drop in the results
of camera pose estimation is due to the loss of information of some scenes in the
sequence - that is, frames being removed may contain important visual informa-
tion for guiding the camera estimate pose, and remove them from training mean
the learned transformations between the consecutive frames would miss the key
information as well.
236 T.-P. Tran et al.

Fig. 4. Rendering results of the CF-3DGS reconstructed scene (right) and the ground
truth for the novel view synthesis task (left), as well as the pose estimation results of
the camera poses of the scene.

Resources Usage and Training Duration. The overall resource usage and
training time of the model are reported in Table 1. On average, the VRAM usage
of the pipeline when performing our sampling method is reduced by 30% to 50%,
and the training time is reduced by 20% to 30%. This shows that the trade-off
between the rendering results and the resource utilization can help with fitting
the scenes under computational constraints, especially for low-tier and mid-tier
machines to produce photorealistic reconstruction of the scenes.

5 Conclusion

In this paper, we work on optimizing the training process of COLMAP-free


3D Gaussian Splatting by random sampling from the weight distribution of the
scenes using the Tenengrad value of the frames. We demonstrate that lower train-
ing time and memory usage of the computational machines can be achieved with
the proposed sampling method and a small trade-off with the rendering quality
of the models. The pipeline, however, still need enhancements in the frame sam-
pling mechanism to be able to detect and retain the significant frames with the
visual information to improve the robustness and the resolution of the rendered
Frame Filtering for Gaussian Splatting 237

Gaussians and to reduce the memory usage of the pipeline. The mechanism to
determine the level of significance of each frame towards the reconstruction of
the scene, in other words, how much quality and resolution of the output scene
is attributed to each frames, needs further analyses and deeper understanding
for a more efficient and robust skipping frame mechanism.

Acknowledgements. This research is supported by research funding from the Faculty


of Information Technology, University of Science, Vietnam National University - Ho
Chi Minh City.

References
1. Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G.: 3D Gaussian Splatting for
real-time radiance field rendering. ACM Trans. Graph. 42(4) (2023)
2. Schönberger, J.L., Frahm, J.M.: Structure-from-motion revisited. In: Proceed-
ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
(CVPR), Las Vegas (2016)
3. Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng,
R.: NeRF: representing scenes as neural radiance fields for view synthesis. In:
Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol.
12346, pp. 405–421. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-
58452-8 24
4. Sobel, I., Feldman, G.: An isotropic 3x3 image gradient operator. In: Stanford
Artificial Intelligence Project (1968)
5. Snavely, N., Seitz, S.M., Szeliski, R.: Photo tourism: exploring photo collections in
3D. ACM Trans. Graph. 25(3), 835 –846 (2006)
6. Knapitsch, A., Park, J., Zhou, Q.Y., Koltun, V.: Tanks and Temples: benchmarking
large-scale scene reconstruction. ACM Trans. Graph. 36(4), 1–3 (2017)
7. Lin, C.H., Ma, W.C., Torralba, A., Lucey, S.: BARF: bundle-adjusting neural
radiance fields. In: Proceedings of IEEE International Conference on Computer
Vision (ICCV), Online (2021)
8. Bian, W., Wang, Z., Li, K., Bian, J.W., Prisacariu, V.A.: Nope-Nerf: optimising
neural radiance field with no pose prior. In: Proceedings of the IEEE/CVF Con-
ference on Computer Vision and Pattern Recognition (CVPR), Vancouver (2023)
9. Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment:
from error visibility to structural similarity. IEEE Trans. Image Process. 13(4),
600–612 (2004)
10. Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effec-
tiveness of deep features as a perceptual metric. In: Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake
(2018)
11. Verbin, D., Hedman, P., Mildenhall, B., Zickler, T., Barron, J.T., Srinivasan, P.P.:
Ref-NeRF: structured view-dependent appearance for neural radiance fields. In:
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog-
nition (CVPR), New Orleans (2022)
12. Ranftl, R., Bochkovskiy, A., Koltun, V.: Vision transformers for dense prediction.
In: Proceedings of IEEE International Conference on Computer Vision (ICCV),
Online (2021)
238 T.-P. Tran et al.

13. Schlag, J.F., Sanderson, A.C., Neuman, C.P., Wimberly, F.C.: Implementation
of Automatic Focusing Algorithms for a Computer Vision System with Camera
Control. Carnegie-Mellon University (1983)
14. Fu, Y., Liu, S., Kulkarni, A., Kautz, J., Efros, A.A., Wang, X.: COLMAP-free 3D
gaussian splatting. In: Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition (2024). arXiv:2312.07504 [cs.CV]. Available at:
https://arxiv.org/abs/2312.07504
15. Hou, Y., Solin, A., Kannala, J.: Novel view synthesis via depth-guided skip con-
nections. In: Proceedings of the IEEE/CVF Winter Conference on Applications of
Computer Vision (WACV), pp. 1892–1901 (2021)
16. Wang, Q., et al.: IBRNet: learning multi-view image-based rendering. CoRR,
abs/2102.13090, 2021. Available at: https://arxiv.org/abs/2102.13090
17. Poggi, M., Conti, A., Mattoccia, S.: Multi-view guided multi-view stereo (2022).
arXiv:2210.11467 [cs.CV]. Available at: https://arxiv.org/abs/2210.11467
18. Niedermayr, S., Stumpfegger, J., Westermann, R: Compressed 3D gaussian splat-
ting for accelerated novel view synthesis. In: Proceedings of the IEEE/CVF Con-
ference on Computer Vision and Pattern Recognition (CVPR), Seattle (2024)
19. Kerbl, B., Meuleman, A., Kopanas, G., Wimmer, M., Lanvin, A., Drettakis, G.:
A hierarchical 3D gaussian representation for real-time rendering of very large
datasets. ACM Trans. Graph. 43(4), 1–15 (2024)
Enhancing Unsupervised Person
Re-identification with Multi-view Image
Representation

Anh D. Nguyen(B) , Dang H. Pham , Duy B. Vu ,


and Hoa N. Nguyen(B)

VNU University of Engineering and Technology, Hanoi, Vietnam


{ducanh.ng,dangph,duyvb,hoa.nguyen}@vnu.edu.vn

Abstract. Person Re-Identification (ReID) is a critical research domain


that uses cross-camera surveillance footage to identify pedestrians for
social security and education. Initial ReID methods relied heavily on
supervised deep network models, but the large data and manual anno-
tation requirements have highlighted the need for more scalable solu-
tions. This paper explores unsupervised domain adaptation and fully
unsupervised learning methods for ReID to address these issues. Cluster-
ing algorithms, memory banks, and contrastive loss functions make them
robust, but pseudo-label noise makes it difficult to learn accurate feature
representations due to weak feature image representation. To overcome
these limitations, we introduce a novel Multi-View Embedding Model that
captures multiple views of an image to improve feature robustness and
discrimination. We also propose a diversity loss function to aid multi-
view representation learning. Our method improves unsupervised person
ReID performance on Market-1501 and MSMT17, according to extensive
experiments.

Keywords: Person Re-Identification · Unsupervised Learning ·


Unsupervised Domain Adaptation · Modified ResNet

1 Introduction

Researchers have long focused on Person Re-Identification (ReID), improving


user experiences in social security and education. Person ReID uses multiple
cross-camera images to identify a pedestrian. Supervised ReID methods using
deep network models were initially studied. The massive amount of data and
rising manual annotation time have become major issues as these methods are
widely used in real-world applications. Researchers have focused on unsupervised
ReID methods, which use unlabeled data for training, to address these issues.
Unsupervised person ReID methods fall into two categories: Unsupervised
domain adaptation (UDA) [19, 20] and Fully (or purely) u nsupervised learning
(USL) [1, 3, 10, 14]. A source domain without annotations and a target domain
fully annotated through transfer learning are used in UDA ReID methods. How-
ever, the source domain dependence can reduce model performance. Knowledge
transferred from the source domain affects target domain effectiveness, and data
c The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025
W. Buntine et al. (Eds.): SOICT 2024, CCIS 2350, pp. 239–250, 2025.
https://doi.org/10.1007/978-981-96-4282-3_20
240 A. D. Nguyen et al.

distribution discrepancies can hinder knowledge transfer and model performance.


USL ReID methods are more adaptable and scalable because they train on unla-
beled datasets without external dependencies in real-world environments.
Modern USL ReID frameworks often combine modular components like clus-
tering algorithms [5, 25], memory banks [11, 14], and contrastive loss functions
with network models [1, 10]. This framework has become popular due to its
performance. USL ReID typically involves (1) generating pseudo-labels using a
clustering algorithm, (2) computing contrastive loss for query instances using
positive and negative memory bank samples, and (3) updating the cluster rep-
resentation vectors for subsequent iterations. Each iteration teaches the network
robust and discriminative features. Traditional methods only use Pooling opera-
tions [7] on the last feature mapping to present images in feature space. This app-
roach could cover most salient image features, but it may overlook fine-grained
local information for object recognition and neglect important local portions.
Clustering algorithm ineffectiveness is due to this. These algorithms sometimes
label samples incorrectly, assigning multiple identities the same pseudo-labels
or a single identity multiple labels in the unlabeled dataset. Because cluster
representation vectors are often the average centroid or hardest instance in a
mini-batch, this can hinder learning. Thus, memory bank centroids may not
accurately represent clusters.
Our main contributions to this paper are as follows: (i) we present a new
Multi-View Embedding Model that, first, represents images in various perspec-
tives. These multi-view representations will then be combined to create the final
image representation; (ii) propose a diversity loss that can assist the model in
learning image representations from multiple perspectives; and (iii) conduct com-
prehensive tests on two renowned benchmarks, Market-1501 and MSMT17, to
assess and validate the superior performance of our suggested approach.
The remainder of this paper is organized as follows. In Section Sect. 2, we
describe our main problem and introduce some related work in USL. Section 3
presents our proposal. In Sect. 4, we show our experiments, evaluate our proposed
work, and provide some comparisons. Finally, Sect. 5 gives the conclusions and
future works.

2 Literature Reviews
2.1 Purely Unsupervised Learning for Person ReID
Unsupervised Learning for Person ReID (USL ReID) is challenging and flex-
ible because it uses unlabeled data for training, but it’s better for real-world
developing environments. Traditional methods use metric learning for personal
retrieval. Many USL ReID methods have emerged due to the removal of perfor-
mance bottlenecks from clustering algorithms [16, 25]. Using pseudo-labels from
clustering algorithms or similarity estimation, these methods train models on
unlabeled data like labeled ones [1, 3, 10]. Examples include SpCL [10], a self-
paced contrastive learning framework using instance-level memory, and Cluster
Contrast [4], which addresses inconsistent updates of memory class centroids.
ICE [1] applies camera-aware, hard-sample mining, and soft-label concepts to
Enhancing Unsupervised Person ReId with Multi-view 241

contrastive learning. Some methods have ignored noise in pseudo labels generated
by unsupervised clustering algorithms. Thus, our research focuses on creating a
new USL approach to overcome unsupervised clustering algorithm limitations.

2.2 Image Representations for Person ReID

Usually, global-level matching methods compare image embeddings. Current


methods use ResNet architecture [8, 12] to encode images into feature vector
space by pooling the last feature mapping obtained after feeding images into the
backbone. Then, the received vector is considered as the global representation of
the image in the feature spaces and they use variations of the cross-entropy loss
[18] or batch-hard triplet loss [22] to optimize them. Local alignment between
image regions is important for handling person images with multiple specifics or
objects, but these methods often overlook it. To address this, CAMERA [21] uses
multiple global feature embeddings to represent images, stacked with dilated con-
volutional layers. Many recent studies have used attention mechanisms [6] and
multi-granularity approaches [3, 19] to achieve top-tier performance. Although
supervised part-based methods have been successful, unsupervised person ReID
has seen fewer attempts to apply part features. Our work does not split fea-
ture mapping into sub-features for optimization, as mentioned in the methods.
We propose a new method for improving image representations based on object
matching [15].

3 Proposed Method
3.1 Approach Direction
To solve the challenges described in previous sections, we approach the solution
with the following ideas: (i) modifying the ResNet-based backbone to extract
more information from the image, and (ii) constructing unsupervised learning
architecture integrated with the contrastive loss with the memory bank and the
diversity loss to enhance image representations. Thus, our method is illustrated
in Fig. 1(A), which can be described in detail as follows.
Let .D = {xi }Ni=1 denote an unlabeled dataset, where each .xi represents the
.i-th image and .N is the total number of images. The goal of the USL ReID
task is to train an image encoder .E in an unsupervised manner, producing
ReID features .F = {fi }N i=1 . During inference, these ReID features are used
for identity retrieval. Typically, the training process of clustering-based USL
methods alternates between two stages:
Stage I: Clustering. At the beginning of each epoch, training samples are
clustered using the DBSCAN algorithm. The cluster ID .yi ∈ C, where .C is
the set of cluster IDs, serves as one-hot pseudo labels for network optimization.
Based on the clustering results, a cluster-based memory  bank .M = {mi }C i=1 is
1
initialized using cluster centroids, where .mi = |Ci | fj ∈Ci fj with .fj represents
the feature of the .j-th sample in the cluster .Ci , and .|Ci | denotes the cluster size.
242 A. D. Nguyen et al.

Fig. 1. Overview of our proposed method.

Stage II: Network Training. Once the pseudo labels are obtained, the network
undergoes optimization in a manner akin to supervised learning. The training
objective employed is ClusterNCE [4], which is defined as follows:

exp(Sim(f, m+ ))
L = − log C
. (1)
j=1 exp(Sim(f, mj ))

In this equation, .m+ denotes the centroid of the cluster to which the fea-
ture vector .f belongs, while .mj represents the .j-th centroid within the memory
bank. The function .Sim(u, v) calculates the cosine similarity between vectors .u
and .v, and .τ is the temperature parameter that controls the sharpness of the
distribution.
The memory bank, which stores the cluster centroids, is updated in a
momentum-based manner, similar to the approaches used in previous works
such as Momentum Contrast [11] and HHCL [14]. The normal update rule for
the memory bank is given by:

mi ← βmi + (1 − β)f
. (2)

In this context, the momentum coefficient .β determines how the new fea-
ture vector .f affects the current centroid .mi . The instance in the .i-th cluster
of the current mini-batch is represented by the feature vector .f . With the help
of this momentum update technique, centroids are continuously improved over
time, incorporating fresh data while preserving stability. The holistic distribution
might not be captured by Eq. 2, which uses either the hardest sample or the aver-
age centroid. In order to tackle this, we employ Dynamic Clustering Contrastive
Learning (DyCL) [13], where memory momentum updates are given dynamic
weights. With this strategy, the model can fully use reliable data in the global
context. We give similar instances of each query instance appropriate weights,
with tougher examples having larger weights, in accordance with Triplet Loss’
hard sample mining technique [17]. The sample weights are determined using a
softmax function, which highlights the significance of hard cases and keeps the
Enhancing Unsupervised Person ReId with Multi-view 243

Fig. 2. A person image is represented as multi-views, and the response between the
different views of images is considered together to determine the match.

model from convergent to a local optimum:

dy exp(−mi · fj /τw )
. wij = Ni (3)
t=1 exp(−mi · ft /τw )

where .τw is the temperature coefficient hyper-parameter that affects the propor-
tion of weights for hard instances, .Ni is the instance number of .i-th class in a
Ni dyand .ft is the .t-th instance feature of .Ni . Note that the sum of weights
mini-batch
is . j=1 wij = 1. Thus, the .i-th dynamic cluster centroid is the weighted mean
in the mini-batch:

Ni
m̂i =
. wij fj and mi ← γmi + (1 − γ)m̂i (4)
j=1

where .γ is the hyper-parameter of momentum updating and the adjustable .τw


hyper-parameter used to balance the global and local information.
By employing this two-stage process of clustering and network training, the
model iteratively improves its ability to distinguish between different identities
in an unsupervised manner. The use of pseudo labels derived from clustering
allows the network to learn discriminative features without the need for man-
ually annotated data, making it a powerful approach for unsupervised person
ReID. On the other hand, our approach also adheres to the iterative clustering
and network training framework. However, as depicted in Fig. 1, our method
diverges from previous works in the image Representation. Instead of relying
on traditional global features, we employ multi-view representations to provide
more detailed and discriminative information about individuals for feature learn-
ing and clustering (shown in Sect. 3.2)

3.2 Multi-view Image Representation

As highlighted in the related works section, global representations capture the


most prominent features of an image but often miss the finer details in local
244 A. D. Nguyen et al.

regions, which are crucial for object recognition. Detailed features of human
body parts, which might be overlooked by pooling operations, are essential for
identifying individuals and generating pseudo labels. Additionally, most meth-
ods treat images as individual feature embeddings, overlooking the fact that
they can be described from multiple views, especially images that may contain
diverse information. As shown in Fig. 2, an image of a person can be viewed from
different perspectives, each focusing on different parts of the image and poten-
tially overlapping regions. Different views emphasize various aspects. Therefore,
multi-view embedding offers more comprehensive semantic information, enabling
the model to better adapt to various semantic contexts
We propose a novel architecture called the Multi-View Image Represen-
tation (MVIR), illustrated in Fig. 1(B). Given an image .I, we extract par-
tial features by horizontally dividing the feature map into .K uniformly par-
titioned regions and applying GMP layers. This results in a set of .K partial
  
features .P = stack({fp1 , ..., fpk }|fpi ∈ RD ) ∈ RK×D . Instead of using these
features directly for representation, we construct a multi-view representation.
Specifically, we create .m learnable view codes .(c1 , ..., cm ) as queries for atten-
tion, where .ci ∈ RD . We then calculate .m view attentions .(A1 , ..., Am ), where
.Ai = exp(P · ci ) ∈ R is a weight vector corresponding to the .K par-
K×D

tial features. Next, we obtain .m diverse view features, considered as hidden


K 
states .(H1 , ..., Hm ). Each .Hi = j=1 Aij · fpj ∈ RD is computed through the
weighted sum of .P by the .i-th view attention .Ai . Finally, we concatenate the
.m diverse view features to form the final multi-view representation of an image:

.f = Concat([H1 , H2 , ..., Hm ]).


For each image, we use the multi-view feature vector .f as its representation.
During training, this representation is used to calculate the loss for optimization,
and during evaluation, it is used for matching. On the other hand, we set .m = 3
to represent the top-mid-bottom views corresponding to three different parts of
a person image.

3.3 Multi-view Diversity Loss


To obtain a representation that encompasses comprehensive information and
enhances the diversity of multi-view attention, we ensure that the m view codes
concentrate on distinct aspects. Inspired by the works of Lee et al. [15], we employ
a diversity loss to amplify the differences among the multi-view attention weights,
enabling the multiple view codes to produce varied highlights of an image.
In particular, we calculate the similarity between each view attention by
multiplying .Av with its transpose, where .Av = [a1v , . . . , am
v ]. We then subtract
an identity matrix from this product to measure diversity. The diversity increases
as the non-diagonal values of the similarity matrix decrease. Consequently, we
define a diversity loss as follows:
Ldiv = (Av ȦTv − I)2F
. (5)
Here, .I represents an .m-dimensional identity matrix, which serves to elimi-
nate the self-correspondences of the .m view attention along the diagonal of the
similarity matrix. The notation . · F denotes the Frobenius norm of a matrix.
Enhancing Unsupervised Person ReId with Multi-view 245

3.4 Objective Function

Finally, the objective with ClusterNCE can be considered as a no-parametric


classifier, where centroids stored in the memory bank serve as the weight matrix
of the classification layer. Therefore, with the diversity loss in Eq. 5, the training
objective of our method can be rewritten as follows:

1 
N
.Lobjective = ce(Mq fi , yi ) + λLdiv (6)
N i=1

where .ce refers to the cross-entropy loss and .λ is a control parameter of the
diversity loss.

4 Experiments and Evaluation


4.1 Implementation Details

Datasets and Evaluation Metrics. We conduct experiments on three well-


known benchmarks in person ReID: Market-1501 [26] contains 32,668 photos
taken by 6 distinct cameras and 1,501 IDs. 12,936 photos representing 751 indi-
viduals make up the training set. The query set and gallery set, which have a
total of 19,732 photos, contain 3,368 and 750 identities, respectively, in the test
set. MSMT17 [24] comprises a total of 126,441 pictures of 4,101 people taken
using 15 cameras. It is partitioned into 93,820 test photos with 3,060 identities
and 32,621 training images with 1,041 identities.
Hereafter, in this work, we use the term “Market” to denote “Market-1501”
and “Msmt” to denote “MSMT17”. In our study, we evaluate performance using
two key metrics: mean Average Precision (mAP) and Cumulative Matching Char-
acteristic (CMC) at Rank-1/5/10. We also do not perform Re-Ranking opera-
tions in retrieval evaluation. On the other hand, during evaluation, we prioritize
results with higher mAP scores and Rank-1 accuracy.
Person ReID Model Training Settings. In training, we use data augmenta-
tion similar to previous works [9, 10, 19]. We use Adam optimizer with a learning
rate is .3.5 × 10−4 and train the model in 70 epochs. We utilize the DBSCAN to
produce a pseudo-label at the start of every epoch with the minimal number of
neighbors is set to 4. .γ and τw in Eq. 3 and Eq. 4 are set to 0.02 and 0.3 respec-
tively as [13]. Finally, each mini-batch is randomly sampled with 16 identities
and 4 images, with a batch size of 64 in both stages.

4.2 Results

Baseline Performance. To analyze the effectiveness of our method, we conduct


extensive experiments on Market-1501 and MSMT17. Firstly, we train base-
line models with traditional ResNet-50 backbone as described in Sect. 3.1 by
only contrastive loss (Eq. 1). As presented in Table 1, our baseline model reaches
246 A. D. Nguyen et al.

Table 1. The performance of baseline model with and without proposals (%) on
datasets

Model Market Msmt


mAP R1 R5 R10 mAP R1 R5 R10
Baseline 74.8 88.2 93.2 96.6 15.4 36.5 50.6 60.5
+ MVIR 75.4 89.1 95.4 97.1 17.5 38.2 52.2 63.5
+ MVIR + Diversity Loss 76.8 91.0 96.8 97.5 19.8 43.4 55.4 65.6

Fig. 3. Evaluation on Market with different .K in MVIR while .λ = 0.

74.8%, 96.6% in mAP and Rank-1 (R1) metrics on Market benchmark, respec-
tively, while it obtains 14.2%, 35.2% in mAP and Rank-1 (R1) metrics on the
Msmt benchmark.
Effectiveness of MVIR and Diversity Loss. To evaluate the efficacy of our
proposed method, we conducted a series of independent experiments by incremen-
tally adding various components during the training process. Firstly, we examined
the efficiency of our proposed image representation method. To determine the
optimal number .K of local parts used to generate views, we fine-tuned .K from
1 to 6, where .K = 1 indicates that the model does not utilize MVIR. As illus-
trated in Fig. 3, the best performance in the Rank-1 metric is achieved when .K
is set to 4. Table 1 also demonstrates that the accuracy of the models increased
by approximately 1% to 2% across all metrics on both datasets. Subsequently, we
incorporated the diversity loss into the training process. The .λ parameter was fine-
tuned. As shown in Fig. 4, the best performance in the Rank-1 metric is obtained
when .λ is set to 1. However, further increasing .λ results in a significant drop in
performance. Table 1 also indicates that this loss slightly improves the model’s
performance in both Rank-1 and mAP metrics across all datasets.

4.3 Comparison With SOTA Methods


Our experimental results, as displayed in Table 2, demonstrate that our method
achieves superior performance across multiple evaluation metrics when compared
to several well-known methods. For the Market dataset, our method attains
an mAP of 76.0% and a Rank-1 accuracy of 90.2%, which surpasses BUC,
MMCL, GCL, and IICS. Specifically, our method exceeds BUC and MMCL,
which achieve 38.3% and 45.5% mAP, respectively. Similarly, we outperform
GCL and SpcL, which both use contrastive losses and baseline architecture but
centralize label refinement and enhance contrastive loss functions. Specifically,
Enhancing Unsupervised Person ReId with Multi-view 247

Fig. 4. Evaluation on Market with varying .λ value with K = 4 horizontal splitted parts.

Table 2. Comparison of SOTA unsupervised learning methods for person ReID (%).
Bold denotes the best while Underline indicates the second best.

Method Venue Market MSMT


mAP R1 R5 R10 mAP R1 R5 R10
BUC [16] AAAI .38.3 .66.2 .79.6 .84.5 .− .− .− .−

MMCL [23] CVPR .45.5 .80.3 .89.4 .92.3 .11.2 .35.4 .44.8 .49.8

MMT [9] ICLR .74.3 .88.1 .96.0 .97.5.− .− .− .−

GCL [2] CVPR .66.8 .87.3 .93.5 .95.5 .21.3 .45.7.58.6.64.5

SpCL [10] NeurIPS.73.1 .90.8 .96.3 .97.5.19.1 .42.3 .56.5 .68.4


Our method. .76.8 .91.0.96.8.97.5 19.8 43.4 55.4 .65.6

we surpass GCL by nearly 10% in mAP and show a competitive Rank-1 accuracy
that is close to SpCL’s 90.8%. When evaluating the MSMT dataset, our perfor-
mance is higher than MMCL and comparable to SpcL’s results. In general, these
results highlight the robustness of our approach across well-known datasets and
underscore its potential for a new image representation in person ReID tasks.

5 Conclusions
In this study, we introduce a new way to present images in the feature spaces,
called Multi-View Image Representation (MVIR), tailored for unsupervised per-
son ReID. Our approach leverages both global and local image contexts to
enhance the discriminativeness of representations within the feature space by
utilizing the relationship between part features. We also suggest utilizing a diver-
sity loss to make view features explore different information, leading to the effec-
tiveness of MVIR. Our experiments on three challenging benchmarks consisting
of Market-1501 and MSMT17 datasets prove that the proposals bring poten-
tial enhancements to traditional methods using the ResNet architecture as the
backbone.
Our proposed method has room for improvement. The current method must
synthesize local information to create new image representations. In addition,
unsupervised clustering pseudo-label generation assumes a low percentage of
inaccurate labels. As shown in MSMT17, using raw pseudo-labels from this pro-
cess leads to low performance on large and difficult datasets. Thus, we will focus
on optimizing our backbone architecture and pseudo label generation in our
perspective work method in the near future.
248 A. D. Nguyen et al.

Acknowledgement. This work is supported by VNU University of Engineering and


Technology under grant number CN23.14.

References
1. Chen, H., Lagadec, B., Bremond, F.: Ice: inter-instance contrastive encoding for
unsupervised person re-identification. In: IEEE/CVF International Conference
on Computer Vision (ICCV), pp. 14940–14949 (2021). https://doi.org/10.1109/
ICCV48922.2021.01469
2. Chen, H., Wang, Y., Lagadec, B., Dantcheva, A., Bremond, F.: Joint generative
and contrastive learning for unsupervised person re-identification. In: IEEE/CVF
Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2004–2013
(2021). https://doi.org/10.1109/CVPR46437.2021.00204
3. Cho, Y., Kim, W.J., Hong, S., Yoon, S.E.: Part-based pseudo label refinement
for unsupervised person re-identification. In: IEEE/CVF Conference on Computer
Vision and Pattern Recognition (CVPR), pp. 7298–7308 (2022). https://doi.org/
10.1109/CVPR52688.2022.00716
4. Dai, Z., Wang, G., Yuan, W., Zhu, S., Tan, P.: Cluster contrast for unsupervised
person re-identification. In: ACCV: 16th Asian Conference on Computer Vision,
pp. 319–337 (2022). https://doi.org/10.1007/978-3-031-26351-4_20
5. Deng, D.: Dbscan clustering algorithm based on density. In: 2020 7th International
Forum on Electrical Engineering and Automation (IFEEA), pp. 949–953 (2020).
https://doi.org/10.1109/IFEEA51475.2020.00199
6. Ding, J., Zhou, X.: Learning feature fusion for unsupervised domain adaptive per-
son re-identification. In: 2022 26th International Conference on Pattern Recog-
nition (ICPR), pp. 2613–2619 (2022). https://doi.org/10.1109/ICPR56361.2022.
9956264
7. Du, H.P., Nguyen, A.D., Nguyen, D.T., Nguyen, H.N.: .μpewface: parallel ensemble
of weighted deep convolutional neural networks with novel loss functions for face-
based authentication. Image Vis. Comput. 139(104819) (2023). https://doi.org/
10.1016/j.imavis.2023.104819
8. Du, H.P., Nguyen, A.D., Nguyen, D.T., Nguyen, H.N., Nguyen, D.: A novel deep
ensemble learning to enhance user authentication in autonomous vehicles. IEEE
Trans. Autom. Sci. Eng. 21(3), 2362–2373 (2024). https://doi.org/10.1109/TASE.
2023.3270764
9. Ge, Y., Chen, D., Li, H.: Mutual mean-teaching: pseudo label refinery for unsuper-
vised domain adaptation on person re-identification. In: International Conference
on Learning Representations (2020)
10. Ge, Y., Zhu, F., Chen, D., Zhao, R., Li, H.: Self-paced contrastive learning with
hybrid memory for domain adaptive object re-id. In: Proceedings of the 34th Inter-
national Conference on Neural Information Processing Systems (NIPS) (2020).
https://doi.org/10.5555/3495724.3496673
11. He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised
visual representation learning. In: IEEE/CVF Conference on Computer Vision
and Pattern Recognition (CVPR), pp. 9726–9735 (2020). https://doi.org/10.1109/
CVPR42600.2020.00975
12. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition.
In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
pp. 770–778 (2016). https://doi.org/10.1109/CVPR.2016.90
Enhancing Unsupervised Person ReId with Multi-view 249

13. He, Z., Xue, M., Du, Y., Zhao, Z., Su, F.: Dynamic clustering and cluster con-
trastive learning for unsupervised person re-id with feature distribution align-
ment. In: IEEE International Conference on Acoustics, Speech and Signal Pro-
cessing (ICASSP), pp. 3610–3614 (2024). https://doi.org/10.1109/ICASSP48485.
2024.10447711
14. Hu, Z., Zhu, C., He, G.: Hard-sample guided hybrid contrast learning for unsu-
pervised person re-identification. In: 2021 7th IEEE International Conference on
Network Intelligence and Digital Content (IC-NIDC), pp. 91–95 (2021). https://
doi.org/10.1109/IC-NIDC54101.2021.9660560
15. Lee, K.H., Chen, X., Hua, G., Hu, H., He, X.: Stacked cross attention for image-
text matching. In: ECCV 2018: 15th European Conference on Computer Vision,
pp. 212–228 (2018). https://doi.org/10.1007/978-3-030-01225-0_13
16. Lin, Y., Dong, X., Zheng, L., Yan, Y., Yang, Y.: A bottom-up clustering approach
to unsupervised person re-identification. In: Proceedings of the Thirty-Third AAAI
Conference on Artificial Intelligence. AAAI’19. AAAI Press (2019). https://doi.
org/10.1609/aaai.v33i01.33018738
17. Ming, Z., Chazalon, J., Luqman, M.M., Visani, M., Burie, J.C.: Simple triplet
loss based on intra/inter-class metric learning for face verification. In: 2017 IEEE
International Conference on Computer Vision Workshops (ICCVW), pp. 1656–
1664 (2017). https://doi.org/10.1109/ICCVW.2017.194
18. Nguyen, A.D., Nguyen, D.T., Dao, H.N., Le, H.H., Tran, N.Q.: Impact analysis of
different effective loss functions by using deep convolutional neural network for face
recognition. In: From Born-Physical to Born-Virtual: Augmenting Intelligence in
Digital Libraries, pp. 101–111 (2022). https://doi.org/10.1007/978-3-031-21756-
2_8
19. Nguyen, A.D., Pham, D.H., Nguyen, H.N.: GAN-based data augmentation
and pseudo-label refinement for unsupervised domain adaptation person re-
identification. In: Computational Collective Intelligence, pp. 591–605 (2023).
https://doi.org/10.1007/978-3-031-41456-5_45
20. Pham, D.H., Nguyen, A.D., Nguyen, H.N.: GAN-based data augmentation and
pseudo-label refinement with holistic features for unsupervised domain adaptation
person re-identification. Knowl.-Based Syst. 288, 111471 (2024). https://doi.org/
10.1016/j.knosys.2024.111471
21. Qu, L., Liu, M., Cao, D., Nie, L., Tian, Q.: Context-aware multi-view summariza-
tion network for image-text matching. In: Proceedings of the 28th ACM Interna-
tional Conference on Multimedia. MM ’20, New York, NY, USA, pp. 1047–1055.
Association for Computing Machinery (2020). https://doi.org/10.1145/3394171.
3413961
22. Si, T., Zhang, Z., Liu, S.: Compact triplet loss for person re-identification in cam-
era sensor networks. Ad Hoc Netw. 95, 101984 (2019). https://doi.org/10.1016/j.
adhoc.2019.101984
23. Wang, D., Zhang, S.: Unsupervised person re-identification via multi-label clas-
sification. In: IEEE/CVF Conference on Computer Vision and Pattern Recogni-
tion (CVPR), pp. 10978–10987 (2020). https://doi.org/10.1109/CVPR42600.2020.
01099
24. Wei, L., Zhang, S., Gao, W., Tian, Q.: Person transfer GAN to bridge domain
gap for person re-identification. In: IEEE/CVF Conference on Computer Vision
and Pattern Recognition, pp. 79–88 (2018). https://doi.org/10.1109/CVPR.2018.
00016
250 A. D. Nguyen et al.

25. Zeng, K., Ning, M., Wang, Y., Guo, Y.: Hierarchical clustering with hard-batch
triplet loss for person re-identification. In: IEEE/CVF Conference on Computer
Vision and Pattern Recognition (CVPR), pp. 13654–13662 (2020). https://doi.org/
10.1109/CVPR42600.2020.01367
26. Zheng, L., Shen, L., Tian, L., Wang, S., Wang, J., Tian, Q.: Scalable person re-
identification: a benchmark. In: 2015 IEEE International Conference on Computer
Vision (ICCV), pp. 1116–1124 (2015). https://doi.org/10.1109/ICCV.2015.133
Boosting Image Super-Resolution:
Incorporating Locally-Enhanced FFN
and Data Augmentation in the Swin
Transformer Architecture

Phong Hai Tran1,2 and Ngoc-Thao Nguyen1,2(B)


1
Faculty of Information Technology, University of Science,
Ho Chi Minh City, Vietnam
[email protected]
2
Vietnam National University,
Ho Chi Minh City, Vietnam

Abstract. Image super-resolution aims to enhance the resolution of


low-quality images by generating high-resolution counterparts, signifi-
cantly benefiting various practical applications, such as high-quality tele-
vision, gaming, and medical imaging. Our work stems from the success
of SwinIR, a state-of-the-art model that leverages Swin Transformers
and window-based self-attention mechanisms to effectively model dis-
tant dependencies. Firstly, we employ the CutBlur augmentation method
to increase both the size and diversity of the training data. This tech-
nique cuts and pastes random regions between low- and high-resolution
images, forcing the model to tackle both degraded and detailed areas
simultaneously. Secondly, we replace the Swin Transformer layer in the
original model with a Locally-Enhanced Feed-Forward Network (LEFF)
layer. This modification improves the model’s ability to capture local
context by incorporating a depth-wise convolutional block within the
feed-forward network. Experimental results demonstrate that our pro-
posed approach consistently outperforms several baselines across vari-
ous benchmarks. Notably, on the Set5 and Set14 datasets, our model
achieves PSNR values of 38.37 dB and 34.17 dB, respectively, surpass-
ing the baseline SwinIR. On the BSD100 and Manga109 datasets, our
model achieves a PSNR of 39.61 dB, maintaining superior performance.
Although the performance on the more challenging Urban100 dataset is
slightly lower, it remains comparable to the baseline, leaving room for
further improvement.

Keywords: image super-resolution · image restoration · swin


transformer

1 Introduction

Image super-resolution (SR) improves the resolution of low-resolution (LR)


images by generating high-resolution (HR) counterparts that reconstruct the
missing high-frequency details [12]. This technology has been widely applied
c The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025
W. Buntine et al. (Eds.): SOICT 2024, CCIS 2350, pp. 251–262, 2025.
https://doi.org/10.1007/978-981-96-4282-3_21
252 P. H. Tran and N.-T. Nguyen

across diverse real-world domains, such as medical imaging, remote surveillance,


and digital photography, where preserving high-quality image details is essential.
Various approaches have been developed to address the image super-
resolution problem, ranging from traditional techniques [18, 20] to deep learning-
based models [3, 8, 12]. One of the earliest used traditional methods is Bicubic
Interpolation [18], which estimates unknown pixel values by taking a weighted
average of neighboring pixels using cubic polynomials. Another traditional app-
roach is Example-based super-resolution [20], which relies on a dictionary of
HR and LR image patch pairs to reconstruct high-resolution details. While this
method performs better than simple interpolation, its effectiveness is limited by
the size and diversity of the dictionary.
The advancement of deep learning models has significantly improved the
quality of super-resolution. One of the pioneering efforts to apply deep learning
to SR tasks was the Super-resolution convolutional neural network (SRCNN) [8],
which achieved far superior results compared to traditional methods. Another
notable approach is Non-local Sparse Attention (NLSA) [15], which proposes cap-
turing long-range relationships by leveraging the Non-local attention technique,
ensuring computational tractability. The Swin transformer for image restoration
(SwinIR) [12] is a recently developed approach based on the Swin Transformer
architecture. SwinIR employs a hierarchical structure with shifted windows, suc-
cessfully capturing both local and global dependencies by dividing the image into
non-overlapping windows. This method has demonstrated strong performance on
benchmarks like Set5, Set14, and BDS100 [2, 13, 22].
Inspired by the state-of-the-art performance of SwinIR [12], we identified
areas for further improvement. Thus, we introduce two key enhancements to the
baseline SwinIR model:

1. CutBlur Data Augmentation: We incorporate CutBlur [1] for data aug-


mentation, which mixes high-quality and low-quality resolution patches dur-
ing training. This approach exposes the model to diverse regions within a
single image, encouraging it to learn more robust and adaptable feature rep-
resentations.
2. LeFF Integration: We integrate the Locally-enhanced feed-forward network
(LeFF) layer [19] into the Swin transformer layers, enhancing the feed-forward
network’s ability to capture fine-grained local information.

Experimental results demonstrate that our proposed model outperforms the


baseline across several datasets [2, 10, 13, 14, 22], particularly on those with simple
and medium complexity such as Set5, Set14 and BSD100 [2, 13, 22].

2 Related Work
Data Augmentation. Augmentation strategies for SR can be classified into
two types based on where they are applied: pixel-domain and feature-domain
techniques [1]. Pixel-domain augmentations, such as CutBlur [1], CutMix [21],
or Cutout [7], operate directly on the raw image data. Meanwhile, feature-domain
Boosting Image Super-Resolution with LeFF and CutBlur 253

augmentations, such as AutoAugment [5], or Mixup [23], modify the intermediate


feature representations learned by the models. Recently, CutBlur, a pixel-domain
augmentation technique, diversifies the training datasets by cutting patches from
LR images and pasting them into their corresponding HR patches, helping the
model learn to super-resolve effectively across varying resolutions. This confirms
that CutBlur is a promising technique for data augmentation in SR tasks, as it
enables the model to handle both high- and low-quality information, leading to
significant increases in the performance of the SR task [1].
Feed-Forward Networks. A recent key refinement in FFN architecture is
the Locally-enhanced Feed-Forward Network (LeFF), which addresses the chal-
lenge of standard Transformer architectures in capturing local context [19]. LeFF
is proposed to integrate a 3 × 3 depth-wise convolutional block into the FFN,
enabling the model to gather local context by processing adjacent pixels within
feature maps. Nevertheless, the model’s overall efficiency is potentially impacted
by adding convolutional operations within the FFN as this approach can increase
the number of parameters and computational cost [17].

3 The Proposed Method


3.1 Learning with Mixed Resolution Images
The quality of training data plays a vital role in shaping the generalization capa-
bilities of deep learning-based SR models [24]. As demonstrated in [1], training
SR models using a mixture of high- and low-resolution images enhances their
performance. High-resolution images offer critical details and textures, while
low-resolution images familiarize the model with common degradation patterns.
In this paper, we utilize CutBlur [1], a technique that combines high- and
low-resolution regions within a single training sample. This approach compels
the model to simultaneously address both degraded and detailed areas, thereby
enhancing the diversity of the training data and enabling the model to learn to
restore lost details more effectively. As a result, the SR models are anticipated
to achieve superior reconstruction across a wide range of input images.
Let xLR ∈ RW ×H×C and xHR ∈ RW ×sH×C be the pair of LR and HR
images originated from a dataset for SR tasks. CutBlur [1] generates a new pair
of images by cutting-and-pasting random regions of xLR to the corresponding
areas in xHR and vice versa:
x̂HR→LR = M  xHR + (1 − M )  xLR (1)

x̂LR→HR = M  xLR + (1 − M )  xHR (2)


where M represents the random region to replace in the high- or low-resolution
images, and  denotes the element-wise multiplication as defined in [1].
Then, the augmented dataset is defined as:
DFinal Mixed = (DLR + DHR ) + (DAug LR + DAug HR ) (3)
where DLR (or Aug LR) and DHR (or Aug HR) denote the sets of original (or aug-
mented) LR images and HR images, respectively.
254 P. H. Tran and N.-T. Nguyen

Fig. 1. The proposed architecture for the SwinIR-LeCut framework.

3.2 The SwinIR-LeCut Framework

In this work, we propose the SwinIR-LeCut model, which contains the two key
improvements added to the SwinIR model. First, we apply the CutBlur data
augmentation technique to increase the diversity of training images. Second,
we integrate the LeFF layer into the Swin Transformer blocks to improve the
model’s ability to capture both channel-wise and spatial feature interactions
effectively. These enhancements leverage diverse training data and more efficient
feature modeling to boost overall model performance [25]. More precisely, at
first, the number of training images is increased by using the CutBlur [1] data
augmentation technique to build the final training set defined as DMixed Final in
Sect. 3.1 above. The training then starts with the Shallow Feature Extraction,
where the initial presentation of each training LR image is conducted. After
that, the New Deep Feature Extraction (nDF) phase begins with the Residual
Swin Transformer LeCut Block (RSTLB) (m blocks), containing multiple LeSwin
Transformer layers (k layers) where the LeFF layer [19] is added to each layer.
Each LeSwin Transformer layer (LeSTL) consists of 3 main components, includ-
ing Window Self-Attention (WSA), LeFF, and MLP with 2 FC layers, where the
LayerNorm is added before each component to normalize the input (see Fig. 1).
At the final stage, the new HR image is constructed using the learned features
which is finally upsampled by the PixelShuffle. The overall architecture can be
modeled as the following steps:
Shallow Feature Extraction: The Shallow feature extraction component is
used to capture the LR features from the inputs. As defined in [12], given an
ILR ∈ DMixed Cutblur LR is the input LR image, the operation can be formulated
as:
F0 = HSF E (ILR ) (4)
where F0 ∈ RW ×H×C is the extracted feature map of the LR input image ILR
by the Shallow Feature Extraction layer HSF E .
Boosting Image Super-Resolution with LeFF and CutBlur 255

New Deep Feature Extraction: After the first extraction, the deep features
are captured by this second component. The enhancement behind this step is
motivated by the need to address specific limitations in SwinIR’s handling of
intricate local textures. LeFF [19] introduces locally enhanced convolutional lay-
ers that focus on refining texture details at the pixel level, making it well-suited
for this purpose. The process can be modeled as follows:

FnDF = HnDF (F0 ) (5)

where FnDF ∈ RW ×H×C is the deep feature map extracted by the proposed
Deep Feature Extraction module HnDF , which can be formulated as:
k

HnDF (F0 ) = HLeST Li (Fi−1 ) (6)
i=1

where HLeST Li (.) is the proposed LeSwin Transformer layer (LeSTL) inside the
RSTLB block. Each RSTLB block contains k LeSTL layers.
Image Construction: This component is the final step before producing the
super-resolved output image. In this step, the output image is upsampled from
the extracted deep representation while the predict high-quality details are pre-
served. According to [12], this process can be modeled as:

IRHQ = HREC (F0 + FnDF ) (7)

As proposed in [12], the IRHQ is the output HR image super-resolved by the


proposed SwinIR-LeCut model, and the HREC is the PixelShuffle upsampling
technique that is used to reconstruct the output image.
Loss Function: Similarly to SwinIR [12], we utilized the L1 loss function to
optimize the model’s parameters, reducing the pixel-wise differences between the
predicted HR output images and the ground truth images. The optimization is
formulated as:
L1 = IRHQ − IHQ 1 (8)
where the IRHQ and IHQ are the output HR image super-resolved by the pro-
posed SwinIR-LeCut model and the ground-truth of the LR input image of the
model, respectively [12].

4 Experiments
4.1 Datasets
Set5 [2] (5 images) and Set14 [22] (14 images) contain mixtures of simple scenes,
including animals, people, and landscapes. They are commonly used for initial
testing of SR models due to their small sizes.
BSD100 [13] consists of 100 natural images, featuring diverse subjects like build-
ings, animals, and landscapes. It challenges models with more detailed textures
and variety, compared to Set5 and Set14.
256 P. H. Tran and N.-T. Nguyen

Manga109 [14] is a dataset of 109 manga images, designed to test models on


non-photorealistic content. It focuses on clean lines, sharp contrasts, and typical
detailed artwork of Japanese comics.
Urban100 [10] consists of 100 images of urban scenes. The dataset features com-
plex architectural elements such as repetitive patterns, intricate textures, and
detailed geometric structures that are common in urban scenes.

4.2 Baselines

We compare the proposed model with the following baselines: Residual chan-
nel attention networks (RCAN) [24], Second-order attention network (SAN) [6],
Internal Graph neural networ (IGNN) [26], Holistic attention network (HAN)
[16], Non-local sparse attention (NLSA) [15], and SwinIR [12]

4.3 Evaluation Metrics

Peak Signal-to-Noise Ratio (PSNR) measures the pixel-level quality of super-


resolved images by comparing the model’s output with the ground-truth HR
image. It is expressed in decibels (dB), with higher values indicating better
image quality [9].
Structural Similarity Index (SSIM) valuates the structural integrity of an
image by examining luminance, contrast, and texture similarities between the
super-resolved image and the ground-truth HR image. SSIM ranges from −1
to 1, where values closer to 1 represent greater structural similarity and better
overall image quality [9].

4.4 Experimental Protocol

We conduct experiments for our proposed model, SwinIR-LeCut, on the classical


SR tasks at ×2 and ×4 scale factors. The main hyperparameters, including
the number of Residual Swin Transformer Layers with LeFF (RSTLB), LeSwin
Transformer Layers (LeSTL), window size, channel count, and attention heads,
are configured as 6, 6, 8, 180, and 6, respectively (Fig. 1).
We train the proposed model on the DIV2K dataset and evaluate its perfor-
mance by utilizing two metrics, PSNR and SSIM, across five datasets: Set5 [2],
Set14 [22], BSD100 [13], Manga109 [14], and Urban100 [10]. The results of the
proposed model are compared to its baseline.

4.5 Experiment Results

×2 Scale Image Super-Resolution. For scale ×2, the SwinIR-LeCut model


shows incremental but consistent enhancements over the original SwinIR and
other baseline models across multiple introduced datasets (Table 1 and Table 2).
The most noticeable improvements occur in Set5 and Set14, where the PSNR
Boosting Image Super-Resolution with LeFF and CutBlur 257

Fig. 2. Visual comparison of the images super-resolved by SwinIR (left) and our pro-
posed SwinIR-LeCut model (right).

Table 1. ×2 Scale SR - PSNR comparison across datasets

Method Set5 [2] Set14 [22] BSD100 [13] Urban100 [10] Manga109 [14]
RCAN [24] 38.27 34.12 32.41 33.34 39.44
SAN [6] 38.31 34.07 32.42 33.1 39.32
IGNN [26] 38.24 34.07 32.41 33.23 39.35
HAN [16] 38.27 34.16 32.41 33.35 39.46
NLSA [15] 38.34 34.08 32.43 33.4 39.59
SwinIR [12] 38.35 34.14 32.44 33.40 39.60
SwinIR-LeCut 38.37 34.17 32.45 33.41 39.61

Table 2. ×2 Scale SR - SSIM comparison across datasets

Method Set5 [2] Set14 [22] BSD100 [13] Urban100 [10] Manga109 [14]
RCAN [24] 0.9614 0.9216 0.9027 0.9384 0.9786
SAN [6] 0.962 0.9213 0.9028 0.937 0.9792
IGNN [26] 0.9613 0.9217 0.9025 0.9383 0.9786
HAN [16] 0.9614 0.9217 0.9027 0.9385 0.9785
NLSA [15] 0.9618 0.9231 0.9027 0.9394 0.9789
SwinIR [12] 0.9620 0.9227 0.903 0.9393 0.9792
SwinIR-LeCut 0.9622 0.9232 0.903 0.9393 0.9792

and SSIM metrics slightly increase, showing that the refinements made to the
model enhance the quality of super-resolved images on both simple and medium-
complexity datasets (Set5 and Set14). For example, Set5 moves from 38.35 to
38.37 PSNR, and Set14 benefits from an increase in PSNR from 34.14 to 34.17,
demonstrating the model’s adaptability to varied textures. In more complex
datasets like BSD100, the improvements are minimal. Although BSD100 is more
challenging due to its intricate textures and noise patterns, our model main-
tains a competitive performance, suggesting that the changes made are effective
in handling these intricacies. On the highly detailed Manga109 and Urban100
258 P. H. Tran and N.-T. Nguyen

Table 3. ×4 Scale SR - PSNR comparison across datasets

Method Set5 [2] Set14 [22] BSD100 [13] Urban100 [10] Manga109 [14]
RCAN [24] 32.63 28.87 27.77 26.82 31.22
SAN [6] 32.64 28.92 27.78 26.79 31.18
IGNN [26] 32.57 28.85 27.77 26.84 31.28
HAN [16] 32.64 28.9 27.8 26.85 31.42
NLSA [15] 32.59 28.87 27.78 26.96 31.27
SwinIR [12] 32.72 28.94 27.83 27.07 31.67
SwinIR-LeCut 32.74 28.96 27.82 26.72 31.68

Table 4. ×4 Scale SR - SSIM comparison across datasets

Method Set5 [2] Set14 [22] BSD100 [13] Urban100 [10] Manga109 [14]
RCAN [24] 0.9002 0.7889 0.7436 0.8087 0.9173
SAN [6] 0.9003 0.7888 0.7436 0.8068 0.9169
IGNN [26] 0.8998 0.7891 0.7434 0.809 0.9182
HAN [16] 0.9002 0.789 0.7442 0.8094 0.9177
NLSA [15] 0.9 0.7891 0.7444 0.8109 0.9184
SwinIR [12] 0.9021 0.7914 0.7459 0.8164 0.9226
SwinIR-LeCut 0.9023 0.7916 0.7459 0.8112 0.9227

datasets, the model performs similarly to its baseline (SwinIR), indicating that
while the improvements help with general datasets, there should be further
improvements in order to deliver more precise super-resolved images on these
such complex datasets where the complex architectural elements are present in
a high frequency manner [11].
×4 Scale Image Super-Resolution. For scale ×4, the performance of the
SwinIR-LeCut is more variable. Set5 and Set14 continue to exhibit increases
in both PSNR and SSIM metrics, confirming that the model still handles sim-
pler datasets well (Table 3 and Table 4). However, the results for more complex
datasets such as Urban100 show a slight decline in performance at the ×4 scale
factor, where PSNR drops from 27.07 to 26.72, and SSIM decreases from 0.8164
to 0.8112, indicating that the model struggles with upscaling fine-grained tex-
tures and highly detailed images. This reduction in performance highlights the
need for additional enhancements of high frequency details and complex struc-
tural features in urban data [4], which is already exposed in the results of the
model at the ×2 scale factor for these datasets in SSIM metric (Table 2). Mean-
while, BSD100 remains the performance in either metrics. That said, the addi-
tional enhancements may be necessary to address the challenges due to the noisy
level of this dataset in order to have visible increases. The results for Manga109
remain strong but exhibit minimal changes, reflecting that the augmentation
Boosting Image Super-Resolution with LeFF and CutBlur 259

Table 5. ×2 Scale SR - PSNR Percentage Differences compared to SwinIR

Method Set5 [2] Set14 [22] BSD100 [13] Urban100 [10] Manga109 [14]
RCAN [24] −0.2% −0.05% −0.09% −0.1% −0.4%
SAN [6] −0.1% −0.2% −0.06% −0.8% −0.7%
IGNN [26] −0.2% −0.2% −0.09% −0.5% −0.6%
HAN [16] −0.2% +0.05% −0.09% −0.1% −0.3%
NLSA [15] −0.02% −0.1% −0.03% 0% −0.02%
SwinIR-LeCut +0.05% +0.08% +0.03% +0.02% +0.02%

Table 6. ×2 Scale SR - Ablation study of SwinIR-CutBlur on PSNR across datasets

Method Set5 [2] Set14 [22] BSD100 [13] Urban100 [10] Manga109 [14]
SwinIR [12] 38.35 34.14 32.44 39.6 33.4
SwinIR-CutBlur 38.36 34.16 32.45 39.6 33.4

Table 7. ×2 Scale SR - Ablation study of SwinIR-LeFF on PSNR across datasets

Method Set5 [2] Set14 [22] BSD100 [13] Urban100 [10] Manga109 [14]
SwinIR [12] 38.35 34.14 32.44 39.6 33.4
SwinIR-LeFF 38.35 34.15 32.44 39.61 33.41

techniques may have a limited impact on highly structured image domains at


this scale. Confirming that finding a solution to address this issue is a valid
refinement in the future [24] (Table 5).

5 Ablation Study
For the ablation study, we conduct the experiments of the SwinIR [12] with
CutBlur data augmentation technique [1] and the SwinIR [12] with the addi-
tional LeFF layer [19] in Swin Transformers blocks separately on all introduced
datasets. The results are presented in Table 6 and Table 7.
Impact of CutBlur on the Performance of SwinIR. Our experimental
findings from this ablation study demonstrate that integrating CutBlur [1] into
SwinIR [12] positively impacts the model’s overall performance, particularly on
simpler datasets like Set5 [2], Set14 [22], and even moderately complex ones such
as BSD100 [13] (Table 6). This validates the effectiveness of CutBlur [1].
Impact of LeFF on the Performance of SwinIR. The ablation study on
the incorporation of LeFF layer [19] in SwinIR [12] reveals that the additional
LeFF layer enhances the model’s ability to capture potential local features, lead-
ing to improved performance on more complex datasets like Urban100 [10] and
260 P. H. Tran and N.-T. Nguyen

Manga109 [14]. Therefore, this experiment confirms the contribution to local


feature extraction of the LeFF layer.

6 Conclusion and Future Work

In this work, we propose the SwinIR-LeCut model, which incorporates Cut-


Blur [1] data augmentation and integrates the LeFF [19] layer to the Swin
Transformer blocks. We conducted experiments for our proposed model on the
classical SR tasks at ×2 and ×4 scale factors. The experiments demonstrate the
significant gains in performance of the proposed model over its baseline, partic-
ularly at scale ×2 for the SR tasks under the Swin Transformer architecture.
Nevertheless, the limitations of the proposed model, as well as its baseline, are
exposed when tasked with the ×4 scale factor in several challenging datasets,
such as BSD100 and Urban100. Therefore creative innovations to address this
challenge remain valid for this problem [12].

Acknowledgment. This research is supported by research funding from Faculty of


Information Technology, University of Science, Vietnam National University - Ho Chi
Minh City. It used the same faculty GPUs provided by the Intelligent Systems Lab.

References
1. Ahn, N., Yoo, J., Sohn, K.A.: Data augmentation for low-level vision: cutblur and
mixture-of-augmentation, pp. 2041–2059 (2024)
2. Bevilacqua, M., Roumy, A., Guillemot, C., Alberi-Morel, M.L.: Low-Complexity
Single-Image Super-Resolution Based on Nonnegative Neighbor Embedding.
BMVA Press (2012)
3. Chen, Z., Guo, Y., Zhou, Z.: Pretrained image transformer for image super-
resolution. In: Proceedings of the European Conference on Computer Vision
(ECCV) (2020)
4. Conde, M.V., Choi, U.J., Burchi, M., Timofte, R.: Swin2sr: Swinv2 transformer
for compressed image super-resolution and restoration (2022). https://arxiv.org/
abs/2209.11345
5. Cubuk, E.D., Zoph, B., Mane, D., Vasudevan, V., Le, Q.V.: Autoaugment: learning
augmentation strategies from data. In: Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, pp. 113–123 (2019)
6. Dai, T., Cai, J., Zhang, Y., Xia, S.T., Zhang, L.: Second-order attention network
for single image super-resolution. In: Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, pp. 11065–11074 (2019)
7. DeVries, T.: Improved regularization of convolutional neural networks with cutout
(2017)
8. Dong, C., Loy, C.C., He, K., Tang, X.: Image super-resolution using deep convo-
lutional networks (2016)
9. Hore, A., Ziou, D.: Image quality metrics: PSNR vs. SSIM. In: 2010 20th Interna-
tional Conference on Pattern Recognition, pp. 2366–2369. IEEE (2010)
Boosting Image Super-Resolution with LeFF and CutBlur 261

10. Huang, J.B., Singh, A., Ahuja, N.: Single image super-resolution from transformed
self-exemplars. In: Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, pp. 5197–5206 (2015)
11. Huang, J.B., Singh, A., Ahuja, N.: Single image super-resolution from trans-
formed self-exemplars. In: 2015 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), pp. 5197–5206 (2015). https://doi.org/10.1109/CVPR.2015.
7299156
12. Liang, J., Cao, J., Sun, G., Zhang, K., Gool, L.V., Timofte, R.: Swinir: image
restoration using SWIN transformer. In: Proceedings of the IEEE/CVF Interna-
tional Conference on Computer Vision (ICCV), pp. 1833–1844 (2021)
13. Martin, D., Fowlkes, C., Tal, D., Malik, J.: A database of human segmented natural
images and its application to evaluating segmentation algorithms and measuring
ecological statistics. In: Proceedings eighth IEEE international conference on com-
puter vision. ICCV 2001, vol. 2, pp. 416–423. IEEE (2001)
14. Matsui, Y., et al.: Sketch-Based Manga Retrieval Using Manga109 Dataset, vol. 76,
pp. 21811–21838. Springer (2017)
15. Mei, Y., Fan, Y., Zhou, Y.: Image super-resolution with non-local sparse attention.
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pp. 3517–3526 (2021)
16. Niu, B., et al.: Single image super-resolution via a holistic attention network. In:
Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol.
12357, pp. 191–207. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-
58610-2_12
17. Pires, T.P., Lopes, A.V., Assogba, Y., Setiawan, H.: One wide feedforward is all
you need (2023). https://arxiv.org/abs/2309.01826
18. Pratt, W.K.: Digital Image Processing. Wiley (2001)
19. Wang, Z., Cun, X., Bao, J., Zhou, W., Liu, J., Li, H.: Uformer: a general u-shaped
transformer for image restoration. In: Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, pp. 17683–17693 (2022)
20. Yang, J., Wright, J., Huang, T.S., Ma, Y.: Image super-resolution via sparse rep-
resentation (2010)
21. Yun, S., Han, D., Oh, S.J., Chun, S., Choe, J., Yoo, Y.: Cutmix: regularization
strategy to train strong classifiers with localizable features. In: Proceedings of the
IEEE/CVF International Conference on Computer Vision, pp. 6023–6032 (2019)
22. Zeyde, R., Elad, M., Protter, M.: On single image scale-up using sparse-
representations. In: Curves and Surfaces: 7th International Conference, Avignon,
France, June 24-30, 2010, Revised Selected Papers 7, pp. 711–730. Springer (2012)
23. Zhang, H.: Mixup: beyond empirical risk minimization (2017)
24. Zhang, Y., Li, K., Li, K., Wang, L., Zhong, B., Fu, Y.: Image super-resolution
using very deep residual channel attention networks. vol. abs/1807.02758 (2018).
http://arxiv.org/abs/1807.02758
262 P. H. Tran and N.-T. Nguyen

25. Zheng, Q., Xu, H., Bian, M.: Image super-resolution using a enhanced SWIN trans-
former network. In: 2023 3rd International Symposium on Computer Technology
and Information Science (ISCTIS), pp. 1151–1155 (2023). https://doi.org/10.1109/
ISCTIS58954.2023.10213090
26. Zhou, S., Zhang, J., Zuo, W., Loy, C.C.: Cross-scale internal graph neural network
for image super-resolution. Adv. Neural. Inf. Process. Syst. 33, 3499–3509 (2020)
Dual-Domain Reconstruction Network
for Enhancing Sparse-View and Low-Dose
CT Imaging

Pham Cong Thang(B) and Phan Minh Nhat

The University of Danang–University of Science and Technology,


54 Nguyen Luong Bang Street, Danang, Viet Nam
[email protected]

Abstract. Computed Tomography (CT) is a critical diagnostic tool,


but the growing use of CT has raised concerns about patient radia-
tion exposure. Sparse-view CT, which reduces the quantity of projec-
tion angles, has been proposed as a potential solution. However, tradi-
tional reconstruction methods, such as Filtered Back Projection (FBP),
face challenges in producing high-quality images from sparse data. To
address these challenges, the DD-ReconNet model for CT image recon-
struction is introduced. This model leverages both the Sinogram and
Image domains to enhance image quality and includes three stages: Sino-
gram Restoration, Image Reconstruction using FBPConvNet, and Image
Restoration. The Sinogram and Image Restoration modules integrate the
Swin Transformer V2 block and an Improved Edge Convolution layer
to boost restoration performance. In addition, a hybrid objective func-
tion is employed to optimize the reconstructed images. Experimental
results indicate the superiority of the DD-ReconNet model over con-
ventional methods, positioning it as a promising approach for low-dose
and sparse-view CT reconstruction, which improves diagnostic accuracy
while minimizing radiation exposure.

Keywords: CT Reconstruction · DD-ReconNet · Hybrid Objective


Function

1 Introduction
Computed tomography (CT) imaging, while indispensable for diagnostic pur-
poses, poses significant radiation exposure risks to patients due to its widespread
use [1]. To address this concern, numerous strategies have been proposed to min-
imize radiation dose while maintaining image quality. Sparse-view scanning [2],
which involves acquiring projections from a reduced quantity of angles, offers
a promising approach. However, conventional reconstruction methods, such as
Filtered Back Projection (FBP) [3], struggle to generate high-quality images
from sparse data, often resulting in artifacts and noise. FBP, although computa-
tionally efficient, is prone to noise amplification and aliasing artifacts, especially
c The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025
W. Buntine et al. (Eds.): SOICT 2024, CCIS 2350, pp. 263–274, 2025.
https://doi.org/10.1007/978-981-96-4282-3_22
264 P. C. Thang and P. M. Nhat

in low-dose CT scenarios [4]. These limitations compromise its ability to accu-


rately reconstruct large-scale images, thus affecting diagnostic accuracy. Recon-
structing CT images from sparse data remains a challenging task in biomedical
imaging, and while conventional methods have advanced, they often fall short in
overcoming the limitations of incomplete datasets, leading to noisy and detail-
deficient reconstructions.
Recent years have seen significant progress in CT image reconstruction,
with Compressed Sensing (CS) [5] and deep learning [6] emerging as leading
approaches. CS exploits signal sparsity, often achieved through wavelet trans-
forms, to reconstruct high-quality images from limited projection data [7]. Mean-
while, deep learning models have demonstrated remarkable abilities in learning
complex features from CT data [8], surpassing traditional methods like FBP.
These advancements enable the generation of high-quality CT images at reduced
radiation doses, addressing the critical need for accurate and efficient clinical
diagnostics.
This paper presents the DD-ReconNet model for CT image reconstruc-
tion, which leverages both the Sinogram and Image domains, inspired by [9],
to enhance image quality. The model comprises three primary stages: Sino-
gram Restoration, Image Reconstruction, and Image Restoration. The Sinogram
Restoration (SR) module, FBPConvNet, and Image Restoration (IR) module
are employed in these stages, respectively. The SR/IR modules incorporate the
Swin Transformer V2 block [10] and a novel Improved Edge Convolution layer
to boost image restoration performance. In addition, a hybrid objective function
is proposed to optimize the quality of reconstructed images.
The key contributions of this research are summed up below:

1. The DD-ReconNet model for CT image reconstruction leverages both the


Sinogram and Image domains to enhance image quality. The model consists
of three key components: Sinogram Restoration (SR), Image Reconstruction
using FBPConvNet, and Image Restoration (IR). The SR and IR modules
incorporate the Swin Transformer V2 block and a novel Improved Edge Con-
volution layer to further improve image restoration performance.
2. A hybrid objective function is proposed to optimize the overall quality of the
reconstructed CT images.

2 Related Work
2.1 Computed Tomography Reconstruction

The Filtered Back Projection (FBP) algorithm [3] is a widely used method for CT
image reconstruction, but it has notable limitations, such as failing to account
for noise, X-ray spectrum variability, and sensor characteristics. Dual-domain
deep learning methods, such as DRONE [11], CDDCN [12], CLRecon [13], and
DualCNN [14], have been developed to fully leverage information from both
sinogram and image domains, enabling simultaneous enhancement of sinograms
and reconstructed images. Despite their effectiveness, these dual-domain joint
DD-ReconNet for Enhanced Sparse-View CT Image Reconstruction 265

learning networks face several limitations, including a high number of trainable


parameters, substantial computational time, and independent optimization of
sub-networks, which can hinder their efficiency and scalability.
To address these issues, several advanced methods have been proposed.
Guan et al. [15] introduced a generative model for sparse projection data, while
Xia et al. [16] developed patch-based Denoising Diffusion Probabilistic Models
(DDPM). Lin et al. [9] propose an end-to-end trainable Dual Domain Network
that addresses the challenges of Metal Artifact Reduction (MAR) by simulta-
neously restoring sinogram consistency and enhancing CT images, a task previ-
ously handled in single domains. Despite these advancements, these models often
struggle to capture clinical image details at varying scales, resulting in instabil-
ity and reduced image quality. Deep learning approaches have shown promise in
overcoming these limitations. Vasconcelos et al. [17] proposed a Bayesian Implicit
Neural Representation, and Li et al. [18] introduced GMM-unNet, a GMM-based
unsupervised method for low-dose CT reconstruction. However, fine details, such
as vascular structures, often remain unclear.

2.2 Application of Transformer Networks to CT Imaging


Leveraging the robust attention mechanism [19] and patch-based operations,
the Transformer architecture has been successfully applied to various computer
vision tasks [10, 20]. Notably, the Swin Transformer [10] combines the strengths
of this architecture with the local feature extraction capabilities of CNNs. This
combination allows Swin Transformer-based models [21] to overcome the mem-
ory limitations of earlier Vision Transformer models. Building on these advance-
ments, Transformers have been employed in medical image analysis to effectively
model global features, leading to significant improvements in tasks such as image
segmentation [22], registration [23], and classification [24]. Despite these break-
throughs, Transformer architectures have been scarcely explored in sparse-view
CT reconstruction. Although TransCT [25] uses Transformer-based techniques
to suppress noise artifacts in low-dose CT imaging, it overlooks the importance
of global sinogram characteristics in its design-a crucial factor for effective recon-
struction.

3 Proposed Method
3.1 DD-ReconNet Architecture
As illustrated in Fig. 1, the DD-ReconNet is designed with three primary stages:
Signogram Restoration, employing an SR-Module; CT Reconstruction, utiliz-
ing the FBPConvNet method with direct inversion followed by a CNN to deal
with normal-convolutional inverse problems (as discussed in detail by [26]); and
CT Restoration, leveraging an IR-Module. Given a sparse-view sinogram input
denoted as .Y ∈ RHS ×WS , an enhanced sinogram, represented by .Ỹ ∈ RHS ×WS ,
is generated through the SR-Module. Subsequently, the FBPConvNet method
is applied to sequentially reconstruct two low-quality CT images, .X̃1 and .X̃2 ,
266 P. C. Thang and P. M. Nhat

both of which belong to the space .RHI ×WI , from .Y and the enhanced sinogram
Ỹ . Lastly, the concatenated output of .X̃1 and .X̃2 is fed into the IR-Module to
.
yield the final, high-quality CT image denoted as .X̃ ∈ RHI ×WI .

Fig. 1. DD-ReconNet Architecture

3.2 Sinogram/Image Restoration Module

The process of sinogram/CT image restoration is immensely challenging, as


the inherent details not only encompasses the spatial structures of the human
body but also adheres to a global sampling process. Figure 2 illustrates the dia-
gram of the Restoration Module, which consists of .M successive Swin residual
blocks. Each block contains .N Swin-Transformer v2 (SwinV2) Blocks [10] and
an Improved Edge Convolutional (IE-Conv) Layer, designed to extract both
global and local features. The spatial structure is initially extracted by applying
a convolutional layer to degraded sinograms/CT images, referred to as .Fconv .
Denoting this as .F0 , the output after passing through one SwinV2 residual block
is represented as .Fi . Traditional CNNs, which excel at local feature extraction,
struggle to capture this global characteristic. Therefore, the SwinV2 structure
has been integrated into this module to enable it to handle long-range depen-
dencies effectively.
The IE-Conv layer, depicted in Fig. 2, is designed to enhance edge features
in each layer, providing a balance between detailed high-frequency information
and local feature preservation. IE-Conv employs a multi-branch architecture
consisting of a conventional convolutional layer and two additional convolutional
layers, each enhanced with predefined gradient filters. These filters include first-
order and second-order gradient kernels, such as Sobel and Laplacian operators,
which are effective in capturing local spatial variations and edge information.
DD-ReconNet for Enhanced Sparse-View CT Image Reconstruction 267

Fig. 2. Sinogram/Image Restoration Module

Given an input image .I ∈ RH×W ×C , the aforementioned process can be


mathematically expressed as follows:

.F0 = Fconv = Conv(I),


N

Fi = IE-Conv
. (SwinV2j (Fi−1 )) +Fi−1 .
j=1

The Restoration Image .IR of input .I is calculated as follows:


 
.IR = Conv IE-Conv(FM ) + F0 +I,

 denotes the IE-Conv layer, .{Swin}i=1 represents the .N SwinV2


N
where .IE-Conv
Blocks, and . signifies the sequential application of these SwinV2 Blocks. With
3×3
.K3×3 ∈ R and .B3×3 denote the learnable kernel weights and bias for the tra-
ditional convolutional layer, respectively. With .FSwin representing the feature
after .N SwinV2 Blocks, the feature extraction process employing this convolu-
tional layer is defined as follows: .F3×3 = K3×3 ∗ FSwin + B3×3 , where .∗ presents
the convolution operation. Let .KSx , KSy ∈ R3×3 represent the horizontal and
vertical Sobel filters, respectively:
⎡ ⎤ ⎡ ⎤
+1 0 −1 +1 +2 +1
. KSx = ⎣ +2 0 −2 ⎦ ,KSy = ⎣ 0 0 0⎦.
+1 0 −1 −1 −2 −1

The first-order gradient of the latent feature map is calculated as follows:

.FS = (KSx ∗ FSwin + BSx ) + (KSy ∗ FSwin + BSy ),

where .BSx and .BSy are biases for horizontal and vertical Sobel Convolution,
respectively. Additionally, the Laplacian filter is utilized to extract the second-
order gradient. The Laplacian operators based on 4-connected and 8-connected
268 P. C. Thang and P. M. Nhat

neighborhoods are applied as follows:


⎡ ⎤ ⎡ ⎤
0 +1 0 +1 +1 +1
. KL4 = ⎣ +1 −4 +1 ⎦ , KL8 = ⎣ +1 −8 +1 ⎦ .
0 +1 0 +1 +1 +1

The mathematical expression for the second-order gradient of the latent feature
map is given by:

FL = (KL4 ∗ FSwin + BL4 ) + (KL8 ∗ FSwin + BL8 ),


.

where .BL4 and .BL8 are biases for 4- and 8-connected neighborhoods Laplacian
Convolution, respectively. Let the parameters .α1 , .α2 , and .α3 serve as learn-
able competitive coefficients for each branch. These coefficients are regulated
by a simple softmax function, ensuring that the IE-Conv framework effectively
preserves high-frequency feature information. The complete feature extraction
process within the IE-Conv layer is as follows:

FIE = α1 F3×3 + α2 FS + α3 FL .
.

As outlined in [27], the multi-branch IE-Conv layer can be consolidated into a


single convolution during inference without increasing complexity. Let .KIE and
.BIE represent the merged kernel weight and bias of the IE-Conv in inference.
These parameters can be calculated as follows:

KIE = α1 K3×3 + α2 (KSx + KSy ) + α3 (KL4 + KL8 ),


.
BIE = α1 B3×3 + α2 (BSx + BSy ) + α3 (BL4 + BL8 ).

Through the reparameterization process, the three branches are merged into
a single Convolution operation. The feature extraction process of the IE-Conv
layer in the inference phase is as follows:

FIE = KIE ∗ FSwin + BIE .


.

3.3 Loss Function


In this work, let .Sgt and .Xgt represent the full-view sinogram and the correspond-
ing CT image reconstructed from it, respectively. To ensure that the estimated
.Ŝ and .X̂ remain consistent with the ground truth, the objective functions for
the two modules, Sinogram Restoration and Image Restoration, denoted as .LSR
and .LIR , are defined as follows:
(SR) (SR) (SR)
L(SR) = LM SE + LV ar + LEdge ,
.
(IR) (IR) (IR)
L(IR) = LM SE + LV ar + LEdge .

The Variance loss function .(LV ar ) was developed to address the shortcomings
of Mean Squared Error .LM SE in deep learning-based image reconstruction tasks.
DD-ReconNet for Enhanced Sparse-View CT Image Reconstruction 269

Traditional MSE-supervised methods frequently result in image blurring due to


the averaging of pixel intensities. Unlike MSE, .LV ar prioritizes edge sharpness
in reconstructed images .Iˆ (.I can be .S or .X). To achieve this, the variance of
ˆ
gradient maps is utilized by computing .GIx , .GIy , .Gxgt , and .Gygt for both .Iˆ and
I I

the ground truth image .Igt using Sobel operators. These maps are subsequently
ˆ ˆ I
divided into .n × n non-overlapping patches to form matrices .G̃Ix , .G̃Iy , .G̃xgt , and
I
. G̃ygt , each of size .n2 . With .μ as the mean value, the variance of each gradient
n 2 2
(G̃ −µ)
map is then calculated as: .v = i=1n2 −1i .
Therefore, the Variance loss .(LV ar ) is formulated as:
ˆ ˆ
LV ar = vxI − vxIgt 2 + vyI − vyIgt 2 .
.

Although Variance loss effectively improves edge sharpness, it may compro-


mise the preservation of high-frequency details due to its primary emphasis on
variance reduction. To rectify this limitation, the Edge loss .LEdge is proposed,
which seeks to mitigate the blurring inherent in MSE and the high-frequency
detail deficiencies of Variance loss. By emphasizing high-frequency components,
.LEdge promotes the enhancement of edges and finer details, resulting in improved
image clarity. This approach is inspired by high-pass filtering techniques com-
monly employed in image processing. .LEdge leverages .L1 distance to prioritize
high-frequency amplitudes, ensuring edge detail preservation. A high-pass filter
smoothly emphasizes high frequencies, enhancing clarity without compromising
image integrity. This mathematical framework allows high-pass filter weights,
defined as follows: √ 2 2 2
( fx +fy −)

W = 1 − e−
. 2σ 2 ,
to modulate frequencies, suppressing lower frequencies below . while amplifying
higher ones. The parameter .σ controls the Gaussian’s spread, influencing the
filter’s emphasis on high frequencies. .fx and .fy denote the frequency components
along the .x and .y axes. By using Fast Fourier Transformation .FFT , .LEdge is
computed as follows:
ˆ − W  |FFT (Igt )| .
LEdge = W  |FFT (I)|
.
1

Finally, the full objective of DD-ReconNet is: .L = L(SR) + L(IR) .

4 Experiments and Discussion


4.1 Dataset

The abdominal CT dataset leveraged in this study was contributed by the Mayo
Clinic [28]. It consists of 5,388 slices, each with a thickness of 1 mm and a pixel
resolution of .512 × 512. From these slices, 5,388 sinograms were generated, with
each sinogram derived from 120 projection angles. The dataset was divided into
a training set of 4,839 images and a testing set of 549 images. Label images were
270 P. C. Thang and P. M. Nhat

reconstructed utilizing the filtered back-projection (FBP) algorithm from 720


projection angles. This dataset allows for the evaluation of the model’s perfor-
mance in denoising Gaussian and Poisson noise, which is commonly present in
low-dose CT images.

4.2 Experiment Settings


The model was trained over 1,500 epochs, with an initial learning rate of .10−3 ,
which was increased to .5 × 10−3 during the first 100 epochs. Afterward, the
learning rate was decayed to .10−5 following a cosine annealing schedule for
the remaining epochs. The Adam optimizer, with parameters .β1 = 0.9 and
.β2 = 0.99, was employed to minimize the loss function. Model performance was
evaluated using three widely-used image quality metrics: Peak Signal-to-Noise
Ratio (PSNR), Structural Similarity Index Measure (SSIM), and Mean Squared
Error (MSE). All experiments were conducted on an Ubuntu 20.04 system using
the PyTorch 1.12.1 framework, Python 3.10.4, and an NVIDIA DGX A100 GPU
with CUDA 12.1, powered by an Intel Xeon Platinum 8470Q CPU.

4.3 Experiment Results

In the field of medical imaging, the quality of CT images is critical for ensur-
ing accurate treatment decisions. However, acquiring the large number of pro-
jection angles necessary for high-quality image reconstruction is often time-
consuming and inconvenient for patients. To address this challenge, we propose
DD-ReconNet, a novel CT image reconstruction model based on deep neural
networks. This model capitalizes on its ability to learn complex image features,
leading to substantial improvements in reconstruction quality. In comparison
with conventional CT reconstruction methods like FBPConvNet [26], as well as
advanced deep learning approaches such as GMSD [15], DuDoNet [9], and DDP-
Net [29], DD-ReconNet demonstrates marked superiority. The model not only
achieves significantly higher PSNR, SSIM, and MSE values (as summarized in
Table 1), but also surpasses the state-of-the-art GMSD by approximately 2.46
dB in PSNR and 0.01 in SSIM. Furthermore, DD-ReconNet shows remarkable
improvement in preserving fine details, such as vascular structures and small
lesions. As illustrated in Fig. 3, local details are effectively maintained.

Table 1. Comparison the performance of DD-ReconNet and other methods

Method MSE PSNR SSIM


FBPConvNet [26] 0.70 31.69 0.882
DuDoNet [9] 0.36 34.62 0.965
DDPNet [29] 0.29 36.95 0.975
GMSD [15] 0.24 37.44 0.978
DD-ReconNet 0.13 39.90 0.984
DD-ReconNet for Enhanced Sparse-View CT Image Reconstruction 271

Fig. 3. Reconstructed Image using different methods and the corresponding


PSNR/SSIM values

Table 2. Ablation experiment on the effect of different objective functions

Objective Function MSE PSNR SSIM


.LM SE 0.20 38.16 0.968
.LM SE + LV ar 0.16 38.97 0.983
.LM SE + LEdge 0.19 38.87 0.978
.LM SE + LV ar + LEdge 0.13 39.90 0.984

Table 2 presents a comparative analysis of various objective functions


employed in CT image reconstruction. Our findings demonstrate that the pro-
posed model, which incorporates a combination of objective functions .LM SE ,
.LV ar , and .LEdge , consistently outperforms the baseline approach utilizing only

function .LM SE . This superiority is evident across all three evaluation metrics:
MSE (0.07), PSNR (1.74), and SSIM (0.16).
272 P. C. Thang and P. M. Nhat

5 Conclusion
This work presents the DD-ReconNet model as an innovative solution for enhanc-
ing image quality in sparse-view CT reconstruction. By leveraging both the
Sinogram and Image domains, along with advanced techniques such as the
Swin Transformer V2 block and the Improved Edge Convolution layer, DD-
ReconNet effectively addresses challenges posed by reduced projection angles
and radiation dose concerns. The three-stage reconstruction process-comprising
Sinogram Restoration, FBPConvNet-based Image Reconstruction, and Image
Restoration-demonstrates superior performance compared to traditional meth-
ods like Filtered Back Projection. Experimental results confirm the model’s abil-
ity to generate higher-quality images, supporting its potential to improve diag-
nostic accuracy while reducing patient radiation exposure. These findings posi-
tion DD-ReconNet as a promising approach for advancing low-dose and sparse-
view CT imaging, with significant implications for both clinical practice and
patient safety.

Acknowledgements. This work was supported by The University of Danang–


University of Science and Technology, code number of Project: T2023–02–05MSF.

References
1. Brenner, D., Elliston, C., Hall, E., Berdon, W.: Estimated risks of radiation-induced
fatal cancer from pediatric CT. Am. J. Roentgenol. 176, 289–296 (2001). https://
doi.org/10.2214/ajr.176.2.1760289
2. Hsieh, J.: Computed Tomography: Principles, Design, Artifacts, and Recent
Advances. SPIE Press, vol. PM259, p. 666 (2015)
3. Schäfer, D., Grass, M., Haar, P.: FBP and BPF reconstruction methods for cir-
cular X-ray tomography with off-center detector. Med. Phys. 38, S85–S94 (2011).
https://doi.org/10.1118/1.3578342
4. Lauritsch, G., Haerer, W.: Theoretical framework for filtered back projection
in tomosynthesis. Med. Imaging 1998 Image Process. 3338, 1127–1137 (1998).
https://doi.org/10.1117/12.310839
5. Ye, J.: Compressed sensing MRI: a review from signal processing perspective. BMC
Biomed. Eng. 1, 1–17 (2019). https://doi.org/10.1186/s42490-019-0006-z
6. Koetzier, L., et al.: Deep learning image reconstruction for CT: technical principles
and clinical prospects. Radiology 306, e221257 (2023). https://doi.org/10.1148/
radiol.221257
7. Zhang, X., Pan, G., Chen, B., Sun, K., Meng, Z.: Integral algorithm of exponential
observables for interacting fermions in quantum Monte Carlo simulations. Phys.
Rev. B 109, 205147 (2024). https://doi.org/10.1103/PhysRevB.109.205147
8. Chauhan, S., Malik, N., Vig, R.: UNet with ResNextify and IB modules for low-
dose CT image denoising. Int. J. Inf. Technol. 16, 4677–4692 (2024). https://doi.
org/10.1007/s41870-024-01898-8
9. Lin, W., et al.: DuDoNet: dual domain network for CT metal artifact reduction.
In: Proceedings of The IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pp. 10512–10521 (2019). https://doi.org/10.1109/CVPR.2019.01076
DD-ReconNet for Enhanced Sparse-View CT Image Reconstruction 273

10. Liu, Z., et al.: Swin Transformer V2: scaling up capacity and resolution. In: Pro-
ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni-
tion, pp. 12009–12019 (2022). https://doi.org/10.1109/CVPR52688.2022.01170
11. Wu, W., Hu, D., Niu, C., Yu, H., Vardhanabhuti, V., Wang, G.: DRONE: dual-
domain residual-based optimization network for sparse-view CT reconstruction.
IEEE Trans. Med. Imaging 40(11), 3002–3014 (2021). https://doi.org/10.1109/
TMI.2021.3078067
12. Li, Q., et al.: A cascade-based dual-domain data correction network for sparse
view CT image reconstruction. Comput. Biol. Med. 165, 107345 (2023). https://
doi.org/10.1016/j.compbiomed.2023.107345
13. Hu, J., Xing, S., Shan, X., Yu, X., Li, G.: Research on key processing parameters
of parallel seam welding of micro crystal resonator based on simulation experi-
ment. Ferroelectrics 565(1), 88–98 (2020). https://doi.org/10.1080/00150193.2020.
1761722
14. Chao, L., et al.: Sparse-view cone beam CT reconstruction using dual CNNs in pro-
jection domain and image domain. Neurocomputing 493, 536–547 (2022). https://
doi.org/10.1016/j.neucom.2021.12.096
15. Guan, B., et al.: Generative modeling in sinogram domain for sparse-view CT
reconstruction. IEEE Trans. Radiat. Plasma Med. Sci. 8, 195–207 (2024). https://
doi.org/10.1109/TRPMS.2023.3309474
16. Xia, W., et al.: Parallel diffusion model-based sparse-view cone-beam breast CT.
ArXiv Preprint ArXiv:2303.12861, pp. 1–16 (2023). https://doi.org/10.48550/
arXiv.2303.12861
17. Vasconcelos, F., He, B., Singh, N., Teh, Y.: UncertaINR: uncertainty quantification
of end-to-end implicit neural representations for computed tomography. ArXiv
Preprint ArXiv:2202.10847, pp. 1–57 (2022). https://doi.org/10.48550/arXiv.2202.
10847
18. Li, D., et al.: Noise characteristics modeled unsupervised network for robust CT
image reconstruction. IEEE Trans. Med. Imaging 41, 3849–3861 (2022). https://
doi.org/10.1109/TMI.2022.3197400
19. Fu, J., et al.: Dual attention network for scene segmentation. In: Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3141–
3149 (2019). https://doi.org/10.1109/CVPR.2019.00326
20. Habib, G., Singh, D., Malik, I., Lall, B.: optimizing vision transformers with
data-free knowledge transfer. ArXiv Preprint ArXiv:2408.05952, pp. 1–20 (2024).
https://doi.org/10.48550/arXiv.2408.05952
21. Liang, J., et al.: SwinIR: image restoration using swin transformer. In: Proceedings
of the IEEE/CVF International Conference on Computer Vision, pp. 1833–1844
(2021). https://doi.org/10.1109/ICCVW54120.2021.00210
22. Lin, A., et al.: DS-TransuNet: dual swin transformer U-Net for medical image
segmentation. IEEE Trans. Instrum. Measur. 71, 1–15 (2022). https://doi.org/10.
1109/TIM.2022.3178991
23. Lei, Y., et al.: Diffeomorphic transformer-based abdomen MRI-CT deformable
image registration. Med. Phy. 51(9), 1–18 (2024). https://doi.org/10.1002/mp.
17235
24. Lian, J., Liu, T.: Lesion identification in fundus images via convolutional neural
network-vision transformer. Biomed. Signal Process. Contr. 88, 105607 (2024).
https://doi.org/10.1016/j.bspc.2023.105607
25. Chi, J., et al.: Low-dose CT image super-resolution with noise suppression based
on prior degradation estimator and self-guidance mechanism. IEEE Trans. Med.
Imaging (2024). https://doi.org/10.1109/TMI.2024.3454268
274 P. C. Thang and P. M. Nhat

26. Jin, K., McCann, M., Froustey, E., Unser, M.: Deep convolutional neural network
for inverse problems in imaging. IEEE Trans. Image Process. 26, 4509–4522 (2017).
https://doi.org/10.1109/TIP.2017.2713099
27. Zhang, X., Zeng, H., Zhang, L.: Edge-oriented convolution block for real-time super
resolution on mobile devices. In: Proceedings of the 29th ACM International Con-
ference on Multimedia, pp. 4034–4043 (2021). https://doi.org/10.1145/3474085.
347529
28. Moen, T., et al.: Low-dose CT image and projection dataset. Med. Phys. 48, 902–
911 (2021). https://doi.org/10.1002/mp.14594
29. Ge, R., etal.: DDPNet: a novel dual-domain parallel network for low-dose CT
reconstruction. In: International Conference on Medical Image Computing and
Computer-Assisted Intervention, pp. 748–757 (2022). https://doi.org/10.1007/978-
3-031-16446-0_71
DehazeCLNet: A Contrastive Learning
Framework with Advanced Feature
Extraction for Image Dehazing

Pham Cong Thang(B) , Nguyen An Hung , Nguyen Quoc Cuong ,


and Phan Minh Nhat

The University of Danang–University of Science and Technology,


54 Nguyen Luong Bang Street, Danang, Viet Nam
[email protected]

Abstract. Recent advancements in image dehazing, especially within


indoor environments, are highlighted by the SOTS-Indoor dataset. Tradi-
tional methods, such as MSBDN, FFA-Net, DeHamer, and MAXIM-2S,
have employed techniques like feature fusion, attention mechanisms, and
multi-scale feature extraction to address the challenge of haze removal.
MSBDN attained a Peak Signal-to-Noise Ratio (PSNR) of 33.67 and a
Structural Similarity Index (SSIM) of 0.985 on the SOTS-Indoor dataset,
while FFA-Net improved these metrics to 36.39 and 0.989, respectively.
Subsequent models, including DeHamer and MAXIM-2S, continued to
enhance performance, achieving PSNR values of 36.63 and 38.11. This
study introduces DehazeCLNet, a novel model integrating contrastive
learning to enhance haze suppression capabilities. By incorporating con-
trastive loss and feature extraction across multiple depths, Dehaze-
CLNet achieves notable image restoration, with a PSNR of 42.57 and
SSIM of 0.996, surpassing previous methods. These results underscore
DehazeCLNet’s potential, establishing new benchmarks for indoor image
dehazing and suggesting promising directions for future research in haze
removal.

Keywords: Image dehazing · Contrastive learning · Image


Restoration · Attention mechanisms · Feature Extraction

1 Introduction
Image dehazing has garnered compelling attention recently due to its significa-
tion in various tasks, such as autonomous driving, aerial imaging, and outdoor
surveillance. The presence of haze, caused by the scattering of light by atmo-
spheric particles, severely degrades image quality by reducing contrast, blurring
textures, and distorting colors. Numerous dehazing methods have been devel-
oped, focusing on restoring image clarity and enhancing visibility under chal-
lenging weather conditions. Early methods depended on priors, such as the Dark
Channel Prior (DCP) [1] and atmospheric scattering models [2]. While effective
c The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025
W. Buntine et al. (Eds.): SOICT 2024, CCIS 2350, pp. 275–286, 2025.
https://doi.org/10.1007/978-981-96-4282-3_23
276 P. C. Thang et al.

in certain conditions, these approaches often struggled with non-uniform haze


and introduced artifacts when applied to complex scenes.
More recent deep learning-based methods have significantly improved dehaz-
ing performance. For instance, the Multi-Scale Boosted Densely Connected Net-
work (MSBDN) [3] utilized dense connections to merge features from multiple
scales, leading to substantial improvements in image clarity. The Feature Fusion
Attention Network (FFA-Net) [4] employed channel and pixel attention mech-
anisms to selectively enhance important features, achieving strong results in
various dehazing tasks. Similarly, DeHamer [5] introduced a novel haze-aware
feature extraction approach, enhancing the network’s ability to handle complex
hazy environments.
However, despite these advancements, challenges remain, particularly in real-
world scenarios where models trained on synthetic datasets may struggle to gen-
eralize effectively. Additionally, existing models often fail to balance image clar-
ity with the preservation of essential details like texture and contrast, especially
under heavy haze conditions. The aim of this research is to improve upon existing
dehazing techniques by introducing a novel framework, DehazeCLNet, designed
to enhance image quality while preserving essential visual details. Our proposal
focues on the shortcomings of traditional models by integrating advanced atten-
tion mechanisms and contrastive learning, optimizing the dehazing process for
both synthetic and real-world conditions.
The contributions of this study are centered around four key innovations:

– Channel Attention Block (CAB): This module enhances critical image


features by dynamically assigning attention weights to the most relevant chan-
nels, thereby improving the overall clarity and detail of dehazed images.
– Double Convolution Block (DCB): The DCB facilitates the extraction of
both spatial and contextual features, enhancing the detection of fog-affected
regions and enabling sharper, clearer image restoration.
– Contrastive Learning: By employing contrastive learning, the network is
trained to effectively distinguish between hazy and clear regions, significantly
enhancing its generalization capability across varying fog levels. The details
of this approach are further discussed in Sects. 2.2 and 3.4.
– Custom Loss Function: A specialized loss function is introduced to ensure
that the dehazed output not only removes haze but also preserves the natural
contrast, color, and texture of the original image, as outlined in Sect. 3.

These innovations enable DehazeCLNet to outperform existing methods,


delivering superior dehazing results on both synthetic datasets and real-world
scenarios, thus pushing the boundaries of current image dehazing techniques.

2 Related Works
2.1 Image Dehazing
In recent years, image restoration, including dehazing, has gained substantial
attention due to its importance in enhancing vision-based tasks such as object
DehazeCLNet for Image Dehazing 277

detection and recognition. Various dehazing techniques have been classified into
three primary approaches: image enhancement, image fusion, and image restora-
tion [6]. Authors in [7] introduced non-aligned supervision methods, providing
significant insights into new dehazing approaches.
Moreover, the dark channel prior method, developed in [8], has since become a
fundamental benchmark in dehazing research. Additionally, [7] further advanced
the field with non-aligned supervision methods, contributing to the exploration
of innovative dehazing strategies. Alongside dehazing, deep learning has enabled
significant improvements in related restoration tasks, including motion deblur-
ring [9] and defocus deblurring [8].
The atmospheric scattering model mathematically describes the formation
of hazy images as follows:
I
. hazy (x) = Iclean (x) · τ (x) + A · (1 − τ (x)) ,
where .Ihazy (x) serves as the observed hazy image, .Iclean (x) stands for the scene
radiance or the clean image, .A is the global atmospheric light, .τ (x) is the trans-
mission map at a spatial position .x, defined as .τ (x) = e−β·d(x) , quantifies the
fraction of light reaching the camera through the hazy medium, .β is the scatter-
ing coefficient related to the haze density, .d(x) reproduces the depth or distance
between the object and the camera.

2.2 Contrastive Learning


Contrastive learning is a machine learning approach that excels in unsupervised
and semi-supervised settings. It has proven highly effective in enhancing model
performance, particularly in tasks such as image recognition, text analysis, and
clustering.

Supervised Contrastive Learning (SCL): Supervised Contrastive Learning


(SCL) uses labeled data, training the model on pairs of data points to differenti-
ate between similar and dissimilar instances. For example, the InfoNCE loss func-
tion is widely applied in SCL models to maximize the similarity between positive
pairs and minimize it for negative pairs. Studies such as [10] have demonstrated
that SCL achieves state-of-the-art results in image classification by leveraging
labels to enhance contrastive learning.

Self-Supervised Contrastive Learning (SSCL): SSCL (Self-Supervised


Contrastive Learning) distinguishes itself from SCL (Supervised Contrastive
Learning) Self-Supervised Contrastive Learning (SSCL) distinguishes itself from
Supervised Contrastive Learning (SCL) by relying on unlabeled data. Instead of
utilizing predefined labels, SSCL generates positive and negative pairs through
pretext tasks to explore the underlying features of the data. This approach
enables the model to learn meaningful representations from benign data traf-
fic, improving its ability to generalize when detecting and classifying unseen
samples-particularly in scenarios where labeled data is scarce or unavailable.
278 P. C. Thang et al.

SSCL has demonstrated impressive results across multiple fields, consist of


computer vision and natural language processing. In computer vision, it excels
in tasks such as image classification, object detection, and image generation. In
natural language processing, SSCL is effectively applied to sentence representa-
tion learning, text classification, and machine translation. Overall, contrastive
learning provides a robust framework for leveraging unlabeled data, significantly
improving model performance even in the absence of labeled samples.

3 Proposal Method
The overall architecture of DehazeCLNet is outlined through Fig. 1(a). We then
provide a detailed explanation of the proposed modules within Groups 1, 2, and
3 (Fig. 1(b)), which include the Channel Attention Block (CAB) (Fig. 1(c)) and
the Dual Convolutional Block (DCB) (Fig. 1(d)). The discussion concludes with
an explanation of the loss function and the implementation of the Contrastive
Learning process.

3.1 Overall Architecture

Fig. 1. (a) The overall architecture of the proposed DehazeCLNet is designed for net-
work training using the Contrastive Learning method. (b) Groups 1, 2, 3 consist of
N blocks containing convolutional layers, channel attention blocks (CAB) and dual
convolutional blocks (DCB) to extract feature maps through various filters. (c) The
Channel Attention Block (CAB) improves feature learning by focusing on the most
relevant channels. (d) The Dual Convolutional Block (DCB) comprises two distinct
convolutional branches: one with smaller kernels to learn local features and the other
with larger kernels to capture global context.

Our proposed network is built on advanced feature learning mechanisms


through the use of blocks and groups of blocks. Specifically, within the Dehaze-
CLNet architecture, the network incorporates components such as Channel
DehazeCLNet for Image Dehazing 279

Attention Blocks (Fig. 1(c)) and Dual Convolutional Blocks (Fig. 1(d)). These
components are strategically designed to capture and extract detailed informa-
tion, strengthening the network’s performance in image restoration tasks by that.
Additionally, the network layers have been adapted to accommodate contrastive
learning tasks [11].
The architecture of our network (Fig. 1(a)) is organized into three groups,
each responsible for extracting features at different levels of granularity. These
extracted features are then concatenated into a unified feature map comprising
192 filters. This feature map is processed through the Channel Attention Block,
which analyzes the 192 channels to emphasize important feature representations.
Following this, the Dual Convolutional Block is employed to capture both local
and global features, further improving the network’s ability to generalize and
effectively process hazy regions, thereby enhancing dehazing performance.
Additionally, the network architecture includes 3. × 3 convolutional layers at
the input and output stages, along with skip connections, to prevent the loss of
critical information during forward propagation and preserve essential geometric
features throughout the model.

3.2 Channel Attention Block (CAB)

The Channel Attention Block (CAB) is a module composed of several layers


constructed to identify and emphasize the most important channels within the
feature map. As shown in Fig. 1(a), the input to our method consists of vari-
ous features, typically containing multiple channels. The CAB is responsible for
learning to highlight and amplify the influence of significant features, as not all
features contribute equally to the task at hand.
To explain each layer of the proposed CAB in detail, the Adaptive Average
Pooling layer [12] plays a essential role in extracting key features from differ-
ent channels by summarizing global information for each channel. The 1. × 1
convolutional layers are then used to capture local features. These are followed
by a sigmoid activation function that normalizes the attention masks, where
each value ranges from 0 to 1, representing the importance of each channel.
Finally, these attention masks are multiplied by the input of the block to selec-
tively emphasize significant features. The specific operations are formalized in
the equations below:

CA(X) = σ(Conv(ReLU (Conv(FAvgP ool (X))))),


. (1)
.CAB(X) = X ∗ CA(X),
where .FAvgP ool (X) computes the global average of each channel in the feature
map. .CA(X) applies a convolution, ReLU activation, and sigmoid normalization

.σ = 1+e1−x to create the attention mask. .CAB(X) enhances the feature map
by applying the attention mask to emphasize the most relevant channels. The
symbol .∗ denotes the multiplication of elements in the original input feature
map .X and the channel attention map .CA(X).
280 P. C. Thang et al.

3.3 Dual Convolutional Block (DCB)

The Dual Convolutional Block (DCB) consists of two parallel branches that
utilize distinct convolutional layers to process input feature maps, which have
already been weighted for importance by the Channel Attention Block (CAB),
as described in Sect. 3.1. Each branch is designed differently, focusing on specific
tasks to capture unique aspects of the features. These branches work concur-
rently to extract various characteristics from the input data.
Specifically, as shown in Fig. 1 (d), the right branch has a structure similar
to the CAB, but the final normalization step is omitted. This branch is primarily
responsible for learning local features. In contrast, the left branch consists of 3. × 3
convolutional layers with larger kernel sizes, aimed at capturing more complex
features, such as global contextual information.
Once the global context information, denoted as .h̃, is obtained, the remaining
information, .1− h̃, is multiplied by the output of the right branch to extract finer
details. Simultaneously, the left branch refines the global context by multiplying
the feature map by .h̃. These outputs are then fused to generate new feature maps
that encapsulate both generalized global context and critical local features.
For simplicity in implementation and configuration, the Dual Convolutional
Block (DCB) can be expressed as follows, in conjunction with Eq. (1):

.FDCB (X) = (1 − h̃) · CA(X) + h̃ · X,

where .FDCB (X) represents a function that yields the output feature map of the
Dual Convolutional Block (DCB) after processing the input feature map .X. The
term .h̃ denotes the global context, while .1 − h̃ signifies the complement of the
global context information, effectively capturing the local features that are not
addressed by .h̃.

3.4 Establishing the Contrastive Learning Process

In this study, we propose a solution that integrates Contrastive Learning into


the training process to enhance image dehazing capabilities. Before detailing
our methodology, it is essential to define certain theoretical concepts. Firstly,
the input for the task is a hazy image, while the desired output is a clear image,
referred to as the Positive Output. The Negative Model is defined as a previously
published dehazing model; for this task, we utilize the FFA model introduced by
Xu Qin et al. in 2020 [4], with its output regarded as the Negative Output. The
DehazeCLNet model serves as the anchor model, producing what we term the
Anchor Output. Our objective is to minimize the distance between the Anchor
Output and the Positive Output while maximizing the distance between the
Anchor Output and the Negative Output.
During the training process, both DehazeCLNet and the Negative Model are
simultaneously fed hazy images as input. To optimize the objectives set by our
team, an integrated loss function is constructed, which is detailed in Sect. 3.5.
DehazeCLNet for Image Dehazing 281

3.5 Loss Function

In this study, the loss function is constructed by combining two distinct com-
ponents: the pixel loss function and the contrastive loss function. The pixel loss
function is used to compare the clear image with the dehazed output gener-
ated by the DehazeCLNet model. In contrast, the contrastive loss function is
designed to strengthen the model’s capability to distinguish between outputs.
It works by minimizing the distance between the DehazeCLNet’s output and
the clear image (positive image), while maximizing the distance between the
DehazeCLNet’s output and the output generated by the Negative Model.
The pixel objective function is formed on the .L1 Loss [8], and is specifically
represented by the following equation:

LP ixel = Iˆa − Ip 1 ,
. (2)

where .Ip represents the clear image considered as the positive image, while .Iˆa
denotes the reconstructed image from the anchor model (DehazeCLNet).
The contrastive loss function is formulated by calculating the distances
between the anchor image and the positive image, as well as between the anchor
image and the negative image, across various feature depth levels. A deep neural
network architecture is leveraged to extract features from input images at mul-
tiple depths, capturing feature sizes that range from large to small. Specifically,
we utilize the pre-trained VGG19 network [13] to perform feature extraction at
these varying depth levels.
Additionally, we define the influence weights of the feature maps at different
depths, as defined by the subsequent equation:
i
Wcontrastive
. = 2−(η−i) , (3)

where, .η denotes the total number of blocks utilized in the computation, and
i represents the index of the blocks within the VGG19 architecture. For the
.

purpose of this study, we set .η = 5, corresponding to the five convolutional


blocks of the VGG19 network.
Upon establishing the influence weights of the features at various depths, we
proceed to calculate the feature extraction for the input images, which encompass
the positive image, anchor image, negative image, and the input image:

I˜ = Fvgg19 (Ia ),
. a I˜p = Fvgg19 (Ip ), I˜n = Fvgg19 (In ), I˜x = Fvgg19 (Ix ). (4)

In this context, .I{a,p,n,x} represents the input images, which include the posi-
tive image, anchor image, and negative image, while .I˜{a,p,n,x} denotes the output
feature maps extracted at the five depth levels of the VGG19 network. The func-
tion .Fvgg19 is employed for this feature extraction process. By leveraging Eqs.
(3), (4), and the .L1 loss function, we can derive the formula for the contrastive
loss function as follows:
282 P. C. Thang et al.

 · L1 Loss(I˜ai , I˜pi )
η i
Wcontrastive
LContrastive =
. . (5)
i=1
L1 Loss(I˜ai , I˜ni ) · Wnegative
i + L1 Loss(I˜ai , I˜xi ) · Winput
i

i i
In this equation, .Wnegative and .Winput signify the influence weights of the
negative image concerning the image generated by the anchor model and the
input image relative to the image produced by the anchor model, respectively.
Having established the computation formulas for the pixel loss function as
delineated in Eq. (2) and the contrastive loss function as specified in Eq. (5),
the overall objective function for the task can be formulated as follows:

LCL = LP ixel + λ · LContrastive .


.

In this equation, the influence of the contrastive loss function on the training
phase is indicated via parameter .λ. .λ = 0.2 have been selected for the purposes
of this study.

4 Experiments
4.1 Experiment Setup

In this experimental process, we used the REalistic Single Image DEhazing


(RESIDE) dataset [14], incorporating both the Indoor Training Set (ITS), which
contains 13,999 samples, and the Outdoor Training Set, comprising 72,135 sam-
ples. The images were resized to .240 × 240 pixels to facilitate more efficient
training. For evaluation, we used the Synthetic Objective Testing Set (SOTS).
The models and methods were implemented using PyTorch version 2.4.1 on the
CUDA 12.3 platform. All experiments were conducted on a system equipped
with an Intel Xeon E5-2676 processor, 32 GB of RAM, and an NVIDIA RTX
2060 Super GPU with 8 GB of memory, running Ubuntu Desktop 22.04 as the
operating system.
The loss function was configured following the description in Sect. 3.5, and
the Adam optimizer [15] was used during the training. To assess performance, we
employed two key metrics: Peak Signal-to-Noise Ratio (PSNR) [16] and Struc-
tural Similarity Index (SSIM) [17].

4.2 Experiment Results

The results in Table 1 present a comparative analysis of diverse dehazing methods


on the SOTS-Indoor and SOTS-Outdoor datasets, evaluated using PSNR and
DehazeCLNet for Image Dehazing 283

SSIM metrics. DehazeCLNet demonstrates exceptional performance in indoor


scenes, achieving the top PSNR of 42.57 and SSIM of 0.996 on the SOTS-Indoor
dataset, substantially surpassing models like SFNet (PSNR: 41.24, SSIM: 0.996),
OKNet (PSNR: 40.79, SSIM: 0.996), and PMNet (PSNR: 38.41, SSIM: 0.990).
For outdoor scenes, while SFNet attains a slightly higher PSNR of 40.05 and
an SSIM of 0.996, DehazeCLNet remains competitive with a PSNR of 34.07
and SSIM of 0.986. These results highlight DehazeCLNet’s strong performance
in restoring image quality and structural integrity in indoor conditions and its
robustness in outdoor environments, underscoring its potential for generalized
application across diverse dehazing scenarios.

Table 1. Quantitative analysis of the DehazeCLNet on SOTS-Indoor and SOTS-


Outdoor datasets.

Method SOTS-Indoor SOTS-Outdoor


PSNR.↑ SSIM.↑ PSNR.↑ SSIM.↑
MSBDN (CVPR 2020) [3] 33.67 0.985 33.48 0.982
FFA-Net (AAAI 2020) [4] 36.39 0.989 33.57 0.984
DeHamer (CVPR 2022) [5] 36.63 0.988 35.18 0.986
MAXIM-2S (CVPR 2022) [18] 38.11 0.991 34.19 0.985
PMNet (ECCV 2022) [19] 38.41 0.990 34.74 0.985
SFNet (ICLR 2023) [20] 41.24 0.996 40.05 0.996
OKNet (IAAA 2024) [21] 40.79 0.996 37.68 0.995
DehazeCLNet (ours) 42.57 0.996 34.07 0.986

In Fig. 2, it can be seen that the outputs of various models yield relatively
high results. However, our model, DehazeCLNet, demonstrates notably supe-
rior performance, achieving Peak Signal-to-Noise Ratio (PSNR) and Structural
Similarity Index (SSIM) values that are significantly elevated, with some exper-
iments even exceeding 45 dB. These experimental results indicate that Dehaze-
CLNet, developed in conjunction with the methods presented in our research,
has yielded impressive outcomes in the task of image dehazing on the SOTS-
Indoor dataset. This highlights the validity of our approach in restoring image
quality and enhancing visual clarity in challenging hazy conditions.
284 P. C. Thang et al.

Fig. 2. Visual Comparision of DehazeCLNet with others on SOTS-Indoor dataset

5 Conclusion and Future Works

This study proposes the DehazeCLNet network, which incorporates groups and
blocks to enable advanced feature extraction. Additionally, a depth-wise loss
function is introduced, integrated with contrastive learning methods to enhance
the efficiency and performance of DehazeCLNet in image dehazing tasks. The
proposed network demonstrates relatively high performance, achieving superior
PSNR and SSIM scores. Future improvements could include replacing the Nega-
tive Model, currently FFA-Net, with a more effective alternative, and adjusting
the influence parameters of the contrastive loss function to further optimize the
dehazing process.
DehazeCLNet for Image Dehazing 285

Acknowledgements. This work was supported by The University of Danang–


University of Science and Technology.

References
1. He, K., Sun, J., Tang, X.: Single image haze removal using dark channel prior.
IEEE Trans. Pattern Anal. Mach. Intell. 33, 2341–2353 (2010). https://doi.org/
10.1109/TPAMI.2010.168
2. Fattal, R.: Single image dehazing. ACM Trans. Graph. (TOG) 27, 1–9 (2008).
https://doi.org/10.1145/1360612.1360671
3. Hang, D., et al.: Multi-scale boosted dehazing network with dense feature fusion.
In: Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR), pp. 2154–2164 (2020). https://doi.org/10.1109/CVPR42600.
2020.00223
4. Qin, X., Wang, Z., Bai, Y., Xie, X., Jia, H.: FFA-Net: feature fusion attention
network for single image dehazing. Proc. AAAI Conf. Artif. Intell. 34, 11908–11915
(2020)
5. Guo, C., et al.: Image dehazing transformer with transmission-aware 3D position
embedding. In: Proceedings of T2022 IEEE/CVF Conference on Computer Vision
and Pattern Recognition (CVPR), pp. 5802–5810. (2022). https://doi.org/10.1109/
CVPR52688.2022.00572
6. Wang, W., Yuan, X.: Recent advances in image dehazing. IEEE/CAA J. Automat-
ica Sinica 4, 410–436 (2017). https://doi.org/10.1109/JAS.2017.7510532
7. Fan, J., et al.: Non-aligned supervision for real image dehazing. ArXiv Preprint
arXiv:2303.04940 (2023). https://doi.org/10.48550/arXiv.2303.04940
8. He, X., Cheng, J.: Revisiting L1 loss in super-resolution: a probabilistic view and
beyond. ArXiv Preprint arXiv:2201.10084, pp. 1–13 (2023). https://doi.org/10.
48550/arXiv.2201.10084
9. Zamir, S., et al.: Restormer: efficient transformer for high-resolution image
restoration. In: Proceedings of EEE/CVF Conference on Computer Vision and
Pattern Recognition (CVPR), pp. 5718–5729. (2022). https://doi.org/10.1109/
CVPR52688.2022.00564
10. Khosla, P., et al.: Supervised contrastive learning. In: Proceedings of the 34th
International Conference on Neural Information Processing Systems, pp. 18661–
18673. (2020). https://doi.org/10.5555/3495724.3497291
11. Cheng, D., et al.: Progressive negative enhancing contrastive learning for image
dehazing and beyond. IEEE Trans. Multimedia 26, 8783–8798 (2024). https://doi.
org/10.1109/TMM.2024.3382493
12. Bieder, F., Sandkühler, R., Cattin, P.: Comparison of methods generalizing max-
and average-pooling. ArXiv Preprint arXiv:2103.01746, pp. 1–16 (2023). https://
doi.org/10.48550/arXiv.2103.01746
13. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale
image recognition. In: Proceedings of 3rd International Conference on Learning
Representations (ICLR 2015), pp. 1–14 (2015). https://doi.org/10.48550/arXiv.
1409.1556
14. Li, B., et al.: Benchmarking single-image dehazing and beyond. IEEE Trans. Image
Process. 28, 492–505 (2019). https://doi.org/10.1109/TIP.2018.2867951
15. Kingma, D., Ba, J.: Adam: a method for stochastic optimization. ArXiv Preprint
arXiv:1412.6980, pp. 1–15 (2017). https://doi.org/10.48550/arXiv.1412.6980
286 P. C. Thang et al.

16. Horé, A., Ziou, D.: Image quality metrics: PSNR vs. SSIM. In: 2010 20th Interna-
tional Conference on Pattern Recognition, pp. 2366–2369 (2010). https://doi.org/
10.1109/ICPR.2010.579
17. Nilsson, J., Akenine-Möller, T.: Understanding SSIM. ArXiv Preprint
arXiv:2006.13846, pp. 1–8 (2020). https://doi.org/10.48550/arXiv.2006.13846
18. Tu, Z., et al.: MAXIM: multi-axis MLP for image processing. In: Proceedings
Of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition
(CVPR), pp. 5759–5770 (2022). https://doi.org/10.1109/CVPR52688.2022.00568
19. Ye, T., et al.: Perceiving and modeling density is all you need for image dehazing.
In: Computer Vision - ECCV 2022: 17th European Conference, Part XIX, pp.
130–145 (2021). https://doi.org/10.1007/978-3-031-19800-7_8
20. Cui, Y., et al.: Selective frequency network for image restoration. In: Proceedings
of 11th International Conference on Learning Representations, pp. 1–13 (2023)
21. Cui, Y., Ren, W., Knoll, A.: Omni-Kernel network for image restoration. Proc.
AAAI Conf. Artif. Intell. 38, 1426–1434 (2024). https://doi.org/10.1609/aaai.
v38i2.27907
Distortion-Resilient DIBR for Novel View
Synthesis from a Single Image

Yuchen Liu(B) , Eiji Kamioka , and Phan Xuan Tan(B)

Shibaura Institute of Technology, Tokyo, Japan


[email protected]

Abstract. It is challenging to render novel views from a single image


input due to inherent ambiguities in the geometry and texture infor-
mation of the desired scene. As a consequence, existing methods often
encounter various types of distortions in synthesized views. To this end,
we propose a distortion-resilient Depth-Image-Based Rendering (DIBR)
method for synthesizing novel views given a single image input. The pro-
posed method is qualitatively and quantitatively evaluated on the Real-
Estate 10K dataset, showing superior results compared to baselines.

Keywords: Novel View Synthesis · DIBR · Distortion Handling

1 Introduction
Novel view synthesis (NVS) has become a crucial technology in various fields,
including 3D modeling [1], autonomous driving [2], virtual reality [3], medi-
cal imaging [4], and industrial scanning [5]. It enables the rendering of three-
dimensional scenes from un-sampled viewpoints, enhancing the realism and inter-
activity of digital environments by assuming a static scene with a fixed light-
ing condition where the camera moves freely to illustrate the parallax changes
between frames, allowing the understanding of the spatial information of the 3-D
scene.
Most existing NVS methods rely on multiple views input to better recon-
struct the geometry proxies [6–9] or sample sufficient optical information [10–13].
However, obtaining multiple views is not always feasible due to several practical
constraints. In many real-world scenarios, only a single image may be available,
such as in surveillance footage, historical photographs, or casual daily photos.
Additionally, capturing multiple views often requires specialized equipment, like
multi-camera setups or depth sensors, which can be expensive and not readily
accessible. Single Image NVS which is our research focus, in contrast, potentially
provides a flexible and cost-effective solution. The challenge from a single image
is the limited ability to perceive the geometry information, as with unknown
optical information (e.g. occluded area). Traditional methods try to apply 3-
D Warping [14] to the input image with known depth to get a sparse novel
This work was supported by JSPS KAKENHI under Grant 24K20797.
c The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025
W. Buntine et al. (Eds.): SOICT 2024, CCIS 2350, pp. 287–297, 2025.
https://doi.org/10.1007/978-981-96-4282-3_24
288 Y. Liu et al.

view with significant distortions and fixing the distortion usually involves more
input images. FTV View Generation [15] first warps the depth and refines it to
re-sample the missing texture for its stereo input. 3D Photo Inpainting [16] esti-
mates a monocular depth and locates the depth discontinuous area, then inpaints
the disocclusion region from the neighbor context region. MPI [17] encodes every
input into fixed layers representation on pre-defined depths, it further models
the layer with another alpha channel to encode transparency to solve the depth
discontinuous caused by discrete layers.
In this paper, a comprehensive DIBR-based model that can handle different
types of distortions when synthesizing a novel view given a single image input,
is proposed. Specifically, when receiving a single arbitrary RGB image input
from the user, the proposed model estimates the depth and combines it with
input RGB to form an RGB-D as the input of the synthesizing procedure. In
our method, the scene is represented by two distinct layers: the foreground,
derived from the input RGB; and the background, assumed to constitute the
occluded areas. This representation allows us to deal with different distortion
types independently. A depth-guided segment-inpainting approach is used to
generate the background, effectively addressing disocclusions by filling in missing
textures with contextually appropriate information. For the foreground, a reverse
depth mapping test is proposed to re-sample the unknown pixels and generate
an alpha map simultaneously to enable a soft blend with the background. This
approach aims to restore texture in distorted areas and preserve the sharpness
of other regions.
Our proposed model is qualitatively and quantitatively evaluated using scenes
from the Real-Estate 10K dataset [18]. The results demonstrate sharper and
clearer synthesized images with better NIQE [19] and LPIPS [20] score.

2 Problem Statement
3D Warping is a universal approach for most DIBR methods to render novel
views. It has a clear physical meaning and a direct, fast computation procedure,
but it suffers from distortion problems which will be handled by our work.
Given an RGB-D image as input, 3D Warping (2) is a one-to-one mapping
algorithm composed of camera pose .RT and intrinsics .K, where all points .p from
the input view are mapped to new positions .p in the novel view.

.p = W3 (p), p =< x, y, z > (1)

⎡ ⎤ ⎡ ⎤
x x
1
. ⎣ y ⎦ = K5 ( ⎣ y ⎦ ) ))
 T −1
(R 6 K 5 (z (2)
z
1 1
3D Warping considers only the pixels sampled from the reference view and
thus suffers from severe distortions as shown in Fig. 1.
Distortion-Resilient DIBR for Novel View Synthesis from a Single Image 289

Fig. 1. Warped “Vintage” image, 4 different distortions are observed

– Disocclusion: Foreground objects in the reference image occluded the tex-


ture of the background.
– Undersample: When the novel viewpoint changes(e.g., zoom in), the unsam-
pled points between pixels are exposed, often producing aliasing-like patterns.
– Out-of-F.o.V.: When the novel viewpoint changes (e.g., rotation), the
unknown areas that were outside of the reference view are exposed.
– Depth Errors: Depth maps are not accurately mapped to the real object(e.g.
depth discontinuous), causing erroneous mapping called “ghosting pixels”.

3 Our Proposal
In this research, to tackle the above-listed distortions, we propose a comprehen-
sive DIBR-based model for synthesizing novel views from single input image. We
assume that the target novel views in a desired scene are composed of two images:
a foreground image .C[H×W ] and a background image .B[H×W ] . All information in
the foreground image is provided by the input view, while the background image
reflects the information of the occluded area. Figure 2 shows our pipeline which
synthesizes the foreground and background images simultaneously and blends
them into the final image .S[H×W ] .

3.1 Input and Pre-processing


We take an arbitrary RGB image as input and use a monocular depth estimation
model [21] to generate the corresponding normalized depth map .D[H×W ] . Since
there exists blurred error depth at the depth discontinuous which might cause
“ghosting pixels”, we apply a bilateral filter to the depth map to enhance edge
sharpness. Finally, an RGB-D input .(C[H×W ] , D[H×W ] ) can be formed.
290 Y. Liu et al.

Fig. 2. The pipeline of our proposed Single-View DIBR method. In our approach,
the scene is represented into foreground and background to efficiently handle different
types of distortions. The novel view is produced by soft blending the foreground and
background controlled by an alpha channel.

3.2 Foreground and the Alpha-Blending Map

We calculate the foreground image of the novel view using 3D warping and Z-
buffering with all the points .p(uv) =< u, v, d > from the input image, where
.< u, v > stands for the coordinates in image space, .d is the depth value at that
point .d = D(uv) .

< u , v  , d >∼ pi = W3 (pi ), i ∈ [H, W ]


. (3)
Ĉ(uv ) = C(uv)

Considering two types of distortions in the novel view: Disocclusion and


Undersample area. The simplest way to handle these is to interpolate distorted
areas using neighboring pixels. However, this approach cannot effectively distin-
guish disocclusions, resulting in incorrect filling.
To address this issue, we propose a quick reverse depth mapping test as
illustrated in Fig. 3. First, we perform local minimum value filling on the warped
depth map .D̂ in the novel view. We record the positions of unknown points and
then fill each unknown point’s depth with the minimum depth value .dmin within
a searching window .W nearby.

D̂ = −1[H×W ] , D̂(uv ) = d
. pvm =< u , v  , dmin >, where D̂(uv ) = −1 (4)
dmin = min(D̂i )i±W
i=(uv  )

By doing this, we create a virtual mapping .pvm over the distorted areas, the
mapping’s position depends on the depth relationship between the occluded
area and the occluder. In undersampled areas, this mapping will be very close
to the local samples, almost negligible.
Distortion-Resilient DIBR for Novel View Synthesis from a Single Image 291

Fig. 3. Proposed reverse depth mapping test.

After filling all unknown depths, we compute the inverse 3D warping .W3−1 ,
projecting the filled points .pvm from the novel view back to the reference view.
. < urm , vrm

, drm >∼ prm = W3−1 (pvm ) (5)
At this point, the virtual mapping overlaps with certain points in the reference
view, these overlapped points are called reverse mapping. Based on the reverse
mapping .< urm , vrm

>, we resample RGB value .C(uvrm ) and the depth value
.D(uv  ) . The RGB values of reverse mapping are directly mapped to the previ-
rm
ously recorded unknown points,
. Ĉ(uvvm
 ) = C(uvrm
 ) (6)
while the re-sampled depth values are compared with the depth from virtual
mapping to generate a depth difference map for the previously recorded unknown
points.

.dif f(uv  ) = |D(uv  ) − drm | (7)
vm rm

This depth difference exists in the distorted regions throughout the novel
view, it reflects the relative gradient changes in the occluded regions and main-
tains global continuity. We found this information can be well-modeled as an
alpha map to control the soft-blending ratio between the background and fore-
ground. The absolute values of the depth difference vary depending on the com-
plexity of the scene and the camera poses in the novel views. However, for nor-
malized scene and camera matrices, we found the distribution pattern of the
depth difference shares consistency when the same camera movement happens
in different scenes, thus, we normalize the depth-diff map to an alpha map with
range .[0, 1].
.a[H×W ] = norm(dif f[H×W ] ) (8)
292 Y. Liu et al.

3.3 Background and Alpha Compositing


To generate disocclusion information for a scene, we utilize the image inpainting
technique. Most image inpainting methods require input in the form of a source
RGB and a mask of the areas of interest. To generate occluded information for
the entire scene, we need to mask and inpaint all objects. This process must be
guided by depth information to ensure the correct inpainting order; otherwise,
texture from objects closer to the camera may erroneously fill in areas behind
them. To address this issue, we employ a segmentation model [22] to perform
a global segmentation of the entire scene. Since disocclusion arises from depth
discontinuities, our segmentation of objects in the scene is solely based on depth
information.
After obtaining the segments, we sort them based on the maximum depth
within each segment. Using the generated masks, we then perform Image Inpaint-
ing [23] on the input RGB-D image in sequence according to their depth order as
shown in Fig. 4, ultimately creating a global background of the scene in RGB-D
format. When synthesizing novel viewpoints, both the background and fore-
ground are computed using the same 3D-Warping matrix. Because the recon-
structed background includes depth information for the occluded regions, these
areas can be filled in a manner that more accurately reflects parallax changes.
The warped background image .B̂[H×W ] , like the foreground, may exhibit dis-
tortions. We address distortions in the background by applying nearest-neighbor
interpolation. Finally, we use alpha compositing to softly blend the background
and foreground using the generated alpha map.

.S[H×W ] = (1 − a[H×W ] )  Ĉ[H×W ] + a[H×W ]  B̂[H×W ] (9)

4 Evaluation
To evaluate the performance of our NVS method, we apply our method with to-
be-compared baselines to the Real-Estate 10K dataset [18]. This dataset contains
video frames with corresponding camera matrices (Intrinsics, Poses) in static
scenes with fixed illumination. We use the first frame as the input view and try
to synthesize the novel views at the corresponding camera of other frames, then
compare it with the actual image of that frame (Ground Truth).

4.1 Perceptional Visual Quality Metric


Classic evaluation metrics are challenging to reflect the perceived visual quality
[24]. To best align our result with evaluated scores, the Natural Image Qual-
ity Evaluator(NIQE) [19] and the Learned Perceptual Image Patch Similar-
ity(LPIPS) [20] are used. Since NIQE is a no-reference evaluation metric, we
cannot confirm the differences of the compared results relative to the ground
truth. Therefore, we utilize the ground truth image to standardize the NIQE
scores.
Distortion-Resilient DIBR for Novel View Synthesis from a Single Image 293

Fig. 4. The segment-inpainting procedure to generate a background RGB-D.

2 |N IQEpred − N IQEgt |
N IQEstand =
. arctan( ) (10)
π N IQEgt

4.2 Scale Ambiguity

The monocular vision cannot perceive the scale of a scene, this results in the
relative pose of the novel view not matching the real pose of ground truth. There-
fore, we utilize the method “scale-invariant loss” used in MPI [17] to calculate a
factor .σ to correct the normalized depth to the correct scale.
294 Y. Liu et al.

4.3 Results

We compared our method with Raw-DIBR(3D Warping + Z-Buffering) and


MPI [17] on 8 randomly selected scenes(A-H). Visually evaluated from Fig. 5,

Table 1. The average standard-NIQE score on different scenes.

Stand-NIQE (.↓) A B C D E F G H
RAW-DIBR 0.863 0.75 0.747 0.771 0.638 0.870 0.435 0.164
MPI 0.217 0.122 0.189 0.213 0.117 0.238 0.232 0.217
OURS 0.184 0.043 0.036 0.11 0.05 0.027 0.114 0.087

Table 2. The average LPIPS(Alex) score on different scenes.

LPIPS(Alex) (.↓) A B C D E F G H
RAW-DIBR 1.053 1.031 1.04 1.104 1.131 1.010 0.88 0.512
MPI 0.361 0.238 0.278 0.424 0.334 0.242 0.315 0.481
OURS 0.331 0.206 0.23 0.485 0.303 0.205 0.254 0.359

Fig. 5. The visual quality compared to Raw-DIBR, MPI, and Ground Truth.
Distortion-Resilient DIBR for Novel View Synthesis from a Single Image 295

our method shows the capability to fix the distortion and adaptively restore the
disocclusion, while maintaining the sharpness of the novel view. For quantiative
evaluation, we calculate the average stand-NIQE and LPIPS score from all the
frames of every scene to show the overall performance as in Tables 1 and 2, our
method shows a relatively lower error along 8 different scenes. Figure 6 shows the
average stand-NIQE from all the scenes of every frame(In all 8 scenes, the camera
is moving steadily away from the position of the first frame). It is observed when
the target camera baseline increases, all the methods tend to generate larger
errors on the novel view. Our method maintained a relatively low error in all the
frames and a steady slope when the baseline increased.

Fig. 6. The average standard NIQE score at different frames, refers to the error that
grows in the novel views when rendering at a wider baseline.

5 Conclusion
We have proposed a DIBR-based model to tackle the distortions from the syn-
thesized views given a single image input, which is challenging in traditional
approaches. Our proposed method models the scene representation into 2 lay-
ers to handle different distortions in foreground and background, and then they
are further softly blended by an alpha map generated by the proposed reverse
depth mapping test. Experimental results show that our method has an abil-
ity to recover the unknown textures (e.g., disocclusions) and also maintain the
sharpness of the novel image, the performance was also reflected in the dedicated
dataset by quantitative evaluations.
296 Y. Liu et al.

References
1. Verykokou, S., Ioannidis, C.: An overview on image-based and scanner-based 3D
modeling technologies. Sensors 23(2), 596 (2023)
2. Cheng, J., et al.: A review of visual SLAM methods for autonomous driving vehi-
cles. Eng. Appl. Artif. Intell. 114 , 104992 (2022)
3. Fachada, S., et al.: Depth image based view synthesis with multiple reference views
for virtual reality. In: 2018-3DTV-Conference: The True Vision-Capture, Transmis-
sion and Display of 3D Video (3DTV-CON). IEEE (2018)
4. Wolterink, J.M., et al.: Deep MR to CT synthesis using unpaired data. In: Simula-
tion and Synthesis in Medical Imaging: Second International Workshop, SASHIMI
2017, Held in Conjunction with MICCAI 2017, Québec City, QC, Canada, Septem-
ber 10, 2017, Proceedings 2. Springer International Publishing (2017)
5. Usamentiaga, R., Molleda, J., García, D.F.: Fast and robust laser stripe extraction
for 3D reconstruction in industrial environments. Mach. Vis. Appl. 23, 179–196
(2012)
6. Dyer, C.R.: Volumetric scene reconstruction from multiple views. In: Foundations
of Image Understanding. Boston, MA: Springer US, pp. 469–489 (2001). https://
doi.org/10.1007/978-1-4615-1529-6_16
7. Sinha, S., Steedly, D., Szeliski, R.: Piecewise planar stereo for image-based render-
ing. In: 2009 International Conference on Computer Vision (2009)
8. Penner, E., Zhang, L.: Soft 3D reconstruction for view synthesis. ACM Trans.
Graph. (TOG) 36 (6 ), 1–11 (2017)
9. Hedman, P., et al.: Casual 3D photography. ACM Trans. Graph. (TOG) 36(6),
1–15 (2017)
10. Buehler, C., et al.: Unstructured lumigraph rendering. In: Seminal Graphics
Papers: Pushing the Boundaries, vol. 2, pp. 497–504 (2023)
11. Kalantari, N.K., Wang, T.C., Ramamoorthi, R.: Learning-based view synthesis for
light field cameras. ACM Trans. Graph. (TOG) 35(6), 1–10 (2016)
12. Mildenhall, B., et al.: Local light field fusion: practical view synthesis with pre-
scriptive sampling guidelines. ACM Trans. Graph. (TOG) 38(4), 1–14 (2019)
13. Mildenhall, B., et al.: NeRF: representing scenes as neural radiance fields for view
synthesis. Comm. ACM 65(1), 99–106 (2021)
14. Mark, W.R., McMillan, L., Bishop, G.: Post-rendering 3D warping. In: Proceedings
of the 1997 symposium on Interactive 3D graphics (1997)
15. Mori, Y., et al.: View generation with 3D warping using depth information for
FTV. Signal Process. Image Comm. 24(1-2), 65–72 (2009)
16. Shih, M.L., et al.: 3D photography using context-aware layered depth inpainting.
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition (2020)
17. Tucker, R., Snavely, N.: Single-view view synthesis with multiplane images. In: Pro-
ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni-
tion (2020)
18. Zhou, T., et al.: Stereo magnification: learning view synthesis using multiplane
images. arXiv preprint arXiv:1805.09817 (2018)
19. Mittal, A., Soundararajan, R., Bovik, A.C.: Making a “completely blind” image
quality analyzer. IEEE Signal Process. Lett. 20(3), 209–212 (2012)
20. Zhang, R., et al.: The unreasonable effectiveness of deep features as a perceptual
metric. In: Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition (2018)
Distortion-Resilient DIBR for Novel View Synthesis from a Single Image 297

21. Yang, L., et al.: Depth anything: unleashing the power of large-scale unlabeled
data. arXiv preprint arXiv:2401.10891 (2024)
22. Kirillov, A., et al.: Segment anything. In: Proceedings of the IEEE/CVF Interna-
tional Conference on Computer Vision (2023)
23. Suvorov, R., et al.: Resolution-robust large mask inpainting with Fourier convo-
lutions. In: Proceedings of the IEEE/CVF Winter Conference on Applications of
Computer Vision (2022)
24. Lin, W., Kuo, C.C.J.: Perceptual visual quality metrics: a survey. J. Vis. Comm.
Image Representation 22(4), 297–312 (2011)
Towards Real-Time Open World Instance
Segmentation

Bao Ly Tran Hoang1(B) , Minh Le Thanh1(B) ,


and Khanh-Duy Nguyen1,2
1
University of Information Technology, Ho Chi Minh City, Vietnam
[email protected]
2
Vietnam National University, Ho Chi Minh City, Vietnam

Abstract. Instance segmentation is a common task in computer vision


specifically, and computer science in general. Its applications are widely
used in areas such as autonomous driving and automotive systems.
However, current instance segmentation models are often limited, as
they only perform well on fixed training sets. This creates a signifi-
cant challenge in real-world applications, where the number of classes
is strongly dependent on the training data. To address this limitation,
we propose the concept of Open World Instance Segmentation (OWIS)
with two main objectives: (1) segmenting instances not present in the
training set as an “unknown” class, and (2) enabling models to incre-
mentally learn new classes without forgetting previously learned ones,
with minimal cost and effort. These objectives are derived from open
world object detection task [12]. We also introduce new datasets fol-
lowing a novel protocol for evaluation, along with a strong baseline
method called ROWIS (Real-Time Open World Instance Segmentor),
which incorporates an advanced energy-based strategy for unknown class
identification. Our evaluation, based on the proposed protocol, demon-
strates the effectiveness of ROWIS in addressing real-world challenges.
His research will encourage further exploration of the OWIS problem
and contribute to its practical adoption. Our code was published at
https://github.com/4ursmile/ROWIS.

Keywords: Open world · Instance segmentation · Incremental


learning · Dataset · Evaluation method

1 Introduction
Instance segmentation is a crucial task in computer vision with wide-ranging
applications in fields such as education, medicine, and autonomous driving. How-
ever, traditional deep learning models for instance segmentation face limitations
due to their reliance on fixed training sets, reducing their effectiveness in dynamic
real-world scenarios.
To address these limitations, the field of open-world learning has emerged,
introducing approaches like novel feature representations and text-based features
c The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025
W. Buntine et al. (Eds.): SOICT 2024, CCIS 2350, pp. 298–312, 2025.
https://doi.org/10.1007/978-981-96-4282-3_25
Towards Real-Time Open World Instance Segmentation 299

to enhance the adaptability of deep learning models. This concept originated


from the open-world object detection (OWOD) problem [10, 12].
While some recent works have started exploring instance segmentation within
an open-world context, none have fully addressed both critical challenges: (1)
segmenting instances not present in the training set by assigning them to an
‘unknown’ class, and (2) enabling models to incrementally learn new classes
without forgetting previously learned ones. Some approaches [28, 32] perform
segmentation without assigning class labels to the segmented instances. Others
focus solely on challenge (1), such as [29], while still others [15] build upon
open-world object detection models and add segmentation as a post-processing
step.
The primary challenge in Open World Instance Segmentation (OWIS) is seg-
menting and identifying “unknown” instances not present in the training classes,
where the Foreground (FG)-comprising both known and unknown instances-is
distinct from the background. Recent works [35] employ two prediction heads:
one to detect FG objects and another to classify them if they belong to known
classes. Due to the lack of ground truth for unknowns, unsupervised methods
like clustering with Mahalanobis distance are used to maximize latent space sep-
aration between FG and background. Energy-based models [12] address this by
creating pseudo-classes during training, transforming the task into a supervised
learning problem and enabling a multi-layer perceptron (MLP) objectness head
to identify FG objects. We also implement a dynamic selection strategy during
training to enhance FG-background separation without introducing noise into
the predictions.
Another significant challenge is incremental learning. Traditional deep learn-
ing models, designed for closed-world problems, suffer from catastrophic forget-
ting when new classes are introduced-often due to “fine-tuning”-leading them
to forget previously learned knowledge. To mitigate this, we propose a novel
method involving masking during training and a language model-inspired tun-
ing technique that uses a temperature parameter to adjust class distributions. In
this paper, we combine advancements in instance segmentation and open-world
object detection to develop an end-to-end OWIS model and introduce a new
evaluation protocol, including a dataset derived from MS-COCO [14] (Fig. 1).
Instance segmentation, requiring pixel-level masks rather than bounding
boxes, is more complex than object detection-especially when handling unknown
objects in open-world settings. To address this challenge, we enhance tradi-
tional closed-set methods by integrating an efficient CNN architecture with
self-attention mechanisms inspired by models like Segment Anything [13]. This
approach enables accurate mask generation for unknown instances while mini-
mizing performance loss. Focusing on real-time performance, we prioritize speed
and reliability over state-of-the-art accuracy, utilizing SparseInst [5] as our base
model.
The key contributions of our work are as follows:

– Firstly, we define a clear problem for open-world instance segmentation


(OWIS), establishing a foundation for future research in this domain.
300 B. L. T. Hoang et al.

Fig. 1. Example output of our enhanced model: Highlighting both known and unknown
instances in a single image, demonstrating the capability of real-time open-world
instance segmentation.

– Secondly, we provide a comprehensive dataset along with a robust evaluation


protocol specifically designed for addressing the OWIS problem.
– Finally, we propose a novel method, ROWIS, which is among the first solu-
tions for OWIS, demonstrate its strong potential for real-world applications.

2 Related Works
2.1 Open World Object Detecion (OWOD)

Open-World Object Detection (OWOD) has made significant strides in address-


ing real-world challenges where new and unknown objects appear frequently. Tra-
ditional object detection models assume all object classes are known, which lim-
its their performance in dynamic environments. Open-set detection approaches
[18, 19] address this by introducing mechanisms to identify “unknown” objects,
and incremental learning techniques [17, 23] have been developed to add new
object classes without forgetting previously learned ones. These methods have
been applied successfully in OWOD, particularly through models like Faster
R-CNN [24] and DETR [4].
However, instance segmentation in the open-world context remains less
explored. Instance segmentation presents additional complexity over object
detection, as it requires pixel-level object masks, including for “unknown”
instances. While some works [28, 32] have attempted to adapt OWOD tech-
niques by adding segmentation as a post-processing step, these approaches are
not designed for segmentation-specific tasks and often result in performance lim-
itations.
Our work introduces a novel approach specifically designed for open-world
instance segmentation (OWIS), focusing solely on segmentation rather than
detection. Unlike prior approaches that rely on unsupervised techniques for
foreground-background separation, we leverage an energy-based supervised
method to train a multi-layer perceptron (MLP) objectness head. This approach
improves the model’s ability to accurately differentiate between foreground (both
Towards Real-Time Open World Instance Segmentation 301

known and unknown instances) and background, delivering better results than
unsupervised methods commonly used in earlier works.
Incremental learning remains a challenge in OWIS, particularly due to the
complexity of generating accurate segmentation masks for new classes without
degrading performance on previously learned ones. Our approach addresses this
with dynamic training strategies, ensuring that the model can continuously learn
new classes while preserving its segmentation capabilities for known objects.
Real-time instance segmentation models must balance fast inference (over 30
FPS) with reasonable accuracy, typically measured by COCO mAP@50-95 scores
ranging from 24 to 40. The YOLO family, including models like YOLACT [3]
and maYOLACT [20], is known for prioritizing speed while delivering sufficient
accuracy, making them popular for applications requiring high frame rates.
Nevertheless, for our needs, SparseInst [5] provides a better balance between
speed and segmentation quality. With a ResNet-50 backbone, it achieves a mAP
of 32.8, placing it among the higher-performing real-time models while maintain-
ing competitive speed. Additionally, SparseInst’s flexible architecture, similar to
D-DETR [34], allows for future improvements-such as domain adaptation or
enhanced segmentation-without sacrificing real-time performance. This makes it
not only strong in its current form but also adaptable to open-world segmenta-
tion challenges.

2.2 Self-attention Enhancing Convolution

Self-attention mechanisms, initially introduced in natural language process-


ing (NLP) tasks [7, 26], have emerged as the backbone of many state-of-the-
art (SOTA) models in computer vision due to their ability to capture global
contextual information and semantic relationships across an image. Models
like CLIP [22], Vision Transformer (ViT) [8, 30], and Detection Transformer
(DETR) [4, 33, 34] have successfully applied self-attention to achieve state-of-
the-art results in object detection and instance segmentation tasks.
However, while self-attention helps overcome the locality limitations of tra-
ditional convolutional networks (CNNs), it often comes with significant compu-
tational overhead. To address this challenge, research has focused on combining
self-attention with CNNs to balance performance and computational efficiency.
Early efforts, such as Squeeze-and-Excitation (SE) [11], BAM [21], and CBAM
[21], introduced attention mechanisms to enhance CNNs by reweighting spa-
tial and channel-wise features, helping to extend their receptive field. Similarly,
models like AA-ResNet [1] and BoTNet [25] augment CNNs with self-attention
layers, improving feature extraction and task performance.
More recent works have aimed to further integrate self-attention mechanisms
into convolutional networks to boost both speed and accuracy [6, 27]. These
approaches exploit the strengths of both paradigms to capture long-range depen-
dencies while keeping computational demands manageable.
Based on these insights, we apply self-attention mechanisms within a CNN
architecture, focusing on achieving higher accuracy with minimal computational
302 B. L. T. Hoang et al.

overhead. By embedding self-attention alongside fully convolutional computa-


tions, we aim to maintain high performance in tasks like instance segmentation
while minimizing the trade-off in processing speed.

3 Problem Statement
In the context of open-world instance segmentation, during training, a model .f is
trained on a dataset .D = {I, Y }, which contains .K known instance classes. The
dataset comprises .N images and corresponding labels, where .I = {I1 , I2 , . . . , IN }
represents the images, and .Y = {Y1 , Y2 , . . . , YN } represents the labels. Each
label .Yi , .i ∈ [1, 2, . . . , N ], consists of .J annotated instances, denoted as .Yi =
{y1 , y2 , . . . , yJ } ⊂ Y , where each .yj is an instance label.
Each instance label contains two parts:
1. A mask, which is a set of points defining the boundary of the instance.
2. An instance label .lj ∈ {0, 1}K , which is a one-hot vector encoding the class
of the instance.
We extend this framework to the open-world setting, following the formula-
tion introduced by Joseph et al. [12], which builds upon the work of Bendale and
Boult [2] by enables models to dynamically update themselves in an incremen-
tal fashion over multiple training episodes. At a given task or time .t, there are
.Kt known in-distribution classes, and the dataset .Dt = {It , Yt } consists of .Nt
images and labels. The object class label is now a .(Kt + 1)-dimensional vector
K +1
.lj ∈ {0, 1} t , where the additional dimension represents unknown instances.
While there may be an unbounded number of unknown classes, a subset
.Ut denotes the unknown classes of interest that we aim to segment. When the
model identifies unknown instances, they are sent to an oracle (e.g., a human
annotator) for labeling. The newly labeled objects are then used to generate a
new dataset .Dt+1 , which contains instances of the .Ut newly introduced instance
classes.
The model is then updated using the new dataset .Dt+1 , the current model
.ft , and a limited subset of the previous datasets .Di where .i ∈ {0, 1, . . . , t}, to
produce an updated model .ft+1 capable of segmenting .Kt+1 = Kt + Ut instance
classes. This cycle can repeat as necessary. Note that “unknown” is not treated
as a class, and there are no annotations for instances that are not introduced at
time .t.

4 Proposed Method
For our Real-Time Open-World Instance Segmentation (ROWIS) model, we
extend SparseInst [5], leveraging its Sparse Instance Activation Map (IAM) for
instance representation. To adapt it to the open-world setting, we introduce a
novel approach.
Like PROB [35], we use separate objectness and class heads for novel objects.
Figure 2 shows the architecture, where both heads are trained and used during
Towards Real-Time Open World Instance Segmentation 303

inference. The objectness head employs an advanced energy-based method to


boost confidence, eliminating the need for unknown instance annotations.
Additionally, in Fig. 3, we demonstrate how we apply self-attention in the
mask generation branch. This approach is designed to enhance the model’s
semantic understanding, enabling it to produce better masks for unknown
objects.

Fig. 2. Our method builds on the efficient encoder-decoder framework of SparseInst [5].
The decoder has two branches: one for generating masks and another for instance acti-
vation maps. Each map is processed by three heads: the kernel head (for mask mul-
tiplication), the classification head (for instance class prediction), and the objectness
head (for foreground-background classification). To handle unknown objects, we train
the objectness head with an advanced energy-based approach using both matched and
unmatched examples. Additionally, a Self-Attention CNN Module is applied to the
mask branch, enhancing semantic understanding and mask quality.

4.1 Multi-layer Perceptron Objectness

SparseInst employs an Instance Activation Map (IAM) to represent instances


within an image. Similar to D-DETR, which uses a set of .Nquery embeddings
for each image, SparseInst generates a fixed set of IAMs for every image. Each
IAM is then treated as an embedding and passed through three separate heads:
a kernel head, a classification head, and an objectness head. Since the number
of IAMs is typically larger than the actual number of instances in an image,
the objectness head is used to filter out activation maps that do not correspond
to foreground objects. This mechanism effectively addresses the challenges of
open-world instance segmentation, where we directly utilize the objectness head
to handle the foreground-background problem.
304 B. L. T. Hoang et al.

Let .m denote the instance activation map, .i the foreground instance, and
.c the class of the instance. Traditional approaches aim to answer the ques-
tion: “Does this instance activation map correspond to an instance of a specific
class?” However, inspired by [35], we decouple instance prediction .i and instance
class prediction .c|i, treating them independently during training and inference.
The objectness now becomes .p(o|m), and the instance class prediction becomes
.p(c|i, m).
The objectness head .fot (m) is used to predict the probability that an instance
activation map corresponds to a foreground object, while the classification head
.fc (m) predicts the class of the activation map, assuming it corresponds to a
t

foreground object. The final prediction can be expressed as:

p(c|q) = fct (m) · fot (m)


. (1)

where .t represents the .t-th task.

4.2 Training the Objectness Head


Traditional objectness head training relies solely on matched pairs. To adapt this
for open-world domains, we introduce an advanced energy-based approach. For
each task .t, we define a threshold .αt . For any unmatched IAM, if its confidence
is above .αt , we treat it as likely representing an object and continue using it to
train the objectness head as if it were a matched IAM.
For unmatched IAMs with confidence below .αt , we assign them a very small
label, treating them as noise. This ensures that we do not lose potentially impor-
tant information, such as misclassified foreground instances.
The loss function for the objectness head is defined as:

.Lto = Lo,matched + Lo,unmatched≥αt + Lo,unmatched<αt (2)


Additionally, the threshold .αt increases with each task .t, ensuring that
αt < αt+1 , allowing the model to progressively refine its prediction of unknown
.
instances.

4.3 Training the Classification Head


Similar to the objectness head, we apply an advanced energy-based approach
to the classification head. For each task .t, we define a threshold .αt . For any
unmatched IAM with confidence above .αt , we assume it represents an object
and assign it the “unknown” label. For unmatched IAMs with confidence below
.αt , we assign them a small value, termed empty_weight, which decreases as tasks
progress.
For incremental learning, we adjust the output logits with a temperature
scaling factor .temperaturet using the following formula:

exp(zi /temperaturet )
p = n
. c (3)
k=1 exp(zk /temperaturet )
Towards Real-Time Open World Instance Segmentation 305

As .t increases, we slightly increase the temperature to smooth the output


distribution. This reduces the gradient’s impact on previously learned classes
and helps minimize forgetting when learning new classes.
Previous work, such as [10, 35], used exemplar selection to pick objects with
the highest objectness score in open-world object detection (OWOD) domains
for training in the next task. However, this requires iterating through the entire
dataset, which is resource-intensive. Instead, we randomly sample around 0.5%
of the images from the previous task .t − 1, ensuring the presence of previous
classes as placeholders during training in task .t.

4.4 Self-attention Mechanism

The self-attention mechanism plays a pivotal role in capturing long-range depen-


dencies within input features, essential for contextual understanding in computer
vision tasks. We formulate attention functions through the following equations:

– Key, Query, and Value Generation: Key (.K), query (.Q), and value (.V )
vectors are generated from the input features .X using 1x1 convolution layer.
– Attention Score Computation: The attention scores are computed by
performing the dot product of the query and key vectors, followed by softmax
T
normalization: .A = Sof tM ax( QK

dk
), where .dk is the dimensionality of the
key vectors.
– Weighted Sum of Values: The weighted sum of values is obtained by
multiplying the attention scores with the value vectors: .Z = AV , Where .Z
represents the outputs feature map.

The architecture of self-attention layer were visuallized in Fig. 3

Fig. 3. The self-attention layer takes input features and generates key, query, and value
vectors via 1x1 convolutions. The query and value vectors are downsampled, while the
key remains unchanged. Attention scores are computed using the dot product of key
and query, followed by softmax normalization. The weighted sum of values produces
the self-attention features, which are output by the layer.
306 B. L. T. Hoang et al.

We downsample the key and query vectors using bottleneck 1x1 convolution
layers to reduce the parameter overhead while preserving essential information
for generating instance activation maps. Specifically, the key and query vectors
are downsampled with a reduction factor of .r. To maintain the original shape
of the input features, the value vectors’ dimensions remain unchanged, while
the shape of the key and query vectors is adjusted to .[n, c/r, w, h], where .n
represents the batch size, .w and .h denote the width and height of the input
features, respectively.
This downsampling approach enables efficient computation of attention
scores while preserving the spatial resolution of the input features. With this
lightweight implementation, we observe an increase in Average Precision (AP),
particularly at higher IoU thresholds like AP@75, due to the generation of high-
quality masks.

4.5 Dataset

We construct our dataset by sampling from MS COCO [14], diverging from pre-
vious methods due to the lack of instance segmentation data in Pascal VOC [9],
commonly used in earlier open-world object detection tasks.
In alignment with the approaches from [10, 12], we split our dataset into four
tasks to progressively evaluate model adaptability:

– Task 1: Train on known classes from MS COCO to ensure coverage.


– Task 2, 3, 4: Incrementally fine-tune the model on new, unseen classes data,
using minimal or no data from previous tasks. The model is evaluated on
its ability to identify new classes while mitigating catastrophic forgetting of
current known classes.

The COCO classes are divided into four tasks, each with 20 classes. Classes
within tasks are semantically similar, while classes across tasks are semantically
distinct, based on super-category criteria.
At each task .t, we sample images from the COCO training set that con-
tain the highest percentage of instances belonging to the classes in task .t. For
instances labeled with classes from future tasks .t (where .t > t), we remove their
annotations-both the mask and the label-to ensure there is no overlap between
tasks in the initial state.
In the evaluation set, we follow a similar sampling strategy from the COCO
validation set. However, for classes in future tasks .t , while the instances remain
present in the images, their labels are changed to “unknown” for evaluation
purposes, as their true class is yet to be introduced.
To evaluate the model’s incremental learning ability, we increase the difficulty
by accumulating images from the previous evaluation sets. This means that
for task .t, the evaluation not only tests the model’s performance on the newly
learned classes but also enforces the evaluation on earlier tasks’ datasets to ensure
the model retains knowledge and does not experience significant forgetting.
Towards Real-Time Open World Instance Segmentation 307

5 Experiments
Since Pascal VOC lacks clear annotations for instance segmentation tasks, we
created a new dataset based on MS COCO. While this differs slightly from the
OWOD benchmark, we perform comparisons with both the base model and our
proposed updates. Given that COCO evaluation is more challenging than Pascal
VOC, for mAP comparison in the open-world domain, we focus on the incre-
mental learning ability by measuring the percentage change in mAP between
previously and newly learned tasks.
Training Settings. We trained on 2 RTX 4090 GPUs with a batch size of 32.
The base learning rate was set to .5 × 10−5 for all tasks, with an initial warm-up
phase followed by three learning rate reductions. We applied data augmentation
techniques such as cropping to enhance dataset robustness. Unlike the standard
OWOD setup, which typically involves training and fine-tuning for each task
(resulting in 8 steps), we perform sampling in the early stages. As a result, we
reduce the total training time, requiring only 4 training steps for the 4 tasks (see
Table 1 for more details).
Evaluation Metrics. For known classes, we use the mean average precision
(mAP) metric. To better understand the quality of continual learning, mAP is
divided into previously seen and newly introduced object classes. For unknown
objects, we adopt the unknown object recall (U-recall) metric, which measures
the ratio of detected unknown objects to the total labeled unknown objects, as
mAP cannot be applied (due to the lack of annotations for all unknown objects).
We also study the confusion between unknown and known objects [12, 31, 35],
Implementation Details. Our implementation is based on the SparseInst
model, using a ResNet-50 backbone with an FPN. The number of Instance Acti-
vation Maps (IAMs) is set to 100, and the dimensionality .D of the IAMs is
256.

Table 1. Dataset structure and maximum iterations per task.

Task Ids Task 1 Task 2 Task 3 Task 4


Super category Person, vehicle, Outdoor, Sports, food Electronic, indoor,
animal accessories, kitchen, furniture.
appliance, truck.
#Class 20 20 20 20
#train image 36290 20176 20464 20644
#val image 1000 1899 2777 3647
%unknown instance 15% 13% 7% 0%
%instance/total 55% 65% 65% 69%
Max iteration 70000 40000 30000 30000
308 B. L. T. Hoang et al.

5.1 Open World Instance Segmentation

Due to the limited training data compared to the original dataset and eval-
uation on the same test set, we observed a reduction in mAP relative to the
base model. As expected, the shift to an open-world domain creates a trade-off
between unknown object recall and mAP for known instances. Nonetheless, we
successfully kept the mAP reduction to around 5% compared to the base model,
while significantly enhancing unknown object recall and improving incremen-
tal learning capabilities. Our method outperformed the base model in mAP on
Tasks 3 and 4, demonstrating its robustness in handling incremental learning
tasks. See Table 2.

Table 2. Comparison of mAP and unknown object recall across tasks between the
base model and our method. Training on a smaller subset for Task 1 shows a .∼2.1
mAP drop in the base model, while our method dropped .∼2.14% compared with based
model but achieved superior unknown recall and incremental learning performance,
with mAP for previous tasks and sustained performance through Tasks 3, 4. The FPS
archived when tested on a single RTX 4090 24G.

Task Ids (→) Task 1 Task 2 Task 3 Task 4 FPS


U-Recall mAP(↑) U-Recall mAP(↑) U-Recall mAP(↑) mAP(↑)
(↑) Current (↑) Previous Current Both (↑) Previous Current Both Previous Current Both
SparseInst R50 - 32.11 - 27.19 27.11 27.14 - 23.51 20.24 22.76 20.52 17.81 19.94 92
SparseInst R50-DCN - 35.53 - 28.81 29.95 29.49 - 25.21 24.32 25.01 21.91 19.10 21.31 83
ROWIS R50 15.42 29.97 22.51 27.81 26.53 27.04 18.24 24.67 19.57 23.50 21.97 17.14 20.94 91
ROWIS R50-DCN 17.71 33.14 24.69 29.14 28.31 28.64 20.37 26.42 21.25 25.24 23.45 18.23 22.34 82

5.2 Open World Object Detection Comparison

In comparison with open-world object detection models, one of our key evalua-
tion metrics is unknown recall, which measures how well the model can identify
unknown instances or objects in images. To ensure a fair comparison with object
detection models, we convert the segmentation masks into bounding boxes by
creating bounding rectangles from the masks and then recalculating the unknown
recall. The results are shown in Table 3 all OWOD results report in M-OWOD
benchmark.

5.3 Self-attention on Mask Generation Result

The implementation of self-attention enhances the accuracy of mask genera-


tion, particularly for fine details such as small objects and for higher precision
thresholds like AP@75. As seen in Table 4, we observe a slight trade-off between
precision and inference speed, with a reduction of approximately 1 FPS when
tested on a single RTX 4090 24G.
Towards Real-Time Open World Instance Segmentation 309

Table 3. Comparison of unknown recall across tasks between our method and other
open-world object detection models. Although benchmarked on different datasets, the
use of the same metrics and evaluation methods provides a reference for understanding
our model’s performance. Our model achieved promising unknown recall, with the
highest recall observed in Task 2.

Task Ids (.→) Task 1 Task 2 Task 3 Task 4


U-recall (.↑) U-recall (.↑) U-recall (.↑) U-recall (.↑)
ORE [12] 4.9 2.9 3.9 -
OW-DETR [10] 7.5 6.2 5.7 -
PROB [35] 19.4 17.4 19.6 -
CAT [16] 23.7 19.1 24.4 -
RandBox [31] 10.6 6.3 7.8 -
ROWIS R50 17.21 23.29 19.44 -
ROWIS R50 DCN 19.32 24.8 20.32 -

Table 4. Applying the efficient self-attention layer improves the precision of mask
generation, particularly on small objects and higher precision metrics such as AP@75.
However, this comes with a slight trade-off in inference speed, reducing performance
by approximately 1 FPS.

Self-attention? AP50 AP75 APs APm Apl FPS


No 37.55 26.40 8.54 23.41 38.49 92
Yes 38.23 29.32 11.3 23.62 38.60 91

6 Conclusion

The open-world domain presents significant challenges due to its complexity and
the requirement for models to handle unknown objects. In this work, we adapted
Open World Object Detection to the Open World Instance Segmentation task
and introduced ROWIS, an end-to-end model specifically designed for Open
World Instance Segmentation. We also provided a new dataset to encourage
further exploration in this field, not only for instance segmentation but also for
open-world object detection.
ROWIS demonstrated its ability to adapt to real-world scenarios while bal-
ancing precision, speed, and the ability to handle unknown objects. However,
there are limitations that need further improvement. Notably, there is a trade-
off between precision and recall, which could be addressed in future iterations.
Additionally, the model is currently highly sensitive to hyperparameter tuning
to achieve optimal results. In future work, we aim to develop a more robust
approach that reduces dependency on hyperparameters. We also acknowledge
the potential inconsistencies in our dataset and will continue to refine it for the
benefit of the research community.
310 B. L. T. Hoang et al.

Acknowledgements. This research was supported by The VNUHCM-University of


Information Technology’s Scientific Research Support Fund.

References
1. Bello, I., Zoph, B., Le, Q., Vaswani, A., Shlens, J.: Attention augmented con-
volutional networks. In: 2019 IEEE/CVF International Conference on Computer
Vision, ICCV 2019, Seoul, Korea (South), 27 October–2 November 2019, pp. 3285–
3294. IEEE (2019)
2. Bendale, A., Boult, T.E.: Towards open world recognition. In: IEEE Conference on
Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, 7–12
June 2015, pp. 1893–1902. IEEE Computer Society (2015)
3. Bolya, D., Zhou, C., Xiao, F., Lee, Y.J.: YOLACT: real-time instance segmenta-
tion. In: 2019 IEEE/CVF International Conference on Computer Vision, ICCV
2019, Seoul, Korea (South), 27 October–2 November 2019, pp. 9156–9165. IEEE
(2019)
4. Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.:
Endto- end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox,
T., Frahm, J. (eds.) Computer Vision - ECCV 2020 - 16th European Conference,
Glasgow, UK, 23–28 August 2020, Proceedings, Part I. Lecture Notes in Computer
Science, vol. 12346, pp. 213–229. Springer, Cham (2020)
5. Cheng, T., et al.: Sparse instance activation for real-time instance segmentation.
In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR
2022, New Orleans, LA, USA, 18–24 June 2022, pp. 4423–4432. IEEE (2022)
6. Chu, X., et al.: Twins: revisiting the design of spatial attention in vision transform-
ers. In: Ranzato, M., Beygelzimer, A., Dauphin, Y.N., Liang, P., Vaughan, J.W.
(eds.) Advances in Neural Information Processing Systems 34: Annual Conference
on Neural Information Processing Systems 2021, NeurIPS 2021, 6–14 December
2021, virtual, pp. 9355–9366 (2021)
7. Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirec-
tional transformers for language understanding. In: Burstein, J., Doran, C., Solorio,
T. (eds.) Proceedings of the 2019 Conference of the North American Chapter of
the Association for Computational Linguistics: Human Language Technologies,
NAACL-HLT 2019, Minneapolis, MN, USA, 2–7 June 2019, Volume 1 (Long and
Short Papers), pp. 4171–4186. Association for Computational Linguistics (2019)
8. Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image
recognition at scale. In: 9th International Conference on Learning Representations,
ICLR 2021, Virtual Event, Austria, 3–7 May 2021. OpenReview.net (2021)
9. Everingham, M., Gool, L.V., Williams, C., Winn, J.M., Zisserman, A.: The pascal
visual object classes (VOC) challenge. Int. J. Comput. Vis. 88(2), 303–338 (2010)
10. Gupta, A., Narayan, S., Joseph, K.J., Khan, S., Khan, F.S., Shah, M.: OW-DETR:
open-world detection transformer. In: IEEE/CVF Conference on Computer Vision
and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, 18–24 June 2022,
pp. 9225–9234. IEEE (2022)
11. Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: 2018 IEEE Confer-
ence on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City,
UT, USA, 18–22 June 2018, pp. 7132–7141. Computer Vision Foundation/IEEE
Computer Society (2018)
Towards Real-Time Open World Instance Segmentation 311

12. Joseph, K.J., Khan, S.H., Khan, F.S., Balasubramanian, V.N.: Towards open world
object detection. In: IEEE Conference on Computer Vision and Pattern Recog-
nition, CVPR 2021, virtual, 19–25 June 2021, pp. 5830–5840. Computer Vision
Foundation/IEEE (2021)
13. Kirillov, A., et al.: Segment anything. arXiv:2304.02643 (2023)
14. Lin, T., et al.: Microsoft COCO: common objects in context. In: Fleet, D.J., Pajdla,
T., Schiele, B., Tuytelaars, T. (eds.) Computer Vision - ECCV 2014 - 13th Euro-
pean Conference, Zurich, Switzerland, 6–12 September 2014, Proceedings, Part V.
Lecture Notes in Computer Science, vol. 8693, pp. 740–755. Springer (2014)
15. Liu, S., et al.: Grounding DINO: marrying DINO with grounded pre-training for
open-set object detection. CoRR abs/2303.05499 (2023)
16. Ma, S., et al.: CAT: localization and identification cascade detection transformer
for open-world object detection. In: IEEE/CVF Conference on Computer Vision
and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, 17–24 June 2023,
pp. 19681–19690. IEEE (2023)
17. Mallya, A., Lazebnik, S.: Packnet: adding multiple tasks to a single network by
iterative pruning. In: 2018 IEEE Conference on Computer Vision and Pattern
Recognition, CVPR 2018, Salt Lake City, UT, USA, 18–22 June 2018, pp. 7765–
7773. Computer Vision Foundation/IEEE Computer Society (2018)
18. Miller, D., Dayoub, F., Milford, M., Sünderhauf, N.: Evaluating merging strategies
for sampling-based uncertainty techniques in object detection. In: International
Conference on Robotics and Automation, ICRA 2019, Montreal, QC, Canada, 20–
24 May 2019, pp. 2348–2354. IEEE (2019)
19. Miller, D., Nicholson, L., Dayoub, F., Sünderhauf, N.: Dropout sampling for robust
object detection in open-set conditions. In: 2018 IEEE International Conference
on Robotics and Automation, ICRA 2018, Brisbane, Australia, 21–25 May 2018,
pp. 1–7. IEEE (2018)
20. Oksuz, K., Cam, B.C., Kahraman, F., Baltaci, Z.S., Kalkan, S., Akbas, E.: Mask-
aware IOU for anchor assignment in real-time instance segmentation. In: 32nd
British Machine Vision Conference 2021, BMVC 2021, Online, 22–25 November
2021, p. 228. BMVA Press (2021)
21. Park, J., Woo, S., Lee, J., Kweon, I.S.: BAM: bottleneck attention module. In:
British Machine Vision Conference 2018, BMVC 2018, Newcastle, UK, 3–6 Septem-
ber 2018, p. 147. BMVA Press (2018)
22. Radford, A., et al.: Learning transferable visual models from natural language
supervision. In: Meila, M., Zhang, T. (eds.) Proceedings of the 38th International
Conference on Machine Learning, ICML 2021, 18–24 July 2021, Virtual Event.
Proceedings of Machine Learning Research, vol. 139, pp. 8748–8763. PMLR (2021)
23. Rajasegaran, J., Khan, S.H., Hayat, M., Khan, F.S., Shah, M.: itaml: an incre-
mental task-agnostic meta-learning approach. In: 2020 IEEE/CVF Conference on
Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, 13–19
June 2020, pp. 13585–13594. Computer Vision Foundation/IEEE (2020)
24. Ren, S., He, K., Girshick, R.B., Sun, J.: Faster R-CNN: towards real-time object
detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell.
39(6), 1137–1149 (2017)
25. Srinivas, A., Lin, T., Parmar, N., Shlens, J., Abbeel, P., Vaswani, A.: Bottleneck
transformers for visual recognition. In: IEEE Conference on Computer Vision and
Pattern Recognition, CVPR 2021, virtual, 19–25 June 2021, pp. 16519–16529.
Computer Vision Foundation/IEEE (2021)
312 B. L. T. Hoang et al.

26. Vaswani, A., et al.: Attention is all you need. In: Guyon, I., et al. (eds.) Advances
in Neural Information Processing Systems 30: Annual Conference on Neural Infor-
mation Processing Systems 2017, 4–9 December 2017, Long Beach, CA, USA, pp.
5998–6008 (2017)
27. Wang, C., Xu, H., Zhang, X., Wang, L., Zheng, Z., Liu, H.: Convolutional embed-
ding makes hierarchical vision transformer stronger. In: Avidan, S., Brostow, G.J.,
Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision - ECCV 2022 - 17th
European Conference, Tel Aviv, Israel, 23–27 October 2022, Proceedings, Part XX.
Lecture Notes in Computer Science, vol. 13680, pp. 739–756. Springer (2022)
28. Wang, W., Feiszli, M., Wang, H., Malik, J., Tran, D.: Open-world instance seg-
mentation: exploiting pseudo ground truth from learned pairwise affinity. In:
IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022,
New Orleans, LA, USA, 18–24 June 2022, pp. 4412–4422. IEEE (2022)
29. Wang, W., Feiszli, M., Wang, H., Tran, D.: Unidentified video objects: a benchmark
for dense, open-world segmentation. In: 2021 IEEE/CVF International Conference
on Computer Vision, ICCV 2021, Montreal, QC, Canada, 10–17 October 2021, pp.
10756–10765. IEEE (2021)
30. Wang, W., et al.: Pyramid vision transformer: a versatile backbone for dense pre-
diction without convolutions. In: 2021 IEEE/CVF International Conference on
Computer Vision, ICCV 2021, Montreal, QC, Canada, 10–17 October 2021, pp.
548–558. IEEE (2021)
31. Wang, Y., Yue, Z., Hua, X., Zhang, H.: Random boxes are open-world object
detectors. In: IEEE/CVF International Conference on Computer Vision, ICCV
2023, Paris, France, 1–6 October 2023, pp. 6210–6220. IEEE (2023)
32. Xue, X., et al.: Transformer-based open-world instance segmentation with cross-
task consistency regularization. In: El-Saddik, A., et al. (eds.) Proceedings of
the 31st ACM International Conference on Multimedia, MM 2023, Ottawa, ON,
Canada, 29 October 2023–3 November 2023, pp. 2507–2515. ACM (2023)
33. Zhang, H., et al.: DINO: DETR with improved denoising anchor boxes for end-to-
end object detection. In: The Eleventh International Conference on Learning Rep-
resentations, ICLR 2023, Kigali, Rwanda, 1–5 May 2023. OpenReview.net (2023)
34. Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable DETR: deformable
transformers for end-to-end object detection. In: 9th International Conference on
Learning Representations, ICLR 2021, Virtual Event, Austria, 3–7 May 2021.
OpenReview.net (2021)
35. Zohar, O., Wang, K., Yeung, S.: PROB: probabilistic objectness for open world
object detection. In: IEEE/CVF Conference on Computer Vision and Pattern
Recognition, CVPR 2023, Vancouver, BC, Canada, 17–24 June 2023, pp. 11444–
11453. IEEE (2023)
An Attempt to Develop a Neural Parser
Based on Simplified Head-Driven Phrase
Structure Grammar on Vietnamese

Duc-Vu Nguyen2,3 , Thang Chau Phan1,3 , Quoc-Nam Nguyen1,3 ,


Kiet Van Nguyen1,3 , and Ngan Luu-Thuy Nguyen1,3(B)
1
Faculty of Information Science and Engineering, University of Information
Technology, Yangon, Burma
{20520929,20520644}@gm.uit.edu.vn, {kietnv,ngannlt}@uit.edu.vn
2
Laboratory for Multimedia Communications, University of Information Technology,
Yangon, Burma
[email protected]
3
Vietnam National University, Ho Chi Minh City, Vietnam

Abstract In this paper, we aimed to develop a neural parser for Viet-


namese based on simplified Head-Driven Phrase Structure Grammar
(HPSG). The existing corpora, VietTreebank and VnDT, had around
15% of constituency and dependency tree pairs that did not adhere to
simplified HPSG rules. To attempt to address the issue of the corpora
not adhering to simplified HPSG rules, we randomly permuted samples
from the training and development sets to make them compliant with
simplified HPSG. We then modified the first simplified HPSG Neural
Parser for the Penn Treebank by replacing it with the PhoBERT or
XLM-RoBERTa models, which can encode Vietnamese texts. We con-
ducted experiments on our modified VietTreebank and VnDT corpora.
Our extensive experiments showed that the simplified HPSG Neural
Parser achieved a new state-of-the-art F-score of 82% for constituency
parsing when using the same predicted part-of-speech (POS) tags as the
self-attentive constituency parser. Additionally, it outperformed previous
studies in dependency parsing with a higher Unlabeled Attachment Score
(UAS). However, our parser obtained lower Labeled Attachment Score
(LAS) scores likely due to our focus on arc permutation without changing
the original labels, as we did not consult with a linguistic expert. Lastly,
the research findings of this paper suggest that simplified HPSG should
be given more attention to linguistic expert when developing treebanks
for Vietnamese natural language processing.

Keywords: Neural Parser · Head-driven Phrase Structure Grammar ·


VietTreeBank · VnDT · Transformer

1 Introduction
Natural Language Processing (NLP) has witnessed significant advancements
in recent years, propelled by the development of sophisticated models and
c The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025
W. Buntine et al. (Eds.): SOICT 2024, CCIS 2350, pp. 313–328, 2025.
https://doi.org/10.1007/978-981-96-4282-3_26
314 D.-V. Nguyen et al.

algorithms. A critical area within NLP is the development of efficient and accu-
rate parsers, particularly for languages with limited computational resources,
like Vietnamese. Vietnamese, characterized by its tonal nature, complex mor-
phology, and unique syntactic structure, presents unique challenges for parsing
technologies [1, 2]. This study aims to address these challenges by developing
a Vietnamese neural parser using a simplified version of Head-Driven Phrase
Structure Grammar (HPSG) [3, 4].
Our approach involves addressing inconsistencies within the VietTreebank
and VnDT corpora, which are pivotal for Vietnamese NLP [5, 6]. We incorporated
advanced text encoding models, PhoBERT and XLM-RoBERTa, hypothesizing
that these models would enhance the parser’s performance due to their robust
linguistic representation capabilities [7, 8]. Our experiments demonstrate that
the parser achieves an 82% F-score in constituency parsing and shows promising
performance in dependency parsing, outperforming others in the field despite
lower scores in the Labeled Attachment Score (LAS) [9, 10].
In the context of the VLSP 2023 - Vietnamese Constituency Parsing Chal-
lenge, our study also ventures into transforming constituency trees into depen-
dency trees using proposed head-rules implemented with the ClearNLP toolkit
[11, 12]. This transformation is particularly significant, as it was achieved with-
out direct linguistic input. Remarkably, the HPSG Neural Parser [13] achieved
a marginally higher F-score of 89.04%, surpassing established parsers like the
Stanza Constituency Parser, which scored 88.73% [14]. This outcome under-
scores the potential of incorporating linguistic expertise into the development of
Vietnamese NLP tools, an area that has been relatively underexplored.
The rest of this paper is structured in the following manner. Section 2 sur-
veys several current works on Vietnamese parsing. Section 3 briefly looks and
analyzes at the used datasets for our methodology and baseline. Our method-
ology is presented in detail through Sect. 4. Section 5 illustrates processes for
experimental settings, implementing models, and our experimental results on
each dataset, and describes the result analysis and discussion of the proposed
approach. In summary, Sect. 6 serves as the final part of our research and outlines
our conclusions and any potential areas for future exploration.

2 Background and Related Work


Vietnamese parsing has evolved significantly over the past decades. Early efforts
were focused on building foundational resources like the VietTreebank and
exploring basic dependency parsing strategies for Vietnamese. These efforts laid
the groundwork for more sophisticated parsing techniques, as seen in [1] and
[15, 16].
The role of treebanks in Vietnamese NLP cannot be overstated. Nguyen et al.
[2] emphasized the importance of ensuring annotation consistency and accuracy
in Vietnamese treebanks, which has been pivotal in advancing parsing tech-
niques [2]. These treebanks serve as critical resources for training and evaluating
parsers, forming the backbone of most modern Vietnamese NLP applications, as
demonstrated in studies like [5, 6].
An Attempt to Develop a Neural Parser 315

Neural parsing techniques have seen significant innovations, shifting from tra-
ditional rule-based methods to more advanced neural network-based approaches.
Key developments include the minimal span-based neural constituency parsers
by Stern, Andreas, and Klein [17] and the analytical insights into neural con-
stituency parsers provided by Gaddy, Stern, and Klein [18]. These studies have
significantly influenced the field, moving it towards more efficient and accurate
parsing solutions.
The integration of pre-trained models like PhoBERT and XLM-RoBERTa
into parsing has been a game-changer. Nguyen and Tuan Nguyen [7] introduc-
tion of PhoBERT, and Conneau et al. [8] work on unsupervised cross-lingual
representation learning [8] have demonstrated the potential of these models in
enhancing parsing accuracy and efficiency, especially for languages like Viet-
namese that lack extensive computational resources.
Despite these advancements, Vietnamese parsing faces specific challenges,
such as the complexity of its syntactic structure and limited linguistic resources.
Recent studies have proposed innovative solutions, including the use of head-
rules for tree transformations and leveraging toolkits like ClearNLP to improve
parsing efficiency [1]. Nguyen, Nguyen, and Nguyen [19] presented a Depen-
dency Tree-LSTM approach for Vietnamese sentiment analysis. Trang et al. [20]
proposed a prosodic boundary prediction model to improve Vietnamese speech
synthesis, using both traditional and novel features like syntactic blocks and
links.
In comparing our parser’s performance with others, such as the Stanza Con-
stituency Parser, it becomes evident that while there are similarities in method-
ological approaches, each parser has its unique strengths and limitations. Our
parser’s slightly higher F-score highlights the potential impact of incorporating
linguistic expertise in parser development, a concept that has been relatively
underutilized in Vietnamese NLP [21, 22].
The future of Vietnamese parsing looks promising, with potential impacts
extending beyond the immediate field. The methodologies and findings from this
study could influence future research directions, not only in Vietnamese NLP but
also in the broader context of computational linguistics for other low-resource
languages [23, 24].

3 Corpora
3.1 VTB and VnDT

introduce our project on developing a Vietnamese lexicon


tailored for NLP applications, with a strong focus on the standardization of
lexicon representation. Specifically, the authors propose a scalable framework
of Vietnamese syntactic descriptions that are useful for defining tagsets and
conducting morphosyntactic analysis.
The VietTreebank (VTB) dataset, developed by Nguyen et al. [16], includes
20,000 sentences, with 10,000 having syntactic annotations and another 10,000
316 D.-V. Nguyen et al.

tagged for parts of speech. This dataset is sourced from “Tuoi Tre1 ”, a Vietnamese
newspaper. In addition, Nguyen et al. [1] introduced a technique to convert Viet-
namese treebanks into dependency trees, particularly important for addressing
Vietnamese linguistic peculiarities. This conversion process led to the creation
of the VnDT Treebank, featuring 10,200 sentences. The treebank was evaluated
using two parsers, MSTParser and MaltParser, with MSTParser showing su-
perior performance in Vietnamese dependency parsing. The VnDT Treebank is
now a publicly available resource that offers significant value for research on
Vietnamese natural language processing (Fig. 1).

Fig. 1. Constituent, dependency, and joint span structures, extracted from the
training datasets of VTB and VnDT and intended solely for visualization
purposes, may contain slight labeling errors. These structures represent the same
Vietnamese sentence, indexed from 1 to 7 and assigned an interval range for each node.
The sentence

Dependency arcs indicate grammatical relationships such as subject, object, and


modifiers. The joint span structure combines constituent and dependency structures,
explicitly marking the category (Categ) and head word (HEAD) for each span.

Figure 1 presents an example extracted from the training datasets of


VTB and VnDT, which may contain slight labeling errors and is
intended solely for visualization purposes, illustrating a valid simplified
HPSG structure as proposed by Zhou and Zhao [13]. This example highlights
the integration of constituent and dependency analyses with explicit annotation
of categories and head words. However, it was found that approximately 15%
of the constituency and dependency tree pairs in the VietTreebank and VnDT
corpora did not adhere to the simplified HPSG framework outlined by Zhou
and Zhao [13]. To resolve this discrepancy, samples from the training and devel-
opment sets were adjusted through random permutation to comply with these
rules. Crucially, the original labels were preserved throughout the modification
1
https://tuoitre.vn/.
An Attempt to Develop a Neural Parser 317

process, as no additional expert linguistic annotations were introduced to refine


these corrections.

3.2 VLSP 2023 Vietnamese Treebank

The VLSP 2023 Vietnamese Treebank [16] is a collection of about 10,000 Viet-
namese sentences, mostly from news articles and socio-political texts. The cre-
ators used various linguistic methods to handle language ambiguities, and anno-
tators were assisted by automatic tools.
In the VLSP 2023 shared task [15], participants are asked to develop a con-
stituency parser. This parser takes a sentence and produces a tree that shows the
grammatical structure of the sentence. Participants can improve their parsers us-
ing extra Vietnamese text or pre-trained language models. The evaluation uses
Parseval metrics, and the test data includes texts from the same domain as the
training data, as well as new areas like legal and biomedical fields (Fig. 2).

Fig. 2. Distributions in the VLSP 2023 Vietnamese Treebank training set.

We analyzed the training data from the VLSP 2023 Vietnamese Treebank to
understand the structure of Vietnamese sentences. The data shows that nouns
and verbs are the most common parts of speech, appearing 42,584 and 32,456
times respectively. This means that Vietnamese sentences in this dataset often
focus on nouns and verbs, which is typical in formal writing like news articles.
Punctuation marks are also frequent, with 22,819 instances, highlighting the
structured nature of written Vietnamese. The dataset includes various types of
318 D.-V. Nguyen et al.

pronouns and verb forms, such as personal pronouns (PRO:per), demonstrative


pronouns (PRO:dem), and copular verbs (V:cop), which are important features
of the language.
In terms of sentence structure, noun phrases (NP) are the most common
constituents, appearing 49,449 times. Verb phrases (VP) are also common, with
31,806 instances, indicating that sentences often have complex structures with
detailed information. Other types of phrases like prepositional phrases (PP), ad-
jective phrases (AP), and subordinate clauses (SBAR) are also present, showing
the richness of Vietnamese syntax.
In summary, the VLSP 2023 Vietnamese Treebank provides valuable insights
into the Vietnamese language, especially in formal writing. Understanding the
common parts of speech and sentence structures helps in developing better nat-
ural language processing tools for Vietnamese.

4 Method: Head-Driven Phrase Structure Grammar


Neural Parser
4.1 Overview of Joint Span HPSG
Our HPSG Neural Parser is based on a novel Joint Span HPSG model, which
innovatively integrates constituent and head information within a single con-
stituent tree structure [13]. The “joint span” concept in this model encompasses
all child phrases and dependency arcs among these phrases, providing a compre-
hensive syntactic analysis framework.

4.2 Token Representation


In this paper, we emphasize the importance of part-of-speech and contextual em-
beddings in the token representation of the HPSG Neural Parser. This parser’s
token representation is intricately designed, comprising character-level embed-
dings for morphological analysis, word embeddings to provide semantic context,
and part-of-speech embeddings crucial for syntactic parsing [7, 9]. The integra-
tion of these embeddings facilitates a comprehensive approach to token repre-
sentation, establishing a robust foundation for accurate syntactic analysis. A
notable enhancement to our parser is its integration with advanced pre-trained
language models such as PhoBERT and XLM-RoBERTa [7, 8]. This integration
enables the parser to leverage deep, contextualized word representations, signifi-
cantly improving its parsing performance, particularly in processing Vietnamese,
a language with fewer linguistic resources.

4.3 Self-attention Encoder


A key feature of our model is the self-attention encoder, based on the Trans-
former architecture [26]. This encoder effectively contextualizes each word in a
sentence, considering both immediate and distant word relations, thus capturing
the complexity of syntactic structures in Vietnamese.
An Attempt to Develop a Neural Parser 319

4.4 Scoring Mechanism

The scoring mechanism within the HPSG Neural Parser utilizes a biaffine at-
tention model [9]. This model accurately scores potential dependency relations
among words, allowing for precise parsing and syntactic relationship establish-
ment in complex sentences.

4.5 Decoder for Joint Span HPSG

The decoder in our HPSG Neural Parser employs dynamic programming to


reconstruct syntactic parse trees from the scores generated by the encoder [17].
It utilizes an objective function that combines hinge loss and cross-entropy loss,
optimizing the overall structure of the HPSG tree.

5 Experiments and Results


5.1 Baseline: Stanza Constituency Parser

The Stanza Constituency Parser, as part of the Stanza open-source software


distribution, has been implemented to support a wide range of languages, in-
cluding Vietnamese. This general neural constituency parser, based on an in-
order transition-based parsing framework, showcases its versatility and robust-
ness [14, 27]. In its application to the VLSP 2022 Vietnamese treebank, the parser
achieved an impressive test score of 83.93% F1, leading the private test leader-
board. This implementation uses a shift/reduce compiler-like mechanism, which
manages a stack of partially constructed trees and a queue of unparsed words to
predict transitions at each step of the parsing process. Integrating LSTM net-
works and attention mechanisms within this parser enhances its ability to parse
syntactic structures accurately. In this study, the Stanza Constituency Parser is
utilized as a benchmark to assess the performance of our HPSG Neural Parser.

5.2 Experimental Settings


Stanza Part-of-Speech Tagger. In our study, we utilized a modified version
of the Stanza tagger2 [14], fine-tuned with PhoBERTlarge [7], achieving a 94%
token accuracy. This fine-tuning involved training an ensemble of 10 taggers on
90% of the dataset, with the remaining 10% used for validation, to generate the
necessary tags for training our constituency model. For our training setup, we
defined parameters including a cap of 15,000 steps, 12 BERT hidden layers, a
learning rate of 1e−5, a batch size of 1,000, and the AdamW optimizer.

2
https://github.com/stanfordnlp/stanza/blob/main/stanza/utils/training/run_pos.
py.
320 D.-V. Nguyen et al.

Stanza Constituency Parser. Our study involved refining the Stanza con-
stituency parser3 [14] with PhoBERTlarge [7], yielding an 81% F-score. We
trained a solitary model, consisting of one parser, on 90% of the data, allo-
cating the remaining 10% for validation. The training parameters were set with
BERT fine-tuning from epoch 0 to 300, employing the AdamW optimizer, and
a batch size of 32 for training.

HPSG Neural Parser. We discovered that around 1,000 constituency and


dependency trees within the VTB and VnDT datasets failed to align with the
HPSG tree criteria detailed in [13]. To rectify this, we applied a strategy of
random permutation, which was not bound by linguistic constraints, to modify
these trees to meet the specified HPSG criteria as cited in [13]. Notably, our
modifications were limited to the training and development subsets of the VnDT
dataset, while the test set remained unaltered. This approach ensures a fair
comparison with previous studies conducted on the VnDT dataset.
We focused on enhancing the HPSG Neural Parser, which facilitates joint
constituency and dependency decoding, as described in [13] and available4 . Our
approach involved integrating PhoBERT [7] and XLM-R [8] into the parser. We
developed a single model that included parsers running for 100 epochs, utiliz-
ing both XLM-R and PhoBERT for the VTB & VnDT datasets. However, for
the VLSP 2023 Vietnamese Treebank, we exclusively used PhoBERT due to
time constraints. The configuration of our parser included the use-tag feature, a
learning rate of 0.00005, two layers of self-attention, and the AdamW optimizer.

Fig. 3. Balancing Constituency and Dependency in Joint Span HPSG Parsing on the
VTB & VnDT Development Sets.

In our study, we adjusted the hyper-parameter λ within the HPSG Neu-


ral Parser framework. A higher λ value assigns greater weight to the loss in
3
https://github.com/stanfordnlp/stanza/blob/main/stanza/utils/training/
run_constituency.py.py.
4
https://github.com/DoodleJZ/HPSG-Neural-Parser.
An Attempt to Develop a Neural Parser 321

constituency parsing relative to dependency parsing. As illustrated in Fig. 3, a λ


setting of 0.9 yielded optimal results for both constituency and dependency pars-
ing on the development datasets of VTB and VnDT. This outcome suggests that
the impact of our approach, employing random permutation without linguistic
constraints to align the trees with the HPSG criteria as mentioned in [13], was
minimal. Consequently, we adopted a λ value of 0.9 for subsequent experiments
in our research involving the HPSG Neural Parser.

Fig. 4. An attempted version of head rules for the VLSP 2023 Vietnamese Treebank
was developed with a non-linguistic engineering background.

Currently, the VLSP 2023 Vietnamese Treebank comprises a constituency


dataset annotated with head labels and does not include a corresponding de-
pendency dataset. To adapt to this limitation and fulfill the requirements of the
HPSG Neural Parser, we devised an initial set of head-percolation rules specifi-
cally for the VLSP 2023 Vietnamese Treebank. Notably, these rules were devel-
oped by individuals with a non-linguistic engineering background. Their design
facilitates the conversion of constituency parsing to dependency parsing. This
conversion process is demonstrated in Fig. 4, providing a visual representation
of the approach. For the practical application of this conversion, we modified a
script from the ClearNLP framework5 .

5.3 Results on VTB and VnDT


Table 1 presents comprehensive results from our experiments on the VTB and
VnDT datasets, showcasing the proficiency of our parser in both constituency
5
https://github.com/clearnlp/clearnlp/tree/master/src/main/java/com/clearnlp/
conversion.
322 D.-V. Nguyen et al.

and dependency parsing. Notably, in constituency parsing, our parser achieved


an impressive F-score of 82.34%, indicating its exceptional performance. In de-
pendency parsing, it outperformed other models in several metrics, though it
achieved slightly lower Labeled Attachment Score (LAS) values. This outcome
may be attributed to our emphasis on arc permutation without altering the
original labels, as the adjustments were made without input from a linguistic
expert.

Table 1. The performances of constituency parsing and dependency parsing on the


VietTreebank and VnDT test sets were reported, respectively. The models were each
run five times, as in previous studies. The predicted part-of-speech tags were generated
using the VnCoreNLP toolkit [28]. POS denoted part-of-speech.
Constituency Parsing Dependency Parsing
Model P R F Model LAS UAS
Self-Attentive w/ XLM-Rbase [29] 79.95 78.61 79.28 Biaffine w/ XLM-Rbase [7] 76.46 83.10
Self-Attentive w/ XLM-Rlarge [29] 80.78 81.61 81.19 Biaffine w/ XLM-Rlarge [7] 75.87 82.70
Self-Attentive w/ PhoBERTbase [29] 81.14 79.60 80.36 Biaffine w/ PhoBERTbase [7] ‘78.77 85.22
Self-Attentive w/ PhoBERTlarge [29] 80.55 80.54 80.55 Biaffine w/ PhoBERTlarge [7] 77.85 84.32
W/o POS HPSG w/ XLM-Rbase 78.86 79.06 78.96 W/o POS HPSG w/ XLM-Rbase 75.64 83.51
tags tags
HPSG w/ XLM-Rlarge 80.98 81.24 81.11 HPSG w/ XLM-Rlarge 77.67 85.06
HPSG w/ PhoBERTbase 80.96 81.76 81.36 HPSG w/ PhoBERTbase 77.60 85.18
HPSG w/ PhoBERTlarge 81.54 82.32 81.93 HPSG w/ PhoBERTlarge 78.16 85.73
W/ POS HPSG w/ XLM-Rbase 79.49 79.70 79.60 W/ POS HPSG w/ XLM-Rbase 75.98 83.51
tags tags
HPSG w/ XLM-Rlarge 81.57 81.42 81.50 HPSG w/ XLM-Rlarge 77.92 85.14
HPSG w/ PhoBERTbase 81.46 82.11 81.78 HPSG w/ PhoBERTbase 77.92 85.36
HPSG w/ PhoBERTlarge 82.03 82.49 82.34 HPSG w/ PhoBERTlarge 78.42 85.73

In Table 1, the enhanced performance of the models, especially those lever-


aging PhoBERT, is evident. The inclusion of part-of-speech (POS) tags signif-
icantly improved parsing accuracy, highlighting the critical role of POS infor-
mation in these tasks. Lastly, the findings of this research underscore the need
for greater consideration of simplified HPSG rules by linguistic experts when
designing and refining treebanks for Vietnamese natural language processing.
This comprehensive analysis provides valuable insights into the capabilities and
limitations of different parsing models in addressing the complexities of the Viet-
namese language.
In Table 2, we report the performance of dependency parsing on the VnDT
test set, highlighting the efficacy of various models using the PhoBERT base.
The PhoNLP model with PhoBERTbase , initially achieved an LAS of 78.17 and a
UAS of 84.95, while our replication of this model, denoted by [], yielded slightly
lower scores of 77.38 and 84.44, respectively. When examining the HPSG model
with PhoBERTbase , both with and without POS tags, we observed varied per-
formances. Without POS tags, the lowest scores among five runs (indicated by
An Attempt to Develop a Neural Parser 323

Table 2. The performances of dependency parsing on the VnDT test set were reported.
[] represents our replicated PhoNLP result. [] represents the average results of five
runs, with the lowest result indicated by [♣]. The red bold font indicates that our result
is significantly different from the result of PhoBERTbase [] using a paired-wise t-test.
POS denoted part-of-speech.
Model LAS UAS
PhoNLP w/ PhoBERTbase [30] 78.17 84.95
PhoNLP w/ PhoBERTbase [] 77.38 84.44
W/o HPSG w/ PhoBERTbase [♣] 77.40 85.05
POS tags
HPSG w/ PhoBERTbase [] 77.60 85.18
W/ HPSG w/ PhoBERTbase [♣] 77.28 85.01
POS tags
HPSG w/ PhoBERTbase [] 77.48 85.22

[♣]) were significantly different from the replicated PhoNLP result, as shown by
the red bold font (LAS: 77.40, UAS: 85.05), while the average of these runs ([])
showed a slightly higher LAS of 77.60 and the best UAS of 85.18. With POS
tags included, the lowest and average results were 77.28 (LAS) and 85.01 (UAS)
for the former, and 77.48 (LAS) and 85.22 (UAS) for the latter, with the lowest
results again marked significantly different in red bold font. This analysis under-
scores the nuanced impact of POS tags on parsing accuracy and demonstrates
the robustness of the HPSG model with PhoBERTbase in dependency parsing
tasks.

5.4 Results of the VLSP 2023 Challenge on Vietnamese


Constituency Parsing
The tagging results show mixed performance. We achieved very high accuracy in
tagging punctuation marks (almost perfect at 0.998), coordinating conjunctions
(0.983), and adverbs (0.954). This means the system is excellent at recognizing
these types of words. We also got strong scores in personal pronouns (0.954),
copular verbs (0.952), nouns (0.959), and verbs (0.942), indicating good perfor-
mance in important word categories.
However, the system struggled with some tags. It scored zero on categories
like Nby and ADJb, which means it could not identify these word types at all.
It also had low scores in PRO:det (0.471), Nb (0.388), and very low in ADJ:adv
(0.08). These are areas where we need to improve the system.
We evaluated the tagging and parsing performance on the public set of the
VLSP 2023 Vietnamese Treebank. The results are presented in Fig. 5.
In parsing performance, as shown in Fig. 5, we provide detailed F-scores for
specific syntactic categories, revealing more about each parser’s strengths and
weaknesses. For example, in parenthetical clauses (PRN), the HPSG parser scored
324 D.-V. Nguyen et al.

Fig. 5. Our tagging and parsing results on the public set of the VLSP 2023 Vietnamese
Treebank.

0.735, which is better than Stanza’s 0.549. In noun phrases (NP), both parsers
did almost the same, with Stanza slightly ahead at 0.826 compared to HPSG’s
0.825.
Both parsers struggled with certain categories like WHADVP, UCP, and VCP,
where they scored zero. This indicates these areas need more attention and
improvement. By looking closely at the scores for different categories, we can
better understand each parser’s strengths and where they need to improve in
processing natural language.
Table 3 presents the results of the VLSP 2023 Shared Task6 , comparing three
parsers: Stanza, HPSG, and Attach-Juxtapose. The performance is measured
using Precision (P), Recall (R), and F-score (F) on both public and private test
datasets.
By comparing the results, we see that the Stanza and HPSG parsers per-
form similarly in the public test, with F-scores of 85.87 and 86.05, respectively.
However, in the private test, the HPSG parser performs slightly better, achiev-
ing an F-score of 89.04 compared to Stanza’s 88.73. This indicates that the
HPSG parser is better at handling unseen data. The Attach-Juxtapose parser
performs slightly lower than Stanza and HPSG overall but shows promising re-
sults when combined with specific models like PhoBERTbase-v2 . This suggests
that the Attach-Juxtapose parser has potential for improvement in future appli-
cations.

6
The VLSP 2023 Workshop Program is available at https://vlsp.org.vn/vlsp2023.
An Attempt to Develop a Neural Parser 325

Table 3. Results of the VLSP 2023 Shared Task: Performance metrics of the Stanza
parser [27], HPSG parser [13], and Attach-Juxtapose parser [31]. Note that the result
for the Attach-Juxtapose parser was reported by another team participating in the
shared task. The ‘&’ symbol denotes an ensemble of two language models.
Model Public Test Private Test
P R F P R F
Attach-Juxtapose w/ PhoBERTbase – – 80.55 – – 84.66
Attach-Juxtapose w/ PhoBERTbase-v2 – – 81.09 – – 84.79
Attach-Juxtapose w/ PhoBERTlarge – – 80.44 – – 84.45
Attach-Juxtapose w/ [PhoBERTbase & PhoBERTlarge ] – – 80.87 – – 84.60
Attach-Juxtapose w/ [PhoBERTbase-v2 & PhoBERTlarge ] 82.25 79.97 81.09 83.70 86.06 84.86
Stanza w/ PhoBERTlarge 86.78 84.97 85.87 89.56 87.91 88.73
HPSG w/ PhoBERTlarge 86.84 85.28 86.05 89.56 88.53 89.04

In summary, the results show that the HPSG parser is slightly stronger over-
all, especially in more challenging datasets, while Stanza remains highly com-
petitive. The Attach-Juxtapose parser, although not the strongest here, shows
room for growth with further refinements.

6 Conclusion and Future Work

This paper developed a neural parser for Vietnamese using a simplified Head-
Driven Phrase Structure Grammar (HPSG). To address the 15% of tree pairs in
the VietTreebank and VnDT corpora that did not conform to HPSG rules, we
permuted samples from the training and development sets. We then modified
the original parser by incorporating PhoBERT and XLM-RoBERTa models for
Vietnamese text encoding. Our experiments showed that the parser achieved an
82% F-score in constituency parsing and outperformed previous studies in de-
pendency parsing with a higher Unlabeled Attachment Score (UAS). The lower
Labeled Attachment Score (LAS) likely resulted from not consulting linguistic
experts. These results suggest the need for greater linguistic input when devel-
oping Vietnamese treebanks.

Acknowledgement. This research is funded by University of Information


Technology-Vietnam National University HoChiMinh City under grant number D1-
2024-67.
326 D.-V. Nguyen et al.

References
1. Nguyen, D.Q., Nguyen, D.Q., Pham, S.B., Nguyen, P.-T., Nguyen, M.L.: From
treebank conversion to automatic dependency parsing for Vietnamese. In: Métais,
E., Roche, M., Teisseire, M. (eds.) Natural Language Processing and Information
Systems, pp. 196–207. Springer, Cham (2014). ISBN: 978-3-319-07983- 7
2. Nguyen, Q.T., Miyao, Y., Le, H.T.T., Nguyen, N.T.H.: Ensuring annotation consis-
tency and accuracy for Vietnamese treebank. Lang. Resour. Eval. 52(1), 269–315
(2018). ISSN: 1574-0218
3. Pollard, C., Sag, I.A.: Head-Driven Phrase Structure Grammar. The University of
Chicago Press, Chicago (1994)
4. Do, B.L., Le, T.H.: Implementing a Vietnamese syntactic parser using HPSG. In:
The International Conference on Asian Language Processing (IALP) (2008)
5. Nguyen, K.-H.: BKTreebank: building a Vietnamese dependency treebank. In: Pro-
ceedings of the Eleventh International Conference on Language Resources and
Evaluation (LREC 2018). Miyazaki, Japan: European Language Resources Asso-
ciation (ELRA) (2018)
6. Thi, L.N., My, .H., Viet, H.N., Minh, H.N.T., Hong, P.L.: Building a treebank for
Vietnamese dependency parsing. In: The 2013 RIVF International Conference on
Computing & Communication Technologies - Research, Innovation, and Vision for
Future (RIVF), pp. 147–151 (2013)
7. Nguyen, D.Q., Nguyen, A.T.: PhoBERT: pre-trained language models for Viet-
namese. In: Findings of the Association for Computational Linguistics: EMNLP
2020, pp. 1037–1042. Association for Computational Linguistics (2020)
8. Conneau, A., et al.: Unsupervised cross-lingual representation learning at scale.
In: Proceedings of the 58th Annual Meeting of the Association for Computational
Linguistics, pp. 8440-8451. Association for Computational Linguistics (2020)
9. Dozat, T., Manning, C.D.: Deep biaffine attention for neural dependency parsing.
In: 5th International Conference on Learning Representations, ICLR 2017, Toulon,
France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net (2017)
10. Chomsky, N.: The Pisa Lectures. De Gruyter Mouton, Berlin, New York (1993).
ISBN: 9783110884166
11. de Marneffe, M.-C., MacCartney, B., Manning, C.D.: Generating typed depen-
dency parses from phrase structure parses. In: Proceedings of the Fifth Interna-
tional Conference on Language Resources and Evaluation (LREC’06). European
Language Resources Association (ELRA), Genoa, Italy (2006)
12. Ma, X., Zhang, X., Zhao, H., Lu, B.-L.: Dependency parser for Chinese constituent
parsing. In: CIPS-SIGHAN Joint Conference on Chinese Language Processing
(2010)
13. Zhou, J., Zhao, H.: Head-driven phrase structure grammar parsing on PENN tree-
bank. In: Proceedings of the 57th Annual Meeting of the Association for Compu-
tational Linguistics, Florence, Italy, pp. 2396–2408. Association for Computational
Linguistics (2019)
14. Qi, P., Zhang, Y., Zhang, Y., Bolton, J., Manning, C.D.: Stanza: a python nat-
ural language processing toolkit for many human languages. In: Celikyilmaz, A.,
Wen, T.-H. (eds.) Proceedings of the 58th Annual Meeting of the Association for
Computational Linguistics: System Demonstrations, pp. 101-108. Association for
Computational Linguistics (2020)
An Attempt to Develop a Neural Parser 327

15. Nguyen, T.-M.-H., Vu, X.-L., Ha, M.-L.: VLSP 2023 challenge on Vietnamese Con-
stituency Parsing. In: 2023
16. Nguyen, P.-T., Vu, X.-L., Nguyen, T.-M.-H., Nguyen, V.-H., Le, H.-P.: Building a
large syntactically-annotated corpus of Vietnamese. In: Proceedings of the Third
Linguistic Annotation Workshop (LAW III), pp. 182–185. Association for Compu-
tational Linguistics, Suntec, Singapore (2009)
17. Stern, M., Andreas, J., Klein, D.: A minimal span-based neural constituency parser.
In: Proceedings of the 55th Annual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers), Vancouver, Canada, pp. 818-827. Association
for Computational Linguistics (2017)
18. Gaddy, D., Stern, M., Klein, D.: What’s Going on in neural constituency parsers?
An analysis. In: Proceedings of the 2018 Conference of the North American Chapter
of the Association for Computational Linguistics: Human Language Technologies,
Volume 1 (Long Papers). New Orleans, Louisiana„ pp. 999–1010. Association for
Computational Linguistics (2018)
19. Nguyen, V.D., Nguyen, K.V., Nguyen, N.L.-T.: Variants of long short-term memory
for sentiment analysis on Vietnamese students’ feedback corpus. In: 2018 10th
International Conference on Knowledge and Systems Engineering (KSE), pp. 306–
311 (2018)
20. Trang, N.T.T., Ky, N.H., Rilliard, A., d’Alessandro, C.: Prosodic boundary pre-
diction model for Vietnamese text-to- speech. In: Interspeech 2021. Brno, Czech
Republic, pp. 3885–3889. ISCA (2021)
21. Linh, H.M., Huyen, N.T.M., Luong, V.X., Luong, N.T., Hue, P.T., Cuong, L.V.:
VLSP 2020 shared task: universal dependency parsing for Vietnamese. In: Pro-
ceedings of the 7th International Workshop on Vietnamese Language and Speech
Processing, Hanoi, Vietnam, pp. 77–83. Association for Computational Lingustics
(2020)
22. Nguyen, K.V., Nguyen, N.L.-T.: Vietnamese transition-based dependency parsing
with supertag features. In: 2016 Eighth International Conference on Knowledge
and Systems Engineering (KSE), pp. 175–180 (2016)
23. Nguyen, B.D., Nguyen, K.V., Nguyen, N.L.-T.: LSTM easy-first dependency pars-
ing with pre-trained word embeddings and character-level word embeddings in
Vietnamese. In: 2018 10th International Conference on Knowledge and Systems
Engineering (KSE), pp. 187–192 (2018)
24. Nguyen, D.Q.: A neural joint model for Vietnamese word segmentation, POS tag-
ging and dependency parsing. In: Proceedings of the The 17th Annual Workshop of
the Australasian Language Technology Association. Sydney, Australia, pp. 28–34.
Australasian Language Technology Association (2019)
25. Nguy`ên, T.M.H., Romary, L., Rossignol, M., Vũ, X.L.: A lexicon for Vietnamese
language processing. Lang. Resour. Eval. 40, 291–309 (2006)
26. Vaswani, A., et al.: Attention is all you need. In: Proceedings of the 31st In-
ternational Conference on Neural Information Processing Systems, Long Beach,
California, USA, pp. 5998–6008. Curran Associates Inc. (2017)
27. Bauer, J., Bui, H., Thai, V., Manning, C.: In-order transitionbased parsing for
Vietnamese. J. Comput. Sci. Cybernet. 39(3), 207–221 (2023)
28. Vu, T., Nguyen, D.Q., Nguyen, D.Q., Dras, M., Johnson, M.: VnCoreNLP: a Viet-
namese natural language processing toolkit. In: Proceedings of the 2018 Conference
of the North American Chapter of the Association for Computational Linguistics,
Demonstrations, New Orleans, Louisiana, pp. 56–60. Association for Computa-
tional Linguistics (2018)
328 D.-V. Nguyen et al.

29. Tran, T.-V., Pham, X.-T., Nguyen, D.-V., Nguyen, K.V., Nguyen, N.L.-T.: An
empirical study for Vietnamese constituency parsing with pre-training. In: 2021
RIVF International Conference on Computing and Communication Technologies
(RIVF), pp. 1–6 (2021)
30. Nguyen, L.T., Nguyen, D.Q.: PhoNLP: a joint multi-task learning model for Viet-
namese part-of-speech tagging, named entity recognition and dependency parsing.
In: Proceedings of the 2021 Conference of the North American Chapter of the As-
sociation for Computational Linguistics: Human Language Technologies: Demon-
strations, pp. 1–7. Association for Computational Linguistics (2021)
31. Yang, K., Deng, J.: Strongly incremental constituency parsing with graph neural
networks. In: Proceedings of the 34th International Conference on Neural Infor-
mation Processing Systems. NIPS ’20, Vancouver, BC, Canada. Curran Associates
Inc. (2020). ISBN: 9781713829546
Knowledge Distillation for Lumbar Spine
X-ray Classification

Minh-Khang Nguyen1(B) , Viet-Tham Huynh2,3 , Thuy-Giang Thi Vo1 ,


and Minh-Triet Tran2,3(B)
1
Huynh Man Dat High School for the Gifted, Rážąch Giá, Kien Giang, Vietnam
[email protected]
2
Software Engineering Laboratory, University of Science, VNU-HCMC,
Ho Chi Minh City, Vietnam
[email protected]
3
Vietnam National University, Ho Chi Minh City, Vietnam
[email protected]

Abstract. Lumbar spondylosis is a prevalent chronic illness that results


in deformation of the lumbar spine and limits human movement. Over
time, spinal deformities can compress or exert tension on the nerve roots,
resulting in lower back discomfort and disc herniation. The incidence of
spondylosis is escalating, attributed to a growing population of younger
individuals. This tendency results from alterations like contemporary
jobs and education. X-ray imaging of the lumbar spine is widely uti-
lized and endorsed by several physicians for its rapidity, precision, and
accessibility across diverse patient populations. This article introduces a
technique for detecting and classifying both abnormal and healthy lum-
bar spine X-ray pictures. After image filtration, we implement Knowl-
edge Distillation, wherein a trained teacher model instructs smaller stu-
dent models. We employ EfficientNet-B4 as the Teacher model, a high-
accuracy and efficient Convolutional Neural Network (CNN) architecture
for medical image analysis, and MobileNetV2 as the Student model,
which also utilizes the knowledge distillation approach. To assess the
model’s performance, 2,000 lumbar spine X-ray pictures were obtained
from Kien Giang General Hospital and Trung Cang General Clinic, with
872 samples designated for training and testing. The outcomes attained
an accuracy of 91.0%, a precision of 90.0%, a recall of 91.8%, and an
F1-score of 90.9%. The findings were achieved after 500 training epochs
with a learning rate 0.001. This indicates that our suggested model has
strong performance with excellent dependability.

Keywords: Lumbar Spondylosis · Lumbar Spine · EfficientNet-B4 ·


MobileNetV2 · X-ray Images · Knowledge Distillation

c The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025
W. Buntine et al. (Eds.): SOICT 2024, CCIS 2350, pp. 329–342, 2025.
https://doi.org/10.1007/978-981-96-4282-3_27
330 M.-K. Nguyen et al.

1 Introduction
The human spine encompasses seven cervical vertebrae, twelve thoracic verte-
brae, five lumbar vertebrae, five sacral vertebrae, and four coccygeal vertebrae
[17]. The spine has a double S-shaped curve, facilitating movement flexibility.
Spondylosis symptoms frequently manifest in the lumbar and cervical regions, as
these areas endure the most significant strain, supporting the whole body [12].
Lumbar spondylosis is a chronic, progressively degenerative disorder that
induces discomfort, limits mobility, and results in deformity of the lumbar spine
[9]. The principal cause of this disorder is the aging process [1]. Moreover, addi-
tional factors encompass heredity [14], adverse living environment, and inade-
quate nourishment for the body [9]. As society advances and working circum-
stances enhance, the prevalence of office labor has risen; individuals now spend
most of their time seated and are less physically active [13]. This elucidates that,
while the principal etiology of this disorder is aging-impacting 80% of patients
over 40 years old-the disease is progressively affecting younger demographics,
evidenced by a notable increase in patients aged 20 to 29 exhibiting symptoms
of lumbar spondylosis [10].
Numerous imaging modalities are presently accessible for evaluating spinal
disease, including X-rays, computed tomography (CT), and magnetic resonance
imaging (MRI). Doctors frequently advise patients to choose lumbar X-rays [17],
as this technique offers extensive insights into spinal health, encompassing spinal
alignment, vertebral anatomy, bone cortex integrity, and degenerative or trau-
matic conditions [9], while also yielding prompt results at a minimal cost and
being readily available at numerous clinics and healthcare establishments [17].
Medical imaging informatics denotes using information and communication
technology (ICT) in healthcare imaging services. Due to the prevailing trend of
global aging [17] and a decline in physical activity linked to work patterns [13],
spine illnesses are becoming more prevalent. With the rising patient population,
physicians encounter an escalating workload that affects their diagnostic efficacy
[11]. The authors acknowledge the significance and necessity of facilitating the
swift identification of lumbar spondylosis by X-ray imaging methods.
Our main contributions are summarized as follows:

– We have collected and released a comprehensive dataset of 872 lumbar spine


X-ray images, with 422 labeled as normal and 450 as abnormal.
– We introduce a model specifically designed to improve the efficiency and
accuracy of diagnosing lumbar spine spondylosis through X-ray imaging. By
leveraging Knowledge Distillation, our approach trains an EfficientNet-B4 as
the Teacher model and a MobileNetV2 as the Student model, ensuring its
practical applicability in real-world healthcare environments.

The paper is organized as follows. Section 2 of the paper reviews related works.
Section 3 discusses the pathogenesis. The proposed system is presented in Sect. 4.
Section 5 presents the dataset and experiment, and Sect. 6 is a conclusion.
Knowledge Distillation for Lumbar Spine X-ray Classification 331

2 Related Work
In 2022, Trinh et al. conducted an extensive study on techniques for identify-
ing lumbar disc herniation using deep learning networks applied to X-ray images
[17]. Their research highlighted the potential of neural networks in medical image
analysis, particularly in identifying disc herniation, a common condition affect-
ing the lumbar spine. In the same year, Trinh et al. introduced the LumbarNet
model, a specialized deep-learning network designed to diagnose lumbar disc her-
niation from X-ray images autonomously. The goal of LumbarNet was to enhance
both the accuracy and efficiency of diagnostic processes. After a thorough evalua-
tion, the model achieved an impressive accuracy of 88.83% in vertebrae detection,
underscoring its potential in clinical settings [18].
Further developments in the field were seen with the work of Zhang et al.,
who proposed a novel approach using deep learning techniques for identifying
osteoporosis from a dataset comprising 1,616 X-ray scans, augmented by two
additional datasets of 204 and 396 images, respectively. The findings of this
research were promising, demonstrating the viability of using deep learning for
osteoporosis screening based on X-ray images [19]. This body of work illustrated
the growing interest in leveraging AI for bone disease detection, showing promis-
ing results in both accuracy and computational efficiency.
In a related study, Kong et al. advanced the diagnosis of fractures through
deep learning models, further expanding the application of AI in musculoskele-
tal imaging [8]. Moreover, a noteworthy contribution by Hong et al. involved the
development of a model capable of simultaneously detecting osteoporosis and
fractures, demonstrating the potential for multi-condition detection using a sin-
gle model framework [5]. These studies underscore the breadth of deep learning
applications in spinal and bone-related conditions.
Discussion. Despite these advancements, the application of deep learning tech-
niques specifically for identifying lumbar spondylosis from X-ray images has not
yet received significant attention. Existing research has largely focused on indi-
vidual issues affecting the lumbar spine, such as disc degeneration, herniation,
osteoporosis, and vertebral displacement. While these conditions are components
of lumbar spondylosis, comprehensive studies exploring the use of deep learning
for the full detection of this condition remain scarce. Lumbar spondylosis, which
involves the degeneration of intervertebral discs and joints and the formation
of bone spurs, requires a more focused approach to deep learning-driven diag-
nosis. This paper addresses this gap by proposing and evaluating the effective-
ness of two deep learning models-EfficientNet-B4 and MobileNetV2-employed
within a knowledge distillation framework. Specifically, EfficientNet-B4 acts as
the Teacher model during the initial training phase, guiding the learning pro-
cess for MobileNetV2, which functions as the Student model. This approach
aims to optimize both the efficiency and accuracy of the detection and classi-
fication of lumbar spondylosis in X-ray images. Knowledge distillation ensures
that the more computationally efficient MobileNetV2 model inherits the strong
performance characteristics of the larger EfficientNet-B4 model, facilitating its
332 M.-K. Nguyen et al.

application in real-world healthcare scenarios where resource constraints may be


a consideration.

3 Pathogenesis

Fig. 1. The three-joint complex [3]

The lumbar spine has five vertebrae designated as L1, L2, L3, L4, and L5.
According to Kirkaldy-Willis, each vertebra possesses an intervertebral linking
structure including a complex of three joints (The information is laid out in
Fig. 1), which contains one disc anteriorly and two facet joints posteriorly. This
construction facilitates flexible joint movement while ensuring stability. Due to
the interrelated nature of this system, injury to one joint adversely impacts the
others.
Initially, lumbar spondylosis tends to manifest in a limited number of verte-
brae, most commonly at the L4-L5 or L5-S1 levels. These vertebrae are located
at the lower segment of the vertebral column, where they support the body’s
weight, and are subjected to significant mechanical stress due to their position in
the spinal curvature. Over time, if the condition is not diagnosed and managed
early, these degenerative changes may spread to other adjacent vertebrae within
the lumbar spine.
Figure 2 presents a detailed depiction of the spectrum of degenerative change,
illustrating the initial damage occurring in two directions-posterior joint and
intervertebral disc-leading to the development of intricate multi-level degenera-
tive lesions. These modifications underscore the significance of early identifica-
tion and prompt action to avert future decline in spine health.
For damage oriented toward the posterior joint, synovial reactions, carti-
lage destruction, osteophyte formation contribute to capsular laxity, subluxa-
tion, and lateral nerve entrapment. These biomechanical alterations can lead to
Knowledge Distillation for Lumbar Spine X-ray Classification 333

Fig. 2. The spectrum of degenerative change

spinal instability, which worsens over time, impacting the overall function of the
lumbar spine. As these conditions persist, the enlargement of articular processes
and osteophytes at the vertebral bodies lead to multilevel degenerative lesions,
which characterize advanced stages of lumbar spondylosis. The entire procedure
is illustrated in Fig. 3, and the specifics will be addressed in part Sect. 3.1.
Lesions in the direction of the intervertebral disc can manifest as circumfer-
ential and radial tears, which may further progress to internal disruption and
herniation. Herniation, in particular, is a critical event that affects the overall
stability of the lumbar region, potentially resulting in a reduction in disc height
and disc resorption. Specifics will be addressed in part Sect. 3.2, and Fig. 4 will
illustrate the entire procedure.

3.1 Posterior Joint

Fig. 3. Posterior joint lesions

The subsequent joints are categorized as diarthrodial joints, which enable


movement between bones and possess articular cartilage and synovial mem-
branes. The posterior joint capsule consists of collagen, connective tissue, and
synovial membranes, which safeguard the interior components of the joint, confer
stability, and relay sensory impulses to the central nervous system. Degenerative
processes will induce specific alterations in these joints.
Degeneration damages the cartilage in the facet joints of the lumbar spine,
leading to inflammation. At this juncture, the synovial membrane in the facet
334 M.-K. Nguyen et al.

joint is activated, resulting in an augmented synthesis of synovial fluid to safe-


guard the joint from damage due to heightened friction between the joint sur-
faces. Nonetheless, the persistent degenerative process ensures that the damage
remains, resulting in the overproduction of synovial fluid, which induces syn-
ovitis, leads to effusion, and diminishes the viscosity of the synovial fluid, so
impairing its lubricating properties. This results in edema, rigidity, and dimin-
ished joint movement.
The inflammation of the synovial membrane results in the secretion of pro-
inflammatory cytokines, impairing the function of chondrocytes in cartilage
and hindering their production of vital cartilage components, including collagen
and proteoglycan, which diminishes the cartilage’s elasticity and its capacity to
endure pressure. Tissue-degrading enzymes, including matrix metalloproteinases
and aggrecanases, released from the inflamed synovial membrane, directly target
and decompose the fundamental constituents of cartilage, namely collagen and
the moisture-retaining, elastic component aggrecan; they disrupt the connections
between chondrocytes and the extracellular matrix, leading to a disconnection
between chondrocytes and their surrounding milieu.
The overproduction of synovial fluid due to the synovial response exacer-
bates strain on the joint surfaces. This exacerbates surface injury, disrupts carti-
lage regeneration, and hastens deterioration. Moreover, variations in weight and
pressure stresses, together with modifications in oxygen stress and hydrostatic
pressure, contribute to the development of osteophytes.
The cartilage degradation and osteophyte growth resulting from inflamma-
tory reactions progressively elevate strain on the joint capsule and ligaments.
This pressure causes inflammation of the ligaments and joint capsule, leading to
a diminished capacity to support the joint. This inflammation results in the joint
capsule and ligaments’ distension, leading to joint laxity symptoms. When the
joint capsule becomes slack and unstable, the capacity to regulate and perform
joint motions diminishes, resulting in joint instability, which causes misalignment
or aberrant movement of the joint. Suppose the joint’s pressure surpasses the
adjacent tissues’ stability threshold. In that case, the joint surfaces will momen-
tarily be displaced from their original position, resulting in subluxation. When
the joint is dislocated and subluxated, it may compress adjacent nerves due to
the small channels via which the nerves traverse.
Following degeneration and subsequent subluxation of the joint surfaces, the
joint processes and vertebral plates must endure increased pressure to preserve
joint stability. At this stage, the articulating surfaces and spinal plates will
increase in size, density, and rigidity.

3.2 Intervertebral Disc


The intervertebral disc in the lumbar region undergoes a degenerative process
that typically unfolds in three distinct, sequential stages. These stages reflect
the cumulative impact of numerous microtraumas sustained by the disc over the
course of a patient’s life. In the earliest stage, these microtraumas are small, grad-
ual injuries that go largely unnoticed by the patient due to their subtle nature.
Knowledge Distillation for Lumbar Spine X-ray Classification 335

Fig. 4. Intervetebral disc lesions

Since the changes caused by these microtraumas are minimal at first, patients
often find it difficult to identify any abnormal alterations in their body, and no
significant symptoms may be apparent. However, as these injuries accumulate
over time, they progressively undermine the integrity of the intervertebral disc,
setting the stage for more significant structural damage.
As the process advances, the microtraumas give rise to circumferential tears
within the outer layers of the annulus fibrosus, which is the tough, fibrous ring
surrounding the softer core of the disc. These tears mark the beginning of disc
degeneration and can induce discomfort, particularly on the external surface
of the disc. This represents the initial phase of the degeneration process, during
which the disc’s ability to maintain its structure and function is compromised. As
these circumferential tears accumulate and spread, they may eventually coalesce
into radial tears. These deeper tears extend from the outer layers of the annulus
fibrosus into the central, gelatinous nucleus pulposus of the disc. The formation
of radial tears significantly weakens the disc, making it more susceptible to disc
herniation, which occurs when the nucleus pulposus is displaced outward through
the compromised annulus fibrosus.
As degeneration progresses, the disc’s height begins to decrease. This reduc-
tion in height is directly related to the disc’s declining ability to retain water,
as the structural damage impairs the disc’s capacity to absorb and hold onto
moisture. The desiccation or drying out of the disc leads to a loss of its cushion-
ing properties, which are essential for absorbing mechanical forces in the spine.
The onset of instability in the spinal joints often accompanies this loss of disc
height. The weakening of the disc, combined with the loosening of the joints,
compromises the stability of the entire lumbar region.
As these degenerative changes continue to accumulate, the disc sustains fur-
ther damage. New tears form within the annulus fibrosus, compounding the
existing injuries and further reducing disc height. This ongoing degeneration
reduces the overall space within the disc, effectively diminishing its ability to
perform its biomechanical functions. At the same time, in response to the insta-
bility and reduced disc space, the body begins to form osteophytes, or bone
spurs, around the affected vertebrae. These bony projections, which develop as
the body attempts to stabilize the spine, can contribute to additional complica-
tions, such as nerve impingement and further joint stiffness. As the degeneration
advances to this stage, the once-flexible and resilient intervertebral disc becomes
336 M.-K. Nguyen et al.

severely compromised, and the affected spinal segments may experience chronic
pain and reduced mobility.

4 Proposed System
4.1 Convolutional Neural Network
Convolutional Neural Network (CNN) is a deep learning model specifically
designed to process grid-like data such as images, by automatically extracting
features through different filters across multiple layers. CNN is widely applied
in various fields such as image classification, object detection.

4.2 Teacher Model (EfficientNet-B4)


The EfficientNet model stands as one of the most advanced convolutional neural
network (CNN) architectures, boasting impressive accuracy rates ranging from
77.3% for the EfficientNet-B0 model to 84.4% for the more complex EfficientNet-
B7 model in the widely recognized ImageNet classification task [16]. This model
family is notable for its scalability across three dimensions: breadth (the number
of channels), depth (the number of layers), and resolution (the input image size),
which collectively allow for a more efficient and balanced performance across
various tasks. The parameters of these models also scale accordingly, starting
from 5.3 million parameters in the EfficientNet-B0 model and increasing to 66
million parameters in the EfficientNet-B7 model, making the latter capable of
handling more complex datasets and computational tasks.
In this study, the EfficientNet-B4 model is utilized, representing an enhanced
version of the base EfficientNet architecture. The B4 model balances model size,
computational efficiency, and accuracy, making it well-suited for tasks requiring
both speed and precision in data processing. Specifically, the EfficientNet-B4
model has been improved in breadth, depth, and resolution compared to the
baseline EfficientNet-B0 model, rendering it more appropriate for handling intri-
cate image classification tasks. This makes it particularly suitable for specialized
applications, such as classifying lumbar spine X-ray images, which often involve
detailed and complex visual information that requires advanced processing capa-
bilities [20].

4.3 Student Model (MobileNetV2)


MobileNet is a CNN architecture family created for mobile and embedded
devices, emphasizing speed and memory efficiency without sacrificing perfor-
mance in image processing tasks [6]. MobileNetV2, an improved version, further
reduces computational complexity while enhancing overall performance, making
it suitable for resource-limited environments.
Though MobileNetV2 has slightly lower accuracy than larger models like
EfficientNet-B4, it is significantly faster and more lightweight. This balance
of efficiency and performance makes it ideal for use as a Student model in
Knowledge Distillation techniques, as discussed in Sect. 4.4, where computational
demands are minimized while maintaining stable results [15].
Knowledge Distillation for Lumbar Spine X-ray Classification 337

4.4 Knowledge Distillation

In deep learning, Knowledge Distillation is a highly effective method for trans-


ferring the learned knowledge from a large, complex model, the Teacher, to a
smaller, simpler model called the Student [4]. The Teacher model, pre-trained
on a large dataset, typically possesses higher accuracy and complexity, making
it well-suited for tasks with significant computational resources. The Student
model, in contrast, is trained to imitate the teacher’s predictions, learning from
both the hard labels (the actual data labels) and the soft labels (the probabili-
ties output by the Teacher). This process allows the Student to approximate the
Teacher’s performance, albeit with far fewer parameters and lower computational
overhead.
Because of its reduced size, the Student model can execute tasks more quickly,
making it ideal for deployment in real-world platforms with constrained com-
puting resources, such as mobile devices, web applications, and IoT systems.
Knowledge Distillation can dramatically shrink the model’s size and computa-
tional requirements while preserving a significant portion of the Teacher model’s
performance [7]. This technique is particularly advantageous in environments
where speed and efficiency are critical without compromising too much on accu-
racy. This process makes deep learning models more accessible for a wider range
of practical applications.

5 Dataset and Experiments


5.1 Data Collection

The dataset for this study was col-


lected from 1,000 patients from Kien
Giang Provincial General Hospital
between January 2024 and early Table 1. Images used in the experiment
September 2024; Kien Giang Tradi- Train set Test set
tional Medicine Hospital from June Normal images 372 50
to July; and from Trung Cang Gen- Abnormal images 400 50
eral Clinic from June 2024 to the end Total images 772 100
of August 2024. The author obtained
permission to collect data from the
management of the three mentioned hospitals, and all data was manually
retrieved using each hospital’s respective PACS software (a Picture Archiving
System). Each patient provided two images corresponding to frontal and lat-
eral X-rays of the lumbar spine, totaling 2,000 images. The research team used
the European Guidelines for Quality Criteria on Quality Criteria for Diagnos-
tic Radiographic Images (EGQCDRI) to assess image quality based on main
criteria: (1) Image Quality Assessment for AP or PA Projection, (2) Image
Quality Assesssment for Lateral Projection, (3) Image Quality Assessment for
Lateral Projection of the Lumbo-Sacral, and (4) General Assessment [2]. Only
338 M.-K. Nguyen et al.

872 images met the standard for image quality. The images were manually anno-
tated by a Level 1 specialist radiologist and later validated by another radiolo-
gist. The labeled images were then divided into two groups: Normal images and
Abnormal images (Table 1 shows an overview of the number of photographs in
each group; Fig. 5 showcases sample images of cases from the two groups).

Fig. 5. The image illustrates our dataset. The top 4 images are normal and the bottom
4 images are abnormal.

5.2 Experiments

The Teacher model, EfficientNet-B4, was employed for training on the dataset
described in Table 1. This dataset includes 872 images, consisting of 422 normal
images and 450 images showing various stages of lumbar degenerative disease.
Each image was carefully curated to provide a balanced dataset for the clas-
sification task, allowing the model to learn key features associated with both
healthy and degenerated lumbar conditions. However, one of the primary chal-
lenges with this dataset was the variability in the original image sizes. The width
of the images ranged from 918 to 1853 pixels, while the height varied between
477 and 957 pixels, making it necessary to preprocess the images to a standard
size before feeding them into the model.
To ensure uniformity and compatibility with the EfficientNet-B4 architec-
ture, all images were resized to a fixed dimension of 380 × 380 pixels. This resiz-
ing was essential for maintaining the model’s performance, as EfficientNet-B4
relies on consistent input sizes to perform efficiently, especially in large-scale
image classification tasks. By standardizing the image dimensions, we not only
facilitated the model’s training process but also ensured that the spatial fea-
tures of the images were preserved as much as possible, allowing the Teacher
Knowledge Distillation for Lumbar Spine X-ray Classification 339

model to detect better subtle differences between normal and degenerated lum-
bar spine images. This preprocessing step was crucial to adapting the dataset to
the EfficientNet-B4 architecture, optimizing the model’s ability to learn from a
diverse range of input images.
We executed training for 500 epochs with a batch size 16 to facilitate regular
model updates. The learning rate was established at 0.001 to avert instability
and ensure efficacy throughout training. The Teacher model (EfficientNet-B4)
was pre-trained to produce labels for the dataset samples. These labels provide
more information on the probability of each class training the student model
(MobileNetV2). This aims to assist the Student model in boosting its capacity
to learn from the intricate and more substantial traits identified by the Teacher
model, hence improving its generalization ability.

Table 2. Experimental results

Backbone Parameters Accuracy Precision Recall F1-score


EfficientNet-B4 19.0M 93.0% 93.4% 93.0% 93.0%
MobileNetV2 3.4M 90.0% 90.0% 90.0% 90.0%
EB4 distill to MV2a 3.4M 91.0% 90.0% 91.8% 90.9%
a
EfficientNet-B4 distill to MobileNetV2

Table 2 points out that the Teacher model has commendable classification
accuracy on the dataset. The F1-score is high at 93%, signifying a desirable
equilibrium between Precision and Recall. This is noteworthy, signifying that the
model not only accurately identifies positive cases but also reduces the incidence
of incorrect predictions for the positive class.
The Student model (MobileNet
V2) will utilize the Knowledge Dis-
tillation technique to assimilate the
Teacher model’s outcomes via labels,
enhancing the model’s image recog-
nition proficiency in classifying lum-
bar spine X-ray pictures. The experi-
mental procedure of the Knowledge
Distillation model on the Test set
is depicted in Fig. 6, demonstrating
the model’s exceptional capacity to
predict True Positive and False Pos- Fig. 6. Confusion matrix for Knowledge Dis-
itive values, with a negligible num- tillation model on Testset.
ber of wrong predictions. Therefore, our experimental results show that the
MobileNetV2 model after distillation has higher accuracy (91%) than the
MobileNetV2 model without distillation (90%). In addition, compared to the
Teacher model, the Student model is 5.5 times smaller in size but the Accuracy
is only reduced by 2% (The details can be seen in Table 2).
340 M.-K. Nguyen et al.

The implementation of Knowledge Distillation was intended to modify the


model for diverse real-world applications, given that most healthcare institutions
do not possess the computational capacity to operate intricate models such as
EfficientNet-B4 (Teacher). Our method substantially decreases computational
requirements by transferring knowledge to the smaller MobileNetV2 (Student)
model, enabling deployment on conventional computers while preserving good
accuracy, with merely a 2% reduction. The Student model exhibits superior
inference speed owing to its reduced parameter count, which can be seen in
Table 2, is vital in medical contexts where swift and precise analysis is crucial
for prompt patient outcomes. These benefits render our condensed model more
applicable for real-world implementations.

Fig. 7. The loss function values of the Teacher and Student models.

Moreover, Fig. 7 illustrates the training process of both the Teacher and Stu-
dent models over 500 epochs. The initial loss value of the Teacher model was
0.6525, higher than that of the Student model, which started at 0.5254. The
loss for both models began to decrease rapidly and then gradually stabilized
around epoch 150, indicating that the models were converging well. By the end
of the training process, the loss value of the Teacher model was 0.0385, repre-
senting a 94% reduction compared to the initial value. This final loss was also
lower than the Student model, which ended at 0.0513. These results suggest
that although the Student model initially had an advantage due to its simpler
structure, allowing it to quickly capture basic features, the more complex archi-
tecture of the Teacher model ultimately outperformed it by better learning and
optimizing important features from the data. This lays the groundwork for the
Knowledge Distillation process, as the lower loss of the Teacher model indicates
higher accuracy, enabling the Teacher’s knowledge to be transferred to enhance
the performance of the Student model.
Knowledge Distillation for Lumbar Spine X-ray Classification 341

6 Conclusion
In this research, we have collected and published a comprehensive dataset of
lumbar spine X-ray images, which includes 872 images. The dataset we offer is
dependable, as all photos have undergone quality verification with the EGQC-
DRI scale, a respected instrument that assists radiologists in evaluating the qual-
ity of medical images prior to diagnosis. The dataset labels were initially assigned
by one radiologist and subsequently validated by another radiologist. Addition-
ally, this paper delves into the study of Convolutional Neural Network (CNN)
architectures and the technique of knowledge distillation to effectively detect
and classify images based on our dataset. The results obtained from the testing
process demonstrate that our model exhibits high efficiency in image recognition
and classification tasks. Although the distilled model does not achieve the same
level of accuracy as the teacher model, it still outperforms the original model.
This promising outcome paves the way for future advancements, offering signif-
icant potential for improving diagnostic accuracy and effectiveness in medical
image analysis.

References
1. Buckwalter, J.A., Saltzman, C., Brown, T.: The impact of osteoarthritis: implica-
tions for research. Clin. Orthop. Relat. Res. (1976-2007) 427, S6–S15 (2004)
2. Doktor, K., Vilholm, M.L., Hardardóttir, A., Christensen, H.W., Lauritsen, J.:
European guidelines on quality criteria for diagnostic radiographic images of the
lumbar spine-an intra-and interobserver reproducibility study. Chiropractic Manual
Ther. 27, 1–6 (2019)
3. Grivas, T.B., et al.: Are the spinal changes in the course of scoliogeny primary but
secondary? J. Clin. Med. 13(8), 2163 (2024)
4. Hinton, G.: Distilling the Knowledge in a Neural Network. arXiv preprint
arXiv:1503.02531 (2015)
5. Hong, N., et al.: Deep-learning-based detection of vertebral fracture and osteo-
porosis using lateral spine X-ray radiography. J. Bone Mineral Res. 38(6), 887–895
(2020)
6. Howard, A.G., et al.: MobileNets: efficient convolutional neural networks for mobile
vision applications. arXiv preprint arXiv:1704.04861 126 (2017)
7. Kabir, M.M., Mridha, M.F., Rahman, A., Hamid, M.A., Monowar, M.M.: Detec-
tion of COVID-19, pneumonia, and tuberculosis from radiographs using AI-driven
knowledge distillation. Heliyon 10(5) (2024)
8. Kong, S.H., et al.: Development of a spine X-ray-based fracture prediction model
using a deep learning algorithm. Endocrinol. Metab. 37(4), 674–683 (2022)
9. Middleton, K., Fish, D.E.: Lumbar spondylosis: clinical presentation and treatment
approaches. Curr. Rev. Musculoskelet. Med. 2, 94–104 (2009)
10. Rothschild, B.: Lumbar spondylosis. Emedicine Publication (2008)
11. Sabri, N., Hamed, H.N.A., Ibrahim, Z., Ibrahim, K.: 2D photogrammetry image of
scoliosis Lenke type classification using deep learning. In: 2019 IEEE 9th Interna-
tional Conference on System Engineering and Technology (ICSET), pp. 437–440.
IEEE (2019)
342 M.-K. Nguyen et al.

12. Sasiadek, M.J., Bladowska, J.: Imaging of degenerative spine disease-the state of
the art. Adv. Clin. Exp. Med. 21(2), 133–142 (2012)
13. Shrestha, N., et al.: Workplace interventions for reducing sitting at work. Cochrane
Database Syst. Rev. 6 (2018)
14. Spector, T.D., MacGregor, A.J.: Risk factors for osteoarthritis: genetics.
Osteoarthr. Cartil. 12, 39–44 (2004)
15. Srinivasu, P.N., et al.: Classification of skin disease using deep learning neural
networks with MobileNet V2 and LSTM. Sensors 21(8), 2852 (2021)
16. Tan, M.: Efficientnet: rethinking model scaling for convolutional neural networks.
arXiv preprint arXiv:1905.11946 (2019)
17. Trinh, G.M., et al.: Detection of lumbar spondylolisthesis from Xray images using
deep learning network. J. Clin. Med. 11(18), 5450 (2022)
18. Trinh, G.M., et al.: LumbarNet: A Deep Learning Network for the Automated
Detection of Lumbar Spondylolisthesis From X-Ray Images (2022)
19. Zhang, B., et al.: Deep learning of lumbar spine X-ray for osteopenia and osteoporo-
sis screening: a multicenter retrospective cohort study. Bone 140, 115561 (2020)
20. Zhang, P., Yang, L., Li, D.: EfficientNet-B4-Ranger: a novel method for green-
house cucumber disease recognition under natural complex environment. Comput.
Electron. Agric. 176, 105652 (2020)
Forecasting Traffic Flow Under
Uncertainty: A Case Study in Da Nang

Doan Phuoc Mien1(B) , Tran The Vu2(B) , and Ngo Van Sy3
1
Tra Vinh University, Tra Vinh City, Viet Nam
[email protected]
2
VN-UK Institute for Research and Executive Education,
The University of Da Nang, Da Nang, Viet Nam
[email protected]
3
Vietnam Research Institute of Electronics, Informatics and Automation Da Nang,
Da Nang, Viet Nam

Abstract. This paper discusses the design and implementation of a


modern traffic flow prediction system using data from street surveil-
lance cameras deployed at the website 0511.vn. The core objective of
the research was to develop an efficient prediction model based on direct
image analysis and real-time data, providing instant traffic information
and forecasting short-term traffic trends. Initially, it is necessary to iden-
tify and evaluate existing image processing and machine learning meth-
ods to filter out and classify vehicles from the collected video data. Sub-
sequently, the author designed models combining ARIMA and LSTM
methods to predict the density and movement of vehicles on the roads.
These methods were tested and optimized through a series of experiments
on historical data and real-time data collected from 0511.vn, marking
a significant advancement in applying video surveillance technology to
urban traffic management. The research results not only contribute to
the field of data science and image processing but also have practical
potential in supporting the decision-making of traffic management agen-
cies and improving the community’s commuting experience.

Keywords: Forecasting Traffic flow · Traffic flow prediction ·


Uncertainty in traffic · Da Nang

1 Introduction

Traffic flow prediction is a critical issue in the field of transportation, as accurate


predictions can enhance traffic management by reducing congestion, travel time,
and environmental impact. However, forecasting traffic flow is challenging due
to the complexity of transportation systems and the variability of influencing
factors. Traffic flow fluctuates over time, varying by the hour of the day and the
day of the week. Effective prediction requires models capable of handling these
fluctuations.
c The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025
W. Buntine et al. (Eds.): SOICT 2024, CCIS 2350, pp. 343–353, 2025.
https://doi.org/10.1007/978-981-96-4282-3_28
344 D. P. Mien et al.

Traffic uncertainty in Vietnam arises from several factors, including conges-


tion, unexpected events, increased transportation demand, and diverse trans-
portation modes. Urban areas frequently experience jams, particularly in Da
Nang City due to high tourist influxes. Accidents, sudden construction, and
various vehicle types contribute further complexity. Non-compliance with traffic
laws, along with rapid urbanization, heightens collision risks, while pedestrians
and cyclists create additional unpredictability. Weather conditions like rain and
fog also worsen visibility and traffic flow. Additionally, limitations in camera
data due to high angles or wide frames affect traffic monitoring.
When forecasting time series, it is common to identify a set of ranges that
the research will focus on. These ranges vary from study to study but typically
fall into three categories:

– Short-term forecasting: time range from 5 min to 30 min.


– Medium-term forecasting: from 30 min to several hours.
– Long-term forecasting: from one day to several days.

The primary objective of this research is to create an efficient prediction


model for traffic flow in Da Nang using real-time data from street surveillance
cameras. The practical applications of these predictions include supporting traffic
management agencies in decision-making, reducing congestion, and improving
the commuting experience for the community.

2 Literature Review
2.1 Related Research on Traffic Flow Prediction

The study [12] introduced Graph Neural Networks (GNNs), a type of neural net-
work based on graphs, utilized in structured data processing tasks like graphs or
networks. GNNs can handle structured and heterogeneous data sequences, aiding
in the prediction of traffic flow based on traffic network information. In [7], the
development of a deep learning model for predicting traffic flow was discussed. The
authors developed a deep learning model capable of accurately predicting non-
linear, space-time dependent traffic flows. However, this method was only tested
on two specific events and requires further experimentation and verification across
a broader range of data and traffic scenarios. The research [13] presented the use of
IoT and intelligent algorithms for collecting data from multiple sources and pro-
cessing this information to enhance traffic flow performance. The authors focused
on evaluating and comparing intelligent techniques used in traffic flow prediction
to understand the strengths and limitations of each method. However, the lack of
detailed presentation regarding the quantity of data could impact the objectivity
of the comparison results. In the study [8, 9, 14], the authors employed multiscale
temporal smoothing to address data loss and fill missing time intervals in traffic
flow data within an LSTM network. Although the results achieved high accuracy,
further research and testing on diverse datasets are needed to ensure the method’s
generalizability and reliability.
Forecasting Traffic Flow Under Uncertainty: A Case Study in Da Nang 345

2.2 Related Research on Uncertain Traffic Environments


Traffic forecasting has long been a critical area in transportation research. Recent
advancements in Deep Neural Networks (DNNs) have significantly impacted
this field, as models trained on rich data from loop detectors and sensors have
addressed several challenges in traffic data. Recent applications include identi-
fying congestion, peak times, estimating travel times, and quantifying passenger
demand.
However, quantifying uncertainty in traffic forecasting remains limited. Some
methods, like the study [1], use parametric statistical models with confidence
intervals. The research [6] extends statistical time-series models by employing
adaptive Kalman filters. DNN-based methods, as explored in studies [11] and [5],
have investigated uncertainty quantification techniques such as quantile regres-
sion and Monte Carlo dropout. The study [3] combines various uncertainty quan-
tification methods with different learning models. Notably, research [4] applied
quantile regression to image-based traffic data but reported reduced performance
compared to their baseline model. This study aims to fill this gap by analyz-
ing non-distributional uncertainty quantification methods on large-scale traffic
datasets, primarily relying on image-based data, differing from previous studies
that mainly used loop detector data.

3 Methodology
3.1 Problem
(1) represents the simplified problem and its components when the conditions of
the road network have been predetermined.


. t+1 = f (xt , pt , xt−1 , pt−1 , xt−2 , pt−2 , ..., xt−n , pt−n ) (1)
In the context, “x” represents the current traffic state at time t, “p” refers to the
parameters affecting the traffic state, “n” is the sample size, “.x̂” is the predicted
traffic state, and “f(x)” denotes the model applied to the historical data of traffic
state and its parameters to make predictions. Each component is now detailed
further before exploring different prediction models.

3.2 Proposed Algorithm: HAIVAN-ALSTM


This research proposes a hybrid ARIMA-LSTM model, leveraging ARIMA’s
statistical strengths and LSTM’s deep learning capabilities.

ARIMA Model. The ARIMA model is a popular method in time series anal-
ysis for predicting future data based on past information. This model integrates
three main components:

– AR (Autoregression): A model where current data is explained by its own


past data.
346 D. P. Mien et al.

– I (Integrated): To make the time series stationary, meaning its statistical


properties do not change over time, raw data may need to be transformed,
often through differencing.
– MA (Moving Average): A model where the current data is explained by past
errors and the current error.

An ARIMA(p, d, q) model is defined by three parameters and can be represented


by the following formula:

p 
q
ARIM A = (1 −
. φi Li )(1 − L)d Xt = (1 + θi Li )εt (2)
i=1 i=1

where:
– .Xt : is the time series data at time t.
– .p: the number of autoregressive parameters.
– .d: the degree of differencing required to make the time series stationary.
– .q: the number of moving average parameters.
.L: the lag operator .L Xt = Xt−i .
i

– .φi : are the coefficients of the AR part.
– .θi : are the coefficients of the MA part.
– .εt : is the error (residual) at time t.

The ARIMA model remains a fundamental tool in time series analysis, widely
used in forecasting economics, finance, weather, and many other application
fields. It provides a systematic method to consider both trend and cyclicality in
historical data, allowing users to generate well-founded and informed predictions
[10].
Additionally, other statistical models such as Holt-Winters and Seasonal
ARIMA (SARIMA) can also be utilized for traffic flow prediction. These models
consider seasonality in the data, which is particularly useful for predicting traffic
patterns that exhibit regular seasonal variations.

LSTM Model. LSTM is a variant of RNNs capable of learning long-term


dependencies. In other words, LSTMs can remember information over extended
periods. They have a sequential architecture of repeating neural network modules
or cells. These modules in traditional RNNs have a simple structure, often a
single tanh layer.
However, in LSTMs, these modules consist of four interacting layers.
.ft : is the forget gate, deciding which information to discard (forget) from the

cell state. It takes .ht−1 as input and outputs a number in [0,1] for each number in
cell state .Ct−1 . An output of 1 means the information is retained (remembered),
and 0 means it is discarded (forgotten).

f = σ(Wf .[Ct−1 , ht−1 , xt ] + bf )


. t (3)
i : is the input gate
. t
Forecasting Traffic Flow Under Uncertainty: A Case Study in Da Nang 347

i = σ(Wi .[Ct−1 , ht−1 , xt ] + bi )


. t (4)
o : is the output gate
. t

o = σ(Wo .[Ct , ht−1 , xt ] + bo )


. t (5)

b , b , b : are the bias terms


. f . i . o
The mathematical formula for traffic flow prediction can be complex and
varies depending on the approach of each model and the dataset used. A simple
formula for predicting the average traffic flow on a specific road segment over a
specific period can be calculated by the ratio of the number of vehicles passing
through that segment to the time period:
N
A=
. (6)
T
where:
– .A is the average traffic flow (units: vehicles/hour).
– .N is the number of vehicles passing through the segment during the period.
– .T is the time (units: hour)
However, this formula only calculates average traffic flow on a specific road
segment over a specific period and cannot be applied to predict traffic flow
across the entire road system.
In deep learning models, the mathematical formula used is more complex,
involving multiple layers of CNNs or RNNs to learn traffic flow features, followed
by methods like softmax for predicting traffic flow across the entire road system.
The mathematical formula for predicting traffic flow applicable to the entire
road system is as follows:
Ci,t+1 1
. Ci,t+1 = Ci,t + Ii,t − Oi,t Si,t+1 = Ti,t+1 = (7)
Li Si,t+1
where:
– .Ci,t is the number of vehicles moving on route .i at time .t.
– .Ii,t is the
total number of new vehicles entering route .i in the period from .t to .t + 1.
– .Oi,t is the total number of vehicles leaving route .i in the period from .t to
.t + 1.
– .Li is the length of route .i.
– .Si,t+1 is the vehicle density on route .i at time .t + 1.
– .Ti,t+1 is the travel time through route .i at time .t + 1.
This formula allows for calculating density and travel time across the entire route
and can be used to predict traffic flow across the entire road system. However,
to apply this formula, data on the number of vehicles moving on the routes,
traffic flow entering and leaving the routes, the length of the routes, and other
information about the road system are required.
348 D. P. Mien et al.

HAIVAN-ALSTM Model. Time series models encompass both linear and


nonlinear relationships. ARIMA is suitable for predicting linear relationships,
while LSTM is apt for both linear and nonlinear relationships [2]. To achieve
better forecasting results, hybrid models based on the principle of separately
modeling the linear and nonlinear components of the time series have been uti-
lized. The concept of hybrid models aims to diversify the model with improved
forecasting results. The outcomes from hybrid models and those from individual
models, although not directly related, can reduce error or overall variance. For
this reason, hybrid models are the most successful in forecasting (Fig. 1).

Fig. 1. Proposed model combining ARIMA and LSTM methods (HAIVAN-ALSTM)

Time series forecasting models are often represented as the sum of linear and
nonlinear components as shown in Eq. 8.

y = Lt + At
. t (8)
where .At represents the linear component in the time series while .Lt repre-
sents the nonlinear component. In the hybrid model, .At is predicted using the
ARIMA model, then .Lt is predicted using the LSTM model. The error values
are calculated according to formulas 9 and 10.

LSss = LS_M EAN [ss]


. (9)

ARIss = ARI_M EAN [ss]


. (10)
Meanwhile, the weight values of each model are calculated according to for-
mulas 11 and 12.
LSss
. LSts = (1 − ( )) ∗ 2 (11)
LSss + ARIss

ARIts = 2 − LSts
. (12)
Forecasting Traffic Flow Under Uncertainty: A Case Study in Da Nang 349

The outcome of the hybrid model (HAIVAN_ALSTM) is calculated using


formula 13

(ARIts ∗ ARIdd [i]) + (LSts ∗ LSss [i])


.ALSdd [i] = ( ) (13)
2

4 Experiments and Results

To evaluate the individual contributions of ARIMA and LSTM components,


we conducted ablation studies. We tested the model using only ARIMA, only
LSTM, and the combined ARIMA-LSTM approach.

4.1 Data Collection

The author utilized web and mobile app for collecting and storing online data.
From the application, we can track both historical and current data of camera
nodes installed in the surveillance camera system at 0511.vn. The data storage
system, collected from 2017 to May 2023, is stored for each monitored roadway,
across three time collection frames such as 5–9 h, 9–12 h, 13–17 h.
For predicting traffic flow, the author utilized 190,000 input images, dividing
the data into an 80-20 split (80% for training, equating to 152,000 images, and
20% for testing, equating to 38,000 images). The data in both the training and
testing sets represent all classes, which is particularly important for addressing
class imbalance. To ensure this, the author employed techniques like upsampling
and downsampling during the automatic data splitting process.

4.2 Evaluation

To assess the effectiveness of the proposed method, the author conducted exper-
iments on real data from the website 0511.vn, at a traffic junction in Da Nang
city, Vietnam. The data included information on traffic flow, congestion status,
details about special events, and other traffic-related information.
The model was compared with other currently popular models such as SAEs,
Random Forest, CNNs, LSTM based on criteria such as RMSE (Root Mean
Squared Error of the model), Accuracy (Ratio of correct predictions out of the
total samples), Precision (Ratio of true positive predictions to the total posi-
tive predictions (true positives + false positives)), Recall (Ratio of true positive
predictions to the total actual positives (true positives + false negatives)), and
F1-score (The harmonic mean of precision and recall). The results are shown in
Table 1.
From the table above, it can be seen that the HAIVAN-ALSTM model
achieves better or equivalent results compared to other models on most eval-
uation criteria such as RMSE (1), accuracy (2), precision (3), recall (4), and
F1-score (5).
350 D. P. Mien et al.

Table 1. Performance comparison table of the HAIVAN-ALSTM model against other


models

Model (1) (2) (3) (4) (5)


HAIVAN-ALSTM 7.82 0.94 0.89 0.88 0.88
Random Forest 8.72 0.85 0.82 0.78 0.80
CNNs 9.15 0.84 0.83 0.81 0.82
LSTM 7.92 0.93 0.88 0.84 0.86
SAEs 8.36 0.86 0.85 0.81 0.83

Experimental results show that the HAIVAN-ALSTM model achieves high


accuracy in predicting traffic flow at the studied traffic junction. The model can
predict future traffic conditions based on current and past information and can
be applied in reality to predict traffic flow.
Figure 2 illustrates the comparison between actual traffic volume and the
forecast results from three different deep learning models in one day. The vertical
axis (y-axis) represents “Traffic Flow” used to measure the number of vehicles.
The horizontal axis (x-axis) shows “Time”, representing traffic data by hour
within a day from 0 to 24 h) The green line represents “Actual Data”, showing
the recorded actual traffic volume. The orange line represents “HAIVAN-ALSTM
results”, showing the forecast results of the HAIVAN-ALSTM model, The blue
line represents “SAEs results”. The red line represents “GRU results”, showing
the forecast results of the GRU model. According to the graph, it can be seen
that all three forecast models closely match the actual data, with a few minor
fluctuations. This indicates that the models can predict traffic volume with high
accuracy. The highest peak on the graph is around 6:00 to 7:30 AM, followed by a
smaller peak around 8:00 AM, possibly reflecting morning rush hour. Afterward,
the traffic volume gradually decreases and remains stable until the evening before
significantly dropping at the end of the day (Table 2).

Table 2. Performance comparison of three models HAIVAN-ALSTM, SAEs, and GRU.

Model Measure Next 15 min


HAIVAN-ALSTM MAE 7.224
MSE 100.947
RMSE 10.047
MAE 7.577
SAEs MSE 107.234
RMSE 10.355
MAE 7.376
GRU MSE 103.015
RMSE 10.150
Forecasting Traffic Flow Under Uncertainty: A Case Study in Da Nang 351

Fig. 2. Forecasting results of HAIVAN-ALSTM, SAEs, GRU (Color figure online)

The respective MAEs of the HAIVAN-ALSTM, SAEs, and GRU models are
(7.224, 7.577, 7.376). In this case, HAIVAN-ALSTM has the lowest MAE, indi-
cating it has the lowest average absolute error and therefore is the most accurate
model according to this metric.
Meanwhile, the MSE values for the HAIVAN-ALSTM, SAEs, and GRU mod-
els are (100.947, 107.234, 103.015) with HAIVAN-ALSTM having the lowest
MSE, indicating that HAIVAN-ALSTM is less affected by large errors. The
RMSE values for the HAIVAN-ALSTM, SAEs, and GRU models are (10.047,
10.355, 10.150), showing that HAIVAN-ALSTM has the lowest RMSE, indicat-
ing that HAIVAN-ALSTM has the smallest average error when considering the
distribution of errors.
Therefore, based on these metrics, HAIVAN-ALSTM is currently the highest-
performing model in forecasting, with the lowest error rates in all three mea-
surements. GRU is second, while SAEs perform the least effectively. However,
selecting the appropriate model should not only be based on these metrics but
also consider other factors such as model complexity, training time and resources
required, and the specific characteristics of the problem.
The results presented in Fig. 3(a) demonstrate that, following a 15-min inter-
val, the observed roadway segment exhibits low traffic density (free-flowing). At
Fig. 3(c), the traffic density is high due to red light stops. In Fig. 3(b), the pre-
dicted traffic density at 5:15 AM is expected to be at a medium level.
352 D. P. Mien et al.

Fig. 3. Predicted traffic flow results

5 Conclusion

This paper introduced the HAIVAN-ALSTM traffic flow prediction model,


integrating ARIMA and LSTM methods to enhance prediction accuracy. The
ARIMA model effectively addresses the inherent unpredictability of traffic. The
proposed combined model was compared with other models, demonstrating supe-
rior performance and substantiating its efficacy. Future research could explore
the integration of real-time weather data and incident reports to further enhance
the accuracy of traffic flow predictions. Additionally, expanding the dataset to
include more diverse traffic conditions can improve the generalizability of the
model.

References
1. De Jong, G., Daly, A., Pieters, M., Miller, S., Plasmeijer, R., Hofman, F.: Uncer-
tainty in traffic forecasts: literature review and new results for The Netherlands.
Transportation 34(4), 375–395 (2007)
2. Ketu, S., Mishra, P.K.: A hybrid deep learning model for covid-19 prediction and
current status of clinical trials worldwide. Comput. Mater. Continua 66(2) (2021)
3. Laña, I., Del Ser, J., et al.: Measuring the confidence of traffic forecasting models:
techniques, experimental comparison and guidelines towards their actionability.
arXiv preprint arXiv:2210.16049 (2022)
4. Maas, T., Bloem, P.: Uncertainty intervals for graph-based spatio-temporal traffic
prediction. arXiv preprint arXiv:2012.05207 (2020)
Forecasting Traffic Flow Under Uncertainty: A Case Study in Da Nang 353

5. Mallick, T., Balaprakash, P., Macfarlane, J.: Deep-ensemble-based uncertainty


quantification in spatiotemporal graph neural networks for traffic forecasting. arXiv
preprint arXiv:2204.01618 (2022)
6. Matas, A., Raymond, J.-L., Ruiz, A.: Traffic forecasts under uncertainty and capac-
ity constraints. Transportation 39, 1–17 (2012)
7. Polson, N.G., Sokolov, V.O.: Deep learning for short-term traffic flow prediction.
Transp. Res. Part C Emerg. Technol. 79, 1–17 (2017)
8. Rengasamy, D., Jafari, M., Rothwell, B., Chen, X., Figueredo, G.P.: Deep learning
with dynamically weighted loss function for sensor-based prognostics and health
management. Sensors 20(3), 723 (2020)
9. Tian, Y., Zhang, K., Li, J., Lin, X., Yang, B.: LSTM-based traffic flow prediction
with missing data. Neurocomputing 318, 297–305 (2018)
10. Ting, T.J.: Machine learning models for traffic flow prediction. Ph.D. thesis, Uni-
versity of Toronto (Canada) (2021)
11. Wu, D., et al.: Quantifying uncertainty in deep spatiotemporal forecasting. In:
Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery &
Data Mining, pp. 1841–1851 (2021)
12. Wu, Z., Pan, S., Chen, F., Long, G., Zhang, C., Philip, S.Y.: A comprehensive
survey on graph neural networks. IEEE Trans. Neural Netw. Learn. Syst. 32(1),
4–24 (2020)
13. Yin, H., Wong, S.C., Xu, J., Wong, C.K.: Urban traffic flow prediction using a
fuzzy-neural approach. Transp. Res. Part C Emerg. Technol. 10(2), 85–98 (2002)
14. Yong, Yu., Si, X., Changhua, H., Zhang, J.: A review of recurrent neural networks:
LSTM cells and network architectures. Neural Comput. 31(7), 1235–1270 (2019)
Constraint Programming-Based Cutting
Plane Algorithm for a Combination
of Orienteering and Maximum Capture
Problem

Hoang Giang Pham1(B) , Tien Mai2 , and Minh Hoàng Hà3


1
ORLab, Faculty of Computer Science, Phenikaa University, Hanoi, Vietnam
[email protected]
2
School of Computing and Information Systems, Singapore Management University,
Singapore, Singapore
3
SLSCM and CADA, Faculty of Data Science and Artificial Intelligence,
National Economics University, Hanoi, Vietnam

Abstract. In this paper, we study a new variant of orienteering problem


(OP) where each vertex in the OP tour is a facility within a competitive
market context, where customer demand is predicted by a random utility
choice model. Unlike prior research, which primarily focuses on simple
objective function such as maximizing a linear sum of score of selected
vertices, we introduce a complicated non-linear objective function that
necessitate the selection of locations to maximize a profit value such as
expected customer demand or revenue. In our study, routing constraints
included in the form of the OP is handled by Constraint Programming
(CP), and the non-linear objective function, resulting from the utiliza-
tion of random utilities, is tackled by two types of valid cuts, namely,
outer-approximation and submodular cuts. These lead to the develop-
ment of an exact solution methods: Cutting Plane, where these valid
cuts are iteratively added to a master problem. Extensive experiments
are conducted on problem instances of varying sizes, demonstrating that
our approach excels in terms of solution quality and computation time
when compared to other baseline approach.

Keywords: Orienteering Problem · Maximum Capture Problem ·


Outer Approximation · Submodular Cuts · Constraint Programming

1 Introduction

The usual setting of the OP ([12]) involves each location being assigned a con-
stant score and visited at most once. The OP tour starts at a given departure
and visits a subset of locations during a time limitation. The objective of the
OP is to maximize the total score of the selected locations. However, it is not
easy to measure the scores as predefined values in many real-world applications
because of uncertain dependent factors. As such, the score can be estimated via a
choice model, which is more pratical. The variant proposed in this paper replaces
the scores by a result of a Maximum Capture Problem (MCP). The MCP is a
c The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025
W. Buntine et al. (Eds.): SOICT 2024, CCIS 2350, pp. 354–368, 2025.
https://doi.org/10.1007/978-981-96-4282-3_29
Orienteering Problem with Random Utility Maximization 355

specific category of competitive facility location problems in which customer


demand is characterized and forecasted using a random utility maximization
(RUM) model ([3, 10, 15, 20, 21, 30]). In this setting, it is assumed that customers
make their choices among available facilities by maximizing their utility associ-
ated with each facility.
Several surveys, such as [8, 13, 29], present a comprehensive summary of the
OP and its variants. These surveys mention the Team OP, the OP with time
windows, the Time-dependent OP, the OP with hotel selection, the Stochastic
OP, the Generalized OP, the Arc OP, the Multi-agent OP, the Clustered OP,
the Correlated OP, and others. In addition to addressing the theoretical aspects,
the OP is also explored in various practical real-world applications (see [13]).
Currently, there is still ongoing interest in various variants and applications of
the OP within the research community. For instance, [1, 2, 5, 9] introduce the
Probabilistic OP, Set OP, Adaptive OP and Attractive OP, respectively. Some
extensions of the OP are also considered in the fields of robotics, autonomy, and
UAV, such as the Surviving Team OP [17, 18], the Dubins OP [26], the Close
Enough Dubins OP [7], the Dubins Correlated OP [28], the Physical OP [25],
and the Kinematic OP [22].
Our new problem holds significant relevance in practical scenarios; for exam-
ple, a chain of retail stores needs to design an efficient supply chain in which a
set of locations is chosen to establish new facilities/stores. The main objective
of the chain is to fulfill the customer demand as much as possible. However, the
chain manager also tries to keep the internal freight costs reasonable because
these costs are quite significant compared to the fixed installation cost of new
facilities.
The paper is organized as follows. Section 2 presents the problem formulation,
and Sect. 3 discusses cutting plane framework with outer-approximation and
submodular cuts. In Sect. 4, we provides numerical experiments, and finally,
Sect. 5 concludes.

2 Orienteering Problem with Random Utility


Maximization

Let us denote .[m] = {1, 2, . . . , m} as the set of locations that can be used to
locate new facilities and .N as the set of customer zones/types. Let .qn be the
number of customers at zone .n ∈ N . Assume that .C represents the set of the
competitor’s facilities. Letting .Tmax be the time budget for the OP tour and .tij
be amount of time to travel from location .i to .j. In our research, two tasks are
handled at the same time: selecting a subset of locations .S ⊂ [m], which satisfies
the customer demand .(qn ) as much as possible, to establish new facilities; and
creating a tour which starts at a .depot (0), visits several opened facilities within
.Tmax and returns the .depot.
Figure 1 visualizes the basic idea of our problem. In the Fig. 1(a), there are
10 customer zones .(|N | = 10) shown in yellow. Each zone is depicted as a circle,
with the size reflecting the number of customers .(qn ) (larger circles imply more
356 H. G. Pham et al.

Fig. 1. Orienteering Problem with Random Utility Maximization.

customers). The black dots present the candidate locations .(m = 17) to open
new facilities. The sets of customers in different zones have different utilities
associated with each location. For example, the utilities of customers in zone A
for nearer candidate locations are higher because they prefer a facility in their
neighborhood, while ones in zone B are opposed to open a new facility near
their houses, thus, their utilities are higher for further locations. Assume that
Fig. 1(b) shows an optimal solution wherein 6 facilities .(|S| = 6), which maximize
customers’ utilities, are opened. There is a new facility in zone A, while zone
B does not contain any facility based on the customers’ utilities. Specially, a
tour, denoted by black arrows, is constructed to visit each of new facilities once.
The total length of this tour must be less than or equal a limitation set by the
decision-maker. This explains why only one facility can be opened in zone A.

2.1 Objective Function Under the Multinomial Logit (MNL) Model


In the classical OP problem, decision-makers aim to visit locations that maxi-
mize the total score. Our research answers the question of whether the demand
captured by the customers can replace that score in some real-world applica-
tions. The customers’ demands are difficult to evaluate and not deterministic in
practice. One of the common ideas is applying discrete choice models to predict
these values.
In the literature on discrete choice models, RUM ([27]) is the most widely
used approach to model discrete choice behavior. In the RUM framework, the
probability that individual .n selects option .i ∈ S can be computed as .P (uni ≥
unj , ∀j ∈ S), i.e., the individual chooses the option of the highest utility, where
the random utilities are defined as .uni = vni + ni , in which .vni is a deterministic
Orienteering Problem with Random Utility Maximization 357

part and can be modeled based on characteristics of the alternative and/or the
decision-maker, and .ni is the random term which is unknown to the analyst.
Under the MNL model, the choice probability that the facility at location .i is
chosen by an individual .n is
evni
Pn (i|S) = 
.
vni
i∈S e

Assume that a “newcomer ” company wants to open new facilities in a market


that already has several competitors. Let .C represent the set of the competitor’s
facilities. The probability that customers in zone .n ∈ N select the facility .i ∈ S
can be computed as:
evni
Pn (i|S) = 
.
vnc +
 vnj
c∈C e j∈S e

For notational simplicity, let .Unc = i∈C evni and .Vni = evni for all .n ∈ N , i ∈
[m]. The expected customer demand captured over all the customer zones can
be computed as:

   qn
i∈S Vni
.f (S) = qn Pn (i|S) = c

Un + i∈S Vni
n∈N i∈S n∈N

We rewrite the function .f (S) in the binary representation as follows:



 qn i∈[m] Vni xi
.f (x) = 
Unc + i∈[m] Vni xi
n∈N

where: .x ∈ {0, 1}m represents a subset .S ⊂ [m] as .xi = 1 if .i ∈ S and .xi = 0


otherwise, for all .i ∈ [m].
It is worth noting that the MCP (without any constraints) is already known
to be NP-hard, and verifying whether there exists a tour which is created from
a solution .S and satisfies the time budget .Tmax is also an NP-hard problem.
Consequently, the combination of the OP and the MCP emerges as a particularly
challenging problem.

2.2 Constraint Programming Formulation


Our research investigates the applicability of CP to the new variant of the
OP. The CP, which focuses on identifying feasible solutions based on arbi-
trary constraints, has gained attention due to advancements in black-box solvers.
Although CP is not yet widely adopted in the OP community, its potential lies in
leveraging parallel computation for solving complex problems. In the literature,
CP for OP is first used by [11]. In another study, [16] developes a multi-objective
evolutionary algorithm based on the CP model in [11]. Recently, the authors of
[19] extend the CP model in [11] to solve the Team OP with time window and
mandatory visits.
358 H. G. Pham et al.

Let .y be decision variables such that .yij = 1 if there exists a visit from
locations .i to location .j, and 0 otherwise, for all .i, j ∈ [m] ∪ 0. To explore the
potential of CP on the problem, we introduce a formulation using the .“Circuit”
statement as follows:

 qn i∈[m] Vni xi
. max f (x) =  (CP)
x,y Unc + i∈[m] Vni xi
n∈N

s.t.
. Circuit(yij ; i, j ∈ [m] ∪ 0) (1)

y0i = 1 (2)
i∈[m]

yii = 1 =⇒ xi = 0 ∀i ∈ [m] (3)


m  m
tij yij ≤ Tmax (4)
i=0 j=0

x ∈ {0, 1}m , y ∈ {0, 1}(m+1)×(m+1)

The .“Circuit” statement adds a circuit constraint from a sparse list of arcs
that encode the graph. A circuit is a unique Hamiltonian path in a subgraph
of the total graph. In case a node .i is not in the path, then there must be a
loop arc .i → i associated with a true literal. Otherwise this constraint will fail.
Constraint (1) assures that the .y variables take values in such a way to form
a valid tour, eventually with self-loops for the variables associated to locations
unvisited. Constraint (2) ensures that the tour must include the .depot (0) and
constraints (3) guarantee that if a location contains a self-loop, it cannot be
chosen to open a new facility. Constraint (4) presents the time budget restricting
the OP tour.
In the traditional OP, a path starting from a .depot, ending at a given desti-
nation, which can be different from the .depot, and visiting a subset of location
in .[m]. By building upon previous research ( [13, 29]), it is possible to represent
the routing constraint by a set of linear constraints formulated according to the
well-known Miller-Tucker-Zemlin formulation (MTZ) (see [23]). To adapt the
requirement that the tour starts at .depot and also ends at .depot, we extend the
set of locations .[m] as .M = {0, 1, ..., m, m + 1} wherein locations 0 and .m + 1
are the .depots. A mixed-integer nonlinear program (MINLP) for the problem is
introduced as a baseline to compare to the (CP):

 qn i∈[m] Vni xi
. max f (x) =  (MINLP)
x,y,p Unc + i∈[m] Vni xi
n∈N
Orienteering Problem with Random Utility Maximization 359

s.t.
. (4)
m+1
 m

y0i = yi,m+1 = 1 (5)
i=1 i=0
m
 m+1

yji = yik = xi ∀i ∈ [m] (6)
j=0 k=1

pi − pj + 1 ≤ m(1 − yij ) ∀i, j ∈ M (7)


0 ≤ pi ≤ m ∀i ∈ [m] (8)
m (m+2)×(m+2) m+2
x ∈ {0, 1} , y ∈ {0, 1} ,p ∈ N

In the above formulation, constraints (5) and (6) establish the ongoing and
outgoing flow at each location. Constraints (5) guarantee that there exists exact
one arc starting from depot and there is also exact one arc ending at depot.
Constraints (6) ensure that if location .i is chosen .(xi = 1), there is exact one
ongoing arc and one outgoing arc at .i. Constraints (7) and (8) stand for the
MTZ subtour elimination constraints.

3 Solution Methods
3.1 The Mixed-Integer Linear Programming Reformulation
The (MINLP) cannot be directly solved by any off-the-shelf solvers because its
objective function is not concave either convex. However, it can be linearized,
then become a mixed-integer linear programming (MILP) formulation, which
can be handled by commercial solver such as CPLEX or Gurobi.
Let us first denote that: .sn = U c + 1 Vni xi , .zni = xi sn for all .i ∈ [m], n ∈
n i∈[m]
N and then implement the following inequalities to represent the linear relation
between .zni , .xi and .sn :

z
. ni ≤ sU
n xi ∀i ∈ [m], n ∈ N (9)
zni ≥ sL
n xi ∀i ∈ [m], n ∈ N (10)
zni ≤ sn + sL
n (xi − 1) ∀i ∈ [m], n ∈ N (11)
zni ≥ sn + sU
n (xi − 1) ∀i ∈ [m], n ∈ N (12)

Vni zni + Unc sn = 1 ∀n ∈ N (13)
i∈[m]


where .sU c L
n = 1/Un and .sn = 1/(Un +
c
i∈[m] Vni ) are upper and lower bounds of
.sn .
Constraints (9), (10), (11), (12) restrict values of variables .zni , while con-
straints (13) are generated by combining .zni = xi sn and .sn = U c + 1 Vni xi .
n i∈[m]
360 H. G. Pham et al.

Thus, the MILP-based definition of the problem can be presented as follows:


 
. max qn Vni zni (MILP)
x,y,s,z
n∈N i∈[m]

s.t. (4), (5), (6), (7), (8), (9), (10), (11), (12), (13)
|N |
s ∈ R+ , z ∈ R|N |×m , x ∈ {0, 1}m , y ∈ {0, 1}(m+2)×(m+2) , p ∈ Nm+2

3.2 Outer-Approximation and Submodular Cuts


To overcome the challenge of non-linear objective function formulated by the
MNL model, outer-approximation and submodular cuts have become a state-
of-the-art approach, thanks to the concavity and submodularity
 ( [20, 21]). The
objective function then can be written as: .f (x) = n∈N Ψn (x), where:

q Uc
Ψn (x) = qn −
. n n , ∀n ∈ N
Unc + i∈[m] Vni xi
c
q Un
It can be seen that each component . U c + n Vni xi is convex in .x. Therefore,
n i∈[m]
Ψn (x) is concave in .x for any .n ∈ N . Then the following inequality holds for any
.
solution .x ∈ {0, 1}:

Ψn (x) ≤ ∇x Ψn (x)T (x − x) + Ψn (x), ∀n ∈ N


. (14)

By letting .Φz (x) = Uzc + i∈[m] Vzi xi , ∀z ∈ N , the right-hand side of (14),
denoted by .γn (x), can be presented as follows:
  qz U c Vzi (xi − xi )
z
. n γ (x) = + Ψn (x)
Φz (x)2
i∈[m] z∈N

Besides the concavity, it is known that the objective function .f (S) is submod-
ular. Then, .Ψn (S) is monotonically increasing and submodular in .S. According
to the properties of submodular functions, the following inequalities hold for any
.n ∈ N , S ⊂ [m], k ∈ [m]:
.Ψn (S + k) ≥ Ψn (S)

and for any .S  ⊂ S, k ∈


/S:

. Ψn (S  + k) − Ψn (S  ) ≥ Ψn (S + k) − Ψn (S)

where .S + k denotes the set .S ∪ {k}, for ease of notation.


For any .n ∈ N , S ⊂ [m] and .k ∈ [m]\S, let us define:
qn Vnk
ψnk (S) = Ψn (S + k) − Ψn (S) =
.   (15)
(Unc + c
j∈S Vnj )(Un + j  ∈S∪k Vnj )

Orienteering Problem with Random Utility Maximization 361

The functions .ψnk (S) are often referred to as marginal gains, i.e., gains from
adding an item .k to the set .S. The submodular properties imply that .ψnk (S) ≥ 0
for any .S ⊂ [m] and .ψnk (S) ≥ ψnk (S  ), ∀S ⊂ S  ⊂ [m]. These properties offer
the following inequalities that hold for any subset .S, S ⊂ [m] (see [24]):
 
.Ψn (S) ≤ ψnk (S) − ψnk ([m] − k) + Ψn (S)
k∈([m]\S)∩S k∈S\S
 
Ψn (S) ≤ ψnk (∅) − ψlk (S − k) + Ψn (S)
k∈([m]\S∩S) k∈S\S

To transform the above inequalities into valid linear cuts, the binary represen-
tation of this function can be presented as:

.ψnk (x) = Ψn (x + ek ) − Ψn (x)

where .ek is a vector of size .m with zero elements except the .k-th element which
takes a value of 1.
The linear cuts can be deduced from the submodular inequalities as follows:
 
.Ψn (x) ≤ ψnk (x)(1 − xk )xk − ψnk (e − ek )xk (1 − xk ) + Ψn (x) (16)
k∈[m] k∈[m]
 
Ψn (x) ≤ ψnk (0)(1 − xk )xk − ψnk (x − ek )xk (1 − xk ) + Ψn (x) (17)
k∈[m] k∈[m]

where .e is an all-one vector of size .m, .0 is an all-zero vector of size .m and .x is


a given solution of the MCP.
Because of Eq. (15), the right-hand sides of (16) and (17), denoted by .αn (x)
and .βn (x), can be rewritten as follows:
 (1 − xk )qn Vnk  xk qn Vnk
αn (x) =
. xk − (1 − xk ) + Ψn (x)
Φn (x)Φn (x + ek ) Φn (e)Φn (e − ek )
k∈[m] k∈[m]
 (1 − xk )qn Vnk  xk qn Vnk
βn (x) = xk − (1 − xk ) + Ψn (x)
Φn (0)Φn (ek ) Φn (x)Φn (x − ek )
k∈[m] k∈[m]

3.3 Cutting Plane Algorithm

In [20], the outer-approximation and submodular cuts are employed within a


Branch-and-Cut procedure, where achieving convergence is straightforward, to
solve the facility location problem under the MNL model. Their approach outper-
forms concave programming method proposed by [3] and linearization technique
of [14]. However, CP solvers does not provide any type of callback function to
access feasible solutions founded during its running process. Therefore, if CP is
implemented to handle the OP under MNL model, the outer-approximation and
submodular cuts must be added via cutting plane framework.
362 H. G. Pham et al.

Some prior studies have demonstrated that a cutting plane method incor-
porating outer-approximation cuts can yield an optimal solution after a finite
number of iterations (see [4, 6]). The seminal work of [24] also demonstrates
that the submodular maximization problem can be reformulated equivalently
as a MILP whose constraints are submodular cuts generated at every point
within the binary domain. These allow us to sequentially generate the outer-
approximation and submodular cuts and add them to a master problem that is
solved iteratively until an optimal solution is obtained.
For any feasible solution .x, the first master formulation created by replacing
nonlinear
 objective function .f (x) of the (CP) with a linear function: .g(θ) =
n∈N n where each .θn is a non-negative variable and its value equals to .Ψn (x),
θ ,
can be written as follows:

. max g(θ) = θn (Master – CP)
x,y,θ
n∈N

.s.t. (1), (2), (3), (4)


θn ≤ αn (x) ∀n ∈ N (18)
θn ≤ βn (x) ∀n ∈ N (19)
θn ≤ γn (x) ∀n ∈ N (20)
θn ≥ 0 ∀n ∈ N (21)
m (m+1)×(m+1)
x ∈ {0, 1} , y ∈ {0, 1}

Similarly, the second master formulation, denoted by .(Master − MINLP), can


be proposed by using the objective function of the (Master – CP) and con-
straints (4), (5), (6), (7), (8), (18), (19), (20), (21) with .x ∈ {0, 1}m , y ∈
{0, 1}(m+2)×(m+2) , p ∈ Nm+2 .

Algorithm 1: Cutting Plane Algorithm


1 Set  to 0.0001 as the optimal gap and x = 0
2 Build the (Master – CP)/(Master − MINLP)
3 do
4 Solve the (Master – CP)/(Master − MINLP) to get a new solution (x, y, θ)
5 Add outer-approximation and submodular cuts (18), (19), (20) associated
with the new x
6 until f (x) −  ≤ g(θ);
7 Return (x, y) as an optimal solution

The cutting plane framework using outer-approximation and submodular


cuts is presented as in Algorithm 1. Parameter ., which controls the gap between
the orginal objective .f (x) and the approximation one .g(θ), is set to 0.0001. The
trivial solution .x = 0 is used to generate the three first sets of .|N | valid cuts
(18), (19), (20). In the loop from line 3 to line 6 of Algorithm 1, the master prob-
lem is iteratively solved and added new cuts generated from any approximated
Orienteering Problem with Random Utility Maximization 363

solution found. This process ends when the value of the linear approximation
g(θ) calculated from .θ is extremely close to the value of the MNL function .f (x)
.
provided by .x.

4 Experiment Results
Since there is no benchmark instances available in the literature for the proposed
problem, we generate a new set of benchmark instances to test the performance
of the methods. We reuse the OP instances provided by [1] (available at: https://
or-brescia.unibs.it/instances) from which we take the coordinates of locations,
the distance matrix between any two vertices, and the value of .Tmax . To create
customer utilities, we use instances from three datasets ORlib, HM14, and NYC,
which are widely used in prior MCP studies. The set ORlib contains 15 instances
with the number of zones varying in {50, 100, 200, 400, 800} and the number
of locations in {25, 50, 100}, while HM14 includes 3 instances with 1000 zones
and 100 locations. The last set NYC has a single instance coming from a large-
scale Park-and-Ride location problem in New York City, with 82,341 customer
zones and 58 locations. The number of locations in the MCP data is modified to
match the number of vertices in the OP data. For example, 14 locations and their
corresponding utilities are randomly sampled from the ORlib instance of 25, 50,
or 100 locations, and then they are combined with the coordinates of vertices in
the OP instance burma14 containing 14 vertices to create a new instance.
A 64-bit Windows machine, with AMD Ryzen 7 3700X 8-Core Proces-
sor, 3.60 GHz, and 16 GB of RAM, is used to run the experiments. All mod-
els are implemented in C++. We link to CPLEX to formulate the (MILP)
model and design the cutting plane algorithm for the .(Master − MINLP) for-
mulation. The CP-SAT solver of Google OR-Tools (https://developers.google.
com/optimization/) is used for implementing the (Master – CP) embedded in
the cutting plane. Note that CP solvers significantly outperform MILP solvers
when it comes to leveraging multi-threaded computation. This advantage arises
from the distinct solving routines employed by CP solvers. Moreover, modern
home computers and laptops now feature multi-core architectures, making paral-
lel computation accessible. Therefore, we evaluate how multi-threading impacts
solver performance while tackling the OP under MNL model. There are 8 threads
used with a limited run time of 1 h for each instance.
Figures 2, 3 and 4 visualize the comparison between the (MILP) formu-
lation and others which are embedded into cutting plane framework. Over-
all, the cutting plane outperforms the (MILP) in both terms of number of
optimal solutions found and average runtime. In which, the master problem
(Master – CP) (yellow line) provides the most optimal solutions, following by
one from .(Master − MINLP) (red line) on all datasets. The average runtime of
each OP instance is calculated from runtime of all related MCP instances provid-
ing optimal and non-optimal solutions. It can be seen that the average runtime
of the cutting plane algorithm with the (Master – CP) is extremely lower than
the others. While the average run time of .(Master − MINLP) or (MILP) fluctuates
364 H. G. Pham et al.

Fig. 2. The results of the (MILP) model and the cutting plane algorithm on the ORlib
dataset (Color figure online)

Fig. 3. The results of the (MILP) model and the cutting plane algorithm on the HM14
dataset (Color figure online)

Fig. 4. The results of the (MILP) model and the cutting plane algorithm on the NYC
dataset (Color figure online)

wildly, the (Master – CP)’s one slowly increases until handling the OP instances
with over 70 vertices in Figs. 2 and 3 and over 29 vertices in Fig. 4. The run-
time of .(Master − MINLP) and (MILP) are competitive on small OP instances
with utilities taken from ORlib and NYC datasets, while the .(Master − MINLP)
Orienteering Problem with Random Utility Maximization 365

outperforms the (MILP) on the OP dataset containing more than 42 vertices and
all OP instances with utilities from NYC dataset.
The detailed results of the cutting plane framework are shown in Tables 1, 2
and 3. In each table, the two first columns present the OP instances’ name and
number of instances created from combining an OP instance, ranging from 14
vertices to 100 vertices, with an MCP instance. The third and fourth column
show number of optimal solutions found by solving the cutting plane with the
master problem of the .(Master − MINLP) and the average runtime calculated by
taking into account only those instances solved to optimality. Similarly, two last
columns report the results of the (Master – CP). Best values (the largest number
of optimal solutions, or the lowest average computing time) are shown in bold.

Table 1. The results of cutting plane Table 2. The results of cutting plane
algorithm on the ORlib dataset algorithm on the HM14 dataset

ORlib .(Master − MINLP) (Master – CP) HM14 .(Master − MINLP) (Master – CP)
OP #Instances #Optimal Time(s) #Optimal Time(s) OP #Instances #Optimal Time(s) #Optimal Time(s)
burma14 45 45 1.44 45 0.88 burma14 9 9 0.93 9 3.76
ulysses16 45 44 101.55 45 1.44 ulysses16 9 9 18.38 9 11.75
gr17 45 45 13.72 45 2.03 gr17 9 9 5.62 9 7.64
gr21 45 45 24.92 45 4.09 gr21 9 9 22.33 9 21.21
ulysses22 45 34 1351.33 45 5.11 ulysses22 9 5 2297.35 9 37.89
gr24 45 44 684.14 45 6.90 gr24 9 9 765.40 9 56.66
fri26 30 9 3087.12 30 10.04 fri26 9 2 3302.69 9 41.76
bays29 30 30 121.90 30 9.54 bays29 9 9 63.56 9 47.22
dantzig42 30 0 - 30 39.69 dantzig42 9 0 - 9 237.01
swiss42 30 29 485.80 30 28.81 swiss42 9 9 312.47 9 132.42
att48 30 3 3431.18 30 54.32 att48 9 4 - 9 202.55
gr48 30 14 2638.57 30 57.78 gr48 9 4 2542.17 9 208.24
hk48 30 3 3418.90 30 88.16 hk48 9 1 3457.62 9 207.44
eil51 15 13 1249.77 15 66.60 eil51 9 9 5.00 9 180.08
berlin52 15 14 1048.20 15 64.65 berlin52 9 8 1030.15 9 242.03
brazil58 15 1 3479.18 15 184.07 brazil58 9 3 2555.73 9 446.76
st70 15 0 - 15 458.61 st70 9 0 - 9 1088.84
eil76 15 2 3293.42 15 330.69 eil76 9 1 3233.93 9 435.35
pr76 15 0 - 12 1155.54 pr76 9 0 - 6 1736.46
gr96 15 0 - 14 1075.44 gr96 9 0 - 9 1490.20
rat99 15 0 - 10 2225.71 rat99 9 0 - 6 2519.48
kroA 9 0 - 9 739.41
kroB 9 0 - 6 2394.47
kroC 9 0 - 9 1311.41

From the tables, we see that the (Master – CP) outperforms the
.(Master − MINLP) in both terms of number of optimal solutions and aver-
age runtime. The MTZ subtour elimination constraint shows their disadvan-
tage when tackling the instances with medium and large number of locations,
while the “.Circuit” statement can effectively handle them. For example, the
.(Master − MINLP) only solve to optimality all instances up to 21 locations with
utility taken from ORlib dataset, while the (Master – CP) can handle ones up
to 76 locations. The same phenomenon occurs when solving instances related
to HM14 dataset. In case of NYC dataset, which contains 82341 customer zones
and is much larger than number of zones in ORlib and HM14, the (Master – CP)
handles successfully instances up to 48 locations, while the .(Master − MINLP)
can solve small instances with 14, 16, 17, 21, 22 and 29 locations.
366 H. G. Pham et al.

Table 3. The results of cutting plane algorithm on the NYC dataset

NYC .(Master − MINLP) (Master – CP)


OP #Instances #Optimal Time(s) #Optimal Time(s)
burma14 3 3 2.24 3 13.32
ulysses16 3 3 6.17 3 19.59
gr17 3 3 8.80 3 18.84
gr21 3 3 9.80 3 36.73
ulysses22 3 3 234.07 3 58.50
gr24 3 2 1518.24 3 180.98
fri26 3 1 2478.12 3 207.74
bays29 3 3 713.36 3 295.80
dantzig42 3 1 3309.45 3 1660.66
swiss42 3 3 440.55 3 186.17
att48 3 0 - 1 2867.81
gr48 3 0 - 1 2534.47
hk48 3 2 2872.97 2 2054.46
eil51 3 1 2914.54 2 2059.42
berlin52 3 2 2382.65 3 1447.88
brazil58 3 0 - 0 -

5 Conclusion

In this paper, we have considered a new variant of the OP, where each location
is controlled by customer behavior modeled by a MNL discrete choice model.
To address this challenging problem, we have explored two types of cuts - outer-
approximation and submodular, to solve the objective function and constraint
programming to tackle routing part. Experiments conducted on instances of
various sizes demonstrate the superiority of our cutting plane approaches, which
can solve to optimality the instances with a very large number of customer zones.
Future works will be dedicated to incorporating more complex and practical
settings for the routing such as time-window, mandatory visits, multiple vehicles
(Team OP) or exploring more advanced choice models, such as nested models or
cross-nested models.

Acknowledgment. This research was funded by Vingroup Innovation Foundation


(VINIF), Vietnam under project code VINIF.2024.DA072.
Orienteering Problem with Random Utility Maximization 367

References
1. Angelelli, E., Archetti, C., Filippi, C., Vindigni, M.: The probabilistic orienteering
problem. Comput. Oper. Res. 81, 269–281 (2017)
2. Archetti, C., Carrabs, F., Cerulli, R.: The set orienteering problem. Eur. J. Oper.
Res. 267(1), 264–272 (2018)
3. Benati, S., Hansen, P.: The maximum capture problem with random utilities: Prob-
lem formulation and algorithms. Eur. J. Oper. Res. 143(3), 518–530 (2002)
4. Bonami, P., et al.: An algorithmic framework for convex mixed integer nonlinear
programs. Discret. Optim. 5(2), 186–204 (2008)
5. Dolinskaya, I., Shi, Z.E., Smilowitz, K.: Adaptive orienteering problem with
stochastic travel times. Transport. Res. Part E: Logist. Transport. Rev. 109, 1–19
(2018)
6. Duran, M.A., Grossmann, I.E.: An outer-approximation algorithm for a class of
mixed-integer nonlinear programs. Math. Programm. 36, 307–339 (10 1986)
7. Faigl, J., Pěnička, R.: On close enough orienteering problem with dubins vehicle.
In: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS),
pp. 5646–5652 (2017)
8. Feillet, D., Dejax, P., Gendreau, M.: Traveling salesman problems with profits.
Transp. Sci. 39(2), 188–205 (2005)
9. Freeman, N.K., Keskin, B.B.: İbrahim Çapar: Attractive orienteering problem with
proximity and timing interactions. Eur. J. Oper. Res. 266(1), 354–370 (2018)
10. Freire, A., Moreno, E., Yushimito, W.: A branch-and-bound algorithm for the
maximum capture problem with random utilities. Eur. J. Oper. Res. 252(1), 204–
212 (2016)
11. Gedik, R., Kirac, E., Bennet Milburn, A., Rainwater, C.: A constraint programming
approach for the team orienteering problem with time windows. Comput. Indust.
Eng. 107, 178–195 (2017)
12. Golden, B.L., Levy, L., Vohra, R.: The orienteering problem. Naval Res. Logist.
(NRL) 34(3), 307–318 (1987)
13. Gunawan, A., Lau, H.C., Vansteenwegen, P.: Orienteering problem: a survey of
recent variants, solution approaches and applications. Eur. J. Oper. Res. 255(2),
315–332 (2016)
14. Haase, K.: Discrete location planning (2009). http://hdl.handle.net/2123/19420
15. Haase, K., Müller, S.: A comparison of linear reformulations for multinomial logit
choice probabilities in facility location models. Eur. J. Oper. Res. 232(3), 689–691
(2014)
16. Hu, W., Fathi, M., Pardalos, P.M.: A multi-objective evolutionary algorithm based
on decomposition and constraint programming for the multi-objective team orien-
teering problem with time windows. Appl. Soft Comput. 73, 383–393 (2018)
17. Jorgensen, S., Chen, R.H., Milam, M.B., Pavone, M.: The matroid team surviving
orienteers problem: Constrained routing of heterogeneous teams with risky traver-
sal. In: IEEE/RSJ International Conference on Intelligent Robots and Systems
(IROS), pp. 5622–5629 (2017)
18. Jorgensen, S., Chen, R.H., Milam, M.B., Pavone, M.: The team surviving orien-
teers problem: routing teams of robots in uncertain environments with survival
constraints. Auton. Robots 42(4), 927–952 (2018)
19. Kirac, E., Gedik, R., Oztanriseven, F.: Solving the team orienteering problem with
time windows and mandatory visits using a constraint programming approach. Int.
J. Oper. Res. 46, 20–42 (2023)
368 H. G. Pham et al.

20. Ljubić, I., Moreno, E.: Outer approximation and submodular cuts for maximum
capture facility location problems with random utilities. Eur. J. Oper. Res. 266(1),
46–56 (2018)
21. Mai, T., Lodi, A.: A multicut outer-approximation approach for competitive facility
location under random utilities. Eur. J. Oper. Res. 284(3), 874–881 (2020)
22. Meyer, F., Glock, K.: Kinematic orienteering problem with time-optimal trajecto-
ries for multirotor UAVs. IEEE Robot. Autom. Lett. 7(4), 11402–11409 (2022)
23. Miller, C.E., Tucker, A.W., Zemlin, R.A.: Integer programming formulation of
traveling salesman problems. J. ACM 7(4), 326–329 (1960)
24. Nemhauser, G.L., Wolsey, L.A.: Maximizing submodular set functions: formula-
tions and analysis of algorithms. In: North-Holland Mathematics Studies, vol. 59,
pp. 279–301. Elsevier (1981)
25. Pěnička, R., Faigl, J., Saska, M.: Physical orienteering problem for unmanned
aerial vehicle data collection planning in environments with obstacles. IEEE Robot.
Autom. Lett. 4(3), 3005–3012 (2019)
26. Pěnička, R., Faigl, J., Váňa, P., Saska, M.: Dubins orienteering problem. IEEE
Robot. Autom. Lett. 2(2), 1210–1217 (2017)
27. Train, K.: Discrete Choice Methods with Simulation. Cambridge University Press
(2003)
28. Tsiogkas, N., Lane, D.M.: Dcop: Dubins correlated orienteering problem optimizing
sensing missions of a nonholonomic vehicle under budget constraints. IEEE Robot.
Autom. Lett. 3(4), 2926–2933 (2018)
29. Vansteenwegen, P., Souffriau, W., Oudheusden, D.V.: The orienteering problem: a
survey. Eur. J. Oper. Res. 209(1), 1–10 (2011)
30. Zhang, Y., Berman, O., Verter, V.: The impact of client choice on preventive
healthcare facility network design. OR Spectrum 34, 349-370 (04 2012)
Operations Research
Cost Optimization in Competitive Facility
Location Under General Demand Model

Ba Luat Le , Thuy Anh Ta(B) , and Hoang Giang Pham

ORLab, Faculty of Computer Science, Phenikaa University, Hanoi, Vietnam


[email protected]

Abstract. This work addresses a cost optimization problem in facility


location where customer demand is modeled using the cross-nested logit
model, one of the most flexible demand models in the literature. The
objective is to maximize a captured demand function by allocating a
fixed investment budget across a set of facilities, where the investment
directly influences the demand captured by each facility. The resulting
optimization problem involves exponential and fractional terms, leading
to a highly nonlinear structure. To the best of our knowledge, no existing
methods can solve this problem to near-optimality. To address this, we
propose a piecewise linear approximation technique and apply variable
transformations to approximate the problem (to any desired precision)
as a mixed-integer convex program, which can be solved to optimality
using an outer-approximation method. Extensive experiments on gen-
erated instances of varying sizes demonstrate the effectiveness of our
proposed approach compared to standard baselines.

Keywords: Cost optimization · Facility Location · Cross-nested


logit · Piecewise linear approximation · Outer-Approximation

1 Introduction
Competitive facility location (CFL) has been a key decision-making problem in
modern transportation and logistics, typically involving the selection of opti-
mal locations and the allocation of financial resources for establishing new facil-
ities. The primary goal is to either maximize profit (e.g., by capturing customer
demand or increasing revenue) or minimize costs (e.g., operational or transporta-
tion expenses). In this study, we address a cost optimization in facility loca-
tion problem within a competitive market, focusing on models that utilize dis-
crete choice frameworks like the random utility maximization (RUM) approach
[7, 24, 31] to predict customer behavior. In this setting, customers are assumed
to choose facilities by maximizing their perceived utility, which depends on facil-
ity features (e.g., service quality or transport costs) and customer characteristics
(e.g., age, income, gender). The RUM models are widely used and effective for pre-
dicting human choices in transportation-related contexts [5, 28]. In the context of
facility cost optimization, prior research has primarily relied on the classical multi-
nomial logit (MNL) model [14], which is one of the most popular choice models in
c The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025
W. Buntine et al. (Eds.): SOICT 2024, CCIS 2350, pp. 371–385, 2025.
https://doi.org/10.1007/978-981-96-4282-3_30
372 B. L. Le et al.

demand modeling. However, the MNL model has a significant limitation known as
the Independence of Irrelevant Alternatives (IIA) property, which assumes that
the ratio of choice probabilities between two facilities remains unaffected by the
presence of other facilities. This assumption often fails in real-world scenarios,
reducing the model’s effectiveness in accurately capturing customer behavior. To
overcome this limitation, several studies have proposed more advanced models,
such as the nested logit (NL) [3, 4], Generalized Extreme Value (GEV) [26], and
mixed-logit models (MMNL) [29]. For example, the NL model groups locations
into distinct subsets (nests), relaxing the IIA property for choices within the same
nest, though it still applies to alternatives from different nests. In this paper,
we examine the cost optimization in facility location under the cross-nested logit
(CNL) model [33], one of the most flexible and general RUM models. The CNL
model extends the NL model by allowing facilities to belong to multiple overlap-
ping nests, thereby fully relaxing the IIA property. This flexibility enables the
CNL model to approximate any RUM model with high precision [8, 17]. To the
best of our knowledge, this is the first study to apply such a general model to cost
optimization in facility location problem.
The main challenge in facility cost optimization problem is allocating a fixed
budget across predetermined locations to maximize customer demand. Higher
investment in a facility increases its attractiveness and utility to customers. Open-
ing and operating costs directly impact a facility’s ability to provide services and
attract demand. Our goal is to optimize how the budget is allocated to each facil-
ity, assuming the potential locations have already been selected. We model cus-
tomer behavior as a function of costs, where a higher budget leads to better service
capacity and a greater likelihood of being chosen. The problem then can be for-
mulated as a nonlinear optimization problem with continuous variables. Previous
studies show that this problem is highly non-convex with multiple local optima
even with only one nest [13]. To address this, we use a piecewise linear approxi-
mation and then reformulate the approximation problem as a mixed-integer non-
linear convex program, solvable using an outer-approximation method. We also
apply a projected gradient descent (PGD) algorithm with adaptive step size for
the original problem to evaluate the performance of our approach.
Paper Outline: The paper is organized as follows. Section 2 provides a litera-
ture review. Section 3 discusses the problem formulation. Section 4 presents our
solution methods. Section 5 presents our experimental results, and finally, Sect. 6
concludes the paper.
Notation: Boldface characters represent matrices (or vectors), and .ai denotes
the .i-th element of vector .a if it is indexable. We use .[m], for any .m ∈ N, to
denote the set .{1, . . . , m}.

2 Literature Review
The RUM framework has been extensively studied since the 1980s and applied
in numerous fields [25, 27, 31]. In facility location cost optimization under RUM
Cost Optimization in Competitive Facility 373

models, most studies use the MNL model to estimate customer demand. For
example, [14] was the first to propose a cost optimization in CFL problem under
the MNL model, utilizing a CONIC and Mixed-Integer Linear Programming
(MILP) reformulation for a piecewise approximation problem. They also propose
an outer-approximation approach to solve the approximation problem.
Regarding the CNL customer choice model, the model has been primarily
introduced and explored within the transportation research community [6, 30,
32]. [8] provided the theoretical groundwork for the CNL model, demonstrating
its inclusion in the GEV family [26] and introducing a novel estimation method
using nonlinear programming, as opposed to earlier heuristic approaches. The
CNL model is highly flexible, capable of approximating any RUM model with
arbitrary precision [17], as well as the general ranking preference model [1, 21]. Its
versatility has led to successful applications in various transportation problems,
including mode choice [34], departure time choice [11], route choice and revenue
management [19, 22, 23], air travel management [12], and more recently, location
choice in international migration [2]. The CNL model consistently outperforms
other choice models, such as the MNL and NL, in demand modeling [2, 11].
To the best of our knowledge, this study is the first to apply the CNL model
to solve the cost optimization problem in CFL. Our work is closely related to
the work of [14], which addressed cost optimization in CFL using the MNL
model. Their approach guarantees near-optimal solutions through piecewise lin-
ear approximation, which can be easily reformulated into a MILP or CONIC
model after the approximation procedure. However, it still theoretically requires
solving problems of infinite size to achieve an exact solution, as the problem size
depends on the precision of the solution.

3 Problem Statement and Formulation

In this study, we examine cost optimization in the CFL problem, where a “new-
comer” company seeks to enter a market already dominated by an established
competitor. The company has already planned the locations for opening facili-
ties, and it aims to allocate its budget across its stores to attract customers away
from the existing supermarkets. The main goal is to capture a portion of the mar-
ket share by drawing customers to the new facilities, achieved by optimizing the
investment in each location to maximize expected customer demand.
First, we need to consider the attraction of facilities to customer demand. In
real-world scenarios, estimating customer demand is challenging and inherently
uncertain. To address this, we explore the CFL problem using discrete choice
models, as described in [31], to estimate and predict customer behavior. Among
the various demand modeling approaches, the RUM framework [31] is the most
commonly used to model discrete choice behavior. This method is based on
the theory that a decision-maker’s preference for an option is represented by a
random utility, leading customers to choose the option that offers the highest
utility. According to the RUM framework [16, 26], the probability of individual .t
choosing an option .i from a given choice set .S is determined by .P (uti ≥ utj , ∀j ∈
374 B. L. Le et al.

S), indicating that the individual will opt the alternative with the greatest utility.
The random utility for a given location .i is defined as .uti = vti + ti , where .vti
represents the deterministic component derived from observable factors, and .ti
represents random components that are unobservable.
In this work, we adopt the CNL model, which is renowned for its flexibility
and general applicability within RUM frameworks. To formulate the problem,
let .[m] = {1, 2, . . . , m} represent the set of all available locations that will be
invested, and let .[T ] = {1, . . . , T } represent customer types, which can be defined
by factors such as geography, age, income, or gender. Additionally, let .C denote
the set of existing facilities owned by competitors. Under the CNL model, these
locations can be assigned to different subsets (or “nests”). For each customer
type .t ∈ [T ], we assume that .[m] ∪ C can be assigned to .N nests, denoted as
t t t
.N1 , N2 , . . . , NN , based on shared attributes or characteristics. Importantly, these
nests are not necessarily disjoint, meaning each facility/location .i can belong to
t
multiple nests. We also use a non-negative quantity .αin to capture the level of
t
 t
membership of location .i in nest .Nn , where . n∈[N ] αin = 1 for all .i ∈ [m] and
t t
.αin is equal to .0 if the facility .i is not in nest .Nn .
Let .vti be the deterministic utility for location .i ∈ [m] ∪ C and customer
type .t ∈ [T ], which can be modeled as a function of both customer and loca-
tion characteristics. The parameters of this utility function can be estimated
through the choice model inference [31]. Specifically, the deterministic utility for
customers from zone .t choosing location .i is defined as .vti = ati xi + bti for all
.t ∈ [T ], i ∈ [m], where .ati represents the sensitivity of customer in zone .t to
the cost invested in facility .i, .xi is investment in facility .i, and .bti accounts for
other factors affecting customer choice, such as distance or parking availability.
The parameter .ati plays a critical role in our model, as it captures the impact
of investment on customer attraction based on proximity. If a customer in zone
.t is far from facility .i, the investment in that facility will have a limited effect
on attracting those customers. On the other hand, if facility .i is close to zone .t,
increasing the budget for that facility will significantly boost its appeal. There-
fore, higher investment .xi generally increases the attractiveness of the facility .i,
and thus, .ati is expected to be positive.
The choice process in the CNL model is conceptualized as a two-stage process
where a customer first selects a nest and then chooses a location or facility within
that nest. Let .S ⊆ [m] represent the set of available locations, and the probability
that a customer selects a nest .Nnt , for any .n ∈ [N ] is given by
σtn
Wtn
.P (Nnt |S) =  σtn ,
n ∈[N ] Wtn

 t vti /σtn
where .Wtn = i∈N t ∩(S∪C) αin e is the total utility of alternatives in nest
n
t
.Nn , and .σtn is the dissimilarity parameter, typically assumed to vary within the
unit interval to maintain consistency with the RUM framework [8, 26].
Cost Optimization in Competitive Facility 375

In the second stage, the customer decides to select a facility .i ∈ S from the
chosen nest .Nnt with probability:
t vi /σn
αin e
.P (i|Nn ) = , ∀i ∈ S.
Wn

Thus, the overall probability of a customer of type .t ∈ [T ] selecting a facility


i ∈ S can be computed as:
.

  σtn
Wtn t vti /σtn
αin e
.P t (i|S) = P (Nnt |S)P (i|Nnt ) =  σtn ×

n ∈[N ] W tn W tn
n∈[N ] n∈[N ]

n∈[N ]Wtn σtn −1 αin
t vti /σtn
e
=  σtn , ∀i ∈ S.
n∈[N ] Wn

By applying the law of total expectation and replacing .vti with the utility func-
tion .vti = ati xi + bti , the expected captured market share given by the set of
available locations .S is expressed as
 σtn −1 
 n∈[N ] Wtn ( i∈S αint (ati xi +bti )/σtn
e )
.F (x, S) = qt  σtn ,
t∈[T ] n∈[N ] Wtn

where .qt is the total demand of customers of type .t, and .x = (x1 , x2 , . . . , xm ) ∈
Rm+ represents the investment cost for available locations. This problem is com-
monly referred to as the Maximum Capture Problem (MCP) [7]. By the assump-
tion that the list of new facilities to be opened has already been determined, it
is not a loss of generality to assume .S = [m]. Then, we can formulate the cost
optimization as the following nonlinear program:
⎧  ⎫
⎨ σtn −1  t (ati xi +bti )/σtn ⎬
 n∈[N ] W tn ( α
i∈[m] in e )
. max F (x) = qt  σtn
x ⎩ n∈[N ] Wtn ⎭
t∈[T ]
(CFL-CNL)

subject to xi ≤ B,
i∈[m]

xi ∈ [Li , Ui ], ∀i ∈ [m].
 t (ati xi +bti )/σtn
where .Wtn = i∈Nnt ∩([m]∪C) αin e , ∀t ∈ [T ], n ∈ [N ]. Here, .B is
the maximum budget to spend on the new facilities, .Li and .Ui are the lower
and upper bounds for the investment in facility .i ∈ [m]. It is required that
. i∈[m] Li ≤ B to ensure feasibility.
376 B. L. Le et al.

c
 t (ati xi +bti )/σtn
Let .Utn = i∈C∩Nnt αin e , we then can rewrite the objective
function of (CFL-CNL) as
 σtn −1 
 n∈[N ] Wtn ( i∈[m] αin t (ati xi +bti )/σtn
e )
.F (x) = qt  σtn
t∈[T ] n∈[N ] Wtn
 σtn −1  t (ati xi +bti )/σtn
 n∈[N ] Wtn ( i∈N t ∩[m] αin e )
= qt  n
σtn
t∈[T ] n∈[N ] Wtn
 σtn −1
 n∈[N ] Wtn (Wtn − Utn c
)
= qt  σtn
t∈[T ] n∈[N ] Wtn
 σtn −1 c
  n∈[N ] Wtn Utn
= qt − qt  σtn ,
t∈[T ] t∈[T ] n∈[N ] Wtn

It can be observed that .F (x) is highly nonlinear and potentially non-


concave in .x (or no one has proved that it is concave), making traditional exact
approaches for mixed-integer nonlinear programs, such as MILP, conic reformu-
lations [10], or outer-approximation [15, 24], not directly applicable.

4 Solution Methods

In this section, we discuss an approximation method to tackle the challenging


problem in (CFL-CNL). The first step involves applying a discretization procedure
to simplify the objective function. Subsequently, we perform several variable
changes that reformulate the approximation problem into a concave program,
allowing it to be solved using traditional methods such as cutting plane, which
effectively exploits the convexity to solve the approximation problem optimally.

4.1 Discretization and Approximation

For ease of notation, let .ftni (x) = e(ati x+bti )/σtn , where .t ∈ [T ], n ∈ [N ] such
that .Nnt is the nest to which facility .i belongs. We can see that .ftni (x) is convex
and always takes positive values. Each decision variable .xi can vary within the
interval .[Li , Ui ]. Therefore, similar to the piecewise-linear approximation method
proposed by [14], we can divide each interval .[Li , Ui ] into .K successive closed
sub-intervals of equal length .Δi = (Ui − Li )/K, represented as .[cik , cik+1 ] for
i
.k ∈ [K], where .∀k ∈ [K + 1], ck = Li + (k − 1)Δi . Then, we can represent an

approximation of .xi by .K binary variables .wik ∈ [0, 1] as



.xi ≈ δ(xi ) = Li + Δi wik .
k∈[K]
Cost Optimization in Competitive Facility 377

If .k ∗ ∈ [K] such that .cik∗ ≤ xi < cik∗ +1 , then .wik = 1 ∀k < k ∗ , and .wik = 0 ∀k ≥
k ∗ . We then approximate .ftni (xi ) as follows

tn
.ftni (xi ) ≈ ftni (xi ) = ftni (Li ) + Δi γik wik ,
k∈[K]

tn
where .γik = (ftni (cik+1 )−ftni (cik ))/Δi are the slopes of .ftni (x) in the sub-interval
i i
.[ck , ck+1 ], ∀k ∈ [K]. We can then approximate .Wtn as

⎛ ⎞
 
.
c
Wtn ≈ Wtn = Utn + ⎝ftni (Li ) + Δi tn
γik wik ⎠
i∈Nnt ∩[m] k∈[K]

c
 tn
Let .αtn = Utn + i∈Nnt ∩[m] ftni (Li ) and .βtnik = Δi γik for all .t ∈ I, n ∈
[N ], i ∈ S. Then, the approximation problem of cost optimization in CFL prob-
lem can be formulated as
  σtn −1 c

  n∈[N ] Wtn Utn
. max F (w) = qt − qt ·  (MCP-Approx)
w σtn
t∈I t∈I n∈[N ] Wtn
 
s.t. Wtn = αtn + βtnik wik , ∀t ∈ [T ], n ∈ [N ].
i∈Nnt k∈[K]
 
(Li + Δi wik ) ≤ B,
i∈[m] k∈[K]

wi,k+1 ≤ wik , ∀i ∈ [m], k ∈ [K − 1]


m×K
w ∈ {0, 1} , ∀i ∈ [m], k ∈ [K].

4.2 Convexification

The non-concavity of .F (w) implies that (MCP-Approx) cannot be solved directly


to optimality using outer-approximation cuts. Fortunately, we can demonstrate
below that, with some changes of variables, we can transform (MCP-Approx) into
a concave program. To begin, we define some new variables as follows:
σtn −1 c
y
. tn = log(Wtn Utn ), ∀t ∈ [T ], n ∈ [N ]
⎛ ⎞

zt = log ⎝ σtn ⎠
Wtn
n∈[N ]
378 B. L. Le et al.

We then write (MCP-Approx) as the following equivalent nonlinear program:


  
. max qt − qt exp (ytn − zt ) (1)
w,y,z
t∈[T ] t∈[T ] n∈[N ]
σtn −1 c
s.t. ytn = log(Wtn Utn ), ∀t ∈ [T ], n ∈ [N ] (2)
⎛ ⎞

zt = log ⎝ σtn ⎠
Wtn , ∀t ∈ [T ] (3)
n∈[N ]
 
Wtn = αtn + βtnik wik , ∀t ∈ [T ], n ∈ [N ]
i∈Nnt ∩[m] k∈[K]
 
(Li + Δi wik ) ≤ B,
i∈[m] k∈[K]

wi,k+1 ≤ wik , ∀i ∈ [m], k ∈ [K − 1]


w ∈ {0, 1}m×K , y ∈ RT ×N , z ∈ RT .

It can be further shown that the above nonlinear non-convex program can refor-
mulated as the following optimization problem:

. max θ (MCP-Reform)
w,y,z,θ,W   
. s.t. θ≤ qt − qt exp(ytn − zt ), (4)
t∈[T ] t∈[T ] n∈[N ]
c
ytn ≥ (σtn − 1) log(Wtn ) + log(Utn ), ∀t ∈ [T ], n ∈ [N ] (5)
⎛ ⎞

zt ≤ log ⎝ Wtnσtn ⎠
, ∀t ∈ [T ] (6)
n∈[N ]
 
Wtn = αtn + βtnik wik , ∀t ∈ [T ], n ∈ [N ]
i∈Nnt ∩[m] k∈[K]
 
(Li + Δi wik ) ≤ B,
i∈[m] k∈[K]

wi,k+1 ≤ wik , ∀i ∈ [m], k ∈ [K − 1]


w ∈ {0, 1}m×K , y ∈ RT ×N , z ∈ RT .

This formulation has been demonstrated to be a concave program with a


linear objective and convex constraints, as shown by [20]. Consequently, an outer-
approximation method [9, 15] can be used to solve it to optimality. The key idea
behind this approach is to formulate a master problem, usually as a MILP, by
replacing the nonlinear objective function and constraints with a sequence of
linear outer-approximation cuts. These cuts iteratively refine the feasible region
and improve the approximation of the nonlinear problem. By repeatedly solving
the master problem and adding new cuts based on the nonlinear constraints,
the method guarantees convergence to the optimal solution after a finite number
Cost Optimization in Competitive Facility 379

of iterations. The outer-approximation cuts, along with the master problem,


can be embedded into standard optimization techniques like cutting plane or
branch-and-cut procedures. This approach ensures that we can achieve optimal
solutions for the approximation problem in a structured and computationally
effective method.

5 Numerical Experiments
In this section, we present experimental results to evaluate the performance of
our proposed methods for solving the cost optimization in CFL problem under
the CNL model. We first present the settings of the experiment, followed by a
comparison of the results between the approximation approach and the direct
application of PGD in non-convex settings to evaluate the performance of the
proposed algorithm.

5.1 Experimental Settings


In this experiment, we randomly generated various instances to evaluate the per-
formance of proposed algorithms. The dataset comprises .12 problem instances,
with the number of customer types .T various in .{25, 50, 100, 200} and the num-
ber of available locations in .{25, 50, 100}. For each pair of .(T, m) we generated
.10 different instances. These parameters in each instance are generated as fol-
lows. The total demand of each customer type .t is ramdomly generated from
.1 to .100. The lower and upper bounds for the cost of each facility are uni-

formly generated in range .[0, 5] and .[5, 10], respectively. The maximum bud-
get is considered between the total of lower bound and upper bound of the
investment  cost. We define a new parameter .δ for the maximum budget .B such
that .B = i∈[m] Li + ( i∈[m] (Ui − Li )) × δ. The parameter .δ is chosen in
.{0.1, 0.3, 0.5}. Consequently, each problem instance has a total of .30 instances.

The sensitivity parameter .ati is randomly and uniformly generated in range


.[0.5, 1.5]. For other factors .bti , we also randomly generated between .10% of the
bound for the cost of facility .i, i.e., .0.1 × Li ≤ bti ≤ 0.1 × Ui . This also implies
that other factors will influence about .10% of the customer’s decision in selecting
a facility. For CNL parameters, the dissimilarity parameter .σtn for nest .Nnt is
drawn from a uniform distribution between .0.5 to .1. Moreover, the number of
nests, .N , is set to .5. Since the nests may overlap, we introduce a parameter .γ ≥ 1,
which represents the average number of nests to which a single location belongs.
This parameter, referred to as the overlapping rate, directly influences how loca-
tions are shared among the different nests and will be used to control the degree
of overlap across nests. In this experiment, the value of .γ is set to .1.2. This
parameter influences the cross-nested correlation structure across .m locations
and .N nests, which are constructed as follows. We begin by randomly assigning
the .m locations to the .N nests. Next, we randomly sample . (γ − 1)m locations
from the .m locations. These sampled locations are then randomly assigned to the
.N nests, ensuring that each location is assigned no more than once per nest and
380 B. L. Le et al.

that each nest contains at least two distinct locations. The allocation parameter
t
αin
. for .i ∈ [m] ∪ C is randomly generated from a uniform distribution over the
t
interval .[0, 1]. These parameters 
are normalized such that .αin = 0 if location .i
t t
is not a member of nest .Nn and . n∈[N ] αin = 1. Different customer types have
distinct sets of available locations within each nest.
In this experiment, we will compare our approaches, specifically CP, in solving
the mixed-integer convex program in (MCP-Reform) using outer-approximation
cuts. We also include a PGD algorithm with ADAM optimizer [18] as the baseline
for comparison. This algorithm starts from the lower bound of the cost and
iteratively selects locations one by one to maximize the objective value. By
considering the gradient descent method, we aim to compare the performance
of this heuristic approach with our approximation methods. The learning rate
of PGD is set to .10−2 and the algorithm was executed for .10000 epochs. The
parameters for the ADAM optimizer are as follows: .β1 = 0.9, β2 = 0.999,  =
10−3 [35].
The algorithms were written in C++, and the MILP models were solved by
Gurobi version .11.0.3, where the number of CPUs is set to .8 cores. All experi-
ments were run on .13th Gen Intel(R) Core(TM) i.5-.13500 @ .4.8 GHz, and .64 GB
of RAM. The CP algorithm terminates when it reaches the maximum number of
iterations .nbiter = 10000 or the running time exceeds .3600 seconds. The optimal
gap . is set to .10−4 . The number of intervals in approximation procedure .K is
set to .20.

5.2 Numerical Results

To evaluate the performance of our approach compared to the baseline, we report


the number of times our approach returns optimal solutions (# Opt) within a
fixed time budget (.1 hour) or the best objective values (# Best) for the approx-
imation problem. Since our method guarantees only near-optimal solutions, we
compare the solution found by the PGD algorithm with the objective values
obtained from the cutting plane method. We then report the average gap of .10
instances between the the solutions from PGD and our approach (Gap (%)) for
each pair of .(T, m). Additionally, we indicate how often the solution from the
PGD outperforms the solution from our method. Lastly, we provide the average
computing times in seconds (Time (s)) for all approaches.
The results in Table 1 demonstrate that our approach consistently outper-
forms the PGD algorithm in finding the best solution. In all problem instances,
the cutting plane algorithm produced better solutions than PGD, except for
three instances in problem instance with .δ = 10%, .(T, m) = (50, 25), (100, 25),
and .(200, 25). These results highlight that our approach consistently yields bet-
ter solutions in most cases. Specifically, the average gaps between the PGD and
our approach are larger than .3% in all cases and larger than .5% for .33 out of .36
problem instances.
Cost Optimization in Competitive Facility 381

Table 1. Numerical results for cost optimization in competitive facility location under
CNL model, average computing times are given in seconds.

.δ(%).T .m Cutting Plane Projected Gradient Descent


# Opt # Best Time (s) # Best Gap (%) Time (s)
10 25 25 10 10 10.99 0 8.13 21.06
10 25 50 10 10 25.18 0 7.92 35.07
10 25 100 10 10 92.88 0 9.73 84.96
10 50 25 10 9 145.84 1 9.54 22.47
10 50 50 10 10 127.87 0 7.20 37.34
10 50 100 10 10 178.72 0 6.18 89.48
10 100 25 10 9 223.29 1 5.47 24.35
10 100 50 8 10 2323.80 0 7.14 40.36
10 100 100 7 10 1560.98 0 7.02 97.65
10 200 25 8 9 2280.26 1 4.28 28.76
10 200 50 1 10 3428.24 0 8.36 47.38
10 200 100 4 10 3007.21 0 7.20 107.04
30 25 25 10 10 3.83 0 6.73 20.89
30 25 50 10 10 3.95 0 8.45 33.71
30 25 100 10 10 9.80 0 12.12 80.63
30 50 25 10 10 16.62 0 6.57 22.32
30 50 50 10 10 17.07 0 6.73 36.34
30 50 100 10 10 28.47 0 7.98 84.12
30 100 25 10 10 30.97 0 5.23 24.69
30 100 50 10 10 75.02 0 7.59 39.61
30 100 100 10 10 83.34 0 7.99 89.62
30 200 25 10 10 239.31 0 5.14 29.46
30 200 50 10 10 156.79 0 4.74 47.90
30 200 100 10 10 240.32 0 6.74 100.91
50 25 25 10 10 2.62 0 7.70 20.66
50 25 50 10 10 2.30 0 8.97 32.29
50 25 100 10 10 5.16 0 12.49 81.22
50 50 25 10 10 3.12 0 8.18 22.09
50 50 50 10 10 7.43 0 7.60 34.54
50 50 100 10 10 13.09 0 8.38 84.30
50 100 25 10 10 9.52 0 5.31 24.22
50 100 50 10 10 16.79 0 8.39 38.38
50 100 100 10 10 24.38 0 8.68 92.36
50 200 25 10 10 26.22 0 3.33 28.67
50 200 50 10 10 62.22 0 5.47 45.84
50 200 100 10 10 75.15 0 7.41 101.25
382 B. L. Le et al.

In terms of optimality, our method always returns the optimal solution for
the approximation problem, and these solutions outperform the local optima
obtained by PGD when .δ ≥ 30%. However, in larger instances, particularly with
.T = 200 and .δ = 10%, the cutting plane algorithm returns optimal solutions
for only 8 instances when .m = 25, .1 instance for .m = 50 and .4 instances
for .m = 100. Regarding the computing time, as shown in Table 1, the PGD
algorithm converges in less than .100 seconds for most problem instances, while
the cutting plane algorithm takes longer to find optimal solutions, especially for
lower .δ. Specifically, in larger-sized instances with .T ≥ 100, m ≥ 50 and .δ = 10%,
our method takes more than 1500 s, whereas for instances with .δ ≥ 30%, it
consistently takes less than .250 seconds.
To analyze the impact of budget, we conduct some small experiments with
.T = 50, examining the relative change in objective value and average computing
time of our approach as .δ varies. Figure 1 shows the results for .δ ranging from
.0.1 to .1. In Fig. 1a, we observe that as the budget increases (i.e., higher .δ), the
expected captured market share rises. Notably, when .δ ≤ 50%, the objective
value grows more rapidly than when .δ > 50%, suggesting that a few key facili-
ties have the largest impact on customer demand. Thus, it is more effective to
concentrate the budget on these facilities. For the remaining facilities, increas-
ing investment yields diminishing returns in terms of customer attraction. This
explains why we chose .δ ∈ {0.1, 0.3, 0.5} for the main experiments. In terms of
computation time, the problem becomes harder to solve as .δ decreases, indicat-
ing that with a limited budget, the cutting plane method requires more time to
identify the most impactful facilities.

Fig. 1. Relative change of the objective value and average computing time of cutting
plane algorithm with respect to .δ.

6 Conclusion
We have studied the cost optimization problem in competitive facility loca-
tion, where customer demand is predicted using the cross-nested logit model.
Given the highly non-convexity of this problem, it poses significant challenges.
Cost Optimization in Competitive Facility 383

To address this, we proposed a piecewise linear approximation method to simplify


the problem. We then reformulated the approximation problem into a mixed-
integer nonlinear problem with a linear objective and convex constraints, making
it solvable to optimality via the outer-approximation method. Additionally, we
introduced a projected gradient descent algorithm with adaptive step size as
a baseline to assess the performance of the approximation approach. Through
extensive experiments on generated instances of various sizes, we showed that
the proposed cutting plane method with outer-approximation cuts provides a
near-optimal solution and outperforms standard baselines. A future direction
would be to explore a joint location and cost optimization problem under the
CNL model, which is highly relevant but presents significant challenges due to
its complexity. Addressing both location and cost decisions simultaneously could
lead to more comprehensive solutions in the context of competitive facility loca-
tion problems.

Acknowledgment. This research is funded by Phenikaa University under grant num-


ber PU2023-1-A1-01.

References
1. Aouad, A., Farias, V., Levi, R., Segev, D.: The approximability of assortment
optimization under ranking preferences. Oper. Res. 66(6), 1661–1669 (2018)
2. Beine, M.A., Bierlaire, M., Docquier, F.: New York, Abu Dhabi, London or Stay at
Home? using a cross-nested logit model to identify complex substitution patterns
in migration. IZA Discussion Paper No. 14090 (2021)
3. Ben-Akiva, M., Lerman, S.R.: Discrete Choice Analysis: Theory and Application
to Travel Demand. MIT Press, Cambridge, Massachusetts (1985)
4. Ben-Akiva, M.: The structure of travel demand models. Ph.D. thesis, MIT (1973)
5. Ben-Akiva, M., Bierlaire, M.: Discrete Choice Methods and their Applications to
Short Term Travel Decisions, pp. 5–33. Springer US, Boston, MA (1999)
6. Ben-Akiva, M., Bierlaire, M.: Discrete choice methods and their applications to
short term travel decisions. Handbook of Transportation Science, pp. 5–33 (1999)
7. Benati, S., Hansen, P.: The maximum capture problem with random utilities: Prob-
lem formulation and algorithms. Eur. J. Oper. Res. 143(3), 518–530 (2002)
8. Bierlaire, M.: A theoretical analysis of the cross-nested logit model. Ann. Oper.
Res. 144, 287–300 (2006)
9. Bonami, P., et al.: An algorithmic framework for convex mixed integer nonlinear
programs. Discret. Optim. 5(2), 186–204 (2008)
10. Şen, A., Atamtürk, A., Kaminsky, P.: A conic integer programming approach to
constrained assortment optimization under the mixed multinomial logit model.
Oper. Res. 66(4), 994–1003 (2018)
11. Ding, C., Mishra, S., Lin, Y., Xie, B.: Cross-nested joint model of travel mode
and departure time choice for urban commuting trips: case study in maryland-
washington, dc region. J. Urban Plann. Develop. 141(4), 04014036 (2015)
384 B. L. Le et al.

12. Drabas, T., Wu, C.L.: Modelling air carrier choices with a segment specific cross
nested logit model. J. Air Transp. Manag. 32, 8–16 (2013)
13. Duong, N.H., Dam, T.T., Ta, T.A., Mai, T.: Joint location and cost planning
in maximum capture facility location under random utilities. Comput. Oper.
Res. 159, 106336 (2023). https://doi.org/10.1016/j.cor.2023.106336, https://www.
sciencedirect.com/science/article/pii/S0305054823002009
14. Duong, N.H., Ta, T.A.: Approximation methods for a nonlinear competitive facility
cost optimization problem. In: 2022 14th International Conference on Knowledge
and Systems Engineering (KSE), pp. 1–6. IEEE (2022)
15. Duran, M.A., Grossmann, I.E.: An outer-approximation algorithm for a class of
mixed-integer nonlinear programs. Math. Program. 36, 307–339 (1986)
16. Fosgerau, M., Bierlaire, M.: Discrete choice models with multiplicative error terms.
Transp. Res. Part B 43(5), 494–505 (2009)
17. Fosgerau, M., McFadden, D., Bierlaire, M.: Choice probability generating func-
tions. J. Choice Modell. 8, 1–18 (2013)
18. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization (2017). https://
arxiv.org/abs/1412.6980
19. Lai, X., Bierlaire, M.: Specification of the cross-nested logit model with sampling
of alternatives for route choice models. Transport. Res. Part B: Methodol. 80,
220–234 (2015)
20. Le, B.L., Mai, T., Ta, T.A., Ha, M.H., Vu, D.M.: Competitive facility location
under cross-nested logit customer choice model: Hardness and exact approaches
(2024). https://arxiv.org/abs/2408.02925
21. Le, C., Mai, T.: Constrained assortment optimization under the cross-nested logit
model. Production and Operations Management p. 1 (2024). https://doi.org/10.
1177/10591478241263857
22. Mai, T.: A method of integrating correlation structures for a generalized recursive
route choice model. Transport. Res. Part B: Methodol. 93, 146–161 (2016)
23. Mai, T., Frejinger, E., Fosgerau, M., Bastin, F.: A dynamic programming approach
for quickly estimating large network-based MEV models. Transport. Res. Part B:
Methodol. 98, 179–197 (2017)
24. Mai, T., Lodi, A.: A multicut outer-approximation approach for competitive facility
location under random utilities. Eur. J. Oper. Res. 284(3), 874–881 (2020)
25. McFadden, D.: Conditional logit analysis of qualitative choice behaviour. In:
Zarembka, P. (ed.) Frontiers in Econmetrics, pp. 105–142. Academic Press, New
York (1973)
26. McFadden, D.: Modelling the choice of residential location. Transportation
Research Record (1978)
27. McFadden, D.: Econometric models of probabilistic choice. In: Manski, C., McFad-
den, D. (eds.) Structural Analysis of Discrete Data with Econometric Applications,
chap. 5, pp. 198–272. MIT Press (1981)
28. McFadden, D.: Economic choices. American Economic Review, pp. 351–378 (2001)
29. McFadden, D., Train, K.: Mixed MNL models for discrete response. J. Appl.
Econom. 447–470 (2000)
30. Small, K.A.: A discrete choice model for ordered alternatives. Econometrica 55(2),
409 (1987). https://doi.org/10.2307/1913243
31. Train, K.E.: Discrete choice methods with simulation. Cambridge university press
(2009)
32. Vovsha, P.: Application of cross-nested logit model to mode choice in Tel Aviv,
Israel, metropolitan area. Transp. Res. Rec. 1607(1), 6–15 (1997)
Cost Optimization in Competitive Facility 385

33. Vovsha, P., Bekhor, S.: Link-nested logit model of route choice Overcoming route
overlapping problem. Transp. Res. Rec. 1645, 133–142 (1998)
34. Yang, L., Zheng, G., Zhu, X.: Cross-nested logit model for the joint choice of
residential location, travel mode, and departure time. Habitat Int. 38, 157–166
(2013)
35. Zaheer, M., Reddi, S., Sachan, D., Kale, S., Kumar, S.: Adaptive methods for
nonconvex optimization. In: Advances in Neural Information Processing Systems,
vol. 31 (2018)
A Historical GPS Trajectory-Based
Framework for Predicting Bus Travel Time

Khang Nguyen Duy1,3 , Minh Nguyen Tuan1,3 , and Nam Thoai1,2,3(B)


1
Faculty of Computer Science and Engineering, Ho Chi Minh City University
of Technology (HCMUT), Ho Chi Minh City, China
[email protected]
2
Advanced Institute of Interdisciplinary Science and Technology, Ho Chi Minh City
University of Technology (HCMUT), 268 Ly Thuong Kiet Street, District 10,
Ho Chi Minh City, Vietnam
3
Vietnam National University Ho Chi Minh City, Linh Trung Ward,
Thu Duc District, Ho Chi Minh City, Vietnam

Abstract. Accurate bus travel time information helps passengers plan


their trips more effectively and can potentially increase ridership. How-
ever, cyclical factors (e.g., time of day, weather conditions, holidays),
unpredictable factors, and other complex factors (e.g., dynamic traffic
conditions, dwell times, variations in travel demand) make accurate bus
travel time prediction challenging. This study aims to improve travel
time prediction accuracy. To achieve this, we developed a bus travel
time prediction framework based on similar historical Global Position-
ing System (GPS) trajectory data and an information decay technique.
The framework first divides the predicted bus route into segments, inte-
grating GPS trajectory data with road map processing techniques to
accurately map the bus’s position and estimate its arrival time at bus
stops. Then, instead of relying on a single historical trajectory that best
matches the predicted bus journey, the framework samples a set of similar
trajectories as the basis for travel time estimation. Finally, the informa-
tion decay technique is applied to construct a bus travel time prediction
interval. We conduct comprehensive experiments using GPS trajectory
data collected from Kandy, Sri Lanka, and Ho Chi Minh City, Vietnam,
to validate our ideas and evaluate the proposed framework. The experi-
mental results show that the proposed prediction framework significantly
improves accuracy compared to baseline approaches by considering fac-
tors such as bus stops, time of day, and day of week.

Keywords: GPS trajectory data · Bus travel time prediction ·


Trajectory similarity · Information decay technique · Map-mapping

1 Introduction

With the rapid development of the Intelligent Transportation System (ITS),


more and more data on bus vehicles and passengers (e.g., GPS trajectory data,
c The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025
W. Buntine et al. (Eds.): SOICT 2024, CCIS 2350, pp. 386–400, 2025.
https://doi.org/10.1007/978-981-96-4282-3_31
A Historical GPS Trajectory-Based Framework 387

Integrated Circuit card data) were collected. Extracting accurate bus operation
information from these data has become key research and is important in improv-
ing bus services, particularly passenger satisfaction, efficient bus operation, etc.
The goal of predicting bus travel time is to estimate the time a bus takes to
travel between two bus stops [27]. However, unpredictable factors may cause an
early arrival or delay at a specific bus stop, causing passengers to lose confidence
in public transport and prefer private cars. Cyclical factors, like weather condi-
tions [9] (e.g., rain, snow, fog), time intervals [11] (e.g., peak and off-peak hours)
and holidays have a big effect on bus travel time prediction. Moreover, there are
various challenges associated with the processing of GPS data, like discontinu-
ities, non-uniformities, poor network coverage, and human errors. Additionally,
the systematic error and uneven receive time interval of original GPS data make
it rather difficult to estimate the bus travel time at any two consecutive stops.
A large amount of research based on bus GPS data has been conducted on
the bus evaluation methodologies [28], bus arrival time prediction models [8] and
influence factors to improve the performance of bus travel time prediction [22].
Over the years, many traditional approaches have been explored to predict bus
travel time. In earlier studies, methods such as the k-nearest neighbors (KNN)
algorithm were applied to address this problem [4]. Other research has combined
different model types to enhance prediction accuracy [2]. More recently, Pang
et al. [16] utilized Recurrent Neural Networks with Long Short-Term Memory
blocks (RNN-LSTM) to improve accuracy further. Additionally, Han et al. [8]
divided entire bus routes into segments and proposed a hybrid-BAT (Bus Arrival
Time) model, incorporating weighted LSTM to predict multi-stop bus arrival
times more effectively. However, many of them have not considered stochastic
features or the variability of bus travel time throughout a particular interval of
a day, which can significantly affect the travel time. While traffic varies a lot
between road segments, most prior studies have used a single model to predict
the travel time for the entire route, which cannot effectively capture sophisti-
cated dependencies among different road segments at various periods. Moreover,
the existing studies usually focused on the delay at special bus operation facil-
ities such as intersections and travel speed at the route level, and few research
studies were supported by the route map based on bus line data [30]. There-
fore, few studies address how to accurately estimate bus arrival times using GPS
trajectory data or predict bus travel times at different intervals of the day.
In this study, we propose a prediction framework to estimate bus travel times
by leveraging collected GPS trajectory data. This approach uses historical GPS
data to estimate travel times. Based on the pre-processing of GPS trajectory
data, bus route data, and mapped GPS points with the route map, the bus stop
arrival time was estimated and then it could acquire the bus travel time using
the clustering method and information decay function to downweight records
that correspond to older data.
The main contributions of this study are the following: (1) This study pro-
poses a bus travel time prediction framework that utilizes historical GPS tra-
jectory data and mapping techniques to accurately estimate bus stop arrival
388 K. Nguyen Duy et al.

times. By dividing the bus route into segments and considering multiple similar
trajectories, the framework addresses variability in travel conditions, improving
the accuracy of travel time prediction. (2) The framework takes advantage of
the positional relationship between bus stop locations and nearby mapped GPS
points to estimate bus stop arrival times, rather than relying on GPS instanta-
neous speed values. (3) This study conducts extensive experiments using real-
world GPS bus trajectory data from Sri Lanka and Vietnam to validate the pro-
posed framework. The results demonstrate that the model significantly improves
prediction accuracy compared to existing methods by considering time intervals
and providing a more robust solution for travel time estimation.
The rest of this study is organized as follows. Section 2 reviews the related
work, Sect. 3 introduces terminology, and formulates the research problem. Later,
Sect. 4 gives an overview of the framework, detailing its prediction model. Then,
a comprehensive experimental study is conducted using the collected real dataset
of bus trajectories in Sect. 5. Finally, the conclusions are outlined in Sect. 6.

2 Related Work
The problem of bus travel time prediction was studied by considering different
models and various essential factors. In a study by Gaikwad et al. [7], the crucial
features for bus travel time prediction and standard evaluation metrics were pre-
sented. Former studies have demonstrated that travel time tends to be affected
by traffic congestion and weather conditions. Thus, it can be periodic (e.g., the
daily growth in traffic over the peak interval on weekdays) or non-periodic (like
accidents and abnormal weather) over many time intervals in a day or day-to-
day. Over the last several years, many studies have been proposed to predict the
bus travel time in urban networks, mainly between two consecutive bus stops
along the path. Multiple prediction methods and mechanisms have been devel-
oped to predict bus travel time using historical trajectory or GPS data. These
methods can be categorized into statistical methods, machine learning methods,
and neural network methods.
Statistical approaches for predicting bus travel times can be classified into
historical average methods [6], regression methods [3], and time-series meth-
ods [8]. Historical average methods predict travel times by averaging past travel
times for the same time interval across multiple days [6]. This method follows
a decentralized model that doesn’t rely on specific training data or assumptions
about the data’s underlying patterns. However, it is often difficult to gather suf-
ficient trip data for each bus stop across different time intervals. Furthermore,
this approach requires stable traffic conditions to be effective, and its accuracy
significantly decreases when there are major fluctuations in traffic patterns [21].
Regression methods are designed to assess the impact of various factors on bus
travel time, explaining travel time (the dependent variable) through a set of
independent variables (e.g., passenger loads, and street characteristics). Traffic
patterns, journey distance, and dwell time are treated as separate variables. It
should be noted that nonlinear relationships between independent and dependent
A Historical GPS Trajectory-Based Framework 389

variables may also exist [3]. The time-series method supposes that the coming
status of a bus relies on the previous status of the same bus. These methods
often lead to minimal delays between predicted and actual times. However, they
are highly sensitive to complex and unusual circumstances, such as congestion or
delays at intersectdata, which are common on bus routes. The common applied
time-series methods are ARIMA [20], and generalized autoregressive conditional
heteroscedasticity [26].
Machine learning techniques have been increasingly applied to predict bus
travel time. Methods like Support Vector Machine (SVM) and Artificial Neural
Networks (ANN) are particularly popular due to their ability to capture com-
plex relationships efficiently. In study [1], an ANN-based model was developed to
predict bus travel times using GPS data from a route in Delhi, India. Similarly,
a study focusing on bus arrival predictions in both urban and rural regions of
China [23] utilized a combination of Support Vector Regression (SVR) and KNN
to compare with traditional prediction methods, using data from a bus route
that spanned urban and rural areas. Leveraging the Internet of Things, cluster-
ing method, study [12] showed that accurate arrival time under different traffic
conditions can be predicted using the bus and route with different parameters
such as average speed, number of passengers, rush hour information, and num-
ber of bus stops. To further enhance prediction accuracy, many studies [13] have
incorporated the Kalman Filter algorithm into machine learning-based methods.
In addition, with recent advancements in deep learning, researchers have
demonstrated that bus travel time prediction can be further enhanced by leverag-
ing deep learning architectures like LSTM networks [10], which excel at learning
temporal correlations. A more recent study [8] introduced a GPS position calibra-
tion technique to improve arrival time accuracy through the use of LSTM models.
Furthermore, a hybrid model as a combination of convolutions and LSTM layers
into a single structure, ConvLSTM [24] was applied [18]. Here, the convolution
layer could learn spatial dependencies between segments. In parallel, ensemble
learning methods have gained popularity in travel time prediction. Gradient
boosting ensembles, such as XGBoost [29], have been widely used, improving
prediction accuracy while mitigating the risk of overfitting.
Furthermore, there is no standard dataset used by these researchers, which
reflects no direct comparison. Many studies have not considered characteristics of
bus travel time, such as stochastic features or variability across different intervals
of the day, which can significantly impact accuracy. Although traffic conditions
vary widely between road segments, most prior studies have used a single model
to predict travel time for entire routes, which fails to effectively capture complex
dependencies across segments at various times.

3 Notation and Problem Definition

In this section, we define the essential concepts that are employed throughout
this study and formulate the targeted research problem.
390 K. Nguyen Duy et al.

3.1 Terminology
To support our discussions throughout this study, we begin by defining several
key terms, including bus route, route segment, road link, road map, GPS point,
raw GPS trajectory, mapped GPS trajectory, and segment travel time.
Definition 1. Bus route: A bus route .R is represented as a sequence of bus
stops, .R = (s1 , s2 , ..., sn ) where .si stands for a .(i)th bus stop and .n is the
number of bus stops. Additionally, in public transportation, a bus regularly runs
on a fixed route, which can be represented by a sequence of two-dimensional
(latitude and longitude) route points .(p1 , p2 , ..., pm ) where .pi = (xi , yi ) and .m is
the number of route points.
Definition 2. Route segment: In this study, based on bus stops, we divide
a route into route segments which are partial segments between two consecu-
tive bus stops. Accordingly, a bus route .R can be represented by a sequence of
segments, .(S1 , S2 , ..., Sn−1 ).
Definition 3. Road link: A road link is a directed link, defined as .e =
start , pend ), where .eid represents the road link identifier, .pstart represents
(eid , pid id id
id
the road link start point .pid , and .pend represents the road link end point .pid .
Definition 4. Road map: The road map is represented by a single directed
graph .G = (V, E), where .V is the set of road link nodes, expressed as .V =
(v1 , v2 , ..., vm ), .E is the set of all road links, expressed as .E = (e1 , e2 , ..., em−1 ).
Definition 5. GPS point: The GPS point (also called trajectory point) rep-
resents the location information acquired by the GPS device within a certain
time interval, defined as .tri = (lati , lngi , ti ), where .lati and .lngi represent the
latitude and longitude coordinates of the current location point, respectively, .ti
represents the time of the current location point of the vehicle.
Definition 6. Raw GPS trajectory: The raw GPS trajectory (also called
original GPS trajectory) consists of a set of GPS points ordered by timestamps,
defined as .T R = (tr1 , tr2 , ..., tru ), where .u is the number of GPS points contained
in .T R.
Definition 7. Mapped GPS trajectory: Given an original GPS trajectory,
its mapped trajectory .T Rm represents the actual driving link of the bus in the
road map, defined as .T Rm = (c1 , c2 , ..., cu ).
Definition 8. Segment travel time: A bus trajectory is represented as a
sequence of .((s1 , T1 ), (s2 , T2 ), ..., (sn , Tn )) where .Ti denotes the arrival time of
bus at the bus stop .si which can calculated from mapped trajectory. With the
arrival time at each bus stop, it is straightforward to derive the segment travel
time between two consecutive bus stops which can also be represented as a
sequence .(ΔT1 , ΔT2 , ...ΔTn−1 ), where .ΔTi denotes the travel time on segment
.(i)th. In this study, the segment travel time is defined as the time from arrival at
one bus stop to arrival at the next bus stop. Therefore, the dwell time which is
the travel time before and after the stop, is included in the segment travel time.
A Historical GPS Trajectory-Based Framework 391

3.2 Problem Statement


Given a bus route (.R), road map (.G), a repository of historical GPS trajectories
and contextual data including the bus’s scheduled departure date and time,
origin station (.O), and destination station (.D), we aim to develop a predictive
framework for estimating future bus travel time from (.O) to (.D). This framework
will leverage historical travel patterns, and take into account time-of-day, and
day-of-week influences, to provide accurate predictions for buses scheduled to
depart in the future, aiding in route planning and schedule optimization.

4 Framework of Bus Travel Time Prediction Model


The technique framework of historical GPS trajectory-based is shown in Fig. 1.
The framework flow was divided into three main phases: data extraction and
pre-processing; data mapping and bus stop arrival time estimation; and travel
time prediction.
At first, the raw GPS trajectory data was filtered out for the required time
window and sorted by date and time in chronological order, to identify trips
sequentially. Some data pre-processing techniques need to be carried out before
deep diving into trip extraction. Then, data were grouped by bus route. In the
second phase, the processed bus GPS trajectory data were mapped with the bus
road map data, and the mapped position of buses was determined to estimate
the arrival time of every bus stop. Finally, based on the similarity of segment
travel time extracted, time-of-day, and day-of-week, we can predict the travel
time between (.O) and (.D) with the information decay function added.

4.1 GPS Trajectory Data Mapping


According to the data character in public transportation, a bus regularly runs
on a fixed route. For a given GPS point .tri , we first determine a set of candidate-
mapped points within the circle of error radius .r, which refers to 10-meter in
this study. Note that each point .tri usually has several candidate-mapped points
in the roundabout area, as shown in Fig. 2. To accurately identify the correct
mapped link for a point, it is necessary to reduce the number of candidate-
mapped points. Here, for each candidate-mapped point, we first use the spatial
probability and the transmission probability together to measure the probability
of obtaining a correctly matched result for the given GPS point, then identify
the best candidate edges for further processing by Eqs. 2, 3, 4 respectively.
e
Spatial probability: Let .Htrji be the shortest distance from the trajectory point
.tri to the road link .ej . We define the spatial probability of spatial matching
e
between .tri and .ej , denoted as .G1 (Htrji ).
e
.Htrji = min(disp(tri , p1 ), disp(tri , p2 ), disp(tri , c)) (1)
ej 2
1 (Htr )
i
e− 2σ2
ej
.G1 (Htr ) =√ (2)
i
2πσ
392 K. Nguyen Duy et al.

Fig. 1. The technique framework of bus travel time prediction.

Fig. 2. A GPS point has several candidate-mapped points in the roundabout area.

where .p1 and .p2 are the two endpoints of the road link .ej respectively, and .c is the
vertical projection point from .tri to .ej . If .c is not located in .ej , then .disp(tri , c)
is set to .+∞, .σ is the standard deviation of location measurement, which is
generally set to 20-meter [14]. However, mapped road links determined only by
spatial probability will have corresponding errors. Therefore, the transmission
probability is proposed to improve the mapped accuracy.
Transmission probability: Given two candidate points .ci−1 and .ci for two
neighboring GPS sampling points .tri−1 and .tri respectively, the transmission
probability from .ci−1 to .ci is defined as the likelihood that the path from .tri−1
to .tri follows the path from .ci−1 to .ci . We compute transmission probability as:
disp(tri−1 , tri )
G2 (ci−1 , ci ) =
. (3)
disp(ci−1 , ci )
By combining the spatial probability and the transmission probability, this study
defines the comprehensive probability as .G, which is calculated by:

.G = G1 × G2 (4)

According to the above calculation and discussion, there is at least one eligible
road link around each trajectory point, so there is at least one comprehensive
probability. Based on the comprehensive probability value, we select the candi-
date road link with the highest probability for the selection of the best-mapped
link. Additionally, we ensure that the direction of travel is consistently forward,
eliminating any instances of the vehicle moving backward. The objective is to
ensure that the bus’s movement remains consistent and avoids any cases of the
bus traveling backward.
A Historical GPS Trajectory-Based Framework 393

4.2 Bus Stop Arrival Time Estimation

According to the positional relationship of the bus stop site and adjacent mapped
GPS points, the bus stop arrival time could be estimated. The different positional
relationships are applied to different arrival time estimated methods, as shown
in Fig. 3.
Scenario 1: The mapped GPS point is near the bus stop site. A bus stop with
many bus routes often appears to be a queuing phenomenon caused by vehicles
waiting to dwell. Moreover, bus station platforms have a fixed length in general.
Therefore, this study developed a bus stop area which refers to the 20-meter
distance in front of the bus stop site. If the GPS mapped point is in this range,
the return time .tm−1 of the GPS mapped point will be regarded as the bus
arrival time .Ts . However, bus dwell time will be added to the next travel time
interval between bus stops.
Scenario 2: The mapped GPS point is outside the bus stop area. During the bus
operation, the behaviors of accelerating or decelerating happen frequently. To
process conveniently, this study established a hypothesis that the speed between
two GPS-mapped points is even. As a result, if the GPS mapped point is outside
the scope of the bus stop area, the bus arrival time with the return time of two
continuous GPS mapped points, as expressed by Eq. 5.

d(m−1,s) × (Tm − Tm−1 )


T =
. s + Tm−1 (5)
d(m−1,m)

where .Ts is the bus arrival time at bus stop .(s)th, .Tm−1 is the return time of
(m − 1)th GPS mapped point, .Tm is the return time of .(m)th GPS mapped
.

point, .d(m−1,s) and .d(m−1,m) are the distance between .(m − 1)th GPS mapped
point, bus stop .(s)th and .(m)th GPS mapped point on the bus route map.

4.3 Bus Travel Time Prediction Model

The prediction model adopts a stop-to-stop approach to predict the travel time
between two consecutive bus stops, i.e., Cho Lon Terminal .→ Bus Stop 1, Bus
Stop 1 .→ Bus Stop 2, and so on as seen in Fig. 4. Therefore, the bus travel
time will be predicted by adding the predicted segment travel time of preceding
segments not yet traveled:

D−1
.TOD = T Si (6)
i=O

where .TOD is the travel time from the origin station (.O) to the destination
station (.D), .TSi is the predicted travel time on segment .Si .
In this study, the prediction model also assumes that bus travel time is cycli-
cal or similar for the same day of week and time of day, for example, bus travel
time on successive Tuesdays is cyclical or is characterized by similar travel times
for similar segments at similar days of the week. Moreover, the bus travel time
394 K. Nguyen Duy et al.

Fig. 3. Two scenarios of bus stop arrival time estimating.

Fig. 4. A stop-to-stop approach to predict the travel time between two consecutive bus
stops.

is changing over time. Thus, the model for travel time prediction also consid-
ers the variation in period. Before clustering similar historical trajectories, the
model first selects trajectories from the same day of the week and then considers
the time interval for each segment. The time interval depends on when the bus
enters the segment. Take segment .Si , for example. The time interval .τ to which
.Si belongs:

.(τ − 1) × ΔT < Ti ≤ τ × ΔT, 1 ≤ τ ≤ 19 (7)


where .ΔT is length of the interval, which equals 60-minute in this study.

Similar Trajectories Clustering. This study leverages segment travel time


patterns observed in similar trajectories to create a travel time prediction model
based on historical GPS trajectory data. To identify similar trajectories by seg-
ment travel time, an ideal approach is to use conventional time series similarity
measures, such as Dynamic Time Warping [15] or similarity functions based on
the Longest Common Subsequence [17]. However, our preliminary experiments
revealed that these methods are highly sensitive to errors or outliers in the data
and tend to focus on the overall similarity of the entire trajectory, overlooking
similarities within individual segment travel time which can vary in length. To
address this, we propose evaluating the similarity between trajectories based on
individual segments, using separate thresholds for each segment. If the difference
in travel time for a segment is less than its predefined threshold, the two tra-
jectories are considered similar. This approach extends the conventional concept
of trajectory similarity to account for both overall trajectory comparison and
individual segment similarity. Moreover, efficiency is a concern when compar-
ing segment travel times, especially with large numbers of historical trajectories
and segments. To improve efficiency and avoid exhaustive similarity comparisons
across all trajectories, we group the segment travel times for each segment into
A Historical GPS Trajectory-Based Framework 395

groups. To partition the segment travel time into groups, we consider the clus-
tering method which partitions the travel times on a segment to multiple groups
with minimized variances. Given the list of all the travel times on a segment
k
.S in the time interval .k, we first sort .S k by their values and then recursively
perform binary partition on the sorted .S k into sub-lists. Basically, in each itera-
tion, we compute the variances of all the data in .S k to find the best-split point
following the minimal weighted average variance (WAV) as defined below:

|SAk
(i)| |S k (i)|
W AV (i, S k ) =
.
k
V ar(SA (i)) + B k V ar(SB
k
(i)) (8)
|S |
k |S |
k
where .SA (i) and .SB
k
(i) are two sub-lists of .S k split at the .(i)th element and .V ar
represents the variance. This best split point leads to a maximum decrease of
.ΔV (i):

ΔV (i) = V al(S k ) − W AV (i, S k )


. (9)
The algorithm terminates when .maxi {ΔV (i)} is less than a tunable thresh-
old .δ. As a result, we can obtain a set of split points partitioning the whole
list .S k into several groups .H = (h1 , h2 , ..., hm ) which has minimized variances.
With the groups of similar trajectories created, we can efficiently identify similar
trajectories using the travel times on segments of predicting bus journeys.
Moreover, our observation shows that the correlation between a segment and
its adjacent ones is significantly stronger than with distant segments. Thus, for
predicting travel time for segment .Si , where .i > 1, we begin by focusing on
the adjacent previous segment, .Si−1 , and initialize .Tp , which contains historical
trajectories that fall within the same travel time range for that segment. Next,
we iteratively refine .Tp by filtering out trajectories that do not fall within the
same time range group for subsequent segments. However, to prevent the sample
size of .Tp from becoming too small, which would make predictions statistically
insignificant, we introduce a parameter called the minimum number of trajec-
tories (MNT). This ensures that filtering halts once the number of remaining
trajectories in .Tp drops below the MNT threshold. Finally, once a sample set
of similar trajectories .Tp is obtained through this segment filtering process, we
predict the travel time by averaging the travel times of the trajectories in .Tp ,
with weights applied according to a polynomial decay function to emphasize the
most relevant data.

Information Decay Function. While processing temporal data, it is common


to down-weight records that correspond to older data [5]. This approach reflects
the belief that recent data are more relevant for making predictions, while older
data hold less significance and can either be weighted less or ignored altogether.
Various decay functions are commonly employed, which rely on the age of a
record, rather than its exact timestamp, to adjust its weight. If .t represents the
timestamp of a record and .tp is the present time, the age .a of the record is
calculated as .a = tp − t, and the decay function is expressed as .f (a). The most
popular decay functions that have been used for many applications are:
396 K. Nguyen Duy et al.

– No decay: .f (a) = 1, where all records are treated equally regardless of their
timestamp.
– Exponential decay: .f (a) = 1 − λ for .λ > 0, where .λ is the decay constant.
−a
– Polynomial decay: .f (a) = (a + 1) , this function is required because is
slower than the exponential decay, which is too fast.

In this study, based on the idea of decay function we use polynomial decay
to weighting for historical segments of travel time. The mean lifetime .a can
be calculated from the domain knowledge of the dataset by taking the week
difference between the most significant timestamp (MST) for prediction and the
−t
timestamp of segment .t, .a = M ST
7 . We consider all segments whose timestamps
are greater than MST to weigh one. This case is under the assumption that all
records in the future are the best predictors of the future. Therefore, with a
sample set of similar trajectories .Tp obtained through segment filtering above
−a
and the weight calculated by .f (a) = (a + 1) , the travel time on segment .Si
was predicted by calculating as:


Nk
wi
W =
. wi (10) wi =
. (11)
W
i=1

1 
Nk

T
. Si = (ΔTi × wi ) (12)
Nk i=1

where .wi , wi are weights and normalized weights of segment .Si , which .Si is the
segment of trajectory in .Tp , .Nk is number of trajectory in .Tp .

5 Experiments

This section evaluates the results of the proposed travel time prediction frame-
work on real-world trajectory datasets. We evaluated the accuracy and reliability
using the mean absolute error (MAE) and the root mean square error (RMSE).

5.1 Datasets

In this study, we used two real-world datasets from buses in Kandy, Sri Lanka,
and Ho Chi Minh City, Vietnam:

– Kandy, Sri Lanka: The raw GPS data were processed to extract the segment
running times and dwell times. The detailed explanation of processing the raw
GPS data was presented in study [19]. The published dataset of Route No.
654 from Kandy to Digana terminals with 14 bus stops was considered.1 From
the data of 4 months (from Oct 2021 to Jan 2022), we used 3 months of data
for training and 1 month for testing.

1
https://www.kaggle.com/datasets/shiveswarranr/bus-travel-time-data.
A Historical GPS Trajectory-Based Framework 397

Table 1. Comparison of the performance of the proposed model with other existing
approaches.

Model Kandy, Sri Lanka Ho Chi Minh City, Vietnam


Route No. 654 Route No. 94 Route No. 74
MAE RMSE MAE RMSE MAE RMSE
LR 48.9621 71.8958 33.8662 45.8662 34.4447 45.6749
KNN 47.8470 68.9515 30.0387 43.0387 25.3482 37.2961
ANN 36.2941 50.4059 28.2640 39.2640 23.3439 32.5127
Pure LSTM 32.0453 51.4695 27.4892 39.4892 21.9516 36.7442
ConvLSTM Multi-Model [24] 26.2175 42.1500 24.5211 36.5211 21.9487 31.8101
XGBoost Segment-Based [29] 33.4200 47.3691 27.2154 37.2154 22.8273 36.2272
ConvLSTM Segment-Based [25] 25.4210 40.9024 23.3170 32.3170 17.6767 32.2093
Proposal Framework 22.8729 35.4069 16.521 29.5212 15.4345 28.2672

– Ho Chi Minh City, Vietnam: The raw bus GPS trajectory data used in
this study (a week of November 9 to 16, 2012) was provided by Ho Chi Minh
City Department of Transportation. The GPS trajectory data for Route No.
94 (Cho Lon Terminal to Cu Chi Terminal) with 26 bus stops and Route No.
74 (An Suong Terminal to Cu Chi Terminal) with 27 bus stops were extracted
and mapped onto the road network to estimate bus stop arrival times, then
used to predict travel times. We used 4 d of data (Monday to Thursday) for
training and 1 day (Friday) for testing.

5.2 Experimental Results

Table 1 shows the overall evaluation results of the proposed framework with other
existing approaches. The results show that our proposal is better than the other
existing models.
We also test the proposed framework on all bus stops on the route. The
estimated arrival time and the measured arrival time of one bus for 26 bus stops
of route No. 94 and 27 bus stops of route No. 74 are compared and the results
are presented in Fig. 5. Overall, we can see that the predicted values are close to
the actual values. However, as distance increases, the deviation also increases.
Moreover, we conducted practical experiments to compare the prediction
results for different minimum numbers of trajectories (MNT) in the segment
filtering step. Figure 6 shows the comparison results of route No. 654 and route
No. 94.
398 K. Nguyen Duy et al.

Fig. 5. The comparison of the predicted and measured time for 26 bus stops of route
No. 94 and 27 bus stops of route No. 74.

Fig. 6. The comparison of the prediction results (MAE) of difference minimum number
of trajectories (MNT).

6 Conclusions

In this study, we investigate the limitations of existing methods and the low
accuracy of state-of-the-art models in predicting bus arrival times using GPS
trajectory data only, especially in capturing the high variability of travel times.
To address this, we implemented a historical trajectory-based framework to pro-
cess GPS trajectory data using a mapping method and to predict travel times
with consideration of date and time intervals. This proposed approach enables
more accurate long-term predictions compared to traditional approaches, where
each segment’s travel time is predicted independently. Additionally, we applied
polynomial decay to down-weight records corresponding to older data. We con-
ducted comprehensive experiments using real datasets collected from buses in
Kandy, Sri Lanka, and Ho Chi Minh City, Vietnam, demonstrating that our
model outperforms other popular and recent methods. The data required for the
proposed model is straightforward, consisting of the standard outputs generated
by most ITS commonly used in public transport.
A Historical GPS Trajectory-Based Framework 399

Acknowledgements. This research is funded by Ho Chi Minh City University of


Technology (HCMUT) - VNU-HCM under grant number SVKSTN-2024-KH&KTMT-
06. We acknowledge the support of time and facilities from Ho Chi Minh City University
of Technology (HCMUT), VNU-HCM for this study. We would also like to thank the
members of High Performance Computing Laboratory at HCMUT for their support,
together with the anonymous referees for their valuable comments as well as helpful
suggestions.

References
1. Amita, J., Jain, S., Garg, P.K.: Prediction of bus travel time using ann: a case
study in Delhi. Transport. Res. Proc. 17, 263–272 (2016)
2. As, M., Mine, T., Yamaguchi, T.: Prediction of bus travel time over unstable
intervals between two adjacent bus stops. Int. J. Intell. Transp. Syst. Res. 18,
53–64 (2020)
3. Chen, M., Liu, X., Xia, J., Chien, S.I.: A dynamic bus-arrival time prediction model
based on APC data. Comput.-Aided Civil Infrastruct. Eng. 19(5), 364–376 (2004)
4. Coffey, C., Pozdnoukhov, A., Calabrese, F.: Time of arrival predictability horizons
for public bus routes. In: Proceedings of the 4th ACM SIGSPATIAL International
Workshop on Computational Transportation Science, pp. 1–5 (2011)
5. Cormode, G., Shkapenyuk, V., Srivastava, D., Xu, B.: Forward decay: a practi-
cal time decay model for streaming systems. In: 2009 IEEE 25th International
Conference on Data Engineering, pp. 138–149. IEEE (2009)
6. Dhivya Bharathi, B., Anil Kumar, B., Achar, A., Vanajakshi, L.: Bus travel time
prediction: a log-normal auto-regressive (AR) modelling approach. Transportmet-
rica A: Transp. Scie. 16(3), 807–839 (2020)
7. Gaikwad, N., Varma, S.: Performance analysis of bus arrival time prediction using
machine learning based ensemble technique. In: Proceedings 2019: Conference on
Technologies for Future Cities (CTFC) (2019)
8. Han, Q., Liu, K., Zeng, L., He, G., Ye, L., Li, F.: A bus arrival time prediction
method based on position calibration and LSTM. IEEE Access 8, 42372–42383
(2020)
9. He, P., Jiang, G., Lam, S.K., Sun, Y.: Learning heterogeneous traffic patterns for
travel time prediction of bus journeys. Inf. Sci. 512, 1394–1406 (2020)
10. He, P., Jiang, G., Lam, S.K., Tang, D.: Travel-time prediction of bus journey with
multiple bus trips. IEEE Trans. Intell. Transp. Syst. 20(11), 4192–4205 (2018)
11. Huang, Y., et al.: Bus arrival time prediction and reliability analysis: an experimen-
tal comparison of functional data analysis and bayesian support vector regression.
Appl. Soft Comput. 111, 107663 (2021)
12. Jalaney, J., Ganesh, R.: Highly accurate bus arrival time prediction using k-nearest
neighbor prediction in the internet of things (Iot) environment. J. Green Eng.
10(9), 4752–62 (2020)
13. Liu, H., Van Lint, H., Van Zuylen, H., Zhang, K.: Two distinct ways of using kalman
filters to predict urban arterial travel time. In: 2006 IEEE Intelligent Transporta-
tion Systems Conference, pp. 845–850. IEEE (2006)
14. Lou, Y., Zhang, C., Zheng, Y., Xie, X., Wang, W., Huang, Y.: Map-matching for
low-sampling-rate gps trajectories. In: Proceedings of the 17th ACM SIGSPATIAL
International Conference on Advances in Geographic Information Systems, pp.
352–361 (2009)
400 K. Nguyen Duy et al.

15. Müller, M.: Dynamic time warping. Information retrieval for music and motion,
pp. 69–84 (2007)
16. Pang, J., Huang, J., Du, Y., Yu, H., Huang, Q., Yin, B.: Learning to predict bus
arrival time from heterogeneous measurements via recurrent neural network. IEEE
Trans. Intell. Transp. Syst. 20(9), 3283–3293 (2018)
17. Paterson, M., Dančík, V.: Longest common subsequences. In: International Sym-
posium on Mathematical Foundations of Computer Science, pp. 127–142. Springer
(1994)
18. Petersen, N.C., Rodrigues, F., Pereira, F.C.: Multi-output deep learning for bus
arrival time predictions. Transport. Res. Proc. 41, 138–145 (2019)
19. Ratneswaran, S., Thayasivam, U.: Extracting potential travel time information
from raw gps data and evaluating the performance of public transit-a case study
in kandy, sri lanka. In: 2023 3rd International Conference on Intelligent Commu-
nication and Computational Techniques (ICCT), pp. 1–7. IEEE (2023)
20. Thomas, T., Weijermars, W., Van Berkum, E.: Predictions of urban volumes in
single time series. IEEE Trans. Intell. Transp. Syst. 11(1), 71–80 (2009)
21. Vanajakshi, L., Rilett, L.R.: Support vector machine technique for the short term
prediction of travel time. In: 2007 IEEE Intelligent Vehicles Symposium, pp. 600–
605. IEEE (2007)
22. Wang, L., Zhang, D., Wang, Y., Chen, C., Han, X., M’hamed, A.: Sparse mobile
crowdsensing: challenges and opportunities. IEEE Commun. Mag. 54(7), 161–167
(2016)
23. Wang, Y.: Intellectualization of the urban and rural bus: the arrival time prediction
method. J. Intell. Syst. 30(1), 689–697 (2021)
24. Wu, J., Wu, Q., Shen, J., Cai, C.: Towards attention-based convolutional long
short-term memory for travel time prediction of bus journeys. Sensors 20(12),
3354 (2020)
25. Xie, Z.Y., He, Y.R., Chen, C.C., Li, Q.Q., Wu, C.C.: Multistep prediction of bus
arrival time with the recurrent neural network. Math. Probl. Eng. 2021(1), 6636367
(2021)
26. Yang, M., Liu, Y., You, Z.: The reliability of travel time forecasting. IEEE Trans.
Intell. Transp. Syst. 11(1), 162–171 (2009)
27. Yu, B., Lam, W.H., Tam, M.L.: Bus arrival time prediction at bus stop with
multiple routes. Transport. Res. Part C: Emerg. Technol. 19(6), 1157–1170 (2011)
28. Zhang, J., Yu, X., Tian, C., Zhang, F., Tu, L., Xu, C.: Analyzing passenger density
for public bus: Inference of crowdedness and evaluation of scheduling choices. In:
17th International IEEE Conference on Intelligent Transportation Systems (ITSC),
pp. 2015–2022. IEEE (2014)
29. Zhu, L., Shu, S., Zou, L.: Xgboost-based travel time prediction between bus sta-
tions and analysis of influencing factors. Wirel. Commun. Mob. Comput. 2022(1),
3504704 (2022)
30. Zhu, T., Ma, F., Ma, T., Li, C.: The prediction of bus arrival time using global
positioning system data and dynamic traffic information. In: 2011 4th Joint IFIP
Wireless and Mobile Networking Conference (WMNC 2011), pp. 1–5. IEEE (2011)
Influence Maximization with Fairness Allocation
Constraint

Hue T. Nguyen1,2 , Bac D. Pham3 , Uyen T. Tran3 , Nguyen Long Giang1 ,


and Canh V. Pham4(B)
1
Graduate University of Science and Technology, Vietnam Academy of Science
and Technology (VAST), Hanoi, Vietnam
[email protected]
2
Faculty of Information Technology, Hanoi Architecture University, Hanoi, Vietnam
3
People’s Security Academy, Hanoi, Vietnam
4
ORLab, Phenikaa University, Hanoi 12116, Vietnam
[email protected]

Abstract. Motivated by practical applications from social influence and viral


marketing, this work studies the problem of Influence Maximization with Fair-
ness Allocation Constraint, which aims to find a set of .k users from groups in
a social network with maximal influence spread so that the number of selected
uses in each group does not exceed the group budget. We propose an efficient and
scalable approximation algorithm that returns an approximation ratio of .1/2 − 
and takes .O((m + log( k )) n2 (k log n + log( 1δ ))) time complexity, where ., δ are
constants, .n is the number of uses and .m is the number of links. Besides theoret-
ical results, extensive experiments conducted on real social networks show that
our algorithm provides better solutions than cutting-edge methods.

Keywords: Influence Maximization · Fairness · Allocation · Approximation


Algorithms

1 Introduction
Information Maximization (IM) in Online Social Networks (OSNs) has recently been
a hot research topic due to their wide range of commercial, viral marketing, and social
network analysis. Kempe et al. [13] first mathematically formulated the problem of .IM,
which aimed at finding a set of .k (budget) influential users (called seed set) in an online
social network to begin an influence process that could possibly influence the largest
number of users. Since then, the .IM problem has demonstrated its important role in
various domains, not only in product promotion and social influence [15, 18], but also in
other applications such as social network monitoring [21, 31], epidemic prevention [16,
22] and recommendation systems [30].
In some realistic scenarios, the selection of a seed set can be considered in some
groups of users, where each group usually has a group budget. The group budget con-
straint ensures that each group’s budget is evenly distributed. A typical example is prod-
uct promotion in viral marketing, where a company conducts a marketing strategy to

c The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025
W. Buntine et al. (Eds.): SOICT 2024, CCIS 2350, pp. 401–414, 2025.
https://doi.org/10.1007/978-981-96-4282-3_32
402 H. T. Nguyen et al.

some interested groups of users; each group has the same local location. They want
to distribute sample products fairly to groups so that the total number of influenced
users is maximized. Motivated by the aforementioned phenomena, in this paper, we
study Influence Maximization with Fairness Allocation (.IMFA) Constraint, defined as
follows:
Definition 1. Given a social network represented by a graph .G = (V, E), where .|V |
is the set of users and .E is the set of links under the information diffusion model .M .
Given a budget .k, a set of .K groups .C1 , C2 , . . . , CK , .Ci ⊆ V, i ∈ [K] and a fairness
ratio .α ∈ (0, 1). The problem asks to find a seed set .S that maximizes the number of
influenced users such that the size is at most .k and the number of elements in each
group .Ci is at most .αk, i.e., .|S| ≤ k and .|S ∩ Ci | ≤ αk.
We focus on developing an efficient approximation algorithm for the problem. Our
contributions are as follows:
– We first formulate the Influence Maximization with Fairness Allocation Constraint
under the well-known Independent Cascade (.IC) model.
– We design an efficient approximation algorithm, named .SIMF, that returns an
approximation ratio of .1/2 −  with probability at least .1 − δ and takes .O((m +
log( k )) n2 (k log n + log( 1δ ))) time complexity, where ., δ ∈ (0, 1) are constants .
– Finally, we conduct several extensive experiments on real-world social networks.
The results show that our algorithms produce the comparative solution quality but
take less running time than state-of-the-art algorithms.

Additional Related Works. Inspired by the advantages of social media platforms for
product promotion, Kempe et al. [13] first mathematically introduced two informa-
tion diffusion models: Independent Cascade (.IC) and Linear Threshold (.LT), and then
studied Influence Maximization (IM) as a discrete optimization problem. They showed
that the IM is NP-hard and proposed a greedy algorithm that can return .(1 − 1/e)-
approximation ratio. The Kempe et al.’s work has inspired later works on further study-
ing IM problems in efficient and salable algorithms [6, 7, 14, 17, 18, 20, 26], importance
variants of IM [4, 11, 18–20, 23], etc.
From an algorithmic perspective, several fast heuristic algorithms that improve the
running time in practical for large networks, including converting the original graph to
a directed acyclic graph [6, 7], path-based algorithms [12], and community-based algo-
rithms [32]. However, these algorithms did not provide any guaranteed approximation
quality and may not perform well with large networks (billions of vertices and links).
Borgs et al. [3] made a breakthrough by introducing a sampling method, Reverse Influ-
ence Sampling (RIS), which achieved an optimal ratio of .(1−1/e−) within near-linear
time complexity. Recently, several efforts have tried to keep the ratio but further run-
ning time to .O((m + n) log(n)) by modifying the RIS model [17, 26] (see [26] for an
overview about RIS-based algorithms).
Several variants of IM that capture practical applications have been studied. Nguyen
et al. [19] investigated IM at the Community level (IMC) that asked to find a seed
set that could influence the largest number of groups and proposed several efficient
Influence Maximization with Fairness Allocation Constraint 403

algorithms with theoretical guarantees. The authors in [33, 34] investigated the problem
of Group Influence Maximization (GIM), which also asked to find a set of users with the
largest influence of groups. A sandwich approximation algorithm with a data-dependent
ratio was proposed. Recently, the authors [23] studied the problem of finding a seed set
with a minimal total cost that can influence all (given) target groups. More recently,
Tsang et al. [29] have proposed an algorithmic approximation framework with constant
approximation ratios .IM under fairness constraints to influence groups. The authors in
[9] tried solving the problem using integer programming formulation. However, this
method only worked with a fixed-size sample graph and may not expand to medium-
sized networks.
In general, these studies focused on the problem of influencing individuals or groups
of users under some constraints, which was different from the context of distribution
fairness over groups. Therefore, the existing algorithms may not be directly applied to
the proposed problem with approximation ratio.

Organization. The rest of the paper is organized as follows. Section 2 provides some
useful notations. Section 3 presents our proposed algorithm for .IMFA. The experiments
and results are presented in Sect. 5. Finally, Sect. 6 concludes this work and discusses
future studies.

2 Notations and Problem Definition


This section presents the well-known Independent Cascade (.IC) diffusion model [13]
and based on that we formally define the Influence Maximization with Fairness budget
on Groups (IMFG) problem. The frequently used notations are summarized in Table 1.

Table 1. Used notations in this work

Notional Description
.G = (V, E) a graph representing a social network, node set .V represents the set of
users, and edge set .E represents the set of links.
.n, m .n = |V |, .m = |E|.

.C .C = {C1 , C2 , . . . , CK , Ci ∩ Cj = ∅} is the set of groups.

.Nout (v), Nin (v) The sets of outgoing and incoming neighbor nodes of .v.

.S A seed set of our algorithm.



.S An optimal solution of IMFG problem.
V
.I(·) .I : 2 → R+ , the influence function, i.e., given a set .S ⊆ V , .I(S) is the
number of influenced users by .S after the influence process.

2.1 Independent Cascade Model


Independent Cascade (.IC) is one of the most popular models for capturing the influ-
ence process in OSNs [2, 5, 18, 26]. In this model, we are given a graph .G = (V, E)
404 H. T. Nguyen et al.

represents an OSN, where .V is the set of nodes/vertices and .E is the set of edges
with .|V | = n and .|E| = m. Denote .Nout (u)(.Nin (u)) be the set of out-neighbors (in-
neighbors) of node .u, respectively. For any initial set of influence uses (defined as a
seed set) .S ⊆ V , the process of influence propagation happens in discrete steps, and
more nodes can be influenced. Under the .IC model, each edge .e = (u, v) ∈ E has a
propagation probability .p(e) ∈ [0, 1] representing the information transmission from
a node .u to another node .v. The diffusion process happens from a seed set from .S as
follows.
– At the beginning, i.e., the first round .t = 1, all nodes in .S are active and other nodes
in .V are inactive.
– At step .t > 1, each node .u activated at step .t − 1 has a single chance to activate each
currently inactive node .v ∈ Nout (u) with a successful probability .p(e).
– If a node is activated, it remains active till the end of the diffusion process. The
propagation process terminates at step .t if no new node is activated in this step.
Kempe et al. [13] showed that the .IC model is equivalent to a live-edge model defined
as follows. From the graph .G = (V, E), a random sample graph .g is generated from .G
by selecting an edge .e ∈ E with probability .p(e) and non-selecting .e with probability
.1 − p(e). We refer to .g as a sample of .G and write .g ∼ G and .Ω as the space of all

sample graphs. The probability of generation a sample graph .g from .G is calculated by:
 
. Pr[g ∼ G] = p(e) (1 − p(e)) (1)
e∈Eg e∈E
/ g

where .Eg is the set of edges in the graph .g. The influence spread from a set of nodes .S
to any node .u is calculated as follows:

.I(S, u) = Pr[g ∼ G] · Rg (S, u) (2)
g∼G

where .Rg (S, u) = 1 if .u is reachable from .S in .g and .Rg (S, u) = 0 otherwise. The
influence spread of .S in network .G (number of influenced nodes) is:

.I(S) = I(S, u). (3)
u∈V

2.2 Reverse Influence Sample


We recap the concept of Reverse Influence Sample (RIS) [2, 26] to estimate the influ-
ence function. RIS was first introduced by Borg et al. [2]. RIS is generated from .G
under .IC model, according to the following steps:
1) Randomly picking a node .u as a source node.
2) Generating a sample graph .g from .G by live-edge model.
3) Returning .Rg as the set of nodes that can reach .v in .g.
Define a random variable .Xg = min{|Rg ∩ S|, 1}. Borg et al. [2] prove the following
Lemma to compute the influence function:
Influence Maximization with Fairness Allocation Constraint 405

Lemma 1. For any set of nodes .S ⊆ V , we have: .I(S) = n · E[Xj ]


Form [28], the following Lemma helps estimating the influence function with high
probability.
Lemma 2 ([28]). Given a set of MRR samples .R with .T = |R| and .λ > 0, we have:

T  − 2λ
2

. Pr Xj (S) − T · μ ≥ λ ≤ e λ 3 +2μT , (4)


j=1

T  λ2
Pr Zj (S) − T · μ ≤ −λ ≤ e− 2μT . (5)
j=1

In the Lemma 2, by replacing .λ = T μ with a note that .σ(S) = Kμ, we have:


2 μT
−
2+ 2 
. Pr[σ̂(S) ≥ (1 + )σ(S)] ≤ e 3 , (6)
2
−  2μT
Pr[σ̂(S) ≤ (1 − )σ(S)] ≤ e . (7)
Therefore, if the number of samples is large enough, i.e., .T ≥ (2 + 23 ) μ1 12 ln( 1δ ) for
δ ∈ (0, 1), .ÎR (S) is an .(, δ)-approximation of .I(S), i.e.,
.

. Pr[(1 − )I(S) ≤ ÎR (S) ≤ (1 + )I(S)] ≥ 1 − δ. (8)

2.3 Matroid
We introduce some notations about matroid which is useful for designing our algo-
rithm. Given a ground set .V , and .M ⊆ 2V . We call a system .(V, M) as a matroid if it
satisfies:
1) Downward closed property: If for all .S ⊆ T such that .T ∈ M, then .S ∈ M.
2) Augmentation property: If .S, T ∈ M and .|S| < T , then there exists .e ∈ T \ S such
that .S ∪ {e} ∈ T .
The rank of a matroid .rank(M) is the maximum size of a set .S ∈ M.

2.4 Problem Definition


This part formally defines the Influence Maximization with Fairness Allocation (.IMFA)
problem, which will be studied in this paper.
Definition 2 (IMFA problem). Given a social network .G = (V, E) under .IC model,
and .C = {C1 , · · · , CK } is a collection of .K disjoint groups, .Ci ∩ Cj = ∅. Given a
total budget .k and a positive constant .α < 1. The objective is to find a seed set .S ⊆ V
satisfying .|S ∩ Ci | ≤ αk so that the influence spread .I(S) is maximal, i.e., the problem
asks to find
max
. I(S) (9)
Subject to: |S| ≤ k (10)
|S ∩ Ci | ≤ αk. (11)
406 H. T. Nguyen et al.

If .α = 1, .IMFA becomes the classical Influence Maximization [13]. Therefore


IMFG is an NP-hard problem.

3 Proposed Algorithm

In this section, we introduce an approximation algorithm, named Sampling Approxi-


mation for Influence Maximization with Fairness (.SIMF) for .IMFA problem. Our algo-
rithm consists of two components: (1) Create a set of RR sets to efficiently estimate
the influence functions; (2) Finding the near-optimal solution over RR sets by solving
sub-problem Maximum Cover with Fairness Allocation Constraint (MCFA) problem,
(3) Finally, by combining the solution of the MCFA problem with sampling method
based on martingale theory, we give a final solution with theoretical bounds.

3.1 MCFA: a Subproblem of .IMFA

From the analysis and discussion in Sect. 2, one can use .Î(S) to effectively estimate
I(S) if the number of samples .|R| is sufficiently large. For a set of RR sets .R, we need
.

to find the a set of nodes .S that satisfies fairness allocation constraint with high objec-
tive function value. Thus, we investigate the Maximum Cover with Fairness Allocation
Constraint MCFA problem defined as follows:
Definition 3 (MCFA). Given a graph .G = (V, E) under .IC model, a set of RR sets
R = {R1 , R2 , . . . , RT } generated from .G, .C = {C1 , · · · , CK } is a collection of .K
.
disjoint subsets of .V , .Ci ⊆ V, Ci ∩ Cj = ∅. Given a total budget .k and a positive
constant .α ∈ (0, 1).MCSG asks to find the set .S satisfying .|S| ≤ k and .|S ∩ Ci | ≤ αk
n min{|S∩Ri |,1}
so that .ÎR (S) = Ri ∈R
|R| is maximized.

.ÎR (S) is monotone and submodular [17, 26]. We show that .MCFA is a form of well-

known submodular maximization under a matroid constraint problem [8, 10].


Lemma 3. Assume that .F = {S ∈ 2V : |S ∩ Ci | ≤ αk}, i.e., the set of all subset of .V
satisfying the constraint of MCFA problem. A system .(V, F) is a matroid.

Proof. We show that .(V, F) satisfies two properties of a matroid.


Downward closed property: If for all .S ⊆ T such that .T ∈ F, then .|S∩Ci | ≤ |T ∩Ci | ≤
αk. Thus .S ∈ F.
Augmentation property: For .S, T ∈ M and .|S| < |T |, there exists a group .Ci so that
.|S ∩ Ci | < |T ∩ Ci | ≤ αk. Therefore, there exists an element .e ∈ T \ S so that

.|(S ∪ {e}) ∩ Ci | = |S ∩ Ci | + 1 ≤ |T ∩ Ci | ≤ αk. Hence .S ∪ {e} ∈ F.

One can adapt the Greedy algorithm with the ratio of .1/2 for the .MCFA [10]. How-
ever, this algorithm takes .O(|R|k) times and may be impractical for medium social
networks. Therefore, in this work we propose a Threshold Greedy (.ThGreedy) that
reduces the time complexity to .O(|R| log k) while returning an approximation ratio of
.1/2 −  for .MCFA problem, where . ∈ (0, 1/2) is an arbitrarily small constant.
Influence Maximization with Fairness Allocation Constraint 407

Our algorithm adapts the idea of the threshold greedy method [1] to select the good
elements without violating the matroid constraint in each iteration. Notably, it first ini-
tiates a solution .S as empty and finds .M the best estimation value (.Î(·)) of an ele-
ment. The algorithm primarily operates in the main loop (Lines 2-11) with at most
.log1/(1−) (k/) + 1 iterations. In iteration .i, it considers all elements in .V \ S and adds
ones that do not violate the allocation constraint (Line 4) and their marginal gain larger
or equal to .θ, where .θ = (1 − )i M (Line 5). At the end of each iteration (outer loop),
the threshold .θ is decreasing by a factor of .1 −  (Line 7). Finally, the algorithm ter-
minates after .log1/(1−) (k/) + 1 iterations and returns the solution .S. The pseudo of
.ThGreedy is depicted in Algorithm 1.

Algorithm 1: .ThGreedy(R, C, )
Input: A set of RRs .R, a set of groups .C = {C1 , C2 , . . . , CK }, α ∈ (0, 1), budget .k,
. > 0
Output: A set .S
1: .S ← ∅, M ← maxe∈V ÎR ({e}), θ ← M
2: while .θ ≥ M/k do
3: foreach .e ∈ V do
4: if .|S ∩ C(e)| < αk and .|S| ≤ k then
5: if .ÎR (S ∪ {e}) − ÎR (S) ≥ θ then
6: .S ← S ∪ {e}

7: .θ ← (1 − )θ
8: return.S

Theorem 1. Algorithm 1 runs in .O( |R| k


 log(  )) time complexity and returns an
approximation ratio of .1/2 − .
Proof. The number of outer iterations of Alg. 1 (Lines 2-12) is upper bounded by:

k k log( k ) log( k )
. 1 (
log 1− ) + 1 ≤ log1+ ( ) + 1 = +1≤ + 1. (12)
  log(1 + ) /2

Each iteration takes at most .O(|R|) time, so the time complexity of the algorithm is
log( k )
.
/2+ 1 + O(n) = O( |R| k
 log(  )). Now, assume that .S = {s1 , s2 , . . . , sk } is .S after
ending the main loop (if .|S| < k, we add more some empty elements into .S). Denote
.si as the .i-th element added into .S and .Si = {s1 , s2 , . . . , si }, .θ(si ) as .θ at the iteration
.si be added to .S. Let .O = {o1 , o2 , . . . , ok } such that .{s1 , s2 , . . . , si−1 , oi } ∈ M, ∀i ≤
k − 1 is feasible for all .i, which exists by the augmentation property of matroids. We
first prove that

f (si |Si−1 ) ≥ (1 − )f (oi |Si−1 ).


. (13)

If .si is added into .S at the first iteration, we have .f (si |Si−1 ) = f (emax ) ≥ (1 −
)f (oi |Si−1 ). If .si is added into .S at the first iteration .t ≥ 2. If .oi ∈ Si , (13) holds. If
408 H. T. Nguyen et al.

o ∈
/ Si , .oi was note added in to .S at previous iterations, so we have:
. i

(1 − )θ(si )
.f (si |Si−1 ) ≥ θ(si ) = ≥ (1 − )f (oi |Si−1 ). (14)
1−
Therefore the inequality 13 holds. By the submodularity of .f and the selection rule of
the Algorithm 1, we have:
k

. f (S) = f (si |{s1 , s2 , . . . , si−1 }) = f (si |Si−1 ) (15)
i=1
k

≥ (1 − )f (oi |Si−1 ) (Due to (13)) (16)
i=1
k

≥ (1 − )f (oi |S ∪ {o1 , . . . , oi−1 }) (Due to the submodularity of f ) (17)
i=1
≥ (1 − )f (O|S) ≥ (1 − )(f (O) − f (S)) (18)

which implies that


opt 1 1
f (S) ≥
. ≥ (1 − )opt > ( − )opt
2− 2 2
which completes the proof.

3.2 SIMF: A Sampling Algorithm for .IMFA


.

We now present Sampling Approximation Algorithm for .IMFA (.SIMF), a .(1/2 − )-
approximation algorithm for .IMFA. This algorithm combines the solution of Algo-
rithm 1 with sampling algorithm framework in [17, 26] for IM problem.
The .SIMF algorithm receives an instance of .IMFA problem: Graph .G = (V, E),
groups .C = {C1 , . . . , CK }, budget .k, fairness ratio .α, and accuracy parameters ., δ ∈

Algorithm 2: .SIMF Algorithm


Input: An instance of .IMFA problem: .(G, V, E, C, α, k), parameters ., δ ∈ (0, 1)
Output: A set .S
2 δ
1: .N1 ← Nmax  k/n, .imax ← log2 (Nmax /N1 ) , .δ1 ←
3imax
2: Generate two sets .R1 and .R2 of random RR sets, with .|R1 | = |R2 | = N1
3: for .i = 1 to .imax do
4: .S ← ThGreedy(R1 , C, )
Il (S)
5: if . IuR1 (O) ≥ 1/2 −  or .i = imax then
R2
6: return.S
7: Double the sizes of .|R1 | and .|R2 | with new random RR sets;
8: return.S
Influence Maximization with Fairness Allocation Constraint 409

(0, 1/2) as inputs. The Algorithm initiates two sets of RR sets .R1 and .R2 (Line 2).
It then mainly works in at most .imax ← log2 (Nmax /N1 ) iterations, where .Nmax is
provided in Lemma 6. At each iteration .i, it finds a candidate solution .S over a set of .R1
by calling Algorithm 1 and then checks the condition of solution quality in Line 5. If the
condition is true the algorithm immediately returns the solution .S. If no, it doubles size
of .R1 , R2 simultaneously (Line 7) and moves the next iteration. The details of pseudo
of .SIMF are depicted in Algorithm 2.
For theoretical analysis, we first recap the Lemma 4.2 in [26] to give a lower bound
of .I(S) for any set .S:

Lemma 4 (Lemma 4.2 in [26]). For any .δ ∈ (0, 1), a set of RR sets .R and a
S, we have .Pr[I(S) ≥ IlR (S)] ≥ 1 − δ, where .a = ln(1/δ), and .IlR (S) =
set .
 
a 2
ΛR (S) + 2a
9 − 2
a
− 18 n
|R| .

Lemma 5. For any .δ ∈ (0, 1), a set of RR sets .R. Assume that .S is returned by
ThGreedy with .R in line 5 of Algorithm 2, we have .Pr[I(O) ≤ IuR (O)] ≥ 1 − δ,,
.
 2
ΛR (S) a a n
where .a = ln(1/δ), .IuR (O) = 1/2− + 2 + 2 |R| .

Proof. By Lemma 2, we have


  ΛR (O) a a 2 n 
. Pr I(S) > 
+ + (19)
1/2 −  2 2 |R|
  a a 2 n 
≤ Pr I(S) > ΛR (S) + + (20)
2 2 |R|
 |R|I(S) a 2 a
≤ Pr − > ΛR (S) + (21)
n 2 2
|R|I(S)
 |R|I(S) |R|I(S) 
−2a
n
|R|I(S)
= Pr − 2a > ΛR (S) − ≤e 2 n = 1/δ (22)
n n
which completes the proof.

By adapting the similar reasoning with Lemma 6.1 in [26], we establish the number of
required RR sets to obtain the performance guarantee as follows:
Lemma 6. For any ., δ ∈ (0, 1), a set of RR sets .R. Assume that .S is returned by
ThGreedy with .R in line 5 of Algorithm 2, if
.

    2
2n (1/2 − ) ln( 2δ ) + (1/2 − )(ln nk + ln( 2δ ))
.|R| ≥ Nmax = (23)
2 k
then .S is a .(1/2 − ) approximation solution with probability at least .1 − δ.
Finally, we give the performance of .SIMF algorithm in Theorem 2.
Theorem 2. Algorithm 2 returns a solution .S satisfying .I(S) ≥ (1/2 − ) with proba-
bility at least .1 − δ.
410 H. T. Nguyen et al.

Proof. If the algorithm terminates at the iteration .i = imax , the number of RR sets is
Nmax . By Lemma 6, the algorithm returns the approximation ratio of .1/2− ≥ 1/2−
.
with a probability of at least .1 − δ1 .
If the algorithm terminates at some iteration .i < imax , i.e., it meets the condition in
Line 6. By Lemma 4 and Lemma 5, with probability at least .1 − δ1 , we have
I(S) ≥ IlR1 (S), I(O) ≥ IuR2 (O).
. (24)
Therefore, with probability at least .1 − 2δ1 , we have

I(S) Il (S) 1
. ≥ uR1 ≥ − . (25)
I(O) IR2 (O) 2
By the bound of probability, the probability the algorithm failure at most .δ1 +imax ·δ1 =
δ/3 + 2δ/3 = δ. The time computation of the algorithm has two components: the time
to generate RR sets and the time to find the solution. The algorithm needs at most .Nmax
RR sets. By previous work [27], the time to generate a RR sample at most .O( m n I({v})),
where .v is a node selected randomly from those in G with probabilities proportional to
their in-degrees. The time to generate .Nmax RR sets is bounded by
m n 1
O(Nmax
. I({v})) = O( 2 (k log n + log( ))m) (26)
n  δ
In the other hand, by Theorem 1 the time complexity to find the solution is at most
Nmax k n 1 k
O(
. log( )) = O( 2 (k log n + log( )) log( )) (27)
   δ 
Therefore, the time complexity of the algorithm is at most
 k n 1
.O (m + log( )) (k log n + log( )) . (28)
 2 δ
which completes the proof.

4 Experiment Evaluation
4.1 Settings
In this section, we conduct experiments showing the performance of our proposed .SIMF
algorithm as compared with the OPIM, the state-of-the-art algorithm for Influence Max-
imization (IM) Problem [26] on two major metrics: the solution quality (the influence
spread) and running time. OPIM is one of the best solutions for IM problems which to
find the seed set .S with .|S| ≤ k so that .I(S) is maximized. Since the solution of the
IM problem may not be a feasible solution to the .IMFA problem, we adapt OPIM to the
.IMFA problem by following these steps.

– For each group .Ci , we adapt OPIM to find solution .Si ⊆ Ci with .|Si | ≤ αk.
K
– After finding solutions .Si , we check the overall solution .S = i=1 Si . If there exists
a group .i that violates the group budget constraint, i.e., .|S ∩ Ci | > αk, we remove
some nodes from .S so that .|S ∩ Ci | ≤ αk.
Influence Maximization with Fairness Allocation Constraint 411

Table 2. Datasets

Dataset # Nodes #Edges Type Source


Gemsec-Facebook 50.515 81.9306 Undirected [25]
Google-Plus 211.200 1.506.896 Directed [24]

Dataset. We use public OSN datasets shown in Table 2.

Parameters Setting. All experiments are under the .IC model with edge probabilities set
to .p(u, v) = 1/|Nin (v)|. This weight setting is adopted from prior works [13, 17, 26,
27]. We vary .k ∈ {100, 200, 300, 400, 500} for each dataset, and set .K = 10 (the
number of groups), .α = 0.1. We also set parameters . = 0.1 and .δ = 0.1 according
previous works [17, 26].

Fig. 1. The influence spread of algorithms: (a) Facebook, (b) Google Plus.

Fig. 2. The running time of algorithms: (a) Facebook, (b) Google Plus.

4.2 Results
Figures 1 and 2 display the performance of compared algorithms on the four datasets for
IMFA problem. Figure 1 shows the influence spread of algorithms. It can be observed
.

that the lines of .SIMF are almost. Although OPIM gives better influence spread in Ego-
Facebook, the gap is insignificant (below 5 %). Our algorithm gives better influence
412 H. T. Nguyen et al.

spread for Google-Plus network. This result may be because our algorithm does not
select enough .k nodes in the seed set.
Figure 2 shows that our .SIMF significantly outperforms OPIM in terms of time
taken. Specifically, .SIMF runs 1.12 to 4.61 times faster than that of OPIM. The above
results show that our algorithms provide the comparative solution quality while wasting
the lowest running time.

5 Conclusions
In this paper, we investigated a novel .IMFA problem that asks to find a seed set in
a social network that maximizes influence spread subject to group fairness allocation
constraint. We propose a .1/2 −  approximation algorithm near-linear time complex-
ity for the problem. We further investigate the practical performance of our algorithm
compared to the state-of-the-art Influence maximization problem. The results show our
superiority in terms of running time. In the future, we will improve the quality of our
algorithm’s solutions in theory and practice.

References
1. Badanidiyuru, A., Vondrák, J.: Fast algorithms for maximizing submodular functions. In:
Proceedings of the 2014 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA),
pp. 1497–1514 (2014)
2. Borgs, C., Brautbar, M., Chayes, J.T., Lucier, B.: Maximizing social influence in nearly opti-
mal time. In: Proceedings of the Twenty-Fifth Annual ACM-SIAM Symposium on Discrete
Algorithms, SODA 2014, Portland, Oregon, USA, January 5-7, 2014, pp. 946–957 (2014)
3. Borodin, A., Filmus, Y., Oren, J.: Threshold models for competitive influence in social net-
works. In: Saberi, A. (ed.) WINE 2010. LNCS, vol. 6484, pp. 539–550. Springer, Heidelberg
(2010). https://doi.org/10.1007/978-3-642-17572-5_48
4. Chen, N.: On the approximability of influence in social networks. SIAM J. Discret. Math.
23(3), 1400–1415 (2009)
5. Chen, W., Lakshmanan, L., Castillo, C.: Information and Influence Propagation in Social
Networks. Morgan & Claypool Publishers, Synthesis Lectures on Data Management (2013)
6. Chen, W., Wang, C., Wang, Y.: Scalable influence maximization for prevalent viral market-
ing in large-scale social networks. In: Proceedings of the 16th ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, pp. 1029–1038. KDD ’10, Associa-
tion for Computing Machinery, New York, NY, USA (2010)
7. Chen, W., Yuan, Y., Zhang, L.: Scalable influence maximization in social networks under the
linear threshold model. In: ICDM 2010, Proceedings of the 10th IEEE International Confer-
ence on Data Mining, Sydney, Australia, 14-17 December 2010, pp. 88–97 (2010)
8. Ene, A., Nguyen, H.L.: Streaming algorithm for monotone k-submodular maximization with
cardinality constraints. In: Chaudhuri, K., Jegelka, S., Song, L., Szepesvári, C., Niu, G.,
Sabato, S. (eds.) International Conference on Machine Learning, ICML 2022, 17-23 July
2022, Baltimore, Maryland, USA. Proceedings of Machine Learning Research, vol. 162, pp.
5944–5967. PMLR (2022)
9. Farnadi, G., Babaki, B., Gendreau, M.: A unifying framework for fairness-aware influence
maximization. In: Companion Proceedings of the Web Conference, Taipei, Taiwan, April
20-24, 2020, pp. 714–722 (2020)
Influence Maximization with Fairness Allocation Constraint 413

10. Fisher, M.L., Nemhauser, G.L., Wolsey, L.A.: An analysis of approximations for maximizing
submodular set functions—ii. In: Polyhedral Combinatorics: Dedicated to the memory of
D.R. Fulkerson, pp. 73–87. Springer Berlin Heidelberg (1978)
11. Goyal, A., Bonchi, F., Lakshmanan, L., Venkatasubramanian, S.: On minimizing budget and
time in influence propagation over social networks. Soc. Netw. Anal. Min. 3(2), 179–192
(2013)
12. Goyal, A., Lu, W., Lakshmanan, L.V.S.: SIMPATH: an efficient algorithm for influence max-
imization under the linear threshold model. In: Cook, D.J., Pei, J., Wang, W., Zaïane, O.R.,
Wu, X. (eds.) Proceedings of the 11th IEEE International Conference on Data Mining, ICDM
2011, Vancouver, BC, Canada, December 11-14, 2011, pp. 211–220. IEEE Computer Soci-
ety (2011)
13. Kempe, D., Kleinberg, J.M., Tardos, É.: Maximizing the spread of influence through a social
network. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowl-
edge Discovery and Data Mining, Washington, DC, USA, August 24 - 27, 2003, pp. 137–146
(2003)
14. Leskovec, J., Krause, A., Guestrin, C., Faloutsos, C., VanBriesen, J.M., Glance, N.S.: Cost-
effective outbreak detection in networks. In: Proceedings of the 13th ACM SIGKDD Inter-
national Conference on Knowledge Discovery and Data Mining, San Jose, California, USA,
August 12-15, 2007, pp. 420–429 (2007)
15. Li, Y., Zhang, D., Tan, K.: Real-time targeted influence maximization for online advertise-
ments. Proc. VLDB Endow. 8(10), 1070–1081 (2015)
16. Nguyen, H.T., Cano, A., Tam, V., Dinh, T.N.: Blocking self-avoiding walks stops cyber-
epidemics: a scalable gpu-based approach. IEEE Trans. Knowl. Data Eng. 32(7), 1263–1275
(2020)
17. Nguyen, H.T., Thai, M.T., Dinh, T.N.: Stop-and-stare: Optimal sampling algorithms for viral
marketing in billion-scale networks. In: Proceedings of the 2016 International Conference
on Management of Data, SIGMOD Conference 2016, San Francisco, CA, USA, June 26 -
July 01, 2016, pp. 695–710 (2016)
18. Nguyen, H.T., Thai, M.T., Dinh, T.N.: A billion-scale approximation algorithm for maximiz-
ing benefit in viral marketing. IEEE/ACM Trans. Network. 25(4), 2419–2429 (2017)
19. Nguyen, L.N., Zhou, K., Thai, M.T.: Influence maximization at community level: A new
challenge with non-submodularity. In: Proceedings of the 39th IEEE International Confer-
ence on Distributed Computing Systems, ICDCS 2019, Dallas, TX, USA, July 7-10, 2019,
pp. 327–337 (2019)
20. Pham, C.V., Duong, H.V., Thai, M.T.: Importance sample-based approximation algorithm for
cost-aware targeted viral marketing. In: Tagarelli, A., Tong, H. (eds.) CSoNet 2019. LNCS,
vol. 11917, pp. 120–132. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-34980-
6_14
21. Pham, C.V., Pham, D.V., Bui, B.Q., Nguyen, A.V.: Minimum budget for misinformation
detection in online social networks with provable guarantees. Optimiz. Lett. 16(2), 515–544
(2022)
22. Pham, C.V., Phu, Q.V., Hoang, H.X., Pei, J., Thai, M.T.: Minimum budget for misinformation
blocking in online social networks. J. Comb. Optim. 38(4), 1101–1127 (2019). https://doi.
org/10.1007/s10878-019-00439-5
23. Pham, P., Pham, C.V., Duong, H.V., Snásel, V., Nguyen, T.T.: Minimizing cost for influencing
target groups in social network: a model and algorithmic approach. Comput. Commun. 212,
182–197 (2023)
24. Rossi, R.A., Ahmed, N.K.: The network data repository with interactive graph analytics
and visualization. In: Bonet, B., Koenig, S. (eds.) Proceedings of the Twenty-Ninth AAAI
Conference on Artificial Intelligence, January 25-30, 2015, Austin, Texas, USA, pp. 4292–
4293. AAAI Press (2015)
414 H. T. Nguyen et al.

25. Rozemberczki, B., Davies, R., Sarkar, R., Sutton, C.: Gemsec: Graph embedding with self
clustering. In: Proceedings of the 2019 IEEE/ACM International Conference on Advances
in Social Networks Analysis and Mining 2019, pp. 65–72. ACM (2019)
26. Tang, J., Tang, X., Xiao, X., Yuan, J.: Online processing algorithms for influence maxi-
mization. In: Proceedings of the 2018 International Conference on Management of Data,
pp. 991–1005. SIGMOD ’18, Association for Computing Machinery, New York, NY, USA
(2018)
27. Tang, Y., Shi, Y., Xiao, X.: Influence maximization in near-linear time: a martingale app-
roach. In: Proceedings of the 2015 ACM SIGMOD International Conference on Manage-
ment of Data, Melbourne, Victoria, Australia, May 31 - June 4, 2015, pp. 1539–1554 (2015)
28. Tang, Y., Xiao, X., Shi, Y.: Influence maximization: near-optimal time complexity meets
practical efficiency. In: Proceedings of the International Conference on Management of Data,
SIGMOD 2014, Snowbird, UT, USA, June 22-27, 2014, pp. 75–86 (2014)
29. Tsang, A., Wilder, B., Rice, E., Tambe, M., Zick, Y.: Group-fairness in influence maximiza-
tion. In: Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intel-
ligence, IJCAI 2019, Macao, China, August 10-16, 2019, pp. 5997–6005 (2019)
30. Ye, M., Liu, X., Lee, W.: Exploring social influence for recommendation: a generative model
approach. In: Proceedings of the 35th International ACM SIGIR conference on research and
development in Information Retrieval, SIGIR ’12, Portland, OR, USA, August 12-16, 2012,
pp. 671–680 (2012)
31. Zhang, H., Kuhnle, A., Zhang, H., Thai, M.T.: Detecting misinformation in online social
networks before it is too late. In: Proceedings of the IEEE/ACM International Conference on
Advances in Social Networks Analysis and Mining, ASONAM, San Francisco, CA, USA,
August 18-21, 2016, pp. 541–548 (2016)
32. Zhang, X., Zhu, J., Wang, Q., Zhao, H.: Identifying influential nodes in complex networks
with community structure. Knowl.-Based Syst. 42, 74–84 (2013)
33. Zhu, J., Ghosh, S., Wu, W.: Group influence maximization problem in social networks. IEEE
Trans. Comput. Social Syst. 6(6), 1156–1164 (2019)
34. Zhu, J., Ghosh, S., Wu, W., Gao, C.: Profit maximization under group influence model in
social networks. In: Tagarelli, A., Tong, H. (eds.) CSoNet 2019. LNCS, vol. 11917, pp. 108–
119. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-34980-6_13
Exemplar-Embed Complex Matrix
Factorization with Elastic-Net Penalty:
An Advanced Approach for Data Representation

Manh Quan Bui1(B) , Viet Hang Duong2(B) , and Jia-Ching Wang3


1 HoChiMinh City University of Technology and Education, Ho Chi Minh, Vietnam
[email protected]
2 University of Information Technology, VNU-HCM, Ho Chi Minh, Vietnam
[email protected]
3 National Central University, Zhongly, Taiwan

Abstract. This paper presents an advanced method for complex matrix factoriza-
tion, termed exemplar-embed complex matrix factorization with elastic net penalty
(ENEE-CMF). The proposed ENEE-CMF integrates both L1 and L2 regulariza-
tions on the encoding matrix to enhance the sparsity and effectiveness of the pro-
jection matrix. Utilizing Wirtinger’s calculus for differentiating real-valued com-
plex functions, ENEE-CMF efficiently addresses complex optimization challenges
through gradient descent, enabling more precise adjustments during factorization.
Experimental evaluations on facial expression recognition task demonstrate that
ENEE-CMF significantly outperforms traditional non-negative matrix factoriza-
tion (NMF) and similar complex matrix factorization (CMF) models, achieving
superior recognition accuracy. These findings highlight the benefits of incorpo-
rating elastic net regularization into complex matrix factorization for handling
challenging recognition tasks.

Keywords: Complex matrix factorization · Feature extraction · Data


representation

1 Introduction
Data representation is crucial in face imaging tasks such as facial expression recogni-
tion (FER), as the quality and effectiveness of the representation significantly influence
the system’s ability to accurately interpret facial features. The performance of an FER
system can be affected by various factors, including age, ethnicity, gender, facial hair,
makeup, gestures, occlusions, and lighting conditions [1]. An effective representation
must preserve key characteristics of facial expressions to facilitate accurate recognition
and classification of emotions. Developing a robust FER system remains challenging,
with feature extraction playing a pivotal role. Recent advancements have focused on
subspace projection techniques for appearance-based features, utilizing matrix factor-
ization in both real and complex domains to create a new feature matrix that effectively
maps data into a lower-dimensional subspace.

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025
W. Buntine et al. (Eds.): SOICT 2024, CCIS 2350, pp. 415–426, 2025.
https://doi.org/10.1007/978-981-96-4282-3_33
416 M. Q. Bui et al.

In the real domain, techniques such as Principal Component Analysis (PCA) [2], Lin-
ear Discriminant Analysis (LDA) [3, 10], and Nonnegative Matrix Factorization (NMF)
[4, 5] have been widely utilized to represent facial images as linear combinations of low-
rank basis images. Lee and Seung [4, 5] found that NMF exhibits superior performance in
parts-based representation. To further refine this method, Hoyer [6] introduced a sparsity
function that enhances decompositions by incorporating sparsity into NMF. Addition-
ally, Yuan and Oja [7] extended the technique by developing Projective NMF (PNMF),
which is specifically designed to capture localized features. Over the years, numerous
extensions of NMF have been developed to advance the field of FER. For example, Niki-
tidis et al. [8] introduced NMF variants that incorporate discriminant criteria, such as
Clustering-Based Discriminant Analysis (CDA) [9] and Linear Discriminant Analysis
(LDA) [10]. Lee and Chellappa [11] proposed the integration of sparsity constraints to
derive localized dictionaries from dense motion flow image sequences. These advance-
ments emphasize the critical role of regularization in most NMF frameworks to optimize
FER performance. Nevertheless, traditional NMF methods are inherently limited by their
strict requirement for nonnegative entries in the data matrix, which significantly restricts
their applicability. To circumvent this limitation, Semi-NMF and Convex NMF (Con-
NMF) algorithms have been introduced [12]. Notably, the Con-NMF algorithm ensures
that basis vectors are convex or linear combinations of data points with mixed signs.
Motivated by the work of Liwicki et al. [13], which demonstrated that the squared
Frobenius norm in the complex domain is equivalent to a robust dissimilarity measure in
the real domain, Duong et al. [14, 15] transformed real data into the complex domain for
complex matrix factorization and proposed unsupervised learning algorithms to enhance
FER model. By utilizing Wirtinger’s calculus and the gradient descent method [18, 19],
these effectively tackle complex optimization challenges, offering significantly enhanced
recognition accuracy compared to traditional matrix factorization techniques.
In this work, we present a novel algorithm for complex matrix factorization incor-
porating an elastic net penalty, known as ENEE-CMF. This method advances image
representation techniques in the complex domain. The key contributions of this study
are as follows:
• The introduction of the ENEE-CMF method for image analysis within the complex
domain.
• Derivation of updating rules for ENEE-CMF using gradient descent techniques
specifically designed for the complex domain.
• A thorough experimental evaluation of facial expression recognition, revealing that
the proposed ENEE-CMF method, enhanced with an elastic net penalty, outperforms
both standard and extended NMF and CMF techniques.
The structure of this paper is organized as follows Sect. 2 provides an overview of key
topics, including NMF and CMF techniques relevant to our model. Section 3 describes
the elastic net-constrained ENEE-CMF framework, with a focus on applying gradient
descent to solve the constrained optimization problem in the complex domain. Section 4
presents a comparison of experimental results obtained from ENEE-CMF with those
from various NMF and EE-CMF methods. Section 5 demonstrates the generalizability
of the proposed methods in comparison with standard NMF and EE-CMF approaches.
Exemplar-Embed Complex Matrix Factorization 417

2 Preliminaries
2.1 Nonnegative Matrix Factorization
Consider an N × M input data matrix S = (s1 , s2 ,..., sM ), where M denotes the number
of facial images and each column sm represents an image of size p by q (N = p × q). The
×K
goal of NMF problem is to identify matrices W ∈ RN + and V ∈ RK×M
+ that minimize
the following objective function:
1
min NMF (W, V) = S − WV2F (1)
W≥0,V≥0 2
Here, the matrix W consists of K basic vectors, which are combined linearly using the
coefficients in V to approximate the data matrix. Lee and Seung [4, 5] proposed iterative
algorithms to solve this problem, updating V and W as follows:

(WT S)ij
Vij ← Vij (2)
(WT WV)ij
(SVT )ij
Wij ← Wij (3)
(WVVT )ij
To address the limitation of non-negative data constraints, Ding et al. [12] developed
Convex Nonnegative Matrix Factorization (Con-NMF), which accommodates mixed-
sign data matrices. Con-NMF imposes the constraint that column vectors must reside
within the column space of S, W = SΠ where Π is an auxiliary adaptive weight matrix.
This approach leads to the following modified objective function [17]:
1
2
 ×K ×K
min conNMF (Π, V) = S − SVT  s.t Π ∈ RM + , V ∈ RM + (4)
2 F

The factors V and Π are updated as follows [14]:



[(ST S)+ Π ]ij + [VΠ T (ST S)− Π ]ij
Vij ← Vij (5)
[(ST S)− Π ]ij + [VΠ T (ST S)+ Π ]ij

[(ST S)+ V]ij + [(ST S)− VT V]ij
Πij ← Πij (6)
[(ST S)− V]ij + [(ST S)+ VT V]ij

where ST S = (ST S)+ − (ST S)− , (xij )+ = max{0, xij }, and (xij )− = max{0, −xij }
Sparse NMF. Sparse representation has recently gained significant attention in relation
to the NMF problem due to its effective classification performance and robust properties
[6, 25, 26]. Given a non-negative matrix M ∈ RN ×P , NMF seeks to find two non-negative
matrices W and V such that: M ≈ WV, here W ∈ RN ×K and V ∈ RK×P . In sparse
NMF, the goal is to obtain a sparse projection matrix for explicit data representation and
effective interpretation. Generally, sparse dimensionality reduction NMF is expressed
compactly as [26]:

min F(M, WV) + ωp(V) (7)


W,V
418 M. Q. Bui et al.

where p(V) is a penalty term that enforces sparsity in the learning process.
This sparse term is designed to restrict the number of nonzero elements in each
column of the projection matrix. In such scenarios, the L1 -norm is often used as a
relaxation of the L0 penalty [25, 26]. While the L1 -norm (lasso) is convex, it is not
differentiable, making it challenging to find solutions for the lasso-regularized model. To
address this issue, W. Liu et al. [25] imposed an L2 -norm on the lasso-penalized problem,
overcoming the limitations of the L1 -norm while preserving its beneficial properties.
The elastic net penalty, which combines the L1 -norm and the L2 -norm of the projection
matrix, is convex with respect to the projection matrix and thus provides the grouping
effect property [25].

2.2 Exemplar-Embed Complex Matrix Factorization


Complex Matrix Factorization. Given a complex data matrix T ∈ CN ×M , CMF is
analogous to the NMF model, where T is factorized into the form T ≈ BE, B ∈ CN ×K
and E ∈ CK×M , in general.
This approach results in the following optimized objective function [14]:
1
min CMF (B, E) = T − BE2F s.t B ∈ CN ×K , E ∈ CK×M (8)
2
where .F denotes the Frobenius norm and K < < min(N,M)
Exemplar-Embed Complex Matrix Factorization. In the EE-CMF [15], the image
dataset, initially represented as a set of real values, is first converted into N-dimensional
real vectors through lexicographic ordering. Each real vector is then mapped to the
complex domain by the form:
⎡ iαπ u (1) ⎤
e t
1 iαπ ut 1 ⎢ .. ⎥
f (ut ) = vt = √ e =√ ⎣ . ⎦ (9)
2 2 iαπ u (N )
e t

where, f : RN → CN is a mapping function that converts an image vector with real


values ut into a vector of complex values vt in the complex space. This mapping is
achieved using the sine and cosine components derived from Euler’s formula [22].
Suppose that T ∈ CN ×M is the complex matrix containing the images of the dataset.
The EE-CMF model factorizes T into the form T ≈ BE, where B given by B = TW,
(W ∈ CM ×K ), represents the exemplar-embedded basis matrix, and E ∈ CK×M denotes
encoding matrix. Therefore, the objective function in the EE-CMF problem is expressed
as follows:
1
min fEE - CMF (W, E) = min T − TWE2F (10)
W,E W,V 2

where T − TWE2F = Tr(T − TWE)H (T − TWE)


(11)
= Tr(TH T − EH WH TH T − TH TWE + EH WH TH TWE)
Exemplar-Embed Complex Matrix Factorization 419

To address the minimization problem, the complex gradient descent algorithm is


employed, leveraging Wirtinger’s calculus. Equations (10) correspond to non-convex
minimization problems with respect to both variables W and E. Consequently, to obtain
the optimal solution, the problem is approached by iteratively fixing one variable while
optimizing the other.

3 Proposed Method
3.1 Exemplar-Embedded Complex Matrix Factorization with the Elastic Net
Penalty (ENEE-CMF)

This section proposes a new model based on EE-CMF by incorporating a complex elastic
net constraints into the coefficient matrix.
As previously discussed, the EE-CMF model aims to factorize a complex data matrix
T ∈ CN ×M into the form T = BE, where B is given by B = TW. The proposed
model introduces constraints on both the L1 -norm and the L2 -norm of the projection
matrix, incorporating these as regularization terms in the objective function to enhance
performance and stability. This leads to the comprehensive definition of our model,
Exemplar-Embedded Complex Matrix Factorization with Elastic Net Penalty (ENEE-
CMF), as follows:
1
min fENEE - CMF (W, E) = min [ T − TWE2F + ω1 E1 + ω2 E22 ] (12)
W,E W,E 2

M
  M K M K
where E1 = E: j  = ( Eij ); E22 = Eij2 (13)
1
j=1 j=1 i=1 j=1 i=1

and ω1 , ω2 are regularization parameters that balance the trade-off between the
approximation error and the sparse constraints.

3.2 Optimal Solution

To solve the minimization problem, we also apply the complex gradient descent
algorithm, utilizing Wirtinger’s calculus [18, 19].
First, by fixing W, the objective function in (11) is modified to depend only on the
single variable E. As a result, the objective function is transformed as follows:
1
min OENEE - CMF (E) = min [ T − TWE2F + ω1 E1 + ω2 E22 ] (14)
E E 2
Then, W is updated based on the Moore–Penrose pseudo inverse [28], and computed
by W = (T † T)E † with fixed E, where † denoted pseudo inverse matrix.
To solve sub-problems (13), the function O(E) is treated as O(E, E∗ ), where,

OENEE−CMF (E, E∗ ) = 21 Tr[TH T − (E∗ )T WH TH T − TH TWE


(15)
+ (E∗ )T WH TH TWE] + ω1 E1 + ω2 Tr((E∗ )T E
420 M. Q. Bui et al.

According to [19], at a given iteration round t, the following update rule is employed.

E(t+1) = E(t) − 2βt ∇V∗ O(E(t) , E∗(t) ) (16)

where βt is the learning step parameter for the t th iteration, estimated using the Armijo
rule [29]. Based on the Armijo rule, βt = μst , 0 < μ < 1, and st is the first non-negative
integer such that the following inequality is satisfied.

O(E(t+1) , E∗(t+1) ) − O(E(t) , E∗(t) ) ≤ 2σ Re ∇V∗ O(E(t) , E∗(t) ), E(t+1) − E(t)
(17)

The first order partial derivatives with respect to V* are evaluated as follows.
E
∇E∗ OENEE−CMF (E, E∗ ) = −WH TH T + WH TH TWE + α1 + α2 E (18)
abs(E)
The condition specified by (16) ensures that the function value decreases with each
iteration. Finally, a pre-defined threshold Γ can be selected, and the stopping criterion
is set as follows:
 
∇V∗ O(E, E∗ ) ≤ (19)
F

4 Experiments
This section evaluates the performance of the proposed ENEE-CMF framework for
FER. The classification ability of the derived encoding coefficient vector is compared
with that of various NMF-based methods and EE-CMF. The basis matrix Wtrain was
generated from Wtrain = Ttrain (Etrain Ttrain )† during the training phase. The test sample
was encoded as etest = (TWtr )† ttest , and classification was conducted using a nearest
neighbor classifier following projection.

4.1 Data Representation and Experiment Settings


The proposed model was evaluated using two publicly available datasets: JAFFE [30]
and Cohn-Kanade (CK+) [31]. The CK+ dataset includes 593 video sequences from 123
subjects, with each sequence capturing various facial expressions. The last five frames
of each video were used as static representations of these expressions. In contrast, the
JAFFE dataset contains 213 grayscale images from ten subjects, each exhibiting two to
four examples of different expressions. Sample images from both datasets are shown in
Fig. 1.
The proposed algorithm was compared to several well-known NMF and CMF
algorithms, including: (1) basic NMF [5], (2) convex-NMF (Con-NMF) [12], (3) L1 -
norm sparse constrained NMF with Euclidean divergence (Eud_SNMF) [23], (4) L1 -
norm sparse constrained NMF with Kullback-Leibler divergence (Kull_SNMF) [24],
(5) sparse NMF with elastic net regularization (EN-SNMF) [25], (6) weighted NMF
Exemplar-Embed Complex Matrix Factorization 421

Fig. 1. Cropped facial images displaying six distinct expressions from the JAFFE dataset [30] in
the first row, and from the CK + dataset [31] in the second row.

(weiNMF), which applies binary weights to the data matrix [32], (7) NeNMF, utilizing
Nesterov’s optimal gradient method for efficient optimization [33], and (8)(9) unsu-
pervised and supervised robust nonnegative graph embedding (uRNGE and sRNGE),
which replace the L2 -norm with the L21 -norm for greater robustness [34, 35], and (10)
EE-CMF [15].
To meet the condition specified in Eq. (16), the step size reduction rate μ was set
to 0.01. The stopping criterion, as described in Eq. (18), used a relative tolerance ε was
10–4 or a maximum of 10,000 iterations. For both the sparse complex ENEE-CMF and
the sparse real NMF algorithms, a sparseness control parameter ω of 0.1 was used in all
simulations.

4.2 Performance and Comparison


Experiment Results on the JAFFE Dataset. Experiments were conducted by using
one randomly selected image of each expression per person for training, while the
remaining images were used for testing. Table 1 displays the average recognition rates
across different subspace dimensionalities. The data in Table 1 reveal a common trend
among most algorithms: as the number of training images (or subspace dimensionality)
decreases, the recognition rate tends to drop. Among the methods compared, the pro-
posed ENEE-CMF method achieved the highest accuracy at 72.26%, followed closely
by the EE-CMF method with an accuracy of 71.14%. These outperformed all other meth-
ods, including GSNMF and Kull_SNMF, which had accuracies of 69.91% and 69.56%,
respectively. In contrast, the Con-NMF and EN-NMF frameworks, despite their similar-
ities to the proposed methods, showed the lowest recognition accuracies, around 39.84%
and 61.24%, respectively. Another baseline method’s accuracy varied between 39.95%
and 62.98%.
Experiment Results on the Cohn-Kanade Dataset. The CK + dataset was used to
evaluate facial expression recognition (FER) performance. For this evaluation, one frame
from each video sequence was selected for training, while the remaining frames served
as the test set. Table 2 displays the recognition performance of the various methods
tested, each with different subspace sizes. The ENEE-CMF method achieved the highest
recognition rate at 96.83%. The GSNMF method, which specializes in extracting parts-
based facial features, also performed notably well with an accuracy of 96.54%. Both
the EE-CMF and Kull-SNMF methods achieved identical results of 95.2%. The other
422 M. Q. Bui et al.

methods showed a wider range of performance, with recognition accuracies spanning


from 53.46% to 93.66%.

Table 1. The accuracy rate (%) on the JAFFE dataset with various subspace dimensionalities

No ENEE- EE- NMF Con- Eud_ Kull_ EN- weiNMF NeNMF GSNMF uRNGE sRNGE
Base CMF CMF NMF SNMF SNMF NMF
20 67.51 66.99 65.24 45.53 35.66 64.20 69.65 63.36 66.85 68.11 22.38 27.62
30 71.61 66.36 68.11 49.09 48.53 68.67 71.19 68.32 61.05 70.07 22.10 33.78
40 70.07 72.31 70.84 48.60 57.20 68.39 70.21 68.95 63.28 69.23 28.18 39.79
50 73.99 72.03 71.68 52.17 64.76 70.21 70.77 69.02 61.82 69.09 32.66 43.15
60 73.15 72.45 71.12 46.36 64.48 71.47 69.93 72.38 63.71 70.28 36.78 47.41
70 71.47 72.31 69.79 27.34 48.81 70.07 59.44 69.16 62.52 70.98 42.24 51.19
80 72.73 72.59 26.15 27.90 14.41 72.03 46.29 23.50 62.87 69.93 46.64 59.86
90 74.97 71.68 16.01 26.01 11.33 70.91 50.21 26.71 61.40 70.91 49.37 64.55
100 74.83 73.57 18.60 35.59 14.41 70.07 43.50 15.25 63.36 70.56 53.64 66.36
Ave 72.26 71.14 53.06 39.84 39.95 69.56 61.24 52.96 62.98 69.91 37.11 48.19

Table 2. The accuracy rate (%) on the CK+ dataset with various subspace dimensionalities

No ENEE- EE- NMF Con- Eud_ Kull_ EN- weiNMF NeNMF GSNMF uRNGE sRNGE
Base CMF CMF NMF SNMF SNMF NMF
20 97.23 95.43 85.41 41.28 46.61 93.06 96.24 85.06 94.30 95.60 30.87 41.74
30 96.94 92.25 90.99 58.99 65.83 94.59 96.36 91.07 89.30 96.69 40.93 53.28
40 96.65 91.24 93.88 73.55 79.50 94.38 96.32 94.17 90.27 96.01 45.37 61.18
50 96.49 95.06 94.5 80.27 87.64 95.45 96.45 94.75 91.80 96.71 51.94 66.39
60 96.70 96.14 95.06 87.40 91.57 95.41 76.94 94.92 92.34 96.57 55.45 70.39
70 96.85 96.59 95.18 90.42 94.46 96.28 90.41 95.62 93.31 96.71 55.19 70.74
80 96.57 96.74 95.93 91.57 95.66 95.66 53.72 95.58 93.27 96.84 60.29 74.74
90 97.15 96.63 95.95 92.42 95.91 95.79 48.39 95.87 93.26 96.82 62.46 74.68
100 96.86 96.78 96.03 92.03 96.65 96.24 79.83 95.62 94.17 96.86 78.60 74.35
Ave 96.83 95.21 93.66 78.66 83.76 95.21 81.63 93.63 92.45 96.54 53.46 65.28

Experiment Results on the Occluded Cohn–Kanade Images. As highlighted in pre-


vious sections, the ENEE-CMF method has demonstrated excellent performance in facial
expression recognition across both the JAFFE and CK+ datasets. In this section, the
robustness of the method is tested on images from the Cohn-Kanade dataset, where
occlusions are applied to a random area. For non-occluded images, the last five frames
of the video sequences are treated as static images. Three of these non-occluded images
are used for training, while the remaining two are altered with occlusions and used
for testing. Mouth and eye occlusions are created by placing masks over the respective
regions, while random occlusions are simulated by adding 70 × 70 patches in arbitrary
locations on the original 640 × 490 images. Figure 2 shows examples of CK+ images
with these occlusions.
Exemplar-Embed Complex Matrix Factorization 423

Fig. 2. Occluded images of six facial expressions, along with a neutral expression, from the CK+
dataset.

Table 3. The accuracy rate (%) for the occluded CK + dataset

No ENEE- EE- NMF Con- Eud_ Kull_ EN- weiNMF NeNMF GSNMF uRNGE sRNGE
Base CMF CMF NMF SNMF SNMF NMF
20 74.46 73.26 50.62 20.04 24.71 60.66 76.12 48.60 68.97 50.50 21.78 26.90
30 81.65 68.77 58.39 24.59 28.68 61.86 80.00 58.97 63.88 75.21 22.93 32.69
40 84.13 68.48 62.27 26.16 28.02 70.74 85.29 63.43 61.07 74.09 28.55 35.33
50 85.45 68.71 65.29 27.52 32.48 71.73 53.39 65.00 59.71 68.22 28.06 39.21
60 85.87 72.20 70.37 27.19 36.86 72.45 63.64 67.85 55.54 59.01 34.17 40.95
70 86.28 71.31 70.33 26.82 35.21 71.27 67.52 72.36 52.98 67.85 34.92 45.08
80 87.85 73.91 73.31 27.81 37.77 75.60 47.77 72.40 47.23 65.62 38.60 46.61
90 80.74 73.38 73.39 27.73 36.61 78.41 75.37 73.31 39.09 72.98 41.78 48.97
100 75.45 73.20 75.25 29.34 38.93 78.17 62.73 74.38 37.60 70.08 40.54 54.79
Ave 82.43 71.47 66.58 26.36 33.25 71.21 67.98 66.26 54.01 67.06 32.37 41.17

Table 3 displays the recognition results for occlusions in random facial regions.
The ENEE-CMF consistently outperforms other methods, achieving the highest average
recognition accuracy of 82.43% across all occlusion levels, demonstrating exceptional
performance in handling occlusions. The EE-CMF also performs well, with an average
accuracy of 71.47%, but falls short of ENEE-CMF. Among advanced methods, EN-
NMF and weiNMF achieve moderate accuracies of 67.98% and 66.26%, respectively,
showing improvement over basic methods yet still lagging behind ENEE-CMF. NeNMF
and GSNMF yield competitive results, with average accuracies of 67.06% and 67.98%,
respectively, outperforming several basic methods but not reaching ENEE-CMF’s accu-
racy. Overall, the ENEE-CMF model demonstrates superior accuracy and robustness in
handling occlusions compared to its counterparts.
424 M. Q. Bui et al.

5 Analyzing Generalizability
To evaluate the generalizability of the proposed methods against standard NMF and EE-
CMF approaches, we conducted subject-independent and cross-dataset experiments. The
training phase used a subset of the JAFFE dataset, while testing involved images from
both JAFFE and CK+. The training set included images from seven individuals, and the
testing set consisted of images from the remaining three JAFFE individuals and three
randomly selected CK+ individuals. The average recognition accuracy for all algorithms
was below 40%. However, the ENEE-CMF models exhibited superior generalizability
compared to their counterparts, as shown in Fig. 3.

Fig. 3. The accuracy rate (%) on the JAFFE data for training and a combination of JAFFE and
CK + datasets for testing across various subspace dimensionalities

Figure 4 presents the basis images extracted from the training data of the CK+ dataset.

Fig. 4. Basis images learned from the training data in the CK + dataset by (a) the proposed
ENEE-CMF, (b) EE-CMF, (c) Con_NMF, (d) Eud_SNMF, (e) Kull_SNMF, and (f) EN_NMF

6 Conclusions
This work presents a novel complex matrix factorization approach with an elastic net
penalty. To solve complex matrix factorization problems, the gradient descent method
with Wirtinger calculus was employed. The model was evaluated on two facial expression
Exemplar-Embed Complex Matrix Factorization 425

datasets, achieving high recognition accuracy and outperforming both EE-CMF and
NMF algorithms. Future work will focus on enhancing the model by incorporating
additional regularization techniques and advanced optimization methods to improve
convergence and robustness, particularly in handling more complex occlusions and noisy
data. Another promising direction is extending the model to process dynamic facial
expressions in real-time by leveraging spatio-temporal features. Additionally, applying
the model to other domains, such as object recognition or medical image analysis, could
further expand its applicability.

Acknowledgments. This research was supported by The HoChiMinh City University of


Technology and Education’s Scientific Research Support Fund.

References
1. Fasel, B., Luettin, J.: Automatic facial expression analysis: a survey. Pattern Recogn. 36(1),
259–275 (2003)
2. Fukunaga, K.: Introduction to Statistical Pattern Recognition, 2nd edn. Academic Press, San
Diego, CA (2013)
3. Noushath, S., Hemantha Kumar, G., Shivakumar, P.: Diagonal fisher linear discriminant
analysis for efficient face recognition. Neurocomputing 69, 1711–1716 (2006)
4. Lee, D.D., Seung, H.S.: Learning the parts of objects by non-negative matrix factorization.
Nature 401(6755), 788–791 (1999)
5. Lee, D.D., Seung, H.S.: Algorithms for non-negative matrix factorization. In: NIPS, 13,
pp. 556–562 (2000)
6. Hoyer, P.: Non-negative matrix factorization with sparseness constraints. J. Mach. Learn. Res.
5(9), 1457–1469 (2004)
7. Yuan, Z., Oja, E.: Projective nonnegative matrix factorization for image compression and
feature extraction. In: Image Analysis: 14th Scandinavian Conference (SCIA 2005), pp. 333–
342. Springer, Berlin Heidelberg (2005)
8. Nikitidis, S., Tefas, A., Pitas, I.: Using subclasses in discriminant non-negative subspace
learning for facial expression recognition. In: 19th European Signal Processing Conference,
pp. 1964–1968. IEEE (2011)
9. Chen, X., Huang, T.: Facial expression recognition: a clustering-based approach. Pattern
Recogn. Lett. 24(9–10), 1295–1302 (2003)
10. Belhumeur, P.N., Hespanha, J.P., Kriegman, D.J.: Eigenfaces vs. fisherfaces: Recognition
using class specific linear projection. IEEE Transactions on Pattern Analysis and Machine
Intelligence 19(7), 711–720 (1997)
11. Lee, C.S., Chellappa, R.: Sparse localized facial motion dictionary learning for facial expres-
sion recognition. In: International Conference on Acoustics, Speech and Signal Processing
(ICASSP), pp. 3548–3552. IEEE (2014)
12. Ding, C.H., Li, T., Jordan, M.I.: Convex and semi-nonnegative matrix factorizations. IEEE
Trans. Pattern Anal. Mach. Intell. 32(1), 45–55 (2008)
13. Liwicki, S., Tzimiropoulos, G., Zafeiriou, S., Pantic, M.: Euler principal component analysis.
Int. J. Comput. Vision 101, 498–518 (2013)
14. Duong, V.H., Lee, Y.S., Pham, B.T., Mathulaprangsan, S., Bao, P.T., Wang, J.C.: Com-
plex Matrix Factorization for Face Recognition. https://arxiv.org/ftp/arxiv/papers/1612/1612.
02513.pdf
426 M. Q. Bui et al.

15. Duong, V.H., et al.: Exemplar-embed complex matrix factorization for facial expression
recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP), pp. 1837–1841). IEEE (2017)
16. Duong, V.H., et al.: Projective complex matrix factorization for facial expression recognition.
EURASIP Journal on Advances in Signal Processing, pp. 1–11 (2018)
17. Wirtinger, W.: Zur formalen theorie der funktionen von mehr komplexen veränderlichen.
Math. Ann. 97(1), 357–375 (1927)
18. Hu, W., Choi, K.S., Wang, P., Jiang, Y., Wang, S.: Convex nonnegative matrix factorization
with manifold regularization. Neural Netw. 63, 94–103 (2015)
19. Amin, M.F., Amin, M.I., Al-Nuaimi, A.Y.H., Murase, K.: Wirtinger calculus based gradient
descent and Levenberg-Marquardt learning algorithms in complex-valued neural networks.
In: International Conference on Neural Information Processing, pp. 550–559. Springer Berlin
Heidelberg (2011)
20. Hjorungnes, A., Gesbert, D.: Complex-valued matrix differentiation: techniques and key
results. IEEE Trans. Signal Process. 55(6), 2740–2746 (2007)
21. Gilbert, S.: Linear Algebra and its Applications. Saunders College Publishing (1986)
22. Moskowitz, M.A.: A Course in Complex Analysis in One Variable. World Scientific (2002)
23. Eggert, J., Korner, E.: Sparse coding and NMF. In: IEEE International Joint Conference on
Neural Networks (IEEE Cat. No. 04CH37541), 4, pp. 2529–2533. IEEE (2004)
24. Schmidt, M.N.: Speech separation using non-negative features and sparse non-negative matrix
factorization. In: Interspeech, pp. 19–33. Technical University of Denmark, DTU (2007)
25. Liu, W., Zheng, S., Jia, S., Shen, L., Fu, X.: Sparse nonnegative matrix factorization with the
elastic net. In: IEEE International Conference on Bioinformatics and Biomedicine (BIBM),
pp. 265–268. IEEE (2010)
26. Sra, S., Dhillon, I.: Generalized nonnegative matrix approximations with Bregman diver-
gences. Adv. Neural. Inf. Process. Syst. 18, 283–290 (2005)
27. Peharz, R., Pernkopf, F.: Sparse nonnegative matrix factorization with L0 -constraints.
Neurocomputing 80, 38–46 (2012)
28. Barata, J.C.A., Hussein, M.S.: The Moore-Penrose pseudoinverse: a tutorial review of the
theory. Braz. J. Phys. 42, 146–165 (2012)
29. Lin, C.J.: Projected gradient methods for nonnegative matrix factorization. Neural Comput.
19(10), 2756–2779 (2007)
30. Lyons, M., Akamatsu, S., Kamachi, M., Gyoba, J.: Coding facial expressions with gabor
wavelets. In: Proceedings Third IEEE International Conference on Automatic Face and
Gesture Recognition, pp. 200–205. IEEE (1998)
31. Kanade, T., Cohn, J.F., Tian, Y.: Comprehensive database for facial expression analysis.
In: Proceedings Fourth IEEE International Conference on Automatic Face and Gesture
Recognition (cat. No. PR00580), pp. 46–53. IEEE (2000)
32. Wang, D., Li, T., Ding, C.: Weighted feature subset non-negative matrix factorization and its
applications to document understanding. In: IEEE International Conference on Data Mining,
pp. 541–550. IEEE (2010)
33. Guan, N., Tao, D., Luo, Z., Yuan, B.: NeNMF: an optimal gradient method for nonnegative
matrix factorization. IEEE Trans. Signal Process. 60(6), 2882–2898 (2012)
34. Yang, J., Yang, S., Fu, Y., Li, X., Huang, T.: Non-negative graph embedding. In: 2008 IEEE
Conference on Computer Vision and Pattern Recognition, pp. 1–8. IEEE (2008)
35. Zhang, H., Zha, Z.J., Yang, Y., Yan, S., Chua, T.S.: Robust (semi) nonnegative graph
embedding. IEEE Trans. Image Process. 23(7), 2996–3012 (2014)
A Method Combining the Reference
Information of the Adaptive Adjustment
Method and the Decision Maker
of Multi-objective Evolutionary
Algorithms

Long Nguyen1(B) , Minh Tran Binh2(B) , and Thu To Thi2


1
Faculty of Logistics and Engineering, National Defense Academy, Hanoi, Vietnam
[email protected]
2
Military Information Technology Institute, Academy of Military Science
and Technology, Hanoi, Vietnam
[email protected]

Abstract. In practice, when using multi-objective optimization algo-


rithms, people use reference information to search for desired solutions.
However, including decision maker reference information can cause the
evolutionary process to lose balance between exploration and exploita-
tion capabilities, thereby leading to missing good solutions or being
locally optimized. Recently, there have been many effective proposals
to analyze trends and maintain the balance automatically. To simulta-
neously address decision maker desires and self-regulation capabilities,
this paper proposes a method that combines decision maker information
and adaptive control information applied on the DMEA-II, MOEA/D
using reference points. The experimental results created a good balance
in using the two reference information types in the evolutionary process.

Keywords: Multi-objective evolutionary algorithm · DMEA-II ·


MOEA/D interactive · adaptive adjustment · reference point

1 Introduction
In reality, optimization problems often have more than one objective, and most
of the objectives often conflict. From there, a class of multi-objective optimiza-
tion problems is formed. A multi-objective optimization problem is defined as
an optimization problem that has at least two objectives, these objectives are
conflicting and need to be optimized simultaneously. Mathematically, the multi-
objective optimization problem (MOP) with k objectives is expressed as follows:
f (x) = [f1 (x), f2 (x), ..., fk (x)] (1)
in which x is a vector of decision variables in v-dimensional Rv . In evolutionary
computation (EC), x represents an individual in the population to be evolved.
c The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025
W. Buntine et al. (Eds.): SOICT 2024, CCIS 2350, pp. 427–438, 2025.
https://doi.org/10.1007/978-981-96-4282-3_34
428 L. Nguyen et al.

The value fj (x), then, describes the performance of individual x as evaluated


against the jth objective in the MOP.
In the field of multi-objective optimization, the concept of dominance is intro-
duced. The dominance relationship is defined as follows: In a set of solutions, a
solution is called a non-dominated solution if there is no solution in the set that
is better than it on all objectives, that is: if a solution is better than this solution
on a certain objective, it will be worse on at least one other objective. The set
of non-dominated solutions is called the class of non-dominated solutions. This
is the basic principle in the Pareto optimization method, that set of solutions is
also called the Pareto optimal solution set. People classify methods for solving
MOP as the non-preference method, the priori method, the posteriori method,
and the interactive method. The non-preference method does not use preference,
the decision maker (DM) will receive the solution of the optimization process
and they can choose to accept or reject it. In the priori method, the process
of introducing and incorporating the DM preferences happens before the search
process. In the posteriori method, the process of incorporating the preferences
is performed at the end of the search process. In the interactive method, the
process of introducing, incorporating, and modifying the DM preferences in an
interactive way at any time during the search process. According to the meth-
ods of solving multi-objective optimization problems, nowadays the method of
using multi-objective evolutionary optimization algorithms (MOEAs) is com-
monly used for several reasons: first, evolutionary algorithms work on popula-
tions, suitable for determining Pareto optimal solution sets based on dominance
relationships; second, due to the use of approximation techniques in evolutionary
algorithms, the use of MOEA can solve difficult problems, high computational
costs, even problems with non-differentiable objective functions.
MOEAs are widely used, however, to ensure the quality of the algorithm, peo-
ple are concerned with two main issues: the first is the exploration and exploita-
tion ability of the population to ensure balance so as not to fall into local points,
search globally and speed up the search for optimal results. The second is about
the quality of the solution set, requiring a balance between convergence and
diversity. Recently, there have been many methods to control the evolutionary
process to ensure the balance between exploration and exploitation capabilities
and the convergence quality and diversity of the population. The popular method
is to analyze some type of reference information such as the population quality,
the evolutionary trend... to control the algorithm in the direction of maintain-
ing the above balance goals. In practical problems, the interaction of the DM
is very important, this is a move to express the desire and priority of the DM.
Therefore, in the evolution process of MOEAs, the algorithm can be affected
simultaneously by the process of self-adaptation and the interaction of the DM.
However, a combination of the above information is a difficult problem because
they are relatively independent.
In this paper, we investigate adaptive adjustment methods and DM inter-
action with MOEA. From there, we propose a method to combine two refer-
ence information into synthetic reference information to control the evolution-
Combining the Reference Information of MOEAs 429

ary process to simultaneously meet two goals of adaptive self-adjustment and the
requirements from the decision maker. The paper is structured into 4 sections,
Sect. 1 is a general introduction, Sect. 2 presents the usage of reference infor-
mation in adaptive adjustment and multi-objective interaction, Sect. 3 presents
methodology for combining reference information from DM and from the self-
adaptation process, Sect. 4 is the experiment and results. Section 5 is the con-
clusion.

2 Reference Information in Adaptive Adjustment


and Multi-objective Interaction
2.1 Adaptive Adjustment
In order for MOEAs to create a good balance between the exploration and
exploitation capabilities of the search process as well as maintain the quality
while ensuring convergence and diversity of the solution population, there have
been many recent studies, some recent typical works such as the [3] evaluated the
population quality according to generations of the evolutionary process based on
popular measures and adjusted the algorithm to create an equilibrium transfor-
mation of those metrics, thereby better maintaining the balance between the
exploration and exploitation of the population. The authors in [18] introduced
the heterogeneous green city vehicle routing problem with time windows, aiming
to address the limitations of the traditional vehicle routing problem with time
windows models in comprehensively accounting for diverse stakeholder inter-
ests and practical considerations in urban logistics. The heterogeneous green
city vehicle routing problem with time windows considers heterogeneous vehicle
attributes, recipient time sensitivity, and environmental impacts of varying road
congestion levels. To effectively solve this problem, an adaptive multi-objective
genetic algorithm with the GIS is proposed, which generates high-quality initial
solutions tailored for different objectives and ensures the diversity of the popu-
lation. The authors in [13] introduced a systematic investigation of an adaptive,
multi-objective algorithm that is designed for the optimization of problems in
unbounded integer decision spaces and suggested using static configurations. The
authors in [8] propose a selection strategy based on the angle-penalized distance
is used to improve the coverage of the solutions in the objective space, an adap-
tive reproduction operation is used to select different reproduction strategies for
the gene-level global exploration or local exploitation. In [1] proposed a mecha-
nism to maintain this balance by leveraging the relationship between the algo-
rithm’s quality assessment indicators and the trajectory of the search process.
This balancing mechanism is then applied to improve the symmetry of multi-
objective evolutionary algorithms based on differential evolution. In which the
correlation information about the variation in convergence and diversity qual-
ity measures was integrated with the step length of the differential evolution
operator to maintain the equilibrium between the exploratory and exploitative
nature of evolution. The authors in [16], a decomposition-based dynamic bi-
objective evolutionary algorithm is proposed to deal with complex global opti-
430 L. Nguyen et al.

mization problems. It transforms a global optimization problem into an equiva-


lent dynamic bi-objective optimization problem, one is the original objective and
the niche count function. The proposed MOEA decomposes the transformed bi-
objective optimization problem into several bi-objective optimization subprob-
lems. An added helper objective niche count is controlled by the niche radius,
the niche radius is dynamically decreased over time, which can provide a tradeoff
between exploration and exploitation.

2.2 Interactive Multi-objective Evolutionary Algorithms


Multi-objective optimization interaction is a quite interesting research problem
and it has great significance in practice. Through interaction, the DM pro-
vide their desired information to the search process to prioritize and choose
the appropriate solution for the decision maker’s (DM’s) problem. Recently,
there have been various approaches to engage decision DM in the search process
of multi-objective optimization algorithms, especially evolutionary algorithms.
The authors in [14] analyzed three interactive reference-point-based algorithms:
an interactive version of the Non-dominated Sorting Genetic Algorithm-II (I-
NSGA-II), an interactive version of the Third Evolution Step of Generalized Dif-
ferential Evolution (I-GDE3), and an interactive version of the GDE3 equipped
with an adaptive penalty method (I-GDE3+APM) and introduced an interac-
tive reference-point-based method for incorporating decision maker preferences.
The authors in [4] proposed a new approach to Interactive Evolutionary Multi-
objective Optimization guided by a preference elicitation procedure inspired by
artificial intelligence and designed in line with decision psychology. The authors
in [6] proposed an interactive final solution selection method for multi-objective
optimization. The proposed IFSS method aims to provide a good final solu-
tion through several interactions with the DM. In other words, the DM can
obtain a satisfying solution after evaluating only a small number of solutions
even without providing clearly specific preferences. Furthermore, a calibration
strategy is introduced to significantly improve the algorithm’s performance by
slightly increasing the number of interactions. In [12] introduced an interactive
method to multi-objective evolutionary using a buffer as reference information.
The authors in [11] proposed an interactive method using set of rays that are
generated from given reference points by DM. These rays replace the current
original rays in objective space. Based on the new distribution of rays, a niching
is applied to control the external population (the archive) and next generation
for priority convergence to DM’s preferred region. The authors in [5] analyzed
the factors affecting algorithm quality, and the role of DMs with visual reviewing
to propose an interactive method to adjust algorithms to improve quality, and
also meet the actual requirements of the decision maker.

2.3 The Usage of Reference Information


Through the survey of adaptive and interactive self-adjustment methods of
MOEAs in recent publications, it can be seen that reference information is used
Combining the Reference Information of MOEAs 431

quite commonly to control evolutionary algorithms to achieve the goal of bal-


ancing adjustment and the decision maker’s goal.
With algorithms using adaptive methods, the authors use measures to evalu-
ate the quality of the population, the exploration and exploitation capabilities of
the current search process, forecasting, and change trends to propose strategies
to adjust the evolutionary process in a beneficial direction. Here, based on the
quality characteristics of the population (solution set) through measurements,
the distribution of solutions on the objective space, and the proportion of non-
dominated individuals, ... people determine reference information in the form of
reference points, reference lines, thresholds of penalty functions... to adjust the
evolutionary process (search) to simultaneously ensure the balance of conver-
gent and diverse population quality as well as the exploration and exploitation
capabilities of the search process.
In terms of the interaction aspect of MOEAs, the decision maker’s wishes are
typically expressed in terms of the range of objective function values, thereby
defining reference information in the form of reference points, reference lines, ref-
erence regions, or penalty function values... to adjust the search process in the
preferred direction from those reference information. In both cases, the principles
of each algorithm are analyzed. The researchers will determine the relationship
between the reference information and the control parameters (or parameters
affecting the search process) to come up with strategies according to the evalu-
ation cycle for adaptive adjustment or any time when the decision maker wants
to interact to adjust the algorithm in the desired representation. At that adjust-
ment point, the reference information is represented visually or through tables
of values... for the decision maker, or the adjustment functions to use as input
for adjustment. Finally, the new control parameter values are determined and
applied to the next search process of the algorithm, and so on until the search
process ends.
Using MOEAs in practice raises the question of how to harmoniously use
reference information from the adaptive self-adjustment process and reference
information from the decision maker’s wishes, and when is the appropriate time
to determine self-adjustment?

3 Methodology
3.1 Combining Reference Information
In this section, we propose a method to use appropriate reference information
to adjust the algorithm to meet both the requirements of self-adjustment and
the wishes of the decision maker. First of all, it must be assessed that both the
reference information provided by the self-adjustment process and the decision
maker are very important for the effectiveness of the algorithm in practice both in
theory and in practice. Through the survey, the use of multiple reference points
is a quite popular interactive methods [2, 5, 7, 9,12,15], so in our proposal, we
propose to combine the information between the adaptive adjustment process
and the interactive information of the decision maker of the multi-objective
432 L. Nguyen et al.

optimization evolutionary algorithm using adaptive reference points according


to the stage of the evolution process.
Mathematically, we define it as follows: the set Pj consists of m elements
Pj = (p1j , p2j , . . . , pmj ) which are reference points generated from the adaptive
adjustment process, the set Qj includes of n elements Qj = (q1j , q2j , . . . , qnj )
which are reference points from the desired regions of the decision maker at
the j th interaction. We define two sets of points that are combined from two
reference sets Pj and Qj . In which, Xj is the set consisting reference points
located in the union search region by the boundary of two sets Pj and Qj ; Yj
is the set consisting reference points located in the intersection search region by
the boundary of two sets Pj and Qj . We have:

Xj = Pj ∪ Qj (2)

 set Xj consists of k = m + n elements Xj = (x1j , x2j , . . . , xkj ), xij ∈


The
(Pj , Qj ), ∀i, j.
Yj = Pj ∩ Qj (3)

The set Yj consists of g elements Yj = (y1j , y2j , . . . , ygj ), yij ∈ (Pj , Qj ), ∀i, j.
In the union and intersection operations, the reference point sets introduced
by the decision maker and the adaptive adjustment process are performed as
follows: The set X resulting from the union operator of two sets P and Q is the
set that includes all reference points from two sets P and Q. The intersection set
Y resulting from the intersection operator between two sets P and Q is defined
as the reference points that lie within the largest envelope of the two sets, as
illustrated in Fig. 1.

Fig. 1. The illustration of the union and intersection operators

We can see that, in the early generations of the evolution process, the solu-
tions are still quite scattered and far from the Pareto optimal region. Therefore,
the role of auto-adjustment is very important to ensure the balance between the
Combining the Reference Information of MOEAs 433

ability to search extensively (or the ability to explore) and the ability to con-
verge quickly (or the ability to exploit). On the contrary, in the later stages of
the evolution process, the solutions have approached the Pareto optimal region,
the impact of auto-adjustment is not much. Therefore, we propose to use the two
sets X and Y according to the following strategy: We divide the evolution process
of the algorithm into the early phase and the late phase based on the number of
generations or the number of evaluations (the algorithm can be designed accord-
ing to the number of generations or the number of evaluations of the fitness
function). In the interactions in the early phase of the evolution process, we use
the set X as the reference set to adjust the search process of the algorithm. And
in the interactions in the late phase, we use the set Y as the reference set for
adjustment.

3.2 Guiding Evolutionary Process

After determining the reference information that is combined between the ref-
erence information of the adaptive adjustment and the reference information
of the decision maker. The next step is to use this combined reference infor-
mation to adjust the evolutionary process. In principle, reference information
can be expressed in many forms such as: trade-off information, reference points,
reference directions and classification of objective functions. This information,
according to each algorithm with different mechanisms such as using to guide
the search direction, search area... has priority of fast convergence objective and
spread evenly over the area created by the references.

Fig. 2. The illustration of the determining reference points via empty regions

Normally, people use niching functions to use this reference information to


guide the search process combined with the algorithm’s evolutionary operators
434 L. Nguyen et al.

such as selection, mutation, crossover. Therefore, in the step of guiding the algo-
rithm according to the synthesized reference information in the proposal, it is
necessary to evaluate the control information to orient the algorithm according
to the newly created reference region. And this method is basically the same
as using reference information from the proposals in interaction and adaptive
adjustment. For example, the DMEA-II+ algorithm [2] uses reference points
(called bliss points) to create a new ray system from the mechanism of auto-
matically determining the reference points from the center of the empty regions
drawn in the Fig. 2; the MOEA/D++ algorithm is used to calculate reference
points in decomposition subproblems using information combined with the input
reference point set [9].

4 Experiments
4.1 Building up on Reference Points
To build up the technique, in this paper, we apply it to two algorithms MOEA/D
[17], DMEA-II [10] on ZDTs benchmark sets [19], two recently competing algo-
rithms in the field of multi-objective optimization using reference points as ref-
erence information in self-adjustment and interaction.
The experiment is conducted using the DMEA-II algorithm. The algorithm
is evaluated for 1000 generations and divided into two testing phases: the early
phase from generation 1 to 500 and the late phase from generation 501 to 1000.
The early phase performs the interaction at generations 150 and 400; and the late
phase performs the interaction at generations 600, 700 and 800. By determining
the bliss point at the time of interaction with the reference point determined by
the decision maker on the objective space, according to the above determination
method, we will use the union of two sets of reference points (set X) at the
interactions in the early stage, and use the intersection of two sets of reference
points (set Y) in the late stage. Using the X, Y set in interacting with DMEA-II,
a new ray system is added with rays passing through the reference points and
removing rays in the region far from the area containing the reference point set.
At each interaction, the reference points (bliss points) of the adaptive adjustment
process are derived using the method defined as in [2]. The newly created ray
system serves as a reference for the RD (Ray based density) niching function
used to control the algorithm’s search process (as in Fig. 3).
Similarly, experimenting with the MOEA/D algorithm, here the reference
points of the adaptive adjustment process are generated at each interaction time
according to the method in [9]. Combined with the reference points intuitively
inserted by the decision maker into the objective space, the unions of the two sets
of reference points X and the intersection of the reference point Y are generated,
which are used in the early and late stages as described in the method. Similar
to the determination of bliss points as reference points in DMEA-II through
the “empty region” concepts, the reference points of the adaptive auto-tuning
process of the MOEA/D algorithm in the publication [9] are determined, together
with the reference points introduced by the decision maker according to the
Combining the Reference Information of MOEAs 435

Fig. 3. The illustration of the determining new system of rays

strategy using two stages of the search process as proposed in this paper. At
each interaction, the set X and Y are selected together with the reference points
currently used by the algorithm, and synthesized to create a new reference point.
This final reference point is used to guide the algorithm to prioritize the choice
of solutions for the evolution process according to the algorithm’s decomposition
mechanism (as in Fig. 4).

Fig. 4. The illustration of the determining new reference points

4.2 Results

Experiments with two algorithms DMEA-II and MOEA/D on 5 problems in


ZDTs benchmark set, with the technique of combining reference information
in the form of reference points derived from the adaptive adjustment process
and interacting with the decision maker, following the two-phase strategy of
the evolutionary process. The results obtained through the visualization of the
distribution of solutions after each interaction can be commented as follows: In
436 L. Nguyen et al.

the reference information fusion step, the strategy uses the union of two sets
according to the formula (2) at the early stage to simultaneously satisfy the
priority requirements of the adaptive adjustment process, the decision maker,
and the need to expand the global search information. In the late stage, the
intersection of two sets to the formula (3) is used to prioritize solutions for the
desired regions of the decision maker, when the auto-adaptation process has a
weak impact in the late stages of the evolution.
In the step of guiding the algorithm according to the combined reference
information, according to each algorithm’s evolutionary mechanism, that infor-
mation has become important control information that creates priority for indi-
viduals that satisfy the regions due to the adaptive adjustment process and the
decision maker’s desire on the objective space according to the mechanism of each
algorithm. Specifically, with the DMEA-II, by the niching mechanism based on
the RD (Ray based density) [10], the algorithm has created a new solution set,
converged and spread evenly on the Pareto layer of the reference point set area
synthesized from the self-adjustment process and entered by the decision maker.
With the MOEA/D, based on the mechanism of using reference points (or ideal
points) combined between the current reference point of the algorithm and the
reference point synthesized from the center point of the envelope containing the
reference points synthesized from formulas (2), (3) corresponding to the early
and late stages of the evolution process. The new reference point created in this
way serves as the basis for the algorithm to determine neighboring solutions in
the process of decomposition the subproblems of the algorithm’s decomposition
mechanism.
In general, the results obtained through four interactions with random points
on the objective space in the first and last stages of the two experimental algo-
rithms, with the mechanism of using reference points in adaptive adjustment
and interaction. The results obtained through visual observation of the graph-
ical representation on the objective space, can easily see the clear adjustment
according to the desired region of the decision maker, and importantly, at the
same time ensure the convergence speed and dispersion, or the diversity evenly
distributed according to the Pareto optimal layer. The results confirm that the
positive impact in combining the reference information of the adaptive adjust-
ment process and the reference information provided by the decision maker, has
practical value for practical problems.

5 Conclusion

At the same time, the adaptive adjustment of the algorithm ensures the popu-
lation quality and the search ability to meet the wishes of the decision maker,
which is very important in MOPs solved by evolutionary algorithms in prac-
tice. The paper has proposed a method of synthesizing reference information
in the form of reference points from the adaptive control process and the deci-
sion maker. The process of synthesizing reference information is carried out on
a two-stage evolutionary strategy, the early stage and the late stage, which is
Combining the Reference Information of MOEAs 437

suitable for the evolutionary characteristics of the population and still meets
the above simultaneous requirements. Reference information in the form of ref-
erence points, according to each mechanism of the algorithm, is used to guide
the algorithm to meet the requirements in each stage, at the interaction times.
The experimental results on the DMEA-II and the MOEA/D algorithms in 2-
dimensional space have confirmed the significance of the process of combining
reference information, which is valuable in practical problems. Further studies
can be extended to algorithms with multidimensional objective spaces, reference
information represented in reference lines, and value functions to effectively com-
bine reference information from the adaptive adjustment process and from the
decision maker’s wishes.

Acknowledgments. The experiments in the publication of this paper were conducted


at the Laboratory of the Faculty of Logistics and Engineering, Vietnam National
Defense Academy.

References
1. Binh, M., Nguyen, L.: An approach to enhance the equilibrium of search capa-
bilities for multi-objective evolutionary algorithms based on differential evolution.
In: 2024 7th International Conference on Information and Computer Technolo-
gies (ICICT), pp. 145–150. IEEE Computer Society, Los Alamitos (2024). https://
doi.org/10.1109/ICICT62343.2024.00029. https://doi.ieeecomputersociety.org/10.
1109/ICICT62343.2024.00029
2. Binh, M.T., Nguyen, L., Duc, D.N.: Using bliss points to enhance direction based
multi-objective algorithms. In: 2022 14th International Conference on Knowl-
edge and Systems Engineering (KSE), pp. 1–6 (2022). https://doi.org/10.1109/
KSE56063.2022.9953747
3. Binh, M.T., Nguyen, L., Duc, D.N.: An approach to maintain the balance between
exploitation and exploration of the evolutionary process in multi-objective algo-
rithms. In: 2023 6th International Conference on Information and Computer Tech-
nologies (ICICT), pp. 29–34. IEEE (2023)
4. Corrente, S., Greco, S., Matarazzo, B., Slowiński, R.: Explainable interactive evo-
lutionary multiobjective optimization. Omega 122, 102925 (2024)
5. Duc, D.N., Nguyen, L., Trung, K.T.: An interactive method for surrogate-assisted
multi-objective evolutionary algorithms. In: 2020 12th International Conference on
Knowledge and Systems Engineering (KSE), pp. 195–200. IEEE (2020)
6. Gong, C., Nan, Y., Shu, T., Pang, L.M., Ishibuchi, H., Zhang, Q.: Interactive
final solution selection in multi-objective optimization. In: 2024 IEEE Congress on
Evolutionary Computation (CEC), pp. 1–9. IEEE (2024)
7. Li, S., Zhang, Y., Wang, Q., He, L., Li, H., Ye, B.: A surrogate-assisted multi-
objective evolutionary algorithm guided by hybrid reference points. In: Tan, Y.,
Shi, Y. (eds.) Advances in Swarm Intelligence (2024)
8. Li, W., Tang, J., Wang, L.: Many-objective evolutionary algorithm with multi-
strategy selection mechanism and adaptive reproduction operation. J. Supercom-
put. 1–48 (2024)
438 L. Nguyen et al.

9. Minh, T.B., Long, N., Kien, T.T.: An adaptive reference point technique to improve
the quality of decomposition based multi-objective evolutionary algorithm. J. Mil.
Sci. Technol. (CSCE7) 3–14 (2023)
10. Nguyen, L., Bui, L.T., Abbass, H.A.: DMEA-II: the direction-based multi-objective
evolutionary algorithm-II. Soft. Comput. 18(11), 2119–2134 (2014)
11. Nguyen, L., Bui, L.T.: A ray based interactive method for direction based multi-
objective evolutionary algorithm. In: Huynh, V.N., Denoeux, T., Tran, D.H., Le,
A.C., Pham, S.B. (eds.) Knowledge and Systems Engineering. AISC, vol. 245, pp.
173–184. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-02821-7 17
12. Nguyen, L., Duc, D.N., Thanh, H.N.: An enhanced multi-point interactive method
for multi-objective evolutionary algorithms. In: Satapathy, S.C., Bhateja, V.,
Nguyen, B.L., Nguyen, N.G., Le, D.-N. (eds.) Frontiers in Intelligent Computing:
Theory and Applications. AISC, vol. 1013, pp. 42–49. Springer, Singapore (2020).
https://doi.org/10.1007/978-981-32-9186-7 5
13. Rudolph, G., Wagner, M.: Towards adaptation in multiobjective evolutionary algo-
rithms for integer problems. In: 2024 IEEE Congress on Evolutionary Computation
(CEC), pp. 1–8. IEEE (2024)
14. Vargas, D.E., Lemonge, A.C., Barbosa, H.J., Bernardino, H.S.: An interactive
reference-point-based method for incorporating user preferences in multi-objective
structural optimization problems. Appl. Soft Comput. 112106 (2024)
15. Vargas, D.E., Lemonge, A.C., Barbosa, H.J., Bernardino, H.S.: An interac-
tive reference-point-based method for incorporating user preferences in multi-
objective structural optimization problems. Appl. Soft Comput. 165, 112106
(2024). https://doi.org/10.1016/j.asoc.2024.112106. https://www.sciencedirect.
com/science/article/pii/S1568494624008809
16. Zhang, Q., Jiao, R., Zeng, S., Zeng, Z.: Balancing exploration and exploitation
with decomposition-based dynamic multi-objective evolutionary algorithm. Int. J.
Cogn. Inform. Nat. Intell. (IJCINI) 15(4), 1–23 (2021)
17. Zhang, Q., Li, H.: MOEA/D: a multiobjective evolutionary algorithm based on
decomposition. IEEE Trans. Evol. Comput. 11, 712–731 (2008). https://doi.org/
10.1109/TEVC.2007.892759
18. Zhao, W., Bian, X., Mei, X.: An adaptive multi-objective genetic algorithm for
solving heterogeneous green city vehicle routing problem. Appl. Sci. 14(15), 6594
(2024)
19. Zitzler, E., Thiele, L., Deb, K.: Comparision of multiobjective evolutionary algo-
rithms: emprical results. Evol. Comput. 8(1), 173–195 (2000)
Modeling Information Diffusion
in Bibliographic Networks Using
Pretopology

Thi Kim Thoa Ho1(B) , Quang Vu Bui2 , and Marc Bui3


1
Informatic Department, Hue University of Education, Hue University,
Hue City, Vietnam
[email protected]
2
Mathematic Department, Hue University of Sciences, Hue University,
Hue City, Vietnam
[email protected]
3
EPHE, EPHE, PSL University, AOROC UMR 8546 CNRS-ENS-EPHE,
Paris, France
[email protected]

Abstract. In this research, we propose a novel approach to model infor-


mation diffusion on bibliographic networks using pretopology theory. We
propose pretopological independent cascade model that is a variation of
the independent cascade model (IC), namely .P reto_IC. We apply pre-
topology to model the structure of heterogeneous bibliographic networks
since it is a powerful mathematical tool for complex network analysis.
The highlights of .P reto_IC are that the propagation process is simu-
lated on multiple relations, and the concept of elementary closed subset is
applied to capture the seed set. In the first step, we construct a pretopo-
logical space to illustrate a heterogeneous bibliographic network. In this
space, we define a strong pseudo-closure function to capture the neigh-
borhood set of a set .A. Next, we propose a new method to choose seed set
based on the elementary closed subsets. Finally, we simulate .P reto_IC
with the seed set from step (2) and for each step .t of propagation, deter-
mine the neighborhood set for infection based on pseudo-closure function
defined from step (1). We experiment on three real datasets and demon-
strate the effectiveness of .P reto_IC compared with the IC model with
existing methods of seed set selection.

Keywords: Information Diffusion · Heterogeneous Network ·


Bibliographic Network · Independent Cascade Model · Pretopology

1 Introduction
1.1 Problem Definition
Information diffusion is a process in which information is propagated from one
object to another in a network. Numerous fields, including social science [10, 17],
c The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025
W. Buntine et al. (Eds.): SOICT 2024, CCIS 2350, pp. 439–451, 2025.
https://doi.org/10.1007/978-981-96-4282-3_35
440 T. K. T. Ho et al.

computer science [20], medical field [2, 12], have extensively researched informa-
tion dissemination.
A node participating in the information propagation process is considered
under two states: active and inactive. A node is active if it has already taken the
action related to diffused information and vice versa. For instance, a scientist
is considered active with topic “deep learning” since he studied and published
articles related to that topic, or with a marketing campaign, a customer is marked
as active if he bought the product.
The majority of previous studies have focused on homogeneous networks,
which are networks with only one kind of object and one kind of connection.
Instances of these types of networks include co-author networks, which have
an object author and a co-author link, or object users on Twitter who follow
links. The research can be divided into two branches, including diffusion models
and influence maximization. There are various proposed diffusion models such
as independent cascade model (IC), linear threshold model (LT), etc. [8, 10, 17].
Moreover, scientists proposed diverse algorithms for influence maximization [9,
14, 16, 21] in which the challenge of identifying a small subset of nodes with the
greatest potential for spreading impact.
However, the majority of networks in reality are heterogeneous, containing
a variety of object types and multiple relations. For example, a bibliographic
network is a heterogeneous network that includes a diversity of objects includ-
ing authors, papers, venues, affiliations, and so forth, and several relationships
among objects, such as co-author, common co-author, and so on. In this study,
we focus on the dissemination in heterogeneous bibliographic networks since this
problem plays a significant role in promoting scientific development and research
collaboration.
There are several studies about information diffusion on bibliographic net-
work [6, 11, 13, 18, 19]. In those studies, authors proposed methods to estimate the
activation probability from an active node to an inactive node based on meta-
paths. However, the propagation process is simulated using IC and LT model
which determines the neighbor set for spreading just only based on one relation
co-author and uses graph theory to model network structure.
Graph theory has been widely utilized to model network structure. How-
ever, there are two drawbacks. Firstly, the closure function in topological space
is idempotent, meaning that set .X can only be reached in one step from its
closure. As a result, we are unable to gradually observe a set’s extensibility. Sec-
ond, in graph theory, the combination of all elements’ neighbors in .X defines a
neighborhood set of a set .X. On the other hand, in the real-world network, the
identification of a set’s neighbor set is more intricate. Since pretopology theory
has been proved to be an extension of graph theory [4, 5], it provides a solution to
the aforementioned problems. Thus, we suggest using pretopology in this study
to analyze information spread on bibliographic networks.
In this study, we propose an information diffusion model in the bibliographic
networks using pretopology. This model is pretopological independent cascade
model, namely .P reto_IC, which is an expanded model of IC. In .P retoIC, the
Modeling Information Diffusion in Bibliographic Networks 441

highlights are seed set selection based on the concept of elementary closed subset
and the identification of the node’s neighbor set for spreading based on multi-
relations by pseudo-closure function. Firstly, we define a strong pseudo-closure
function .as (.) to capture the neighborhood set of set .A. This function will be con-
structed based on multi-relations. Next, we apply the concept of the elementary
closed subset to calculate the maximum extensibility of each node. We choose
a seed set from nodes with the highest extensibility. Finally, .P reto_IC model
based on IC with seed set from the previous step and determination of node’s
neighbor set at each spreading step will be based on .as (.). Experimental results
demonstrate that .P reto_IC model obtains higher influence spread comparison
with previous models.
The structure of our paper is organized as follows: Sect. 1 introduces the
problem definition and related works; Sect. 2 reviews preliminaries; our approach
is proposed in Sect. 3; Sect. 4 illustrates experiments and results; we conclude our
work in Sect. 5.

1.2 Related Works


Research on information diffusion has attracted numerous scientists with attri-
bution to diverse areas of society including social science, marketing, computer
science, and so on. The majority of studies concentrate on aspects such as which
kind of information spreads fastest, which objects will contribute to influence
maximization, and which models to mimic for propagation. The answers to these
concerns can be found in research results in smaller branches of information
diffusion including epidemic spread models, influence maximization, predictive
models, and so on.
Various diffusion models have been proposed such as independent cascade
model (IC) [8], the linear threshold model (LT) [10, 17], general threshold model
[14], and so on. Those models are simulated on both homogeneous networks
[9, 14–16, 21] and heterogeneous networks [6, 11, 13, 18, 19].
Researches on homogeneous networks concentrated on influence maximiza-
tion problem in which proposed algorithms to select set of nodes initiating
the spread process so that obtain maximum infection. Numerous algorithms
of influence maximization have been proposed including random heuristics [14],
centrality-based heuristics [7], greedy algorithm [14] and its improved versions
such as CELF, CELF++, TIM, TIM+ [9, 16, 21] and so on. Several surveys on
algorithms of influence maximization can be found in [3].
On the other hand, scientists focus on proposing methods to measure the
infected probability of an inactive node based on meta-paths information or tex-
tual information on heterogeneous networks. There are other extended models
from IC, such as Heterogeneous Probability Model - IC (HPM-IC) [18], where
activated probability is determined by a conditional probability based on meta-
paths information, or Homophily Independent Cascade Diffusion (TextualHomo-
IC) [13] where probability of infection is calculated based on textual data. Addi-
tionally, there are two expanded models of LT model, such as the Probability
Model-LT (HPM-LT) [18] and the Multi-Relational Linear Threshold Model -
442 T. K. T. Ho et al.

Relation Level Aggregation (MLTM-R) [11]. Besides diffusion predicted model,


machine learning and deep learning have been utilized to predict information
propagation including the spreading of a tweet on Twitter [22], information dif-
fusion on Github [1], predicting information propagation on a bibliographic net-
work [6, 19].
Topic diffusion on the heterogeneous bibliographic network has been studied
under dissimilar spreading models including IC, LT model, or expanded models
of IC and LT. However, the propagation process is simulated on the network’s
structure using graph theory. We analyze the drawbacks of graph theory in prob-
lem definition. Therefore, in this study, we propose a novel approach to modeling
the diffusion process on bibliographic networks using pretopology theory.

2 Preliminaries
2.1 Hetergeneous Bibliographic Network

A bibliographic network is a typical instance of a heterogeneous network. In


this network, there are multiple objects, including authors, articles, affiliations,
conferences, etc. Moreover, researchers could be connected through dissimilar
relationships: co-authorship, citation, common co-authors, co-attending confer-
ences, etc. Several typical relationships in bibliographic networks: co-authorship
(author-paper-author (APA)), common co-authors (author - paper - author -
paper - author (APAPA)), co-attending conferences (author - paper - venue
- paper - author (APVPA)), same affiliations (author - affiliation - author
(AAFA)), citation relationships (author - paper - paper - author (APPA)).
Figure 1 illustrates a real-world example where authors are linked by five types
of relationships.

Fig. 1. A real-world example of bibliographic network


Modeling Information Diffusion in Bibliographic Networks 443

2.2 Independent Cascade Model (IC)


Assuming a network .G = (V, E, P ), the infected probability function is .P : V ×
V → [0, 1]. The likelihood of node .u infecting node .v is denoted by .P (u, v). To start
the diffusion process, we need to select a set of nodes and assign them activated
states. This set is called a seed set. Diffusion happens in discrete time intervals
denoted by .t. .Xt represents the set of active nodes at step .t. With a probability
newest
.P (u, v), every .u ∈ X at each step .t, where .X newest is the set of the newly
activated nodes at step .t − 1, infects the inactive neighbors .v .∈ η out (u).
Various methods have been proposed for seed set selection to get influence
maximization, including centrality-based method, basis greedy method, CELF,
CELF++, TIM, TIM+. We review typical methods below:
– Centrality-based method (HD) [7] : This is a fundamental method in influ-
ence maximization in which a seed set is chosen from k highest degree nodes.
– Basis greedy [14]: The ideal is to start with the empty seed set (S) and
iteratively select a node that is not in S. By adding this node to S, the
highest marginal increment about influence spread is achieved. This algorithm
is revealed in detail in [14].
– CELF [16]: This is an improved version of the basis greedy algorithm. The
primary concept of this algorithm is that the node’s marginal gain in influence
spread in the current iteration can’t exceed its marginal gain from previous
iterations. Therefore, this algorithm reduced a significant amount of estima-
tions in influence spread and improved runtime. This algorithm is demon-
strated in detail in [16].
– CELF++ [9] : This is an optimized version of CELF. Unlike CELF,
CELF++ keeps all nodes that need to be investigated in a table Q.<u.mg1,
u.prev_best, u.mg2, u.flag.>. The fundamental principle of CELF++ is that
the marginal gain of u w.r.t S+prev_best in the next iteration does not need
to be recalculated if the node u.prev_best is selected as a seed in the current
iteration. The algorithm of CELF++ is illustrated in detail in [9].
– TIM & TIM+ [21]: TIM algorithm includes two phases: The first phase
calculates a lower-bound of the maximum expected spread among all size-k
node sets, then uses this lower-bound to determine a parameter .θ. In the
second phase, .θ random reverse reachability (RR) sets from G are sampled,
and then a seed set with size-k .Sk∗ is generated to cover a maximum number
of RR sets. TIM has been proven to run faster than CELF++, but achieved
the same influence spread. Besides, Tang also improved TIM algorithm in
another version, namely TIM+. TIM+ has been proven to be faster than
TIM. Algorithms of TIM & TIM+ are described in detail in [21].

2.3 Pretopology
Pretopology [4] is considered as topology’s expansion as a consequence of reduc-
ing its axiomatic constraints. Pretopology is a powerful mathematical foundation
for the notion of proximity. It allows for the gradual observation of a set’s exten-
sibility.
444 T. K. T. Ho et al.

Definition 1. A pretopological space is an ordered pair .(V, a), in which .V is a


set and .a : P(V ) → P(V ) is a pseudo-closure operator, fulfilling the following
two axioms:

(P1): .a(∅) = ∅; (Preservation of Nullary Union)


(P2): .A ⊂ a(A) ∀A, A ⊂ V (Extensivity)

Fig. 2. Pseudo-closure and closure function

We can see that pseudo-closure .a(.) is an idempotent growth by its definition


(see Fig. 2a). Therefore, we can calculate .a(A), a(a(A)), a(a(a(A))), . . . , al (A),
contrary to how it happens in topology (Fig. 2b). Therefore, we can observe a
gradual growth.

Definition 2. Assuming .(V, a) is a pretopological space, .∀A, A ⊂ V . A is a


closed subset if and only if .a(A) = A.

Definition 3. Closure of .A is the existence of the smallest closed subset of .A


that includes .A in a pretopological space .(V, a). The closure of .A is indicated by
the notation .F (A).

Definition 4. An elementary closed subset, noted as .Fu , is the closure of a one


element set .u of .V . We denote .FV as the family of elementary closed subsets :
.FV = {Fu |u ∈ V }.

Pretopology is considered a tool for analyzing a complex network [5] in which


we can model a network with dissimilar space including binary relation space,
valued relation space, etc., and define various pseudo-closure functions to capture
the neighbor set of node’s set.
Modeling Information Diffusion in Bibliographic Networks 445

3 Our Approach
In this section, we propose pretopological independent cascade model, namely
P reto_IC. This is an expanded model of the independent cascade model to
.
model the propagation on a heterogeneous network. There are two novel points
in .P reto_IC comparison with IC. Firstly, the seed set is selected based on
elementary closed subsets. Secondly, the identification of a neighborhood set of
an active node to activate infection based on multiple relations. This work is
calculated using strong pseudo-closure function.
To execute .P reto_IC model, we will perform the following steps:
1. Construct a pretopological space in which we define a strong pseudo-closure
function. This function is used to determine the neighborhood set of set .A
and its closure. This step is described in Subsect. 3.1.
2. Calculate elementary closed subsets based on the strong pseudo-closure func-
tion defined in Step 1 and select a seed set. This step is illustrated in Subsect.
3.2.
3. Simulate propagation with .P reto_IC. This work is revealed in Subsect. 3.3.

3.1 Strong Pseudo-closure Function


Assume that we have set of node .V , a family of binary reflexive relations
.(Ri )i=1,n , a set of indexes of relations .Sn = {1, 2, ..., n}. We can construct
pretopological structure for any relation .Ri by taking into account subset:
.∀i ∈ Sn , ∀u ∈ V, Vi (u) is defined by:

V (u) = {v ∈ V |u Ri v}
. i

We define a strong pseudo-closure function .as (.) at Eq. 1.

a (A, s) = {u ∈ V \ A|∃S ⊂ Sn , |S| >= s, ∀i ∈ S, Vi (u) ∩ A = ∅}


. s ∀A ⊂ V (1)

The expansion of set A with several strength levels is represented by .as (.).
The threshold value .s reveals how many relations each element .u must satisfy
to be accepted into the neighborhood set of .A.

3.2 Maximum Expansion Method Based on Elementary Closed


Subset for Seed Set Selection
In this research, we propose a novel method to choose a seed set for the diffusion
process, namely elementary closed subset (.ECS). This method is based on the
maximum extensibility of each node. Firstly, we estimate elementary closed sub-
set of each node .u ∈ V . Elementary closed subset .Fu is closure of one element
set .{u} (as the definition at 4). .Fu presents for maximum extensibility of node .u
and will be calculated by strong pseudo-closure function (at Subsect. 3.1). Next,
we rank nodes by the size of elementary closed subsets. Finally, we will select .k
446 T. K. T. Ho et al.

Algorithm 1. Elementary closed subsets for seed set


Require: .R = {R1 , R2 , ..., Rn }, .Sn , .G = (V, {Vi }), .s
procedure .ECS(.G, R, Sn , s, k)
.FV = {}, .aA = {}, .S = ∅
for .u ∈ V do
.Fu = as ({u}, s); .aA[u] = Fu
while .as (Fu , s) <> Fu do .Fu = as (Fu , s)
end while
.FV [u] = |Fu |
end for
.FV = sort(.FV )  sort dictionary descending by value
for item .∈ FV [: k] do
S = S .∪ item[0]
end for
return .S, aA  Output
end procedure

Algorithm 2. .P reto_IC
Require: .G = (V, {Vi }), .S, aA
1: procedure .P reto_IC(.G, S, aA)
total
2: .t ← 0, X ← S, X newest ← S
3: while infection occur do
4: .t ← t + 1; .Xt ← ∅
5: for u .∈ X newest do
in activ e
6: .Xt (u) ← {v ∈ aA[u], q <= p}; p, q ∼ U (0, 1)

7: .Xt ← Xt Xt (u)
8: end for 
total
9: .X ← X total Xt ; .X newest ← Xt
10: end while
11: return .X total  Output
12: end procedure

nodes from .V with the highest size of the elementary closed subset to be seed
set for the propagation process. Algorithm .ECS is described at (1).
Moreover, algorithm (1) also returns a dictionary aA which contains nodes
and their neighbor set respectively. The determination of node’s neighbor .aA is
calculated by pseudo-closure function .as (.). aA is also an input for .P reto_IC
that we will describe in the next subsection.

3.3 Pretopological Independent Cascade Model for Bibliographic


Networks

In this subsection, we illustrate the algorithm for .P reto_IC model (algorithm


2). The inputs of the algorithm (2) are outputs of the algorithm (1).
Modeling Information Diffusion in Bibliographic Networks 447

Table 1. Statistics for three datasets

Data1 Data2 Data3


Number of Nodes 704 3079 13605
Number of Edges 2109 10006 29637
Clustering coefficient 0.802 0.769 0.713
Average degree 5.99 6.5 4.357

At each step .t where .X newest is the set of the newly active nodes at time .t−1,
each .u ∈ X newest activate the inactive neighbors .v ∈ aA[u] with a probability
.P (u, v). The spreading continues until no more infection can happen.
The novel points of .P reto_IC include: seed set S is obtained from elementary
closed subsets and neighbor’s set determination of each node u .∈ X newnest to
infect based on multiple relations through pseudo-closure function. Line 6 in
algorithm (2) illustrates this task, in which .aA is a dictionary of neighbor sets
calculated from pseudo-closure function .as (.) in algorithm (1).

4 Experiments and Results


4.1 Dataset

The real-world dataset in our experiments comes from DBLP-SIGWEB.zip.


Three distinct data sets that were created from the dataset “DBLP-
SIGWEB.zip” were used in our studies. The first data set (data1) is generated
from 10 random nodes with degree larger than 10. Then, we extend the co-
authorship of those authors. The second (data2) is built similarly to data1, but
the distinction is that it inits with random 50 nodes. Lastly, the third (data3)
matches the entire “DBLP-SIGWEB.zip” dataset. We retrieve relevant metadata
associated with the network’s authors. Table 1 describes summary statistics for
three datasets.

4.2 Experimental Setting

We consider four relations between authors, .R = {AP A, AP AP A, AP V P A,


AAF A}. We run the algorithm (1) to get a seed set with sizes k=.10, 50 in which
strong pseudo-closure function .as (.) with strong level .s = 2.
To evaluate the performance of .P reto_IC model, we also implement IC
model with several previous methods to get the seed set such as HD, CELF++,
TIM+ and ECS. We have models .HD_IC, .CELF + +_IC, .T IM + _IC,
.ECS_IC respectively. With these models, determination of neighbor set based

on one relation co-author.


With all the above diffusion models, we conduct experiments with 1000
Monte-Carlo simulations and calculated the average influence spread. Besides,
we utilized propagation probability p = 0.5.
448 T. K. T. Ho et al.

4.3 Results and Evaluation

Experimental results are shown in Figs. 3, Fig. 4 and Fig. 5. We evaluate per-
formance of models based on two criteria : influence spread and running time.
From the figures, it can be seen that .P reto_IC model is more efficient than
other models about influence spread. However, the running time of .P reto_IC is
slower than .HD_IC and .T IM + _IC and approximate with .CELF + +_IC
and .ECS_IC.
The influence spreads of these algorithms on dissimilar datasets data1, data2,
and data3 are shown in Fig. 3a, Fig. 4a and Fig. 5a respectively. We can see that,
our algorithm .ECS for selecting a seed set based on elementary closed subset
brings influence spread better than methods .HD and .CELF + + when we used
it for IC model. Particularly, .P reto_IC model with the combination of seed set
from .ECS and determination of neighbor set based on pseudo-closure function
.as (.) peaks the best performance. These results demonstrated the advantage of
propagation on multiple relations.

Fig. 3. Influence spread and running time on data1

In addition, the running times of algorithms on three datasets are illustrated


in Fig. 3b, Fig. 4b and Fig. 5b. We can see that, the running time of .HD_IC is
the lowest compared with other methods since the selection of the seed set is the
simplest method using node’s degree. Next, the running time of .T IM + _IC
is much lower than .CELF + +_IC, .ECS_IC and .P reto_IC. This result can
be explained by algorithms of seed set selection. The algorithm’s complexity of
.CELF + + is .O(k.m.n.R) where .k is the size of the seed set, n is the number of

nodes, m is the number of edges and R is the number of Monte Carlo samples
used to estimate the expected spread of each node set. Algorithm’s complexity of
2
.T IM + is only .O((k + l)(n + m)logn/ ) where k is the size of seed set, n and m

are the number of nodes and edges respectively, .l,  are parameters. Besides, the
Modeling Information Diffusion in Bibliographic Networks 449

Fig. 4. Influence spread and running time on data2

Fig. 5. Influence spread and running time on data3

algorithm’s complexity of .ECS is .O(n.x.n.s) where n is the number of nodes, .x


is the number of rounds that .x obtains closure, .s is the strong level of pseudo-
closure function. We can see that .CELF + + and .ECS have approximated
complexity and are worse than the complexity of .T IM +. This reason leads to
the running time of .T IM + _IC being lower than .CELF + +_IC, .ECS_IC,
and .P reto_IC. That is also the reason for .CELF + +_IC, .ECS_IC, and
.P reto_IC having the running times approximately.
In short, .P reto_IC model demonstrated a novel approach to simulate propa-
gation on multiple relations. Experimental results proved that .P reto_IC model
brings better influence spread to compare with IC model with dissimilar meth-
ods selecting seed set. However, the optimization in terms of running time is still
limited. We will continue to improve this problem in future work.
450 T. K. T. Ho et al.

5 Conclusion and Future Works


This paper proposed a novel approach to simulate the propagation in the biblio-
graphic networks using pretopology theory. We proposed an expanded diffusion
model of IC, namely .P reto_IC. In .P reto_IC, we proposed a new method to
capture a seed set that is based on elementary closed subset. Moreover, at each
step of spreading, identify the node’s neighbor set based on multiple relations
through pseudo-closure function. Experimental results proved that .P reto_IC
achieved the best influence spreads. We believe that our study can make a sig-
nificant contribution into applications of information diffusion in the scientist’s
network. In future works, we will conduct experiments on other networks and
optimize the efficiency of running time.

Acknowledgments. This study has been supported by Research Project No.


DHH2023-03-186 of Hue University, Vietnam.

References
1. Akula, R., Yousefi, N., Garibay, I.: DeepFork: Supervised Prediction of Information
Diffusion in GitHub, p. 12 (2019)
2. Anderson, R.M., May, R.M.: Infectious diseases of humans: dynamics and control.
Oxford university press (1991)
3. Banerjee, S., Jenamani, M., Pratihar, D.K.: A survey on influence maximization
in a social network. Knowl. Inf. Syst. 62(9), 3417–3455 (2020). https://doi.org/10.
1007/s10115-020-01461-4
4. Belmandt, Z.: Basics of Pretopology. Hermann (2011). http://ijpam.eu/contents/
2013-86-1/5/5.pdf
5. Bui, Q.V., Ben Amor, S., Bui, M.: Stochastic pretopology as a tool for topological
analysis of complex systems. In: Nguyen, N.T., Hoang, D.H., Hong, T.-P., Pham,
H., Trawiński, B. (eds.) ACIIDS 2018. LNCS (LNAI), vol. 10752, pp. 102–111.
Springer, Cham (2018). https://doi.org/10.1007/978-3-319-75420-8_10
6. Bui, Q.V., Ho, T.K.T., Bui, M.: Topic diffusion prediction on bibliographic net-
work: new approach with combination between external and intrinsic factors. In:
Nguyen, N.T., Hoang, B.H., Huynh, C.P., Hwang, D., Trawiński, B., Vossen, G.
(eds.) ICCCI 2020. LNCS (LNAI), vol. 12496, pp. 45–57. Springer, Cham (2020).
https://doi.org/10.1007/978-3-030-63007-2_4
7. Freeman, L.C., et al.: Centrality in social networks: Conceptual clarification. Social
network: critical concepts in sociology. Londres: Routledge 1, 238–263 (2002)
8. Goldenberg, J., Libai, B., Muller, E.: Talk of the network: a complex systems look
at the underlying process of word-of-mouth. Mark. Lett. 12(3), 211–223 (2001)
9. Goyal, A., Lu, W., Lakshmanan, L.V.: Celf++: Optimizing the greedy algorithm
for influence maximization in social networks (technical report)
10. Granovetter, M.: Threshold models of collective behavior. Am. J. Sociol. 83(6),
1420–1443 (1978)
11. Gui, H., Sun, Y., Han, J., Brova, G.: Modeling topic diffusion in multi-relational
bibliographic information networks. In: CIKM (2014). https://doi.org/10.1145/
2661829.2662000
Modeling Information Diffusion in Bibliographic Networks 451

12. Hethcote, H.W.: The mathematics of infectious diseases. SIAM Rev. 42(4), 599–
653 (2000)
13. Ho, T.K.T., Bui, Q.V., Bui, M.: Homophily independent cascade diffusion model
based on textual information. In: Nguyen, N.T., Pimenidis, E., Khan, Z., Trawiński,
B. (eds.) ICCCI 2018. LNCS (LNAI), vol. 11055, pp. 134–145. Springer, Cham
(2018). https://doi.org/10.1007/978-3-319-98443-8_13
14. Kempe, D., Kleinberg, J.M., Tardos, E.: Maximizing the spread of influence
through a social network. In: KDD (2003).https://doi.org/10.1145/956750.956769
15. Kimura, M., Saito, K.: Tractable models for information diffusion in social net-
works. In: European Conference on Principles of Data Mining and Knowledge
Discovery, pp. 259–271. Springer (2006)
16. Leskovec, J., Krause, A., Guestrin, C., Faloutsos, C., VanBriesen, J., Glance, N.:
Cost-effective outbreak detection in networks. In: Proceedings of the 13th ACM
SIGKDD International Conference on Knowledge Discovery and Data Mining, pp.
420–429 (2007)
17. Macy, M.W.: Chains of Cooperation: Threshold Effects in Collective Action. Am.
Sociol. Rev. 56(6), 730–747 (1991)
18. Molaei, S., Babaei, S., Salehi, M., Jalili, M.: Information Spread and Topic Diffusion
in Heterogeneous Information Networks. Sci. Rep. 8(1), 1–14 (2018)
19. Molaei, S., Zare, H., Veisi, H.: Deep learning approach on information dif-
fusion in heterogeneous networks. Knowledge-Based Systems p. 105153 (Oct
2019). https://doi.org/10.1016/j.knosys.2019.105153, http://www.sciencedirect.
com/science/article/pii/S0950705119305076
20. Serazzi, G., Zanero, S.: Computer virus propagation models. In: International
Workshop on Modeling, Analysis, and Simulation of Computer and Telecommuni-
cation Systems. pp. 26–50. Springer (2003)
21. Tang, Y., Xiao, X., Shi, Y.: Influence maximization: Near-optimal time complexity
meets practical efficiency. In: Proceedings of the 2014 ACM SIGMOD International
Conference On Management Of Data, pp. 75–86 (2014)
22. Varshney, D., Kumar, S., Gupta, V.: Modeling information diffusion in social net-
works using latent topic information. In: Huang, D.-S., Bevilacqua, V., Premaratne,
P. (eds.) ICIC 2014. LNCS, vol. 8588, pp. 137–148. Springer, Cham (2014). https://
doi.org/10.1007/978-3-319-09333-8_16
Optimizing Credit Scoring Models
for Decentralized Financial Applications

Trong Hoan Dao , Tuan-Dat Trinh(B) , and Viet-Bang Pham

School of Information and Communications Technology, Hanoi University of


Science and Technology, Hanoi, Vietnam
[email protected]

Abstract. Decentralized Finance (DeFi), a rapidly evolving ecosystem


of blockchain-based financial applications, has attracted substantial cap-
ital in recent years. Lending protocols, which provide deposit and loan
services similar to traditional banking, are central to DeFi. However,
the lack of credit scoring in these protocols creates several challenges.
Without accurate risk assessment, lending protocols impose higher inter-
est rates to offset potential losses, negatively affecting both borrowers
and lenders. Furthermore, the absence of credit scoring reduces trans-
parency and fairness, treating all borrowers equally regardless of their
credit history, discouraging responsible financial behavior and hinder-
ing sustainable growth. This paper introduces credit scoring models for
crypto wallets in DeFi. Our contributions include: (1) developing a com-
prehensive dataset with 14 features from over 250 000 crypto wallets; and
(2) constructing four credit scoring models based on Stochastic Gradient
Descent, Adam, Genetic, and Multilayer Perceptron algorithms. These
findings offer valuable insights for improving DeFi lending protocols and
mitigating risks in decentralized financial ecosystems.

Keywords: Credit Score · DeFi · Decentralized Finance · Lending ·


Genetic Algorithm · Stochastic Gradient Descent · Adam · Genetic ·
Multilayer Perceptron

1 Introduction

Decentralized Finance (DeFi) [23] encompasses an ecosystem of decentralized


applications (DApps) built on blockchain technology. Among the most promi-
nent DApps are lending protocols, also known as lending pools. According to
DefiLlama1 , as of May 2024, over $36 million had been locked in lending proto-
cols, positioning them as the second largest sector within the DeFi ecosystem.
Lending pools enable users to deposit their tokens [8] into a liquidity pool [19],
which are then used to provide loans to other users.
Current lending protocols in DeFi face significant drawbacks due to the
absence of credit scoring. First, these protocols struggle to accurately assess the
risk level of borrowers, which often leads to higher interest rates being imposed to

T.H. Dao and T.-D. Trinh—Contributed equally to this work.


1
https://defillama.com/.
c The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025
W. Buntine et al. (Eds.): SOICT 2024, CCIS 2350, pp. 452–466, 2025.
https://doi.org/10.1007/978-981-96-4282-3_36
Optimizing Credit Scoring Models for Decentralized Financial Applications 453

offset potential risks. Second, the lack of credit scoring diminishes transparency
and fairness in the lending process, as all borrowers are treated uniformly regard-
less of their credit history. This reduces incentives for maintaining good financial
behavior and negatively impacting the sustainable growth of the decentralized
financial ecosystem.
Due to these issues, there is an increasing demand for assessing the quality
of crypto wallets. Such assessments determine repayment capabilities and cate-
gorize users, enabling lending pools to set tailored loan limits and interest rates
for each participant. Verified and evaluated loans are less likely to be liquidated
compared to those that do not consider credit factors.
In traditional finance (TraFi), banks use the FICO credit score to evalu-
ate the creditworthiness of individuals or organizations. These scores, typically
ranging from 300 to 850, represent the credit risk of a person, with higher scores
indicating lower risk. This scoring system can be adapted to evaluate crypto
wallets in DeFi, as wallets share key characteristics with bank accounts: (1) they
store assets, (2) facilitate transfers, and (3) support activities such as depositing,
collateralizing, and borrowing.
Over recent decades, models for optimizing FICO credit score parameters
have been extensively developed, primarily using regression techniques and deep
learning algorithms. These models rely on variables such as outstanding debt,
payment history, credit usage length, credit mix, and new credit applications [9].
Despite their effectiveness, access to TraFi data [13] remains limited due to its
sensitive nature and the restricted exchange of information between banks.
This paper proposes credit scoring models for DeFi, leveraging unique wallet
parameters such as total current assets, average assets, transaction frequency,
number and types of interacted DApps, transaction amounts, account age, num-
ber of liquidations, loan-to-balance ratio, and loan-to-investment ratio. Our con-
tributions include: (1) the collection and processing of a dataset comprising 14
characteristics of crypto wallets, and (2) the development of four credit evalua-
tion models based on the FICO score, followed by their assessment and compar-
ison.
The structure of this paper is organized as follows: Sect. 2 reviews related
work on credit scoring. Section 3 presents the dataset and methodology. Section 4
discusses the experimental setup and results, while Sect. 5 concludes the paper
with key findings and future research directions.

2 Related Work
This section reviews methods, research, and models for evaluating crypto wallet
credit scores within the DeFi ecosystem. Packin and Lev-Aretz [15] identified two
key approaches to credit scoring in DeFi: the off-chain integration model
and the crypto-native credit score. These approaches aim to integrate user
data from both TraFi and DeFi systems for a comprehensive assessment of cred-
itworthiness across Web2 and Web3 platforms [22].
In the off-chain integration model, data from TraFi is used alone of com-
bined with on-chain data (DeFi data). Machine learning models then process this
454 T. H. Dao et al.

data to generate credit scores, which are continuously updated, encrypted, and
publicly stored on the blockchain. Zhu [27] proposed a blockchain-based approach
to identity verification and credit reporting, incorporating multi-dimensional
authentication, weighted score calculation, and encryption for secure identity
management and risk assessment.
Patel et al. [16] introduced KiRTi, a deep-learning-based credit recommender
system operating on a public blockchain. KiRTi facilitates direct lending by lever-
aging historical blockchain transaction data and using a long-short-term mem-
ory model to generate credit scores. Smart contracts automate loan repayments,
removing the need for third-party credit rating agencies.
Uriawan et al. [20] developed a credit score formula combining loan risk,
activity, profile, and social recommendation scores. Hartmann and Hasan [7]
introduced a “social score” using social media data to provide loan opportunities
on a decentralized platform. However, these models primarily depend on off-chain
data, which can limit their accuracy in assessing digital wallets. Additionally,
requiring wallet authentication for off-chain data conflicts with the anonymity
preference of many Web3 users.
Our focus is on crypto-native credit scores, which rely on blockchain
activity data such as loan repayments, trading, and governance participation.
Unlike traditional credit scores, which are tied to individuals, a crypto-native
score is linked to a wallet and dynamically adjusted based on blockchain inter-
actions. Packin and Lev Aretz [14] highlighted notable products including Spec-
tral2 ,LedgerScore3 , and Prestare4 . Spectral provides a Multi-Asset Credit Risk
Oracle (MACRO) Score, considering factors such as transaction history and
market conditions. LedgerScore offers autonomous crypto credit reports based
on cryptocurrency transactions and asset portfolios, while Prestare combines
on-chain credit scoring with lending protocol management.
Several Web3 companies, including CreDA5 , Quadrata6 , Credefi7 , and
TrueFi8 , are also developing DeFi credit scoring solutions. These platforms assess
wallet credit by analyzing transaction history, liquidation events, amounts owed,
and credit mix, though they do not disclose their detailed methodologies or
parameter optimization processes.
Research on blockchain credit scoring is still evolving. Wolf et al. [25] pro-
posed a scoring method for Aave accounts based on account age, historical health
factors, interactions with the Aave protocol, and asset types. Their model is
tailored only for Aave accounts on Ethereum. Austin et al. [1] introduced the
Autonomous Lending Organization on Ethereum (ALOE) system for unsecured
lending on Ethereum, which maintains and updates borrower-specific ratios such
as Underpay Ratio (UPR) and Current Debt Burden Ratio (CDBR). The ALOE

2
https://docs.spectral.finance.
3
https://www.ledgerscore.com.
4
https://linktr.ee/prestare.finance.
5
https://www.creda.app.
6
https://quadrata.com.
7
https://www.credefi.finance.
8
https://truefi.io.
Optimizing Credit Scoring Models for Decentralized Financial Applications 455

system applies a k -nearest neighbors algorithm to calculate a borrower’s credit


score, but this model is limited to assessing loan-related risks and requires FICO
scores from TraFi for users not involved in borrowing activities.

3 Methodology
3.1 Dataset

In DeFi, data on blockchain networks is readily accessible for analysis. Major


networks like Binance Smart Chain (BSC), Polygon, Tron, Arbitrum, and Opti-
mism share technology with Ethereum through the Ethereum Virtual Machine
(EVM), enabling decentralized applications across distributed networks. As of
May 2024, EVM-based chains hold over 80% of the total value locked (TVL) in
DeFi, with Ethereum alone contributing over 60%. For this research, we collected
and aggregateed data from EVM-based chains.
Data on EVM-based chains is categorized into two types: (1) transactional
data, encompassing activities such as transfers, deposits, withdrawals, and loans
conducted on DApps or via crypto wallets, and (2) state data, detailing asset
balances held in wallets or locked within DApps. These data are encrypted,
distributed across blockchain networks, and remain publicly accessible.
To process and decode this information effectively, we adopted the data col-
lection model outlined by Pham and Trinh [17]. By May 2024, we gathered 20TB
of transaction and event data from seven EVM-based chains. This data was used
to create a knowledge graph covering over 600 million wallets, 25 million smart
contracts, 20 000 tokens, and 46 000 projects. We then selected data from 251,290
wallets9 , applying two filters: (1) wallets holding assets over $10, and (2) wallets
that have interacted with lending pools such as Aave and Compound in the past
year. Table 1 presents the 14 features recorded in each wallet.

Preprocessing Data. We used Winsorization [5] for preprocessing to man-


age outliers and prevent skewed results. The process involves two steps: detect-
ing outliers and transforming the data. Outliers were identified using percentile
thresholds for each feature. Values beyond these thresholds were adjusted to the
threshold limits.
In the next step, data was normalized using min-max scaling to a range of
[300, 850], as shown in Eq. 1. In this equation, X represents the original value,
X’ is the normalized value, .Xmin is the minimum value of the feature, and .Xmax
is the maximum value of the feature.
(X − Xmin ) ∗ (850 − −300)
X =
. + 300 (1)
Xmax − Xmin
For the total current assets feature, Table 2 shows the percentile thresholds
and their corresponding values. The feature shows significant variance. Selecting
thresholds of 0.25 and 0.95 (approximately $10 and $10 000) effectively distin-
guishes wallets with minimal assets from those with substantial assets.
9
https://github.com/Centic-io/credit-score-optimization/blob/main/Lending-Data-
Ethereum-Labeled.csv.
456 T. H. Dao et al.

Table 1. Wallet features

Features Description
total_current_asset Current total value of assets held in the wallet
average_total_asset Average asset value over the last 30 d
frequency_of_DApp_transactions Wallet’s activity level on blockchain investment channels
number_of_interacted_DApps Number of high-reputation DApps interacted with in the last 30 d
Types_of_interacted_DApps Number of different types of DApps interacted with in the last 30 d
reputation_of_interacted_DApps Number of high-reputation DApps interacted with in the last 30 d
transaction_amount Total value (in USD) of funds received by the wallet in the last 30 d
frequency_of_transaction Number of transactions made by the wallet in the last 30 d
age_of_accounts Duration since the wallet’s first transaction, indicating how long it
has been active in the crypto market
number_of_liquidations Total number of liquidation events for the wallet
total_value_of_liquidations Cumulative value of all liquidation events
loan_to_balance_ratio Ratio of total loans to the account balance
loan_to_investment_ratio Penalty points applied for violating acceptable loan-to-investment
ratios
investment_to_total_asset_ratio Wallet’s investment activity relative to its total assets

Table 2. Percentile threshold and value of total asset attribute

Percentile Threshold 0 0.05 0.25 0.9 0.95 1


Value 0.0 0.19 9.61 4219.90 11 936.94 22 824 864 075

Table 3. FICO credit score ranges

Score Ranges Rating Description


300–579 Poor This credit score is well below the U.S. average, indicating a higher
risk to lenders
580–669 Fair Slightly below average, but many lenders will approve loans with
this score
670–739 Good Likely to qualify for credit at competitive rates
740–799 Very Good Eligible for favorable rates and terms
800–850 Excellent Typically qualifies for the best available rates and terms

Labeling Data. Following the FICO model, we categorize wallets into five
credit levels (see Table 3). However, assigning the exact credit level is challenging
due to the lack of established evaluation mechanisms. For example, a wallet with
a balance of $1 trillion and a recent liquidation event may fall into level 2 due
to the liquidation or level 4 because of its large balance. This ambiguity reflects
the characteristics of Fuzzy Data.
Fuzzy data consists of information that cannot be represented as precise num-
bers or precisely categorized [21]. Research on fuzzy data [2] includes machine
learning algorithms that handle fuzzy logic and data. Denoeux [4] proposed
methods for estimating parameters in statistical models with fuzzy observations.
Jane and Ganesh [10] reviewed how machine learning and fuzzy logic techniques
enable knowledge-based decision-making.
We use Fuzzy Labels for crypto wallet classification, inspired by the Fuzzy
c-means clustering method [3]. This robust unsupervised technique provides a
Optimizing Credit Scoring Models for Decentralized Financial Applications 457

flexible alternative to hard clustering. It allows objects on the boundary between


classes to have membership values between 0 and 1, reflecting their partial mem-
bership rather than forcing them into a single class.
Labeling is based on the analysis of over 6 million wallets interacting across
7 EVM chains in the past year. As detailed in Table 4, assigning labels involves
setting multiple thresholds. Future work will focus on improving the threshold
selection mechanism.
Table 4. Label description by score range

Poor Fair Good Very Good Excellent


Accounts with Accounts with Accounts with Accounts with Accounts with
assets under assets under assets over $1 000, assets over $1 assets over $1
$1 000 and $1 000, fewer a borrow-to-asset million, a borrow- million, a borrow-
more than than three ratio of 30%- to-asset ratio of to-asset ratio
three liquida- liquidations 40%, and no 20%-30%, and no below 20%, and
tions. liquidation. liquidation. no liquidation.
Accounts with a borrow-to- Accounts with assets ranging
asset ratio over 40% and more from $1 000 to $10 000 and no
than one liquidation. liquidations.
Accounts with assets under
$100 000, a debt ratio above
60%, and at least one liquida-
tion.
Accounts with assets over
$1 000, a borrow-to-asset ratio
below 40%, and at least one
liquidation.
Accounts with assets under
$1 000, liquidated less than
three times and no deposits.
Accounts with assets under Accounts with assets ranging
$100 000 and a debt ratio from from $10 000 to $100 000, a
40% to 60%, with no liquida- borrow-to-asset ratio below
tions. 20%, and no liquidations.
Accounts with assets over
$100 000, no borrowings,
no liquidations, and active
deposits.

Handling Imbalanced Dataset. After labeling, our dataset exhibits signif-


icant class imbalance, predominantly skewed toward level 2. According to the
labeling method in row 1 of Table 4, level 2 wallets constitute 89.41% of the
dataset, while levels 1, 3, 4, and 5 account for 0.12%, 8.65%, 0.88%, and 0.94%,
respectively. This imbalance may result from the anonymity of blockchain, which
leads to a large number of inactive or abandoned wallets.
We chose not to correct this imbalance for two key reasons: (1) Fuzzy Labeling
has partially mitigated the imbalance by increasing the proportion of underrepre-
sented labels, such as level 1, which rose from 0.12% to 0.38%. (2) The imbalance
reflects real-world conditions similar to the FICO model, where 71.3% of the U.S.
population falls into the Good or better categories, and only 12.1% achieve the
Poor label.10

10
https://www.lendingtree.com/credit-repair/credit-score-stats-page/.
458 T. H. Dao et al.

3.2 Optimization Methods for Credit Scoring


Genetic Algorithm. A Genetic Algorithm (GA) is an optimization technique
inspired by the process of natural selection in biological evolution [12]. It itera-
tively enhances a pool of candidate solutions through principles of “survival of
the fittest” [24]. GAs are effective for solving complex problems with large search
spaces, noisy or discontinuous objective functions, and multiple local optima.
They have been successfully applied to parameter tuning, scheduling, routing,
and machine learning [11]. GAs efficiently explore extensive search spaces and
are capable of finding globally optimal or near-optimal solutions, addressing
challenges that traditional optimization methods often encounter.
In our research, the GA optimizes parameters for a linear credit score for-
mula involving 14 features .(f1 , f2 , ..., f14 ) listed in Table 1 and 14 corresponding
parameters .(a1 , a2 , ..., a14 ) as shown in Eq. 2. Each individual in the GA repre-
sents a distinct set of these 14 parameters.
14

. credit_score = a1 ∗ f1 (2)
i=1

Algorithm 1 outlines the our GA implementation. Through evolutionary pro-


cesses across generations, we refine parameter sets to calculate wallet credit
scores, which are scaled between 300 and 850. This approach differs from other
algorithms in this research, which only classify wallets into 5 discrete levels (see
Table 3).

Algorithm 1. Genetic Algorithm


Require: Population size N , mutation rate pm , crossover type pc , maximum number
of generation M AX
1: Initialize a population of size N , and set the iteration counter t = 0
2: for each iteration i (up to M AX) do
3: Evaluate the fitness of each individual in the population
4: Select the best individuals for reproduction based on fitness
5: while the new population is not full do
6: Select random pairs of parents from the fittest individuals
7: Perform crossover between pairs of parents using crossover type pc
8: Apply mutation to offering with probability pm
9: Add the offspring to the new population
10: end while
11: Increment the current iteration t by 1
12: Replace the current population with the new population
13: end for
14: return Best individual found

Figure 1 illustrates the GA process, including crossover and mutation. We


use two-point crossover to enhance solution quality by exploring more possi-
Optimizing Credit Scoring Models for Decentralized Financial Applications 459

bilities, despite increased training time. Mutation involves multiplying the ele-
ments resulting from crossover by randomly generated values within the range
.[1 − M utationRate; 1 + M utationRate]

During implementation, we fine-tuned three main parameters: Population


Size (PS), Mutation Rate (MR), and Number of Generations (NG). In each
generation, we applied Rank Selection to maintain diversity while focusing on
promising candidates by selecting the top n and bottom m solutions as parents.
Our objective was to identify the optimal set of parameters (PS, MR, NG, n,
m) that yields the best-performing population. Each parameter was fine-tuned
within specified ranges: PS and NG from 50 to 200, MR from 0.1 to 0.3, n
from 10 to 50, and m from 1 to 10. For each parameter set, we evaluated the
individuals, selecting the best-performing results to compare across parameter
sets.
We evaluated individuals by calculating wallet credit scores from their param-
eters and converting these scores into credit levels. We then compared these levels
to the assigned labels to assess accuracy. The Fitness Function was computed
using Eq. 3. The optimal results were obtained with the following settings: Pop-
ulation Size = 100, Mutation Rate = 0.1, Number of Generations = 100, n =
20, and m = 5.
Total number of wallets with correct predicted labels
accuracy =
. (3)
Total number of wallets predicted

Fig. 1. Genetic Algorithm Simulation

Stochastic Gradient Descent. Stochastic Gradient Descent (SGD) is an iter-


ative optimization used to minimize loss functions by updating model param-
eters. Unlike standard gradient descent, which uses the entire dataset, SGD
updates parameters using a single randomly selected training example or a small
batch at each iteration, reducing computational cost per iteration. SGD is widely
used and performs well across various machine learning tasks [18].
460 T. H. Dao et al.

Algorithm 2. Stochastic Gradient Descent Algorithm Implementation


Require: Learning rate α, decay rate decay_rate, number of iterations
num_iterations
1: Initialize parameter θ
2: Set time step t = 0
3: for each iteration i in num_iterations do
4: Sample a mini-batch of m training examples xi with corresponding labels
yfi irst , ysecond
i
1
m
5: Compute gradients g = m ∇θ i=1 J(θ; xi , yfi irst , ysecond
i
)
6: Update parameters: θ ← θ − αg
7: Adjust learning rate: α ← α ∗ decay_rate
8: Increment time step: t ← t + 1
9: end for

Algorithm 3. Adam Algorithm Implementation.


Require: Learning rate α, number of iterations num_iterations
Exponential decay rates β1 and β2 for moments vt and at respectively
Small constant δ, for numerical stability
1: Initialize parameter θ, first and second moments vt = 0 and at = 0.
2: Set time step t = 0
3: for each iteration i in num_iterations do
4: Sample m training examples xi with corresponding labels yfi irst , ysecond
i
1
m
5: Compute gradients: g ← m ∇θ i=1 J(θ; xi , yf irst , ysecond )
i i

6: Update time step t ← t + 1


7: First moment: vt = β1 vt−1 + (1 − β1 )g
8: Second moment: at = β2 at−1 + (1 − β2 )g  g
9: Bias correction for first moment estimate: vˆt = 1−β vt
t
1
10: Bias correction for second moment estimate: aˆt = 1−β at
t
2
11: Compute parameter update: Δθ = −α √aˆvˆt+δ
t
12: Update parameters: θ ← θ + Δθ
13: end for

Algorithm 2 describes our SGD implementation. We adapted the loss function


to handle multiple labels per wallet (see Table 4). With the input parameters
.yf irst (first label), .ysecond (second label), and .ypredict (predicted value), the new
loss function is defined as shown in Eq. 4.
⎧ i
⎨ ypredict − yf irst if yfi irst = ysecond
i i

.E(i) = y i
− min(yf irst , ysecond ) if ypredict < min(yfi irst , ysecond
i i i i
) (4)
⎩ predict
ypredict − max(yf irst , ysecond ) if ypredict > max(yf irst , ysecond )
i i i i i i
Optimizing Credit Scoring Models for Decentralized Financial Applications 461

Adam Algorithm. The Adam optimizer has gained significant popularity in


recent years due to its efficiency and performance [12]. Unlike traditional SGD
algorithms, Adam (Adaptive Moment Estimation) requires fewer resources, con-
verges faster, and accelerates learning while enhancing overall performance [26].
Adam is a first-order optimization algorithm that can replace traditional
stochastic gradient descent. It enhances the optimization process by using
first-order moment estimation (momentum) and second-moment estimation
(Adadelta). The learning rate for each parameter is dynamically adjusted based
on these estimates, and bias correction is applied to stabilize the parameters.
The pseudocode in Algorithm 3 provides the Adam implementation details. The
loss function used in Adam is similar to that in SGD Eq. 4)

Multilayer Perceptron. A Multilayer Perceptron (MLP) is a type of feedfor-


ward neural network that uses nonlinear activation functions to model complex
patterns in data. It typically comprises three components: an input layer, one
or more hidden layers, and an output layer [6]. The input layer receives raw
data, which passes through the hidden layers. In each layer, neurons compute a
weighted sum of inputs, apply a nonlinear activation function (commonly ReLU),
and forward the result to the next layer. The output layer produces the final pre-
diction.
Figure 2 illustrates the neural network architecture we designed, consisting
of a 14-node input layer (representing wallet features), two hidden layers with
64 and 10 nodes respectively, and a 5-node output layer, corresponding to five
credit score levels.
In our MLP implementation, we customized the loss function to handle two
labels per sample. The input parameters include .yf irst (the first label), .ysecond
(the second label), and .ypredict (the predicted value). The new loss function
is defined in Eq. 5, where .N is the number of samples, .C is the number of
classes, .one_hot(ytrue , c) transforms the true label .ytrue into a one-hot vector,
and .ypredict [i, c] represents the predicted probability of class .c for sample .i.


C
SCC (Sparse Categorical Crossentropy) = one_hot(ytrue , c) · log(ypredict [c])
c=1
.
1 
N
Loss = − min(SCC(yf irst [i], ypredict [i]), SCC(ysecond [i], ypredict [i]))
N i=1
(5)
462 T. H. Dao et al.

Fig. 2. Neural network architecture

4 Experiment
SGD and Adam: We implemented Stochastic Gradient Descent (SGD) and
Adam optimizers, tuning both the learning rate and number of iterations. For
SGD, we set a decay rate of 0.95. The Adam optimizer was configured with
the parameters .β1 = 0.9; .β2 = 0.999; and . = 1 × 10−8 . We tested learning
rates of 0.1, 0.01, and 0.001 across 100, 1000, and 10 000 iterations. The best
results for both algorithms were achieved with a learning rate of 0.001 and 10
000 iterations.
Genetic Algorithm: For GA, we initialized the number of generations to
100 and experimented with various crossover functions and mutation rates. The
two-point crossover and a mutation rate of 0.1 produced the best results after
testing one-point and two-point crossovers and mutation rates of 0.1 and 0.2.
Multilayer Perception: We experimented with multiple neural network archi-
tectures, the number of epochs, and batch sizes. The best performance was
achieved with two hidden layers (see Fig. 2), 10 epochs, and a batch size of
64.
Table 5 and Fig. 3 present and compare the performance of the four models on
the test dataset, covering three metrics: accuracy (A), precision (P), and recall
(R), for each separated label and for all labels. The key findings are as follows:
(1) The MLP model achieved the highest performance, with 97.93% accuracy
and 90.68% precision. In comparison, the GA model had slightly lower perfor-
mance, with 97.5% accuracy and 90.30% precision, but surpassed the MLP in
recall (62% vs. 58.96%). The SGD model exhibited the lowest performance,
with 64.03% accuracy and 53% precision.
Optimizing Credit Scoring Models for Decentralized Financial Applications 463

(2) The Fair and Good labels had the most accurate predictions. For the Fair
label, the results (A, P, R) for the GA and MLP models are (97.65%; 99.05%;
97.86%) and (98.66%; 97.09%; 98.91%), respectively. For the Good label,
GA and MLP results are (94.43%; 94.50%; 75.54%) and (98.27%; 97.09%;
91.77%), respectively. Figure 4 further illustrates that the MLP model accu-
rately predicted instances with high confidence in the Fair and Good labels.
(3) Recall values were notably low for certain labels across the models. For the
MLP model, recall is particularly low for the Poor and Excellent labels, at
7.4% and 24.10%, respectively. The GA model also shows low recall for the
Poor and Very Good labels, with values of 22.90% and 15.05%, respectively.
This discrepancy arises from the use of fuzzy labeling in the test dataset,
while the model outputs are a selection among the five labels: Fair, Good,
Very Good, and Excellent. Specifically, samples labeled as (Poor, Poor ),
(Poor, Fair ), and (Fair, Fair ) were predominantly predicted as Fair, result-
ing in high recall for the Fair label but low recall for the Poor label.

Table 5. The performance of the optimized models

Label Accuracy (%) Precision (%) Recall (%)


SGD Adam GA MLP SGD Adam GA MLP SGD Adam GA MLP
Overall 64.03 96.10 97.50 97.93 53.00 81.83 90.30 90.68 40.60 51.10 62.0 58.96
Poor 57.30 98.73 99.05 96.88 0.95 43.27 77.36 85.07 35.75 41.34 22.90 7.40
Fair 53.67 95.93 97.65 98.66 96.20 97.99 99.05 98.66 40.90 96.65 97.86 98.91
Good 80.00 93.94 94.43 98.27 14.30 92.72 94.50 97.09 47.16 75.1 75.54 91.77
Very Good 79.50 83.03 79.97 98.57 11.90 93.85 88.62 86.24 94.60 28.23 15.05 72.55
Excellent 88.00 96.20 99.57 98.35 100.00 81.36 91.78 86.36 26.00 14.14 98.67 24.10

Fig. 3. Model Comparison

Figure 4 provides a heatmap visualizing the MLP’s predictions compared to


the original fuzzy labels. In the heatmap, the Y-axis represents the predicted
credit levels: 0 (Poor), 1 (Fair), 2 (Good), 3 (Very Good), and 4 (Excellent).
The X-axis corresponds to the fuzzy labels assigned to the test dataset. Using
464 T. H. Dao et al.

the Fuzzy Labeling approach, each wallet address is assigned two labels, reflect-
ing the uncertainty in classification. For example, a label of (0;0) indicates a
definitive classification of Poor, while a label of (0;1) suggests that the wallet
could be classified as either Poor or Fair. Thus, the test dataset contains these
fuzzy label pairs, such as (0;0), (0;1), (1;2), and (4;4).
Each cell in the heatmap shows the number of wallets predicted at a specific
level, compared to their original fuzzy labels. For example, a cell displaying 644
indicates that 644 wallets were predicted to be at level 1 (Fair), while their
original fuzzy labels were (0, 1), meaning they could belong to either Poor or
Fair.

Fig. 4. Heatmap of Predicted Credit Levels and Fuzzy Labels

5 Conclusion

This study applied optimization techniques including Stochastic Gradient


Descent (SGD), Adam Optimization Algorithm, Genetic Algorithm (GA), and
Multilayer Perceptron (MLP) to optimize the credit scoring formula for crypto
wallets on the DeFi platform. While SGD, Adam, and MLP models classified wal-
let credit into five levels, the Genetic Algorithm provided detailed credit scores
ranging from 300 to 850. Among these techniques, the Multilayer Perceptron
achieved the highest performance, with an accuracy of 97.93% and a precision
of 90.68%. These results highlight the potential of advanced algorithms and
machine learning models to improve credit scoring systems in DeFi. The use
of fuzzy labeling helps manage uncertainty in credit classification, leading to
more accurate risk assessments and greater transparency in lending practices.
This contributes to the development of more reliable and transparent financial
practices.
Optimizing Credit Scoring Models for Decentralized Financial Applications 465

Future research should expand the dataset to include non-EVM chains like
Cosmos11 , Solana12 , and Polkadot13 to test the models in diverse blockchain
environments. Additionally, an automated labeling system will further enhance
scoring accuracy and streamline evaluation, promoting the integration of robust
credit scoring mechanisms in decentralized finance.

Acknowledgements. This research was supported by Centic.io. We would like to


show our gratitude to them for sharing their pearls of wisdom with us during this
research.

References
1. Austin, T.H., Potika, K., Pollett, C.: Autonomous lending organization on
ethereum with credit scoring. In: 2023 Silicon Valley Cybersecurity Conference
(SVCC), pp. 1–8, IEEE (2023)
2. Bandemer, H., Näther, W.: Fuzzy data analysis, vol. 20. Springer Science & Busi-
ness Media (2012)
3. Bezdek, J.C., Ehrlich, R., Full, W.: Fcm: the fuzzy c-means clustering algorithm.
Comput. Geosci. 10(2–3), 191–203 (1984)
4. Denœux, T.: Maximum likelihood estimation from fuzzy data using the em algo-
rithm. Fuzzy Sets Syst. 183(1), 72–91 (2011)
5. Dixon, W.J., Yuen, K.K.: Trimming and winsorization: a review. Statistische Hefte
15(2), 157–170 (1974)
6. Genc, E., Shin, H.R., Sik Park, J., Song, J.K.: Number recognition of parts
book schematics using convolutional recurrent neural network. In: 2018 Interna-
tional Conference on Information and Communication Technology Robotics (ICT-
ROBOT), pp. 1–3 (2018), https://doi.org/10.1109/ICT-ROBOT.2018.8549859
7. Hartmann, J., Hasan, O.: Privacy considerations for a decentralized finance (defi)
loans platform. Clust. Comput. 26(4), 2147–2161 (2023)
8. Harvey, C.R., Ramachandran, A., Santoro, J.: DeFi and the Future of Finance.
John Wiley & Sons (2021)
9. Homonoff, T., O’Brien, R., Sussman, A.B.: Does knowing your fico score change
financial behavior? evidence from a field experiment with student loan borrowers.
Rev. Econ. Stat. 103(2), 236–250 (2021)
10. Jane, J.B., Ganesh, E.: A review on big data with machine learning and fuzzy logic
for better decision making. Int. J. Sci. Technol. Res. 8(10), 1221–1225 (2019)
11. Katoch, S., Chauhan, S.S., Kumar, V.: A review on genetic algorithm: past,
present, and future. Multimed. Tools Appl. 80, 8091–8126 (2021)
12. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint
arXiv:1412.6980 (2014)
13. Munkhdalai, L., Munkhdalai, T., Namsrai, O.E., Lee, J.Y., Ryu, K.H.: An empir-
ical comparison of machine-learning methods on bank client credit assessments.
Sustainability 11(3), 699 (2019)

11
https://cosmos.network.
12
https://solana.com.
13
https://polkadot.network.
466 T. H. Dao et al.

14. Packin, N.G., Lev Aretz, Y.: Crypto native credit score: Between financial inclusion
and predatory lending. Cardozo Law Review, Forthcoming (2023)
15. Packin, N.G., Lev-Aretz, Y.: Decentralized credit scoring: Black box 3.0. American
Business Law Journal (2023)
16. Patel, S.B., Bhattacharya, P., Tanwar, S., Kumar, N.: Kirti: A blockchain-based
credit recommender system for financial institutions. IEEE Trans. Netw. Sci. Eng.
8(2), 1044–1054 (2020)
17. Pham, V.B., Trinh, T.D.: Analysis model for decentralized lending protocols. In:
Proceedings of the 11th International Symposium on Information and Communi-
cation Technology, pp. 405–412 (2022)
18. Recht, B., Re, C., Wright, S., Niu, F.: Hogwild!: A lock-free approach to paralleliz-
ing stochastic gradient descent. In: Advances in Neural Information Processing
Systems, vol. 24 (2011)
19. Sun, X., Stasinakis, C., Sermpinis, G.: Liquidity risks in lending protocols: Evidence
from aave protocol. arXiv preprint arXiv:2206.11973 (2022)
20. Uriawan, W., Badr, Y., Hasan, O., Brunie, L.: Decentralized trustworthiness score
management with smart contracts on the trustlend platform. IET Blockchain 4(1),
59–72 (2024)
21. Viertl, R.: Statistical methods for fuzzy data. John Wiley & Sons (2011)
22. Voshmgir, S.: Token economy: How the Web3 reinvents the internet, vol. 2. Token
Kitchen (2020)
23. Werner, S., Perez, D., Gudgeon, L., Klages-Mundt, A., Harz, D., Knottenbelt, W.:
Sok: Decentralized finance (defi). In: Proceedings of the 4th ACM Conference on
Advances in Financial Technologies, pp. 30–46 (2022)
24. Wirth, N.: Algorithms + data structures = programs. Prentice-Hall Series in Auto-
matic Computation. Englewood Cliffs, N.J.: Prentice-Hall, Inc. XVII, 366 p. £
13.55; $ 21.55 (1976). (1976)
25. Wolf, W., Henry, A., Fadel, H.A., Quintuna, X., Gay, J.: Scoring aave accounts for
creditworthiness. arXiv preprint arXiv:2207.07008 (2022)
26. Yaguchi, A., Suzuki, T., Asano, W., Nitta, S., Sakata, Y., Tanizawa, A.: Adam
induces implicit weight sparsity in rectifier neural networks. In: 2018 17th IEEE
International Conference on Machine Learning and Applications (ICMLA), pp.
318–325, IEEE (2018)
27. Zhu, X.: Blockchain-based identity authentication and intelligent credit reporting.
In: Journal of Physics: Conference Series, vol. 1437, p. 012086, IOP Publishing
(2020)
Application of the SFE Feature Selection
Method for Multi-omic Biomarker
Discovery in Brain Cancer Subtyping

Hien Nguyen Minh , Ha Tang Vinh , Hoang Le , and Diep Thi Hoang(B)

VNU University of Engineering and Technology, Hanoi, Vietnam


[email protected]

Abstract. Glioblastoma (GBM) is an aggressive brain cancer with poor


prognosis, making the identification of reliable molecular biomarkers vital
for early detection and improving treatment strategies. This study intro-
duces a two-phase framework for discovering and validating GBM sub-
typing biomarkers. In the first phase, we employed the Simple, Fast,
and Efficient (SFE) [1] feature selection algorithm to high-dimensional
multi-omics data from The Cancer Genome Atlas (TCGA) GBM cohort
to identify potential biomarkers. In the second phase, we assessed the
explainability of these biomarkers through two approaches. First, by
comparing them with reference data from established databases. Sec-
ond, by evaluating their performance using classical machine learning
models. This two-phase framework is versatile and potentially applica-
ble to other cancer datasets, offering a promising approach to biomarker
discovery for improving cancer treatment.

Keywords: feature selection · evolutionary computing · particle


swarm optimization · cancer molecular biomarkers

1 Introduction
Cancer progression involves multiple stages, each associated with distinct genetic
mutations. These mutations disrupt normal cell cycle regulation, resulting in
uncontrolled proliferation [2]. A cancer biomarker is a measurable characteris-
tic indicating cancer risk or patient outcome. Molecular biomarkers encompass
genes, genetic variations, mRNA or protein expression differences, post transla-
tional protein modifications, and metabolite levels. And they aid to monitor dis-
ease progression or treatment response [3]. One emerging approach for biomarker
discoveries is utilizing evolutionary algorithms with multi-omics data [4]. This is
because different types of omics data offer unique insights into cellular activities
and biological processes, making them valuable for understanding complex dis-
eases. However, multi-omics data analysis presents a challenge: integrating a high
dimensionality of diverse biological data while having the inherently smaller sam-
ple size. And evolutionary computation (EC) techniques have gained significant
attention for their efficiency in exploring large solution spaces and achieving near-
optimal results within a reasonable timeframe [4]. Furthermore, previous studies
H. N. Minh and H. T. Vinh—Equally contributed.
c The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025
W. Buntine et al. (Eds.): SOICT 2024, CCIS 2350, pp. 467–477, 2025.
https://doi.org/10.1007/978-981-96-4282-3_37
468 H. N. Minh et al.

have demonstrated the effectiveness of evolutionary algorithms in biomarker dis-


covery [12, 13].
One of the efficient evolutionary feature selection (FS) techniques recently
introduced is the SFE algorithm [1]. SFE has demonstrated its efficiency
when tested on 40 high-dimensional datasets, achieving superior performance
compared to six contemporary FS algorithms. It has also shown promise for
biomarker discovery by significantly reducing the original feature sets to a rela-
tively small subset of features. However, this potential has not been fully explored
or analyzed in the original study. Building on this success, our paper makes four
main contributions:
– Proposing a novel two-phase framework that leverages the strengths of SFE:
• Phase 1 employs SFE to extract potential biomarkers from multi-omics
data, capitalizing on its proven efficiency in high-dimensional feature
spaces.
• Phase 2 evaluates the interpretability and relevance of the identified
biomarker set through comparison with established cancer-related genes
from curated databases and performance assessment using classical
machine learning models.
– Demonstrating the efficacy of our SFE-based framework through comprehen-
sive experiments on the TCGA GBM cohort.
– Introducing modifications to the utilization and analysis of the SFE and SFE-
PSO algorithms.
– Suggesting that the addition of a feature preselection step may improve the
overall performance of both the SFE and SFE-PSO algorithms.

2 Related Works
FS within omics data is challenging due to the vast search space, which grows
exponentially with the number of subsets of features, making exhaustive search
impractical. Traditional search methods like greedy, heuristic, and random
approaches often face issues like stagnation in local optima and high computa-
tional costs. To address these limitations, EC techniques have gained popularity
for their global search capabilities and potential to improve FS’s efficiency [5, 6].
In [7], Martinez et al. presented a customized version of the standard binary
particle swarm optimization (PSO) algorithm, designed to improve classifica-
tion accuracy while significantly reducing the number of selected features. Their
method updates a controlled subset of particles to avoid overwhelming the sys-
tem. It enables more efficient identification of small biomarker sets from microar-
ray data. Tested on 11 microarray datasets, the algorithm outperforms other
PSO-based approaches by achieving higher accuracy with fewer selected genes.
These genes can be considered potential biomarkers which are critical in clin-
ical applications. In [8], Kourid and Batouche introduce a scalable method for
biomarker discovery using a two-stage FS process. The first stage employs par-
allel K-means clustering and Signal-to-Noise Ratio (SNR) ranking to filter out
Application of the SFE for Multi-omic Brain Cancer Biomarker Discovery 469

redundant features, selecting top features from each cluster. In the second stage,
the Binary Particle Swarm Optimization (BPSO) algorithm, implemented with
MapReduce, is used to further optimize the feature subset. Wang et al. in [9]
proposed a Feature Weighting Particle Swarm Optimization (FWPSO) method,
which operates in two phases. In the Feature Weighting Phase, PSO assigns
weights to features based on their relevance, discriminating between impor-
tant and irrelevant features. In the FS Phase, the PSO algorithm refines the
search to the most relevant features, enhancing the identification of significant
biomarker genes while reducing data dimensionality. Their method demonstrates
improved classification accuracy and efficiency on microarray datasets, outper-
forming other techniques.

3 Dataset
3.1 Dataset Collection

The data were retrieved from Xena Hub, a cancer genomics platform provided by
the University of California, Santa Cruz, which offers easy access to key datasets,
including TCGA Legacy. The dataset used in this study was the TCGA GBM
cohort, which classifies GBM into four subtypes: classical, mesenchymal, proneu-
ral, and neural. We incorporated three omics data types-copy number variations
(CNV), DNA methylation, and level 3 gene expression (GE)-selecting only sam-
ples with matched data across all three types. Preprocessing was performed to
reduce noise and redundancy. The dataset characteristics are as follows:
– Subtypes: Classical: 71, Neural: 46, Proneural: 72, Mesenchymal: 81.
– Original features: GE: 12,043; Methylation: 27,579; CNV: 24,777.
– Selected features: GE: 2,000; Methylation: 2,000; CNV: 2,000.

The stratified k-fold function from scikit-learn [18] was used to split the
data into training, validation, and test sets for different purposes, which will
be specified later in the Methods section. Additionally, the training and test
sets were employed in phase two to assess the identified biomarkers using other
classical machine learning models, including the softmax classifier and random
forest classifier. The number of samples (i.e., GBM patients) in the training,
validation, and test datasets is as follows: 162 (train), 54 (validation), and 54
(test).

3.2 Data Preprocessing


Based on prior experience and existing studies, data preprocessing is essential
for enhancing the performance of both machine learning models and classi-
fiers [10, 11]. Given the large number of features (64,399 in total) and the fact
that the fitness function in the SFE optimization algorithm also uses a classi-
fier’s performance metric, feature preselection using statistics-based methods is
critical to prevent overfitting in our study. Preprocessing and FS were performed
470 H. N. Minh et al.

independently for each omics data type to enhance classification accuracy. The
data was split into training-validation (80%) and test (20%) sets using stratified
sampling. We selected the top 2,000 features per omics type based on ANOVA
F-values, ensuring that the first principal component explained less than 50%
of the variance. The training-validation set was further divided into training
(75%) and validation (25%) subsets using stratified 4-fold cross-validation. Min-
max scaling was applied to the gene expression and methylation data (excluding
CNV), with scaling parameters derived from the training set and consistently
applied to the validation and test sets. Finally, the three omics matrices were
concatenated into a single matrix for fitting with the classifier in the fitness
function.

4 Methods
4.1 Biomarker Discovery Problem

Researchers often deal with datasets where the number of measured features far
exceeds the sample size in biomarker discovery. These datasets are often derived
from high-throughput omics approaches or transcriptomic sequencing. Finding
biomarkers from omics data can be considered a FS problem. However, they
hold distinct meanings within bioinformatics. FS can be expressed formally as
below. It aims to identify a subset of relevant features .S ⊆ {1, 2, . . . , n} from the
original set of .n features, such that a model .f built on this subset .S can achieve
optimal or near-optimal classification performance while reducing computational
cost and avoiding overfitting. Given:

– A dataset .D = {(xi , yi )}m n


i=1 , where .xi ∈ R is the feature vector for the .i-th
instance, and .yi ∈ {1, . . . , C} is the class label.
– A classifier .f : R|S| → {1, . . . , C}, parameterized by .S, the subset of selected
features.
– A loss function .L(f (x), y) to evaluate the model’s performance.

The FS problem can be formulated as:

. minimize E(x,y)∼D [L (f (xS ), y)]


S⊆{1,2,...,n},|S|≤k

subject to:

. |S| ≤ k,
where .xS denotes the feature vector reduced to the subset .S, and .k is the max-
imum number of features allowed in the subset. Biomarker discovery, on the
other hand, is a specialized application of FS. It involves identifying key biolog-
ical markers, such as specific genes or proteins, that have clinical or biological
importance. The assumption behind applying FS techniques for this problem is
that the elements chosen to improve the predictive model’s performance are also
likely to hold biological relevance [14, 15].
Application of the SFE for Multi-omic Brain Cancer Biomarker Discovery 471

4.2 Introduction to the SFE and SFE-PSO Algorithm

The SFE algorithm is a wrapper-based method designed for high-dimensional


datasets, performing feature selection (FS) through a search agent, denoted as
.X, which uses binary encoding to indicate selected or unselected modes of fea-

tures. Specifically, .X is the solution to the FS problem, where each element .xj in
.X represents the .j-th feature: .xj = 0 indicates that the feature is not selected,
while .xj = 1 indicates selection. The SFE algorithm operates in two main phases:
exploration and exploitation. In the exploration phase, a global search is con-
ducted using a non-selection operator to discard redundant and noisy features.
The exploitation phase then applies a selection operator to refine the search
locally, focusing on the features deemed non-selectable in the exploration phase.
This results in a subset of relevant features.
The SFE-PSO algorithm combines SFE with an evolutionary computation
(EC) method, specifically particle swarm optimization (PSO). The primary
objective of SFE-PSO is to reduce dataset dimensionality with SFE in the initial
stages, followed by PSO to identify an optimal subset in the lower-dimensional
space. More specifically, PSO is applied after SFE has not improved the solution
within the previous 1000 fitness evaluations. Notably, PSO can be substituted
by other EC methods, creating a versatile and adaptable PSO-EC framework.
The fitness function in both SFE and SFE-PSO is based on the accuracy of
a classifier. For instance, GBM has four subtypes: Classical, Neural, Proneural,
and Mesenchymal. Each patient’s subtype (e.g., Neural or Proneural) is included
in the omics data. A classifier, such as K-nearest neighbor as used in the original
study, is trained on omics data to classify patients’ subtypes. In each iteration of
SFE, the classifier is retrained on a new subset of features while using the same
patient data.
The flowcharts for the two algorithms are presented in Fig. 1 and Fig. 2.
Further details and explanations of these algorithms can be found in the original
study.

4.3 Proposal of a Two-Phase Framework for Discovering


and Evaluating Biomarkers

In the first phase of applying SFE and SFE-PSO to the TCGA GBM dataset, we
propose four key modifications: data preprocessing, introducing a mechanism to
control the number of output features, changing the fitness evaluation function,
and recording the algorithms’ performance across the training, validation, and
test sets (in contrast to the original study, which focused solely on the validation
set). The first modification, data preprocessing, is detailed earlier, along with its
rationale, and is illustrated in Block A of Fig. 1 and Fig. 2.
The second modification (Block B of Figs. 1 and 2) aims to control the num-
ber of output features, increasing it to 400 in certain experiments to enhance the
likelihood of identifying potential biomarkers, while still significantly reducing
the feature set. These feature counts represent only 0.00155%, 0.003105%, and
0.00621% of the original 64,399 features (genes). This approach aligns with the
472 H. N. Minh et al.

Fig. 1. Flow chart version of the SFE Fig. 2. Flow chart version of SFE-EC
algorithm. framework

goal of creating a practical biomarker panel for medical professionals, enabling


rapid, cost-effective genetic testing based on these genes. This method is faster
and more affordable than sequencing the entire genome and conducting exten-
sive computational analyses [16]. We implemented this modification by adding
features to the search agent when the number of selected features is below a
predefined threshold.
The third modification (Block C of Figs. 1 and 2) involved changing the classi-
fier from K-nearest neighbors to support vector machines and using the F1-score
as the evaluation metric. SVM is an efficient model that has outperformed other
traditional machine learning algorithms [17], making it a more comprehensive
target for the fitness function optimization. The F1-score, particularly useful
for imbalanced datasets, provides a more balanced evaluation by accounting for
both false positives and false negatives.
The fourth modification (also in Block C of Figs. 1 and 2) aims to provide
deeper insights into how the selected feature subsets impact classifier perfor-
mance. Similar to the original study, the training set was used to train the
classifier for the fitness function, the validation set optimized the search agent
in the SFE and SFE-PSO algorithms, and the test set evaluated the biomarkers
identified after running these algorithms. In this phase, we recorded the fitness
function values on both the training and test datasets with the selected features
at each iteration, allowing us to visualize the classifier’s accuracy over time.
Application of the SFE for Multi-omic Brain Cancer Biomarker Discovery 473

This enables a better understanding of the effectiveness of the feature subsets


before and after applying the SFE algorithm.
In the second phase, we validated the identified biomarkers using two
approaches. First, we compared them with known GBM-associated genes
from established sources. For this, we used the Comparative Toxicogenomics
Database (CTD) managed by North Carolina State University, which compiles
gene/protein relationships, Gene Ontology (GO) annotations, and phenotypic
data from sources like NCBI PubMed, OMIM, and KEGG. The CTD is curated
annually and includes 74 GBM biomarkers supported by direct experimental
evidence. This evidence includes genes proven to interact with specific chemicals
or those linked to diseases based on laboratory studies. The second approach
involved evaluating the performance of the selected feature subset using stan-
dard machine learning models. We tested the features on softmax classifiers and
random forests (RF), performing grid search for hyperparameter optimization.
As discussed earlier, we assume that variables selected to improve predictive
model performance are also likely to have biological significance [14, 15].

5 Experiments and Results


5.1 Experiment Settings
We designed three experiments to evaluate the proposed modifications: (1)
Experiment 1 (E1), which uses preprocessed data without controlling the num-
ber of output features, (2) Experiment 2 (E2), which uses unpreprocessed data
without controlling the number of output features, and (3) Experiment 3 (E3),
which uses preprocessed data with control over the number of output features
(note that this control mechanism has been implemented for SFE, but not yet
for SFE-PSO). All experiments were conducted on the Kaggle platform using
hardware with an Intel(R) Xeon(R) CPU (2.20GHz) and 13GB of RAM. In
each experiment, the SFE and SFE-PSO algorithms were run for 20 iterations.
The code and results for all experiments can be found at: www.kaggle.com/code/
maiphng12/gbm-feat-select. The results of the three experiments using SFE and
SFE-PSO are summarized in Table 1, while the changes in metrics over the iter-
ations for SFE and SFE-PSO are presented in Figs. 3, 4, 5, 6 and 7.

5.2 Results with Default Settings


The hyperparameters for SFE and SFE-PSO, along with the fitness function
values plotted during the iterations of the algorithm (i.e., the results from Phase
1), are as follows:
– SFE: .U R = 0.3, .U Rmax = 0.3, .U Rmin = 0.001, .SN = 1, number of fitness
evaluations = 1000.
– SFE-PSO: Same hyperparameters as above, with inertia weight .w = 0.5,
cognitive parameter .c1 = 1.5, social parameter .c2 = 1.5, and the number of
particles .nparticles = 30.
474 H. N. Minh et al.

Fig. 3. E1 - SFE Fig. 4. E1 - SFE-PSO

5.3 Results of Investigating the Effect of Data Preprocessing

Using the same hyperparameters as the previous section, but with unprocessed
data, we conducted an investigation using only the SFE algorithm, achieving the
results shown below.

Fig. 5. E2 - SFE Fig. 6. E2 - SFE-PSO

5.4 Results of Validating the Impact of Controlling the Number


of Output Features

In this experiment, we controlled the number of output features to 400 by modify-


ing the algorithm so that the final set of biomarkers would consist of 400 features
when using SFE (this mechanism has not yet been implemented for SFE-PSO).
The results of Phase 1 are shown below. The hyperparameter settings remain
the same as in the previous sections.
Application of the SFE for Multi-omic Brain Cancer Biomarker Discovery 475

Fig. 7. E3 - SFE with 400 features

Table 1. Summary of results from three experimental settings.

Algorithm Setting Number of Genes Overlap Softmax Random Forest


Output with CTD
Accuracy F1-score Accuracy F1-score
Features Database
(%) (%) (%) (%)
SFE E1 59 None 70.37 69.76 66.66 65.46
SFE-PSO E1 107 None 77.77 76.97 77.77 74.69
SFE E2 220 MGMT, RUNX3 75.93 74.15 75.93 75.72
SFE-PSO E2 131 None 64.81 63.14 70.37 68.55
SFE E3 400 CCNH, PTK2, 79.62 78.95 81.48 82.51
RUNX3, SRRT

6 Analysis and Discussions


6.1 About the Results

The issue of overfitting is evident when comparing the results from E2 with those
from E1, where both experiments used the same hyperparameter settings but
differed in the dataset. In E2, the disparity between the classifier’s performance
on the training dataset and the test dataset is significantly larger than in E1.
Additionally, the Phase 2 results in E2 were suboptimal, further highlighting the
critical role of data preprocessing.
Upon examining the results of Phase 2 in E3, we observe an increase in
the number of overlapping genes, accompanied by a significant improvement in
the performance of classical machine learning models (the softmax classifier and
random forest), compared to the results in E1. This underscores the effective-
ness of controlling the number of output features. The improvement can also be
attributed to the typical behavior of machine learning models, which tend to
perform optimally and remain stable when working with a suitable number of
features.
In Phase 1, although there is a noticeable performance gap between the train-
ing and test sets, the results on the test set remain relatively robust, especially
considering the drastic reduction in the number of features. The classifier was
trained on the full set of 6,000 features but was subsequently evaluated on the
476 H. N. Minh et al.

test set using only a small subset of features: 59, 107, and 400 features obtained
from the modified SFE and SFE-PSO algorithms (59 features in E1 SFE, 107
features in E1 SFE-PSO, and 400 features in E3 SFE and SFE-PSO). Despite
this substantial reduction, the model’s ability to generalize, as reflected in the
test set performance, is commendable. This indicates that the FS algorithm,
SFE, is effective at identifying the most relevant biomarkers. Traditional models
like SVM continue to perform well when applied to a refined subset of features.
The reduced complexity contributes to more efficient computation while main-
taining classification quality, demonstrating the algorithm’s practical utility in
handling high-dimensional data.

6.2 Future Work

At present, we have successfully controlled the number of output features in the


SFE algorithm, but we have not yet implemented this mechanism for SFE-PSO.
However, as previously discussed, we believe that incorporating feature control
into SFE-PSO is essential. The approach for integrating this feature control into
the SFE-EC framework will depend on the specific EC method used. In the
future, we aim to incorporate feature control into SFE-PSO and may explore
other EC methods for further analysis.

7 Conclusion

In this study, we proposed and evaluated a two-phase framework leveraging


the SFE and SFE-PSO algorithms for biomarker discovery in high-dimensional
multi-omics datasets, using GBM subtyping as a case study. The first phase
focused on identifying potential biomarkers through FS, while the second phase
validated these biomarkers using established biological databases and traditional
machine learning models. Our experimental results demonstrated that the SFE-
based approach effectively reduced the dimensionality of the dataset, preserving
key biological markers relevant to cancer diagnosis. Additionally, modifications
such as early stopping and output feature control showed promise in enhancing
the performance of the algorithm. This framework provides a flexible and scalable
solution for biomarker discovery and can be adapted for other cancer datasets
or diseases. Future work will focus on refining the feature control mechanism in
SFE-PSO and exploring its applicability with other evolutionary computation
techniques.

References
1. Ahadzadeh, B., et al.: SFE: a simple, fast, and efficient feature selection algorithm
for high-dimensional data. IEEE Trans. Evol. Comput. 27(6), 1896–1911 (2023)
2. Hassanpour, S.H., Dehghani, M.: Review of cancer from perspective of molecular.
J. Cancer Res. Pract. 4(4), 127–129 (2017)
Application of the SFE for Multi-omic Brain Cancer Biomarker Discovery 477

3. Maruvada, P., et al.: Biomarkers in molecular medicine: cancer detection and diag-
nosis. Biotechniques 38(sup4), 9–15 (2005)
4. Liang, J., et al.: A survey on evolutionary computation for identifying biomarkers
of complex disease. IEEE Trans. Evol. Comput. (2024)
5. Xue, B., et al.: A survey on evolutionary computation approaches to feature selec-
tion. IEEE Trans. Evol. Comput. 20(4), 606–626 (2016)
6. Abd-Alsabour, N.: A review on evolutionary feature selection. In: 2014 European
Modelling Symposium. IEEE (2014)
7. Martinez, E., Alvarez, M.M., Trevino, V.: Compact cancer biomarkers discovery
using a swarm intelligence feature selection algorithm. Comput. Biol. Chem. 34(4),
244–250 (2010)
8. Amine, A., Bellatreche, L., Elberrichi, Z., Neuhold, E.J., Wrembel, R. (eds.): CIIA
2015. IAICT, vol. 456. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-
19578-0
9. Wang, X., Jia, W.: A feature weighting particle swarm optimization method to
identify biomarker genes. In: 2022 IEEE International Conference on Bioinformat-
ics and Biomedicine (BIBM). IEEE (2022)
10. Khan, N.M., Madhav C, N., Negi, A., Thaseen, I.S.: Analysis on improving the per-
formance of machine learning models using feature selection technique. In: Abra-
ham, A., Cherukuri, A.K., Melin, P., Gandhi, N. (eds.) ISDA 2018 2018. AISC, vol.
941, pp. 69–77. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-16660-
1_7
11. Tang, J., Alelyani, S., Liu, H.: Feature selection for classification: a review. In:
Data Classification: Algorithms and Applications, vol. 37 (2014)
12. Popovic, D., et al.: A self-tuning genetic algorithm with applications in biomarker
discovery. In: 2014 IEEE 27th International Symposium on Computer-Based Med-
ical Systems. IEEE (2014)
13. Panagiotopoulos, K., et al.: MEvA-X: a hybrid multiobjective evolutionary tool
using an XGBoost classifier for biomarkers discovery on biomedical datasets. Bioin-
formatics 39(7), btad384 (2023)
14. Torres, R., Judson-Torres, R.L.: Research techniques made simple: feature selection
for biomarker discovery. J. Investig. Dermatol. 139(10), 2068–2074 (2019)
15. Nair, T.M.: Calliper randomization: an artificial neural network based analysis of
E. coli ribosome binding sites. J. Biomol. Struct. Dyn. 15(3), 611–617 (1997)
16. Vargas, A.J., Harris, C.C.: Biomarker development in the precision medicine era:
lung cancer as a case study. Nat. Rev. Cancer 16(8), 525–537 (2016)
17. Bhavsar, H., Panchal, M.H.: A review on support vector machine for data clas-
sification. Int. J. Adv. Res. Comput. Eng. Technol. (IJARCET) 1(10), 185–189
(2012)
18. Pedregosa, F., et al.: Scikit-learn: machine learning in python. J. Mach. Learn.
Res. 12 (2011)
A Reputation Scoring Framework
for Lending Protocols Using the PageRank
Algorithm

Mau-Tra Nguyen , Tuan-Dat Trinh(B) , and Viet-Bang Pham

School of Information and Communications Technology, Hanoi University of


Science and Technology, Hanoi, Vietnam
[email protected]

Abstract. Blockchain technology has revolutionized the financial sec-


tor by introducing decentralized finance (DeFi) as a powerful alterna-
tive to traditional banking systems. Among DeFi sectors, lending has
become a key area, facilitating cryptocurrency borrowing and lending
without intermediaries. As of May 2024, lending ranks second in total
value locked (TVL) within DeFi, reflecting its widespread adoption. Key
entities in the lending ecosystem, including personal wallets, centralized
exchange, lending smart contracts, and protocol supported tokens, play a
crucial role in shaping governance, decision-making, and resource alloca-
tion. However, current evaluation methods, which primarily rely on token
holdings for governance voting, are vulnerable to manipulation and fail
to accurately reflect contributions. To address this, we propose a novel
scoring framework that evaluates entities based on both token holdings
and lending interactions over time. Our framework uses the PageRank
algorithm, scaled to the FICO (Fair Isaac Corporation) score range to
offer a more stable and transparent assessment. This approach promotes
healthy competition, encourages user activity, and supports the long-
term growth and stability of the lending DeFi ecosystem.

Keywords: Reputation Score · DeFi · Lending · PageRank

1 Introduction
The advent of blockchain technology has led to a fundamental transformation in
the financial system [4], with decentralized finance (DeFi) emerging as a viable
alternative to traditional centralized banking [19]. Within DeFi, lending decen-
tralized applications (Lending DApps), or Lending protocols, play a crucial
role by facilitating cryptocurrency borrowing and lending without intermedi-
aries [24]. As of May 2024, lending ranks second in total value locked (TVL)
among DeFi categories, with over $36 billion.1 This reflects the substantial adop-
tion and influence of lending within the blockchain ecosystem.
M.-T. Nguyen and T.-D. Trinh—Contributed equally to this work.
1
https://defillama.com/categories.
c The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025
W. Buntine et al. (Eds.): SOICT 2024, CCIS 2350, pp. 478–494, 2025.
https://doi.org/10.1007/978-981-96-4282-3_38
A Reputation Scoring Framework for Lending Protocols 479

Entities in lending DApps include: (1) user wallets [21], representing indi-
vidual users who participate in lending by depositing or borrowing tokens; (2)
centralized exchanges [25], hot wallets of centralized exchanges (CEX) that oper-
ate like user wallets but provide higher liquidity, enhancing market stability and
efficiency; (3) lending smart contracts [22], which automate lending processes,
ensure transparency, and enable trustless execution; (4) protocol supported tokens
[20], tokens eligible for borrowing or lending within the protocol.
Reputation within a lending DApp is demonstrated by consistent engagement
and lending activities over time. Implementing a reputation scoring algorithm
based on user interactions offers deeper insights into entity behavior and con-
tributions [1], allowing for more informed governance decisions. This scoring
mechanism also fosters healthy competition, motivating users to engage more
actively, thereby fostering a vibrant and sustainable ecosystem
However, evaluating entities within these DApps remains a challenge. Cur-
rent methods primarily rely on token holdings for governance voting [2], which is
vulnerable to manipulation as entities can temporarily inflate their token hold-
ings to gain influence during governance events [7]. To address this issue, we
propose a more robust scoring framework that evaluates entities based on both
token holdings and lending interactions over time, providing a fairer approach
to governance and development within the lending ecosystem [10].
This paper presents a robust reputation and credit scoring framework for
entities in decentralized lending applications. First, we adapt the PageRank
algorithm [18] to rank wallets, centralized exchanges, lending contracts, and
supported tokens. This ranking utilizes a graph constructed from an analysis of
lending activities, yielding a unified reputation score across diverse entity types.
Second, we implement a multi-step score normalization process to align our
results with traditional credit scoring systems, ensuring compatibility with exist-
ing financial frameworks. Finally, we validate our approach through backtest-
ing during market downturns, demonstrating that higher-ranked wallets exhibit
resilience and underscoring the proposed model’s effectiveness in risk assessment
and stability promotion within the lending ecosystem.
The paper is structured as follows: Sect. 2 reviews related work in DeFi and
lending DApps. Section 3 describes the proposed scoring framework and method-
ology. Section 4 presents the implementation and results of our system and dis-
cusses the findings. Finally, Sect. 5 concludes with future research directions and
potential improvements.

2 Related Work

This section begins with an introduction to the PageRank algorithm and its
applications. It then reviews existing evaluation systems relevant to our proposed
framework, including credit scoring systems in traditional banking and Web3
scoring projects.
480 M.-T. Nguyen et al.

2.1 PageRank Algorithm

The PageRank algorithm [18], developed by Google, is a foundational method


for ranking web pages based on their importance. It assesses the importance of
a page by considering both the quantity and quality of incoming links. A page
receives a higher rank if it is linked by pages with high ranks. Initially, each
page is given an equal PageRank value. The algorithm then recalculates each
page’s rank based on the ranks of the pages linking to it. This process repeats
iteratively until the PageRank values stabilize and no longer change significantly.
The PageRank formula is shown in Formula 1, where .u represents a web
page, .B(u) is the set of pages linking to .u, .P R(u) and .P R(v) are the rank scores
of page .u and .v, respectively, .Nv denotes the number of outgoing links of page .v,
and .c is a normalization factor. In this formulation, the rank of a page is evenly
distributed among its outgoing links.
 P R(v)
P R(u) = c
. (1)
Nv
v∈B(u)

To address the rank sink issue – where pages accumulate rank without dis-
tributing it – the PageRank algorithm uses a dampening factor (see Formula 2).
This adjustment balances the rank distribution, preventing certain pages from
disproportionately accumulating rank.
Since its introduction, PageRank has evolved with notable variations [5] such
as the Weighted PageRank (WPR) [23]. Unlike the standard PageRank, which
treats all links equally, WPR enhances ranking by incorporating the significance
of both incoming and outgoing links. This approach assigns different weights to
links, improving the relevance of search results.
 P R(v)
P R(u) = (1 − d) + d
. (2)
Nv
v∈B(u)

The WPR algorithm assigns higher rank values to more important pages by
considering both the number of inlinks and outlinks. Formula 3 illustrates how
WPR uses weights .Win (v, u) and .Wout (v, u), calculated based on the number of
inlinks and outlinks. In this formula, .R(v) denotes the set of pages to which page
.v links, .Iu and .Ip represent the number of inlinks of pages .u and .p, respectively,

while .Ou and .Op represent the number of outlinks. .Win (v, u) measures how
important node .u is compared to other nodes that link to node .v. It helps adjust
the influence of the inbound link from .v to .u. .Wout (v, u) adjusts the importance
of the outbound link from .v to .u based on how links are distributed from node
.v to other nodes.

Iu Ou
Win (v, u) =  Wout (v, u) = 
I
p∈R(v) p p∈R(v) Op
.  (3)
P R(u) = (1 − d) + d P R(v)Win (v, u)Wout (v, u)
v∈B(u)
A Reputation Scoring Framework for Lending Protocols 481

Our selection of the PageRank algorithm for this research is driven by its
computational efficiency [12] and scalability for large datasets. Recent studies
have demonstrated its application in blockchain contexts. Do and Do [6] used
PageRank to rank Ethereum addresses based on transaction data, effectively
identifying significant nodes such as major exchanges and decentralized appli-
cations. Mitra [14] proposed a cost-efficient method to improve blockchain con-
sensus in Industry 4.0 by integrating a Cellular Automata-based extension of
PageRank, enhancing data security and privacy. Experimental results showed
this method outperformed standard PageRank, NCDawareRank, and HodgeR-
ank. Qin et al. [16] introduced the Segmented PageRank (SPR) algorithm for
evaluating data value in alliance blockchains, which proved broadly applicable
and offered similar complexity to traditional PageRank. Experimental results
confirmed its superior performance.
Gleich [9] noted that PageRank is widely used in bibliometrics, social net-
works, link prediction, and recommendation systems. Boldi et al. [3] applied it
to a voting model in social networks, while François et al. [8] extended PageR-
ank with clustering to detect stealthy botnets in peer-to-peer communication
networks.

2.2 Scoring Systems in Traditional and Web3 Environments

Traditional lending scoring models in banking generally function as binary clas-


sifiers to predict loan repayment delinquency. These models analyze historical
account behaviors, including payment history, amounts owed, credit history
length, new credit, and credit mix. For example, FICO (Fair Isaac Corpora-
tion) scores [15] convert the probability of repaying a loan within 90 d past due
into an integer score ranging from 300 to 850. A higher score indicates a greater
likelihood of repayment and is widely used by American financial institutions to
assess creditworthiness.
The Chinese social credit system, established in 2014, adopts a different
approach [13]. It aims to comprehensively assess citizens’ economic and social
reputation by analyzing a wide range of behavioral data. The system applies
rewards or penalties based on behavior, affecting areas such as government ser-
vices, hospitality, education, and banking, including credit ratings and interest
rates. Concerns have been raised about potential privacy violations and cyberse-
curity issues associated with the centralized database and servers of the Chinese
social credit system.
In the evolving landscape of decentralized finance (DeFi), several Web3 com-
panies, such as Spectral2 and TrueFi3 , are competing to develop reliable credit
scoring systems tailored for the crypto ecosystem. Spectral uses a traditional
scale from 300 to 850 and employs a tree-based classifier to predict borrower liq-
uidation risk and health factor drops, based on transaction history, liquidation
history, amounts owed and repaid, credit mix, and credit history. TrueFi aims to
2
https://www.spectrallabs.xyz/.
3
https://truefi.io/.
482 M.-T. Nguyen et al.

create a creditworthiness score from 0 to 255 specifically for crypto-native insti-


tutions. TrueFi’s score incorporates factors like company background, repayment
history, operating and trading history, assets under management, and credit met-
rics.
Despite these advancements, our survey of current research and applications
reveals a significant gap: no existing mechanism scores all types of entities within
a lending DApp. Most solutions are confined to scoring wallet addresses within
a single EVM chain, such as Ethereum, Binance Smart Chain, or Layer-2 net-
works like Optimism and Arbitrum. The challenge of multi-chain deployment
[11], where projects operate across multiple chains, remains unresolved.

3 Entity Ranking Framework for Lending Protocols

To develop a comprehensive reputation scoring framework for entities within a


Lending DApp, we propose a three-step process: Data Crawling, Graph Building,
and Entities Ranking.

Data Crawling. In this initial step, data is gathered from lending DApps to
construct a graph for applying the PageRank algorithm. We begin by identify-
ing and collecting addresses of supported tokens from the official lending DApp
website. Using these token addresses, all related transactions are retrieved, along
with a list of interacting addresses. These are categorized into three groups:
personal wallet addresses, centralized exchange hot wallets, and smart contract
addresses. This comprehensive dataset captures both user and contract interac-
tions, which is essential for accurate graph construction and analysis.
Graph Building. Using the collected data, we create a graph to model interac-
tions between different entities within the lending DApp. Each vertex represents
an entity, while each edge signifies a financial relationship between them. To
ensure accuracy in scoring, entities, such as identical tokens deployed across
multiple chains, are consolidated.

Entities Ranking . The PageRank algorithm is applied to calculate scores


for entities in the constructed graph. Entities are categorized into four groups:
centralized exchange wallet address, personal wallet, lending smart contract, and
token, allowing for precise rank comparisons within each type. Since personal
wallet rankings are the most practically relevant for lending applications, the
score normalization is optimized specifically for this category. Scores are scaled
according to the FICO model in a way that ensures the final score distribution
follows a normal curve. This normalization approach is also applied to the other
entity types, without further adjustments.
The automated nature of this data processing flow ensures that the frame-
work remains responsive to the dynamic environment of lending DApps. With
daily updates, it can accurately reflect real-time changes in the creditworthiness
of entities as token prices fluctuate and transaction volumes shift. This continu-
ous scoring process not only enhances the reliability of the reputation scores but
A Reputation Scoring Framework for Lending Protocols 483

also allows DApps to make more informed decisions, particularly in risk manage-
ment and loan issuance. Furthermore, the automated updates reduce the need
for manual intervention, ensuring that the system can scale to accommodate
growing user and transaction volumes across multiple chains.

3.1 Data Crawling


Data collection was conducted in real-time from January 2023 to July 2024
within 6 Ethereum Virtual Machine (EVM) chains: Ethereum4 , Binance Smart
Chain5 , Arbitrum6 , Polygon7 , Optimism8 , Fantom9 . Using Web3 providers, we
continuously gathered and stored blockchain event data. We monitored and cap-
tured event logs from these chains, then classified them to isolate lending-specific
actions.
In total, we processed over 300 million blocks and 8 billion event logs, iden-
tifying more than 20 million lending-related events. To track token balances in
wallets, we monitored asset histories over a 30-day period, capturing how tokens
were held across various addresses without being transferred or utilized. The
final dataset includes over 3.5 million addresses engaged in lending activities
across more than 40 of the largest lending DApps by total value locked.

3.2 Graph Building


We construct a scoring computation graph based on a detailed analysis of lending
DApp operations. Key entities include borrowers, lenders, and tokens. Lenders
supply assets, which may consist of stablecoins (e.g., USDC, DAI), major cryp-
tocurrencies (e.g., ETH, BTC), or governance tokens (e.g., AAVE, COMP).
These assets are deposited into a lending pool to generate interest, and lenders
receive aTokens10 in return. aTokens represent the lender’s share of the pool
and automatically accumulate interest over time. They are redeemable for the
original asset plus interest.
Borrowers provide collateral in the form of assets to borrow from the pool.
The amount they can borrow is determined by collateral’s value, based on a
Loan-to-Value (LTV) ratio. Borrowers receive the borrowed assets and are issued
dTokens11 . dTokens represent the borrowed amount plus accrued interest that
must be repaid, which track their debt obligations within the protocol.
As the loan progresses, borrowers are required to repay the principal plus
interest to reclaim their collateral. If the value of the collateral falls below a cer-
tain threshold, liquidation can be triggered to maintain the pool’s liquidity. Once
4
https://ethereum.org.
5
https://www.bnbchain.org.
6
https://arbitrum.io.
7
https://polygon.technology.
8
https://optimism.io.
9
https://fantom.foundation/.
10
https://docs.aave.com/developers/v/2.0/the-core-protocol/atokens.
11
https://docs.aave.com/developers/v/2.0/the-core-protocol/debt-tokens.
484 M.-T. Nguyen et al.

Fig. 1. Graph Model for Lending Activities. The DApp token, as a type of supported
token, has the same types of edges as other supported tokens.

the borrower fully repays the loan, the dTokens are burned, and they can retrieve
their collateral. Lenders can redeem their aTokens at any time to withdraw
their initial deposits along with the accrued interest. This entire process, includ-
ing deposits, borrowing, repayments, and withdrawals, operates autonomously,
ensuring the stability of the lending protocol even in volatile markets.
Based on these operations, we propose the graph shown in Fig. 1. Each DApp
has a unique graph for evaluating its entities, which include: (1) the lending
DApp; (2) supported tokens – assets that can be deposited or borrowed
within the DApp; (3) the DApp token – the token issued by the DApp, also
usable for borrowing or depositing; (4) Each supported token has corresponding
aTokens and dTokens, such as aETH and dETH for ETH; (4) Addr: represents
addresses interacting with the DApp, including user wallets, centralized exchange
hot wallets, or the DApp’s lending smart contracts.
The graph is constructed with each vertex representing an entity, analogous
to a page in the PageRank algorithm, where PageRank indicates the entity’s rep-
utation score within the lending ecosystem. The weighted PageRank algorithm
assigns weights to edges based on the percentage of users clicking links to navi-
gate between pages. In Lending DApps, financial connections – such as token
ownership or the amount of tokens locked in the lending DApp – are modeled
as edges, with weights reflecting the monetary value of these relationships. This
approach standardizes edges to a common unit, enabling consistent conversion
to percentage weights.
To establish the edges among the vertices in the graph, we first examine the
effects of adding one-way and two-way links. When a one-way link is added from
vertex .u to vertex .v, .u creates an outbound link, and .v gains an inbound link.
While .u’s PageRank does not decrease immediately, the amount of PageRank
it passes to .v is divided among all of .u’s outbound links. The more outbound
links .u has, the smaller the portion of its PageRank that .v receives. Although
.u’s initial PageRank remains unchanged, this redistribution can cause its rank
to decrease over several iterations, as its influence becomes more diluted. In
contrast, .v benefits from the new inbound link, and its PageRank is likely to
increase depending on .u’s rank and how many other outbound links .u has.
A Reputation Scoring Framework for Lending Protocols 485

In the case of a two-way link between vertices .u and .v, both gain new inbound
links, potentially enhancing their PageRanks. The value each vertex receives
is moderated by the number of their outbound links. Mutual inbound links
generally lead to a net increase in PageRank, particularly when both vertices
already have high ranks. Nonetheless, the exact impact depends on the overall
graph structure and the distribution of PageRank among other vertices.
Based on these characteristics, the direction of an edge in the graph is based
on which entity’s rank benefits from their connection. If only .v’s rank increases,
the edge directs from .u to .v; if both ranks increase, the edge is bidirectional.
Consequently, most edges are bidirectional, with equal weights assigned in both
directions, thereby enhancing each vertex’s rank through their interrelations.
Figure 1 illustrates the edges, where labels indicate the weights of these
relationships expressed as monetary values in USD, rather than their
names. The edges include:

1. Bidirectional edges from users to aTokens and dTokens: In decentralized lend-


ing, users receive aTokens when depositing tokens into the lending DApp
and dTokens when borrowing tokens. The quantity of aTokens corresponds
to the deposit amount and can be transferred, with the remaining aTokens
indicating the user’s share in the pool. In contrast, dTokens represent the
user’s borrowings and are not transferable. Direct edges indicating deposit
and borrow amounts are not created between users and tokens because the
lending mechanism inherently captures these amounts through the aTokens
and dTokens held by users. Instead, bidirectional edges are established from
users to aTokens and dTokens, weighted by the respective deposit and borrow
amounts.
2. Bidirectional edges connecting aTokens and dTokens to supported tokens:
These edges represent the total deposit and borrow amounts from all users
for a specific token, corresponding to the total aTokens and dTokens issued.
3. Two bidirectional edges linking tokens to the lending pool : One edge reflects
the token’s liquidity within the pool, weighted by the total deposit and borrow
amounts, while the other indicates the actual funds remaining in the pool,
weighted by the total value locked.
4. Bidirectional edges between users and the DApp token: A user’s rank should
increase with the amount of the DApp token they hold, as this token directly
benefit the DApp and indicate user loyalty. In contrast, holding supported
tokens – those available for borrowing or depositing – does not affect the
lending DApp and is considered a personal activity. As a result, no edges are
established between users and supported tokens. Instead, bidirectional edges
are formed solely for DApp token holdings, with edge weights reflecting the
value of the DApp token in the user’s wallet.
5. A unidirectional edge from the Lending DApp to the DApp token: Lending
DApps like Aave issue a fixed amount of DApp tokens (e.g., AAVE) to raise
capital, support governance, ensure protocol stability, and distribute rewards.
A portion of these tokens is retained by the DApp to incentivize lenders or
borrowers and maintain reserves for liquidity and security. To capture the
486 M.-T. Nguyen et al.

importance of the DApp token to the lending DApp, we establish a unidirec-


tional edge from the lending DApp to the token. As the central vertex, the
lending DApp holds the highest rank; this additional edge allows the DApp
to transfer part of its score to the token, elevating the token’s rank and, in
turn, enhancing the rank of users who hold it.

To accurately assess the ranks of lending entities, vertices representing the


same entities must be consolidated within the graph. Two types of entities require
merging: tokens and hot wallets of centralized exchanges. (i) A hot wallet is
a cryptocurrency wallet connected to the internet, providing rapid access to
funds. Exchanges use these wallets to facilitate frequent transactions like deposits
and withdrawals, making them crucial for daily operations. Since they manage
liquidity rather than long-term storage, centralized exchanges often maintain
multiple hot wallets. For example, Binance operates over 40 hot wallets on EVM
networks. The vertices representing these wallets must be merged into one. (ii)
Similarly, as most lending DApps are deployed across multiple chains, tokens
issued by the same DApp will have distinct addresses on different networks. For
example, the Aave token on Ethereum has the address 0x7fc6...2ddae9 while on
Polygon, it uses 0xd6df...21c90b and these must also be merged.

3.3 Entities Ranking


After consolidating vertices, we use the PageRank algorithm to compute scores
for entities across the constructed graph. The PageRank algorithm processes the
weighted, directed graph, and outputs a list of scores for every entity represented
by a vertex.
Entities are grouped into four categories to ensure accurate rank compar-
isons within each type: (1) Centralized Exchange: Hot wallets from major
exchanges that hold significant amounts of the DApp token but rarely engage in
lending; (2) Personal Wallet: Wallets owned by individual users who actively
engage in lending activities and contribute to the project; (3) Smart Con-
tract: Contracts enabling lending functions, including deposits, borrows, and
withdrawals within the ecosystem; (4) Supported Token: Tokens that can be
borrowed or deposited in the lending DApp.
The data crawling process starts by collecting supported tokens from the
lending DApp’s website and then filtering addresses that interact with these
tokens. With supported tokens already identified, the focus shifts to distinguish-
ing the remaining three entity types: personal wallets, centralized exchange hot
wallets, and smart contract addresses. Hot wallet addresses can be easily identi-
fied through the centralized exchange documentation. The remaining entities –
personal wallets and smart contracts – are differentiated using the web3 library’s
getContractCode function, which checks for bytecode; wallets return empty, while
contracts return bytecode.
After applying the PageRank algorithm, each entity corresponding to a vertex
in the graph receives a rank score, with the total score normalized to 1. Since
tokens and lending pools are central vertices in the graph, their scores differ
A Reputation Scoring Framework for Lending Protocols 487

significantly from those of individual users. Thus we isolate user wallets and
normalize their rank scores to a range of 300 to 850, ensuring that these scores
adhere to a normal distribution. This adoption of the FICO score range enhances
the practical applicability of the proposed ranking method. Details on score
normalization are provided in the experiment section.

4 Experimental Results

To evaluate the proposed entity ranking method for lending based on the PageR-
ank algorithm, we conducted experiments with the Aave lending protocol. As of
May 2024, Aave is the leading lending protocol with over $11 billion in TVL.
Leveraging the lending data outlined in Sect. 3.1, we isolated Aave’s data across
six EVM chains, covering both Aave v2 and v3 versions, and constructed a graph
with over 45,000 vertices and 200,000 edges. The average processing time was
15 min for graph construction, and 5 min for running the PageRank algorithm
on the lending DApp graph.
Since personal wallet rankings hold the most practical relevance for lending
applications, we specifically optimized the score normalization for this category.
Initially, we selected the top 6,000 personal wallets by score and plotted their
scores, as illustrated in Fig. 2a. Our objective was to transform the scores to a
range of 355–800 while ensuring that the scores follow a normal distribution,
which required transitioning the graph from Fig. 2a to Fig. 2f.
Through regression calculations, we identified four transformations as fol-
lows: (1) We normalized the scores of all wallets to be greater than 10 using
the transformation shown in Formula 4. This step prepared us for the subse-
quent logarithmic transformation using base 10. (2) We applied a double log-
arithmic transformation using base 10 (see Formula 5). (3) The scores were
exponentiated by a factor of .α, as shown in Formula 6). To achieve normal-
ization in the FICO score range of 300 to 850, we will determine .α such that
max(all IntermediateScore2 ) 850
.
min(all IntermediateScore2 ) ≈ 300 (4) Finally, the scores will be scaled to the range
of 300–850 according to Formula 7.
10
IntermediateScore1 = P ageRank ×
. (4)
min(all P ageRanks)

IntermediateScore2 = log10 (log10 (IntermediateScore1 ))


. (5)

IntermediateScore3 = (IntermediateScore2 )α
. (6)

IntermediateScore3
F inalScore =
. × 300 (7)
min(all IntermediateScore3 )
The process of transforming the PageRank scores is illustrated in Fig. 2,
which includes six plots: the original PageRank scores, the transformed scores
with a minimum value of 10, the first logarithmic transformation, the second
488 M.-T. Nguyen et al.

Fig. 2. Score Normalization Process for Personal Wallets

Fig. 3. Histogram of Wallet Scores

logarithmic transformation, the exponentiated scores, and the final scores nor-
malized to the range of 300–850. The curve shape in the final plot indicates that
our method has successfully resulted in scores distributed normally from 300 to
850. This is further supported by the histogram of wallet scores in Fig. 3.
Along with personal wallets, the four transformations were also applied to
rank other entity types, including centralized exchanges, lending smart contracts,
and supported tokens of Aave, to derive their overall scores. These scores are
used to analyze the entities in the following sections.

4.1 Personal Wallets


Figure 4 shows the distribution of wallets within the AAVE DApp across five
FICO score ranges: Poor (300–579), Fair (580–669), Good (670–739), Very Good
(740–799), and Exceptional (800–850). Over 80% of wallets (31,787) fall within
A Reputation Scoring Framework for Lending Protocols 489

Fig. 4. Distribution of Wallets by FICO Score Ranges in the Aave DApp

Fig. 5. Financial Holdings of the Top 5,000 Wallets in the Aave DApp

the Poor range, contributing only 4.5% to the Total Value Locked (TVL), indi-
cating their limited impact on the lending pool. In contrast, wallets with scores
above 800, though numbering only 126, represent 40% of the TVL, highlighting
their importance for financial stability. This distribution suggests that AAVE
could refine its lending strategies by offering targeted incentives to different user
segments to optimize participation and maintain platform stability.
Figure 5 presents a scatter plot visualizing the top 5,000 wallets ranked by
score, along with their respective deposit, borrow amounts in AAVE, and the
amount of AAVE tokens held in these wallets. The trend shows a decline in these
amounts as rank decreases. Since the y-axis is on a logarithmic scale, it is clear
that wallets with the highest ranks (on the left side of the plot) have significantly
higher deposit, borrow amounts, and AAVE holdings compared to lower-ranked
wallets.

4.2 Centralized Exchanges, Lending Smart Contracts, and DApp


Supported Tokens
Figure 6 highlights the scores and total AAVE holdings of the top 10 out of 29
centralized exchanges. Although these exchanges do not participate in deposits
or borrowing, their large AAVE holdings make them key players in the Aave
lending ecosystem. Binance leads with a score of 868, holding $113.4 million in
490 M.-T. Nguyen et al.

Fig. 6. Scores and Total AAVE Holdings of Top 10 Decentralized Exchanges

Fig. 7. Scores and Total Borrowing and Depositing of Top 10 Supported Tokens
on Aave

AAVE, significantly exceeding the second-ranked exchange, which holds $14.6


million.
The scores of supported tokens indicate their influence on the Aave DApp.
Figure 7 illustrates the scores, total borrowing, and total depositing of the top
10 out of 58 supported tokens on Aave. As the native token of the protocol,
AAVE stands out with a high score of 889, despite having no borrowing activity.
Other tokens, such as WBTC (Wrapped Bitcoin), USDC, USDT, and WSTETH,
share similar scores and play crucial roles in providing liquidity and borrowing
opportunities within Aave, highlighting the platform’s diversity and accessibility
for users.
Figure 8 analyzes the top 10 contracts among 10,055 prominent contracts
within the Aave ecosystem. The Staked Aave contract leads with a score of
875 and a user transaction volume of $267.8 million, followed closely by the
Ethereum AAVE V3 contract, which scores 867 with a transaction volume of
$108.6 million. These high scores highlight the critical roles of these contracts
in maintaining system reliability and performance within the Aave ecosystem.
Some contracts, despite having high transaction volumes, rank lower as they are
not directly tied to the AAVE token and can be replaced without affecting the
ecosystem’s core functionality. This analysis provides valuable insights into the
most actively utilized contracts, facilitating the prioritization of optimization
and risk management strategies to enhance platform stability.
A Reputation Scoring Framework for Lending Protocols 491

Fig. 8. Scores and Total Transaction Volume of Top 10 Aave Smart Contracts

Fig. 9. Backtest of Wallet Resilience During Market Downturns

4.3 Liquidation Backtesting in Downtrending Markets

A critical factor in evaluating wallet reputation scores is their resilience during


market downtrends. In lending DApps, wallets typically deposit volatile tokens
like ETH to borrow stable tokens. Since the borrowed amount remains constant,
a decline in the price of the deposited tokens can lead to liquidation when the
loan-to-value (LTV) ratio falls below a certain threshold [17]. Significant mar-
ket downturns, such as the Bitcoin price crash from approximately $69,000 to
$15,000 between 2021 and 2022, resulted in substantial losses throughout the
cryptocurrency market.
To backtest the reputation scores of wallets, we assume that the prices of
deposited tokens fluctuate in accordance with Bitcoin’s price from November 10,
2021, to November 10, 2022, as reported by Yahoo Finance12 . Other assumptions
include a constant wallet reputation score and unchanged deposit and borrow
amounts throughout this period.
In this experimental scenario, as the collateral declines, users are progres-
sively liquidated over time. We will observe the percentage of wallets that remain
intact over a 365-day period. The experimental results are illustrated in Fig. 9,
12
https://finance.yahoo.com/quote/BTC-USD/history/.
492 M.-T. Nguyen et al.

where the left chart depicts Bitcoin’s price, while the right chart shows the per-
centage of wallets that have not been liquidated, categorized as Exceptional and
Very Good, Good, and Fair.
The right chart reveals a clear pattern: Exceptional and Very Good wallets
exhibit greater resilience, with approximately 30% remaining by the end of the
period, compared to around 20% for Good and Fair wallets. This finding rein-
forces the validity of our scoring model for lending entities as higher-scoring
wallets are more capable of weathering market downturns.

5 Conclusion

This study presents a robust entity ranking framework for decentralized lend-
ing, utilizing the PageRank algorithm. We detail the graph construction process
based on an analysis of lending DApps and conducted experiments with Aave,
the leading lending protocol in the market. By normalizing PageRank scores to
align with FICO scores, we evaluated various entities within the Aave ecosystem,
including personal wallets, centralized exchanges, lending smart contracts, and
supported tokens. This evaluation provided valuable insights into the influence
of these entities on Aave’s platform stability.
Our analysis demonstrated that a small fraction of high-scoring wallets plays
a crucial role in maintaining platform stability, highlighting the necessity for tar-
geted lending strategies. Additionally, backtesting results confirmed that higher-
ranking wallets tend to be more resilient during market downturns, thereby
reinforcing the reliability of our scoring model.
Future work will focus on refining the graph construction process to incor-
porate additional parameters and expanding our model to include other decen-
tralized finance protocols. This will enable a more comprehensive evaluation of
lending strategies and risk management practices across various platforms, ulti-
mately contributing to the advancement of decentralized finance applications.

Acknowledgements. This research was supported by Centic.io. We would like to


show our gratitude to them for sharing their pearls of wisdom with us during this
research.

References
1. Agrawal, D., Natalia, N., Gopalakrishnan, G., Guzman, M.N., McDonald, M.D.,
Kim, H.M.: Loyalty points on the blockchain. Business Manage. Stud. 4(3), 80–92
(2018)
2. Barbereau, Smethurst, T., Papageorgiou, R., Rieger, O., Fridgen, A., Gilbert.: Defi,
not so decentralized: The measured distribution of voting rights. In: Proceedings
of the 55th Hawaii International Conference on System Sciences (dec 2022)
3. Boldi, P., Bonchi, F., Castillo, C., Vigna, S.: Voting in social networks, pp. 777–786
(11 2009). https://doi.org/10.1145/1645953.1646052
A Reputation Scoring Framework for Lending Protocols 493

4. Chen, Y., Bellavitis, C.: Decentralized finance: Blockchain technology and the quest
for an open financial system. In: Stevens Institute of Technology School of Business
Research Paper, Hoboken, NJ 07030-5991 USA (jul 2019)
5. Chung, F.: A brief survey of pagerank algorithms. In: IEEE Transactions on Net-
work Science and Engineering, vol. 1, pp. 38–42 (2014)
6. Do, H.D., Do, T.: Pagerank and hodgerank on ethereum transactions: A measure
for social credit. Int. J. Softw. Innov. (IJSI) 11(1), 1–13 (2023)
7. Fatih, R., Arezki, S., Gadi, T.: A review of blockchain-based e-voting systems:
comparative analysis and findings. Int. J. Interact. Mobile Technol. (iJIM) 17,
49–67 (dec 2023)
8. François, J., Wang, S., State, R., Engel, T.: BotTrack: tracking botnets using net-
flow and pagerank. In: Domingo-Pascual, J., Manzoni, P., Palazzo, S., Pont, A.,
Scoglio, C. (eds.) NETWORKING 2011. LNCS, vol. 6640, pp. 1–14. Springer, Hei-
delberg (2011). https://doi.org/10.1007/978-3-642-20757-0_1
9. Gleich, D.F.: Pagerank beyond the web. SIAM Rev. 57(3), 321–363 (2015)
10. Hassija, V., Bansal, G., Chamola, V., Kumar, N., Guizani, M.: Secure lending:
blockchain and prospect theory-based decentralized credit scoring model. IEEE
Trans. Netw. Sci. Eng. 7(4), 2566–2575 (2020)
11. Li, L., Wu, J., Cui, W.: A review of blockchain cross-chain technology. IET
Blockchain 3(3), 149–158 (2023)
12. Lofgren, P., Banerjee, S., Goel, A.: Bidirectional pagerank estimation: from
average-case to worst-case. In: Algorithms and Models for the Web Graph, pp.
164–176, Springer International Publishing, Cham (2015)
13. Mcwilliams, A.: Corporate social responsibility: a theory of the firm perspective.
Acad. Manage. Rev. 26, 117–127 (01 2001)
14. Mitra, A.: How can we enhance reputation in blockchain consensus for indus-
try 4.0-a proposed approach by extending the pagerank algorithm. Inter-
national Journal of Information Management Data Insights 2(2), 100138
(2022), ISSN 2667-0968, https://doi.org/10.1016/j.jjimei.2022.100138, https://
www.sciencedirect.com/science/article/pii/S2667096822000817
15. myFICO: What’s in my fico R scores? (2024). https://www.myfico.com/credit-
education/whats-in-your-credit-score
16. Qin, C., et al.: A segmented pagerank-based value compensation method for per-
sonal data in alliance blockchains. Big Data Research 30, 100326 (2022), ISSN
2214-5796, https://doi.org/10.1016/j.bdr.2022.100326, https://www.sciencedirect.
com/science/article/pii/S221457962200020X
17. Qin, K., Zhou, L., Gamito, P., Jovanovic, P., Gervais, A.: An empirical study of defi
liquidations: incentives, risks, and instabilities. In: IMC ’21: Proceedings of the 21st
ACM Internet Measurement Conference, pp. 336–350, Association for Computing
Machinery, New York, NY, United States (nov 2021)
18. Rieder, B.: What is in pagerank? a historical and conceptual investigation of a
recursive status index. Comput. Cult. 2 (sep 2012)
19. Santos, S.D., Singh, J., Thulasiram, R.K., Kamali, S., Sirico, L., Loud, L.: A new
era of blockchain-powered decentralized finance (defi) - a review. In: Proceedings of
the IEEE Annual Computer Software and Applications Conference (COMPSAC),
pp. 1286–1292, IEEE, Los Alamitos, CA, USA (aug 2022)
20. Shirole, M., Darisi, M., Bhirud, S.: Cryptocurrency token: an overview. In: IC-BCT
2019, pp. 133–140, Springer Singapore, Singapore (2020)
21. Suratkar, S., Shirole, M., Bhirud, S.: Cryptocurrency wallet: a review. In: 2020
4th International Conference on Computer, Communication and Signal Processing
(ICCCSP), pp. 1–7 (2020)
494 M.-T. Nguyen et al.

22. Taherdoost, H.: Smart contracts in blockchain technology: A critical review. Infor-
mation 14(2), 117 (2023)
23. Xing, W., Ghorbani, A.: Weighted pagerank algorithm. In: Proceedings of the
Second Annual Conference on Communication Networks and Services Research,
2004., pp. 305–314 (2004)
24. Xu, J., Vadgama, N.: From banks to defi: the evolution of the lending market.
In: Enabling the Internet of Value, pp. 53–66, Springer, Cham, Los Alamitos, CA,
USA (jan 2022)
25. Zhou, Z., Shen, B.: Toward understanding the use of centralized exchanges for
decentralized cryptocurrency (2022). https://arxiv.org/abs/2204.08664
Unifying Convolution and Self-attention
for Liver Lesion Diagnosis on Multi-phase
Magnetic Resonance Imaging

Huynh-Sang Nguyen1,2 , Nhat-Minh Truong1,2 ,


and Minh-Triet Tran1,2(B)
1
Software Engineering Laboratory and Faculty of Information Technology,
University of Science, VNU-HCM, Ho Chi Minh City, Vietnam
{nhsang20,tnminh20}@apcs.fitus.edu.vn
2
Vietnam National University, Ho Chi Minh City, Vietnam
[email protected]

Abstract. Accurate liver lesion diagnosis is crucial for effective treat-


ment planning, with Magnetic Resonance Imaging (MRI) being a key
diagnostic tool due to its ability to provide detailed anatomical and
functional information. Despite its benefits, the manual analysis of 3D
multi-phase MR images is challenging for radiologists due to the complex-
ity of the data and the variability in lesion characteristics. To address
this issue, in this paper, we propose a novel approach that integrates
convolutional neural networks and self-attention mechanisms using the
UniFormer framework. This method combines local and global feature
extraction to enhance the accuracy of liver lesion classification. By lever-
aging pretrained weights from video tasks, the model performs better in
identifying and classifying lesions than traditional methods. Extensive
experiments with the LLD-MMRI2023 dataset, which includes multi-
phase MR images for liver lesions, demonstrate significant advancements
in diagnostic accuracy. This approach not only aids in automating the
analysis process but also supports radiologists by reducing diagnostic
errors and improving patient care. The research highlights the effective-
ness of combining convolutional and self-attention mechanisms in medi-
cal image analysis and suggests promising avenues for future automated
diagnostic systems.

Keywords: Liver Lesion Classification · Multi-phase MRI ·


Convolution Neural Network · Transformer · Self-Attention

1 Introduction

Liver cancer is the sixth most diagnosed cancer and the third leading cause of
cancer-related deaths globally [18]. Hepatocellular carcinoma (HCC) accounts
for most cases, while secondary liver cancers from metastases also contribute
significantly [7]. Non-invasive imaging methods like CT and MRI are essential
for detecting tumors and planning surgeries, despite challenges from the liver’s
complex architecture and lesion variability [12].

c The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025
W. Buntine et al. (Eds.): SOICT 2024, CCIS 2350, pp. 495–509, 2025.
https://doi.org/10.1007/978-981-96-4282-3_39
496 H.-S. Nguyen et al.

Manual analysis of multi-phase images is time-intensive and prone to incon-


sistencies due to unclear margins and variable lesion appearances [16]. This
underscores the need for robust computer-aided diagnostic (CAD) systems to
enhance radiologist accuracy and efficiency, addressing growing patient numbers
and infrastructure challenges.
Deep learning has revolutionized medical image analysis by capturing com-
plex representations, especially in multi-phase imaging [23]. Advanced techniques
leverage temporal and structural data to improve diagnostic accuracy [22] [9].
Two main fusion strategies are common: image-level fusion, which prioritizes
scalability but may lose feature richness, and feature-level fusion, which cap-
tures detailed interactions but is computationally demanding [4] [17].
Both approaches face limitations, including high computational demands and
challenges with varying phase counts [15]. Advances in hardware and computa-
tional methods are needed to enhance their clinical viability.
This paper aims to improve the diagnostic accuracy of liver lesions using
advanced image analysis techniques. We focus on multi-phase Magnetic Reso-
nance Imaging (MRI) and introduce a novel method that combines the strengths
of convolutional neural networks (CNNs) and self-attention mechanisms. Our
approach seeks to enhance the ability to accurately classify liver lesions by lever-
aging local and global feature extraction capabilities, addressing the challenges
radiologists face in interpreting complex and variable lesion characteristics.
Our main contributions in this paper are as follows:
– We adapt the UniFormer framework’s backbone from image classification
to our specific task by incorporating 3D convolutions. By integrating CNNs
and self-attention mechanisms, UniFormer framework combines local feature
extraction through convolutional operations and global feature relationships
via self-attention. We also explore different kernel sizes to determine the opti-
mal receptive field for patch embedding.
– Our method employs pretrained weights from video classification tasks, sig-
nificantly boosting the model’s ability to identify and classify liver lesions.
This transfer learning strategy provides a substantial performance gain over
traditional diagnostic methods.
– Comprehensive Experimental Validation: We conduct extensive experiments
on the LLD-MMRI2023 dataset [15], which contains multi-phase MR images
of liver lesions. Our results demonstrate considerable advancements in diag-
nostic accuracy, validating the effectiveness of our approach.
– We incorporate a class-balanced focal loss function to tackle data imbalance
issue. This function prioritizes underrepresented classes so that model per-
forms robustly across all categories, improving overall diagnostic accuracy.
The structure of this paper is as follows. In Sect. 2, we briefly review rele-
vant work on Liver Lesion Diagnosis on Multi-phase MRI. Then we propose in
Sect. 3 our method using Uniformer, a framework combining convolution and sefl-
attention mechanism. The experimental results in the landscape dataset using
our proposed method are in Sect. 4. Finally, Sect. 5 discusses the conclusion and
future work.
Unifying Convolution and Self-attention for Liver Lesion Diagnosis 497

2 Related Work
2.1 SDR-Former: A Siamese Dual-Resolution Transformer for Liver
Lesion Classification Using 3D Multi-phase Imaging

Lou et al. [15] introduce the SDR-Former, a framework designed for liver lesion
classification in 3D multi-phase CT and MR imaging. This framework com-
bines a hybrid CNN-Transformer network called DR-Former and an Adaptive
Phase Selection Module (APSM) to enhance feature representation and improve
diagnostic accuracy. The SDR-Former is validated using two clinical datasets: a
three-phase CT dataset with two lesion types and an eight-phase MR dataset
with seven different lesion categories.
The SDR-Former framework employs a dual-stage approach with a Siamese
Neural Network (SNN) for feature extraction and an Adaptive Phase Selection
Module (APSM) for phase-specific feature integration. The SNN ensures scalabil-
ity and adaptability across datasets with varying phase counts, enabling effective
transfer learning. However, SNN alone may struggle to isolate distinctive phase
features, leading to weaker representation.
To address this, the DR-Former network incorporates dual branches: a 3D
CNN for high-resolution spatial details and a 3D Transformer for low-resolution
global context. These complementary methods, connected via a Bidirectional
Convolutional Interaction Module (BCIM), enhance feature exchange and rep-
resentation.
The APSM then dynamically merges phase-sensitive features, emphasizing
diagnostically critical information. The combined features are processed through
Global Average Pooling (GAP) and a Fully-Connected (FC) layer for final clas-
sification. This design ensures robust multi-phase imaging analysis, improving
diagnostic accuracy.

2.2 Class-Balanced Loss Based on Effective Number of Samples

Yin Cui et al. [3] addresses the issue of imbalanced data in large-scale datasets,
where a few classes dominate while most classes have relatively few samples.
Traditional re-balancing strategies, such as re-sampling and re-weighting based
on class frequency, often fail to yield satisfactory performance on real-world data.
The authors propose a new approach that calculates the effective number of
samples, which considers the diminishing additional benefit of new data points
as the sample size increases. This is achieved through a formula involving a
hyperparameter .β, which helps better estimate each class’s true representation.
To address the imbalance, the authors introduce a class-balanced loss func-
tion that re-weights the loss for each class inversely proportional to its effective
number of samples. This method assigns higher weights to under-represented
classes, thereby improving the model’s performance across all classes. The paper
demonstrates that this approach significantly enhances the accuracy of mod-
els on long-tailed datasets like CIFAR, ImageNet, and iNaturalist. The paper
shows substantial improvements in handling data imbalance by integrating the
498 H.-S. Nguyen et al.

class-balanced term with common loss functions such as softmax cross-entropy,


sigmoid cross-entropy, and focal loss.
The Class-Balanced Softmax Cross-Entropy Loss modifies the traditional
softmax cross-entropy by incorporating a balancing term that adjusts the loss
based on each class’s effective number of samples. This helps address the issue of
class imbalance by assigning higher weights to underrepresented classes, thereby
improving the model’s overall performance.
In the Class-Balanced Sigmoid Cross-Entropy Loss, the sigmoid function is
used instead of the softmax function. Unlike softmax, sigmoid treats each class
independently, making it suitable for multi-class and multi-label classification
tasks commonly found in real-world datasets. The class-balanced term is applied
similarly to the softmax variant, re-weighting the loss based on the effective
number of samples. This approach benefits from not assuming mutual exclusivity
among classes and is advantageous for datasets with a large number of fine-
grained classes.
Class-Balanced Focal Loss builds on the focal loss, which focuses on diffi-
cult samples by adding a modulating factor to the sigmoid cross-entropy loss.
The class-balanced focal loss incorporates the same balancing term used in the
previous methods, adjusting the loss based on the effective number of samples.
This combination helps reduce the relative loss for well-classified samples and
emphasizes learning from hard-to-classify examples. The class-balanced term in
focal loss can be seen as an explicit way to set the balancing factor based on the
effective number of samples to handle imbalanced data better.

3 Proposed Method
3.1 Unifying Convolution and Self-attention
In recent years, convolutional neural networks (CNNs) have revolutionized com-
puter vision by excelling in tasks like image classification and object detec-
tion. Starting with seminal architectures like AlexNet [11], CNNs have evolved
through numerous powerful variants, demonstrating high performance across
a spectrum of image understanding tasks. As video data gains prominence,
researchers have extended CNNs to 3D space, albeit facing challenges of opti-
mization complexity and computational cost. Strategies such as kernel inflation
and dimension factorization have been explored to mitigate these issues, while
temporal modeling enhancements like temporal shifts and spatiotemporal exci-
tation have aimed to improve video understanding.
Vision Transformers (ViTs) have emerged as an alternative approach to cap-
ture long-range dependencies in images. Inspired by Transformer architectures
from natural language processing (NLP), ViTs represent images as tokens and
utilize attention mechanisms to model token relationships. Despite initial depen-
dencies on large datasets and careful augmentation, advancements in patch
embedding, efficient self-attention, and multi-scale architectures have signifi-
cantly enhanced ViT performance across various image tasks. Extensions to
Unifying Convolution and Self-attention for Liver Lesion Diagnosis 499

Fig. 1. Unified transFormer. The dimensions highlighted in red only exist for the video
input, while all are equal to one for image input. [13] (Color figure online)

video modeling, such as TimeSformer, have further adapted ViTs for spatiotem-
poral representation learning, though challenges remain in efficiently encoding
low-level features compared to convolution-based methods.
Efforts to combine CNNs and ViTs seek to leverage their respective strengths
for enhanced vision tasks. Integrative approaches include incorporating convo-
lutional stems and position embeddings into ViTs, as well as embedding convo-
lution within Transformer feed-forward networks. While some methods focus on
replacing convolution with self-attention, recent innovations like UniFormer [13]
propose unified architectures that blend both mechanisms. This approach aims
to optimize local and global token relations across diverse vision tasks, achiev-
ing improved accuracy and computational efficiency in both image and video
domains.

UniFormer Block Overview. Figure 1 illustrates the Unified transFormer


(UniFormer). To simplify, consider a video consisting of T frames, where an
image input can be viewed as a video with a single frame. Consequently, the
dimensions highlighted in red pertain only to video inputs, whereas they are
all equal to one for image inputs. UniFormer adopts a foundational transformer
format, yet it is meticulously designed to address computational inefficiencies
and effectively capture intricate dependencies. Specifically, the UniFormer block
consists of three key modules: Dynamic Position Embedding (DPE), Multi-Head
Relation Aggregator (MHRA) and Feed-Forward Network (FFN):

X = DPE(Xin ) + Xin
. (1)

Y = MHRA(Norm(X)) + X
. (2)
Z = FFN(Norm(Y)) + Y
. (3)
C×T ×H×W
Considering the input token tensor .Xin ∈ R (where .T = 1 for image
input), the paper first introduces Dynamic Positional Embedding (DPE) to
500 H.-S. Nguyen et al.

dynamically incorporate positional information into all tokens (Eq. 1). Next,
Multi-Head Relation Attention (MHRA) enhances each token by capturing con-
textual relationships with neighboring tokens (Eq. 2). Finally, the paper employs
a Feed-Forward Network (FFN) akin to traditional Vision Transformers (ViTs)
(Eq. 3).

Multi-Head Relation Aggregator. Traditional CNNs and ViTs tend to


address either local redundancy or global dependency, resulting in suboptimal
accuracy and/or excessive computation. The paper presents a versatile Relation
Aggregator (RA) to address these issues, which seamlessly integrates convolution
and self-attention for learning token relationships. By incorporating local token
affinity in the shallow layers and global token affinity in the deeper layers, the
RA enables efficient and effective representation learning. Specifically, MHRA
exploits token relationships in a multi-head style:
Rn (X) = An Vn (X)
. (4)
MHRA(X) = Concat(R1 (X), R2 (X), . . . , RN (X))U
. (5)
C×T ×H×W
Given the input tensor .X ∈ R , it is first reshaped into a sequence
of tokens .X ∈ RL×C , where .L = T × H × W . Here, .Rn (·) denotes the Relation
Aggregator (RA) in the .n-th head, and .U ∈ RC×C is a learnable parameter
matrix used to integrate .N heads. Each RA module comprises token context
encoding and token affinity learning. The original tokens are encoded into con-
textual tokens .Vn (X) ∈ RL×C through a linear transformation. Subsequently,
the RA module can summarize context under the guidance of token affinity
L×L
.An ∈ R . Details on the learning process for .An is described subsequently.

Local MHRA. The paper proposes representing local affinity using a learnable
parameter matrix in the shallow layers. Specifically, for an anchor token .Xi ,
the local RA learns the affinity between this token and others in the small
neighborhood .Ωt×h×w
i (.t = 1 for image input):
Alocal
.n (Xi , Xj ) = ai−j
n , (6)
where .j ∈ Ωt×h×w
i . .an ∈ Rt×h×w is a learnable parameter, and .Xj refers to any
neighbor token in .Ωt×h×w
i . Here, .(i − j) denotes the relative position between
token .i and .j.

Global MHRA. In the deep layers, exploiting long-range relations across a


broader token space is crucial, akin to the principles underlying self-attention.
Therefore, we define token affinity by comparing content similarity among all
tokens: T
eQn (Xi ) Kn (Xj )
.An (Xi , Xj ) =  ,
global
Qn (Xi )T Kn (Xj  )
(7)
j  ∈ΩT ×H×W e
where .Xj can be any token in the global cube with dimensions .T × H × W
(.T = 1 for an image input). Here, .Qn (·) and .Kn (·) represent two different linear
transformations.
Unifying Convolution and Self-attention for Liver Lesion Diagnosis 501

Table 1. Backbones for image classification, ‘L’ and ‘G’ refer to local and global
UniFormer blocks, respectively. [13]

Model Block Type #Blocks #Channels #Param


Uniformer-S [L, L, G, G] [3, 4, 8, 3] [64, 128, 320, 512] 21M
Uniformer-B [L, L, G, G] [5, 8, 20, 7] [64, 128, 320, 512] 50M

Dynamic Position Embedding. Positional information plays a crucial role in


describing visual representations. Traditionally, most Vision Transformers (ViTs)
encode such information using absolute or relative position embeddings [6] [14]
[21]. However, absolute position embeddings require interpolation for varying
input sizes during fine-tuning [19] [20], while relative position embeddings often
struggle due to modifications in self-attention mechanisms [2]. To enhance flexi-
bility, recent approaches have introduced convolutional position embeddings [1]
[5]. Specifically, conditional position encoding (CPE) [2] can implicitly capture
positional information through convolutional operations, enabling Transformers
to handle arbitrary input sizes and improving recognition performance. Due to
its versatile nature, the paper adopts CPE as Dynamical Position Embedding
(DPE) in UniFormer:

DPE(Xin ) = DWConv(Xin ),
. (8)

where DWConv refers to depthwise convolution with zero paddings. The paper
adopts this design for DPE based on several reasons. First, depthwise convolution
adapts well to arbitrary input shapes, such as its straightforward extension to
encode 3D positional information in videos. Second, it is lightweight, balancing
computation and accuracy efficiently. Finally, adding zero paddings helps tokens
understand their absolute positions by progressively querying their neighbors [2]
(Table 1).

3.2 Data Augmentation

Data augmentation is essential for improving the robustness and generalizability


of the liver lesion classification model. Various advanced techniques are employed
to artificially expand the training dataset with diverse and realistic image varia-
tions, mitigating overfitting and enhancing performance across different imaging
conditions and anatomical variations.
Initially, NIfTI files are converted into numpy arrays. The images are resized
using trilinear or tricubic interpolation for high-quality resampling. Normaliza-
tion adjusts pixel intensity values to enhance contrast. Random cropping extracts
sub-volumes from larger 3D images, ensuring the model recognizes lesions from
various liver locations. Random flipping and rotations introduce spatial variabil-
ity, while edge detection using the Sobel operator emphasizes critical structural
502 H.-S. Nguyen et al.

information. Blurring and sharpening techniques simulate different image qual-


ities, and embossing highlights textures and contours.
Masking emphasizes specific regions, while auto-augmentation introduces
diverse transformations from the ImageNet policy. The difference between con-
secutive frames is computed to capture dynamic information. These compre-
hensive augmentation techniques create a rich training set, enabling the model
to generalize better to unseen data and improve its liver lesion classification
performance.

3.3 Region-Guided Training


Region-guided training focuses the model’s attention on clinically relevant areas,
improving diagnostic accuracy. This involves using specific regions of interest
(ROIs) within the liver images that are known to be indicative of various liver
lesions. By guiding the model to prioritize these regions, we enhance its ability to
accurately classify different types of liver lesions, from hepatocellular carcinoma
to hepatic cysts, ensuring that the model learns to distinguish between subtle
variations in imaging features that are crucial for accurate diagnosis

3.4 Loss Function


A dataset exhibits a long-tailed distribution where some classes are significantly
underrepresented compared to others. This imbalance poses several challenges
during model training and evaluation. Models trained on such datasets tend
to prioritize learning from the majority classes, often resulting in poor perfor-
mance in minority classes. This issue is particularly pronounced in tasks such as
object detection, semantic segmentation, and multi-class classification, where
accurate representation of all classes is crucial for effective decision-making.
The Class Balanced Focal Loss function (CBFL) is specifically designed to mit-
igate the impact of class imbalance in long-tailed datasets. It integrates two key
mechanisms—class balancing and focal loss—to enhance the model’s ability to
learn from and generalize across all classes, regardless of their distribution in the
dataset.
CBFL adjusts the contribution of each class to the overall loss function based
on its frequency in the dataset. In long-tailed datasets, where some classes have
fewer instances, CBFL ensures that these classes receive proportionately higher
emphasis during training. This prevents the model from being biased towards
the majority classes and allows it to allocate more resources to learning from rare
and valuable data instances. The mathematical details about this loss function
is already defined in section ?? above.

4 Experiments
4.1 Dataset
The MR image dataset used in our experiment was obtained from Lou et al. [15]
and consists of 498 annotated multi-phase liver lesions from an equal number
Unifying Convolution and Self-attention for Liver Lesion Diagnosis 503

Table 2. Dataset variation in Phase resolution, Number of Slices

#Studies #Resolution #Slice each phase


Train 316 9 16 - 98
Validation 78 9 16 - 98
Test 104 11 16 -108

of patients. Each lesion is captured across eight phases: non-contrast, arterial,


venous, delayed, T2-weighted imaging, diffusion-weighted imaging, T1 in-phase,
and T1 out-of-phase, each providing a distinct volume. The lesions are catego-
rized into seven types: hepatocellular carcinoma (HCC), intrahepatic cholangio-
carcinoma (ICC), liver metastases (HM), hepatic cysts (HC), hepatic heman-
gioma (HH), focal nodular hyperplasia (FNH), and hepatic abscess (HA). The
dataset is pre-partitioned for research purposes into training (316 lesions), valida-
tion (78 lesions), and testing (104 lesions) sets. To advance multi-phase medical
imaging analysis, this dataset has been publicly released and serves as the core
resource for the MICCAI LLD-MMRI Challenge 2023, which aims to promote
the development and refinement of computer-aided diagnosis (CAD) systems
(Table 2).
We use several metrics to evaluate classification performance, including Accu-
racy, Precision, Recall, F1-score, and Cohen’s Kappa. Accuracy measures the
proportion of correctly classified instances among the total instances, providing
an overall effectiveness measure. Precision calculates the proportion of true pos-
itive predictions among all positive predictions, which is important when the
cost of false positives is high. Recall (sensitivity) determines the model’s ability
to identify all relevant instances by calculating the proportion of true positive
predictions among all actual positives, crucial when the cost of false negatives is
high. The F1-score combines precision and recall into a single metric, providing
a balanced measure, particularly useful for imbalanced datasets. Cohen’s Kappa
measures inter-rater agreement for categorical items, comparing the model’s pre-
dictions with actual labels while accounting for chance agreement, offering a
nuanced evaluation beyond simple accuracy.

4.2 Experimental Settings


Preprocessing. The preprocessing of the dataset involves cropping the images
to focus on the region of interest (ROI) as specified in the annotation files. For
each 2D slice, the bounding box (box2d) is used to delineate the ROI, while for
each 3D MRI volume, the bounding box (box3d) provides the necessary spa-
tial coordinates. To ensure sufficient context around the lesions, the region is
extended by 16 pixels from the maximum coordinates in both the x and y direc-
tions. Additionally, two extra slices are included at each end along the z-axis.
This approach enhances the model’s accuracy by providing a more comprehen-
sive view of the relevant anatomical structures.
As an example shown in Fig. 2, consider the study ‘MR-391135’ in the arterial
phase, where the original volume size is (72, 512, 512), indicating 72 2D slices
504 H.-S. Nguyen et al.

Fig. 2. Example for preprocessing

each of size 512. × 512. The lesion is annotated to be present from the 33rd to
the 42nd slice. To preprocess this data, we create a 3D volume based on the
annotation information. The ROI is defined using the bounding boxes, and to
ensure adequate context, we extend the region by 16 pixels in both the x and
y directions and add two extra slices at each end along the z-axis. This results
in a final cropped 3D volume (14, 94, 87), comprehensively capturing the lesion
and surrounding areas.

4.3 Implementation Details


Our experiments were conducted using Uniformer [13] framework, implemented
in Python and utilizing the PyTorch library. The training methodology pre-
scribed by Uniformer framework is with the following configuration:
– Hardware Specifications: The implementation and execution of our experi-
ments were performed on a machine running Ubuntu 22.04.4 LTS. The sys-
tem is equipped with an Intel R CoreTM i5-9400F CPU, 32GB of memory,
and an RTX 4070 Ti SUPER GPU with 16GB of VRAM, the batch size is
set to 4 and 8 for small and base Uniformer variant, respectively.
– Epochs: Leveraging the pretrained weights from the video classification task,
the training process extends over 50 epochs. The initial 5 epochs serve as a
warm-up phase, gradually increasing the learning rate.
– Optimizer: We employed the AdamW [10] optimizer, initiating with an initial
learning rate of 0.0001. This learning rate is dynamically adjusted using a
cosine annealing schedule to enhance performance. A weight decay of 0.05
is applied to mitigate the risk of overfitting. To further mitigate overfitting,
various data augmentation techniques are employed: random rotations and
flips across different anatomical axes, edge detection transformation, random
Unifying Convolution and Self-attention for Liver Lesion Diagnosis 505

Gaussian noise, and blurring effects. Lesion volumes are randomly cropped
to dimensions of 14.× 112.× 112.
– Loss function: Given the long-tailed distribution of the dataset, a class-
balanced loss function is employed to compare the model’s output with the
ground truth.

4.4 Experimental Results


Qualitative Observations: Kernel Size impacts on Result. When tran-
sitioning from image to video classification, the model architecture retains its
four-stage structure. The first two stages utilize local UniFormer blocks, while
the last two stages employ global UniFormer blocks. However, all 2D convolution
filters are replaced with 3D convolution filters. Specifically, the kernel sizes of the
DWConv in DPE and local MHRA are 3.× 3.× 3 and 5.× 5.× 5, respectively. We
experimented with various configurations to downsample the spatial dimensions,
aiming to reduce computational costs while maintaining optimal performance.
Consequently, the convolution filters prior to these stages are set to 1.× 2.× 2
with a stride of 1.× 2.× 2, except for the first stage, which has a kernel size of
2.× 2.× 2 and a stride of 2.× 2.× 2. Further modifications to the first stage were
found to degrade performance (Table 3).

Performance Comparison with Other Methods. The performance of our


proposed method using the Uniformer-base variant is evaluated against the base-
line and other top results in the MICCAI LLD-MMRI Challenge. The evalua-
tion metrics include accuracy, precision, F1 score, and Cohen’s kappa coefficient,
which collectively comprehensively assess each method’s effectiveness.
Baseline: The baseline method achieved an accuracy of 0.6250, a precision of
0.6344, an F1 score of 0.6083, and a kappa coefficient of 0.5414. These results
serve as the reference point for comparing the performance improvements of
more advanced methods.
ResNet50: This method outperformed the baseline with an accuracy of
0.6932, a precision of 0.7932, an F1 score of 0.6898, and a kappa coefficient
of 0.6244. The improvements in all metrics highlight the effectiveness of the
ResNet50 [8] model in this context.
SDR-Former: The SDR-Former [15] method showed further enhancements
with an accuracy of 0.7885, a precision of 0.8122, an F1 score of 0.7910, and a

Table 3. Results across the Multi-phase MRI Dataset

Method Accuracy Precision F1 Kappa


Baseline 0.6250 0.6344 0.6083 0.5414
ResNet50 0.6923 0.7932 0.6898 0.6244
SDR-Former 0.7885 0.8122 0.7910 0.7467
Ours 0.7885 0.8380 0.7719 0.7358
506 H.-S. Nguyen et al.

kappa coefficient of 0.7467. These results represent the highest scores among the
comparative methods, with SDR-Former achieving the best accuracy, F1 score,
and kappa coefficient.
Our proposed method using the Uniformer-base variant matched the highest
accuracy score of 0.7885. It also achieved the highest precision of 0.8380, indi-
cating superior ability in correctly identifying positive cases. The F1 score was
0.7719, slightly lower than that of SDR-Former but still competitive, and the
kappa coefficient was 0.7358, reflecting substantial agreement and robustness of
our approach.
Overall, our method demonstrated competitive performance, particularly
excelling in precision and achieving strong results across all metrics, thus vali-
dating the effectiveness of the Uniformer framework for multi-phase liver lesion
detection and classification.

Quantitative Observations: The confusion matrix for our method’s classi-


fication results is shown in Fig. 3. The model performs exceptionally well in
classifying Hepatic Hemangioma (HH) and Hepatocellular Carcinoma (HCC),
with high accuracy and no or minimal misclassifications. Intrahepatic Cholan-
giocarcinoma (ICC) and Hepatic Abscess (HA) show moderate accuracy but are
often confused with other liver lesions, particularly Hepatic Metastasis (HM) and
Hepatocellular Carcinoma (HCC). Hepatic Metastasis (HM) and Focal Nodular
Hyperplasia (FNH) present challenges, with significant misclassifications, espe-
cially with Hepatocellular Carcinoma (HCC). Hepatic Cyst (HC) demonstrates
high accuracy, with only one misclassification.

Fig. 3. Confusion Matrix of Predictions on test set


Unifying Convolution and Self-attention for Liver Lesion Diagnosis 507

The confusion matrix analysis reveals that many cases are misclassified as
Hepatocellular Carcinoma (HCC), primarily due to several factors. Firstly, HCC
is the most common liver cancer type, resulting in a higher representation in the
training dataset, which can bias the model towards classifying uncertain cases as
HCC. The imaging characteristics of HCC often overlap with other liver lesions
such as Hepatic Metastasis (HM) and Intrahepatic Cholangiocarcinoma (ICC),
leading to confusion. The similarity in texture, shape, and enhancement patterns
across different MRI phases contributes to this misclassification. Moreover, class
imbalance in the dataset and variability in annotation quality can further exac-
erbate this issue, making it challenging for the model to accurately distinguish
HCC from other lesion types. Addressing these challenges may require balancing
the dataset, enhancing feature extraction techniques, employing advanced data
augmentation strategies, and ensuring consistent annotations.

5 Conclusion

Building on the foundational accomplishments highlighted in the introduction,


this thesis addresses the inherent efficiency challenges in liver lesion classifica-
tion. Leveraging our extensive experience in overcoming various medical field
obstacles, we have developed and validated an array of techniques specifically
designed to enhance the performance of the Uniformer framework for liver lesion
classification. Our comprehensive experimentation and rigorous ablation stud-
ies confirm the effectiveness of these strategies, each crafted to address distinct
aspects of this complex task.
We adapt the Uniformer framework’s backbone from image classification to
our specific task by incorporating 3D convolutions and leveraging pre-trained
weights from video classification tasks. This adaptation treats the different MRI
phases as frames in a video, effectively utilizing temporal information to enhance
feature extraction. Another significant contribution is our exploration of different
kernel sizes to determine the most suitable receptive field for patch embedding,
which is crucial for capturing relevant features in the data.
Our empirical results demonstrate that it is possible to achieve competitive
classification outcomes without extended training periods or complex ensemble
methods. By efficiently optimizing the training regimen and leveraging our pro-
posed techniques, we show that the Uniformer framework can deliver excellent
results while maintaining computational efficiency-a balance that aligns with our
primary objectives.
Future work could include incorporating additional imaging modalities like
CT and PET scans to enhance classification accuracy, addressing class imbalance
with techniques such as GANs and advanced resampling, and integrating the
model into clinical workflows for real-time processing. Additionally, improving
the explainability and interpretability of the model’s predictions and conducting
longitudinal studies across diverse patient cohorts will be crucial for gaining trust
and ensuring reliability in clinical practice.
508 H.-S. Nguyen et al.

Acknowledgment. This work is funded by Vietnam National University Ho Chi Minh


City (VNU-HCM) under grant no DS2020-42-01.

References
1. Chu, X., et al.: Twins: Revisiting the design of spatial attention in vision trans-
formers. In: Proceedings of the 35th Conference on Neural Information Processing
Systems (NeurIPS) (2021)
2. Chu, X., Zhang, B., Tian, Z., Wei, X., Xia, H.: Do we really need explicit
position encodings for vision transformers? arXiv preprint arXiv:2102.10882
abs/2102.10882 (2021)
3. Cui, M., Jia, M., Lin, T.Y., Song, Y., Belongie, S.: Class-balanced loss based on
effective number of samples. In: Proceedings of the IEEE Conference on Com-
puter Vision and Pattern Recognition (CVPR) (Jun 2019). https://doi.org/10.
1109/CVPR.2019.01138
4. Liang, D., et al.: Combining convolutional and recurrent neural networks for clas-
sification of focal liver lesions in multi-phase CT images. In: Frangi, A.F., Schn-
abel, J.A., Davatzikos, C., Alberola-López, C., Fichtinger, G. (eds.) MICCAI 2018.
LNCS, vol. 11071, pp. 666–675. Springer, Cham (2018). https://doi.org/10.1007/
978-3-030-00934-2_74
5. Dong, X., et al.: Cswin transformer: A general vision transformer backbone with
cross-shaped windows. In: Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition (CVPR), pp. 12124–12134 (2022)
6. Dosovitskiy, A., et al.: An image is worth 16x16 words: Transformers for image
recognition at scale. In: Proceedings of the International Conference on Learning
Representations (ICLR) (2021)
7. Ferlay, J., Shin, H.R., Bray, F., Forman, D., Mathers, C., Parkin, D.M.: Estimates
of worldwide burden of cancer in 2008: Globocan 2008. Int. J. Cancer 127(12),
2893–2917 (2010)
8. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition
(2015). https://arxiv.org/abs/1512.03385
9. Khan, M., et al.: Multimodal brain tumor classification using deep learning and
robust feature selection: a machine learning application for radiologists. Diagnostics
10, 565 (08 2020). https://doi.org/10.3390/diagnostics10080565
10. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization (2017). https://
arxiv.org/abs/1412.6980
11. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep con-
volutional neural networks. In: Pereira, F., Burges, C., Bottou, L., Weinberger,
K. (eds.) Advances in Neural Information Processing Systems. vol. 25. Curran
Associates, Inc. (2012). https://proceedings.neurips.cc/paper_files/paper/2012/
file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf
12. Lee, S., et al.: Ct and mri liver imaging reporting and data system version 2018 for
hepatocellular carcinoma: A systematic review with meta-analysis. J. Am. Coll.
Radiol. 17(10), 1199–1206 (2020)
13. Li, K., et al.: Uniformer: Unifying convolution and self-attention for visual recog-
nition (2023). https://arxiv.org/abs/2201.09450
14. Liu, Z., et al.: Swin transformer: Hierarchical vision transformer using shifted win-
dows. In: Proceedings of the IEEE/CVF International Conference on Computer
Vision (ICCV), pp. 10012–10022 (2021)
Unifying Convolution and Self-attention for Liver Lesion Diagnosis 509

15. Lou, M., Ying, H., Liu, X., Zhou, H.Y., Zhang, Y., Yu, Y.: Sdr-former: A siamese
dual-resolution transformer for liver lesion classification using 3d multi-phase imag-
ing (2024). https://arxiv.org/abs/2402.17246
16. Luo, L., et al.: Rare benign liver tumors that require differentiation from hepato-
cellular carcinoma: focus on diagnosis and treatment. J. Cancer Res. Clin. Oncol.
149 (07 2022). https://doi.org/10.1007/s00432-022-04169-w
17. Qu, T., et al.: M3net: A multi-scale multi-view framework for multi-phase pan-
creas segmentation based on cross-phase non-local attention. Med. Image Anal.
75, 102232 (2022)
18. Sung, H., et al.: Global cancer statistics 2020: Globocan estimates of incidence
and mortality worldwide for 36 cancers in 185 countries. CA: A Cancer J. Clin.
71(3), 209–249 (2021). https://doi.org/10.3322/caac.21660, https://acsjournals.
onlinelibrary.wiley.com/doi/abs/10.3322/caac.21660
19. Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training
data-efficient image transformers & distillation through attention. In: Proceedings
of the 38th International Conference on Machine Learning (ICML), pp. 10347–
10357 (2021)
20. Touvron, H., Cord, M., Sablayrolles, A., Synnaeve, G., Jégou, H.: Going deeper
with image transformers. arXiv preprint arXiv:2103.17239 abs/2103.17239
(2021)
21. Wang, W., et al.: Pyramid vision transformer: a versatile backbone for dense pre-
diction without convolutions. In: Proceedings of the IEEE/CVF International Con-
ference on Computer Vision (ICCV), pp. 568–578 (2021)
22. Yasaka, K., Akai, H., Abe, O., Kiryu, S.: Deep learning with convolutional neural
network for differentiation of liver masses at dynamic contrast-enhanced ct: A pre-
liminary study. Radiology 286(3), 887–896 (2018). https://doi.org/10.1148/radiol.
2017170706
23. Zhou, S.K., et al.: A review of deep learning in medical imaging: imaging traits,
technology trends, case studies with progress highlights, and future promises. Proc.
IEEE 109(5), 820–838 (2021). https://doi.org/10.1109/JPROC.2021.3054390
Author Index

A K
Agrawal, Kunal 193, 203 Kamioka, Eiji 287

L
B Lam, Phat 39
Binh, Minh Tran 427 Le, Ba Luat 371
Bui, Manh Quan 415 Le, Chau-Anh 167
Bui, Marc 439 Le, Duy Minh 155
Bui, Nhu-Nghia 3 Le, Hoang 467
Bui, Quang Vu 439 Le, Minh-Huan 14
Le, Quang-Khai 25
Le, Trung-Nghia 14, 25, 54, 65, 77, 88, 100,
C 127, 193
Cannon, Ian 203 Liu, Yuchen 287
Cuong, Ngo Xuan 182
Cuong, Nguyen Quoc 275 M
Mai, Tien 354
D Mai, Tien-Dung 182
D. Huynh, Vinh-Hien 167 Mai, Xuan-Bach 65
Dang, Thai-Viet 3 Mien, Doan Phuoc 343
Dao, Mai Hoang 155 Minh, Hien Nguyen 467
Dao, Thao Thi Phuong 100
Dao, Trong Hoan 452 N
Do, Manh Quang 216 Ngo, Dat 39
Do, Trung-Hieu 100 Nguyen Duy, Khang 386
Duong, Viet Hang 415 Nguyen Tuan, Minh 386
Nguyen, Anh D. 239
Nguyen, Ba Nghien 216
G Nguyen, Cong-Long 25
Giang, Nguyen Long 401 Nguyen, Duc P. T. 141
Nguyen, Duc-Vu 313
Nguyen, Hai-Dang 112
H Nguyen, Hieu 127
Hà, Minh Hoàng 354 Nguyen, Hoa N. 239
Ho, Thi Kim Thoa 439 Nguyen, Hue T. 401
Hoang, Bao Ly Tran 298 Nguyen, Huynh-Sang 495
Hoang, Diep Thi 467 Nguyen, Khanh-Duy 298
Hung, Nguyen An 275 Nguyen, Kiet Van 313
Huynh, Van-Hieu 54 Nguyen, Loi Khanh 39
Huynh, Viet-Tham 329 Nguyen, Long 427
Huynh, Y. Thien 167 Nguyen, Long-Bao 54

© The Editor(s) (if applicable) and The Author(s), under exclusive license
to Springer Nature Singapore Pte Ltd. 2025
W. Buntine et al. (Eds.): SOICT 2024, CCIS 2350, pp. 511–512, 2025.
https://doi.org/10.1007/978-981-96-4282-3
512 Author Index

Nguyen, Mau-Tra 478 Q


Nguyen, Minh-Kha 100 Quan, Tho T. 141
Nguyen, Minh-Khang 329
Nguyen, Minh-Loi 54 S
Nguyen, Minh-Quang 226 Schindler, Alexander 39
Nguyen, Ngan Luu-Thuy 313 Sy, Ngo Van 343
Nguyen, Ngoc-Thao 251
Nguyen, Phuc-Tan 127
T
Nguyen, Quoc-Nam 313
Ta, Thuy Anh 371
Nguyen, Quoc-Nghia 65
Tan, Phan Xuan 3, 287
Nguyen, Quy T. 167
Thang, Pham Cong 263, 275
Nguyen, Son Thai 155
Thanh, Minh Le 298
Nguyen, Tam V. 193, 203
Tharra, Reema 193
Nguyen, Thanh Long 216
Thi Vo, Thuy-Giang 329
Nguyen, Thong T. 141
Thi, Thu To 427
Nguyen, Tin 39
Thoai, Nam 386
Nguyen, Truong 39
Tran, Cong 155
Nguyen, Van-Truong 3
Tran, Dinh-Manh-Cuong 3
Nguyen, Y-Hop 77
Tran, Minh-Triet 14, 25, 100, 112, 193, 203,
Nguyen-Huu, Hoang-Minh 65
226, 329, 495
Nguyen-Mau, Trong-Hieu 112
Tran, Phong Hai 251
Nhat, Phan Minh 263, 275
Tran, Thien-Phuc 226
Tran, Uyen T. 401
P
Trinh, Tuan-Dat 452, 478
Patel, Vastsa S. 193
Truong, Chau M. 167
Patel, Vatsa S. 203
Truong, Nhat-Minh 495
Pham, Bac D. 401
Pham, Canh V. 401
Pham, Cuong 155 V
Pham, Dang H. 239 Vinh, Ha Tang 467
Pham, Hoang Giang 354, 371 Vo, Hoai-Danh 88
Pham, Lam 39 Vo, Khang H. N. 141
Pham, Thinh 39 Vu, Duy B. 239
Pham, Viet-Bang 452, 478 Vu, Hoang-Tung 65
Phan, Minh-Duy 14 Vu, Tran The 343
Phan, Thang Chau 313
Phung, Kim Anh 100 W
Phu-Thi, Kim-Trang 112 Wang, Jia-Ching 415

You might also like