0% found this document useful (0 votes)

1K views574 pages

Data Science and Emerging Technologies - Yap Bee Wah, Dhiya Al-Jumeily Obe, Michael W - Berry - 2024 - Springer - 9789819702923 - Anna's Archive

Uploaded by

Evqueni Yero Rodríguez

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

1K views574 pages

Data Science and Emerging Technologies - Yap Bee Wah, Dhiya Al-Jumeily Obe, Michael W - Berry - 2024 - Springer - 9789819702923 - Anna's Archive

Uploaded by

Evqueni Yero Rodríguez

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 574

Lecture Notes on Data Engineering

and Communications Technologies 191

Yap Bee Wah

Dhiya Al-Jumeily OBE
Michael W. Berry Editors

Data Science
and Emerging
Technologies
Proceedings of DaSET 2023
Lecture Notes on Data Engineering
and Communications Technologies

Volume 191

Series Editor
Fatos Xhafa, Technical University of Catalonia, Barcelona, Spain
The aim of the book series is to present cutting edge engineering approaches to data
technologies and communications. It will publish latest advances on the engineering
task of building and deploying distributed, scalable and reliable data infrastructures
and communication systems.
The series will have a prominent applied focus on data technologies and commu-
nications with aim to promote the bridging from fundamental research on data
science and networking to data engineering and communications that lead to industry
products, business knowledge and standardisation.
Indexed by SCOPUS, INSPEC, EI Compendex.
All books published in the series are submitted for consideration in Web of Science.
Yap Bee Wah · Dhiya Al-Jumeily OBE ·
Michael W. Berry
Editors

Data Science and Emerging

Technologies
Proceedings of DaSET 2023
Editors
Yap Bee Wah Dhiya Al-Jumeily OBE
School of Information Technology Faculty of Engineering and Technology
UNITAR International University Liverpool John Moores University
Petaling Jaya, Malaysia Liverpool, UK

Michael W. Berry
University of Tennessee
Knoxville, TN, USA

ISSN 2367-4512 ISSN 2367-4520 (electronic)

Lecture Notes on Data Engineering and Communications Technologies
ISBN 978-981-97-0292-3 ISBN 978-981-97-0293-0 (eBook)
https://doi.org/10.1007/978-981-97-0293-0

This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse
of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and
transmission or information storage and retrieval, electronic adaptation, computer software, or by similar
or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd.
The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721,
Singapore

Paper in this product is recyclable.

Conference Organization

Patron

Sahol Hamid Bin Abu Bakar, Vice-Chancellor, UNITAR International University,

Malaysia

Honorary Chairs

Witold Pedrycz, University of Alberta, Canada

Mohd Uzir Mahidin, Department of Statistics, Malaysia

General and Founding Chairs

Yap Bee Wah, UNITAR International University, Malaysia

Dhiya Al-Jumeily OBE, Liverpool John Moores University, UK

International Advisory Committee

Mario Koeppen, Kyushu Institute of Technology, Japan

Heri Kuswanto, Institute Teknologi Sepuluh Nopember
Mohammed Bennamoun, University of Western Australia, Australia
Chidchanok Lursinsap, Chulalongkorn University, Thailand

v
vi Conference Organization

Proceeding Editors

Yap Bee Wah, UNITAR International University, Malaysia

Dhiya Al-Jumeily OBE, Liverpool John Moores University, UK
Michael W. Berry, University of Tennessee, USA

Finance and Registration Committee

Chong Kim Loy, UNITAR International University, Malaysia

Normaiza Binti Mohamad, UNITAR International University, Malaysia
Simranpreet Kaur Hansaram, UNITAR International University, Malaysia

Secretary

Sharifah Nurul Aina Binti Sayed Burhanudin, UNITAR International University,

Malaysia

Technical Program Committee

Abdulaziz Al-Nahari, UNITAR International University, Malaysia (Chair)

Azlin Ahmad, Universiti Teknologi MARA, Malaysia (Co-chair)
Aitizaz Ali, UNITAR International University, Malaysia
Hadi Naghavipour, UNITAR International University, Malaysia
Jamila Mustafina, Kazan Federal University, Russia
Sulaf Assi, Liverpool John Moores University, UK
Umar Farooq Khattak, UNITAR International University, Malaysia
Tahir Mehmood, UNITAR International University, Malaysia
Wasiq Khan, Liverpool John Moores University, UK
Abir Jaafar Hussain, Liverpool John Moores University, UK
Anazida Binti Zainal, Universiti Teknologi Malaysia, Malaysia
Wan Fairos Wan Yaacob, Universiti Teknologi MARA, Malaysia
Syerina Azlin Md Nasir, Universiti Teknologi MARA, Malaysia
Mohd Fadzil Hassan, Universiti Teknologi PETRONAS, Malaysia
Sapto Wahyu Indratno, Institut Teknologi Bandung, Indonesia
Farhad Nadi, UNITAR International University, Malaysia
Syarifah Bahiyah Rahayu Binti Syed Mansoor, Universiti Pertahanan Nasional
Malaysia, Malaysia
Conference Organization vii

Program Book Committee

Noor Lees Ismail, UNITAR International University, Malaysia

Iznora Aini Binti Zolkifly, UNITAR International University, Malaysia
Rohaizah Abd Latif, UNITAR International University, Malaysia
Jan Lunn, Liverpool John Moores University, UK

IT Committee

Danny Ngo Lung Yao, UNITAR International University, Malaysia

Noor Azma Binti Ismail, UNITAR International University, Malaysia
Paridah Binti Daud, UNITAR International University, Malaysia
Intan Izzatul Fariza Binti Rossli, UNITAR International University, Malaysia
Ker Boon Chin, UNITAR International University, Malaysia
Muhamad Hadri Bin Mohd Hassan, UNITAR International University, Malaysia
Mohd Aiman Mohd Sani, UNITAR International University, Malaysia
Ahmad Ruzaini Bin Rahim, UNITAR International University, Malaysia

Logistics Committee

Mohd Farizudin Mohammad Fauzi, UNITAR International University, Malaysia

Muhammad Shahrir Aizat Muhammad Shuhaili, UNITAR International University,
Malaysia
Roziyana Binti Bahrudin, UNITAR International University, Malaysia
Nurlisnawati Binti Mohd Hijazi, UNITAR International University, Malaysia
Syed Munir Barakbah Bin Syed Faozi Barakbah, UNITAR International University,
Malaysia
Mohd Amar Bin Mohd Mokhtar, UNITAR International University, Malaysia
Farah Nazurah Binti Zainal, UNITAR International University, Malaysia

Publicity and Strategy Committee

Hadi Naghavipour, UNITAR International University, Malaysia (Chair)

Wan Zakiyatussariroh Wan Husin, Universiti Teknologi MARA, Malaysia
Bander Ali Al-Rimy, Universiti Teknologi Malaysia, Malaysia
Izzatdin Abdul Aziz, Universiti Teknologi PETRONAS, Malaysia
Norshahriah Binti Abdul Wahab, Universiti Pertahanan Nasional Malaysia, Malaysia
viii Conference Organization

Corporate Committee

Stella Chua Ching Yee, UNITAR International University, Malaysia

Marc Kevin Natusch, UNITAR International University, Malaysia
Mohamad Shah Andrew Ibrahim, UNITAR International University, Malaysia
Nikki Poh Li Yi, UNITAR International University, Malaysia

Industry Committee

Badrie Abdullah, UNITAR International University, Malaysia

Visnuvarthen A/L. Sakayam, UNITAR International University, Malaysia

International Scientific Committee

Adel Al-Jumaily, Charles Sturt University, Australia (Chair)

Naomi Bt Salim, Universiti Teknologi Malaysia, Malaysia
Ku Ruhana Ku Mahmud, Universiti Utara Malaysia, Malaysia
Siddhivinayak A. Kulkarni, MIT-World Peace University, Pune, India
Simon Fong, University of Macau, China
Richard Millham, Durban University of Technology, South Africa
Layth Sliman, Efrei, Paris, France
Preface

This volume constitutes the proceedings of the Second International Conference on

Data Science and Emerging Technologies (DaSET 2023) held from December 4 to 5,
2023, on a virtual platform. DaSET 2023 aims to provide a platform bringing together
experts from academia, industries, government, and professional bodies to share
recent trends in Artificial Intelligence and Emerging Technologies for Data-Driven
Decisions. The theme of the conference is “Towards Green Artificial Intelligence
and Sustainable Solutions.”
DaSET is committed to creating a forum that brings academic and industry
practitioners to share and establish collaborations toward impactful innovative
research for community development, business success, and economic pros-
perity. This conference is an international conference in collaboration with UK
Malaysia University Consortium, Universiti Teknologi MARA, Universiti Teknologi
Malaysia, Centre for Data Science (CerDaS), Universiti Teknologi PETRONAS,
Institut Teknologi Sepuluh Nopember, Indonesia; Chulalongkorn University, Thai-
land; Charles Sturt University, Australia; Institut Teknologi Bandung, Indonesia;
National Defence University of Malaysia and Data Analytics and Collaborative
Computing Group, University of Macau, China. We also appreciate the strong
support from Microsoft Malaysia, Malaysia Digital Economy Corporation (MDEC),
Cybersecurity Malaysia, and Statworks (M) Sdn Bhd.
From a total of 80 submitted papers, 40 were selected after a rigorous review
process for oral presentation, and the Best Paper Awards were given for each track.
The authors and presenters for these 40 papers represented 10 different countries.
We thank all the reviewers and Springer Editors for their time spent reviewing the
papers.
We are very honored to have Dato’ Sri. Dr. Mohd Uzir Mahidin, Chief Statis-
tician of Malaysia to officiate the opening of DaSET2023. We are privileged to
have Prof. Witold Pedcryz, University of Alberta, for his special keynote address.
We are proud to have eight distinguished international and local keynote speakers:
Prof. Dr. Muhammad Khurram Khan, King Saud University, Saudi Arabia; Prof. Dr.
Schahram Dustdar, Tu Wien, Vienna University of Technology, Austria; Prof. Dr.
Seifedine Kadry, Noroff Education AS, Norway; Prof. Dr. Naomie Salim, Universiti

ix
x Preface

Teknologi Malaysia, Assoc. Prof. Dr. Chin Kim On, Universiti Malaysia Sabah, Dato’
Dr. Amirudin Abdul Wahab, Cybersecurity Malaysia, Ms. Puteri Anis Aneeza binti
Zakaria, Statworks Group, and Mr. Raja Segaran, MDEC, Malaysia. All the distin-
guished speakers shared various data science and emerging technologies perspectives
and projects which are beneficial for academics and industry practitioners.
We would like to thank Professor Emeritus Tan Sri Dato’ Sri. Ir. Dr. Sahol Hamid
Bin Abu Bakar, Vice-Chancellor of UNITAR International University for his great
leadership, advice, and support of local and international academic activities to foster
collaborations that lead to the exchange of knowledge and skills for research with
impactful outcomes for social and economic prosperity.
We also thank the Series Editor, Springer, Lecture Notes on Data Engineering
and Communications Technologies, for the opportunity to organize this guest-edited
volume. We are grateful to Mr. Aninda Bose (Senior Publishing Editor, Springer
India Pvt. Ltd.) and Mr. Radhakrishnan Madhavamani for the excellent collaboration,
patience, and help during the preparation of this volume. We are confident that the
volume will provide insightful information to researchers, practitioners, and graduate
students in the areas of data science, artificial intelligence, and emerging technologies
which are important in this digital information era. Last but not least, we thank all the
DaSET 2023 committees for working tirelessly to ensure a successful conference.

Petaling Jaya, Malaysia Yap Bee Wah

Liverpool, UK Dhiya Al-Jumeily OBE
Knoxville, USA Michael W. Berry
About This Book

The book presents selected papers from the Second International Conference on Data
Science and Emerging Technologies (DaSET 2023), held online at UNITAR Inter-
national University, Malaysia, from December 4–5, 2023. This book aims to present
current research and applications of data science and emerging technologies. The
deployment of data science and emerging technology contributes to the achievement
of the Sustainable Development Goals for social inclusion, environmental sustain-
ability, and economic prosperity. Emerging technologies such as artificial intelli-
gence and blockchain are useful for various domains such as marketing, health care,
finance, banking, environmental, and agriculture. Innovations in the field of artifi-
cial intelligence continue to shape the future of work across nearly every industry.
Data Science has a transformative effect on the economy, industry, and society.
An important grand challenge in data science is to determine how developments
in computational and social-behavioral sciences can be combined to improve well-
being, emergency response, sustainability, and civic engagement in a well-informed,
data-driven society. The topics of this book include, but are not limited to: artificial
intelligence, machine and deep learning, statistical learning, and health and industrial
applications.

xi
Contents

Artificial Intelligence
A Comparative Study of Lemmatization Approaches for Rojak
Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Liu Jun Yoon, Xuan Yi Tan, Khai Yin Lim, Chi Wee Tan,
Ling Ern Cheng, and Jenny Tan
Multi-aspect Extraction in Indonesian Reviews Through
Multi-label Classification Using Pre-trained BERT Models . . . . . . . . . . . . 17
Nur Hayatin, Suraya Alias, Lai Po Hung, and Yuliana Setiowati
Artificial Intelligence (AI) Empowered Sign Language Recognition
Using Hybrid Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
Ambar Saxena, Nailya Sultanova, Jamila Mustafina,
and Noor Lees Ismail
The Performance of GPT-3.5 in Summarizing Scientific and News
Articles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Sabkat Arshad, Muhammad Yaqoob, and Tahir Mehmood
Wound Stage Recognition Using YOLOv5 . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
Clair Abela and Frankie Inguanez
Harvest Palm Tree Based on Detection Through 2D LiDAR Sensor
Using Power Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
Luqman Hakim Bin Yusof, Abdulaziz Yahya Yahya Al-Nahari,
Danny Ngo Lung Yao, and Normaiza Mohamad
Enhancing Security Surveillance Through Business Intelligence
with NVIDIA DeepStream . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
Vishal Pednekar, Nidhi Shettigar, and Sayli Tawhare
Fuzzified Hybrid Metaheuristics for QoS-Aware Service
Composition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
Hadi Naghavipour, Farhad Nadi, and Ali Aitizaz

xiii
xiv Contents

Machine/Deep Learning
Fraudulent E-Commerce Website Detection Using Convolutional
Neural Network Based on Image Features . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
Nurfazrina Mohd Zamry, Anazida Zainal, Eric Khoo,
Mohamad Nizam Kassim, and Zanariah Zainudin
A Generic Framework for Ransomware Prediction
and Classification with Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . 137
Saaman Nadeem, Tahir Mehmood, and Muhammad Yaqoob
Leveraging Gamification for Engaged Learning in Online Teaching
and Learning Experiences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
Norshahriah Abdul Wahab, A’tifah Hanim Rosli,
Syarifah Bahiyah Rahayu Syed Mansoor,
Norazliana Akmal Jamaludin, and Siti Hajar Adam
Sentiment Analysis Using Large Language Models: A Case Study
of GPT-3.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
Farhad Nadi, Hadi Naghavipour, Tahir Mehmood,
Alliesya Binti Azman, Jeetha A/P Nagantheran, Kezia Sim Kui Ting,
Nor Muhammad Ilman Bin Nor Adnan, Roshene A/P Sivarajan,
Suita A/P Veerah, and Romi Fadillah Rahmat
Telecom Customer Experience Analysis Using Sentiment Analysis
and Natural Language Processing—Comparative Study . . . . . . . . . . . . . . 169
Ahmed Mohamed Abdou Ahmed, Abdulaziz Al-Nahari,
Raghad Al-Shabandar, Chong Kim Loy, and A. H. Mohammed
Efficient Time Series of Smoothing and Auto-regressive Forecasting
Models for Predicting Police Officer Fatalities in the USA . . . . . . . . . . . . . 181
Danush Nagappan, Manoj Jayabalan, Ahmad Alanezi, Farhad Nadi,
and Thomas Coombs
Multimodal Emotion Recognition Using Attention-Based Model
with Language, Audio, and Video Modalities . . . . . . . . . . . . . . . . . . . . . . . . . 193
Disha Sharma, Manoj Jayabalan, Nailya Sultanova, Jamila Mustafina,
and Danny Ngo Lung Yao
Comparative Analysis of Emotion Recognition Using Large
Language Models and Conventional Machine Learning . . . . . . . . . . . . . . . 211
Mangu Soujanya Rao, Thomas Coombs, Normaiza Binti Mohamad,
Vinay Kumar, and Manoj Jayabalan
The Impact of Clustering-Based Sequential Multivariate Outliers
Detection in Handling Missing Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
Mety Agustini, Kartika Fithriasari, and Dedy Dwi Prastyo
Contents xv

Sarcasm Detection in Newspaper Headlines . . . . . . . . . . . . . . . . . . . . . . . . . . 237

Vishnu Sai Reddy Chilpuri, Saaman Nadeem, Tahir Mehmood,
and Muhammad Yaqoob
Transformer-Based Named Entity Recognition Model—Tamil
Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
Karthi Dhayalan, Nailya Sultanova, Jamila Mustafina, and Paridah Daud
A Comparative Study of Methods for Topic Modelling in News
Articles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
Swapna D. Rajan, Thomas Coombs, Manoj Jayabalan,
and Noor Azma Ismail
Application of Deep Learning Algorithms to Terahertz Images
for Detection of Concealed Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
Soumen Sardar, Sulaf Assi, Iznora Aini Zolkifly, Manoj Jayabalan,
Manea Alsaleem, Ammar H. Mohammed, and Dhiya Al-Jumeily OBE
Multivariate Comparative Analysis of Statistical and Deep
Learning Models for Prediction Hardware Failure . . . . . . . . . . . . . . . . . . . . 291
Saurabh Gupta, Raghad Alshabandar, Chong Kim Loy,
and Ammar H. Mohammed

Statistical Learning
A Case Study via Bayesian Network: Investigating Factors
Influencing Student Academic Performance in Online Teaching
and Learning During COVID-19 Pandemic . . . . . . . . . . . . . . . . . . . . . . . . . . 303
Zheng Ning Looi, Poh Choo Song, Huai Tein Lim, and Sing Yan Looi
Harnessing the XGBoost Ensemble for Intelligent Prediction
and Identification of Factors with a High Impact on Air Quality:
A Case Study of Urban Areas in Jakarta Province, Indonesia . . . . . . . . . . 319
Wahyu Wibowo, Harun Al Azies, Susi A. Wilujeng,
and Shuzlina Abdul-Rahman
Modeling Earthquake Catalog in Sumatra by Space–Time
Epidemic-Type Aftershock Sequences Model: Combining
Davidon–Fletcher–Powell and Stochastic Declustering Algorithms . . . . . 335
Christopher Andreas, Achmad Choiruddin, and Dedy Dwi Prastyo
Small Area Estimation of Mean Years of Schooling Under Time
Series and Cross-sectional Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353
Reny Ari Noviyanti, Setiawan, and Agnes Tuti Rumiati
Probabilistic Seismic Hazard Analysis for Sulawesi-Maluku
Region of Indonesia Using the Space–Time Epidemic-Type
Aftershock Sequence Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369
Sonia Faradilla, Achmad Choiruddin, and Bambang Widjanarko Otok
xvi Contents

Application of Time Series Regression, Double Seasonal ARIMA,

and Long Short-Term Memory for Short-Term Electricity Load
Forecasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385
Hafez Afghan and Hidayatul Khusna
A Bayesian Network for Classifying and Predicting Ship Collision . . . . . 403
Iis Dewi Ratih, Ketut Buda Artana, Heri Kuswanto, Emmy Pratiwi,
and Muhammad Farhan Nuari
Outlier Detection in Simultaneous Equations with Panel Data . . . . . . . . . 415
Suci Ismadyaliana, Setiawan, and Jerry Dwi Trijoyo Purnomo
Assessing Departmental Efficiency at Sepuluh Nopember
Institute of Technology: A Comparative Study Using Classical
and Advanced Data Envelopment Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 429
Zakiatul Wildani, M. Naufal Mubarik, Sri Pingit Wulandari,
Lucia Aridinanti, and Muhammad Alifian Nuriman
Multivariate Adaptive Fuzzy Clustering Means Regression Splines
Model Using Generalized Cross-Validation (GCV) on Stunting
Cases in Southeast Sulawesi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447
Mira Meilisa, Bambang Widjanarko Otok,
and Jerry Dwi Trijoyo Purnomo
Statistical Inferences for Multivariate Generalized Gamma
Regression Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463
Hasbi Yasin, Purhadi, and Achmad Choiruddin

Health and Industrial Applications

W@rk: Attendance Application Framework Using Blockchain
Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 479
Putra Roskhairul Fitri Kaha, Syarifah Bahiyah Rahayu,
Afiqah M. Azahari, Mohd Hazali Mohamed Halip, and K. Venkatesan
Exploring the Impact of COVID-19 on Individuals’ Mental Health
Through Cluster Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 493
Azlin Ahmad, Siti Nabilah Mohd Abdul Hakim Amir,
Ezzatul Akmal Kamaru Zaman, and Abdulaziz Al-Nahari
Tracking High Potential Transmission Risk Spots of Infectious
Disease Using Spatial Social Network Analysis and Visualisation
(SSNAV) Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 505
Ilham Abdul Jalil and Abdul Rauf Abdul Rasam
Evaluation of Machine Learning Algorithms for Early Prediction
of Liver Disease . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 521
Sushmitha Geddam, Sulaf Assi, Hadi Naghavipour, Manoj Jayabalan,
Abdullah Al-Hamid, and Dhiya Al-Jumeily OBE
Contents xvii

An Agricultural Information Recommendation Method Based

on Matrix Decomposition Knowledge Graph Algorithm . . . . . . . . . . . . . . . 531
Ruipeng Tang, Narendra Kumar Aridas, and Mohamad Sofian Abu Talip
Performance of the Auxiliary Information Based Hybrid EWMA
Chart with Fast Initial Response . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 545
Peh Sang Ng, Huai Tein Lim, Wai Chung Yeong, and Sajal Saha
Evaluation of Machine Learning Models for Breast Cancer
Detection in Microarray Gene Expression Profiles . . . . . . . . . . . . . . . . . . . . 563
Mohammad Nasir Abdullah and Yap Bee Wah

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 577

About the Editors

Professor Yap Bee Wah is the director of the Research and Consultancy Center at
UNITAR International University Malaysia. She is the founding and general chair for
DaSET2022: International Conference on Data Science and Emerging Technologies
and the editor of the proceedings published in Lecture Notes on Data Engineering
and Communications Technologies published by Springer. She was the conference
chair of the International Conference on Soft Computing in Data Science (2015–
2019 and 2021) and an editor of the SCDS conference proceedings published in the
Springer CCIS series. She is also one of the editors of the book titled Supervised
and Unsupervised Learning for Data Science published by Springer Nature Switzer-
land AG 2020. She actively published papers in ISI and Scopus journals such as
Expert Systems with Applications, Journal of Statistical Computation and Simula-
tion, Communications in Statistics-Computation and Simulation, Journal of Clinical
and Translational Endocrinology, and Computers, Materials and Continua.

Professor Dhiya Al-Jumeily OBE is a professor of Artificial Intelligence and the

president of eSystems Engineering Society. His research focus is on developing AI
analytics for improving healthcare and the environment fulfilling the United Nations
SDGs. His research has been well-recognized and featured in 300+ peer-reviewed
articles, 40+ books/book chapters, and attracted over £7.5M. He has successfully
supervised 25+ Ph.D. students to completion and has been an external examiner in
UK and global universities. He is actively involved as a member of the editorial board/
review committee for numerous international journals. He is the founder/general
series chair of the IEEE International Conference on Developments in eSystems
Engineering since 2007 and DASET since 2022. He was promoted and appointed
by The Queen to the Most Excellent Order of the British Empire, “OBE-Ordinary
Officers of the Civil Division” of the said Most Excellent Order for the “Services to
Scientific Research”.

Professor Michael W. Berry is the co-author and an editor of sixteen books covering
topics in scientific computing, information retrieval, text/data mining, and data
science. He is the co-editor of the Soft Computing in Data Science volumes from

xix
xx About the Editors

2015 to 2021 and Data Science and Emerging Technologies 2022 proceedings by
Springer. He is also the co-author of popular books published by Society for Industrial
and Applied Mathematics: Understanding Search Engines: Mathematical Modeling
and Text Retrieval, Second Edition, and Computational Information Retrieval. He has
published over 115 refereed journals and conference publications. He is a member of
SIAM, ACM, MAA, ASEE, and the IEEE Computer Society and is on the editorial
board of Foundations of Data Science (AIMS) and the SIAM Journal on Matrix
Analysis and Applications (SIAM). He is also a certified program evaluator for
the Computing Accreditation Commission (CAC) of the Accreditation Board for
Engineering and Technology, Inc. (ABET).
Artificial Intelligence
A Comparative Study of Lemmatization
Approaches for Rojak Language

Liu Jun Yoon, Xuan Yi Tan, Khai Yin Lim, Chi Wee Tan, Ling Ern Cheng,
and Jenny Tan

Abstract Lemmatization is an important preprocessing step in most natural

language processing (NLP) applications where it extracts a valid and linguistically
meaningful lemma from an inflectional word. This allows different inflected forms
of a word to be grouped into a common root which is the base-form or dictionary-
form of a word, known as lemma. Due to the rapid spread of code-mixing languages
like the Rojak language that mixes English with Malay, a lemmatizer capable of
lemmatizing the language is needed for NLP applications involving this language.
Thus, this work proposes a Rojak language lemmatization approach that is able to
handle both languages without requiring users to input texts in different language
separately. Various methods including rule-based, corpus-based, machine learning,
and deep learning-based were experimented and compared using the English Web
Treebank (EWT) and Indonesian GSD corpora from the Universal Dependencies
(UD) framework. Besides, the effect of POS tags on the performance of lemma-
tizers was also evaluated based on the accuracy of the train and test sets. From the
experiments conducted, the corpus-based approach produced the best results with

L. J. Yoon · X. Y. Tan · K. Y. Lim (B) · L. E. Cheng · J. Tan

Department of Computing and Information Technology, Penang Branch, Tunku Abdul Rahman
University of Management and Technology, Tanjung Bungah, Malaysia
e-mail: [email protected]
L. J. Yoon
e-mail: [email protected]
X. Y. Tan
e-mail: [email protected]
L. E. Cheng
e-mail: [email protected]
J. Tan
e-mail: [email protected]
C. W. Tan
Faculty of Computing and Information Technology, Tunku Abdul Rahman University of
Management and Technology, Kuala Lumpur Campus, Kuala Lumpur, Malaysia
e-mail: [email protected]

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 3
Y. Bee Wah et al. (eds.), Data Science and Emerging Technologies, Lecture Notes on
Data Engineering and Communications Technologies 191,
https://doi.org/10.1007/978-981-97-0293-0_1
4 L. J. Yoon et al.

99.90% and 92.27% test set accuracy for Malay and English, respectively, whereas
the deep learning-based with POS tag approach produced the worst results of 79.78
and 91.15%.

Keywords Lemmatization · Rojak language · Natural language processing

(NLP) · POS tag

1 Introduction

Lemmatizer plays an important role in the preprocessing part in most natural language
processing (NLP) applications like information retrieval system (IRS) and machine
translation system (MTS). This is because in most NLP applications, extraction of
a valid and linguistically meaningful lemma from an inflectional word is required.
For example, in MTS, the first phase involves analysis and testing the input text
written in the source language for its grammatical correctness. Hence, a lemmatizer
is required to perform text normalization to break down inflectional words into root
and affixes [1, 2]. By performing lemmatization, different inflected forms of a word
are grouped into a common root which is the base-form or dictionary-form of a
word, known as lemma, that is a valid root and linguistically meaningful word. For
example, if the words “play, plays, and played” are given to a lemmatizer, “play”
will be returned by the lemmatizer as a lemma [2]. Since lemmatization is widely
used in most preprocessing of NLP applications, a lemmatizer that is able to give a
more accurate lemma has to be studied.
Code-mixing is a widespread phenomenon in Asia that mixes two or more
languages. In Malaysia, English has been the focal mixing language practiced with
Malay, resulting in the creation of Rojak language [3, 4]. For example, Malaysians
tend to write “That is such a canggih gadget, I want one too”. Due to the rapid spread
of Rojak language, a lemmatizer capable of lemmatizing the Rojak language, namely
a combination of Malay and English, is needed [5]. Besides, all existing stemmers
and lemmatizers for Malay or Indonesian confuse roots with stems or lemmas. For
instance, although the Sastrawi stemmer is a stemmer, it returns roots instead of
stems, while Malaya also inherits this problem. This is because they do not consider
the parts-of-speech (POS) tags when performing stemming. Conversely, MorphInd is
currently the most widely used morphological analyzer for Indonesian regards roots
as lemmas. No existing tools provide stems and lemmas [6]. Therefore, an approach
capable of lemmatizing inflected Malay and English words, constituting the Rojak
language, into its respective lemmas is needed. This study encompasses three primary
objectives: (1) to introduce an approach capable of lemmatizing Rojak language,
which is able to handle English and Malay simultaneously, without requiring the
user to manually identify and separate the language; (2) to compare the performance
of rule-based, corpus-based, machine learning, and deep learning-based lemmatiza-
tion approaches; and (3) to evaluate the impact of POS tags on the performance of
machine learning and deep learning-based lemmatizers.
A Comparative Study of Lemmatization Approaches for Rojak Language 5

This article is structured as follows. Section 2 analyzes related works. Section 3

describes the experimental set up and proposed methodology. Section 4 details the
results and discussions. Section 5 concludes the work.

2 Literature Review

Recently, a lemmatizer that uses a novel dictionary lookup approach has been
proposed for the Urdu language [7]. In the approach, the lemma returned depends on
the corpus used to train the model. The lemma of a given word was only returned if
both the word and its corresponding POS tag were present in the corpus. If the word
and POS tag were not found, the lemmatizer would return the word as it was passed
in, without any changes. Any differences caused by spelling mistakes or different
blank spaces from the user input words would result in a no match in the corpus. The
proposed lemmatizer was evaluated with and without POS tagging. The lemmatizer
obtained the highest accuracy 76.44% when words without POS tags were lemma-
tized. On the other hand, the accuracy achieved by words that are passed in together
with POS tags obtained a lower accuracy of 66.79%.
A lemmatizer for Icelandic known as Nefnir was proposed in 2019 [1]. Nefnir was
developed from rules derived from a morphological database known as Database of
Modern Icelandic Inflection (DMII). Suffix substitution rules were derived from the
database and used to lemmatize tagged text. New rules were generated to minimize
the number of errors in the training set until no further reduction in the error count.
In Nefnir, it was assumed all word forms are identical to their lemma. A list of rules
was generated for all the errors. The rule that minimizes the number of remaining
errors was selected and applied to the training set until the number of errors does
not reduce. The criteria for rule generation are that rules are generated only if the
rule correctly lemmatizes at least two examples in the training set. The evaluation
of Nefnir was performed to determine the accuracy of Nefnir in lemmatizing words
with correct POS tags and words that are automatically tagged with POS tagger
which was IceTagger. The accuracy achieved by Nefnir with correct POS tags was
99.55%, while for words that are tagged automatically with IceTagger, the accuracy
was 96.88%. It was shown that Nefnir accuracy dropped when lemmatization was
performed on words that were automatically tagged by POS tagger.
Another rule-based lemmatizer that uses the longest-affix-match approach was
proposed for Kannada inflectional words [2]. In the approach, the input word that
contains prefix or suffix would be applied with a set of linguistic rules to get the
appropriate lemma. Prefixes and suffixes for Kannada inflected nouns and finite verbs
were collected manually from Kannada grammar textbooks [8]. A root dictionary was
created from Kannada dictionary “Kannada Rathnakosha”. The proposed lemmatizer
searched for a lemma in the root dictionary, and if the lemma was not found, the
lemmatizer would append the obtained lemma to the root dictionary. This further
improved the performance of the lemmatizer. The proposed lemmatizer was tested
on four datasets with lemmatization performed on official circulars that achieved
6 L. J. Yoon et al.

an accuracy of 85.72%, newspaper 95.80%, legal documents 97.08%, and All India
Radio news 95.39%. The accuracy of the proposed lemmatizer achieved above 85%
on different dataset.
In 2020, lemmatization of the Russian language based on machine learning algo-
rithms was proposed [8]. Vectorized word forms obtained from open dictionaries
were fitted into various machine learning regression models, which were decision
tree, random forest, extra tree, and bagging. Decision tree produced the highest accu-
racy on the lemmatization of real-world corpora, ABBYY corpus and Open-Corpora
corpus, with the accuracy of 75.61% and 70.88%, respectively.
Deep learning sequence-to-sequence approach was proposed in 2021 to perform
the automatic Romanian lemmatization [9]. The encoder and the decoder in the
sequence-to-sequence model for lemmatization of Romanian words contain a single
long short-term memory (LSTM) layer. The encoder and the decoder were enriched
with one or two additional LSTM layers to improve the system’s accuracy. When the
deep learning models were evaluated on Romanian Explicative Dictionary (DEX)
dataset, one layer LSTM-based architecture achieved the highest accuracy at both
word and character levels with an accuracy of 95.93% and 97.29%, respectively.
When POS information was included, the system’s accuracy increased by 3.39% at
word level and by 2.14% at character level resulting in an accuracy of 99.32% and
99.43%, respectively. The model’s accuracy improved when POS information was
included.

3 Proposed Methodology

This section describes the proposed framework proposed and the workflow for devel-
oping a lemmatizer for Rojak language texts that mixes English and Malay. Figure 1
shows the overall framework of the general process design.

3.1 Data Acquisition

Two corpora, one for the English language and another for the Malay language,
were utilized in constructing the lemmatizer designed for the Rojak language. In
light of the findings [10] indicating a similarity of over 90% between the Malay
and Indonesian lexicons, an Indonesian corpus was used due to its relatively greater
availability of resources as compared to Malay language. In this study, the English
Web Treebank (EWT) corpus and Indonesian GSD corpus were employed [11–13].
Both the corpora were split into train, development, and test sets in the CoNLL-U
file format. The train and development sets were combined to form the train set. Each
set consists of sentences made up of words, where various information is provided
for each word such as ID, FORM, LEMMA, UPOS, and XPOS. For building the
lemmatizers, only the FORM, LEMMA, and UPOS which are universal POS tags
A Comparative Study of Lemmatization Approaches for Rojak Language 7

Fig. 1 Overall architecture of the general process design consisting of a the development of
lemmatization and b the Rojak language lemmatization

are used. UPOS can be used to tag parts-of-speech for all languages, and there are 17
UPOS tags in the UD framework [14]. The number of sentences and words contained
in each dataset is as shown in Table 1.

Table 1 Number of sentences and words in each dataset

English Indonesian
Train Development Test Train Development Test
Number of 12,544 2001 2001 2001 2001 2001
sentences
Number of 204,612 204,612 204,612 204,612 204,612 204,612
words
8 L. J. Yoon et al.

3.2 Data Preprocessing

Lowercasing was performed in this stage to ensure all words can be mapped or
lemmatized despite the different text casings in the corpus.

3.3 Feature Extraction

Feature extraction is required by machine learning-based and deep learning-based

lemmatizers. In this process, each word and lemma was transformed into a sequence
of numerical values, forming either the feature vector or word embedding [15]. A
pre-trained model known as fastText [16] was used to generate word embedding of
size 300 for each word in the corpus, which were then fed to the machine learning
and deep learning models for training.

3.4 Lemmatizer Building

Lemmatizers for the English and Malay languages were individually constructed
using distinct methods. Four different lemmatizers, consisting of rule-based, corpus-
based, machine learning-based, and deep learning-based were developed. The
training or development process for each method was replicated to create two
lemmatizers of the same method for the distinct languages, namely English and
Malay.

3.5 Rule-Based

Lowercasing was performed in rule-based lemmatizer on both the train and test sets.
The rule-based lemmatizer utilizes information regarding word forms, lemmas, and
POS tags to generate rules, as demonstrated in Table 2 for English and Table 3 for
the Malay language, enabling it to carry out lemmatization. During the lemmatizer
building stage, the initial step involves constructing a corpus that encompasses all
word forms and lemmas. This corpus serves as a reference for the lemmatizer to
determine whether a generated lemma represents a meaningful or dictionary-form of
a word. Subsequently, rules for lemmatizing English and Malay words were formu-
lated based on linguistic knowledge of the languages, observations from the corpora,
and insights garnered from prior research. The overall workflow for the rule-based
lemmatizer is outlined in Algorithm 1.
A Comparative Study of Lemmatization Approaches for Rojak Language 9

Table 2 Rules for English rule-based lemmatizer

POS tag Rules Example
NOUN/PROPN 1. plural form → singular form 1. geese → goose
2. families →
family
3. cacti → cactus
VERB 1. past tense → present tense 1. played → play
2. continuous tense → present tense 2. playing → play
3. perfect tense → present tense 3. seen → see
4. singular verb (with s) → plural verb (without s) 4. plays → play
ADJ 1. comparative → positive 1. greater → great
2. superlative → positive 2. greatest → great
ADV 1. better → well
NUM 1. remove comma
AUX 1. third person → first person 1. has → have
2. past tense → present tense 2. had → have
3. ‘m, am, ‘s, is, ‘re, are, art, was, were, being, been →
be
4. ‘ve, has, had, having → have
5. does, did, done, doing → do
6. ‘ll → will
7. ‘d → would
PART 1. n’t → not
PRON 1. object → subject 1. him → he
2. possessive (with ‘s) - > not possessive (without ‘s) 2. who’s → whose

Algorithm 1: The execution steps for the rule-based lemmatizer

1 Input: User input text in Rojak language
2 Output: Lemmatized words in either English or Malay
3 Step 1: User inputs sequence of words in Rojak language
4 Step 2: Convert the received text to lowercase
5 Step 3: Detect language
6 Step 4: Check the received POS tag of the received word (perform on both English
and Malay text)
7 Step 5: Perform lemmatization based on the received POS tag of the word where there
will be 2 cases:
8 if the POS tag indicates that no lemmatization should be performed on the word, the
word will be returned as the lemma
9 else the word will be lemmatized according to the rules defined for lemmatizing words
with the specific POS tag
10 Step 6: Check if the output lemma exists in the corpus where there will also be 2 cases:
11 if the lemma exists in the corpus which means that it is a valid word, return the lemma
(continued)
10 L. J. Yoon et al.

(continued)
12 else the received word will be returned as the lemma

The rules for lemmatizing English words were derived from linguistic knowledge
and observing the word-lemma pairs in the training set. As there are many irregular
verbs and plural forms in English, some irregular words found from the training set
are explicitly added to the list of rules for lemmatizing English words to increase
coverage and improve accuracy. As for lemmatizing Malay words, the rules are
mainly derived from the findings of previous studies supplemented by linguistic
knowledge [17].

3.6 Corpus-Based

As for the corpus-based lemmatizer, the training dataset and development database
from UD Treebanks for the two different languages were used as training data. Test
set was used for evaluating the corpus-based lemmatizer to determine the accuracy of
the lemmatizer on unseen dataset. The idea behind this approach is to build a corpus
containing all the possible lemmas for a word according to the different POS tags,
for each language. After the corpus has been built, it can then be used to look for a
matching word form and POS tag to return its corresponding lemma, given an input
word and POS tag.
Therefore, to build the corpus, lowercasing was first performed on the UD Tree-
banks data. The words and lemma from the training set were then used to build a
dictionary corpus for the lemmatizer. Words in the training set that are not in the
dictionary were added to the corpus together with the POS tag. At the same time,
the POS tag of the word was checked if it exists in the corpus. In cases where the
POS tag was not found in the corpus, it was appended to the corpus along with its
lemma. Hence, to use the lemmatizer, it takes a word and its POS tag as input and
then checks the corpus to determine whether the given word and POS tag exist in the
corpus. If they exist, the corresponding lemma is returned; otherwise, the received
word is returned.

3.7 Machine Learning-Based

In machine learning-based lemmatizer, the train and development sets were combined
to form a train set for training the lemmatizer. Features extraction was then performed
on the words and lemma through fastText word embedding. In this study, decision
tree regression model (DTR) was selected as it managed to yield the highest accuracy
during the lemmatization process on the real-world corpora for the Russian language
[8]. Thus, DTR was expected to be able to produce outstanding results in Rojak
A Comparative Study of Lemmatization Approaches for Rojak Language 11

Table 3 Rules for Malay

POS tag Rules Example
rule-based lemmatizer
NOUN 1. anti+ 1. olahragawan → olahraga
2. peN+ 2. pengiriman → kirim
3. +an 3. perbaikan → baik
4. antar+
5. ke+
6. +wan
7. per+
8. +wati
9. ke + tidak+
ADJ 1. non+ 1. tercantik → cantik
2. ter+ 2. kecantikan → cantik
3. +an 3. secantik → cantik
4. ke+
5. +nya
6. se+
VERB 1. meN+ 1. mengambil → ambil
2. per+ 2. berduduk → duduk
3. +kan 3. diletak → letak
4. di+
5. +i
6. ber+
NUM 1. nya+ 1. bertujuh → tujuh
2. ber+ 2. tujuhnya → tujuh
PRON 1. nya → dia
2. ku → aku
3. mu → kamu
4. kau → kamu

language lemmatizer. The DTR takes the word embedding of the inflected word as
the input and learns to predict its corresponding output, which is the word embedding
of the lemma of the input word. Depending on the type of approach, it can also take
the encoded POS tag of the inflected word as input, allowing the DTR to learn from
both the word and POS tag.
The DTR is a model based on the decision tree (DT) algorithm that learns using a
tree structure that contains a root node, decision nodes, edges, and leaf nodes. It starts
from the root node by selecting an attribute and splitting values as its starting point
and continuously repeats this process to generate decision nodes, until it reaches its
leaf nodes. The leaf nodes contain the values of the final prediction generated by
the DT. In this case, the DTR uses the DT algorithm to perform lemmatization that
is posed as a regression problem, whereby the predicted lemmas are represented by
their feature vectors, which are sequences of continuous values. Using the generated
tree, a path leading to the leaf nodes can be found to obtain the predicted values, which
can also be expressed in rule form. Hence, the DTR is able to perform predictions
and generate results for performing lemmatization [18].
12 L. J. Yoon et al.

Table 4 Parameters of the LSTM

Layers Parameters
With POS tag Without POS tag
Masking layer mask_value = 0 mask_value = 0
LSTM layer units = 317 units = 300
return sequences = true return sequences = true
Dropout layer rate = 0.2 rate = 0.2
Time distributed layer layer = Dense layer = Dense
units = 300 units = 300
activation = ‘linear’ activation = ‘linear’

3.8 Deep Learning-Based

Similar processes to the machine learning-based lemmatizer were performed. LSTM

was employed for the lemmatization task. Similar to the machine learning-based
approach, the LSTM also takes word embedding of the inflected word and optionally,
encoded POS tags, and lemma as inputs and output respectively for training. With
this, it is then able to predict the embedding of the lemma of a given word based on
its embedding and optionally, its encoded POS tag.
LSTM is a variant of recurrent neural network (RNN) that retains its chain-like
structure. However, the difference between the two is that RNN has difficulties
dealing with long term dependencies, which are sequential data that require previous
data or context to be retained for a longer period due to the vanishing gradient
problem. LSTM overcomes this problem with the use of cell state, which differenti-
ates its recurrent unit architecture from RNNs. It uses three types of gates, namely
the forget gate, input gate, and output gate, in the cell state to control addition and
deletion of information from the recurrent unit, while allowing information to flow
through it using the door mechanism [19]. The parameters of the layers in the LSTM
with POS and without POS tags are as shown in Table 4. About 100 epochs with a
batch size of 4 were used to train the model. The chosen loss function was cosine_
similarity, and the optimization technique utilized was rmsprop.

3.9 Evaluation

Both the English and Malay lemmatizers built on the different methods were eval-
uated using the accuracy (Eq. 1). To perform the evaluation, the test sets of the two
corpora were used.

Number of Words Correctly Lemmatized

Accuracy = (1)
Total Number of Words in the Test Set
A Comparative Study of Lemmatization Approaches for Rojak Language 13

3.10 Proposed Approach for Rojak Language Lemmatization

In this study, an approach comprising individual language lemmatizers was proposed

to process Rojak language through a series of steps. Firstly, the user input sentence
undergoes tokenization where the sentence was split into tokens, which was then
passed to POS taggers, for both English and Malay, to obtain two sets of POS tags
for the sentence. The rationale behind this approach is that POS taggers often rely
on the preceding and succeeding words of a given word to assign POS tags. Hence,
by giving the full sentence as input enables the taggers to produce more accurate
results. Next, language detection was performed on each token using the pre-trained
fastText language identification model [20, 21], and the tokens were then passed to
the appropriate lemmatizer. To obtain the lemmatized sentence, the lemma outputs
produced by the lemmatizers were joined to form the final output sentence. These
processes are depicted in Fig. 1(b).

4 Experiment and Results

Several experiments, namely encompass of rule-based, corpus-based, with and

without POS tags machine learning, and deep learning-based lemmatizers, were
developed by using the English and Malay datasets.
From the results shown in Table 5, it can be seen that the corpus-based approach
attained the highest train accuracy for both Malay and English languages. It achieves
the highest test accuracy for Malay language, while experiencing a 7.63% difference
for English language. This achievement, particularly for the Malay language, may be
attributed to its capability of constantly adding new words absent in the dictionary.
This adaptability enables the lemmatizer to expand its dictionary with unseen words
as more inputs are passed in.
The rule-based lemmatizer appears to be the second-best lemmatizer. One
contributing factor is the process of its rule generation, whereby the manner in which
the rules generated significantly influences the overall lemmatizer performance. The

Table 5 Train and test set accuracy by using the (a) rule-based, (b) corpus-based, (c) machine
learning-based (without POS tags), and (d) deep learning-based (without POS tags) lemmatizer
(a) Rule-based (b) Corpus-based (c) Machine (d) Deep
learning-based learning-based
Train set Test set Train set Test set Train set Test set Train set Test set
accuracy accuracy accuracy accuracy accuracy accuracy accuracy accuracy
(%) (%) (%) (%) (%) (%) (%) (%)
Malay 91.22 90.61 99.94 99.90 99.08 89.64 86.81 85.22
English 94.03 94.45 99.79 92.27 99.44 93.26 91.31 91.15
Average 92.625 92.53 99.865 96.085 99.26 91.45 89.06 88.19
14 L. J. Yoon et al.

rules were formulated based on the train set. Therefore, a sufficiently large and high-
quality train set ensures that the generated rules encompass a wide range of word
forms, ultimately leading to a considerably good outcome.
As for the without POS tags machine learning-based method, a relatively big
difference between the train and test accuracies is observed. The difference is more
noticeable in Malay, around 10%, and somewhat less in English, about 6%. This
suggests a slight overfitting on the training data, which fails to generalize well with
new unseen data.
Lastly, the without POS tags deep learning-based, namely the LSTM lemmatizer,
was found to have the lowest accuracy for both the languages. The lower performance
of the LSTM lemmatizer may be due to the considerably low number of training data
as deep learning models often require a substantial amount of training data. The model
might not be able to fully capture the nuances of the language when the training data
are limited. Besides that, the lack of an extensive hyperparameters search on the
LSTM may be a reason for the rather poor performance.
Additionally, both the machine learning and deep learning-based approaches were
experimented with the inclusion of POS tags. This allows a comparative analysis
between the methods with and without the POS tags, which could then facilitate an
assessment of whether POS tags would help in the learning and prediction processes.
Upon observing Tables 6 and 7, it can be concluded that the inclusive of POS
tags not only fails to enhance but in fact deteriorates the lemmatization outcomes
in certain cases. The finding was found to be consistent with [1] but contradicts
with [9]. The decline in accuracy could potentially be due to the model’s inability to
effectively utilize and learn from the meaningful information from the POS tags.
In this context, the POS tags might introduce noises into the model, causing it to
be unable to learn from the data properly which in turn leads to the accuracy drop.
Another possible reason is that there may be inconsistent lemma for the same word
form and POS tag in the train data. Learning from such data can cause confusion for
the model during prediction as there is no exact lemma that the model is able to map
the word to, thereby affecting the overall accuracy of the model.
Lastly, the reduction in the accuracy of the model with POS tags might also be
attributed to different words with different POS tagging sharing the same lemma.
This makes it harder for the model to extract information and patterns in the data,
which therefore resulting in a reduction in the accuracy.

Table 6 Train and test set accuracy of Malay and English machine learning-based lemmatizer
Without POS tag
Machine learning-based lemmatizer Train set accuracy (%) Test set accuracy (%)
Malay 99.08 89.64
English 99.44 93.26
With POS tag
Malay 97.34 87.20
English 97.80 91.94
A Comparative Study of Lemmatization Approaches for Rojak Language 15

Table 7 Train and test set accuracy of Malay and English deep learning-based lemmatizer
Without POS tag
Deep learning-based lemmatizer Train set accuracy (%) Test set accuracy (%)
Malay 86.81 85.22
English 91.31 91.15
With POS tag
Malay 80.53 79.78
English 90.27 91.15

5 Conclusion

In this study, various lemmatization methods have been employed to compare the
performance of the lemmatizers on the UD Treebanks test sets of the English and
Malay languages. To verify the effectiveness of the proposed lemmatizers, the rule-
based, corpus-based, machine learning-based, and deep learning-based lemmatiza-
tion methods were conducted. Among all the lemmatization methods, the corpus-
based lemmatizer was found to outperform the others. The deep learning-based
method, on the other hand, performed the worst. A more detailed analysis as to
the fine-tuning for the model will be deferred to future investigation. Additionally,
the effect of POS tags to the performance of the lemmatizer have also been compared
for the machine learning and deep learning-based lemmatizers. Results have shown
that POS tags do not contribute any value to the lemmatizers for both the methods
in English and Malay languages.

References

1. Ingólfsdóttir S, Loftsson H, Daðason J, Bjarnadóttir K (2019) Nefnir: a high accuracy lemma-

tizer for Icelandic. In: Proceedings of the 22nd Nordic conference on computational linguistics,
pp 310–315. Linköping University Electronic Press, Finland
2. Prathibha RJ, Padma MC (2015) Design of rule based lemmatizer for Kannada inflectional
words. In: 2015 International conference on emerging research in electronics, computer science
and technology (ICERECT). Institute of Electrical and Electronics Engineers, India, pp 264–
269
3. Azu Farhana A, Shamala P, Jais I (2020) Malay-English code-mixing insertion: why ‘lepaking’
in preference to ‘hanging out’? Quantum J Soc Sci Hum 1(5):69–84
4. Ho WJ (2007) Code-mixing: linguistic form and socio-cultural meaning. Int J Lang Soc Cult
21
5. Ismail MJ, Loo FC, Anuar AF, Yusuf R (2020) Institutionalising the Kompang for primary
school students in Malaysia. Int J Innov Creativity Change 13(5):275–292
6. Nomoto H (2020) Towards genuine stemming and lemmatization in Malay/Indonesian. In:
Proceedings of the 26th Annual conference of the association for natural language processing.
The Association for Natural Language Processing, Japan, pp 1033–1036
7. Shaukat S, Asad M, Akram A (2023) Developing an Urdu lemmatizer using a dictionary-based
lookup approach. Appl Sci 13(8):5103
16 L. J. Yoon et al.

8. Sridhar SN (2007) Modern Kannada grammar. Manohar, India

9. Nuţu M (2021) Deep learning approach for automatic Romanian lemmatization. Procedia
Comput Sci 192:49–58
10. Ranaivo-Malancon B (2006) Automatic identification of close languages—case study: Malay
and Indonesian. ECTI Trans Comput Inf Technol (ECTI-CIT) 2(2):126–134
11. McDonald R, Nivre J, Quirmbach-Brundage Y, Goldberg Y, Das D, Ganchev K, Hall K, Petrov
S, Zhang H, Täckström O, Bedini C, Castelló NB, Lee JM (2013) Universal dependency anno-
tation for multilingual parsing. In: Proceedings of the 51st annual meeting of the association for
computational linguistics (Volume 2: Short Papers). Association for Computational Linguistics,
Bulgaria, pp 92–97
12. Silveira N, Dozat T, Marneffe MD, Bowman S, Connor M, Bauer J, Manning C (2014) A Gold
standard dependency corpus for English. In: Proceedings of the Ninth International conference
on language resources and evaluation (LREC’14). European Language Resources Association
(ELRA), Iceland, pp 2897–2904
13. Universal Dependencies. https://universaldependencies.org/. Last accessed 05 May 2023
14. Universal POS tags. https://universaldependencies.org/u/pos/. Last accessed 09 May 2023
15. Almeida F, Xexeo G (2023) Word embeddings: a survey. arXiv preprint arXiv:1901.09069
16. Grave E, Bojanowski P, Gupta P, Joulin A, Mikolov T (2018) Learning word vectors for 157
languages. In: Proceedings of the Eleventh International conference on language resources
and evaluation (LREC 2018). European Language Resources Association (ELRA), Japan, pp
3483–3487
17. Jarkko L, Klami A (2021) Learning to lemmatize in the word representation space. In: Proceed-
ings of the 23rd Nordic conference on computational linguistics (NoDaLiDa). Linköping
University Electronic Press, Iceland, pp 249–258
18. Kushwah JS, Kumar A, Patel S, Soni R, Gawande A, Gupta S (2022) Comparative study of
regressor and classifier with decision tree using modern tools. Mater Today: Proc 56(6):3571–
3576
19. Saleh H, Mostafa S, Alharbi A, El-Sappagh S, Alkhalifah T (2022) Heterogeneous ensemble
deep learning model for enhanced Arabic sentiment analysis. Sensors 22(10):3707
20. Joulin A, Grave E, Bojanowski P, Mikolov T (2017) Bag of tricks for efficient text classifi-
cation. In: Proceedings of the 15th conference of the European chapter of the association for
computational linguistics: volume 2, Short Papers. Association for Computational Linguistics,
Spain, pp 427–431
21. Joulin A, Grave E, Bojanowski P, Douze M, Jégou H, Mikolov T (2016) FastText.zip:
compressing text classification models. arXiv preprint arXiv:1612.03651
Multi-aspect Extraction in Indonesian
Reviews Through Multi-label
Classification Using Pre-trained BERT
Models

Nur Hayatin, Suraya Alias, Lai Po Hung, and Yuliana Setiowati

Abstract Aspect extraction automatically identifies and categorizes specific aspects

mentioned in the text to enable fine-grained opinion analysis in sentiment anal-
ysis. While previous studies have successfully extracted aspects, they often focused
on a single-aspect per review, overlooking the presence of multiple aspects within
sentences. This limitation has affected capturing a complete user opinion, thus posing
a challenge despite the complexity of computations and annotations in supervised
learning. In this study, we address the task of extracting multiple aspects from Indone-
sian reviews based on multi-label classification using Bidirectional Encoder Repre-
sentations from Transformers (BERT) by implementing pre-trained models. BERT
is chosen due to the ability to capture contextual information through bidirectional
encoder mechanisms and the capacity to catch complex word and sentence relation-
ships. In the experiment, we conducted the tests with various Indonesian pre-trained
BERT models to enhance the performance of multi-aspect extraction on Indonesian
hotel reviews. Our findings indicate that Indonesian-BERT-1.5G pre-trained model
can improve the classifier performance and achieve an impressive F1-score of 0.84
with the lowest loos of 0.3029 for Indonesian reviews data.

Keywords Aspect extraction · BERT · User reviews · Pre-trained model · Deep

learning

N. Hayatin
Informatics Department, University of Muhammadiyah Malang, Malang, Indonesia
N. Hayatin · S. Alias (B) · L. P. Hung
Computing and Informatics Faculty, University Malaysia Sabah, Kota Kinabalu, Malaysia
e-mail: [email protected]
Y. Setiowati
Informatics and Computing Department, Politeknik Elektronika Negeri Surabaya, Surabaya,
Indonesia

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 17
Y. Bee Wah et al. (eds.), Data Science and Emerging Technologies, Lecture Notes on
Data Engineering and Communications Technologies 191,
https://doi.org/10.1007/978-981-97-0293-0_2
18 N. Hayatin et al.

1 Introduction

Aspects are attributes or components of the product or service [1]. Aspects can be
written directly in the form of a word or phrase in a sentence explicitly. Aspect
extraction is a task for automatically identifying and categorizing specific aspects
of a product or service mentioned in the text. Sentiment analysis needs an aspect
level to obtain a more fine-grained opinion analysis [2]. This idea leads to Aspect-
Based Sentiment Analysis (ABSA), a primary task to extract and summarize people’s
opinions and also the aspects contained therein.
Previous researchers have extracted aspects well [3, 4], however, they only extract
a single-aspect. In general, one comment has more than two sentences and can contain
multiple aspects. For instance, “strategic location, and the room is very comfortable”,
this sentence contains two aspects “location” and “room”. If a kind of sentence only
takes one relevant aspect that represents the sentence, then the information from
the user’s opinion cannot be captured completely. This is a challenge that must be
overcome, even though the computations and annotations for supervised learning are
more complex than extracting only one aspect.
Recent advancements in deep learning techniques such as transformer-based
models have shown promising results in aspect extraction tasks. These models can
capture complex relationships between words and sentences, making them suitable
for extracting fine-grained aspects [5, 6]. BERT is a pre-trained model on a large
corpus of data in a self-supervised learning that operates using the same bidirec-
tional encoder mechanism as a transformer [7]. Several studies have successfully
utilized the BERT architecture to overcome the multi-label classification approach
in sentiment analysis tasks [8, 9]. Inspired by these findings, our research aims to
extract multiple aspects from Indonesian reviews based on multi-label classification
using a pre-trained BERT model for better capturing the context of sentences.
We have differences from previous works. In this research, we handle multi-
aspects extraction from Indonesian hotel reviews where BERT’s architecture was
chosen for multi-label classification because the model has a bidirectional encoder
architecture which was better capturing the context of sentences. Other than that, we
carry out tests with various pre-trained models to improve the classifier performance.
To the best of our knowledge, we are the first study implemented Indonesian pre-
trained BERT model for multi-aspect extraction. We implemented various pre-trained
BERT models, include recent Indonesian pre-trained models, IndoBERT [10] and
IndoBERTWiki [11]. We also implemented IndoBERTweet [12], the first large-scale
pre-trained model from Indonesian tweets.
Multi-aspect Extraction in Indonesian Reviews Through Multi-label … 19

2 Related Works

In general, the approach of aspect categorization is divided into two main groups:
supervised and unsupervised approaches [13]. Both of them have superiority as
well as shortcomings issues. Previous study has implemented the multi-label text
classification method to determine dangerous speeches on Twitter texts based on
aspects [14]. Their method presented good performance by implementing Naive
Bayes (NB) classifier; however, the study must handle imbalanced data issues within
each aspect and extend the vocabulary. Other methods used for extracting aspects
in Indonesian reviews data are Conditional Random Field (CRF) [15, 16], Latent
Dirichlet Allocation (LDA) [17], word vector similarity [18], and rule-based [4,
19]. These techniques are relatively inexpensive because they do not require data
labeling. However, they have a weakness when it comes to collecting lots of pattern
combinations and various words to express opinion. In addition, majority of them
need to determine the initial seed words for each aspect manually for classifying
reviews.
In supervised approach, aspect extraction task involves a labeled data,
where researchers have worked with various machine learning techniques such as
Support Vector Machine (SVM), Naïve Bayes (NB), and the K-Nearest Neighbor
(KNN) [20]. Deep learning techniques are also popularly used in aspect catego-
rization which shown good result, such as Recurrent Neural Network (RNN) for
handling aspect detection [5]. Another method is Gated Recurrent Unit (GRU), a
sequential-based deep learning model, which was used for aspect spread detection
[21]. Recent study developed semi supervised deep learning by implementing Graph
Convolutional Network (GCN) for aspect classification to improve ABSA [22].
In term of multi-aspect extraction, previous study employed multi-label classifi-
cation using deep learning approaches which focused on multi-label aspect classifi-
cation on Indonesian hotel reviews. The study used a combination of Convolutional
Neural Network (CNN) and Extreme Gradient Boosting (XGB) [6], while the model
is quite sensitive to misspelling issues. Other models proposed by [23] and [24] have
extracted explicit aspects from Indonesian reviews using the Bidirectional Encoder
Representations from Transformers (BERT) model. The result shown well-turned
performance which employed BERT-multilingual for pre-trained model. However,
many basic words cannot be captured, because BERT multilingual model contained
text from various languages, and not specific for Indonesian text. Therefore, they
need to improve a pre-trained model which is fitted for Indonesian text. Table 1
shows several studies related to our work.
20 N. Hayatin et al.

Table 1 Related studies in supervised and unsupervised approaches

Approach Technique Domain/language Result Reference
Unsupervised CRF Restaurant/Indonesian F1 = 0.82 [15]
Restaurant/Indonesian F1 = 0.55 [16]
Rule-based Martketplace/Indonesian F1 = 0.78–0.92 [4]
Hotel/Indonesian F1 = 0.75 [19]
Word2vec Restaurant/Indonesian F1 = 0.88 [18]
Supervised CNN + XGB Hotel/Indonesian F1 = 0.93 [6]
GCN Martketplace/Indonesian F1 = 0.97 [22]
GRU Martketplace/Indonesian F1 = 0.49 [21]
SVM Tweets/Indonesian Accuracy = 0.70 [20]
NB Tweets/Indonesian Accuracy = 0.77 [14]
BERT Tourism/Indonesian F1 = 0.79 [23]
RNN Martketplace/Indonesian F1 = 0.88 [5]

3 Methodology

3.1 Research Workflow

The extraction of multi-aspects through multi-label classification using BERT

model involves four primary steps: input formatting, data tokenization and encoding,
classification, and validation. The architectural overview of our proposed approach
for multi-aspect extraction is depicted in Fig. 1.
The detail for each phase for the proposed multi-aspect extraction will be explained
as follows.
Input Formatting. BERT is a pre-trained model that expects input data in a specific
format. In this study, the input text is tokenized into sub-words or word pieces using
the BERT tokenizer. This process transforms the text into the required input format
for BERT, which typically involves adding special tokens to indicate the start and
end of a sentence. The first token of every sequence is [CLS] which tags as beginning
of the sentence, while [SEP] is a sequence delimiter token that is appended at the
end of the sentence.
Tokenizing and Encoding the Data. In tokenization, after the sentences are split
into tokens based on special tokens, tokens are mapped to their IDs before truncating
all sentences to the same length and the attention masks are created (see Fig. 2).
To perform word tokenization in the BERT model, the tokenizer initially verifies
whether the entirety of the word is present in the vocabulary. If it is not, the tokenizer
endeavors to divide the word into its maximum sub-word units that are encompassed
within the vocabulary. As a final option, the tokenizer will disassemble the word into
its constituent characters. To perform tokenization, we utilized the AutoTokenizer
Multi-aspect Extraction in Indonesian Reviews Through Multi-label … 21

Fig. 1 Architecture of multi-aspect extraction through multi-label classification using BERT pre-
trained model

API based on BERT pre-trained model, which automatically selects the appropriate
tokenizer based on the checkpoint available on the hub.
The next process is encoding which refers to the process of converting text data
into numerical representations that can be understood and processed by the model.
BERT utilizes the encoder element of the transformer. Each token was produced from
the previous process then it will be converted into a WordPiece embedding vector by
BERT. Furthermore, the embeddings are then passed through the pre-trained BERT
model which is used for classification. The example of converting the result of text
tokens to text embeddings can be seen in Fig. 2.

Fig. 2 The visualization of data tokenizing and encoding

22 N. Hayatin et al.

Multi-label Classification Using BERT

Data Splitting. This research employed supervised learning, which necessitates the
division of the dataset into training and validation data. The dataset is stratified and
randomly split, with the training data utilized for model construction during the
training phase, and the validation (val) data employed for model evaluation.
Model Training. A multi-label classifier model is developed based on supervised
learning using pre-trained BERT model to predict multi-aspects in reviews. The first
step is choosing pre-trained BERT model. For the experiment, we employ various
pre-trained BERT models to get the best model. For multi-label classification, we
are treating each review as its unique sequence, so one sequence will be classified to
one or more of the eight labels.
The embedding is fed into the pre-trained BERT model, which consists of 12 layers
of transformers [25]. These layers effectively capture contextual information from
the input text. Each transformer receives a set of token embeddings and generates
an equivalent quantity of embeddings on the output with altered feature values.
On the output of the final (12th) transformer, solely the initial embedding, which
corresponds to the [CLS] token, is harnessed by the classifier. The specified training
hyperparameters for this experiment are as follows: learning rate = 2e-5, batch size
for training and evaluation = 32, number of epochs to train: 8, and the maximum
sentence length for padding: 128 which was appropriated with the average of length
sentences in the reviews.
Model Validation. After training, the classifier model is evaluated using validation
data. We implemented appropriate metrics to assess the performance of the model,
such as accuracy, F1-score, or ROC-AUC. The predicted values are compared to the
true labels using binary cross-entropy loss. A binary cross-entropy is used for a multi-
label classification task as a combination of multiple independent binary classifiers.
In this study, eight classes of aspects are used, so there are eight binary classifiers
separately. Each binary classifier is trained independently. Thus, we can produce a
multi-label for each sample.
We apply a sigmoid function to convert for each prediction scores produced by the
model into actual predicted labels. This transformation results in every score being
mapped to a value between 0 and 1, which can be interpreted as the “probability”
indicating the model’s certainty regarding the presence of a particular class in the
input text. Subsequently, we set a threshold of 0.5 to convert each probability into
either a 1 (indicating that the label is predicted for the given review) or a 0 (indicating
that the label is not predicted for the given review).

3.2 Pre-trained BERT Models

Bidirectional Encoder Representations from Transformers (BERT) is a pre-trained

model on a large corpus of data in a self-supervised learning [7]. That is, the training
Multi-aspect Extraction in Indonesian Reviews Through Multi-label … 23

data consists of raw texts only, without human labeling. BERT models help machines
understand and interpret the meaning of the text. It uses immediately preceding text
to understand the context. It also checks the relationships of words within a sentence
to give the actual meaning of words. BERT ensures words with the same meaning
will have a similar representation.
In this study, we implemented 10 different pre-trained BERT models to get the
best classifier. To the best of our knowledge, we are the first study which implemented
Indonesian pre-trained BERT models for multi-aspect extraction task in Indonesian
review. All pre-trained BERT models used in this research are uncased models. We
describe each Indonesian pre-trained BERT model used in the experiment as follow:
BERT-Base [7] is the smaller version of the BERT model. It comprises 12 trans-
former layers, with 12 self-attention heads in each layer. The training data used for
BERT-base is from the English language.
BERT-Large [7] is the larger version of the BERT model. It consists of 24 trans-
former layers, each with 16 self-attention heads. The total number of parameters in
BERT-large is significantly higher compared to BERT-base. The training data used
for BERT-base is from the English language.
BERT-Multilingual [7] is a variant of the BERT model that is designed to handle
multiple languages. It has the same underlying architecture as other BERT models,
consisting of transformer layers with self-attention mechanisms. The training data
used for BERT multilingual includes text from various languages, enabling it to
capture the nuances and contextual information specific to each language.
IndoBERT [10] is a recent Indonesian pre-trained BERT model from IndoNLU
which was trained with around 4-billion-word corpus (Indo4B), more than 20 GB
of text data. There are several versions of Indobenchmark-Bert, in this study we use
two versions: Indobenchmark-BERT-base and Indobenchmark-BERT-large. In this
study, we used the p2 version for both models.
IndoConvBERT1 is a variant of the Indonesian-BERT model that is pre-trained
on Indo4B, an Indonesian corpus which consists of around 4B words, and a small
Twitter dump. This model follows a different training procedure on pre-trains the
model and sequence length.
IndoBERTWiki [11] is a recent Indonesian pre-trained BERT model adapted from
BERT model which was trained by combining more than 220 million words from
three primary sources: Indonesian Wikipedia (74 million words), news articles (522
million words), and an Indonesian Web Corpus (90 million words).

1 https://huggingface.co/Wikidepia/IndoConvBERT-base.
24 N. Hayatin et al.

IndoBERTweet [12] is a groundbreaking pre-trained model specifically designed for

Indonesian Twitter. It is the first large-scale pre-trained model from Indonesian tweets
which has been trained using a massive dataset of Indonesian tweets, comprising over
409 million-word tokens. Notably, the training data used for pre-training is twice as
large as the dataset used for pre-training the original IndoBERT model.
Indonesian-BERT: there are two versions of this pre-trained model: indonesian-
522 M 2 and indonesian-1.5G.3 This BERT model is uncased and was trained based
on 522 MB of Indonesian Wikipedia using a masked language modeling (MLM)
objective and 1 GB of Indonesian newspapers.

4 Result and Discussion

4.1 Dataset

The dataset contains users’ reviews which are gathered from Tripadvisor web travel.
We select comments only for Indonesian hotels reviews. There are more than 5000
comments scrapped, however, this research used a thousand data to label manually
due to limited resources. The total data used for experiment is 1085 rows, which are
split into 80% for train (868 rows) and 20% for validation (217 rows).
After collecting the data, we list aspects related with the hotel domain. These
aspects will later become class labels in the classification process. There are several
references that use different aspect lists for the hotel domain. We decided to combine
aspect lists in the hotel domain from several references, i.e., SPACE [3], HoASA
(Indonesian ABSA dataset) [10], and also from some popular travel web such as
Tripadvisor, Booking.com, and Agoda. Finally, we used eight aspects, i.e., ‘room’,
‘location’, ‘cleanliness’, ‘building’, ‘food’, ‘service’, ‘value’, and ‘facilities’. The
description for each aspect is presented in Table 2.
Based on statistical analysis, one comment has more than two sentences and
can contain more than one comment, which we call multi-aspects review. The total
sentence of the data is 4720 sentences, while the total word in the corpus is 58,012
words. The average length of the sentence is 4 sentences per comment. In terms
of word length, the average word length per comment is 53–54 words, while the
average word length per sentence is 16–17 words. From 1085 sample data, the domi-
nance comments contain multi-aspect with around 89% from the total comments.
An example of sentences which contain multi-aspects is presented in Fig. 3.

2 https://huggingface.co/cahya/BERT-base-indonesian-522M.
3 https://huggingface.co/cahya/BERT-base-indonesian-1.5G.
Multi-aspect Extraction in Indonesian Reviews Through Multi-label … 25

Table 2 Aspect list and the description

No Aspect Description
1 Room Representation of the hotel’s room includes room size, room comfort and
quality, and a private bathroom
2 Location Representation of the hotel’s location, distance between other objects,
surrounding, and access to the hotel
3 Cleanliness Cleanliness in room, bathroom, and all hotel areas
4 Building Hotel’s architecture, includes the hotel’s public areas design, interior and
exterior design, furniture, and decoration
5 Food Food and beverages include menu variation and taste
6 Service Representation of staff hospitality and responsibility
7 Value Representation of value for money
8 Facilities Amenities, including a parking area, lobby room, pool, Wi-Fi, etc.

Fig. 3 Multi-aspects reviews

4.2 Experiment Result and Discussion

For the experiment, we implemented appropriate metrics to assess the performance

of the model, such as accuracy (Acc), F1-score (F1), or ROC-AUC. We construct
the Design of the Experiment (DoE), there are two scenarios here: DoE #1: we
compare the result of various pre-trained BERT models to get the best model, and
DoE #2: we employ several machine learning techniques for extracting aspects and
then compare their results with BERT. All BERT pre-trained models used in this
research are uncased models.
For the first experiment (DoE #1), we compare the result of the BERT model
with various pre-trained models which shown in Table 3. We used BERT-large-
uncased and BERT-base-uncased models to extract multi-aspect. Both models can
26 N. Hayatin et al.

provide good results for aspect extraction in Indonesia’s hotel reviews with an F1-
score of 0.7388 and 0.7104, respectively. We also implemented pre-trained BERT-
multilingual model and we get an increase in the F1-score up to 0.8046. The pre-
trained BERT-multilingual model gives higher results than the two models earlier,
this is because the pre-trained BERT-multilingual model is trained to handle multiple
languages, whereas the two models mentioned are specifically trained for the English
language, making them less suitable for implementation in an Indonesian dataset.
Moreover, we used various Indonesian pre-trained BERT models in the exper-
iment. Based on the experimental results presented in Table 3, both pre-trained
Indonesian-BERT-1.5G and IndoBERT-large models demonstrate the highest supe-
riority with an impressive F1-score of up to 0.84. With a difference of 0.0008
higher for Indonesian-BERT-1.5G compared to IndoBERT-large. However, pre-
trained Indonesian-BERT-1.5G model shows the highest accuracy and roc up to
44.24 and 0.8734, respectively, compared to other techniques, with the lowest loss
value of 0.3029.
On the other hand, the pre-trained IndoBERT-base and Indonesian-BERT-522 M
models exhibit comparable performance within the F1-score of 0.83. The pre-trained
IndoBERTWiki-base model achieved an F1-score of 0.80 exhibit similar performance
with BERT-multilingual model, but higher then IndoBERTweet which only got F1-
score of 0.7936. Notably, the pre-trained IndoConvBERT model exhibits the lowest
performance, yielding an F1-score of 0.6941.
For the second experiment (DoE #2), we employ several machine learning tech-
niques for extracting aspects and then compare their results with BERT. Several
machine learning techniques which are implemented are Decision Tree (DT), Naïve
Bayes (NB), Logistic Regression (LR), Support Vector Machine (SVM), Random
Forest (RF), Extreme Gradient Boosting (XGB), and Multilayers Perceptron (MLP).
Besides that, we also used Long Short-Term Memory (LSTM) for the deep learning
approach.

Table 3 Comparison of various pre-trained BERT model performance

Corpus Model F1 Acc(%) Roc-AUC Loss
English BERT-large-uncased 0.7388 22.58 0.7965 0.4332
BERT-base-uncased 0.7104 13.36 0.7758 0.4756
Multilingual BERT-multilingual-uncased 0.8046 34.56 0.8433 0.3758
Indonesia Indonesian-BERT-522 M 0.8302 39.17 0.8646 0.3208
Indonesian-BERT-1.5G 0.8421 44.24 0.8734 0.3029
IndoBERTweet 0.7936 32.26 0.8342 0.3557
IndoBERTWiki-base 0.8010 33.18 0.8411 0.3600
IndoConvBERT-base 0.6941 14.75 0.7629 0.5014
IndoBERT-base 0.8397 42.40 0.8701 0.3113
IndoBERT-large 0.8413 41.94 0.8723 0.3058
Multi-aspect Extraction in Indonesian Reviews Through Multi-label … 27

Table 4 Techniques
F1 Technique F1 Acc(%)
performance comparison
Machine learning DT 0.77 30.41
NB 0.69 7.37
LR 0.74 22.58
SVM 0.77 27.19
RF 0.79 27.65
XGB 0.84 41.47
MLP 0.69 18.43
Deep learning LSTM 0.67 38.71
BERT 0.84 44.24

Table 4 presents the F1-score and accuracy for each technique. The result shows
that XGB produced the highest performance compared to other machine learning
techniques, with an F1-score of 0.84 and an accuracy of 41.47%. Meanwhile, NB
showed the lowest accuracy of 7.37% and the lowest F1-score of 0.69 similar with
MLP. On the other hand, in deep learning approach, LSTM produced an F1-score of
0.67 and an accuracy of 38.71%, this performance is quite large compared to BERT
performance. Overall, in this case, the performance of machine learning techniques
that can be compared with BERT is XGB. However, in terms of accuracy, BERT
is higher than XGB at 44.24% and 41.47, respectively. This indicates that BERT is
more powerful in overcoming imbalanced data than XGB.
Table 5 shows the example of aspect prediction results for both single-and multi-
aspect from the original hotel reviews. Single-aspect review is represented in Review
ID 2209, while the example of multi-aspect review is presented in Review ID 2572.
The embed column contains the embedding results of BERT which are represented
with sequential numbers. Meanwhile, the output represents the binary result of clas-
sification which represented the aspect of class prediction. The “1.0” value repre-
sents a predicted aspect, while the “0.0” value is vice versa. Furthermore, the aspect
predictions are generated from binary to aspect labels.
Based on the experiment result, we found that all BERT pre-trained models can
provide good results for the extraction of multi-aspects of Indonesian hotel reviews.
In the same case, BERT is also superior compared to machine learning techniques and
another deep learning technique. In this research, we compared BERT with LSTM,
while LSTM gives lower performance results than BERT and other ML techniques.
This notes that deep learning does not always provide higher performance than ML
techniques. In another view, it is also possible that extra effort is needed to fine-tune
the parameters so that the model fits the dataset.
Overall, to handle multi-aspect extraction on Indonesian hotel reviews, the best
performance is produced when using a pre-trained Indonesian-BERT-1.5G model
with an F1-score of 0.8421 and an accuracy of 44.24%. This mechanism makes
use of contextual information captured by BERT to carry out classification tasks
effectively. Pre-training in large-scale corpora is capable of helping BERT learn
28 N. Hayatin et al.

Table 5 Multi aspect extraction result

Single-aspect
Review ID Terimakasih sangat terbantu. Bisa titip barang dengan aman, sblm dan sesudah
2209 proses check in dan check out. Kereta sy dtg jam 3 pagi jauh sblm proses cek in. Tp
pihak hotel mempersilahkan sy duduk di lobi dan memakai kamar mandi di pool
In English
Thank you so much for your help. I could safely leave my belongings before and
after the check-in and check-out process. My train arrived at 3 am, long before the
check-in process. However, the hotel staff allowed me to sit in the lobby and use
the bathroom by the pool
Embed [2, 5196, 310, …., 26, 15,731, 30,470, 3]
Output [0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0]
Aspect [‘service’]
Prediction
Multi-aspect
Review ID lokasi strategis, tapi sayang kamar agak kurang bersih (berasa debu), pelayanan
2090 bagus, langsung di tanggapi pada saat kami lapor ke front desk. ada toleransi
dari staf tentang masalah check out (pada saat kami masih di luar kota)’,
‘Cleaned’: ‘lokasi strategis, tapi sayang kamar agak kurang bersih (berasa debu),
pelayanan bagus, langsung di tangapi pada saat kami lapor ke front desk. ada
toleransi dari staf tentang masalah check out (pada saat kami masih di luar kota)
In English:
Strategic location, but unfortunately the room was a bit unclean (felt dusty). Good
service, promptly addressed when we reported it to the front desk. There was some
leniency from the staff regarding the check-out issue (while we were still out of
town)
Embed [2, 1604, 4581, …., 626, 30,465, 3]
Output [1.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0]
Aspect [‘room’, ‘location’, ‘service’]
Prediction

common language representations, which can be adapted for multi-aspect extraction

tasks with relatively smaller labeled datasets. For future work, combining BERT with
ML techniques in classification is a challenge that can be developed to improve the
multi-aspect extraction model performance.
Refer to the Tables 3 and 4, the experiment result shows low accuracy and high
F1-scores that were produced by the classification models. It typically indicates a
specific pattern in the data, and the interpretation depends on the characteristics of
the dataset and the goals of the classification task. One of the factors is an imbalanced
dataset when one class has significantly more samples than the other. The data used
in this experiment is imbalance data (shown in Fig. 4), we believe this is a factor
that influences the low accuracy scores produced. Therefore, for future work, it is
necessary to develop a model that can address imbalanced data issues, with the
expectation of improving accuracy scores.
Multi-aspect Extraction in Indonesian Reviews Through Multi-label … 29

Fig. 4 Data distribution based on aspect

5 Conclusion

The research succeeds to extract multi-aspects based on multi-label classifier using

pre-trained BERT model for Indonesian reviews data. Experimental results indicate
that all pre-trained BERT models yield satisfactory outcomes for multi-aspect extrac-
tion in Indonesian hotel reviews. BERT consistently demonstrates superior perfor-
mance compared to machine learning techniques and other deep learning approaches
such as LSTM. This research plays a significant role in advancing multi-aspect
extraction techniques by employing a supervised approach, particularly within the
domain of hotel reviews in the Indonesian language. However, it is worth noting that
deep learning does not always guarantee better performance than machine learning
techniques, and fine-tuning the model parameters may be necessary to optimize
performance.
Overall, the Indonesian pre-trained BERT models achieves the best performance
in handling multi-aspect extraction for Indonesian hotel reviews. For future work,
the researchers may involve exploring the combination of BERT with other machine
learning techniques in classification, presenting a challenging opportunity to enhance
the performance of the multi-aspect extraction model.
30 N. Hayatin et al.

References

1. Bing L (2008) Web data mining, vol 10, no 2. Springer

2. Zhang L, Liu B (2014) Aspect and entity extraction for opinion mining, pp 1–40
3. Angelidis S, Amplayo RK, Suhara Y, Wang X, Lapata M (2021) Extractive opinion
summarization in quantized transformer spaces. Trans Assoc Comput Linguist 9:277–293
4. Fachrina Z, Widyantoro DH (2018) Aspect-sentiment classification in opinion mining using
the combination of rule-based and machine learning. In: Proceedings of 2017 International
conference on data and software engineering, ICoDSE 2017, vol 2018, pp 1–6
5. Ilmania A, Abdurrahman, Cahyawijaya S, Purwarianti A (2018) Aspect detection and sentiment
classification using deep neural network for Indonesian aspect-based sentiment analysis. In:
2018 International conference on Asian language processing (IALP), pp 62–67
6. Azhar AN, Khodra ML, Sutiono AP (2019) Multi-label aspect categorization with convo-
lutional neural networks and extreme gradient boosting. In: Proceedings of International
conference on electrical engineering and informatics, vol 2019, pp 35–40
7. Devlin J, Chang MW, Lee K, Toutanova K (2019) BERT: pre-training of deep bidirectional
transformers for language understanding. In: Proceedings of NAACL HLT 2019, vol 1. North
American Chapter Association Computing, Linguistics Human Language Technology, pp
4171–4186
8. Jin Z, Lai X, Cao J (2020) Multi-label sentiment analysis base on BERT with modified TF-IDF.
ISPCE-CN 2020—IEEE International symposium on production and compliance engineering
2020
9. Tao J, Fang X (2020) Toward multi-label sentiment analysis: a transfer learning based approach.
J Big Data 7(1)
10. Wilie B et al (2020) IndoNLU: benchmark and resources for evaluating Indonesian natural
language understanding. arXiv, pp 843–857
11. Koto F, Rahimi A, Lau JH, Baldwin T (2020) IndoLEM and IndoBERT: a benchmark dataset
and pre-trained language model for Indonesian NLP, pp 757–770
12. Koto F, Lau JH, Baldwin T (2021) INDOBERTWEET: a pretrained language model for
Indonesian Twitter with effective domain-specific vocabulary initialization. EMNLP 2021—
2021 Conference on empirical methods on national language processing proceedings, pp
10660–10668
13. Chebolu SUS, Rosso P, Kar S, Solorio T (2022) Survey on aspect category detection. ACM
Comput Surv 55(7)
14. Findawati Y, Pramana KA, Raharjo AB, Abadi TW, Purwitasari D (2022) Aspect based multil-
abel text classification for identifying dangerous speech Twitter text. In: 2022 10th International
conference on information and communication technology (ICoICT 2022), pp 179–183
15. Ekawati D, Khodra ML (2017) Aspect-based sentiment analysis for Indonesian restaurant
reviews. In: 2017 International conference on advanced and informatics concepts, theory
applications (ICAICTA 2017), pp 5–10
16. Gojali S, Khodra ML (2016) Aspect based sentiment analysis for review rating prediction. In:
4th IGNITE Conference 2016 International conference on advanced and informatics concepts,
theory applications (ICAICTA 2016)
17. Surjandari I, Wayasti RA, Laoh E, Zulkarnain Z, Rus AMM, Prawiradinata I (2019) Mining
public opinion on ride-hailing service providers using aspect-based sentiment analysis. Int J
Technol 10(4):818–828
18. Sasmita DH, Wicaksono AF, Louvan S, Adriani M (2017) Unsupervised aspect-based sentiment
analysis on Indonesian restaurant reviews. In: 2017 International conference on Asian language
processing (IALP 2017), pp 383–386
19. Setiowati Y, Setyorini F, Helen A (2018) Aspect and opinion word extraction on opinion
sentences in Bahasa Indonesia using rule based generated from regular expression. In: Interna-
tional Conference on information technology and information systems electrical engineering,
vol 1, no 1, pp 1689–1699
Multi-aspect Extraction in Indonesian Reviews Through Multi-label … 31

20. Manik LP et al (2020) Aspect-based sentiment analysis on candidate character traits in Indone-
sian presidential election. In: 2020 International conference on radar, antenna, microwave,
electronics, and telecommunications (ICRAMET), pp 224–228
21. Ismet HT, Mustaqim T, Purwitasari D (2022) Aspect based sentiment analysis of product review
using memory network. Sci J Inf 9(1):73–83
22. Chamid AA (2023) Graph-based semi-supervised deep learning for Indonesian aspect-based
sentiment analysis
23. Yanuar MR, Shiramatsu S (2020) Aspect extraction for tourist spot review in Indonesian
language using BERT. In: 2020 International conference on artificial intelligence in information
and communication (ICAIIC 2020), pp 298–302
24. Azhar AN (2020) Fine-tuning pretrained multilingual BERT model for Indonesian aspect-based
sentiment analysis
25. Vaswani A et al (2017) Attention is all you need. Adv Neural Inf Process Syst 2017:5999–6009
26. Ruskanda FZ, Widyantoro DH, Purwarianti A (2018) Comparative study on language rule
based methods for aspect extraction in sentiment analysis. In: 2018 International conference
on Asian language processing (IALP 2018), Bandung, Indonesia, November 15–17, 2018, pp
56–61
Artificial Intelligence (AI) Empowered
Sign Language Recognition Using
Hybrid Neural Networks

Ambar Saxena, Nailya Sultanova, Jamila Mustafina, and Noor Lees Ismail

Abstract Hand gestures serve as the primary means of communication in sign

languages, which are composed of visual gestures made up of hands, faces, and
other bodily motions. Although sign language has become more common in recent
years, communicating with sign language speakers or signers remains difficult for
non-sign language speakers. There has been promising progress in the disciplines
of motion and gesture detection utilizing Artificial Intelligent techniques because of
recent advances in deep learning and computer vision. The deep learning network
makes full use of the advantages of time series classification provided by the recurrent
neural network model as well as the feature extraction capabilities of convolutional
neural networks to achieve more accurate recognition. High precision, scalability, and
robustness, on the other hand, remain significant issues in sign language recognition
research. The purpose of this research is to examine hybrid neural network to improve
the accuracy and robustness of sign language recognition. This research proposes a
sign language recognition system using an ensemble of convolutional neural network
(CNN) models followed by a long short-term memory (LSTM) model. The proposed
system is designed to recognize hand gestures and interpret sign language, with a
focus on American Sign Language (ASL) and sign digit dataset. Research evaluated
the performance of the proposed system using the same ASL dataset and achieved
an accuracy of 99.3%. We compared the performance of the proposed model with
standalone CNN models and found that the proposed hybrid model outperformed
standalone models.

A. Saxena
Liverpool John Moores University, Liverpool, UK
e-mail: [email protected]
N. Sultanova · J. Mustafina (B)
Kazan Federal University, Kazan, Russia
e-mail: [email protected]
N. Sultanova
e-mail: [email protected]
N. L. Ismail
UNITAR International University, Petaling Jaya, Malaysia
e-mail: [email protected]

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 33
Y. Bee Wah et al. (eds.), Data Science and Emerging Technologies, Lecture Notes on
Data Engineering and Communications Technologies 191,
https://doi.org/10.1007/978-981-97-0293-0_3
34 A. Saxena et al.

Keywords Artificial intelligence · Sign language recognition · Neural network ·

Convolutional neural network · Long short-term memory

1 Introduction

There are over 300 sign languages used by 70 million deaf people worldwide,
according to the World Federation of the Deaf (Dr. Joseph J. Murray, 2018).
However, most communication technologies have been designed for spoken or
written language, creating barriers for sign language users in society. However, most
communication technologies have been designed for spoken or written language,
creating barriers for sign language users in society. While tools such as Imo and
WhatsApp have become ubiquitous in daily life, they are not always accessible or
effective for deaf people. This can create difficulties in communication between the
deaf community and the hearing majority. The development of sign language recog-
nition technology has the potential to bridge this communication gap and facilitate
more seamless and inclusive communication for the deaf community.
Artificial Intelligence (AI) has made significant advancements in the field of
computer vision in recent years. Computer vision is the ability of machines to inter-
pret and understand images and video. With the help of AI techniques such as deep
learning, computer vision algorithms can now recognize objects, faces, and other
important features in images and videos with remarkable accuracy.
Deep learning techniques such as convolutional neural networks (CNNs) have
been particularly effective in computer vision. CNNs can automatically learn and
identify features within images, making them highly effective in object recognition
and classification. Other AI techniques such as recurrent neural networks (RNNs)
and generative adversarial networks (GANs) have also been applied in computer
vision with promising results.
One of the key challenges in developing sign language recognition systems is
the variability of sign language across different regions and cultures, as well as
the variability in the way that different individuals use sign language. In addition
to the technical challenges of sign language recognition, there are also practical
considerations, such as the limited availability of training data, especially for less
commonly used sign languages or dialects.
To address this variability, researchers have developed techniques for adapting
hybrid neural networks to the specific characteristics of individual signers, in order
to improve the accuracy of the model’s translations.
By combining the strengths of convolutional neural networks (CNNs) and recur-
rent neural networks (RNNs), researchers have been able to develop hybrid models
that can accurately recognize sign language gestures and translate them into spoken
or written language.
The primary purpose of this research is to examine hybrid neural networks while
training various CNN and ensemble networks to increase the accuracy and precision
Artificial Intelligence (AI) Empowered Sign Language Recognition … 35

of sign language recognition. The aim of this study is to use deep learning networks
in artificial intelligence to recognize sign language gestures automatically.

2 Background

The majority of studies have concentrated to detect the static sign language motions
in pictures or video-clips that were acquired in a testing setting. Image capture, data
pre-processing, image segmentation, extraction of features, and classification are the
five categories into which the processes of Image-based sign language recognition
(SLR) is classified (Fig. 1).
Many researchers use previously defined datasets like ASL Image Dataset
(ASLID), ImageNet Large-Scale Visual Recognition Challenge (ILSVRC)—2010,
ASL Gesture Dataset 2012, RWTH-Phoenix-Weather dataset, ArSL databases, and
the SIGNUM [1]. Few researchers develop their own datasets for data training due
to scarcity of sign language datasets for specific countries.
Wadhawan and Kumar concentrate on very widely used scalable networks that
are transferred in transfer learning. AlexNet, created by LeNet, is a commonly used
deep learning architecture. They evaluated more than 50 CNN models. The data were
further evaluated using multiple optimizers, and it was determined that the suggested
technique attained the maximum training efficiency of 99.72% on colorful pictures
and 99.90% on monochrome images [2].
Rathi [3] used the GoogleNet neural network, which has depth in both the
directions. Very deep network with high accuracy.
Masood Sarfaraz and Thuwal [4] classified 2624 ASL gestures using a pre-trained
VGG16 model. Simonyan and Zisserman [5] presented the VGG network architec-
ture. Although the ResNet architecture can be successfully trained at deep depths,
VGG-16 networks are regarded as being exceptionally deep. Sadly, VGG has two
significant drawbacks: It is quite slow during training, and the network weights are
fairly enormous.
The most significant deep learning neural network models are convolutional neural
networks (CNN) for image recognition and classification, which is widely accepted.

Fig. 1 Sign language recognition workflow using deep learning

36 A. Saxena et al.

By alternating between a great number of layers of convolution and pooling, a CNN

architecture can be created.
Mannan et al. [6] presented the performance of a DeepCNN architecture which
improves with size of the dataset given, they used data augmentation to expand the
input data artificially.
Katoch et al. [7] suggested a technique uses the sign langue gestures in a live
video stream and predict labels in the form of texts. Segmentation was based on skin
color as well as background too. SVM and CNN are used for classification.
Yirtici and Yurtkan [8] employed AlexNet as a pre-trained network, transfer
learning method was used to classify TSL with average precision of 99.7%.
Gupta investigated different sign language motions and created a comparative
evaluation of various neural network architectures. For the sign language recogni-
tion system, CNN models (GoogleNet, AlexNet, VGGNet, and EfficientNet) were
employed and investigated. The benchmarked ASL dataset was used for testing [9].
Sharma and Kumar [10] used the Boston ASL LVD dataset to train 3DCNN,
the training data comprises 70% of the whole dataset, whereas the testing dataset
comprises 30%. The proposed approach beats existing models in terms of precision
(3.7%), f-measure (3.9%), and recall (4.3%).
Bheda and Radpour [11] suggested a letter and digit recognition system based
on American Sign Language. The proposed CNN-based architecture produced an
accuracy of 82.5%.
Rao et al. [12] created a dataset that shows signs at different angles and
against different backdrops. They classified them using CNN with different pooling
algorithms and the stochastic pooling approach beat the other pooling strategies.
Koller et al. [13] suggested a hybrid technique for continuous sign recognition that
combines CNN’s strong discriminative features with the Hidden Markov Model’s
sequence modeling capability (HMM). The obtained data was preprocessed using
a dynamic programming method. According to the results, the hybrid CNN-HMM
technique outperforms the other known approaches.
Deep ensemble models have been used to classify X-ray and CT-scan images
for disease diagnosis in medical applications. For instance, a stacked ensemble of
CNNs was utilized by Kandel et al. [14] to detect fractures in X-ray pictures. In
order to provide input to eight different conceptual models, the authors combined
the stochastic outputs of different CNN models. The ensemble model was said to be
10% more accurate than individual CNN models.
Many academics are creating their own tiny datasets to use in the development
of their SLR. There are still certain nations and languages for which large databases
are unavailable. The type of sign language used in the majority of nations depends
entirely on their grammar and how each phrase is presented, such as by utilizing
words or phrases. The categorization method used to distinguish sign language
varies between scholars as well. Comparing this method to each other in the Sign
Language Recognition System remains arbitrary when using their notions and limi-
tations. Based on deep learning approaches like CNN, LSTM, and Bi-LSTM models
exhibit high classification efficiency in a stream of images and videos.
Artificial Intelligence (AI) Empowered Sign Language Recognition … 37

3 Methodology

The methodology used in this study includes essential sign language recognition
system operations like input data selection, pre-processing of the testing data, trans-
formation into a structured and understandable data format, dataset-balancing, appli-
cation of supervised deep learning techniques, and evaluation of machine learning
performance using evaluation matrix.

3.1 Dataset Analysis

This study used two different set of datasets which consists of images of alpha-
bets from the American Sign Language (ASL) and sign language digits images,
respectively, (Figs. 2 and 3).

3.2 Data Pre-processing

Data preparation is the use of several morphological methods to minimize noise in

data. A pre-processing steps are listed in the paragraphs below.
1. Verifying sure the photos are the same size and aspect ratio is one of the first
duties.
2. Image scaling. We selected images that were 100 pixels wide and high. Each
image’s width and height were scaled.
Each input parameter (in this example, pixel) needs to have a uniform data distri-
bution, thus it is essential to first normalize the data. This facilitated network training
while accelerating convergence.
1. Dimensional reduction:
• Convert RGB to gray scale.
• Image enhancement and contrast.
• Noise filtering and removing.
• Resizing.

3.3 Data Augmentation

The inclusion of the original data set together with modified copies of older photos is
another typical pre-processing strategy called augmentation. Common affine trans-
formations include scaling, rotations, and others. To expose the neural network to
38 A. Saxena et al.

Fig. 2 Sign language alphabet dataset

a number of various factors augmentation is applied. As a result, there is a lower

chance that the neural network would identify negative characteristics in the data set
(Fig. 4).

3.4 Model Training

Convolutional neural networks were used for the model training. The suggested
model was undergone training. The classifier placed the preprocessed sign images
in the appropriate category after classification. The dataset of several ASL signs was
used to train the classifier. The training set comprised 80% of the dataset after it has
been separated into training and validation sets. To bring unpredictability into the
Artificial Intelligence (AI) Empowered Sign Language Recognition … 39

Fig. 3 Sign language digits dataset

deep learning model training process, the dataset must be randomly rearranged. This
prevents the model from favoring specific parameters. The following CNN networks
was evaluated and trained for.
• Few different CNN networks (least layers to deep layers).
• Ensemble of few CNNs of varying resolutions followed by classification using
LSTM.

3.5 Model Evaluation

The effectiveness of the network obtained in this study for American Sign Language
recognition will be evaluated in relation to the model training parameters, which
include the number of layers, filters, and optimizers. The study evaluated the perfor-
mance of the approach on the test dataset, using four distinct measures to analyze
its effectiveness. This process typically involves calculating various metrics, such
as accuracy, precision, recall, and F1-score, as well as visualizing the results using
confusion matrices and training and loss graphs.
In this study models are compared with following evaluations:
Training Loss refers to the error that the model makes on the training set during
training.
Validation Loss. The purpose of the validation set is to evaluate the performance of
the model on new, unseen data, and to prevent overfitting.
A Confusion Matrix is a table that summarizes the performance of a classification
model on a set of test data for which the true values are known.
40 A. Saxena et al.

Fig. 4 Methodology workflow for AI-based sign language recognition

Accuracy. This measures the overall percentage of images that were correctly
classified by the model.

Accuracy = (TP + TN)/(TP + TN + FP + FN) (1)

where TP is the number of true positives, TN is the number of true negatives, FP is

the number of false positives, and FN is the number of false negatives.
Precision. This measures the proportion of true positive predictions among all posi-
tive predictions. It indicates how often the model correctly predicted a specific
class.

Precision = TP/(TP + FP) (2)

Artificial Intelligence (AI) Empowered Sign Language Recognition … 41

Recall (Sensitivity). This measures the proportion of true positive predictions among
all actual positive instances. It indicates how well the model was able to detect a
specific class.
F1 Score. This is a weighted average of precision and recall and provides a single
value that summarizes the model’s performance.

F1 scor e = 2 ∗ (Precision ∗ Recall)/(Precision + Recall) (3)

4 Results and Discussion

This section details the outcome of the model building experiments, it includes the
results of each model is trained. The section provides an evaluation of the trained
model, including metrics such as accuracy, precision, and recall. It compares the
models tested with proposed hybrid neural network model.

4.1 Model Evaluation

Training and loss curves have been shared in Chapter 4 for each model. Table 1
compares the training and loss for all models.
Model-1 has given 79% accuracy with training dataset and 87% accuracy with
test dataset, accuracy decreased with Model-2 due increase of filter size and got
62% with training and 78% with test dataset. With one conv layer increased and
smaller filter size Model-3 achieved 86% training and 93% test accuracy but overall
proposed hybrid model which consist all of other model as ensemble and followed
by LSTM classifier has given an accuracy of 98% with training dataset and 99% with
test dataset.

4.2 Evaluation Matrix

Table 2 shows the evaluation matrix for each model.

1. Model 1: This model achieved an accuracy of 0.8719, which means that it
correctly classified 87.19% of the test data. The precision and recall scores are
also high, with values of 0.8827 and 0.8719, respectively. The F1-score, which
is a harmonic mean of precision and recall, is 0.8716. Overall, this is a decent
model with fairly balanced precision and recall.
2. Model 2: This model achieved an accuracy of 0.7794, which is lower than the
other models. The precision score is 0.8005, which means that when the model
42

Table 1 Comparison of Training and Validation Accuracy of all models

Model Number of Kernel Pooling Dropout Regularization Training Validation Test Training Validation Test
layers size accuracy accuracy accuracy loss loss loss
1 6 2×2 2×2 0.5 L2 (0.05) 0.7913 0.872 0.8719 1.17 1.01 1.015
2 6 3×3 3×3 0.5 L2 (0.05) 0.6221 0.7813 0.7794 1.244 0.8153 0.823
3 8 2×2 2×2 0.5 L2 (0.05) 0.8696 0.9578 0.9389 0.7876 0.5901 0.587
Hybrid Ensemble of model 1, 2, and 3 followed by LSTM 0.9865 0.993 0.9934 0.039 0.0214 0.02
(Proposed
model)
A. Saxena et al.
Artificial Intelligence (AI) Empowered Sign Language Recognition … 43

Table 2 Evaluation matrix for all models

Model Accuracy Precision Recall F1-score
Model 1 0.871947 0.882743 0.871947 0.871564
Model 2 0.779375 0.800516 0.779375 0.780278
Model 3 0.938865 0.942784 0.938865 0.939006
Hybrid model 0.993432 0.993538 0.993432 0.993396

predicts a positive label, it is correct 80.05% of the time. The recall score is
0.7794, which means that the model correctly identifies 77.94% of all positive
instances. The F1-score is 0.7803, which is the harmonic mean of precision and
recall. This model has relatively low accuracy, but the precision score is higher
than the recall score, which suggests that it tends to make more false negative
errors.
3. Model 3: This model achieved an accuracy of 0.9389, which is quite high. The
precision and recall scores are both above 0.93, with values of 0.9428 and 0.9389,
respectively. The F1-score is 0.9390. This model has the highest precision and
recall scores among the four models, indicating that it is better at both minimizing
false positives and false negatives.
4. Hybrid Model: This model achieved the highest accuracy of 0.9934, which is
significantly higher than the other models. The precision and recall scores are
also very high, with values of 0.9935 and 0.9934, respectively. The F1-score is
0.9934. This model has very high precision and recall, which means it performs
very well in minimizing both false positives and false negatives (Fig. 5).

4.3 Prediction with Hybrid Model

After training with train and validation dataset, proposed hybrid model was used
to predict unseen test dataset and evaluated with confusion matrix and plotting test
images with predict and input labels.

4.4 Confusion Matrix

Confusion matrix is a table that is used to evaluate the performance of a classification

model. It displays the number of true positives, true negatives, false positives, and
false negatives for each class in the classification problem. The rows represent the
actual class, and the columns represent the predicted class. The diagonal elements
represent the number of correct predictions, while the off-diagonal elements represent
incorrect predictions (Fig. 6).
44 A. Saxena et al.

Fig. 5 Comparison of evaluation matrix for all models

4.5 Prediction

Unseen test data was predicted using proposed hybrid model and Fig. 7 shows images
with actual labels against predicted labels. Model has predicted all labels correctly.
In summary, the hybrid model has the highest accuracy and precision/recall scores
among the four models, while Model 2 has the lowest accuracy and a higher precision
score compared to the recall score. Model 1 and Model 3 have relatively balanced
precision and recall scores.

5 Conclusions

This study explored the Artificial Intelligent-based sign language recognition using
CNN neural networks and further using hybrid neural network. This work presented
a comprehensive overview of recent research done in sign language recognition field
and discussed various aspects around it.
The aim of this study was to train a hybrid neural network to recognize American
Sign Language and digits with higher accuracy. Experiments performed in this study
Artificial Intelligence (AI) Empowered Sign Language Recognition … 45

Fig. 6 Confusion matrix for hybrid model

proved that proposed hybrid model outperformed the other CNN models across
the evaluation matrix. The results suggest that the proposed models can achieve
high performance in ASL recognition, which can have important applications in
improving communication for the hearing-impaired.
This study proved that ensemble of different models can improve the sign language
recognition accuracy over the individual models. There are many different types of
neural networks, each with its strengths and weaknesses. Hybrid neural networks
allow developers to create models that leverage the strengths of each network while
minimizing their weaknesses. Though hybrid neural network have many advantages,
but it can be more complex to design and train compared to traditional neural
46 A. Saxena et al.

Fig. 7 Prediction using proposed hybrid model

networks, as they involve combining multiple models and architectures. Due to

their increased complexity, hybrid neural networks may require longer training times
compared to traditional neural networks. The increased complexity of hybrid neural
networks can make it difficult to interpret the results and understand how the model
arrived at its predictions.
This study showed that hybrid neural networks can achieve higher accuracy
for image classification than traditional neural networks because they can learn
both linear and non-linear relationships between the input and output data. Hybrid
neural networks can be customized to the specific needs of a particular problem by
combining different types of neural networks.
Hybrid neural networks can be designed to handle a wide range of data types,
including image, text, and time series data.
There are several areas that could be explored as future work in sign language
recognition, including:
Artificial Intelligence (AI) Empowered Sign Language Recognition … 47

• Incorporating more advanced techniques for feature extraction and representation,

such as convolutional neural networks (CNNs) and recurrent neural networks
(RNNs), to capture more complex spatial and temporal relationships in the sign
language gestures.
• Investigating the use of transfer learning and fine-tuning approaches to leverage
pre-trained models for related tasks, such as image or video classification, and
adapt them to sign language recognition.
• Developing more comprehensive and diverse sign language datasets, including
different sign languages and dialects, to enable more accurate and robust
recognition across various settings and populations.
• Exploring the use of multimodal data sources, such as depth sensors or haptic
devices, to capture additional information about the sign language gestures and
improve recognition performance.
• Investigating the use of real-time systems and wearable devices for sign language
recognition, which could have practical applications in communication and
accessibility for the deaf and hard-of-hearing communities.

Acknowledgements The authors thank the UNITAR management for supporting the publication
of this paper.

References

1. Subburaj S, Murugavalli S (2022) Survey on sign language recognition in context of vision-

based and deep learning. Meas Sens 23:100385. https://www.sciencedirect.com/science/art
icle/pii/S2665917422000198
2. Wadhawan A, Kumar P (2021) Sign Language recognition systems: a decade systematic liter-
ature review. Arch Comput Methods Eng 283:785–813. https://doi.org/10.1007/s11831-019-
09384-2
3. Rathi D (2018) Optimization of transfer learning for sign language recognition targeting mobile
platform. CoRR, abs/1805.06618. http://arxiv.org/abs/1805.06618
4. Masood S, Thuwal HC, Srivastava A (2018) American sign language character recognition
using convolution neural network. In: Satapathy Suresh Chandra DS, Bhateja (eds) Smart
computing, and informatics. Springer Singapore, Singapore, pp 403–412
5. Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image
recognition. ICLR. http://arxiv.org/abs/1409.1556
6. Mannan A, Abbasi A, Javed AR, Ahsan A, Gadekallu TR, Xin Q (2022) Hypertuned deep
convolutional neural network for sign language recognition. Comput Intell Neurosci 1450822.
https://doi.org/10.1155/2022/1450822
7. Katoch S, Singh V, Tiwary US (2022) Indian sign language recognition system using SURF
with SVM and CNN. Array 14:100141. https://www.sciencedirect.com/science/article/pii/S25
90005622000121
8. Yirtici T, Yurtkan K (2022) Regional-CNN-based enhanced Turkish sign language recognition.
SIViP 165:1305–1311. https://doi.org/10.1007/s11760-021-02082-2
9. Gupta N (2022) Sign language recognition using diverse deep learning models. In: Goutam
G, Travieso-González CM et al (eds) International conference on artificial intelligence and
sustainable engineering. Springer, Singapore, pp 463–475
48 A. Saxena et al.

10. Sharma S, Kumar K (2021) ASL-3DCNN: American sign language recognition technique
using 3-D convolutional neural networks. Multimedia Tools Appl 8017:26319–26331. https://
doi.org/10.1007/s11042-021-10768-5
11. Bheda V, Radpour D (2017) Using deep convolutional networks for gesture recognition in
American sign language. CoRR, abs/1710.06836. http://arxiv.org/abs/1710.06836
12. Rao GA, Syamala K, Kishore PV, Sastry ASCS (2018) Deep convolutional neural networks
for sign language recognition. In: 2018 Conference on signal processing and communication
engineering systems, SPACES, pp 194–197
13. Koller O, Zargaran S, Ney H, Bowden R (2018) Deep sign: enabling robust statistical continuous
sign language recognition via hybrid CNN-HMMs. Int J Comput Vision 12612:1311–1325.
https://doi.org/10.1007/s11263-018-1121-3
14. Kandel I, Castelli M, Popovič A (2021) Comparing stacking ensemble techniques to improve
musculoskeletal fracture image classification. J Imaging 76. https://www.mdpi.com/2313-
433X/7/6/100
The Performance of GPT-3.5
in Summarizing Scientific and News
Articles

Sabkat Arshad, Muhammad Yaqoob, and Tahir Mehmood

Abstract In the age of information, we are overwhelmed with large amounts of

data. The quest to know more in less time has increased the need for efficient text
summarization models that convert information into precise summaries such that
essential details are not overlooked. Recently, GPT-3.5 has demonstrated impressive
performance in text completion, generation, and question answering. However, its
effectiveness in generating concise and coherent summaries for scientific articles and
news reports remains under-explored. This work evaluates the performance of GPT-
3.5 in summarizing scientific research articles and news data. Scientific articles were
collected from arXiv STEM dataset, whereas news articles were sampled from the
CNN/DailyMail dataset. Using the GPT-3.5 OpenAI API, the pre-trained model is
prompted to generate summaries of the scientific and news articles. In the next step,
the ROUGE score is computed for the generated summaries against the reference
summaries to analyse the performance of the model. Our results show that GPT-
3.5 performs slightly better in summarizing scientific articles as compared to news
articles with an average ROUGE score of 0.35 and 0.31, respectively. Moreover,
in agreement with the literature, we show that the ROUGE is not the best measure
for evaluating text similarity as it heavily relies on similar vocabulary rather than
semantics.

Keywords Text summarization · Scientific papers · ChatGPT · Natural language

processing · Large language model

S. Arshad · M. Yaqoob (B)

Department of Computer Science, University of Hertfordshire, Hatfield, UK
e-mail: [email protected]
T. Mehmood (B)
School of Information Technology, UNITAR International University, Selangor, Malaysia
e-mail: [email protected]

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 49
Y. Bee Wah et al. (eds.), Data Science and Emerging Technologies, Lecture Notes on
Data Engineering and Communications Technologies 191,
https://doi.org/10.1007/978-981-97-0293-0_4
50 S. Arshad et al.

1 Introduction

In recent years, there have been significant advancements in the field of natural
language processing (NLP). The NLP entails creating computer models and tech-
niques that allow to comprehend, interpret, and produce language that is similar to
that of humans [1]. The advancements in NLP include improvements in language
models, machine translation, deep learning and transfer learning [2]. Over the past
decade, we have seen a fast-paced evolution in large language models (LLMs). This
has been possible due to advancements in deep learning techniques as well as the
availability of high compute power hardware. This evolution started with the incep-
tion of word2vec model (a simple two-layered neural network), and the development
of continuous bag of words (CBOW) and continuous skip-gram techniques [3]. More-
over, transformers made a significant impact in NLP by altering encoder–decoder
models into attention-based models [4]. This was followed by BERT [5] and GPT
[6]. These models are evolved to the extent that now these pre-trained LLM models
can be applied to many different NLP tasks with zero-shot learning configuration [7,
8]. This year, models like Falcon and Llama-2 [9] have completely revolutionized the
field of NLP. These models are trained on billions of parameters which enables them
to achieve state-of-the-art performance on several tasks including text generation,
text prediction, and text summarization.
Text summarization is among the most researched topics in natural language
processing as it gives a brief overview of the text and is being used in a wide range of
applications. In this study, we aim to analyse the capabilities of the GPT-3.5 model
for text summarization task and discuss its suitability for scientific articles versus
news articles. To the best of our knowledge, no such study has yet been conducted to
evaluate the capabilities of GPT-3.5 and understand how well it captures the context
of complex scientific articles and simple but not trivial news articles.

2 Related Work

Pre-trained large language models (LLMs) came into play with models like embed-
dings from language models (ELMO) [10, 11] and generative pre-trained trans-
former (GPT), which are impressive in their ability to do a wide variety of NLP
tasks, including text generation and comprehension [7]. These models are trained on
billions of parameters and have achieved remarkable breakthroughs in setting new
benchmarks on a variety of NLP tasks [6, 12]. It is argued that GPT-3 has acquired
linguistic knowledge and can comprehend semantic information [8]. Moreover, ten
different LLMs were compared including models like OPT [13], Instruct GPT [14]
and GPT-3 [12], and it was concluded that state-of-the-art LLMs perform on par with
summaries written by humans. Similarly, very recently Llama-2 has been released
which is a series of fine-tuned LLMs ranging from 7 to 70 billion parameters [9]. It
has outperformed Falcon 7-B and 40-B on several benchmarks. These benchmarks
The Performance of GPT-3.5 in Summarizing Scientific and News Articles 51

include coding, reasoning, knowledge tests, etc. These LLMs can be easily used for
zero-shot tasks [7, 8]. These models make use of transformers, which utilize self-
attention mechanism to focus on different parts of the input sequence [4]. They also
make use of position-wise feed-forward networks to capture the order and position of
words as it is important to fully capture the significance of words in a sentence. The
foundational architecture for all LLMs like GPT, BERT, etc. is based on transformers.
For text summarization, BERTSUM has been introduced [15, 16]. It is a modified
variant of BERT specifically designed for text summarization [5]. It generates extrac-
tive summaries and is trained on CNN/DailyMail dataset [17]. On the other hand,
BERTSUMABS is used widely to generate abstractive summaries. Unlike extractive
summaries, abstractive summaries are generated by the model using its learnt vocab-
ulary rather than extracting sentences from the original document. It is worth noting
that these models are trained on news-related datasets like CNN/DailyMail dataset
[17], the New York Times Annotated Corpus [18], and XSum dataset [19]. One reason
behind this could be that the vocabulary of news-related datasets is straightforward
and easier to comprehend as their sources are much simpler in nature. However, BERT
(extractive) performs poorly and generates grammatically incorrect summaries even
when dealing with simple student essays. Similarly, its performance deteriorates even
further when asked to summarize a research article as it starts including noise in its
generated summaries [20]. Interestingly, most of the text summarization work has
been carried out on general news-related articles. The task of summarizing research
papers has not been given much attention. In this direction, a recent study proposed
a model that compares BERT and GPT-2 and is trained/fine-tuned on COVID-19
medical research articles [21]. The dataset consists of only 59,000 scholarly articles
(at the time) [22]. They concluded that fine-tuning GPT-2 only gave a reasonable
abstractive summary and that it was still below human-level performance.
Research papers are not only complex in nature but also contain noise (equations,
figures, tables, etc.) that needs proper distillation in order to produce high-quality
summaries. Therefore, summarizing research articles continues to remain an open
challenge. Models that are fairly recent like Llama-2 and Falcon have not yet even
been tested on their text summarization capabilities. A recent study has analysed
the GPT series for text summarization, but the study has been conducted on the
SAMSum dataset, a dialogue-based corpus with rather simpler text [23]. This work
compares the performance of GPT-3.5 in summarizing scientific research articles and
news data. This will quantify GPT-3.5’s ability to comprehend complex scientific
literature compared to news articles.

3 Methods

The aim of this study is to evaluate the performance of GPT-3.5 for text summarization
on both scientific articles as well as news articles. The evaluation process involves
generating summaries of a hundred scientific articles collected from arXiv STEM
dataset, and 100 news articles sampled from the CNN/DailyMail dataset. Using the
52 S. Arshad et al.

API, the GPT-3.5 model is prompted to generate summaries of both scientific and
news articles.
For research papers, their respective abstracts are used as reference summaries,
whereas the CNN/DailyMail dataset contains article highlights which briefly describe
the theme of an article. These highlights act as reference summaries for news articles.
The average abstract length of a scientific article is 150 words, whereas the average
length of the news highlights is 50 words. To match the number of words in the
reference summaries, GPT-3.5 is prompted to generate summaries in 150 words for
scientific articles and 50 words for news highlights. The generated summaries are then
compared to the reference summaries and ROUGE scores (ROUGE-I, ROUGE-II,
and ROUGE-L) are computed.

3.1 Data Source

The scientific articles were selected from the arXiv STEM dataset. It is a subset of
arXiv dataset which consists of 1.7 million research articles and papers in different
categories. We selected a hundred research articles from the arXiv STEM dataset to
assess the performance of GPT-3.5 in generating summaries for these articles. The
news articles are sampled from the CNN/DailyMail dataset which contains 300,000
news articles.
After selecting the datasets, the next step was to pre-process them in order to
make them usable for the model. The arXiv STEM dataset provides abstracts, a DOI
number, title, author names, publisher, and the year in which the paper was published.
However, it does not provide full-length papers. Another issue with this dataset is
that the DOI number is missing for many articles. Therefore, we selected 100 papers
at random with a condition on the DOI number, i.e. the chosen article was dropped
if its DOI number was missing. Using the DOI number, the articles were fetched
from arXiv website, and a smaller dataset of a hundred papers was prepared. Since
the papers were in Portable Document Format (PDF), they could not be fed directly
into the model. Therefore, the next step involved converting them to text using an
open-source Python library called “PyPdf”.
In contrast to the arXiv STEM dataset, CNN/DailyMail includes both the news
articles and their corresponding highlights. This made it more straightforward to
generate summaries from the articles without the need for pre-processing. A hundred
articles were sampled from this dataset and fed into the model.

3.2 GPT-3.5

GPT-3.5 is available using an API from OpenAI; however, there is a token limitation
for free usage. The limit is 4096 tokens, which works out well for the CNN/DailyMail
dataset as they are not lengthy documents. However, this limitation makes it difficult
The Performance of GPT-3.5 in Summarizing Scientific and News Articles 53

to process scientific articles from the arXiv STEM dataset. This dataset consists
of research papers which are much lengthier to be processed. For this reason, we
used the GPT-3.5 16 k context window model to process the papers by omitting the
abstracts and the references from the papers. Summaries were then generated with
GPT-3.5 16 k for 100 scientific articles selected from the arXiv STEM dataset and
100 news articles sampled from the CNN/DailyMail dataset, respectively.

3.3 ROUGE Score

ROUGE is an automated metric that is primarily used to compare generated

summaries to reference summaries in order to evaluate the quality of the former.
Its utility has grown to include operations like machine translation and picture
captioning. It was primarily developed for text summarization evaluation. The metric
comes in a variety of forms: ROUGE-L assesses the longest common subsequence,
emphasizing content overlap without strict word order; ROUGE-I takes unigrams
into account, and ROUGE-II gives more weight to bi-grams.
Although ROUGE provides an efficient evaluation by prioritizing recall, its weak-
nesses include superficial word overlap with considering semantic equivalence and
its reliance on reference summaries, which may lead to biased results. Despite this,
ROUGE is a widely used metric and has been a key metric for summarization tasks
in the most recent papers published and a widely accepted benchmark metric. There-
fore, it is used in this study to evaluate the summaries produced by the GPT-3.5
model against the reference summaries to analyse its performance [24].

4 Results

4.1 Performance of GPT-3.5 in Summarizing Scientific

Articles

Using the GPT-3.5 OpenAI API, the pre-trained model is prompted to generate
summaries of the 100 scientific articles. Then the ROUGE scores were calculated to
evaluate the similarity between reference summaries (paper abstract) and generated
summaries. Table 1 shows the maximum, minimum, and mean F1 values for all three
types of ROUGE scores (ROUGE-I, ROUGE-II, and ROUGE-L). We observed that
the highest F1 value persistently remained high on three ROUGE measures unigram
(ROUGE-I), bigram (ROUGE-II), and lgram (ROUGE-L), which indicates that the
generated summary captures most of the information in the reference summary (paper
abstract). The best, worst, and average summaries generated by GPT-3.5 are presented
in Table 2.
54 S. Arshad et al.

Table 1 ROUGE score: Performance evaluation of GTP-3.5 on generating summaries of 100

scientific articles
ROUGE-1 ROUGE-II ROUGE-L
Max 0.88 0.78 0.88
Mean 0.35 0.13 0.31
Min 0.16 0.01 0.14
Std Dev 0.11 0.11 0.11

Table 2 Partial extracts from the best, worst, and average summaries generated by GPT-3.5 for
scientific articles with the corresponding reference summaries (paper abstracts)
ROUGE Reference summary (abstract) Generated summary (GPT-3.5)
score
Best The complex Monoceros Loop SNR/ The HESS telescope array has
(0.88) Rosette Nebula region contains several discovered a new gamma-ray source,
potential sources of very high-energy HESS J0632 + 057, located close to the
(VHE) gamma-ray emission and two as yet rim of the Monoceros SNR. This source
unidentified high-energy EGRET sources. is unresolved by HESS and has no clear
Sensitive VHE observations are required counterpart at other wavelengths but is
to probe acceleration processes in this possibly associated with the weak
region. The H.E.S.S. telescope array has X-ray source 1RXS J063258.3 +
been used to search for very high-energy 054857, the Be-star MWC 148, and/or
gamma-ray sources in this region the lower energy gamma-ray source
3EG J0634 + 0521
Worst I describe the different classes of Active Blazars, a type of active galactic
(0.16) Galactic Nuclei (AGN) and the basic nucleus (AGN), dominate the
tenets of unified schemes. I then review the gamma-ray sky. Blazars are
properties of the extragalactic sources characterized by irregular, rapid
detected in the GeV and TeV bands, variability, high polarization, and a
showing that the vast majority of them non-thermal continuum that extends
belong to the very rare blazar class from radio to gamma-ray wavelengths.
They are believed to be powered by
relativistic jets of plasma emitted from
a central black hole
Average We consider a single ion confined in a trap Researchers have investigated the
(0.35) under radiation of two travelling. waves of system of a single trapped ion
lasers. In the strong-excitation regime and experiencing two travelling wave lasers
without the restriction of Lamb-Dicke in the strong-excitation regime (SER)
limit, the Hamiltonian of the system is and in the absence of the Lamb-Dicke
similar to a driving. Jaynes Cummings limit. They found that the ground state
model without rotating wave in the non-rotating wave approximation
approximation (RWA). The approach we (RWA) case is energetically lower than
developed enables us to present a complete the ground state in the RWA case. This
eigen solutions, which makes it available new ground state could be useful for
to compare with the solutions under the quantum information storage and
RWA quantum computing
The Performance of GPT-3.5 in Summarizing Scientific and News Articles 55

Beginning with the best summary generated by GPT-3.5, it is evident that all the
technical details of this paper are captured by the generated summary. Moreover,
it consists of vocabulary that is seen in the reference summary. ROUGE has given
it the highest score on all three measures, given the fact that it captures the overall
theme very well and has reproduced important facts and figures (Table 2). The lowest
ROUGE score for the worst summary generated by GPT-3.5 deviates from the theme
of the reference summary and does not cover context of the reference summary
(Table 2). However, it summarizes the paper very well. After reading through the
papers with the least scoring summaries, we found two reasons for lower ROUGE
scores: (i) the reference summary (abstract) is too short, providing limited insights
into the original papers, (ii) abstract is too general and does not cover the theme of
the paper.
In the average scoring summary, it is observed that the context of the text has
been well captured by the model (Table 2). However, the summary is generated with
different vocabulary and ROUGE lacks the ability of semantic evaluation. Such is
the case with most of the average scoring summaries, which question the ability of
ROUGE and its usage for text evaluation. GPT can generate abstractive summaries,
whereas surface-level evaluation of text can result in incorrectly scored summaries
as evident from this example.

4.2 Performance of GPT-3.5 in Summarizing News Articles

We used the same GPT-3.5 pre-trained model to generate summaries of the 100 news
articles and calculated ROUGE scores to evaluate the similarity between reference
summaries (news highlights) and generated summaries. Table 3 shows the maximum,
minimum and mean F1 values for all three types of ROUGE scores (ROUGE-I,
ROUGE-II, and ROUGE-L). As expected, the highest F1 value persistently remained
high on three ROUGE measures unigram (ROUGE-I), bigram (ROUGE-II), and
l-gram (ROUGE-L). However, in contrast to our expectations, the news articles
achieved the maximum ROUGE-I score of 0.67 much lower than the maximum
score for scientific articles (Tables 1, 3). To investigate this further, the best, worst,
and average summaries generated by GPT-3.5 are presented in Table 4.
Consistent with our observation for scientific articles, the best-scoring generated
summary picked up all the important points from the news article which have been

Table 3 ROUGE score:

ROUGE-1 ROUGE-II ROUGE-L
performance evaluation of
GPT-3.5 on generating Max 0.67 0.41 0.64
summaries of 100 news Mean 0.31 0.10 0.28
articles
Min 0.04 0.00 0.04
Std Dev 0.12 0.08 0.11
56 S. Arshad et al.

Table 4 Partial extracts from the best, worst, and average summaries generated by GPT-3.5 for
news articles with the corresponding reference summaries (news highlights)
ROUGE Reference summary (highlights) Generated summary (GPT-3.5)
score
Best YouTube questions address taxes, the The CNN/YouTube GOP presidential
(0.67) Bible, abortion, gun control. Giuliani, debate was marked by acrimony and
Romney, Huckabee spar over attacks between the Republican
immigration. McCain challenges Paul candidates. They engaged in a free-for-all,
over suggestion to bring troops home discussing immigration, the Iraq war,
from Iraq. Nearly, 5000 videos for the abortion, gun control, and their beliefs in
GOP debate; 2000 more than the Bible
Democratic debate
Worst If you build it, the tourists will come to The text describes ten unusual museums
(0.04) your museum. Museums for hobos, around the USA. These include the SPAM
medical oddities, and trash. Kentucky Museum, the National Museum of
Museum is where dummies go to die Funeral History, the Hobo Museum, the
Mutter Museum, the Barnum Museum,
the Conspiracy Museum, the Museum of
Question
Average President Bush says Tony Snow “will White House press secretary Tony Snow
(0.31) battle cancer and win”. Job of press will step down from his post on
secretary “has been a dream for me”, September 14 due to his ongoing cancer
Snow says Snow leaving on September treatment. He will be replaced by Deputy
14, will be succeeded by Dana Perino Press Secretary Dana Perino. President
Bush expressed his sadness at Snow’s
resignation and praised his contributions

part of the reference summary (news highlights). This shows that GPT-3.5 has been
able to rightly prioritize the important factors similar to the reference summary.
Although the reference summary is written by the author of the news articles, GPT-3.5
could still better summarize the article in a related context (see Table 4).
Moving to the generated summary securing the least ROUGE score, it is obvious
that the reference summary itself is vaguely written and does not really tell much
about the article itself. On the other hand, just by reading the GPT-3.5 summary, one
can easily guess what the article is about (Table 4). This put into question the article
highlights from the original dataset and their usage as ground truth for the articles.
Finally, considering the average scoring summary with ROUGE score of 0.31,
it does capture the context of the news article. However, the summary is generated
with distinct vocabulary which resulted in lower ROUGE score. The other possible
reason is that the news headline (reference summary) is written in a typical writing
style followed by the authors of news articles which is different from the gener-
ated summary. As observed for scientific articles, ROUGE heavily relies on similar
vocabulary, and this example yet again puts its wide acceptability into question.
The Performance of GPT-3.5 in Summarizing Scientific and News Articles 57

5 Discussion

Summaries generated for scientific articles achieved better ROUGE score as

compared to news articles (Tables 1 and 3). One possible reason for better perfor-
mance on scientific articles is that the reference summary, i.e. paper abstract, is
written by experts which captures the theme of the paper, whereas the poor perfor-
mance in news articles can be due to the short length of the highlight. This needs
further exploration in future work.
Another observation is that both ROUGE-I and ROUGE-L scores are similar
whereas ROUGE-II consistently remains lower irrespective of whether the text is a
scientific article or a news article. This is consistent with previous studies [25, 26]
and the fact that ROUGE-II evaluates bi-grams, and the likelihood of two words
appearing in the same order in two distinct summaries is typically lower.
Moreover, it is observed that for the news articles, the ROUGE-II scores have
been around 0.4 (Fig. 2) for the best-scoring summaries, whereas the top scoring
two summaries of scientific articles have scored above 0.7 (Fig. 1). As presented in
Table 2, the best-scoring summary has almost reproduced the abstract. Therefore,
the overlap between bi-grams is higher, which resulted in higher ROUGE-II score.
The scientific papers’ summaries generally perform slightly better on average and
have higher peak performance than the news articles’ summaries (Fig. 3).
A previous study on summarizing CNN/DailyMail dataset [27] reported that
GPT-3 received poor scores on automatic reference-based similarity metrics like
ROUGE. They used GPT-3 text-davinci-002 model for summarizing these articles.

Fig. 1 ROUGE score over 100 summaries of scientific articles

58 S. Arshad et al.

Fig. 2 ROUGE score over 100 summaries of news articles

Fig. 3 Performance comparison of scientific versus news articles

Although we have used a different model from the GPT-3 text-davinci-002, our
model has also generated similar results. The study concluded that almost all avail-
able reference-based, and reference-free metrics are ineffective when it comes to
evaluating summaries. For the reference-based metric ROUGE, our findings are
consistent with the literature [27]. Moreover, given the functioning mechanism of
reference-based text similarity measures, BLEU and METEOR would have also
resulted in similar low-scoring summaries. This points towards the long-standing
issue related to evaluating text as modern LLMs are now reproducing human-like
The Performance of GPT-3.5 in Summarizing Scientific and News Articles 59

summaries [8]. Another study concluded that human annotators prefer summaries
generated by GPT-3 in comparison to fine-tuned models which were trained on large
summarization datasets [27].
In this study, we extend the analysis to see how well can GPT-3.5 summarize scien-
tific papers as opposed to simpler news articles. Although the summaries generated
for scientific papers evaluated a better ROUGE score, the overall scores are closer to
previous studies which have been carried out using automatic reference-based eval-
uation metrics [8, 27]. Moreover, it has been demonstrated that low-quality refer-
ence summaries result in lower scores by automatic metrics due to little correlation
between the summaries produced by these advanced LLMs and reference summaries
[8]. In our case, the average and poor scoring summaries for both scientific and news
datasets also indicate a similar issue with these reference summaries. Manual analysis
of the summaries produced in this study also confirms that the generated summaries
are concise and contain novel vocabulary, but ROUGE has failed to evaluate them.
Finally in agreement with the literature, this study emphasizes the need for a better
evaluation metric for text summarization.

6 Conclusion

This study evaluates the capability of GPT-3.5 on the task of text summarization. The
aim was to quantify the model’s contextual understanding and capacity to produce
precise and succinct summaries for scientific and news articles. We demonstrate that
GPT-3.5 can precisely summarize both scientific and news article with a slightly
higher ROUGE score for scientific summaries. In addition, the summaries generated
by GPT-3.5 demonstrate its ability to absorb and summarize complicated information
to capture the theme of both scientific and simple but not trivial news articles. It was
also noted that the reference summaries included in the datasets had flaws and were
not always accurate reflections of the articles’ context and content. As a result, the
generated summaries received lower ROUGE scores. Additionally, we anticipate that
the lower ROUGE score is the limitation of ROUGE itself, as it heavily relies on
similar vocabulary. Future studies need to develop a semantic evaluation metric for
text summarization. Additionally, the creation of datasets containing high-quality
reference summaries will significantly enhance our ability to accurately evaluate the
summarization capabilities of LLMs across diverse datasets.

Acknowledgements The authors thank the UNITAR International University for supporting the
publication of this paper.
60 S. Arshad et al.

References

1. Mehmood T, Gerevini AE, Lavelli A, Olivato M, Serina I (2023) Distilling knowledge with a
teacher´s multitask model for biomedical named entity recognition. Information 14(5)
2. Mehmood T, Serina I, Lavelli A, Gerevini A (2020) Knowledge distillation techniques for
biomedical named entity recognition. In: Proceedings of the 4th workshop on natural language
for artificial intelligence (NL4AI 2020) co-located with the 19th International conference of
the Italian Association for artificial intelligence (AI*IA 2020), Anywhere, November 25–27th,
2020. CEUR Workshop Proceedings, vol 2735, pp 141–156. CEUR-WS.org
3. Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in
vector space. arXiv preprint arXiv:1301.3781
4. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I
(2017) Attention is all you need. Adv Neural Inf Process Syst 30
5. Kenton JDMWC, Toutanova LK (2019) Bert: pre-training of deep bidirectional transformers
for language understanding. In: Proceedings of NAACL-HLT, vol 1, p 2
6. Radford A, Narasimhan K, Salimans T, Sutskever I et al (2018) Improving language
understanding by generative pre-training
7. Qin C, Zhang A, Zhang Z, Chen J, Yasunaga M, Yang D (2023) Is chatgpt a general-purpose
natural language processing task solver? arXiv preprint arXiv:2302.06476
8. Zhang T, Ladhak F, Durmus E, Liang P, McKeown K, Hashimoto TB (2023) Benchmarking
large language models for news summarization. arXiv preprint arXiv:2301.13848
9. Touvron H, Martin L, Stone K, Albert P, Almahairi A, Babaei Y, Bashlykov N, Batra S, Bhargava
P, Bhosale S et al (2023) Llama 2: open foundation and fine-tuned chat models. arXiv preprint
arXiv:2307.09288
10. Mehmood T, Serina I, Lavelli A, Putelli L, Gerevini A (2023) On the use of knowledge transfer
techniques for biomedical named entity recognition. Future Internet 15(2):79
11. Peters M, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, Zettlemoyer L (2018) Deep
contextualized word representations
12. Brown T, Mann B, Ryder N, Subbiah M, Kaplan JD, Dhariwal P, Neelakantan A, Shyam P,
Sastry G, Askell A et al (2020) Language models are few-shot learners. Adv Neural Inf Process
Syst 33:1877–1901
13. Zhang S, Roller S, Goyal N, Artetxe M, Chen M, Chen S, Dewan C, Diab M, Li X, Lin XV
et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.
01068
14. Ouyang L, Wu J, Jiang X, Almeida D, Wainwright C, Mishkin P, Zhang C, Agarwal S, Slama
K, Ray A et al (2022) Training language models to follow instructions with human feedback.
Adv Neural Inf Process Syst 35:27730–27744
15. Liu Y (2019) Fine-tune bert for extractive summarization. arXiv preprint arXiv:1903.10318
16. Liu Y, Lapata M (2019) Text summarization with pretrained encoders. arXiv preprint arXiv:
1908.08345
17. Hermann KM, Kocisky T, Grefenstette E, Espeholt L, Kay W, Suleyman M, Blunsom P (2015)
Teaching machines to read and comprehend. Adv Neural Inf Process Syst 28
18. Sandhaus E (2008) The New York Times annotated corpus (linguistic data consortium,
Philadelphia)
19. Narayan S, Cohen SB, Lapata M (2018) Don’t give me the details, just the summary! topic-
aware convolutional neural networks for extreme summarization. arXiv preprint arXiv:1808.
08745
20. Miller D (2019) Leveraging bert for extractive text summarization on lectures. arXiv preprint
arXiv:1906.04165
21. Kieuvongngam V, Tan B, Niu Y (2020) Automatic text summarization of COVID-19 medical
research articles using bert and gpt-2. arXiv preprint arXiv:2006.01997
22. Wang LL, Lo K, Chandrasekhar Y, Reas R, Yang J, Burdick D, Eide D, Funk K, Katsis Y,
Kinney R et al (2020) Cord-19: The covid-19 open research dataset. ArXiv
The Performance of GPT-3.5 in Summarizing Scientific and News Articles 61

23. Gliwa B, Mochol I, Biesek M, Wawer A (2019) Samsum corpus: a human-annotated dialogue
dataset for abstractive summarization. arXiv preprint arXiv:1911.12237
24. Lin CY (2004) Rouge: a package for automatic evaluation of summaries. In: Text summarization
branches out, pp 74–81
25. Chakraborty R, Bhavsar M, Dandapat SK, Chandra J (2019) Tweet summarization of news
articles: an objective ordering-based perspective. IEEE Trans Comput Soc Syst 6(4):761–777
26. Ghanem FA, Padma M, Alkhatib R (2023) Automatic short text summarization techniques in
social media platforms. Future Internet 15(9):311
27. Goyal T, Li JJ, Durrett G (2022) News summarization and evaluation in the era of GPT-3. arXiv
preprint arXiv:2209.12356
Wound Stage Recognition Using
YOLOv5

Clair Abela and Frankie Inguanez

Abstract Pressure injuries or pressure sores are lesions that develop over the bony
prominences of the body. It is noted from professional nurses and from first-hand
experience that such wounds are often wrongly classified, making the healing process
difficult and painful for the patient, when it does not need to be the case. This study
aims to research the use of modern computer vision and artificial intelligence tech-
niques to aid in the classification of pressure ulcers to a degree superior to that
achieved by humans. The YOLOv5 algorithm was trained in various combinations
of three data sets. Single-phase and multi-phase architecture models were developed
and trained for the different data set configurations to identify the ideal scenario
of classification for the different stages of pressure ulcers. A single-phase architec-
ture model trained using the public data sets and enhanced with the novel data set
improves on the current state of the art. While the single-phase architecture was
deemed the better option for the wound stage classification throughout this study,
both architectures had their own strong points. The multi-phase architecture was
found to be better suited for smaller data sets, while the single-phase architecture
was better suited for larger ones. It was also found that throughout both architectures,
the combination of three data sets is varied enough, however, less is not. This study
is further strengthened through the input of different viewpoints that complement the
lead researchers’ work.

Keywords Pressure injury · Staging · Machine learning · YOLO

C. Abela · F. Inguanez (B)

Institute of Information and Communication Technology, Malta College of Arts Science and
Technology, Corradino Hill, Paola PLA9032, Malta
e-mail: [email protected]
C. Abela
e-mail: [email protected]

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 63
Y. Bee Wah et al. (eds.), Data Science and Emerging Technologies, Lecture Notes on
Data Engineering and Communications Technologies 191,
https://doi.org/10.1007/978-981-97-0293-0_5
64 C. Abela and F. Inguanez

1 Introduction

Pressure injuries (PIs), also known as pressure ulcers or pressure sores, are lesions
that develop over bony prominences of the body. These areas include the heels of
the feet, sacral area, hip, and the area of the coccyx. Other areas not as frequently
affected include the elbows, knees, ankles, back of the cranium, and the back of the
shoulders. These develop because of prolonged pressure or shearing in that area.
PIs usually affect people who suffer from a mobility impairment and simultaneously
have poor nutrition and hydration.
Apart from the obvious pain and discomfort, there are various complications that
could come with PIs. One of the most significant risks associated with pressure
ulcers is an infection, which can spread to deeper layers of the skin, leading to
conditions such as cellulitis, septic arthritis, and osteomyelitis, as well as potentially
life-threatening conditions like septicaemia, septic shock, and necrotising fasciitis
or gas gangrene. In addition to these health risks, pressure ulcers can also delay
healing, leading to chronic wounds and decreased mobility, affecting a person’s
quality of life, and potentially resulting in permanent disability or death. Agency for
Healthcare Research and Quality (AHRQ) estimates a total of 60,000 individuals
dying yearly due to PIs acquired through their hospital stay. Treatment for pressure
ulcers can be costly and time consuming, requiring specialised medical care and
ongoing management, which can put a significant burden on both the patient and the
healthcare system.
Currently, the most common way that PIs are classified is through assessment
tools such as the Braden Scale and the Waterlow Scale [1], which break down
various factors, such as mobility and sensory perception, the clinician conducting the
assessment gives a score-based prompts, and finally, all the scores are added up and
classification is done based on the total score. Treatment of PIs can be done through
multiple mediums, such as dressings, which help to maintain a moist environment
to maximise wound-bed epithelialisation and fix any nutritional deficiencies. Mobil-
isation and frequent turning also help to heal as they help to elevate pressure and
make sure that the wound does not get worse. This study focuses on classifying the
different stages of PIs to be used by both medical and non-medical professionals.
The hypothesis being addressed in this research is: By making use of computer
vision, it is possible to identify the different stages of PIs to further aid in their
treatment and healing. This research is addressing the following research questions:
1. By making use of computer vision, is it possible to classify the severity of skin
wounds?
2. What are the key features needed for a data set in such research?
3. How do different data sets contribute to the current state of the art?
4. Is a single-phase or multi-phase approach ideal for the chosen skin injury?
5. How does the proposed model architecture compare to the current state of the
art?
6. How can nurses or student nurses benefit from such research?
Wound Stage Recognition Using YOLOv5 65

2 Background

Object detection and localisation via computer vision have been researched to aid
individuals with visual impairment [2] yet also to aid medical professionals in the
correct localisation [3] and classification of medical alignments such as PIs [4].
These are areas of injured skin, which could go on to affect underlying tissues
and muscles. As mentioned, PIs develop because of prolonged pressure or shearing
on an area, usually affect people who suffer from a mobility impairment, and simul-
taneously have poor nutrition and hydration. They are divided into four different
stages. Stage one is where the area is red and warm to the touch. During stage two,
the area will have an open sore or blister. In the third stage, the area is hollow, but
tendons, bones, and muscles are not exposed. Stage four is where the wound reaches
the muscle, ligaments, or bone. Apart from these four stages, a PI can be unstageable,
which means that there is full-thickness tissue loss, and the wound bed is necrotic,
meaning that the tissue is dead.1 The treatment of a single grade IV PI costs around
£14,108 [5], so it is always best to prevent PIs rather than treat them. Multiple risk
assessments, such as the Waterlow score, the DESIGN tool, and the SSKIN assess-
ment, have been established to determine the risk of someone developing a PI and
aid in the prevention of development. However, according to Barnard and Copson
[6], an estimated 73% of skin damage referred to the Northern Lincolnshire and
Goole Hospitals NHS Foundation Trust, from October 2015 to December 2015, was
misclassified. This is due to human error, possibly because the person performing the
risk assessment is not knowledgeable enough about the different stages of pressure
ulcers.
As of today, there have been multiple studies carried out to evaluated images of
wounds with the aid of machine learning; however, most of the focus has been on
wound segmentation and depth perception. To be able to distinguish between stages,
certain features need to be present. In the first stage of PIs, there is no outline around
the wound as the skin is still intact during this stage. The wound is also lighter in
colour, as during this stage the skin only starts to get red, so having areas highly
concentrated with pink to light red pixels indicates a ‘Stage 01’ PI.
In the second stage, there will be a light outline around the wound as the skin
would have broken; however, the wound will not be deep. While the colour of the
wound can be a bright red, it will not be a dark red as that will indicate heavy bleeding.
Wounds in the class ‘Stage 03’ have a thick border, showing wound depth. They
also feature bright to dark red colours, showing active bleeding; usually, these wounds
have a dark border. When the wound is a lighter red, it will usually have a light border.
During this stage, there is also no large amount of white in the wound itself, as this
would insinuate bone or muscle showing.
Throughout ‘Stage 04’, there is white inside the wound itself, insinuating that
there is bone or muscle showing. However, most of the surface area of the wound is
not white, and it will be a shade of red.

1 https://www.hopkinsmedicine.org/health/conditions-and-diseases/bedsores.
66 C. Abela and F. Inguanez

The ‘Unstageable’ class has the most distinct features as during this stage the
wound has gone from being able to be healed to dead tissue. The wound features
during this stage are the colours: black, yellow, and light tan. There is no red in the
wound as that indicates living cells.
Object localisation refers to the identification of one or multiple objects in an
image with a bounding box around each object.2 Being able to locate a wound
successfully is the first step to a functional automated wound healing system. Patel
[3] focused solely on developing a wound locator. A deep neural network (DNN)
was used to determine the time progress of wound healing in terms of wound colour
composition. The data set was collected from Advancing the Zenith of Healthcare
(AZH) Wound and Vascular Centre, and it contains a total of 1010 wound images,
with three different types of ulcers being included: diabetic foot ulcers (DFU), venous
ulcers (VU), and PIs. For testing, 56 images were collected from the Medetec database
[7]. The data set was then augmented by rotating, flipping, and blurring the images.
After augmenting, the data set had a total of 4050 images. In the end, 3645 images
were used to train, while 405 images were used for testing. The best results obtained
were for YOLOv5 with a precision of 0.925, recall of 0.905, F1-Score of 0.915, and
mean Average Precision (mAP) of 0.939.
Image segmentation is where the recognised object’s location is not outlined by
bounding boxes but by the highlighting of the pixels containing the object; due to this,
it makes the object easier to analyse. Wang et al. [8] researched fully automatic wound
segmentation and did not find an adequate public data set, and so collaborated with
the AZH Wound and Vascular Centre.3 It was made sure that no identifiable features
of the patients were included in the images. The data set was collected over 2 years
and included a total of 1109 foot ulcer images from 889 patients under uncontrolled
illumination conditions with various backgrounds. Bounding boxes were then placed
on the data set by YOLOv3, the images were then augmented, and a training set of
3645 photos and a validation set of 405 photos were produced. The data set was also
made public [9]. The YOLOv3 model was run with a batch size of 8 for 273 epochs,
and this has resulted in a mAP of 0.939. A convolutional neural network (CNN),
MobileNet-v2, was then used to segment the wound from the images, and they did
not aim to further classify the wounds. Apart from testing their model on their data
set, they also tested it on the Medetec [7] data set, which contains 358 images of
various wounds. They tested both data sets on different models, and the best results
were obtained using the Medetec data set as shown in Table 1.
Elmogy et al. [10] aimed to develop an automatic segmentation system to detect
and segment PI red, blue, green (RGB) coloured images. The system proposed
extracts three different ulcer tissues: granulation being red; necrotic eschar being
black; and slough being yellow. Their data set consisted of 193 PI RGB images.
Thirty-six of the images were provided by IMQ Igurco Oficina Central4 with a reso-
lution of 1024 × 1024 pixels, and the other 157 images were obtained from the

2 https://machinelearningmastery.com/object-recognition-with-deep-learning/.
3 https://azhcenters.com/.
4 https://igurco.imq.es/.
Wound Stage Recognition Using YOLOv5 67

Table 1 Evaluation metrics

Model Precision Recall Dice
using the model developed by
[8] on the Medetec data set VGG16 0.7784 0.8069 0.7924
SegNet 0.7203 0.7387 0.7294
U-Net 0.8684 0.8133 0.8401
Mask-RCNN 0.9840 0.8860 0.9320
MobileNet-v2 0.9384 0.9427 0.9405

Medetec [7] wound database with a resolution of 1024 × 731 pixels. The data set
was divided into 60% for training, 10% for validation, and 30% for testing. A four-
fold cross-validation technique was applied. A convolutional neural network (CNN)
was used for segmenting the PI tissues, making it quite fast.
To validate their results, they compared their study with four different segmen-
tation systems. These systems are the linear combinations of the discrete Gaussians
(LCDG) segmentation technique, the Fuzzy C-Means (FCM) technique, the Otsu
technique, and finally, a system from their previous work which was based on two
different CNN networks and only extracted features from hue, saturation, intensity
(HIS) colour space [11]. Apart from this validation method, they further validated
their system by using a fourfold cross-validation method, which achieved an average
accuracy of 0.94, getting a consistent result when compared to the first method.
The results from their former work [11] and current work were similar, showing that
changing the colour space does not have a high impact; however, since in their current
work they used one CNN network, as opposed to two, as in their former work, their
current system is more efficient.
Furthering the methods is object detection and classification. The output of this
method includes the image inputted, along with bounding boxes around objects and
a class label that goes along each bounding box. Veredas et al. [4] focused on wound
image evaluation, and their data set was divided into: necrotic, slough, healing, and
granulation. They partnered with clinicians, who took colourised photos of PIs of
patients with home care, accumulating a data set of 113 photos. The next step was a
preliminary segmentation process, which was carried out with a k-means clustering
algorithm, where light reflections, clinician fingers, blood stains, and medical equip-
ment were removed from the photos. This was done to divide the images into groups
with similar characteristics based only on the wound itself. The second stage then
defined a region of interest of the estimated wound-bed area, getting a peri-ulcer data
set. In the third and final segmentation stage, the k-means clustering algorithm was
once again used, to process pixels in and out of the wound bed, this time making use
of different colour space combinations. The data set was then tested on three different
machine learning algorithms, support vector machine (SVM), feed-forward neural
network (FNN), and random forest (RF).
To determine which algorithm worked best, the performance rates or performance
metrics were compared, and Veredas et al. [4] utilised accuracy as their main perfor-
mance rate. In the wound-bed set, SVMs achieved the best overall performance rate,
with a difference of 0.0257, 0.0052 when compared to FNNs and RFs, respectively.
68 C. Abela and F. Inguanez

Table 2 Summary of data sets

Study Accessible Source Training Testing
[12] No Metedec N/S N/S
[10] No Igurco Gestion S.L. and Medetec 116 77
[13] No Far Eastern hospital 2393 598
[3] No AZH Wound and Vascular Centre 3645 405
[14] No N/S 1278 144
[4] No Clinicians in a Care Home 113 N/S
[8] Yes [9] AZH Wound and Vascular Centre 3645 405
[15] No Taichung Rongmin General Hospital 2800 800
N/S—Representing ‘Not Specified’

In the peri-ulcer set, SVMs and RFs gave the highest accuracy rates once again, with
no significant difference between them (p ≈ 0.003). However, there was a signif-
icant difference observed between the two previous algorithms mentioned and the
FNN algorithm (p < 0.01). From this study, it was found that SVMs and RFs gave
the highest performing rates when classifying wound-bed patterns. All three models
have a high efficacy score when identifying patterns from the peri-ulcer area.
A list of research on the subject matter and the datasets used is shown in Table 2.
It can be seen that while most studies gathered their own data sets from various
sources, all but one data set, the one by [8], is private. Due to this, the amount of
public wound data sets remains to be scarce.

3 Methodology

Many PIs are misclassified, even with the numerous tools that exist to properly clas-
sify them. The aim of this study is to show that if we make use of the current technolo-
gies, it will be possible to create a solution to help educate healthcare professionals
or students and offer a guide to classifying PIs.

3.1 Image Acquisition

The two public data sets found are [7] (Medetec) and [8] (Wang). The Medetec data
set contains 358 wound images, 183 images being pressure ulcers at the time of
collecting the images. The Wang data set contained 1010 training images and 1009
validation images of pressure ulcers. Some images from both data sets were of the
same patient only a few weeks apart, and due to this, not all images were used as they
were not significantly different. The images from both data sets were not augmented
Wound Stage Recognition Using YOLOv5 69

at all, they were taken in uncontrolled illumination conditions, and no background

items were removed. A local care home, San Vincenz De Paul (SVDP)5 was contacted
and some images were collected. A total of 85 images from Medetec, 461 images
from Wang, and 40 images from SVDP were utilised. Another data set was gathered
consisting of images of healthy skin. The images were collected through various
searches for the most common body parts that PIs affect, and these are being: arms,
legs, feet, and hands. A total of 231 images were collected. All images were re-scaled
by a factor of 2, then resized to 400 by 600 pixels.

3.2 Data Augmentation

‘Stage 01’ and unstageable PIs images underwent smoothing. ‘Stage 02’ and ‘Stage
03’ were not smoothed since they could have been easily mistaken for each other as
the features that differentiate them were blurred out. ‘Stage 04’ was also ruled out
from undergoing the smoothing method as it could be mistaken for an unstageable
PI. Each image would be flipped both vertically and horizontally. For the healthy skin
data set each image underwent flipping and smoothing, increasing the total number
to 693 images. Since the data set was not balanced with the wound data set, it was
decided that it would undergo further augmentations, and this time noise (5%) and
saturation (between − 35% and + 35%) were applied.

3.3 Data Configurations

In this study, the performance of two algorithms, YOLOv5 and YOLOv7, was eval-
uated on the accumulated data set. The YOLO algorithm was chosen due to its
low-resource requirements, and since the intended use is to be integrated within a
mobile app, we believe that this algorithm would be the ideal way forward. Results
indicated that YOLOv5 outperformed YOLOv7, which led to the decision to discard
the latter algorithm for the time being. However, it is worth mentioning that with more
time, potential solutions to improve the performance of YOLOv7 could be explored.
Listing 1 demonstrates the configurations used in the execution of YOLOv5.

Listing 1 YOLOv5 configurations

!python train. py - - - img 640 - - - batch 16 - - - epochs 300 - - - data custom.

yaml - - - weights yolov5m . pt - - - cache

5 https://activeageing.gov.mt/st-vincent-de-paul-long-term-care-facility/?lang=en.
70 C. Abela and F. Inguanez

Experiment 1
An experiment was carried out using YOLO to attempt to answer the fourth research
question. In this, the PI data set was run once with all stages and then run another
time in four different phases. Phase 00 checked whether the images given contained
a wound or not. If not, the model stops there, however, if a wound is found, the model
moves on to the next stage. Phase 01 focused on identifying if the PI in the photo
was stageable or unstageable. From there, if the PI is stageable, it moves on to Phase
02 which will further classify the PI into either Stage 1/2, Stage 3, or Stage 4. For
this phase, stages 1 and 2 are grouped together as they are often mistaken for each
other. Moving on to Phase 03, the model will determine whether the PI in the photo
falls in Stage 1 or Stage 2. The single-phase classification focuses on the single-class
classification of PIs, where a picture is given to the model, and it determines if it is
classified as one of the stages or if it is unstageable. Since during this experiment the
model was divided into various phases, each phase had its own data set.
Experiment 2
The second experiment addresses the third and fifth research questions. The model
architecture chosen from the previous experiment was repeated with the various
data sets used throughout this study. The breakdown of data sets trained during this
experiment was as follows: the Medetec data set on its own, the Wang data set on its
own, the Medetec data set alongside the novel data set, the Wang data set combined
with the novel data set. This was done so that there could be a comparison for which
combination of data sets performed best for specific phases, while also trying to
determine if one data set contributes more than another.
Experiment 3
For the third experiment, a testing data set was created, consisting of an equal number
of instances per data set, per stage, where possible. Whenever this was not possible,
due to a lack of images, it was replaced by images from the other data sets. The
aim was to have two images, per stage per data set, and this was to eliminate the
possibility of a data set bias. From each training run carried out during the second
experiment, the best model was saved so that it could be tested on the testing data
set accumulated during this stage.
Professional Evaluation
Semi-structured interviews were conducted with nurses. The questions were open-
ended so that the participant can elaborate and give as much information as possible
[16]. This will aid in obtaining information about if such a system could prove
beneficial during a nurse’s day-today, if any key features would be needed for the
system to be functional and any recommendations on how the system could progress
further in the future.
Wound Stage Recognition Using YOLOv5 71

Table 3 Experiment 1 evaluation metrics

Stage PPV TPR ACC [email protected]
Single Phase
Stage 01 0.987 1 1 0.995
Stage 02 0.692 0.728 0.939 0.813
Stage 03 0.89 0.981 0.883 0.971
Stage 04 0.917 1 0.903 0.975
Unstageable 0.887 0.911 0.946 0.934
Overall 0.875 0.924 0.934 0.938
Multi-phase
Stage 01 0.51 0.623 0.542 0.62
Stage 02 0.463 0.534 0.336 0.547
Stage 03 0.609 0.673 0.556 0.696
Stage 04 0.542 0.542 0.407 0.535
Unstageable 0.816 0.811 0.528 0.847
Overall 0.588 0.637 0.474 0.649

4 Results

4.1 Experiment 1

The first experiment was carried out to compare a single-stage architecture with a
multi-stage architecture to see which approach is better for wound detection. Table 3
outlines the result. For the multi-phase architecture, weighted results are provided
for a fair comparison.
Each metric had a significantly higher result in the single-stage architecture when
compared to the average of the multi-stage architecture. After evaluating the results,
the single-phase architecture was deemed the best option when it comes to detecting
wounds. However, to further solidify this result, the multi-stage architecture was also
utilised in the second experiment.

4.2 Experiment 2

Multi-phase Architecture
The weighted average for each stage was calculated for the multiple data sets to be
able to properly compare the results. For most of the data set combinations which
did not include all three data sets at once (Medetec, Medetec + Novel, Wang, Wang
+ Novel), the ‘Unstageable’ class performed the best in the multi-stage, and this is
72 C. Abela and F. Inguanez

due to it being the stage with the least number of models to go through to get to the
final classification. The results have not been included here for brevity, especially
since the single-phase architecture has performed better.
Single-Phase Architecture
The results for each stage are not included here for brevity yet explained verbally.
Like the multi-stage architecture, the overall results of the Medetec data set are
lowered significantly due to the class ‘Stage 02’ containing no images.
Even though ‘Stage 03’ had the lowest metrics among all stages in the Medetec
data set, it still performed well with a PPV of 0.899, TPR of 0.743, ACC of 0.842,
and [email protected] of 0.926. These results suggest that the model was able to accurately
classify and localise wounds, despite the data set being quite small in size. In regards
to [email protected], the model performed well across all stages, with scores ranging from
0.926 to 0.988, indicating that the model is able to locate and detect accurately.
When looking at the overall metrics, the single-phase architecture with the combi-
nation of Medetec and Novel data sets performed better than the one with only the
Medetec data set in terms of PPV, TPR, ACC, and [email protected]. This is due to the
class ‘Stage 02’ being added to the data set. When looking at each class PPV, apart
from ‘Stage 02’ and ‘Unstageable’, each class is higher with the Medetec data set on
its own.
In terms of PPV and TPR, the Wang data set outperformed the Medetec data set,
both with and without the Novel addition. The best PPV of 0.955 and TPR of 0.923
was both achieved in ‘Stage 01’, while the worst PPV of 0.726 and TPR of 0.621 was
both in ‘Stage 02’. On the other hand, the best ACC of 0.982 is achieved in ‘Stage 01’,
while the worst ACC of 0.853 is in ‘Unstageable’. Finally, the best [email protected] of
0.868 is achieved in ‘Unstageable’, while the worst [email protected] of 0.667 is in ‘Stage
02’. These metrics show that the model struggles to detect the class ‘Stage 02’ the
most, and this could be due to the class being under-represented or not containing
many distinguishing features.
Incorporating the Novel data set with the Wang data set, the model had a signifi-
cantly lower PPV and TPR (difference of 0.022 and 0.024, respectively), an insignifi-
cantly lower ACC (0.002), and an insignificantly higher [email protected] (0.004). Adding
images from a different source might have thrown off the model training since most
of the images used were from the Wang data set.
The Medetec model, which was the lowest performer among the single-phase
architecture models, still outperformed all the models in the multi-stage architec-
ture. With both architectures, the addition of the Novel data set resulted in lower
performance; however, when all three data sets were combined, the model outper-
formed the rest. This could indicate that two data sets together are not diverse enough
to work well together, however, three data sets are.
Wound Stage Recognition Using YOLOv5 73

4.3 Experiment 3

A testing data set was produced, which consisted of 36 images, 6 for each class. For
each data set, two images were chosen per class for the testing data set; however, when
there was a class not represented in a data set, the images were taken from the other
remaining data sets instead. Another six were added from the healthy skin data set so
that all were equally represented for the multi-stage architecture. The images were
evaluated based on each individual box that was correctly or incorrectly identified
or completely missed.
The single-stage architecture trained with all data sets performed the best. The
classes ‘Stage 02’, ‘Stage 03’, and ‘Stage 04’ detected all wounds, missing none.
‘Stage 01’ performed the worst with 0.833, further supporting that the ‘Stage 01’ class
could benefit from having a higher number of images due to it being less noticeable.
Overall, the model achieved a 0.947 of correct detections.

4.4 Comparison with Other Work

When it came to comparing the results presented in this study with the studies
presented earlier, a direct comparison could be made with those studies in the ‘Wound
localisation’ and ‘Classifying wounds’ sections. The studies from these sections were
chosen due to the studies having similar classes and end goals and this study.
The ‘Wound localisation’ section contained only one study [3], and Phase00 was
chosen to make a comparison. [3] accumulated his own data set, adding to it from the
Medetec data set. The best model achieved PPV of 0.925, TPR of 0.905, and a mAP
of 0.939. When compared with this study (0.952, 0.913, and 0.941, respectively),
the model presented in Phase00 performed slightly better.
Moving onto papers in the ‘Classifying wounds’ section, three of the papers were
able to provide a direct comparison to the models presented in this study. The first
paper was that by [4], who divided their classes into: ‘necrotic’, ‘slough’, ‘healing’,
and ‘granulation’. They got their data set from various clinicians. Three of the four
classes were used for a comparison, the ‘necrotic’ and ‘slough’ classes were utilised
for a comparison with the ‘Unstageable’ class, and the granulation class was used
for a comparison with the ‘Stage 03’ class as any wounds containing those features
would belong to the corresponding class. When looking at the wound data set results
of [4], which had an ACC of 0.878, it was higher than the multi-stage architecture
by 0.404 but lower than the single-stage architecture by 0.056.
Yilmaz et al. [12] divided their classes into ‘Stage 1’, ‘DTPI’, ‘Stage 2’, ‘Stage 3’,
‘Stage 4’, and ‘Unstageable’. ‘DTPI’, which stands for Deep Tissue Pressure Injury, is
skin areas that have a colour that is persistently dark red, purple, or maroon and do not
turn lighter when pressed. These areas may occur on skin that is unbroken or damaged,
and they may appear as blisters filled with blood. The cause of these skin changes is
damage to the soft tissues that are beneath the skin’s surface. DTPI are not usually
74 C. Abela and F. Inguanez

classified with other pressure injuries, however, when they are they are commonly
put into ‘Stage 03’, ‘Stage 04’ or ‘Unstageable’ due to lack of information.6 Due to
this, the overall accuracy of the results presented by Yilmaz et al. [12] was directly
compared to the results from this study. Yilmaz et al. [12] used Medetec to accumulate
their data set. The best model from the study conducted by [12] achieved an ACC
of 0.759, a PPV of 0.732, and a TPR of 0.759. When compared to the multi-phase
architecture, the model by Yilmaz et al. performed significantly better across all
metrics. However, when compared to the single-phase architecture, it performed
significantly worse, with a difference of 0.175 for ACC, 0.143 for PPV, and 0.183
for TPR.
Lau et al. [14] used various public data sets to accumulate their own. Lau et al.
[14] divided their study into the classes: ‘Stage 1’, ‘Stage 2’, ‘Stage 3’, ‘Stage 4’ &
Unstageable’, so a direct comparison could be made. Lau et al. [14]’s study achieved
an ACC of 0.632, which was found to be significantly lower than the single-stage
architecture by a margin of 0.302. Furthermore, the multi-stage architecture reported
a lower ACC than [14]’s study, with a difference of 0.158. Each paper in the ‘Classi-
fying Wounds’ section supports the notion that a single-stage architecture is superior
to a multi-stage architecture.

4.5 Professional Evaluation Results

Five participants were chosen for the interview process of this study, a video was
shown to each participant of how the prototype works, and they were then told that the
result would be more polished as an actual application. Four of the participants were
qualified nurses, while one was a student nurse. The qualified nurses had experience
ranging from 2 to 39 years.
In regard to experience with PI detection, it ranged from no experience to some
lectures. Overall, no participant had vast training when it comes to handling PI.
While nurses are not the ones always handling PIs, they see them and document
them day-to-day, so a good understanding is important.
The next question was about the current version of the PI scale used in their
place of work or placement. The student was not aware of any consistent scale used;
however, the nurses mentioned the Waterlow Scale, the Braden Scale, or a shortened
version of one of them. None of the nurses interviewed throughout this study found
these scales helpful as they did not allow for a personalised analysis and were seen
to be biased.
Participants were asked to rate how easy the prototype would be to use. One
participant stated that the software was easy to use (1 out of 10), while another
participant stated that it was relatively difficult (8 on the scale). The participant who
stated that the software seemed difficult to use was due to a feeling of technological
illiteracy. The other three participants all stated that the software seemed relatively

6 https://www.calmedmedical.com/description-of-pressure-ulcer-staging.
Wound Stage Recognition Using YOLOv5 75

easy to use as there was not a lot going on and did not have multiple functionalities
in one.
Four out of the five participants stated that they would be interested in integrating
the software in their day-to-day work. The one participant who stated that they would
not be interested in using it in their daily work since they work in the emergency
department and PIs are not something they deal with.

4.6 Limitations

Initially this study was going to focus on two different skin wounds, PIs and burn
wounds. However, despite the completion of the required forms and persistent follow-
ups, the acquisition of a primary burns data set was not possible and public data sets
did not have substantial data sets available. Due to these limitations, the burns skin
wound approach had to be abandoned.
Another limitation faced throughout this study was regarding the PI data set.
While a primary data set was gathered, it was not large enough to make our own
data set consisting of only primary images. This was primarily due to the fact that
only one hospital was willing to share its PI images, coupled with the relatively short
duration of the study, which resulted in a limited number of images being collected.

5 Conclusion

This rigorous study has focused on the use of computer vision and artificial intelli-
gence techniques for the assisted automated classification of pressure ulcer stages.
In order to pursue the aim of this research, a locally sourced novel data set has
been assimilated and complemented to the two most publicly cited and commonly
accepted data sets on the subject matter. Furthermore, this research has considered
different architectures to address the limitations of the data sets and the challenges
with this research problem. The findings of this study by the researcher were also
complemented and supported by student and professional nurses.
Depending on the size of the data set and possibly the type of wound, different
approaches might need to be taken; however, throughout this research, it was found
that it is possible to classify the severity of skin wounds using CV.
The defining features in the images of a PI data set were found to be very impor-
tant to remain clear and not reduced during any data augmentation or data transfor-
mation processes. Due to this, data augmentation considered in this study such as
contrast enhancement, segmentation, and image shifting is not recommended to be
implemented. Augmentations such as flipping or rotation are seen to be appropriate.
When comparing the single-stage architecture to the multi-stage architecture with
various data set configurations, it was found that two data sets together are not diverse
enough; however, complimenting with the Novel data set improves diversity. This
76 C. Abela and F. Inguanez

observation is corroborated by the fact that in both the single-stage and multi-stage
architectures, the addition of the Novel data set resulted in lower results; however,
when all three data sets were combined, the results outperformed the other models.
This finding indicates that having diverse image sources is beneficial as the model
performs better, while also ensuring that the model is not biased.
When comparing the proposed models with various other works which focused on
classifying wounds, it was found that the multi-stage architecture performed worse
than the current state of the art, however, the single-stage architecture performed
better. Moreover, when compared to papers that focused solely on identifying the
location of a wound, and thus only the first phase of the multi-stage architecture was
considered, it was discovered that wound detection of the multi-stage architecture
yielded superior results.
When evaluating the results of the interviews conducted, the prototype was seen
to be overall easy to use (with a 4.2 out of 10 mean overall). Furthermore, four out of
the five participants were interested in the software being integrated into their daily
work as it would make PIs easier to understand and could lead to fewer human errors.
After conducting this study, several key insights were revealed, including that the
multi-stage architecture would be better suited for a smaller data set, as throughout
the experiments with the various data set combinations, the multi-stage architecture
consistently performed better when the data set presented was smaller in size. Another
takeaway was that, for single-stage architecture, the number of images always directly
correlated to how well a model performed, proving the importance of a large, balanced
data set. Furthermore, with both architectures, the addition of the Novel data set
resulted in lower-performing models, indicating that two data sets are not diverse
enough to work well together, however, three data sets are, as when all three data
sets were combined, the models outperformed the rest.

References

1. Chou R, Dana T, Bougatsos C, Blazina I, Starmer AJ, Reitel K, Buckley DI (2013) Pressure ulcer
risk assessment and prevention: a systematic comparative effectiveness review. Ann Internal
Med 159(1):28–38. https://doi.org/10.7326/0003-4819-159-1-201307020-00006
2. Khan W, Hussain A, Khan BM, Crockett KA (2023) Outdoor mobility aid for people with visual
impairment: Obstacle detection and responsive framework for the scene perception during the
outdoor mobility of people with visual impairment. Expert Syst Appl 228:120464. https://doi.
org/10.1016/J.ESWA.2023.120464
3. Patel (2020) Deep learning-based object detection in wound images. Master’s thesis, University
of Wisconsin Milwaukee
4. Veredas FJ, Luque-Baena RM, Martn-Santos FJ, Morilla-Herrera JC, Morente L (2015) Wound
image evaluation with machine learning. Neurocomputing. https://doi.org/10.1016/j.neucom.
2014.12.091
5. Hunter IA, Edwards KJ (2017) Managing pressure sores. Surg Infect (Larchmt) 35(9):505–510.
https://doi.org/10.1016/j.mpsur.2017.06.008
6. Barnard JA, Copson DL (2016) Increasing the accuracy of pressure ulcer classification using
a pressure ulcer guidance tool. Wounds UK 12(4)
Wound Stage Recognition Using YOLOv5 77

7. Thomas S (2014) Medetec wound database: stock pictures of wounds. http://www.medetec.co.

uk/files/medetec-image-databases.html. Accessed 31 Mar 2022
8. Wang C, Anisuzzaman DM, Williamson V, Dhar MK, Rostami B, Niezgoda J, Gopalakrishnan
S, Yu Z (2020) Fully automatic wound segmentation with deep convolutional neural networks.
Sci Rep 10. https://doi.org/10.1038/s41598-020-78799-w
9. Wang C, Anisuzzaman DM, Williamson V, Dhar MK, Rostami B, Niezgoda J, Gopalakrishnan
S, Yu Z (2020) Wound dataset. https://github.com/uwm-bigdata/wound-segmentation/tree/mas
ter/data/wound_dataset. Accessed 31 Mar 2022
10. Elmogy M, Garia-Zapirain B, Burns C, Elmaghraby A, El-Baz A (2018) Tissues classification
for pressure ulcer images based on 3D convolutional neural network. ResearchGate. https://
doi.org/10.1109/ICIP.2018.8451119
11. Garćıa-Zapirain B, Elmogy M, El-Baz A, Elmaghraby AS (2018) Classification of pressure
ulcer tissues with 3D convolutional neural network. Med Biol Eng Comput 56:2245–2258.
https://doi.org/10.1007/s11517-018-1835-y
12. Yilmaz B, Atagn E, Demrcan F, Yceda (2021) Classification of pressure ulcer images with
logistic regression. In: International conference on innovations in intelligent systems and
applications (INISTA). https://doi.org/10.1109/INISTA52262.2021.9548585
13. Chang CW, Lai F, Christian M, Chen YC, Hsu C, Chen YS, Chang DH, Roan TL, Yu YC
(2021) Deep learning–assisted burn wound diagnosis: diagnostic model development study.
JMIR Med Inform 9(12):22798. https://doi.org/10.2196/22798
14. Lau CH, Yu KH-O, Yip TF, Luk LYF, Wai AKC, Sit T-Y, Wong JY-H, Ho JWK (2022) An
artificial intelligence-enabled smartphone app for real-time pressure injury assessment. Front
Med Technol 4. https://doi.org/10.3389/fmedt.2022.905074
15. Huang H-N, Zhang T, Yang C-T, Sheen Y-J, Chen H-M, Chen C-J, Tseng M-W (2022) Image
segmentation using transfer learning and fast R-CNN for diabetic foot wound treatments. Front
Public Health 10. https://doi.org/10.3389/fpubh.2022.969846
16. Alsaawi A (2014) A critical review of qualitative interviews. Eur J Bus Soc Sci 3(4):6. https://
doi.org/10.2139/ssrn.2819536
Harvest Palm Tree Based on Detection
Through 2D LiDAR Sensor Using Power
Equation

Luqman Hakim Bin Yusof, Abdulaziz Yahya Yahya Al-Nahari,

Danny Ngo Lung Yao, and Normaiza Mohamad

Abstract The palm oil industry in Malaysia is encountering challenges that need
innovative and technology-driven solutions. This study focuses on palm tree detection
where 2D LiDAR sensors are utilized to collect data like distance and reflection
strength. Through analysis, the gathered data are compared with an array of trend
lines to ascertain the optimal data relationship. Among the equations considered,
including linear, logarithmic, polynomial, and power equations, the power equation
emerges to fit more for the detection algorithm. The chosen equation is integrated into
the ESP32 firmware. The evaluation of the algorithm’s efficacy transpires through its
accuracy in identifying palm trees. The algorithm exhibits an accuracy rate of 98%,
attesting to its proficiency in discerning palm trees within plantations.

Keywords Power equation · Palm tree detection · 2D LiDAR sensor · ESP32

firmware

1 Introduction

Over the decades, the palm oil sector has contributed in Malaysia to the country’s
economy. The substantial expansion has been undergone by the sector since the
early 1970s, leading to the conversion of approximately 5.8 million hectares of land
into palm oil plantations, employment provided for over 650,000 individuals, and
a yield of 19.5 million tons of palm oil in 2018 [1]. However, recent years have
seen the emergence of significant challenges. One of the challenges was the ban on
Malaysian palm oil imports in Europe due to hazardous glycidyl fatty acid esters
(GE) detected in Malaysian fruit, resulting from pesticide use on plantations [2].
This ban caused substantial financial losses. Furthermore, heavy reliance on manual
labor, amplified by the COVID-19 pandemic, poses challenges. The shortage of

L. H. B. Yusof · A. Y. Y. Al-Nahari · D. N. L. Yao (B) · N. Mohamad

School of Information Technology, UNITAR International University, 47301 Petaling Jaya,
Selangor, Malaysia
e-mail: [email protected]

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 79
Y. Bee Wah et al. (eds.), Data Science and Emerging Technologies, Lecture Notes on
Data Engineering and Communications Technologies 191,
https://doi.org/10.1007/978-981-97-0293-0_6
80 L. H. B. Yusof et al.

foreign workers disrupted harvests and plantation maintenance [3]. To overcome

these issues, alternative avenues should be explored by the palm oil industry, poten-
tially embracing technology, and automation, such as the use of 2D lidar sensors and
microcontrollers, for plantation management. This could lead to a reduction in labor
dependency, increased efficiency, and overall sustainability enhancement, though
challenges related to sensor accuracy and algorithm precision need to be addressed.
Therefore, this paper proposed oil palm tree detection through a 2D LiDAR sensor
using a power equation.
A power equation expresses one variable as a power of another, typically in the
form y = ax b , where y and x are variables, a is a constant, and b is the exponent
determining the nature of the relationship, such as exponential growth or decay.
Power equations are used in various scientific and engineering contexts to model
and understand phenomena where one quantity’s change influences another with a
power-law relationship.

2 Literature Review

LiDAR, or Light Detection and Ranging, is an advanced technology that employs

laser beams to create precise maps of environments, gaining recent popularity across
diverse industries. This technology involves directing laser light toward a target and
measuring the reflected light for differences in wavelength and arrival time. By emit-
ting brief light pulses to the target, which then bounce back to the scanner, the distance
between the scanner and the target is determined by the light’s travel time, while
object qualities are inferred from the wavelengths. After data collection, processing
yields high-resolution maps. Comprising lasers, scanners, optics, photodetectors,
and GPS receivers, a traditional LiDAR system generates light pulses, guides beams
for data collection, focuses light, detects reflected light, and establishes scanner
location. Its applications span agriculture, archaeology, autonomous vehicles, and
environmental research, providing insights into climate change, air quality, and more.
Recent advancements in 2D LiDAR technology highlight its growing significance for
autonomous driving, indoor mapping, and obstacle detection, affirming its versatility
and impact [4].
At the core of Light Detection and Ranging (LiDAR) technology lies the principle
of light reflection, employed as a measurement tool which measuring the time it takes
for the light to travel to the object and back. The process involves directing a laser
beam toward a target and recording the time taken for the reflected light to return to
the sensor. This time, coupled with other attributes of the reflected light, is harnessed
to ascertain the distance and construct a comprehensive digital representation of the
object.
In the drone industry, a typical LiDAR system comprises four key components:
a laser, a scanner, a receiver, and a GPS device [3]. The scanner triggers the laser to
emit brief light bursts, focused on the target. The laser’s light reaches the target and
reflects back to the receiver. The receiver logs the round-trip light travel time. Precise
Harvest Palm Tree Based on Detection Through 2D LiDAR Sensor … 81

georeferencing of LiDAR data are achieved through the GPS device’s accurate sensor
location information.
Usually, the LiDAR laser emits a light pulse, a swift burst of light lasting nanosec-
onds, traveling just centimeters before reflection [5]. This time taken for the pulse’s
round trip, known as “time-of-flight,” is measured by the LiDAR system. By multi-
plying the time-of-flight by the speed of light, the system accurately calculates the
target distance. Illumination of the target with laser light enables measurement of
the reflected light’s wavelength and arrival time, providing data to deduce a digital
representation of the target’s distance.
Additional properties of the reflected light, including intensity, wavelength,
surface characteristics, color, and chemical composition, can be measured by LiDAR
systems. By integrating these measurements with distance data, LiDAR generates
highly precise three-dimensional environmental representations. Equation (1) is used
to calculate the distance [6], where D is distance, c is the speed of light, and T is the
time required by the light to travel. Measurement accuracy in LiDAR is influenced by
factors like laser pulse rate, beam divergence, and receiver sensitivity, with adjust-
ments such as increasing pulse rate and minimizing beam divergence enhancing
distance measurement precision, amplifying receiver sensitivity refining intensity
measurements.

T
D=c . (1)
2

The application of Median Filter preprocessing in classifying oil palm trees using
Convolutional Neural Networks on LiDAR images concluded that its impact on
classification effectiveness is limited due to low salt and pepper noise levels, slightly
lower than previous research without Median Filter preprocessing [7]. Besides, the
LiDAR technology is also being implemented to categorize the ripeness of oil palm
fresh fruit bunch (FFB) using sensor data and create a distribution map in the context
of Malaysia’s palm oil industry [8].

3 Palm Tree Detection Using 2D LiDAR

Initially, data collection from various oil palm fields is imperative, involving field
preparation, data gathering, and data accuracy verification, as these data serve as
the foundation for developing a precise algorithm to estimate oil palm tree expan-
sion and production. Subsequently, data analysis advances to plotting graphs, visu-
ally depicting the relationships between variables. After data plotting, a trendline
is determined through visual validation, assessing the line that best fits the data’s
overall trend. This trendline is further validated using mathematical graph equa-
tions, including linear, logarithmic, polynomial, and power models, to ensure reli-
able and confident decision-making support. Once the optimal algorithm is selected,
82 L. H. B. Yusof et al.

the next phase involves firmware development for an ESP32 microcontroller, inte-
grating the chosen algorithm and a communication program for seamless 2D LiDAR
sensor connectivity. The firmware design includes an indicator light to provide real-
time feedback during palm tree detection, enhancing usability. Well-documented and
organized firmware construction ensures easy maintenance and future updates. Thor-
ough testing confirms compliance with original requirements, resulting in a high-
quality, efficient palm tree detection solution. To interface sensors with the ESP32
and manage physical components, an electronic circuit will be meticulously crafted,
emphasizing reliability and data processing capability. Real-world deployment in an
oil palm setting evaluates the algorithm’s accuracy and circuit performance, guiding
potential applications in the agricultural industry and shaping the project’s future
direction.

3.1 Data Sampling

To comprehensively collect data from the LiDAR sensor, a data sampling process is
required, utilizing a serial port terminal application for data viewing and recording.
This terminal tool must accommodate specific communication settings tailored to
the LiDAR sensor, including baud rate, bit length, and stop bit type. After evaluating
various options, CoolTerm version 2 serial port terminal has been chosen for its
versatility and reliability, being a highly capable freeware supporting a wide range
of serial port configurations. Moreover, CoolTerm’s recording features are crucial
for the efficient and accurate accumulation of palm tree data, a vital aspect of these
project’s goals.
To construct the algorithm, it is essential to retrieve at least two parameters from
the LiDAR sensor, facilitating the establishment of data relationships. Unlike most
LiDAR sensors on the market, which typically provide a single output, measuring the
distance from the sensor to the object, the Benewake TFMini Plus—Micro-LiDAR
offers a distinctive capability. This LiDAR sensor furnishes two types of readings: the
object’s distance in centimeters and the object’s surface reflection value, presented
as a generic value without specific measurement units. While its primary application
is distance measurement, it offers two correlated values. The TFMini Plus, func-
tioning on the Time of Flight (ToF) principle, can measure distances as close as
10 cm (±5 cm up to 6 m) and as far as 12 m (±1% starting at 6 m). Its effectiveness
in detecting distances varies depending on lighting conditions and target reflectivity,
yet its compact dimensions, measuring only 35 × 18.5 × 21 mm, enable its inte-
gration into applications typically reserved for smaller sensors. Distinguishing itself
from the original TFMini, the TFMini Plus boasts an IP65 enclosure, ensuring dust
tightness and water resistance, and has passed drone-level vibration tests, broadening
its range of applications. Powered by a simple 5 V source, it communicates data via
the Universal Asynchronous Receiver/Transmitter (UART) serial protocol, enabling
direct reading through a USB-to-Serial Converter. Table 1 shows the configuration
of TFMini Plus LiDAR.
Harvest Palm Tree Based on Detection Through 2D LiDAR Sensor … 83

Table 1 Configuration of
Default baud rate 115,200
TFMini Plus LiDAR
Data bits 8
Stop bit 1
Parity None

Table 2 shows the data code for the TFMini Plus—LiDAR. The manufacturer has
structured the sensor’s output data into a 9-byte data frame, including frame head,
distance value (Dist_L and Dist_H), signal strength (Strength_L and Strength_H),
temperature (Temp_L and Temp_H), and checksum, all transmitted in hexadecimal
format. Specifically, bytes 2 and 3 carry the distance value to the object, allowing
for centimeter measurement, while bytes 4 and 5 convey the reflection strength,
facilitating reflection strength determination.
An extensive survey of the plantation areas is conducted to ensure that the data
collected accurately represent the full spectrum of palm tree ages and sizes present.
It is noteworthy that palm trees have a productive lifespan ranging from 3 to 25 years,
during which they yield the most. As they age and grow in size, it can affect sensor
readings’ accuracy. Therefore, accounting for the diverse ages and sizes of palm
trees in the plantations is crucial to gathering reliable and relevant data for the study.
To ensure an accurate reflection of the entire range of palm tree ages and sizes, the
research will employ a systematic data collection process, measuring trees of varying
ages and sizes. This meticulous approach enables a comprehensive examination of
palm tree plantation characteristics. Specifically, data sampling will focus on palm
trees aged between 5 and 10 years, a critical period marked by significant changes
in trunk diameter, as diameter alterations impact sensor readings due to variations
in surface area and reflected infrared light. The sensor will be positioned at a height
of 1.2 m above the ground to ensure sampling from the trunk rather than the base,
where the presence of roots results in a greater diameter.

Table 2 Data code for TFMini Plus LiDAR

Default baud rate 115,200
Byte0 0 × 59, frame header, same for each frame
Byte1 0 × 59, frame header, same for each frame
Byte2 Dist_L distance value lower by 8 bits
Byte3 Dist_L distance value higher by 8 bits
Byte4 Strength_L low 8 bits
Byte5 Strength_L high 8 bits
Byte6 Temp_L low 8 bits (suit for version later than V1.3.0)
Byte7 Temp_L high 8 bits (suit for version later than V1.3.0)
84 L. H. B. Yusof et al.

The procedure of data sampling is shown in Fig. 1 and described:

i.
The data sample will be grouped by the age of the palm tree.
ii.
Each age group will be taken in five data sets.
iii.Each data set will have a ten-fixed distance range.
iv.The data then will be stored in the format as Table 3
v.Samples will be taken for palm trees at the age of 5. Distance from the sensor
to the palm tree will be fixed and the strength value will be measured.
vi. This procedure will be repeated for palm trees at multiple ages until 25 data sets
are collected.

Fig. 1 Data sampling procedure

Harvest Palm Tree Based on Detection Through 2D LiDAR Sensor … 85

Table 3 Samples of data

No. Age 5
Distance (m) Strength
1 0.5
2 1
3 1.5
4 2
5 2.5
6 3
7 3.5
8 4
9 4.5
10 5

After successfully collecting the data, the focus shifts to establishing the correla-
tion between distance and reflection strength in order to identify a suitable mathemat-
ical equation that aligns with the data trend. This involves loading the collected data
into Microsoft Excel for further analysis, where all datasets are grouped by distance,
and the average reflection strength is computed for each. This process is repeated for
all distances across all datasets. Following data preprocessing, a chart is generated
to visualize the trend and relationship, and a trend line is drawn in the chart. Four
types of mathematical equations are then tested to assess their compatibility with
the data trend. Microsoft Excel’s chart tools are employed to generate formulas for
each equation type. Multiple equations like linear equations, logarithmic equations,
polynomial equations, and power equations are generated and individually compared
with the actual collected data. When distance values are input into the formula, they
should produce reflection strength values closest to the actual averaged data obtained
from the process. Error percentages are calculated for each mathematical equation,
with the equation demonstrating the lowest error percentage being chosen as the final
formula for use in real-world testing. Table 4 shows the virtual distance generated
by different equations.
To assess the equation’s performance, a comparison is made between the virtual
distance and the actual distance, and the efficiency is measured by dividing the virtual
value by the actual value. These efficiency scores are then totaled for all seven data
points under evaluation, and the equation’s overall efficiency is determined. Ideally,
the sum of efficiencies should closely approach a total of seven, given the seven
data points being tested. The equation demonstrating the sum of efficiencies closest
to seven will be selected for incorporation into the algorithm. Table 5 shows the
efficiency of each equation. Based on Table 5, the power equation achieved better
efficiency.
After the comparison, the power equation generated as Eq. (2) was chosen where
y is the virtual distance and x is the actual strength. The power equation is then coded
into the Firmware for ESP32. For algorithm testing, the electronic circuit is equipped
86 L. H. B. Yusof et al.

Table 4 Virtual distance that generated by different equations

No. 1 2 3 4 5 6 7
Actual distance 166 216 244 310 371 406 457
Virtual distance Linear 127.61 273.70 301.40 309.02 330.01 338.67 348.61
equation
Logarithmic 123.73 244.63 286.43 300.50 348.48 374.26 411.13
equation
Polynomial 108.56 183.41 262.27 287.60 365.37 400.93 444.23
equation
Power 151.32 234.48 272.82 287.08 341.58 375.02 428.62
equation

Table 5 Efficiency of the equations

No. 1 2 3 4 5 6 7 Total
Linear equation 0.77 1.27 1.23 1.00 0.89 0.83 0.76 6.75
Logarithmic equation 0.75 1.13 1.17 0.97 0.94 0.92 0.90 6.78
Polynomial equation 0.65 0.85 1.07 0.93 0.98 0.99 0.97 6.44
Power equation 0.91 1.09 1.12 0.93 0.92 0.92 0.93 6.82

with two LiDAR sensors, enabling the system to detect palm trees on both the left and
right sides of the pathway. The firmware writing process involves implementing the
Real-Time Operating System (RTOS) capability of the ESP32, which enhances code
efficiency and overall system operation. The firmware is structured into several tasks:
obtaining LiDAR sensor data and performing detection logic, acquiring data from
the GPS module, and recording detection activity into internal memory, including
location coordinates. Before executing the detection algorithm, the firmware must
capture raw data from both LiDAR sensors. This is achieved through dedicated tasks
in the RTOS, with two independent tasks looping every 100 ms to parse distance and
strength values to perform the detection. The detection function validates readings,
calculates estimated distance using a specific equation, establishes a margin value
for flexibility in detection logic, and compares actual distance readings to the esti-
mated distance within a certain range to produce a “true” output indicating palm tree
detection. Furthermore, the firmware retrieves location coordinates when a palm tree
is detected, records UTC time and stores data in its internal memory, aiding in result
measurement.

y = 10891x −0.513 . (2)

Harvest Palm Tree Based on Detection Through 2D LiDAR Sensor … 87

4 Implementation and Testing

The algorithm implementation involves installing the electronic system containing

the ESP32 firmware into a fertilizer-spraying machine designed for palm trees. This
integrated electronic system, equipped with the LiDAR sensor, enables the machine
to detect the presence of palm trees, enhancing its functionality during the fertilizer-
spraying process. Figure 2 shows the installation of an electronic system and LiDAR
sensor on the spraying machine.
Algorithm validation occurs at Research Station MPOB Keratong, situated in
Pahang and managed by the Malaysian Palm Oil Board (MPOB). This particular
plantation receives diligent care and maintenance, adhering to Malaysian planta-
tion standards. The paths beneath the palm tree canopy remain unobstructed, while
the spacing between each palm tree maintains a consistent and uniform distance of
approximately 25 feet (7.62 m). Testing is conducted on a single hectare, encom-
passing roughly 140 seven-year-old palm trees. To prevent multiple detections of
the same palm tree, the machine’s path is set to move only forward, positioning the
machine at the path’s center, and maintaining a distance of 3.5–4.5 m between the
machine and the palm tree. GPS sensors and data recording are activated to record
detections, creating a record row in the ESP32’s storage in CSV format for each
detection event. The data like latitude, longitude, and date time are recorded.

Fig. 2 Installation of an electronic system and LiDAR sensor on the spraying machine
88 L. H. B. Yusof et al.

An investigation spanning 30 cycles was conducted to study the presence of palm

trees within a specific region. In the initial ten cycles, a perplexing issue of “over
detection” arose, where the number of detections consistently exceeded the actual
count of palm trees, casting doubts on data accuracy. Further analysis revealed that the
LiDAR sensor had inadvertently detected palm trees in the adjacent row, surpassing
the maximum detection limit set at 140 due to its impressive 12-m range. Table 6
shows the result of the detection. To rectify this, the firmware underwent alterations to
restrict the detection distance range to a narrower band of 3–4.5 m. Subsequent tests,
spanning cycles 11–20, witnessed a significant decline in detections, occasionally
falling below 130, as the limited range sometimes caused the machine to maneuver to
avoid obstacles, taking it out of the effective range. In response, a second firmware
update expanded the detection distance range to a more flexible 2–5 m, leading
to a marked improvement in the average detection rate of 98% in cycles 21–30.
This refined detection rate is well-suited for practical applications like fertilizer or
pesticide spraying for palm trees, and the recorded data can be visually mapped to
depict the machine’s operational trajectory and detection triggers. Figure 3 shows
the proposed palm tree detection algorithm.

Table 6 Results of detection

Cycle 1 2 3 4 5 6 7 8 9 10
Number of detections 145 143 144 144 145 143 142 144 144 145
Cycle 11 12 13 14 15 16 17 18 19 20
Number of detections 129 128 130 132 129 134 131 132 134 133
Cycle 21 22 23 24 25 26 27 28 29 30
Number of detections 139 140 138 139 140 140 139 138 139 140
Harvest Palm Tree Based on Detection Through 2D LiDAR Sensor … 89

Fig. 3 Palm tree detection

algorithm

5 Conclusion

The palm tree detection algorithm successfully detects the palm tree and eases the
spraying process. However, there is still room for improvement. To enhance the
algorithm’s accuracy, an effective strategy is to expand the volume of sample data
used during its development. Achieving this involves capturing more data across
diverse environmental settings and locations. Furthermore, the algorithm’s precision
90 L. H. B. Yusof et al.

can be further refined by incorporating supplementary data processing tools such

as MATLAB, thereby diversifying and optimizing it for various types of LiDAR
sensors. Another potential avenue for algorithm improvement lies in the applica-
tion of modeling techniques available in tools like TensorFlow. Techniques such as
Convolutional Neural Networks (CNNs) could be employed to enhance palm tree
detection using LiDAR data. By training the CNN on a substantial dataset of LiDAR
point cloud images of palm trees, it can learn the distinct characteristics of palm trees
and accurately identify them in new LiDAR data. Additionally, the use of Recurrent
Neural Networks (RNNs) is worth exploring for analyzing the temporal dynamics of
LiDAR data, enabling more precise palm tree detection in dynamic environments.
These strategies hold promise for advancing the field of environmental monitoring
and refining palm tree detection algorithms with 2D LiDAR sensors. The study also
suggests the potential applicability of this approach to other types of plantations
in Malaysia, such as rubber, dragon fruit, pineapple, banana, durian, and cocoa.
Incorporating detection distance range as part of the algorithm can further improve
accuracy and prevent “Over Detection” issues. Besides, this study also will compare
the performance with the standard palm tree detection in future study.

Acknowledgements The authors thank UNITAR International University for the publication of
this research.

References

1. Malaysian Palm Oil Council (2019) Monthly palm oil trade statistics. Retrieved from https://
www.mpoc.org.my/Monthly-Palm-Oil-Trade-Statistics-Jan-Dec-2018
2. New Straits Times (2019) EU decision on palm oil not surprising, says Malaysia.
Retrieved from https://www.nst.com.my/news/nation/2019/03/471097/eu-decisionpalm-oil-
not-surprising-says-malaysia
3. Ho J, Phang SK, Mun H (2021) 2-D UAV navigation solution with LIDAR sensor under GPS-
denied environment. J Phys: Conf Ser 2120:012026. https://doi.org/10.1088/1742-6596/2120/
1/012026
4. Zhang H, Zhang K, Chen C (2020) 2D lidar-based indoor mapping with SLAM. Sensors
20(15):4259. https://doi.org/10.3390/s20154259
5. Setyawan A, Taftazani I, Bahri S, Noviana E, Faridatunnisa M (2022) Drone LiDAR application
for 3D city model. J Appl Geospat Inform 6:572–576. https://doi.org/10.30871/jagi.v6i1.3462
6. Kim G, Ashraf I, Eom J, Park Y (2021) Concurrent firing light detection and ranging system for
autonomous vehicles. Rem Sens 13(9):1767. https://doi.org/10.3390/rs13091767
7. Aliandra SR, Prasvita DS (2022) Application of median filter method for classification of oil
palm tree on LiDAR images. In: 2022 international conference on informatics, multimedia,
cyber and information system (ICIMCIS), Jakarta, Indonesia, pp 441–444. https://doi.org/10.
1109/ICIMCIS56303.2022.10017880
8. Husin HS, Amar N, Bakar Sajak AA, Sallehin Mohd Kassim M (2021) Distribution map of
oil palm fresh fruit bunch using LiDAR. In: 2021 12th international conference on information
and communication systems (ICICS), Valencia, Spain, 2021, pp 4–9. https://doi.org/10.1109/
ICICS52457.2021.9464575
Enhancing Security Surveillance
Through Business Intelligence
with NVIDIA DeepStream

Vishal Pednekar, Nidhi Shettigar, and Sayli Tawhare

Abstract Security systems, whether in public or private establishments, frequently

encounter challenges in delivering tailored, insightful analytics and real-time secu-
rity alerts. In the contemporary business landscape, grasping customer behavior
and enriching their experience is necessary. Business analytics emerges as the
linchpin for well-informed decision-making across financial aspects, daily opera-
tions, and beyond. Concurrently, upholding premises security remains of utmost
importance. To address these requirements, we introduce URSA—an econom-
ical application seamlessly integrated into existing surveillance systems. URSA
harmoniously merges business analytics with heightened security measures, lever-
aging NVIDIA DeepStream technology to analyze data acquired from surveil-
lance systems. The customized DeepStream pipeline within URSA amalgamates
diverse deep learning models, including PeopleNet, FaceNet, and ST-GCN, to deliver
targeted outcomes. Our evaluation of URSA’s performance, conducted with video
footage from bustling areas like college campuses and railway stations, underscores
its exceptional accuracy in handling challenging real-world scenarios.

Keywords Analytics · Artificial intelligence · Deep learning · Machine learning ·

Real time · Surveillance

1 Introduction

Security incidents at commercial and public premises such as theft, break-ins, and
vandalism are unfortunate but common occurrences in today’s society. Traditional
security systems, while effective to some extent, often fail to provide user-defined
insightful analytics and real-time security alerts, leaving commercial and public

V. Pednekar (B) · N. Shettigar · S. Tawhare

Department of Computer Engineering, K.C. College of Engineering and Management Studies and
Research, Thane, India
e-mail: [email protected]

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 91
Y. Bee Wah et al. (eds.), Data Science and Emerging Technologies, Lecture Notes on
Data Engineering and Communications Technologies 191,
https://doi.org/10.1007/978-981-97-0293-0_7
92 V. Pednekar et al.

premises vulnerable to potential threats. We recognized the importance of improving

security systems and providing useful business analytics to organizations.
Therefore, we propose URSA, a pocket-friendly application that can be used to
enhance the security infrastructure of commercial and public premises such as super-
markets, banks, and railway stations. URSA leverages video analytics technology to
process and collect data from surveillance video data and perform analysis on it in
real time.
URSA offers various features that can help organizations to manage their security
infrastructure more effectively. With URSA, organizations can gain insights such
as visitor count, average time spent by visitors, crowd monitoring and management
in user-defined areas. Furthermore, URSA also provides proactive surveillance with
real-time alerts for intrusion, dangerous posture, and criminal face detection. Indi-
viduals’ body temperatures can be detected and recorded by URSA, provided that
surveillance system is equipped with thermal cameras. These features make URSA a
valuable tool for commercial premises to improve security systems, prevent potential
threats, and enhance operational efficiency.
One real-world example of a security incident that URSA could help prevent is
a theft that may take place in a supermarket or a bank. Despite having a traditional
security system, the system fails to detect the thief, resulting in significant loss.
URSA’s real-time alerts for intrusion and facial recognition can help identify the thief
and prevent the theft. Additionally, visitor count and crowd-monitoring features can
help an organization better manage the flow of customers and prevent such incidents
from occurring in the future. Overall, URSA is a powerful plugin that can help
organizations improve their security and safety while providing insightful business
analytics to help make better-informed decisions.

2 Related Work

In this section, we offer a concise overview of pertinent research in image recog-

nition aimed at improving security surveillance systems. The study [1] proposes an
improved “YOLO V3” algorithm for small target detection by optimizing the dataset
and network structure. It incorporates a specialized fourfold down-sampling residual
module aimed at the detection of small targets, contributing to substantial advance-
ments in accuracy, recall, and average accuracy as evidenced by the experimental
results. The algorithm proves highly advantageous for real-world applications in the
realm of computer vision.
The paper [2] introduces a new approach for person recognition using OpenPose
key points, with potential applications in defense. The experiment conducted on five
persons shows that the F1 score increases with the number of OpenPose parameters.
Future work can expand the dataset and test it in outdoor environments.
The paper [3] presents a parallelization methodology for object detection that
exploits multithreading, multi-stream, pipelining, and partial network duplication.
Enhancing Security Surveillance Through Business Intelligence … 93

A heuristic is used to determine pipeline cut points, and the methodology is evalu-
ated with six object detection networks on an NVIDIA Jetson AGX Xavier board.
Results show significant throughput improvements of 81% to 91% over the base-
line GPU-only inference. The proposed approach shows promise for accelerating
object detection, and future work can explore scalability to even large networks and
datasets.
The survey [4] reviewed transfer learning techniques for visual categorization
tasks and highlighted the importance of knowledge transfer. Methods for transferring
knowledge were discussed at both the feature representation and classifier levels. The
study emphasized the need to avoid negative knowledge transfer and deal with the
many-to-one adaptation problem. The challenge for transfer learning in the future is
to mine helpful information from noisy source domain data and extend methods to
deal with large-scale data.
The study [5] focuses on improving the performance of ResNet and DenseNet
models by adjusting the training pipeline and applying a weighted sum of individual
losses for regression models. Additionally, it suggests further training for VGG19
models with early stopping until convergence before moving to the next training
stage. These adjustments aim to overcome abnormal validation losses and improve
overall model performance.
A novel framework [6] integrating human hand detection and pose estimation is
proposed based on deeply learned networks. The shared convolutional layers effi-
ciently detect human hands and compute their pose configuration. An energy function
is proposed to optimize the predicted result. Experimental results show high relia-
bility for real-world applications. Future work includes improving estimator network
accuracy and enhancing input data quality while maintaining efficient computation.
A discriminative single-shot segmentation tracker—D3S, [7] narrows the gap
between visual object tracking and video object segmentation. A single-shot network
applies two target models with complementary geometric properties, one invariant
to a broad range of transformations, including non-rigid deformations, the other
assuming a rigid object to achieve both strong resilience and real-time target segmen-
tation simultaneously. Without per-dataset fine-tuning and trained only for segmen-
tation as the primary output, D3S outperforms all trackers on VOT2016, VOT2018,
and GOT-10k benchmarks and performs close to the state-of-the-art trackers on the
TrackingNet.
Another class of tracker, called Siamese trackers [8], has evolved in the direction
of generative templates. Siamese trackers apply a backbone pre-trained offline with
general targets such that object-background discrimination is maximized by corre-
lation between the search region and target template extracted in the first frame [8].
Segmentation of moving objects is a central problem in the emerging field of video
object segmentation (VOS) [9, 10].
Recent studies [11–15] have demonstrated impressive outcomes, yet they rely
on extensive deep networks, often necessitating fine-tuning and resulting in slow
processing. Simultaneously, Hu et al. [16] and Chen et al. [17] introduced a method
for segmentation through feature matching in the initial frame, significantly reducing
processing time. However, these approaches are less effective in addressing the visual
94 V. Pednekar et al.

object tracking (VOS) task, especially when dealing with small, rapidly moving
objects that exhibit limited appearance changes in short videos. Consequently, the
methods proposed by Hu et al. [16] and Chen et al. [17] are less proficient in visual
object tracking involving such objects. This paper [8] aims to bridge the gap between
visual object tracking and video object segmentation.

3 Proposed Architecture

URSA harnesses cutting-edge open-source frameworks and technologies to deliver

top-tier results. These include the NVIDIA DeepStream SDK, Apache Kafka,
Apache Spark, Apache Cassandra, MySQL, Python, and React.
The NVIDIA DeepStream SDK serves as a comprehensive streaming analytics
toolkit built on GStreamer. Tailored for AI-driven multi-sensor processing, it covers
video, audio, and image comprehension. This versatile SDK facilitates the devel-
opment of stream processing pipelines integrating neural networks, and managing
complex tasks such as tracking, video encoding/decoding, and rendering. Deep-
Stream pipelines enable real-time analytics on diverse data sources, including
video, images, and sensor data. Apache Kafka, a distributed event store and stream
processing platform, facilitates real-time communication between the DeepStream
pipeline and the broader system, including the analytics server and user interface.
Apache Spark, a unified analytics engine for large-scale data processing, provides a
programming interface for clusters with implicit data parallelism and fault tolerance.
Kafka conveys events from DeepStream to Spark, compiling them into meaningful
information. Apache Cassandra, a distributed non-relational database, is designed for
applications requiring fast read and write performance. It stores the results produced
by Spark. The aforementioned components form the backbone of URSA.
MySQL serves as the standard database for storing static data like camera config-
urations and user data. Python, a general programming language, utilizes Flask,
a micro-web development framework, as a bridge between the user application
and URSA’s backend. React, a popular web development framework, is leveraged
to create an intuitive web application for users to easily navigate and understand
the generated analytics. In Fig. 1, a flowchart visually elucidates the high-level
architecture of URSA.
Figure 1 emphasizes some main components and processes in URSA’s archi-
tecture. The blocks in white show how the surveillance cameras can be added and
configured within URSA’s system. Every time a modification is done within this area,
the configuration file has to be updated to accommodate the change and requires a
reload of the DeepStream pipeline.
The blue blocks on the extreme right show how URSA prepares the input camera
streams, analyzes them, and stores the results in a database (marked in yellow).
The results are read from the database and compiled into analytics, alerts and other
features, such as crowd prediction, that depend on analytics. The security dashboard,
where the user can see all the statistics collectively at a glance, acts as the central node
Enhancing Security Surveillance Through Business Intelligence … 95

Fig. 1 Proposed URSA architecture

Fig. 2 DeepStream pipeline architecture

to navigate to other features mentioned above. URSA uses a customized DeepStream

pipeline to analyze the input feed from the cameras and it is built as shown in Fig. 2.
URSA’s custom DeepStream GStreamer pipeline consists of multiple plugins,
each having a source and a sink. The NvStreammux plugin collects video feeds
from multiple live streaming cameras and forms a batch of frames. NvInfer plugin
performs the primary task of detecting people in each video frame. When the detection
confidence level falls or becomes unavailable, “NvTracker” keeps track of moving/
occluding targets. NvTracker uses the NvDCF algorithm, which employs a discrim-
inative correlation filter for visual object tracking. Secondary detections such as face
recognition and body pose detections are made by the secondary NvInfer plugins,
using the bounding box data from the primary NvInfer plugin. The primary and
secondary inference plugins attach their respective inferences to the frame metadata
before passing the frame to their sinks.
The NvDsAnalytics plugin extracts the inference information from each frame’s
metadata, which constitutes the raw data for analysis. The analytics plugin, guided
by the configuration file, identifies and retains pertinent data for analytics while
discarding irrelevant information. The configuration file specifies the rules for
analyzing each camera frame, including the desired type of analysis, boundaries,
96 V. Pednekar et al.

and thresholds. The plugin has been modified to send data to the analytics server by
publishing it to an Apache Kafka topic for further processing.
To display the results, the pipeline incorporates the NvMultiStreamTiler plugin to
lay out the camera streams in a grid arrangement, NvVideoConvert plugins convert
the video from the proprietary NV12 format to a suitable standard format for display,
and the NvDsOsd plugin for rendering the video on a display.
The video output is also available for streaming on the Real-Time Streaming
Protocol, which will accessed remotely. This functionality is extended from the
“NvDsOsd” plugin by converting the video feed to NV12 format, and then by
applying a suitable caps filter, video is encoded by employing H264 or H265 NvV4L2
encoder. The encoded video is then converted to an RTP payload and sent to a UDP
sink for streaming.

4 Proposed System

URSA works in six stages to provide user-defined real-time insightful analytics and
security alerts. They are as follows:
Stage 1: Camera Setup and Configuration.
The first stage involves the user adding the surveillance cameras using their RTSP
URLs to URSA. Other information, such as the name and location of cameras, can
also be specified. Additionally, the analytics settings for each camera are configured.
Some analysis types allow users to define boundary lines and areas of interest for
analysis. The user can also choose the type of analysis they expect from the camera
feed, e.g., overcrowding detection.
Stage 2: Building and Initializing DeepStream Pipeline.
The DeepStream pipeline, as depicted in Fig. 2 in Chap. 3, is initialized with config-
uration files for its components, such as primary and secondary inferences, tracker,
and analytics. Key configurations include models like PeopleNet, Spatial–Temporal
Graph Convolutional Network (ST-GCN), and FaceNet for person detection, body
posture detection, and criminal face detection respectively. PeopleNet—a combi-
nation of ResNet34 (for feature extraction) and DetectNet V2 (for inference) is the
primary inference model. The tracker parameters include minimum object confidence
and the tracking algorithm (NVIDIA® Discriminative Correlation Filter—NvDCF—
that uses Discriminative Correlation Filter). User-defined analytics configuration is
loaded into the NvDsAnalytics plugin to extract relevant inference data. The pipeline
includes plugins to display the output video on a connected device and stream it using
the Real-Time Streaming Protocol (RTSP).
Stage 3: Loading Camera Sources.
Once the pipeline is configured and initialized, multiple camera feeds are connected
to its source. The DeepStream pipeline is configured with the details of enabled
Enhancing Security Surveillance Through Business Intelligence … 97

cameras, including their names, locations, RTSP URLs, and other metadata that are
loaded from the database. Each camera gets an ID sequentially within the pipeline,
and the results are published to the sink against the camera IDs.
Stage 4: Inferring from the Live Camera Feeds.
Inference is performed on the input video signal by leveraging the deep learning
models discussed in Stage 2. PeopleNet, the primary inference model, detects people
in the video feed and generates a corresponding bounding box. The area within this
bounding box is then subjected to secondary inference using ST-GCN for body pose
detection and FaceNet models for facial recognition. Inference is made solely on the
data inside the bounding box, and in the absence of a matching class, the models
produce no outputs. The inference stage will output the bounding box data for all the
processed frames. This data needs to be refined and normalized to obtain analytics.
This task will be achieved in the forthcoming stage.
Stage 5: Analyzing the Raw Inference Data.
The vast raw inference data is filtered to highlight unique features for each detected
class. Consecutive duplicate data is irrelevant, and thus, it gets discarded at this stage.
The remaining data is analyzed to determine person counts, line crossings, and people
in specific areas. The analysis compares the overlaps and direction of movement of
the primary object’s bounding box with the user-defined lines and areas.
Finally, the analysis results are published on multiple Kafka topics for further
analysis on the analytics server. Kafka is a messaging service between two systems,
wherein the messages published on a particular “topic” can be accessed by the
systems subscribed to it. With more computing power, the analytics server performs
complex calculations and aggregations to generate interpretable insights and alerts.
These are stored in a Cassandra database. This stage includes finding aggregations
over some time, or even performing future predictions.
Stage 6: Creating Insights, Alerts and Reports.
After processing the raw data, insights and alerts can be generated in real time or near
real time from the results stored in the Cassandra database. The analytics dashboard
provides control over functions like preparing reports from aggregated data over
time and making future predictions as required by the end users. The results can be
displayed through the means of meaningful and appropriate graphs and charts.

5 Experimental Results and Analysis

5.1 Model Training and Performance

URSA incorporates three primary deep learning models, namely PeopleNet, ST-
GCN, and FaceNet. Each of these models was trained separately using separate
datasets using a technique known as transfer learning. This task of adapting the
98 V. Pednekar et al.

models to our use cases was done on NVIDIA Tao Toolkit. We chose this method as
it allows us to build highly performant and accurate models using only a fraction of
time and data used to train a model from scratch. Transfer learning has been proved
to be an effective training method for many AI applications. The training was carried
out on a NVIDIA® GeForce® GTX 1650 Mobile GPU with 4 GB GDDR5 video
memory. The GPU has a CUDA compute capability of 7.5.
PeopleNet model, which is also the primary inference model that detects people
in a scene, has been trained on a dataset consisting of images collected by us
from different crowded locations such as railway stations, supermarkets, and educa-
tional institutions as well as from a dataset created by Konstantin Verner’s “Human
Detection Dataset” on Kaggle.
We ourselves collected about 300 images and combined them with about 550
images in Konstantin’s dataset. The images were then labeled individually using
LabelImg as shown in Fig. 3.
The original trainable pre-trained model has a general accuracy of 82.17%. We
trained this model for 25 epochs, and it resulted in an accuracy of 96.59% using FP16
calibration on our dataset.
The graph shown in Fig. 4 shows the training progress of PeopleNet over a course
of 25 epochs. The trained model was pruned to achieve optimum performance and
accuracy. The final model has an accuracy of about 95%.
Similar to how the dataset was created for PeopleNet, the dataset for FaceNet was
also created by annotating the faces of people using LabelImg and using their names
as the image label as shown in Fig. 5. For FaceNet, we created our dataset by using
about 40 images of the same person from different angles.
The original pre-trained model has a general accuracy of 83.85%. After training
the model for 15 epochs, the accuracy is 84.68% using INT8 calibration. After

Fig. 3 Labeling images for PeopleNet

Enhancing Security Surveillance Through Business Intelligence … 99

Fig. 4 Graph of accuracy and loss of PeopleNet model

Fig. 5 Labeling images for FaceNet

Fig. 6 Graph of accuracy

and loss of FaceNet model

pruning, the model is 83.92% accurate. Figure 6 shows the training progress of
FaceNet over the course of 15 epochs.
100 V. Pednekar et al.

Fig. 7 Annotation generated

by NVIDIA’s BodyPoseNet

Spatial–Temporal Graph Convolutional Network (ST-GCN) is another secondary

inference model for body pose detection. The dataset here can be generated by
manually annotating the images as we did for PeopleNet and FaceNet, however,
NVIDIA’s BodyPoseNet model generates the annotations automatically for a given
image as shown in Fig. 7. The ST-GCN model is able to detect 18 key-point skeletons
defined in the OpenPose dataset. The trained model reaches an accuracy of 53.65%.
Further, training and optimizations are required for this model to deliver optimum
performance and accuracy.

5.2 Live Experimental Result and Analysis

URSA monitors a crowd outside a busy railway station during peak morning hours
in real time. Here, the crowd movement to and from the platform and the staircase
leading up to the bus terminal is being tracked. The green lines in the scene measure
the number of people moving through these areas in both directions. To prevent
overcrowding, the orange box at the base of the staircase sends alerts when the number
of people in the area exceeds the set threshold (4 in this case). The yellow box tracks
the number of people in the region and their average dwell time, providing valuable
insights for crowd management. The system also monitors the total number of people
appearing in the camera frame. All the lines, boxes, and thresholds discussed above
are defined by the user as per the requirements as shown in Figs. 8, 9, 10, and 11.
Enhancing Security Surveillance Through Business Intelligence … 101

Fig. 8 Real-time surveillance of the busy crowd outside the railway station

Fig. 9 Multi-camera analysis

Figure 12 shows a possible way of representing the analytics collected by URSA.

The raw data can be transformed into interactive graphs and charts for intelligent
business insights and future predictions.
Figure 11 in Chap. 5 shows a possible way of viewing current and past events for
which the alerts have been received. Apart from viewing the events, it also allows the
administrator to coordinate with the people responsible to handle the events (Fig. 13).
URSA’s analytics and alerts dashboard will be fine-tuned as per the client’s
requirements.
102 V. Pednekar et al.

Fig. 10 Single camera focused from multi-camera view

Fig. 11 Configuring user-defined regions and thresholds for analysis

Enhancing Security Surveillance Through Business Intelligence … 103

Fig. 12 Representative analytics page

Fig. 13 Representative alert history page

6 Conclusion

In conclusion, this research paper underscores the significance of leveraging video

analytics for the examination of public and commercial spaces. Through the appli-
cation of real-time analytics and security alerts, businesses can streamline space
utilization, boost profits, and elevate customer experiences. The findings demon-
strate the efficacy of analyzing videos from diverse camera types using NVIDIA
DeepStream technology. The experiments conducted with URSA in densely popu-
lated areas showcase the potential advantages across sectors like banking, retail, and
104 V. Pednekar et al.

warehousing. This cost-effective plugin seamlessly integrates analytics and secu-

rity alerts, fostering a safer environment. Future endeavors may delve into advanced
analytics and multi-object tracking for further refinements. In summary, URSA,
powered by video analytics, furnishes invaluable insights for businesses seeking to
optimize operations and achieve success.

References

1. Gongguo Z, Junhao W (2021) An improved small target detection method based on Yolo V3.
IEEE
2. Badave H, Kuber M (2021) Evaluation of person recognition accuracy based on openpose
parameters. IEEE
3. Jin E, Wei Jangryul Kim J, Tan S, Lee J, Ha S (2021) Deep learning inference parallelization
on heterogeneous processors with TensorRT. IEEE
4. Shao L, Zhu F (2014) Transfer learning for visual categorization: a survey. IEEE
5. Hong Nga C, Nguyen K-T, Tran NC, Wang JC (2021) Transfer learning for gender and age
prediction. IEEE
6. Chen T-Y, Wu M-Y, Hsieh Y-H, Fu L-C (2017) Deep learning for integrated hand detection
and pose estimation. IEEE
7. Lukežič A, Matas J, Kristan M (2020) D3S—a discriminative single shot segmentation tracker.
IEEE
8. Bertinetto L, Valmadre J, Golodetz S, Miksik O, Torr PHS (2016) Staple: complementary
learners for real-time tracking. Comp Vis Patt Recogn
9. Perazzi F, Pont-Tuset J, McWilliams B, Van Gool L, Gross M, Sorkine-Hornung A (2016) A
benchmark dataset and evaluation methodology for video object segmentation. Comp Vis Patt
Recogn
10. Xu N, Yang L, Fan Y, Yue D, Liang Y, Yang J, Huang T (2018) YouTubeVOS: a large-scale
video object segmentation benchmark. arXiv:1809.03327
11. Voigtlaender P, Chai Y, Schroff F, Adam H, Leibe B, Chen L-C (2019) FEELVOS: Fast end-
to-end embedding learning for video object segmentation. Comp Vis Patt Recogn
12. Caelles S, Maninis K-K, Pont-Tuset J, Leal-Taixe L, Cremers D, Van Gool L (2017) Oneshot
video object segmentation. Comp Vis Patt Recogn
13. Voigtlaender P, Leibe B (2017) Online adaptation of convolutional neural networks for video
object segmentation. Proc Br Mach Vision Conferen
14. Cheng J, Tsai Y-H, Hung W-H, Wang S, Yang M-H (2018) Fast and accurate online video
object segmentation via tracking parts. Comp Vis Patt Recogn
15. Yang L, Wang Y, Xiong X, Yang J, Katsaggelos AKK (2018) Efficient video object
segmentation via network modulation. Comp Vis Patt Recogn
16. Hu Y-T, Huang J-B, Schwing AG (2018) VideoMatch: Matching based Video Object
Segmentation. Proc Euro Conf Comput Vision
17. Chen Y, Pont-Tuset J, Montes A, Van Gool L (2018) Blazingly fast video object segmentation
with pixelwise metric learning. Comp Vis Patt Recogn
Fuzzified Hybrid Metaheuristics
for QoS-Aware Service Composition

Hadi Naghavipour, Farhad Nadi, and Ali Aitizaz

Abstract Service composition has been a centric challenge in various distributed

computational paradigms arising from the necessity of composing services to deliver
more complicated computing tasks. In many cases, service composition was the
process of selecting the optimal set of services from a repository of available services,
which led to a multi-objective combinatorial optimization problem falling in the cate-
gory of NP-hard. As a result, seeking optimal solutions in a minimum time budget
has been a continuous quest in the research agenda. Metaheuristics and hybrid meta-
heuristics have been the primary avenues to address this problem. Nonetheless, meta-
heuristics suffer from randomicity, slow or premature convergence and stochastic
behaviour. A plethora of hybridization methods leveraging operator modification
is evident in existing literature. This paper proposes a fuzzy hybrid metaheuristics
using fuzzy linguistics search operator to accelerate convergence and minimize the
stochastic behaviours of metaheuristics to achieve a stable search mechanism.

Keywords Fuzzy Metaheuristics · Hybrid Metaheuristics · QoS-aware service

composition

1 Introduction

The advent of service computing has given rise to service-oriented architecture

(SOA), whereby multiple services can communicate or integrate to meet consumers’
various qualitative and quantitative criteria. In this contex, Interoperability has been
regarded as one of the highly challenging areas in cloud computing [1]. This problem
is twofold. First is the integrability of services that require extensive standardiza-
tion; while, the vendor’s conflict of interest is controversial and hard to achieve.
Secondly, it is the service selection and discovery challenge to determine an optimal
solution from a service repository. Cloud environments have become geographically

H. Naghavipour (B) · F. Nadi · A. Aitizaz

School of Information Technology, UNITAR International University, Petaling Jaya, Malaysia
e-mail: [email protected]

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 105
Y. Bee Wah et al. (eds.), Data Science and Emerging Technologies, Lecture Notes on
Data Engineering and Communications Technologies 191,
https://doi.org/10.1007/978-981-97-0293-0_8
106 H. Naghavipour et al.

distributed to support emerging application paradigms and diverse consumer require-

ments. Within this architecture, which has extended to the edge of networks (IoT and
Fog), service composition plays a critical bounding role [2] for distributed services
across networks. The above circumstances made service composition one of the most
sought-after research problems in service-oriented computing [3, 4]. Moreover, the
service composition has been applied in several domains and scenarios, such as travel
planning [5].
From a scientific point of view, this problem is seen as a combinatorial opti-
mization problem [6] and falls under the category of NP-hard [7]. This indicates
that time complexity is a significant concern in addressing the problem. Solving
service composition in polynomial times has gained researchers’ attention exces-
sively. The main goal of this NP-hard problem has been to resolve the problem
in optimal polynomial-time. In order to accomplish this, a variety of approaches
have been suggested, ranging from none heuristics (precise procedures) to heuris-
tics and, lastly, metaheuristics. Furthermore, a comprehensive review of prospec-
tive studies found that more sophisticated metaheuristics can address problems with
vast search spaces. As part of this continuous attempt to identify optimal or near-
optimal solutions, the researcher devised hybrid metaheuristics to transcend the
boundary of metaheuristics. These efforts at the algorithm level have been directed
to accelerate the convergence, prevent local entrapment and deal with big service
compositions. However, due to the extensive review of this research, existing studies
did not address the stochastic behaviour of metaheuristics, which emerges as an
imperative design requirement for service composition. Population-based and nature-
inspired metaheuristics that form the overwhelming majority of presented solutions
for this problem, are stochastic search processes, that produce inconsistent results
in every run. Although this stochastic behaviour is acceptable in many engineering
applications, it may translate to instability in the user’s perspective for the service
selection.

1.1 Service Composition Problem

The problem of service composition can fall into three stages. The first stage is
task decomposition, in which the submitted tasks are considered and broken down
into smaller subtasks denoted by T = T 1 , T 2 , T 3 ,. . . , T n according to an abstract
workflow. The second stage of service discovery is finding services that satisfy user
requirements by selecting the optimal set of services. From a technical perspective,
service discovery is a matching endeavour to find a service set from the service
repository that fulfils user requirements. The application of meta-heuristics arrives
at this stage to solve the optimization problem, and these are based on semantic
similarities between sub-tasks and resource descriptions. The third stage validates
an optimal selection, ensuring the overall QoS is optimized. The aggregation models
to compute overall service composition quality and normalization formulas are given
in the following:
Fuzzified Hybrid Metaheuristics for QoS-Aware Service Composition 107
∑
Max(QoS) = Max wk × Norm(Q k ) (1)

for (Q k ) ≥ Q k0 if Q k is a positive attribute agg(Q k ) ≤ Q k0 if Q k is a negative

∑
r
wk = 1 (2)
k=1
⎧
QoS(S)−Q mtn
⎪
⎨ Q maxx
− Q min if Q is positive and Q max − Q min /= 0,
Q max −Q(S)
QoS = if Q is negative and Q max − Q min /= 0, (3)
⎪
⎩
Q −Q m1n
max

1 Otherwise.

1.2 Shortcomings of Traditional Metaheuristics

One of the most fundamental shortcomings of metaheuristics is their randomness

and easy being. In order to address this shortcoming, numerous research works
aimed to implement a novel algorithm by modifying traditional metaheuristics with
another search mechanism to achieve a guided search mechanism that is more effi-
cient. Hybridizing metaheuristics was a viable strategy for developing more effi-
cient search mechanisms. Another fundamental shortcoming of the existing solu-
tion is their stochastic behaviour. Since current hybrid approaches attest to the
nature-inspired complimentary algorithm or search operator, the proposed hybrid
version also inherits the stochastic behaviour of traditional metaheuristics. Delving
into existing literature helps to discover that fuzzy sets’ full potential has not been
exploited. The existing paradigm has been modifying algorithm operators or inte-
grating base algorithms with another metaheuristic to achieve a more guided and
efficient search mechanism.

1.3 Scope, Objective and Contributions

This paper investigates hybrid metaheuristics that leverage operator modification

in the conventional algorithm. The main objective of this paper is to propose a
guided search mechanism empowered by a fuzzy interface as an adaptive algorithm
operator to guide the search process based on predefined fuzzy linguistics rules. As a
result of this methodological approach, a fast convergence fuzzy-based evolutionary
algorithm was proposed with minimal stochastic effects to point out global optima
in search space. The succinct contribution of this paper in brief is as follows:
108 H. Naghavipour et al.

• An in-depth overview of existing hybrid metaheuristics from the modification

of algorithm operator perspective was conducted to pinpoint the limitations of
existing works.
• This article proposed the fuzzy linguistics aspect of fuzzy sets incorporated in
population-based genetic algorithms to manipulate the random search process in
traditional evolutionary algorithms using adaptive search operators.
• Finally, statistical analysis of results is presented where improvement in conver-
gence rate has been statistically significant and stochastic behaviour minimized
considerably.

2 Related Works

Operator modification has been a viable strategy to overcome metaheuristics short-

comings. However, in the existing literature, the major flaws of metaheuristics are
reported as premature convergence and slow convergence. Therefore, various hybrid
methods suggested achieving full convergence towards global optima in applying
these computational methods. Following, related works are reviewed concerning the
modification of base algorithms (Table 1).

2.1 Limitation of Existing Works

An overview of the existing approach indicates that nature-inspired metaheuristics

have been an common existing research paradigm. As a result, stochastic behaviour
is inevitable, as natural intelligence is associated with randomness. To address the
current shortcoming, a wide range of algorithm modifications has been proposed in
the form of the hybrid model by inheriting the various search mechanisms. On the
swarm intelligence base of solution, the social cognitive component of these algo-
rithms has led to premature convergence, while the evolutionary algorithm like GA
suffers from slow convergence. Therefore, the researcher tried to modify traditional
mathematics methods by modifying the algorithm operator. Despite applying data
clustering approaches such as K-means and skyline as search operators, the principle
from chaos theory and quantum mechanics inherited in modifying traditional meta-
heuristics algorithm operators. Nonetheless, the overwhelming majority of solutions
attest to nature-inspired search mechanisms.
Since the hybrid approach also inherits a nature-inspired search mechanism, the
stochastic behaviour of the existing method remains intact. This paper proposes a
paradigm shift by incorporating fuzzy sets and probability theory that, unlike existing
nature-inspired solutions, is based on human intellect and mathematical reasoning.
Moreover, empirical study has shown the significance of the proposed approach in
terms of time complexity, fitness value, and stability of results across various runs of
experiments. In the following, a hybrid design approach was proposed, Incorporating
fuzzy sets.
Fuzzified Hybrid Metaheuristics for QoS-Aware Service Composition 109

Table 1 Hybrid metaheuristics for service: algorithm operator

Reference Approach Fitness Metaheuristics Hybrid operator
Ye et al. (2008) [8] Graph-base SAW GA Integer
programming
Ma et al. (2008) [9] Vector-base P-SAW GA Relation matrix
coding scheme
Feng et al. (2009) Vector-base SAW GA Quantum
[10] mechanic
Liang et al. (2009) Vector-base SAW GA Crossover via
[11] data mining
Tang et al. (2010) Vector-base SAW GA Local optimizer
[12]
Wang et al. (2010) Graph-base Independent PSO Ergodic chaos
[13] variable
Salomie et al. Graph-base SAW AIS Genetic
(2011) [14] operator
Pop et al. (2011) Graph-base SAW FireFly Evolutionary
[15] operator of GA
Liu et al. (2011) Vector-base SAW PSO Quantum
[16] mechanic
Yin et al. (2014) Vector-base Independent PSO Genetic
[17] operator
Jatoth et al. (2015) Vector-base SAW PSO Quantum
[18] mechanic
Bao et al. (2016) Vector-base SAW GA Orthogonal
[19] design
Seghir et al. (2018) Vector-base SAW GA Fruit fly-Elite
[20] Strategy
da Sila et al. (2016) Permutation-based SAW PSO Planning
[21] algorithm
Hossain et al. Vector-base SAW PSO Kmeans-Map
(2016) [22] reduce
Zhou et al. (2017) Graph-base SAW ABC Distribution
[23] algorithm
(ACEDA)
Podili et al. (2017) Vector-base SAW BAT Differential
[24] optimization
technique
Savarala et al. Vector-base SAW FruitFly Discrete
(2017) [25] immune
algorithm
Chifu et al. (2017) Graph-base SAW HBM GA, Tabu
[26] Search
Components
Zhou et al. (2018) Vector-base Independent ABC Differential
[27] evolution (DE)
(continued)
110 H. Naghavipour et al.

Table 1 (continued)
Reference Approach Fitness Metaheuristics Hybrid operator
Sadeghiram et al. Tree-based SAW GA K-Means
(2018) [28]
Jatoth et al. (2018) Vector-base SAW GA Guided
[29] mutation
Liu et al. (2018) Vector-base Independent NSGA-II Differential
[30] optimization
technique
Xu et al. (2018) Graph-base SAW PSO Evolutionary
[31] operator
Alayed et al. (2019) Graph-base SAW ACO Swap concept
[32] trap of local
Sadouki et al. Vector-base Independent EHO Evolutionary
(2019) [33] operator of GA
Bouzary et al. Vector-base SAW GWO Evolutionary
(2019) [34] operator of GA
Gao et al. (2020) Vector-base TFNSAW AIS Clonal selection
[35] algorithm
Bhaskar et al. Vector-base SAW GWO Levy flight
(2020) [36]
Yang et al. (2020) Vector-base Independent GWO Backward
[37] learning
strategy
Li et al. (2020) [38] Vector-base SAW HHO Ergodic
operator
Zhang et al. (2021) Vector-base SAW ABC Group
[39] initialization
strategy
Wang, Ma et al. Vector-base SAW EDA Local search
(2022) [40] procedure

3 Methodology: Fuzzy-Based Evolutionary Algorithm

The previous works indicated the lack of an adaptive search mechanism where algo-
rithm operators can readjust based on the population fitness of a generated population
in the course of the search process. Furthermore, the stochastic mechanism will lead
to premature or slow convergence, a fundamental shortcoming in metaheuristics.
Consequently, slow convergence impacts time complexity while premature conver-
gence influences search capability to point out global optima. Therefore, the current
approach in service composition has been the application of nature-inspired meta-
heuristics in an overwhelming scale. In a vast number of hybrid models, nature-
inspired algorithm operators are employed to achieve better results. This paper aims
to use fuzzy logic to overcome the problems associated with slow convergence and
Fuzzified Hybrid Metaheuristics for QoS-Aware Service Composition 111

premature convergence in metaheuristics. For this purpose, a hybrid fuzzy method

was employed to guide the stochastic search process with predefined fuzzy systems.
Fuzzy inference system was used to replicate a specific search strategy created
by a human expert. As a result, the number of rules and their configuration depends
solely on the expert’s expertise, with no regard for the learning rule process. Each
fuzzy rule represents an expert observation that models the conditions under which
candidate solutions are modified to arrive at the optimal location under these condi-
tions. This paper presents a fuzzy method for implementing human-knowledge-
based optimization. The fuzzy interface, including Mamdani interface [41], has been
embedded in population generation in order to manipulate the search process. In this
technique, a fuzzy interface system controls the search process to accelerate meta-
heuristic convergence. Moreover, the fuzzy rules have been designed by a human
expert. The initial rule definition solely relies on human experts. Fuzzy rules repre-
sent expert judgement that moderates the search process to achieve an optimal solu-
tion that is also immune to local entrapment. This study presents a methodology
to implement human-knowledge-based optimization strategies. A Fuzzy Mamdani
inference system create a specific search strategy generated by a human expert in the
scheme.

3.1 Fuzzy Sets Preliminaries

A fuzzy set (Z) [42] is a generalization of a Crisp or Boolean set, which is defined
in a universe of discourse X. In this definition, Z is a linguistic label that defines the
fuzzy set through the word Z. The fuzzy set (Z) is characterized by a membership
function μZ(x), which provides a measure of the degree of similarity of an element
x from X to the fuzzy set A. It takes some values in the interval within the [0,1]
that are defined in Eq. (4). Therefore, a generic variable x c can be represented using
multiple fuzzy sets {Z c , Z c , . . . , Z c }, each one modelled by a membership function
as follows:

μz(x) : x → [0, 1] (4)

{ }
μZ 1c (xc ), μZ 2c (xc ), . . . , μZ mc (xc )

A fuzzy system is a computing model based on fuzzy logic concepts. It includes

three conceptual elements: a rule base, which contains a selection of fuzzy rules;
a database, which defines the membership functions used by the fuzzy rules; and
a reasoning mechanism, which performs the inference procedure. There are two
different inference fuzzy systems: Mamdani and Takagi-Sugeno (TS) [43]. The
central difference between the two inference models is in the consequent section
of the fuzzy systems. In the Mamdani model, the structure of the fuzzy system has
linguistic variables and fuzzy sets. However, the consequent section of the TS model
112 H. Naghavipour et al.

consists of mathematical functions. Since the available knowledge for the design of
the fuzzy system conceived in our approach includes functional, local behaviours,
the Mamdani inference model has been used for system modelling [44].

Algorithm 1: Basic evolutionary algorithm

Input: MaxIt, nPop, pCrossover, pMutation
Output: BestSolution
1 Initialization;
2 Generate population popg for n quality attribute.;
3 Calculate fitness popg:f(popg);
4 for It = 1: MaxIt do
5 pops = selection (pops, f(popg);
6 popc = crossover (popc, f(popc);
7 popm = Mutation (popm, f(popm8)%; popg+1 ← popm;
end
return BestSolution

3.2 Adaptive Operator with Fuzzy Rules

The key idea is to achieve an adaptive search mechanism rather than conventional
metaheuristics where search operators are constant. The adaptive search mecha-
nism has shown to be an effective strategy that leads to fast convergence methods
immune to the trap of local optima. An existing study has benefited from probability
theory, chaos theory, and statistical analysis in the process of population generation in
order to produce high-quality solution, particularly with respect to population-based
metaheuristics, which are the current research paradigms for service composition
problems.
The flowchart illustrated in Fig. 2 is a detailed description of the proposed method.
In this approach, the evolutionary algorithm operator, crossover and mutation, will
be tuned by fuzzy linguistics predefined rules in three low, medium, and high values
for iteration, fitness variance, best fitness, and means. In this setting, nine rules have
been defined to generate high-quality fitness values as detailed in Figs. 1, 2.

Fig. 1 Fuzzy linguistics rules

Fuzzified Hybrid Metaheuristics for QoS-Aware Service Composition 113

Fig. 2 Fuzzy evolutionary process

3.3 Fuzzy Evolutionary Algorithm

The proposed solution is based on a guided search mechanism with a fuzzy system
acting as an algorithm operator. The idea behind proposed algorithm is an adaptive
search process rather than traditional stochastic exploration in search space with
fixed operators. In contrast, in the proposed approach, every iteration’s population
diversity will be evaluated to generate an adaptive algorithm operator according to
the fuzzy rule matrix in a fuzzy interface. The details of such are detailed in the
following algorithm 2.

Algorithm 2: Fuzzy based evolutionary algorithm

Input: MaxIt, nPop, pCrossover, pMutation, FuzzyRules
Output: BestSolution
1 Initialization;
2 Set: MaxIt, nPop, pCrossover, pMutation;
3 Set Fuzzy Interface ∆ → F S;
4 Define: MatrixRules, Interface Type;
5 Generate population popg for n quality attribute.;
6 Calculate fitness popg:f(popg).;
7 Calculate Mean, Var, BestFitness in popg:f(popg).;
8 ∆ → F S ← (MatrixRules, It, BestF itness, V arFitness, Mean);
9 return pCrossover, pMutation;
10 for It = 1: MaxIt do
11 pops = selection (pops, f(popg);
12 popc = crossover (popc, f(popc);
13 popm = Mutation (popm, f(popm);
end
ReturnBestSolution
114 H. Naghavipour et al.

4 Result and Discussion

This section analyses the behaviour of the proposed algorithm considering the
different settings. All experiments have been executed on an Intel Xeon(R) CPU
E3-1225 v5 @ 3.30 GHz and 64-GB RAM under MATLAB® 9.6. We have used
QWS VER2 data sets where Throughput, success ability and reliability are positive
values while compliance and best practice are deemed as negative values. The weight
of quality attributes defines in chronological order and denoted by w = [0.18, 0.17,
0.11, 0.39 0.15].
In order to have a holistic empirical evaluation, genetic algorithm from evolu-
tionary algorithms and PSO from swarm intelligence have been selected. Moreover,
a hybrid version of GA and PSO is included to showcase the hybridization of two
metaheuristics. In addition, the Eagle strategy as a metaheuristics with a robust explo-
ration search mechanism also implemented to highlight algorithm behaviour against
various scenarios. The proposed model here is a fuzzy evolutionary algorithm where
the detail in Table 2.
According to the stochastic nature of metaheuristics, the near-optimal solution
is not unique. Therefore, a viable empirical study requires experiments on multiple
instances. In this study, the algorithm has run ten times with hundred iterations.
Swarm intelligence methods such as PSO and ES suffer from local optima entrap-
ment. Moreover, the hybridization of two metaheuristics GAPSO did not accelerate
convergence. In Table 3, primary performance metrics, including execution time,
global fitness value, and hit-rate, are summarized for ten runs and a hundred iterations.

Table 2 Algorithm parameters

Algorithm Parameters
GA MaxIt = 100; nPop = 100;
pCrossover = 0.5; pMutation = 0.5;
FuzzyGA MaxIt = 100; nPop = 100; pCrossover = 0.5; pMutation = 0.5; 9 Rules, 3 Value
PSO MaxIt = 100; nPop = 100;
phi1 = 2.05; phi2 = 2.05;
GAPSO MaxSubItGA = 1; MaxSubItPSO = 1; nPop = 100;
pCrossover = 0.5; pMutation = 0.5; phi1 = 2.05; phi2 = 2.05;
ES nvar = 5; bound.xmin = VarMin; bound.xmax = VarMax;
param.npop = nPop; param.beta0 = 1; param.gamma = 1; param.damp = 0.9;
param.alpha = 0.2; param.scale = (bound.xmax-bound.xmin);

Table 3 Experiment results for fitness, execution time and hit-rate

Values GA FuzzyGA PSO GAPSO ES
Execution time (s) 1.8893 0.65952 0.57772 9.9998 2.2227
Hit-rate 0.459 0.094 0.147 0.632 0.068
Maximum fitness 0.9848 0.9848 0.54 0.70054 0.54
Fuzzified Hybrid Metaheuristics for QoS-Aware Service Composition 115

Fig. 3 Time of execution

and stochastic behaviour of
fuzzy GA

Experiment results demonstrated that fuzzy genetic algorithms outperform pair

solutions. The second experiment was conducted where the maximum fitness value
has been set to 0.55 in order to provide a ground for all algorithms to achieve their
full convergence mode. This setting is required to gain insight into the convergence
behaviour of the proposed method compared to the rest.

4.1 Performance Evaluation

The evolutionary algorithm has shown quite satisfying results in exploring global
optima. Nonetheless, it suffers from slow convergence. Figure 3 illustrates the box
plot for the proposed fuzzy genetic algorithm compared to traditional GA for ten
runs. The following chart shows how fuzzification of metaheuristics can improve the
time of execution and also minimize the stochastic behaviour of metaheuristics.

4.2 Computational Complexity Evaluation

Metaheuristic [45, 46] methods are generally complex processes with several random
operations and stochastic subroutines. Therefore, conducting a complexity analysis
from a deterministic point of view is impractical. For that reason, the hit-rate (%) is
used to evaluate each algorithm’s computational effort. Hit-rate exhibits the number
of iterations an algorithm requires to point out the highest fitness value that is global
optima. These metrics are system-independent and provide a precise measurement
independent from implementation tools or hardware specs. The time complexity of
each algorithm iteration depends not only on the hardware specs but also on the
choice of tools, such as Python, Matlab, R. In order to have a fair comparison, the
hit-rate computed for the algorithm in the setting can be relatively in an equally
converged search process. Equation 5 shows the computational logic for hit-rate.
116 H. Naghavipour et al.

Fig. 4 Hit-rate evaluation

for proposed approaches

Moreover, Figure 4 shows how convergence significantly accelerates and reduces

time complexity.

Itcon
HitRate (% ) = ∗100 (5)
Itmax

5 Conclusion

New methods in the area of metaheuristic algorithms have demonstrated signifi-

cant performances in solving optimization problems. However, most of them use
operators based on metaphors of natural or social elements to evolve candidate solu-
tions. Although humans have manifested their potential to resolve real-life complex
optimization problems, the use of human intelligence to build optimization algo-
rithms has been small widespread compare to natural or social metaphors. Similarly,
within the realm of service composition, the current paradigm has been to modify
traditional metaheuristics through operator modification or integration with another
metaheuristic to address generic shortcomings in nature-inspired solutions such as
slow convergence and local entrapment stochastic behaviour.
In this study, a fuzzy evolutionary algorithm based on human-knowledge-based
optimization strategies was implemented by using fuzzy linguistics rules. This tech-
nique is incorporated into the population generation process of the evolutionary
algorithm. The empirical evidence proves that the fuzzification of population-based
metaheuristics is a potent strategy to accelerate convergence, reduce execution time
Fuzzified Hybrid Metaheuristics for QoS-Aware Service Composition 117

and minimize stochastic behaviour. In future works, the broader fuzzification of

population-based metaheuristics should be placed on the research agenda.

Acknowledgements The main author wants to thank Professor Witold Pedrycz for providing
precise comments and the Centre for Research and Consultancy, UNITAR International University.

References

1. Buyya R, Srirama SN, Casale G, Calheiros R, Simmhan Y, Varghese B, Gelenbe E, Javadi B,

Vaquero LM, Netto MA (2018) A manifesto for future generation cloud computing: research
directions for the next decade. ACM Comput Surv (CSUR) 51(5):1–38
2. Mahmud R, Kotagiri R, Buyya R (2018) Fog computing: a taxonomy, survey and future
directions. Springer, pp 103–130
3. Neiat AG, Bouguettaya A, Sellis T, Ye Z (2014) Spatio-temporal composition of sensor cloud
services. In: 2014 IEEE 21st international conference on web services
4. Medjahed B, Bouguettaya A, Elmagarmid AK (2003) Composing web services on the semantic
web. VLDB J 12(4):333–351
5. Sethuraman R, Sasiprabha T, Sandhya A (2015) An effective qos based web service composition
algorithm for integration of travel and tourism resources. Procedia Comput Sci 48:541–547
6. Fan X-Q, Fang X-W, Jiang C-J (2011) Research on web service selection based on cooperative
evolution. Expert Syst Appl 38(8):9736–9743
7. Gabrel V, Manouvrier M, Moreau K, Murat C (2018) Qos-aware automatic syn- tactic service
composition problem: complexity and resolution. Futur Gener Comput Syst 80:311–321
8. Ye X, Mounla R A hybrid approach to qos-aware service composition. In: 2008 IEEE
international conference on web services, IEEE, pp 62–69
9. Ma Y, Zhang C (2008) Quick convergence of genetic algorithm for qos-driven web service
selection. Comput Netw 52(5):1093–1104
10. Feng L, Lei ZM (2009) Research on user-aware qos based web services composition. J China
Univ Posts Telecommun 16(5):125–130
11. Liang WY, Huang CC (2009) The generic genetic algorithm incorporates with rough set
theory—an application of the web services composition. Expert Syst Appl 36(3):5549–5556
12. Tang M, Ai L (2010) A hybrid genetic algorithm for the optimal constrained web service
selection problem in web service composition. In: Evolutionary computation (CEC), 2010
IEEE congress on, IEEE, pp 1–8
13. Wang L, He Y Web service composition based on qos with chaos particle swarm optimization.
In: 2010 6th international conference on wireless communications networking and mobile
computing (WiCOM), pp 1–4
14. Salomie I, Vlad M, Chifu VR, Pop CB (2011) Hybrid immune-inspired method for selecting
the optimal or a near-optimal service composition. In: 2011 federated conference on computer
science and information systems (FedCSIS), IEEE, pp 997–1003
15. Pop CB, Chifu VR, Salomie I, Baico RB, Dinsoreanu M, Copil G (2011) A hybrid firefly-
inspired approach for optimal semantic web service composition. Scalable Comput Pract Exp
12(3):363–370
16. Liu Y, Miao H, Li Z, Gao H Qos-aware web services composition based on hqpso algorithm. In:
2011 First ACIS/JNU international conference on computers, networks, systems and industrial
engineering, IEEE, pp 400–405
17. Yin H, Zhang C, Zhang B, Guo Y, Liu T (2014) A hybrid multi-objective discrete particle
swarm optimization algorithm for a slaaware service composition problem. Math Probl Eng
18. Jatoth C, Gangadharan GR (2015) QoS-Aware web service composition using quantum inspired
particle swarm optimization, vol. 39 of smart innovation systems and technologies, pp 255–265
118 H. Naghavipour et al.

19. Bao L, Zhao F, Shen M, Qi Y, Chen P (2016) An orthogonal genetic algorithm forqos-aware
service composition. Comput J 59(12):1857–1871
20. Seghir F, Khababa A (2018) A hybrid approach using genetic and fruit fly optimization
algorithms for qos-aware cloud service composition. J Intell Manuf 29(8):1773–1792
21. da Silva AS, Mei Y, Ma H, Zhang M Particle swarm optimization with sequence-like indirect
representation for web service composition. In: Evolutionary computation in combinatorial
optimization, Springer, pp 400–405
22. Hossain MS, Moniruzzaman M, Muhammad G, Ghoneim A, Alamri A (2016) Big data-
driven service composition using parallel clustered particle swarm optimization in mobile
environment. IEEE Trans Serv Comput 9(5):806–817
23. Zhou J, Yao X (2017) A hybrid artificial bee colony algorithm for optimal selection of qos-based
cloud manufacturing service composition. Int J Adv Manuf Technol 88(9–12):3371–3387
24. Podili P, Pattanaik K, Rana PS (2017) Bat and hybrid bat meta-heuristic for quality of service-
based web service selection. J Intell Syst 26(1):123–137
25. Savarala BB, Chella PR (2017) An improved fruit fly optimization algorithm for qos aware
cloud service composition. Int J Intell Eng Syst 10(5):105–114
26. Chifu VR, Pop CB, Salomie I, Chifu ES (2017) Hybrid honey bees mating optimization algo-
rithm for identifying the near-optimal solution in web service composition. Comput Inform
36(5):1143–1172
27. Zhou J, Yao X, Lin Y, Chan FT, Li Y (2018) An adaptive multi-population differential artificial
bee colony algorithm for many-objective service composition in cloud manufacturing. Inf Sci
456:50–82
28. Sadeghiram S, Ma H, Chen G (2018) Cluster-guided genetic algorithm for distributed data-
intensive web service composition. In: 2018 IEEE congress on evolutionary computation
(CEC), IEEE, pp 1–7
29. Jatoth C, Gangadharan G, Fiore U, Buyya R (2018) Qos-aware big service composition using
mapreduce based evolutionary algorithm with guided mutation. Futur Gener Comput Syst
86:1008–1018
30. Liu L, Gu S, Fu D, Zhang M, Buyya R (2018) A new multi-objective evolutionary algorithm
for inter-cloud service composition. TIIS 12(1):1–20
31. Xu X, Rong H, Pereira E, Trovati M (2018) Predatory search-based chaos turbo particle
swarm optimization (ps-ctpso): a new particle swarm optimization algorithm for web service
combination problems. Fut Generat Comput Syst Int J Escience 89:375–386
32. Alayed H, Dahan F, Alfakih T, Mathkour H, Arafah M (2019) Enhancement of ant colony
optimization for qos-aware web service selection. IEEE Access 7:97041–97051
33. Sadouki SC, Tari A (2019) Multi-objective and discrete elephants herding optimization
algorithm for qos aware web service composition. RAIRO-Operat Res 53(2):445–459
34. Bouzary H, Chen FF (2019) A hybrid grey wolf optimizer algorithm with evolutionary operators
for optimal qos-aware service composition and optimal selection in cloud manufacturing. Int
J Adv Manufact Technol 101(9–12):2771–2784
35. Gao M, Chen M, Liu A, Ip WH, Yung KL (2020) Optimization of microservice composition
based on artificial immune algorithm considering fuzziness and user preference. IEEE Access
8:26385–26404
36. Bhaskar B, Jatoth C, Gangadharan G, Fiore U (2020) A mapreduce-based modified grey wolf
optimizer for qos-aware big service composition. Concurren Comput Pract Exp 32(8):e5351
37. Yang Y, Yang B, Wang S, Jin T, Li S (2020) An enhanced multi-objective grey wolf optimizer
for service composition in cloud manufacturing. Appl Soft Comput 87:106003
38. Li C, Li J, Chen H (2020) A meta-heuristic-based approach for qos-aware service composition.
IEEE Access 8:69579–69592
39. Zhang S, Shao Y, Zhou L (2021) Optimized artificial bee colony algorithm for web service
composition problem. Int J Mach Learn Comput 11(5):327–332
40. Wang C, Ma H, Chen G, Hartmann S (2022) Memetic eda-based approaches to qos-aware fully
automated semantic web service composition. IEEE Trans Evol Comput 26(3):570–584
Fuzzified Hybrid Metaheuristics for QoS-Aware Service Composition 119

41. Cuevas E, Zaldívar D, Pérez-Cisneros M (2018) Metaheuristic algorithms based on fuzzy logic,
Springer, pp 167–218
42. Zadeh LA (1965) Fuzzy sets. Inf Control 8(3):338–353
43. Mamdani EH, Assilian S (1975) An experiment in linguistic synthesis with a fuzzy logic
controller. Int J Man Mach Stud 7(1):1–13
44. Bagis A, Konar M (2016) Comparison of sugeno and mamdani fuzzy models optimized
by artificial bee colony algorithm for nonlinear system modelling. Trans Inst Measur Control
38(5):579–592
45. Naghavipour H, Soon TK, Idris MYI, Namvar M, Salleh RB, Gani A (2021) Hybrid meta-
heuristics for qos-aware service composition: a systematic mapping study. IEEE Access
1–25
46. Naghavipour H, Idris MYIB, Soon TK, Salleh RB, Gani A (2022) Hybrid metaheuristics using
rough sets for qos-aware service composition. IEEE Access 10:112609–112628
Machine/Deep Learning
Fraudulent E-Commerce Website
Detection Using Convolutional Neural
Network Based on Image Features

Nurfazrina Mohd Zamry, Anazida Zainal, Eric Khoo,

Mohamad Nizam Kassim, and Zanariah Zainudin

Abstract Due to the growth of e-commerce’s website, attackers have motivated

to intrude these e-commerce websites. These illegitimate websites, also known as
fraudulent websites, exploited the Internet to trick people into committing fraud or
engaging in malicious cyberattacks, which cost billions of dollars of loss. Numerous
research and methodologies have been developed to recognize fraudulent Internet
websites, yet none of them has been able to offer a feasible solution to bring an end
to these illegal acts. Moreover, as the number of fraudulent webpages increases, so
does the sophistication of the threat. Due to this issue, this paper presents a method
for detecting fraudulent websites that is based on image of the website that may
contain some patterns that show it is fraudulent. In this paper, convolutional neural
network (CNN) classification is used to classify the fraud and legal website image
utilized. A total of 530 website images were collected with 272 fraud website images
and 258 fraud website images. The websites are partitioned into 80% (370 images)
samples as training set, 10% (80 images) samples as testing set, and the rest 10% (80
images) samples as validation set. Three CNN optimizers namely including Adam,
Root Mean Square (RMSprop), and Stochastic Gradient Descent (SGD) were tested
for training and validation accuracy and loss in different epoch. The results show
that proposed fraudulent website detection model accuracy of 78.25% using Adam
optimizer for 120 epochs.

Keywords Fraudulent website · Convolutional neural network · HTML image

N. M. Zamry (B) · A. Zainal · E. Khoo

Cyber Threat Intelligence Lab, Information Assurance and Security Research Group, Faculty of
Computing, UTM Johor Bahru, 81310 Johor, Malaysia
e-mail: [email protected]
A. Zainal
e-mail: [email protected]
M. N. Kassim
Bahagian Penyelidikan Strategik, National Anti-Financial Crime Center, Putrajaya, Malaysia
Z. Zainudin
Faculty of Information and Communication Technology (FICT), Universiti Tunku Abdul
Rahman, Jalan Universiti, Bandar Barat, 31900 Kampar, Perak, Malaysia

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 123
Y. Bee Wah et al. (eds.), Data Science and Emerging Technologies, Lecture Notes on
Data Engineering and Communications Technologies 191,
https://doi.org/10.1007/978-981-97-0293-0_9
124 N. M. Zamry et al.

1 Introduction

In the current digital and connected era, the Internet is an indispensable element
of daily life, offering many benefits but also causing several challenges. While e-
commerce and e-digital websites are continually growing, the number of fraudulent
websites is also increasing. Foreign currency exchange (Forex), gold, and other
precious metal investment, Ponzi, pyramid schemes, and multi-level marketing,
online shopping, and e-commerce websites are among the most popular types of
fraudulent websites [1]. These fraudulent websites appear to be genuine, making it
hard for consumers to distinguish them from legitimate websites. Consumers lose
money as a result of fraudulent e-commerce websites selling fake products, never-
theless the digital marketing industry suffers equally [2]. Moreover, this fraudu-
lent e-commerce not only misleads people, but it also affects the image of legal
online stores [3]. Additionally, the fraudulent activity may lead to other cybercrime
that involves the stealing and misuse of the victim’s sensitive personal information,
such as bank account information, social security numbers, or credit card numbers.
Over the years, millions of dollars are lost because of fraudulent website [1, 3–5]
Researchers have proposed a variety of approaches for detecting fraudulent websites,
including heuristic methods, machine learning-based methods, and currently utilizing
deep learning including in [1, 3, 6, 7]. However, due to several challenges, existing
measures are often insufficient in detecting fraudulent websites. As described in [1,
3], these challenges are first, the current trend of web programming technologies
makes online scraping more complicated and hinders the ability to access fraudu-
lent websites’ contents. Secondly, with the diverse range of web fraudulent activities
(e-commerce fraud, MLM, Forex, etc.) it is challenging to propose a comprehen-
sive, all-in-one solution for fraudulent website detection. Next, the rapid prolifer-
ation of fraudulent websites leads to static measures irrelevant and necessitates a
dynamic solution. Finally, the ineffectiveness of these models is driven by fraud-
sters’ attempts to conceal, deceive, obstruct, and circumvent the fraudulent website
detection models.
Based on the challenges, an efficient website detecting technique needs to be
designed. Utilizing the technique of deep learning, this study proposed a technique
for detecting fraudulent websites based on website images. Our research is focused
on fraudulent e-commerce sites. This paper is divided into five sections. Section 2
highlights several significant research in fraudulent website detection, including
the detection of fraudulent websites and deep learning methods. Section 3 outlines
the proposed model and experimental design. Section 4 presents and discusses the
findings, while Sect. 5 concludes the study.
Fraudulent E-Commerce Website Detection Using Convolutional … 125

2 Related Work

The two primary methods used by fraudulent websites to operate are the duplicating
and cloning of legitimate websites and the development of websites with short lifes-
pans [8]. Meanwhile, [9] phishing websites as fraudulent websites that pose as legit-
imate and deceive people into engaging with them to obtain their sensitive data.
The emergence of fake websites may be due to a variety of factors, including their
genuine appearance, an inadequate level of user awareness, and fraudsters’ skill to
overcome many of the current measures in use to safeguard against them [10]. From
these studies, fraudulent websites are also known as phishing websites and fake
websites. Few afford on detecting fraudulent are proposed in [8, 11–13].
Proposed a new fraudulent website detection system based on statistical learning
theory (SLT) [10]. Combination of textual (word phrases, lexical measures, spelling
and grammar) URL features (URL text, anchor text), source code (HTML and
Javascript, coding style), images (metadata, image contents), linkage (site level, page
level) features were used to identify the fraudulent websites. Their evaluation results
showed the accuracy of 96% using SLT in a dataset of 900 fraudulent websites.
Another study by [3] used three types of features including image features to detect
fraudulent e-commerce website. Image Moments technique was used to extract the
screenshotted images features while four classification techniques namely Logistic
Regression, Random Forest, Decision Tree, and XGBoost were used to classify
fraudulent websites. XGBoost shows the better accuracy among the classification
techniques with 80% detection accuracy.
Deep learning has shown to be a particularly useful strategy due to its capacity
to manage enormous volumes of data. Deep learning has an impact on not just
conventional machine learning approaches, but also human perception. Hidden layers
outperform traditional approaches, particularly in pattern identification. Convolu-
tional neural networks (CNN), a variant of artificial neural network (ANN), are one
of the most often used deep neural networks. CNNs may be trained with either
supervised or unsupervised machine learning techniques. CNN is utilized in picture
detection and processing (pixels), despite its origins in data processing. A neural
network is made up of many interconnected nodes named “neurons” that are catego-
rized into three layers: the input layer, the hidden layer, and the output layer. CNN
has been a hub of research in a variety of areas, with successful applications in natural
language processing, catastrophe climate prediction, and many more.
CNN has recently been used successfully on a variety of tasks related to computer
vision such as object detection, image classification, image segmentation, and object
tracking [14], as well as in many domains such as medical image detection such as
MRI and CT-scan, remote sensing image, document analysis, handwriting analysis,
and more. Convolutional neural networks (CNN) use filters to capture the image’s
local structures, and as a result perform effectively on computer vision applications
[15]. CNN was used by [6, 16] to learn and categorize webpage screenshot images.
The collection comprises a total of 6104 samples, including 2375 malicious websites
and 3729 safe websites. Next, at random, choose 60% of the samples as a training
126 N. M. Zamry et al.

set, 20% as a validation set, and 20% as a test set. As a result, there are 3662 samples
in the training set, 1221 samples in the validation set, and 1221 samples in the test
set.
CNN and other deep learning techniques are also employed in [3, 6, 13–15].
Integrate CNN and recurrent neural network (RNN) and identify false news in two
datasets (ISO and FA-KES) [17]. To train, a recurrent neural network (RNN) employs
sequential data processing. It is known as recurrent as the results of every time step
are utilized as input in the next time step. This is accomplished by remembering the
previous time step’s output. Conducted another work for false news identification
using CNN, Bidirectional LSTM, and ResNet on four fake news datasets [18]. The
back-translation method is employed in the data augmentation process in this paper
to reduce data imbalances between classes. Another CNN-based fake news detec-
tion method is proposed in [15] which addresses the American presidential election
dataset. Both text and image data are utilized to train the TI-CNN simultaneously
and proposed convolutional neural networks train substantially faster than LSTM and
several other RNN models. According to [4] study recommends a CNN and random
forest-based hybrid phishing website detection technique (RF). The technique may
be used to assess if a URL is valid without having access to the website’s content or
using other services.

3 The Proposed Model and Experimental Setup

Convolutional neural networks (CNN) provide the foundation of the proposed fraud-
ulent website detection model framework, as previously mentioned. The experiment
dataset are taken from e-commerce website screenshot gathered from [3]. Meanwhile
optimizers namely: Adam, Root Mean Square (RMSprop), and Stochastic Gradient
Descent (SGD) are used to measure the detection accuracy.

3.1 Website Screenshot Acquisition Phase

The data utilized in this study obtained from both legitimate and fraudulent e-
commerce websites from [3]. Approximately 500 websites were crawled, with 258
of which were legitimate and 272 were fraudulent websites. The website’s offi-
cial website was manually screenshotted to obtain the images. The distribution of
websites evaluated in the study is shown in Table 1. A sample of images is shown in
Fig. 1.
Fraudulent E-Commerce Website Detection Using Convolutional … 127

Fig. 1 A sample of (a) [19] legal and (b) [20] fraudulent e-commerce website screenshot

Table 1 Dataset distributed

Category of website Amount
Legitimate e-commerce website 258
Fraudulent e-commerce website 272
Total 530

3.2 CNN Detection Methods

This study employs the convolution neural network (CNN) to classify the legal and
fraudulent websites. In these experiments, the same image dataset from [3] was used
to classify the fraud and legal website image utilized CNN classification. A total of
530 website images were collected with 272 fraud website images and 258 legitimate
website images. The websites are partitioned into 80% (370 images) samples as
128 N. M. Zamry et al.

training set, 10% (80 images) samples as testing set, and the rest 10% (80 images)
samples as validation set the experiment used the same setup from [6] as follow:
• The input layer is a screenshot of the page with the size adjusted. The size is W ∗
L ∗ 3, where 3 represents the three channels. The size of input images is adjusted
to W ∗ L ∗ 3, where W = L = 256
• Each of the first two convolutions has 32 output filters (i.e., M1 = M2 = 32), and
the third one has 64 output filters (i.e., M3 D 64).
• The size of the convolution kernel is 3 ∗ 3, where K1 = K2 = K3 = 3.
• The output of the convolution layer is then mapped nonlinearly.
• A full connection layer with 64 neurons (i.e., N = 64) and is connected after
several convolution and pooling layers.
• The prediction output layer accounts for image prediction and determines the
corresponding website category.
• Considering that it is a binary classification problem, we adopt the sigmoid func-
tion. If the sigmoid value is bigger than 0:5, the corresponding site is classified
as a phishing website; otherwise, it is recognized as a legitimate website.
• The optimizer used in the CNN model is an Adam optimization algorithm, which is
an extension to stochastic gradient descent that has recently seen broader adoption
for deep learning applications in computer vision and natural language processing.
• The parameters for image data augmentation are also set: rescale = 1 = 255,
shear_range = 0:2, zoom_range = 0:2, horizontal_flip = True, batch_size is set
as 32, and epochs are set as 61 in our experiment.
A CNN model of four layers, three spatial convolution layers, and one fully
connected layer was created in accordance with [6] setup. Then the operation of
the random flip, shift, shear, and other operations as data augmentation. The input
image sizes are transformed to W x L × 3, where W = L = 256. For the convolution
process, the output filters for the first two convolutions are 32 and the third one has
64 output filters. The convolution kernel has a size of 3 × 3, where K1 = K2 = K3
= 3. Meanwhile, a nonlinear mapping is applied to the convolution layer’s output.
These convolutional layer and max pooling layers are both used for automated image
feature extraction techniques. The ReLU was chosen as the activation function for
the CNN as specified in Eq. (1) due to its fast convergence time and simplicity in
computing the gradient.

ReLU (x) = {0x ≤ 0x x > 0 (1)

Following the extraction of the image’s features, a classification task is imple-

mented by using a fully connected layer. The neural network then randomly discards
half units during the training phase at the fully connected layer. The website classi-
fication is determined by the prediction output layer, which considers image predic-
tion. Since this is a binary classification problem, the sigmoid function was used, as
demonstrated in Eq. (2). Meanwhile, the loss function is calculated as in Eq. (3).
Fraudulent E-Commerce Website Detection Using Convolutional … 129

1
sigmoid(x) = (2)
1 + e−x

loss(x, z) = (x[i] × loglog(z[i]) + (1 − x[i]) × loglog(1 − z[i])), (3)

where x represents the output and z represents the target.

Meanwhile the experiment was further utilized as different optimizers were used
including Root Mean Square (RMSprop) and Stochastic Gradient Descent (SGD)
with different epochs. In CNN, an optimizer is a function that optimizes the neural
network’s properties, such as weights and learning rate. As a result, it benefits in
lowering total loss and improving accuracy. A stochastic gradient descent variant
known as Adam optimizer, which has recently grown in prominence for deep learning
applications in computer vision and natural language processing, is used in the exper-
iment. The setting of the CNN structure is illustrated in Fig. 2. While Fig. 3 shows
CNN architecture used in the experiments.

Fig. 2 The CNN setup used in the experiment [6]

Fig. 3 The CNN architecture used in the experiment [6]

130 N. M. Zamry et al.

4 Results and Analysis

The experiment results using the mentioned dataset to measure the detection accu-
racy based on three optimizers namely: Adam, Root Mean Square (RMSprop) and
Stochastic Gradient Descent (SGD). The epoch varies for 61 as set in [6] and other
two epochs: 100 and 120. The experiment also measured F1-scores and training time.
The F1-score, a machine learning metric, combines precision and recall assessing
model accuracy. It is especially useful for imbalanced datasets, where each class has
a different number of samples, providing a more reliable evaluation than traditional
accuracy metrics. The F1-score is calculated as in Eq. (4).

2 × Precision × Recall
F1 = (4)
Precision × Recall

where:

TruePositive
Precision = (5)
TruePositive + FalsePositive
TruePositive
Recall = (6)
TruePositive + FalseNagetive

Meanwhile the training time in CNNs denotes the period required for the model
to learn from input data and fine-tune its parameters. This process entails providing
CNN with a labeled dataset, enabling it to make predictions, and subsequently
refining its internal parameters (weights and biases) based on prediction errors. The
result of the experiment is shown in Table 2.
SGD is a very basic optimizer that is rarely utilized in applications due to its
poor processing speed. Another issue with the approach is the consistent learning
rate for each epoch. Thus, this results in the lowest accuracy for all three epochs.
Meanwhile, RMSProp optimizer automatically adjusts the learning rate and selects
a different learning rate for each parameter. Unlike SGD training, which maintains
a single learning rate, Adam optimizer updates the learning rate for each network
weight individually. Moreover, Adam optimizer combining RMSProp and other algo-
rithms’ best characteristics. The Adam optimizer produces better results than other
optimization methods, has a shorter computation time, and requires fewer tuning
parameters. As a result of all of this, Adam is suggested as the default optimizer for
most applications. The training and validation accuracy as well as loss is illustrated
in Figs. 4, 5, and 6.
The experiment is also compared with accuracy and from [3] for accuracy perfor-
mance as shown in Table 3. The best optimizer results from the experiments which
is Adam optimizer is compared with Linear Regression, Decision Tree, Random
Forest, and XGBoost from [3].
Fraudulent E-Commerce Website Detection Using Convolutional … 131

Table 2 The CNN model accuracy and F1-score using training set and testing set and training time
Optimizer Epoch
61 100 120
(a) Accuracy
Adam 74.0 71.0 68.0
RMSprop 63.0 66.0 52.0
SGD 66.0 66.0 66.0
(b) F1-score
Adam 74.0 72.0 69.0
RMSprop 62.0 67.0 48.0
SGD 52.0 52.0 52.0
(c) Training time
Adam 693,708.314 ms 1,168,302.057 ms 1,371,541.660 ms
RMSprop 715,048.149 ms 1,154,158.749 ms 1,325,656.689 ms
SGD 680,314.043 ms 1,138,732.898 ms 1,365,106.503 ms

The result of accuracy shows the CNN techniques using Adam optimizer outper-
formed almost all the techniques in [3] except the XGBoost and Decision Tree. The
XGBoost technique is the most recent version of the Decision Tree with an optimized
gradient boosting algorithm that uses parallel processing, tree pruning, missing value
handling, and regularization to counteract bias or overfitting during training. Mean-
while, utilizing partial convolution layers and one fully connected layer, batches of
augmented data are generated, allowing for the enrichment of training data and the
avoidance of overfitting in CNN technique.
132 N. M. Zamry et al.

Fig. 4 The training and validation accuracy as well loss based on [6] setup using Adam optimizer
with different epoch
Fraudulent E-Commerce Website Detection Using Convolutional … 133

Fig. 5 The training and validation accuracy as well loss based on [6] setup using RMSprop
optimizer with different epoch
134 N. M. Zamry et al.

Fig. 6 The training and validation accuracy as well loss based on [6] setup using SGD optimizer
with different epoch
Fraudulent E-Commerce Website Detection Using Convolutional … 135

Table 3 Comparison of the

Technique Accuracy F1
CNN model accuracy with
CNN algorithm Adam optimizer 74.0 74.0
Khoo et al. [3] Linear regression 73.7 73.7
Decision tree 75.8 73.6
Random forest 73.7 68.4
XGBoost 80.0 80.0

5 Conclusion

Websites have evolved into a venue for cybercriminals to prey on victims and spread
fraud and cybercrime. Despite researchers’ efforts, most measures are still unable to
keep up with the rapid increase and variation of fraudulent websites. This paper aims
to address this problem by utilizing a deep learning-based fraudulent website detec-
tion model that employs image with CNN techniques to detect fraudulent websites.
Experiment finding showed that the fraudulent website detection model based on
CNN algorithm achieved acceptable results compared to previous work, with an
accuracy of 66.0–74.0%. Future work will focus on analyzing and implementing
others deep learning-based methods to increase detection accuracy.

Acknowledgements The authors would like to acknowledge the Universiti Teknologi Malaysia
for supporting this study under the UTM RA ICONIC Grant (Q.J130000.4351.09G61).

References

1. Maktabar M, Zainal A, Maarof MA, Kassim MN (2018) Content based fraudulent

website detection using supervised machine learning techniques. Adv Intell Syst Comput
734(December):294–304. https://doi.org/10.1007/978-3-319-76351-4_30
2. Wu KT, Chou SH, Chen SW, Tsai CT, Yuan SM (2018) Application of machine learning to
identify Counterfeit Website. In: ACM international conference proceeding series, pp 321–324.
https://doi.org/10.1145/3282373.3282407
3. Khoo E, Zainal A, Ariffin N, Kassim MN, Maarof MA, Bakhtiari M (2021) Fraudulent e-
commerce website detection model using html, text and image features. Adv Intell Syst Comput
1182 AISC:177–186. https://doi.org/10.1007/978-3-030-49345-5_19
4. Yang R, Zheng K, Wu B, Wu C, Wang X (2021) Phishing website detection based on deep convo-
lutional neural network and random forest ensemble learning. Sensors 21(24):1–18. https://doi.
org/10.3390/s21248281
5. Rajab M (2018) An anti-phishing method based on feature analysis. In: ACM international
conference proceeding series, pp 133–139. https://doi.org/10.1145/3184066.3184082
6. Liu D, Lee JH (2020) CNN based malicious website detection by invalidating multiple web
spams. IEEE Access 8:97258–97266. https://doi.org/10.1109/ACCESS.2020.2995157
7. Zhou X, Zhang Z, Wang L, Wang P (2019) A model based on siamese neural network for
online transaction fraud detection. In: Proceeding international jt. conference neural networks,
2019. https://doi.org/10.1109/IJCNN.2019.8852295
136 N. M. Zamry et al.

8. Sabadash I, Dumanskyi N, Korobiichuk I (2020) Methods and means of identifying fraudulent

websites. CEUR Workshop Proc 2616:177–186
9. Almousa M (2022) Phishing website detection : how effective are deep learning-based models
and hyperparameter optimization ?, pp 1–19. https://doi.org/10.1002/spy2.256
10. Abbasi A, Zhang Z, Zimbra D, Chen H, Nunamaker JF (2010) Detecting fake websites: the
contribution of statistical learning theory. MIS Q Manag Inf Syst 34(3):435–461. https://doi.
org/10.2307/25750686
11. Abbasi A, Chen H (2009) A comparison of fraud cues and classification methods for fake
escrow website detection. Inf Technol Manag 10(2–3):83–101. https://doi.org/10.1007/s10799-
009-0059-0
12. Mazalan NASB, Zainal A (2017) Data analytics and classification of fraud website. 2:1–7
13. Zabihimayvan M, Doran D (2019) Fuzzy rough set feature selection to enhance phishing attack
detection. arXiv
14. Hanif MS, Bilal M (2020) Competitive residual neural network for image classification. ICT
Exp 6(1):28–37. https://doi.org/10.1016/j.icte.2019.06.001
15. Yang Y, Zheng L, Zhang J, Cui Q, Li Z, Yu PS (2018) TI-CNN: convolutional neural networks
for fake news detection. Retrieved from http://arxiv.org/abs/1806.00749
16. Liu D, Lee JH, Wang W, Wang Y (2019) Malicious websites detection via CNN based screen-
shot recognition∗. In: Procceding—2019 international conference intellectual computer its
emerging application ICEA 2019, 61375093:115–119. https://doi.org/10.1109/ICEA.2019.885
8300
17. Nasir JA, Khan OS, Varlamis I (2021) Fake news detection: A hybrid CNN-RNN based deep
learning approach. Int J Inf Manag Data Insights 1(1). https://doi.org/10.1016/j.jjimei.2020.
100007
18. Sastrawan IK, Bayupati IPA, Arsa DMS (2021) Detection of fake news using deep learning
CNN–RNN based methods. ICT Exp. https://doi.org/10.1016/j.icte.2021.10.003
19. “Anthropologie.” https://www.anthropologie.com/
20. “Isabel Marant.” https://www.isabelmarant.com/my
A Generic Framework for Ransomware
Prediction and Classification
with Artificial Neural Networks

Saaman Nadeem, Tahir Mehmood, and Muhammad Yaqoob

Abstract With the increase in the development of technology, the threat of “Ran-
somware” has also increased especially towards organizations. Ransomware is a
malicious software that encrypts all the user’s data or system and demands a ransom
payment for decryption. Despite various machine learning approaches proposed for
ransomware detection, they often fail to identify those threats accurately in time,
thus leading to data loss and victimization. This research introduces a novel frame-
work, primarily based on static analysis of ransomware and predicting the pres-
ence of ransomware on users’ systems by monitoring a defined set of ransomware
activities. In this study, we used the Resilient Information Systems Security (RISS)
ransomware dataset, encompassing 582 ransomware samples from 11 distinct fami-
lies and 982 instances of goodware. We proposed a generic neural network frame-
work for the identification of ransomware and compared the performance of artificial
neural networks (ANN) and deep neural networks (DNN) in terms of accurately clas-
sifying ransomware and goodware. The suggested framework secured an accuracy
of 98.56% with ANNs, and achieved a slightly better performance (99.06%) when
ANN was replaced with DNN. Our results showed that a basic ANN can achieve
performance comparable to that of a DNN for ransomware detection. In future work,
we plan to evaluate the performance of the proposed framework in a real-time setting.

Keywords Ransomware · Artificial neural network · Deep neural network ·

Malware · Machine learning

S. Nadeem
Department of Computer Science, University of Management and Technology, Lahore, Pakistan
T. Mehmood (B)
School of Information Technology, UNITAR International University, Selangor, Malaysia
e-mail: [email protected]
M. Yaqoob
Department of Computer Science, University of Hertfordshire, Hatfield, UK

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 137
Y. Bee Wah et al. (eds.), Data Science and Emerging Technologies, Lecture Notes on
Data Engineering and Communications Technologies 191,
https://doi.org/10.1007/978-981-97-0293-0_10
138 S. Nadeem et al.

1 Introduction

Ransomware is a malicious code or malware that poisons user’s computer and quickly
spreads to encrypt data or to lock a device. In this way, the data become inaccessible
to the victim and the attacker requests payment from the victim to decrypt it. Payment
is mostly requested in Bitcoins or other cryptocurrencies that cannot be tracked [1].
Companies and individuals around the world are being attacked by using ransomware.
The main goal of ransomware is to make money by using malicious software.
One reason for the popularity of ransomware is that the cyber-security area is
not fully ready for its recovery and mitigation. When there are security breaches
it gets easier for hackers to attack the system. Several security systems have been
proposed to detect and mitigate threats, identify changes in data, and track abnormal
behaviour and vice versa [2–4]. Such proposed security framework also includes
signature-based approach or one-hand rule [5]. This method requires the user to
assess the system using a hand rule to determine whether it is infected by a virus or
not. Another popular approach is based on machine learning methods, which learn
the hidden patterns of the ransomware and use that knowledge for future prediction
[6–8].
Nowadays, the computer security industry tends to create rules and signatures
manually and automatically through machine learning and statistical methods. Rules
and signature-based methods can achieve lower false positives, whereas machine
learning models provide researchers with powerful tools to create models for the
inspection of ransomware, leading to greater accuracy and scalability. However,
the performance of these machine learning algorithms is highly dependent on
the features. Redundant and irrelevant features deteriorate the performance of the
machine learning algorithms [9].
In this paper, we propose a generic framework and evaluate it for two models
(i) artificial neural network (ANN), and (ii) deep neural network (DNN) to predict
and classify both ransomware and goodware. These models are integrated into the
suggested framework for the characterization of essential static ransomware features,
which require ransomware prediction. To ensure that the feature sets are retained,
we utilize shared data standards to identify the most pertinent static characteristics
among a multitude of factors. Importantly, these approaches are also well-suited for
identifying new ransomware variants that share similar characteristics. In terms of
accuracy, our proposed framework stands out with an impressive 99.06%, surpassing
other machine learning classifiers. Furthermore, the framework can be easily adapted
to real-time settings. Finally, regardless of the model, the framework works well for
both ANN and DNN, reaching an accuracy of 98.56% and 99.06%, respectively.
A Generic Framework for Ransomware Prediction … 139

1.1 Background

Ransomware is a type of malware that locks devices, holding the user’s data hostage
until a ransom is paid. Ransomware preys on people’s fears, the vulnerabilities of
their devices, their lack of security awareness, and their frustration. The countries
most frequently targeted by ransomware include the United States, Japan, the United
Kingdom, Italy, Germany, and Russia, primarily due to their heavy reliance on the
Internet [10].
There are two primary types of ransomwares: Crypto ransomware and Locker
ransomware, as illustrated in Fig. 1. Crypto ransomware encrypts the user’s files
or private information, demanding payment for decryption. Conversely, Locker
ransomware locks the system, preventing user access until the ransom is paid.
Ransomware is highly profitable, as it effectively seizes control of the digital lives
of many system users [11].
Locker ransomware denies the user access to the system’s interface, restricts the
availability of system resources, and limits certain abilities, such as keyboard func-
tionality, while inner system functionalities might remain unlocked. The keyboard
and mouse, however, are locked. Locker ransomware spreads through social
engineering, phishing campaigns, and restricted sites.
In crypto ransomware, instead of locking the user’s interface and denying access
to the system, it targets the user’s important files and data stored on the system. The
overall functionality of the system remains unaffected. Only data files are encrypted,
allowing the victim to use the computer, but preventing access to the encrypted files.
The process cycle of the ransomware is shown in Fig. 2.

2 Literature Review

In 2013, an automated framework was offered to detect malware for dynamic

malware analysis by employing real-time detection and resource monitoring [12].
Two techniques for analysing malware were dynamic analysis and static analysis.

Fig. 1 Two types of

ransomwares: Locker
ransomware restricts access
to a computer system by
locking the user out. Crypto
ransomware encrypts files on
the victim’s system
140 S. Nadeem et al.

Fig. 2 The ransomware process cycle displays all the steps for the attack, from the initial infection
to the eventual payment of the ransom

The framework encompassed three primary processes: on-the-spot analysis, resource

monitoring, and behaviour definition. The main challenge in this framework was
understanding and detection of malware codes.
Ransomware testing was demonstrated using file system data characteristics, such
as content encryption and changes made to file types [13]. The authors successfully
detected and blocked 492 ransomware samples in all tests, with less than a 33% loss
of user data in each instance.
In a recent study, an automatic recovery system was proposed, utilizing file system
machine activity (e.g. file read/write counts) to detect malware. The author detected
all 305 test ransomware samples with very low false positive rates. Both methods
employed a specific selection of functions for ransomware detection, requiring an
understanding of malware behaviour [14].
In the context of Windows-based ransomware, another article details the MD5
method and the Cuckoo Sandbox for malware analysis. They used RSA and AES
are used for encryption. The main goal is to identify ransomware by examining
A Generic Framework for Ransomware Prediction … 141

abnormal behaviour in the system registry. A specific tool, PEid, is employed for
Windows ransomware detection. Users are advised to take precautionary steps, such
as online backups of files and important data like images and related documents [15].
A model called “Cyber Army Modelling and Simulation” has been presented for
detecting malware circulating over typical network behaviours [16]. The proposed
approach used the National Cyber Range (NCR) to generate data and provide results
for networks of different sizes in various test cases.
Furthermore, clustering algorithms have been utilized to select relevant features
for training classifiers to detect different malware types, including unknown malware
[17]. Malware, once replicated, can create entirely new variants. The study of
malware and executable files is conducted in groups to identify new malware with
high accuracy. Based on classifiers, malware detection can achieve accuracy rates of
up to 99.11% regarding file sizes.
To analyse and classify malware using artificial intelligence techniques and
categorize them into different groups, a behavioural-based methodology has been
proposed in [18]. Two methods, intrusion detection and honeypot techniques, were
employed to gather malware samples. Using a virtual machine platform such as
Cuckoo Sandbox, a model was developed to generate behaviour-based reports for
each sample, which were then manually analysed. AI technology is used to gener-
alize malware samples into worms or Trojan horses. The limitation lies in the manual
analysis, which is not suitable for a large number of concurrently generated malware
samples [18].
Another paper introduces a scalable method for identifying and clustering
malware with similar behaviour. The approach involves dynamic analysis to capture
execution traces of malware, which are then generalized into behavioural profiles.
These profiles are then fed into the proposed clustering algorithm. The proposed
system has been employed to collect, recognize, and cluster malware in real-world
setting [19].
Hwang et al. [20] developed a model based on a two-stage approach involving
Markov models and Random Forest Models to classify ransomware. The authors
focused on creating a sequence pattern of Windows-based API calls associated with
ransomware characteristics. To control false positives and true negatives, another
model based on Random Forest machine learning was built. The two-stage model
achieved an overall accuracy of 97.3%, with a false positive rate of 4.8% and true
negative rate of 1.5%.
Arivudainambi et al. [21] presented a model for the classification and detection
of ransomware in live traffic. The author elucidates how ransomware infiltrates and
compromises devices, followed by an analysis of live ransomware classification.
This model employs deep learning algorithms and classification patterns based on
network traffic characteristics. For instance, it examines the behaviour of well-known
ransomware to develop a real-time live traffic-based model. Various methodologies
and algorithms for ransomware detection have been introduced, yet they often exhibit
shortcomings.
In another study [23], an experimental platform was constructed to compare
the performance of various machine learning-based algorithms, including Random
142 S. Nadeem et al.

Forest, Gradient Boosting Decision Trees, Neural Networks using Multilayer Percep-
tron, and three types of Support Vector Machine kernels for ransomware detection,
using a ransomware dataset. The experiments involved the analysis of executable files
and the measurement of opcode frequencies. The primary objective was to identify
algorithms best suited for developing ransomware detection models and systems.
The results indicated that Random Forest, Gradient Boosting Decision Trees, and
Support Vector Machine outperformed neural networks.

3 System Design

The Resilient Information Systems Security (RISS) ransomware dataset [24] was
collected and analysed using the Cuckoo Sandbox, which includes 582 ransomware
and 942 goodware instances. It’s noteworthy that many ransomwares, belonging to
different families, share common features or behaviour patterns. However, numerous
detection techniques struggle to identify them. The absence of automatic detection
in many techniques contributes to delays in ransomware detection. This issue can be
addressed through the application of artificial intelligence and machine learning-
based approaches. Importantly, the proposed approach does not require a rigid
framework for implementation. Ransomware samples are randomly selected from
the database, and artificial neural networks are employed to predict and classify
ransomware into their respective families.

3.1 Dataset

The dataset used in this research paper comprises 582 samples of ransomware span-
ning 11 different families and 942 samples of goodware as shown in Table 1.
It includes both older and newer ransomware samples. The gathered ransomware
samples represent the prevailing versions and variations currently encountered with
a significant portion being crypto ransomware types.

3.2 The Proposed Framework

This framework is designed for the RISS ransomware dataset (Fig. 3), where data
scaling was unnecessary as the dataset had already been appropriately scaled. In the
feature extraction phase, Principal Component Analysis (PCA) is utilized to extract
the most significant features. Simultaneously, to ensure generalization, K-fold cross-
validation is used, where the value of K is set to 10. The use of tenfold cross-validation
ensures that each data point is used for both training and testing, reducing the risk of
A Generic Framework for Ransomware Prediction … 143

Table 1 Ransomware
ID family Samples
families and the number of
samples for each family 1. Citroni 50
contained in the dataset 2. CryptLocker 107
3. CryptoWall 46
4. Kollah 25
5. Kovter 64
6. Locker 97
7. Matsnu 59
8. Pgpcoder 4
9. Reveton 90
10. TeslaCrypt 6
11. Trojan-Ransom 34

overfitting and enhancing the model’s generalization capability. Following this pre-
processing, a classification model (ANN or DNN) is applied to differentiate between
instances of ransomware and goodware. Additionally, the family of ransomware is
also identified in the next step.
Artificial Neural Network Model Artificial neural networks are inspired by the
biological nervous system. The biologically inspired neurons communicate with
each other via electrical pulses known as spikes [25]. An artificial neural network
contains input, hidden, and output layers, and can be trained on complex data [26].
When analysing big and noisy data, neural networks are very efficient in extracting
hidden patterns and non-linear complicated relationships in the target variable.
In this work, we use an artificial neural network (ANN)-based classifier. The
ANN model comprises an input layer, an output layer, and a hidden layers with
1000 neurons. The input layer accommodates the dataset’s features, the hidden layer
employs the Rectified Linear Unit (ReLU) activation functions for pattern learning,
and the output layer have the number of nodes equal to the number of classes. In the
case of ransomware detection, the output is binary (ransomware or goodware) and

Fig. 3 The proposed framework for statically analysing the RISS dataset, employs K-fold batch
splitting, PCA for feature extraction, and a classifier (ANN or DNN) to detect and classify
ransomware
144 S. Nadeem et al.

Fig. 4 Model implementation of ANN shows the method used for feature extraction, for data
splitting and the basic working of the model

uses a sigmoid as activation function, whereas in the case of classification (different

types of ransomwares) SoftMax is used as activation function. To ensure generaliza-
tion, K-fold cross-validation is employed during training. This involves splitting the
dataset into K-folds, using K-1 folds for training and the remaining one for validation
in each iteration. Accuracy is used as an evaluation matrix to assess the performance
of the model across distinct folds. The proposed framework with ANN model is
illustrated in Fig. 4.
Deep Neural Network Model The deep neural network (DNN) architecture for
ransomware detection contains multiple layers. Starting with an input layer of 64
neurons driven by the Rectified Linear Unit (ReLU) activation function. The subse-
quent hidden layer contains 128 neurons with ReLU activation, and a dropout of 0.5
is incorporated for regularization. The next layer contains 256 neurons with ReLU
activation, and a dropout of 0.5. The penultimate layer contains 64 neurons with
ReLU activation. For ransomware classification the output layer uses the SoftMax
activation function. Likewise, K-fold cross-validation is used during training by
splitting the dataset into folds for training and validation. The model is compiled
with the Adam optimizer and categorical cross-entropy loss function. The proposed
framework with DNN model is illustrated in Fig. 5.
A Generic Framework for Ransomware Prediction … 145

Fig. 5 Model implementation of DNN shows the method used for feature extraction, for data
splitting, and the basic working of the model

4 Results and Discussion

Our preliminary results show that both ANN and DNN consistently demonstrate high
training accuracy throughout the training batches, reaching average values of around
99.64–99.48% (Table 2). This indicates that both models are capable of learning and
fitting the training data well, capturing the underlying patterns in the dataset. When
we consider the testing accuracy, both models perform remarkably well, with ANN
achieving an average testing accuracy of 98.56%, while DNN outperforms slightly
with an average testing accuracy of 99.06%. Both models can correctly classify
ransomware samples in the testing set which is crucial for practical application of
ransomware detection systems.
We can also speculate that both models show stability in their performance as the
batch number progresses. There is no significant fluctuation in accuracy from one
batch to another, indicating that both models are able to generalize well to different
subsets of the data. Moreover, the models are not sensitive to variations in the training
data and is likely to perform reliably on new unseen data. Additionally, the small gaps
between training and testing accuracies suggest a robust generalization and minimal
overfitting of both models. This implies that the models have learned meaningful
patterns in the data rather than memorizing it.
146 S. Nadeem et al.

Table 2 Performance comparison of ANN and DNN for ransomware classification

Batch No. ANN model accuracy DNN model accuracy
Training (%) Testing (%) Training (%) Testing (%)
0 99.76 95.00 99.79 95.62
1 99.72 95.00 99.44 97.50
2 99.51 98.75 99.31 98.75
3 99.58 100.00 99.31 100.00
4 99.58 100.00 99.44 100.00
5 99.65 100.00 99.44 100.00
6 99.65 100.00 99.37 100.00
7 99.65 100.00 99.58 100.00
8 99.65 100.00 99.58 100.00
9 99.65 100.00 99.58 100.00
Total 99.64 98.56 99.48 99.06

5 Conclusion and Future Work

The purpose of this research is to analyse different types of ransomware attacks and to
provide an efficient conceptual framework that can be applied in real-time detection
and classification of ransomware. The performance of the suggested framework is
evaluated with both artificial neural networks (ANNs) and deep neural networks
(DNNs). Our findings indicate that the proposed framework is network-agnostic,
with DNNs achieving a slightly higher accuracy of 99.06% compared to ANNs
at 98.56%. Moreover, our demonstration reveals that the performance of a basic
ANN is comparable to that of a DNN in ransomware classification. Moreover, since
both ANNs and DNNs are trained on RISS dataset, the proposed framework may
fail to predict ransomware with new features. Therefore, the future studies need to
investigate the significance of deep models and focus on developing robust real-time
ransomware detection systems capable of effectively handling emerging ransomware
variants with new features.

Acknowledgements The authors thank the UNITAR International University supporting the
publication of this paper.

References

1. Kshetri N, Voas J (2017) Do crypto-currencies fuel ransomware? IT profess 19(5):11–15

2. AlYousef MY, Abdelmajeed NT (2019) Dynamically detecting security threats and updating a
signature-based intrusion detection system’s database. Procedia Comput Sci 159:1507–1516
3. Borrion H, Tripathi K, Chen P, Moon S (2014) Threat detection: a framework for security
architects and designers of metropolitan rail systems. Urban Plann Transp Res 2(1):173–194
A Generic Framework for Ransomware Prediction … 147

4. Lin PP (2006) System security threats and controls. The CPA J 76(7):58
5. Mehmood T, Helmi B (2016) Svm for network anomaly detection using aco feature subset. In:
2015 International symposium on mathematical sciences and computing research (iSMSC),
IEEE, pp 121–126
6. Bae SI, Lee GB, Im EG (2020) Ransomware detection using machine learning algorithms.
Concurr Comput Pract Exp 32(18):e5422
7. Fernando DW, Komninos N, Chen T (2020) A study on the evolution of ransomware detection
using machine learning and deep learning techniques. IoT 1(2):551–604
8. Noorbehbahani, F., Rasouli, F., Saberi, M.: Analysis of machine learning techniques for
ransomware detection. In: 2019 16th International ISC (Iranian Society of Cryptology)
conference on information security and cryptology (ISCISC), IEEE, pp 128–133
9. Mehmod T, Rais HBM (2016) Ant colony optimization and feature selection for intrusion
detection. In: Advances in machine learning and signal processing, Springer International
Publishing, Cham, pp 305–312
10. Sajjan RS, Ghorpade VR (2017) Ransomware attacks: Radical menace for cloud computing. In:
2017 International conference on wireless communications, signal processing and networking
(WiSPNET), IEEE, pp 1640–1646
11. Liska A, Gallo T (2016) Ransomware: defending against digital extortion. O’Reilly Media,
Inc.
12. Nari S, Ghorbani AA (2015) Automated malware classification based on network behavior. In:
2013 International conference on computing, networking and communications (ICNC), IEEE,
pp 642–647
13. Rao V, Hande K (2017) A comparative study of static, dynamic and hybrid analysis techniques
for android malware detection. Int J Eng Develop Res 5(2):1433–1436
14. Grosse K, Papernot N, Manoharan P, Backes M, McDaniel P (2020) Adversarial examples
for malware detection. In: Computer security–ESORICS 2017: 22nd European symposium on
research in computer security, Oslo, Norway, September 11–15, Proceedings, Part II 22, pp
62–79
15. Zavarsky P, Lindskog D et al (2016) Experimental analysis of ransomware on windows and
android platforms: evolution and characterization. Procedia Comput Sci 465–472
16. Brown S, Henz B, Brown H, Edwards M, Russell M, Mercurio J (2015) Validation of network
simulation model with emulation using example malware. In: MILCOM 2015–2015 IEEE
military communications conference, pp 1264–1269
17. Sahay SK, Sharma A (2016) Grouping the executables to detect malwares with high accuracy.
Procedia Comput Sci 78:667–674
18. Rieck K, Holz T, Willems C, Du¨ssel P, Laskov P (2008) Learning and classification of malware
behavior. In: International conference on detection of intrusions and malware, and vulnerability
assessment, Springer, pp 108–125
19. Bayer U, Comparetti PM, Hlauschek C, Kruegel C, Kirda E (2009) Scalable, behavior-based
malware clustering. In: NDSS, vol 9, pp 8–11
20. Hwang J, Kim J, Lee S, Kim K (2020) Two-stage ransomware detection using dynamic analysis
and machine learning techniques. Wireless Pers Commun 112:2597–2609
21. Arivudainambi D, KA VK, Visu P et al (2020) Ransomware traffic classification using deep
learning models: ransomware traffic classification. Int J Web Port (IJWP) 12(1):1–11
22. Kok S, Abdullah A, Jhanjhi N, Supramaniam M (2019) Ransomware, threat and detection
techniques: A review. Int J Comput Sci Netw Secur 19(2):136
23. Dion Y, Brohi SN (2020) An experimental study to evaluate the performance of machine
learning alogrithms in ransomware detection. J Eng Sci Technol 15(2):967–981
24. Resilient information systems security (riss) ransomware dataset. Retrieved from http://rissgr
oup.org/ransomware-dataset/. Accessed on 05 Jan 2023
25. Eluyode O, Akomolafe DT (2013) Comparative study of biological and artificial neural
networks. Euro J Appl Eng Sci Res 2(1):36–4
148 S. Nadeem et al.

26. Pini M, Scalvini A, Liaqat MU, Ranzi R, Serina I, Mehmood T (2020) Evaluation of machine
learning techniques for inflow prediction in lake como, italy. In: Knowledge-based and intel-
ligent information & engineering systems: proceedings of the 24th international conference
KES-2020, Virtual Event, 16–18 September 2020. Procedia Computer Science, vol 176, pp
918–927. Elsevier
27. 6Sgandurra D, Mun˜oz-Gonza´lez L, Mohsen R, Lupu EC (2016) Automated dynamic analysis
of ransomware: benefits, limitations and use for detection. arXiv preprint arXiv:1609.03020
Leveraging Gamification for Engaged
Learning in Online Teaching
and Learning Experiences

Norshahriah Abdul Wahab, A’tifah Hanim Rosli,

Syarifah Bahiyah Rahayu Syed Mansoor, Norazliana Akmal Jamaludin,
and Siti Hajar Adam

Abstract In the rapidly evolving landscape of education, the integration of tech-

nology and evolving student dynamics have ushered in a new era of teaching
and learning (TnL). This article explores the transformative power of gamification
in online education, shedding light on its capacity to ignite motivation, enhance
engagement, and facilitate meaningful learning experiences. Traditional educa-
tion paradigms have expanded, requiring educators to guide students in exploring
knowledge and essential skills in digital environments. Gamification emerges as a
potent solution in this context, seamlessly blending educational content with gaming
elements, creating stimulating and rewarding learning journeys. Gamification, rooted
in the fusion of game design principles and non-game contexts, has risen to promi-
nence. It has been recognized for its potential to revolutionize engagement and
behaviour transformation, with over 70% of Fortune Global 2000 organizations
adopting it by 2014. This article explores the burgeoning realm of gamification, its
rapid evolution, and its role as an innovation catalyst in the teaching and learning land-
scape. A study conducted with 30 Computer Science students at the National Defence
University of Malaysia reveals the impact of gamification on student engagement.
Results indicate that the gamified approach effectively encourages active participa-
tion in online discussions and engages students’ passion for learning. These findings
underscore gamification’s potential to create dynamic and interactive learning envi-
ronments, aligning with its growing popularity and transformative potential. As the
digital generation shapes education, gamification offers an innovative approach to
fostering effective learning experiences that cater to evolving preferences. Gamifi-
cation represents a revolutionary method bridging education and gaming, promoting
motivation, active involvement, and improved learning outcomes. This study prepares

N. A. Wahab (B) · A. H. Rosli · S. B. R. S. Mansoor · N. A. Jamaludin

Computer Science Department, Faculty of Defence Science and Technology, Universiti
Pertahanan Nasional Malaysia, Kuala Lumpur, Malaysia
e-mail: [email protected]
S. H. Adam
Faculty of Medicine and Defence Health, Universiti Pertahanan Nasional Malaysia, Kuala
Lumpur, Malaysia

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 149
Y. Bee Wah et al. (eds.), Data Science and Emerging Technologies, Lecture Notes on
Data Engineering and Communications Technologies 191,
https://doi.org/10.1007/978-981-97-0293-0_11
150 N. A. Wahab et al.

the way for a new era of enhanced and interactive learning experiences, equipping
educators to harness the potential of gamification and adapt education to the evolving
needs of contemporary learners.

Keywords Gamification · Teaching and learning · Student engagement · Learning

experiences · Education technology

1 Introduction

In today’s rapidly evolving educational landscape, the realms of teaching and learning
(TnL) have undergone a profound transformation, fuelled by the integration of tech-
nology and the shifting dynamics of student engagement [1]. As the new generation
of learners, often referred to as ICT natives by immerse themselves in a digital world,
the traditional paradigms of education have given way to innovative approaches that
tap into the technological tools familiar to them. This article delves into the inter-
section of teaching and learning, exploring the captivating realm of gamification
as a dynamic strategy to foster motivation, engagement, and meaningful learning
experiences in the digital era.
The components that comprise effective teaching and learning have expanded
beyond the confines of conventional instruction. Educators now shoulder the respon-
sibility of not merely disseminating information, but guiding students in their explo-
ration of knowledge and development of essential skills. As the modern classroom
extends beyond physical walls into the digital realm, educators face the challenge of
maintaining motivation and engagement in remote and online learning environments
[2]. It is within this context that gamification emerges as a potent solution, seam-
lessly blending educational content with elements drawn from the world of gaming
to create a stimulating and rewarding learning journey.

1.1 Gamification

Landscape of education and user engagement has undergone a transformative shift

with the emergence of gamification. Rooted in the fusion of game design principles
and non-game contexts, gamification has quickly risen to prominence as a powerful
tool for enhancing user interaction and achieving multifaceted goals. Gamification is
defined as the strategic incorporation of game elements into realms beyond gaming,
setting the stage for a rule-based, goal-oriented, and competitive environment that
transcends traditional boundaries [3]. This concept has garnered significant attention,
propelling it onto the stage of technological trends. Notably, gamification earned its
place on the Gartner Hype Cycle for 2011, signalling its potential to revolutionize
engagement and behaviour transformation [4]. As we navigate this dynamic land-
scape, this article delves into the burgeoning realm of gamification, exploring its
Leveraging Gamification for Engaged Learning in Online Teaching … 151

rapid evolution, burgeoning applications, and its role as a catalyst for innovation and
enhanced learning experiences.
The Gartner Hype Cycle for 2011 predicted that by 2014, over 70% of
Fortune Global 2000 organizations would have adopted gamification in some way
[6]. Numerous encouraging outcomes have been documented, demonstrating the
potential of gamification in conjunction with both conventional and cutting-edge
approaches in educational settings to enhance students’ overall learning experiences,
inspire and involve them, and foster desired behaviour [6]. Many companies have
integrated gamification in various field such as Nike in sport and eBay in online shop-
ping. Similarly, companies use gamification for marketing by altering their customer
behaviour to encourage them to purchase products or visit their website. As for now,
Malaysia is one of countries that mostly carried out empirical study research into the
use of Augmented Reality and gamification [7]. Malaysia also known as an active
country examined the use of Augmented Reality and gamification in education [7].
This article embarks on an exploration of the synergistic relationship between
technology-enhanced education and the principles of gamification. As online
teaching and learning (TnL) becomes increasingly prevalent, it is imperative to under-
stand how to harness the power of gamified experiences to fuel students’ intrinsic
motivation, curiosity, and active participation. By investigating the impact of gami-
fication on engagement, this article seeks to unravel the intricacies of incorporating
game elements like points, badges, leader boards, and multimedia analytics into the
fabric of educational design. Through a comprehensive analysis of existing liter-
ature, this study aims to provide insights into how gamification can revolutionize
the teaching and learning landscape, creating a dynamic and interactive space that
resonates with the digital generation’s learning preferences and aspirations.

2 Methodology

The methodology for this study aimed to gather insights into the effectiveness and
efficiency of gamification in online teaching and learning environments. It involved
a purposive sampling approach and employed a structured questionnaire to eval-
uate the impact of gamification on student engagement and learning outcomes. The
primary aim of the study was to gather insights into the effectiveness and efficiency
of gamification design in an online learning environment.

2.1 Participants

The study was conducted with a total of 30 students enrolled in the Computer Science
Department at National Defence University of Malaysia. Participants were selected
purposively to ensure representation from various age groups and genders.
152 N. A. Wahab et al.

2.2 Questionnaire

A structured questionnaire was developed to evaluate effectiveness and efficiency

of the gamification in online learning. Employing a Likert scale ranging from 1
(Strongly Disagree) to 5 (Strongly Agree), it aimed to assess participants’ perceptions
of the gamification effectiveness and efficiency. Participants were requested to rate
their agreement with statements addressing various aspects, including motivation,
engagement, and learning outcomes influenced by the gamification system. Question
three (Q3) from the questionnaire employing multiple choices answers that requires
participant to give their opinion about learning using Augmented Reality (AR). The
questionnaire of this study is as below:
Q1. Do you have facility to ask questions or clear doubts during online lectures?
Q2. Merit and demerit points engaging student’s passion in learning (merit is a
quality of getting reward, and demerit is loss of privilege or reward)
Q3. What your opinion in learning using Augmented Reality (AR)?
• Learning using AR is exciting
• Learning using AR is facilitating
• Learning using AR is unnecessary
• Learning using AR is complicating
Q4. Do you think AR can gain students’ interest in learning techniques compare to
conventional learning?
Q5. Getting a reward after watching a video game is exciting
Q6. Interacting with the online gamification system does not require a lot of my
mental effort
Q7. The online gamification functionality and interface is clear and understandable
Q8. Online gamification system encourages me in participating actively in small-
group discussions
Q9. I think I can describe whatever I studied to my friends after video game session.

2.3 Data Collection and Analysis

Participants were approached and informed about the study’s purpose and the volun-
tary nature of their participation. The structured questionnaire, featuring a Likert
scale ranging from 1 (Strongly Disagree) to 5 (Strongly Agree), was administered
electronically to the participants to facilitate straightforward data collection and
record-keeping. Subsequently, quantitative data analysis techniques were employed
to process the collected data. The analysis primarily involved calculating percent-
ages to summarize and interpret participants’ responses, providing a quantitative
perspective on the effectiveness and efficiency of gamification in online teaching
and learning environments.
Leveraging Gamification for Engaged Learning in Online Teaching … 153

3 Findings

In the dynamic world of education, fostering student engagement has long been
a pivotal goal, particularly in the context of online learning environments where
the physical boundaries of the traditional classroom are transcended. One of the
most promising strategies to achieve this goal is the integration of gamification
elements into the learning process. This study delves into a fascinating facet of this
phenomenon: how online gamification systems can serve as catalysts for actively
engaging students in small-group discussions.
Question 2 of the survey, designed as a Likert scale ranging from 1 (Strongly
Disagree) to 5 (Strongly Agree), provided a nuanced perspective on the effectiveness
of this gamification element. The results, as illustrated in Fig. 3, were nothing short
of remarkable. A staggering 80% of respondents concurred with the statement that
merit and demerit activities during online learning serve as a catalyst for engaging
students’ passion for learning.
This finding carries profound implications for the realm of education, as it high-
lights the potential of gamification to not only enhance student engagement but also
to ignite a fervent interest in learning. By strategically integrating the principles of
merit and demerit within the online learning context, educators can create a dynamic
environment where students are not merely passive recipients of knowledge but
enthusiastic participants in their own educational journey. This approach, rooted in
gamification, offers a promising avenue for shaping a generation of learners who
are not just motivated but genuinely passionate about the pursuit of knowledge. It
underscores the transformative potential of gamification as a powerful ally in the
quest to make learning a vibrant and fulfilling experience for students in the digital
age.
Figure 4 highlights on participants’ opinions regarding learning through
Augmented Reality (AR) unveiled intriguing insights into the role of AR in educa-
tion. Notably, 65% of respondents expressed their enthusiasm, describing learning
with AR as “exciting.” An additional 25% of participants highlighted the facilitative
nature of AR, perceiving it as a tool that eases the learning process. This collective
sentiment of excitement and facilitation echoes the transformative potential of AR in
education, positioning it as an engaging and supportive force in the learning journey.
The data underscores the notion that AR can captivate students’ interest, turning the
educational experience into an exciting adventure.
Figure 5 offers a comprehensive statistic of the survey responses to questions
four (Q4), five (Q5), six (Q6), and seven (Q7), unveiling valuable insights into the
impact of gamification on the learning experience. These responses, illuminated by
the participants, underscore the transformative potential of gamification in educa-
tion. Based on survey question number four (Q4), a remarkable 80% of participants
voiced their agreement with the statement that Augmented Reality (AR) can pique
students’ interest in learning techniques more effectively compared to conventional
learning methods. This outcome aligns with the broader narrative of gamification’s
role in capturing student attention and motivation, providing a new dimension to
154 N. A. Wahab et al.

the learning process. Based on green bar in Fig. 5, an impressive 90% of respon-
dents acknowledged the allure of receiving a reward after engaging with a video
game. This sentiment underscores the excitement and motivation that gamification
elements, such as rewards and recognition, can offer students in their learning journey.
These participant responses underscore the significance of gamification as a means
to ignite student interest, motivation, and active engagement. They provide empirical
evidence of gamification’s ability to captivate learners’ attention and stimulate their
enthusiasm for the educational content. In the ever-evolving landscape of education,
these insights are invaluable, offering a compelling argument for the continued explo-
ration and integration of gamification in teaching and learning contexts. As proven in
survey question number seven (Q7), which delves into the heart of the matter, an over-
whelming 93% of participants expressed their agreement with the idea that the online
gamification system effectively encourages their active participation in small-group
discussions. This finding solidifies the pivotal role of gamification in transforming
passive online interactions into dynamic and engaging discussions, where students
are not just participants but enthusiastic contributors.
The findings from Fig. 6 illuminate the profound impact of gamification on the
level of student participation. Notably, an impressive 80% of respondents unequiv-
ocally agreed that online gamification effectively encourages students to participate
actively in online discussions. This revelation underscores the transformative power
of gamification, reshaping what was once a passive exchange of ideas into a vibrant,
interactive discourse. The beauty of gamification lies in its ability to infuse elements
of competition, goals, and rewards into the learning experience. When applied to
online discussions, this approach ignites a newfound enthusiasm among students to
actively contribute their insights and ideas. The gamification system, through points,
badges, and leader boards, not only taps into students’ competitive spirit but also
offers them clear objectives to strive for. As a result, students are not mere bystanders
in discussions; they become enthusiastic participants, driven by the desire to excel
and earn recognition.
This transformation is not just about increased activity; it signifies a paradigm shift
in how students engage with course material and interact with their peers. It encour-
ages them to think critically, articulate their thoughts effectively, and constructively
challenge their peers’ ideas. Gamification, therefore, emerges as a potent tool in
addressing the age-old challenge of passive online discussions. It brings a newfound
vibrancy to the digital classroom, fostering an environment where students actively
seek to contribute, collaborate, and learn from one another. In the era of remote and
online learning, this discovery heralds a new dawn, promising more engaging and
enriching educational experiences for students and educators alike.
One of the fascinating discoveries in this study revolves around the implemen-
tation of merit and demerit points as motivational tools within the gamified online
learning platform. This innovative approach harnesses the intrinsic motivation of
students, tapping into their inherent desire for recognition, achievement, and rewards.
In this context, “merit” represents the attainment of a reward, while “demerit”
signifies a loss of privilege or reward.
Leveraging Gamification for Engaged Learning in Online Teaching … 155

4 Result and Discussion

The investigation into the integration of gamification in online teaching and learning
environments yielded insightful findings that shed light on its potential to enhance
student engagement and learning outcomes. The study, conducted with a purpo-
sive sample of 30 students from the Computer Science Department at the National
Defence University of Malaysia, delved into the effectiveness and efficiency of
gamification design.
The incorporation of merit and demerit points as motivational tools within the
gamified online learning platform emerged as another noteworthy finding. Remark-
ably, 80% of respondents concurred that merit and demerit activities during online
learning effectively engage students’ passion for learning. This result underscores the
importance of recognizing students’ intrinsic motivation and their desire for achieve-
ment and recognition. By implementing gamification elements that reward desirable
behaviours and penalize undesirable ones, educators tap into students’ innate drive to
excel and stand out. This finding not only validates the effectiveness of gamification
as a motivational strategy but also suggests that it can play a pivotal role in shaping
students’ attitudes toward learning, transforming them from passive recipients of
knowledge into active and enthusiastic participants.
The findings presented in Fig. 2 illuminate the profound impact of Augmented
Reality (AR) on student engagement. This collective sentiment reflects the transfor-
mative potential of AR in education, positioning it as an engaging and supportive
force in the learning journey. These responses suggest that AR can effectively capti-
vate students’ interest, making the educational experience an exciting adventure. By
providing an interactive and immersive learning environment, AR holds the promise
of elevating engagement, fostering curiosity, and enhancing the overall educational
experience. The data presented in Fig. 3 underscores the transformative potential
of gamification in education and its synergy with Augmented Reality (AR). Survey
question number four (Q4) revealed that 80% of participants believe that AR can
more effectively pique students’ interest in learning techniques compared to conven-
tional methods. Additionally, the survey unveiled that 90% of respondents found the
idea of receiving rewards after engaging with a video game exciting, highlighting
the motivating power of gamification elements. This alignment between AR and
gamification speaks to the ability of these innovative approaches to capture student
attention and motivation, redefining the learning process.
The study’s investigation into the impact of gamification on student engagement
revealed compelling results. A significant majority of participants, a striking 80%,
expressed agreement with the statement that the online gamification system effec-
tively encourages active participation in small-group discussions. This outcome
underscores the pivotal role gamification plays in transforming the often-passive
nature of online discussions into dynamic and interactive exchanges. Gamification
elements, such as points, badges, and leader boards, introduce a competitive and
goal-oriented dimension that motivates students to actively contribute to discus-
sions. These finding highlights gamification’s potential to bridge the engagement
156 N. A. Wahab et al.

Fig. 1 Gartner 2011 Hype Cycle of emerging technologies [5]

Fig. 2 Countries that carried out empirical study research into use of Augmented Reality (AR) and
gamification [7]

gap in remote and online learning environments, offering educators a potent tool for
fostering collaboration and knowledge sharing among students.
The results of this study carry significant implications for the future of online
teaching and learning. Gamification emerges as a promising approach to address the
perennial challenge of student engagement in digital education settings. Its ability
Leveraging Gamification for Engaged Learning in Online Teaching … 157

Q2. Merit and demerit points engaging student's passion in learning

(Merit is a quality of getting reward, demerit is loss of privilege or
reward)
20

[VALUE] (53.3%)
15

10
[VALUE] (26.7%)

5 [VALUE] (13.3%)
2(6.7%)
[VALUE] (0%)
0
1 2 3 4 5

Fig. 3 Merit and demerit points engaging student’s passion in learning

Q3. What your opinion in learning using Augmented Reality (AR)?

5%
5%

Learnig using AR is exciting

Learnig using AR is facilitating

25%

Learnig using AR is
unnecessary

65% Learnig using AR is

complicating

Fig. 4 Opinions of respondent in learning using AR

to motivate students and encourage active participation in discussions suggests that

gamification can be a transformative force in the realm of online education. Moving
forward, educators should consider the integration of gamification elements thought-
fully and strategically into their course designs. Additionally, future research may
delve deeper into the specific gamification features and techniques that have the most
significant impact on student engagement and learning outcomes, thus providing
a more nuanced understanding of how gamification can be optimized in educa-
tional contexts. In the ever-evolving landscape of education, gamification stands as a
dynamic and powerful ally, offering innovative ways to create engaging and effective
learning experiences for the digital generation.
158 N. A. Wahab et al.

20
17

15 1414
13
1111
10 10
10 8

5
5 3
2
1 1
0 0 0 0 0 0
0
1 2 3 4 5

Q4 Q5 Q6 Q7

Fig. 5 Statistic of survey answers on Q4, Q5, Q6, and Q7

Q8. Online gamification system encourages me in

participating actively in small-group discussions
20

15 (50%)
15

10 9 (30%)

6 (20%)
5

0 (0%) 0 (0%)
0
1 2 3 4 5

Fig. 6 Online gamification system encourage students to participating actively in discussion

Augmented Reality constitutes a flexible and interactive technology that can

be further enriched when combined with other novel technologies [8]. By inte-
grating AR with gamification, educators have the opportunity to create dynamic and
engaging online learning experiences. Recent systematic review, scientific mapping,
and bibliometric studies have presented both the benefits that can be yielded when
integrating Augmented Reality into educational settings in a student-centred manner
and some of its drawbacks and limitations [9, 10]. This combination can stimulate
interest, enhance motivation, and actively engage students in their educational jour-
neys, offering a promising avenue for the future of digital education. Incorporating
Augmented Reality (AR) and gamification into online learning experiences has the
potential to revolutionize education. AR can provide an interactive and immersive
Leveraging Gamification for Engaged Learning in Online Teaching … 159

environment that excites and engages students, making learning a thrilling adven-
ture. Meanwhile, gamification offers a range of elements like rewards, points, and
leader boards that can motivate and inspire learners, transforming them from passive
recipients of knowledge into enthusiastic participants. Together, these approaches
can capture students’ interest, foster curiosity, and elevate the overall educational
experience, creating a dynamic and engaging digital classroom.

5 Conclusion

In conclusion, the integration of gamification into online teaching and learning expe-
riences presents an exciting and transformative path for education in the digital age.
This article has illuminated the dynamic synergy between traditional education and
the engaging world of gaming, showcasing how gamification can fuel motivation,
foster active participation, and enhance learning outcomes. The traditional role of
educators has evolved beyond imparting knowledge; it now includes guiding students
in their exploration of knowledge and skills. As online teaching and learning continue
to gain prevalence, harnessing the power of gamified experiences becomes impera-
tive. This article has delved into the impact of gamification on student engagement
and found compelling evidence of its effectiveness.
Our study with Computer Science students at the National Defence University
of Malaysia underscores how gamification effectively encourages active participa-
tion in online discussions and ignites students’ passion for learning. These insights
affirm gamification’s ability to create dynamic and interactive learning environments,
aligning with its growing adoption across various sectors. As gamification transcends
traditional boundaries, its transformative potential in shaping user behaviour and
interaction is evident. Incorporating gamification principles into online education
is a promising avenue for revolutionizing educational practices. By tapping into
students’ intrinsic motivation for achievement, recognition, and engagement, gami-
fication offers an innovative approach to fostering dynamic and effective learning
experiences.
In an era where the digital generation shapes the educational landscape, the explo-
ration of gamification’s multifaceted impact on motivation, engagement, and learning
outcomes is a crucial step. It equips educators with a powerful tool to adapt educa-
tion to the evolving preferences and needs of contemporary learners. In summary,
gamification is more than a trend; it’s a pathway to creating education that is not
only engaging but also effective. By embracing gamification, educators can navigate
the dynamic intersection of technology and education, ensuring that students are
well-prepared for a future that demands adaptability, creativity, and active partici-
pation. The gamified classroom is not just a concept; it’s a reality that promises to
revolutionize education for generations to come.
160 N. A. Wahab et al.

References

1. Almusaed A, Almssad A, Yitmen I, Homod RZ (2023) Enhancing student engagement:

harnessing “AIED”’s power in hybrid education—a review analysis. Educ Sci 13(7):632. https://
doi.org/10.3390/educsci13070632
2. Mhlongo S, Mbatha K, Ramatsetse B, Dlamini R (2023) Challenges, opportunities, and
prospects of adopting and using smart digital technologies in learning environments: an iterative
review. Heliyon 9(6). https://doi.org/10.1016/j.heliyon.2023.e16348
3. Tan DS, Begole B, Kellogg W, SIGCHI (Group : U.S.). (2011) Proceedings of the 2011 annual
conference extended abstracts on Human factors in computing systems : 2011 proceeding,
Vancouver, BC, Canada - May 07–12, 2011. ACM Press
4. Koch M (2014) Gamification and the trough of disillusionment. In: Butz, Schlichter J (Hrsg.)
Mensch und computer. Oldenbourg Wissenschaftsverlag. https://www.researchgate.net/public
ation/282337345
5. Using gamification to improve participation in social learning environments (nd) https://doi.
org/10.13140/RG.2.1.4253.0328
6. Palomino PT, Toda AM, Oliveira W, Cristea AI, Isotani S (nd) Narrative for gamification
in education: why should you care? https://www.sciencedirect.com/science/article/pii/S02684
01218304614
7. Lampropoulos G, Keramopoulos E, Diamantaras K, Evangelidis G (2022) Augmented reality
and gamification in education: a systematic literature review of research, applications, and
empirical studies. Appl Sci (Switzerland) 12(13). MDPI. https://doi.org/10.3390/app12136809
8. Lampropoulos G, Keramopoulos E, Diamantaras K (2022) Semantically enriched augmented
reality applications: a proposed system architecture and a case study. Int J Recent Contrib Eng
Sci IT (IJES) 10(01):29–46. https://doi.org/10.3991/ijes.v10i01.27463
9. Avila-Garzon C, Bacca-Acosta J, Kinshuk, Duarte J, Betancourt J (2021) Augmented Reality
in education: an overview of twenty-five years of research. Contemporary Educ Technol
13(3):ep302. https://doi.org/10.30935/cedtech/10865
10. Mystakidis S, Christopoulos A, Pellas N (2022) A systematic mapping review of augmented
reality applications to support STEM learning in higher education. Educ Inf Technol
27(2):1883–1927. https://doi.org/10.1007/s10639-021-10682-1
Sentiment Analysis Using Large
Language Models: A Case Study
of GPT-3.5

Farhad Nadi, Hadi Naghavipour, Tahir Mehmood, Alliesya Binti Azman,

Jeetha A/P Nagantheran, Kezia Sim Kui Ting,
Nor Muhammad Ilman Bin Nor Adnan, Roshene A/P Sivarajan,
Suita A/P Veerah, and Romi Fadillah Rahmat

Abstract Sentiment analysis, which utilizes machine learning, natural language

processing, and computational linguistics, has been developed to comprehend the
emotions and viewpoints of individuals on social media platforms. This area of
research has undergone extensive investigation, leading to the integration of diverse
algorithms and techniques specifically tailored to this objective. A recent break-
through in the field of general AI is exemplified by Large Language Models,
which have surpassed numerous research domains as quintessential achievements.
This study examines the performance of GPT-3.5 in sentiment analysis. We eval-
uate its performance on established benchmark datasets as well as a collection of
reviews extracted from Google Maps. The findings indicate that the Large Language
Models can outperform traditional methods in the literature, and they do not require
pre-processing, unlike traditional methods.

Keywords Sentiment analysis · GPT3.5 · Large Language Models · Artificial

intelligence

1 Introduction

Sentiment analysis is a technique that uses machine learning, natural language

processing, and computational linguistics to understand people’s feelings and opin-
ions on social media [10]. It involves dividing the information shared on social

F. Nadi (B) · H. Naghavipour · T. Mehmood · A. B. Azman · J. A. Nagantheran · K. S. K. Ting ·

N. M. I. B. N. Adnan · R. A. Sivarajan · S. A. Veerah
School of Information Technology, UNITAR International University, 47301 Kelana Jaya,
Malaysia
e-mail: [email protected]
R. F. Rahmat
Department of Information Technology, Universitas Sumatera Utara, Medan, Indonesia

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 161
Y. Bee Wah et al. (eds.), Data Science and Emerging Technologies, Lecture Notes on
Data Engineering and Communications Technologies 191,
https://doi.org/10.1007/978-981-97-0293-0_12
162 F. Nadi et al.

media into positive, negative, or neutral sentiment at different levels, such as sentence
level, document level, or features level sentiment analysis. Sentiment analysis has
various applications, including understanding public perception of eco-friendly trans-
formation, air quality, predicting box office success, analyzing public sentiment
during global events like pandemics, brand management, customer feedback analysis,
market research, social media insights, finance, health care, and more [2, 13].
Using traditional methods, developing accurate sentiment analysis models for
all languages poses challenges [13]. Using traditional methods, sentiment analysis
involves strict adherence to specific protocols, including pre-processing, feature
extraction, and model creation. Initially, a supplied dataset typically undergoes
preliminary processing stages like tokenization and stemming. Then, various tech-
niques are employed during the feature extraction phase to extract relevant features.
A model is subsequently constructed based on these extracted features. Finally, the
resulting model can be implemented to perform sentiment analysis [9]. It’s impor-
tant to note that this process must be repeated for each context since the constructed
models are only trained on specific contexts.
The advent of sizable language models represents a groundbreaking progres-
sion in the domain of natural language processing, providing a fresh perspective
on sentiment analysis that challenges established techniques. These models, with
their unparalleled magnitude and capacity to acquire complex linguistic patterns,
possess the capability to supplant traditional sentiment analysis models. Divergent
from specialized models that necessitate meticulous feature engineering and domain-
specific fine-tuning, Large Language Models are inherently general in nature [7].
Their comprehensive pretraining on diverse text sources grants them a broad compre-
hension of language subtleties and contextual relationships, facilitating the identifi-
cation of sentiment-carrying elements in text. By presenting textual prompts, these
models can produce sentiment predictions with remarkable accuracy, eliminating the
need for task-specific feature extraction and intricate model development. Addition-
ally, their adaptability enables them to capture emerging language trends and adjust
to varying domains without requiring extensive retraining. Hence, Large Language
Models are well-positioned to streamline and enhance sentiment analysis procedures
by providing a versatile and unified alternative to conventional, application-specific
models. Nevertheless, ethical considerations, such as the mitigation of bias and trans-
parency, must be meticulously addressed to ensure the responsible and equitable
deployment of such models in sentiment analysis applications.
In sentiment analysis literature, three main approaches are commonly employed:
Machine Learning-Based Methods: These approaches utilize machine learning
algorithms which are commonly employed for text classification into predefined
label categories, such as positive, negative, or neutral. Commonly used algorithms
include Naive Bayes [12], Support Vector Machines (SVM) [11], and Random
Forest [4]. Lexicon-Based Methods: Lexicon-based methods rely on predefined
dictionaries containing words associated with positive or negative sentiments. The
sentiment score of a text is determined by summing the scores of words in the
text [1]. Hybrid Methods: Hybrid methods combine both machine learning and
lexicon-based approaches to enhance the accuracy of sentiment analysis.
Sentiment Analysis Using Large Language Models: A Case Study … 163

2 Datasets

Two datasets were employed in this research study. The initial dataset used is the
IMDB Review Dataset, commonly referred to as the Large Movie Review Dataset
[6]. This dataset is suitable for comparative analysis, containing 25,000 film reviews
categorized as either positive or negative. In our study, we selected the first 501
reviews from each category within the test set, resulting in a total of 1,002 reviews
for analysis.
The second dataset was curated from Google Maps reviews of 124 distinct univer-
sities in Malaysia. To gather these reviews, the Google Maps API was employed.
Due to restrictions imposed on the free version of the Google Maps API, the data
collection process was limited to capturing a maximum of five reviews from each
university. After the removal of empty reviews and those containing only scores
without accompanying text, a total of 592 reviews remained for analysis. These
retained reviews underwent a manual evaluation, wherein they were assigned one
of three sentiment labels: ‘Positive,’ ‘Negative,’ or ‘Neutral.’ It has to be mentioned
that the constructed dataset has been made publicly available for download [8].

3 Methodology

We employed the OpenAI API to utilize GPT-3.5 for sentiment analysis. Reviews
from each dataset under scrutiny were processed by GPT-3.5 along with corre-
sponding prompts, as given in Table 1.
These prompts were strategically constructed to direct GPT-3.5 in allocating
classifications to the input texts. To evaluate the impact of diverse prompts on the
outcomes, we employed two distinct prompts for the Google Maps dataset. In the

Table 1 Prompt that was used for instructing the GPT-3.5 for performing the task
Prompt for You will be provided with a movie review, and your task is to classify its sentiment
IMDB forcefully to as either positive or negative. The final result should be in the format
movie of ‘sentiment:’ followed by one of the following two labels based on sentiment of
review the review: ‘positive’ or ‘negative’
dataset
First You will be provided with a Google Map review of a university, and your task is to
prompt for classify its sentiment as either positive, negative, or neutral
Google
Map review
dataset
Second You will be provided with a Google Map review of from a university, and your task
prompt for is to classify its sentiment forcefully to as either positive, negative, or neutral. The
Google final result should be in the format of ‘sentiment:’ followed by one of the following
Map review two labels based on sentiment of the review: ‘positive’, ‘negative’, or ‘neutral’
dataset
164 F. Nadi et al.

case of the IMDB dataset, it was labeled with two distinct categories including posi-
tive and negative. Accordingly, the prompt was structured to instruct GPT-3.5 to
classify the reviews under either of these two categories. In contrast, for the Google
Maps reviews, a different approach was taken. GPT-3.5 was tasked with assigning
three distinct labels to the reviews, aligning with our predefined labeling scheme.

4 Results and Discussion

The present study introduces GPT-3.5 as a potent Large Language Model (LLM) and
compares its performance with conventional methodologies. The results emphasize
that the LLM exhibits a competitive capacity to classify reviews into distinct senti-
ment categories. A notable advantage of GPT-3.5 is its ability to eliminate the need
for intricate pre-processing or feature extraction steps on textual data, which makes it
an attractive option. Nevertheless, it is crucial to acknowledge the inherent expenses
involved in deploying GPT-3.5 for diverse tasks. The convenience it provides, saving
substantial time, could be perceived as a cost-saving factor when compared with
conventional approaches. This aspect gains paramount significance when dealing
with varying contextual domains. Unlike traditional methods that require a complete
iteration of the model construction or lexicon assembly process, LLM models stand
out. They offer the advantageous capability to seamlessly adapt to different contexts
without encountering limitations. This adaptability underscores a pivotal strength of
LLMs over their traditional counterparts.
Table 3 shows the confusion matrix of applying GPT-3.5 on the Google Map
reviews dataset. Table 4 provides an overview of GPT-3.5’s performance on the
Google Maps review dataset when using the first and second prompts. Notably, with
the first prompt, GPT-3.5 failed to label one review (as given in row 3 of Table 2).
Consequently, the overall accuracy of GPT-3.5 in predicting ternary class labels
using the first prompt was 92.22%. In contrast, when utilizing the second prompt, the
accuracy improved slightly to 92.57%, and GPT-3.5 successfully assigned labels to
all reviews. A detailed report of various measures for each class in this experiment is
presented in Table 4. Notably, the accuracy for the neutral class was 93.57%, while
other metrics hover around the 50% mark. Conversely, for the remaining two classes,
GPT-3.5 consistently delivers superior results, with a majority exceeding 91%.
Table 5 shows detail of the confusion matrix for the IMDB database. As a result,
the overall accuracy of the GPT-3.5 for binary sentiment analysis was 94.11%.
Table 6 presents a comprehensive breakdown of measures for each sentiment
class. Notably, all measures surpass or equal 92%.
A comparison between the GPT-3.5 results on the IMDB movie review dataset
and three other proposed algorithms from the literature [3] is presented in Table 7.
As can be seen, the GPT-3.5 was able to perform better than all three algorithms
over all measures. This suggests that GPT-3.5, as a Large Language Model (LLM),
can be reliably used for performing sentiment analysis tasks. Moreover, its ease of
Sentiment Analysis Using Large Language Models: A Case Study … 165

Table 2 GPT3.5 outputs for Google Map reviews that that were more than 50 character long when
using the first prompt
1 Review Good university, good place, good view, and good environment but maybe lack
(label) of public transportation for tourists and lack of shops, and the coffee shop next
to the library does not accept cash you can’t buy some drinks when tired.
(positive)
GPT-3.5 The sentiment of the review is mixed. It contains positive aspects such as a
output good university, good place, good view, and good environment. However, it
also mentions some negative points such as a lack of public transportation for
tourists, lack of shops, and the inconvenience of the coffee shop not accepting
cash. Overall, the sentiment can be classified as neutral
2 Review World’s most global university. Students from different countries study there,
(label) and you have an opportunity to meet students from all over the world. Great
campus, design, and students. Everywhere in the world, it is the creative
thinkers and the normbreakers who are driving the development of the world’s
most successful companies and the world’s most powerful economies. tan sri
Limkokwing (positive)
GPT-3.5 Overall, the sentiment of this review is positive. The reviewer highlights the
output university’s global nature, diverse student body, great campus design, and
students. They also mention the importance of creative thinkers and
normbreakers in driving successful companies and economies
3 Review University of choice (positive)
(label)
GPT-3.5 I’m sorry, but I cannot provide a sentiment analysis for a university without the
output actual review. Could you please provide the review for me to analyze?
4 Review Facilities are age old, hope serious attention is given in upgrading them. It’s not
(label) only quality of education but also the basic needs of students. Don’t run like a
prison dorm. Please take this as a positive feedback (negative)
GPT-3.5 The sentiment of this review is mixed. While the reviewer mentions that the
output facilities are age old and hopes for upgrades, they also mention that it is
important to address the basic needs of students. They further state that the
university should not run like a prison dorm. However, they end the review by
asking the university to take their feedback as positive. Overall, the sentiment
can be classified as neutral
5 Review Pros friendly security guards, they always help when you need auditorium is
(label) modern have restroom at any floor and big parking lot. cons food and drink in
the canteen is very expensive lift is going super slow, you must wait in the long
time and the small way to motorcycle in or out. (neutral)
GPT-3.5 The sentiment of the review is a mix of positive and negative. Positive aspects:
output friendly security guards, modern auditorium, restrooms on every floor, big
parking lot. Negative aspects: expensive food and drinks in the canteen, slow
lift, long waiting time, small way for motorcycles to enter or exit. Overall, the
sentiment can be classified as neutral, as the positive and negative aspects
balance each other out
166 F. Nadi et al.

Table 3 Confusion matrix of the GPT-3.5 output for Google Map reviews using the
Positive Neutral Negative
Positive 422 (428) 17 (11) 5 (6)
Neutral 9 (13) 21(15) 9 (11)
Negative 3 (2) 3(1) 102 (105)

Table 4 Evaluation measures for GPT-3.5 on Google Maps review dataset using first prompt
(Second prompt)
Sentiment Accuracy Precision Recall F1 Score
Positive 94.25 (94.59) 0.95 (0.96) 0.97 (0.97) 0.96 (0.96)
Neutral 93.57 (03.92) 0.54 (0.38) 0.51 (0.56) 0.53 (0.45)
Negative 96.62 (96.62) 0.94 (0.97) 0.88 (0.86) 0.91 (0.91)

Table 5 Confusion matrix of the GPT3.5 output for IMDB movie reviews
Positive Negative
Positive 461 40
Negative 19 482

Table 6 Evaluation measures for GPT-3.5 on IMDB movie dataset

Sentiment Accuracy Precision Recall F1 Score
Positive 94.11 0.92 0.94 0.94
Negative 94.11 0.96 0.92 0.94

use stands out when compared to traditional methods, especially because it does not
require any specific training on the context in which it will be utilized.
Another significant advantage we have observed in LLMs is their proficiency in
processing poorly written English reviews, including those containing typos. For
example, when examining the reviews in Table 2, you will notice that although the

Table 7 Comparison of the GPT-3.5 results on sentiment analysis task with CNN, LSTM, and
LSTM-CNN algorithms reported in [3]
Evaluation measure CNN LSTM LSTM-CNN GPT3.5
Accuracy 0.90 0.88 0.89 0.9411
Recall 0.95 0.82 0.90 0.9604
Specificity 0.84 0.90 0.87 0.9234
Precision 0.87 0.90 0.87 0.9202
F-score 0.91 0.86 0.88 0.9399
Sentiment Analysis Using Large Language Models: A Case Study … 167

reviews are not well-written, the results produced by GPT-3.5 are still satisfactory.
This suggests that GPT-3.5 was able to infer the context of the reviews effectively.

5 Conclusion

The present study introduces GPT-3.5 as a potent Large Language Model (LLM)
and compares its performance with conventional methodologies. The results empha-
size that the LLM exhibits a competitive capacity to classify reviews into distinct
sentiment categories. A notable advantage of GPT-3.5 is its ability to eliminate the
need for intricate pre-processing or feature extraction steps on textual data, which
makes it an attractive option. However, it’s important to acknowledge the inherent
expenses of deploying GPT-3.5 for various tasks. Its efficiency in saving both time
and computational costs related to pre-processing can be considered a cost-saving
advantage compared to conventional approaches. This aspect gains paramount signif-
icance when dealing with varying contextual domains. Unlike traditional methods
that require a complete iteration of the model construction or lexicon assembly
process, LLM models stand out. They offer the advantageous capability to seam-
lessly adapt to different contexts without encountering limitations. This adaptability
underscores a pivotal strength of LLMs over their traditional counterparts.
The GPT-3.5, an exceptional Large Language Model, has been extensively trained
on diverse textual data [14]. However, it’s possible that publicly accessible datasets,
such as the IMDB dataset, were included in its training corpus. Therefore, evaluating
its performance under such circumstances may not be entirely justified. To address
this concern, we utilized Google Maps reviews to ensure that the GPT-3.5 model was
not exposed to this data during training.
In our experimental analysis of the Google review dataset using the initial prompt
(see Table 1), we obtained descriptive outputs from GPT-3.5 that necessitated metic-
ulous post-processing. Although the prompts explicitly instructed the model to solely
assign labels to the input text, the resultant outputs were not consistently straightfor-
ward. This compelled us to conduct further analysis on the GPT-generated results.
While this phenomenon was observed in only a limited number of reviews, the
descriptive nature of the output necessitated additional processing.
Table 2 presents a selection of such results obtained when utilizing the first prompt
with Google Map reviews. Notably, there were six instances where the GPT-3.5
outputs were descriptive, requiring careful extraction of the assigned labels from
the output. In addressing this challenge, we introduced a second prompt, explicitly
instructing GPT-3.5 to generate its response in a predefined format. Subsequently,
the generation of descriptive answers ceased entirely, consistently yielding outputs
conforming to the specified format. This observation underscores the pivotal role
of well-structured prompts when harnessing GPT-3.5’s capabilities. Prompt engi-
neering refers to the process of designing and refining prompts or instructions given
to AI models to generate desired outputs. It involves finding suitable prompts for
specific tasks and improving the effectiveness of prompts through various methods
168 F. Nadi et al.

and techniques. The goal is to enhance the performance and capabilities of AI models
in different domains such as conversational agents, visual and image processing, and
text-to-image generation [5].

References

1. Bonta V et al (2019) A comprehensive study on lexicon based approaches for sentiment analysis.
AJCST. 8(S2):1–6. https://doi.org/10.51983/ajcst-2019.8.S2.2037
2. Feldman R (2013) Techniques and applications for sentiment analysis. Commun ACM
56(4):82–89. https://doi.org/10.1145/2436256.2436274
3. Haque MdR et al (2019) Performance analysis of different neural networks for sentiment
analysis on IMDb movie reviews. In: 2019 3rd international conference on electrical, computer
and telecommunication engineering (ICECTE), pp. 161–164 IEEE, Rajshahi, Bangladesh.
https://doi.org/10.1109/ICECTE48615.2019.9303573
4. Karthika P et al (2019) Sentiment analysis of social media network using random forest algo-
rithm. In: 2019 IEEE international conference on intelligent techniques in control, optimization
and signal processing (INCOS). IEEE, Tamil Nadu, India, pp 1–5. https://doi.org/10.1109/INC
OS45849.2019.8951367
5. Lo LS (2023) The CLEAR path: a framework for enhancing information literacy through
prompt engineering. J Acad Librarianship 49(4):102720. https://doi.org/10.1016/j.acalib.2023.
102720
6. Maas A et al (2011) Learning word vectors for sentiment analysis. In: Proceedings of the 49th
annual meeting of the association for computational linguistics: human language technologies,
pp 142–150
7. Mirchandani S et al (2023) Large language models as general pattern machines. https://doi.
org/10.48550/ARXIV.2307.04721
8. Nadi F Malaysian universities google map review. https://github.com/pharhadnadi/Malaysian
UniversitiesGoogleMapReview
9. Nandwani P, Verma R (2021) A review on sentiment analysis and emotion detection from text.
Soc Netw Anal Min 11(1):81. https://doi.org/10.1007/s13278-021-00776-6
10. Serrano-Guerrero J et al (2015) Sentiment analysis: a review and comparative analysis of web
services. Inf Sci 311:18–38
11. Styawati S, Mustofa K (2019) A support vector machine-firefly algorithm for movie opinion
data classification. Indonesian J Comput Cybern Syst 13(3):219. https://doi.org/10.22146/ijccs.
41302
12. Wongkar M, Angdresey A (2019) Sentiment analysis using naive bayes algorithm of the data
crawler: twitter. In: 2019 fourth international conference on informatics and computing (ICIC).
IEEE, Semarang, Indonesia, pp 1–5. https://doi.org/10.1109/ICIC47613.2019.8985884
13. Yadav J (2023) Sentiment analysis on social media. Qeios. https://doi.org/10.32388/YF9X04
14. Zhou C et al (2023) A comprehensive survey on pretrained foundation models: a history from
BERT to ChatGPT. https://doi.org/10.48550/ARXIV.2302.09419
Telecom Customer Experience Analysis
Using Sentiment Analysis and Natural
Language Processing—Comparative
Study

Ahmed Mohamed Abdou Ahmed, Abdulaziz Al-Nahari,

Raghad Al-Shabandar, Chong Kim Loy, and A. H. Mohammed

Abstract In today’s competitive telecom landscape, service providers are increas-

ingly seeking real-time customer experience analysis, prompt responses to customer
feedback, and the ability to effectively promote new services. To achieve these goals,
telecom operators are embracing digitalization initiatives that encompass the entire
customer journey, broadly divided into three phases: engaging, using, and evaluating.
Recent advancements in natural language processing (NLP) and sentiment analysis
(SA) techniques have empowered telecom service providers to rapidly analyze and
categorize millions of customer tweets, gaining valuable insights into service percep-
tions and user satisfaction. With a significant presence of telecom service providers in
Arab countries, where customers frequently share their service experiences through
Arabic tweets, the need for specialized NLP and SA techniques that can effectively
handle Arabic language data becomes paramount. This study focuses on Arabic
language processing and sentiment analysis to support one of the Middle East’s
largest telecom service providers in analyzing and enhancing customer experience.
The study successfully applied BERTopic, a topic modeling technique, to Arabic
telecom-related text, generating six distinct clusters for 50% of the analyzed tweets.
Support vector machine (SVM) outperformed XGBoost as the machine learning
classifier when combined with the BERT-base model, achieving an F1-score of 0.71
compared to XGBoost’s 0.65. The fine-tuned MARBERT model demonstrated supe-
rior performance in text classification compared to machine learning algorithms,

A. M. A. Ahmed
Iraqi Prime Minister Office, Baghdad, Egypt
A. Al-Nahari · C. K. Loy
School of Information Technology, UNITAR International University, Petaling Jaya, Selangor,
Malaysia
R. Al-Shabandar (B) · A. H. Mohammed
Advisory Office for Scientific, Iraqi Prime Minister Office, Academic Affairs and Artificial
Intelligence Applications, Baghdad, Egypt
e-mail: [email protected]

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 169
Y. Bee Wah et al. (eds.), Data Science and Emerging Technologies, Lecture Notes on
Data Engineering and Communications Technologies 191,
https://doi.org/10.1007/978-981-97-0293-0_13
170 A. M. A. Ahmed et al.

achieving an F1-score of 0.8571, a 3% improvement over the best-performing

machine learning classifier.

Keywords Latent Dirichlet Allocation (LDA) · MARBERT · Arabic natural

language · Sentiment analysis

1 Introduction

Telecom services providers are increasingly prioritizing customer experience anal-

ysis and customer loyalty, not only to boost net sales through new customer acquisi-
tion but also to protect and enhance existing revenue streams by promptly addressing
customer concerns and fostering satisfaction. Consequently, improving customer
satisfaction has become a critical issue in the contemporary telecom market.
A telecom user’s experience life cycle encompasses three primary phases:
engaging with a telecom service, utilizing it, and subsequently sharing their
experiences through various channels [1, 2].
Analyzing customer feedback on provided services holds immense value, as
it enables decision-makers to identify customer pain points without the need for
human intervention. One approach to gather user opinions on provided services is
through social networks. With the widespread adoption of social networks, indi-
viduals are increasingly expressing their opinions about products and services they
utilize. Telecom service provider customers have been actively utilizing social media
to provide feedback on services [3, 4].
While numerous studies have been conducted on sentiment analysis to gauge
customer experiences with provided services, the majority have focused on the
English language, with fewer studies specifically addressing Arabic language
analysis of customer feedback on provided services [3, 5].
This paper delves into an analysis of telecom services provided by one of the
Middle East’s largest telecom operators. The research involves analyzing and cate-
gorizing a large volume of thousands of Arabic tweets related to provided services.
NLP techniques (BERT, MARBERT, and embedding) in conjunction with LDA,
XGBoost, and SVM classifiers are employed to analyze customer sentiment on
telecom services.
This study will empower decision-makers to promptly identify dissatisfied
customers. Additionally, it will assist telecom services providers in understanding
customer interests, thereby accelerating the development of new products or
enhancing the promotion of existing products across different locations and time-
frames.
The remainder of this paper is organized as follows: Section two presents a liter-
ature review of related work. Section three outlines the detailed methodology along
with results. Section four describes the conclusion and future work.
Telecom Customer Experience Analysis Using Sentiment Analysis … 171

2 Related Works

Social media becomes one of the main sources of users’ opinions and customer
sentiment toward any product. It reflects real public opinion toward any phenomenon,
incident, or even governmental policy. Much research has been conducted during
the last decade on sentiment analysis and how to precisely classify public opinions
into different polarities. The majority of this research concentrates on the English
language, and few are focusing on other morphologically rich languages like Arabic.
Moreover, sentiment analysis and opinion mining research touched several domains
and industries like health, sports, politics, technology, and marketing [6].
The author in [7] uses a public Twitter dataset consisting of 2000 tweets that are
divided into positive and negative tweets that are being processed and tokenized using
4G techniques. After tokenization khoja-stemmer is used to remove stop words and
TF-IDF is used to count word occurrence as a selected feature input to a discrimina-
tive multinomial Naïve Bayes (DMNB) classifier. DMNB classifier achieved 0.875
weighted average recall and 0.876 weighted average precision for negative and posi-
tive tweets while accuracy and f-score were 87.5% which was an improvement to
approaches in related work.
Reference in [8] explored the topic of Arabic sentiment analysis of online learning
tweets in Saudi Arabia during Covid-19 pandemic where manual annotation for
tweets was done and different sizes of N-gram along with TF-IDF were used for
feature extractions before input into the classification model. Six different models
have been tested and compared including Naïve Bayes, support vector machine,
logistic regression, KNN, random forest, and XG-Boosting. Logistic regression with
unigram TF-IDF has achieved the best accuracy (89.9%) and recall (89.9%); however,
best precision (96.8%) was achieved by SVM with bigram and TF-IDF [8].
The reference in [4] created Arabic sentiment lexicon through the automatic trans-
lation of English sentiment lexicons and manual extraction of sentiment words trans-
lated to Jordanian dialects that were used to train three different machine learning
classification models: Naïve Bayes, KNN, and support vector machine on Facebook
comments related to telecom operators’ services in Jordan. Lexicon-based labeling
achieves very good accuracy (98%), and the support vector machine algorithm was
found to be the most accurate model among the three tested models with (97.8%)
accuracy. The reference in [9] proposed new approach to detect churn in Saudi’s
telecom companies using Arabic tweets sentiment analysis. Several approaches
have been applied to detect customer satisfaction and compared with a question-
naire designed for the customers whose tweets were analyzed using the proposed
approach.
172 A. M. A. Ahmed et al.

3 Methodology

3.1 Data Collection

The dataset was collected from Kaggle and consists of 10,000 un-labeled Arabic
tweets from stc customer care account on Twitter [10]. The tweets in this dataset for
42 days start from 17 February 17, 2022, to 30 March 30, 2022, and it was written
by 6084 unique users.
Dataset that was loaded and read using Pandas has five different columns including
index column that is unnamed, and it has values from 0 to 10,000: datetime column
has detailed date and time of posted tweet, tweet id is a key column that contains
a unique id for each tweet, text column contains the body of the tweets shared by
different users, and finally, username column contains the user identity as per Twitter
identification.

3.2 Data Preparation and Text Preprocessing

Various text preprocessing techniques are considered in this paper. These are data
annotation, tokenization, lower casing, or normalization, stop words and unnecessary
words removal, stemming and lemmatization.
In this study, Twitter-xlm-roberta-base model was used to annotate the tweets to
either negative, positive, or neutral. XLM-Roberta-base model developed by trained
on ~ 198 M multilingual tweets. An instance of XLM-Roberta-based model was
initialized using auto classes from hugging face and a batch size of 32 was used
without changing any of model parameters [11].
In the text preprocessing phase, various steps were being executed to ensure a
precise data transformation and text preparation for suitable format input to topic
modeling and classification tasks. The implementation of text preprocessing in this
research follows the most common practice steps used in Arabic text preprocessing
including removing hashtag accounts, web links, excluding non-Arabic words, and
removing stop words, unnecessary spaces, emojis, and punctuations.
Using regular expressions hashtags account names and web links removed
successfully, Fig. 1 shows sample of tweets before, and after removing hashtags
account names and web links, the English translation added below each Arabic tweet
with blue color before and after processing to be well understood for non-Arabic
speakers.
Removing emojis is done by compiling all its format and substituting it with
no characters using regular expression. The English translation was added to each
Arabic tweet with blue color before and after processing to be well understood for
non-Arabic speakers.
NLTK tool was used to remove Arabic stop words and while checking word cloud
using ar_wordcloud, it was found that a lot of words that are being repeated without
Telecom Customer Experience Analysis Using Sentiment Analysis … 173

Fig. 1 Sample of tweets before and after cleaning

adding more meaning to the tweet. To avoid its influence on proposed models a list of
words added to stop words
list and removed from the whole tweets.

3.3 Topic Modeling

The target of topic modeling is to discover the most repeated topics on which
customers are writing their tweets on them. Two different models used to handle
the topic modeling job models which are Latent Dirichlet Allocation and BERTopic.

3.3.1 Latent Dirichlet Allocation

Latent Dirichlet Allocation (LDA) is a well-known topic model used to classify

text in a document to a particular topic. Topic modeling is a kind of unsupervised
machine learning and there is no prior information about number of topics that could
be extracted from the tweets, so coherence score is used to evaluate the model by
measuring how similar words are to each other [12].
The LAD topic model algorithm requires a document word matrix to be main input
to algorithm. This has been created using CountVectorizer. The CountVectorizer
considers words that have occurred at least two times by excluding frequent words
that are available in more than 50% of documents [13].
The CountVectorizer class is initialized with the specified configuration to create
doc-word matrix. In our work, it acts as a feature selection that reduces the number of
input words from 14,489 unique words to 5930 unique words input to LDA algorithm.
174 A. M. A. Ahmed et al.

3.3.2 BERTopic

BERTopic developed by reference [14] is a topic modeling technique that leverages

using transformers and c-TF-IDF to come up with an interpretable topic using clusters
of documents created with support of a semi-supervised machine learning approach
where documents with same topics are being clustered together. The advantage of
involving BERT or transformers in topic modeling is that it is giving better repre-
sentation of words and sentences coming from the nature of being pre-trained on big
corpus and considers the context of any given text this helps to create a topic when
few data is available.
In this paper, BERTopic is used to leverage the BERT embeddings and TF-IDF to
create clusters of words composing interpretable topics. Using MARBERT embeded
layer in BERTopic model with automatic reduction setting for nr_topics parameter
and (1, 2) n_gram_range end up with 56 clusters or topics and 50% of the tweets lie
on outliers.
We have ten topics represented in clusters; it was really obvious that some of the
topics are very similar and near to each other’s but could not be grouped in one cluster
during the training so topics 1, 2, 5, 6, and 9 combined using merge_topics method.
The output of this process is 6 clusters or topics in addition to the outliers’ clusters
that slightly increased from 5090 to 5454. Figure 2 shows the tweets distribution over
different topics after applying BERTopic models; it was found that 36% of tweets
have replying issues from customer care and 41% of tweets are related to Internet
and services packages and 19% of the tweets are reported issues with no solution;
this is giving very good results in term of domain interpretation.

Fig. 2 Topic modeling distribution

Telecom Customer Experience Analysis Using Sentiment Analysis … 175

3.4 Experiment Setup

In this paper, two sets of experiments were conducted in our work. In the first exper-
iment, NLP model BERT is utilized while in second, fine-tuned MARBERT is used.
The output of NLP models were used as input to machine learning classifiers (SVM,
GBoost). In order to evaluate classification accuracy, accuracy, precision, recall, and
F1-score methods were used. More details about models can be found below.

3.4.1 BERT with SVM

The data should be converted into a form that is can be appreciated and suitable to
BERT model. Therefore tto undertaketa has been tokenized, and the pretrain model
‘BERT-base model’ is used to undertake the initial training process; Padding and
truncation were set in the tokenization process to have same length for each sentence
or tweet where short sentences were padded with zero and long sentences would be
cut off to max_length and output was adjusted to return PyTorch tensor.
The result of tokenizing a text will be a dictionary that contains the input_ids, that
are the tokens expressed in numbers, and the attention_mask that tells us if the token
is or is not a [PAD]. Then the texts [CLS] hidden states have been gotten by running
the model on tokenized sentences or tweets. The [CLS] hidden state is then used as
an input to support vector machine initialized from sklearn library and with linear
kernel and ‘ovo’ decision_function_shape in order to handle multi-class classification
where we have three labels here neutral, positive, and negative sentiments.

3.4.2 BERT with XGBoost

The same BERT-base model hidden states have been used to extract features from
the tweets along with machine learning classifier that was XGBoost classifier. Auto
class was used for text tokenization with same settings and configurations, the only
difference in this model is that the [CLS] hidden state input to XGBClassifier from
XGBoost library with 50 estimators.

3.4.3 MARBERT with SVM

MARBERT is dialect model which was developed by [15] to facilitate transfer

learning on modern standard Arabic and Arabic dialects. The first model ARBERT
is designed using the same network architecture as BERT-base after handling light
preprocessing steps model trained on different six sources of Arabic sources including
1,800 Arabic books, 5 M news articles dataset, Gigaword, OSCAR, OSIAN, and
December 2019 dump of Arabic Wikipedia.
176 A. M. A. Ahmed et al.

The MARBERT specialized Arabic dialect model is utilized in our work with
the aim to extract the tweets hidden state features input to support vector machine
classifier. The Ammar-alhaj-ali/Arabic MARBERT-sentiment model and ‘utoTok-
enizer’ are used to extract the textual features. The resulted dictionary of input_ids
and attention_mask used in MARBERT model to get the texts [CLS] hidden state.
The [CLS] hidden state is then used as an input to support vector machine initialized
from sklearn library and with linear kernel and ‘ovo’ decision_function_shape in
order to handle multi-class classification where we have three labels here neutral,
positive, and negative sentiments [16, 17].

3.4.4 MARBERT with XGBoost

The output of MARBERT is fed as input to XGBoost. ‘XGBClassifie’ is trained

XGBClassifier with 50 estimators. The output of MARBERT hidden states was used
as input to XGBoost classifier.

3.4.5 MARBERT

The MARBERT uses the same architecture as BERT-base model. However, it is

trained on different large Arabic tweets datasets with modern standard Arabic and
Arabic dialects. The tokens are generated based on a pre-trained tokenizer and each
token is mapped to an embedding vector represented in word, segment, and positional
(ID) embeddings. [CLS] and [SEP] tokens will be padded to the word sequence at
the beginning and end of the sentence, respectively.
At this stage, the model receives an input representing the tweets from the dataset,
which will be converted to a set of embedding vectors. Each word in the input
sequence will be mapped to an embedding vector generated from summing up its
corresponding word, segment, and positional embeddings. To learn shared global
contextual representations across all corpora, the shared part is employed to fine-tune
the weights of a pre-trained multilayer bidirectional transformer encoder.

4 Result and Evaluation

4.1 First Experiment Results

The findings demonstrate that SVM outperforms XGBoost when utilizing the BERT-
base model, which was primarily pre-trained on the English language and subse-
quently fine-tuned on Arabic language telecom domain data in this study. SVM
achieves an F1-score of 0.71, while XGBoost achieves an F1-score of 0.65. This
significant performance gap between the two machine learning classifiers diminishes
Telecom Customer Experience Analysis Using Sentiment Analysis … 177

Table 1 First experiment results

Model F1-Score Recall Precision Accuracy
BERT-base + SVM 0.71 0.73 0.72 0.73
BERT-base + XGBoost 0.65 0.68 0.69 0.68
MARBERT + SVM 0.83 0.83 0.83 0.83
MARBERT + XGBoost 0.82 0.82 0.82 0.82

when employing MARBERT embedding layers for feature extraction, resulting in

substantial F1-score improvements for both SVM and XGBoost. SVM achieves an
F1-score of 0.83, while XGBoost achieves an F1-score of 0.82.
The primary factor contributing to this improvement is MARBERT’s robust pre-
training on both modern standard Arabic and Arabic dialects, such as the Gulf dialect
present in our data. This enhanced contextual representation leads to improved clas-
sification accuracy and reduces the performance gap between different machine
learning classifiers like SVM and XGBoost. Similar trends are observed for other
evaluation metrics, including recall, precision, and accuracy, when comparing the
four experimental models presented in Table 1.

4.2 Second Experiment Results

Fine-tuned MARBERT exhibits the potential for enhanced feature extraction through
its embedding layers, leveraging the advantages of pre-trained Arabic dialects and
modern standard Arabic corpora.
The data for the second experiment is collected from the Kaggle tweets dataset
[10]. We use the MARBERT Arabic-specific transformer LMs model as the source
model for Table 2 and Fig. 3 [16].

Table 2 Second experiment results

No epochs LR_rate Batch_size Max_lenght F1-Score Recall Precision Accuracy
2 3.00E-05 32 128 0.815 0.7652 0.7236 0.8071
2 5.00E-05 16 64 0.8184 0.788 0.7566 0.814
2 5.00E-05 64 128 0.8251 0.6964 0.8136 0.8299
8 1.00E-05 32 128 0.8294 0.7498 0.7482 0.8269
2 5.00E-05 16 128 0.8341 0.7528 0.7514 0.8375
2 5.00E-05 32 128 0.8397 0.7661 0.7827 0.8435
4 1.00E-05 32 128 0.845 0.7793 0.8136 0.8474
2 1.00E-05 32 128 0.8571 0.801 0.8067 0.8562
178 A. M. A. Ahmed et al.

Fig. 3 Target distribution

During this experiment, eight iterations of hyperparameter tuning were performed,

including adjustments to max_length, batch_size, learning_rate, and the number of
epochs. The outcomes of these iterations are presented in Table 2.
While fine-tuned MARBERT did not consistently outperform MARBERT with
machine learning classifiers, it surpassed MARBERT with machine learning clas-
sifiers in three out of eight iterations involving fine-tuned MARBERT. The best
fine-tuned MARBERT model achieved an F1-score of 0.8571, following two epochs
with a batch size of 32, a relatively low learning rate of 1.00E-05, and a max length
of 128. This result represents a 3% improvement over the first experiment in terms
of F1-score, recall, precision, and accuracy.
The classification results are visualized in Fig. 3. The classifier effectively distin-
guishes between positive, negative, and neutral tweets. Notably, a topic like ‘thanks_
responding,’ where users express gratitude to customer care, yielded 100% positive
tweets. Conversely, discussions about the application, whether related to customer
care or over-the-top Internet applications, mostly elicited negative feedback.

5 Conclusion and Future Work

The study effectively employs the latest techniques of topic modeling and text classifi-
cation based on pre-trained bidirectional encoder representations from transformers
(BERT). BERTopic leverages transformers, and c-TF-IDF is utilized to generate
distinct dense clusters, facilitating the identification of easily interpretable topics.
This research marks the first application of BERTopic to Arabic telecom-related
text, successfully generating six clusters for 50% of the tweets, aided by manual
assessment and topic reduction.
Telecom Customer Experience Analysis Using Sentiment Analysis … 179

In terms of machine learning classifiers, support vector machine (SVM) outper-

formed XGBoost when both were combined with the BERT-base model, with the
hidden layers functioning as a feature extraction layer prior to classification. In this
instance, where the employed BERT model was primarily pre-trained on English
language data, SVM outperformed XGBoost by 9%, achieving an F1-score of 0.71
compared to XGBoost’s 0.65.
The fine-tuned MARBERT model, tested on Arabic telecom domain tweets,
surpassed the performance of machine learning algorithms in text classification,
achieving an F1-score of 0.8571, a 3% improvement compared to the best-performing
machine learning classifier among those tested.
For future research, we recommend expanding the dataset on the same topic to
enable pre-training with BERT. This would facilitate the development of a domain-
specific trained model on a telecom corpus, including the Arabic language. Addi-
tionally, to enhance classification using fine-tuned MARBERT, stacking and bagging
techniques could be explored to maximize classification performance.

References

1. Asare-Frempong J, Jayabalan M (2017) Predicting customer response to bank direct tele-

marketing campaign. In: 2017 international conference on engineering technology and
technopreneurship (ICE2T). IEEE, pp 1–4
2. Komang Ananta Aryadinata I, Pangesti D, Anugerah GB, Aditya IE, Ruldeviyani Y (2021)
Sentiment analysis of 5G network implementation in Indonesia using twitter data. In: Proceed-
ings—IWBIS 2021: 6th international workshop on big data and information security, pp
23–2.
3. Saxena A, Reddy H, Saxena P (2022) Recent developments in sentiment analysis on social
networks: techniques, datasets, and open issues. [online] Smart Innovation, Systems and
Technologies, Springer Singapore. Available at: https://doi.org/10.1007/978-981-16-3398-0_
13
4. Mashaabi M, Alotaibi A, Qudaih H, Alnashwan R, Al-Khalifa H (2022) Natural language
processing in customer service: a systematic review. ArXiv preprint arXiv:2212.09523
5. Grootendorst M (2022) BERTopic: neural topic modeling with a class-based TF-IDF procedure.
arXiv preprint arXiv:2203.05794
6. Ranjan S, Sood S, Verma V (2019) Twitter sentiment analysis of real-time customer expe-
rience feedback for predicting growth of Indian telecom companies. In: Proceedings—4th
international conference on computing sciences, ICCS 2018, March 2019, pp 166–174
7. Sohail A, Aslam U, Tariq HI, Jayabalan M (2020) Methodologies and techniques for text
summarization: a survey. J Crit Rev 7(11):781–785
8. Alsalman H (2020) An improved approach for sentiment analysis of Arabic tweets in twitter
social media. In: ICCAIS 2020—3rd international conference on computer applications and
information security, pp 2020–2023
9. Almuqren LAR, Qasem MMD, Cristea AI (2019) Using deep learning networks to predict
telecom company customer satisfaction based on Arabic tweets. Proceedings of the 28th
International
10. Almuqren L, Alrayes FS, Cristea AI (2021) An empirical study on customer churn behaviours
prediction using Arabic twitter mining approach. Future Internet 137:1–19
11. MANSOUR (2022) Customer care tweets KSA. [online] Available at: https://www.kaggle.
com/datasets/mansourhussain/customer-care-tweets-ksa
180 A. M. A. Ahmed et al.

12. Lenka RK, Coombs T, Assi S, Jayabalan M, Mustafina J, Liatsis P, Al-Hamid A, Al-Sudani
S, Ismail NL, Al-Jumeily OBE D (2022) Evaluation of extractive and abstract methods in text
summarization. In: The international conference on data science and emerging technologies.
Springer Nature Singapore, Singapore, pp 535–546
13. Mehta R, Varma V (2023) LLM-RM at SemEval-2023 Task 2: Multilingual Complex NER
using XLM-RoBERTa. ArXiv preprint arXiv:2305.03300
14. Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3(Jan):993–
1022
15. Sheikha H (2020) Text mining Twitter social media for Covid-19: comparing latent semantic
analysis and latent Dirichlet allocation
16. Abdul-Mageed M, Elmadany AR, Nagoudi EMB (2021) ARBERT & MARBERT: Deep bidi-
rectional transformers for Arabic. ACL-IJCNLP 2021—59th annual meeting of the association
for computational linguistics and the 11th international joint conference on natural language
processing, proceedings of the conference, i, pp 7088–7105
17. Komang Ananta Aryadinata I, Pangesti D, Anugerah GB, Aditya IE, Ruldeviyani Y (2021)
Sentiment analysis of 5G network implementation in Indonesia using twitter data
Efficient Time Series of Smoothing
and Auto-regressive Forecasting Models
for Predicting Police Officer Fatalities
in the USA
Danush Nagappan, Manoj Jayabalan, Ahmad Alanezi, Farhad Nadi,
and Thomas Coombs

Abstract With police deaths rising, predicting the number of police deaths is now
of significant importance and is necessary to take precautions to prevent deaths from
affecting the police force, public, and government from an associated reduction in
police numbers. The aim of this research is to compare different time series fore-
casting models and find the most efficient model in predicting police deaths occurring
in the USA. The dataset used in this study consisted of details of police officers who
had died in service on duty in the USA. A total of 26,269 records between January
3, 1791, and June 3, 2022. The dataset was obtained through the Kaggle website
data repository. A total of four smoothing models and four auto-regressive models
were used and compared in this research. The smoothing models had better RMSE
and MAPE scores, with HWES being the best-performing model. In summary, the
HWES model performed the best on the USA police deaths dataset by producing
less error compared to the other smoothing and auto-regressive models used in this
research.

Keywords Time series · Forecast · HWES · Police officer fatalities

D. Nagappan · M. Jayabalan
Liverpool John Moores University, Liverpool, United Kingdom
A. Alanezi
Saad Alabdullah Police Academy, Al Jahra, Kuwait
F. Nadi
School of Information Technology, UNITAR International University, Petaling Jaya, Malaysia
T. Coombs (B)
British American Tobacco, Southampton, United Kingdom
e-mail: [email protected]

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 181
Y. Bee Wah et al. (eds.), Data Science and Emerging Technologies, Lecture Notes on
Data Engineering and Communications Technologies 191,
https://doi.org/10.1007/978-981-97-0293-0_14
182 D. Nagappan et al.

1 Introduction

Recent years have witnessed a concerning increase in police fatalities in the USA,
primarily attributed to the heightened accessibility of firearms among the public
and the accompanying surge in gang-related violence [1–3]. This rising trend in
annual police casualties, combined with an uptick in gun-related incidents nation-
wide, presents a growing threat to public safety. This threat carries the potential for
severe economic and substance abuse ramifications [4–6].
The year 2021 marked a grim milestone, with the USA recording its highest annual
tally of police fatalities. Consequently, it has become imperative to develop predictive
models for estimating the number of police fatalities. Such models can serve as a
foundation for enhancing the protection of police officers at higher risk, augmenting
training for vulnerable officers, and improving the recruitment and training of new
officers [7–10].
Recent research has demonstrated the utility of time series data analysis in fore-
casting various phenomena, such as COVID-19 cases and identifying crime-prone
areas along with the timing of such events [7]. This methodological approach holds
promise for informing strategies to mitigate the risks associated with police fatalities
in the USA
Time series forecasting methods are employed to predict future events using
historical data. Their primary purpose is to assist in making more informed deci-
sions. These forecasting techniques find wide application in various fields, including
predicting population trends, sales figures, inventory levels, workforce demand,
weather patterns, and disease outbreaks [8]. Visualizing time series data is often
done by creating graphs, which can reveal underlying patterns and trends within the
data.
To assess the performance of time series forecasting models, researchers
commonly employ error metrics such as the mean absolute percentage error (MAPE)
and root mean square error (RMSE) [7, 9, 10]. These metrics help researchers deter-
mine which model is the most accurate, as they quantify the level of error associated
with each model’s predictions [9, 10]. This information is vital for selecting the most
reliable forecasting model.
Predicting the number of police deaths is highly important to take necessary
precautions and reduce the impact of such deaths on the police force, the public,
and the government due to a decrease in police numbers. In a study conducted in a
Chinese city, a dataset from a police station was utilized, which included information
on robberies, thefts, and burglaries. The research concluded that the ARIMA model
provided more accurate forecasts compared to other models [11–13].
Another study by Cesario [11] focused on crime forecasting using the AR and
ARIMA models in Chicago from 2001 to 2014, achieving an 84% accuracy rate
[11]. Kumar (2018) collected crime data from Indian datasets and employed prepro-
cessing algorithms to make the data stationary. Subsequently, SES, HES, HWES,
and ARIMA models were used for prediction purposes [8]. The results indicated
that all models produced a 95% confidence interval.
Efficient Time Series of Smoothing and Auto-regressive Forecasting … 183

Recent research has primarily emphasized the utilization of seasonal autoregres-

sive integrated moving average (SARIMA) models in scenarios where seasonality
is a prominent factor. However, it’s worth noting that in two specific domains—
predicting tide levels in the sea and forecasting fruit yields—the Holt Winter’s
exponential smoothing model has demonstrated notable effectiveness [14–16]. For
other datasets, such as aircraft failure rates, traffic prediction in urban areas, phone
network bandwidth, stock market fluctuations, demand forecasting in E-commerce,
and COVID-19 trends, the SARIMA model consistently outperformed alternative
models, including Simple Exponential Smoothing, Holt’s Exponential Smoothing,
Holt Winter’s Exponential Smoothing, AR, MA, ARMA, ARIMA, and Facebook
Prophet. SARIMA’s superior performance can be attributed to its ability to effectively
capture the underlying seasonality inherent in the data [15–17].
Various time series forecasting algorithms are accessible, but their predictive
performance depends on factors intrinsic to the time series data, including data
volume, noise level, and model complexity. Consequently, selecting the most effec-
tive algorithm is a critical undertaking. Some of these algorithms have tunable param-
eters to enhance prediction accuracy. Thus, this research seeks to evaluate several
time series forecasting models to identify the one that provides the most accurate
predictions for police officer deaths in the USA.

2 Methodology

2.1 Dataset

The dataset employed in this study comprises information on police officers who lost
their lives while on duty in the USA. It includes data on their names, dates of death,
officer ranks, affiliated departments, the respective states of service, and the causes
of their demise. The dataset encompasses a total of 26,269 records spanning from
January 3, 1791, to June 3, 2022. We accessed this dataset from the Kaggle data.
repository [14]. The utilization of this extensive time frame stems from our inten-
tion to leverage historical data for the purpose of forecasting future police officer
fatalities in the USA.

2.2 Data Preparation

To ensure the suitability of the dataset for our research, we implemented several data
preprocessing steps. Initially, the dataset underwent filtration to exclude instances
related to K9 dog deaths, as our study exclusively focused on human fatalities [18, 19].
Subsequently, we excluded data concerning COVID-19-associated deaths among
police officers, as their inclusion introduced inaccuracies into the prediction of police
fatalities.
184 D. Nagappan et al.

Fig. 1 US police deaths from 1791 to 2022

In the process of feature selection, we carefully considered the variables that would
be fed into our time series forecasting models. We retained two essential features: (1)
the combination of years and months in which police officers died, and (2) the number
of deaths occurring in a specific month of a given year. Any remaining variables that
did not align with our research objectives were omitted from the analysis.
Addressing missing values, particularly those containing zero values, required
special attention. We decided to cap these zero values to the next lowest value, which
is 1. This adjustment helped mitigate errors without significantly impacting the final
forecast output.
Subsequently, we divided the dataset into training and testing sets. The training
dataset encompassed the initial 180 years, spanning from 1791 to 1970, while the
remaining data, covering the years from 1971 to 2022, constituted the test set. This
partitioning allowed us to evaluate the performance of our time series forecasting
models effectively. Figure 1 depicts the result time series after preprocessing step
was applied.

2.3 Exploratory Data Analysis

The preliminary examination of the dataset containing information about police

deaths in the USA employed the Pyplot library. This analysis focused on plotting
the number of deaths per month on the Y-axis against the months and years on the
X-axis. To ensure the accuracy of the analysis and maintain the integrity of the time
series data, the Seaborn library was employed to identify and subsequently bound
the outliers using an upper limit. This precautionary measure was taken to prevent
any undue influence of outliers on the outcomes of the time series forecasting model.
Subsequently, for the purpose of time series decomposition, the dataset underwent
analysis using the statsmodels Python modules. Both additive and multiplicative
Efficient Time Series of Smoothing and Auto-regressive Forecasting … 185

decomposition techniques were applied to the data to comprehensively explore its

temporal patterns and trends.

2.4 Model Implementation and Hyperparameter Tuning

Figure 2 shows the methodological approach through a process flow diagram used
in this study. In this study, we are going to use multiple methodologies available
in time series analysis to predict the future values and compare which of these
models performs better on this data. The models which are used for this study are
SES, HES, HWES additive, HWES multiplicative, AR, MA, ARMA, ARIMA, and
SARIMA. Sequential exponential smoothing (SES) technique is used for the data
which does not have seasonality and trend, but contains only level. Holt’s exponential
smoothing (HES) model is used in this study to check whether the model performs
well with only the trend and level of our police deaths data. Holt-Winters Exponential
Smoothing (HWES) model is used in our research to predict based on trend, level,
and seasonality components of the series and gave satisfactory prediction values in
other studies; therefore, we assume that the HWES model might perform great on
our data and yield good forecast values. The auto-regressive (AR)prediction model
uses the past value to find the future values. We can find the dependent value using
the independent values. In moving average (MA) model, the future is predicted by
considering the residuals of the past predicted errors. ARMA consists of both AR and
MA components; therefore, it is called as auto-regressive moving average (ARMA).
This research uses this model to check whether the performance of the model is great
without the integrated component on our police dataset. ARMA consists of both
AR and MA components; therefore, it is called as auto-regressive moving average
(ARMA). This research uses this model to check whether the performance of the
model is great without the integrated component on our police dataset. Seasonal
auto-regressive integrated moving average (SARIMA) model captures seasonality
as well. The SARIMA has two elements which are non-seasonal and seasonal. Here,
the non-seasonal values are used to make the data stationary, and the seasonal values
are used to make the differencing on season on the data.
In the initial step of our analysis, we subjected the data used in the auto-regressive
models to an Augmented Dickey-Fuller (ADF) test to ascertain its stationarity. The
obtained p-value of 0.58 suggested non-stationarity, as it exceeded the conventional
threshold of 0.05. To rectify this issue and ensure stationarity, we performed a Box-
Cox transformation using the SciPy library. Subsequently, the transformed stationary
data was utilized to generate autocorrelation function (ACF), as shown in Fig. 3, and
partial autocorrelation function (PACF), as shown in Fig. 4, plots via the statsmodels
library, helping us determine the appropriate lag order function.
In the next stage of our analysis, we employed the root mean square error (RMSE)
and mean absolute percentage error (MAPE) evaluation metrics to fine-tune various
hyperparameters for forecasting time series data. These metrics played a pivotal role
in enhancing the accuracy and effectiveness of our forecasting model.
186 D. Nagappan et al.

Fig. 2 Process flow diagram of research approach used

Fig. 3 ACF output

Efficient Time Series of Smoothing and Auto-regressive Forecasting … 187

Fig. 4 PACF output

3 Results

The results obtained from each model will be compared to find which model produced
the lowest error value and performed best on US police deaths dataset.

3.1 Smoothing Models

Simple exponential smoothing (SES) model. The initial analysis of police deaths
data involved employing a straightforward exponential smoothing model, which
yielded RMSE and MAPE values of 6.01 and 43.63, respectively, with a smoothing
level set at 0.5. Subsequently, adjusting the smoothing level to 0.7 resulted in revised
RMSE and MAPE values of 5.81 and 41.83, respectively. Given that a smoothing
level of 0.7 produced the most favorable error values, this parameter was chosen as
the optimal setting for further analysis.
Holt’s exponential smoothing model. By applying a smoothing level of 0.7 from
the SES model and subsequently modifying the smoothing slope to 0.5 using Holt’s
smoothing method, we obtained RMSE and MAPE values of 367.75 and 2475.08,
respectively. Further adjustment of the smoothing slope to 0.1 resulted in RMSE and
MAPE values of 36.52 and 234.55, respectively. Finally, when the smoothing slope
was set to 0.06, the RMSE and MAPE values reached 4.68 and 27.24. Notably, this
method outperformed the SES model in terms of its ability to identify the underlying
level and decreasing trend within the dataset.
Holt’s winter exponential smoothing additive model. The initial selection
of smoothing parameters for the models, namely a smoothing level of 0.7 and
a smoothing slope value of 0.06, was accompanied by the choice of a seasonal
smoothing value of 0.2, resulting in respective RMSE and MAPE values of 22.57
and 140.80. Maintaining all other variables constant while adjusting the smoothing
slope value to 0.055 led to improved model performance, yielding RMSE and MAPE
values of 18.38 and 111.89. Subsequently, a modification in the seasonal smoothing
parameter from 0.2 to 0.01 resulted in even more favorable RMSE and MAPE values,
188 D. Nagappan et al.

Fig. 5 HWES additive model forecast

amounting to 6.53 and 32.77. Further refinement by reducing the seasonal smoothing
to 0.0025 yielded the most optimal RMSE and MAPE values of 4.62 and 28.13, and
the model successfully identified the presence of level, decreasing trend, and season-
ality in the dataset, as illustrated in Fig. 5. It is noteworthy that the magnitude of the
seasonal pattern in the data did not exhibit a direct correlation with the series value.
Holt’s winter exponential smoothing multiplicative model. For the final
smoothing method, smoothing level (0.7), smoothing slope (0.055) and smoothing
seasonal (0.3) values were initially used and yielded RMSE and MAPE values of
25.16 and 156.17, respectively. Then, seasonal smoothing value was changed from
0.3 to 0.0025 resulting in RMSE MAPE values of 8.24 and 42.71. Finally, the seasonal
smoothing value again was adjusted to 0.001 and yielded RMSE and MAPE values
of 4.83 and 26.89. Although this did achieve a lower error rate, the HWES additive
model performed better as the seasonal pattern data did not increase with the increase
in data values and did not decrease with a decrease in the data values.

3.2 Auto-regressive Models

Simple auto-regressive model. The first auto-regressive model that were used is the
simple AR method. As mentioned in the method (2.4 Model Implementation and
Hyperparameter Tuning) when using ACF and PACF, we got the lag order as 1, and
therefore, we will keep the order as 1, 0, 0. This achieved RMSE and MAPE values
of 15.03 and 107.63, respectively. The AR model did not perform well on this dataset
due to failing to capture the downward trend of the test dataset.
Moving average model. By maintaining the lag order at 1 (model order 0, 0,
1), we obtained RMSE and MAPE values of 22.80 and 165.12, respectively. The
moving average model again like the auto-regressive model did not perform well on
the police deaths dataset as it failed to capture the downward trend of the test dataset.
Auto-regressive moving average model. The lag order is kept the same (1) for
the auto-regressive moving average model with an order of 1, 0, 0. For the ARMA
model, these specific value inputs yielded RMSE and MAPE values of 23.20 and
168.19 and did not perform well due to it, again, failing to capture the decreasing
trend of the test dataset.
Efficient Time Series of Smoothing and Auto-regressive Forecasting … 189

Auto-regressive integrated moving average model. Using the same lag order of
1, we got the order of 1, 1, 1 for the ARIMA model, which when these values were
given as inputs, we achieved RMSE and MAPE values of 23.20 and 168.19. This
model performed the same as ARMA and again did not perform well on the police
deaths dataset as the downtrend of the test dataset failed to be captured.
Seasonal auto-regressive integrated moving average methods. The final auto-
regressive model used was SARIMA which the same lag order of 1 was used as
input to this model (model order: 1, 1, 1 and seasonal order: 1, 1, 1, 12). This yielded
RMSE and MAPE values of 15.00 and 106.52 and performed better than the other
auto-regressive models as it captured seasonality. However, the performance was not
satisfactory when compared to the smoothing models as it failed to capture the police
death dataset trend.

3.3 Choosing the Best Model

The performance of the HWES additive model on the police deaths dataset is demon-
strated in Table 1, where it effectively captures the level, trend, and seasonality in
police fatalities. The selection of this model was based on its competitive perfor-
mance in comparison to other research models, as indicated by its RMSE and MAPE
values of 4.62 and 28.13, respectively [15, 16].

3.4 Comparing Previous Research

In conclusion, the performance of our study’s primary model is compared to the

performance of previous research models that utilized similar datasets. As illustrated
in Table 2, our study employed the Holt-Winters Exponential Smoothing (HWES)
model, and it demonstrated commendable performance when juxtaposed with other
models utilized in prior research endeavors. When we contrast the outcomes of the
HWES additive model with those of other auto-regressive models, such as ARIMA
and SARIMA, as well as neural network models [20], we observe that our model
yielded lower root mean square error (RMSE) and mean absolute percentage error
(MAPE) values. These results suggest that the HWES additive model exhibited supe-
rior predictive accuracy in forecasting future occurrences of US police deaths when
compared to the auto-regressive models.
In this research, we discussed the results obtained from different time series fore-
casting models for our US police deaths dataset and finally concluded that the HWES
additive model performed better in terms of RMSE and MAPE error values when
compared to other smoothing and auto-regressive models. We then compared the
results of our HWES Additive model with error measures from previous literature
models that utilized similar datasets to our police deaths data. Our model outper-
formed others in terms of prediction accuracy, demonstrating the lowest RMSE and
MAPE values.
190 D. Nagappan et al.

Table 1 RMSE and MAPE values of different time series forecasting models
Method Smoothing Smoothing Smoothing Order RMSE MAPE
level slope seasonal
SES 0.5 – – – 6.01 43.63
SES 0.9 – – – 5.9 42.67
SES 0.7 – – – 5.81 41.83
HES 0.7 0.5 – – 367.75 2475.08
HES 0.7 0.1 – – 36.52 234.55
HES 0.7 0.06 – – 4.68 27.24
HWES additive 0.7 0.06 0.2 – 22.57 140.80
HWES additive 0.7 0.055 0.2 – 18.38 111.89
HWES additive 0.7 0.055 0.01 – 6.53 32.77
HWES additive 0.7 0.055 0.0025 - 4.62 28.13
HWES 0.7 0.055 0.3 – 25.16 156.17
multiplicative
HWES 0.7 0.055 0.0025 – 8.24 42.71
multiplicative
HWES 0.7 0.055 0.001 – 4.83 26.89
multiplicative
AR – – – (1, 0, 0) 15.03 107.63
MA – – – (0, 0, 1) 22.80 165.12
ARMA – – – (1, 0, 1) 23.20 168.19
ARIMA – – – (1, 1, 1) 23.20 168.19
SARIMA – – – (1, 1, 1) 15.00 106.52

4 Conclusions

In conclusion, the Holt-Winters Exponential Smoothing Additive (HWES-A) model

demonstrated superior performance when applied to the dataset of police deaths in
the USA, exhibiting lower error rates compared to the other smoothing and auto-
regressive models employed in this study. Consequently, the HWES-A model can be
effectively utilized for forecasting future police fatalities in various US states, thereby
facilitating proactive measures in recruiting new officers to mitigate the potential
decline in the number of serving police officers. Subsequent research endeavors
should explore alternative approaches, such as deep learning, neural networks, and
the Prophet model, to assess whether these methodologies can enhance prediction
accuracy while simultaneously minimizing error.
Efficient Time Series of Smoothing and Auto-regressive Forecasting … 191

Table 2 Comparison of evaluation measures across different time series prediction models in
literature
Previous literature Model used Evaluation measure
Deethong and Boonam, (2022) [21] HWES R-squared = 0.723
MAPE = 168.135
Jayaraman et al. (2022) [22] SARIMA RMSE = 18.758
MAPE = 8.455
Chordia and Pawar (2021) [23] ARIMA RMSE = 1180.22
MAPE = 1.184
Cesario et al. (2016) [11] ARIMA RMSE = 28.86
MAPE = 0.16
Chen et al. (2008) [12] ARIMA RMSE = 56.94
MAPE = 9.48
Kamarudin et al. (2021) [20] Multilayer Perception Neural RMSE = 40.08
Network MAPE = 20.92
Daryl et al. (2021) [24] SARIMA RMSE = 112.21
MAPE = 36.05
Goswami and Kandali (2020) [25] SARIMA RMSE = 200
MAPE = 10.7
Dutta et al. (2020) [26] SARIMA RMSE = 18.08
This study HWES Additive RMSE = 4.62
MAPE = 28.13

Acknowledgements The authors would like to express their acknowledgment to the UNITAR
international university for supporting the publication of this research.

References

1. Mustard DB (2001) The impact of gun laws on police deaths. J Law and Econom 44(S2):635–
657
2. Krieger N, Chen JT, Waterman PD, Kiang MV, Feldman J (2015) Police killings and police
deaths are public health data and can be counted. PLoS Med 12(12):e1001915
3. Edwards F, Esposito MH, Lee H (2018) Risk of police-involved death by race/ethnicity and
place, United States, 2012–2018. Am J Public Health 108(9):1241–1248
4. Jennings JT, Rubado ME (2017) Preventing the use of deadly force: the relationship between
police agency policies and rates of officer-involved gun deaths. Public Adm Rev 77(2):217–226
5. Anon (2022) US police deaths on the rise - BBC. [online] BBC. Available at: https://www.bbc.
com/news/world-us-canada-61218611. Accessed 25 Aug 2023
6. Coombs T, Abdelkader A, Ginige T, Van Calster P, Assi S (2023) Understanding synthetic
drug analogues among the homeless population from the perspectives of the public: thematic
analysis of Twitter data. J Substance Use 11:1–8
7. Cihan P (2021) ARIMA-based forecasting of total COVID-19 cases in the USA and India. In:
2021 29th signal processing and communications applications conference (SIU), pp 1–4
8. Kumar M, Athulya S, Minu MMB, Vinodini VMD, Lakshmi AKG, Anjana S, Manojkumar TK
(2018) Forecasting of annual crime rate in India: a case study. In: 2018 International conference
on advances in computing, communications and informatics (ICACCI), pp 2087–2092
9. Yadav R, Sheoran SK (2018) Crime prediction using auto regression techniques for time series
data. In: 2018 3rd international conference and workshops on recent advances and innovations
in engineering (ICRAIE), pp 1–5
192 D. Nagappan et al.

10. Hartono AP, Luhur CR, Qomariyah NN (2022) Forecasting vaccination growth for COVID-19
using machine learning. In: 2022 5th international conference on computing and informatics
(ICCI), pp 356–363
11. Cesario E, Catlett C, Talia D (2016) Forecasting crimes using autoregressive models. In: 2016
IEEE 14th Intl conference on dependable, autonomic and secure computing, 14th Interna-
tional conference on pervasive intelligence and computing, 2nd international conference on big
data intelligence and computing and cyber science and technology congress (DASC/PiCom/
DataCom/CyberSciTech), pp 795–802
12. Chen P, Yuan H, Shu X (2008) Forecasting crime using the ARIMA model. In: 2008 Fifth
international conference on fuzzy systems and knowledge discovery, pp 627–630
13. Abdurrahman M, Irawan B, Latuconsina R (2017) Flood forecasting using holt-winters
exponential smoothing method and geographic information system. In: 2017 International
conference on control, electronics, renewable energy and communications (ICCREC), pp
159–163
14. Koli M (2023) Police deaths in USA from 1791 to 2022: USA law enforcement line of duty
deaths, Version 1. Retrieved 1 Sept 2023 from https://www.kaggle.com/datasets/mayureshk
oli/police-deaths-in-usa-from-1791-to-2022
15. Saha AS, Haque M, Alam MGR (2021) Covid-19 confirmed-cases prediction in SAARC coun-
tries through machine learning. In: 2021 Third international conference on inventive research
in computing applications (ICIRCA), pp 889–894
16. Shimizu S, Shin S (2021) Applicability of SARIMA model in tokyo population migration
forecast. In: 2021 14th international conference on human system interaction (HSI), Pp 1–4.
17. Shukla S, Ramaprasad R, Pasari S, Sheoran S (2022) Statistical analysis and forecasting of
wind speed. In: 2022 4th international conference on energy, power and environment (ICEPE),
pp 1–6
18. Alloghani M, Al-Jumeily D, Hussain A, Mustafina J, Baker T, Aljaaf AJ (2020) Implementation
of machine learning and data mining to improve cybersecurity and limit vulnerabilities to cyber
attacks. Nature-inspired computation in data mining and machine learning, pp 47–76
19. Abbas G, Tanveer M, Abbas ZH, Waqas M, Baker T, Al-Jumeily OBED (2021) A secure
remote user authentication scheme for 6LoWPAN-based Internet of Things. PLoS ONE
16(11):e0258279
20. Kamarudin KA, Wan Ismail WA, Harymawan I, Shafie R (2021) Types of politically connected
firms and analysts’ earnings forecast. Journal of Applied Accounting Research, 22(5), 883–913
21. Deethong T, Boonnam N (2022) Forecasting Analysis of the Durian Yield Trends in Southern
Thailand Using Holt-Winters Exponential Smoothing Method and Box-Jenkins’ Techniques.
In: 2022 Joint International Conference on Digital Arts, Media and Technology with ECTI
Northern Section Conference on Electrical, Electronics, Computer and Telecommunications
Engineering (ECTI DAMT & NCON). pp. 29–32. IEEE, Chiang Rai, Thailand.https://doi.org/
10.1109/ECTIDAMTNCON53731.2022.9720330
22. Jayaraman V, Parthasarathy S, Lakshminarayanan AR (2022) Forecasting the emission of
greenhouse gases from the waste using SARIMA model. In 2022 6th International Conference
on Trends in Electronics and Informatics (ICOEI) (pp. 99–106). IEEE
23. Chordia S, Pawar Y (2021) Analyzing and forecasting covid-19 outbreak in India. In 2021
11th International Conference on Cloud Computing, Data Science & Engineering (Confluence)
(pp. 1059–1066). IEEE
24. Winata DA, Kumara S and Suhartono D (2021) Predicting Stock Market Prices using Time
Series SARIMA. 1st International Conference on Computer Science and Artificial Intelligence
(ICCSAI) (pp. 92–99) https://doi.org/10.1109/ICCSAI53272.2021.9609720
25. Goswami K, Kandali AB (2020) Electricity demand prediction using data driven forecasting
scheme: ARIMA and SARIMA for real-time load data of Assam. In 2020 International
Conference on Computational Performance Evaluation (ComPE) (pp. 570–574). IEEE
26. Dutta B, Barman MP, Patowary AN (2020) Application of Arima model for forecasting road
accident deaths in India. International Journal of Agricultural and Statistical Sciences, 16(2),
607–615
Multimodal Emotion Recognition Using
Attention-Based Model with Language,
Audio, and Video Modalities

Disha Sharma, Manoj Jayabalan, Nailya Sultanova, Jamila Mustafina,

and Danny Ngo Lung Yao

Abstract Multimodal emotion identification is becoming increasingly important in

human–computer interaction due to the amount of emotional information in human
communication. Multimodal emotion recognition is the technique of simultaneously
considering several modalities to boost accuracy and robustness. As emotion iden-
tification studies become more vital to human–computer interactions, automatic
emotion detection systems become increasingly necessary. However, a lack of data
presents a problem for multimodal emotion identification. To address this issue, we
suggest employing transfer learning, which uses pretrained models such as RoBERTa
and attention-based mechanisms such as self-attention to extract relevant features
from multiple modalities and multi-head attention to fuse data across modalities.
The aim of this paper is to provide a strategy for reliably forecasting emotions in
audio, visual, and text by merging and complementing aspects traditionally handled
by humans with those typically handled by deep learning. During the study, three
popular multimodal emotion recognition datasets, IEMOCAP, CMU-MOSI, and
CMU-MOSEI, are analyzed and ranked based on their quality. This study will help
in constructing the network with the right amount of focus placed on each feature
modality by creating an architecture that efficiently combines textual characteristics

D. Sharma (B) · M. Jayabalan

Liverpool John Moores University (LJMU), Liverpool, UK
e-mail: [email protected]
M. Jayabalan
e-mail: [email protected]
N. Sultanova · J. Mustafina
Kazan Federal University, Kazan, Russia
e-mail: [email protected]
J. Mustafina
e-mail: [email protected]
D. N. L. Yao
UNITAR International University, Petaling Jaya, Malaysia
e-mail: [email protected]

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 193
Y. Bee Wah et al. (eds.), Data Science and Emerging Technologies, Lecture Notes on
Data Engineering and Communications Technologies 191,
https://doi.org/10.1007/978-981-97-0293-0_15
194 D. Sharma et al.

retrieved from RoBERTa with other modality-based features. A model better than
BERT is introduced as part of this work that helps to improve the performance.

Keywords Emotion identification · Multimodal emotion recognition ·

Self-attention · Multi-head attention

1 Introduction

Multimodality is inherent to the human communication process. When we speak, the

meaning that we give can be enhanced by varying our vocal tone and the expressions
that we make with our faces [1]. This can involve both overt displays of emotion
on the speaker’s face as well as nonverbal indicators such as the speaker’s tempo
and tone of voice. When doing tasks that require the expression of higher-level
cognitive processes, such as those involving emotions, persuasion, or the mental
health research, nonverbal communication is an absolutely necessary skill to have.
We emphasize a multimodal approach to emotion recognition because the primary
ways humans communicate their feelings are through verbal language, auditory cues,
and facial expressions.
Earlier research frequently concentrated its attention on a single modality and
made use of approaches using machine learning, such as the detection of facial expres-
sions, the identification of speech emotions, or the recognition of written emotions.
End-to-end unimodal emotion recognition has attained a high level of accuracy as a
direct result of the development of the deep neural network [2].
Many emotion recognition investigations used handcrafted, unimodal compo-
nents. Deep learning technologies are used to recognize emotions from raw data.
Many significant emotional traits are presented simultaneously through multiple
modalities, making unimodal feature-based techniques ineffective. Due to the
trimodal emotion identification model’s great performance and durability, efforts
have been made to simultaneously consider auditory, visual, and textual clues to
improve emotion detection over unimodal strategies.
In an attention-based paradigm [3], a transformer is used to separate the different
representations of modalities and identify the specific qualities. Fusion assisted in
comprehending the multimodal components, while the text-based sections had very
little influence. According to the findings of this research, textual characteristics are
superior to auditory and visual cues for identifying emotional states. It can improve
performance by combining audio, video, and text. Connecting these many modes of
expression allows for highlighting various aspects of the text.
RoBERTa [4] is proposed to be incorporated into text feature extraction process
in this study. Multimodal data determines emotion at the utterance level. We inte-
grate information from proper weighting across modalities with good integration
of unimodal variables with relevant saliency to recognize emotions accurately. This
allows us to do so. In order to combine aspects of self-attention drawn from various
senses, we use a multi-head attention mechanism [3].
Multimodal Emotion Recognition Using Attention-Based Model … 195

An attention-based methodology uses a transformer to isolate individual modality

representations and determine their unique properties. Fusion helped understand
multimodal components, whereas text-based parts had little effect. This study shows
that textual traits trump aural and visual clues for recognizing emotional states.
Combining audio, video, and text improves performance. Textual features can be
emphasized by mixing these styles of expression.
By combining and supplementing aspects that humans typically handle with those
typically handled by deep learning, this paper aims to provide a strategy for reliably
forecasting emotions in audio, visual, and text data. This will be accomplished by
combining and using deep learning to handle aspects humans traditionally hold. To
construct the network with the appropriate amount of focus placed on each feature
modality by creating an architecture that efficiently combines textual characteristics
obtained from RoBERTa (4) with other modality-based features to construct the
network with the appropriate amount of focus placed on each feature modality, we
must first create an to present a model that is superior to BERT [5] and which
contributes to the enhancement of performance.

2 Background

Recently, sentiment analysis became a popular technique processing social media

data on online communities, wikis, blogs, micro-blogging platforms, as well as
other types of online collaborative media. Both the business world and the scientific
community have shown a growing interest in sentiment analysis due to the remark-
able benefits that can be gathered from manufacturing and supply chain applications,
human communication comprehension and discussion systems, financial and polit-
ical forecasts, user profiling, e-health and tourism, and more. Because of this, there
are now more interesting open challenges in the scientific community.
The early fusion method has been used in previous work. It was once common
practice to use feature level fusion, also known as early fusion. However, feature-level
fusion can’t model complicated relationships.
Bucur et al. [6] used an early fusion approach to combine the features from each
modality in a way that makes sense and lines them up with their labels.
Interaction between different modal signals makes extracting characteristics from
coupled modalities harder at different periods. To avoid this problem, following
papers employed decision-level fusion to the model of each feature combination and
concatenated the estimated values to train the regression model.
Su et al. [7] proposed multi-level segmentation decision-level fusion emotion
recognition model to increase performance. It used BLSTM as multi-level segmented
emotional feature learning and SVR as a decision layer fusion model.
Song et al. [8] proposed a decision-level fusion method that uses the outputs of
each recognizer as inputs and checks to see which combination of features gives
196 D. Sharma et al.

the best accuracy. The output form video and audio paths were fused using the k-
Nearest Neighbor (k-NN) or artificial neural network (ANN) algorithm to classify
combinations of features.
Xu et al. [9] provided a model that was based on head fusion that was based on
the multi-head attention process in order to make SER more accurate. It utilized a
model known as an attention-based convolution neural network (ACNN).
To solve this problem research papers study correlations of bimodal and trimodal
for data fusion by moving from unimodal vectors to bimodal and to trimodal vectors
using deep neural networks. Multimodal transformer (MulT) is a way to deal with
both problems in a general way from beginning to end without explicitly aligning
the data [3]. (1) Data that doesn’t match up because each modality’s sequences are
sampled at different rates, and (2) long-term dependencies between elements from
different modalities. At the heart of our model is the directional pair wise cross-modal
attention, which pays attention to interactions between multimodal sequences across
different time steps and latently adapts streams from one modality to another.
Siriwardhana et al. [1] proposed Transformer-Based Self-Supervised Learning
(SSL) method for multimodal emotion recognition. Retrained SSL models for text,
audio, and faces were used in RoBERTa [4], Wav2Vec [10], and Fabnet [11] respec-
tively to Wav2Vec (A) and Fabnet (V) embeddings for self-attention’s is a special
token that can sum up the sequence’s data. After modifying embeddings, audio,
video, and test embeddings are inputted into six IMA transformer blocks to capture
cross-modal data. Used Hadamard product of CLS tokens for identical modality to
add most relevant data.
Wang T., Hou Y., Zhou D. and Zhang Q. proposed a neural network for contextual
attention based on a multimodal framework that uses conversational information from
both the target speaker and the other speaker to determine how someone feels from
what they say [12].
Lee et al. [2] presented a new multimodal approach to emotion recognition that
improves the BERT model by combining it with different features from the language,
audio, and visual modalities. In particular, we improve the BERT model because
the audio and visual modalities are different in many ways. We introduce the self-
multi-attention fusion module, the multi-attention fusion module, and the video
fusion module. These are attention-based multimodal fusion mechanisms that use the
recently proposed transformer architecture. We look into the best ways to combine
fine-grained representations of audio and visual features into a single embedding
while combining a pretrained BERT model.
Our research considers acoustic, visual, and textual cues simultaneously to
improve emotion recognition performance over unimodal approaches, as the trimodal
emotion recognition model shows high performance and robustness by comprehen-
sively considering multiple modalities. Lastly, we applied the attention mechanism’s
capacity to incorporate multimodal aspects as the attention mechanism has recently
drawn some interest in the field due to its usefulness. Additionally, transfer learning
methods that extract features from pretrained networks have made substantial recent
advancements.
Multimodal Emotion Recognition Using Attention-Based Model … 197

3 Methodology

In this paper, we proposed the model that is shown in Fig. 1. It takes raw video and
breaks it down into “utterance units” to get information about audio, video, and text.
Toolkit and pretrained models, audio and visual modalities are used to pull out each
deep learning feature. Self-multi-attention fusion module, each different visual and
audio feature linked, and the high-level representation are taken out of the fusion
process. RoBERTa is used to pull out text expressions. Multi-head attention fusion
module predicts person’s emotional state in the end.

3.1 Dataset

To extract significant features from audio and video, we employed transformer’s

attention, also known as self-attention on visual and audio features. On the other
hand, RoBERTa was utilized to extract text. The multi-head attention technique
combines aspects of each modality by initially bringing all the aspects into the same
realm. The mechanism for many heads of self-attention, the fusion approach seeks
to strike a balance between the combined representations and the time for emotion
recognition.

Fig. 1 Architecture of our proposed model

198 D. Sharma et al.

3.2 Data Acquisition and Description

Our tests are conducted on the public IEMOCAP, CMU-MOSI, and CMUMOSEI
datasets, all of which represent the current state of the art in benchmarking video
sentiment analysis.
For multimodal emotion and emotion recognition, CMU-MOSI includes 3702
brief speech video snippets taken from YouTube movie reviews. Each statement is
labeled with an emotion scale from [3], where [− 3] denotes an exceedingly negative
mood and [+ 3] denotes an exceptionally pleasant one.
CMU-MOSEI dataset is also from YouTube, like CMU-MOSI, except it’s much
bigger. CMU-MOSEI has 23,453 video movie reviews, making it the world’s largest
multimodal analysis dataset. Additionally, there are 22,856 annotated statements
in that data. Like the CMU-MOSI dataset, the CMU-MOSEI contains a variety of
emotion scores ranging from [3 to + 3].
IEMOCAP dataset is 12 h of data among ten actors across five sessions. In this
study, we focused on four distinct feelings—joy, sadness, rage, and apathy.

3.3 Modeling Approach

In our method, visual, audio, and textual features are used to improve the performance
of emotion recognition. We experiment with the CMU-MOSI, CMU-MOSEI, and
IEMOCAP Database.
First, we used the concatenation method to combine the different kinds of features
(handcraft and bottleneck features) that were taken from the audio and video data.
It uses a relative positional encoding method to let states be used more than once
without confusing the time between frames. Because audio and visual features have
different sizes than text features, changed both Fabnet (V) and Wav2Vec (A) embed-
dings by appending a trainable vector termed CLS and then applying self-attention
to each embedding sequence. The self-attention fusion module figures out how
strongly different parts of a single modality are related and pulls out a sympathetic
representation.
We use multi-head attention to combine the parts of each modality’s atten-
tion. Scaled dot-product attention is done more than once at the same time by the
multi-head mechanism. We propose not only the context information between each
modality feature but also the textual representation of RoBERTa and two attention
mechanisms (self-multi-attention fusion and multi-attention fusion) that combine
these features. Self-multi-attention fusion and multi-attention fusion both have a
residual connection and layer normalization.
Multimodal Emotion Recognition Using Attention-Based Model … 199

3.4 Evaluation Metrics

CMU-MOSI and CMU-MOSEI

1. Accuracy-2: Binary accuracy (negative or positive sentiments)
2. Accuracy-7: Seven class accuracy (classification of sentiment score)
3. F1 Score (precision and recall HM)
4. Mean Absolute Error (MAE)
5. Correlation Coefficient (Corr)
• Lower the MAE value better performance.
• Higher value of Accuracy-2, Accuracy-7, F1 score (precision and recall
harmonic mean), and Correlation Coefficient (Corr) means better perfor-
mance.
1. IEMOCAP
2. Accuracy-2
3. F1 Score (Precision and Recall HM)
• Higher value of Accuracy-2, Accuracy-7, F1 score (precision and recall
harmonic mean), and Correlation Coefficient (Corr) means better perfor-
mance.

4 Analysis

4.1 Dataset Analysis

For our experiments, we used four sets of data that were available to the public. All
of these datasets include speech, written video, and text. The proposed method has
been compared with the best results published so far for each dataset. Accuracy, F1
score, mean average error (mae), and Correlation Coefficient were the main ways we
measured how well something worked. IEMOCAP’s dataset was made up of common
emotion categories like “happy,” “angry,” “sad,” “neutral,” and “excitement”. The
CMU-MOSI and CMUMOSEI datasets have scores for sentiment that range from
-3 to + 3.

4.2 Feature Extraction

Initially, three pretrained models, RoBERTa, Wav2Vec, and FABNET, were used
to extract features from raw data modalities. Dimensions and maximum lengths of
training sequences for features vary by modality. Pretrained Wav2Vec and RoBERTa
models are accessible from the Fairseq code base and employ extracted SSL audio
and text features. Faces were cut from video frames using the Retina-Face facial
200 D. Sharma et al.

recognition models and extracted video features. Then, routed every face-containing
frame through the pretrained Fabnet model to acquire video modality characteristics.

4.3 Embedding Size Modification

The retrieved features from models possess a big embedding size and a lengthy
sequences length. Created a technique so that single embedding might represent
lengthy embedding sequences associated with modalities. Changed both Fabnet
embeddings (V) and Wav2Vec embeddings (A) by appending trainable vector termed
CLS and then apply self-attention to every embedding sequence in order to extract
significant features from each modality. Symbol Q, K, V, and dQ refer to Query, Key,
Value, and dimensionalities of Query vector, respectively.
In the embedding sequences were influenced by how RoBERTa or BERT model
describe entire sequences used first unique tokens termed CLS (stands for classifica-
tion). As the self-attention mechanism of BERT-based model used bidirectional (past
and future), first token of sequence, the CLS tokens, encoded all information to right,
that is the future sequence. Consequently, the CLS tokens were utilized compressed
representation to tackle classification’s problems such as sentiment analysis. In our
model, prepended CLS tokens in the embedding sequences of Wav2Vec and Fabnet
because neither of these formats has architecture similar to BERT. Since RoBERTa
is a BERT-based model, the text embedding sequences were left unchanged. Access
to three CLS tokens representing three modalities enabled us to compute attention
efficiently and develop a straightforward late fusion mechanism.

Vembedding = Self - Attention [CLS]v , Fabnetseq , (1)

Aembedding = Self - Attention [CLS] A , Wav2Vecseq , (2)

QK
Self - Attention = softmax √ V (3)
d

4.4 Multi-head Attention

The fundamental objective of IMA fusion layers was to facilitate the exchange of
pertinent data across modalities. IMA fusion layers were developed to incorporate
information from representations of various modalities into the representation of
one modality. The multi-head attention fusion layer functions similarly to the self-
attention fusion layer, with the exception that it generates a Query (Q) vector using
Multimodal Emotion Recognition Using Attention-Based Model … 201

the CLS token of one modality and the Key (K)–Value (V) vectors from the embed-
ding sequences of other modalities. The inputs to the multi-head fusion layer are
three embedding sequences, with the first token of each embedding sequences being
the CLS token. IMA attention computed between the CLS token of one modality
and the complete embedding sequence of the other modality, as the CLS token of
each modality added the sequence’s information. So, there are six multi-head atten-
tion transformer blocks, every transformer block’s Q vector is computed using one
modality CLS token and another modality whole embedding sequence.

Vfinal = [CLS]speech → video [CLS]video → text, (4)

Afinal = [CLS]speech → video [CLS]speech → text, (5)

Tfinal = [CLS]speech → text [CLS]video → text, (6)

Finalfusion = concatenation(Vfinal , Afinal , Tfinal ). (7)

4.5 Hadamard Product Prediction Layer

Considering their similarities, the six CLS embeddings can be divided into three
pairs. Derived from a single modality, Hadamard product was considered between
token pairings that share same Core-Modality. Hadamard product is used to extract
mutual information from a group of entities the two representations of CLS. The
Hadamard product is used to find out what a group of entities know about each other.
The two representations of CLS.
• Accuracy is the proportion of correct predictions over total predictions.
• Precision is the fraction of relevant instances among the retrieved instances.
• Recall is the fraction of relevant instances that were retrieved.
• The F1 score is defined as the harmonic mean of recall and precision. Measure of
a model’s accuracy on a dataset.
• MAE (mean absolute error) represents the difference between the original and
predicted values extracted by averaged the absolute difference over the dataset.
• True Positive (TP)—predicted positive and it is positive.
• True Negative (TN)—predicted negative and it is negative.
• False Positive (FP)—predicted positive and it is negative.
• False Negative (FN)—predicted negative and it is positive (Fig. 2).
202 D. Sharma et al.

Fig. 2 Represent ROC

curve in logistic regression

5 Experimentation and Results

5.1 Experimentation 1: CMU-MOSI Experiment

For the CMU-MOSI dataset we undertake Hyperparameter adjustment to avoid over-

fitting the batch size used is 16, the number of epochs utilized is 30, and the dropout
rate is 0.2%. To extract significant information from each modality, we utilized two
self-attention blocks and four multi-head attention blocks. A regularization approach
was utilized to avoid overfitting, and the learning rate chosen was 3.00E-05 (Figs. 3
and 4).

5.2 Experimentation 2: CMU-MOSEI Experiment

For CMUMOSEI dataset, we undertake Hyperparameter adjustment to avoid over-

fitting the batch size used is 32, the number of epochs utilized is 30, and the dropout
rate is 0.4%. To extract significant information from each modality, we utilized four
self-attention blocks and two multi-head attention blocks. Regularization approach
was utilized to avoid overfitting, and the leaning rate chosen was 3.00E-05 (Figs. 5
and 6).

5.3 Experimentation 3: IEMOCAP Experiment

For IEMOCAP dataset, we undertake Hyperparameter adjustment to avoid overfitting

the batch size used is 32, the number of epochs utilized is 30, and the dropout rate is
0.4%. To extract significant information from each modality, we utilized four self-
attention blocks and two multi-head attention blocks. Regularization approach was
utilized to avoid overfitting, and the leaning rate chosen was 3.00E-05 (Figs. 7 and 8).
Multimodal Emotion Recognition Using Attention-Based Model … 203

Fig. 3 Accuracy comparison between our model and state of the art for CMU-MOSI dataset

5.4 Ablation Study

We carried out a series of ablations research that employ the CMUMOSEI datasets
comprehend effect of various constituents on projected fusion mechanisms. Used
CMUMOSEI possesses the greatest amount of training example, relative to other
datasets, using primarily three types of ablation studies characterized as follows:
• Ablations research inputs of text, speech, and video modalities.
• Ablations research the application of multi-head attention
• A study on the utilization of Hadamard.
Table 1 depicts results of the ablation studies. Letters A, V, and T stand for
audio, video, and text, respectively. Most of T in unimodal shows the high perfor-
mance. Text and video are used together in bimodal T + V, which show high
performance. Comparing the results for trimodal to our final model shows how well
Hadamard products-based information’s extraction works. Finally, result shows using
Hadamard computation improves the accuracy of seven class and binary sentiment
by about 3% reducing the number of trainable parameters. This is because adding six
vectors together and add three times as many parameters to final prediction layer as
204 D. Sharma et al.

Fig. 4 F1 score comparison between our model and state of the art for CMU-MOSI dataset

Fig. 5 Accuracy comparison between our model and state of the art for CMUMOSEI dataset
Multimodal Emotion Recognition Using Attention-Based Model … 205

Fig. 6 F1 score comparison between our model and state of the art for CMUMOSEI dataset

using Hadamard computation. This result shows that combining text modality better
recognize emotions.
Table 2 shows that our model for the CMU-MOSI dataset does better than the
current best methods in most ways. Experiment results of the CMU-MOSI dataset
show that proposed by HFU-BERT model improves performance by 0.53 and 0.21
compared to lowest state-of-the-art performance Acc-7 and Acc-2, respectively. But
there is still a difference between our model and [2], which shows that there is room
for more improvement.
For CMU-MOSEI dataset compared various state-of-the-art models, this model
does better each in F1 score, MAE, Acc-2, and Corr. For IEMOCAP, dataset demon-
strates the better F1 score and accuracy for each sentiment compared with earlier
studies.
To sum up, our research proposed a model that jointly employs a pretrained
BERT model for MMLs and heterogeneous feature unification for video and audio
for the purpose of multimodal emotion recognition. Proposed model combined audi-
tory and visual modalities into different features, and RoBERTa was used to fine-
tune the model. The state of the art was beaten by proposed method using three
difficult benchmarks: CMU-MOSI, CMUMOSEI, and the IEMOCAP dataset. The
higher calculations necessitated by the production of additional trainable weights
and Hyperparameter represented a possible constraint of our proposed model.
206 D. Sharma et al.

Fig. 7 Accuracy comparison between our model and state of the art for IEMOCAP dataset
Multimodal Emotion Recognition Using Attention-Based Model … 207

Fig. 8 F1 score comparison between our model and state of the art for IEMOCAP dataset
208 D. Sharma et al.

Table 1 Studies of ablation performed on the CMU-MOSI dataset were analyzed

Metric Acc-7 Acc-2 F1 Score MAE Corr
Unimodal transformer
Text 46.6 78.2 79.4 0.651 0.623
Video 42.7 65.3 64.4 0.678 0.259
Speech 41.7 66 66.8 0.700 0.375
Bimodal transformer
Text and speech 52.3 84 83.3 0.498 0.723
Text and video 54 84.4 85 0.533 0.652
Video and speech 44 67.3 66.4 0.693 0.354
Fusion layer
A + V + T (self-attention) 46.8 80 80.6 0.623 0.658
Hadamard product
A + V + T (Multi-head) 51.4 83.8 83.2 0.559 0.700

Table 2 Comparing our model to other state-of-the-art algorithms on CMU-MOSI and

CMUMOSEI and accuracy, mean absolute error (MAE), and the F1 score were all used as metrics
Metric Acc-7 Acc-2 F1 score MAE Corr
CMU-MOSI
Ogura et al. [3] 46.6 78.2 79.4 0.651 0.623
Siriwardhana et al. [1] 42.7 65.3 64.4 0.678 0.259
Wang et al. [12] 41.7 66 66.8 0.700 0.375
Pham et al. [13] 52.3 84 83.3 0.498 0.723
Lee et al. [2] 54 84.4 85 0.533 0.652
Our model 44 67.3 66.4 0.693 0.354
CMU-MOSEI
Ogura et al. [3] 50.7 81.6 81.6 0.591 0.694
Siriwardhana et al. [1] 55.7 87.3 87 0.529 0.792
Wang et al. [12] 45.5 75.4 75.7 0.664 0.599
Pham et al. [13] 048.20 079.30 079.70 00.6310 00.6450
Lee et al. [2] 86 49 86 0.598 0.69
Our model 55.4 87.4 86.6 0.531 0.801

6 Conclusion

As emotion identification research becomes increasingly important for human–

computer interactions, automatic emotion detection systems are becoming increas-
ingly important. This work focused adopting transfer learning, which use pretrained
models such as RoBERTa and attention-based mechanisms such as self-attention and
Multimodal Emotion Recognition Using Attention-Based Model … 209

multi-head attention to extract significant features from various modalities and fuse
data across modalities, respectively.
Our research proposed a model that jointly employs a pretrained BERT model
for MMLs and heterogeneous feature unification for video and audio for the purpose
of multimodal emotion recognition. Proposed model combined auditory and visual
modalities into different features, and RoBERTa was used to fine-tune the model.
The state of the art was beaten by proposed method using three difficult benchmarks:
CMU-MOSI, CMUMOSEI, and the IEMOCAP dataset. The higher calculations
necessitated by the production of additional trainable weights and Hyperparameter
represented a possible constraint of our proposed model.
Automatic emotion detection systems are becoming an increasingly essential
component of human–computer interactions as the field of emotion identification
studies becomes increasingly important. Nevertheless, multimodal emotion detec-
tion is complicated by the fact that there is insufficient evidence. Transfer learning
is a method that we recommend using to solve this problem. Transfer learning takes
advantage of pretrained models, such as BERT, and attention-based mechanisms,
such as self-attention and multi-head attention, to extract relevant features from
multiple modalities and fuse data across modalities, respectively.
This study focused on video, audio, and text; however, we aim to investigate addi-
tional modalities in future studies. Independently trained models for each modality
were used together with pretrained models; however, current research reveals that
certain algorithms can gain joint text and video knowledge of task such as video
question answering.

References

1. Siriwardhana S, Kaluarachchi T, Billinghurst M, Nanayakkara S (2020) Multimodal emotion

recognition with transformer-based self-supervised feature fusion. IEEE Access 8:176274–
176285
2. Lee S, Han DK, Ko H (2021) Multimodal emotion recognition fusion analysis adapting BERT
with heterogeneous feature unification. IEEE Access 9:94557–94572
3. Ogura Y, Parsons WH, Kamat SS, Cravatt BF (2017) 乳鼠心肌提取 HHS public access.
Phys Behav 17610:139–148. file:///C:/Users/Carla Carolina/Desktop/Artigos para acrescentar
na qualificação/The impact of birth weight on cardiovascular disease risk in the.pdf.
4. Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V,
Allen PG (2022) RoBERTa: a robustly optimized BERT pretraining approach. https://github.
com/pytorch/fairseq. Last accessed 02 Aug 2022
5. Anon bert (2022) https://arxiv.org/pdf/1810.04805.pdf&usg=ALkJrhhzxlCL6yTht2BRmH9a
tgvKFxHsxQ. Last accessed 01 Aug 2022
6. Bucur B, Somfelean I, Ghiurutan A, Lemnaru C, Dinsoreanu M (2018) An early fusion approach
for multimodal emotion recognition using deep recurrent networks. In: Proceedings—2018
IEEE 14th International conference on intelligent computer communication and processing,
ICCP 2018, pp 71–78 (2018)
7. Su H, Liu B, Tao J, Dong Y, Huang J, Lian Z, Song L (2020) An improved multimodal dimension
emotion recognition based on different fusion methods. In: International conference on signal
processing proceedings, ICSP, 2020-Decem, pp 257–261
210 D. Sharma et al.

8. Song KS, Nho YH, Seo JH, Kwon DS (2018) Decision-level fusion method for emotion recog-
nition using multimodal emotion recognition information. In: 15th International conference on
ubiquitous robots, UR 2018, pp 472–476
9. Xu M, Zhang F, Zhang W (2021) Head fusion: improving the accuracy and robustness of speech
emotion recognition on the IEMOCAP and RAVDESS dataset. IEEE Access 9:74539–74549
10. Zhao Z, Wang Y, Wang Y (2022) Multi-level fusion of Wav2vec 2.0 and BERT for multimodal
emotion recognition
11. Wiles O, Sophia Koepke A, Zisserman A (2018) Self-supervised learning of a facial attribute
embedding from video. In: British Machine Vision Conference 2018, BMVC
12. Wang T, Hou Y, Zhou D, Zhang Q (2021) A contextual attention network for multimodal
emotion recognition in conversation. In: Proceedings of the international joint conference on
neural networks, 2021-July
13. Pham H, Liang PP, Manzini T, Morency LP, Póczos B (2019) Found in translation: Learning
robust joint representations by cyclic translations between modalities. In: 33rd AAAI confer-
ence on artificial intelligence, AAAI 2019, 31st innovative applications of artificial intelligence
conference, IAAI 2019 and the 9th AAAI symposium on educational advances in artificial
intelligence, EAAI 2019, Shaffer 2018, pp 6892–6899
Comparative Analysis of Emotion
Recognition Using Large Language
Models and Conventional Machine
Learning

Mangu Soujanya Rao, Thomas Coombs, Normaiza Binti Mohamad,

Vinay Kumar, and Manoj Jayabalan

Abstract Emotions are significant aspects of human existence and influence inter-
action between individuals and groups, influencing how we think and behave. In this
research, we aim to use conventional and neural network models to identify emotions
from textual data and compare which performed best. The Go Emotions dataset
contained 27 different emotions across 58,000 samples. The approach involves
modelling the conventional machine learning models and the neural network-based
models and comparing the results over test dataset and choosing the best model.
Upon comparing the classification reports for the conventional and neural network-
based models on the Ekman taxonomy, conventional machine learning algorithms
were outperformed by neural network-based models which gained almost 10% more
than conventional models. Conventional models averaged the values around 50%
of macro-average F1-score except for the KNN classifier which performed poorly
getting the macro-average F1-score of 21%. BERT classifier with Ekman taxonomy
including neutral emotion had a macro-average precision of 55% and a sensitivity
of 68%. This classifier also outperformed the macro-average F1-score by 106 61%.
While the RoBERTa classifier had a macro-average precision of 65%, the recall, or
sensitivity, was found to be 53%. This study clearly states that neural network-based
models outperformed conventional models. Our study proposed BERT model which
achieved a macro-average F1-score of 0.50 across Go Emotion taxonomy.

Keywords Neural network · Emotion · Machine learning

M. Soujanya Rao · V. Kumar · M. Jayabalan

Liverpool John Moores University, Liverpool, UK
T. Coombs (B)
British American Tobacco, Southampton, UK
e-mail: [email protected]
N. Binti Mohamad
School of Information Technology, UNITAR International University, Petaling Jaya, Malaysia

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 211
Y. Bee Wah et al. (eds.), Data Science and Emerging Technologies, Lecture Notes on
Data Engineering and Communications Technologies 191,
https://doi.org/10.1007/978-981-97-0293-0_16
212 M. Soujanya Rao et al.

1 Introduction

In all conversation, emotions contribute to the overall expression of information and

how that information is received [1]. This process of emotional expression in the past
had been through face-to-face, radio and television communication, however, the rise
on the Internet and more recently, social media, now has individuals expressing their
emotions online [2–4]. The majority of word-based expressions not only result from
the direct use of emotional words but also from the interpretation of the meaning
of concepts and interactions of concepts, which are described in the textual docu-
ment. Specific focus on evaluating emotion from word-based expressions has recently
been conducted by researchers to detect depression from social media text [5–8].
According to the World Health Organisation (WHO), the number of individuals
suffering from depression had risen 18% between 2005 and 2015. By gathering data
(tweets from Twitter), researchers have conducted analysis of people’s emotions in
health care, drug use, disease outbreak, and disease monitoring [9–14].
There are several social media portals in which people discuss or provide reviews
regarding a diverse range of topics. A statement can contain multiple different mean-
ings if the context is not expressed correctly. A noticeable increase in the dissemi-
nation of information over the past two decades has been uploaded to the Internet
especially regarding social media websites. Due to the public availability of this
information, Twitter has become a key research area where users post statements
that can be reviewed and replied by anyone [15–17].
The Go Emotions’ dataset is corpus of 58,000 Reddit comments that have been
extracted from well-known English-language subreddits. The Reddit platform is
a useful tool for emotion research since it provides a sizable amount of publicly
accessible content that includes direct user-to-user communication. Several methods
have been suggested for emotion detection in text, keyword-based methods, rule-
based methods, classical learning-based methods, deep learning methods, and hybrid
methods. In these research methods, the models follow a two-step method [18].
Firstly, it involves the extraction of features, and secondly, these features are used
in machine learning models to predict the emotions. State-of-the-art outcomes
have been attained in many key sectors thanks to technological advancements, and
identifying emotions is no exception [19].
The primary aim of this research is identifying which algorithm, being either trans-
former or conventional, is the most accurate in identifying emotion from text. In this
research, the classification of emotions is done via neural networks and conventional
models by building efficient models to identify emotion from text. The results are
then compared and assessed in which model has performed the best on predicting
emotion from text.
Comparative Analysis of Emotion Recognition Using Large Language … 213

2 Method

2.1 Dataset

The data used in this study were obtained from Google and comprised of Reddit
comments manually annotated by humans and classified with 27 different emotion
categories posted between January 2005 and January 2019. Emotion taxonomy
included four ambiguous emotions, 11 negative emotions, 12 positive emotions
and 1 neutral emotion categories. The dataset, known as the Go Emotion dataset,
consisted of almost 58,000 comments from Reddit users. The goal was to gather a
sizable dataset with a concentration on conversational data, where emotion is a key
component of communication. The Reddit platform offers a sizable, publicly acces-
sible volume of content that includes direct user-to-user communication, making it
a significant source for emotion research. Reddit comments from subreddits with at
least 10,000 comments from January 2019 through the year 2005 were used to build
Go Emotions, with deleted and non-English comments being removed [20]

2.2 Data Pre-processing

In the Go Emotion taxonomy, the data were labelled using indices of each emotion
in text format. Encoding the emotion was done by converting the indices into lists
of indices using comma separator, map each index to the actual emotion label, and
use One-Hot encoding. After encoding, text cleaning was done by the following
steps: (1) emojis are converted into text, (2) contractions are expanded, (3) acronyms
and typing mistakes are fixed, (4) some word with multiple letters are replaced, (5)
special characters and numbers from text and replaced with spaces, and finally (6)
double spaces are eliminated. Figure 1 shows the data pre-processing used for this
dataset.

2.3 Data Analysis

Data analysis consisted of exploratory and text analysis. For exploratory analysis,
multiple emotion could be used to annotate the sample and the number of samples
per emption. Word analysis was used to visualise the top 10 words that occurs most
frequently for each emotion.
Table 1 shows that the proportions of emotions in train, validation, and test
datasets are similar. This is essential in order to avoid adding an additional bias
when evaluating predictive models.
214 M. Soujanya Rao et al.

Fig. 1 Data pre-processing method used

Table 1 Sample emotion

Emotion Dataset Percentage
proportions for each dataset
Surprise Train 2.44183
Surprise Test 2.59812
Surprise Validation 2.37744
Neutral Test 32.928
Neutral Validation 32.547
Neutral Train 32.7551

2.4 Model Implementation

Implementing the classical and conventional machine learning algorithms is shown in

the flow diagram below. A total of eight different algorithms were used in the conven-
tional approach and included stochastic gradient decent (SGD), Support Vector
Machine (SVM), k-nearest neighbour (KNN), decision tree, random forests, and
neural network models BERT, ELECTRA, and RoBERTa (Fig. 2).

3 Results and Discussion

3.1 Evaluation of Models on Taxonomy and Results

Stochastic Gradient Decent (SGD). The performance evaluation shows that Ekman
taxonomy with and without neutral emotion resulted in the highest F1-macro-average
score of 57 and 50%, respectively. Go Emotion and Ekman taxonomy (with and
without neutral emotion) produced the highest precision macro score of 65, 64,
Comparative Analysis of Emotion Recognition Using Large Language … 215

Fig. 2 Model design implemented in this research

and 64%, respectively. Moreover, Ekman taxonomy obtained the best recall macro-
average score of 53 and 47%.
Support Vector Machines (SVM). Again, the Ekman taxonomy with and without
neutral emotion resulted in the highest F1-macro-average score of 57 and 50%,
respectively. Go Emotion and Ekman taxonomy with and without emotion achieved
precision macro-average scores of 65, 64, and 64%, respectively. Ekman taxonomy
without neutral emotion achieved the highest score of 53%.
K-Nearest Neighbours (KNN). Ekman taxonomy without neutral emotion achieved
the highest F1-macro-average score of 46%, highest precision macro-average score of
60%, and highest recall score of 41%. However, KNN achieved lower F1, precision,
and recall macro-average scores compared to the two previous models.
Decision Tree. The performance of the decision tree model for Ekman taxonomy
without neutral emotion achieved the same F1, precision, and recall macro-average
scores of 48, 60, and 47%, respectively.
216 M. Soujanya Rao et al.

Random Forest. The performance of Ekman taxonomy without neutral emotion,

again, was the highest among the other classifiers with F1, precision, and recall
macro-average scores of 55, 65, and 51% respectively. These results are higher than
the two previous models (KNN and decision tree) but not the first two (SGD and
SVM) (Table 2).
BERT. By using the BERT model, both Go Emotion and Ekman taxonomy F1,
precision, and recall macro-average scores improved with Ekman taxonomy without
neutral emotion performing the best. Ekman taxonomy without neutral emotion
obtained F1, precision, and recall scores of 69, 66, and 72%, respectively. This
is likely due to the skewed dataset containing more emotion labels than the Ekman
taxonomy.
RoBERTa. Surprisingly, the Ekman taxonomy with neutral emotion had the best
performance in regard to F1 macro-average score of 67%, but without neutral emotion
performed the best on precision and recall scores of 72 and 63%, respectively.
RoBERTa only performed better on precision than the BERT model.
ELECTRA. The performance of the ELCTRA model for EKMAN taxonomy
performed best on data without neutral emotion achieving F1, precision, and
recall scores of 68, 68, and 68%. Conventional models performed poorly on the
Go Emotion taxonomy. Overall, neural network models BERT, RoBERTa and
ELECTRA performed well on the Reddit dataset (Table 3).

Table 2 Conventional model performances

Model F1 Precision Recall
SGD (Ekman taxonomy without neural) 0.65 0.64 0.64
SVM (Ekman taxonomy without neural) 0.57 0.65 0.53
KNN (Ekman taxonomy without neural) 0.46 0.60 0.41
Decision trees (Ekman taxonomy without neural) 0.48 0.60 0.47
Random forest (Ekman taxonomy without neural) 0.55 0.65 0.51

Table 3 Neural network model performance

Model F1 Precision Recall
BERT (Ekman taxonomy without neural) 0.69 0.66 0.72
RoBERTa (Ekman taxonomy without neural) 0.67 0.72 0.63
ELECTRA (Ekman taxonomy without neural) 0.68 0.68 0.68
Comparative Analysis of Emotion Recognition Using Large Language … 217

3.2 Testing on Validation Dataset

Stochastic Gradient Decent (SGD). When compared with Ekman taxonomy

models, Go Emotion performed poorly, achieving much lower F1, precision, and
recall scores. Since Go Emotion has 28 emotion labels, the model misidentified the
majority of the emotions when compared to the Ekman taxonomy, which only has
seven emotions.
Support Vector Machines (SVM). The performance evaluation metric scores
showed that SVM with Ekman taxonomy with neutral emotion produced the highest
F1, precision, and recall scores of 56, 52, and 63%, respectively. Again, like SGD,
Go Emotion model misidentified the majority of the emotion compared to the Ekman
taxonomy.
K-Nearest Neighbours (KNN). The KNN model with Ekman taxonomy without
neutral emotion, again, had the best F1, precision, and recall scores of 55, 39, and
43%, respectively, but is much lower than the two previous models (SGD and SVM).
Decision Tree. Ekman taxonomy without neutral emotion performed better on the
decision trees model than KNN with F1, precision, and recall scores of 59, 47, and
51%, respectively. However, the decision tree model still performed worse than SVM
and SGD.
Random Forest. The random forest with Ekman taxonomy without neutral emotion
improved F1, precision, and recall scores to 63, 50, and 54%, respectively. The next
three models consist of neural network models of BERT, RoBERTa, and ELECTRA
which have already shown better performance than the conventional machine learning
techniques (Table 4).
BERT. For the BERT model, the Go Emotion taxonomy performed much better
than in the conventional methods with F1, precision, and recall scores of 53, 59,
and 55%, respectively. However, the BERT with Ekman taxonomy without neutral
emotion achieved the highest F1, precision, and recall scores of 70, 68, and 69%,
respectively.
RoBERTa. Using the RoBERTa model only improved the F1-score for the best
performing model (RoBERTa with Ekman taxonomy without neutral emotion) to

Table 4 Model comparison results

Model F1 Precision Recall
SGD (Ekman taxonomy without neural) 0.56 0.63 0.52
SVM (Ekman taxonomy without neural) 0.61 0.53 0.56
KNN (Ekman taxonomy without neural) 0.55 0.39 0.43
Decision trees (Ekman taxonomy without neural) 0.59 0.47 0.51
Random forest (Ekman taxonomy without neural) 0.63 0.50 0.54
218 M. Soujanya Rao et al.

Table 5 Comparison BERT, RoBERTa and ELECTRA Result

Model F1 Precision Recall
BERT (Ekman taxonomy without neural) 0.70 0.68 0.69
RoBERTa (Ekman taxonomy without neural) 0.72 0.62 0.66
ELECTRA (Ekman taxonomy without neural) 0.66 0.68 0.67

72% from 70%. The precision and recall scores were 62 and 66% and did not improve
from the BERT model.
ELECTRA. Finally, the ELECTRA model with Ekman taxonomy with and without
neutral emotion performances were evaluated. The best performing model was
ELECTRA with Ekman taxonomy without neutral emotion with F1, precision, and
recall scores of 66, 68, and 67%, respectively. The ELECTRA technique did not
perform better than the BERT technique (Table 5).

3.3 Comparison Between Proposed Classifiers and Results

Additional performance measures to evaluate classifiers analysed to determine the

best classification model for emotion recognition from text using the Go Emotion
dataset. This was done by using macro-average F1-score and comparing between
the classifiers for Go Emotion and Ekman taxonomy. The average recall score of
the BERT classifier is higher than the macro-average precision score in comparison
with RoBERTa and is due to BERT identifying actual emotion more accurately. The
results suggest that BERT is the best model for emotion recognition from text using
the Go Emotion dataset for Go Emotion taxonomy. Moreover, BERT F1-score was
higher than RoBERTa and ELECTRA, which means it again, identified emotion from
text more accurately from the Go Emotion dataset (Table 6).

Table 6 Comparison of
Previous literature Predictive model Evaluation measure
previous research with this
research on the Go Emotion Kamath et al., 2022 RoBERTa F1-score = 47
dataset Singh et al., 2021 BERT + CDP F1-score = 52.34
Demszky et al., 2020 BERT F1-score = 46
This research BERT F1-score = 50
Comparative Analysis of Emotion Recognition Using Large Language … 219

4 Conclusions

This study was conducted using the Go Emotions dataset which consist of 58 k
Reddit comments that have been manually annotated by humans and classified
with 27 different emotion categories. There were three variables in the dataset,
the text, emotion label index and ID. Data pre-processing was done to correct
the spelling mistakes and remove the unwanted words like stop words, special
symbols and convert the data into clean text. Then, exploratory data analysis was
performed, and the results were interpreted to understand the number of emotions
per sample. Here, cardinality varies from 1 to 5 emotions per sample, with a
large majority of samples that have only 1 or 2 emotions assigned. In summary,
the BERT neural network algorithm achieved the best accuracy and efficiency at
predicting emotion from the Go Emotion dataset. The number of class samples per
emotion was also explored to understand the class balancing of the train dataset
and showed that the data were skewed due to 30% being labelled as neutral. Upon
comparing the classification reports for the conventional and neural network-based
models on the Ekman taxonomy, conventional machine learning algorithms like
Stochastic Gradient Descent, Support Vector Machines, decision tree, and random
forest performed well, while the neural network-based models like BERT, RoBERTa
and ELECTRA performed exceptionally well with gaining almost 10% than that
of the conventional models. Conventional models averaged the values around 50%
of macro-average F1-score except for the KNN classifier which performed poorly
getting the macro-average F1-score of 21%. Here, the BERT classifier with Ekman
taxonomy including neutral emotion had a macro-average precision of 55% and a
sensitivity of 68%. This classifier also outperformed the macro-average F1-score by
106 61%. While the RoBERTa classifier had a macro-average precision of 65%, the
recall, or sensitivity, was found to be 53%. Future research should look into using
fine-grained datasets with low skew with the aim to improve model performance and
also use a hybrid model structure.

References

1. Kamath R, Ghoshal A, Eswaran S, Honnavalli P (2022) An enhanced context-based emotion

detection model using RoBERTa. In: 2022 IEEE international conference on electronics,
computing and communication technologies (CONECCT). IEEE, pp 1–6
2. Malviya K, Roy B, Saritha SK (2021) A transformers approach to detect depression in social
media. In: 2021 international conference on artificial intelligence and smart systems (ICAIS).
IEEE, pp 718–723
3. Meisheri H, Dey L (2018) Tcs research at semeval-2018 task 1: learning robust representations
using multi-attention architecture. In: Proceedings of The 12th international workshop on
semantic evaluation, pp 291–299
4. Nimmi K, Janet B, Selvan AK, Sivakumaran N (2022) Pre-trained ensemble model for iden-
tification of emotion during COVID-19 based on emergency response support system dataset.
Appl Soft Comput 122
220 M. Soujanya Rao et al.

5. Rajendram SM, Mirnalinee TT (2022) Contextual emotion detection on text using gaussian
process and tree based classifiers. Intell Data Anal 261:119–132
6. Senn S, Tlachac M, Flores R, Rundensteiner E (2022) Ensembles of BERT for depression clas-
sification. In: 2022 44th annual international conference of the IEEE engineering in medicine &
biology society (EMBC). [online] IEEE, pp 4691–4694. Available at: https://ieeexplore.ieee.
org/document/9871120/
7. Ab Nasir AF, Seok Nee E, Sern Choong C, Shahrizan Abdul Ghani A, Abdul Majeed APP,
Adam A, Furqan M (2020) Text-based emotion prediction system using machine learning
approach. In: IOP conference series: materials science and engineering. Institute of Physics
Publishing
8. Chowanda A, Sutoyo R, Meiliana ST (2021) Exploring text-based emotions recognition
machine learning techniques on social media conversation. Procedia Comput Sci 821–828
9. Saad MM, Jamil N, Hamzah R (2018) Evaluation of support vector machine and decision tree
for emotion recognition of malay folklores. Bull Electr Eng Inform 73:479–486
10. Alm CO, Roth D, Sproat R (2005) Emotions from text: machine learning for text-based emotion
prediction. In: Proceedings of human language technology conference and conference on
empirical methods in natural language processing, pp 579–586
11. Desmet B, Hoste V (2013) Emotion detection in suicide notes. Exp Syst Appl 4016:6351–6358
12. Gievska S, Koroveshovski K, Chavdarova T (2014) A hybrid approach for emotion detection
in support of affective interaction. In: 2014 IEEE international conference on data mining
workshop, IEEE, pp 352–359
13. Gupta A, Sahu H, Nanecha N, Kumar P, Roy PP, Chang V (2019) Enhancing text using emotion
detected from EEG signals. J Grid Comput 172:325–340
14. Acheampong FA, Nunoo-Mensah H, Chen W (2021) Transformer models for text-based
emotion detection: a review of BERT-based approaches. Artif Intell Rev 548:5789–5829
15. Santhosh Baboo S, Amirthapriya M (2022) Sentiment analysis and automatic emotion detection
analysis of twitter using machine learning classifiers. Int J Mechan Eng
16. Lee SYM, Chen Y, Huang C-R (2010) A text-driven rule-based system for emotion cause
detection. In: Proceedings of the NAACL HLT 2010 workshop on computational approaches
to analysis and generation of emotion in text, pp 45–53
17. Tao J (2004) Context based emotion detection from text input. In: 8th international confer-
ence on spoken language processing, ICSLP 2004. International Speech Communication
Association, pp 1337–1340
18. Yousaf A, Umer M, Sadiq S, Ullah S, Mirjalili S, Rupapara V, Nappi M (2021) Emotion
recognition by textual tweets classification using voting classifier (LR-SGD). IEEE Access
9:6286–6295
19. Deshpande M, Rao V (2017) Depression detection using emotion artificial intelligence.
GoEmotions: a dataset of fine-grained emotions. arXiv preprint arXiv:2005.00547
20. Singh G, Brahma D, Rai P, Modi A (2021) Fine-grained emotion prediction by modeling
emotion definitions. In: 2021 9th international conference on affective computing and
intelligent interaction, ACII 2021. Institute of Electrical and Electronics Engineers Inc.
The Impact of Clustering-Based
Sequential Multivariate Outliers
Detection in Handling Missing Values

Mety Agustini, Kartika Fithriasari, and Dedy Dwi Prastyo

Abstract The presence of missing values is a common issue that frequently leads
to incomplete data in a wide range of research. They diminish the accessibility of
the dataset that can be utilized and degrade the statistical power of the analysis.
A significant focus in numerous studies has been directed toward the methods of
missing value imputation. In cases where the dataset includes outliers, the impu-
tation of missing values might be incorrect or significantly deviate from the actual
values. One of the challenges that impacts the quality of data is the handling of missing
values and outliers simultaneously. Several studies removed outliers before imputing
missing values or deleted observations with missing values before detecting outliers.
The removal approach leads to a lack of information included within the data. Other
researchers integrate clustering methods into the process of missing value imputation
to mitigate the impact of outliers and data variations, thereby enhancing the accu-
racy of the imputation model. This paper proposes a new clustering-based sequen-
tial multivariate outlier detection (SMOD) method to effectively handle incom-
plete data within outliers. The method is applied to an official economic statistics
dataset that involves outliers and performs a missing value rate scenario of about 50
percent. In comparison with a well-known and widely used clustering technique, i.e.,
model-based clustering (MBC), the proposed method works well in missing value
imputation.

Keywords Clustering · Missing value · Outlier detection · Multiple imputation

M. Agustini · K. Fithriasari (B) · D. D. Prastyo

Department of Statistics, Institut Teknologi Sepuluh Nopember, Surabaya, Indonesia
e-mail: [email protected]
M. Agustini
e-mail: [email protected]
D. D. Prastyo
e-mail: [email protected]
M. Agustini
Badan Pusat Statistik Provinsi Kepulauan Bangka Belitung, Pangkalpinang, Indonesia

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 221
Y. Bee Wah et al. (eds.), Data Science and Emerging Technologies, Lecture Notes on
Data Engineering and Communications Technologies 191,
https://doi.org/10.1007/978-981-97-0293-0_17
222 M. Agustini et al.

1 Introduction

The existence of missing values and outliers is one of the issues that affects data
quality. Removing missing values and outliers can potentially introduce bias in
parameter estimation and weaken the generalizability of the findings [1]. Statistical
analysis that disregards missing values might result in the loss of information, hence
diminishing the statistical power and increasing standard errors [2]. Several statistical
procedures have been designed to imputation the data [3]. Before performing anal-
ysis of datasets containing missing values and outliers, an editing process is required
to make the data a complete dataset. Failure to accurately edit data can result in data
that is not suitable for statistical inference and statistical analysis is vulnerable to
violations of assumptions.
Currently, imputation-based procedures for handling missing values, whether
using statistical or machine learning approaches, have emerged as promising
solutions for addressing incomplete datasets [4].
For many decades, the process of computing against large datasets has been
executed using complicated methodologies. Rubin [5] introduced multiple impu-
tation as a method to address nonresponse in surveys. Each complete dataset
is simulated and then analyzed using standard methods. The results are subse-
quently combined to generate estimates and confidence intervals that incorporate
the uncertainty of single imputation.
Clustering is one of the techniques used to detect the presence of outliers [6].
Several multiple imputation techniques include clustering techniques as an approach
to imputing missing values [7]. Clustering approaches are employed to mitigate the
impact of outliers and data variations, hence enhancing the accuracy of the imputation
model [8]. However, only a few cluster techniques are insensitive to outliers [9].
This study introduces a novel clustering method that can be used in missing value
imputation processing with a special focus on solving outliers. The distinguishing
factor of this study is the utilization of a sequential multivariate outlier detection
(SMOD) technique for grouping. The performance of the addition of the introduced
clustering technique will be compared to the already well-known and widely used
method of clustering, i.e., model-based clustering.

2 Literature Review

2.1 Multiple Imputation Using Clustering

An extensive amount of literature exists related to the improvement of methodolo-

gies for integration multiple imputation and clustering. Zhang et al. [10] proposed
new MI-based validation (MIV) which has been specifically developed for identi-
fying the appropriate number of clusters in large longitudinal datasets with missing
values. Nikfalazar et al. [11] introduced a new iterative fuzzy clustering algorithm
The Impact of Clustering-Based Sequential Multivariate Outliers … 223

for multiple imputation of missing data. Tsai et al. [12] defined a class center-
based approach for efficiently generating effective imputation results. Xue et al.
[13] constructed a mixture-based multiple imputation model to handle the variance
of predictive models. Lin et al. [14] proposed a new hybrid multiple imputation
framework and cluster monitoring system based on deep neural network. Samad
et al. [8] introduced missing value imputation using clustering and deep learning
within multiple imputation framework.
Fortunately, the majority of existing research hasn’t put significant emphasis on
the development of an imputation approach that takes considered the presence of
outliers. Hence, this research presents a clustering approach that relies on the principle
of identifying multivariate outliers through the use of robust distances in handling
missing values. A detailed discussion of this methodology will be provided in the
next section. Multiple imputation uses predictive mean matching model in the ‘mice’
package of R software [15].

2.2 Multivariate Outlier Detection

One method used to check the presence of a multivariate outlier is the calculation of
the Mahalanobis distance [16], which is defined as follows:
[ ]1/2
MDi = (xi − µ)T ∑ −1 (xi − µ) , i = 1, 2, . . . , n, (1)

where µ is the mean vector of the sample and ∑ is the variance–covariance matrix.
An observation is said to be outlier if the square root value of Mahalanobis distance >
χ2p;1−α/2 , where p is degree of freedom. However, the Mahalanobis distance is very
sensitive to the outlier, so it can change the values of µ and ∑ resulting in the
nonoutlier data being misidentified to outlier. To overcome this, Rousseeuw and
Zomeren introduced a robust distance (RD) that is more robust to the outlier [17]. RD
uses an estimate of the variance–covariance matrix and mean based on the minimum
covariance determinant (MCD).
If assumed x1 , x2 , …, xn is a random sample of a number of n observed objects in
a R p dimensional space, then the MCD estimate aims to obtain h subset of samples
of n objects of observation with maximum size

h = ((n + p + 1)/2), (2)

which has the smallest determinant of the variance–covariance matrix.

MCD = min{det(∑)1 , det(∑)2 , . . . det(∑)( hn ) }. (3)

Robust distance is obtained with the following formula:

224 M. Agustini et al.

[ ]1/2
RDi = (xi − µmcd )T ∑ −1
mcd (xi − µmcd ) , i = 1, 2, . . . , n, (4)

where µmcd is a robust sample mean estimator based on MCD (average of the h
subset of the sample):

1∑
h
µmcd = xj, (5)
h j=1

and ∑ mcd is a robust variance–covariance matrix estimator of the h subset of the

sample:

1 ∑( )
h
∑ mcd = x j − μmcd (x i − μmcd )T . (6)
h − 1 j=1

Roussew dan Van Zomeren [17] used the RD and quantum of the χ p2 distribution
as a cut off value to detect multivariate outlier. Data ith with value:
/
RD => χ p;1−α/2
2
(7)

is defined as a multivariate outlier.

In this study, multivariate outlier detections use the ‘distance-distance plot
(dd.plot)’ function on the ‘mvoutlier’ package of R software [18].

2.3 Model-Based Clustering

The assumption used in model-based clustering is that in a population can be taken

subpopulations that have a specific chance distribution and each subpopulation has a
different parameter. This assumption directs to the mathematical probability model,
the finite mixture model. Currently, the use of finite mixture models on clustering
has developed very rapidly and has become one of the most popular methods of
clustering.
Banfield and Raftery [19] decomposed the variance–covariance matrix (∑g ) of the
model-based clustering for a new interpretation of the clusters in terms of volume,
orientation, and shape as follows:

∑ g = λg Dg Ag DTg, (8)

with
λg is the scalar value that shows the volume of an ellipse.
The Impact of Clustering-Based Sequential Multivariate Outliers … 225

Dg is an ortogonal eigenvector matrix that is the orientation of the main component

∑g .
A g is a diagonal of a matrix with elements that are proportional to the eigenvalue ∑g
and show the contour of its density function.
This paper applied model-based clustering using the teigen () function on the
‘teigen’ package of R software [18].

2.4 Model Evaluation

The normalized root mean square error (NRMSE) is one of numerous metrics used to
evaluate the imputation model of several variables that have far different unit values.
The discrepancy between the imputed values and the true values on the dataset is
quantified using the root mean squared error (RMSE) divided by the mean of the
actual values as the normalized root mean square error (NRMSE). The measurement
requirements are expressed mathematically in Eqs. (9) and (10) as stated below.
┌
|∑ ( )
| m X −X 2
∧

√ i=1 i i
RMSE = , (9)
m
RMSE
NRMSE = 1 ∑m , (10)
m i=1 X i
∧

with m is the number of imputed values, X i is the actual value, X i is the imputed value.
The techniques presenting lower values of NRMSE demonstrate superior accuracy
in the model imputation of missing values.

3 The Proposed Method: Clustering-Based Sequential

Multivariate Outlier Detection (SMOD)

Davies and Gather [20] made a significant divergence between single-step and
sequential steps for outlier detection. Single-step techniques are able to discover
all outliers simultaneously, in contrast to the sequential procedure of eliminating
or adding data points. In the sequential process, at each step, every observation is
assessed to determine if it fits the criteria for being identified as an outlier. The data
point with the highest outlyingness measure is subjected to outlier detection. In the
event that an observation is identified as an outlier, it is subsequently removed from
the dataset, and the process is repeated. The process will stop if no more observations
are detected as outliers.
226 M. Agustini et al.

Fig. 1 Davies and Gather’s

sequential outlier detection Data.1
procedure
Data
Data.2.1

Data.2

Data.2.2 Data.2.2.2

The boxes with red lines show observations detected as outliers, while boxes with
black lines show reverse. At the last step in Fig. 1, there were no red line boxes,
which meant no further observations were identified as outliers.
Drawing upon the works of Filzmoser et al. [21], this paper utilizes a resilient
distance metric that was created for the purpose of detecting outliers in multivariate
data [17]. The current update on this work entails the implementation of a sequential
multivariate outlier detection approach, which different with sequential procedure
by Davies and Gather.
The results of outlier detection on the first sequence dataset are examined, both
for the subset identified as an outlier and the subset that is not considered an outlier.
Subsequently, the outlier is recognized once again. The aforementioned procedure
persists until a specific threshold is attained, such as the sequential outlier detection
until four steps. The process of outlier identification is anticipated to yield data
clusters or smaller subsets of variations due to its primary focus on identifying the
existence of outliers from the beginning of data grouping. The both figures depict a
comparison between the already employed sequential outlier detection approaches
and the newly suggested sequential outlier detection procedure.
In Fig. 2, the boxes in either black or white filled in gray mean the selected subset
of data as clusters. For comparative equality, the number of clusters is determined
based on the optimum clusters derived from other methods, namely model-based
clustering. The MBC employs two criteria, namely Akaike information criterion
(AIC) and Bayesian information criterion (BIC), to determine the optimal number
of clusters. This research has not developed a criterion in determining the optimum
number of clusters in clustering-based SMOD.
The primary objective of this article is to mitigate substantial deficiencies in data
by including clusters into the imputation procedure and evaluate the efficacy of the
proposed clustering approach in comparison with current clustering techniques. This
article presents the following contributions. Initially, we do imputation on the datasets
without using clustering. Furthermore, we conducted comprehensive analyses to
illustrate the impact of including clustering analysis on the accuracy of imputation.
Thirdly, a comparison is made between the performance of the pre-existing clustering
approach: model-based clustering (MBC) and the proposed method: clustering-based
sequential multivariate outlier detection (SMOD).
The Impact of Clustering-Based Sequential Multivariate Outliers … 227

Fig. 2 Proposed sequential

multivariate outlier detection Data.1.1.1
procedure Data.1.1
Data.1.1.2
Data.1
Data.1.2.1
Data.1.2
Data.1.2.2
Data
Data.2.1.1
Data.2.1
Data.2.1.2
Data.2
Data.2.2.1
Data.2.2
Data.2.2.2

4 Discussion

4.1 Dataset Preparation

The data utilized in this study were obtained from the 2016 Economic Census for
Medium and Large Establishments (MLE) in the province of East Java. The official
economic statistics dataset has three variables: turnover (X 1 ), measured in million
Indonesian rupiah (IDR), cost (X 2 ), also measured in million IDR, and worker (X 3 ),
measured as the number of individuals. The compilation includes official economic
information and encompasses a substantial sample size of over 8,000 enterprises.
From a conceptual standpoint, the quantity of employees directly corresponds to
the scale of human resources inside an organization. Consequently, it can be inferred
that an escalation in the number of workers would likely result in a proportional rise
in employee turnover. Moreover, the magnitude of expenses serves as an indicator of
a company’s scale of operations. Consequently, a positive correlation is anticipated
between the growth in expenses and the corresponding increase in turnover [22].
The original dataset does not contain the missing values. To facilitate the research
process, a scenario that includes the missed values for three variables is created
using the ’ampute’ function of the MICE package in the R program. This means that
a missing value has an actual known value.
Figure 3 represents the composition of the dataset, which comprises 4,382
observed values (OV) and 4170 occurrences of missing values (MV). These values
related to 8552 complete data for variables X 1 , X 2 , and X 3 . The observed values
are represented in the uppermost row, where bars are depicted as blue. The missing
values are denoted by bars that have been colored in red.
228 M. Agustini et al.

Fig. 3 Pattern of missing

value X 1 , X 2 , X 3

Table 1 presents the descriptive statistics of the dataset, together with the corre-
sponding rate of missing values. There is a significant range between the highest
(Max) and lowest (Min) values. The range variation among the datasets is quite
wide. Hence, the standard deviation is greater than the mean. The datasets also
contain some outliers, as shown in Fig. 4.

Table 1 Descriptive statistics and missing value rate of case study data
Datasets Min Mean Max SD Numbers of Missing
OV MV rate (%)

Turnover 21 123,731 76,203,236 1,951,234 6850 1702 19.90

(X 1 )
Cost (X 2 ) 18 74,265 36,959,307 1,156,856 7339 1213 14.18
Worker 2 200.5 45,625 976 7297 1255 14.67
(X 3 )

Turnover ( 1) in million IDR Costs ( 2) in million IDR Worker ( 3) in person

Fig. 4 Existence of outliers in each variable of the dataset

The Impact of Clustering-Based Sequential Multivariate Outliers … 229

Fig. 5 Distance-distance plot for first step of outlier detection

Fig. 6 Distance-distance plot for second step of outlier detection

4.2 The Sequential Multivariate Outlier Detection (SMOD)

Figures 5 and 6 show the distance-distance plot of outlier detection using robust
distance for first step and second step. The process of outlier detections is
accomplished in four steps.
Sequential multivariate outlier detection generates a variety of dataset combina-
tions that can be utilized to create clusters.

4.3 The Model-Based Clustering and the Clustering-Based

SMOD

The discrepancy in the cluster membership counts between the MBC and SMOD
approaches is notable in Table 2.
According to the findings, it has been found that the ideal number of groups for
MBC is six. This conclusion was reached by utilizing the clustering-based SMOD
230 M. Agustini et al.

Table 2 Number of group

Group MBC SMOD
members for model-based
clustering (MBC) and 1 677 5330
clustering-based sequential 2 2757 1411
outlier detection (SMOD)
3 1331 1100
4 1019 161
5 1791 185
6 977 365
Total 8552 8552

approach, which resulted in the formation of data subsets that can be seen as clusters,
as illustrated in Fig. 7

4.4 Multiple Imputation Based on Clustering

and Nonclustering

The comparative analysis of the clustering algorithms’ performance in handling

missing values is presented below. We set the number of multiply-imputed datasets to
10 and the maximum iteration to 20. When clustering is added to missing value impu-
tation, both model-based clustering (MBC) and clustering-based sequential multi-
variate outlier detection (SMOD) are used to get interesting results. Both approaches
exhibit lower NRMSE values compared to multiple imputations without clustering,
as given in Table 3. The bold NRMSE value indicates that the imputation model has
better accuracy results than other models.
The accuracy of computational findings obtained from clustered datasets surpasses
that of nonclustering ones. The evaluation of the model’s accuracy was conducted
by employing NRMSE measure. Both clustering methods exhibit lower values for
NRMSE compared to the nonclustering dataset (Table 3). The proposed method,
multiple imputation with clustering-based SMOD, works better in missing value
imputation for turnover (X 1 ) and worker (X 3 ) than multiple imputation with MBC.
Additional information is presented in visual form, such as in Figs. 8 and 9.
The Impact of Clustering-Based Sequential Multivariate Outliers … 231

Data.1.1.1.1
n=13
Data.1.1.1
n=67 Data.1.1.1.2

Data.1.1 n=54
n=365
Data.1.1.2.1
n=36
Data.1.1.2
n=298
Data.1.1.2.2
n=262
Data.1
n=1,881
Data.1.2.1.1
n=0
Data.1.2.1
n=185
Data.1.2.1.2
n=185
Data.1.2
n=1,446
Data.1.2.2.1
n=161
Data.1.2.2
n=1,261
Data.1.2..2.2
n=1,100
Data
n=8,552
Data.2.1.1.1
n=7
Data.2.1.1
n=82
Data.2.1.1.2
n=75
Data.2.1
n=1,411
Data.2.1.2.1
n=62
Data.2.1.2
n=1,329
Data.2.1.2.2
n=1,267
Data.2
n=6,741
Data.2.2.1.1
n=6
Data.2.2.1
n=826 Data.2.2.1.2

Data.2.2 n=820
n=5,330 Data.2.2.2.1

Data.2.2.2 n=450
n=4,504 Data.2.2.2.2
n=4,054

Fig. 7 Dendogram of clustering-based SMOD

Table 3 Comparison model

Datasets Clustering Nonclustering
accuracy: NRMSE of the
multiple MBC SMOD
imputation-predictive mean Turnover (X 1 ) 4.32 4.30 5.12
matching (PMM) between
clustering datasets and Costs (X 2 ) 6.44 6.49 7.90
nonclustering datasets Worker (X 3 ) 0.93 0.92 2.73
232 M. Agustini et al.

Fig. 8 Plot of plausible value for all clusters of model-based clustering (MBC) for predictive mean
matching imputation (iteration = 20)
The Impact of Clustering-Based Sequential Multivariate Outliers … 233

Fig. 9 Plot of plausible value for all clusters of clustering-based sequential multivariate outlier
detection (SMOD) for predictive mean matching imputation (iteration = 20)
234 M. Agustini et al.

5 Conclusion

This paper compares the effectiveness of two clustering strategies for dealing with
missing values. The proposed method has better imputation accuracy for some vari-
ables that have lower NRMSE values. The suggested method yields outcomes that are
comparable to those of commonly used clustering techniques. However, the inclusion
of clustering significantly improves the accuracy of all variables. In this paper, we
continue to cluster the entire dataset before creating a scenario with missing values.
This study aims to demonstrate that clustering datasets improves model imputation
accuracy. In our upcoming research, we propose to perform clustering on incom-
plete datasets. It is anticipated that this can be a solution to the frequent occurrence
of missing values in the data survey, especially in the official statistics dataset. For
the extended research, we will add more datasets and simulations for generalization
of outcomes.

References

1. Rubin DB (1976) Inference and missing data. Biometrika 63(3):581–592. https://doi.org/10.

1093/biomet/63.3.581
2. Peng Y, Little RJA, Raghunathan TE (2004) An extended general location model for causal
inferences from data subject to noncompliance and missing values. Biometrics 60(3):598–607.
https://doi.org/10.1111/j.0006-341X.2004.00208.x
3. Schafer JL, Graham JW (2002) Missing data: our view of the state of the art. Psychol Methods
7(2):147–177. https://doi.org/10.1037/1082-989X.7.2.147
4. García-Laencina PJ, Sancho-Gómez JL, Figueiras-Vidal AR (2010) Pattern classification with
missing data: a review. Neural Comput Appl 19(2):263–282. https://doi.org/10.1007/s00521-
009-0295-6
5. Rubin DB (1978) Multiple imputations in sample surveys—a phenomenological Bayesian
approach to nonresponse. Am Stat Assoc 1:20–34
6. Acuna E, Rodriguez C (2004) A meta analysis study of outlier detection methods in classifi-
cation. In: Technical paper. Department of Mathematical Science, University of Puerto Rico
Mayaguez, pp 1–25 [Online]. Available: http://paperout.pdf
7. Huque MH, Moreno-Betancur M, Quartagno M, Simpson JA, Carlin JB, Lee KJ (2020) Multiple
imputation methods for handling incomplete longitudinal and clustered data where the target
analysis is a linear mixed effects model. Biometrical J 62(2):444–466. https://doi.org/10.1002/
bimj.201900051
8. Samad MD, Abrar S, Diawara N (2022) Missing value estimation using clustering and deep
learning within multiple imputation framework. Knowl-Based Syst 249:108968. https://doi.
org/10.1016/j.knosys.2022.108968
9. Xu D, Tian Y (2015) A comprehensive survey of clustering algorithms. Ann Data Sci 2(2):165–
193. https://doi.org/10.1007/s40745-015-0040-1
10. Zhang Z, Fang H, Wang H (2016) Multiple imputation based clustering validation (MIV) for
big longitudinal trial data with missing values in eHealth. J Med Syst 40(6). https://doi.org/10.
1007/s10916-016-0499-0
11. Nikfalazar S, Yeh CH, Bedingfield S, Khorshidi HA (2017) A new iterative fuzzy clustering
algorithm for multiple imputation of missing data. IEEE Int Conf Fuzzy Syst. https://doi.org/
10.1109/FUZZ-IEEE.2017.8015560
The Impact of Clustering-Based Sequential Multivariate Outliers … 235

12. Tsai CF, Li ML, Lin WC (2018) A class center based approach for missing value imputation.
Knowl-Based Syst 151:124–135. https://doi.org/10.1016/j.knosys.2018.03.026
13. Xue Y, Klabjan D, Luo Y (2016) Mixture-based multiple imputation model for clinical data
with a temporal dimension. In: Proceedings of 2019 IEEE international conference on Big
Data, Big Data 2019, pp 245–252. https://doi.org/10.1109/BigData47090.2019.9005672
14. Lin J, Li NH, Alam MA, Ma Y (2019) Data-driven missing data imputation in cluster monitoring
system based on deep neural network. Appl Intell 50(3):860–877. https://doi.org/10.1007/s10
489-019-01560-y
15. van Buuren S, Oudshoorn CGM (2007) MICE: multivariate imputation by chained equations.
R Packag Version 1(3):2007
16. Bedrick EJ, Lapidus J, Powell JF (2000) Estimating the Mahalanobis distance from mixed
continuous and discrete data. Biometrics 56(2):394–401. https://doi.org/10.1111/j.0006-341X.
2000.00394.x
17. Rousseeuw PJ, van Zomeren BC (1990) Unmasking multivariate outliers and leverage points:
rejoinder. J Am Stat Assoc 85(411):648. https://doi.org/10.2307/2289999
18. Filzmoser P, Gschwandtner M (2022) Package ‘mvoutlier’ R Package. version [Online].
Available: http://cstat.tuwien.ac.at/filz
19. Banfield JD, Raftery AE (1993) Banfield & Raftery (1993) model based gaussian dan non
gaussian.pdf. Biometrics
20. Davies L, Gather U (1993) The identification of multiple outliers. J Am Stat Assoc 88(423):782–
792. https://doi.org/10.1080/01621459.1993.10476339
21. Filzmoser P, Garrett RG, Reimann C (2005) Multivariate outlier detection in exploration
geochemistry. Comput Geosci 31(5):579–587. https://doi.org/10.1016/j.cageo.2004.11.013
22. Takahashi M, Ito T (2013) Multiple imputation of missing values in economic surveys: compar-
ison of competing algorithms. In: Proceedings 59th ISI world statistics congress, no. August,
pp 3240–3245
Sarcasm Detection in Newspaper
Headlines

Vishnu Sai Reddy Chilpuri, Saaman Nadeem, Tahir Mehmood,

and Muhammad Yaqoob

Abstract Language is an essential medium for human communication. It allows

us to convey information, express our ideas, and give instructions to others. The
rise of sarcasm can be attributed to the increasing number of negative comments
and expressions posted on social networks such as Twitter, Facebook, and news-
papers. Due to the use of positive vocabulary in sarcastic comments, it is hard to
detect sarcasm in news reports. Sarcasm is intentionally used in news reports to
grab the readers’ attention. Unfortunately, many people find it hard to identify the
ironic tone of the headlines and may pass incorrect information. This work focuses
on detecting sarcasm in newspaper headlines and investigates the performance of
four machine learning algorithms (Logistic Regression, Naive Bayes, decision tree,
and Random Forest) and one deep learning model BiLSTM (Bi-directional Long
Short-Term Memory) for sarcasm detection in news headlines. We demonstrate that
regardless of the machine learning model, the application of vectorization tech-
nique, i.e. BoW (Bag of Words) and TF–IDF (Term Frequency–Inverse Document
Frequency) has minimal influence on the ability to detect sarcasm in news head-
lines. We also show that the performance of the three machine learning algorithms
(Logistic Regression, Random Forest, and decision tree) remains stable across two
tokenization techniques (Unigram or Bigram) except Naive Bayes which secured a
higher precision with Unigram analysis. We further found that BiLSTM is the most
preferred model for sarcasm detection in news headlines.

Keywords Sarcasm detection · Newspaper headlines · Natural language

processing · Machine learning · Deep learning

V. S. R. Chilpuri · M. Yaqoob (B)

Department of Computer Science, University of Hertfordshire, Hatfield, UK
e-mail: [email protected]
S. Nadeem
Department of Computer Science, University of Management and Technology, Lahore, Pakistan
T. Mehmood (B)
School of Information Technology, UNITAR International University, Selangor, Malaysia
e-mail: [email protected]

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 237
Y. Bee Wah et al. (eds.), Data Science and Emerging Technologies, Lecture Notes on
Data Engineering and Communications Technologies 191,
https://doi.org/10.1007/978-981-97-0293-0_18
238 V. S. R. Chilpuri et al.

1 Introduction

Communication is a process that involves exchanging ideas and feelings between

people. It is important for businesses as it allows individuals to work with clients more
effectively, and it can help them develop different plans and strategies [5]. Sarcasm
is a type of language that is complex and nuanced. It is often used to intentionally
say the opposite of what is being said. According to the Collins English Dictionary,
sarcasm is a type of ironic or mocking language that is intended to convey disdain or
insults. Another definition of sarcasm is that it is a mode of satirical wit that is used
to describe the effect of harsh or ironic language [4]. On social media platforms like
Twitter, Facebook, and Instagram, etc., sarcasm is being used in online comments,
posts, messages, or headlines, and when the user writes or posts a sarcastic comment.
Given that the text data presented online are in unstructured form and, therefore it
is exceedingly difficult to detect sarcasm in the comments. Various natural language
methods [18] capture the complex structure and detect sarcasm as we do not know the
feelings or facial expressions of the user. Sarcastic comments are known to create
deeper effects than other negative responses and can attract more attention where
people are known to include sarcasm in written communication, which can stimulate
the virality of content on social media. These social media platforms need to be
equipped with the necessary tools to detect the sentiment of the user comments. It
would help in determining the proportion of negative and positive comments across
these platforms [14].
It is important to note that the borders of sarcasm are not as well defined as they are
in sentiment analysis, where the categories (positive, neutral, negative) are clearly
defined [17]. Sarcastic analysis can be used in various areas of interest in natural
language processing, such as information categorization and marketing research.
Unfortunately, detecting sarcasm is a challenging task due to its context, tone, and
prior knowledge [36]. For example, an online reviewer might give a negative review
about a meal that “the restaurant has better air conditioning”. Unfortunately, it is
not always possible for an automatic sentiment analysis system to assign a rating
to their cusine based on the information it has collected. For instance, if the restau-
rant’s air conditioning is selected over the taste of the food, the system might not be
able to properly interpret the negative sentiment about the food. This work employs
four machine learning (ML) algorithms (Logistic Regression, Naive Bayes Classi-
fier, decision tree and Random Forest) and one deep learning (DL) algorithm (Bi-
directional Long Short-Term Memory (BiLSTM) for classification of news head-
lines in two classes: sarcastic and non-sarcastic. In contrast to the classical machine
learning algorithms, deep learning models perform exceptionally well and do not
require explicit feature engineering [20]. We then compare the performance of ML
models with the BiLSTM, and report the best model.
Sarcasm Detection in Newspaper Headlines 239

2 Related Work

Today sarcasm is a key feature in verbal communication, and it extends to newspaper

headlines, where journalists may prefer sarcastic language to convey their point [24].
To gain more attention of subscribers, and to get more views from social media
platforms like YouTube channels and newspapers, content producers use sarcasm in
their headlines and thumbnails. The problem with sarcastic language is the audience
are more likely to misinterpret messages and incorrect information may get passed to
others [4]. For example, if a user posts a sarcastic review about a particular product
which is not good, there might be a chance that other buyers may purchase the product
after reading the review thinking that it is a positive review. More importantly, a
sentiment analysis on such a review will classify the review as a positive review
because it has a positive sentiment, but the actual sentiment behind the review is
negative as the user is criticizing and mocking the product because of its inferior
quality [12]. Therefore, detecting sarcasm is important to get the true intent behind a
statement. This ensures that the reader comprehends the message accurately and can
respond accordingly. Extensive research has been carried out on sentiment analysis
in reviews, comments, Tweets, text pre-processing, and related areas. However, the
issue of sarcasm detection in newspaper headlines remains relatively underexplored
[2, 26, 27].
Behavioural modelling has been conducted for sarcasm detection using the Scuba
framework, where the intrinsic traits of the user’s tweets are identified, resulting in
an accuracy of around 80% [34]. Another study uses a rule-based system to detect
sarcastic and non-sarcastic tweets using three machine learning models: decision
tree (DT), Naive Bayes, and a Bagging classifier. In their study, the decision tree
classifier outperformed the other two with the highest F1 score of around 95% [1].
Maynard and Greenwood explored the impact of sarcasm on tweet polarity and
introduced a set of rules that enhance sentiment analysis accuracy in the presence
of sarcasm. Moreover, they investigated the influence of sentiment and sarcasm
conveyed through hashtags by proposing a hashtag tokenizer for GATE which stream-
lines the detection of sentiment and sarcasm embedded within hashtags. With the
hashtag tokenization they obtained 98% precision. Additionally, sarcasm detection
achieved 91% precision and polarity detection reached 80% [17].
Pawar and Bhingarkar have applied the concept of contextual clues to detect
sarcasm using machine learning techniques. They used Support Vector Machine
(SVM), Random Forest (RF), and K-Nearest Neighbours (KNN) classifiers on the
Twitter dataset which has the combination of English and Hindi tweets [28]. In their
experimental set-up, out of the three classifiers Random Forest achieved the highest
accuracy of 81%. They used raw Twitter data and were able to detect sarcastic tweets
with a reasonable accuracy.
In another study Recurrent Neural Networks (RNN,) Long Short-term Memory
(LSTM) to detect sarcasm in the Twitter dataset [31]. On a small dataset of 2000
tweets (1000 sarcastic and 1000 non-sarcastic), they reported an accuracy of 90%.
240 V. S. R. Chilpuri et al.

Taking this experiment as a reference, it is clear that capturing long-term dependen-

cies of the context is required for sarcasm detection. A similar study implemented Bi-
directional LSTM with the GloVe embeddings as their data vectorization technique
to detect sarcasm in newspaper headline, resulting in 86% accuracy [37].
A recent study on the same topic suggested that Bag of Words (BoW) and Term
Frequency–Inverse Document Frequency (TF–IDF) give better results on the multi-
dimensional dataset as compared with the GloVe technique. The multidimensional
dataset is a combination of four datasets: Twitter, Reddit, newspaper headlines, and
sarcasm corpus. For classification, they trained the Support Vector Classifier (SVC)
and their proposed Convolutional Neural Networks and Long Short-Term Memory
(CNN-LSTM) model on the multidimensional dataset and reported an accuracy of
70 and 73%, respectively [13].

3 Methods

This work compares the performance of four ML algorithms (Logistic Regression,

Naive Bayes, decision tree, and Random Forest) and BiLSTM for binary sarcasm
classification. We plan to use n-gram analysis to capture local context and BiLSTM
to capture long-term dependencies in both backward and forward directions. For
feature representation, we will employ Bag-of-Words (BoW) and Term Frequency–
Inverse Document Frequency (TF–IDF), as they are preferred vectorization methods
for text classification tasks [13].

3.1 Data Source

The dataset is obtained by merging the headlines from two widespread news websites
called “huffingtonpost.com” (HuffPost) and “theonion.com” (Onion). The news
headlines dataset for sarcasm detection is less likely to noise as they are direct head-
lines from newspapers written by professional journalists. All the sarcastic posts are
gathered from the Onion newspaper as it is more known to publish sarcastic news
versions of the present events and the real (non-sarcastic) headlines are collected
from HuffPost [25]. The dataset contains 26,710 headlines where around 56% of
headlines are non-sarcastic and the remaining 44% are sarcastic. This ratio of 56:44
is sufficient for the problem statement, and there is no need to use oversampling
or under-sampling techniques. Since the Onion only writes sarcastic headlines, the
labels (or tags) we have for sarcasm are accurate, unlike Twitter data, where many
messages are responses to other messages.
Sarcasm Detection in Newspaper Headlines 241

3.2 Pre-processing

The raw data contains various types of characters and numerical values that can
affect the performance of the model. First, all upper-case words are converted to
lower-case words, then punctuations are removed followed by the removal of special
characters using the regular expression library. In the next phase stop words such
as “from”, “she”, “it”, “are”, etc., are removed. They are commonly used in all
sentences, however, in natural language processing and in text mining applications,
these words are not extremely useful as they carry little information [35]. Thus, they
were removed from all headlines in the dataset. It is also important to transform
words into their base forms. Numerous applications, such as web search, labelling,
and information retrieval, use lemmatization which involves extracting the correct
lemma from a word and then returning its dictionary form [30]. To perform this
lemmatization, we used Wordnet Lemmatizer. The data is then divided into 80%
train set, 10% validation set and 10% test set.
Vectorization and Feature Extraction: To perform machine learning on text data,
one of the first steps is to transform a document into a vector representation, and this
process is referred to as feature extraction or text vectorization which is to convert
the features into numerical representations. Although machine learning algorithms
can perform various mathematical operations, they cannot process raw text data [21].
Instead, they can only process the numerical representation of the text in a vector,
and this textual data processing is done through vectorization, which enables the
machines to recognize and interpret the contents of the text [33]. In this work, we
implement two text vectorization techniques: (i) Bag of Words (BoW), and (ii) Term
Frequency–Inverse Document Frequency (TF–IDF).
Bag of Words (BoW) technique converts text documents into numbers and the
result is stored in the form of a key-value pair (a dictionary). Each dictionary item
has a key (a word) and a value (the number of times this word appeared in the
headline). Since the BoW is a representation of the number of times a word appears
in a document, it does not consider the context or the order in which they appear which
is important for natural language processing and document classification. Other cons
of BoW vectorization include sparsity of the matrix and vocabulary shortage, i.e. if
new data or extra word comes up in the testing set are not captured in testing data
[11].
TF–IDF, a statistical measure, evaluates a word’s relevance to a document by
considering its frequency within the document and its rarity across multiple docu-
ments. It combines term frequency (word occurrences in a document) with inverse
document frequency (word prevalence across a document set). Term frequency is
determined by the number of instances a word appears in a document, adjusted for
word length. There are various methods to calculate it, including raw count or relative
to the most frequent word in a document (see Eq. 1). Inverse document frequency
quantifies a word’s rarity in the entire document set. It’s computed by taking the total
number of documents, dividing by the number of words, and applying a logarithm.
Common widely used words approach a value of 0, while rarer words approach 1
242 V. S. R. Chilpuri et al.

(see Eq. 2) [4].

No. of repetitions of a word in a sentence

TF = , (1)
No. of words in a sentence
No. of sentence
IDF = loge . (2)
No. of sentences containing the word

Comparing Bag of Words (BoW) and TF–IDF vectorizations, TF–IDF assigns

weights to words for machine learning model training. It captures word importance
effectively. However, an issue arises with out-of-vocabulary words, limiting testing
on new data or words added after training [5].
Tokenization: The tokenization process separates a piece of text into smaller units
known as tokens, which can be subwords, words, or characters. After collecting
various tokens, a vocabulary is created by considering the top K frequently used
unique words in the corpus [10].
N-grams are consecutive sequences of symbols or words in a document. They
capture co-occurring words within a window. The N-gram model predicts the likeli-
hood of a word based on the previous ones. For instance, a Bigram model considers
only the previous word, while a trigram model looks at the two preceding words [41].
Moreover, in Bigram analysis, both Unigram and Bigram features are considered. In
trigram analysis, Unigram, Bigram, and trigram features are all taken into account.
This N-gram analysis helps maintain the semantic meaning of the text during training
[41]. This work investigates the performance of Unigram and Bigram tokenization
in the context of sarcasm detection in news headlines.
Unigram analysis is a technique that splits the entire sentences of a document into
single tokens. Example: The sentence or text “She gave him a sarcastic smile” will be
divided into single tokens like “She”, “gave”, “him”, “a”, “sarcastic”, and “smile”,
whereas in Bigram analysis the sentence splits into size two tokens. Considering the
same sentence “She gave him a sarcastic smile” will be divided into tokens like “She
gave”, “gave him”, “him a”, “a sarcastic”, and “sarcastic smile”. In addition, Bigram
also considers Unigram analysis. Thus, “She”, “gave”, “him”, “a”, “sarcastic”, and
“smile” are also considered in Bigram analysis. The pre-processed data is then fed
to four machine learning algorithms.

3.3 Implemented Models

We have implemented four machine learning models (Logistic Regression, Naive

Bayes, decision tree, and Random Forest) and one deep learning model (Bi-
directional LSTM) for sarcasm detection. The decision to choose these specific
models is based on their proven performance to detecting sarcasm in the existing
Sarcasm Detection in Newspaper Headlines 243

Fig. 1 Machine learning algorithms pipeline (a). The layered architecture of BiLSTM model (b)

literature. Our proposed approach is shown in Fig. 1, where different tokenization,

vectorization, and machine learning algorithms are adapted at different levels.
Logistic Regression Classifier: Logistic regression is a statistical method used to
estimate the likelihood of a specific event occurring. It uses a mathematical transfor-
mation called the logit function. This function helps in making accurate predictions,
especially when the probability is close to zero. In classification tasks, logistic regres-
sion helps determine the probability of an observation belonging to a certain category,
represented as a number between 0 and 1. A value closer to 1 indicates a higher prob-
ability of belonging to that category. Unlike linear regression (equation Y = mX +
C), logistic regression is given by a sigmoid function given by equation Y = ex /(ex
+ 1). This ensures that the probabilities calculated fall within the range of 0–1.
Naïve Bayes Classifier: Naive Bayes classification is like a smart guesser. It uses
patterns it has seen before to make predictions about new things. It assumes that
different features (like words in a text) are independent, even though they might not
be [22]. It’s called “naive” because it makes this simple assumption. Despite this
simplification, it often works really well in practice for classification tasks [7, 15,
36]. The Naive Bayes classification is based on Bayes’ theorem, which calculates
the probability of an event occurring given that another event has already occurred.
In this context, it’s used for making predictions based on features (see Eq. 3). This
equation represents the probability of class C k given features x 1 , x 2 , ..., x n .

P(Ck ) ∗ P(x1 |Ck ) ∗ P(x2 |Ck ), . . . , P(xn |Ck )

P(Ck |x1 , x2 , . . . , xn ) = . (3)
P(x1 ) ∗ P(x2 ), . . . , P(xn )
244 V. S. R. Chilpuri et al.

Decision Tree Classifier: The decision tree classifier analyses the characteristics of
a dataset to predict its class. It begins at the root node, then progresses to subsequent
nodes. The process involves comparing attributes at each node, following different
branches, until it reaches the end of the tree [39]. The process begins with the complete
dataset, then, using an Attribute Selection Measure (ASM), the algorithm selects
the best attribute to divide the root node into subsets. This leads to the creation
of multiple decision trees. This process is repeated until nodes can no longer be
classified. The commonly used attribute selection measure is based on the information
gain. Information gain quantifies the reduction in uncertainty (entropy) achieved by
splitting a dataset based on a particular attribute (see Eq. 4) [40]. This equation
measures the amount of information a feature provides about a given class.

|subset|
Information Gain = Entropy(parent) − ∗ Entropy(subset). (4)
|psrent|

Random Forest Classifier The Random Forest (RF) algorithm is a collection of deci-
sion trees [29] trained through bagging or bootstrapping methods. Its purpose is to
provide a robust prediction by averaging the outputs of each tree. Moreover, RF algo-
rithms can mitigate overfitting commonly associated with individual decision trees.
They create multiple decision trees, each with a leaf, a root, and decision nodes. The
leaf node represents the tree’s output, and the root node shows the initial point where
the data is split based on a chosen feature. Furthermore, RF classifiers are good at
handling missing data and generating reliable predictions with minimal hyperparam-
eter tuning. They effectively counter overfitting, outperforming individual decision
trees in terms of accuracy [3].
Bi-directional Long Short-Term Memory (BiLSTM): Bi-directional Long Short-
Term Memory (LSTM) is typically developed to overcome the long dependency
problem and retains the information for a long time [19]. This is accomplished
using a separate memory cell called cell state, which sequentially passes through
all LSTM units as shown in Fig. 2. The LSTM usually learns from the beginning
of the sentence until its end in a forward manner. However, to get the full context
of a sentence, it is better to look through it in a backward manner too. This is done
using Bi-directional LSTM (BiLSTM), where two hidden units are used to get the
information of the sentence in a forward and backward way. This enables the model
to capture dependencies in both past and future contexts, enhancing its ability to
understand temporal patterns [9].
Compared to conventional recurrent neural networks (RNNs), LSTMs are
designed to tackle the issue of vanishing or exploding gradients, allowing for the
retention of relevant information over longer sequences. They achieve this by main-
taining a cell state, acting as a form of memory that the network can access and
update over time. An LSTM model comprises three gates: input, output, and forget.
The forget gate decides what information to discard from the cell state. The input
gate determines what new information to store in the cell state. These gates operate
Sarcasm Detection in Newspaper Headlines 245

Fig. 2 The structure of Long Short-Term Memory model

together to update the cell state and produce an output that contributes to the model’s
predictions [38].
Bi-directional LSTMs process information in both directions. Therefore, they are
useful where context from both past and future events is needed, e.g. in natural
language processing and speech recognition. By combining the outputs from both
directions, the model can make more informed and accurate predictions. This
architectural enhancement greatly expands the capabilities of traditional LSTMs
in understanding complex temporal relationships in data [8].

4 Results and Discussion

In this research, we evaluated the performance of four machine learning models

and one deep learning model (Bi-directional LSTM) for sarcasm detection in news
headlines using different vectorization techniques (BoW and TF–IDF). The results
of different machine learning algorithms using Bag of Words (BoW) vectorization
method are summarized in Table 1. The results show that with BoW vectorization
technique, the Naive Bayes achieved the highest accuracy of 80.98% in Unigram
analysis, while decision tree classifier resulted the lowest accuracy (72.68%) for
Unigram analysis. However, when the results are analysed for the Bigram, we find
that Logistic Regression outperformed compared to the other algorithms with an
average accuracy of 81.01%. Surprisingly, the Naive Bayes algorithm which shows
the performance gain for Unigram resulted the lowest score of 72.95% for the Bigram
approach. We discover that the Logistic Regression approach is the most suitable
246 V. S. R. Chilpuri et al.

approach as it has produced the comparable stable accuracy for both Unigram and
Bigram analysis.
On the other hand, when the vectorization was changed the Term Frequency–
Inverse Document Frequency (TF–IDF), the performance of Logistic Regression and
Naive Bayes classifiers remained consistent in the Unigram analysis, achieving an
accuracy of 80.32 and 80.54%, respectively, as given in Table 2. Whereas in Bigram
analysis, Logistic Regression outperformed the other algorithms, achieving an accu-
racy of 80.09%. We again notice that the accuracy of the Naive Bayes algorithm
is dropped for the Bigram analysis, similar to the results of the BoW vectorization
method presented in Table 1.
Our results show that sarcasm detection in news headlines is invariant of the vector-
ization technique (BoW, TF–IDF). The vectorization (BoW, TF–IDF) has little or no
effect on the performance of these four machine learning algorithms (see Tables 1
and 2). One reason behind this invariance could be the short length of the newspaper
headlines. Furthermore, the performance of the three machine learning algorithms
(Logistic Regression, Random Forest and decision tree) remains stable across two
tokenization techniques (Unigram and Bigram), except Naive Bayes, reporting high
precision (80.9%) in the Unigram analysis, regardless of the vectorization method
(BoW, TF–IDF).
Furthermore, we extended our analysis to include a Bi-directional LSTM model,
a preferred model for natural language processing as compared to the traditional
machine learning algorithms [32, 16]. In the case of sarcasm detection in news
headlines, the BiLSTM model achieved an accuracy of 88.48% on the test set (93%
on the train set and 90% on the validation set). The proximity of these accuracy
values indicates minimal overfitting. The performance of the model is reported in
Table 3.

Table 1 Results of machine learning models with Bag of Words (BoW) vectorization technique
with Unigram and Bigram analysis
Algorithms BOW vectorization technique
Unigram analysis Bigram analysis
Accuracy Recall Precision F1 Accuracy Recall Precision F1
score score
Logistic 80.68 80.63 80.68 80.53 81.01 81.11 81.01 80.77
Regression
Naive 80.98 81.25 80.98 80.65 72.95 79.53 72.95 69.68
Bayes
Classifier
Decision 72.86 72.74 72.86 72.77 73.50 73.32 73.50 73.31
tree
classifier
Random 76.28 76.86 76.28 75.54 75.17 76.91 75.17 73.81
Forest
Classifier
Sarcasm Detection in Newspaper Headlines 247

Table 2 Results of machine learning models with Term Frequency–Inverse Document Frequency
(TF–IDF) vectorization technique with Unigram and Bigram analysis
Algorithms TF–IDF vectorization technique
Unigram analysis Bigram analysis
Accuracy Recall Precision F1 Accuracy Recall Precision F1
score score
Logistic 80.32 80.33 80.32 80.11 80.09 80.20 80.10 79.81
Regression
Naive 80.54 81.01 80.98 80.54 72.62 79.41 72.59 69.95
Bayes
Classifier
Decision 71.77 71.70 71.78 71.74 71.64 71.41 71.64 71.29
tree
classifier
Random 77.12 77.37 77.12 76.62 75.62 76.78 75.62 74.55
Forest
Classifier

Table 3 Results of deep

Accuracy Recall Precision F1 score
learning model Bi-directional
Long Short-Term Memory 88.47 88.51 88.47 88.48
(BiLSTM)

It is observed that BiLSTM and logistic regression are the best choices for sarcasm
detection in news headlines, yielding average accuracy of 88 and 81%, respec-
tively. The performance of the BiLSTM model can be attributed to its architectural
design which facilitates the retention of extensive sequences of data while effec-
tively filtering out irrelevant information. Moreover, the model uses both backward
and forward LSTM layers, along with a sigmoid layer, to decide which parts of the
information are important for the output [23].

5 Conclusion

This work investigates the performance of four machine learning algorithms (Logistic
Regression, Naive Bayes, decision tree, and Random Forest) and one deep learning
model (Bi-directional LSTM) for sarcasm detection in news headlines. We demon-
strate that sarcasm detection in news headlines is invariant of the vectorization tech-
nique (BoW, TF–IDF). The vectorization (BoW, TF–IDF) has little or no effect on the
performance of machine learning algorithms. Moreover, the performance of the three
machine learning algorithms (Logistic Regression, Random Forest and decision tree)
remains stable across two tokenization techniques (Unigram or Bigram), whereas
Naive Bayes achieved a high precision with Unigram analysis. In the nutshell, we
248 V. S. R. Chilpuri et al.

conclude that BiLSTM is the most preferred model for sarcasm detection in news
headlines.
In future work, we plan to evaluate the performance of the BiLSTM model on a
combined dataset sourced from multiple platforms, such as Twitter, Facebook, Insta-
gram, Reddit, news headlines, and various e-commerce websites. Another interesting
direction is to compare the performance of the BiLSTM model with pre-trained Large
Language Models (LLMs) for sarcasm detection, focusing on efficiency in terms of
speed and computational cost.

Acknowledgements The authors thank the UNITAR International University supporting the
publication of this paper.

References

1. Abulaish M, Kamal A (2018) Self-deprecating sarcasm detection: an amalgamation of rule-

based and machine learning approach. In: 2018 IEEE/WIC/ACM international conference on
Web Intelligence (WI), pp 574–579. IEEE
2. Agarwal A, Xie B, Vovsha I, Rambow O, Passonneau RJ (2011) Sentiment analysis of twitter
data. In: Proceedings of the workshop on language in social media (LSM 2011), pp 30–38
3. Belgiu M, Dragut L (2016) Random forest in remote sensing: a review of applications and
future directions. ISPRS J Photogramm Rem Sens 114:24–31
4. Chaudhari P, Chandankhede C (2017) Literature survey of sarcasm detection. In: 2017 Interna-
tional conference on wireless communications, signal processing and networking (WiSPNET),
pp 2041–2046. IEEE
5. Christian H, Agus MP, Suhartono D (2016) Single document automatic text summarization
using term frequency-inverse document frequency (tf-idf). ComTech: Comput Math Eng Appl
7(4):285–294
6. Genç R (2017) The importance of communication in sustainability & sustainable strategies.
Procedia Manufact 8:511–516
7. Godara J, Aron R, Shabaz M (2022) Sentiment analysis and sarcasm detection from social
network to train health-care professionals. World J Eng 19(1):124–133
8. Graves A, Schmidhuber J (2005) Framewise phoneme classification with bidirectional lstm
and other neural network architectures. Neural Netw 18(5–6):602–610
9. Graves A, Graves A (2012) Long short-term memory. In: Supervised sequence labelling with
recurrent neural networks, pp 37–45
10. Habert B, Adda G, Adda-Decker M, de Mareüil PB, Ferrari S, Ferret O, Illouz G, Paraubeck
P (1998) Towards tokenization evaluation. LREC, pp 427–432
11. Handler A, Denny M, Wallach H, O’Connor B (2016) Bag of what? simple noun phrase
extraction for text analysis. In: Proceedings of the first workshop on NLP and computational
social science, pp 114–124
12. Jain S, Ranjan A, Baviskar D (2018) Sarcasm detection in amazon product reviews. Int J
Comput Sci Inform Technol 9(3)
13. Jamil R, Ashraf I, Rustam F, Saad E, Mehmood A, Choi GS (2021) Detecting sarcasm in multi-
domain datasets using convolutional neural networks and long short term memory network
model. Peer J Comput Sci 7:e645
14. Joshi A, Bhattacharyya P, Carman MJ (2017) Automatic sarcasm detection: a survey. ACM
Comput Surv (CSUR) 50(5):1–22
15. Kumar R, Bhat A (2021) An analysis on sarcasm detection over twitter during covid19. In:
2021 2nd international conference for emerging technology (INCET), pp 1–6. IEEE
Sarcasm Detection in Newspaper Headlines 249

16. Lin R (2022) Comment texts sentiment analysis based on improved bi-LSTM and Naive
Bayes. In: 2022 international conference on data analytics, computing and artificial intelligence
(ICDACAI), pp 407–412. IEEE
17. Maynard DG, Greenwood MA (2014) Who cares about sarcastic tweets? Investigating the
impact of sarcasm on sentiment analysis. In: LREC 2014 proceedings. ELRA
18. Mehmood T, Gerevini AE, Lavelli A, Olivato M, Serina I (2023) Distilling knowledge with a
teacher’s multitask model for biomedical named entity recognition. Information 14(5):255
19. Mehmood T, Gerevini A, Lavelli A, Serina I (2019) Leveraging multi-task learning for biomed-
ical named entity recognition. In: AI*IA 2019—advances in artificial intelligence—XVIIIth
international conference of the Italian Association for Artificial Intelligence, Rende, Italy,
19–22 Nov 2019, Proceedings. Lecture notes in computer science, vol 11946, pp 431–444.
Springer, Berlin
20. Mehmood T, Gerevini A, Lavelli A, Serina I (2019) Multi-task learning applied to biomedical
named entity recognition task. In: Proceedings of the sixth italian conference on computational
linguistics, Bari, Italy, 13–15 Nov 2019. CEUR Workshop Proceedings, vol 2481. CEUR-
WS.org
21. Mehmood T, Gerevini AE, Lavelli A, Serina I (2020) Combining multi-task learning with
transfer learning for biomedical named entity recognition. In: Knowledge based and intelligent
information & engineering systems: proceedings of the 24th international conference KES-
2020, Virtual Event, 16–18 Sept 2020. Procedia Computer Science, vol 176, pp 848–857.
Elsevier
22. Mehmood T, Md Rais HB (2016) Machine learning algorithms in context of intrusion detection.
In: 2016 3rd international conference on computer and information sciences (ICCOINS), pp
369–373
23. Mehmood T, Serina I, Lavelli A, Putelli L, Gerevini A (2023) On the use of knowledge transfer
techniques for biomedical named entity recognition. Fut Internet 15(2):79
24. Mishra A, Kanojia D, Nagar S, Dey K, Bhattacharyya P (2017) Harnessing cognitive features
for sarcasm detection. arXiv preprint arXiv:1701.05574
25. Misra R (2022) News headlines dataset for sarcasm detection. arXiv preprint arXiv:2212.06035
26. Mouthami K, Devi KN, Bhaskaran VM (2013) Sentiment analysis and classification based
on textual reviews. In: 2013 international conference on Information communication and
embedded systems (ICICES). pp 271–276. IEEE
27. Ortigosa A, Martín JM, Carro RM (2014) Sentiment analysis in facebook and its application
to e-learning. Comput Hum Behav 31:527–541
28. Pawar N, Bhingarkar S (2020) Machine learning based sarcasm detection on twitter data. In:
2020 5th international conference on communication and electronics systems (ICCES), pp
957–961. IEEE
29. Pini M, Scalvini A, Liaqat MU, Ranzi R, Serina I, Mehmood T (2020) Evaluation of machine
learning techniques for inflow prediction in Lake Como, Italy. In: Knowledge-based and intel-
ligent information & engineering systems: proceedings of the 24th international conference
KES-2020, Virtual Event, 16–18 Sept 2020. Procedia Computer Science, vol 176, pp 918–927.
Elsevier
30. Plisson J, Lavrac N, Mladenic D et al (2004) A rule based approach to word lemmatization.
In: Proceedings of IS, vol 3, pp 83–86
31. Porwal S, Ostwal G, Phadtare A, Pandey M, Marathe MV (2018) Sarcasm detection using
recurrent neural network. In: 2018 second international conference on intelligent computing
and control systems (ICICCS), pp 746–748. IEEE
32. Preethi V et al (2021) Survey on text transformation using bi-LSTM in natural language
processing with text data. Turkish J Comput Math Educ (TURCOMAT) 12(9):2577–2585
33. Prokhorov S, Safronov V (2019) AI for AI: what NLP techniques help researchers find the
right articles on NLP. In: 2019 international conference on artificial intelligence: applications
and innovations (IC-AIAI), pp 76–765. IEEE
34. Rajadesingan A, Zafarani R, Liu H (2015) Sarcasm detection on twitter: a behavioral modeling
approach. In: Proceedings of the eighth ACM international conference on web search and data
mining (pp. 97–106)
250 V. S. R. Chilpuri et al.

35. Runeson P, Alexandersson M, Nyholm O (2007) Detection of duplicate defect reports

using natural language processing. In: 29th international conference on software engineering
(ICSE’07), pp 499–510. IEEE
36. Sarsam SM, Al-Samarraie H, Alzahrani AI, Wright B (2020) Sarcasm detection using machine
learning algorithms in twitter: a systematic review. Int J Mark Res 62(5):578–598
37. Shrikhande P, Setty V, Sahani A (2020) Sarcasm detection in newspaper headlines. In: 2020
IEEE 15th international conference on industrial and information systems (ICIIS), pp 483–487.
IEEE
38. Staudemeyer RC, Morris ER (2019) Understanding LSTM–a tutorial into long shortterm
memory recurrent neural networks. arXiv preprint arXiv:1909.09586
39. Swain PH, Hauska H (1977) The decision tree classifier: design and potential. IEEE Trans
Geosci Electron 15(3):142–147
40. Tangirala S (2020) Evaluating the impact of gini index and information gain on classification
using decision tree classifier algorithm. Int J Adv Comput Sci Appl 11(2):612–619
41. Wol Kowicz J, Kulka Z, Keselj V (2008) N-gram-based approach to composer recognition.
Arch Acoust 33(1):43–55 (2008)
Transformer-Based Named Entity
Recognition Model—Tamil Language

Karthi Dhayalan, Nailya Sultanova, Jamila Mustafina, and Paridah Daud

Abstract This work presents different transformer-based approaches to the named

entity recognition problem in the Tamil language. The WikiANN-ta dataset is used
for training, validating and testing the model. Person, organization and location tags
are annotated in the dataset. Context features of Tamil languages are extracted by the
BERT pre-trained models. In this task, five significant BERT variations are explored,
(1) mBERT (multilingual BERT), (2) XLM-RoBERTa, (3) Indic BERT, (4) MuRIL
base and (5) + MuRIL large. Context features extracted by mBERT/XLM-R/Indic
BERT/ MuRIL base/MuRIL large are sent to the feedforward layer with SoftMax to
identify named entities. Two models out of these five BERT-based models achieve
state-of-the-art results in the test dataset of WikiANN-ta (1) NER model with MuRIL
base, (2) NER model with MuRIL-large. MuRIL base is proposed for better run time,
and MuRIL-large is proposed for better accuracy.

Keywords Transformer · Tamil language · Named entity recognition · Machine

learning

K. Dhayalan (B)
Liverpool John Moores University, Liverpool, UK
e-mail: [email protected]
N. Sultanova · J. Mustafina
Kazan Federal University, Kazan, Russia
e-mail: [email protected]
J. Mustafina
e-mail: [email protected]
P. Daud
School of Information Technology, UNITAR International University, Petaling Jaya, Selangor,
Malaysia
e-mail: [email protected]

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 251
Y. Bee Wah et al. (eds.), Data Science and Emerging Technologies, Lecture Notes on
Data Engineering and Communications Technologies 191,
https://doi.org/10.1007/978-981-97-0293-0_19
252 K. Dhayalan et al.

1 Introduction

Named entity recognition is the process of extracting information as entities from a

document like person, organization and location. Manual extraction of named entities
is a time-consuming and tedious activity. So, this task is automated using machines
with different approaches like rule-based and machine learning.
The machine learning approach is introduced to overcome the issues of the rule-
based approach, and the machine learning NER model is reused in other NLP tasks
like machine translation and question answering. The machine learning approach
in NER is further divided into two different approaches, namely (1) unsupervised
learning and (2) supervised learning.
Statistical methods and similarity scores used in an unsupervised learning
approach without manual annotation reduce more manual effort but do not fit into
most of the scenarios. In supervised learning, named entities in the text are manu-
ally annotated to form training data and train the models. Supervised learning is
further divided into two different techniques: (1) machine learning (like Naïve Bayes,
random forest, SVM) and (2) deep learning (like RNN, LSTM, transformer models).
ML and DL NER models are explored already in many European, Asian and African
languages like English, Spanish, Hindi, Tamil and Wolaytta.
Initially, machine learning models are tried mostly in NER models in the Tamil
language. Deep learning techniques like long short-term memory, gated recurrent
unit and recurrent neural networks are used to improve accuracy and reduce feature
engineering work. In this task, transformer-based models are explored to build the
Tamil NER model on the WikiANN-ta dataset.
Tamil is the official language of Tamil Nadu which is an Indian state and the
nations of Sri Lanka and Singapore. There are many NER works done in the Tamil
language using mostly machine learning techniques (like SVM, Naïve Bayes) and
deep learning techniques (like LSTM, GRU and Bi-LSTM).
There are some research gaps identified in these works. For example, class imbal-
ance issues, ensemble approaches and data augmentation techniques are not handled
[1–3]. This study is focusing on this gap to improve the performance of the NER
model. This work focuses on only WikiANN-ta dataset and building an efficient
NER model in the Tamil language. This study explores the NER models in the Tamil
language using different transformer models (BERT, mBERT, Indic BERT, XLM-
Roberta, MuRIL). The best model or ensemble of all models is designed and built
as part of this work to improve the performance of the Tamil NER model.

2 Background

The language feature extraction process improves the performance of the NER model
of low resourced languages. Regex feature extraction and morphological analyzer
are explored by different NER-based models in the Tamil language.
Transformer-Based Named Entity Recognition Model—Tamil Language 253

Numerical values, date and time are not needed for this task, so these values are
considered noise in the dataset. Extracting and removing these values help to reduce
the complexity. Extracting these values using REGEXP was used by Rajendran
et al. [2].
Nzeyimana and Rubungo explored a supervised method for sub-word tokenization
for the language Kinyarwanda which is a low resourced language from Africa. A
language model architecture is proposed to represent most of the input words with
the morphological parser generated by a morphological analyzer. BPE of BERT is
used for the words which are not decomposed by the morphological analyzer like
foreign language words, misspellings and proper names [5].
In Tamil, morphs are usually suffixed to the root or base. Most of the words
in Tamil have two different parts: (1) base and (2) grammatical formatives. It is
possible to add grammatical formatives through six morphophonological operations,
which include affixation, incorporation, compounding, cliticization, doubling and
stem mutation. Sarveswaran et al. suggested morphological parser techniques used
to extract morphological features [6].
There are many named entity recognition works done in the Tamil language using
machine learning and deep learning algorithms.
Random Kitchen Sink (RKS) algorithm is used to classify nonlinear data like text.
Unlike the conventional nonlinear kernel methods, RKS works faster and performs
better with fewer data. Only feature size is used by RKS and ignores the number
of data points. This RKS algorithm is used by Abinaya N., Anand Kumar M. and
Soman K.P. to identify the named entities—person, location and organization. The
model is trained with the FIRE NER dataset, and results are compared with SVM
and CRM models.
Bidirectional long short-term memory has sequence information in both direc-
tions, backwards (left to right) or forward (right to left). This Bi-LSTM algorithm is
used to predict named entities in Tamil, Telugu, Kannada, Malayalam and Hindi
[7]. The model is also trained using FIRE NER corpus to identify the named
entities—person, location and organization.
The BERT is a transformer-based model which is designed to pre-train unlabelled
texts and fine-tuned based on NLP tasks.
BERT-based models learn language features automatically which reduces the
effort of labour-intensive and skill-dependent tasks. BERT pre-trained model was
used to extract chemical and protein entities from Spanish biomedical texts [8].
Clinical named entity recognition in the Chinese language is explored by Chang
et al. [9].
The multilingual BERT (Pires et al., n.d.) is trained in 104 languages, including
Tamil [10]. Entire Tamil Wikipedia texts are extracted to train the mBERT and
normalized with other language datasets by oversampling/under-sampling.
Baumann A. explored and evaluated multilingual BERT model on NER for
German and English datasets [11].
XLM-R model is trained on around 100 languages, including Tamil with other
Dravidian languages (Telugu, Kannada and Malayalam) [12]. This model was
254 K. Dhayalan et al.

Table 1 Summary of early NER model performance in Tamil language

Dataset Author Year ML/DL Performance measure
algorithms used
FIRE Corpus N. Abinaya, M. Anand 2015 SVM Accuracy 81.62
Kumar and K. P.
Soman
FIRE Corpus N. Abinaya, M. Anand 2015 Random Sink Accuracy 86.61
Kumar and K. P. Algorithm
Soman
FIRE Corpus N. Abinaya, M. Anand 2015 CRF Accuracy 87.21
Kumar and K. P.
Soman
FIRE 2018 S. Nagesh Bhattu, N. 2018 Bi-LSTM F1-score 89.55
Satya Krishna and D.
V. L. N. Somayajulu
FIRE 2018 R. Srinivasan, C.N. 2019 Naïve-Bayes Precision 87.06
Subalalitha

selected to get better accuracy as it is trained on more than 2.5 Terabytes of common
crawl data.
XLM-R Base and Large were explored for the Turkish NER dataset [13]. MBERT
and BERTurk models are also explored along with XLM-R, and all the models are
tried with or without CRF layer options. The models with the CRF layer perform
better compared with the model.
Indic BERT is trained using datasets sourced from news crawls in 11 Indian
languages and English [14]. The Tamil language is part of these 11 languages. This
model has this unique advantage (only trained in Indian languages other than English)
over mBERT and XLM-R which are trained in more than 100 languages.
Multilingual language models like mBERT and XLM-R are mostly trained in 100+
languages. MuRIL addressed these significant challenges [15]. MuRIL is specifically
built and trained for Indian languages. MuRIL supports 16 Indian languages and
English. Sharma, R., Morwal, S. and Agarwal, B., used MuRIL to extract named
entities from Hindi language texts using MuRIL. They have explored two different
variants with MuRIL architecture, (1) with fully connected and SoftMax layer (2)
fully connected with CRF layers. The performance results are listed in Table 1.

3 Methodology

The proposed NER model includes feature extraction methods and named entity
recognition model (Fig. 1).
Transformer-Based Named Entity Recognition Model—Tamil Language 255

Fig. 1 Architecture of the NER model

3.1 Dataset Analysis

The WikiANN-ta is a multilingual silver-standard dataset used in this task that has
25.6k sentences which are extracted from the Wikipedia Tamil site and annotated as
person, location and organization. The data is split into train, validation and test with
256 K. Dhayalan et al.

Table 3 Precision, recall and

Precision Recall F1-score
F1-score of MuRIL-base
NER model—validation MuRIL base 0.8433 0.8552 0.8492
dataset B-LOC 0.9228 0.7878 0.85
I-LOC 0.9495 0.5667 0.7098
B-PER 0.9326 0.9326 0.9326
I-PER 0.8617 0.9533 0.9052
B-ORG 0.7881 0.9548 0.8635
I-ORG 0.7773 0.9636 0.8605

22.6k, 1.5k and 1.5k examples as per the split ratio of the huggingface WikiANN-ta
dataset. The data is annotated in the standard IOB format. Table 2 describes each tag.
The training dataset has 22.6k records. The training data has 16.8k I-ORG tags
which are 28% of all tags, and it has 13.1 k B-LOC tags and 9.7k I-PER tags which
are 21 and 16% total count of tags.
B-PER and I-LOC tags are 11% each of the total tags which are 6.9k and 6.4k,
respectively.
The validation dataset has 1.5k records. The validation dataset has also more no
of I-ORG tags like training data—1.3k I-ORG tags which are 29% of all tags. It has
831 I-LOC tags and 759 B-LOC tags which are 19 and 17% total count of tags.

Table 4 Precision, recall and

Precision Recall F1-score
F1-score of MuRIL-base
NER model—test dataset MuRIL base 0.8961 0.9069 0.9015
B-LOC 0.914 0.871 0.8919
I-LOC 0.9061 0.729 0.808
B-PER 0.9363 0.942 0.9391
I-PER 0.9283 0.9452 0.9367
B-ORG 0.8536 0.9299 0.8901
I-ORG 0.8816 0.9547 0.9167
Transformer-Based Named Entity Recognition Model—Tamil Language 257

Table 5 Precision, recall and

Precision Recall F1-score
F1-score of MuRIL-large
NER model—validation MuRIL Large 0.853 0.8572 0.8551
dataset B-LOC 0.9196 0.7997 0.8555
I-LOC 0.9061 0.6389 0.7494
B-PER 0.8322 0.9276 0.8773
I-PER 0.969 0.9347 0.9515
B-ORG 0.7911 0.9191 0.8503
I-ORG 0.8018 0.9446 0.8673

B-PER and I-PER tags are 9 and 12% of the total tags which are counted 401 and
536, respectively.
The test dataset has 1.5k records. The test dataset has 1592 I-ORG tags which is
34% of all tags. It has 671 B-ORG tags and 659 B-LOC tags which is 14% each of
the total count of tags.
I-LOC and I-PER tags are 13% of the total tags each which are counted as 609
and 603, respectively.

3.2 Model Architecture

The proposed model is based on a hybrid deep neural network approach which
includes BERT variation models for token representations. The architecture of the
BERT variation language models has 12 encoder layers with 12 self-attention heads
and 768 hidden layer dimensions. So, every input token is represented as 768 dimen-
sional. The output of the last layer of the BERT variation models (mBERT, Indic
BERT, XLM-R, MuRIL Large, MuRIL base) is passed to the NER model which is
based on the architecture space. TransitionBasedParser.v2 with state type “NER”.

3.3 Morphology Feature Extraction

Out of five transformer models, MuRIL base is selected for fine-tuning further as
this model performs better in terms of accuracy and execution time. There are no
out-of-context numeric values in the WikiANN dataset, like date and time. Dataset
sentences are having numeric values, and all of them are tagged as “O”. Removing
them will help to reduce the class imbalance ratio.
258 K. Dhayalan et al.

3.4 Optimizers

Stochastic gradient descent, Adam and RAdam optimizers are explored as part of
this task.

3.5 Loss Functions

Categorical cross-entropy and sequence categorical cross-entropy loss functions are

explored as part of this task.
The loss function is changed as categorical cross-entropy in the MuRIL base
model, and the loss function is changed as sequence categorical cross-entropy in the
MuRIL base model.

4 Results and Discussion

MuRIL-based NER model is trained and validated with the same datasets, and the
performance of the MuRIL model is better than mBERT, Indic BERT and XLM-R.

4.1 MuRIL Base

In training phase, MuRIL base model predicts 324 I-LOC tags wrongly as I-ORG
and 134 B-LOC tags are wrongly predicted as B-ORG. The wrong prediction pattern
looks the same as the Indic BERT and XLM-R prediction in the validation dataset
(Fig. 2).
The F1-score of the training phase is 84.92% given in Table 3, and the performance
is increased by 1.6% as compared to XLM-R model.
In testing phase, testing phase MuRIL base model predicts 143 I-LOC tags
wrongly as I-ORG followed by 94 wrong predictions of O tags as I-PER and 90
I-PER tags as B-PER (Fig. 3).
The F1-score of the test phase 90.15% is given in Table 4, and the performance
is increased by 2% as compared to XLM-R model.

4.2 MuRIL-Large

MuRIL-Large NER model is trained and validated with the same datasets, and the
performance of the MuRIL-Large model is better than all BERT models tried before.
Transformer-Based Named Entity Recognition Model—Tamil Language 259

Fig. 2 Confusion matrix of

MuRIL-base NER
model—validation dataset

Table 6 Precision, recall and

Precision Recall F1-score
F1-score of MuRIL-large
NER model—test dataset MuRIL Large 0.9053 0.9104 0.9078
B-LOC 0.9041 0.8877 0.8958
I-LOC 0.8851 0.7339 0.8025
B-PER 0.9241 0.95 0.9368
I-PER 0.9797 0.9651 0.9724
B-ORG 0.8817 0.9225 0.9016
I-ORG 0.8898 0.9491 0.9185

Fig. 3 Confusion matrix of

MuRIL-base NER
model—test dataset
260 K. Dhayalan et al.

Table 7 Summary of NER model performance in Tamil language

Dataset Author Year ML/DL Algorithms Performance measure
used
FIRE Corpus N. Abinaya 2015 SVM Accuracy 81.62
M. Anand Kumar and
K. P. Soman
FIRE Corpus N. Abinaya 2015 Random Sink Accuracy 86.61
M. Anand Kumar and Algorithm
K. P. Soman
FIRE Corpus N. Abinaya 2015 CRF Accuracy 87.21
M. Anand Kumar and
K. P. Soman
FIRE 2018 S. Nagesh Bhattu, N. 2018 Bi-LSTM F1-score 89.55
Satya Krishna and D.
V. L. N. Somayajulu
FIRE 2018 R. Srinivasan, C.N. 2019 Naïve-Bayes Precision: 87.06
Subalalitha
WikiANN-ta This work 2022 XLM-R base with FF Precision: 88.3
network
WikiANN-ta This work 2022 MuRIL Base with FF Precision: 89.61
network Recall: 90.69
F1-score: 90.15
WikiANN-ta This work 2022 MuRIL Large with FF Precision: 90.53
network Recall: 91.04
F1-score: 90.78

In training phase, MuRIL Large model predicts 273 I-LOC tags wrongly as I-
ORG, 134 O tags are predicted as B-PER and 128 B-LOC tags are wrongly predicted
as B-ORG. The wrong prediction pattern looks the same as the Indic BERT and
XLM-R prediction in the validation dataset (Fig. 4).

Fig. 4 Confusion matrix of

MuRIL-Large NER
model—validation dataset
Transformer-Based Named Entity Recognition Model—Tamil Language 261

Fig. 5 Confusion matrix of

MuRIL-Large NER
model—test dataset

The F1-score of the training phase is 85.51% given in Table 5, and the performance
is increased by 0.5% compared to MuRIL base model.
In testing phase, MuRIL-Large model predicts 141 I-LOC tags wrongly as I-ORG
followed by 90 wrong predictions of I-PER tags as B-PER (Fig. 5).
The F1-score of the test phase 90.78% is given in Table 6, and the performance
is increased by 0.6% compared to MuRIL Base model.

4.3 Performance of NER Models with Different BERT

Variations

Precision, recall and F1-score of the above five BERT-based NER models are
compared, and the optimized model is chosen for further fine-tuning.

4.3.1 Precision

In test dataset predictions, MuRIL Large model performs better than all other models
with the highest precision score of 0.9053. The precision score of the MuRIL base
is 0.8961 which is close to MuRIL-Large (0.9% less) (Fig. 6).

4.3.2 Recall

The recall score follows the same pattern as the precision score. In test dataset
predictions, MuRIL Large model performs better than all other models with the
highest recall score of 0.9104. The recall score of the MuRIL base is 0.9069 which
is close to MuRIL-Large (0.35% less) (Fig. 7).
262 K. Dhayalan et al.

Fig. 6 Precision score of all NER models—test dataset

Fig. 7 Recall score of all NER models—test dataset

4.3.3 F1-score

The recall and precision scores have the same pattern, so F1-score follows the same.
In test dataset predictions, MuRIL Large model performs better than all other models
with the highest F1-score of 0.9078. The F1-score of the MuRIL base is 0.9015 which
is close to MuRIL-Large (0.69% less) (Fig. 8).

4.4 MuRIL Base with REGEX and Morphological

There is no significant difference using REGEX and morphological analysis as the

F1-score difference is 0.0019. MuRIL has an in-built morphological feature. The
F1-score is almost the same as the MuRIL base (Fig. 9).
Transformer-Based Named Entity Recognition Model—Tamil Language 263

Fig. 8 F1-score of all NER models—test dataset

0.9069 0.9072
0.9034
0.9015
0.8996
0.8961

PRECISION RECALL F SCORE

Muril base Muril base with Regex & Morph

Fig. 9 MuRIL base NER comparison in test dataset—with/without REGEX and morphological
analysis

4.5 Loss Functions Comparison

The MuRIL base model performs better with categorical cross-entropy loss function
which is the default. The F1-score difference is 0.0004 between categorical and
sequential categorical cross-entropy (Fig. 10).

4.6 Optimizers Comparison

The MuRIL base model performs better with Adam optimizer which is the default.
The F1-score difference is 0.0023 between categorical and sequential categorical
cross-entropy (Fig. 11).
264 K. Dhayalan et al.

0.9069
0.9028 0.9015 0.9011
0.8993
0.8961

PRECISION RECALL F SCORE

Muril base(Categ_cross) Muril base(Seq categ_cross)

Fig. 10 Loss function comparison—categorical cross-entropy versus sequential categorical cross-

entropy

0.8961 0.9026 0.9069 0.8959 0.9015 0.8992

PRECISION RECALL F SCORE

Muril base(Adam) Muril base(RADAM)

Fig. 11 Optimizer comparison—Adam versus RAdam

0.9100
F1-Score

0.9000
0.8900
0.8800
0.E+00 1.E-05 2.E-05 3.E-05 4.E-05 5.E-05 6.E-05 7.E-05 8.E-05
Learning Rate

Fig. 12 Learning rate comparison

4.7 Learning Rates Comparison

The MuRIL base model performs better with the learning rate 5.E−05. The F1-score
difference is 0.0016 between 5.E−05 and 6.E−05 (Fig. 12).

4.8 Dropout Rates Comparison

The MuRIL base model performs better with the dropout rate 0.1. The F1-score
difference is 0.0008 between 0.1 and 0.3 (Fig. 13).
MuRIL base and MuRIL large NER models achieve a state-of-the-art result with
the highest F1-scores of 90.15 and 90.78. There is minimal improvement in terms
Transformer-Based Named Entity Recognition Model—Tamil Language 265

Fig. 13 Dropout rate 0.91

F1-Score
comparison 0.9
0.89
0 0.2 0.4 0.6 0.8
Dropout rate

of precision scores from the previous work. R. Srinivasan, C. N. Subalalitha have

achieved 87.06% already. The recall score improved significantly (from 80.31 to
90.69) in this work which improved the F1-score.

5 Conclusion

This research examines detailed the advantages of using language models like
BERT, multilingual BERT (mBERT), XLM-RoBERTa, Indic BERT and Multilingual
Representations for Indian Languages (MuRIL) in NER models with the ensemble
approach.
The most appropriate optimizers (stochastic gradient descent, RAdam, Adam),
loss functions (categorical cross-entropy and negative log-likelihood) and evalua-
tion measures (precision, recall and F1-score) matching predictions against the gold
standard are discussed in detail.
The advantages of the language model pre-training methods for the NER task
are discussed in detail. Data augmentation methods like knowledge-based system
for multilingual named entity recognition and external context retrieving, and
cooperative learning are detailed.
The findings of the experiment show NER model with MuRIL base, and MuRIL
Large is performing better than other models. F1-score of MuRIL base is almost equal
to MuRIL-Large, but the NER model with MuRIL base is getting trained faster and
predicting faster due to less no of encoder layers. MuRIL base has only 12 encode
layers which are 50% of MuRIL-Large encode layers. So MuRIL base is considered
an optimized model in terms of accuracy and run-time performance.
As a result, two different NER models are built (1) MuRIL Large followed by
FF network is built with Adam optimizer, categorical cross-entropy loss function,
learning rate as 5.E−05 and dropout rate as 0.1 for better accuracy. (2) MuRIL base
followed by FF network is built with Adam optimizer, categorical cross-entropy loss
function, learning rate as 5.E−05 and dropout rate as 0.1 for better run time.
Further works can involve class balancing techniques as there is a class imbalance
issue in the WikiAnn-ta dataset.
The multilingual pre-trained models produced state-of-the-art results in NLP tasks
in low resourced languages. But recent research shows that pre-training multilingual
BERT with high-quality monolingual corpora outperforms the results of multilin-
gual models. Language model pre-training for any NLP task generally improves
266 K. Dhayalan et al.

performance. Fine-tuning language models like XLM-R, and MuRIL Large using
the T-NER library may improve the performance of the model.
Performance is significantly improved by fetching the document-level context
of each sentence and feeding it into the NER model. Document-level context of
a sentence is retrieved by searching the semantically relevant texts using a search
engine. Manual annotators are using their domain knowledge to annotate named
entities and use a search engine if there is a gap, this method works in this way.
The ensemble is one of the proven successful approaches to work on complex
datasets. Two steps of the supervised ensemble classifications may improve the
performance (1) generating classification results using multiple weak classifiers and
(2) integrating those multiple results using consistency function with voting schemes.
The major contribution of this study is that the proposed NER model can be
used for retrieving people, organizations and places from the documents of these
domains, in other NLP applications like information extraction, machine translation
and question answering. This model can also be used in information extraction from
government documents of Tamil Nadu, India, Tamil news articles, law and medical
documents in Tamil.

Acknowledgements The authors thank UNITAR International University for supporting this
research’s publication.

References

1. Abinaya N, Anand Kumar M, Soman KP (2015) Randomized kernel approach for named entity
recognition in Tamil. Indian J Sci Technol 824
2. Rajendran S, Cn S, Srinivasan R, Subalalitha CN (2019) Automated named entity recognition
from Tamil documents 2019. In: IEEE 1st international conference on energy, systems and
information processing, ICESIP. IEEE (2019). https://www.researchgate.net/publication/338
362244
3. Hariharan V, Anand Kumar M, Soman KP (2019) Named entity recognition in Tamil language
using recurrent based sequence model. In: Lecture notes in networks and systems, pp 91–99.
Springer
4. Mohseni M, Fondazione AT, Kessler B (n.d.) MorphoBERT: a Persian NER system with BERT
and morphological analysis
5. Nzeyimana A, Rubungo AN (n.d.) KinyaBERT: a morphology-aware Kinyarwanda language
model. https://github.com, last accessed 26 Nov 2022
6. Sarveswaran K, Dias G, Butt M (2021) ThamizhiMorph: a morphological parser for the Tamil
language. Mach Transl 35:37–70. https://doi.org/10.1007/s10590-021-09261-5
7. Nagesh Bhattu S, Satya Krishna N, Somayajulu DVLN (n.d.) idrbt-team-a@IECSIL-FIRE-
2018: named entity recognition of Indian languages using Bi-LSTM. http://www.idc.upe
nn.edu, last accessed 5 Feb 2022
8. Sun C, Yang Z (n.d.) Transfer learning in biomedical named entity recognition: an evaluation
of BERT in the PharmaCoNER task, pp 100–104
9. Chang Y, Kong L, Jia K, Meng Q (2021) Chinese named entity recognition method based
on BERT. In: 2021 IEEE international conference on data science and computer application
(ICDSCA), pp 294–299. IEEE. https://ieeexplore.ieee.org/document/9650256/, last accessed
23 Jan 2022
Transformer-Based Named Entity Recognition Model—Tamil Language 267

10. Pires T, Schlinger E, Garrette D (n.d.) How multilingual is Multilingual BERT? pp 4996–5001.
https://github.com/google-research/bert, last accessed 12 Feb 2022
11. Baumann A (2019) Multilingual language models for named entity recognition in German and
English. In: Proceedings of the student research workshop associated with RANLP 2019, pp
21–27. Incoma Ltd. https://acl-bg.org/proceedings/2019/RANLPStud%202019/pdf/RANLPS
tud004.pdf
12. Conneau A, Khandelwal K, Goyal N, Chaudhary V, Wenzek G, Guzmán F, Grave E, Ott M,
Zettlemoyer L, Stoyanov V (n.d.) Unsupervised cross-lingual representation learning at scale.
https://github.com/facebookresearch/cc, last accessed 12 Feb 2022
13. Aras G, Makaroğlu D, Demir S, Cakir A (2021) An evaluation of recent neural sequence
tagging models in Turkish named entity recognition. Exp Syst Appl 182
14. Kakwani D, Kunchukuttan A, Golla S, Gokul NC, Bhattacharyya A, Khapra MM, Kumar P
(n.d.) IndicNLPSuite: monolingual corpora, evaluation benchmarks and pre-trained multilin-
gual language models for Indian languages, vol 3, p 5. https://github.com/AI4Bharat/indicnlp_
catalog, last accessed 12 Feb 2022
15. Khanuja S, Bansal D, Mehtani S, Khosla S, Dey A, Gopalan B, Margam DK, Aggarwal P,
Nagipogu RT, Dave S, Gupta S, Chandra S, Gali B, Subramanian V, Talukdar P (n.d.) MuRIL:
multilingual representations for Indian languages. https://oscar-corpus.com, last accessed July
Feb 2022
16. Sharma R, Morwal S, Agarwal B (2022) Named entity recognition using neural language model
and CRF for Hindi language. Comput Speech Lang 74:101356
A Comparative Study of Methods
for Topic Modelling in News Articles

Swapna D. Rajan, Thomas Coombs, Manoj Jayabalan,

and Noor Azma Ismail

Abstract The past few decades have seen an increase in textual data and influence
from the news media. With the rise in available data, especially in regard to textual
data from news media, it is imperative to quickly categorise news topics. In this
research, the primary aim is to suggest a method for automatically identifying news
topics in articles. The dataset used in this research was the news category published
on Kaggle and comprised of 210,294 headlines and abstracts from HuffPost between
2012 and 2022. The dataset consisted of a total of 42 categories and six columns.
Traditional modelling techniques did not perform well in comparison with Top2Vec,
NMF or BERTopic. This research confirms the efficacy of Top2Vec and BERTopic,
followed by NMF, LDA and LSA for analysing, news category data from a human-
interpretation perspective. Though BERTopic was able to deduce 1145 topics from
the data, it could not chuck unwanted words like “to”, “say”, “for” which do not add
any value to the topic semantics. In summary, TF-IDF proved to be the best feature
extraction technique and Top2Vec the best topic modelling technique.

Keywords Topic modelling · Machine learning · Top2Vec · BERTopic

1 Introduction

Textual data provides one of the most significant sources of knowledge and informa-
tion throughout history till the present day [1]. The past few decades have seen an
increase in the influence from the news media towards the public via dissemination of

S. D. Rajan · M. Jayabalan
Liverpool John Moores University, Liverpool, UK
T. Coombs (B)
British American Tobacco, Southampton, UK
e-mail: [email protected]
N. A. Ismail
UNITAR International University, Petaling Jaya, Selangor, Malaysia

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 269
Y. Bee Wah et al. (eds.), Data Science and Emerging Technologies, Lecture Notes on
Data Engineering and Communications Technologies 191,
https://doi.org/10.1007/978-981-97-0293-0_20
270 S. D. Rajan et al.

newspapers, magazines and broadcasting information to electronic devices [2]. The

vast amount of information disseminated from the news media makes it difficult for
users to identify relevant information quickly and accurately from all news sources.
It has become imperative to satisfy the users need to seek and focus on specific events
fast, and text classification and topic modelling are two methods to solve this issue
[3, 4].
When classifying text, a dataset must first go through a pre-processing stage that
includes data cleaning and noise reduction, after which the topic model algorithm is
used on the cleaned-up data [5]. Recent years have seen academics create a number
of methods utilised in natural language processing, particularly topic modelling.
Examples included latent sematic indexing (LSI), latent Dirichlet analysis (LDA) and
non-negative matrix factorisation (NMF). The uses differ between the different topic
modelling techniques with LSA working better on short text, while LDA performs
better on long text [6–9].
Topic models are popular for displaying discrete variables and providing
useful methods for uncovering semantic structure in large datasets [10–13]. Topic
modelling, an unsupervised classification method, is used in emotion classification,
text classification, emergency department crowding, system recommendation and
image classification. In the process of topic modelling, the text corpus’s hidden
themes are made visible. To understand what the text or set of documents are about,
topic modelling finds topics within the text or document without the need to provide
labels while training and numerous studies have been developed and tested for text
classification. Finding hidden issues in unstructured writings is a key challenge in
understanding short texts. However, because of the lack of data, standard models
(such as NMF) would not completely benefit from the document dataset in the light
of short text topic mining, such as social comments on Facebook, Twitter or WeChat
[11, 13–17].
Recently, researchers have started using multilingual news classification
approaches with LDA to locate topics in 70,000 articles of the Bangla news [5].
Multi-label classification was created and used on the Indonesian news corpus to
determine the semantic association between words and calculating the distance [10].
It is observed that not all traditional text mining approaches are suitable for short or
long text classification and are dependent on the algorithm used [18, 19]. This is why
it is essential to choose the topic model that derives the highest quality topic from
both long and short text news corpus.
The question to this research is a topic modelling identify hidden news topics
within the online headlines using topic modelling techniques and algorithms.
Currently, there are few conducted studies that provide more than two topic modelling
techniques with a guided method to follow. The primary aim of this study is to suggest
a method for automatically identifying news topics in articles. By conducting a
comparative study of the few selected topic model techniques and determining the
best feature extraction method on the data, assessment of the best performing model
is recommended. This will enable the news media platform to better meet the user’s
needs.
A Comparative Study of Methods for Topic Modelling in News Articles 271

2 Method

Python was selected as the programming language as it is open source and a

high-level programming language that is versatile. Data manipulation was done
by pandas and NumPy and visualisation was done by matplotlib and seaborn. The
setup of the environment: The system was installed on a machine with the following
characteristics:
• Processor -Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz, 2101 MHz, 4 Core(s),
8 Logical Processor(s) with RAM: 8 GB.
• BERTopic and TopVec were run on Google Colab using GPU while LDA, LSA
and NMF models were run on Jupyter Notebook.

2.1 Dataset

The dataset used in this research was news category published on Kaggle and
comprised of 210,294 headlines and abstracts from HuffPost between 2012 and
2022. The dataset is composed a total of 42 categories and six columns (category;
online article; web link; headline; short description; date published).

2.2 Data Cleansing

Data cleansing consisted of removal of unwanted strings including punctuation, null

rows and duplicates. The news category dataset consisted of 22,505 duplicates and
19,000 containing null values with 9% short_description articles were missing values.
As a result, we chose to drop rows with missing values from the dataset.

2.3 Data Pre-processing

First tokenisation is performed to break down a given text or sentence, and, in

this research, word tokenise from NLTK module was used aiding in the removal
of unwanted words in further pre-processing steps. Next, stop words and punctua-
tion were removed by Regex followed by WordNet lemmatiser from NLTK python
module used with a lemmatisation with part of speech (POS) tag. The POS tag
assigns parts of speech and was done for nouns, verbs, adverbs and pronouns for our
analysis. Term frequency and inverse document frequency (TF-IDF) were used for
feature extraction penalising words that are often used but have little meaning in the
context of the content. The data-pre-processing technique used is shown in Fig. 1.
272 S. D. Rajan et al.

Fig. 1 Data pre-processing method used in this research

2.4 Model Implementation

The model implementation was done by five unsupervised topic techniques being
latent semantic analysis (LSA), latent Dirichlet analysis (LDA), non-negative matrix
form (NMF), BERTopic and Top2Vec. The first approach was LSA and utilised the
single value decomposition (SVD) function on open source python library Sklearn
considering ten n_components with the algorithm randomised to build the model.
We implemented SVD with dimensions of 5, 15 and 20 using vectoriser for deriving
features for LSA. To build the LDA model, we used Sklearn’s LDA module with ten
n_components, learning method online and learning offset 25 achieving a coherence
score of 0.41. This was followed by hyperparameter tuning to find the optimum k
done by grid search with a range between 5 and 40. An optimum k value of 20 for
this dataset was found and achieved a coherence of 0.43 and was visualised using
pyLDAvis to generate intertropical distance maps. In contrast to LSA, NMF is a
decomposition, non-probabilistic algorithm using matrix factorisation belonging to
a group of linear-algebraic algorithms.

3 Results and Discussion

A comparative analysis of each model is conducted in this section to better understand

which model performs best in the provided dataset. A comparison between the models
has also been presented in the results below.

3.1 Comparison of LSA and LDA

The LSA model uses bag of words (BoW) and was built using Gensim module where
BoW were used for deriving features. Next, the evaluation coherence measure value
(c_v) was plotted against the number of topics to find the optimal number of topics
with good coherence. It was observed that after topic nine, there was a notable drop
in coherence, and hence, nine topics were chosen as “k” and obtained a coherence
score of 0.29 for LSA model.
A Comparative Study of Methods for Topic Modelling in News Articles 273

LDA, on the other hand, hyperparameter tuning was performed to find optimum
“k” demonstrating that increasing the number of topics increases the coherence score
until a peak at 25 topics was observed. Therefore, the optimal number of topics for
LDA model is chosen as 25 with a coherence score of 0.43. However, high coherence
with 25 topics did not reflect a good topic as many of the keywords were repeated:
there is a need to fine-tune the model parameters to minimise overlap. Based on this,
the LDA model parameters were adjusted to chunk size = 10,000, iteration = 100
and pass = 40 improving the model coherence to 0.39. With the above observations,
it can be confirmed that LDA model performs better than LSA post-hyperparameter
tuning.

3.2 Comparison of LDA and NMF Model

The NMF model was built using Gensim module with special attention focused on
hyperparameter tuning (kappa, chunk size, passes). Figure 2 shows the NMF model
with a coherence score of 0.42 and topic number 25.
LDA, on the other hand, is more focused on COVID-19-related topics generating
repetitive topics of politics. It seems that NMF outperformed LDA as the NMF model
distinguishes between all the indicated topics. However, because both LDA and NMF
rely on hyperparameters the results are not unexpected, but neither had a thorough
grasp on the phenomenon. The next two approaches focus on embedding techniques
to overcome the aforementioned issue.

Fig. 2 Hyperparameter
tuning for optimal topic
number for LDA model
274 S. D. Rajan et al.

3.3 Comparison of Top2Vec and BERTopic

Both BERTopic and Top2Vec are based on embedding clustering techniques. A

total of 1145 topics were identified by BERTopic and used BERT sentence trans-
formers (SBERT) to provide top-quality contextual word and vector representa-
tions. Top2Vec, alternatively, used Doc2Vec to generate word, document and subject
vectors. Both need an interactive approach for topic inspection as they use an embed-
ding model. A total of 659 topics were identified in the Top2Vec model and to
understand the data better, a word cloud was plotted for 35 topics providing better
visualisation (Fig. 3). The BERTopic was searching for “Sports” with most of the
discussion being about the Olympics, Super Bowl, football-related topics. Top2Vec,
on the other hand, went a step further and linked personalities related to the sports
in the topics. Additionally, both models allowed for the reduction of topics based on
hierarchical clustering as shown in Fig. 4.
This research confirms the efficacy of Top2Vec and BERTopic, followed by NMF,
LDA and LSA for analysing news category data from a human-interpretation perspec-
tive. Both Top2Vec and NMF were able to identify topics quite impressively, but the
topic extracted by Top2Vec proved to be better. Moreover, BERTopic is a useful tech-
nique in generating hidden insights through visualisation aiding in better compre-
hension. Both Top2Vec and BERTopic are based on pretrained embedding algorithm,

Fig. 3 Word cloud of Top2Vec

Fig. 4 Hierarchical clustering using BERTopic

A Comparative Study of Methods for Topic Modelling in News Articles 275

still Top2Vec was able to identify diverse topics when compared to BERTopic which
had multiple concepts and much overlapping topics. On the other hand, all traditional
topic modelling including LDA and LSA failed to extract meaning topics. Since NMF
was able to provide some intriguing topics, this study recommends NMF over LDA.
Term score decline per topic shows the importance of each keyword in the topic
extracted. It is an interactive Plotly chart which is line chart with the term rank being
the x-axis and the c-TF-IDF score on the y-axis. Another interesting visualisation
is inter-topic distance map. The distance between topics is depicted on an inter-
topic distance map. Themes that are highly distinct from one another are far apart,
whereas topics that are similar are closer together. As shown in Fig. 5. There are six
topic groups for all the topics. In the same topic group, topics with similar semantic
meanings are clubbed together.

Fig. 5 Inter-topic distance map—BERTopic

276 S. D. Rajan et al.

4 Conclusions

In summary, TF-IDF proved to be the best feature extraction technique and Top2Vec
the best topic modelling technique. Traditional modelling techniques did not perform
well in comparison with Top2Vec, NMF or BERTopic. Though BERTopic was able to
deduce 1145 topics from the data, it could not chuck unwanted words like “to”, “say”,
“for” which do not add any value to the topic semantics. Additionally, BERTopic
requires an extra data pre-processing step to remove noisy words. Special attention
should be taken with respect to the pre-processing and optimising of the model
using hyperparameters when the model in scope is LDA, LSA and NMF as these
might sometimes require domain knowledge and sufficient resources. It should also
be noted that each of these models might perform differently in different datasets.
Future studies should use the SPARK environment on the same dataset, which has
faster hyperparameter tuning for traditional algorithms. This will help save media
analysts categorising new articles manually.

Acknowledgements The authors would like to thank UNITAR International University for
supporting the publication of this paper.

References

1. Zhou Z, Qin J, Xiang X, Tan Y, Liu Q, Xiong NN (2020) News text topic clustering optimized
method based on TF-iDF algorithm on spark. Comput Mater Continua 621:217–231
2. Van Dijk TA (1995) Power and the news media. Polit Commun Action 6(1):9–36
3. Schudson M (2002) The news media as political institutions. Annu Rev Polit Sci 5(1):249–269
4. Holt K, Ustad Figenschou T, Frischlich L (2019) Key dimensions of alternative news media.
Digit J 7(7):860–869
5. Alam KM, Hemel MTH, Muhaiminul Islam SM, Akther A (2020) Bangla news trend obser-
vation using LDA based topic modeling. In: ICCIT 2020—23rd international conference on
computer and information technology, proceedings, pp 19–21
6. Xia L, Luo D, Zhang C, Wu Z (2019) A survey of topic models in text classification. In: 2019
2nd international conference on artificial intelligence and big data, ICAIBD 2019, pp 244–250
7. Zosa E, Granroth-Wilding M (2019) Multilingual dynamic topic model. In: International
conference recent advances in natural language processing, RANLP, 2019-Septe, pp 1388–1396
8. Tabassum A, Patil RR (2020) A survey on text pre-processing & feature extraction techniques
in natural language processing. Int Res J Eng Technol [online] 4864–4867. Available at: www.
irjet.net
9. Sethia K, Saxena M, Goyal M, Yadav RK (2022) Framework for topic modeling using BERT,
LDA and K-means. In: 2022 2nd international conference on advance computing and innovative
technologies in engineering, ICACITE 2022, pp 2204–2208
10. Rahmawati D, Khodra ML (2016) Word2vec semantic representation in multilabel classifica-
tion for Indonesian news article. In: 4th IGNITE conference and 2016 international conference
on advanced informatics: concepts, theory and application, ICAICTA 2016, pp 0–5
11. Qiang J, Qian Z, Li Y, Yuan Y, Wu X (2022) Short text topic modeling techniques, applications,
and performance: a survey. IEEE Trans Knowl Data Eng 343:1427–1445
12. Rahamat Basha S, Rani JK (2019) A comparative approach of dimensionality reduction
techniques in text classification. Eng Technol Appl Sci Res 96:4974–4979
A Comparative Study of Methods for Topic Modelling in News Articles 277

13. Bansal S, Srivastava A, Arora A (2017) Topic modeling driven content based jobs recommen-
dation engine for recruitment industry. Procedia Comput Sci [online] 122:865–872. Available
at: https://doi.org/10.1016/j.procs.2017.11.448
14. Blei DM, Ng AY, Jordan MI (2003) LDA-blei.pdf. J Mach Learn Res 3:993–102
15. Chen F, Xie S, Li X, Li S, Tang J, Wang T (2019a) What topics do images say: a 70 neural
image captioning model with topic representation. In: Proceedings—2019 IEEE international
conference on multimedia and expo workshops, ICMEW 2019, pp 447–452
16. Deng X, Smith R, Quintin G (2020) Semi-supervised learning approach to discover enterprise
user insights from feedback and support [online]. Available at: http://arxiv.org/abs/2007.09303
17. Zhang F, Gao W, Fang Y (2019) News title classification based on sentence-LDA model and
word embedding. In: Proceedings—2019 international conference on machine learning, big
data and business intelligence, MLBDBI 2019, pp 237–240
18. Mohamed AHH, Tawfik H, Norton L, Al-Jumeily D (2011, April) e-HTAM: a technology
acceptance model for electronic health. In: 2011 international conference on innovations in
information technology, pp 134–138. IEEE
19. Al-Jumeily D, Hussain A, Alghamdi M, Dobbins C, Lunn J (2015) Educational crowdsourcing
to support the learning of computer programming. Res Pract Technol Enhanc Learn 10:1–15
Application of Deep Learning Algorithms
to Terahertz Images for Detection
of Concealed Objects

Soumen Sardar, Sulaf Assi, Iznora Aini Zolkifly, Manoj Jayabalan,

Manea Alsaleem, Ammar H. Mohammed, and Dhiya Al-Jumeily OBE

Abstract Safety of the public at large venues is of utmost importance, and there-
fore, it is important to detect threats encountered due to concealed objects, especially
on human bodies. Terahertz imaging has gained popularity over the last few years
due to its ability to detect concealed objects inside fabric without harm to humans
or invasion to their privacy. However, terahertz images suffer from poor resolution
and low signal-to-noise ratio. Deep learning has shown high performance in classi-
fying images lately, especially feedforward neural networks. Therefore, this study
utilised deep learning for the detection of concealed objects on human bodies using
the public active terahertz imaging (ATZ) dataset that contained images of 11 tiny
objects. An end-to-end framework was applied and involved image enhancement
using wavelet filtering, locating object using skip attention generative adversarial
networks (SAGAN) and forward-forward mixed convolution with Gaussian affinity
network (FMCGNet). The results showed that wavelet filtering did not affect the
performance of the deep learning models. The SAGAN showed performance accu-
racy of 68.4% and that was related to the insufficient training of the model that
required high computational power. However, the accuracy of classifying images
with anomalies was high when FMCGNet was applied and was featured in the true
positive rate of 91.9% though it was applied to imbalanced dataset. In summary, the
proposed end-to-end framework successfully identified concealed objects on human
bodies in an efficient way. Future work involves adopting this approach to different
types of imaging datasets to further understand its strengths and limitations.

S. Sardar · S. Assi · M. Jayabalan · D. Al-Jumeily OBE (B)

Liverpool John Moores University, Liverpool, UK
e-mail: [email protected]
I. A. Zolkifly
School of Information Technology, UNITAR International University, Petaling Jaya, Selangor,
Malaysia
M. Alsaleem
Khabash General Hospital, Riyadh, Saudi Arabia
A. H. Mohammed
Iraqi Prime Minister Office, Baghdad, Iraq

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 279
Y. Bee Wah et al. (eds.), Data Science and Emerging Technologies, Lecture Notes on
Data Engineering and Communications Technologies 191,
https://doi.org/10.1007/978-981-97-0293-0_21
280 S. Sardar et al.

Keywords Safety · Terahertz · Imaging · Deep learning · SAGAN · FMCGNet

1 Introduction

Identifying concealed objects at airports and large events is important to ensure

public security and safety [1]. Hence, ensuring public safety and security is of utmost
importance yet many detection techniques are dangerous to humans [2].
Terahertz (THZ) is one of the techniques that are safe to humans and can penetrate
several objects [2]. THZ imaging has been used in detecting concealed contrabands
with no harm to others nor invasion to their privacy [3]. Yet THZ images suffer
from low quality and poor resolution [4–6]. Whether active or passive THZ imaging
was used, poor-quality imaging with low signal-to-noise ratio was reported [4–6].
This made the technique under-researched and underexplored in detecting concealed
items in public.
What complicated the situation further is that most of the available security
datasets are of low sample size and poor quality with class imbalance [7, 8]. Class
imbalance occurs when one class outweighs other classes in a dataset and that affects
classification accuracy where models tend to accurately predict the major class [9].
Particularly in THZ image analysis, the literature reported limited studies for their
analysis [6, 9–11]. The latter three studies highlighted challenges with noisy data,
concealed tiny objects or class imbalance. Subsequently, this problem was addressed
using segmentation tasks or state-of-the-art detectors, e.g. YOLO and RetinaNet [9–
11]. Furthermore, wavelet transformation and classical image denoising have been
applied for THZ image analysis [6].
Deep learning (DL) has demonstrated high performance in image classification
over the last few years [12–14]. Most of the neural network studies used backpropa-
gation that relied on gradients; yet must have complete knowledge of differentiable
functions used in forward-pass [15–17]. Moreover, deeper neural networks such as
recurrent neural networks or convolutional neural networks experience vanishing
gradient problems during backpropagation [17]. Hitherto, to handle this situation
Hinton et al. proposed a Forward-Forward algorithm to overcome backpropagation
problem and propose a new field of DL [18, 19].
Therefore, this study built on previous study in THZ imaging and backpropagation
by developing a novel deep learning approach for detecting concealed objects in
THZ images. The approach utilised skip attention generative adversarial network
(SAGAN) and forward-forward mixed convolution with Gaussian affinity network
I (FMCGNet). It aimed to develop real-time screening DL approach that could be
used at airport, festivals, sports venue and other public events where security is a
concern.
Application of Deep Learning Algorithms to Terahertz Images … 281

2 Methods

2.1 Dataset

Dataset used in this study included the public active terahertz imaging (ATZ) dataset
that contained images of tiny items [20]. As any terahertz (THZ) images, the images
were extremely noisy and of low contrast [9, 21]. Moreover, the images consisted of
tiny images that of different shapes, sizes, scales and orientation and that made the
dataset imbalanced.
The dataset contained 3157 images that had 1349 objects of 11 different classes
related to the material and varieties in the images. Images were then split into training
(n = 2555) and test (n = 602) images. The 11 classes included: Kitchen Knife (KK)
(n = 100); Gun (GA) (n = 116); Metal Dagger (MD) (n = 64); Scissors (SS) (n
= 96), Water Bottle (WB) (n = 107), Ceramic Knife (CK) (n = 129), Cell Phone
(CP) (n = 129), Key Chain (KC) (n = 78), Leather Wallet (LW) (n = 78), Cigarette
Lighter (CL) (n = 163) and Unknown (UN) (n = 289). The dataset had 1218 images
with objects and 1349 images without objects. Qualitative statistics of the dataset
showed size and format of 225 × 880 JPEG, resolution of 5 × 5 mm and object sizes
range in pixels (min–max) of 390–13,390. File format details (Table 1).
The location of objects varied in images and was located in one of these positions:
left arm (LA); abdomen (S); right arm (RA); back (B); chest (C); left thigh (LT); hip
(N); waist (W); right calf (RL); left calf (LL); and right thigh (RT).

Table 1 File format details

Key Length Value details
of the dataset
Item count 1 S—1
D—2
T—3
If exists 1 P—Exists
N—Not
Exists
Subject ID 2 M—Male
F—Female
Subsequent ID
Class 2 Class
Symbols
Orientation 1 F—Front
V—Back
L—Left
R—Right
Location 1 or 2 Symbols
Front or back 4 or 5 Front
Back
Time stamp 10 Month, day, hours-24, minute, second
282 S. Sardar et al.

2.2 Methodological Approach

Considering the challenges in the dataset highlighted above, it was not possible to
perform classical supervised learning. Subsequent an end-to-end framework detec-
tion task was adopted (Fig. 1). The framework comprised three stages: image
enhancement by wavelength filtering [6], skip attention generative adversarial
network (SAGAN) [19] and forward-forward mixed convolution with Gaussian
affinity network I (FMCGNet) [18].
In this respect, SAGAN detects anomalies in images, whereas FMCGNet classifies
them. Wavelet filtering removed the noise from the images by isolating the noise from
the signal using the biorthogonal wavelet transformation [6].
SAGAN is a type of GAN that is able to identify unusual objects in an image by
flagging them as anomalies. Unlike traditional GAN, SAGAN generates indications
(for anomalies) from all feature location rather than from low-resolution feature
location [19]. As other GAN models, it consisted of a generator and discriminator
networks. Because of the small sample size in the present dataset, depthwise separable
convolution (DSC) was applied in order to overcome overfitting problem [19]. A U-
Net architecture was used in this case and three objective functions were optimised
during training related to image reconstruction loss, latent representation loss and
adversarial loss. For evaluation of SAGAN model, two metrics were used the final
anomaly score and raw anomaly score.
In addition, FMCGNet was first proposed by Hinton in order to overcome limi-
tations encountered in backpropagation techniques that in turn must have complete
knowledge of the function used in the forward-pass otherwise it cannot proceed
[18]. In addition, backpropagation, in huge networks where billion of parameters
perturbe, does not consider variables perturbating simultaneously and averages out
gradient noise produced by other perturbing variables. That in turn introduces error
in the algorithm outcome [22]. On the other hand, FMCGNet addresses this error by
calculating gradients locally by having local objective function at each layer and that

Fig. 1 End-to-end framework detection task

Application of Deep Learning Algorithms to Terahertz Images … 283

in turn produces more accuracy in classifying positive from negative samples. This
accuracy is achieved by applying weights in each layer to raise goodness for posi-
tive samples and reduce goodness for negative samples [18]. Two types of goodness
functions were used for updating weights being sum of squares and negative sum of
squares.

3 Results and Discussion

3.1 Exploratory Data Analysis

The dataset analysed in this study was not balanced in terms of the number of objects
present. Moreover, the height and width of individual objects varied and their dimen-
sions and not all showed normal distribution. Table 2 shows the summary of the
dimensions of the 11 objects with the mean and median of each. In this respect, the
width of the objects ranged between 12 cm (seen for cigarette lighter) and 132 cm
(seen for kitchen knife). Furthermore, the height of objects ranged between 15 cm
(observed for cigarette lighter) to 49 cm (seen for gun).
Thus, the objects were relatively smaller than the usual sizes that had been reported
for them [9]. Of all these objects, kitchen knives were larger than other items and
cigarette lighters were the smallest. This was visualised in both the dimensions and
areas of objects. Thus, kitchen knives showed areas in the range of 6000–10,000
rpm, whereas cigarette lighters were below 1000 rpm (Fig. 2).
The location these items were concealed in varied depending on whether the
person concealing them was a male or female. Figure 3 shows a scatter of the different
types of items across both genders.

Table 2 Width and height for the different objects evaluated in the study
Object Width Height
Min. Median Mean Max. Min. Median Mean Max.
Cell phone 26 42 44.1 77 48 68.5 67.4 83
Ceramic knife 14 38 39.5 78 27 74 70.8 89
Cigarette lighter 12 27.5 28.6 52 15 39 36.7 56
Gun 26 59 61.5 96 49 85 83.8 107
Key chain 18 31 31 52 21 33.5 35 56
Kitchen knife 31 56 64.1 132 45 133 128 152
Leather wallet 36 50 50.5 66 36 59 57.5 70
Metal dagger 14 37 38.3 74 43 80 80.4 108
Scissors 23 41 43.1 81 35 80.5 77.7 100
Water bottle 20 34 34.4 62 43 92.5 89.2 129
284 S. Sardar et al.

Fig. 2 Box plots of the areas (left and middle) and scatter plot of the height and width (right) of
the different objects

Fig. 3 Items scatter from the front and back side across both genders

Hence, males hid more objects in their back than females. On the other hand,
both genders concealed equal volumes of objects in their calf (back and front) and
abdominal areas. Moreover, both genders had their hands free. Figure 4 shows the
percentage of different objects across body parts.

3.2 Data Pre-processing

Authentic images were prepared by combining the original ATZ images (128 × 128)
with 20% of overlapping images to form the authentic images. This was followed
by truncating images to 27 patches per image (Fig. 5). Patches falling outside the
human region and those with 99% of dark pixels (above grey threshold value of 30)
were excluded. Wavelet-based transformation techniques and global threshold-based
denoising enhanced the input images. The denoising setting used were: {‘wavelet’:
‘sym4’, ‘method’: ‘VisuShrink’, ‘decomposition_level’: 2, ‘mode’: ‘soft’}. This
Application of Deep Learning Algorithms to Terahertz Images … 285

Fig. 4 Spread of different objects across various body parts

resulted in the mean peak SNR is recorded as 45.46 for ‘sym4’ wavelets and 45.08 for
‘bior4.4’ wavelets. Image augmentations was made by rotating images or changes
the brightness settings. This was done to generate sufficient negative images so that
the FMCGNet model can learn from. Additional negative images were generated
using a linear classifier with softmax function [23]. After the generation of negative
images, anomalies in data were detected using SAGAN where images with at least
one object were treated as abnormal (n = 4725) and the remaining were normal (n
= 71,744). Then only normal images were included in the training sets; whereas the
test set contained a mixture of normal to abnormal images in the ratio of 1:1.
FMCGNet trained each of its layers with both positive and negative images unlike
traditional deep learning algorithms [18, 24]. Therefore, both positive and negative
patches were considered in the training procedure. Ratio of training to test set was 4:1
to with the training set containing 7604 images and the test containing 1900 images.
286 S. Sardar et al.

Fig. 5 Image pre-processing

including truncation

3.3 SAGAN Development and Evaluation

SAGAN offered binary classification of images, i.e. normal or abnormal images.

Evaluation of performance of SAGAN was measured using the anomaly score and
the AUC-ROC [19, 25]. Training of the model was applied with five subsets that
contained overlapping normal and abnormal images. In training process, the objective
function L is optimised by Adam and the initial learning rate of Adam optimiser is
lr = 0.002 with a lambda decay λ = 0.5 and momentum β1 = 0.999. This increased
the learning rate of the model.
Table 2 shows the performance of the different training models. In this respect, the
best-performing model was the one with the highest AUC (AUC = 0.684) and that
had 70,208 normal images in training set, 1048 normal images in test set and 1048
abnormal images in the test set, 25 epochs and 128 batch size. In addition, the network
showed anomaly score of 0.1228 and that indicated that SAGAN requires more
training [19]. Yet this was not possible in this study as it required more computational
power and time (Table 3).

3.4 FMCGNet Development and Evaluation

For FMCGNet development and evaluation, an intermediate dataset was used with
clear class imbalance to challenge the Gaussian affinity loss function [10, 26]. Five
Application of Deep Learning Algorithms to Terahertz Images … 287

Table 3 Details of models used in this study

Model Objects Train Test Epoch Batch size AUC
1 KK, CK, CLa 64:0 21:43 10 32 0.66
2 KK, CK, CL, MD, SS, 34,595:0 34,806:1048 10 64 0.652
GAa
3 KK, CK, CL, MD, SS, 34,595:0 34,806:1048 15 64 0.538
GAa
4 KK, CK, CL, MD, SS, 34,595:0 34,806:1048 25 128 0.601
GAa
5 KK, CK, CL, MD, SS, 27,676:0 1048:1048 25 128 0.637
GA
6 KK, CK, CL, MD, SS, 70,208:0 1048:1048 25 128 0.684
GA
7 KK, CK, CL, MD, SS, 70,208:0 1048:1048 25 128 0.683
GA
8 All categories all 70,930:0 5111:5111 25 128 0.618
humans
9 All categories all 70,930:0 5111:5111 25 128 0.617
humans
a Model has overlapping subset normal and abnormal batches otherwise non-overlapping. Wavelet
denoise was applied to models 3–6 and model 8. The ratios in training and test set include normal:
abnormal ratios. KK key chain, CK ceramic knife, CL cigarette lighter, MD metal dagger, SS
scissors, GA gun

layers were used in the FMCGNet model where each layer had a separate loss func-
tion. The training parameters included Adam optimiser with lr = 0.0001, affinity
weighting parameter of λ = 0.5, goodness threshold of g = 2. Hybrid negative data
made FMCGNet learn faster after being trained with 30 epochs, and this was seen
with a goodness score of 3.95. Moreover, the total positive rate was high (TPR =
0.9188) that confirmed the model accuracy. In addition, the model showed FPR of
0.1005, F1-score of 0.8996, accuracy of 0.8501 and goodness of 2.015. These results
were not affected by bath size, training or time.

4 Conclusion

The present study evaluated detection of concealed images on human bodies

analysing ATZ images. Terahertz has gained popularity over the last few years due
to their being non-invasive, non-harmful and able to detect different types of mate-
rials [27]. Subsequently, this study explored using machine learning for analysis of
noisy terahertz images that had tiny objects concealed on human bodies of both male
and female gender. After denoising images, a two-stage architecture was applied for
classifying images being SAGAN and FMCGNet. SAGAN achieved 68.4% accuracy
288 S. Sardar et al.

and anomaly score of 0.1228 and that showed that further training was required. On
the other hand, FMCGNet showed very high accuracy with TPR rate of 91.9% when
applied to imbalanced dataset and that showed its outperformance to other machine
learning models. There were limitations encountered in the study related to sample
size as well as computing power. Future work involves building on the FMCGNet
architecture by applying it to larger datasets in order to maximise its performance.

Acknowledgements The authors thank UNITAR International University for the publication of
this research.

References

1. Chen HM, Lee S, Rao RM, Slamani MA, Varshney PK (2005) Imaging for concealed weapon
detection: a tutorial overview of development in imaging sensors and processing. IEEE Signal
Process Mag 22(2):52–61
2. Kowalski M, Kastek M, Palka N, Polakowski H, Szustakowski M, Piszczek M (2013) Inves-
tigation of concealed objects detection in visible, infrared and terahertz ranges of radiation.
Photon Lett Poland 5(4):167–169
3. Helal S, Sarieddeen H, Dahrouj H, Al-Naffouri TY, Alouini MS (2022) Signal processing and
machine learning techniques for terahertz sensing: an overview. IEEE Signal Process Mag
39(5):42–62
4. Hu BB, Nuss MC (1995) Imaging with terahertz waves. Opt Lett 20(16):1716–1718
5. Cheng YZ, Withayachumnankul W, Upadhyay A, Headland D, Nie Y, Gong RZ et al (2014)
Ultrabroadband reflective polarization convertor for terahertz waves. Appl Phys Lett 105(18)
6. Danso S, Liping S, Hu D, Odoom J, Quancheng L, Mushtag M (2022) Security inspection image
processing methods applying wavelet transform filters on Terahertz active images. Revista de
Investigaciones Universidad del Quindío 34(1):37–51
7. Velayudhan D, Hassan T, Damiani E, Werghi N (2022) Recent advances in baggage threat
detection: a comprehensive and systematic survey. ACM Comput Surv 55(8):1–38
8. Ahmed A, Velayudhan D, Hassan T, Hassan B, Dias J, Werghi N (2022, May) Baggage threat
detection under extreme class imbalance. In: 2022 2nd international conference on digital
futures and transformative technologies (ICoDT2), pp 1–6. IEEE
9. Liang D, Pan J, Yu Y, Zhou H (2019) Concealed object segmentation in terahertz imaging via
adversarial learning. Optik 185:1104–1114
10. Sun Q, He Y, Liu K, Fan S, Parrott EP, Pickwell-MacPherson E (2017) Recent advances in
terahertz technology for biomedical applications. Quant Imaging Med Surg 7(3):345
11. Li R, Li C, Li H, Wu S, Fang G (2018) Study of automatic detection of concealed targets in
passive terahertz images for intelligent security screening. IEEE Trans Terahertz Sci Technol
9(2):165–176
12. Guo Y, Liu Y, Oerlemans A, Lao S, Wu S, Lew MS (2016) Deep learning for visual
understanding: a review. Neurocomputing 187:27–48
13. Wang J, Zhu H, Wang SH, Zhang YD (2021) A review of deep learning on medical image
analysis. Mob Netw Appl 26:351–380
14. Suganyadevi S, Seethalakshmi V, Balasamy K (2022) A review on deep learning in medical
image analysis. Int J Multimed Inform Retriev 11(1):19–38
15. Bharathi PT, Subashini P (2011) Optimization of image processing techniques using neural
networks: a review. WSEAS Trans Inf Sci Appl 8(8):300–328
16. Al-Saffar AAM, Tao H, Talab MA (2017, October) Review of deep convolution neural network
in image classification. In: 2017 International conference on radar, antenna, microwave,
electronics, and telecommunications (ICRAMET), pp 26–31. IEEE
Application of Deep Learning Algorithms to Terahertz Images … 289

17. Chen L, Li S, Bai Q, Yang J, Jiang S, Miao Y (2021) Review of image classification algorithms
based on convolutional neural networks. Rem Sens 13(22):4712
18. Hinton G (2022) The forward-forward algorithm: some preliminary investigations. arXiv
preprint arXiv:2212.13345
19. Liu G, Lan S, Zhang T, Huang W, Wang W (2021, September) SAGAN: skip-attention GAN
for anomaly detection. In: 2021 IEEE international conference on image processing (ICIP), pp
2468–2472. IEEE
20. Terahertz Imaging Dataset (2022) Available at: https://github.com/LingLIx/THz_Dataset.
Accessed: 31 Aug 2023
21. Shen X, Dietlein CR, Grossman E, Popovic Z, Meyer FG (2008) Detection and segmentation
of concealed objects in terahertz images. IEEE Trans Image Process 17(12):2465–2475
22. Karner C, Kazeev V, Petersen PC (2022) Limitations of gradient descent due to numerical
instability of backpropagation. arXiv preprint arXiv:2210.00805
23. Lin D, Lin Z, Sun L, Toh KA, Cao J (2017, May) LLC encoded BoW features and softmax
regression for microscopic image classification. In: 2017 IEEE international symposium on
circuits and systems (ISCAS), pp 1–4. IEEE
24. Yeh AB, Huwang L, Wu YF (2004) A likelihood-ratio-based EWMA control chart for
monitoring variability of multivariate normal processes. IIE Trans 36(9):865–879
25. Yang C, Lan S, Huang W, Wang W, Liu G, Yang H et al (2022, September) A transformer-
based GAN for anomaly detection. In: International conference on artificial neural networks.
Springer Nature, Cham, pp 345–357
26. Hayat M, Khan S, Zamir SW, Shen J, Shao L (2019) Gaussian affinity for max-margin class
imbalanced learning. In: Proceedings of the IEEE/CVF international conference on computer
vision, pp 6469–6479
27. Xiao H, Zhang R, Wang H, Zhu F, Zhang C, Dai H, Zhou Y (2018, October) R-pcnn method to
rapidly detect objects on thz images in human body security checks. In: 2018 IEEE SmartWorld,
ubiquitous intelligence & computing, advanced & trusted computing, scalable computing &
communications, cloud & big data computing. Internet of People and Smart City Innovation
SmartWorld/SCALCOM/UIC/ATC/CBDCom/IOP/SCI, pp 1777–1782. IEEE
Multivariate Comparative Analysis
of Statistical and Deep Learning Models
for Prediction Hardware Failure
Saurabh Gupta, Raghad Alshabandar, Chong Kim Loy,
and Ammar H. Mohammed

Abstract The twenty-first century is witnessing a transformative shift toward digi-

talization, where various services are undergoing a paradigm change from traditional
to digital platforms. This digital revolution, evident in the transition from commerce
to e-commerce and agriculture to agrotech, necessitates the development of robust
and sustainable systems to support these applications. While cloud providers like
Amazon, Azure, and Google Cloud offer Infrastructure as a Service (IaaS) solutions,
many organizations still prefer on-site hardware maintenance. The existing studies
on system failure prediction have primarily focused on either machine learning tech-
niques like random forest and Naive Bayes or deep learning neural networks like
RNN, LSTM, and CNN. However, there has been no comprehensive comparison
of prediction accuracy between statistical models and deep learning models. This
research aims to identify the algorithm that yields the highest prediction accuracy
while considering hardware resource utilization, including CPU, RAM, ROM, and
network usage. The study utilizes multivariate time series analysis to predict system
failures based on the “BitsBrain” dataset. The models evaluated include ARIMA,
auto-regression, SARIMAX, exponential smoothing, LSTM, and Bi-LSTM. The
results demonstrate that the Bi-LSTM model outperforms the other models, achieving
the lowest mean absolute error (MAE) and the highest R2 score. Conversely, the expo-
nential smoothing model exhibits the poorest performance. Additionally, a compar-
ison of actual versus predicted plots reveals that the Bi-LSTM model generates more
accurate predictions. These findings suggest that the Bi-LSTM model can serve as
a valuable tool for forecasting system failures and enhancing system reliability. Its
superior prediction accuracy, coupled with relatively efficient resource utilization,

S. Gupta (B)
Gurgaon, India
e-mail: [email protected]
R. Alshabandar · A. H. Mohammed
Baghdad, Iraq
C. K. Loy
UNITAR International University, Petaling Jaya, Selangor, Malaysia
e-mail: [email protected]

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 291
Y. Bee Wah et al. (eds.), Data Science and Emerging Technologies, Lecture Notes on
Data Engineering and Communications Technologies 191,
https://doi.org/10.1007/978-981-97-0293-0_22
292 S. Gupta et al.

makes it a compelling choice for organizations seeking to optimize their system

performance and minimize downtime.

Keywords Autoregressive integrated moving average (ARIMA) · Long short-term

memory (LSTM)

1 Introduction

The COVID-19 pandemic has accelerated the adoption of digital products across
various sectors, including businesses and personal use, regardless of age. This surge
in digital dependence has intensified the demand for high availability of applications
hosted on in-house servers or cloud infrastructure. According to available online data,
even a few minutes of application downtime can result in significant financial losses
for organizations and individuals. Each year, several large companies experience
server or application outages. In December 2022, Amazon AWS faced an outage
that disrupted services globally. Reports indicate that the outage was caused by “an
impairment of several network devices,” leading to increased API error rates [1].
Similarly, in October 2021, Facebook, Instagram, and WhatsApp went down for six
hours, affecting 3.5 billion people worldwide [2].
While the concept of Consistency, Availability, and Partition Tolerance (CAP)
is crucial for ensuring system reliability, it is becoming increasingly evident that
additional measures are necessary to maintain seamless cloud operations. As orga-
nizations embrace AI/ML, they need to develop more robust MLOPS or AIOPS
solutions that can effectively detect anomalies and predict faults [3].
Despite the growing use of machine learning for hardware failure prediction,
most previous studies have relied on a single type of feature, such as SMART indi-
cators or CPU utilization metrics. Only a limited number of studies have explored
hybrid feature spaces. To date, no study has focused on comparing different ML/AI
algorithms solely on server health-monitored data to predict system failures. This is
primarily because OPS teams typically have access to both server health data and
success or failure logs for application calls [4].
This paper proposes a multivariate analysis approach to identify the most effective
technique for predicting cloud server failures based on usage patterns (CPU, memory,
and network). By forecasting server failures using current usage data, organizations
can prevent cloud failures, reduce associated costs, and minimize downtime for online
and mobile applications hosted on the affected cloud infrastructure.
The performance of deep learning models for failure prediction in virtual machines
will be compared with baseline models to determine the most suitable approach for
cloud server failure prediction. Additionally, the study will provide valuable insights
to guide future research in developing novel methodologies for failure prediction.
The remainder of this paper is structured as follows: Section 2 presents a compre-
hensive literature review of related work. Section 3 details the methodology and
Multivariate Comparative Analysis of Statistical and Deep Learning … 293

presents the obtained results. Finally, Section 4 outlines the conclusions and discusses
potential future research directions.

2 Related Works

To identify and prevent cloud-based failures before they materialize, the authors advo-
cate for a “combined system metrics approach” [5]. They integrated system indicators
like CPU utilization with data from SMART hard drives. The system was subjected
to rigorous evaluation using over 100 cloud server datasets and four AI algorithms:
random forest, gradient boosting, long short-term memory, and gated recurrent unit.
The results were subsequently subjected to correlation analysis. Random forest (0.99
accuracy) and gradient boosting (0.99) emerged as the most effective methods.
A study by [5] demonstrated the feasibility of predicting failure using a support
vector machine in a cloud environment, highlighting the sensitivity to the failure
threshold. To prevent operational system failures on individual hosts, Chalermar-
rewong et al. [6] employed multivariate analysis, analyzing log files that docu-
ment factors such as system resources (CPU, RAM, and network throughput). The
precision achieved ranged from 97.32 to 99.89, depending on the system.
One study [7] proposed a machine learning approach to predict virtual machine
(VM) failure using a time series stochastic model. The proposed method utilizes
an autoregressive integrated moving average (ARIMA) failure predictor module to
forecast VM failure. The method’s performance was evaluated on non-stationary
failure traces of VMs. The observations revealed that the proposed method accurately
predicts failure with a mean absolute error (MAE) of 0.0344786.
Generative adversarial networks (GAN) and an improved long short-term memory
(LSTM) model called convolution-LSTM (C-LSTM) to predict hard disk failure.
Their findings indicate that C-LSTM achieves promising results with an AUC of
0.76 [8].
In [7] utilized the time series stochastic model ARIMA to predict virtual machine
failure. The results demonstrate that the model achieves satisfactory performance
with a root mean square error (RMSE) of 0.0457443, mean absolute error (MAE) of
0.0344786, and mean absolute scaled error (MASE) of 0.6036391.

3 Methods

3.1 Dataset

The dataset utilized in this study was obtained from the Bitbrain open-source data
repository [9]. It comprises 1750 virtual machine (VM) data points suitable for
multivariate analysis. The dataset encompasses a comprehensive range of features
294 S. Gupta et al.

Table 1 Dataset description

Feature Description
Timestamp It represents time of monitor data
CPU cores It represents size of core in VM
CPU capacity provisioned It represents the CPU capacity in MHZ
CPU usage It represents the usage of CPU cores during mention
timestamp. Its value in MHZ
CPU usage (%) It represents percentage of CPU usage during mention
timestamp
Memory capacity It represents the amount of RAM associated with VM.
Unit is KB (KiloByte)
Memory usage It represents the usage of RAM during mention
timestamp. Unit is in KB
Network throughput It represents VM speed when receiving data. Unit: KB/s
Network transmitted throughput It represents VM VM speed when sending data. Unit:
KB/s
Disk read throughput It represents the size of disk that is used to read data.
Unit: KB/s
Disk write throughput It represents the size of disk that is used to save data.
Unit: KB/s

characterizing each virtual machine. Table 1 provides a concise description of these

features.

3.2 Data Preprocessing

This research employed various data preprocessing techniques to ensure the dataset’s
quality and suitability for model training. This involved identifying and correcting
structural errors, removing inaccurate or corrupted data, and eliminating duplicate
entries. This involved normalizing the data using MinMaxScaler, which scales the
data to a range between 0 and 1. This technique helps improve the performance
of some machine learning models. We carefully addressed missing values by either
removing the corresponding data points or imputing missing values using appropriate
techniques.

3.3 Experiment Setup

In this paper, we investigate the effectiveness of employing a combination of statis-

tical models and deep learning techniques for detecting hardware failures. This
Multivariate Comparative Analysis of Statistical and Deep Learning … 295

approach aims to leverage the strengths of both methodologies to enhance prediction

accuracy and reduce the risk of system or application failures.
Autoregressive Integrated Moving Average (ARIMA): ARIMA is a statistical
model that utilizes time series data to predict future values based on past patterns.
It assumes that the current value of a variable is linearly related to its past values,
along with random errors or shocks [10].
Autoregressive (AR): AR is a simpler version of ARIMA that models the current
value of a variable solely based on its previous values. It assumes that the current
value can be predicted using a linear combination of its recent past values [11].
Seasonal Autoregressive Integrated Moving Average (SARIMA): SARIMA is an
extension of ARIMA specifically designed to handle seasonal variations in time series
data. It incorporates seasonal components into the model to account for recurring
patterns over time [12].
Exponential Smoothing (ES): ES is a forecasting technique that weights past obser-
vations exponentially, giving more importance to recent data and less weight to older
data. This approach assumes that recent observations are more relevant for predicting
future values [13].
Long Short-Term Memory (LSTM): LSTM is a recurrent neural network (RNN)
architecture specifically designed to handle long-term dependencies in time series
data. It overcomes the vanishing gradient problem, which hinders the ability of RNNs
to capture long-term relationships in data [14].
Bi-LSTM (Bidirectional LSTM): Bi-LSTM is an extension of LSTM that processes
input data in both forward and backward directions. This approach allows the network
to extract contextual information from both sides of the input sequence, enhancing
its understanding of the data [15]. By combining these statistical and deep learning
models, we aim to achieve a comprehensive and effective approach to hardware
failure detection, minimizing downtime and ensuring system reliability.

4 Results

The machine learning models were employed to forecast CPU utilization, memory
utilization, and network throughput. The root mean square error (R2 ) was used to
assess the models’ fit to the data, as shown in Table 2. A positive R2 value indicates
a good fit, while a negative R2 value suggests a poor fit. It also displays the predicted
CPU utilization, memory utilization, and network transmitted throughput for each
model. Based on the results, the Bi-LSTM model performed the best, achieving the
highest R2 score for memory utilization and the second-highest R2 score for CPU
usage. The ARIMA model exhibited a negative R2 score for CPU utilization, and the
AR model yielded a very low R2 score for memory consumption.
296 S. Gupta et al.

Table 2 Performance analysis on the basis of R2 sores

Model CPU usage R2 score CPU usage R2 score Network transmitted
(%) throughput (KB/S) R2
score
ARIMA − 0.02 − 2.01 − 5.05
Auto-regressor (AR) 0.00 0.01 − 0.00
SARIMAX 0.57 − 0.74 0.37
LSTM 0.40 0.54 − 0.20
Bi-LSTM 0.40 0.43 − 0.32
Exponential smoothing − 0.01 − 0.1 − 185

The SARIMAX model demonstrated satisfactory performance, with positive R2

scores for CPU utilization and network transmitted throughput but a negative R2
score for memory consumption. The LSTM model obtained a favorable R2 score for
CPU utilization but a poor score for network throughput. With negative R2 scores for
all three variables, the exponential smoothing model performed the least effectively.
The MAE values for each model on the three target variables are displayed in
Table 3. The ARIMA model recorded the highest MAE for CPU utilization (0.81),
while the autoregressive model had the highest MAE for memory utilization (0.63),
and the SARIMAX model had the highest MAE for network transmitted throughput
(0.90).
The Bi-LSTM model consistently achieved the lowest MAE values for all three
variables: CPU utilization (0.50), memory consumption (0.38), and network trans-
mitted throughput (0.38), indicating its superior accuracy. The LSTM model also
demonstrated reasonably low MAE values (ranging from 0.34 to 0.63) for all three
variables, suggesting its effectiveness as well. The exponential smoothing model
exhibited the highest MAE (24.23) for network transmitted throughput, confirming
its inferior performance overall. Considering its consistently low MAE values across

Table 3 Performance analysis on the basis of MAE

Model CPU usage MAE (%) Memory usage (KB) Network transmitted
MAE throughput (KB/s)
MAE
ARIMA 0.81 2.17 4.28
Auto-regressor 0.85 0.63 0.37
(AR)
SARIMAX 0.51 0.87 0.27
LSTM 0.63 0.40 0.34
Bi-LSTM 0.50 0.38 0.33
Exponential 0.75 0.70 24.23
smoothing
Multivariate Comparative Analysis of Statistical and Deep Learning … 297

Fig. 1 Scatter plot of Bi-LSTM model actual versus predicted

all three variables, the Bi-LSTM model emerges as the most suitable choice for
predicting system failures and enhancing system reliability.
Figure 1 presents a scatter plot of the Bi-LSTM model, visualizing the actual
versus predicted values based on R2 for the three main features: CPU usage (%),
memory usage (Kb), and network transmitted throughput.
The findings of this study demonstrate that the Bi-LSTM model consistently
outperforms other statistical and deep learning models in predicting system resource
utilization, including CPU utilization, memory utilization, and network throughput.
Its superior performance, evident in both R2 scores and MAE values, suggests that
the Bi-LSTM model is a powerful tool for anticipating system failures and enhancing
system reliability.

5 Conclusion and Future Work

This study investigated the application of multivariate time series analysis to predict
system failures using the “BitsBrain” dataset. Following data preprocessing and
feature selection to extract meaningful information, a range of basic and advanced
models were employed to forecast CPU consumption, memory utilization, and
network transmitted throughput.
298 S. Gupta et al.

The study evaluated ARIMA, auto-regression, SARIMA, exponential smoothing,

LSTM, and Bi-LSTM models. The Bi-LSTM model outperformed the others,
achieving the lowest MAE and highest R2 score, while the exponential smoothing
model yielded the worst performance. These findings suggest that the Bi-LSTM
model could be a valuable tool for predicting system failures and enhancing system
reliability.
Future research could explore the use of additional advanced deep learning
models, such as convolutional neural networks (CNNs) or transformer models. Addi-
tionally, the study could be expanded to incorporate other variables within the dataset
or even to include additional datasets for system failure forecasting. Overall, there
is significant scope for further research in this area, and this initiative is expected to
make a substantial contribution.

References

1. Yu G, Chen P, Chen H, Guan Z, Huang Z, Jing L, Weng T, Sun X, Li X (2021) Microrank: end-to-
end latency issue localization with extended spectrum analysis in microservice environments.
In: Proceedings of the web conference 2021, pp 3087–3098
2. Anon (2021) Mark Zuckerberg apologises for Facebook, WhatsApp disruption—The
Economic Times [Online]. economictimes.indiatimes.com. Available from: https://economict
imes.indiatimes.com/tech/technology/mark-zuckerberg-apologises-for-facebook-whatsapp-
disruption/articleshow/86772424.cms. Accessed 3 Nov 2022
3. Abro JH, Li C, Shafiq M, Vishnukumar A, Mewada S, Malpani K, Osei-Owusu J (2022) Artifi-
cial intelligence enabled effective fault prediction techniques in cloud computing environment
for improving resource optimization. Sci Program 2022:1–7
4. Litoiu M, Watts I, Wigglesworth J (2021) The 13th CASCON workshop on cloud computing:
engineering AIOps. In: Proceedings of the 31st annual international conference on computer
science and software engineering, pp 280–281
5. Tehrani C, Beer R, Popp H et al (2017) Education 4.0—fostering student’s performance
with machine learning methods. In: 2017 IEEE 23rd international symposium for design and
technology in electronic packaging (SIITME), pp 438–443. IEEE
6. Chalermarrewong T, Achalakul T, See SCW (2012) Failure prediction of data centers using
time series and fault tree analysis. In: 2012 IEEE 18th international conference on parallel and
distributed systems, pp 794–799. IEE
7. Rawat A, Sushil R, Agarwal A, Sikander A (2021) A new approach for vm failure prediction
using stochastic model in cloud. IETE J Res 67(2):165–172
8. Shi J, Du J, Ren Y, Li B, Zou J, Zhang A (2022) Convolution-LSTM-based mechanical hard
disk failure prediction by sensoring SMART indicators. J Sens 2022:1–15
9. Bitbrains (n.d.) GWA-T-12 Bitbrains [Online]. Bitbrains. Available from: http://gwa.ewi.tud
elft.nl/datasets/gwa-t-12-bitbrains. Accessed 3 Nov 2022
10. Jain R, Chetty P (2020) Introduction to the autoregressive integrated moving average (ARIMA)
model. [Online]. projectguru.in. Available from: https://www.projectguru.in/introduction-to-
the-autoregressive-integrated-moving-average-arima-model/. Accessed: 27 Dec 2022
11. Triebe O, Laptev N, Rajagopal R (2019) Ar-net: a simple auto-regressive neural network for
time-series. arXiv preprint arXiv:1911.12436
12. Zhang N, Zhang Y, Lu H (2011) Seasonal autoregressive integrated moving average and support
vector machine models: prediction of short-term traffic flow on freeways. Transp Res Rec
2215(1):85–92
Multivariate Comparative Analysis of Statistical and Deep Learning … 299

13. Ostertagová E, Ostertag O (2011, September) The simple exponential smoothing model. In: The
4th international conference on modelling of mechanical and mechatronic systems, Technical
University of Košice, Slovak Republic, proceedings of conference, pp 380–384
14. Yu Y, Si X, Hu C, Zhang J (2019) A review of recurrent neural networks: LSTM cells and
network architectures. Neural Comput 31(7):1235–1270
15. Sun Q, Jankovic MV, Bally L, Mougiakakou SG (2018, November) Predicting blood glucose
with an lstm and bi-lstm based deep neural network. In: 2018 14th symposium on neural
networks and applications (NEUREL), pp 1–5. IEEE
Statistical Learning
A Case Study via Bayesian Network:
Investigating Factors Influencing Student
Academic Performance in Online
Teaching and Learning During
COVID-19 Pandemic

Zheng Ning Looi, Poh Choo Song, Huai Tein Lim, and Sing Yan Looi

Abstract COVID-19 pandemic has an impact on numerous Malaysian industries,

particularly the education sector. Due to the changes in class delivery during the
pandemic, students’ academic performance has been impacted, for instance, by
decrease in CPGA. Hence, we would like to determine the variables that could
influence academic performance of the undergraduate students in Malaysia during
COVID-19 pandemic, whereby educators may be able to help the students to adapt
the changes. In this study, 27 items categorized into ten factors are investigated,
and Bayesian network (BN) is employed to discover the determinants and forecast
the students’ academic performances. The accuracy of prediction between BN and
support vector machine (SVM) model is compared. The findings indicate a significant
relationship between students’ academic achievement and the factors including the
type of learning device, faculty, attendance, age, self-efficacy, and teaching technique.
Additionally, the MAPE for BN is 5% less than that of SVR in comparison with the
actual results. This study assists higher education institutions to effectively strengthen
the significant variables to raise students’ academic performance, especially for OTL.

Keywords Bayesian network · Support vector machine · Academic performance

Z. N. Looi · P. C. Song (B) · H. T. Lim · S. Y. Looi

Department of Physical and Mathematical Science, Faculty of Science, University Tunku Abdul
Rahman, Petaling Jaya, Malaysia
e-mail: [email protected]
H. T. Lim
e-mail: [email protected]
S. Y. Looi
e-mail: [email protected]

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 303
Y. Bee Wah et al. (eds.), Data Science and Emerging Technologies, Lecture Notes on
Data Engineering and Communications Technologies 191,
https://doi.org/10.1007/978-981-97-0293-0_23
304 Z. N. Looi et al.

1 Introduction

Generally, way of improving students’ academic performance has been studied glob-
ally, for instance, South Africa [1], Midwestern [2], Australia [3], Vietnam [4] and
Indonesia [5]. In Malaysia, there are 114 universities, including 20 state universities,
50 private universities, 34 private university colleges, and ten branch campuses of
international universities [6]. The acceptance of change upon educational system,
e.g., implementing online teaching and learning (OTL), has been stimulated in these
universities during the harsh situation of COVID-19 pandemic. COVID-19 is the
disease caused by the SARS-CoV-2 coronavirus. It was first identified at the end of
the year 2019 which outbreak in China, Wuhan. Malaysia government has imposed
movement control order in response to the huge number of infected cases, and hence,
OTL is implemented. However, there are challenges confronting beginning educators
in conducting OTL. As [4] claimed that most students are not interested in online
learning due to lack of interactions and unstable Internet connection. Likewise, [7]
noted that students had poor sleeping quality than employees during COVID-19
pandemic. Additionally, about 56% of students in the Southern Philippines avoided
associating with others during the COVID-19 pandemic [8]. In all, the COVID-19
pandemic has vitally caused students to lose their motivation in learning [9, 10] as
well as deviate their learning attitude [11]. These conditions are certainly affecting
our new generation’s education quality in future.
Conventionally, there are two categories of factors, i.e., internal factor and external
factor that will influence students’ academic performance [12]. The classroom atmo-
sphere, learning facilities, instructor role, etc., are internal influences. As [2] found
that instructor quality is a strong impact factor on students’ academic achievement.
On the other hand, family influence, financial difficulties, social issues, and others
are the examples of external causes. However, students who are undergoing OTL
system during COVID-19 pandemic may lead to a slightly different point of view
on achieving academic performance as compared to the conventional viewpoint.
According to [13] and [14], time management and online learning have a close rela-
tionship since students have more flexible ways of managing their time. Based on
the past studies, their academic achievement could have strong correlation with self-
efficacy [3] and positive correlation with student attendance [15]. Moreover, many
students experienced environmental issues in OTL during the pandemic [16].
Therefore, to better understand the variables of affecting undergraduate students’
academic performance in OTL during COVID-19 pandemic, this study has (1)
statistically determined a relationship network between academic achievement (i.e.,
dependent variable) and independence variables; (2) evaluated the accuracy of the
prediction with best-performing model by comparing the Bayesian network and
support vector regression under model testing. In a limited capital condition, with
a better causal network, it can assist higher education institutions to effectively
strengthen the significant variables to raise students’ academic performance, avoid
any unpleasant variables, as well as work out academic performance’s prediction.
A Case Study via Bayesian Network: Investigating Factors Influencing … 305

2 Literature Review

Academic achievement is a crucial measurement to determine employment oppor-

tunity [17]. Academic achievement is evaluated based on performance in the class-
room and the results of several assessments [18]. There are several ways of assessing
academic performance, and grade point average (GPA) is one of the common indi-
cators used in past research. For example, [11] and [19] examined the variables
influencing the learning experience in higher education, and they used GPA as
an independent variable. Technically, descriptive statistics with the measures of
central tendency [9] and regression with correlation analysis [20] were the common
approaches used for examining how severe the effect of COVID-19 pandemic on
academic performance.

2.1 Potential Factors of Influencing Academic Performance

A collection of potential factors of influencing students’ academic performance,

whether in face-to-face learning or online teaching and learning (OTL), is reviewed
in this section.
Teaching method may give impact on students’ learning attitudes [2, 21, 22].
Several studies found that physical lecturing or real-time instruction was determined
to be the best teacher-student interactive method based on students’ perspective [1,
21]. However, the statement may be varied as it was claimed before the occurrence
of COVID-19 pandemic. In a situation where universities urge to implement online
learning to adapt the change of COVID-19 pandemic, students must adjust to some
new teaching strategies. However, [5] showed that e-learning system considerably
raised students’ achievement levels.
Time management is a technique for allocating one’s time wisely in accordance
with the tasks and activities that are planned [23]. It was discovered to be an important
indicator of having good academic performance [23, 24]. Time management and
online learning are significantly related due to many students favor online learning
as easily managing their time [14]. Yet, this factor may also depend on individual
discipline.
Studying in a conducive environment is preferable. A dark or underequipped
learning environment could influence a person’s learning performance and mental
health [25]. Several studies found that there is a positive relationship between envi-
ronmental conditions and academic performance [20, 26], but not limited to online
learning [16, 27]. In fact, the environmental issues for OTL during pandemic could
be inadequate lighting, noise, unstable network connection, and uncomfortable room
temperature.
Next, high attendance rate has a significant positive impact on students’ academic
performance [28, 29]. Due to the significance of attendance, most colleges have
adopted an attendance policy to guarantee that every student should have more than
306 Z. N. Looi et al.

70% of the total attendance for every course [28]. A study of the impact of atten-
dance patterns on academic performance and wellness among medical students at
the University of Central Florida during the COVID-19 pandemic was conducted by
Salzman et al. [15]. The study pointed that there was a weak but substantial strong
connection between academic achievement and attendance. This is in line with [30]
where some students show up for class in accordance with university’s attendance
regulation but short of attention on the lecturing.
To have a fresh mind and good health, sleep quality is crucial. Thus, sleep quality
could affect academic performance [31, 32]. According to a survey in [32], more than
nine hours of sleep had greater CGPA than those who get less than six hours. However,
about 60% of college students still experience daytime sleepiness and insufficient
sleep. Moreover, [7] found that the amount and quality of sleep among students and
university staff were strongly affected during COVID-19 lockdown. However, [33]
revealed a contrary point of view where there is no connection between academic
achievement and sleep quality among medical students.
Moreover, a student’s learning desire and academic performance may depend on
family interaction. According to [34], a healthy family dynamic will lower a student’s
cortisol levels and enable the student to concentrate on their studies. Additionally,
parents who interact with their kids through game can support them in recognizing
their growth, well-being, and academic standing [35]. However, during the COVID-
19 pandemic, about 54–56% of high school students refuse social interaction [36],
even no social contact between postgraduate students and their family and friends [8].
This condition may give some impacts during a learning process as well as academic
performance.
Learning attitudes can react positively to academic achievement and [37–40]. Lee
et al. [11] stated that secondary school students’ academic performance is correlated
with their learning attitude during COVID-19 pandemic. Furthermore, how well a
student felt about a topic may be the key to cultivate a better learning attitude [41] and
self-efficacy [42]. Likewise, undergraduate students from science stream have higher
levels of academic self-efficacy than those from the art and commerce streams in India
[43]. Thereby, the self-efficacy factor can advocate good learning behavior and thus
improve academic achievement [42, 43], similarly when COVID-19 pandemic [3].
Self-efficacy can be a kind of self-confidence as [42] stated it is about an idea that
one has a capability of carrying out and accomplishing a task.
Financial stress that encountered by students, such as school loan, tuition fees,
reference books purchases, computing equipment purchases, and stationary fees,
would have some impact on their studies [8, 44]. Majority of the undergraduate
students are having financial problems as [45] revealed 60% of undergraduate
students were having trouble paying their tuition fees and monthly expenses. Perhaps,
to cover the debt, their learning progress may be influenced by skipping classes for
part-time jobs or live frugally. The situation becomes worst in the occurrence of
COVID-19 pandemic, [46] found that students’ emotion and financial well-being
were affected during COVID-19 pandemic. However, on the flipside, the flexible
feature of OTL may aid to reduce the financial stress.
A Case Study via Bayesian Network: Investigating Factors Influencing … 307

Furthermore, motivation is a powerful element to boost up the keen of overcoming

challenges. Therefore, this element is benefited to a learning process, as several
studies indicated that academic performance has positively influenced by motivation
[47, 48]. However, COVID-19 pandemic may cause to lose motivation easily in
learning [10], especially when students are isolated whereby lack of enthusiasm for
online study [49].

2.2 Concept of Framework in This Study

In this study, ten potential independent variables which adopted from the past
studies are compiled and investigated concurrently to determine the core factors
of affecting and predicting academic performance with online learning during
COVID-19 pandemic (Fig. 1).
The investigated independent variables for this paper are teaching method, time
management, environmental, attendance, sleep quality, family influence, learning
attitudes, financial stress, motivation, and self-efficacy, while the dependent variable
is academic performance in terms of CGPA.

Fig. 1 Model in this study

308 Z. N. Looi et al.

3 Methodology

3.1 Data Collection

In this study, Year 3 and Year 4 undergraduate students from various faculties in
Universiti Tunku Abdul Rahman are the target group, since they had experienced
on both learning methods, i.e., face-to-face mode and online virtual mode. Students
from the Faculty of Medicine and Health Science (FMHS) and Faculty of Creative
Industries (FCI) are not involved in this study as they have not implemented online
teaching and learning (OTL) during the pandemics. Ethics approval is obtained from
the university and the participants provide their written informed consent to partici-
pate in this survey. There were 158 respondents involved in the study. The sample size
is selected based on the total population, confidence level and margin of error, which
is about 4000, 80% and 5%, respectively. Convenience sampling is applied when time
and budget constraints are considered. Knowing that convenience sampling is faster
than other sampling techniques, especially during crucial moments [50]. A ques-
tionnaire has classified into two sections. The former section regards to respondent’s
personal information, such as gender, age, place of residence, ethnicity, nationality,
faculty, year of study, medium of academic study, parent’s educational level, CGPA,
monthly household income, etc. The latter section is used to assess the predictor and
predicted variables.
Technically, Bayesian networks, a type of probabilistic graphical model, are useful
for probabilistic inference, decision-making, and prediction, though it may be compu-
tationally expensive that requires a large amount of data to train [51]. Nevertheless,
it also can manage missing data and applicable for small data sets [52], showcasing
a capability quite similar to support vector machine. Furthermore, Bayesian network
can handle uncertainty and can be combined with other Bayesian analytic techniques
like Markov chain Monte Carlo (MCMC). However, HC does not promise to return
the optimal solution; thus, more sophisticated neighborhood search algorithms, Tabu
search [53], or other metaheuristics [54] have been suggested to improve the room.
Since Hill-Climbing is comparable to Tabu search method [53], by reselecting the
restricted object into the model, Tabu search can prevent the loss of a good solu-
tion and reach an optimal solution [55]. Therefore, in this study, Bayesian network
model which selected from Hill-Climbing and Tabu search would be compared with
support vector regression to find the best network; see Fig. 2. Furthermore, we would
consider the correlation between the predictor and the predicted variable.

3.2 Bayesian Network

Bayesian network is made up of two components which are a set of conditional prob-
ability distributions and directed acyclic graphs [51]. It represents a joint probability
distribution between a group of variables. An example of directed acyclic graphs is
A Case Study via Bayesian Network: Investigating Factors Influencing … 309

Fig. 2 Flowcharts of BN and SVR

shown in Fig. 3. The directed arc between nodes A and C indicating a probabilistic
reliance between a computer failure and an electrical failure. There is no relation-
ship between node A and node B given node C is known. Additionally, conditional
probability distributions based on the directed graph are built for each node.
Generally, there are three types of structure learning methods for employing
Bayesian networks: score-based, constraint-based, and hybrid. [53] gives a compre-
hensive review of the three respective structure learning methods such as Tabu search
and Hill-Climbing (HC) in the score-based learning algorithms; the constraint-based
learning algorithms are Grow-Shrink, Incremental Association Markov Blanket
(IAMB), Fast Incremental Association Markov Blanket (FastIAMB), Max–Min
Parents and Children (MMPC); and the hybrid learning techniques such as Max–Min

Fig. 3 Example of directed

acyclic graph [51]
310 Z. N. Looi et al.

Hill-Climbing (MMHC) and Restricted Maximization (RSMAX2). We employed a

score-based method in this study. A score-based learning approach, namely Hill-
Climbing (HC), is popular for solving optimization issues [53]. Two learning
methods, i.e., HC and TS, were carried out using RStudio software. We employed the
Bayesian Information Criterion (BIC), which is adopted by Acquah [56], to evaluate
the best-fitting networks. The equation is described as below:

BIC = −2 log p(L) + p log(n), (1)

where L is the maximum likelihood, p is the number of the estimated parameters

and n is the number of the observations. According to [57], BIC is one of the best
combinations with HC and Tabu algorithm.

3.3 Support Vector Regression

Support vector machine (SVM) is created by Vapnik. It can be classified into two
categories that are support vector classification (SVC) and support vector regression
(SVR) [58]. The aim of SVR is to find a fitting line, called hyperplane, to categorize
a number of data points [59]. Boundary lines, on the other hand, are two lines drawn
around the hyperplane to form margin [60]. Moreover, SVR uses a collection of kernel
functions, such as sigmoidal kernels, polynomial kernels, and Gaussian kernels, to
transform data into necessary dimensions [60]. In addition, SVR has certain benefits
such as resistant to outliers, the decision model is easily to be applied as well as
updated when there are any changes. According to [59] and [60], it has a high
prediction accuracy and can operate with a minimal dataset. Therefore, in this study,
SVR is implemented and compared with the BN to assess the prediction accuracy.
Moreover, 80% of the data are used to train the model, and 20% of the data are used to
test the model. Mean absolute error (MAE), mean absolute percentage error (MAPE)
and root mean square error (RMSE) are used to compare the accuracy performance
between BN and SVM. MAE is defined as:

1∑
n
MAE = |yi − ŷi |, (2)
n i=1

where n is the number of errors and |yi − ŷi | is the absolute errors in which yi is the
actual value and ŷi is the forecast value. MAPE is defined as:
n | |
1 ∑|| yi − ŷi ||
MAPE = . (3)
n i=1 | yi |

RMSE is defined as
A Case Study via Bayesian Network: Investigating Factors Influencing … 311
/
∑
(yi − ŷi )2
RMSE = , (4)
N

where N is the number of observations.

4 Result

According to the findings which is shown in Table 1, the lowest BIC score of Tabu
search was slightly higher than HC. Hence, the final trained network in this study is
determined by the Tabu search. Figure 5 illustrates the factors of affecting student
CGPA academic performance. We found that there is a relationship between student
CGPA and Faculty, Age, Types of gadgets (gadgetTypes), Attendance, Self-efficacy
(SelfE), and Teaching method (TeachMS), respectively.
Based on the output, student CGPA and faculty have a causal relationship. The
measures used by different faculties to evaluate students’ academic performance
could be varied. For instance, some lecturers would evaluate a student’s performance
on their summative assessment based on their assignments. Also, there is a link
between the types of gadgets used and student CGPA. For instance, students who
use mobile devices for studying may affect eyesight that having more strain on their
eyes with a narrower screen. It is hard for students to focus and cause them to lose
interest in online learning.
Moreover, student’s learning attitudes that regard to attendance and self-efficacy
are vital for achieving academic performance. Student absenteeism matters learning
progress. Low self-efficacy of a student could have less faith in his abilities
whereby less interest to perform well in accomplishing learning assignments. Lastly,
educational strategies have a substantial impact on student academic performance.
Different educators employ various educational practices. An appropriate teaching
approach may lead to a smoother and effective learning progress. Only the variables
that are linked to the CGPA are used to train the SVR model. Therefore, variables of
age, faculty, gadget types, teaching technique, attendance, and self-efficacy will be
used in training the model.

Table 1 Results of scores for

Learning algorithms Algorithm BIC Rank
networks for each algorithm
Score-based Hill-Climbing − 3329.795 2
Tabu search − 3329.792 1
312 Z. N. Looi et al.

4.1 Comparison of Bayesian Network and Support Vector

Regression

Figure 4 illustrates the CGPA results that were predicted by BN and SVR, respec-
tively. To have a clearer comparison, we decomposed the 4-points CGPA into 8-
points, where ‘1’ denoted as 3.80–4.00 CGPA, ‘2’ denoted as 3.50–3.79, ‘3’ denoted
as 3.20–3.49, ‘4’ denoted as 2.90–3.19, ‘5’ denoted as 2.60–2.89, ‘6’ denoted as
2.30–2.59, ‘7’ denoted as 2.00–2.29, and ‘8’ denoted as 0.00–1.99. The red dots in
Fig. 4 representing the original students’ CGPA and the blue-colored line are the
prediction based on the trained model.
The result in Table 2 showed the CGPA predictive accuracy of BN is superior
compared to SVR. All three predictive measures including MAE, MAPE and RMSE
of BN are lower compared to SVR. A MAPE of BN showed that the CGPA prediction
is about 14% off from the actual CGPA on average, which is 5% lower compared to
the SVR (Fig. 5).

5 Conclusion

In conclusion, the occurrence of COVID-19 pandemic does give severe impact on

students. Among the best-found independent variables by Bayesian network, the
most important factors that affecting academic performance during the pandemic are
teaching methods, attendance, and self-efficacy. In fact, these three factors, especially
the student’s self-discipline issue that reflected by the attendance and self-efficacy,
had the highest correlation with academic success. As identical viewpoint from the
past study in [2], we consented teaching method that reflects instructor quality has a
substantial impact on student GPA.
Nevertheless, gadget types, age, and faculty have minor correlations with
academic achievement. For further exploring the aspects of gadget types and age
in future research, respondents from different education levels may have different
opinions of influencing academic performance. For instance, international students,
secondary students, and postgraduate students could have different levels of mutu-
ality upon learning. In the aspect of faculty, the higher education institutions can
concentrate on the elements and implement some new policies to enhance students’
academic performance as a result of this study.
A Case Study via Bayesian Network: Investigating Factors Influencing … 313

Fig. 4 Prediction by
Bayesian network and
support vector regression
314 Z. N. Looi et al.

Fig. 5 Best network from Tabu search

Table 2 Accuracy of model for each method

Accuracy BN SVR
Mean absolute error (MAE) 0.59375 0.78125
Mean absolute percentage error (MAPE) (%) 14.11 19.48
Root mean square error (RMSE) 0.9519716 1.159202

References

1. Ganyaupfu EM (2013) Teaching methods and students’ academic performance. Int J Humanit
Soc Sci Invention 2(9):29–35
2. Rugutt J, Chemosit C (2005) A study of factors that influence college academic achievement:
a structural equation modeling approach
3. Talsma K, Robertson K, Thomas C, Norris K (2021) COVID-19 beliefs, self-efficacy and
academic performance in first-year university students: cohort comparison and mediation
Analysis. Front Psychol 12
4. Pham TTT, Le HA, Do DT (2021) The factors affecting students’ online learning outcomes
during the COVID-19 pandemic: a Bayesian exploratory factor analysis. Educ Res Int 2021:1–
13
A Case Study via Bayesian Network: Investigating Factors Influencing … 315

5. Prasetyo AR, Nurtjahjanti H, Ardhiani LN (2021) Impact of changes in teaching methods

during the COVID-19 pandemic: the effect of integrative e-learning on readiness for change
and interest in learning among Indonesian university students. Int Rev Res Open Distrib Learn
22(2):87–101
6. StudyMalaysia.com: List of Universities in Malaysia [Online]. Available at: https://www.stu
dymalaysia.com/education/top-stories/list-of-universities-in-malaysia (2022)
7. Marelli S et al (2020) Impact of COVID-19 lockdown on sleep quality in university students
and administration staff. J Neurology 268
8. Mohd Nasir M, Ramli M, Mohd Som S (2021) Now look what you’ve done, COVID-19! the
impact on academic survival among postgraduate students in Malaysia. Int J Acad Res Bus
Soc Sci 11(2)
9. Mahdy MAA (2020) The impact of COVID-19 pandemic on the academic performance
of veterinary medical students. Front Vet Sci 7(594261)
10. Tan C (2020) The impact of COVID-19 on student motivation, community of inquiry
and learning performance. Asian Educ Develop Stud 10(2):308–321. https://doi.org/10.1108/
AEDS-05-2020-0084
11. Lee J, Lim H, Allen J, Choi G (2021) Effects of learning attitudes and COVID-19 risk perception
on poor academic performance among middle school students. Sustainability 13(10):5541
12. Nasrullah S, Khan MS (2015) The impact of time management on the students’ academic
achievements. J Lit Lang Linguist 11:66–71
13. Sundarasen S et al (2020) Psychological impact of COVID-19 and lockdown among university
students in Malaysia: implications and policy recommendations. Int J Environ Res Public
Health 17(17):6206
14. Fahim SM et al (2021) Measuring the impact of online education on academic performance
amid COVID-19 with the moderating effect of instructor. Int J Educ Sci 33:31–43
15. Salzman J, Williamson M, Epsina-Rey A, Kibble J, Kauffman C (2021) Effects of voluntary
attendance patterns on first-year medical students’ wellness and academic performance during
COVID-19. Adv Physiol Educ 45(3):634–643
16. Realyvásquez-Vargas A et al (2020) The impact of environmental factors on academic perfor-
mance of university students taking online classes during the COVID-19 pandemic in Mexico.
Sustainability 12(21):91–94
17. Regier J (2011) Why is academic success important? [Online]. Available at: https://saskschoo
lboards.ca/wp-content/uploads/2015/08/2011SIAST.pdf
18. Ab Razak WM et al (2019) Academic performance of university students: a case in a higher
learning institution. KnE Soc Sci 3(13):1294
19. Said E (2021) How did the COVID-19 pandemic affect higher education learning experience?
An empirical investigation of learners’ academic performance at a university in a developing
country. Adv Hum Comput Interac (2021) [Online]. Available at: https://www.hindawi.com/
journals/ahci/2021/6649524/
20. Ramli A, Zain RM, Zain MZM, Rahman AAA (2021) Environmental factors and academic
performance: the mediating effect of quality of life. The importance of new technologies and
entrepreneurship in business development: in the context of economic diversity in developing
countries, vol 194, pp 2082–2105
21. Sajjad S (2011) Effective teaching methods at higher education level [Online]. Available at:
https://www.academia.edu/19023886/Research_paper_Teaching_methods
22. Sequeira AH (2012) Introduction to concepts of teaching and learning. papers.ssrn.com
[Online]. Available at: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2150166
23. Razali SNAM, Rusiman MS, Gan WS, Arbin N (2018) The impact of time management on
students’ academic achievement. J Phys Conf Ser 995:012–042
24. Adams RV, Blair E (2019) Impact of time management behaviors on undergraduate engineering
students’ performance. SAGE Open 9(1)
25. Chukwuemeka O (2013) Environmental influence on academic performance of secondary
school students in port Harcourt local government area of Rivers State. J Econ Sustain Develop
4(12):34
316 Z. N. Looi et al.

26. Ali M, Tazilah MDABK, Ahmad MF, Adnan NS, Awang N (2016) The relationship between
environmental factors, learning style and self-management towards academic performance of
accounting students: a case of Uniten. South East Asia J Contemp Bus Econ Law 10(1)
27. Ragpala E (2021) Environmental factors that affect the academic performance of senior high
school students during COVID-19 pandemic. Int J Soc Sci Curr Future Res Trends 11(1):1–19
28. Ancheta RF, Daniel D, Ahmad R (2021) Effect of class attendance on academic performance.
Eur J Educ Stud 8(9)
29. Al Shenawi H, Yaghan R, Almarabheh A, Al Shenawi N (2021) The relationship between atten-
dance and academic performance of undergraduate medical students during surgical clerkship.
BMC Med Educ 21(1)
30. Macfarlane B (2013) The surveillance of learning: a critical analysis of university attendance
policies. High Educ Q 67(4):358–373
31. CDC healthy schools: sleep and health [Online]. Available at: https://www.cdc.gov/healthysc
hools/sleep.htm#:~:text=Students%20should%20get%20the%20proper (2019)
32. Learning center: sleeping to succeed—learning center [Online]. Available at: https://learningc
enter.unc.edu/tips-and-tools/sleeping-to-succeed/ (2016)
33. Jalali R, Khazaie H, Khaledi Paveh B, Hayrani Z, Menati L (2020) The effect of sleep quality
on students’ academic achievement. Adv Med Educ Pract 11:497–502
34. Kohnke K (2019) Familial stress and academic performance: the role of an
outside stress factor on the academic success of college students [Online]. Avail-
able at: https://rex.libraries.wsu.edu/esploro/outputs/essay/Familial-Stress-and-Academic-Per
formance-The/99900590739701842 (2019)
35. Ðurišić M, Bunijevac M (2017) Parental involvement as a important factor for successful educa-
tion 7(3)
36. Baloran ET (2020) Knowledge, attitudes, anxiety, and coping strategies of students during
COVID-19 pandemic. J Loss Trauma 25(8):1–8
37. Bakar KA et al (2010) Relationships between university students’ achievement motivation,
attitude and academic performance in Malaysia. Procedia Soc Behav Sci 2(2):4906–4910
38. Educational research techniques: attitude and behavior of students [Online]. Avail-
able at: https://educationalresearchtechniques.com/2021/01/03/attitude-and-behavior-of-stu
dents/#comments (2021)
39. Kafle T, Kongawad NB (2019) Students’ attitude in teaching learning activities—ignited minds
journals. Multi Acad Res 16(5):478–482
40. Ismaili Y (2021) Evaluation of students’ attitude toward distance learning during the pandemic
(Covid-19): a case study of ELTE university. On the Horizon 29(1):17–30
41. Mahzan Awang M et al (2013) Students’ attitudes and their academic performance in
nationhood education. Int Educ Stud 6(11)
42. Doménech-Betoret F, Abellán-Roselló L, Gómez-Artiga A (2017) Self-efficacy, satisfaction,
and academic achievement: the mediator role of students’ expectancy-value beliefs. Front
Psychol 8
43. Bhati K, Baral R, Meher V (2022) Academic self-efficacy and academic performance among
undergraduate students in relation to gender and streams of education. Indonesian J Contemp
Educ 4(2):80–88
44. Archuleta KL, Dale A, Spann SM (2013) College students and financial distress: exploring
debt, financial satisfaction, and financial anxiety. J Fin Couns Plann 24(2):50–62
45. Grabmeier J (2015) 70 percent of college students stressed about finances. osu.edu [Online].
Available at: https://news.osu.edu/70-percent-of-college-students-stressed-about-finances/
46. Haji Wahab MZ, Othman K (2021) Impact of COVID-19 on student’s emotional and financial
aspects in the higher learning institutions. SEISENSE J Manag 4(4):1–15
47. Marler EK et al (2021) The impact of COVID-19 on university students’ academic motivation,
social connection, and psychological well-being. Sch Teach Learn Psychol
48. Muhammad AS, Bakar NA, Mijinyawa SI, Halabi KA (2021) Impact of motivation on students’
academic performance: a case study of university sultan Zainal Abidin students. Am J Innov
Res Appl Sci 13(1–6):203–207
A Case Study via Bayesian Network: Investigating Factors Influencing … 317

49. Olmedo-Moreno EM, Expósito-López J, Romero-Díaz de la Guardia JJ, Pistón-Rodríguez MD,

Parejo-Jiménez N (2021) Motivation for learning among students undertaking basic vocational
training and university studies within the context of COVID-19. Sustainability 13(6):3268
50. Gaille L (2020) 15 advantages and disadvantages of convenience sampling. Vittana.org
[Online]. Available at: https://vittana.org/15-advantages-and-disadvantages-of-convenience-
sampling
51. Horný M (2014) Bayesian networks [Online]. Available at: https://www.bu.edu/sph/files/2014/
05/bayesian-networks-final.pdf
52. Uusitalo L (2007) Advantages and challenges of Bayesian networks in environmental
modelling. Ecol Model 203(3–4):312–318
53. Ong HC, Chandrasekaran K (2015) A Bayesian network approach to identify factors affecting
learning of additional mathematics. Jurnal Pendidikan Malaysia 40(2):185–192
54. Beretta S, Castelli M, Gonçalves I, Henriques R, Ramazzotti D (2018) Learning the structure
of Bayesian networks: a quantitative assessment of the effect of different algorithmic schemes.
Complexity 1–12
55. Pan J et al (2019) Application of a Tabu search-based Bayesian network in identifying factors
related to hypertension. Medicine 98(25)
56. Acquah H (2010) Comparison of Akaike information criterion (AIC) and Bayesian information
criterion (BIC) in selection of an asymmetric price relationship. J Develop Agric Econ 2(1):1–6
57. Sathasivam S, Song PC, Yeap JJ (2020) Learning Bayesian networks: the combination of
scoring function and dataset. Int J Eng Adv Technol 9(5):149–154
58. Basak D, Pal S, Patranabis DC (2007) Support vector regression. ResearchGate 11(10):203–224
59. Raj A (2020) Unlocking the true power of support vector regression. Medium [Online].
Available at: https://towardsdatascience.com/unlocking-the-true-power-of-support-vector-reg
ression-847fd123a4a0
60. Pedamkar P (2020) Support vector regression. Educba [Online]. Available at: https://www.edu
cba.com/support-vector-regression/
Harnessing the XGBoost Ensemble
for Intelligent Prediction
and Identification of Factors with a High
Impact on Air Quality: A Case Study
of Urban Areas in Jakarta Province,
Indonesia
Wahyu Wibowo, Harun Al Azies, Susi A. Wilujeng,
and Shuzlina Abdul-Rahman

Abstract This article aims to develop an accurate air quality prediction model to
handle Jakarta’s air pollution challenges. In this study, data from air quality moni-
toring stations’ conventional air pollution indexes was employed. In the research
phase, data is explored, SMOTE is used to manage imbalances, and XGBoost is
used to develop a model with the best parameters. The evaluation stage shows the
model’s ability to predict air quality. With an accuracy rate of 99.516%, an F1-score
of 99.528%, and a recall rate of 99.509%, the results were very astounding. These
performance indicators show the model’s exceptional ability to classify and predict
air quality levels. Furthermore, this study investigates the significance of various vari-
ables in predicting air quality. A thorough evaluation of measures such as weight,
gain, total gain, and cover indicators reveals the significance of numerous aspects.
Even while SO2 helps predict air quality, the prevalence of PM2.5 on several measures
reveals a significant influence. This study contributes to a better understanding of the

W. Wibowo (B)
Department of Business Statistics, Faculty of Vocational Studies, Institut Teknologi Sepuluh
Nopember, 60111 Surabaya, Indonesia
e-mail: [email protected]
H. Al Azies
Faculty of Computer Science, Study Program in Informatics Engineering, Universitas Dian
Nuswantoro, 50131 Semarang, Indonesia
Research Center for Materials Informatics, Faculty of Computer Science, Universitas Dian
Nuswantoro, 50131 Semarang, Indonesia
S. A. Wilujeng
Department of Environmental Engineering, Faculty of Civil, Planning, and Geo Engineering,
Institut Teknologi Sepuluh Nopember, 60111 Surabaya, Indonesia
S. Abdul-Rahman
College of Computing, Informatics and Mathematics, Universiti Teknologi MARA, 40450 Shah
Alam, Selangor, Malaysia

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 319
Y. Bee Wah et al. (eds.), Data Science and Emerging Technologies, Lecture Notes on
Data Engineering and Communications Technologies 191,
https://doi.org/10.1007/978-981-97-0293-0_24
320 W. Wibowo et al.

complicated dynamics of air quality prediction by employing advanced analytical

approaches and accurate models. This knowledge is useful in developing targeted
solutions to address air pollution issues and promote healthier urban environments.

Keywords Air quality prediction · XGBoost algorithm · Jakarta air pollution ·

Predictive modeling

1 Introduction

The consequences of air pollution are deeply concerning. Air pollution has had
major implications, including increasing dangers to human health [1]. According
to the World Health Organization (WHO), the number of deaths from air pollution
reached an alarming 4.2 million in 2019 [2]. The situation is compounded further by
the growing number of vehicles in Indonesia, which emit hazardous petrol pollutants
[3]. Carbon monoxide (CO) concentrations are also a factor in global warming and
temperature fluctuations on Earth [4]. The amount of contaminants in the air has
significantly increased recently due to Indonesia’s rapid economic growth and social
development [5]. The main concern is the assessment of environmental air quality
and the control of air pollution, which is a serious issue with effects on many facets
of daily life, the environment, and the ecosystem as a whole, particularly in urban
areas like Indonesia’s DKI Jakarta Province. Urban air quality is under threat and
becoming a source of concern, notwithstanding the exciting economic expansion and
the rapid rate of urbanization [6].
Government departments use the Air Quality Index (AQI) to notify the public
about the present or predicted levels of air pollution and to monitor air quality [7].
The higher the AQI score, the more of the population is likely to suffer from an
adverse health impact, which can be severe. As a result, timely air quality prediction
is critical for the government to understand the pattern of changes in air quality and
can be used to implement efficient air pollution control and management strategies.
The World Air Quality Index (WAQI), which updates its data in real time, gives
Jakarta an AQI score of 167. Jakarta is now ranked second on the list of the world’s
most polluted cities, trailing only Dubai City in the United Arab Emirates, which
holds the top spot with an AQI score of 176. This graph demonstrates the significant
problems these two cities are facing with air pollution, which has a detrimental effect
on both the general public’s health and the environment.
Monitoring air quality alone is inadequate in addressing the root causes of air
pollution. In recent years, various methods for predicting air pollution have emerged.
While the Gaussian dispersion model is commonly used in air pollution research
[8, 9], other statistical techniques [10–12] can also predict pollutant levels. These
models, based on physical principles, often lack detailed data on pollution sources
and related factors. To overcome this limitation, a more comprehensive approach
is needed, and machine learning algorithms [13, 14] offer a promising solution.
This study distinguishes itself from the previously cited studies by taking a novel
Harnessing the XGBoost Ensemble for Intelligent Prediction … 321

and different approach. Although some prior studies have used various machine
learning approaches and artificial neural networks to predict air quality, this study
focuses specifically on the Jakarta air quality scenario. The research on air quality
predictions in Jakarta is divided into several important sections in this publication to
facilitate reader comprehension and navigation. In Sect. 2, the examination of related
work in air quality prediction research will continue. Section 3 discloses the applied
research framework, which serves as the foundation for a full understanding of the
issue. Section 4 will next show the study’s numerical results, which include crucial
information regarding Jakarta’s air quality. The article will conclude with Sect. 5,
which presents the research’s results and suggestions for future directions.

2 Related Works

The provided literature review gives a general overview of the many methodologies
used to predict air quality using machine learning techniques. Ma et al. [15] conducted
an additional study on the application of the XGBoost algorithm. This study develops
a machine learning model that predicts daily PM2.5 concentrations by combining
estimations of PM2.5 mass concentration with measured data and meteorological
variables. The outcomes demonstrate improved performance over the WRF-Chem
model [15]. Pan [16] to estimate hourly PM2.5 concentrations, this study uses the
XGBoost algorithm and compares its effectiveness to several alternative techniques.
The outcomes demonstrate XGBoost’s advantage in predicting air quality [16]. Wang
and Jing [17] the hourly AQI in Shijiazhuang, Hebei Province, as predicted by this
study using the XGBoost model. With a stronger correlation coefficient and a lower
mean absolute error (MAE), the trained model outperforms the other approaches
[17]. The aforementioned research has contributed significantly to the application
of the XGBoost algorithm in air quality prediction; nonetheless, there are several
significant variations between them. With a focus on the DKI Jakarta Province region
of Indonesia, this research aims to establish a thorough evaluation methodology
specifically for forecasting the AQI in metropolitan areas. The goal is to offer a
technique for predicting air quality that may be applied to complicated urban air
pollution issues.
As a result, the focus of this research will be on establishing an evaluation frame-
work to predict AQI levels using machine learning approaches. In this context, the
XGBoost algorithm is critical in overcoming the shortcomings of prior techniques.
Furthermore, the Synthetic Minority Oversampling Technique (SMOTE) was used
in this work to address the issue of an imbalance in the number of samples in different
classes. By doing this, it is made possible for the developed model to deal with data
inconsistencies and provide predictions of Jakarta’s air quality that are more accu-
rate. Additionally, the XGBoost method and the randomized search approach are
employed for parameter tuning throughout the model construction phase. To find the
ideal parameter that yields the most accurate predictions, the model is then randomly
322 W. Wibowo et al.

run through a collection of potential parameter combinations. With the help of this
method, the model performs better and is better able to predict Jakarta’s air quality.

3 Materials and Methods

This study employs a research framework composed of four primary steps to analyze
air quality in the DKI Jakarta Province area. This methodological approach is delib-
erately created to comprehend the components that influence air quality and to create
accurate predictive models. Each stage of the methodology has a specific goal that
adds to the overall success of the research. The first stage is data collection, which
outlines how data is obtained from current sources, according to Fig. 1, which serves
as the foundation for this study. The following phase, known as data processing,
explains the actions performed to prepare data before using it to train the model. The
next section of this stage’s model development describes the process of developing
the model or algorithm for predicting air quality. Analyzing a model’s performance,
the aim of this stage is to evaluate the performance of the model that is currently
being created.

Fig. 1 Research framework for air quality prediction models in Jakarta, Indonesia
Harnessing the XGBoost Ensemble for Intelligent Prediction … 323

3.1 Data Collection Steps

This research dataset examines the Air Quality Index (AQI) as recorded by five
air quality monitoring sites located throughout DKI Jakarta Province in 2021. This
information comes from a reliable source, Jakarta Open Data, which can be viewed
through the official website: https://data.jakarta.go.id/. The AQI is an important
metric for assessing air quality in a given area. Based on the degree of air pollu-
tion by various pollutants such as PM10, PM25, SO2 , CO, NO2 , and O3 , this index
provides an overview of how excellent or terrible air quality is at a given time. This
study’s attribute variables include PM10, which refers to a material particle having
a diameter of fewer than 10 µm (PM10) [18].
These particles are inhalable and have the potential to harm human health, partic-
ularly the respiratory tract. A material particle (PM2.5) having a diameter of fewer
than 2.5 µm is referred to by the PM25 property. These particles are smaller than
PM10 and can enter the lungs of people more easily, having more detrimental effects
on their health [18]. The air contains sulfur dioxide, or SO2 , which is measured as
SO2 . A gas called SO2 is produced by human activities like the burning of fossil
fuels and is bad for both the environment and human health [18]. The symbol CO
represents the concentration of carbon monoxide (CO). Numerous things, including
motor vehicles, can emit CO, which can be hazardous if inhaled in large quanti-
ties. O3 denotes the amount of ozone (O3 ) in the atmosphere. Ozone is a dangerous
contaminant on the Earth’s surface, but it also serves as a natural barrier against ultra-
violet light in the upper atmosphere. The amount of nitrogen dioxide (NO2 ) in the
atmosphere is referred to as NO2 [18]. NO2 , which is mostly created by human activ-
ities such as fuel combustion, may affect the human respiratory system. Meanwhile,
the predicted category of air pollution standard index (PSI) is the target variable.
This is the outcome of the Air Pollution Standard Index calculation. PSI is an air
quality metric that describes how safe or dangerous the air we breathe is. To offer an
indication of the level of air pollution, this category covers many levels of “good,”
“moderate,” and “unhealthy” air quality.

3.2 Data Preprocessing Steps

The data acquired in the previous stage will go through a preprocessing process at
this level. This procedure ensures that the data is of good quality and ready for use
in model building. This stage is divided into three major sub-stages. Exploratory
Data Analysis (EDA): At this step, the data is examined to discover patterns and
other information that can aid in a deeper understanding of the data’s features [19].
In this study, the correlation matrix was also used to investigate the association.
This EDA stage will enable researchers to have a deeper understanding of the data
being used. Imbalance Data Handling Stage: There is frequently an imbalance in the
number of samples in each target class in the dataset used for air quality prediction.
324 W. Wibowo et al.

For example, the number of samples in the “good” or “moderate” category could be
significantly more than the number of samples in the “unhealthy” category. Because
of this imbalance, the model may perform better when predicting the majority class,
while the minority class may perform poorly. The Synthetic Minority Oversampling
Technique (SMOTE) is an oversampling technique that focuses on minority classes
by developing synthetic samples that look like existing minority samples [20].
SMOTE collects samples from the minority class and then generates a new
synthetic sample by connecting existing points [20]. This increases the number of
samples in the minority class without having to repeat the present sample. The number
of samples in the minority class becomes more balanced with the majority class when
SMOTE is used. This will help the model learn more effectively and avoid bias toward
the dominant class. Data splitting: After processing, the data is divided into training
and test data in a predetermined proportion. Training data is used to train the model,
whereas test data is used to evaluate the model’s performance. The proportion of
training data to test data in this study is 70:30 [21], with the majority of the data used
to train the model (70%), and just a small percentage (30%) used for testing. This
ratio ensures that the model has enough opportunities to learn from the training data
and that it also performs well on the test data.

3.3 Model Development

The model development stage is critical in this study since it will build an air quality
prediction system using the XGBoost algorithm. The XGBoost method was selected
for a variety of reasons, including its ability to handle complex data and give superior
results in classification challenges. The XGBoost algorithm will be used to develop
an air quality prediction model at this step. The XGBoost algorithm was chosen
because of its strong reputation in the data science and artificial intelligence commu-
nities [22]. Extreme gradient boosting (XGBoost) is an ensemble learning algorithm
that combines several simple predictive models (weak learner) into a stronger model
(strong learner) [23]. This is done by using reinforcement techniques that can reduce
the bias and variance that exists in the model. Before training the model, it is neces-
sary to determine the optimal parameters for the XGBoost algorithm. This step
was performed using a random search method [24]. This method will try various
combinations of different parameters and measure their performance using cross-
validation techniques [25]. The optimal parameters will be selected based on the
best performance results during cross-validation. Once the optimal parameters are
found through a random search, the XGBoost model will be trained using the training
data and these parameters using the K-Fold validation technique [26]. In this tech-
nique, the training data will be split into several folds and the model will be trained
and evaluated at each fold. This ensures that the developed model can generalize well
to a wide variety of data. By using the XGBoost algorithm and optimized parame-
ters, as well as involving K-Fold validation in the evaluation, this model development
Harnessing the XGBoost Ensemble for Intelligent Prediction … 325

stage aims to produce an air quality prediction model that has high performance and
can generalize to data that has never been seen before [26].
The XGBoost algorithm was used in this study to provide accurate and thorough
air quality predictions. XGBoost is widely recognized for its outstanding capabilities
in various machine learning applications, especially in classification tasks. As such,
it is an obvious choice for developing strong models that provide precise forecasts of
air quality. Predicting air quality can be difficult when dealing with data that is unbal-
anced, meaning that some categories of air quality occur less frequently than others.
Crucially, the study approach combines XGBoost with the Synthetic Minority Over-
sampling Technique (SMOTE), which balances the dataset by creating new instances
of the minority class artificially. This guarantees that the model is free from biases
and that it can provide precise predictions for all categories of air quality. Moreover,
the usefulness of XGBoost is emphasized by its ability to optimize parameters. This
implies that the model’s predictive power can be further increased by fine-tuning it
to match the particular features of air quality data. Additionally, XGBoost makes the
inner workings of the model transparent, allowing insights into the factors driving
air quality predictions to be extracted. This interpretability enables researchers to
make defensible conclusions by promoting a deeper comprehension of the intricate
interactions between variables and air quality. In addition to its effectiveness and
interpretability, the selection of XGBoost contributes to the larger field of machine
learning applications that tackle environmental issues. It’s a start in the right direction
toward developing long-term solutions to the problems caused by air pollution.

3.4 Model Evaluation

The final stage is model evaluation, in which the predictive model’s performance is
tested using important metrics such as accuracy, F1-score, and recall [27]. Accuracy
is a metric that measures how successfully a classification model predicts each class
in a dataset. It involves dividing the total number of predictions made by the model
by the number of accurate predictions across all classes [28]. Recall is a metric that
assesses how well a model can detect real positive cases among all positive cases in a
dataset. In multi-class classification, recall is calculated independently for each class.
In instances when avoiding false negatives (not recognizing true positive cases) is
critical, recall is key. The F1-score is a metric that combines recall and accuracy, or
the model’s ability to recognize positives correctly. In multi-class classification, the
F1-score can be calculated independently for each class, and then the average of all
F1-scores for that class is determined [28].
326 W. Wibowo et al.

4 Results and Discussion

Before going on to the next step of modeling to predict air quality based on the
standard air pollution index in DKI Jakarta Province, this stage is an exploratory
data analysis (EDA), which attempts to understand the features of the data.
According to Fig. 2, a high correlation between PM10 and PM25 particle concen-
trations shows that changes in PM10 typically follow changes in PM25. The modestly
positive connection between PM10 and SO2 and PM10 and NO2 particles, however,
suggests that gaseous pollutants like SO2 and NO2 can coexist with big airborne
particles. The fact that there is little association between the pollution parameters
(CO, O3 , SO2 , and NO2 ) and air quality suggests that the quality of the air is also
influenced by the weather, pollution sources, and other intricate interactions. In some
low negative relationships between pollution parameters, such as those between SO2
and NO2 and between O3 and NO2 , one pollutant can become more concentrated
while another pollutant becomes less concentrated. The next stage is to verify the
number of classes in the air quality target variable. The goal of this stage is to deter-
mine whether or not there is an imbalance in the distribution of data among air quality
categories. If the distribution of data between air quality classes is highly unequal,
the performance of the model to be created may suffer, especially if the model tends
to favor the majority class while ignoring the minority class.
According to preliminary examination, the class distribution of air quality param-
eters appears to be unbalanced. This experiment uses the SMOTE approach to
manage uneven data to address this imbalance. Using the features of existing minority
samples, this method generates synthetic data for the minority class. As seen in Fig. 3,
the data distribution was adequately balanced after applying the SMOTE technique.
After handling the data imbalance using the SMOTE approach, the next step in this
research is to divide the data into training data and test data with a ratio of 70:30,
resulting in a composition of 2408 samples of training data size and 1033 samples
of test data size. The objective of this division is to ensure that, in addition to gener-
alizing to training data, the developed model can perform well on data that has never
been seen before. After the data has been exchanged, the next step is to build a model
using the XGBoost technique. In classification tasks, one of the most efficient algo-
rithms is widely used. The XGBoost algorithm will have some parameters adjusted
in this study to determine the ideal configuration. These are the parameters that were
used:
a. Subsample: This setting regulates the number of samples utilized in each itera-
tion of the tree-building process. Since only a small portion of the data is used
in each iteration, choosing lower values for the subsample can aid in preventing
overfitting [29]. Configurations 0.6, 0.7, and 0.8 are used in this investigation.
b. n estimators: which are parameters specifying the number of trees to be built in
the algorithm [29]. A larger number of trees generally increases the model’s
ability to capture complex patterns in the data. A total of 100 and 200 are
configurations of n estimators in this study.
Harnessing the XGBoost Ensemble for Intelligent Prediction … 327

Fig. 2 Correlation heat map of numeric attributes: each color’s darker contrast denotes a strong
link, either positive or negative. Brighter contrast, on the other hand, denotes a weak or absent link

Fig. 3 Comparison of the distribution of air quality data before and after handling the imbalance

c. max depth: 3, 4, and 5 different variations are used. This parameter controls
the maximum depth of each tree in the ensemble [30]. The right setting for max
depth is necessary to avoid overfitting [31]. If the model is given too much depth,
it could be able to detect noise in the training data.
328 W. Wibowo et al.

d. Learning rate: How many steps are changed with each iteration of the tree
weights depends on the learning rate [32]. Choosing the right learning rate
can have an impact on model convergence and overall performance. If the
learning rate is too high, the model could fail to detect the global minimum.
This arrangement makes use of 0.01 and 0.02.
e. Gamma: Gamma configurations 0, 0.1, and 0.2 were applied in this experiment.
When a tree is formed, the gamma value controls when the nodes split [33].
The right gamma value can help prevent overfitting by controlling the tree’s
complexity.
f. Colom sample by tree: This option regulates the proportion of features used
to build each tree. The model can perform better in generalization and be more
resilient to overfitting by selecting a feature subset [34]. The settings are 0.4, 0.5,
and 0.6.

This experiment employs parameter settings to test various parameter value

combinations to discover the best combination that delivers optimal model perfor-
mance for the Jakarta air quality prediction case. After tuning, ideal parameters will
be acquired, which may be utilized to construct the best-performing XGBoost model.
The experimental results of the parameters in Table 1 (only the five best combinations
are shown) show the various parameter combinations tested, as well as the mean test
score and standard test score of the XGBoost model’s performance on the SMOTE
dataset. The optimal parameter combination will be chosen to construct the final
model.
Table 1 shows many parameter combinations with high mean test scores, showing
that models with these parameters can produce good results in predicting air quality
in Jakarta. The parameters are searched in this case for combinations with the highest
possible mean test score and the lowest possible standard test score. The parameter
combinations with the greatest mean test score in the 22nd experiment. Because it
has a high mean test score, which suggests good performance, and a low standard
test score, which indicates model consistency on varied test data, this experiment is
regarded as a good choice for constructing an accurate and stable model for predicting

Table 1 Parameter tuning results using the randomized search approach

Experiment Parameters Mean test Standard test
A B C D E F score score

1 0.8 100 3 0.02 0 0.6 0.997093 0.002116

3 0.8 200 5 0.01 0 0.6 0.997094 0.002490
19 0.7 200 3 0.02 0 0.5 0.997509 0.002421
22* 0.7* 200* 5* 0.01* 0.1* 0.6* 0.997509 0.002034
29 0.8 200 4 0.02 0.1 0.5 0.997094 0.002490
Note A = Subsample, B = n estimators, C = max depth, D = Learning rate, E = Gamma, F =
Colom sample by tree, * Selected parameter combinations
Harnessing the XGBoost Ensemble for Intelligent Prediction … 329

Table 2 K-Fold validation

Performance metric Accuracy F1-score Recall
results
Fold 1 0.9938 0.9937 0.9938
Fold 2 0.9979 0.9979 0.9979
Fold 3 0.9979 0.9979 0.9978
Fold 4 0.9979 0.9978 0.9979
Fold 5* 0.9979 0.9979 0.9980
Note * Selected folds

air quality in Jakarta. This parameter combination is used to train the model on all
training data.
The model’s performance in each fold is shown by the results of fivefold cross-
validation (CV) [35]. Table 2 shows that the model’s performance at each fold is
quite good. For each fold, the accuracy, F1-score, and average recall are in the
0.9937 to 0.9980 range. This shows that the computer can predict Jakarta’s air quality
accurately. Overall, Fold 5 has the highest F1-score and recall, indicating that the
model performs best when categorizing data in this fold. Based on the performance
evaluations in each fold, the 5th fold is the best choice to represent the overall
performance of the generated XGBoost model. Whenever the performance results
for each fold are good and consistent, the best model is chosen to create predictions
on data that has never been seen before (test data). In this case, the best model can
be selected based on previous experimental discoveries of the best parameters and
folds. The best model may be employed to predict air quality in Jakarta based on
new data.
Figure 4 shows the confusion matrix, which reflects the model’s predicted results
on the test data. This confusion matrix illustrates how well the model classifies each
class. The algorithm correctly identified 324 samples as “good,” whereas 0 samples
were incorrectly labeled as “good.” The model correctly classified 369 samples in the
“Medium” category, whereas no samples were incorrectly classified as “Medium.”
The program correctly identifies 335 samples as “unhealthy,” but five samples are
incorrectly labeled as such. Overall, the model does an excellent job of classifying air
into “good” and “medium” categories with flawless accuracy and precision. However,
there are some predicted inaccuracies in the “unhealthy” air class. This may be an
area that requires additional effort to improve model performance in classifying
underrepresented classes. The confusion matrix shows the results of the model’s
performance on test data (testing), which received an accuracy value of 0.995160.
The accuracy of the model is measured by how well it classifies the data as a whole.
With an accuracy value of roughly 0.995, it indicates that the model successfully
predicted 99.5% of all test data. Furthermore, the harmonic mean of precision and
recall is 0.995288 in the F1-score. The F1-score value of roughly 0.995 implies
a balance between accuracy and recall, indicating that the model performs well in
measuring precision and recognizing true positives. The last metric is the recall value,
which compares the model’s ability to recognize true positives to the total number
330 W. Wibowo et al.

Fig. 4 Confusion matrix based on modeling results with testing datasets

of actual positives. A recall value of roughly 0.995 suggests that the model is quite
good at classifying positive classifications.
After determining the optimal model, a more in-depth examination of the signifi-
cance of variables is required to comprehend the role and influence of each feature on
Jakarta air quality predictions. The importance of each variable on model predictions
is measured by the variable’s significance. According to this theory, high-importance
features have a larger impact on prediction choices than low-importance features. The
significance of factors in predicting Jakarta’s air quality can shed light on the charac-
teristics that are most crucial in determining air pollution levels. This information can
help anyone understand the factors that affect air quality, including environmental
scientists, policymakers, and the general public.
The variable importance measuring approach in this study makes use of several
indicators, including weight, gain, total gain, and cover as shown in Fig. 5. Each
metric offers a unique perspective on the significance of a feature in terms of affecting
the model’s predictions. The PM25 feature highlights three indicators: gain, total
gain, and cover. This reveals that PM25 has a major impact on air quality estimates.
Because PM2.5 is a fine particulate emitted by a range of sources, including autos,
industry, and combustion, its dominance in three indices indicates that PM2.5 levels
have a substantial impact on Jakarta’s air pollution. Controlling PM2.5 emissions
could be the primary focus of efforts to enhance air quality. Despite having low
Harnessing the XGBoost Ensemble for Intelligent Prediction … 331

Fig. 5 Comparison of important variable indicators

relevance values in two indicators (gain and cover), SO2 has a considerable impact
on air quality predictions, as evidenced by its high weight value. SO2 levels in the
air have a significant role in affecting air quality. The high importance of SO2 in the
“Weight” indicator may indicate that fluctuations in SO2 levels in the air can have
a large impact on overall air quality. Even though it dominates by weight indicator,
it is critical to continue monitoring and controlling SO2 levels to maintain good air
quality. CO has a low significance value for all indicators, indicating that CO has no
meaningful impact on predicting air quality in the context of this study. However, it
should be highlighted that CO levels should be regularly monitored because this gas
can come from sources such as motor vehicles and combustion and can have major
health consequences.

5 Conclusion

According to the findings of this study, the XGBoost parameters improved using
random search have a considerable impact on model performance. The optimal
parameter combination, with subsample values of 0.7, n estimators of 200, max
depth of 5, learning rate of 0.01, gamma of 0.1, and sample by tree of 0.6, may
offer reliable and consistent predictions of air quality. The fivefold cross-validation
technique yields a model with consistent performance across all folds. The accu-
racy, F1-score, and recall values in each fold range from 0.9937 to 0.9980. This
demonstrates that the algorithm can accurately estimate air quality in Jakarta and is
constant across different conditions. Furthermore, the test data evaluation results are
excellent. The model has an accuracy of 0.995160, an F1-score of 0.995288, and a
recall of 0.995098. This demonstrates that the model is highly good at categorizing
air quality.
PM2.5 and SO2 are the key variables influencing air quality predictions in Jakarta,
according to an examination of the importance of variables. This model can be used to
332 W. Wibowo et al.

understand and address the air pollution issues in Jakarta since PM2.5 has a substan-
tial impact on three critical indicators, while SO2 has a strong impact on the “Weight”
indicator, demonstrating its significant significance in air quality prediction. In their
efforts to preserve better air quality, environmental scientists, legislators, and the
general public can benefit greatly from this knowledge.

References

1. Domingo JL, Rovira J (2020) Effects of air pollutants on the transmission and severity of respi-
ratory viral infections. Environ Res 187:109650. https://doi.org/10.1016/J.ENVRES.2020.
109650
2. Liu M et al (2019) Population susceptibility differences and effects of air pollution on cardio-
vascular mortality: epidemiological evidence from a time-series study. Environ Sci Pollut Res
26(16):15943–15952. https://doi.org/10.1007/S11356-019-04960-2/FIGURES/1
3. Lestari P, Arrohman MK, Damayanti S, Klimont Z (2022) Emissions and spatial distribution of
air pollutants from anthropogenic sources in Jakarta. Atmos Pollut Res 13(9):101521. https://
doi.org/10.1016/J.APR.2022.101521
4. Mehmood I et al (2020) Carbon cycle in response to global warming. Environ Clim Plant Veg
Growth 1–15. https://doi.org/10.1007/978-3-030-49732-3_1/COVER
5. Raihan A, Muhtasim DA, Pavel MI, Faruk O, Rahman M (2022) An econometric analysis of
the potential emission reduction components in Indonesia. Cleaner Prod Lett 3:100008. https://
doi.org/10.1016/J.CLPL.2022.100008
6. McGranahan G, Songsore J, Kjellén M (2021) Sustainability, poverty and urban environmental
transitions. In: The Earthscan reader in sustainable cities, pp 107–133.. https://doi.org/10.4324/
9781315800462-8
7. Abulude F, Abulude I, Oluwagbayide S, Afolayan S, Ishaku D (2021) Air quality index: case
of one-day monitoring of 253 urban and suburban towns in Nigeria. Environ Sci Proc 8(1):4.
https://doi.org/10.3390/ECAS2021-10342
8. Tiwari A et al (2019) Considerations for evaluating green infrastructure impacts in microscale
and macroscale air pollution dispersion models. Sci Total Environ 672:410–426. https://doi.
org/10.1016/J.SCITOTENV.2019.03.350
9. Masih A (2019) Machine learning algorithms in air quality modeling. Global J Environ Sci
Manag 5(4):515–534. https://doi.org/10.22034/GJESM.2019.04.10
10. Wang H, Yilihamu Q, Yuan M, Bai H, Xu H, Wu J (2020) Prediction models of soil heavy
metal(loid)s concentration for agricultural land in Dongli: a comparison of regression and
random forest. Ecol Indic 119:106801. https://doi.org/10.1016/J.ECOLIND.2020.106801
11. Abdullah S, Ismail M, Ahmed AN, Abdullah AM (2019) Forecasting particulate matter concen-
tration using linear and non-linear approaches for air quality decision support. Atmosphere
10(11):667. https://doi.org/10.3390/ATMOS10110667
12. Su X, An J, Zhang Y, Zhu P, Zhu B (2020) Prediction of ozone hourly concentrations by support
vector machine and kernel extreme learning machine using wavelet transformation and partial
least squares methods. Atmos Pollut Res 11(6):51–60. https://doi.org/10.1016/J.APR.2020.
02.024
13. Ali M, Dewan A, Sahu AK, Taye MM (2023) Understanding of machine learning with deep
learning: architectures, workflow, applications and future directions. Computers 12(5):91.
https://doi.org/10.3390/COMPUTERS12050091
14. Otok BW, Suharsono A, Purhadi, Standsyah RE, Al Azies H (2022) Partitional clustering of
underdeveloped area infrastructure with unsupervised learning approach: a case study in the
Island of Java, Indonesia. J Reg City Plann 33(2):77–196. https://doi.org/10.5614/JPWK.2022.
33.2.3
Harnessing the XGBoost Ensemble for Intelligent Prediction … 333

15. Ma J, Yu Z, Qu Y, Xu J, Cao Y (2020) Application of the XGBoost machine learning method

in PM2.5 prediction: a case study of Shanghai. Aerosol Air Qual Res 20(1):128–138. https://
doi.org/10.4209/AAQR.2019.08.0408
16. Pan B (2018) Application of XGBoost algorithm in hourly PM2.5 concentration prediction. IOP
Conf Ser Earth Environ Sci 113(1):012127. https://doi.org/10.1088/1755-1315/113/1/012127
17. Jing H, Wang Y (2020) Research on urban air quality prediction based on ensemble learning of
XGBoost. E3S Web of Conferences 165. https://doi.org/10.1051/E3SCONF/202016502014
18. Mishra A, Jalaluddin ZM, Mahamuni CV (2022) Air quality analysis and smog detection in
smart cities for safer transport using machine learning (ML) regression models. In: Proceed-
ings—2022 IEEE 11th international conference on communication systems and network
technologies, CSNT 2022, pp 200–206. https://doi.org/10.1109/CSNT54456.2022.9787618
19. Zhou Y, Chang FJ, Chang LC, Kao IF, Wang YS (2019) Explore a deep learning multi-output
neural network for regional multi-step-ahead air quality forecasts. J Clean Prod 209:134–145.
https://doi.org/10.1016/J.JCLEPRO.2018.10.243
20. Wibowo W, Dewi Ratih I (2021) Classification of non-performing financing using logistic
regression and synthetic minority over-sampling technique-nominal continuous (SMOTE-NC).
Int J Adv Soft Comput Appl 13(3). https://doi.org/10.15849/IJASCA.211128.09
21. Tella A, Balogun AL (2022) GIS-based air quality modelling: spatial prediction of PM10
for Selangor State, Malaysia using machine learning algorithms. Environ Sci Pollut Res
29(57):86109–86125. https://doi.org/10.1007/S11356-021-16150-0/TABLES/5
22. Sheridan RP, Wang WM, Liaw A, Ma J, Gifford EM (2016) Extreme gradient boosting as a
method for quantitative structure-activity relationships. J Chem Inf Model 56(12):2353–2360.
https://doi.org/10.1021/ACS.JCIM.6B00591/SUPPL_FILE/CI6B00591_SI_033.ZIP
23. Chen T, He T (2023) xgboost: eXtreme Gradient Boosting
24. Nugroho A, Suhartanto H (2020) Hyper-parameter tuning based on random search for DenseNet
optimization. In: 7th international conference on information technology, computer, and elec-
trical engineering, ICITACEE 2020—Proceedings, pp 96–99. https://doi.org/10.1109/ICITAC
EE50144.2020.9239164
25. Sun L (2020) Application and improvement of Xgboost algorithm based on multiple param-
eter optimization strategy. In: Proceedings—2020 5th international conference on mechanical,
control and computer engineering, ICMCCE 2020, pp 1822–1825. https://doi.org/10.1109/
ICMCCE51767.2020.00400
26. Yang J, Jiang P, Nassar RUD, Suhail SA, Sufian M, Deifalla AF (2023) Experimental investi-
gation and AI prediction modelling of ceramic waste powder concrete—an approach towards
sustainable construction. J Market Res 23:3676–3696. https://doi.org/10.1016/J.JMRT.2023.
02.024
27. Wibowo W, Amelia R, Octavia FA, Wilantari RN (2021) Classification using nonparametric
logistic regression for predicting working status. AIP Conf Proc 2329(1). https://doi.org/10.
1063/5.0043598/962507
28. Muljono, Andono PN, Wulandari SA, Al Azies H, Naufal M (2023) Tempo recognition of
Kendhang instruments using hybrid feature extraction. J Appl Sci Eng 27(3):3177–2190. https://
doi.org/10.6180/JASE.202403_27(3).0004
29. Ahmad M et al (2022) Extreme Gradient Boosting algorithm for predicting shear strengths of
rockfill materials. Complexity. https://doi.org/10.1155/2022/9415863
30. Guo R, Zhao Z, Wang T, Liu G, Zhao J, Gao D (2020) Degradation state recognition of piston
pump based on ICEEMDAN and XGBoost. Appl Sci 10(18):6593. https://doi.org/10.3390/
APP10186593
31. Ren X, Guo H, Li S, Wang S, Li J (2017) A novel image classification method with CNN-
XGBoost model. In: Lecture notes in computer science (including subseries lecture notes in
artificial intelligence and lecture notes in bioinformatics), vol 10431 LNCS, pp 378–390. https://
doi.org/10.1007/978-3-319-64185-0_28/COVER
32. Chen J, Zhao F, Sun Y, Yin Y (2020) Improved XGBoost model based on genetic algorithm.
Int J Comput Appl Technol 62(3):240–245. https://doi.org/10.1504/IJCAT.2020.106571
334 W. Wibowo et al.

33. Liang Y et al (2019) Product marketing prediction based on XGboost and LightGBM algorithm.
In: ACM international conference proceeding series, pp 150–15. https://doi.org/10.1145/335
7254.3357290
34. Parsa M (2021) A data augmentation approach to XGboost-based mineral potential mapping:
an example of carbonate-hosted ZnPb mineral systems of Western Iran. J Geochem Explor
228:106811. https://doi.org/10.1016/J.GEXPLO.2021.106811
35. Haumahu JP, Permana SDH, Yaddarabullah Y (2021) Fake news classification for Indone-
sian news using Extreme Gradient Boosting (XGBoost). IOP Conf Ser Mater Sci Eng
1098(5):052081. https://doi.org/10.1088/1757-899X/1098/5/052081
Modeling Earthquake Catalog
in Sumatra by Space–Time
Epidemic-Type Aftershock Sequences
Model: Combining
Davidon–Fletcher–Powell and Stochastic
Declustering Algorithms

Christopher Andreas, Achmad Choiruddin, and Dedy Dwi Prastyo

Abstract Earthquake hazard mapping has an important role as a mitigation measure

for earthquake disasters. In Indonesia, especially in the Sumatra, the risk of major
earthquakes is high due to its geographical location. In this study, we conduct statis-
tical analysis of earthquake catalog data (coordinates, time occurrences, and magni-
tudes) and develop earthquake hazard map in the Sumatra during 2005–2023 using
the Space–Time Epidemic-Type Aftershock Sequences (Space–Time ETAS) model.
We perform computationally feasible method for parameter estimation combining
Davidon–Fletcher–Powell and stochastic declustering algorithms. To design the
study and obtain the best model, we define four different models exploring the use
of complementary events. Of the four designed models, we found that model with
short period of complementary events performs best. In addition, we also found that
aftershock generation mechanism in the Sumatra is mainly dominated by the magni-
tudes of mainshocks (rather than locations or time occurrences of mainshocks). The
resulting hazard map also shows the mainshocks dominate the aftershocks, which is
in accordance with Badan Meteorologi, Klimatologi, dan Geofisika (BMKG) records
with respect to the earthquake-prone areas in Sumatra.

Keywords Complementary events · Disaster risk reduction · Earthquake hazard

mapping · Space–Time ETAS

C. Andreas · A. Choiruddin (B) · D. D. Prastyo

Department of Statistics, Faculty of Science and Data Analytics, Institut Teknologi Sepuluh
Nopember, Surabaya, Indonesia
e-mail: [email protected]

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 335
Y. Bee Wah et al. (eds.), Data Science and Emerging Technologies, Lecture Notes on
Data Engineering and Communications Technologies 191,
https://doi.org/10.1007/978-981-97-0293-0_25
336 C. Andreas et al.

1 Introduction

Earthquake could be considered as one of the most destructive natural disasters

[1], since it is able to trigger other natural disasters such as tsunamis, landslides,
and volcanic eruptions especially when its magnitude is sufficiently large [2]. As a
country located at the confluence of three world tectonic plates (the Indo-Australian,
Eurasian, and Pacific plates), Indonesia becomes an area with a large number of
earthquake occurrences. In particular, the Sumatra Island is our main concern in this
study since it is situated in the western part of Indonesia where Indo-Australian and
Eurasian plates often collide, so many major earthquakes (with large magnitude)
occurred in the Sumatra region. For example, the earthquake in Aceh is recorded
as one of the largest earthquakes in the world (with magnitude 9.1), which caused
a tsunami in 2004 [3]. Other examples include earthquake in Bengkulu, Jambi, and
Nias with magnitudes respectively 8.4, 7.9, and 7.1 [4]. Since major earthquake
frequently occurred in Sumatra, an accurate earthquake hazard map is crucial for
mitigating natural disasters [5].
Several statistical-based methodologies have been studied to model earthquake
distribution and develop earthquake hazard map in Indonesia including the Geograph-
ically Weighted Regression (GWR) model [6], Poisson Point Process [7], Log-
Gaussian Cox Process (LGCP) [8], and Neyman–Scott Cox Process (NSCP) [9]. The
four methods have involved spatial aspects essentially required to build earthquake
hazard map. However, GWR may not be appropriate since the model only covers
aggregate data (number of earthquake events) per province, while earthquake catalog
involves coordinate/geospatial data with higher resolution. Poisson point process,
LGCP, or NSCP is a more useful model since the models capture the nature of earth-
quake catalog (coordinates data) with different characteristics. Nonetheless, none of
them includes complete earthquake catalog data (coordinates, time occurrences, and
magnitudes) due to the model complexity and computationally burden.
In this study, we consider the Space–Time Epidemic-Type Aftershock Sequences
(Space–Time ETAS) to model complete earthquake catalog data involving coordi-
nates, time occurrences, and magnitudes in the Sumatra during 2005–2023. In addi-
tion to include the stochastic behavior, the Space–Time ETAS model also offers useful
features capturing seismicity consisting of Omori–Utsu and Gutenberg–Richter laws
[10]. The results of the Space–Time ETAS modeling can also be displayed in the
form of an earthquake hazard map which classifies earthquakes into main earthquakes
and aftershocks. Despite its advantages, the Space–Time ETAS model is complex
involving many parameters to estimate, which leads to computationally expensive
problem. We perform Davidon–Fletcher–Powell (DFP) algorithm combined with
stochastic declustering (SD) algorithm [11–13] to approximate the objective function
(as well as its gradient vector and hessian matrix) and further to estimate the param-
eters. This model and method gain popularity for earthquake modeling, including in
the New Zealand [11], Japan [12, 14], Iran [13, 15], and Chile [16]. It is of interest
to perform statistical analysis to model earthquake events in the Sumatra (Indonesia)
using Space–Time ETAS and develop hazard map based on the resulting model.
Modeling Earthquake Catalog in Sumatra by Space–Time … 337

The remainder of the paper is presented as follows. Section 2 describes the earth-
quake dataset while Sect. 3 details the methodology. We discuss the results in Sect. 4
and conclude the paper in Sect. 5.

2 Data Description

The dataset includes coordinates (latitude, longitude), time, and magnitude of earth-
quake occurrences in the Sumatra during 1 July 2005–30 June 2023, obtained from
the United States Geological Survey (USGS). We focus only on the major-shallow
earthquakes (magnitude ≥ 4 and depth < 60 km) which are often classified as a type
of earthquake causing great damage [17]. The Sumatra is defined in the area within
− 6.125° to 5.806° longitude and 95.010° to 108.545° latitude (see Fig. 1). Based
on Fig. 1, the distribution of earthquakes on the island of Sumatra tends to be domi-
nated by the western and southern regions of Sumatra. In addition, most earthquakes
originate from marine areas close to Sumatra. Table 1 depicts the data structure for
N earthquake events, consisting of n complementary events and N − n main events.
In this study, we divide the research data into two parts, namely the main events and
complementary events. The division is done to increase the accuracy of the modeling.
The existence of a complementary event is useful for identifying previous patterns of
earthquake occurrence, especially aftershocks in the past which can have an effect on
the next time period. In this study, we divide main events and complementary events
based on the length of time duration. The main event is defined from 01 July 2017
to 30 June 2023. Meanwhile, complementary events are defined differently in the
four model scenarios as shown in Table 2. In previous studies, complementary events
were determined by the researcher directly [11, 13, 14]. The model scenario is used
to identify the best choice of complementary events. Because the use of different
complementary events affects the computation time and convergence of parameters,
we also conduct a review of the use of complementary events in the Space–Time
ETAS model.
From 1 July 2005 to 30 June 2023, there have been 5142 earthquakes that have
shallow depth and moderate strength (with 1104 earthquake events defined as main
events). One source of earthquakes on the island of Sumatra is the subduction zone off
the west coast of Sumatra. This is due to the Indo-Australian Plate shifting downward
against the Eurasian Plate. The collision boundary between the two plates can be
observed in the form of a deep-sea trench line west of Sumatra to the Andaman
Islands [18, 19]. Therefore, the distribution of earthquake events on the island of
Sumatra is dominated in the western region. During this time period, the frequency of
earthquakes on Sumatra Island varied. The highest frequency of earthquakes occurred
on 12 September 2007, namely 66 times. Meanwhile, during other time periods, the
frequency of earthquakes on the island of Sumatra is less than 66 times per day.
338 C. Andreas et al.

Fig. 1 Earthquake distribution map on Sumatra Island. The red line indicates the study area. The
dots scattered in the study area are marked point patterns that show the distribution of earthquakes
with certain magnitude values on the island of Sumatra, Indonesia, since 01 July 2005 until 30 June
2023

Table 1 Earthquake catalog data structure

Events Number Time Longitude Latitude Magnitude
Complementary events 1 t1 x1 y1 m1
2 t2 x2 y2 m2
.. .. .. .. ..
. . . . .
n tn xn yn mn
Main events n+1 tn+1 xn+1 yn+1 m n+1
n+2 tn+2 xn+2 yn+2 m n+2
.. .. .. .. ..
. . . . .
N tN xN yN mN
Modeling Earthquake Catalog in Sumatra by Space–Time … 339

Table 2 Modeling scenario

Model Complementary events Main events
1 01 July 2005–30 June 2017 (12 years) 01 July 2017–30 June 2023
(6 years)
2 01 July 2008–30 June 2017 (9 years) 01 July 2017–30 June 2023
(6 years)
3 01 July 2011–30 June 2017 (6 years) 01 July 2017–30 June 2023
(6 years)
4 01 July 2014–30 June 2017 (3 years) 01 July 2017–30 June 2023
(6 years)

3 Methodology

3.1 Space–Time ETAS Model

In general, the conditional intensity function expresses the probability of an event

occurring under certain conditions by considering several aspects. This function
is useful in earthquake modeling. The Space–Time ETAS model was developed
based on the characteristics of earthquake events. Equation (1) shows the conditional
intensity function at time t, in location (x, y) of the region S, with magnitude m in
the Space–Time ETAS model [13]:

λβ,θ (t, x, y, m|Ct ) = vβ (m)λθ (t, x, y|Ct ). (1)

In this case, Ct is an earthquake event which is included in the complementary

event (t = 1, 2, . . . , n and t = n + 1, n + 2, . . . , N ). Ct plays a role in calculating
the effects of aftershocks and considering the boundary effects that may affect the
event in the main event. vβ (m) is a probability density function of the magnitude of
an event, with m 0 is the threshold value for the magnitude of an earthquake, stated
in Eq. (2):

vβ (m) = β exp[−β(m − m 0 )], β > 0. (2)

Meanwhile, λθ (t, x, y|Ct ) is a function consisting of four components stated in

Eq. (3):

λθ (t, x, y|Ct ) = ũ(x, y) + κ A,α (m i )gc, p (t − ti ) f D,γ ,q (x − xi ; y − yi ; m i ).
i;ti <t
(3)
340 C. Andreas et al.

Equation (3) contains a number of constraints for parameter values, namely μ > 0,
A > 0, α > 0, c > 0, p > 1, D > 0, γ > 0, and q > 1. Equation (3) also has
various important features for each earthquake [11].
a. The main earthquake is assumed to be an immigrant in the branching process of
earthquake events. The background seismicity rate is denoted as ũ(x, y):

ũ(x, y) = μu(x, y).

In this study, the background seismicity rate is estimated using a kernel estimator
via an isotropic bivariate Gaussian function (ϕ(x)) with a bandwidth h [13].
b. Each aftershock is independent of other aftershocks. The average number of
aftershocks resulting from an earthquake with a magnitude of m is expressed as
κ(m):

κ A,α (m) = A exp(α(m − m 0 )).

c. The probability distribution from the time of the main earthquake until after-
shocks occur is assumed to be a function of the time lag of the main earthquake,
namely g(t − τ ) which τ is the time between earthquake events:
⎧ − p
⎪
⎨ p − 1 1 + t − ti t − ti > 0
gc, p (t − ti ) = c c
⎪
⎩
0 t − ti ≤ 0

d. The probability distribution of the location and magnitude of aftershocks is

dependent on the location and magnitude of the main earthquake and has a
density function expressed as f (x − xi ; y − yi ; m i ):
−q
q −1 (x − xi )2 + (y − yi )2
f D,γ ,q (x − xi ; y − yi ; m i ) = 1+ .
π Deγ (m i −m 0 ) Deγ (m i −m 0 )

In this study, the function is defined as the radial symmetry density function [13].
e. The distribution of the magnitude of the main earthquake that is independent of
the location, the distribution of the magnitude of the aftershocks is independent
of the magnitude of the main earthquake, and the distribution of the magnitude
of the main earthquake and the magnitude of the aftershocks is identical.
Based on Eqs. (2) and (3), there are nine parameters, namely β, μ, A, α, c, p, D,
γ , q, which need to be estimated in the Space–Time ETAS model.
Modeling Earthquake Catalog in Sumatra by Space–Time … 341

3.2 Estimation and Optimization Techniques

The log-likelihood function of the conditional intensity function of the Space–Time

ETAS model is:

l(β, θ |Ct ) = l1 (β|Ct ) + l2 (θ |Ct ) (4)

where

N
N
l1 (β|Ct ) = log(β) δi − β δi (m i − m 0 ); and
i=1 i=1

N
tstart +T
¨
l2 (θ |Ct ) = δi log(λθ (ti , xi , yi |Ct )) − ∫ λθ (t, x, y|Ct )dxdydt.
tstart
i=1 S

The value of δi = 1 if event i is the main event and δi = 0 if event i is not the main
event. Based on Eq. (4) has two main parameter components, namely the parameter
β, and the vector parameter θ = (μ, A, α, c, p, D, γ , q)T .
Parameter estimation in the Space–Time ETAS model is carried out using the
Maximum Likelihood Estimation (MLE) method. The MLE method produces an
N
δ
estimate of the parameter β = N δ i=1 i
. However, for the vector parameter
i=1 i (m i −m 0 )
θ , the MLE method produces an equation that is not in closed form. Due to the
complexity of l2 (θ|Ct ), numerical methods must be considered. For this reason,
parameter estimation is continued using numerical iteration. In this case, a radial
partition is performed to approximate the value in the integral form in l2 (θ|Ct ). The
integral value of each partition is approximated using a transformation from Cartesian
coordinates to polar coordinates. Furthermore, the estimation of vector parameter θ
is obtained by minimizing ξ (θ ) = −l2 (θ |Ct ) through the Davidon–Fletcher–Powell
(DFP) algorithm and continued with stochastic declustering (SD) algorithm [11–13].
Table 3 describes the DFP algorithm and SD algorithm, which requires ∇ξ (θ ) (the
gradient of ξ (θ )), matrix H k (Hessian matrix of ξ (θ )), and û 0 x j , y j as an inputs.
In this study, the limit of convergence criteria used is ε of 0.001.
The computation is done with the help of the ETAS packages available in R. In
this case, the initial value for the vector parameter θ is μ = 4TN|S| , A = 0.01, α =
1, c = 0.01, p = 1.3, D = 0.01, γ = 1, and q = 1, where T is the length of time
duration of the main event [13, 14]. After getting the complete parameter estimation
values, the Akaike information criterion (AIC) value can be calculated. The AIC
value serves as a measure of optimization of a model [13]. In this study, there are
four model scenarios whose optimization values are compared. The AIC value is
obtained through the following formula [20, 21]:

AIC = 2g − 2l2 (θ |Ct ). (5)

342 C. Andreas et al.

Table 3 Summary of DFP algorithm and SD algorithm

DFP algorithm SD algorithm
Inputs
θ̂0 andH0 û 0 x j , y j
For k = 0, do the calculation:
1 dk = −Hk × ∇ξ θ̂k 1 Use DFP algorithm to get:
2 ζk = arg minζ ξ θ̂k + ζ dk θ̂k+1
3 θ̂k+1 = θ̂k + ζk dk 2 μ̂k+1 û k (x j ,y j )
p̂ j = 1 − λθ̂ (t j ,x j ,y j |Ct )
k+1 j

4 ηk = ∇ξ θ̂k+1 − ∇ξ θ̂k 3 û k+1 (x, y) =

N
1
T 1 − p̂ j ϕ x − x j , y − y j ; h j
j=1
ζk
5 c1 = 4 e1 = maxl
θ̂l,k+1
−1
dkT ×ηk θ̂l,k

6 c2 = 1
5 û k+1 (x j ,y j )
ηkT ×Hk ×ηk e2 = max j û k (x j ,y j )
−1

ξ (θ̂k+1 )
7 Hk+1 = Hk × c1 h k × dkT − 6 e3 = −1
ξ (θ̂k )
c2 Hk × ηk × ηkT × Hk
Repeat from process (1), for k = k + 1 until the convergent criterion is reached
Convergent criteria
θ̂k+1 − θ̂k < ε or e1 < ε, e2 < ε, and e3 < ε
∇ξ θ̂k+1 <ε
Outputs
θ̂k+1 andHk+1 û k+1 (x, y)

In Eq. (5), g represents the number of parameters estimated in l2 (θ |Ct ) component.

The parameters in the l1 (β|Ct ) component are not included in the AIC calculation
because the estimation results are always the same for β (without numerical iteration).
The minimum AIC value shows the most optimal model.

4 Results and Discussion

4.1 Characteristics of Earthquake Distribution in Sumatra

Based on research data, the highest magnitude value of 8.4 occurred on 12 September
2007. This earthquake has a hypocenter at a point as far as 122 km southwest of
Bengkulu, Sumatra. In general, the magnitude value of earthquakes on the island of
Sumatra tends to be dominated by a magnitude value of 4 (with a mean of 4.59 and
Modeling Earthquake Catalog in Sumatra by Space–Time … 343

700

600

500
Frequency

400

300

200

100

0
4,2 4,8 5,4 6,0 6,6 7,2 7,8 8,4
mag

(a)
(b)

Fig. 2 a Histogram of the values of magnitude and b magnitude plots with log values of the number
of earthquake occurrences on Sumatra Island

a median of 4.50), whereas a very high magnitude value has a small frequency of
occurrence. The distribution of the magnitude values is presented in Fig. 2a.
Figure 2a shows that the earthquakes on Sumatra Island were dominated by
moderate earthquakes with a magnitude of 4.35 to 4.45, namely 621 times. Based
on Fig. 2b, it can be seen that in general, magnitude plots with log values of the
number of earthquake occurrences form a linear trend across all models. This is in
accordance with the Gutenberg–Richter law [14]. The Gutenberg–Richter law states
that the logarithmic value of the frequency of earthquakes is directly proportional to
the magnitude, namely log10 (N ) = a − bm, where a and b are parameters.
Furthermore, Space–Time ETAS modeling is carried out based on the model
scenario defined in Table 2. In this case, it is necessary to carry out diagnostics of
stationarity assumptions for time and data completeness before estimating parame-
ters. This is because the Space–Time ETAS model is built on these two assumptions.
Diagnostics of the two assumptions are presented through a magnitude plot with the
log value of the number of earthquake occurrences and a time plot with the cumu-
lative number of earthquake occurrences. In the previous section, it has been shown
that research data follow Gutenberg–Richter law (see Fig. 2b). This indicates that
the assumption of data completeness has been met.
Meanwhile, the assumption of stationarity needs to be tested in each modeling
scenario. Figure 3 shows the time plot with the cumulative number of earthquake
occurrences for each model. The time plot against the cumulative number of earth-
quake occurrences which forms a linear trend shows that the earthquake occurrences
are stationary with respect to time. Models 1 have a linear pattern in the study period,
but have a nonlinear pattern in complementary events. Therefore, it can be concluded
that models 1 only fulfill the assumption of completeness, but parameter estimation
is still carried out as study material for comparison of results. On the other hand,
models 2, 3, and 4 fulfill all assumptions, namely stationarity and data completeness.
It is very important to fulfill the assumptions in the Space–Time ETAS modeling.
344 C. Andreas et al.

Fig. 3 Stationarity assumption diagnostics in a Model 1, b Model 2, c Model 3, and d Model 4.

The red line on the time plot graph against the cumulative number of earthquake occurrences (right
side) indicates the main event (study period). In this case, the linear line (blue line) is formed based
on the main event data

Violation of the assumptions can cause the estimation results to fail to converge or
the estimation results obtained to be illogical to interpret (if they converge) [22].

4.2 Comparison of Estimation Results and Model

Optimization

Parameter estimation results for each model are presented in Table 4. In this case, the
estimated value for the parameter β is identical for each model because the solution is
closed form. Meanwhile, other parameters have different estimation results because
they are obtained by a numerical approach. For all models, the parameters c and D
are not significant because they have a larger standard error value than the estimation
results (see Table 4). This means that the difference in time duration used for the
complementary events does not affect the significance of the estimated parameters
Modeling Earthquake Catalog in Sumatra by Space–Time … 345

of the Space–Time ETAS model. This indicates that the length of complementary
events does not have a significant impact.
To evaluate the four models, we compare using several metrics with the results
shown in Table 5. Model 4 outperforms the other models in terms of log-likelihood,
AIC, number of stochastic declustering iterations, and execution time. In this case,
the optimal model is characterized by the largest log-likelihood value, the minimum
AIC value, the least number of iterations, and the fastest execution time.
In addition, Table 5 shows that the longer duration of time on complementary
events is not always directly proportional to the AIC value (see, e.g., models 2 and
3). Furthermore, the role of data stationarity is very important to the efficiency of
computation time. In Fig. 3, it has been shown that model 1 does not meet the
assumption of stationarity. If parameter estimation is still carried out on data with
this case, the computation time becomes very long. Meanwhile, data that meets the
assumption of stationarity will tend to have faster computation time. Therefore, the
selection of the length of the duration of the complementary events can be done

Table 4 Parameter estimation results for each model

Model β μ A c α p D q γ
1 Estimate 1.6377 0.2500 0.0861 0.0124 1.7380 1.1398 0.0098 2.6555 0.5583
Std. 0.0024 0.0108 0.0409 0.0756 0.0112 0.0066 0.0865 0.0342 0.0600
error
2 Estimate 1.6377 0.3581 0.0785 0.0135 1.7404 1.1622 0.0099 2.6232 0.5095
Std. 0.0024 0.0102 0.0384 0.0754 0.0112 0.0073 0.0864 0.0331 0.0682
error
3 Estimate 1.6377 0.4784 0.0774 0.0133 1.7471 1.1616 0.0098 2.6456 0.5198
Std. 0.0024 0.0099 0.0384 0.0782 0.0113 0.0078 0.0844 0.0327 0.0661
error
4 Estimate 1.6377 0.6577 0.0741 0.0139 1.7464 1.1740 0.0096 2.6399 0.5234
Std. 0.0024 0.0098 0.0385 0.0787 0.0113 0.0082 0.0848 0.0325 0.0655
error
The estimated values and standard errors obtained for the seven parameters, namely A, α, c, p, D,
γ , q have relatively similar values in the four model scenarios. The estimated value of the parameter
μ varies in each model scenario because it is influenced by the amount of data entered on the main
and complementary events

Table 5 Model optimization size

Model Log-likelihood AIC Number of stochastic Execution time (min)
declustering iterations
Model 1 − 4049.905 8115.810 6 275.20
Model 2 − 4014.372 8044.744 5 52.91
Model 3 − 4007.337 8030.675 5 97.74
Model 4 − 4007.064 8030.127 5 46.76
346 C. Andreas et al.

Fig. 4 Comparison of the number of iterations of the DFP algorithm and the SD algorithm for each
model

by considering the stationarity of the data. If the research data used as the main
event is not stationary, then a certain length of time duration can be selected as a
complementary event to make the data stationary.
The time required for computation is also affected by the volume data. In general,
smaller amounts of data require faster computation time. Figure 4 shows a comparison
of the number of iterations of the DFP algorithm and the SD algorithm for each model.
In Fig. 4, it can be seen that the number of iterations of the DFP algorithm in model
2 at each stochastic declustering stage is 19; 9; 5; 3; and 2. This is the least number
of iterations when compared to iterations in other models. However, it doesn’t mean
that model 2 has the fastest computing time. It is proven that the fastest computation
time occurs in model 4, which has less data than model 2. Thus, for time efficiency,
we suggest that the selection of a complementary event does not need to be longer
than the main event, as long as the data meets the assumptions of the Space–Time
ETAS model, then the optimal modeling can be done.

4.3 Model Interpretation

Model interpretation and earthquake hazard mapping on Sumatra Island are based
on model 4. The parameter estimation results in model 4 show that there are two
parameters that are not significant out of nine parameters. From each of these param-
eter values, each parameter can be used to identify patterns of temporal seismicity.
The significant β parameter shows that the magnitude of the magnitude affects the
probability of an earthquake occurring on the island of Sumatra. In addition, the
significant parameters A and α indicate that the number of aftershocks on the island
of Sumatra is affected by the magnitude. The α value of 1.7464 measures the effi-
ciency of the magnitude of an earthquake in producing aftershocks. Meanwhile, the
p value of 1.1740 indicates the aftershock decay rate. Based on the estimation results,
Modeling Earthquake Catalog in Sumatra by Space–Time … 347

Fig. 5 a Background seismicity rate, b total spatial rate, c clustering coefficient, and d conditional
intensity function value at the end of the study period

the results can be visualized to facilitate the interpretation of the model as shown in
Fig. 5.
The value of the background seismicity rate of the main earthquake in each region
of Sumatra Island is dominated along the west and north coasts of Sumatra Island
(see Fig. 5a). The rate value is between 0.00 and 0.25. Meanwhile, the total spatial
rate for both the main and aftershocks is mapped in Fig. 5b. These results tend to be
almost the same as the background seismicity rate.
On the other hand, the mapping of aftershocks can be seen based on the value of the
clustering coefficient in Fig. 5c. Some regions have very high clustering coefficient
values. This means that the potential for aftershocks after the main earthquake in the
area is very high. It can be seen that the high coefficient values are in the sea area
and land area of Sumatra Island. In the sea area, the west coast of Sumatra Island has
a very high potential for aftershocks. Meanwhile, in the mainland, the provinces of
Aceh and North Sumatra have the potential for very high aftershocks. The mapping
of the conditional intensity function at the end of the earthquake period on Sumatra
Island is presented in Fig. 5d. The probability of an earthquake occurring at the end
348 C. Andreas et al.

of the study period has the same distribution pattern as the background seismicity
rate chart. This means that the potential for an earthquake to occur in the future will
still have the same intensity as the previous incident.
The results of this study are used to create a hazard map for earthquake-prone areas,
both the main earthquake and aftershocks on the island of Sumatra. The hazard map is
presented in Fig. 6. Figure 6 shows that the main earthquake potential in the Sumatra
region is dominated by land and sea areas. Meanwhile, the potential for aftershocks to
occur is more dominated in the sea area, especially along the west coast of the island
of Sumatra. This shows that the regions of Aceh, North Sumatra, West Sumatra,
Bengkulu, and Lampung are provinces that are prone to earthquakes, especially the
main earthquake. In addition, small island areas on the west coast of Sumatra Island,
such as the Simoloe, Sikolo, Nias, Mentawai, and Enggano Islands, are areas that
are prone to earthquakes, both main and aftershocks. In general, aftershocks have
a smaller magnitude than the main earthquake. The frequency of aftershocks will
decrease over time. In this study, we set a minimum bandwidth threshold value of
0.05 degrees in the parameter estimation process. This shows that the point of risk
of an earthquake occurring can occur right at that point or around that point, but still
within a radius of 0.05 degrees.
This fact is in accordance with BMKG records regarding earthquake-prone areas
on Sumatra Island [23]. In the Aceh region, most of the epicenters occurred on the
mainland. This is due to the geographical conditions of Aceh which are flanked by
two segments, namely the Aceh segment and the Seulimeum segment. In addition,
Simoloe Island in the province of Aceh is also an area that is very prone to earthquakes
and tsunamis due to its proximity to the subduction zone. In the same way, this also
occurs in the Nias Islands, whose geographical conditions are close to the subduction
zone. Another earthquake-prone area is the province of West Sumatra. The high risk
of earthquakes in West Sumatra is caused by the presence of 7 segments of the Suma-
tran fault located in West Sumatra, namely the Siulak segment, the Tutup segment, the
Sumani segment, the Sianok segment, the Sumpur segment, the Barumun segment,
and the Angkola segment. The high risk of earthquakes on the island of Sumatra
is inseparable from its geographical conditions which are traversed by subduction
zones, faults, and volcanic routes [24, 25].

5 Conclusion

By combining the DFP and SD algorithms, the Space–Time ETAS model can be used
to model a complete earthquake catalog. The results of Space–Time ETAS modeling
on earthquake data on Sumatra Island provide good earthquake hazard mapping
results because the Space–Time ETAS model is built based on the characteristics of
earthquake events by considering the main earthquake and aftershocks. In addition,
this model also accommodates the use of complementary events as supporting data
to improve model accuracy. In this study, we recommend considering the length of
the duration of the complementary event based on the length of the duration of the
Modeling Earthquake Catalog in Sumatra by Space–Time … 349

Fig. 6 Hazard map for main earthquakes and aftershocks on Sumatra Island. The map shows that
the characteristics of the earthquakes that occurred on the island of Sumatra were often followed
by aftershocks. Many earthquakes occur around the small islands on the west coast of Sumatra
Island. Number 1 represents the Simoloe Island area, number 2 represents Sikolo Island, number 3
represents Nias Island, number 4 represents the Mentawai Islands, and number 5 represents Enggano
Island

main event. In addition, the assumption of stationarity and data completeness is also
important in terms of the efficiency of computation time. With this approach, the
most optimal Space–Time ETAS model can be obtained as measured by the AIC
value. Thus, a hazard mapping of earthquakes is obtained, especially shallow earth-
quakes with a minimum magnitude of 4 on the island of Sumatra. In general, shallow
earthquakes with a minimum magnitude of 4 have caused damage. Therefore, appro-
priate mitigation steps can be prepared in areas that are prone to earthquakes. Further
research can be carried out by constructing the effects of geological variables into
the Space–Time ETAS model so that the effects of these variables can be measured.

Acknowledgements The first author thanks Lembaga Pengelola Dana Pendidikan (LPDP) for the
financial support to conduct postgraduate study at the Institut Teknologi Sepuluh Nopember. The
350 C. Andreas et al.

study is partially supported by the grant from Institut Teknologi Sepuluh Nopember No 1735/PKS/
ITS/2023. We also thank three reviewers for their constructive comments.

References

1. Ritchie H, Rosado P, Roser M (2023) Natural disasters, our world in data, https://ourworldi
ndata.org/natural-disasters. Accessed 22 July 2023
2. Parwanto NB, Oyama T (2014) A statistical analysis and comparison of historical earthquake
and tsunami disasters in Japan and Indonesia. Int J Disaster Risk Reduction 7:122–141
3. Jena R, Pardhan B, Beydoun G (2020) Earthquake vulnerability assessment in northern Sumatra
province by using a multi-criteria decision-making model. Int J Disaster Risk Reduction
46:101518
4. USGS, https://earthquake.usgs.gov/earthquakes. Accessed 01 June 2023
5. Cremen G, Galasso C (2020) Earthquake early warning: recent advances and perspectives.
Earth Sci Rev 205:103184
6. Sediono, Mardianto MFF, Ulyah SM, Pangestu AA, Susanti R, Firdaus HA, Andreas C (2022)
The modelling of earthquake magnitude in the southern part of java island using geographically
weighted regression. Commun Math Biol Neurosci 2022:13
7. Tim Pusat Studi Gempa Nasional. Peta Sumber dan Bahaya Gempa Indonesia Tahun 2017.
Kementeria Pekerjaan Umum dan Perumahan Rakyat, Jakarta (2017)
8. Choiruddin A, Aisah Trisnisa F, Iriawan N (2021) Quantifying the effect of geological factors on
distribution of earthquake occurrences by inhomogeneous cox processes. Pure Appl Geophys
178:1579–1592
9. Choiruddin A, Susanto TY, Metrikasari R (2021) Two-step estimation for modeling the earth-
quake occurrences in Sumatra by Newman-Scott cox point processes. Soft Comput Data Sci
1489:146–159
10. Ogata Y, Zhuang J (2006) Space-time ETAS models and an improved extension. Tectonophysics
413:13–23
11. Zhuang J, Ogata Y, Vere-Jones D (2002) Stochastic declustering of space-time earthquake
occurrences. J Am Statistical Assoc 97(458):369–380
12. Zhuang J (2011) Next-day earthquake forecasts for the japan region generated by the ETAS
model. Earth Planets Space 63:207–216
13. Jalilian A (2019) ETAS: an R package for fitting the space-time ETAS model to earthquake
data. J Stat Softw 88(1):1–39
14. Ogata U (1998) Space-time point-process models for earthquake occurrences. Ann Inst Stat
Math 50(2):379–402
15. Davoudi N, Tavakoli HR, Zare M, Jalilian A (2020) Aftershock probabilistic seismic hazard
analysis for Bushehr Province in Iran using ETAS model. Natural Hazard 100:1159–1170
16. Nicolis O, Chiodi M, Adelfio G (2015) Windowed ETAS models with application to the Chilean
seismic catalogs. Spatial Statistics 14:151–165
17. BPBD, https://bpbd.ntbprov.go.id/pages/gempa-bumi. Accessed 05 Aug 2023
18. Natawidjaja DH, Triyoso W (2007) The Sumateran fault zone—from source to hazard. J
Earthquake Tsunami 1(1):21–47
19. Wijaya AR (2023) Model ETAS Spatio-Temporal pada Analisis Pemetaan Intensitas Kegem-
paan di Wilayah Sumatera. Jambura J Math 5(1):179–188
20. Choiruddin A, Hannanu FF, Mateu J, Fitriyanah V (2023) COVID-19 transmission risk in
Surabaya and Sidoarjo: an inhomogeneous marked poisson point process approach. Stoch Env
Res Risk Assess 37:2271–2282
21. Choiruddin A, Susanto TY, Husain A, Kartikasari YM (2023) Kppmenet: Combining the
kppm and Elastic Net Reqularization for Inhomogeneous Cox Point Process with Correlated
Covariates. J Appl Stati 1–14
Modeling Earthquake Catalog in Sumatra by Space–Time … 351

22. Zhuang J (2012) Long-term earthquake forecasts based on the epidemic-type aftershock
sequence (ETAS) model for short-term clustering. Res Geophys 2(1):e8
23. BMKG. Katalog Gempa Bumi Indonesia: Relokasi Hiposenter dan Implikasi Tektonik. Bidang
Informasi Gempabumi dan Peringatan Dini Tsunami, Badan Meteorologi Klimatologi dan
Geofisika, Jakarta (2021)
24. Susanto TY, Choiruddin A, Purnomo JDT (2023) On the earthquake distribution modeling in
Sumatra by cauchy cluster process: comparing log-linear and log-additive intensity models.
Sains Malaysiana 52(2):655–557
25. Kartikasari YM, Choiruddin A (2022) Analisis Risiko Gempabumi di Sumatera dengan Cauchy
Cluster Process. Inferensi 5(2):123–134
Small Area Estimation of Mean Years
of Schooling Under Time Series
and Cross-sectional Models

Reny Ari Noviyanti, Setiawan, and Agnes Tuti Rumiati

Abstract Small area estimation develops within the framework of time series and
cross-sectional models. The restricted estimation maximum likelihood method was
used to obtain the empirical best linear unbiased prediction for small areas and its
mean squared error estimators. The model focuses on applying statistical models that
permit borrowing strength over area and time. The process uses regularly conducted
survey data, where the areas of interest are observed repeatedly under a predeter-
mined scheme. The time series and cross-sectional models were able to capture
heterogeneity across area and time, so it can be used to enhance sample size effec-
tiveness, thereby minimizing mean squared error and producing a more accurate
estimation. The simulation results show that the degree of correlation parameters
significantly affects the efficiency of the model. The application to estimate mean
years of schooling at the sub-district level in Langkat Regency, North Sumatra,
Indonesia, for the period of 2018–2021 showed that the time correlation coefficient
was 0.3758, the variance of the area random effect was 1.1125, and the variance of
the area-time random effect was 0.3241. The estimations derived from time series
and cross-sectional models had a lower mean squared error than those obtained from
the Fay-Herriot models and direct estimation.

Keywords Small area estimation · Empirical best linear unbiased prediction ·

Mean squared error · Mean years of schooling

R. A. Noviyanti · Setiawan (B) · A. T. Rumiati

Institut Teknologi Sepuluh Nopember (ITS), Surabaya 60111, Indonesia
e-mail: [email protected]
R. A. Noviyanti
BPS, Statistics of Sumatera Utara Province, Medan 20123, Indonesia

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 353
Y. Bee Wah et al. (eds.), Data Science and Emerging Technologies, Lecture Notes on
Data Engineering and Communications Technologies 191,
https://doi.org/10.1007/978-981-97-0293-0_26
354 R. A. Noviyanti et al.

1 Introduction

Small area estimation is primarily concerned with methods for acquiring precise
regional statistics when the available sample data is insufficient to make statistically
significant inferences with sufficient precision. In the past few years, the worldwide
demand for accurate estimations of small areas has significantly increased [1]. In
addition, it is difficult to obtain a sufficient sample size to provide an accurate direct
estimation for all areas.
Increasing the sample size can enhance the precision of the estimation in a small
area. Nevertheless, sample size is positively correlated with financial resources,
human resources, and volume of time. Therefore, it is important to use indirect esti-
mations that leverage the notion of “borrowing strength” by using some values related
to the variables of interest from the corresponding areas. This approach effectively
increases the effectiveness of the sample size [1–5].
Small area estimation is a subfield method of statistics that integrates samples from
surveys with constrained populations based on a model of statistics. The monograph
by Rao [3] presents comprehensive explanations of this theory, as do the critical
analyses conducted by Ghosh and Rao [1], Pfeffermann [6], and Jiang and Lahiri
[5]. The small area estimation method consists of two types of variances, the fixed
effects obtained through the application of auxiliary variables and the random effects
obtained by considering specific varieties within the area. Two distinct estimation
methods can be used to improve precision, especially borrowing strength across areas
and/or borrowing strength over multiple time periods.
Fay and Herriot [7] proposed the model of borrowing strength across areas to
derive log per-capita income. Several surveys are conducted repeatedly over a period
of time. Within the framework of repeated surveys, the Fay-Herriot models were
not able to capture heterogeneity both over area and time. Therefore, Rao-Yu [8,
9] proposed time series and cross-sectional models that combine information from
different time periods in various cross-sections. Hence, it can be used to capture
heterogeneity both over area and time; the model is known as the Rao-Yu models.
The Rao-Yu models are a modification of the Fay-Herriot models by adding an area-
time random effects component that follows a first-order autoregressive process. Rao
and Yu [8, 9], Esteban et al. [10], and Fay and Diallo [11] described that significant
gains in the efficiency of estimation can be achieved by using a model with area-time
random effects.
The variance components of random effects on the empirical best linear unbi-
ased prediction (EBLUP) estimator and its mean squared error (MSE) can be esti-
mated using the method of moments, maximum likelihood, or restricted estima-
tion maximum likelihood (REML). In this study, we presented the REML method.
According to Rao [3], the REML method considers the loss in degrees of freedom
by using transformed data. Molina et al. [12] showed that the REML estimator is
more robust and produces an unbiased estimator even at small sample sizes.
Therefore, this study focuses on applying the REML method to estimate the
unknown variance components of random effects. The final goal of this study was
Small Area Estimation of Mean Years of Schooling Under Time Series … 355

to apply developed area-level models to derive the EBLUP and MSE of model 1
(Fay-Herriot model) and model 2 (Rao-Yu model), with a focus on estimating the
mean years of schooling at the sub-district level in Langkat Regency, North Sumatra,
Indonesia, for the period of 2018–2021.

2 Literature Review

2.1 Fay-Herriot Models

The Fay-Herriot models consist of two stages. The sampling model is used to
depict the sampling error of direct estimators in the first stage. Let θd represent
the characteristic of interest in the d-th area and yd represent a direct estimator of θd .

yd = θd + ed , d = 1, ..., D, (1)

The symbol D denotes the areas, and the sampling errors ed are independent, with
ed ∼ iid N (0, σe2 ). In the second stage, assuming that θd is related to area-specific
auxiliary variables xd = (x1d , . . . , xqd ) through a linear model

θd = x d β+z d vd , d = 1, ..., D, (2)

where β = (β1 , . . . , βq ) is the q × 1 vector of regression coefficients and zd s

are a known positive constants. The vd s are area-specific random effect with vd ∼
iid N (0, σv2 ).
By combining (1) and (2), we obtain

yd = x d β+z d vd + ed , d = 1, ..., D, (3)

Henderson [13] describes the best linear unbiased prediction (BLUP) approach
as a common method for obtaining the parameter estimation of random effects in
Eq. (3). The BLUP estimator for the d-th area d = (1, ..., D) can be written as

σv2 z d2
θ̃d F H σv2 = x d β̃ + yd − x d β̃ , (4)
σv z d + σe
2 2 2

However, in practical applications, σv2 is unknown, whereas σe2 is known. In

general, the parameter σv2 is estimated and then resulting the EBLUP.
The EBLUP estimator for the d-th area d = (1, ..., D) can be expressed as

σ̂v2 z d2
θ̂d F H σ̂v2 = x d β̂ + yd − x
d β̂ ,
σ̂v2 z d + σe2
2
356 R. A. Noviyanti et al.

= π̂d yd + (1 − π̂d )x d β̂, (5)

where

σ̂ 2 z 2
π̂d = 2 2v d 2
σ̂v z d + σe

with
−1 D

xd x d 2 2
D
β̂ = x y
σ̂v z d + σe2 d d
σ̂v2 z d2 + σe2 .
d=1 d=1

The mean squared error of the EBLUP estimator of Fay-Herriot models is defined
as follows:

MSE θ̂dH ≈ g1d σv2 + g2d σv2 + g3d σv2 (6)

where

σ 2 z2 σ 2
g1d σv2 = 2 v 2 d e 2 = πd σe2 ,
σv z d + σe
−1
2 2
D
d=1 xd x d

g2 σv = (1 − πd ) x d xd ,
σv2 z d2 + σe2
2 −3
g3 σv2 = σe2 z d4 σv2 z d2 + σe2 σ̂v2

2 2
D z4 2 −1 .
where σ̂v2 = 2 σ̂v z d + σe2
d=1 d

The estimator of MSE θ̂dH is determined similarly to Prasad and Rao [14] by
replacing the estimators of σv2 . It reduces to

mse θ̂dH ≈ g1d σ̂v2 + g2d σ̂v2 + 2g3 σ̂v2 . (7)

2.2 Time Series and Cross-sectional Models

Several sample surveys are conducted periodically, with partial replacement of the
sample elements. To improve the efficiency of repeated surveys, it is possible to
enhance the accuracy by leveraging information from both small areas and previous
time periods. Rao and Yu [8, 9] introduced a modification to the Fay-Herriot model
that allows for the analysis of both time series and cross-sectional data. The Rao-Yu
Small Area Estimation of Mean Years of Schooling Under Time Series … 357

models constructed a sampling error model

ydt = θdt + edt , d = 1, ..., D, t = 1, ..., T , (8)

and a linking model

θdt = x dt β+vd + u dt , d = 1, ..., D, t = 1, ..., T , (9)

where the variable ydt represents a direct estimator of the area-specific variable of
interest, denoted as d with (d = 1, ..., D), and a particular time instant, denoted as
t with (t = 1, ..., T ), an unbiased estimator for θdt . xdt is the vector consisting
of2 q
auxiliary variables, vd is a random effect for small
area d, with v d ∼ iid N 0, σv ,
the error edt is independent, edt ∼ iid N 0, σe2 with known σe2 ’s, and the random
vectors u dt are iid. AR(1) with common variance σε2 and correlation parameters ρ
respectively is,

u dt = ρu d,t−1 + εdt , |ρ| < 1, (10)

where u dt is stationer with εit ∼ iid N 0, σε2 . Assumed that vd , u dt , and edt are
independent of each other.
The Rao-Yu models is a combination of (8) and (9)

ydt = x dt β+vd + u dt + edt , d = 1, ..., D, t = 1, ..., T , (11)

The model (11) is defined in a matrix notation,

y = Xβ + Zw + e, (12)

where

y = (y11 , ..., y1T ; y D1 , ..., y DT ) = y 1 , ..., y D ,

X = X 1 , ..., X D , Xd = (xd1 , ..., xdT ) ,

Z = I D ⊗ Zd , Zd = (1T , IT ), w = w 1 , ..., w D ,

wd = vd , u d , ud = (u d1 , ..., u dT ) ,

e = e 1 , ..., e D , ed = (ed1 , ..., edT ) , β = β1 , ..., βq .

where 1T is a T -vector of 1’s, IT is an identity matrix of T × T , and I D is an

identity matrix of D × D. By vd and u dt are independent, there is
assuming that

σv 0 T
2
E(wd ) = 0, cov(wd ) = Gd = , E(ed ) = 0, cov(ed ) = Rd where 0T is
0T σε2
a T -vector of 0, σε2 is the T × T covariance matrix of ud , and is a symmetrical
358 R. A. Noviyanti et al.

ρ |t−t |
matrix containing the components . The vd , ud , and ed are independent, so
(1−ρ 2 )
the covariance matrix of y is = ZGZ + R, with G = I D ⊗ Gd and R =
blockdiagd (Rd ).
Based on the model (11), BLUP estimator for θdt for the current occasion T is
−1
θ̃dTRY = x dT β̃ + σv2 1T + σε2 γT σv2 JT + σε2 + Rd yd − Xd β̃ (13)

where JT = 1T 1 T is T × T matrix with all elements that are 1 and γT is the T -th
row or column of .
In general, parameters σv2 , σε2 , and ρ are unknown and estimated by σ̂v2 , σ̂ε2 , and
ρ̂, respectively. The EBLUP is derived by substituting the unknown parameters with
the estimators.
−1
θ̂dTRY = x dT β̂ + σ̂v2 1T + σ̂ε2 γ̂T σ̂v2 JT + σ̂ε2 ˆ + Rd yd − Xd β̂ (14)
−1
where β̂ = X −1 X X −1 y .
In this study, the parameter estimation was obtained based on the restricted esti-
mation maximum likelihood method. The log-likelihood function associated with
the model (12) is equal to

DT − q 1 1
1 −1
lreml = − log(2π ) + logX X − log|| + logX X − y Py
2 2 2 2
(15)

where
−1
−1 −1
P = −1 − −1 X X X X , P = PP, PX = 0.

∂P ∂
Let δ = σv2 , σε2 , ρ , then Pa = ∂δa
= −P ∂δ(δ)a
P = −Pa P, a = 1, 2, 3.
The estimator for the variance component is derived by maximizing the probability
function (15). The scores vector can be obtained by determining the partial derivatives
of the function lr eml (15) with respect to δ with a-th element,

∂lr eml 1 1
Sa = = − tr (Pa ) + y Pa Py, a = 1, 2, 3 (16)
∂δa 2 2

where a = ∂ ∂δa is the partial derivative of with respect to δ with a-th element.
By taking second-order partial derivatives of (15) with respect to δ with a, b-th
element, modifying the sign, and taking expectations, the Fisher information matrix
components are obtained

1
Fab = tr (Pa Pb ), a, b = 1, 2, 3 (17)
2
Small Area Estimation of Mean Years of Schooling Under Time Series … 359

The restricted estimation maximum likelihood estimate is derived through the

utilization of the Fisher scoring algorithm and the application of the updating formula

(k+1) (k)
(k) (k)
δ̂ = δ̂ + F−1 δ̂ S δ̂ , (18)

where S represents the column vector of scores and F represents the Fisher informa-
tion matrix. The estimation of the random effect variance component will be obtained
once the iteration of Eq. (18) has converged.
Based on Prasad and Rao [14] and Das et al. [15], the mean square error of the
Rao-Yu models is defined as follows:

MSE θ̂dTRY ≈ g1 (δ) + g2 (δ) + g3 (δ) (19)

where

g1 (δ) = a ZTZ a,

−1 −1
g2 (δ) = a X − a ZTZ R Q X a − X R ZTZ a

g3 (δ) ≈ tr ∇b ∇b E δ̂ − δ δ̂ − δ

−1
with Q = X −1 X , T = G − GZ −1 ZG, b = a ZGZ −1 .
According to Das et al. [15] and Rao and Molina [4], the mean square error of
EBLUP (19) can be approximated as

mse θ̂dTRY ≈ g1 δ̂ + g2 δ̂ + 2g3 δ̂ . (20)

3 Simulation Studies

This section presents the simulation studies used to analyze the performance of the
mean square error estimators. The purpose of the simulations was to determine what
could be obtained by using Rao-Yu models with the time correlation parameter ρ.
The auxiliary and interest variables for area (d = 1, ..., D) and time (t = 1, ..., T )
were

ydt = β0 + β1 xdt +vd + u dt + edt , β0 = β1 = 1, x ∼ N (0.001, 0.04) (21)

360 R. A. Noviyanti et al.

where vd ∼ iid N 0, σv2 with σv2 = 1.58 and edt ∼ iid N 0, σe2 with σe2 = 4.4.
The random vectors u dt were generated according to the following AR(1) procedure
with
u dt = ρu d,t−1 + εdt , |ρ| < 1, where u dt was stationer with εit ∼
−1
iid N 0, σ 2 ,σ 2 = 0.5, and the variance of u was V ar (u ) = 1 − ρ 2 / 2 σ 2 .
ε ε dt dt ε
The aim of the simulation study was to evaluate the efficiency gain obtained by
the EBLUP based on the Rao-Yu models and a function of the correlation parameter
ρ.
The simulation steps were described as follows:
1. For each correlation ρ = 0, 0.2, 0.5, 0.9, repeat R = 104 times (r = 1, ..., R)
1.1. Generate a sample size for each area of d = 10, 20, 30 with time instant
t = 3, 5, 8, calculate θdt(r ) = β0 + β1 xdt +vd(r ) + u (r )
dt ,
1.2. Based on the model (21), calculate β̂0 , β̂1 , σ̂v , σ̂ε2(r ) and EBLUP θ̂dt(r )
(r ) (r ) 2(r )

by using the restricted estimation maximum likelihood method.

2. Calculate the mean square error (MSE) and coefficient of variation (CV) for each
area of d = 1, ..., D, and time t = 1, ..., T .
√
1 (r ) 2
R
(r ) MSEdt
MSEdt = θ̂dt − θdt , CVdt = × 100
R l=1 θ̂dt

1 1
D T D T
MSE = MSEdt , CV = CVdt
DT d=1 t=1 DT d=1 t=1

To evaluate the impact of the number of areas, time, and time correlation, we set
D = 10, 20, 30 areas and both short (T = 3), moderate (T = 5), and long (T = 8)
at time. We used ρ = 0.2 for low correlation, ρ = 0.5 for moderate correlation, and
ρ = 0.9 for high correlation. The results of the simulations to evaluate the impact
of the number of area, time, and time correlation on the mean squared error and
coefficient of variation are presented in Table 1. The best results in mean square error
are achieved when the correlation parameter is high (ρ = 0.9). The mean squared
error was decreased consistently as the correlation parameter (ρ) increased.
The mean squared error of EBLUP based on the Rao-Yu models for each of
D = 100 areas, with T = 8 time instants, and time correlation ρ = 0.2, 0.5, 0.9
are presented in Fig. 1.
The simulation results showed that an increased degree of correlation would lead
to a decrease in mean squared error and coefficient of variation. When the correlation
parameter of the Rao-Yu models was high (ρ = 0.9), it correspondingly produced
smaller mean squared error and coefficient of variation. Furthermore, the mean
squared error and coefficient of variation increased consistently as the number of
areas (D) increased, but it decreased as the time (T ) increased.
Small Area Estimation of Mean Years of Schooling Under Time Series … 361

Table 1 Results of simulations

Area ρ = 0.9 ρ = 0.5 ρ = 0.2
MSE CV MSE CV MSE CV
T =3 D = 10 3.4865 24.6847 3.1938 23.8087 4.3067 27.9588
D = 20 3.5575 23.6844 3.5436 23.9739 4.7929 28.3871
D = 30 3.9883 24.7204 4.2636 26.2131 5.5116 30.1459
T =5 D = 10 2.3987 20.1820 2.5192 20.8517 2.7729 21.3698
D = 20 2.8398 21.8101 2.9217 21.6992 3.3992 23.9297
D = 30 3.8975 22.1193 3.8578 22.5490 4.6129 24.8048
T =8 D = 10 1.9677 16.2489 1.8837 16.0197 1.8190 16.5390
D = 20 2.2981 17.7703 2.6396 19.2380 3.4370 22.2595
D = 30 2.4877 19.1312 2.7033 19.6311 4.5698 24.3726

5.0
4.5 = 0.2
= 0.5
4.0 = 0.9
3.5
MSE

3.0
2.5
2.0
1.5
1.0
10 20 30 40 50 60 70 80 90 100
Areas

Fig. 1 MSE of EBLUP for D = 100, T = 8, and ρ = 0.2, 0.5, 0.9

4 Application

The Fourth Sustainable Development Goal (SDG) ensures equitable and inclusive,
high-quality education and opportunities for lifelong learning for all. Mean years of
schooling are included in the education indicators of the SDG. Furthermore, the mean
years of schooling also serve as an instrument to measure the Human Development
Index, providing data regarding the average level of education in a certain area. The
indicator defines the years of schooling as the duration of formal education attended
by individuals aged 25 years and older. The mean years of schooling are derived from
the National Socioeconomic Surveys (Susenas) data that is conducted periodically by
the Statistics Indonesia. The mean years of schooling are examined for the purpose
of monitoring and evaluating the basic education sector. The Susenas is a twice-
yearly survey that is designed to provide direct estimates of variables at the national,
362 R. A. Noviyanti et al.

Fig. 2 Boxplot of the direct estimator of mean years of schooling

provincial, and district/city levels. The estimation of the mean years of schooling at
a smaller area level is necessary for the policies formulation.
This study aims to provide an estimation of the mean years of schooling at 23
sub-districts in Langkat Regency, North Sumatra, Indonesia, for the period of 2018–
2021. The distribution of the mean years of schooling is shown in Fig. 2. It shows
that the distribution of mean years of schooling is fit to normal distribution.
For the correspond area-specific auxiliary variables, we use five variables, which
are population (X 1 ), harvested area of paddy (X 2 ), planting area of plantation crops
(X 3 ), number of industries (X 4 ), and educational facilities per 10,000 residents (X 5 ).
We apply logarithm transformation on variables X 1 , X 2 , X 3 , and X 4 to obtain a
suitable model. The regression parameter estimates and their corresponding standard
errors from model 1 and model 2 are presented in Table 2.
Table 2 shows that the models consist of fixed effects and variance component
estimates of the random effect. For the fixed effect, of the five auxiliary variables,
there were three variables significant for estimating mean years of schooling, and
they are population (X 1 ), harvested area of paddy (X 2 ), and number of industries
(X 4 ). Based on the standard errors of parameter estimates, model 2 obtained a lower
standard error than model 1. It can be concluded that model 2 is more efficient
compared to model 1.
Small Area Estimation of Mean Years of Schooling Under Time Series … 363

Table 2 Regression parameters and variance components estimates results

Parameter Model 1 Model 2
Coefficient Standard error p-value Coefficient Standard error p-value
estimates estimates
Fixed effects
β0 −0.628 2.3319 0.7873 −0.186 2.3276 0.6588
β1 1.107 0.2705 0.0000 0.978 0.2699 0.0001
β2 −0.119 0.0485 0.0133 −0.092 0.0482 0.0120
β4 −0.431 0.2233 0.0500 −0.292 0.2199 0.0720
Variance components
σ̂v2 0.7580 – – 0.3241 – –
σ̂ε2 – – – 1.1125 – –
ρ̂ – – – 0.3758 – –

Meanwhile, for the variance component that used for capturing heterogeneity,
model 1 produced an area random effect, while model 2 produced an area random
effect, area-time random effect, and autocorrelation coefficients. The variance of the
area random effect in model 1 is higher than in model 2. This means that model 1
is better at capturing heterogeneity over the area. Although there was a variation in
how they captured heterogeneity, both models produced EBLUP and mean square
error that were quite similar. On the other hand, model 2 shows that the variance of
the area-time random effect was 1.1125 greater than the variance of the area random
effect, which was 0.3241, where model 1 was unable to capture it. The first-order
autocorrelation coefficients for model 2 were 0.3758. Although the correlation across
time of model 2 is low, this cannot be disregarded in the context of estimation.
According to the model parameters, as shown in Table 2, the small area model of
mean years of schooling is defined as follows:
Model 1

θ̂d F H = −0.628 + 1.107 log X 1d − 0.119 log X 2d − 0.431 log X 4d + v̂d (22)

Model 2

θ̂dt RY = −0.186 + 0.978 log X 1dt − 0.092 log X 2dt − 0.292 log X 4dt + v̂d + û dt
(23)

The estimation results of the mean years of schooling in Langkat Regency by

sub-district for the period of 2018–2021 are shown in Table 3.
364 R. A. Noviyanti et al.

Table 3 Comparison of the mean years of schooling estimator in Langkat Regency by sub-district,
2018–2021
Sub-district 2018 2019 2020 2021
Dir M1 M2 Dir M1 M2 Dir M1 M2 Dir M1 M2
1 Bahorok 7.91 7.97 7.92 7.85 7.94 7.85 6.49 7.05 6.68 7.26 7.50 7.33
2 Sirapit 8.71 8.40 8.65 7.30 7.28 7.30 6.15 6.78 6.74 7.12 7.15 7.11
3 Salapian 8.51 8.45 8.51 9.30 8.92 9.15 7.03 7.21 7.06 6.94 7.33 7.05
4 Kutambaru 8.16 8.11 8.16 9.11 8.83 9.06 6.71 7.13 6.99 6.13 6.48 6.18
5 Sei Bingai 7.96 7.99 7.97 9.40 9.05 9.31 6.08 6.53 6.23 7.58 7.64 7.58
6 Kuala 9.39 9.24 9.38 7.76 7.80 7.77 7.52 7.71 7.53 6.52 6.99 6.65
7 Selesai 9.67 9.57 9.66 9.62 9.55 9.62 8.34 8.42 8.34 6.97 7.30 7.02
8 Binjai 9.70 9.36 9.63 8.60 8.46 8.57 7.69 7.81 7.70 7.49 7.68 7.51
9 Stabat 9.39 9.31 9.38 8.56 8.57 8.56 8.78 8.76 8.77 8.89 8.84 8.88
10 Wampu 9.15 8.95 9.12 7.25 7.40 7.28 7.82 7.88 7.80 6.78 7.14 6.87
11 Batang 9.26 8.75 8.93 7.90 7.91 7.90 6.55 7.05 6.76 9.20 8.61 8.75
Serangan
12 Sawit 8.89 8.84 8.89 10.10 9.75 9.98 6.44 6.92 6.54 7.08 7.42 7.13
Seberang
13 Padang 9.06 8.93 9.05 8.90 8.66 8.85 8.43 8.37 8.38 6.74 7.13 6.83
Tualang
14 Hinai 8.46 8.35 8.45 8.82 8.49 8.73 7.87 7.85 7.87 7.33 7.38 7.33
15 Secanggang 9.27 9.15 9.26 7.55 7.56 7.55 7.31 7.42 7.32 6.89 7.03 6.91
16 Tanjung 9.47 9.31 9.46 9.57 9.47 9.57 7.78 7.84 7.79 7.46 7.60 7.47
Pura
17 Gebang 7.58 7.72 7.66 8.90 8.81 8.89 6.47 6.94 6.78 8.53 8.32 8.45
18 Babalan 9.60 9.47 9.59 9.00 8.89 8.99 8.33 8.30 8.32 6.56 7.13 6.75
19 Sei Lepan 8.40 8.42 8.41 8.43 8.39 8.42 6.80 7.24 6.96 7.16 7.34 7.19
20 Brandan 8.43 8.06 8.28 6.33 6.56 6.41 6.50 6.69 6.53 7.39 7.34 7.28
Barat
21 Besitang 8.75 8.66 8.74 6.53 6.88 6.66 6.38 6.83 6.51 6.86 7.07 6.90
22 Pangkalan 9.79 9.45 9.72 8.23 8.16 8.23 8.32 8.20 8.30 7.68 7.66 7.67
Susu
23 Pematang 8.79 8.79 8.79 7.50 7.50 7.50 7.69 7.36 7.27 5.51 5.88 5.66
Jaya

5 Conclusions

The application of the small area estimation method can effectively optimize sample
size efficiency, and this leads to a decrease in the standard error. To enhance the accu-
racy of the area-level model, we proposed time series and cross-sectional models that
borrow strength over time. This can be achieved by integrating an area-time random
Small Area Estimation of Mean Years of Schooling Under Time Series … 365

effects component into a linear time model at the area level. The selection of suit-
able linking models and the presence of sufficient auxiliary variables are important
elements in acquiring indirect estimators. The simulation results show that the degree
of correlation parameters produced from time series and cross-sectional models has
an important impact on the effectiveness of the models.
The restricted estimation maximum likelihood method, which was applied to
derive empirical best linear unbiased prediction for small areas and mean squared
error, is more reliable and provides an unbiased estimator. The estimation using the
time series and cross-sectional models can obtain a lower mean square error than
the direct estimation and Fay-Herriot models. This indicates that the time series and
cross-sectional models are more effective in estimating mean years of schooling at
a small area level.
For the upcoming study, we will develop multivariate time series and cross-
sectional models to analyze small area estimation. Variables obtained from repeated
surveys in real applications may have correlation. Thus, considering the correlation
between variables of interest, it is possible to significantly enhance the accuracy of
parameter estimates in small area estimation by using multivariate time series and
cross-sectional models.
The observed range of direct estimates of mean years of schooling in Langkat was
5.51 to 10.10 years. For the period of 2018–2021, the mean years of schooling with
the highest average population attained formal education up to the upper secondary
level, with an average duration of 10.10 years. Meanwhile, the lowest mean years of
schooling among the average population had not completed elementary education
(5.51 years). For model 1, the range of estimates was 5.88 to 9.75 years, and for
model 2, it was 5.66 to 9.98 years. The results show that the range of estimates
obtained from model 1 and model 2 was comparatively shorter than direct estimates.
The best model selection for estimating mean years of schooling can be determined
by evaluating the mean square error. Figure 3 presents a comparison between the
mean square error obtained from direct estimation and the EBLUP from model 1
and model 2. The best model is characterized by the lowest mean square error value.
The results indicate that the mean square error of the EBLUP model 2 was smaller
compared to the mean square error of direct estimate and mean square error of
EBLUP model 1, the average mean square error of model 2 was 0.01766 and model
1 was 0.01976, and the direct estimate was 0.02509. It shows that the EBLUP model
2 provides higher accuracy in parameter estimation due to its lower mean square
error value.

6 Conclusions

0.16
Direct
0.14 Model 1
Model 2
0.12

0.10
MSE

0.08

0.06

0.04

0.02

0.00
Bahorok_2018

Secanggang_2018
Sirapit_2018

Salapian_2018

Kutambaru_2018

Sei Bingai_2018

Kuala_2018

Selesai_2018

Binjai_2018

Stabat_2018

Wampu_2018

Bt Serangan_2018

Sw Seberang_2018

Pd Tualang_2018

Tj Pura_2018

Gebang_2018

Babalan_2018

Sei Lepan_2018

Brandan Brt_2018

Besitang_2018

Pk Susu_2018

P Jaya_2018
2019
2020
2021
2019
2020
2021
2019
2020
2021
2019
2020
2021
2019
2020
2021
2019
2020
2021
2019
2020
2021
2019
2020
2021
2019
2020
2021
2019
2020
2021
2019
2020
2021
2019
2020
2021
2019
2020
2021
Hinai_2018
2019
2020
2021
2019
2020
2021
2019
2020
2021
2019
2020
2021
2019
2020
2021
2019
2020
2021
2019
2020
2021
2019
2020
2021
2019
2020
2021
2019
2020
2021
Sub-district

Fig. 3 MSE comparison by sub-district

Acknowledgements The main author of this research article would like to thank BPS and ITS
for providing her with the opportunity to study in the Statistics Department of ITS. Additionally,
the authors extend their gratitude to all the entities that have made significant contributions to the
successful completion of this research.
Small Area Estimation of Mean Years of Schooling Under Time Series … 367

References

1. Ghosh M, Rao JNK (1994) Small area estimation: an appraisal. Stat Sci 9:55–93
2. Rao JNK (1999) Some recent advances in model-based small area estimation. Surv Methodol
25:175–186
3. Rao JNK (2003) Small area estimation. Wiley
4. Rao JNK, Molina I (2015) Small area estimation, 2nd edn. Wiley, New York
5. Jiang J, Lahiri P (2006) Mixed model prediction and small area estimation. TEST 15:1–96
6. Pfeffermann D (2002) Small area estimation—new developments and directions. Int Stat Rev
70(1):125–143
7. Fay RE, Herriot RA (1979) Estimates of income for small places: an application of James-Stein
procedures to census data. J Am Stat Assoc 74:269–277
8. Rao JNK, Yu M (1992) Small area estimation combining time series and cross sectional data.
In: Proceedings of the joint statistical meetings, pp 1–9
9. Rao JNK, Yu M (1994) Small area estimation by combining time series and cross sectional
data. Can J Stat 22:511–528
10. Esteban MD, Morales D, Pérez A, Santamaría L (2012) Small area estimation of poverty
proportions under area-level time models. Comput Stat Data Anal 56(10):2840–2855
11. Fay RE, Planty M, Diallo MS (2013) Small area estimation alternatives for the National crime
victimization survey. In: Proceedings of the joint statistical meetings, pp 1544–1557
12. Molina I, Rao JNK, Datta GS (2015) Small area estimation under a Fay-Herriot model with
preliminary testing for the presence of random area effects. Surv Methodol 41(1):1–19
13. Henderson CR (1953) Estimation of variance and covariance components. Biometrics 9:226–
252
14. Prasad NGN, Rao JNK (1990) The estimation of the mean squared error of small-area
estimators. J Am Stat Assoc 85:163–171
15. Das K, Jiang J, Rao JNK (2004) Mean squared error of empirical predictor. Ann Stat 32:818–840
Probabilistic Seismic Hazard Analysis
for Sulawesi-Maluku Region of Indonesia
Using the Space–Time Epidemic-Type
Aftershock Sequence Model

Sonia Faradilla, Achmad Choiruddin, and Bambang Widjanarko Otok

Abstract Sulawesi and Maluku are the only regions in Indonesia where three major
plates (Eurasian, Indo-Australian and Pacific) meet, resulting in the most seismi-
cally active region in Indonesia. It is important to perform probabilistic seismic
hazard analysis to understand the seismic activity in this region, study aftershock
sequences, and provide comprehensive seismic hazard maps. In this study, we
consider spatiotemporal analysis and perform space–time epidemic-type aftershock
sequencing (space–time ETAS) to model earthquake catalog data in Sulawesi-
Maluku, Indonesia. The method combines Davidon-Fletcher-Powell (DFP) algo-
rithm and stochastic declustering algorithm. We conclude that the model has good
performance and can be used to construct seismic hazard maps based on the generated
model. The results of the seismic map provide information on the total rate of earth-
quake events occurring around the Maluku Sea, with the clustering of earthquake
events spread almost throughout Sulawesi and Maluku.

Keywords Climate change management · Davidon-Fletcher-Powell algorithm ·

Earthquake catalog · Space–time ETAS model · Stochastic declustering

1 Introduction

The collision of one plate with another would cause natural disasters such as earth-
quakes. In Indonesia, earthquakes occur in almost all regions because most of
Indonesia is located in the Pacific Ring of Fire. There are seven areas in Indonesia
that are suspected to be subduction zones, including the Maluku Sea area bordering
Sulawesi [1, 2]. Earthquakes that occur in the Sulawesi and Maluku regions are due
to tectonic plates, both of which include areas that have three major plates, so the
two regions have quite a lot of seismicity and are quite active, where the Sulawesi

S. Faradilla · A. Choiruddin (B) · B. W. Otok

Department of Statistics, Institut Teknologi Sepuluh Nopember, Kampus ITS-Sukolilo,
Surabaya 60111, Indonesia
e-mail: [email protected]

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 369
Y. Bee Wah et al. (eds.), Data Science and Emerging Technologies, Lecture Notes on
Data Engineering and Communications Technologies 191,
https://doi.org/10.1007/978-981-97-0293-0_27
370 S. Faradilla et al.

Fig. 1 Seismicity map, images taken from [20]

and Maluku regions are traversed by the meeting of three tectonic plates, namely the
Eurasian plate, the Indo-Australian plate, and the Pacific plate (see Fig. 1). There-
fore, it is important to study the seismic activity in the region of Sulawesi-Maluku,
especially the one that involves cataloging earthquakes data including coordinate
location, time occurrence, and magnitude.
In Indonesia, statistical models based on point processes have been considered,
but the analysis of the earthquake catalogs used is still separate. For example, point
processes considering the coordinate location such as log-Gaussian Cox point process
(LGCP) [3–5] and Neyman–Scott Cox point process (NSCP) [5–8], as well as point
processes considering the time of earthquake occurrence such as ETAS [9–11].
However, no study has been considered for a complete earthquake catalog.
Regarding the complex model, Ogata [12] has incorporated a complete catalog
of earthquake events (times, locations, magnitudes) using the space–time ETAS
model. The space–time ETAS model is now widely used for seismicity analysis
to explain the probability of main and aftershocks and to detect changes in seis-
micity [13]. The space–time ETAS model covers the behavior of earthquake events
as a function of time, location (latitude and longitude), and magnitude, given histor-
ical earthquake data [14]. The parameters are estimated using maximum likelihood
estimation (MLE) by applying stochastic declustering iterations using the Davidon-
Fletcher-Powell (DFP) method [15]. The results are combined with the peak ground
acceleration (PGA) information to determine how much seismicity is present in the
Probabilistic Seismic Hazard Analysis for Sulawesi-Maluku Region … 371

Table 1 Research variables

Variable Description
earthquake data in the
space–time ETAS model t Earthquake timing in Sulawesi and Maluku
x Longitude of earthquakes in Sulawesi and Maluku
y Latitude of earthquakes in Sulawesi and Maluku
m Magnitude of earthquakes in Sulawesi and Maluku

region. The space–time ETAS model has been considered for earthquake modeling
in, for example, Japan [12, 14, 16], USA [17], China [13], Taiwan [18], and Iran [15,
19]. To our knowledge, such a model has not been applied for earthquake modeling
in the Sulawesi-Maluku region.
In this study, we implement the space–time ETAS model for probabilistic seismic
hazard analysis in Sulawesi-Maluku by considering catalog data including time
occurrences, coordinate locations, and magnitudes. The model is evaluated using
residual analysis and diagnostic plot. The resulting model is used as the basis for
developing the seismic hazard map in Sulawesi-Maluku.

2 Methodology

2.1 Dataset

This study uses earthquake catalog data consisting of time, longitude, latitude,
and magnitude. The data were obtained from the United States Geological Survey
(USGS) (https://earthquake.usgs.gov/earthquakes/search/) from January 1, 2008, to
July 31, 2023, resulting in 4842 earthquake events. The variables are listed in Table 1.
The map of earthquake locations in the Sulawesi region is located at latitude and
longitude in the range of − 6.5° S to 3° N and 118° to 126° E, while the map of
earthquake locations in the Maluku region is located at the latitude and longitude in
the range of − 3° S to 6° N and 124° to 132° E [1]. The map is shown in Fig. 1.

2.2 Space–Time ETAS Model

The space–time ETAS model is a model that can be used to estimate the occurrence
of aftershocks in a given area using time, location (latitude and longitude), and
magnitude variables that depend on historical earthquake data [14]. The conditional
intensity function in the space–time ETAS model can be defined in Eq. (1).

λβ,θ (t, x, y, m|Ht ) = vβ (m)λθ (t, x, y|Ht ), (1)

372 S. Faradilla et al.

where

vβ (m) = β exp[−β(m − m 0 )], β > 0, (2)

and

λθ (t, x, y|Ht ) = ũ(x, y) + κA,α (m i )gc, p (t − ti ) f D,γ ,q (x − xi , y − yi ; m i ),
i:ti <t
(3)

where Ht is the earthquake occurrence until the time before t and Eq. (3) can be
explained as follows:
ũ(x, y) = μu(x, y) represents the background seismicity rate or can be referred
to as the main earthquake [1]. ũ(x, y) can be referred to as the background seismicity
rate of an event and is assumed to be stationary in time or the base seismicity rate
with the functions u(x, y) and μ > 0.
κA,α (m i ) = A exp[α(m i − m 0 )] is the contribution to the seismicity rate of the ith
event that has occurred previously [16] or the expected number of events triggered by
aftershocks resulting from an event of (magnitude) m i [15]. κA,α (m i ) can be referred
to as the probability of occurrence of an aftershock of magnitude m i , where m i ≥ m 0
while A > 0 and α > 0 are unknown parameters.
gc, p (t − ti ) is the probability density function of an earthquake occurring at time
(t) that previously occurred at time (ti ), where c > 0 and p > 1 are unknown
parameters. Based on the modified Omori law, gc, p (t − ti ) can be described by
Eq. (4).
p−1
t−ti − p
1 + t − ti > 0
gc, p (t − ti ) = c c , (4)
0 t − ti ≤ 0

f D,γ ,q (x − xi , y − yi ; m i ) is the probability density of an aftershock occurring at

location (x, y) triggered by an earthquake of magnitude (m i ) that previously occurred
at location (xi , yi ), where D > 0, γ > 0 and q > 1 are unknown parameters. Based
on the radial symmetry density function, f D,γ ,q (x − xi , y − yi ; m i ) can be described
by Eq. (5).

q −1
f D,γ ,q (x − xi , y − yi ; m i ) =
π D exp γ (m i − m 0 )
−q (5)
(x − xi )2 + (y − yi )2
1+ .
D exp γ (m i − m 0 )
Probabilistic Seismic Hazard Analysis for Sulawesi-Maluku Region … 373

2.3 Parameter Estimation with the Maximum Likelihood

Estimation (MLE) Method

The estimation of the parameter β and θ = (μ, A, c, α, p, D, q, γ )T in the space–

time ETAS model uses the maximum likelihood estimation method, where the log-
likelihood model function can be written as Eq. (6) [12].

N

l(β, θ |HT ) = δi log λβ,θ ti , xi , yi , m i |Hti
i=1
∞ tstart +T ¨ (6)
− λβ,θ (t, x, y, m|Ht )dx dy dt dm,
m0 tstart S

where δi = 1 is event i which is the target event and δi = 0 is event i which is the
complementary (initial) event. In Eq. (1), the parameters β and θ can be separated
∞
by m 0 vβ (m)dm = 1. Thus, the log-likelihood function can be written as Eq. (7).

l(β, θ |HT ) = l1 (β|HT ) + l2 (θ |HT ). (7)

Equation (10) can be translated to l1 ( β|HT ) as Eq. (8).

N

l1 (β|HT ) = δi log vβ (m i ) . (8)
i=1

Estimating β using the MLE method by maximizing l1 ( β|HT ) yields results that
are close to.
For l2 ( β|HT ) can be described as Eq. (9) as follows:

tstart +T ¨

N

l2 (θ |HT ) = δi log λθ ti , xi , yi |Hti − λθ (t, x, y|Ht )dx dy dt, (9)
i=1 tstart S

Estimating θ to approximate u(x, y) with maximum likelihood estimation (MLE)

method requires Davidon-Fletcher-Powell (DFP) method approach to minimize
ξ (θ ) = −l2 ( θ |HT ) which is a gradient-based nonlinear optimization algorithm and
declustering stochastic iteration approach [12].
Furthermore,
if there are estimates u(x, y) and θ , then the declustering probability
of p j can be estimated from Eqs. (10) and (11) [15].

pj = pi j , j = 1, ..., N , (10)
i:ti <t j
374 S. Faradilla et al.

where
⎧
⎨ κA,α (m i )gc, p t j − ti f D,γ ,q x j − xi , y j − yi ; m i , t > t ,
⎪
j i
pi j = λθ t j , x j , y j |Ht j (11)
⎪
⎩
0, t j ≤ ti ,

The variable bandwidth kernel estimator and a better estimate for u(x, y) can be
defined as Eq. (12).

1
N
û(x, y) = 1 − p̂ j ϕ x − x j , y − y j ; h j , (12)
T j=1

where

h j = max h min , r j, n p ,

where h min is the minimum threshold bandwidth value and r j, n p is the distance
between the jth event location and its nearest neighbor n p [15]. If the initial estimate
u(x, y) is replaced by a new estimate from Eq. (12), then the estimates θ and u(x, y)
are repeated until the estimates converge. Thus, the total spatial intensity function
can be estimated as described in Eq. (13).

N

ˆ (x, y) = 1 ϕ x − x j, y − yj; h j . (13)
T j=1

Thus, the equation together with û(x, y) gives ω̂(x, y) = 1 − û(x, y) ˆ (x, y)
estimates for the clustering coefficient [15].
The algorithm using the DFP method and the stochastic declustering method can
be described in Tables 2 and 3 .

2.4 Residual Analysis and Diagnostic Plots

The space–time ETAS model requires residual analysis and diagnostic plots to
assess model fit and visualization and ease of computation. Residual analysis in
this model calculates only temporal and spatial residuals. If the model used is
capable of describing temporal and spatial variations in the data, the expected
results are R(I × B; h) ≈ 0,R temp (I ; h) ≈ 0 and R spat (Bi ; h) ≈ 0 for each
I ⊂ [tstart , tstart + T ] and B ⊂ S [15]. For the diagnostic plot in this model, the
transformation time τi is plotted against the i-th event. If there is a significant devi-
ation in the straight line y = x, then it indicates the weakness of the model. The
Probabilistic Seismic Hazard Analysis for Sulawesi-Maluku Region … 375

Table 2 Davidon-Fletcher-Powell (DFP) Algorithm 1

1. Insert:
Initial value for the model parameters: θ̂0

Initial background rate: û 0 x j , y j , j = 1, ..., N
Initial value of the inverse Hessian matrix: H0
2. Use k ← 0
3. Repeat:

Ascending direction: dk ← −Hk × ∇ξ θ̂k

Search line along dk : ζk ← arg minζ ξ θ̂k + ζ dk
To update the estimate with the equation that is:

θ̂k+1 ← θ̂k + ζk dk ; ηk ← ∇ξ θ̂k+1 − ∇ξ θ̂k ; c1 ← ζk /(dkT × ηk );c2 ← 1/(ηkT × Hk × ηk )
To update the inverse Hessian: Hk+1 ← Hk + c1 h k × dkT − c2 Hk × ηk × ηkT × Hk
Repeat until k ← k + 1

4. Run the criterion until it converges: θ̂k − θ̂k−1 <∈ or ∇ξ θ̂k <∈
5. Back θ̂k and Hk

Table 3 Stochastic declustering Algorithm 2

1. Insert:
Initial value for the model parameters: θ̂0
Initial background rate: û 0 (xi , yi ), i = 1, ..., N
Initial value of the inverse Hessian matrix: H0 ← I
2. Use k ← 0
3. Repeat:
Use algorithm 1 with input θ̂k , û k (xi , yi ), and Hk , to obtain θ̂k+1 , and Hk+1

To obtain the declustering probability equation: p̂ j ← 1−μ̂k+1 û k x j , y j /λθ̂k+1 t j , x j , y j |Ht j

To update the estimates: û k+1 (x, y) ← Nj=1 1 − p̂ j ϕ x − x j , y − y j ; h j /T
Repeat until k ← k + 1

For maximum relative change of parameters: e1 ← maxl θ̂l,k /θ̂l,k−1 − 1

For maximum relative change in û x j , y j : e2 ← max j û k x j , y j /û k−1 x j , y j − 1

For relative changes in the log-likelihood function: e3 ← ξ θ̂k /ξ θ̂k−1 − 1
4. Execute criteria until it converges: e1 < ε and e2 < ε and e3 < ε
5. Back θ̂k , û k (x, y) and Hk .

deviation occurs because there are some characteristics of the data that cannot be
accounted for by the model [21].
376 S. Faradilla et al.

3 Results and Discussion

3.1 Visualization of Seismic Maps in Sulawesi and Maluku

To determine the occurrence of an earthquake, a map of the region is used. Since the
earthquakes in this study used data from the Sulawesi-Maluku earthquake catalog,
the map of the Sulawesi-Maluku earthquake catalog can be shown in Fig. 2.
Figure 2 present the map of seismicity at latitude and longitude coordinates around
the range of – 6, 5° S to 3° N and 118° to 126° E that occurred between January
1, 2008, and July 29, 2023. Earthquake events tend to occur along the Maluku and
North Sulawesi Seas and around Central Sulawesi, specifically around the city of
Palu (see Fig. 1). Figure 2b, c shows that there are earthquake events around the
coordinates of latitude 1° S to 2° N and around the coordinates of longitude 125° to
130° E. Most of the earthquake events occur around the Maluku and North Sulawesi

Fig. 2 Visualization of seismicity points with depth ≤ 70 km and magnitude ≥ 4.5 from 2008
to 2023: a Seismicity map in Sulawesi and Maluku regions, b relationship of latitude coordinate
point with earthquake occurrence time, c relationship of longitude coordinate point with earthquake
occurrence time, d relationship of magnitude with earthquake occurrence time, e relationship of
number of earthquake occurrences at time (time stationarity) with earthquake occurrence time,
f relationship of predicted value of log10 N mag (predicted value of Gutenberg Richter Law) based
on actual earthquake data with magnitude
Probabilistic Seismic Hazard Analysis for Sulawesi-Maluku Region … 377

Seas in the range 1000 to 5000 with a considerable magnitude ranging from 7 to
7.7. The largest earthquake with a magnitude of 7.7 occurred at the 256th time on
January 3, 2009, and at the 4622nd time on January 18, 2023, with magnitude 7 can
be seen in Fig. 2d. The black small circle line shows the magnitude strength against
the predicted value of log10 Nmag (predicted value from the Gutenberg Richter Law)
based on actual earthquake data parallel to the linear line (blue dashed line) can be
seen in Fig. 2f. The conclusion that can be drawn is that the greater the magnitude,
the fewer earthquakes that occur in Sulawesi-Maluku. While Fig. 2e shows a plot
of the accuracy of the amount of data against time, marked by a blue dashed plot,
following a linear line or parallel to the plot of the number of earthquake events at
time (t), marked by a black line plot, it can be said that the background seismicity
rate satisfies the process of stationarity in time

3.2 Space–Time ETAS Model Based on the Earthquake Data

in Sulawesi and Maluku

In the parameter estimation, there are 9 parameters including parameter β and θ =

(μ, A, c, α, p, D, q, γ )T , where the parameters in the space–time ETAS model are
estimated using the MLE method by applying the Davidon-Fletcher-Powell (DFP)
and stochastic declustering algorithms. The parameter estimation process has 5 iter-
ations that take 276.45 minutes to obtain convergent results with an AIC value of
19939.34. In Table 4, there are 6 parameters that are significant because they have
an estimated value that is greater than the standard error value, so the smaller the
standard error value of the parameters β, μ, A, α, p and q, the more accurate the
relationship between the parameters. This is in contrast to the three parameters c, D
dan γ which have insignificant results. Therefore, out of the nine parameters, only
β, μ, A, α, p dan q are significant for the space–time ETAS model.
Parameter μ̂ explains the number of earthquake events in Sulawesi-Maluku by
0.669(shocks/day), while parameter Â explains the intensity of the trigger of the

Table 4 Parameter
Parameters Estimation StdErr Conclusion
estimation for the space–time
ETAS model β 3.287 0.003 Significant
μ 0.669 0.007 Significant
A 0.406 0.017 Significant
c 0.011 0.042 Not significant
α 0.649 0.035 Significant
p 1.160 0.004 Significant
D 0.011 0.036 Not significant
q 2.225 0.013 Significant
γ 0 3.044 × 108 Not significant
378 S. Faradilla et al.

whole earthquake by 0.406 event/ deg2 ×day , and parameter D̂ explains the influ-

ence of the coordinate point of the mainshock event by 0.011 deg2 . For parameter
α̂, it clarifies the influence of the alleged magnitude
of the background event that
triggers the occurrence of aftershocks by 0.649 magnitude−1 and parameter γ̂ = 0
explains the efficiency of the earthquake that is independent of the probability of
occurrence of aftershocks by magnitude; this means that the magnitude is very influ-
ential compared to the location of the coordinate point of occurrence of aftershocks.
The value of the parameter p̂ > 1 that the probability of decreasing the number of
aftershocks requires a fast period of time and the parameter ĉ explains the seismic
activity in Sulawesi-Maluku, which occurs due to the shaking of the earth’s crust that
causes earthquakes around 0.011(day). Then, the value of the parameter q̂ is 2.225,
which means that the larger the value of the parameter q̂, the faster the decrease
in the number of aftershocks over time after the main earthquake occurred. These
results are sufficient to prove that the space–time ETAS model gives good results on
the earthquake data in Sulawesi-Maluku.

3.3 Predicted Probability of Earthquake Events in Sulawesi

and Maluku

The estimated background seismicity rate (u(x, y)) in Sulawesi-Maluku is performed

using peak ground acceleration (PGA) information, where the higher the PGA value,
the higher the risk of earthquakes in the region. Areas with high PGA values are
marked with reddish dots or contours, and blue dots or contours are areas with
low PGA values. The PGA value, or commonly known as the maximum ground
acceleration, which represents the level of seismic vulnerability is expressed in g
(acceleration of the earth’s gravity, 1 g = 9.81 m s2 ) [20]. Figure 3 shows the
results of the background seismicity rate estimation (u(x, y)) in Sulawesi-Maluku.
Figure 3a and d provide information that the area around the Maluku Sea from the
convergence line has the highest risk of earthquake occurrence marked by reddish
points or contours around the region at coordinate points with a range between 0◦ −
5◦ LU. However, when viewed from Fig. 3c, it provides information about clusters of
earthquakes spreading almost throughout Sulawesi and Maluku at coordinate points
with a range between −5◦ and 5◦ LU. The region with the most active earthquake
occurrence is in the north central part of the convergence line, visually marked by
reddish-colored dots or contours, indicating that the region has a relatively high level
of seismicity. Meanwhile, areas of relatively low seismicity are located around the
southwest of the convergence line around Southeast Sulawesi, visually marked by
blue contours. This condition can be seen in Fig. 3d, which explains the mapping
ratio of the highest seismicity level around the Maluku Sea and other than that area
has a low seismicity level. This is evidenced by Fig. 3c which explains the clustering
of most earthquakes distributed along the Maluku Sea between the Halmahera and
Sangihe arcs.
Probabilistic Seismic Hazard Analysis for Sulawesi-Maluku Region … 379

Fig. 3 Visualization of seismicity map prediction from a background seismicity rate, b total spatial
intensity, c clustering coefficient, and d conditional intensity

The results of the background seismicity rate estimation in Fig. 3a gave better
results because it obtained fewer earthquakes than the total spatial intensity esti-
mation in Fig. 3b. Therefore, areas with high seismicity rates indicate that not all
earthquakes occur in the interior, but when a main earthquake occurs, it will be
followed by aftershocks with a high probability of earthquake occurrence. The esti-
mated probability of earthquake occurrence can be seen from the background and
triggered events in Fig. 4.
Figure 4 shows that the detection of earthquake probabilities in the Sulawesi-
Maluku region extends almost along the Maluku Sea, where triggered or aftershocks
(subsequent earthquakes) have a relatively high probability in the region. This is
indicated by the red plus sign symbol which is clustered quite high along the Maluku
Sea, almost comparable to the background or main earthquake indicated by the black
small circle symbol which has a high probability of earthquake occurrence along the
Maluku Sea at latitude coordinate points around 0° to 6° N and longitude coordinate
points around 125° to 130° E.
To evaluate the quality of the model, residual analysis and diagnostic plots of
the space–time ETAS model can be explained in the temporal residual analysis,
and there is a large deviation from the residual data because there is the highest
peak in the estimated intensity function. Fig. 5a, b explains the earthquake event that
occurred around the Maluku Sea with a large PGA value of about 0.5 g at coordinates
between 0° to 4° N and about 125° to 130° E. Therefore, spatial residual analysis can
describe earthquake events in Sulawesi-Maluku quite a lot around the Malukus Sea
380 S. Faradilla et al.

Fig. 4 Visualization of the estimated probability declustering the actual data of real earthquake
events and their triggers in Sulawesi-Maluku from 2008 to 2023

and obtain quite complex results. This is also supported by the results of plotting the
transformation time τi against the cumulative number of i-th earthquake in Fig. 5c
where the plot follows the red line (theoretical cumulative earthquake events against
time change). Similar results are obtained by Fig. 5d Q-Q plot of Ui , where the pair
of quantiles and observed quantile points are aligned along the red line (linear line),
so it is concluded that the earthquake data in Sulawesi-Maluku fulfill the assumption
of U (0, 1) distribution. Thus, from the results of residual analysis and diagnostic
plots, the space–time ETAS model obtained good model quality, although there was
an indication of a deviation in data characteristics.

4 Conclusions

In this study, we consider space–time ETAS model to perform the probability seismic
hazard analysis, for Sulawesi-Maluku. We apply Davidon-Fletcher-Powell (DFP)
and stochastic declustering algorithms to estimate the parameters. The results of the
parameter estimation explain the number of earthquake events in the Sulawesi and
Maluku regions of 0.669(shocks/day)
with the intensity of the trigger for the total

earthquake of 0.406 event/ deg2 ×day . For the influence of the coordinate point of

the mainshock event of 0.011 deg2 and the influence of the alleged magnitude of the

background event that triggered the aftershocks of 0.649 magnitude−1 , where the
Probabilistic Seismic Hazard Analysis for Sulawesi-Maluku Region … 381

Fig. 5 Visualization of residual plots and diagnostic plots of the space–time ETAS model:
a temporal residuals, b spatial residuals, c time plots of the transformation of τi versus i, d Q-Q
plots of Ui

magnitude is very influential compared to the location of the coordinate point of the
aftershocks. Meanwhile, the probability of a decrease in the number of aftershocks
in Sulawesi-Maluku requires a fast period of time and seismic activity in Sulawesi-
Maluku, which occurs due to the shaking of the earth’s crust that causes earthquakes
is 0.011(day). Based on the results of residual analysis and diagnostic plots, the
space–time ETAS model obtained good model quality. These results can be used to
provide more accurate estimates of earthquake events. The estimation of earthquake
events from the seismic hazard map results provides information on the total rate of
earthquake events around the Maluku Sea, with clusters of earthquake events spread
almost throughout Sulawesi and Maluku. The modeling in this study can be used
by the government and local communities to predict the occurrence of earthquakes,
especially in the Sulawesi and Maluku regions. For future study, the space–time
ETAS model can be extended by including the effect of geological variables such
as subduction and faulting. In the spatial case, this direction has been studied by
Choiruddin, et al. [3, 8].

Acknowledgements The authors gratefully acknowledge financial support from the Institut
Teknologi Sepuluh Nopember for this work, under the project scheme of the Publication
Writing and IPR Incentive Program (PPHKI) 2024. We also thank the four reviewers for their
constructive comments.
382 S. Faradilla et al.

References

1. Tim Pusat Studi Gempa Nasional, Peta Sumber dan Bahaya Gempa Indonesia Tahun 2017,
Bandung: Pusat Penelitian dan Pengembangan Perumahan dan Pemukiman Badan Penelitian
dan Pengembangan Kementrian Pekerjaan Umum dan Perumahan Rakyat (2017)
2. Otok BW, Soeparno K, Rahmawati I, Azies HA, Isnawati (2021) Structural equation modeling
on decision making in understanding disasters in Maluku. In: AIP conference proceedings.
Indonesia, pp 020003-1–020003-8
3. Choiruddin A, Susanto TY, Husain A, Kartikasari YM (2023) kppmenet: combining the kppm
and elastic net regularization for inhomogeneous Cox point process with correlated covariates.
J Appl Stat 1–14
4. Sakdiyah K, Choirudddin A (2021) Model inhomogeneous log-Gaussian cox process (LGCP)
untuk Pemetaan Risiko Gempa Bumi di Sumatra. Jurnal Sains dan Seni ITS 9(2):D108–D114
5. Choiruddin A, Trisnisa AF, Iriawan N (2021) Quantifying the effect of geological factors on
distribution of earthquake occurrences by inhomogeneous cox processes. Pure Appl Geophys
1579–1592. Springer Nature Switzerland AG
6. Susanto TY, Choiruddin A, Purnomo JDT (2023) On the earthquake distribution modeling in
Sumatra by Cauchy cluster process: comparing log-linear and log-additive intensity models.
Sains Malaysiana 52(2):655–667
7. Choiruddin A, Susanto TY, Metrikasari R (2021) Two-Step estimation for modeling the
earthquake occurrences in Sumatra by Neyman–Scott cox point processes. In: International
conference on soft computing in data science. Springer, Singapore, pp 146–159
8. Mukhti TO, Choiruddin A, Purhadi (2021) Generalized additive poisson models for quantifying
geological factors effect on the earthquake risk mapping. In: ICMSDS 2020. IPB University,
pp 1–11
9. Zhuang J, Ogata Y, Wang T (2017) Data completeness of the Kumamoto earthquake sequence
in the JMA catalog and its influence on the estimation of the ETAS parameters. Earth Planets
Space 1–12
10. Bansal AR, Dimri VP, Babu KK (2013) Epidemic type aftershock sequence (ETAS) modeling
of northeastern Himalayan seismicity. J Seismol 255–264
11. Kumazawa T, Ogata Y (2013) Quantitative description of induced seismic activity before
and after the 2011 Tohoku-Oki earthquake by nonstationary ETAS models. J Geophys Res
6165–6182
12. Ogata Y (1998) Space-time point-process models for earthquake occurrences. Ann Inst Stat
Math 50(2):397–402
13. Guo Y, Zhuang J, Zhou S (2015) An improved space-time ETAS model for inverting the rupture
geometry from seismicity triggering. J Geophys Res Solid Earth 120:3309–3323
14. Ogata Y, Zhuang J (2006) Space–time ETAS models and an improved extension. Tectono-
physics 413:13–23
15. Jalilian A (2019) ETAS: an R package for fitting the space-time ETAS model to earthquake
data. J Stat Softw 88:1–39
16. Zhuang J (2012) Long-term earthquake forecasts based on the epidemic-type aftershock
sequence (ETAS) model for short-term clustering. Res Geophys 2(e8):52–57
17. Guo Y, Zhuang J, Zhou S (2015) A hypocentral version of the space–time ETAS model.
Geophys J Int 203:366–372
18. Zhuang J, Chang CP, Ogata Y, Chen YI (2005) A study on the background and clustering
seismicity in the Taiwan region by using point process models. J Geophys Res 110(B05S18):1–
12
19. Davoudi N, Tavakoli HR, Zare M, Jalilian A (2020) Aftershock probabilistic seismic hazard
analysis for Bushehr province in Iran using ETAS model. Nat Hazards 100:1159–1170
Probabilistic Seismic Hazard Analysis for Sulawesi-Maluku Region … 383

20. Tim Revisi Peta Gempa Indonesia 2010, Peta Hazard Gempa Indonesia 2010 sebagai
Acuan Dasar Perencnaan dan Perancangan Infrastruktur Tahan Gempa, Jakarta: Kementerian
Pekerjaan Umum (2010)
21. Nicolis O, Chiodi M, Adelfio G (2015) Windowed ETAS models with application to the Chilean
seismic catalogs. Spatial Statistics 14:151–165
Application of Time Series Regression,
Double Seasonal ARIMA, and Long
Short-Term Memory for Short-Term
Electricity Load Forecasting

Hafez Afghan and Hidayatul Khusna

Abstract Electricity load must be accurately estimated since electricity is non-

storable. If electricity is generated more than customer’s demand, it will be wasted,
and the power of generator should be lowered. Meanwhile, if electricity is gener-
ated less than customer’s demand, it may cause power outage and must undergo
backup-plant operation. This study compares three forecasting methods such as time
series regression (TSR), double seasonal autoregressive integrated moving average
(DSARIMA), and long short-term memory (LSTM) to predict electricity load. These
three methods can be applied to forecast electricity load which has double seasonal
pattern. Each method has an ability to capture the data pattern. The observation data
is half-hourly recorded electricity load of East Java Province in Mega Watt (MW)
units from January 1st 2020 to January 31st , 2023. This study obtains the out-sample
symmetric mean absolute percentage error (sMAPE) of those methods as many as
0.6436%, 0.5504%, and 0.9713%, respectively. From these findings, DSARIMA ([2,
10, 11, 12, 15, 16, 17, 18, 19, 20, 21, 22, 23], 1, [1, 2, 3, 7, 8, 30, 34, 35, 39, 40, 41,
42, 43, 44, 45, 46, 47, 48]) (0, 1, 1)48 (0, 1, 1)336 is apparent to be the best forecasting
model for its lowest out-sample sMAPE value. This model gained a white noise
residual, but it does not have a normal distribution due to outliers.

Keywords Double seasonal ARIMA · Electricity load · Long short-term

memory · Time series regression

H. Afghan · H. Khusna (B)

Department of Statistics, Sepuluh Nopember Institute of Technology, Arief Rahman Hakim,
Surabaya 60111, Indonesia
e-mail: [email protected]
H. Afghan
e-mail: [email protected]

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 385
Y. Bee Wah et al. (eds.), Data Science and Emerging Technologies, Lecture Notes on
Data Engineering and Communications Technologies 191,
https://doi.org/10.1007/978-981-97-0293-0_28
386 H. Afghan and H. Khusna

1 Introduction

Electricity has been a primary need for the society. Electricity as a vital source
of power can significantly improve the welfare of the society, such as economy,
education, health, industry, housing, offices, and revenue fields. In 2020, 2021,
and 2022, the total of electricity load in East Java was approximately 37.87 TWh,
40.68 TWh, and 43.72 TWh, respectively [1]. The increase in electricity load was
driven by economic recovery and increased business activities in the post-COVID-19
pandemic. PT Perusahaan Listrik Negara (PLN) Persero is Indonesia’s main elec-
tricity provider which distributes electricity to customers. One of the challenges in
distributing electricity is the fluctuation of electricity load over time which is related
to customer’s electricity consumption habits. For example, we create a time series
plot as in Fig. 1 that shows electricity load patterns from January 3rd to January 16th ,
2020. The figure shows that customer’s electricity consumption patterns are similar
on a daily basis. It shows that during the day there is a peak load, while at night the
load is lower. Moreover, on Saturdays and Sundays there are significant decreases
in electricity consumption compared to weekdays. Having this in mind, electricity
load must be accurately estimated for its non-storable condition. If electricity is
generated higher than demand, it will be a waste and the generator power should be
lowered. Meanwhile, if electricity is generated less than demand, it may cause power
outage and must undergone backup-plant operation. Therefore, forecasting methods
are recommended in order to obtain accurate load forecasts.
Electricity load data has a double seasonal pattern of daily and weekly base [2]. For
this purpose, we apply forecasting methods which can capture the double seasonal
pattern. The first is time series regression (TSR) which is a practical forecasting
method and suitable to be applied for low variances data in each aggregation time.
Since daily customer’s electricity consumption pattern is similar and results in low
variance at each aggregation time, this method is worth to be used. This method
practicality is similar to ordinary linear regression. Meanwhile, TSR hardly has non-
autocorrelated residuals for its less effectiveness in capturing double seasonal lag,

Fig. 1 Time series plot of two-week electricity load

Application of Time Series Regression, Double Seasonal ARIMA … 387

but the case is possible to be handled with the double seasonal ARIMA method.
Few studies have been done using TSR on electricity load forecasting. A study on
input selection for TSR using random forest and stepwise method is conducted by
Sintyaningrum [3], whereas other studies [4, 5] utilize TSR into a hybrid model. So,
this method needs further evaluation for non-hybrid method.
Autoregressive Integrated Moving Average (ARIMA), a conventional forecasting
method that requires data stationarity in mean and variance to identify lag order.
ARIMA estimates parameters in the form of lag orders so that forecasting can be
compatible with the actual data over time. Double seasonal ARIMA (DSARIMA)
as the second method applied in the present study is the development of ARIMA
designed for forecasting data with double seasonal pattern. This method estimates
parameters in the form of double seasonal orders. It has been applied in many elec-
tricity load forecasting studies, such as Mohamed et al. [6], Mado et al. [7], Khusna
and Suhartono [8], and many more.
Neural network (NN) is a type of machine learning where the algorithm in
processing data is inspired by the human brain. This type gives “machine” the ability
to capture patterns and to solve various problems. This study’s third method, long
short-term memory (LSTM) derives from NN. This method is capable of saving the
output of the previous cell [9] resulting in its powerful dependency for long time lags
and thus can capture double seasonal pattern. In addition, the concise procedure and
the no assumptions make this method commonly used in forecasting studies, without
exception in electricity load forecasting, such as Bouktif et al. [10], Bai et al. [11],
Abbasimehr et al. [12], and many more.
This is a fresh study, for the first time comparing practical, conventional, and
machine learning forecasting methods, especially in electricity load forecasting. The
selected models are those that have proven good at forecasting electricity loads for
each method. In addition, there are remarkable scenarios and treatments which is
undertaken in this study in order to obtain more accurate forecasts. Therefore, this
study aims to provide the best model from time series regression, double seasonal
ARIMA, and long short-term memory to forecast electricity load in East Java.
Ultimately, the best overall model is obtained.

2 Methodology

2.1 Dataset Description

The observation data in this study are secondary data provided by PT PLN Unit
Pelayanan Pengatur Beban (UP2B) Jawa Timur [1]. The variable is half-hourly
recorded electricity load of the East Java Province from the period of January 1st ,
2020, to January 31st , 2023, in Mega Watt (MW) units with a total of 54,096 time
series data. Short-term electricity load forecasting is for periods of several hours to
1 week to gain accurate forecasts. Hence, the data were divided into two, in-samples
388 H. Afghan and H. Khusna

and out-samples, respectively, from January 1st , 2020, to January 24th , 2023 (53,760
time series data) and January 25th , 2023, to January 31st , 2023 (336 time series data).

2.2 Time Series Regression

Time series regression is a forecasting method that uses regression analysis to identify
effect of independent variables on dependent variable Z t which depends on a time
function (t) [4]. The independent variables are denominated as dummy variables.
In this study, time series regression uses dummy variables such as trend (t), daily
seasonal (Jr,t ), weekly seasonal (Hh,t ), and inter-seasonal interactions (Ir,h,t ). The
time series regression model can be expressed as in Eq. (1).

Z t = β̂0 + τ̂ t + ϑ̂1 J1,t + · · · + ϑ̂48 J48,t + η̂1 H1,t + · · · + η̂7 H7,t

+ υ̂1,1 I1,1,t + · · · + υ̂48,7 I48,7,t + εt . (1)

Equation (1) can also be written as follows:

∑
48 ∑
7 ∑
7 ∑
48
Z t = β̂0 + τ̂ t + ϑ̂r Jr,t + η̂h Hh,t + υ̂r,h Ir,h,t + εt , (2)
r =1 h=1 h=1 r =1

where β̂0 , τ̂ , ϑ̂r , η̂h , and υ̂r,h are the estimated parameters for intercept, trend, daily
seasonal, weekly seasonal, and inter-seasonal interaction, respectively. The model
shows as such 1 + 1 + 48 + 7 + (7 × 48) = 393 parameters and 392 independent
variables.
This model has many independent variables; consequently, the model is not effec-
tive for forecasting. In order to solve this problem, the stepwise method is utilized
to eliminate the independent variables by combining two stages that are forward
stepwise selection and backward elimination [13]. The time series regression utilizes
Ordinary Least Square (OLS) for estimating the parameters. OLS is one of the param-
eter estimation methods by minimizing the sum of squared errors. The OLS formula
with matrix approach is given by Eq. (3) [14].
( )−1 T
β̂ = X T X X Z, (3)

where Z, X, and β̂ are the vector of Z t , the matrix of independent variables (vector
1 and all dummy variables), and the vector of estimated parameters, respectively.
The residual assumptions for time series regression model are commonly the same
as ordinary linear regression [15]. The independent assumption is a state where there
is no autocorrelation in the residual series. In this study, ACF plot is utilized to identify
this state. If there is a cut-off lag, it indicates that there is autocorrelation in the residual
series, or vice versa. In order to overcome the autocorrelation, lag variables are
Application of Time Series Regression, Double Seasonal ARIMA … 389

added to the model, especially for lags that have strong autocorrelation. The identic
assumption also known as homoscedasticity is a state where the residual variations
are identical or not significantly different. This assumption is not necessarily satisfied,
because it only works on cross-sectional data [16]. Normally distributed assumption
is a state( where
) the residual series is normally distributed and usually denoted as
εt ∼ N 0, σ 2 . While outliers should be detected and handled, the high number
of outliers in the electricity load data leads to a large computational time; thus, the
normally distributed residual assumption is not necessarily satisfied.

2.3 Double Seasonal Autoregressive Integrated Moving

Average (DSARIMA)

Autoregressive Integrated Moving Average (ARIMA) is a forecasting method that is

obtained by combining Autoregressive (AR) and Moving Average (MA), as outlined
by Wei [17]. This method requires data to be stationary. So, they need to be over-
come if they are not stationary in variance and mean. The model order AR and MA
respectively can be identified through PACF plot and ACF plot of stationary data.
The ARIMA model can be implemented on data with double seasonal pattern and
described in Eq. (4), which is called the double seasonal ARIMA (DSARIMA) model
with Φ P1 and Θ Q 1 as autoregressive and moving average parameter for first seasonal,
respectively. The multiplicative DSARIMA model is described in Eq. (5) [17].
( ) ( )( )D ( )D
Φ P1 B s1 Φ P2 B s2 1 − B s1 1 1 − B s2 2
( ) ( ) (4)
Z t = Θ Q 1 B s1 Θ Q 2 B s2 at ,
( ) ( ) ( )D
∅ p (B)Φ P1 B s1 Φ P2 B s2 (1 − B)d 1 − B s1 1
( )D ( ) ( ) (5)
1 − B s2 2 Z t = θq (B)Θ Q 1 B s1 Θ Q 2 B s2 at ,

where Φ P1 (B s1 ) = 1 − Φ11 B s1 − Φ12 B 2s1 − · · · − Φ1P1 B P1 s1 and Θ Q 1 (B s1 ) = 1 −

Θ11 B s1 −Θ12 B 2s1 −· · ·−Θ1Q 1 B Q 1 s1 . The B is backshift operator, so B b × Z t = Z t−b .
In this study, double seasonal ARIMA utilizes the Conditional Least Square (CLS)
to estimate the parameters. Similar to OLS, CLS works by minimizing the sum of
squared error. But, CLS formula varies for each model [18], that is why we call it
“conditional.” The order of the double seasonal ARIMA model varies, and thus, each
of CLS formula is different.
The residual assumptions for double seasonal ARIMA model are white noise and
normally distributed assumptions. White noise is a state where there is no autocorre-
lation in residual series. Ljung–Box Test is utilized to check whether the assumption
is satisfied or not [17]. Normally distributed assumption is a state where the residual
series is normally distributed. Since many outliers are found in electricity load data,
390 H. Afghan and H. Khusna

the assumption is not necessarily satisfied. This has been for the outliers which should
be detected and handled demand a large computational time.

2.4 Long Short-Term Memory (LSTM)

Long Short-Term Memory (LSTM) is a machine learning forecasting method and a

modification of Recurrent Neural Network (RNN). This method can solve vanishing
and exploding gradient problem existing in RNN [19]. The architecture of LSTM
consists of several layers that are input layer, LSTM hidden layer, and output layer.
Cell state acts as a connector for all output layers. The method can add and/or
remove information from the cell state. Each LSTM cell has four activation function
processes which are called gates. The gates control whether information is passed on
or stopped. The four gates are forget gates, input gates, new cell state candidates, and
output gates [19]. The activation function of each gate is sigmoid or tanh activation
function (see Fig. 2). Before running the LSTM, it is necessary to transform the data
using the minmax scaler method. This turns the data into a range of [0,1] that is
given by Eq. (6). This transformation is performed so that the “machine” can more
easily estimate parameters. However, this transformation still retains the relationship
between the time series data [20].

Z t − Z min
Z t' = . (6)
Z max − Z min

Figure 2 shows the architecture of LSTM. As we can see, it starts from entering the
inputs until finally gaining the prediction (output). The steps of LSTM are elaborated
as follows [12].
Step 1. Execute the LSTM to obtain forecast value, using Eqs. (7)–(12), where
WO : The weight value of O gate, bO : The bias value of O gate, and O denotes the
aforementioned gates.
( [ ] )
f t = σ ι W f h t−1 , X t + b f , (7)

( [ ] )
i t = σ ι Wi h t−1 , X t + bi , (8)

( [ ] )
c̃t = tanh Wc h t−1 , X t + bc , (9)

ct = f t ∗ ct−1 + i t ∗ c̃t , (10)

( [ ] )
ot = σ ι Wo h t−1 , X t + bo , (11)
Application of Time Series Regression, Double Seasonal ARIMA … 391

Fig. 2 Architecture of LSTM

h t = ot ∗ tanh(ct ). (12)

Step 2. Calculate residuals and calculate Mean Square Error (MSE) as the loss.
Step 3. The residuals are back-propagated to every gate. This step is popularly
called back-propagation through time [9].
Step 4. Optimizer updates the weights and biases of each gate by utilizing the
loss.
These steps are repeated until the stop condition test is met or the loss value has
converged. The forecasting value is in the form of minmax transformation, and thus it
must be reverse-transformed to compare it with the actual data. Hyperparameters are
parameters used in running machine learning algorithms to get optimal forecasting
performance. Epoch is defined as the number of iterations used in modeling. The
more the number of epochs, the more the weights change in the network. Optimizer
is used to find the optimal weight, so as to minimize the errors and maximize the
accuracy. This study uses Adaptive Moment estimation (ADAM) optimizer which
outperforms other optimizers [21].
392 H. Afghan and H. Khusna

2.5 Forecasting Evaluation

In this study, we measure forecasting accuracy by means of Root Mean Square Error
(RMSE) and symmetric Mean Absolute Percentage Error (sMAPE) through out-
sample approach. RMSE is used to determine the best model from each method
while sMAPE is used to determine the best overall model. The formulas are defined
in Eqs. (13) and (14), respectively [22]. The Z t+l and Ẑ t+l in sequence are the
out-sample data and its forecast at l-th, with L is the number of out-sample.
┌
|
|1 ∑ L
( )2
RMSE = √ Z t+l − Ẑ t+l , (13)
L l=1
| |
1 ∑ | Z t+l − Ẑ t+l |
L
sMAPE = / ( ) × 100%. (14)
L l=1 1 2 Z t+l + Ẑ t+l

3 Results and Discussion

3.1 Data Exploration Analysis

Figure 3 shows the time series plot of electricity load from January 1st , 2020, to
January 31st , 2023. The plot shows an up-trend which was caused by the population
growth resulting in the increasing electricity needs from facilities, domestic demands,
companies, and other matters. There was a phenomenon of decreasing electricity
consumption during Eid al-Fitr, several days before and after. This phenomenon was
explained by the fact of people returning to their hometowns, so that the houses left
almost did not consume electricity. In addition, some industries did not operate due
to a collective leave so as reducing the use of electricity.
Electricity load is found to have different averages and variances at each half-hour.
The data frame was organized by collecting data from each same day at each half-
hour in order to compute averages and variances. Figures 4 and 5 in sequence show
the averages and the variances of electricity load per half-hour at each individual
day. In Fig. 4, the average electricity load on Saturday and Sunday is less than
that on weekdays. This is explained by less activities consuming electricity during
weekends as most people using the time for rest. In addition, it is also the case for
some industries and offices which are off during such time. Meanwhile, at 18.00 is
the peak of electricity consumption, because people rush to complete their works.
After 18.00, electricity consumption decreases, signifying that most people have
completed their works, while the rest are still working. In Fig. 5, every day’s pattern
is almost identical. During working hours, the variance pattern has a fluctuation
Application of Time Series Regression, Double Seasonal ARIMA … 393

Fig. 3 Time series plot of electricity load

Fig. 4 Time series plot of electricity load’s average

which shows an increase from 07:00 to around noon and then a decrease up to 18:00.
People’s activities during these hours are various, and thus electricity consumptions
have a higher variance than that during other hours. It is apparent that on Sunday,
the consumption is lower in variance than that on other days since most people do
less diverse activities on Sunday. It is noted that from 11:00 to 13:00 there is a fall
both in average and in variance of the electricity load. This is attributed to the lunch
break during which most activities are stopped.

3.2 Time Series Regression Analysis

The analysis of this method was divided into two input types. The first, all input
variables were analyzed. Whereas the second, only input variables suspected of
effect (based on the study by Sintyaningrum [3]) were analyzed. The other models
394 H. Afghan and H. Khusna

Fig. 5 Time series plot of electricity load’s variance

were adjustments to the predefined models (1st model and 5th model). The models
are summarized in Table 1. The reference variables are J48,t H7,t , I48,1,t , I48,2,t , …,
I48,6,t , I1,7,t , I2,7,t , …, I48,7,t , that are used for comparison with other aggregation
times.
The evaluations for the first input type are shown in Table 2, with k is the number
of parameters. The number of parameters in the 1st model is 337 parameters, which
is the total 393 parameters subtracted by 56 parameters of the reference variable. The
2nd model is shown to yield the lowest out-sample RMSE value, and thus it becomes
the best model for the first input type. The evaluations for the second input type are
shown in Table 3. Since the 7th model shows the lowest out-sample RMSE value, it
becomes the best model for the second input type. There are many significant lags,
especially around the double seasonal lag (see Fig. 6), indicating that both models did
not satisfy the independent residual assumption. Adding lag variables as independent
variables based on significant lags is required.

Table 1 Time series regression models of each input type

Input variables Model Unanalyzed input variables Elimination method
All input variables 1 None None
2 None Stepwise
3 Reference variable Stepwise
4 Variables with multicollinearity Stepwise
(Based on 1st model)
Input variables suspected of 5 None None
effect 6 Variables with multicollinearity None
(Based on 5th model)
7 Reference variable Stepwise
Application of Time Series Regression, Double Seasonal ARIMA … 395

Table 2 Model evaluations of first input type

Model k Significance Multicollinearity RMSE
1st model 337 Not significant Yes 194.68
2nd model 136 Significant No 194.37
3rd model 139 Significant No 194.4
4th model 220 Significant No 195.37

Table 3 Model evaluations of second input type

Model k Significance Multicollinearity RMSE
5th model 154 Not significant Yes 207.57
6th model 152 Not significant No 208.61
7th model 116 Significant No 207.39

Fig. 6 ACF plot of residual; a 2nd model, and b 7th model

We added lag variables Z t−1 , Z t−48 , and Z t−336 representing the non-seasonal
lag, the first seasonal lag, and the second seasonal lag, respectively. Two new models
were identified, that are the 8th model and the 9th model which were in sequence
the definitions of the 2nd and 7th models added with the lag variables for analysis.
Therefore, those new models were executed to forecast. The evaluations for the
modified models are shown in Table 4. It is noted that the number of parameters in
the 8th model is higher than the 2nd model and as well as the 9th model to the 7th model,
due to the stepwise method. This indicates that with the addition of lag variables,
the models have more dummy variables that have a significant effect. Out-sample
RMSE value of the 9th model is lower than the 8th model, thus the 9th model is the
best model for the time series regression method. Nevertheless, there are still many
significant lags, especially around the first and second seasonal lags (see Fig. 7). The
addition of lag variables did not satisfy the assumption, indicating the existence of
significant lags not being included in the modified models.
396 H. Afghan and H. Khusna

Table 4 Model evaluations of modified model

Model k Significance Multicollinearity RMSE
8th model 146 Significant No 48.23
9th model 118 Significant No 47.73

Fig. 7 ACF plot of residual; a 8th model, and b 9th model

3.3 Double Seasonal ARIMA Analysis

The rounded value of the box-cox analysis is − 0.34. Figure 3 shows that electricity
load has the same range of data for each time shift. There are several falls in electricity
load around Eid al-Fitr, by which the box-cox analysis shows the data not being
stationary in variance. Therefore, we considered that the data is stationary in variance
and did not need to be transformed. Figure 8 shows the ACF plot which is not
stationary in mean because the ACF has extremely slow dies down pattern. Hence,
differencing was carried out at the extremely slow dies down lags, at lags 1, 48, and
336.

Fig. 8 ACF plot of

electricity load
Application of Time Series Regression, Double Seasonal ARIMA … 397

Fig. 9 a ACF plot, and b PACF plot of electricity load after differences

Table 5 Conjecture models of double seasonal ARIMA

Model
Model ARIMA ([10, 11, 12, 15, 16, 17, 18, 19, 20, 21, 22, 23, 25, 34, 35, 40, 41, 42, 43, 44,
1 45, 47], 1, [1, 2, 3, 7, 8, 30]) (0, 1, 1)48 (0, 1, 1)336
Model ARIMA ([10, 11, 12, 15, 16, 17, 18, 19, 20, 21, 22, 23, 25, 40, 41, 42], 1, [1, 2, 3, 7, 8,
2 30, 34, 35]) (0, 1, 1)48 (0, 1, 1)336
Model ARIMA ([2, 10, 11, 12, 15, 16, 17, 18, 19, 20, 21, 22, 23], 1, [1, 2, 3, 7, 8, 30, 34, 35,
3 39, 40, 41, 42, 43, 44, 45, 46, 47, 48]) (0, 1, 1)48 (0, 1, 1)336
Model ARIMA ([10, 11, 12, 15, 16, 17, 18, 19, 20, 21, 22, 23, 25, 40, 41, 42, 43, 44, 45, 47],
4 1, [1, 2, 3, 7, 8, 30, 34, 35, 48]) (0, 1, 1)48 (0, 1, 1)336

Figure 9 is the ACF and PACF plot after the differences and the ACF no longer
has extremely slow dies down pattern, indicating the data has been stationary in
mean. As we can see in Fig. 9, ACF Plot has cuts off at lags 48 and 336, while
PACF plot has dies down patterns at each seasonal lag. Thus, we also identified the
conjecture models that incorporate the order of ARMA. As a result, the conjecture
models are multiplicative double seasonal ARIMA. Based on these identifications,
four conjecture models were obtained and presented in Table 5.
The evaluations for the conjecture models are shown in Table 6. All models
already have significant parameters and satisfy the white noise assumption. However,
all models did not satisfy the normally distributed residuals assumption due to large
number of outliers. The lowest out-sample RMSE value is shown by model 3 making
it to be the best model for the Double Seasonal ARIMA method.

3.4 Long Short-Term Memory Analysis

The variables in the form of lags from the first 336 lags were selected based on
the highest autocorrelation value. In type 1, we took three lags with the highest
autocorrelation value and took Z t−48 , Z t−336 to represent the double seasonal lags.
398 H. Afghan and H. Khusna

Table 6 Model evaluations of double seasonal ARIMA

Model k Significance White noise RMSE
Model 1 30 Significant Satisfied 36.43
Model 2 26 Significant Satisfied 36.51
Model 3 32 Significant Satisfied 35.63
Model 4 31 Significant Satisfied 36.09

In type 2, we took five lags with the highest autocorrelation value. The summary of
these scenarios is presented in Table 7. In this study, we used three hidden layers
with number of neurons equal to 64, 32, and 32. The epoch combinations were set
equal to 50, 100, 150, and 200. LSTM modeling cannot directly gain minimum
out-sample RMSE, so it is necessary to do several iterations. Hence, boxplots of
out-sample RMSE were created and presented to serve as the comparison between
combinations.
In type 1, epoch = 200 is the best combination for having the lowest median and
mean value of out-sample RMSE (see Fig. 10a). As for in type 2, epoch = 200 is also
the best combination for the same reason (see Fig. 10b). In spite of this, Interquartile
Range (IQR) Box of out-sample RMSE with epoch = 200 in type 2 is smaller than
that in type 1 as we can see in Fig. 10. This demonstrates that the forecasting with
the combination of epoch = 200 in the type 2 gains steadier forecasts. Therefore,
LSTM model with epoch = 200 and Z t−1 , Z t−2 , Z t−3 , Z t−4 , Z t−336 as the inputs are
the best model for the long short-term memory method.

Table 7 LSTM’s input lag

Input type Input variable
scenarios
Type 1 Z t−1 , Z t−2 , Z t−3 , Z t−48 , Z t−336
Type 2 Z t−1 , Z t−2 , Z t−3 , Z t−4 , Z t−336

Fig. 10 Boxplot of out-sample RMSE; a type 1, and b type 2

Application of Time Series Regression, Double Seasonal ARIMA … 399

Fig. 11 Time series plot of best model forecasts for each method

3.5 Best Overall Model

Up to this point, we have determined one best model from each of the three methods.
Forecasting comparison, then, needs to be drawn in order to notice the forecasting
performance of the three best models. Hence, we created Fig. 11 to show their
comparison in out-sample data. The figure indicates that the forecasts of each model
quite accurately fit the actual data across the out-sample horizon. This can be inferred
that each method has successfully forecasted the electricity load. The best overall
model is determined by the lowest out-sample sMAPE value. For simplicity, the
values are compared through the visualization of the iterative plot of out-sample
sMAPE, as presented in Fig. 12. Notice, Fig. 12 shows that the out-sample sMAPE at
l = 1 is around 2–3.5%, but it falls significantly thereafter. The farther the forecasting
time, the steadier the iterative line, so that the models have been consistently good
at forecasting electricity load in a 1-week period. The line of iterative out-sample
sMAPE of the DSARIMA model is the lowest among other models. Therefore,
double seasonal ARIMA is the best overall model among the others. Finally, we can
conclude that double seasonal ARIMA is the best method in forecasting electricity
load.

4 Conclusion and Future Studies

In this study, we have applied time series regression, double seasonal ARIMA, and
long short-term memory methods for forecasting electricity load. We have deter-
mined the best model of each method. The best model of time series regression is
defined—analyzed variables are the variables suspected of effect without reference
400 H. Afghan and H. Khusna

Fig. 12 Iterative plot of out-sample sMAPE

variable and the lag variables Z t−1 , Z t−48 , Z t−336 by utilizing the stepwise elimina-
tion method. This model resulted in a quite low out-sample RMSE value of 47.734.
However, the model does not satisfy the independent residual assumption. Next, the
best model of double seasonal ARIMA is double seasonal ARIMA ([2, 10, 11, 12,
15, 16, 17, 18, 19, 20, 21, 22, 23], 1, [1, 2, 3, 7, 8, 30, 34, 35, 39, 40, 41, 42, 43, 44, 45,
46, 47, 48]) (0, 1, 1)48 (0, 1, 1)336 . This model gained a quite low out-sample RMSE
value of 35.631. It has successfully satisfied the white noise assumption. Lastly, the
best model of long short-term memory is the LSTM (5, 64, 32, 32, 1) with epoch =
200 and Z t−1 , Z t−2 , Z t−3 , Z t−4 , Z t−336 as inputs. This model obtained a quite low
out-sample RMSE value of 65.104 in average. Then, we compared those best models
by the lowest out-sample sMAPE value in order to obtain the best overall model.
The best model of TSR, DSARIMA, and LSTM acquired extremely low out-sample
sMAPE values equal to 0.644%, 0.55%, and 0.971%, respectively. Having the lowest
out-sample sMAPE value, the double seasonal ARIMA is concluded to be the best
overall model for forecasting electricity load in this study. There are scopes for further
studies which is still possible to obtain lower errors and more accurate forecasts. In
time series regression, we did not satisfy the independent residual assumption so as
indicating that there were still significant lags that had not been included in the model.
Future study could perform more lag variables to obtain a better TSR model and as an
experiment to satisfy such assumption. In general, future studies could also consider
external factors that may affect electricity load such as weather or temperature.
Application of Time Series Regression, Double Seasonal ARIMA … 401

References

1. UP2B Jawa Timur. Electricity load of East Java. (2023)

2. Taylor JW (2003) Short-term electricity demand forecasting using double seasonal exponential
smoothing. J Oper Res Soc 54:799–805
3. Sintyaningrum N (2017) Pemilihan Input dengan Random Forest pada Model Time Series
Regression dan Double Seasonal ARIMA untuk Peramalan Data Pemakaian Beban Listrik
Jangka Pendek. ITS Repository http://repository.its.ac.id/id/eprint/3542
4. Permata RP, Prastyo DD, Wibawati (2021) Hybrid dynamic harmonic regression with calendar
variation for Turkey short-term electricity load forecasting. In: Procedia Comput Sci 197:25–33
5. Sigauke C, Chikobvu D (2016) Peak electricity demand forecasting using time series regression
models: an application to South African data. J Stat Manag Syst 19:567–586
6. Mohamed N, Hura Ahmad M, Ismail Z (2010) Double seasonal ARIMA model for forecasting
load demand. MATEMATIKA 26:217–231
7. Mado I, Soeprijanto A, Suhartono S (2018) Applying of double seasonal ARIMA model for
electrical power demand forecasting at PT. PLN Gresik Indonesia. Int J Electr Comput Eng
(IJECE) 8:4892–4901
8. Khusna H, Suhartono (2015) Pendekatan Percentile Error Bootstrap pada Model Double
Seasonal Holt-Winters, Double Seasonal ARIMA, dan Naïve untuk Peramalan Beban Listrik
Jangka Pendek Area Jawa Timur-Bali. Jurnal Sains dan Seni ITS 4:43–48
9. Sherstinsky A (2020) Fundamentals of recurrent neural network (RNN) and long short-term
memory (LSTM) network. Physica D 404
10. Bouktif S, Fiaz A, Ouni A, Serhani MA (2018) Optimal deep learning LSTM model for elec-
tric load forecasting using feature selection and genetic algorithm: comparison with machine
learning approaches. Energies (Basel) 11
11. Bai Y et al (2021) Regression modeling for enterprise electricity consumption: a comparison
of recurrent neural network and its variants. Int J Electr Power Energy Syst 126
12. Abbasimehr H, Shabani M, Yousefi M (2020) An optimized model using LSTM network for
demand forecasting. Comput Ind Eng 143
13. Rawlings JO, Pantula SG, Dicky DA (1998) Applied regression analysis: a research tool (2nd
edn). Springer
14. Kutner MH, Nachtsheim CJ, Neter J, Li W (2005) Applied linear statistical models. McGraw-
Hill
15. Walpole RE, Myers RH, Myers SL, Ye K (2011) Probability and statistics for engineers and
scientists. Prentice Hall
16. Gujarati, D. N. Basic Econometrics. (McGraw-Hill, 2004).
17. Wei WWS (2006) Time series analysis univariate and multivariate methods. Pearson Education,
Inc
18. Cryer JD, Chan KS (2008) Time series analysis with applications in R. Springer
19. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9:1735–1780
20. Jain YK, Bhandare SK (2013) Min max normalization based data perturbation method for
privacy protection. Int J Comput Commun Technol 233–238
21. Chang Z, Zhang Y, Chen W (2018) Effective Adam-optimized LSTM neural network for
electricity price forecasting. In: IEEE 9th international conference on software engineering
and service science (ICSESS), pp 245–248
22. Ren Y, Suganthan PN, Srikanth N (2016) A novel empirical mode decomposition with support
vector regression for wind speed forecasting. IEEE Trans Neural Netw Learn Syst 27:1793–
1798
A Bayesian Network for Classifying
and Predicting Ship Collision

Iis Dewi Ratih, Ketut Buda Artana, Heri Kuswanto, Emmy Pratiwi,
and Muhammad Farhan Nuari

Abstract Ship collisions are the type of accident with the highest percentage of
investigations, making them the type of accident with a high variation in causes.
Additionally, ship collisions pose a serious threat because they occur between two
different vessels, resulting in material losses and loss of life. This condition makes
ship collisions a serious problem that requires efforts to minimize prevention and
adjust existing conditions. This study aims to model the causes of ship collisions in
Indonesia to determine the probability of a ship experiencing a collision or a near
miss. The modeling will be conducted using the Bayesian network method. The
Bayesian network model is based on the factors that cause ship collisions, relying on
past incidents and written reports from National Transportation Safety Committee
(NTSC) investigations and judgments from the Maritime Court. The purpose of this
study is to identify the factors that cause ship collisions, determine the probability
of a ship experiencing a collision, and identify the factors that contribute the most
to the probability of ship collisions in Indonesia through sensitivity analysis. The
results obtained from the model, with a 70% weight for training data, show that the
probability of a ship experiencing a collision during a dangerous condition is 63%,
with an accuracy and sensitivity of 93.75% and 100% respectively. According to
the model, the factors with the greatest influence are “crew competence,” “decision
making,” “maneuverability,” and “ship communication.”

Keywords Bayesian networks · Probability · Sensitivity analysis · Ship collision

I. D. Ratih · K. B. Artana (B) · H. Kuswanto · E. Pratiwi · M. F. Nuari

Department of Marine Engineering, Faculty of Marine Technology, Institut Teknologi Sepuluh
Nopember, Surabaya, Indonesia
e-mail: [email protected]
H. Kuswanto
Department of Statistics, Faculty of Science and Data Analytics, Institut Teknologi Sepuluh
Nopember, Surabaya, Indonesia

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 403
Y. Bee Wah et al. (eds.), Data Science and Emerging Technologies, Lecture Notes on
Data Engineering and Communications Technologies 191,
https://doi.org/10.1007/978-981-97-0293-0_29
404 I. D. Ratih et al.

1 Introduction

Ship accidents are a crucial topic that requires extensive research due to their high
risk and potential for significant losses. According to the National Transportation
Safety Committee (NTSC), ship collisions rank second in terms of the percentage of
ship accidents investigated, following burnt ships. IMO MSC/255 (84) 2 and UU No.
17 of 2008 on shipping regulations state that the NTSC investigates only a portion
of ship accidents. The purpose of these investigations is to identify potential issues
and assess the problems arising from the incidents, with the aim of improving safety
measures in the future. This information highlights the importance of analyzing ship
collision accidents, as ship collision accidents are the type of ship accidents with the
highest investigated ratio by the NTSC compared to other types of ship accidents and
involve various causes. In-depth analysis can lead to the development of effective
solutions for reducing the occurrence of subsequent ship collisions.
In the analysis of ship collision accidents, ship accident modeling plays a funda-
mental role, especially in the modeling of causation probability for estimating the
frequency of ship accidents. The estimation of ship accident frequency can be carried
out using a calculation model provided as follows [1]:

λcol = Pc · NG . (1)

This equation defines the frequency of ship collisions in terms of two components:
Geometric probability (N G ) and causation probability (Pc ) [2]. Geometric probability
represents the number of candidate ships that have the potential to experience a
collision, while causation probability represents the likelihood of a ship failing to
avoid a collision when it is on a collision course. The value of causation probability
is influenced by various factors that are represented as interconnected nodes, leading
to the root cause. Some of these nodes include human error, weather conditions, and
technical failures [3]
Based on the concept, each water area has a unique causation probability value,
which distinguishes it from other water areas. Unlike previous studies that used
default causation probability values[1], this research aims to determine causation
probability values based on real conditions using the Bayesian network method.
According to the literature study conducted, [4] states that the Bayesian network is
the best method for estimating the causation probability value.
Several previous studies have focused on examining cases related to the risk of
ship collisions. For instance, studies conducted by researchers [5–7] involved ship
collision risk analysis using default causation probability (Pc) values from IALA
Waterway Risk Assessment Program (IWRAP). These studies indicated that the Pc
results obtained from the analysis in the Akashi and Dover Straits could be gener-
alized and applied to all waters worldwide. In the analysis of ship collisions in the
Sunda, [8–10] employed the fault tree analysis (FTA) modeling method to determine
Pc values. However, the modeling in these studies was based on factors and oppor-
tunities derived from references rather than historical data specific to accidents in
A Bayesian Network for Classifying and Predicting Ship Collision 405

the Sunda Strait. Furthermore, other studies such as those conducted by researchers
[11, 12] employed the Bayesian network method for modeling causation probability.
However, these studies utilized a simplified approach with limited reference data.
Therefore, there is a need to develop and update the conditions based on the latest
available data and establish a more detailed network, which will be undertaken in
this current study.
For this reason, in this paper, Bayesian network is utilized to model the factors
contributing to ship collisions (head-on, overtaking, and crossing) specifically for
Indonesian waters. Consequently, if necessary, appropriate measures can be taken
promptly to ensure safety and minimize potential losses resulting from ship collisions.

2 Literature Review

2.1 Bayesian Networks

A Bayesian network is a simple probabilistic graphical model (PGM) constructed

from the theory of probability and graph theory. Probabilistic theory is directly related
to data, while graph theory is directly related to the representation you want to obtain.
The Bayesian network method is a good approach in machine learning based on
training data, using conditional probability as its foundation [13] When constructing
a Bayesian network model, the structure is built using Bayes’ theorem, which is
conditional probability. Conditional probability is the calculation of the probability
of an event X p+1 occurring given that the event Xp has already occurred, denoted as
P(Xp). This theorem is used to calculate the probability of a dataset belonging to a
certain class based on existing data inference using the formula in Eq. 1.

P(X p+1 |X p )P X p
P X p |X p+1 = , (1)
P X p+1

where:
P(X p | X p+1 ) = The probability of an event X p occurring given that event X p+1 has
already happened
P(X p+1 | X p ) = The probability of an event X p+1 occurring given that event X p has
already happened
P(X p ) = The probability of an event X p
P(X p+1 ) = The probability of an event X p+1
A Bayesian network is an explicit description of direct independencies among a
set of variables. This description takes the form of a directed graph and a set of node
probability tables (NPT). The directed graph, or Bayesian network structure, consists
of a set of nodes and edges. Nodes correspond to variables, and edges connect directly
to dependent variables. An edge from X p to X p+1 encodes the assumption that there is
a direct causal dependence or influence from X p to X p+1 ; the node X p is then referred
406 I. D. Ratih et al.

to as the parent node of X p+1 . No cycles occur within the graph (e.g., if we have an
edge from X p to X p+1 and from X p+1 to X p+2 , there is no edge from X p+2 to X p ). This
is done to avoid circular reasoning. Each node X p has an associated probability table
called the node probability table (NPT) of X p [14].
If there are p random variables X 1 , X 2 , X 3 , …, X p , in a directed acyclic graph with
p-numbered nodes, and let node j(1 ≤ j < p) of the graph be associated with variable
X j . This is called the joint probability distribution (JPD), which is the probability of all
events occurring simultaneously. Then the graph is a Bayesian network representing
variables X 1 , X 2 , X 3 , …, X p if:

p

P X 1, X 2, X 3, . . . , X p = P(X j |Par ents X j ) (2)
j=1

BN, also known as Bayesian belief network (BBN), is constructed using graphical
structures and Bayes’ theorem. It is a widely accepted formalism for representing
uncertain knowledge (subjective or objective) and for efficient reasoning with it. A
complete BN model comprises both qualitative and quantitative components. The
qualitative component of a BN is a directed acyclic graph, where nodes and edges
represent system variables and their causal dependencies. The quantitative compo-
nent of the BN model is presented through a set of conditional probabilities or
probability distributions for each child node given its parent nodes in the network
[15].

2.2 Model Validation

The holdout validation method randomly divides the data into two categories: training
data and testing data. The training data are used for the Bayesian network algorithm
to construct a classification tree, while the testing data is used to evaluate the formed
classification model. Estimation and prediction will then be carried out to assess the
reliability of the classification model formed based on the training data [16] (Fig. 1).
The holdout validation method is chosen to evaluate the Bayesian network model
that has been formed. Holdout validation is the most common approach for evaluating
machine learning models. However, when the dataset used for training and evaluating
the machine learning model is small, performance metrics that rely on this validation
and testing set composition can be sensitive, and the resulting performance measures
are often unreliable [17].
Performance evaluation measure (PEM) is a step taken to measure system perfor-
mance. PEM is performed on the training data to evaluate the created model. PEM is
usually depicted in a table called a confusion matrix [18]. The confusion matrix table
describes the performance of a model in a specific manner. Each row of the matrix
represents the actual class of the data, and each column represents the predicted class
of the data (Table 1).
A Bayesian Network for Classifying and Predicting Ship Collision 407

Fig. 1 Holdout validation model. Source https://esairina.medium.com/

Table 1 Confusion matrix

Actual Predicted Total
collision Near miss
Collision True positive (TP) False positive (FP) TP + FP
Near Miss False negative (FN) True negative (TN) FN + TN
Total TP + FN FP + TN n

The confusion matrix table consists of values for true positive, false positive, false
negative, and true negative. True positive indicates how many data points are actually
positive in class, and the model also predicts them as positive. True negative indicates
how many data points are actually negative in class, and the model predicts them as
negative. False positive indicates how many data points are actually negative, but the
model predicts them as positive. False negative indicates how many data points are
actually positive, but the model predicts them as negative. Through the two criteria
of the confusion matrix, we can obtain [16]:
1. Accuracy is the ratio of the correctness of predictions to the total target data,
representing how often the model correctly classifies. The formula to calculate
accuracy is shown in Eq. 3.

True Positive + True Negative

Accuracy = (3)
Total Observation

2. Sensitivity or recall is the ratio of correct positive predictions to the actual data.
Sensitivity explains how often the model predicts positive when the actual class
is positive. The formula for calculating sensitivity is shown in Eq. 4.

True Positive
Sensitivity = . (4)
True Positive + False Negative
408 I. D. Ratih et al.

2.3 Sensitivity Analysis

Sensitivity analysis is performed to test the significance of the contribution or influ-

ence of the predetermined nodes on the obtained causation probability values. Sensi-
tivity analysis will be conducted using the same application as bibliometric analysis,
which is “GeNIe.” “GeNIe” facilitates sensitivity analysis using a tornado diagram.
Here is an example of a tornado diagram for sensitivity analysis of a Bayesian network
model (Fig. 2).
To be able to perform sensitivity analysis on the constructed Bayesian network
model, one needs the ability to read or define the existing tornado diagram. The
definition of a tornado diagram is as follows:
1. “Target outcome” is the selected target output chosen for analysis and the factors
contributing to it.
2. The vertical line serving as the bar chart’s boundary represents the baseline,
indicating the current target probability value.
3. The right side represents the area of increased target probability values, while
the left side represents the opposite.
4. The bar chart illustrates the influence of changes in the probability of the condi-
tions listed above the bar chart on changes in the target probability value. Green
bars indicate an increase in the probability of the specified conditions, while red
bars indicate the opposite.
5. If the green bar chart is on the right side of the baseline, it means that an increase
in the probability value will affect the collision probability value, causing it to
increase. If the green bar chart is on the left side, it means that an increase in the
probability value will cause the collision probability value to decrease.
The results of this analysis serve as a reference for providing recommendations
to reduce the frequency of collisions by reducing the causation probability value.

Fig. 2 Example of a tornado diagram for sensitivity analysis in a Bayesian network

A Bayesian Network for Classifying and Predicting Ship Collision 409

3 Methodology

The data used includes chronological information on each ship collision incident
documented in the NTSC investigation reports and the court decisions from the
Shipping Court from 2007 to 2021. The data used consists of 44 investigated ship
collision incidents involving a total of 108 vessels. The target variable is “collision,”
and there are 30 causal variables associated with it. These causal variables include the
factors, such as human error, technical failures, and environmental factors. The causal
variables are good seamanship, crew competence, uncertain, ship communication,
understanding of ship characteristics, understanding of water, inexperience, decision
making, crew health, number of crew, dual task, fatigue, situational awareness, visual
observation, daylight, master, understanding of navigation and communication sign,
navigation and communication utilized, navigation and communication equipment,
preventive timing, pilot, maneuverability, technical failure, ship type, water type,
wind velocity, wave height, ship length, ship breadth, ship draft, ship speed. The
following are the steps in this research:
a. Build a Bayesian network structure by estimating the sequence of events so that
ship collisions occur.
b. Determining parameters (Prior Probability) after the Bayesian network structure
is formed.
c. Determine the parameters (Prior Probability) of each cause of ship collisions.
Prior probability is the degree of confidence of a phenomenon that is used when
no other information can be used to see the probability of an event occurring, but
once new information is known the probability must be updated.
d. Create a Conditional Probability Table (CPT)
Conditional probability is the probability of an event B if event A has occurred.
Any table that contains the probabilities of each possible value of A and B is called
a conditional probability table (CPT).
e. Creating a Joint Probability Distribution (JPD)
Joint probability distribution is the probability of co-occurrence for all
possible combinations of values contained in variables A and B. Similarly,
the distribution of a cause of accident is to multiply the value of conditional
probability by prior probability.
f. Calculating Posterior Probability
Posterior probability can be calculated from the results of the joint probability
distribution that has been obtained which is then used to calculate the probability
value of each cause of the accident.
g. Model Validation
The validation model used is using the holdout method where the data is
divided into training and testing data. Training data is used to form the model
while testing data is used to validate the model that has been formed. The
scenarios for training and testing data composition are 60:40, 70:30, 80:20,
and 90:10.
h. Sensitivity Analysis
410 I. D. Ratih et al.

Sensitivity analysis was conducted to test the significance of the contribu-

tion of the influence of predetermined nodes to the causation probability value
obtained. The results of this analysis are used as a reference to provide recom-
mendations for reducing the impact frequency value by simulating each node
and finding the causation probability value again.
i. Conclusions and Recommendations

4 Result and Discussion

4.1 Node Factor Determination

In this particular case, the research focuses on analyzing 44 ship collision incident
chronologies that have occurred in Indonesian waters, as documented in the investi-
gation reports of the NTSC and the trial decisions of the Shipping Court. The factors
identified in these reports serve as the basis for constructing the Bayesian network
model. The relationships between these factors are determined based on the chrono-
logical sequence of events influenced by the predetermined factors. The Bayesian
network model structure, depicting the nodes representing these factors, is illustrated
in Fig. 3.
Prior nodes are factors or variables that do not have antecedent factors. In the
Bayesian network model used in this study, there are several nodes that serve as prior
nodes. These nodes include “daylight,” “master,” “Navcom equipment,” “number
of crew,” “inexperience,” “crew health,” “crew competence,” “ship communication,”
“pilot,” “understanding of Navcom sign,” “maneuverability,” and “technical failure.“
In the Bayesian network model, a child node is a node whose probability value
depends on one or more parent nodes. To calculate the probability at a child node,
Bayes’ theorem is utilized, as described previously.

Fig. 3 Ship collision classification model

A Bayesian Network for Classifying and Predicting Ship Collision 411

In this study, the joint probability calculations are conducted once all the indi-
vidual probabilities are obtained. The software tool “GeNIe” is utilized to assist in
these joint probability calculations. Additionally, “GeNIe” is also used to construct
the structure of the Bayesian network, allowing for the visualization of the joint prob-
ability distributions for each node. Figure 3 illustrates the structure of the Bayesian
network, which is used to classify the probability of a ship experiencing a collision
or surviving a collision (near miss). The “GeNIe” software enables the display of the
joint probability distributions for the nodes within the network, aiding in the analysis
and inference process.

4.2 Model Validation

In the model validation process, the cross-validation method is employed with a

holdout type. This involves separating the available data into training data and test
data. In this study, different variations of data separation weights are utilized, namely
80% training data – 20% test data, 70% training data – 30% test data, and 60%
training data – 40% test data. Following the separation of data, the validation results
are obtained by evaluating the performance of the model on the test data. The perfor-
mance measures such as accuracy and sensitivity are calculated to assess the effec-
tiveness and accuracy of the model in predicting ship collision events. Table 2 shows
the model validation results.
Based on the results, the selected model is the one that used 70% of the entire
dataset for training data. This decision is based on the fact that this particular weight
distribution of 70% training data and 30% test data achieved the highest accuracy
and sensitivity compared to other weight variations. The chosen model exhibited an
accuracy of 93.75% and a sensitivity of 100%. This means that the model accurately
predicted 23 cases where ship collisions occurred (actual Yes predicted Yes), and
correctly predicted that no ship collisions occurred (actual No predicted No) in seven
cases. Importantly, the model did not falsely predict any ship collisions (actual Yes
predicted No), and only made two cases of false negative predictions (actual No
predicted Yes). These evaluation metrics demonstrate the strong performance of the
selected model, indicating its ability to accurately predict ship collision events and
provide valuable insights for assessing collision risks and implementing preventive
measures.

Table 2 Model validation

Scenario Parameter
results
Training (%) Test (%) Accuracy (%) Sensitivity (%)
80 20 90 100
70 30 93.75 100
60 40 83.33 92.3
412 I. D. Ratih et al.

Fig. 4 Tornado diagram of ship collision

4.3 Sensitivity Analysis

The sensitivity analysis in this research is conducted using the software “GeNIe.“ It
involves analyzing the sensitivity of the model based on significant alterations in the
probability of specific events. This analysis is performed by simulating the effects of
a 100% change in the probability of these events. The sensitivity analysis results are
presented in the form of a tornado diagram, which displays the top ten factors that
contribute the most to the increase or decrease in the probability of the target node
(“Collision = Collision”) (Fig. 4).
The sensitivity analysis results indicate that altering the conditional probability
of “collision = collision” because “good seamanship = no,” “decision making =
bad,” “maneuverability = good,” “technical failure = no,” and “preventive timing =
bad” are lowered to be 0%; the causation probability value decreases to 0.426, with
a difference of 0.079. It means that good maneuverability of ship and no technical
failure are the important factors to decrease the causation probability, even though
the condition is bad preventive timing and good seamanship and decision making is
bad. There are also other circumstances that can lead to lower causation probability
values. For example, when the “crew health = fit” has a significant impact on lowering
the causation probability value compared to other circumstances. For instance, if the
probability is decreased from 93 to 0%, the causation probability value decreases to
0.564, resulting in a difference of 0.071. The overall results of the sensitivity analysis
for head-on collisions are presented in the tornado diagram shown in Fig. 2. This
diagram visually represents the top ten factors that contribute the most to the increase
or decrease in the probability of the target node (“collision = collision”) for head-on
collisions. It provides valuable insights into the relative importance of these factors
and their impact on the occurrence of head-on collisions.
A Bayesian Network for Classifying and Predicting Ship Collision 413

5 Conclusion

Based on the analysis conducted on Bayesian network modeling for predicting ship-
ship collision in Indonesia, here is a summary of the key findings:
1. Factors Influencing Ship Collisions: The analysis considered 31 factors that influ-
ence the occurrence of ship collisions. These factors were identified based on the
investigation report by NTSC and the results of the Shipping Court trial. Some of
the factors included daylight, master, navigational equipment, number of crew,
crew competence, ship communication, pilot, and technical failure, among others.
From all nodes, “maneuverability,” “technical failure,” “crew health,” and “crew
competence” are the most significant factors influencing collisions.
2. Bayesian Network Modeling: The results of the Bayesian network modeling
showed that the probability of a ship experiencing a potential collision condition
was determined to be 63%, while the probability of experiencing a collision or
near miss was 37%. The accuracy of the model was measured to be 93.75%.
These findings provide valuable insights into the factors influencing ship colli-
sions, the probabilities associated with collision occurrences, and the frequency of
collisions in the Sunda Strait. They can be used to inform decision-making processes,
improve safety measures, and minimize the risks of ship collisions in the designated
area.

References

1. LNCS Homepage, http://www.springer.com/lncs. Last accessed 21 Nov 2016

2. Friis-Hansen P (2007) IWRAP Mk II working document basic modelling principles for
prediction of collision and grounding frequencies. (Online). Available: www.iseso.org
3. Kristiansen S (2005) Maritime transportation safety management and risk analysis, 1st edn.
Elsevier Butterworth-Heinemann, Oxford
4. Ugurlu H, Cicek I (2022) Analysis and assessment of ship collision accidents using fault
tree and multiple correspondence analysis. Ocean Eng 245. https://doi.org/10.1016/j.oceaneng.
2021.110514
5. Chen P, Mou J, Van Gelder P (2019) Integration of individual encounter information into
causation probability modelling of ship collision accidents. Saf Sci [Online]. Available: https://
www.sciencedirect.com/science/article/pii/S0925753518321623
6. ANurmawati A, Artana KB, Pitana T (2015) Penilaian Risiko Tubrukan Kapal Di Sekitar Buoy
12 Perairan Selat Madura Melalui Proses Formal Safety Assessment (FSA)
7. Sutrisno J, Dinariyana AAB (2018) Risk assessment of ship collision and grounding. In:
Surabaya west access channel due to the existence of shipwrecks
8. Wuryaningrum ND, Handani DW, Dinariyana AB (2020) Frequency analysis of ship collision
and its impact on the fulfillment of supporting facilities and route changes due to implementation
of Sunda strait TSS. In: IOP conference series: earth and environmental science. https://doi.
org/10.1088/1755-1315/557/1/012042.
9. Pratiwi E, Artana KB, Dinariyana AAB (2019) Ship collision frequency during pipeline
decommissioning process on Surabaya west access channel (Swac)
10. Arfian Z, Artana KB, Dinariyana AAB (2017) Penilaianrisiko tubrukan kapal akibat instalasi
anjungan lepas Pantai Di Dekat Alur pelayaran barat Surabaya
414 I. D. Ratih et al.

11. Sukma RA, Handani DW, Nugroho TF (2021) Risk assessment of ship collision on FSO
Pertamina Abherka and oil spill modelling due to structural damage
12. Mulyadi Y, Kobayashi E, Wakabayashi N, Pitana T, Wahyudi (2014) Development of ship
sinking frequency model over subsea pipeline for Madura strait using AIS data. WMU J
Maritime Affairs 13(1):43–594. https://doi.org/10.1007/s13437-013-0049-2
13. Usadha NR (2017) Implementasi Bayesian Network Untuk Perhitungan Probabilitas Pada
Penilaian Risiko Pipa Bawah Laut oleh Faktor Kapal. Limits: J Math Appl 14(1):61. https://
doi.org/10.12962/limits.v14i1.2250
14. Fenton N, Neil M (2013) Risk assessment and decision analysis with Bayesian networks. CRC
Press
15. Wu Z, Shen Y, Wang H, Wu M (2019) Assessing urban flood disaster risk using Bayesian
network model and GIS applications. Geomat Nat Haz Risk 10(1):2163–2184. https://doi.org/
10.1080/19475705.2019.1685010
16. Hermawan H, Yoannita Y (2018) Komparasi metode evaluasi pada credit scoring data mining.
1:22–25
17. Maleki F, Muthukrishnan N, Ovens K, Reinhold C, Forghani R (2020) Machine learning
algorithm validation. Neuroimaging Clin N Am 30(4):433–445. https://doi.org/10.1016/j.nic.
2020.08.004
18. Imron A (2019) Analisis sentimen terhadap tempat wisata di kabupaten rembang menggunakan
metode
19. Akbar A, Supono RA (2022) Prediksi kelancaran piutang pelanggan pada PT. Citra Ina feed-
mill dengan menggunakan algoritma naïve bayes dan K-Nearest neighbors. Jurnal Ilmiah
Edunomika 6(1):558. https://doi.org/10.29040/jie.v6i1.4692
Outlier Detection in Simultaneous
Equations with Panel Data

Suci Ismadyaliana, Setiawan, and Jerry Dwi Trijoyo Purnomo

Abstract An outlier in a set of data is an observation (or subset of observation) that

appears to be inconsistent with the remainder of that set of data. Outliers can have a
marked effect on any type of empirical analysis. Therefore, outlier detection is very
important in a statistical analysis. Outlier detection in regression model used residual
to determine the anomaly score. This research used internally studentized residual
(ISR) to detect outliers in panel data model and simultaneous equation with panel
data. The application of data in this study is data for ten countries that are members
of the ASEAN-China Free Trade Area (ACFTA) during the period 2007–2021. The
result showed that outlier observations will vary depending on the type of modeling.
Each model will obtain certain residuals, so that the observations detected as outliers
will also be different. Outlier detection results in the simultaneous equations show
that the outlier observations are a subset of the independent equation outliers that
compose them.

Keywords ACFTA · Internally studentized residual · Outlier detection · Panel

data · Simultaneous equation

1 Introduction

An outlier in a set of data is an observation (or subset of observation) that appears

to be inconsistent with the remainder of that set of data [1]. The intuitive definition
of an outlier is an observation that deviates so much from other observations as to
arouse suspicions that it was generated by a different mechanism [2]. Outliers can
also be termed as anomalies [3, 4]. Another term for outliers is discordant observa-
tions, exceptions, aberrations, surprises, peculiarities, or contaminants in different

S. Ismadyaliana · Setiawan (B) · J. D. T. Purnomo

Institut Teknologi Sepuluh Nopember, Surabaya, Indonesia
e-mail: [email protected]
S. Ismadyaliana
BPS-Statistics Indonesia, Jakarta, Indonesia

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 415
Y. Bee Wah et al. (eds.), Data Science and Emerging Technologies, Lecture Notes on
Data Engineering and Communications Technologies 191,
https://doi.org/10.1007/978-981-97-0293-0_30
416 S. Ismadyaliana et al.

application domains [4]. In a regression context, an outlier is an observation that has

large residual values and can be identified only with respect to a specific regression
model [5].
Basically, defining outliers can be done before and after the analysis. Further Hair
[3] distinguishes it into pre- and post-analysis. Pre-analysis focuses on each case
as compared to the other observations under study while post-analysis focuses on
predicted values generated by the analysis of interest. Thus, a post-analysis outlier
can be defined as an observation with large residual. This research defines outliers
by taking context in post-analysis, so the focus is on the residuals.
Outliers can have a marked effect on any type of empirical analysis. If the sample
contains outliers, it is likely that the estimation results are highly skewed and unre-
liable [6]. In modeling, the existence of outliers can disrupt a model [3]. The regres-
sion model might not produce accurate results and statistical test results will be
misleading. Therefore, outlier detection is very important in a statistical analysis.
Hair [3] classifies outliers into three types based on their contribution to the analysis,
(i) error outliers, observations that differ from the “normal” because of inaccuracies in
data collection, etc.; (ii) interesting outliers, observations may that bring new insight
into the analysis; and (iii) influential outliers, observations that impact the analysis
and are identified in the post-analysis stage. This research is limited to influential
outliers.
Outlier detection is used to identify rare or abnormal observations in a dataset.
Detecting outliers in data has been studied in the statistics community as early as the
nineteenth century by Edgeworth [4, 7]. Outliers are not directly visible to the eye but
must be detected using specific techniques. Various techniques have been proposed
for outlier detection in post-analysis. Outlier detection in the regression model uses
residuals to determine the anomaly score. An outlier arises from an error term with an
increased variance [8]. Rousseeuw and Leroy [9] argued that the diagonal element
of hat matrix is fundamental to diagnostic outliers. Therefore, several researchers
combine the values of the diagonal element hat matrix and residuals to detect outliers.
One of the measures often used is internally studentized residual (ISR), which was
proposed by Rousseeuw and Leroy [9] and Cook and Weisberg [10]. The reason
for using ISR over externally studentized residual (ESR) is that the calculation is
simpler.
Outlier detection in panel data is still rarely developed by researchers. Outlier
detection in the panel model can refer to the ISR concept. This concept can also
be applied to simultaneous equations with panel data. Outlier detection in a single
equation has been widely practiced, but outlier detection in a simultaneous system of
equations is still limited. Therefore, this research proposed ISR for detecting outliers
in simultaneous equations models with panel data.
Outlier Detection in Simultaneous Equations with Panel Data 417

2 Outlier Detection in Linear Regression

Detecting outliers in linear regression analysis is an important step in assessing

the quality of the model and ensuring that the relationships between variables are
accurately represented. Outliers can significantly affect the regression results, leading
to misleading conclusions and poorly fitted models. Suppose the linear model with
data cross-section is written as:

y = Xβ + ε, (1)

where y is a vector-dependent variable of size n × 1, X is a matrix predictor of size

n × k, and ε is vector error term. The estimate of parameter β given by:

β = (X' X)−1 X' y.

The residual of Eq. (1) has the following form:

r =y−y
= y − X(X' X)−1 X' y
= y − Hy
= (I − H)y

with H = X(X' X)−1 X' .

Research conducted by Rousseeuw and Leroy [9] and Cook and Weisberg [10]
used internally studentized residuals (ISR) to detect outliers in linear regression. The
ISR is expressed as follows:
ri
ti = √ , (2)
σ 1 − h ii
2
where r i is the residual for observation unit-i, σ = r' r/(n − k), hii is diagonal
element of matrix H. The observation is detected as an outlier if |ti | > tγ /2,n−k ,
where γ is significance level.

3 Developed Outlier Detection in Panel Data Model

In general, the linear panel data equation can be written as follows:

yit = xit' β + di' α + εit , i = 1, . . . , n; t = 1, . . . , T .

418 S. Ismadyaliana et al.

Index i represents the individual (cross-section unit), index t represents the time
period, xit is a row vector of regressors whose size 1 × p does not include constants,
di' α is heterogeneity or individual effect, and di is a row vector of size 1 × n which
contains constants and group-specific variables. Outlier detection in the proposed
panel data model uses a balanced panel and fixed effects. The fixed effect model can
be written as follows:

yit = αi + xit' β + εit , i = 1, . . . , n; t = 1, . . . , T . (3)

Matrix form of Eq. (3) as follows:

y = Xβ + Dα + ε, (4)

Where y is a vector of size nT × 1, X is a matrix of size nT × p, β is a parameter

vector, D is a matrix dummy fixed effect with size nT ×nT , and α is a parameter fixed
effect. According to [11], estimates of the parameters in Eq. (4) can be expressed as
follows:
[ ]−1 [ ' ]
β̂ = X' M D X X MD y

α = (D' D)−1 D' (y − Xβ)

M D = I − D(D' D)−1 D' .

The model’s residuals can be expressed as follows:

r =y−y

= y − (Xβ + Dα)
( [ ]−1 [ ]−1 )
= y − X X' M D X X' M D y + D(D' D)−1 D' (y − X X' M D X X' M D y)
= y − (Ay + B(y − Ay))
= y − (A + B − BA)y
= (I − (A + B − BA))y
= (I − P)y

with,
[ ]−1
A = X X' M D X X' M D ,
B = D(D' D)−1 D' ,
P = A + B − BA.

Referring to [9, 10], outliers can be detected by internally studentized residuals

(ISR). Based on Eq. (2), the ISR for the panel data model can be written as:
Outlier Detection in Simultaneous Equations with Panel Data 419

rit
tit = √ ,
σ 1 − pit,it

2
where r it is the residual for observation unit-it, σ = r' r/(nT − k), pit, it is diagonal
element of matrix P. The observation is detected as an outlier if |tit | > tγ /2,nT −k .

4 Developed Outlier Detection in Simultaneous Equation

with Panel Data

The panel data simultaneous equation model is a simultaneous equation model with
panel data. The testing procedure is the same as ordinary simultaneous equations,
only adding panel data effects. The lth panel simultaneous equation with fixed effects
can be written as follows [12]:

yl = Yl τl + Xl βl + Dαl + εl , l = 1, 2, . . . , L (5)

with

yl = (yl11 ...yl1T ...yln1 ...ylnT )' ,

⎡ ⎤ ⎡ ⎤
yl111 ... yl11(L l −1) xl111 ... xl11 p
⎢ . . ⎥ ⎢ .. .. ⎥
⎢ .. .. ⎥ ⎢ . . ⎥
⎢ ⎥ ⎡ ⎤ ⎢ ⎥
⎢ ⎥ ⎢ ⎥
⎢ yl1T 1 yl1T (L l −1) ⎥ τl1 ⎢ xl1T 1 xl1T p ⎥
⎢ . . . ⎥ ⎢ .. ⎥ ⎢ .. . . .. ⎥
⎢
Yl = ⎢ . . . . ⎥ ⎢ ⎥,
. . ⎥, τl = ⎣ . ⎦, Xl = ⎢ . .. ⎥
⎢ ⎥ ⎢ ⎥
⎢ yln11 yln1(L l −1) ⎥ τl(L l −1) ⎢ xln11 xln1 p ⎥
⎢ ⎥ ⎢ ⎥
⎢ .. .. ⎥ ⎢ .. .. ⎥
⎣ . . ⎦ ⎣ . . ⎦
ylnT 1 ... ylnT (L l −1) xlnT 1 ... xlnT p
⎡ ⎤
βl1
⎢ .. ⎥
βl = ⎣ . ⎦
βlp

εl' = (εl11 , ..., εlnT ), αl' = (αl1 , ..., αln ) ⊗ 1'T , and 1T is vector 1 of size T × 1.
Each model must be just identified or over-identified. There are various methods for
parameter estimation in simultaneous equations. This study is limited to Instrumental
Variables (IV) and Two-Stage Least Squares (2SLS). Estimation method uses IV if
the model is just identified, and 2SLS if the model is over-identified.
Equation (5) can be simplified to:

yl = Xl∗ βl∗ + Dαl + εl , l = 1, 2, . . . , L ,

420 S. Ismadyaliana et al.
] [
[ ] γl
where Xl∗
= Yl Xl and = βl∗ . Suppose instruments Zl is matrix of size
βl
∗
nT × κ, Zl is correlated with Xl but uncorrelated with εl . According to [11, 13],
when the model is just identified the estimator of βl∗ can be written as:
∗
βl(I V ) = (Zl' M D Xl∗ )−1 Zl' M D yl ,
∗
α = (Dl' Dl )−1 Dl (yl − Xl βl(I V ) ),

whereas when the model is overidentified, the estimator of βl∗ can be written as:
∗ [ ' ]−1 [ ' ]
βl(2SL S) = Xl∗ Zl (Zl' Zl )−1 M D Zl' Xl∗ Xl∗ Zl (Zl' Zl )−1 M D Zl' yl ,
∗
α = (Dl' Dl )−1 Dl (yl − Xl βl(2SL S) ).

In the condition the model is just identified, the residual of the model can be
written:

rl = yl − yl
∗
= yl − (Xl∗ βl(I V ) + Dα)
( )
= yl − Xl∗ (Zl' M D Xl∗ )−1 Zl' M D yl + Dl (Dl' Dl )−1 Dl' (yl − Xl∗ (Zl' M D Xl∗ )−1 Zl' M D yl )
= y − (My + N(y − My))
= y − (M + N − NM)y
= (I − (M + N − NM))y
= (I − Q)y

with,

M = (Zl' M D Xl∗ )−1 Zl' M D ,

N = Dl (Dl' Dl )−1 Dl'
Q = M + N − NM.

Analogous to the previous panel data model, the outlier observation in the simul-
taneous equation can be detected by ISR. The ISR for simultaneous equation with
panel data model can be written as:
rit
tit = √ ,
σ 1 − qit,it

2
where r it is the residual for observation unit-it, σ = r' r/(nT − k), qit, it is diagonal
element of matrix Q. The observation is detected as an outlier if |tit | > tγ /2,nT −k .
Outlier Detection in Simultaneous Equations with Panel Data 421

5 Results and Application to the Data

The application of data in this study is data for ten countries members of the ASEAN-
China Free Trade Area (ACFTA), i.e., Brunei Darussalam (BRN), Cambodia (KHM),
China (CHN), Indonesia (IDN), Malaysia (MYS), Myanmar (MMR), Philippines
(PHL), Singapore (SGP), Thailand (THA), and Vietnam (VNM) during the period
2007–2021. This research data is sourced from World Development Indicators
(WDI), World Bank. Endogenous variables consist of GDP (y1 ) and FDI (y2 ), while
exogenous variables consist of total exports and imports to GDP or Trade (x 1 ) and
labor force participation rate for 15–64 year olds or Labor (x 2 ). These variables are
useful for economic analysis, policy planning, and evaluation of a country’s economic
performance. The research model specification is as follows:

y1it = γ1 y2it + β11 x1it + β12 x2it + α1i + ε1it (6)

y2it = γ2 y1it + β21 x1it + α2i + ε2it . (7)

In Sect. 5.1, both Eqs. (6) and (7) are considered to be independent of each other,
so each is modeled separately to detect the presence of outliers in the panel data
model. In Sect. 5.2, both Eqs. (6) and (7) are considered simultaneous, to detect the
presence of outliers in the simultaneous equation with panel data.

5.1 Panel Data Model

Outlier detection in panel data models with fixed effects using the Least Square
Dummy Variable (LSDV) estimation method.
Model for GDP
Model estimation results for GDP (Eq. 6) can be seen in Table 1. Meanwhile, the
results of outlier detection in the GDP model can be seen in Table 2.

Table 1 Estimation result for GDP

Independent Estimate Std. error t-value Critical t
FDI 0.2920 0.0379 7.7028*** 1.9766
Trade − 0.0036 0.0015 − 2.4667* 1.9766
Labor − 0.0201 0.0133 − 1.5103 1.9766
Adj R2 0.2801
*** indicates significance at level 0.1%
** indicates significance at level 1%
* indicates significance at level 5%
422 S. Ismadyaliana et al.

Table 2 Outlier detection on GDP model

Unit to- (i, t) r it r it /s hit,it t it Critical t
31 (CHN, − 0.7383 − 2.6971 0.0941 − 2.8338 1.9762
2007)
32 (CHN, − 0.5354 − 1.9559 0.0820 − 2.0414 1.9762
2008)
63 (MYS, 0.8324 3.0409 0.3733 3.8414 1.9762
2009)
76 (MYR, − 0.7293 − 2.6642 0.1349 − 2.8645 1.9762
2007)
136 (VNM, − 0.8403 − 3.0696 0.0751 − 3.1918 1.9762
2007)
137 (VNM, − 0.6829 − 2.4948 0.0691 − 2.5857 1.9762
2008)
138 (VNM, − 0.6044 − 2.2081 0.0736 − 2.2941 1.9762
2009)
150 (VNM, 0.5535 2.0219 0.1152 2.1495 1.9762
2021)

Based on ISR (t it ) with a critical value of t-table (α = 5%, df = 147), there are
eight observation units that are detected as outliers. These outliers occur in the same
country at different times. Observations categorized as outliers are those that are in
the left and right tails of the distribution of the studentized residuals.
Gray [14] proposed the L-R plot to display the high influence in a case to its
leverage. This study uses the plot between ISR and leverage to see the high influence
and high-leverage observations. This can be seen in Fig. 1.
The 107th observation is high leverage but not an outlier. A high-leverage obser-
vation is data that has extreme values in the predictor or outliers on the X-axis.

Fig. 1 Plot studentized residual versus leverage on GDP’s model. Observations detected as outliers
are observations that are outside the dashed red boundary line
Outlier Detection in Simultaneous Equations with Panel Data 423

Table 3 Estimation result for FDI

Independent Estimate Std. error t-value Critical t
GDP 1.0221 0.1353 7.5524*** 1.9762
Trade − 0.0006 0.0027 − 0.2294* 1.9762
Adj R-square 0.2506
*** indicates significance at level 0.1%
** indicates significance at level 1%
* indicates significance at level 5%

Table 4 Outlier detection on FDI model

Unit to (i, t) r it r it /s hit, it t it Critical t
55 (IDN, 2016) − 1.3067 − 2.5132 0.0684 − 2.6038 1.9761
63 (MYS, 2009) − 3.7627 − 7.2369 0.0802 − 7.5460 1.9761
94 (PHL, 2010) − 1.1343 − 2.1817 0.0709 − 2.2634 1.9761
107 (THA, 2011) − 1.0148 − 1.9518 0.2370 − 2.2343 1.9761
125 (SGP, 2008) − 1.0705 − 2.0588 0.0722 − 2.1374 1.9761

This high leverage can be ignored because it does not affect the model estimation.
Therefore, high leverage is not classified as an outlier.
Model for FDI
Model estimation results for FDI (Eq. 7) can be seen in Table 3. Meanwhile, the
results of outlier detection in the FDI model can be seen in Table 4.
Based on ISR (t it ) with a critical value of t-table (α = 5%, df = 148), there are
five observation units that are detected as outliers. Outliers spread across multiple
countries and years. This result showed that observations categorized as outliers are
those that are in the left tail of the distribution of the studentized residuals.
A plot between studentized residual vs. leverage can be seen in Fig. 2. The 107th
is an observation with high leverage and high influence. Thus, it is categorized as
an outlier. The outlier categorization emphasizes that the observation is an influence
observation.

5.2 Simultaneous Equation with Panel Data

Equations (6) and (7) are considered a system of simultaneous equations, with Eq. 6
as the first equation and Eq. 7 as the second equation, so the estimation rule follows
simultaneous equation estimation. The estimation used IV method.
424 S. Ismadyaliana et al.

Fig. 2 Plot studentized residual versus leverage on FDI’s model. Observations detected as outliers
are observations that are outside the dashed red boundary line

Identification
Identification of simultaneous equation systems using order conditions, referring to
[15]. These can be seen in Table 5.
Only the second equation can be estimated using IV, while the first equation cannot
be estimated due to a lack of information about the predetermined variables.
Simultaneous Equation Model
The estimation results of the second equation using the IV method can be seen in
Table 6, while the results of outlier detection can be seen in Table 7.
Based on ISR (t it ) with a critical value of t-table (α = 5%, df = 148), there are six
observation units that are detected as outliers. Observations detected as outliers are
part of the observations detected in Tables 2 and 4, except for the 77th observation.

Table 5 Order condition

Equation K M G (K-M) vs (G-1) Conclusion
1st eq: GDP 4 4 2 0<1 Under identification
2nd eq: FDI 4 3 2 1=1 Just identification

Table 6 Estimation result for the second equation

Independent Estimate Std. error t-value Critical t
GDP − 3.8192 8.4786 − 0.4505 1.9762
Trade − 0.0221 0.0386 − 0.5732 1.9762
Adj R2 0.2176
*** indicates significance at level 0.1%
** indicates significance at level 1%
* indicates significance at level 5%
Outlier Detection in Simultaneous Equations with Panel Data 425

Table 7 Outlier detection on second equation

Unit to- (i, t) r it r it /s hit, it t it Critical t
31 (CHN, 2007) − 3.5515 − 2.1312 0.6404 − 3.5540 1.9761
63 (MYS, 2009) − 5.2020 − 3.1216 − 0.1216 − 2.9476 1.9761
76 (MMR, 2007) − 4.8349 − 2.9014 1.0885 − 9.7537 1.9761
77 (MMR, 2008) − 3.5143 − 2.1089 0.5825 − 3.2638 1.9761
136 (VNM, 2007) − 3.9939 − 2.3967 − 0.2422 − 2.1504 1.9761
150 (VNM, 2021) 3.4572 2.0746 0.3191 2.5142 1.9761

Fig. 3 Plot studentized residual versus leverage on second equation. Observations detected as
outliers are observations that are outside the dashed red boundary line

A plot between studentized residual and leverage can be seen in Fig. 3. The
distribution of leverage is more spread out than in the previous two plots.

6 Discussion

Outlier detection in panel data models has not been done by many researchers, espe-
cially outlier detection in simultaneous equations with panel data. Most researchers
directly use robust estimation to overcome the presence of outliers without detecting
them first. This study adopts [9, 10] for detecting outliers simply using ISR. Ronner
[16] previously detected outliers in simultaneous equations of cross-sectional data
using ESR.
This study uses a dataset of GDP, FDI, and its influencing factors in the ACFTA
region. Modeling this dataset is used for economic analysis, policy planning, and
evaluation of a country’s economic performance. Therefore, outlier detection in this
modeling is very important to ensure that the estimates obtained are not misleading.
Initially, outlier detection is performed on the panel data equations separately without
426 S. Ismadyaliana et al.

considering the two equations to be interconnected. The results of the initial stage of
detection indicate that there are eight outlier observations in the GDP model and five
outlier observations in the FDI model. Outliers spread among individuals at certain
times. Outliers in time and outliers in individuals are both mutually independent.
Outlier detection with ISR mainly considers the aspect of influence observation.
Observations detected purely as high leverage are not categorized as outliers, but
both high influence and high leverage are classified as outliers. Another advantage
of outlier detection with ISR is that it can detect more than one outlier at once. ISR is
based on residuals, so outlier observations will be different when done with different
types of modeling.
In the end, the two equations are considered simultaneous equations, and then
outlier detection is performed. The results of outlier detection in the simultaneous
equation obtained six observations that were detected as outliers. The outlier obser-
vations on the simultaneous equations are mostly part of the outlier observations on
the equations detected separately previously.

7 Conclusions

This study aims to detect outliers in the panel data model and simultaneous equations
with panel data in a simple way using ISR. ISR calculation based on residual value.
The results of this study indicate that outlier observations will vary depending on
the type of modeling. Each model will obtain certain residuals, so the observations
detected as outliers will also be different. Outlier detection results in the simultaneous
equations show that the outlier observations are a subset of the independent equation
outliers that compose them. After detecting outliers, first determine the potential
impact of outliers by comparing results with and without outliers. Then it was decided
that the handling strategy depended on the nature of the outliers. One of them is the
robust method. A future study can determine the best outlier handling strategy.

Acknowledgements The authors are grateful to the Ministry of Education, Culture, Research
and Technology Indonesia under the PPS-PDD grant, BPS-Statistics Indonesia, and ITS-Surabaya
Indonesia, which support this research.

References

1. Barnett V, Lewis T (1995) Outliers in statistical data. J Oper Res Soc 46:1034
2. Hawkins DM (1980) Identification of outliers. Chapman & Hall. https://doi.org/10.1007/978-
94-015-3994-4
3. Hair JF, Black WC, Babin BJ, Anderson RE (2018) Multivariate data analysis, 8th edn
4. Chandola V, Banerjee A, Kumar V (2009) Anomaly detection. Comput Mater Contin 14:1–22
5. Hair JF, Black WC, Babin BJ, Anderson RE (2009) Multivariate data analysis, 7th edn
Outlier Detection in Simultaneous Equations with Panel Data 427

6. Lyu Y (2015) Detection of outliers in panel data of intervention effects model based on variance
of remainder disturbance. Math Probl Eng 2015
7. Edgeworth F (1887) On discordant observations. Lond Edinb Dublin Philos Mag J Sci 23:364–
375
8. Thompson R (1985) A note on restricted maximum likelihood estimation with an alternative
outlier model. J R Stat Soc Ser B 47:53–55
9. Rousseeuw PJ, Leroy AM (1987) Robust regression and outlier detection. J R Stat Soc Series
A (Stat Soc) 152
10. Cook RD, Weisberg S (1982) Residuals and influence in regression. Chapman and Hall. https://
doi.org/10.1016/s0169-4332(96)90178-6
11. Greene WH (2012) Econometric analysis, 7th edn. Pearson UK
12. Cornwell C, Schmidt P, Wyhowski D (1992) Simultaneous equations and panel data. J Econom
51:151–181
13. Johnston J, Dinardo J (1997) Econometric methods. McGraw-Hill
14. Gray JB (1986) A simple graphic for assessing influence in regression. J Stat Comput Simul
24:121–134
15. Koutsoyiannis A (1977) Theory of econometrics: an introductory exposition of econometric
methods
16. Ronner AE (1982) Part I: proceedings in computational statistics 1:12. https://doi.org/10.1007/
978-3-642-51461-6
Assessing Departmental Efficiency
at Sepuluh Nopember Institute
of Technology: A Comparative Study
Using Classical and Advanced Data
Envelopment Analysis

Zakiatul Wildani, M. Naufal Mubarik, Sri Pingit Wulandari,

Lucia Aridinanti, and Muhammad Alifian Nuriman

Abstract As the significance of sustainable development continues to grow, particu-

larly in the field of education, it becomes imperative to evaluate the effective utiliza-
tion of resources by universities. Efficiency plays a pivotal role in enabling orga-
nizations to optimize their resource allocation and achieve optimal outcomes. In
this study, we employ data envelopment analysis (DEA) to assess the efficiency of
38 departments at Sepuluh Nopember Institute of Technology Surabaya or ITS. We
examine three distinct approaches: teaching and learning, research and development,
and an overarching overall performance assessment. Our analysis employs both clas-
sical and advanced models, incorporating four input and five output variables. The
findings consistently indicate that the Department of Instrumentation Engineering
excels in efficiency, regardless of the chosen approach or DEA model. This under-
scores its role as an exemplar for less efficient departments, offering valuable insights
for improvement, and aiding university administrators in setting ambitious objectives.
Notably, a substantial portion of ITS’s departments were identified as inefficient
for research and development scheme. Therefore, it is imperative for the university
to concentrate on enhancing departmental productivity through initiatives such as
workshops and training programs.

Keywords Data envelopment analysis · Departmental efficiency · Efficiency ·

Higher education

Z. Wildani (B) · M. N. Mubarik · S. P. Wulandari · L. Aridinanti · M. A. Nuriman

Department of Business Statistics, Faculty of Vocation, Sepuluh Nopember Institute of
Technology, Surabaya, Indonesia
e-mail: [email protected]

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 429
Y. Bee Wah et al. (eds.), Data Science and Emerging Technologies, Lecture Notes on
Data Engineering and Communications Technologies 191,
https://doi.org/10.1007/978-981-97-0293-0_31
430 Z. Wildani et al.

1 Introduction

Education holds paramount significance in contributing to a nation’s advancement,

aligning with the objectives outlined in the fourth Sustainable Development Goal
(SDG). In the context of Indonesia, a developing country, the trajectory of economic
growth and overall development is intricately linked to the quality of its educational
endeavors. Higher education, specifically, serves as a foundational pillar in imparting
knowledge and equipping individuals with the requisite skills for diverse professional
careers. Additionally, it assumes a pivotal role in fostering research initiatives and
driving innovation within various domains [1]. In Indonesia, the number of higher
education institutions until 2023 reaches 4.482 with 41.866 study programs, and
299.553 teaching lecturers. One of world-class universities that have a good reputa-
tion in Indonesia is the Sepuluh Nopember Institute of Technology or widely known
as ITS.
ITS is a leading university in technology and science, aspiring to be a global
contributor in education, research, and community service. Recognized as a world-
class institution, it achieved a ranking of 701–750 in 2022. With 120 professors and
1072 lecturers, ITS is committed to transforming academically and organization-
ally, aiming for transparency and efficiency. Graduating 3525 and 1280 students at
the 126th and 127th graduations, respectively, ITS shows a consistent increase in
graduates. The strategic plan emphasizes optimizing publications and graduates to
achieve a top 500 global ranking. Thus, this study evaluates the efficiency of 38 ITS
departments in line with these development goals.
Efficiency is how to achieve the maximum possible results with the minimum
possible resources [2]. The level of departmental efficiency at ITS can be seen from
the achievements of the department and its resources. A department at ITS can be said
to be efficient if it can produce maximum results with optimum resources. However,
there are many criteria for the achievements such as the department’s resources, the
number of graduates, the number of publications, the number of lecturers, the number
of students, and the number of educational staff. Therefore, in this study the data
envelopment analysis (DEA) method will be used to evaluate the efficiency of 38
departments at ITS. DEA is a mathematical modeling technique for measuring the
efficiency of a system which is then called a decision-making unit (DMU). One of
the advantages of DEA is that it can be analyzed with multiple inputs and multiple
outputs [3].
The classic DEA model was first introduced by Charnes, Cooper, Rhodes [4] or
widely known as CCR model and then developed by Banker, Charnes, and Cooper
(BCC) models [5]. Since then, the DEA model has been widely used to assess the
efficiency of different types of organizations because the DEA methodology uses a
nonparametric approach and is therefore not bounded by distribution and classical
assumptions [6]. The use of the DEA method can be applied to several fields such
as educational institutions, government areas, financial and banking institutions, and
so on which have the same operational characteristics [7]. However, the research of
measuring the efficiency of Indonesian institution is still limited compared to other
Assessing Departmental Efficiency at Sepuluh Nopember Institute … 431

countries, especially in the field of education such as Australia (see [8]), Taiwan
(see [9]), or China (see [10]). One of the studies using the DEA to measure effi-
ciency in Indonesian school is research by [11] which states that the percentage
of Muhammadiyah Elementary School in Kulon Progo Regency has greater effi-
cient performance (highest 65%, lowest 20%) compared to SD Muhammadiyah in
Bantul Regency (highest 33%, lowest 10%). Another research studied about effi-
ciency of world-class university in Indonesia which shows that the most efficient
universities are the University of Indonesia and Diponegoro University [12]. Further-
more, [13] employed fuzzy DEA to measure the relative efficiency of Indonesian top
tertiary educations in 2015. The result shows that only half top Indonesian univer-
sities perform as its best, which indicate improvements are needed. In department
level, efficiency measurement previously has been done by [14] with comprehensive
explanation and benchmark; however, it did not consider different kinds of schemes
such as teaching or research scheme and only used classical DEA models.
One of the drawbacks of the classical model is the assumption that adding
input will increase output in the same proportion, and the most efficient depart-
ment sequence cannot be identified. Consequently, an additional model or advanced
model was developed from the classic CCR and BCC models. The advanced model
is a model that can identify data with broader assumptions compared to the classical
model, this model can also reduce discrimination against DMU [7]. Thus, this study
will compare the sensitivity of the classical and the advanced DEA model such as
slacks-based measure (SBM) and super-efficiency of DEA to measure the efficiency
of the Department at ITS in terms of three schemes, namely teaching and learning,
research, and development and both (overall indicator).

2 Methodology

2.1 Efficiency

Efficiency is the comparison between the actual output and the frontier output. The
basis for measuring efficiency is the ratio of output to input. Efficiency can be divided
into technical efficiency or output-oriented which means the ability of a system to
achieve maximum output from a given set of inputs and allocative efficiency or
input-oriented model, which means the ability of a system to use minimum input to
produce a certain amount of output [8].
432 Z. Wildani et al.

2.2 Data Envelopment Analysis

Data envelopment analysis (DEA) was first introduced by Charnes, Cooper, Rhodes
in 1978. DEA is a mathematical programming technique based on linear program-
ming that is used to evaluate the efficiency of a decision-making unit (work unit) that
is responsible for using several inputs to obtain a targeted output. One of the advan-
tages of DEA is that it can measure efficiency with more than one input and output.
However, the disadvantage of using DEA is that this approach is very sensitive to
extreme observations, hypothesis testing is difficult to do and only measures relative
efficiency not absolute efficiency. The assumption used is that there is no random
error, and consequently, deviation from the frontier is indicated as inefficiency [3].
However, in real-life scenarios, efficiency is defined by multiple inputs and
outputs. This is when DEA overcomes the shortcoming of other methods by
measuring efficiency as a weighted sum of outputs to the inputs. Efficiency can
be rewritten by incorporating a set of decision-making units (DMUs), as shown in
Eq. 1. Each DMU-j has j = 1,2,…,n, with x and y as input and output variables
weighing vi and μr , respectively. Meanwhile, i = 1,2,…,m is the number of input
variables, and (r = 1,2,…,s) is the number of output variables. As such, the efficiency
score of each DMU can be expressed as the ratio of output and input.
s
μr yr j
e j = rm=1 (1)
i=1 vi x i j

In theory, each DMU is free to choose the weights that are most advantageous
in calculating its relative efficiency, if the same weights do not result in efficiency
scores greater than 1 for all DMUs.

2.3 Classical DEA Model

There are two classic DEA models, namely CCR and BCC. These models design
DEA at an early stage and form the basis for other DEA models [7].
CCR Model
Charnes, Cooper, Rhodes (CCR) model is a model that uses the assumption of
constant return to scale where the ratio of output and input values is constant or the
addition of input values will increase an output proportionally. Another assumption
related to this model is that each DMU operates at an optimal scale [11]. Solving
Eq. (1) normally requires transformation into the following CCR model [4]. The
multiplier output-oriented model can be written as follows:

m
Min q = vi xio
i=1
Assessing Departmental Efficiency at Sepuluh Nopember Institute … 433

subject to:

m
s
vi xi j − μr yr j ≥ 0
i=1 r =1
s
(2)
μr yr o = 1
r =1
μr , vi > 0

However, Model (2) is too complex, so it can be written as a multiplier model

encompassing the dual problem in (3)

Max φ

subject to

n
λ j yr j ≥ φyr k ;
j=1

n
(3)
λ j xi j ≤ xik ;
j=1

λ j ≥ 0;

BCC Model
Subsequently, Banker, Charnes, and Cooper extended the CCR-DEA model by
assuming a variable return to scale (VRS). As such, the BCC model distinguished

n
itself from the CCR model, which was referred to by adding constraint λ j = 1.
j=1
The CCR assumes constant return to scale (CRS), which means that an increase
in input by a certain percentage will proportionally increase output by the same
percentage. On the other hand, the BCC model assumes variable return to scale
(VRS), which means that an increase in the input will not result in a proportional
increase in input. DMU-j is considered as efficient if and only if φ ∗ = 1.

2.4 Advanced DEA Model

The classical model DEA approach to determining DMU efficiency cannot explain
the ranking of efficient units. This is the main drawback of this model. One other
weakness is that if the number of input and output variables is greater than the DMU
it can reduce the sensitivity of the classical DEA model. To overcome this weakness,
434 Z. Wildani et al.

the researcher developed a DEA basic model to increase its discriminatory power.
Some of the advanced models are as follows [7].
Slacks-Based Measure (SBM)
The slacks-based measure is a model developed from the classic model. Basically,
the classical model assumes that changes in input and output occur proportionally.
However, in a system, the addition of input does not always increase the output
proportionally [15]. The slacks-based measure emphasizes the proportional reduction
of input excess (input slacks) or the proportional addition of output deficiencies
(output slacks) from the previous model which was still inefficient because it had not
yet reached the frontier line. The slack value provides important information about
the efforts that need to be made by an inefficient DMU to achieve efficient status. A
nonzero input slack indicates an excess of input, and a nonzero slack output indicates
a lack of output [16]. Output-oriented SBM based on CCR model can be shown as
follows:
s
1 1 sr+
= max 1 +
φ s r =1 yr k

Subject to

n
λ j xi j + si− = xio
j=1

n
(4)
λ j yr j − sr+ = φyr 0
j=1

λ j , si− , sr+ ≥ 0

Super-Efficiency Model
The DEA super-efficiency model was introduced by Andersen and Petersen. This
model performs very well, however efficient DMUs are separated from inefficient
DMUs [7]. The DEA super-efficiency model has been widely used in fields such as
measuring the efficiency of an area, ranking efficient DMUs, and detecting highly
efficient DMUs. Output-oriented, efforts to make the efficiency score of the SBM
model show efficiency by using the maximum possible output with the objective
function:
Subject to

n
xi j λ j + si− = xio
j=1
(5)

n
yr j λ j − sr+ = ϕyr 0
j=1
Assessing Departmental Efficiency at Sepuluh Nopember Institute … 435

Table 1 Scheme of efficiency measurement and combination input and output variables
Input variables Output variables
Number of lecturers Scheme 1: Teaching and learning
Number of lecturers with doctoral degrees Number of graduates
Number of staffs Number of students national championship/total
students
Number of students Scheme 2: Research and development
Total Publications in Scopus-Indexed International
Journals or Equivalent/total lecturers
Cumulative International Co-Authorship
Publications/total lecturers
Scheme 3: Overall performance
Number of graduates
Number of students national championship/total
students
Total Publications in Scopus-Indexed International
Journals or Equivalent/total lecturers
Cumulative International Co-Authorship
Publications/total lecturers
Overall Performance Achievement

3 Research Design

The data used in this research is secondary data obtained from the unit program
management and control of ITS. The data needed is departmental data based on the
variables defined in this study. There are three schemes of efficiency measurement
that will be carried out with different combinations of input and output variables
which are shown in Table 1.

4 Results and Finding

4.1 Characteristics of Input and Output Variables

Table 2 displays the statistical analysis of the input and output variables.
Table 2 shows that the Electrical Engineering department has the highest total
number of lecturers (57) and the highest number of doctoral lecturers (43). In contrast,
the Actuarial department has the lowest number of lecturers (9) and the lowest number
of doctoral lecturers (1). The Civil Engineering department has the highest number
of non-academic staff (26), while the Actuarial department has the least (3). The
Actuarial department’s lower staff numbers may be attributed to its status as a new
436 Z. Wildani et al.

Table 2 Summary statistics input and output variables

Variable Mean Stdev Min Max
Input Number of lecturers 28 159.49 9 57
Number of lecturers with 15 138.48 1 43
doctoral degrees
Number of staffs 11 24.05 3 26
Number of students 604 80.49 282 1384
Output Number of graduates 581 125.45 1 1417
Number of students 4.38% 7.54% 0.62% 12.73%
national championship/
total students
Total Publications in 1.6% 0.39% 0.53% 3.08%
Scopus-Indexed
International Journals or
Equivalent/total lecturers
Cumulative International 2.3% 6.58% 0.11% 12.64%
Co-Authorship
Publications/total
lecturers
Overall Performance 106% 0.7% 81% 124%
Achievement

department. The Electrical Engineering department also has the highest number of
students (1384), while the Information Technology department has the lowest (282).
In terms of graduate numbers, Electrical Engineering leads with 1417 graduates,
while Development Studies has the fewest at 1 due to its recent establishment and 1
person who graduated was a student who had graduated in 3.5 years. Instrumentation
Engineering boasts the highest national championship at 12.73%, while Geomatics
Engineering trails with 0.62%. Chemical Engineering excels in Scopus-indexed inter-
national research at 3.08%, whereas Water Transportation lags at 0.53%. Materials
and Metallurgical Engineering collaborates internationally with a 12.64% Scopus
index, contrasting with Actuarial only 0.11%. Biology excels with a performance
achievement of 124%, while Electrical Engineering and Automation trails at 81%.
Overall, the average performance achievement at ITS is 106% with a variance of
0.7%, signaling departments generally surpassing targets.

4.2 Efficiency in Teaching and Learning

Efficiency in Teaching and Learning was measured across 38 departments at ITS in

2022, utilizing input and output variables in Table 1. Classical models like CCR and
BCC, as well as advanced models like slacks-based measure and super-efficiency,
Assessing Departmental Efficiency at Sepuluh Nopember Institute … 437

were employed. The efficiency scores based on teaching and learning performance
are detailed in Table 3.
Table 3 shows the teaching and learning efficiency scores for each department
at ITS where efficient departments’ scores are written in bold. The CCR and BCC
models produce different efficiency scores where the CCR model produces three
efficient departments, namely the Physics Engineering department, the Product
Design department, and the Instrumentation Engineering department. It is note-
worthy that some departments, despite achieving relatively low CCR scores, have
BCC scores that equal one. The BCC model produces 15 efficient departments
such as the Actuarial Science department, Physics Engineering department, Systems
and Industrial Engineering department, Civil Engineering department, Geophysics,
Marine Engineering, Marine Transportation department, Electrical Engineering,
Biomedical Engineering department, Information technology department, Product
Design department, Development Studies department, Civil Infrastructure Engi-
neering department, Industrial Chemical Engineering department, and Instrumenta-
tion Engineering department, where departments that are efficient using CCR model
measurements are also efficient when measured using the BCC model.
Efficiency scores below 1 indicate departments are inefficient, with lower scores
reflecting less efficiency. For instance, using the CCR model in Table 3, Physics
has an efficiency score of 0.752, while Mathematics scores 0.607. This suggests
Mathematics needs a greater output increase to achieve efficiency. The additional
improvement required for an inefficient department, according to the CCR model,
is calculated using the formula 1/φ—1, where φ is the efficiency score. Using the
example, Physics needs a 13.5% increase (1/0.752–1), and Mathematics requires a
64.7% increase (1/0.607–1). On the other hand, the SBM not only determines the
needed output increase but also the input decrease by subtracting 1 from the efficiency
score. In Table 3, the learning and teaching efficiency score for Physics using SBM
is 0.477. Simultaneous optimization of output and input is calculated as 1–0.477,
resulting in 52.3%.
In Table 3, some efficient departments also serve as references for inefficient
departments, with the number of efficient departments indicated in superscript. For
instance, according to the BCC model, the Actuarial Science department can be a
reference for two inefficient departments. However, this does not apply to all effi-
cient departments. For example, although Biomedical Engineering is efficient, it
is not utilized as a reference for inefficient departments as the efficiency score is
written without superscript. Among these 15 efficient departments using the BCC
model, 12 departments comprise the peer group. However, two departments, namely
Development Studies and Civil Engineering, do not appear in the peer group for any
inefficient departments.
Advanced DEA models, specifically the super-efficiency model derived from the
CCR model, consistently yield identical efficiency scores. In contrast, the SBM, also
based on the CCR model, provides unique efficiency scores. The super-efficiency
model distinguishes departments with scores exceeding 1 and facilitates rankings.
In this evaluation, the Department of Instrumentation Engineering stands out as the
most super-efficient, followed by Product Design and Physics Engineering. Notably,
438 Z. Wildani et al.

Table 3 Teaching and learning efficiency scores

No DMUs (department) Efficiency scores
CCR BCC SBM Super
1 Physics 0.752 0.779 0.497 0.752
2 Mathematics 0.607 0.650 0.259 0.607
3 Statistics 0.843 0.920 0.308 0.843
4 Chemistry 0.692 0.748 0.350 0.692
5 Biology 0.706 0.709 0.191 0.706
6 Actuarial science 0.599 1(2) 0.406 0.599
7 Mechanical engineering 0.816 0.975 0.331 0.816
8 Chemical engineering 0.884 1.000 0.638 0.884
9 Physics engineering 1(24) 1(13) 1(24) 1.483(24)
10 Industrial engineering 0.888 1(3) 0.614 0.888(1)
11 Materials and metallurgical engineering 0.791 0.823 0.586 0.791
12 Civil engineering 0.818 1 0.294 0.818
13 Architecture 0.920 0.978 0.570 0.920
14 Environmental engineering 0.775 0.883 0.187 0.775
15 Urban and regional planning 0.591 0.719 0.369 0.591
16 Geomatics engineering 0.680 0.707 0.097 0.680
17 Geophysical engineering 0.749 1(1) 0.584 0.749
18 Naval architecture 0.763 0.766 0.534 0.763
19 Marine engineering 0.954 1(10) 0.795 0.954
20 Ocean engineering 0.834 0.874 0.486 0.834
21 Marine transportation engineering 0.701 1(2) 0.502 0.701
22 Electrical engineering 0.824 1(3) 0.421 0.824
23 Biomedical engineering 0.581 1 0.575 0.581
24 Computer engineering 0.601 0.918 0.561 0.601
25 Informatics engineering 0.746 0.925 0.336 0.746
26 Information system 0.806 0.949 0.248 0.806
27 Information technology 0.450 1(2) 0.294 0.450
28 Product design 1(28) 1(19) 1(34) 1.473(28)
29 Interior design 0.702 0.868 0.577 0.702
30 Visual communication design 0.632 0.863 0.197 0.632(1)
31 Business management 0.822 0.846 0.652 0.822(1)
32 Development studies 0.285 1 0.005 0.285
33 Civil Infrastructure engineering 0.894 1(5) 0.616 0.894(1)
34 Industrial mechanical engineering 0.893 0.968 0.852 0.893
35 Automation electric engineering 0.948 0.948 0.463 0.948(2)
36 Industrial chemical engineering 0.907 1(1) 0.903 0.907
(continued)
Assessing Departmental Efficiency at Sepuluh Nopember Institute … 439

Table 3 (continued)
No DMUs (department) Efficiency scores
CCR BCC SBM Super
37 Instrumentation engineering 1(14) 1(6) 1(9) 7.075(15)
38 Business statistics 0.735 0.757 0.513 0.735
Note In the CCR, BCC, and SBM models, an efficient department is assigned an efficiency score
of 1. Conversely, in the super efficiency model, an efficient department is characterized by a score
exceeding 1. A higher score beyond 1 indicates that this department is more efficient compared to
other efficient departments. The subscription denotes the frequency of appearance of each efficient
department in the reference set

previously inefficient departments maintain their scores and rankings in the super-
efficiency assessment. The Product Design and Physics Department have the highest
peer counts across all models, indicating their technical efficiency. Greater peer
counts for a department indicate the degree of its robustness when compared to
other efficient departments [16]. Identifying top-performing departments can guide
performance improvements and assist university managers in setting effective targets.

4.3 Efficiency in Research and Development

The efficiency in research and development at ITS department using four input and
two output variables (Table 1) are presented in Table 4.
Table 4 displays the research and development efficiency scores for each depart-
ment at ITS, with efficient departments highlighted in bold. Notably, the CCR and
BCC models yield different efficiency scores. The CCR model identifies three effi-
cient departments: Biomedical Engineering, Information Technology, and Instru-
mentation Engineering. In contrast, the BCC model identifies nine efficient depart-
ments, including Chemistry, Biology, Actuarial, Chemical Engineering, Materials
and Metallurgical Engineering, Biomedical, Information Technology, Development
Studies, and Instrumentation Engineering. Interestingly, departments deemed effi-
cient by the CCR model also maintain efficiency when measured using the BCC
model. By the result of this analysis, it is observed that the Product Design and
Physic Engineering which appeared to be efficient in the initial scheme are not
performing well for research and development scheme. Furthermore, the department
of Development Studies is not treated as good practice for this model as it does not
appear in the peer group for any inefficient departments.
According to super-efficiency model with efficiency score greater than 1, the
top three efficient departments, in order, are Instrumentation Engineering, Biomed-
ical Engineering, and Information Technology which means that these departments
perform well in the field concerning research. Conversely, departments with effi-
ciency scores below 1 are deemed inefficient. Moreover, according to all models, the
Information Technology and Biomedical Engineering Department have the highest
440 Z. Wildani et al.

Table 4 Research and development efficiency scores

No DMUs (department) Efficiency scores
CCR BCC SBM Super
1 Physics 0.287 0.517 0.239 0.287
2 Mathematics 0.238 0.374 0.161 0.238
3 Statistics 0.450 0.669 0.344 0.450
4 Chemistry 0.660 1(9) 0.546 0.660
5 Biology 0.703 1(22) 0.386 0.703
6 Actuarial science 0.623 1(1) 0.079 0.623
7 Mechanical engineering 0.164 0.380 0.089 0.164
8 Chemical engineering 0.388 1(7) 0.224 0.388
9 Physics engineering 0.591 0.916 0.253 0.591
10 Industrial engineering 0.236 0.396 0.186 0.236
11 Materials and metallurgical engineering 0.869 1(13) 0.865 0.869(1)
12 Civil engineering 0.110 0.377 0.027 0.110
13 Architecture 0.285 0.456 0.196 0.285
14 Environmental engineering 0.319 0.640 0.105 0.319
15 Urban and regional planning 0.235 0.425 0.069 0.235
16 Geomatics engineering 0.397 0.648 0.189 0.397
17 Geophysical engineering 0.932 0.998 0.182 0.932(1)
18 Naval architecture 0.392 0.637 0.250 0.392
19 Marine engineering 0.162 0.377 0.074 0.162
20 Ocean engineering 0.381 0.709 0.200 0.381
21 Marine transportation engineering 0.233 0.258 0.161 0.233
22 Electrical engineering 0.211 0.493 0.182 0.211
23 Biomedical engineering 1(29) 1(4) 1(35) 1.907(30)
24 Computer engineering 0.619 0.636 0.456 0.619
25 Informatics engineering 0.352 0.746 0.263 0.352
26 Information system 0.302 0.494 0.237 0.302
27 Information technology 1(34) 1(12) 1(7) 1.337(35)
28 Product design 0.457 0.505 0.264 0.457
29 Interior design 0.433 0.546 0.176 0.433
30 Visual communication design 0.548 0.612 0.056 0.548
31 Business management 0.420 0.542 0.158 0.420
32 Development studies 0.387 1 0.168 0.387
33 Civil infrastructure engineering 0.163 0.350 0.150 0.163
34 Industrial mechanical engineering 0.408 0.573 0.300 0.408
35 Automation electric engineering 0.681 0.841 0.660 0.681
36 Industrial chemical engineering 0.399 0.515 0.371 0.399
(continued)
Assessing Departmental Efficiency at Sepuluh Nopember Institute … 441

Table 4 (continued)
No DMUs (department) Efficiency scores
CCR BCC SBM Super
37 Instrumentation engineering 1(7) 1(13) 1(8) 3.202(8)
38 Business statistics 0.445 0.514 0.110 0.445
Note In the CCR, BCC, and SBM models, an efficient department is assigned an efficiency score
of 1. Conversely, in the super efficiency model, an efficient department is characterized by a score
exceeding 1. A higher score beyond 1 indicates that this department is more efficient compared to
other efficient departments. The subscription denotes the frequency of appearance of each efficient
department in the reference set

peer counts. Therefore, they are deemed the most purely technically efficient depart-
ments for research although they only secure third and second place in the super-
efficiency model. The interpretation for the percentage of output shortage to achieve
efficiency for inefficient departments remains consistent with the initial scheme.

4.4 Efficiency in Overall Performance

The overall department efficiency score at the Sepuluh Nopember Institute of

Technology is presented in Table 5.
Table 5 presents the overall efficiency scores for each department at ITS, with effi-
cient departments highlighted in bold. The CCR model identifies 12 efficient depart-
ments, while the BCC model recognizes 25 efficient departments, with 12 depart-
ments being common to both models. Notably, in the super-efficiency model the
efficient departments, in order, were the Department of Instrumentation Engineering,
Biomedical Engineering, Physics Engineering, Product Design, Actuarial, Informa-
tion Technology, Materials and Metallurgical Engineering, Geophysical Engineering,
Electrical Automation Engineering, Industrial Chemical Engineering, Biology, and
Marine Transportation. It is observed that the last seven departments which founded
to be inefficient in the previous two schemes are found to be efficient in the overall
schemes.
Conversely, departments with efficiency scores below 1 are labeled inefficient, or
in other words, these departments are not working as efficiently for the overall work as
their counterparts in the other departments. According to peer counts, the Department
of Product Design and Physics Engineering have high peer counts compared to
other efficient departments. The interpretation for the percentage of output shortage
to achieve efficiency for inefficient departments remains consistent with the initial
scheme.
442 Z. Wildani et al.

Table 5 Overall performance efficiency scores

No. DMUs (department) Efficiency scores
CCR BCC SBM Super
1 Physics 0.796 0.879 0.518 0.796
2 Mathematics 0.663 0.900 0.323 0.663
3 Statistics 0.892 0.976 0.501 0.892
4 Chemistry 0.897 1 0.591 0.897
5 Biology 1(5) 1(6) 1 1.037(5)
6 Actuarial science 1(6) 1(1) 1(3) 1.462(6)
7 Mechanical engineering 0.816 1 0.259 0.816
8 Chemical engineering 0.946 1(2) 0.652 0.946
9 Physics engineering 1(16) 1(6) 1(14) 1.516(19)
10 Industrial engineering 0.894 1 0.503 0.894(1)
11 Materials and metallurgical engineering 1(10) 1(5) 1(9) 1.227(12)
12 Civil engineering 0.818 1 0.124 0.818
13 Architecture 0.941 1(3) 0.684 0.941
14 Environmental engineering 0.798 0.935 0.276 0.798
15 Urban and regional planning 0.661 0.946 0.195 0.661
16 Geomatics engineering 0.778 0.915 0.213 0.778
17 Geophysical engineering 1(8) 1(3) 1(2) 1.222(12)
18 Naval architecture 0.850 0.938 0.626 0.850
19 Marine engineering 0.954 1 0.478 0.954
20 Ocean engineering 0.908 0.939 0.634 0.908
21 Marine transportation engineering 1 1(1) 1 1.034
22 Electrical engineering 0.837 1 0.393 0.837
23 Biomedical engineering 1(10) 1(4) 1(19) 2.170(15)
24 Computer engineering 0.945 0.970 0.787 0.945
25 Informatics engineering 0.781 1(4) 0.453 0.781
26 Information system 0.834 1(3) 0.408 0.834
27 Information technology 1(3) 1(2) 1(1) 1.337(7)
28 Product design 1(24) 1(10) 1(14) 1.473(31)
29 Interior design 0.899 0.904 0.384 0.899
30 Visual communication design 0.890 0.897 0.118 0.890(1)
31 Business management 0.962 1 0.726 0.962(1)
32 Development studies 0.981 1 0.061 0.981(1)
33 Civil infrastructure engineering 0.905 1(1) 0.470 0.905(1)
34 Industrial mechanical engineering 0.928 0.987 0.749 0.928
35 Automation electric engineering 1 1 1(1) 1.192(2)
36 Industrial chemical engineering 1 1 1(1) 1.040
(continued)
Assessing Departmental Efficiency at Sepuluh Nopember Institute … 443

Table 5 (continued)
No. DMUs (department) Efficiency scores
CCR BCC SBM Super
37 Instrumentation engineering 1(5) 1(6) 1(8) 7.075(9)
38 Business statistics 0.866 0.914 0.342 0.866
Note In the CCR, BCC, and SBM models, an efficient department is assigned an efficiency score
of 1. Conversely, in the super efficiency model, an efficient department is characterized by a score
exceeding 1. A higher score beyond 1 indicates that this department is more efficient compared to
other efficient departments. The subscription denotes the frequency of appearance of each efficient
department in the reference set

4.5 Sensitivity of Efficiency Score in Teaching and Learning,

Research and Development, and Overall Aspect

The average of efficiency score in terms of teaching and learning, research and
development, and overall aspect are shown in Table 6
Table 6 reveals that the average efficiency score for overall assessment at the
ITS Department is significantly higher than the other two schemes. Additionally,
the research and development aspect in most ITS departments needs improvement,
with average scores falling below 0.5 for three out of four models. Furthermore,
Fig. 1 illustrates the sensitivity analysis for each department. The sensitivity for each
department is described by a super-efficiency score for each scheme.
Figure 1 depicts that the overall efficiency score (purple line) and the teaching
and learning efficiency score (orange line) are relatively higher than the research
and development efficiency score (blue line), except for Biomedical Engineering,
Computer Engineering, Development Study, Materials and Metallurgical Engi-
neering, Information Technology, Actuarial, and Geophysical Engineering. This
figure clearly highlights the need for improvement in research and development.

Table 6 Average efficiency score

Schemes Average score
CCR BCC SBM Super
Teaching and learning 0.768 0.910 0.495 0.953
Research and development 0.460 0.662 0.299 0.551
Overall performance 0.906 0.976 0.618 1.164
444 Z. Wildani et al.

Fig. 1 Efficiency score using super-efficiency model for all schemes

5 Conclusion

This study assesses the performance of academic departments at ITS Surabaya using
various DEA models, aiming to identify strengths in teaching and learning, research
and development, and overall performance. The findings reveal that over 50% of
departments at Sepuluh Nopember Institute of Technology are not yet efficient,
emphasizing the need for enhancements and resource optimization. The overall
performance assessment indicates that only 12 out of 38 departments exhibit satis-
factory efficiency. Many departments require improvements across different activi-
ties, particularly in research work. A significant observation is the underutilization
of academic staff in the Electrical Engineering department for research, develop-
ment, and teaching activities, despite a high number of lecturers and doctoral degree
holders.
The consistently efficient performance of the Instrumentation Engineering Depart-
ment across all measurement schemes and models positions it as a benchmark for
resource management and optimal outputs. Other departments, policymakers, and
management systems can learn from its efficiency. Besides Instrumentation Engi-
neering, peer analysis highlights the Department of Product Design and Physics
Engineering as references for inefficient departments in teaching and learning prac-
tices. Similarly, the Departments of Information Technology and Biomedical Engi-
neering can serve as examples for inefficient departments aiming to improve their
performance through enhanced research practices such as workshops and training
programs.
Assessing Departmental Efficiency at Sepuluh Nopember Institute … 445

References

1. De Witte K, López-Torres L (2017) Efficiency in education: a review of literature and a way

forward. J Operat Res Soc 68(4):339–363. https://doi.org/10.1057/jors.2015.92
2. Coelli TJ, Rao DSP, O’Donnell CJ, Battese GE (2005) An introduction to efficiency and
productivity analysis. Springer science & business media. https://doi.org/10.1007/b136381
3. Cossani G, Codoceo L, Cáceres H, Tabilo J (2022) Technical efficiency in Chile’s higher
education system: a comparison of rankings and accreditation. Eval Program Plann 92:102058.
https://doi.org/10.1016/j.evalprogplan.2022.102058
4. Charnes A, Cooper WW, Rhodes E (1978) Measuring the efficiency of decision making units.
Eur J Oper Res 2(6):429–444. https://doi.org/10.1016/0377-2217(78)90138-8
5. Banker RD, Charnes A, Cooper WW (1984) Some models for estimating technical and scale
inefficiencies in data envelopment analysis. Manage Sci 30(9):1078–1092. https://doi.org/10.
1287/mnsc.30.9.1078
6. Abbott M, Doucouliagos C (2003) The efficiency of Australian universities: a data envelopment
analysis. Econ Educ Rev 22(1):89–97. https://doi.org/10.1016/S0272-7757(01)00068-1
7. Emrouznejad A, Yang G (2018) A survey and analysis of the first 40 years of scholarly literature
in DEA: 1978–2016. Socioecon Plann Sci 61:4–8. https://doi.org/10.1016/j.seps.2017.01.008
8. Lee BL, Worthington AC (2016) A network DEA quantity and quality-orientated production
model: an application to Australian university research services. Omega (Westport) 60:26–33.
https://doi.org/10.1016/j.omega.2015.05.014
9. Chen S-P, Chang C-W (2021) Measuring the efficiency of university departments: an empirical
study using data envelopment analysis and cluster analysis. Scientometrics 126(6):5263–5284.
https://doi.org/10.1007/s11192-021-03982-3
10. Yang G, Fukuyama H, Song Y (2018) Measuring the inefficiency of Chinese research univer-
sities based on a two-stage network DEA model. J Informetr 12(1):10–30. https://doi.org/10.
1016/j.joi.2017.11.002
11. Susilawati D, Syafitri Utami A (2021) Article type: research paper data envelopment analysis
(DEA): Efisiensi Kinerja SD Muhammadiyah di Kabupaten dengan Akuntabilitas ‘A,’. Jurnal
Akuntansi Terapan Indonesia 4(2):152–164. https://doi.org/10.18196/jati.v4i2.115
12. Ibrahim C, Fadhli R (2021) Performance of Indonesia’s world-class university efficiency with
bibliometrics (Scientific Strength) approach and data envelopment analysis. Webology 18(1).
https://doi.org/10.14704/WEB/V18I1/WEB18003
13. Mahmudah U, Lola MS, Mahmudah U, Lola MS (2016) The efficiency measurement of Indone-
sian universities based on a fuzzy data envelopment analysis. Open J Stat 06(06):1050–1066.
https://doi.org/10.4236/ojs.2016.66085
14. Wildani Z, Wibowo W, Pingit S, Ari L (2023) Data envelopment analysis for the efficiency of
higher education departments at Sepuluh Nopember institute of technology, Indonesia. Euro J
Educ Res 12(2):1153–1169.https://doi.org/10.12973/eu-jer.12.2.1153
15. Tone K (2001) A slacks-based measure of efficiency in data envelopment analysis. Eur J Oper
Res 130(3):498–509. https://doi.org/10.1016/S0377-2217(99)00407-5
16. Tyagi P, Yadav SP, Singh SP (2009) Relative performance of academic departments using DEA
with sensitivity analysis. Eval Program Plann 32(2):168–177. https://doi.org/10.1016/J.EVA
LPROGPLAN.2008.10.002
Multivariate Adaptive Fuzzy Clustering
Means Regression Splines Model Using
Generalized Cross-Validation (GCV)
on Stunting Cases in Southeast Sulawesi

Mira Meilisa, Bambang Widjanarko Otok, and Jerry Dwi Trijoyo Purnomo

Abstract Regression modeling cannot be solved with a parametric approach if there

is no information about the shape of the function, and there is no clear pattern of
relationship between the response variable and the predictor variables. One approach
that can be done is to use a nonparametric regression approach. Multivariate adaptive
regression splines (MARS) are one of the nonparametric regression models that can
accommodate additive effects and interaction effects between predictor variables in
data modeling. MARS is a model capable of handling high-dimensional data, for
sample sizes of 50–1000 and 3–20 predictor variables. This study uses the MARS
model with fuzzy clustering means (FCM) which is then called the multivariate
adaptive fuzzy clustering means regression splines model (MAFCMRS), creating a
continuous model at knots by paying attention to aspects of heterogeneity. The results
of the cluster validity test using the validity of the Xie and Beni index produced 5
clusters which will be formed into the multivariate adaptive fuzzy clustering means
regression splines model. This model was used to model the prevalence of stunting
in Southeast Sulawesi province using 222 district observation units. The results
obtained from the 4 predictor variables used are number of pregnant women with
chronic energy deficiency (CED), number of low birth weight of the baby, number
of neonate visits, and number of health centers, all of which affected the prevalence
of stunting at the 5% significance level.

Keywords MARS · FCM · MAFCMRS · GCV · Stunting

M. Meilisa · B. W. Otok (B) · J. D. T. Purnomo

Institut Teknologi Sepuluh Nopember, Surabaya, Indonesia
e-mail: [email protected]
M. Meilisa
Muhammadiyah University of Sumatera Barat, Sumatera Barat, Indonesia

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 447
Y. Bee Wah et al. (eds.), Data Science and Emerging Technologies, Lecture Notes on
Data Engineering and Communications Technologies 191,
https://doi.org/10.1007/978-981-97-0293-0_32
448 M. Meilisa et al.

1 Introduction

In the first thousand days of life, a person’s life process begins [1]. It is a crucial time
that affects growth from childhood to adulthood [2]. One growth issue that remains
a challenge for children around the world is stunting. Stunting is a nutritional status
condition characterized by a young individual’s length or height being dispropor-
tionately short in comparison with their age. The measurement was conducted in
accordance with the WHO standard for child growth, which is the deviation of the
standard deviation from the median [3]. Stunting, a malnutrition condition that is
linked to previous nutrient deficiencies, is classified as a chronic nutritional issue.
Development and growth into maturity will be impeded if the child’s nutritional
needs are not met and are utilized to the greatest extent possible [4, 5]. Succes-
sion and development will occur without hindrance. The World Health Organization
(WHO) estimates a global stunting population of 22% or 149.2 million by 2020.
Indonesia is the fifth country with the highest stunting rate in the world and second
in Southeast Asia. The prevalence of stunting in Indonesia in 2018 was 30.8%, which
decreased to 22.67% in 2019, 26.92% in 2020, and 24.4% in 2021 [6]. Because the
reduction in stunting in Indonesia is still below the WHO standard of 20%, stunting is
a public health problem that requires further attention. The Indonesian government’s
national strategy for stunting prevention includes a commitment to address stunting
[7]. Identifying the causes of stunting is one of the efforts to reduce the prevalence
of stunting.
Examining the contributing factors is one way to try to lower the incidence of
stunting. Generate a mathematical model that establishes the association between
the response variable and the predictor variable, thereby elucidating the underlying
patterns of their connection. The formation of connections can be characterized by
employing regression analysis to identify patterns. Parametric, semiparametric, and
nonparametric methods can be used in regression analysis. If the hypothesis about
the curve model is known, the parametric method of regression can be utilized;
otherwise, the nonparametric method is suggested. Since nonparametric regression
may identify regression curve patterns of unknown shape, it provides a great deal of
flexibility when estimating regression curves. One of the nonparametric regression
models developed by Friedman is the multivariate adaptive regression splines [8]. The
MARS model is an advanced truncated spline and recursive partitioning regression
(RPR) combination. The MARS model can take into consideration interactions and
additive effects between predictor variables. Additionally, categorical and continuous
response data can be utilized with the MARS model. MARS can also process data
with big samples and high dimensions. Based on the smallest generalized cross-
validation (GCV) outcome, MARS might produce a model that runs at knots [8].
Considerable academic research has been directed toward exploring the influential
aspects of the MARS model, Poisson regression is used to modify the MARS model
[9], and bootstrap aggregating multivariate adaptive regression splines is used for
observational research on diabetic cases [10]. The parameter for cases of TBC in the
Lamongan should be estimated using a stepwise approach [11]. Additionally, identify
Multivariate Adaptive Fuzzy Clustering Means Regression Splines … 449

the cause factor of hemorrhagic fever as suggested by [11, 12], and [13] considering
both continuous responses and categories. The process of Anova decomposition and
importance variable analysis in the context of multivariate adaptive regression splines
is discussed in [14]. The research focuses on the estimation of hourly global solar
radiation in Hong Kong using the MARS technique [15]. Parameter estimation of
spatial error model multivariate adaptive generalized Poisson regression spline [16].
A hybrid model of spatial autoregressive-multivariate adaptive generalized Poisson
regression spline [17]. This research will use the multivariate adaptive regression
splines (MARS) model combined with the fuzzy clustering means (FCM) approach
[18], resulting in a model called multivariate adaptive fuzzy clustering means regres-
sion splines (MAFCMRS) due to patterns in data that were previously unknown.
Furthermore, the heterogeneity of the process is accounted for in the model’s forma-
tion, which continues within the coil. With a degree of membership, data is grouped
into a data center cluster, with each cluster examining the influencing factors. One of
the five provinces in Indonesia with the greatest prevalence rate is Southeast Sulawesi
[6]. As a result, this study will examine the determinants of stunting in the province
of Southeast Sulawesi through the lens of the incident. The selection of the best
cluster uses the validity of the Xie-Beni index [19].

2 Multivariate Adaptive Regression Splines (MARS)

MARS model is one of the nonparametric regressions that can be utilized to model
response variables and predictor variables if the data pattern is unknown. MARS
is a combination of truncated spline and recursive partitioning regression [8]. The
MARS model can be seen in the following equation:

yi = f x1i , x2i , . . . , x pi + εi ; i = 1, 2, . . . , n

M
Km

= α0 + αm skm xv(k,m)i − tkm + + εi
m=1 k=1 (1)

M
= α0 + αm Bmi (x, t) + εi
m=1

where

yi : response variable
α0 : constant basis function parameter
αm : mth non-constant basis functions parameter
M: the number of non-constant basis functions
Km : maximum interaction at the mth basis functions
(continued)
450 M. Meilisa et al.

(continued)
Skm : basis function sign value + 1 if the knot is located on the right or -1 if the knot is
located on the left
xv(k,m)i : vth variable predictor
tkm : knot value at the k interaction and m basis functions
Bmi (x, t): mth basis functions

The MARS model formation begins by determining the knots and basis functions
of each predictor variable by plotting each predictor variable with the response vari-
able. The optimum number of knots will produce a good MARS model so that further
forward and backward stepwise of the modified recursive partitioning algorithm can
be carried out, where the optimum number of knots is adjusted to the behavior of the
data.
The optimal basis function with the smallest GCV value constitutes the finest
model.

3 Fuzzy Clustering Means (FCM)

FCM is one of the most widely applied methods and can be modified in its application
[20]. Jim Bezdek was the one who first suggested FCM [21]. FCM is a technique
of clustering data in a cluster with a certain membership degree [22]. Clustering
data with FCM produces output in the form of a list of cluster centers and several
membership functions for each data. This information is used in defining membership
functions to give fuzzy membership values. FCM is an algorithm used for clustering
datasets X = {x1 , x2 , ..., xi } ⊂ R into c groups based on objective function:

C
N
J (U, V) = (u ci )m xi − vc 2 (2)
c=1 i=1

which is the vector of the i-th observation (i = 1, 2, ..., N ), and u ci is the member-
ship value in the i-th cluster owned by the randomly formed i-th observation. The
membership value of an object fulfills the condition:

C
u ci = 1 (i = 1, 2, ...., N ; u ci ∈ [0, 1]) (3)
c=1

It is essential to minimize the objective function of Eq. (3) by taking into concern
the conditions of Eq. (4). The process of iteration and membership update is carried
out iteratively so that it can be seen that the cluster moves into the right location with
the right value
Multivariate Adaptive Fuzzy Clustering Means Regression Splines … 451
⎡ 1
⎤−1

C
xi − vc 2 m−1

u ci = ⎣ ⎦ (4)
c∗ =1
xi − vc∗ 2

and cluster centers

N m
u ci xi
vc = i=1
N m
(5)
i=1 u ci

The iteration process will be performed until the specified threshold value is
obtained.

4 Xie and Beni Index Validity

A group validation must be executed by calculating the validity index in order to

ascertain whether a partition suits all the data. The XB index (Xie and Beni) is a
validity-seeking index that is frequently employed to ascertain the optimal number
of groups. There are two properties that Xie and Beni recommend as validity indices,
separation, and compactness [19]. The slightest index value should be utilized to
determine the optimal number of groups.
C n
um ci Dci (x i , vc )
2
X B(c) = c=1 i=1
(6)
N minc,i vc , vi 2

5 Data and Example

5.1 Data Collection

This research uses secondary data on stunting prevalence from the Southeast Sulawesi
Provincial Health Office in 2020. The observation unit used is a sub-district in the
district/city consisting of 222 districts, with the following variables (Table 1).

5.2 Analysis Methods

The stages carried out in this research are:

1. Create a scatter plot between the response variable and the predictor variable.
2. Forming the MARS model
452 M. Meilisa et al.

Table 1 Research variables

Variable Description Scale
Y1 Prevalence of stunting among children under five Continues
X1 Number of pregnant women with chronic energy deficiency (CED) Count
X2 Number of low birth weight of the baby Count
X3 Number of neonate visits Count
X4 Number of health centers Count

3. Calculating the MARS model error

4. Determining clusters using the FCM algorithm.
5. Updating fuzzy membership values
6. Calculating the objective function
7. Getting the best number of clusters
8. Determining the cluster to be taken using the validity index Xie-Beni
9. Follow these steps to generate a MAFCMRS model using the optimal cluster
size:

a. Determine the possibility of the maximum number of basis functions (BF)

b. Determine the maximum number of interactions
c. Determine the minimum observation between knots by trial and error
d. Determine the best model based on the minimum GCV value

10. Interpreting the best model.

6 Result and Discussion

6.1 Data Description

Indonesia is the third stunting country in the Southeast Asia Region. Southeast
Sulawesi is among the provinces that have a prevalence of 30.2% in 2020 [6]. The
following is a scatter plot for each variable used in Fig. 1. It can be clearly seen from
the figure that there is a nonlinear relationship between the response variable and the
predictor variables. It can be seen that the scatter of points is random, not forming
a certain pattern, so one alternative model that can be used is multivariate adaptive
fuzzy clustering means regression splines.
Multivariate Adaptive Fuzzy Clustering Means Regression Splines … 453

Fig. 1 Scatter plots of response variable Y with the predictors X

Table 2 Xie and Beni

Number of clusters Xie-Beni value
validity index values with
FCM algorithm 3 0.088
4 0.112
5 0.080
6 0.094

6.2 Model Multivariate Adaptive Fuzzy Clustering Means

Regression Splines

Cluster building is done using the FCM algorithm. Several scenarios of the number
of clusters are formed and then verified using the Xie and Beni validity index value
with the following results (Table 2).
From the Xie and Beni validity values, the smallest value is obtained in the scenario
of forming 5 clusters. This is followed by the formation of the MAFCMRS model.
Table 3 shows the value of the best model generation experiment results for each
cluster.
Table 3 displayed varying values of n for each cluster. When observed from the
perspective of R-square, it can be observed that each of the clusters exhibits a value
exceeding 70%. The majority of the n values are concentrated within cluster 4.

6.2.1 MAFCMRS in Cluster 1

MAFCMRS parameter estimation for cluster 1 found in Table 3 is obtained from the
combination of BF = 12, MI = 2, and MO = 1. The smallest GCV value obtained
was 3.58475 and R-squared was 84.59%. The best model for cluster 1 is:
454 M. Meilisa et al.

Table 3 Summarizing the results of the best model building experiment MAFCMRS
Model N GCV Kombinasi R-squared
BF MI MO
Cluster 1 53 3.58475 12 2 1 0.84596
Cluster 2 15 0.35607 12 3 1 0.99239
Cluster 3 44 2.89847 12 3 1 0.73666
Cluster 4 63 7.16807 12 2 1 0.80705
Cluster 5 47 6.79629 8 2 1 0.85474
Description: generalized cross-validation (GCV), maximum observation (MO), minimum interac-
tion (MI), and basis functions (BF)

fˆ(x) = 7.400 − 0.02227B f 1 + 0.005573B f 2 − 0.02080B f 3

− 0.004424B f 4 + 0.0729B f 5 + 0.0362B f 6 − 0.0047B f 7
− 0.00005B f 8 + 0.0100B f 9 − 0.0311B f 10 (7)

where

B f 1 = h(X 3 − 181), B f 2 = h(181 − X 3 ), B f 3 = h(X 1 − 20),

B f 4 = h(20 − X 1 ), B f 5 = h(X 4 − 12), B f 6 = h(12 − X 4 ),
B f 7 = h(181 − X 3 ) ∗ h(X 4 − 10),
B f 8 = h(181 − X 3 ) ∗ h(10 − X 4 ), B f 9 = h(X 2 − 9), B f 10 = h(9 − X 2 )

After obtaining the number of clusters formed, the MAFCMRS model obtained
the following results in Table 4
The significance of the values of the four predictor variables is recognized from
Table 4. Variables that possess a level of relevance exceeding 86% include the quantity
of integrated number of health centers. This observation implies that variables have
an important effect on the prediction of stunting prevalence.

6.2.2 MAFCMRS in Cluster 2

MAFCMRS parameter estimation for cluster 2 is illustrated in Table 3 is obtained

from the combination of BF = 12, MI = 3, and MO = 1. The smallest GCV value
was found to be 0.356067 and R-squared was 99.23%. The best model for cluster 2
is:

fˆ(x) = 37.39 − 0.03645B f 1 − 0.4428B f 2 − 0.0091B f 3

− 0.1490B f 4 + 1.169B f 5 − 1.089B f 6
− 0.7803B f 7 + 0.0582B f 8 (8)
Multivariate Adaptive Fuzzy Clustering Means Regression Splines … 455

Table 4 Basis function coefficient test for stunting prevalence in Southeast Sulawesi province in
cluster 1
Coefficients Estimate Std. error Pr(>|t|)
(Intercept) 7.40E + 00 4.53E-16 < 2E-16*
bx [, −1] h (×3-181) −2.23E-02 1.56E-18 < 2E-16*
bx [, −1] h (181- × 3) 5.57E-02 5.90E-18 < 2E-16*
bx [, −1] h (×1-20) −2.08E-02 8.54E-18 < 2E-16*
bx [, −1] h (20- × 1) −4.42E-01 3.44E-17 < 2E-16*
bx [, −1] h (×4-12) 7.29E-01 5.32E-17 < 2E-16*
bx [, −1] h (12- × 4) 3.62E-01 4.56E-17 < 2E-16*
bx [, −1] h (181- × 3) *h (×4-10) −4.76E-03 4.95E-19 < 2E-16*
bx [, −1] h (181- × 3) *h (10- × 4) −5.17E-05 9.00E-19 < 2E-16*
bx [, −1] h (×2-9) 1.00E-01 3.24E-17 < 2E-16*
bx [, −1] h (9- × 2) −3.11E-01 6.16E-17 < 2E-16*
Note * Significant at level 0.05

where

B f 1 = h(X 1 − 17), B f 2 = h(17 − X 1 ), B f 3 = h(X 3 − 143), B f 4 = h(143 − X 3 ),

B f 5 = h(X 2 − 3), B f 6 = h(3 − X 3 ), B f 7 = X 2 ∗ h(143 − X 3 ), B f 8 = X 2 ∗ h(143 − X 3 ) ∗ X 4

The significance of the values of the four predictor variables is shown in Table 5.
Variables that possess a level of relevance exceeding 85% include the quantity of
integrated number of neonate visit. This observation implies that variables have an
important effect on the prediction of stunting prevalence.

Table 5 Basis function coefficient test for stunting prevalence in Southeast Sulawesi province in
cluster 2
Coefficients Estimate Std. error Pr(>|t|)
(Intercept) 3.74E + 01 8.10E-15 < 2E-16*
bx [, −1] h (×1-17) −3.65E-02 2.18E-16 < 2E-16*
bx [, −1] h (17- × 1) −4.43E-01 6.84E-16 < 2E-16*
bx [, −1] h (×3-143) −9.11E-03 4.95E-17 < 2E-16*
bx [, −1] h (143- × 3) 1.49E-01 9.05E-17 < 2E-16*
bx [, −1] h (×2-3) 1.17E + 00 1.98E-15 < 2E-16*
bx [, −1] h (3- × 2) −1.09E + 00 3.44E-15 < 2E-16*
bx [, −1] × 2*h (143- × 3) −7.80E-01 9.02E-16 < 2E-16*
bx [, −1] × 2*h (143- × 3) * × 4 5.82E-02 6.94E-17 < 2E-16*
Note * Significant at level 0.05
456 M. Meilisa et al.

Table 6 Basis function coefficient test for stunting prevalence in Southeast Sulawesi province in
cluster 3
Coefficients Estimate Std. error Pr(>|t|)
(Intercept) 2.58E + 00 6.32E-16 < 2E-16*
bx [, −1] h (×3-84) −2.08E-02 5.05E-18 < 2E-16*
bx [, −1] h (84- × 3) −2.16E-02 2.40E-17 < 2E-16*
bx [, −1] × 1*h (84- × 3) −1.41E-02 6.94E-18 < 2E-16*
bx [, −1] h (×4-9) −9.30E-02 1.39E-16 < 2E-16*
bx [, −1] h (9- × 4) 4.23E-01 8.54E-17 < 2E-16*
bx [, −1] h (×3-84)* × 4 9.47E-04 3.22E-19 < 2E-16*
bx [, −1] × 1*h (84- × 3)* × 4 1.56E-03 6.10E-19 < 2E-16*
Note * Significant at level 0.05

6.2.3 MAFCMRS in Cluster 3

MAFCMRS parameter estimation for cluster 3 found in Table 3 is obtained from the
combination of BF = 12, MI = 3, and MO = 1. The smallest GCV value obtained
was 2.89847 and R-squared 73.67%. The best model for cluster 3 is:

fˆ(x) = 2.5820 − 0.02075B f 1 − 0.02162B f 2 − 0.01407B f 3

− 0.09303B f 4 + 0.4226B f 5 − 0.0009471B f 6 − 0.001558B f 7 (9)

where

B f 1 = h(X 3 − 84), B f 2 = h(84 − X 3 ), B f 3 = X 1 ∗ h(84 − X 3 ),

B f 4 = h(X 4 − 9), B f 5 = h(9 − X 4 ), B f 6 = h(X 3 − 84) ∗ X 4 ,
B f 7 = X 2 ∗ h(143 − X 3 ), B f 8 = X 1 ∗ h(84 − X 3 ) ∗ X 4

The significance of the values of the four predictor variables is presented in Table 6.
Variables that possess a level of relevance exceeding 89% include the quantity of
integrated number of low birth weight of the baby. This observation implies that
variables have an important effect on the prediction of stunting prevalence.

6.2.4 MAFCMRS in Cluster 4

MAFCMRS parameter estimation for cluster 4 found in Table 3 is obtained from the
combination of BF = 12, MI = 2, and MO = 1. The smallest GCV value obtained
was 7.16807 and R-squared was 80.7%. The best model for cluster 4 is:

fˆ(x) = 18.13 − 0.02348B f 1 + 0.1976B f 2 − 0.01989B f 3

− 0.05657B f 4 − 0.00373B f 5 − 0.1881B f 6
Multivariate Adaptive Fuzzy Clustering Means Regression Splines … 457

Table 7 Basis function coefficient test for stunting prevalence in Southeast Sulawesi province in
cluster 4
Coefficients Estimate Std. error Pr(>|t|)
(Intercept) 1.81E + 01 4.51E-15 < 2E-16*
bx [, −1] h (×3-103) −2.35E-02 1.57E-17 < 2E-16*
bx [, −1] h (103- × 3) 1.98E-01 1.38E-16 < 2E-16*
bx [, −1] h (×1-21) −1.99E-02 5.85E-17 < 2E-16*
bx [, −1] h (21- × 1) −5.66E-01 3.34E-16 < 2E-16*
bx [, −1] h (103- × 3) * × 4 −3.74E-03 9.81E-18 < 2E-16*
bx [, −1] h (×4-5) −1.88E-01 4.19E-16 < 2E-16*
bx [, −1] h (5- × 4) 6.08E-01 9.13E-16 < 2E-16*
bx [, −1] × 2*h (×4-5) 2.38E-02 2.19E-17 < 2E-16*
Note * Significant at level 0.05

− 0.6080B f 7 + 0.02376B f 8 (10)

where

B f 1 = h(X 3 − 103), B f 2 = h(103 − X 3 ), B f 3 = (X 1 − 21),

B f 4 = h(21 − X 1 ), B f 5 = h(103 − X 3 ) ∗ X 4 , B f 6 = h(X 4 − 5),
B f 7 = (5 − X 4 ), B f 8 = X 2 ∗ h(X 4 − 5)

After obtaining the number of clusters formed, the MAFCMRS model is pointed
out in Table 7.
The significance of the values of the four predictor variables is provided in Table 7.
Variables that possess a level of relevance exceeding 90% include the quantity of inte-
grated number of pregnant women with chronic energy deficiency (CED). This obser-
vation implies that variables have an important effect on the prediction of stunting
prevalence.

6.2.5 MAFCMRS in Cluster 5

MAFCMRS parameter estimation for cluster 5 highlighted in Table 3 is obtained

from the combination of BF = 8, MI = 2, and MO = 1. The smallest GCV value
obtained was 6.79629 and R-squared was 85.47%. The best model for cluster 5 is:

fˆ(x) = 26.23 − 0.0285B f 1 + 0.02191B f 2 − 0.07181B f 3

− 0.8841B f 4 + 0.01675B f 5 + 0.0597B f 6
− 0.01168B f 7 + 0.001083B f 8 (11)
458 M. Meilisa et al.

where

B f 1 = h(X 3 − 84), B f 2 = h(84 − X 3 ), B f 3 = (X 1 − 15),

B f 4 = h(15 − X 1 ), B f 5 = h(X 4 − 11), B f 6 = h(11 − X 4 ),
B f 7 = X 1 ∗ h(11 − X 4 ), B f 8 = X 1 ∗ h(84 − X 3 )

After obtaining the number of clusters formed, the MAFCMRS model displayed in
Table 8.
The interpretation for the basis functions in cluster 5 is as follows:
1. B f 1 = h(X 3 − 84)
The coefficient will be significant if the ratio of the neonates visit is more than
84. For faithful, it will decrease stunting prevalence by one unit, if the other basis
functions are held constant by 0.0285.
2. B f 6 = h(11 − X 4 )
The coefficient will be significant if the number of number of health centers is
less than 11. For every one unit, the stunting prevalence will increase by 0.0597
(other basis functions are constant).
3. B f 8 = X 1 ∗ h(84 − X 3 )
The coefficient will be significant if the number of pregnant women with
chronic energy deficiency and the neonates visitation ratio is less than 84%.
Each unit will increase the prevalence of stunting by 0.001083.
Table 8 shows that all four of the predictor variables that were employed had
significant values. Variables with an importance level over 80%, specifically the
number of integrated health posts (X 4 ), are considered significant in this study. This
observation indicates that variables have a significant impact on the prediction of
stunting prevalence.

Table 8 Basis function coefficient test for stunting prevalence in Southeast Sulawesi province in
cluster 5
Coefficient of basis function Estimate Std. error p-value
(Intercept) 2.62E + 04 1.58E-12 < 2E-16*
bx [, -1]h (×3-84) −2.09E + 01 5.94E-15 < 2E-16*
bx [, -1]h (84- × 3) 2.19E + 02 5.70E-14 < 2E-16*
bx [, -1]h (×1-15) 7.18E + 01 5.02E-14 < 2E-16*
bx [, -1]h (15- × 1) −8.84E + 02 2.10E-13 < 2E-16*
bx [, -1]h (×4-11) 1.68E + 02 1.27E-13 < 2E-16*
bx [, -1]h (11- × 4) 5.97E + 02 2.78E-13 < 2E-16*
bx [, -1] × 1*h (11- × 4) −1.17E + 01 6.14E-15 < 2E-16*
bx [, -1] × 1*h (84- × 3) −1.08E + 00 9.51E-16 < 2E-16*
Note * Significant at level 0.05
Multivariate Adaptive Fuzzy Clustering Means Regression Splines … 459

Fig. 2 Mapping the distribution of frequencies in each cluster

The map displays the selection of generated clusters based on the validity index
values of Xie and Beni for the five existing cluster values (Fig. 2).

6.2.6 Comparison of Actual Value with Predicted Value

The comparison between the actual value and the predicted value of the MAFCMRS
model for each cluster can be seen in (Fig. 3).
From the (Fig. 3), it can be seen that there is no significant difference between
the actual value and the predicted value of the MAFCMRS model.
460 M. Meilisa et al.

a. Cluster 1 b. Cluster2

c. Cluster 3 d. Cluster 4

e. Cluster 5

Fig. 3 Comparison of the actual response value with the predicted response value of each cluster

7 Conclusion

The MAFCMRS model is one of the nonparametric regression models that combined
the multivariate adaptive regression splines model with fuzzy clustering means. This
model’s development observes the process’s heterogeneity as well as continuing
in the knot. The results of the cluster validity analysis stated that the results of
the formation of 5 clusters were the best results according to the validity of the
Xie and Beni index. The formation of the best MAFCM model is shown by the
combination of BF, MI, and MO which is different for each cluster. The coefficient
test of the MAFCMRS model revealed that the 4 response variables involved are
number of pregnant women with chronic energy deficiency (CED), number of low
birth weight of the baby, number of neonate visits, and number of health centers that
had a significant effect on the prevalence of stunting in Southeast Sulawesi province
at the significant level of 5%.
Multivariate Adaptive Fuzzy Clustering Means Regression Splines … 461

Acknowledgements The authors are grateful to Kementrian Pendidikan, Kebudayaan, Riset,

dan Teknologi (Center for Higher Education Funding-BPPT), and Lembaga Pengelolaan Dana
Pendidikan (LPDP) for supporting the implementation of this research.

References

1. Yuningsih Y (2022) Hubungan status Gizi dengan stunting pada Balita Oksitosin. J Ilm
Kebidanan 9(2):102–109. https://doi.org/10.35316/oksitosin.v9i2.1845
2. Menkokesra (2013) Kerangka Kebijakan Gerakan Nasional Percepatan Perbaikan Gizi Dalam
Rangka Seribu Hari Pertama Kehidupan. Republik Indonesia
3. RI MK (2020) Peraturan Mentri Kesehatan Republik Indonesia Nomor 2 Tahun 2020
4. Mugianti S, Mulyadi A, Anam AK, Najah ZL (2018) Faktor Penyebab Anak Stunting Usia
25–60 Bulan di Kecamatan Sukorejo Kota Blitar. J Ners dan Kebidanan 5(3):268–278. https://
doi.org/10.26699/jnk.v5i3.art.p268-278
5. Ernawati A (2020) Gambaran Penyebab Balita Stunting di Desa Lokus Stunting Kabupaten
Pati. J Litbang Media Inf Penelitian, Pengemb dan IPTEK 16(2):77–94. https://doi.org/10.
33658/jl.v16i2.194
6. Kemenkes RI (2021) Hasil Studi Status Gizi Indonesia (SSGI) Tingkat Nasional, Provinsi dan
Kabupaten/Kota
7. Kementerian PPN/Bappenas (2019) Rencana Pembangunan Jangka Menengah Nasional 2020–
2024
8. Friedman JH (1991) Multivariate adaptive regression splines. Ann Stat 19(1):1–141
9. Prihastuti Yasmirullah SD, Otok BW, Trijoyo Purnomo JD, Prastyo DD (2021) Modification
of multivariate adaptive regression spline (MARS). J Phys Conf Ser 1863(1):012078. https://
doi.org/10.1088/1742-6596/1863/1/012078
10. Otok BW, Putra RY, Sutikno Y, Yasmirullah SDP (2020) Bootstrap aggregating multi-
variate adaptive regression spline for observational studies in diabetes cases. Syst Rev Pharm
11(8):406–413. https://doi.org/10.31838/srp.2020.8.59
11. Yasmirullah SDP, Otok BW, Purnomo JDT, Prastyo DD (2021) Parameter estimation of multi-
variate adaptive regression spline (MARS) with stepwise approach to multi drug-resistant
tuberculosis (MDR-TB) modeling in lamongan regency. J Phys Conf Ser 1752(1). https://doi.
org/10.1088/1742-6596/1752/1/012017
12. Sriningsih R, Otok BW, Sutikno S (2023) Determination of the best multivariate adaptive
geographically weighted generalized poisson regression splines model employing generalized
cross-validation in dengue fever cases. Methods 10. https://doi.org/10.1016/j.mex.2023.102174
13. Sriningsih R, Otok BW, Sutikno W (2021) Factors affecting the number of dengue fever cases
in West Sumatra province using the multivariate adaptive regression splines (MARS) approach.
J Phys Conf Ser 1722(1). https://doi.org/10.1088/1742-6596/1722/1/012094
14. Otok BW, Rumiati AT, Ampulembang AP, Al Azies H (2023) Anova decomposition and impor-
tance variable proccess in multivariate adaptive regression splines. IJASEIT 13(3). https://doi.
org/10.18517/ijaseit
15. Li DHW, Chen W, Li S, Lou S (2019) Estimation of hourly global solar radiation using multi-
variate adaptive regression spline (MARS)—a case study of Hong Kong. Energy 186. https://
doi.org/10.1016/j.energy.2019.115857
16. Yasmirullah SDP, Otok BW, Purnomo JDT, Prastyo DD (2023) Parameter estimation of
spatial error model—multivariate adaptive generalized poisson regression spline. Eng Lett
31(3):1265–1272. https://doi.org/10.5267/j.dsl.2023.7.004
17. Yasmirullah SDP, Otok BW, Trijoyo Purnomo JD, Prastyo DD (2023) A hybrid model of
spatial autoregressive-multivariate adaptive generalized Poisson regression spline. Decis Sci
Lett 12:721–728. https://doi.org/10.5267/dsl.2023.7.004
462 M. Meilisa et al.

18. De Andrés J, Lorca P, De Cos Juez FJ, Sánchez-Lasheras F (2011) Bankruptcy forecasting: a
hybrid approach using fuzzy c-means clustering and multivariate adaptive regression splines
(MARS). Expert Syst Appl 38(3):1866–1875. https://doi.org/10.1016/j.eswa.2010.07.117
19. Muranishi M, Honda K, Notsu A (2014) Xie-beni-type fuzzy cluster validation in fuzzy co-
clustering of documents and keywords. Adv Intell Syst Comput 270. https://doi.org/10.1007/
978-3-319-05515-2
20. Miyamoto S, Ichihashi H, Honda K (2008) Algorithms for fuzzy clustering. Springer-verlag
Berlin Heidelberg
21. Bezdek JC, Ehrlich R, Full W (1984) FCM: The fuzzy c-means clustering algorithm. Comput
Geosci 10(2–3):191–203. https://doi.org/10.1016/0098-3004(84)90020-7
22. Zarinbal M, Fazel Zarandi MH, Turksen IB (2014) Relative entropy fuzzy c-means clustering.
Inf Sci (Ny) 260:74–97. https://doi.org/10.1016/j.ins.2013.11.004
Statistical Inferences for Multivariate
Generalized Gamma Regression Model

Hasbi Yasin, Purhadi, and Achmad Choiruddin

Abstract Generalized gamma (GG) distribution serves as a widely applied statis-

tical tool, particularly suitable for scenarios where data distribution skews positively
and lacks symmetry. In many real-world situations, multiple factors can simultane-
ously influence various outcomes. This article introduces the multivariate general-
ized gamma regression (MGGR) model, tailored for data adhering to a multivariate
generalized gamma (MGG) distribution. Parameter estimation in MGGR relies on
the maximum likelihood estimation (MLE) technique, further optimized with the
Berndt-Hall-Hall-Hausman (BHHH) algorithm to enhance precision. To assess the
model’s significance, we deploy the maximum likelihood ratio test (MLRT) and
conduct partial testing using the Wald test. Rigorous validation through simula-
tions demonstrates the MGGR model’s adeptness in parameter estimation, exhibiting
minimal bias. To underscore its practicality, we apply the MGGR model to a real-
world case study. Specifically, we employ it to analyze three education indicators
spanning 2017–2021 in Central Java, Indonesia. Our findings highlight the efficacy
of multivariate modeling over its univariate counterpart, revealing a more logical
approach to data analysis. In summary, this research underscores the robustness of
the MGGR model in parameter estimation and highlights the benefits of embracing
multivariate modeling for comprehensive data insights.

Keywords Educational indicator · Generalized gamma distribution · Maximum

likelihood estimation · Multivariate generalized gamma regression

H. Yasin · Purhadi (B) · A. Choiruddin

Department of Statistics, Institut Teknologi Sepuluh Nopember, Surabaya, Indonesia
e-mail: [email protected]
H. Yasin
Department of Statistics, Universitas Diponegoro, Semarang, Indonesia

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 463
Y. Bee Wah et al. (eds.), Data Science and Emerging Technologies, Lecture Notes on
Data Engineering and Communications Technologies 191,
https://doi.org/10.1007/978-981-97-0293-0_33
464 H. Yasin et al.

1 Introduction

Linear regression model is a popular tool to study the relationship between a response
variable and one or more predictor variables. However, this model requires a strong
assumption of normal distribution which is often difficult to meet. To handle this
issue, several proposals have been made especially for the asymmetric contin-
uous distribution including gamma, exponential, log-normal, Weibull, and Rayleigh
regressions [1–9]. Such a regression model is developed separately.
In this study, we are interested to explore the generalized gamma regression to
model response variable with positive skew. The generalized gamma is advantageous
since it has more general form of many other distributions such as gamma, Weibull,
exponential, chi-squared, Erlang, Rayleigh, and half normal [10]. This makes sure
that our study also applies for other regression models for continuous distribution.
Although the generalized gamma distribution has been introduced since 60s [11,
12], more development has been considered by re-parametrization to enhance flexi-
bility [10, 13]. In particular, Yasin et al. [14] propose a univariate generalized gamma
regression by modeling its mean as a response of predictor variables. In this study,
we extend the study of Yasin et al. [14] to the multivariate case, develop the method
for parameter estimation, and study the statistic for hypothesis testing. Finally, we
fit the multivariate generalized gamma regression to model educational indicators in
Central Java involving a number of predictor variables.
The rest of the article is organized as follows. Section 2 details the multivariate
generalized gamma distribution and the corresponding regression model, followed by
technique for parameter estimation and hypothesis testing. We evaluate our model
and method using simulated data (Sect. 3) and real data in educational indicators
(Sect. 4). Section 5 provides conclusion.

2 Methods

2.1 Multivariate Generalized Gamma Distribution

In this section, we propose the procedure of forming the MGG distribution by

summing two or more random variables that are mutually independent, and each
has a univariate generalized gamma distribution, as in the formation of the multi-
variate gamma distribution [15]. Suppose V1 , V2 , . . . , VK is K random variables,
each generalized gamma distribution has λ as the first shape parameter, τ as the
second shape parameter, θk as the k-th scale parameter, and δk as the k-th location
parameter or written as Vk ∼ GG(λ, τ, θk , δk ), k = 1, 2, …, K. Then, define a new
random variable Y1 = V1 , Y2 = V1 + V2 , …, and Y K = V1 + V2 +· · ·+ Vk +· · ·+ VK ,
the joint PDF of Y1 , Y2 , . . . , Y K is
Statistical Inferences for Multivariate Generalized Gamma Regression … 465
( ( )τ ∑ ( )τ )
( )K exp − y1θ−δ 1
− K yk −yk−1 −δk
τ 1 k=2 θk
f (y|Θ ) = (∏ )( )1−τ λ ∏ ( )1−τ λ . (1)
┌(λ) K y1 −δ1 K yk −yk−1 −δk
θ
k=1 k θ1 k=2 θk

[ ]T
where y = [y1 y1 . . . y K ]T , Θ = λ τ θT δT is a vector of MGG distribution
parameters, θT = [θ1 θ2 . . . θk . . . θ K ], δT = [δ1 δ2 . . . δk . . . δ K ], λ, τ, θk > 0,
δk ∈ R, yk > 0, y1 + δk < yk ; k > 1, δ1 < y1 , and f (y|Θ ) = 0 for others. Hence,
Yk follows a GG distribution with theoretical mean as in Eq. (2).
( ) k
┌ λ + τ1 ∑ ∑ k
E(Yk ) = θr + δr . (2)
┌(λ) r =1 r =1

2.2 Multivariate Generalized Gamma Regression

One notable development of the multivariate generalized gamma (MGG) distribution

is the creation of the multivariate generalized gamma regression (MGGR) model.
In this paper, the MGGR model is constructed with the scale parameter serving as
the foundation for its formulation. This implies that alterations in predictor variable
values directly correspond to changes in the scale parameter value, while the shape
and threshold parameters remain constant for each observation. The chosen link
function in this model is the “log” function. Consequently, the general structure of
the MGGR model is derived as follows:
( )
E(Yki ) = exp xiT βk (3)

where Yki is the k-th response variable at [ the i-th observation, x ]ji is the j-th predictor
variable at the i-th observation, xiT = 1 x1i x2i . . . x ji . . . x pi , i = 1, 2, . . . , n, n
[ ]T
represent the sample size of the observation, βk = β0k β1k . . . β jk . . . β pk , k =
1, 2, . . . , K , p is the number of predictor variables, and K is the number of response
variables. Therefore, the joint PDF of the MGGR model at the i-th observation is
( ( )τ ∑ ( )τ )
( )K exp − yθ1i1 (x−δ1
− K yki −y(k−1)i −δk
τ i) k=2 θk (xi )
f (yi |Θ M R ) = (∏ )( )1−τ λ ∏ ( )1−τ λ ,
┌(λ) K y1i −δ1 K yki −y(k−1)i −δk
k=1 θk (xi ) θ1 (xi ) k=2 θk (xi )
(4)

where yi is the vector random of response variable at the i-th observation, and Θ M R =
[ T T ]T
β1 β2 . . . βkT . . . βTK λ τ δT represents the parameter of the MGGR model. This
study operates under the assumption that alterations in the predictor’s value solely
466 H. Yasin et al.

impact the scale parameter’s value. Therefore, building on Eqs. (2) and (3), we can
express the scale parameter element for each response variable using Eqs. (5) and
(6).

┌(λ) ( ( T ) )
θ1 (xi ) = ( ) exp xi β1 − δ1 , (5)
┌ λ+ τ1

┌(λ) ( ( T ) ( ) )
θk (xi ) = ( ) exp xi βk − exp xiT β(k−1) − δk , k = 2, 3, . . . , K . (6)
┌ λ+ τ1

Suppose y1 , y2 , . . . , yn are n random sample vectors from the MGG distribution,

each of size K, and X is a matrix of size n × ( p + 1), where the first column is a
constant, and the following columns are predictor variables that are considered to
affect the response variables. Therefore, the log-likelihood function corresponding
to the MGGR model is denoted as in Eq. (7).
( n )
∏ ∑
n
log L(Θ M R ) = log ( f (yi |Θ M R )) = log( f (yi |Θ M R )). (7)
i=1 i=1

To estimate the parameters of the MGGR model, we employ the maximum like-
lihood estimation (MLE) method. However, it is important to note that there is not
a closed-form expression for the first-order derivative of the log-likelihood function
for each parameter. As a result, we require an optimization technique that involves
numerical iteration, and we use the BHHH algorithm for this purpose. We selected
this algorithm because it does not necessitate the calculation of a second derivative
for creating the Hessian matrix [16, 17]. The gradient vector and Hessian matrix of
the BHHH algorithm are shown in Eqs. (8) and (9).

∑
n
g(Θ M R ) = gi (Θ M R ), (8)
i=1

∑
n
H(Θ M R ) = − gi (Θ M R )gi (Θ M R )T , (9)
i=1

where gi (Θ M R ) represents the gradient vector for the i-th individual observation
expressed by
[
∂ log( f (yi |ΘMR )) ∂ log( f (yi |ΘMR )) ∂ log( f (yi |ΘMR ))
gi (ΘMR ) = ···
∂β1T ∂β2T ∂βTK
]T
∂ log( f (yi |ΘMR )) ∂ log( f (yi |ΘMR )) ∂ log( f (yi |ΘMR ))
. (10)
∂λ ∂τ ∂δT
Statistical Inferences for Multivariate Generalized Gamma Regression … 467

Thus, to obtain the value of the MGGR model, parameter estimator can be
explained by the subsequent steps:
(0)
• Step 1. Define the initial value of Θ̂ M R =
[ (0)T (0)T (0)T
]
(0)T (0) (0) (0)T T (0)
β̂ 1 β̂ 2 . . . β̂ k . . . β̂ K λ̂ τ̂ δ̂ , where λ̂(0) , τ̂ (0) , δ̂ is derived
from the estimate of the distribution parameters of the response variable [18].
(0)
In contrast, β̂ k is acquired from the univariate generalized gamma regression
(UGGR) model parameters for each response variable [14].
• Step 2. Determine the stopping criteria of the BHHH iteration. This study imple-
mented a tolerance limit, denoted as ε = 10−8 , or alternatively, the maximum
iteration of M = 150. These choices are made to balance accuracy and efficiency.
It ensures that the algorithm terminates when it achieves a high level of accuracy
or when it reaches a predefined limit on the number of iterations.
• Step 3. Begin the BHHH iteration process by employing the subsequent procedure.
(m+1) (m)
( (m) ) ( (m) )
Θ̂ M R = Θ̂ M R − H−1 Θ̂ M R g Θ̂ M R , (11)

where m = 1, 2, . . . , m ∗ .
∥ ∥
• Step 4. If ∥
(m∗+1) (m∗) ∥
∥Θ̂ M R − Θ̂ M R ∥ < ε then the iterations stop at m∗-th iteration.
Alternatively, the iteration is terminated when it reaches the M-th iteration. The
final iteration generates an estimator value when convergence occurs.
The testing of the significance of the regression parameters of the MGGR
model simultaneously was carried out with the null hypothesis of H0 : β jk = 0,
j = 1, 2, . . . , p, k =
{ 1, 2, . . . , K , whereas the alternative
} is H1 : at least one β jk /= 0.
Suppose that Ω = β1 , β2 , . . . , βk , . . . , β K , λ, τ, δ represent the set of parameters
under the populations and ω = {βω01 , βω02 , . . . , βω0k , . . . , βω0K , λω , τω , δω } repre-
sent the parameters that are assumed to be true under the null hypothesis. Note that
Ω̂ and ω̂ are estimators obtained by the MLE method. Therefore, by using the MLRT
method, the test statistics for testing this hypothesis is as follows:
( ( ) )
G 2 = 2 ln L Ω̂ − ln L(ω̂) . (12)

In the case of large sample sizes, the test statistic G 2 should correspond to a Chi-
squared distribution with pK degrees of freedom [17]. Thus, the decision to reject
the null hypothesis can be reached if G 2 > χ(1−α);
2
pK .
The significance of the MGGR model regression parameters is partially tested
with the null hypothesis H0 : β jk = 0 and alternative hypothesis H1 : β jk /= 0 with
the test statistics presented in the following equation:

β̂ jk
Z jk = / ( ) , (13)
∧

var β̂ jk
468 H. Yasin et al.

∧
( ) [ ( )]−1
where var β̂ jk is the main diagonal element of matrix − H Θ̂ M R corre-
| |
| |
sponding to β̂ jk . The null hypothesis is rejected when | Z jk | >Zα/ 2 .

3 Simulation Study

We used simulation studies to check if our new estimation method works well. In
this simulation, we looked at different sample sizes: 50, 100, 150, and 200 samples,
and we repeated each scenario 500 times. We picked these numbers based on what
makes sense statistically, what our computers can handle, and to make sure our results
are reliable and can be applied widely. To simulate the MGGR model, we needed
to create data that matches how the predictor variables are related to the response
variables, which follow the MGG distribution. Here is a step-by-step guide on how
we did that:
Step 1: We start by setting up the MGGR model as described in Eq. (3). For our
study, we work with three predictor variables.
Step 2: We specify the parameters. In our research, we consider two parameter
models: the bivariate and trivariate models, which are presented in Table 1.
Step 3: Generate predictor variables from a uniform distribution.
X 1 ∼ U (10, 90), X 2 ∼ U (10, 25), and X 3 ∼ U (10, 50).
Step 4: Calculate the scale parameters for each response variable at each obser-
vation k-th response variable and i-th observation by applying Eqs. (5) and
(6).
Step 5: Create response variables based on the multivariate generalized gamma
(MGG) distribution. We derive the shape and threshold parameters from the distri-
bution parameters provided in Table 1. The scale parameters are obtained from
the calculations made in Step 4.

Table 1 True parameter of MGGR model

Model Regression parameter Distribution parameter
β1 β2 β3 λ τ δ
Bivariate 1.95109 4.18551 0.43952 3.93178 6.14902
0.00531 0.00194 41.96176
− 0.00716 − 0.00330
0.00314 0.00142
Trivariate 1.95109 4.18551 4.34511 0.43952 3.93178 6.14902
0.00531 0.00194 0.00251 41.96176
− 0.00716 − 0.00330 − 0.00486 − 8.13038
0.00314 0.00142 0.00264
Statistical Inferences for Multivariate Generalized Gamma Regression … 469

Step 6: Execute the parameter estimation process for the MGGR model, following
the steps outlined in Sect. 2.2.
We assessed the accuracy of our parameter estimation by computing the mean
for each estimated parameter, as detailed in Table 2. Furthermore, we examined
the bias, variance, and root of mean squared error (RMSE) associated with these
parameter estimates. Figures 1 and 2 present these results graphically, providing a
visual representation of the performance of our parameter estimation procedure. This
comprehensive evaluation helps us gage the robustness and reliability of the MGGR
model in capturing the underlying relationships in our simulated data.
Based on the insights from Table 2, our proposed procedure demonstrates robust-
ness in accurately estimating MGGR model parameters across both small and large
sample sizes. The average values of the estimated parameters closely align with
the true parameters, affirming the method’s reliability. Furthermore, the patterns
observed in Figs. 1 and 2 are noteworthy. They illustrate that the variance and root of
mean squared error (RMSE) associated with bivariate and trivariate model parame-
ters tend to decrease as the number of samples increases. This trend signifies that the

Table 2 Mean of estimated parameters based on sample size

Model Parameter True Mean
50 100 150 200
Bivariate β01 1.95109 1.95077 1.94856 1.95015 1.94920
β11 0.00531 0.00591 0.00536 0.00535 0.00533
β21 − 0.00716 − 0.00650 − 0.00722 − 0.00719 − 0.00716
β31 0.00314 0.00369 0.00318 0.00314 0.00315
β02 4.18551 4.18316 4.18814 4.18162 4.18255
β12 0.00194 0.00235 0.00192 0.00198 0.00196
β22 − 0.00330 − 0.00271 − 0.00345 − 0.00320 − 0.00322
β32 0.00142 0.00207 0.00145 0.00142 0.00144
Trivariate β01 1.95109 1.94874 1.94955 1.95246 1.95255
β11 0.00531 0.00543 0.00541 0.00540 0.00538
β21 − 0.00716 − 0.00757 − 0.00727 − 0.00733 − 0.00728
β31 0.00314 0.00334 0.00327 0.00319 0.00322
β02 4.18551 4.19094 4.18461 4.19170 4.18793
β12 0.00194 0.00193 0.00197 0.00197 0.00198
β22 − 0.00330 − 0.00360 − 0.00318 − 0.00341 − 0.00327
β32 0.00142 0.00148 0.00147 0.00138 0.00142
β03 4.34511 4.35932 4.35541 4.36299 4.35598
β13 0.00251 0.00248 0.00248 0.00251 0.00252
β23 − 0.00486 − 0.00515 − 0.00453 − 0.00485 − 0.00462
β33 0.00264 0.00271 0.00266 0.00252 0.00263
470 H. Yasin et al.

Fig. 1 Bias, variance, and RMSE of bivariate generalized gamma regression model parameters

Fig. 2 Bias, variance, and RMSE of trivariate generalized gamma regression model parameters

estimated parameter values become more precise and consistent with an increase in
sample size. However, it is essential to highlight that this trend does not extend to
Statistical Inferences for Multivariate Generalized Gamma Regression … 471

the bias of the estimated parameters. Both Figs. 1 and 2 indicate no clear relation-
ship between bias and sample size. Consequently, our proposed procedure consis-
tently provides parameter estimations for the MGGR model with relatively low bias,
offering dependable and accurate results.

4 Application on Educational Indicator Data

4.1 Data Exploration

In this study, we utilized secondary data spanning from 2017 to 2021, sourced from
BPS Central Java Province. Our observation unit encompassed the 35 districts and
cities within Central Java Province [19]. The data employed pertained to education
statistics in Central Java, which included indicators related to education participation,
specifically the school participation rate (SPR) for individuals aged 16 to 18, and the
gross participation rate (GPR) at the high school or equivalent level. Additionally, we
considered education development indicators, such as the mean years of schooling
(MYS). These indicators were examined in conjunction with various factors believed
to influence them. These factors included GRDP per capita, the percentage of the
population living in poverty, gender ratio, percentage of households with access to
proper sanitation, labor force participation rate, and teacher-student ratio at the junior
high school level. Descriptive statistics for each of the research variables is presented
in Table 3, offering a comprehensive overview of the data characteristics.
The visual depiction of the relationships between response variables and predictor
variables is effectively showcased through a matrix plot, as demonstrated in Fig. 3.
This figure provides a visual summary of the correlations between each predictor

Table 3 Data description

Variable Mean SD Min. Max.
Mean years of schooling (Y 1 ) 7.773 1.212 6.180 10.900
School participation rate (Y 2 ) 71.501 9.011 49.560 91.390
Gross participation rate (Y 3 ) 85.734 14.941 52.980 121.910
GRDP per capita (X 1 ) 28.386 17.966 12.370 87.360
Percentage of poor people (X 2 ) 11.245 3.673 3.980 20.320
Gender ratio (X 3 ) 99.373 2.306 93.900 103.940
Percentage of households 77.954 17.292 9.240 98.070
accessing proper sanitation (X 4 )
Labor force participation rate 69.369 3.153 58.730 76.600
(X 5 )
Teacher-student ratio (X 6 ) 18.063 2.934 13.000 39.000
472 H. Yasin et al.

Fig. 3 Matrix plot of educational indicator data

variable and each response variable, and the observed patterns align with established
theories.
Among the predictor variables, GRDP per capita and the Percentage of households
with access to proper sanitation exhibit a positive correlation with the response
variables. In contrast, the percentage of people living in poverty, gender ratio, labor
force participation rate, and teacher-student ratio show a negative correlation with
the response variables.
Notably, Y 1 (mean years of schooling) exhibits the most significant correlation
with its six predictors. This suggests that the predictor variables offer a better expla-
nation for the variation in the Y 1 variable compared to other response variables,
emphasizing its importance in our analysis.
Subsequently, the MGG distribution completed evaluation by the Kolmogorov–
Smirnov (KS) test. The null hypothesis proposes that the observed data conforms to
the MGG distribution, whereas the alternative hypothesis proposes that the observed
data diverges from the MGG distribution [18]. The Kolmogorov–Smirnov (KS) test
produced a test statistic of 0.052564 and a p-value of 0.7189. These results lead us
to the conclusion that there is no compelling evidence to reject the null hypothesis.
In simpler terms, it means that the school participation rate (SPR), gross participa-
tion rate (GPR), and mean years of schooling (MYS) are well-suited to follow the
multivariate generalized gamma (MGG) distribution.
Statistical Inferences for Multivariate Generalized Gamma Regression … 473

The multicollinearity among the predictor variables is also assessed by employing

the variance inflation factor (VIF). The VIF values for each predictor variable are as
follows: 1.588 for X 1 , 1.675 for X 2 , 1.364 for X 3 , 1.277 for X 4 , 1.070 for X 5 , and
1.209 for X 6 , respectively. The VIF values for each predictor variable are all below
five, indicating the absence of multicollinearity among the predictor variables.

4.2 Modeling Using MGGR Model

In our analysis, we employed the MGGR model to identify which predictor vari-
ables had a significant impact on the response variable. The initial parameter values
required for estimating the MGGR model are provided in Table 4. Subsequently,
we present the parameter estimation results in Table 5 considering three response
variables and six predictor variables. Within this table, the p-values for each param-
eter are used to gage the significance of individual predictors. At a significance level
of 10%, the outcomes are as follows: All predictors significantly affect Y 1 , except
for variable X 5 . Variables X 3 , X 4 , and X 6 notably influence Y 2 . Meanwhile, Y 3 is
substantially affected by variables X 3 , X 4 , and X 5 . These findings offer valuable
insights into the complex relationships between predictor and response variables,
providing a comprehensive understanding of the model’s parameters.
The model’s significance can be assessed concurrently using Wilk’s likelihood
ratio statistics derived from the MLRT. The calculated test statistic is 337.433, while
the quantile of the Chi-squared distribution with 18 degrees of freedom and a signif-
icance level of α = 0.05 is 28.869. The calculated test statistic exceeds the critical
value of the Chi-squared distribution, leading to the rejection of the null hypothesis.
It implies that the six predictor factors collectively influence the response variables
considerably.
Based on the outcomes of a single response modeling (see Table 4) and multi-
variate modeling using MGGR (see Table 5), it can be seen that the difference in
coefficient sign only occurs for variable X 1 in responses Y 2 , Y 3 , and X 5 in response
Y 1 . The presence of this phenomenon can be ascribed to the presence of a correlation
between the response variables. As a comparison, it can also be seen from the matrix

Table 4 Initial value of

Parameter Y1 Y2 Y3
MGGR parameter model
β0 1.4277 4.7795 5.7729
β1 0.0048 − 0.0006 − 0.0011
β2 − 0.0060 − 0.0095 − 0.0077
β3 0.0056 − 0.0006 − 0.0047
β4 0.0016 0.0028 0.0047
β5 − 0.0010 − 0.0054 − 0.0147
β6 − 0.0041 − 0.0090 − 0.0048
474 H. Yasin et al.

Table 5 Parameter estimation of MGGR model

Parameter Estimate Standard error Z p-value
Mean years of schooling (Y 1 )
β01 1.2221 2.23e−5 5.46e+4 0.0000
β11 0.0054 0.0011 5.0580 0.0000
β21 − 0.0048 0.0023 − 2.1037 0.0354
β31 0.0065 0.0018 3.6276 0.0003
β41 0.0015 0.0002 7.5974 0.0000
β51 0.0004 0.0022 0.1800 0.8572
β61 − 0.0040 0.0016 − 2.5053 0.0122
School participation rate (Y 2 )
β02 4.8053 3.61e−5 1.33e+5 0.0000
β12 0.0005 0.0011 0.4429 0.6579
β22 − 0.0033 0.0033 − 0.9923 0.3210
β32 − 0.0053 0.0031 − 1.7170 0.0860
β42 0.0026 0.0006 4.1026 0.0000
β52 − 0.0013 0.0035 − 0.3637 0.7161
β62 − 0.0069 0.0024 − 2.9133 0.0036
Gross participation rate (Y 3 )
β03 5.9204 1.06e−5 5.56e+5 0.0000
β13 0.0002 0.0012 0.1608 0.8723
β23 − 0.0059 0.0041 − 1.4326 0.1520
β33 − 0.0072 0.0037 − 1.9659 0.0493
β43 0.0044 0.0006 7.6246 0.0000
β53 − 0.0148 0.0041 − 3.6258 0.0003
β63 − 0.0010 0.0056 − 0.1742 0.8617

plot in Fig. 3. Therefore, based on the correlation values in this figure, it can be
concluded that the MGGR model is more reasonable than univariate modeling.

5 Conclusion

To sum up, the multivariate generalized gamma regression (MGGR) model has
emerged as a robust and effective tool for estimating parameters in the context of
responses adhering to the MGG distribution. By leveraging the maximum likelihood
estimation (MLE) method and the BHHH algorithm for optimization, the model
consistently delivered reliable results. Our application of the MGGR model to analyze
three education indicators within the framework of six predictor variables revealed
Statistical Inferences for Multivariate Generalized Gamma Regression … 475

the clear advantage of employing multivariate modeling over its univariate counter-
part. This approach not only deepened our understanding of the interrelationships
among variables but also highlighted that parameter significance may vary when
assessed individually in univariate models. It is important to note that addressing the
challenge of collinearity among responses remains an avenue for further exploration
in the future research. In essence, our study underscores the efficacy of the MGGR
model in parameter estimation and emphasizes the merits of adopting a multivariate
approach when confronting intricate data relationships. This work sets the stage
for future research endeavors aimed at resolving the complexities associated with
collinearity among responses in statistical modeling.

Acknowledgements The authors would like to express deep appreciation for the financial support
provided by Balai Pembiayaan Pendidikan Tinggi (BPPT) or Central for Higher Education Funding
and Lembaga Pengelola Dana Pendidikan (LPDP) under the Ministry of Education, Culture,
Research, and Technology of the Republic of Indonesia.

References

1. Cepeda-Cuervo E, Corrales M, Cifuentes MV, Zarate H (2016) On gamma regression residuals.

J Iran Stat Soc 15(1):29–44
2. Nasution AS, Purhadi, Sutikno (2017) Estimasi Parameter dan Pengujian Hipotesis Pada Model
Regresi Gamma (Studi Kasus: Pemodelan Pencemaran Sungai Di Surabaya). Paidagogeo:
Jurnal Pendidikan 2(2):17–26
3. Suyitno S (2017) Penaksiran Parameter dan Pengujian Hipotesis Model Regresi Weibull
Univariat. Jurnal Eksponensial 8(2):179–183
4. Zhang Z (2016) Parametric regression model for survival data: Weibull regression model as an
example. Ann Trans Med 4(24):1–8
5. Bednarski T, Skolimowska-Kulig M (2019) On scale Fisher consistency of maximum likelihood
estimator for the exponential regression model under arbitrary frailty. Stat Probab Lett 150:9–12
6. Davidov O, Zelen M (2000) Exact tests for exponential regression. J Stat Plan Infer 88(1):87–97
7. Palm BG, Bayer FM, Cintra RJ, Pettersson MI, Machado R (2019) Rayleigh regression model
for ground type detection in SAR imagery. IEEE Geosci Rem Sens Lett 16(10):1660–1664
8. Yasin H, Inayati S (2022) Setiawan: 3-Parameter Gamma regression model for analyzing human
development index of Central Java Province, Barekeng. Jurnal Ilmu Matematika dan Terapan
16(1):171–180
9. Diantini NLS, Purhadi, Choiruddin A (2023) Parameter estimation and hypothesis testing on
three parameters log normal regression. In: AIP conference proceedings, vol 2554, no 1, p
030024. AIP Publishing LLC
10. Sanchez R, Mackenzie SA (2016) Information thermodynamics of cytosine DNA methylation.
PLoS ONE 11(3):1–20
11. Stacy EW (1962) A generalization of the Gamma distribution. Ann Math Stat 33(3):1187–1192
12. Stacy EW, Mihram GA (1965) Parameter estimation for a generalized gamma distribution.
Technometrics 7(3):349–358
13. Shanker R, Shukla KK (2016) On modeling of lifetime data using three-parameter generalized
lindley and generalized gamma distributions. Biom Biostat Int J 4(7):283–288
14. Yasin H, Purhadi, Choiruddin A (2022) Estimasi Parameter dan Pengujian Hipotesis Model
Geographically Weighted Generalized Gamma Regression. Jurnal Gaussian 11(1):140–152
476 H. Yasin et al.

15. Mathai AM, Moschopoulos PG (1992) A form of multivariate gamma distribution. Ann Inst
Stat Math 44(1):97–106
16. Hayati FN, Otok BW (2018) Parameter estimation and statistical test of mixed geographically
weighted bivariate Weibull regression (MGWBWR). In: 2018 international symposium on
advanced intelligent informatics (SAIN), pp 78–83. IEEE, Yogyakarta
17. Rahayu A, Purhadi S, Prastyo DD (2020) Multivariate gamma regression: parameter estimation,
hypothesis testing, and its application. Symmetry 12(5):1–17
18. Yasin H, Purhadi, Choiruddin A (2023) Parameter estimation and the goodness-of-fit test for the
multivariate generalized gamma distribution. In: 2023 international conference on computer,
control, informatics and its applications (IC3INA), pp 382–387. IEEE, Bandung, Indonesia
19. BPS-Statistics of Jawa Tengah Province, LNCS Homepage. https://jateng.bps.go.id/. Last
accessed 17 Feb 2023
Health and Industrial Applications
W@rk: Attendance Application
Framework Using Blockchain
Technology

Putra Roskhairul Fitri Kaha, Syarifah Bahiyah Rahayu, Afiqah M. Azahari,

Mohd Hazali Mohamed Halip, and K. Venkatesan

Abstract Post-COVID has inadvertently caused a widespread shift toward remote

working for many employees across the globe. This new normal has necessitated a
focus on employee’s well-being, making it a top priority for companies. However,
many businesses are concerned about the potential for abuse or remote working
arrangements and the risk of data security breaches due to various security concerns.
As a result, there is a need to develop a new attendance application that can facilitate
a shift from a work-from-office setup to a work-from-anywhere arrangement. This
paper introduces a contactless attendance, W@RK, embedded with a face recogni-
tion feature and location-based for recording employee attendance using blockchain
technology. The aim is to monitor employee attendance and empower employees to
work from anywhere. The objective is to develop an alternative contactless atten-
dance application. Agile software development methodology is adopted to develop
this attendance application. A facial recognition feature is used to scan the face of
employees to authenticate themselves through verification services. While the loca-
tion is detected using geo-fencing, the blockchain technology records and verifies
the attendance of the employee. The facial data is recorded on a blockchain, along
with a timestamp and other relevant information such as location. The findings show
the proposed framework has a promising capability to record and verify attendance
securely and efficiently. As the use of blockchain technology continues to grow, it is
likely more organizations will adopt facial attendance systems to track the attendance
of employees, students, and other individuals at events and meetings.

Keywords Attendance · Location-based · Face recognition · Blockchain · Agile

P. R. F. Kaha · S. B. Rahayu (B) · M. H. M. Halip · K. Venkatesan

Faculty of Defense Science and Technology, National Defence University of Malaysia (UPNM),
Kuala Lumpur, Malaysia
e-mail: [email protected]
S. B. Rahayu · M. H. M. Halip · K. Venkatesan
Cyber Security and Digital Industrial Revolution Centre, National Defence University of Malaysia
(UPNM), Kuala Lumpur, Malaysia
A. M. Azahari
EURECOM, Campus Sophia Tech, Biot, France

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 479
Y. Bee Wah et al. (eds.), Data Science and Emerging Technologies, Lecture Notes on
Data Engineering and Communications Technologies 191,
https://doi.org/10.1007/978-981-97-0293-0_34
480 P. R. F. Kaha et al.

1 Introduction

The onset of the COVID-19 pandemic in early 2020 had an unintended impact on
business, accelerating the adoption of digital technologies and resulting in changes
to work patterns due to the lockdown. This led to a shift from traditional work
practices restricted by time, location, and structure to the widespread adoption of
remote work that relied on tools such as video conferencing, collaborative software,
and online accessibility. During the lockdown period, many organizations faced chal-
lenges in recording and monitoring employee attendance using conventional systems
and procedures such as the attendance register system or through an on-premises
biometric-attendance system for them to clock in and out. Furthermore, these atten-
dance systems are prone to intentional or unintentional modification of the system,
the components of systems, and its data. Blockchain technology performs a key role
in mitigating several existing issues, which can be faced by the global community
during pre- and post-pandemics. Blockchain is a technology that has been intro-
duced since the concept of blockchain was first proposed in [1], even though it
has changed rapidly from a prototype concept, and it received outreach scope in
earlier studies. However, it combines three technologies: consensus protocol, peer-
to-peer (P2P) networking, and private key cryptography. This emerging technology
promotes a better architecture with transparency, trust, security, and immutability
via transactions.
This paper proposes a blockchain-based attendance framework to ensure its
tamper-proof mechanisms system. The blockchain stores employee attendance infor-
mation through its shared database or ledger in digital, decentralized, and secure
transactions. The data updated from the attendance can be observed in a public mode.
This accessibility makes the data trustworthy and transparent, leading to a trace for
any tampered datasets. As per reference [2], blockchain has five basic components
consensus mechanism, P2P network, validity rules, ledger, and cryptography. Here,
cryptographic blocks are mainly utilized to collect the transactions across several
nodes that can be either private or public. A public blockchain is open to everybody,
whereas a private blockchain is limited to authorized participants only [3]. However,
different consensus protocols are discussed in blockchain technology, which helps
to ensure the essential agreement over the ledger status and contribution by the other
participants in the blockchain network. Apart from blockchain, many organizations
can continue to use biometric technology such as face recognition systems as part
of their digital attendance authentication mechanism [4]. By using a computer or
laptop with a high-quality web camera as input during the attendance check, the
facial recognition system saves employee attendance into the database. Works of
[5] propose a system that is implemented using four steps such as finding all faces,
posing, and working faces, encoding faces, and finding the person’s name from the
encoding. Another technology that can be applied to the attendance system is through
tracking the location of the employee within the designated geo-fence. The location-
based system enables employees who are working away from home to be tracked
for attendance records. In [6], a student attendance system is developed for schools
W@rk: Attendance Application Framework Using Blockchain Technology 481

and colleges combining geo-fencing using the geo-fencing API from Google Play
services by acquiring the live location of students and a geo-fence of the class area.
In this paper, we introduce a framework of contactless attendance, W@RK,
embedded with a face recognition feature and location-based for recording employee
attendance using blockchain technology. The aim is to monitor employee attendance
and empower employees to work from anywhere. The objective is to develop an
alternative contactless attendance application. Agile software development method-
ology is adopted to develop this attendance application. A facial recognition feature
is used to scan the face of employees to authenticate themselves through verification
services. While the location is detected using geo-fencing, the blockchain technology
records and verifies the attendance of the employee. The facial data is recorded on a
blockchain, along with a timestamp and other relevant information such as location.
The findings show the proposed W@RK framework can record and verify atten-
dance securely and efficiently. As the use of blockchain technology continues to
grow, we expect more organizations will adopt facial attendance systems to track the
attendance of employees, students, and other individuals at events and meetings.
The rest of this paper is organized as follows. Section 2 provides background and
previous work on blockchain adoption in attendance systems. The related works of
face recognition systems in an attendance system are discussed in this section. The
methodology of W@RK is described in Sect. 3. The design of the W@RK framework
is presented in Sect. 4. Section 5 concludes the research work and discusses the future
scope of this work.

2 Background Study

An attendance system is important for an organization to record their employees’

daily routine attendance at the office. The implementation of a biometric-based atten-
dance system using fingerprint or face recognition is popularly used to replace the
conventional system, which can easily be manipulated by irresponsible workers.
Face recognition is a contactless method used by organizations to get employees’
attendance data during the COVID-19 pandemic. Unfortunately, these organiza-
tions still use relational databases to store employees’ login and logout information
for their attendance system. The current conventional relational database limitation
is a missing feature to check on the integrity of data if any data or information
stored has been changed by any unauthorized access [7, 8]. To overcome this issue,
several studies have suggested implementations of a blockchain system that offered
distributed and decentralized solutions which promised a tamper-proof-employee
attendance system [7, 8]. The attendance system with blockchain will not be allowed
even for an administrator to edit or delete data from a database. It is also capable of
detecting if someone has inserted a new record into the blockchain and the person
cannot deny his action [7].
Several studies used face recognition in attendance systems as [9–11] focus more
on the face recognition algorithms only. However, studies on attendance systems that
482 P. R. F. Kaha et al.

used face recognition integrated with blockchain systems are still limited. The study
of [11] utilizes a convolutional neural network for face recognition and stores the
attendance data on the blockchain system. The attendance system with the blockchain
has not been limited to organizations’ attendance, but it has also been designed
for conference [12] attendance and class attendance at a university [13, 14]. The
attendance system discussed in [12] performs auto-track operations for the user’s
behavior over the screen, device location, and allotted time during the session for
which the user makes the subscription. In [13], the attendance system used Wi-Fi that
allows students to record their attendance through a mobile phone or laptop and the
record will be saved on a distributed and decentralized blockchain. As the location
of the worker is important when they log into the attendance system to ensure the
location that they are working either from home or the office, we proposed the W@RK
attendance framework which combines a face recognition feature and location-based
for recording employee attendance data using blockchain technology.

3 Methodology

Agile is a project management approach emphasizing collaboration and software

development flexibility. It involves breaking down the development process called
sprint and involving stakeholders throughout each sprint. The key principles of Agile
include prioritizing working software over extensive documentation, valuing stake-
holder engagement, and being adaptable to change. The Agile methodology has
been widely recognized as a practical approach to software development and has
been adopted by numerous applications [15–17]. On top of that, the methodology
has been continuously refined and improved through ongoing research and develop-
ment efforts [18, 19]. Besides this approach may deliver a faster minimum viable
product (MVP).
Taking on this methodology, the W@RK development is divided into phases:
planning, design, development, testing, deployment, and review. The planning phase
involves a cross-functional team of developers, designers, testers, and users to define
the project scope, goals, and objectives. The requirements for the project are spec-
ified during this phase. Throughout this phase, business potential is explained, and
the project’s time and effort requirements are carefully planned. After quantifying
this data, we assess the project’s technical and financial viability. The design phase
involves breaking down the project into smaller, prioritized components and deter-
mining the sprint length (typically 2–4 weeks). Daily stand-up meetings are held to
track progress, address obstacles, and plan the next steps. At the end of each sprint,
the completed work is reviewed and evaluated, and feedback from stakeholders is
incorporated. The sprints are deployed and tested regularly, with new features added
continuously and made available for user feedback.
W@rk: Attendance Application Framework Using Blockchain Technology 483

Adopting the Agile methodology allows us to develop W@RK in incremental

stages while involving the users at each sprint to gather feedback. Proper documen-
tation is produced after each successful phase and shortening the overall develop-
ment time. Breaking a project down into smaller phases also enables more flexible
and adaptable planning, as well as efficient resource allocation. Furthermore, Agile
enables continuous maintenance to be carried out after the completion of devel-
opment. In this work, we discuss the design framework for the proposed student
attendance application progress. The planning of the complete implementation is
also demonstrated in a classroom environment. Some testing has been performed as
per the steps discussed in agile methodology.

4 W@RK Framework Design

In a previous paper [20], we presented our conceptual study on a contactless

attendance application with a temperature monitoring feature. The primary reason
for implementing temperature detection was to address a COVID-19 pandemic
health crisis. However, we neglect the temperature feature because the situation
has improved significantly, and temperature checks may no longer be needed. For
instance, when the threat of a contagious disease is mitigated or under control, temper-
ature checks may become redundant. Thus, this paper is focusing on designing the
W@RK framework. During this phase, we divide the main components of W@RK
into three main features: biometric recognition, location detection, and data storage
using blockchain technology.

4.1 Biometric Recognition

The first component is the biometric recognition method. It uses facial recognition
as the selected biometric. Facial recognition is a method of identifying and verifying
a user’s identity using their face. This unique choice for biometric authentication
has several advantages. Facial recognition is non-intrusive [21] and does not require
physical contact compared to other scanning such as fingerprint and iris. Here, users
can look into the cameras for authentication, making it user-friendly and easy to
adopt. In addition, this method can also be easily accessible for many users who can
perform through front-facing cameras using smartphones and computers, making
high acceptance [22], robustness [23], convenience [24], speed, contactless, hygiene
[25], and less sensitivity to age [26]. Furthermore, it is also necessary to address
the limitations of facial recognition, including detection accuracy, privacy concerns,
and algorithmic bias. Organizations should collect facial images of their employees
from various angles and lighting conditions to ensure accuracy and preprocess them
to enhance quality and standardize for analysis.
484 P. R. F. Kaha et al.

In data preprocessing, we perform the cropping, resizing, and normalizing tech-

niques that enhance the quality and provide consistency for the input image. However,
these methods are essential for improving the facial recognition algorithms’ accuracy
and enhancing the proposed systems’ performance [27]. Cropping isolates the facial
region and removes irrelevant information. It simplifies the algorithm’s complexity,
making facial detection efficient and accurate. In image resizing, the dimensions
of the cropped images are adjusted into the fixed size, which makes it suitable and
prevents the proposed model from being biased for smaller and larger facial in the
datasets. The neural network model requires a fixed size of the facial images to
perform an efficient operation. This also ensures that all facial images can feed
into the algorithm model with the same resolution, which is necessary for feature
extraction and training of the model.
In normalizing, the pixel values of the resized images can be transformed into
the standardized range. Normalization commonly uses [0, 1] or [−1, 1] as a range,
improving the model’s numerical stability and convergence during training. Deep
learning modules will perform better when the input data lies within this specific
range. These steps provide the model with consistent, relevant, and well-formatted
facial data, leading to better feature extraction, reducing variability, and improving the
generalization of unseen images. Feature extraction is the next step, which transforms
the images into a set of numerical features that can be used to identify a face. We will
extract some features from the facial images using the existing principal component
analysis (PCA) technique or local binary patterns (LBP). We then train a facial
recognition model using a vector machine. Here, we feed the dataset, which consists
of pairs of facial images with the corresponding labeled data. Labeled data consists
of positive and negative examples of facial images. The proposed machine learning
techniques support a test of a large dataset of images.
The training process involves updating the parameters of the facial recognition
model over multiple epochs, which allows the model to learn from the previous
data and improve its performance. In the training process, individual validation can
be set, which is used to observe the model’s performance and avoid overfitting.
Once the facial recognition model is trained using the labeled data, it is tested to
evaluate its accuracy and performance. The proposed model can identify and verify
the images in the testing dataset and measure the accuracy and effectiveness of
facial recognition tasks. At this phase, an unlabeled dataset is used to evaluate its
performance. Testing and fine-tuning the facial recognition model help to increase
the accuracy and performance of detecting staff’s facial features, thus ensuring the
integrity of the W@RK system. A fine-tuned facial recognition model and system
are then integrated into W@RK. This integrated process involves integrating the
system with cameras or video streams. This integration will present a system of
W@RK that can be combined with an embedded system and then perform a real-
time recognition of the attendance system. Figure 1 shows the high-level framework
of facial recognition components.
W@rk: Attendance Application Framework Using Blockchain Technology 485

Collect Taking picture of employess faces rom diﬀerent angles

images and in diﬀerent lighting conditions

Enhance and standardize

Preprocess Extracte feature from
the quality of the images
images: facial images
for analysis.

Train the Using machine learning

Training datasets
model techniques.

Trained, test it to evaluate

Test the
its accuracy and Fine-tune the model
model
performance.
Integrating the model
Integrate Integrate the facial
with cameras or video
the model recognition into W@RK
streams

Fig. 1 Facial recognition components

4.2 Attendee Location Detection

The second component is location detection. This process involves locating a

device, object, or attendee using a variety of technologies and techniques. Appli-
cations including asset tracking, social networking, navigation, and transportation
frequently use location detection to enhance applications capabilities. Multiple cate-
gories of Location-Based Services (LBS) applications and technologies offer rele-
vant information, services, or functionalities of the geographical position of users
and devices.
This includes Global Positioning System (GPS), Wi-Fi triangulation, Bluetooth
beacons, cell tower triangulation, and IP address geo-location. In W@RK, we are
focusing on detecting the user’s location by adapting geo-fencing technology. Geo-
fencing technology is an LBS technology that develops a virtual perimeter around
a physical geographic area. This technology uses GPS to establish the perimeter.
The boundary can be either a circle or a polygon with multiple sides to cover the
targeting area. Given that W@RK is developed to be installed in attendee devices,
the availability of embedded GPS technology is utilized to present a high precision
and accurate location of attendees within any chosen targeted area. Figure 2 shows
the framework to integrate geo-fencing into W@RK.

4.3 Data Protection Through Blockchain Technology

To ensure the secure handling of agreements and safeguarding of confidential data,

W@RK will leverage the capabilities of blockchain technology. Blockchain-based
biometric-attendance monitoring system poses several advantages over traditional
486 P. R. F. Kaha et al.

Set the Determine the geographic area where attendance will be tracked
geofence using geo
geo--fencing.

Decide on the location

location--based technology that will be used to detect
Use GPS
when a user enters or exits the geofenced area using GPS

Integrate Integrate location-

location-based into W@RK that can detect when a user
into W@RK enters or exits the geofenced area.

Provide clear instructions on how to enable location services on user

Notify users
smart mobile phone/laptop

Test and
Test it’s accuracy tracking attendance and reﬁne it as needed.
reﬁne

Fig. 2 Geo-fencing location-based components

database encryption. Blockchain technology has an immutable ledger; the data

recorded in this technique may not be altered. This provides the authenticity and
integrity of the data. Blockchain technology has a distributed and decentralized
process [28], which reduces the single point of failure and converts as highly resilient
to attacks and breaching data. Since the data is distributed across multiple nodes,
it enhances data redundancy and security. In addition, in the traditional database,
data synchronization is complex for multi-location setups. However, in blockchain
technology, all the nodes have the same data, which eliminates the requirement for
manual data synchronization and reduces data discrepancies. Consensus mechanisms
in blockchain technology validate and agree on the ledger’s state, ensuring legiti-
mate and authorized transactions. This makes it highly challenging for unauthorized
parties to access or manipulate the data. Furthermore, blockchain technology also
provides additional advantages such as smart contracts for automation [29], tamper-
resistant biometric data [30], privacy [31], and data ownership [32]. This is to ensure
the attendance is securely recorded; all the transaction information is decentralized
and immutable using blockchain technology. The attendance records along with
blockchain data (e.g., transaction address, private key, and public key). Integrating
blockchain technology is to improve the security, transparency, and efficiency of a
W@RK.
However, this component seems a bit complex process compared to the first two
components. First, we define the use case for the attendance system. Then, we choose
a blockchain platform, in this case, we are planning to use existing platforms such as
Ethereum, Hyperledger, or Corda based on our requirements. Next is to develop the
smear contract to govern the interactions on the blockchain. Later, the blockchain is
integrated with W@RK using an Application Programming Interface (API) provided
by the platforms. Lastly, we conduct comprehensive system testing to ensure proper
W@rk: Attendance Application Framework Using Blockchain Technology 487

functionality before deployment. Figure 3 shows the blockchain components used

in this proposed W@RK framework.
The integration of three main components is summarized and listed in Fig. 4, as
shown below.

Deﬁne the Determine how blockchain can be used to solve a problem

use case related to attendance

Choose a
blockchain Select a blockchain platform that is best suited for W@RK
platform
Develop the
Write the smart contract that will govern the interactions on
smart
the blockchain.
contract
Integrate
with your Integrate the smart contract with W@RK via API
system

Test and Test the system thoroughly to ensure that it is functioning

deploy before deployment

Fig. 3 Blockchain components

Fig. 4 Proposed W@RK framework

488 P. R. F. Kaha et al.

4.4 Integration

The previous section discusses the components and features that collectively form
the entirety of the W@RK project. In this section, we will present the integration
process of the components and the data flow of the W@RK. Figure 5 shows the
flowchart analysis of the proposed W@RK. The flowchart module starts with the face
scanning system. Initially, the face is read through the system and tries to identify the
images from the existing database. If the face is recognized from the server location,
it will try to collect the location identity from the auto-attendee location tracking.
Here, the location tracking is performed through LBS and uses GPS for position
tracking. If the face is not recognized, the system will go back and start scanning
the face for another verification. Here, we can discuss the detailed explanation of
each step, defined in the flow chart. To make the information of attendance secure
and transparent, blockchain technology is employed. A transaction of attendance
record along with the transaction address, private key, and public key is written in
the blockchain. The transaction will be added to the blockchain node if it is valid.
In a blockchain network, only valid transactions can adhere to predefined rules and
conditions that ensure the blockchain ledger’s integrity and consistency. Certain
modification and enhancement of the W@RK system is motivated by Mohammad
Azahari et al. [20].
Initially, the proposed system uses the computer’s or phone’s front camera to
capture the facial image. Once the camera captures the image, it sends it into
processing. Here, the identified facial image is checked in the labeled data. If it
matches the existing images, it will process the next steps in the flowchart. The next

Fig. 5 Proposed W@RK

flowchart
W@rk: Attendance Application Framework Using Blockchain Technology 489

phase discussed in the flowchart is facial recognition. The proposed system uses
facial recognition software to identify the person by searching various images. The
software used in this process compares the facial features of the images in the labeled
database of the stored faces. If the software finds the match of the facial image, then
it recognizes the person. If it is not identified, then the process ends. The third phase
of the flowchart discusses location detection. To identify the attendee’s location, the
system uses GPS and other location tracking systems to determine the person’s loca-
tion. Location detection using GPS and LBs is an essential part of event management.
This technology will allow the organizer to monitor the movement of the attendees
and improve the event experience, security, and safety. GPS is the satellite naviga-
tion system that provides accurate location information through multiple satellites.
GPS-enabled smartphones, computers, tablets, and other tracking devices are used
to calculate their precise details.
To verify the attendee’s location, GPS is used to track the participants’ real-time
position. Attendees must carry a device with GPS receivers allowing the event orga-
nizers to observe their movements in the venue. LBS uses the location information
provided by the positioning technologies such as GPS. This technology leverages
the GPS data and offers other services based on the user’s location while navigating,
mapping, geotagging, and proximity-based notifications. For the attendee’s location,
LBS provides location-specific information and is critical in identifying the event
participants. LBS acquires maps, personal schedules, event-specific notifications, and
other nearby amenities for this process. LBS application may also use the network
opportunities for similar affiliations based on the suggestions. Recently, almost all
the event management provided a web-based application for attendees equipped with
the LBS and other GPS capabilities.
Attendees can use the applications on their smart devices, which allows the event
organizers to monitor and track the attendee’s movements. In addition, other wear-
able devices, such as RFID wristbands and smart badges, can also be distributed
to the participants, which provides the exact location for tracking. This event app
processes real-time data and updates the position of the attendees over the event
mapping section. Even though the GPS provides high accuracy in outdoor environ-
ments, it has signal attenuation for indoor environments, which introduces limita-
tions. This problem can be solved using Bluetooth Low Energy (BLE) beacons or
Wi-Fi positioning systems, which provide location information within indoor areas,
conference rooms, and exhibition halls. Collecting these data allows the event appli-
cation to provide the attendee’s better-personalized experiences. For instance, this
facility may provide information about the nearby exhibitions, sessions, or schedules
based on their interest.
The proposed attendance system uses blockchain technology to store atten-
dance records securely, transparently, and tamper-proof. The stored data includes
the attendee’s name, date, time, and location and can be encrypted using cryp-
tographic hashes to prevent unauthorized access. The system integration of smart
contracts simplifies auditing and reduces administrative burden, making the system
highly trustworthy and fraud-resistant. Furthermore, implementing a blockchain-
based attendance system may have unforeseen consequences, such as increased stress
490 P. R. F. Kaha et al.

or micromanagement. Therefore, it is necessary to weigh the ethical considerations

and the potential harm to employees against the benefits of attendance monitoring.
Maintaining transparency in how attendance data is used is critical. Employees must
comprehend the reasons behind attendance tracking and how it affects their employ-
ment. Privacy concerns may arise while collecting and storing attendance data on a
blockchain. Employers must ensure that employees are familiar with the attendance
monitoring system and have informed consent to collect and store their data on a
blockchain. It is difficult to explain the purpose of attendance tracking transparently.
Establishing clear boundaries for what data is collected and how it’s used can be
challenging.
Based on the flowchart, W@RK comprises three components which are the
application of W@RK—which serves the attendee, the server that stores the fine-
tuned facial recognition database system, and the blockchain. W@RK presented a
highly integrated, transparent, and secure system based on these three components.
Integrating this part of components enables any organization to record and verify
attendance securely and efficiently.

5 Conclusions

After COVID-19, significant changes have occurred in workforce management and

working environments worldwide. Remote and non-contact methods are becoming
increasingly prevalent. However, there is a growing risk of employee misconduct and
hacking/security breaches of critical company information in these remote work situ-
ations. Therefore, this study proposes the application of blockchain technology as a
new form of secure automated attendance tracking and verification process that can be
safely accessed from anywhere, including company, home, or external locations. This
proposed system framework, W@RK includes non-contact attendance systems such
as facial recognition, and timestamp data, and integrates it with a geo-location-based
security fence by utilizing blockchain technology. The proposed system reduces mali-
cious access and other security issues, enhancing high transparency and efficiency in
maintaining employee attendance. Here, we presented a theoretical approach-based
attendance monitoring system using standard procedures and their implementation
strategies. Furthermore, the requirement of the unique characteristics and the orga-
nization’s additional tasks are essential for improving the proposed approach. An
effective attendance system can be implemented using decentralized networks, intel-
ligent contracts, and Integrating Human Resource information with blockchain tech-
nology. The future scope of this research leads to the customization of the attendance
management system per real-world scenarios’ requirements.
Like to thank Causal Productions for permitting us.
W@rk: Attendance Application Framework Using Blockchain Technology 491

References

1. Nakamoto S (2008) Bitcoin: a peer-to-peer electronic cash system. [Online]. Available: https://
bitcoin.org/bitcoin.pdf
2. Garrick H, Michel R (2017) Global blockchain benchmarking study. Cambridge Centre for
Alternative Finance, Cambridge
3. Mamun Q (2022) Blockchain technology in the future of healthcare. Smart Health 23:100223
4. Shashikala HK, Shakya SN, Panjiyar P, Upreti AS, Dadapeer S (2022) Attendance monitoring
system using face recognition. Int J Inform Technol Res Appl 1(3):15–22. https://doi.org/10.
5281/zenodo.7385439
5. Sarath Krishnan P, Manikuttan A (2022) Attendance management system using facial recogni-
tion. In: 2022 International conference on computing, communication, security and intelligent
systems (IC3SIS), Kochi, India, pp 1–6. https://doi.org/10.1109/IC3SIS54991.2022.9885693
6. Babatunde AN, Oke AA, Babatunde RS, Ibitoye O, Jimoh ER (2022) Mobile based student
attendance system using geo-fencingwith timing and face recognition. J Adv Mathem Comput
Sci 9(2):75–90. https://doi.org/10.22624/AIMS/MATHS/V9N1P8
7. Ardina H, Nugraha IGBB (2019) Design of a blockchain-based employee attendance system.
In: 2019 International conference on ICT for smart society (ICISS). vol 7. IEEE
8. Zhang Q (2022) Attendance system based on blockchain and face recognition. In: 2022
International conference on smart applications, communications and networking (SmartNets),
November, IEEE, pp 1–6
9. Adhinata FD, Tanjung NAF, Widayat W, Pasfica GR, Satura FR (2022) Real-time masked face
recognition using FaceNet and supervised machine learning. In: Proceedings of the 2nd inter-
national conference on electronics, biomedical engineering, and health informatics: ICEBEHI
2021, 3–4 November, Surabaya, Indonesia, June, Singapore, Springer Nature Singapore, pp
189–202
10. Adetiba E, Opara AE, Ajayi OT, Owolabi FO (2021) DeepFacematch: a convolutional neural
network model for contactless attendance on e-SIWES Portal. In: Information and communica-
tion technology and applications: third international conference, ICTA 2020, Minna, Nigeria,
November 24–27, 2020, Revised Selected Papers 3, Springer International Publishing, pp
196–205
11. Yadav AR, Kumar J, Anumeha AKA, Kumar R (2021) Contactless attendance system: a health-
care approach to prevent spreading of COVID-19. people, vol 5. pp 6
12. Vubangsi M, Al-Turjman F (2022) Design and implementation of a conference attendance
monitoring system using blockchain and AI technologies. In: 2022 International conference
on artificial intelligence in everything (AIE), IEEE, pp 197–202
13. Sarumi UA, Ameen ZS, Al-Turjman F, Altrjman C, Mubarak AS (2022) A novel attendance
system via integrated wifi and blockchain technologies. In: 2022 International conference on
artificial intelligence in everything (AIE), August, IEEE, pp 209–215
14. Bálint K (2021) Possibilities for the utilization of an automatized, electronic blockchain-based,
students’ attendance register, using a universities’ modern security cameras. Acta Polytechnica
Hungarica 18(2):127–142
15. Petersen J (2016) MEAN Web Application development with Agile Kanban. [Online].
Available: https://digitalcommons.wou.edu/honors_theses/105
16. Commons D, Ryerson @, Kaleel SB, Harishankar S (2013) Applying Agile methodology in
mobile software engineering: android application development and its challenges. [Online].
Available: http://digitalcommons.ryerson.ca/compsci_techrpts
17. Flora HK, Chande SV, Wang X (2014) Adopting an agile approach for the development of
mobile applications
18. Abrahamsson P, Salo O, Ronkainen J, Warsta J (2002) Agile software development methods:
review and analysis. [Online]. Available: http://www.vtt.fi/inf/pdf/publications/2002/P478.pdf
19. Sultania AK (2015) Developing software product and test automation software using Agile
methodology. In: Proceedings of the 2015 3rd international conference on computer, commu-
nication, control and information technology, C3IT 2015, March 2015. https://doi.org/10.1109/
C3IT.2015.7060120
492 P. R. F. Kaha et al.

20. Mohammad Azahari A, Ahmad A, Rahayu S (2021) Contactless attendance method with face
recognition, body temperature measurement and GPS system using blockchain technology. In:
Zakaria Z, Emamian SS (eds) Advances in electrical and electronic engineering and computer
science. Lecture notes in electrical engineering, vol 741. Springer, Singapore. https://doi.org/
10.1007/978-981-33-6490-5_8
21. Lai X, Patrick Rau P-L (2021) Has facial recognition technology been misused? A public
perception model of facial recognition scenarios. Comput Human Behav 124:106894. ISSN
0747-5632.https://doi.org/10.1016/j.chb.2021.106894
22. Kauba C, Prommegger B, Uhl A (2019) Combined fully contactless finger and hand vein
capturing device with a corresponding dataset. Sensors 19:5014. https://doi.org/10.3390/s19
225014
23. Yang B, Yang S, Zhu X, Qi M, Li H, Lv Z, Cheng X, Wang F (2023) Computer vision technology
for monitoring of indoor and outdoor environments and HVAC equipment: a review. Sensors
23:6186. https://doi.org/10.3390/s2313618
24. van Hoof J, Marston HR, Kazak JK, Buffel T (2021) Ten questions concerning age-friendly
cities and communities and the built environment. Build and Environ 199:107922. ISSN 0360-
1323.https://doi.org/10.1016/j.buildenv.2021.107922
25. Yan L, Shi Y, Wei M, Wu Y (2023) Multi-feature fusing local directional ternary pattern for
facial expressions signal recognition based on video communication system. Alexandria Eng
J 63:307–320. ISSN 1110-0168. https://doi.org/10.1016/j.aej.2022.08.003
26. Gode CS, Khobragade AS, Thanekar C, Thengadi O, Lakde K (2023) Face recognition-
based attendance system. In: Smys S, Tavares JMRS, Shi F (eds) Computational Vision and
bio-inspired computing. advances in intelligent systems and computing, vol 1439. Springer,
Singapore. https://doi.org/10.1007/978-981-19-9819-5_12
27. Rajeshkumar G, Braveen M, Venkatesh R, Josephin Shermila P, Ganesh Prabu B, Veerasamy B,
Bharathi B, Jeyam A (2023) Smart office automation via faster R-CNN based face recognition
and internet of things. Measurem: Sens 27:100719. ISSN 2665-9174.https://doi.org/10.1016/
j.measen.2023.100719
28. Falayi A, Wang Q, Liao W, Yu W (2023) Survey of distributed and decentralized iot securities:
approaches using deep learning and blockchain technology. Future Internet 15:178. https://doi.
org/10.3390/fi15050178
29. Mokhamed T, Talib MA, Moufti MA, Abbas S, Khan F (2023) The potential of blockchain
technology in dental healthcare: a literature review. Sensors 23:3277. https://doi.org/10.3390/
s23063277
30. Waseem M, Adnan Khan M, Goudarzi A, Fahad S, Sajjad IA, Siano P (2023) Incorporation
of blockchain technology for different smart grid applications: architecture, prospects, and
challenges. Energies 16:820. https://doi.org/10.3390/en16020820
31. Zhang T, Li P, Wang N (2023) Multi-period price competition of blockchain-technology-
supported and traditional platforms under network effect. Int J Prod Res 61(11):3829–3843.
https://doi.org/10.1080/00207543.2021.1884308
32. KN RP, Rajat (2023) The intelligent information integrity model to ensure the database protec-
tion using blockchain in cloud networking. In: 2023 International conference on distributed
computing and electrical circuits and electronics (ICDCECE), Ballar, India, 2023, pp 1–7.
https://doi.org/10.1109/ICDCECE57866.2023.10151301
Exploring the Impact of COVID-19
on Individuals’ Mental Health Through
Cluster Analysis

Azlin Ahmad, Siti Nabilah Mohd Abdul Hakim Amir,

Ezzatul Akmal Kamaru Zaman, and Abdulaziz Al-Nahari

Abstract The COVID-19 pandemic has had a significant impact on mental health,
resulting in anxiety and other issues among many individuals due to the lockdowns
implemented to curb its spread. With the world moving toward 2030, it is crucial to
reduce premature mortality from non-communicable diseases through prevention and
treatment. Sustainable Development Goal (SDG) 3 emphasizes prioritizing mental
health and well-being to address the increasing burden of mental health issues. The
study utilized text clustering through the K-Means algorithm to gain a better under-
standing of the mental health issues people are facing. The Term Frequency-Inverse
Document Frequency (TF-IDF) was used to determine each word’s weight after
extracting tweets from Twitter and preprocessing the data. The K-Means clustering
algorithm was then employed in the data, which revealed that the clusters could be
classified into three categories of mental health: ‘stress,’ ‘depression,’ and ‘pressure.’
It was found that using three clusters provided more dependable outcomes since clus-
ters with more than three tended to have overlapping mental health conditions. This
study sheds light on the mental health problems that people face during the COVID-
19 pandemic, which can help guide efforts to support those in need. Moreover, it
would be more beneficial to incorporate Bahasa Malaysia in future research since
there has yet to be much exploration done on this language despite it being Malaysia’s
official language. By adopting a holistic approach and prioritizing mental health, we
can work toward ensuring a healthier and happier future for everyone.

A. Ahmad · S. N. M. A. H. Amir · E. A. K. Zaman

School of Computing, College of Computing, Informatics and Mathematics, Universiti Teknologi
MARA, Shah Alam, Malaysia
e-mail: [email protected]
E. A. K. Zaman
e-mail: [email protected]
A. Al-Nahari (B)
School of Information Technology, UNITAR International University, Petaling Jaya, Selangor,
Malaysia
e-mail: [email protected]

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 493
Y. Bee Wah et al. (eds.), Data Science and Emerging Technologies, Lecture Notes on
Data Engineering and Communications Technologies 191,
https://doi.org/10.1007/978-981-97-0293-0_35
494 A. Ahmad et al.

Keywords Clustering · K-Means clustering · Mental health · Term

frequency-inverse document frequency

1 Introduction

The COVID-19 pandemic has caused many changes worldwide, such as lockdowns
and economic slowdowns, which have resulted in job losses and increased mental
health concerns. The impact on mental health has been significant; unfortunately,
resources have been inadequate to tackle this issue. It is crucial to prioritize both phys-
ical and mental health when it comes to handling the pandemic. Due to the outbreak
of COVID-19, many countries have enforced lockdowns to prevent the spread of
the disease. This has also resulted in declining economic growth as many economic
activities had to be suspended. As a result of the COVID-19 pandemic, many individ-
uals have lost their jobs, which has contributed to an increase in depressive symptoms
[1]. The worst-case pandemic is death, which affects the population’s numbers and
mental and physical health. Mental health confines emotional, psychological, and
social well-being, impacting how individuals think, feel, act, manage stress, interact
with others, and make decisions. In times of epidemic, there is a strong correla-
tion between emotional distress and social disorder, which is often influenced by
the spread of infectious diseases and the psychological responses of those affected.
Despite this, mental health and well-being resources are often lacking in the face
of pandemics. It is crucial to prioritize psychological and psychiatric needs during
any phase of pandemic management, alongside testing, critical patient care, and
reducing transmission. This is essential to ensure the overall health and well-being
of individuals and communities impacted by pandemics, as [2] emphasized.
Finding a mentally unstable person could be a challenge. However, one approach
is to analyze the language and content of their written communication, as this can
provide insight into their thought patterns and potential issues [3]. Following this
methodology, our study utilizes an unsupervised machine learning technique, K-
Means clustering, to identify mental health-related keywords during the COVID-19
pandemic, building upon the work of previous researchers in this area [11, 12].
Sabaruddin [4] analyzed mental health tweets during Malaysia’s first Movement
Control Order period, and they also created a Malay tweet dataset using six unique
search strategy keywords associated with mental health, including hashtags related to
COVID-19 and emotions like sadness and anger. The common supervised machine
learning classifiers, Naive Bayes, and support vector machine, are executed to predict
the sentiment of the tweets. Six unique keywords associated with mental health,
including hashtags: #KitaJagakita, #covid19, sedih (sad), marah (angry), kecewa
(disappointed), and bunuh diri (suicide). These keywords were found to be associated
with the emotions felt by Malaysians during the early stages of the pandemic. The
study’s findings showed that the SVM algorithm had the highest accuracy rate of
approximately 84%.
Exploring the Impact of COVID-19 on Individuals’ Mental Health … 495

Then, Lee [5] analyzed the public attitudes toward COVID-19 on Twitter and
found that the top keywords related to the tweets were ‘wash your hands,’ ‘stay
home,’ ‘wear a mask,’ ‘social distancing,’ ‘quarantine,’ and ‘mental,’ indicating that
people are concerned about both their physical and mental health. This study looked
at how people felt about COVID-19 on the well-known social network platform
Twitter. The main goals of this research are to find out what COVID-19-related topics
individuals are talking about, why they are interested in them, and how their feelings
have changed over time. Additionally, the study looks for any connections between
the breakout and any previously unknown to the broader public secret theories.
Furthermore, Hou et al. [6] studied how the public uses social media. They
collected popular Weibo texts mentioning COVID-19 with the keyword’s ‘coron-
avirus’ and ‘pneumonia’ between December 27, 2019, and May 31, 2020. Data
mining and text analysis were used to show the trend in the level of public atten-
tion at different stages. They also conducted topic analysis, sentiment analysis, and
a correlation analysis between public interest level and the number of COVID-19-
related cases. However, this study only focuses on Sina Weibo, and public attention
data on Facebook and Twitter is omitted.
Therefore, this research aims to explore and identify the crucial terms associated
with mental health issues in Malaysia amid the COVID-19 outbreak through an
analysis of tweets shared on Twitter. The clustering technique was utilized to cluster
the tweets due to its ability to recognize hidden data patterns that might not be
noticeable at first glance. Grouping similar data points can emphasize relationships
and trends we have missed [7–10]. Thus, K-Means clustering was chosen to perform
this study’s clustering task. In this paper, we have organized our content in a specific
manner. Firstly, we have presented all related works in the mental health field in
the introduction section. Secondly, we will discuss the proposed methods in Sect. 2,
followed by a conversation on experimental works and analysis of results in Sect. 3.
Lastly, we will conclude our description in the final section by discussing the obtained
results and future works.

2 Proposed Methodology

Figure 1 shows the methodology process for this research. Few processes were
involved in executing this research; (1) data collection, (2) data preprocessing, and
(3) clustering.
1. Data Collection
For this study, we gathered tweets from April 2020 until the end of the lockdown in
Malaysia in October 2021. The data was scraped from Twitter using Twint Python
based on keywords of mental health types in Table 1. We saved the data in a CSV file
with several properties, but only the ‘tweet’ attribute is relevant. Then we combined
the tweets in a new file and removed any duplicates by looping through the documents.
This ensured that the dataset contains all the tweets related to the keywords [13, 14].
496 A. Ahmad et al.

Fig. 1 Overall research methodology process

Table 1 List of keywords

Stress Mental illness
Mental disorders Post-traumatic stress disorder (PTSD)
Distress Depression
Suicide Pressure

2. Data Preprocessing

In the data preprocessing stage, the raw Twitter data is converted into an under-
standable and accessible format before proceeding to the clustering stage. This step
is crucial for the model to learn data trends precisely. It comprises tokenization,
normalization, and calculating term frequency-inverse document frequency. Before
tokenization, the data undergoes preprocessing to eliminate any Bahasa Malaysia
phrases, enabling only English terms to be processed. After the dataset is cleaned,
the data transformation step concludes, and the processed data is then saved in a CSV
file for further processing with TF-IDF calculation.
A. Tokenization
The technique of tokenization is used to separate the words in a textual document. This
is especially helpful for textual data, where sentences are decomposed into smaller
units or tokens to eliminate any unwanted characters such as commas, spaces. For
this research, each tweet is considered as a document that is treated as a string and
passed through the tokenization process to extract the tokens present in the document.
Exploring the Impact of COVID-19 on Individuals’ Mental Health … 497

B. Normalization
Normalization enables uniform processing by putting every word on an equal footing.
It can be done by:
1. Case normalization: converting all text to lowercase or uppercase.
2. With this, there will not be any redundant words that give off the same meaning.
3. Removes all punctuations, symbols, and numbers to keep only words that will
be meaningful.
4. Remove all stopwords: Common words with little meaning, such as ‘the,’ ‘and,’
‘is,’ etc.

C. Term Frequency-Inverse Document Frequency (TF-IDF) Clustering

In general, the TF-IDF is a statistical measure used to evaluate the importance of a

word in a collection of documents. The measure comprises two components—Term
Frequency (TF) and Inverse Document Frequency (IDF). TF is calculated based
on the frequency of occurrence of a term in a document, with longer documents
generally having more terms. Contrarily, IDF computes the importance of a term in
the overall corpus of documents. The final value of TF-IDF is obtained by multiplying
the respective values of TF and IDF. The TF-IDF can be computed as below:

number of occurence term, t in a document

t f (t) = (1)
total number of terms in a document
total number of document
id f (t) = log e (2)
number of document with term t in it
The dataset goes through TF-IDF calculation, which yields 900 words that can
be utilized as input vectors for applying the K-Means clustering technique.
3. Clustering
To create a model for clustering using the K-Means algorithm, we utilized Jupyter
Notebook as the computing platform and implemented the scikit-learn (sklearn)
library in Python. The K-Means class in the ‘sklearn.cluster’ module was used to
apply the clustering algorithm. Due to the large number of attributes in this text
dataset, the K-Means algorithm needs help to process it for modeling. Therefore,
to visualize the K-Means clustering, it is necessary to reduce the dimension of the
attributes to two. To achieve this, we implemented Principal Component Analysis
(PCA) to reduce the number of attributes to smaller dimensions. The reason for using
PCA is that it provides the best low-dimension linear approximation of the high-
dimensional data. Further, using PCA helps in noise reduction during modeling,
leading to more accurate results, and applying PCA before K-Means clustering
reduces the computation time of the modeling since the number of attributes is
smaller. We decided to reduce the set of attributes to two dimensions for better
visualization in the x-axis and y-axis of the graph.
498 A. Ahmad et al.

After reducing the dimensionality, the K-Means model was fitted to the newly
reduced dataset using the ‘fit(X)’ method, where X was a 2D array-like object. The
K-Means algorithm used are as follows:
1. Initialize k centroids, where I is the number of clusters to create. In this case,
several numbers of clusters were tested: k = 3, k = 4, and k = 5.
2. The distance from each document to its k cluster centers is calculated by the m
iteration (m is the maximum iteration number) and output the data into cluster
with the shortest distance.
3. Recompute centroid for each cluster by method mean value of all the data point
assigned to that cluster.
4. Steps 2 and 3 are repeated until the centroids no longer change or a maximum
number of iterations is reached. Final clusters are formed by the data points
assigned to each centroid.
In this project, the ‘predict(X)’ method was utilized to allocate clusters for each
tweeted data point. After the clusters were assigned, we grouped the tweets dataset
based on their cluster and computed the average for words in each cluster. The
resulting list contains words and their frequencies, sorted in descending order to
identify the most frequent words in each cluster. By assigning k values to the K-
Means algorithm in scikit-learn, clusters were formed. It is important to note that
each cluster from each k value had different tweets assigned to it.

3 Results and Findings

We conducted extensive experiments and analyzed the K-Means clustering results

using cluster analysis. By examining the similarities and differences between the
data points, cluster analysis helped us comprehend why certain clusters were formed
and how they relate to each other. This information is crucial in making informed
decisions about interpreting and performing on the K-Means clustering results to
ensure they are both accurate and meaningful. Table 2 displays the clustering results
for each value of k that we experimented with.
We also visualized the results to get better insights into these results. Figure 2
displays the clustering of the initial k value; k = 3. From the result, we can see that
all data points are mapped into the 2-D map, and, despite the cluster’s irregular shape,
it is still evident from the colors of the data in the graph. Figures 3 and 4, however,
exhibit the outcomes for k = 4 and k = 5, respectively, with comparable cluster
structures. The only difference is each algorithm’s number of clusters. The cluster’s
form is comparable even though there are varying numbers of clusters, owing to the
usage of PCA before K-Means.
According to the findings, when k = 3, the algorithm produced 3 distinct clus-
ters without any overlap. However, when k = 4 and k = 5, the algorithm produced
more than 3 clusters, and some data points were scattered or overlapped with other
clusters. Furthermore, to assess the accuracy of the clustering, we visualized the
Exploring the Impact of COVID-19 on Individuals’ Mental Health … 499

Table 2 K-Means clustering results for all number of clusters, k

K Cluster Cluster Number of tweets
3 Cluster 0 1072
Cluster 1 128
Cluster 2 3521
4 Cluster 0 1099
Cluster 1 128
Cluster 2 438
Cluster 3 3056
5 Cluster 0 2955
Cluster 1 1090
Cluster 2 68
Cluster 3 170
Cluster 4 438

Fig. 2 K-Means clustering results for k = 3

Fig. 3 K-Means clustering results for k = 4

500 A. Ahmad et al.

Fig. 4 K-Means clustering results for k = 5

inter-cluster results using word clouds. Word clouds provide an easy way to spot the
most significant keywords in text data. For each cluster, we generated a word cloud
showcasing the top 100 words with larger font sizes, indicating a higher frequency of
occurrence. Although word clouds are useful in identifying and highlighting signif-
icant keywords, they have limitation in representing clusters of related words and
capturing the text’s quality accurately. Additionally, a lack of attributes or informa-
tion will reduce the accuracy of word clouds, which can lead to a loss of nuance and
context, as explained in the previous section.
Based on the word clouds, we compared the impact of different clusters on text
analysis, as presented in Table 3. Each cluster was represented by the words with
the highest frequency in that cluster, which became the defining category for the
mental health associated with that cluster. However, some words appeared in multiple
clusters, making it difficult to determine which cluster belonged to which mental
health category. To address this issue, we found that setting k = 3 produced the most
reliable and accurate results, with each cluster distinctly associated with the mental
health categories of ‘stress,’ ‘depress,’ and ‘stress,’ as shown in Fig. 5. Then, from
the word clouds, we analyzed and identified associated words related to the three
desired categories (Stress, Depress, and Pressure), shown in Table 4.
And finally, we assigned mental health categories to each cluster based on the iden-
tified keywords and labeled them accordingly. We observed that the k = 3 clustering

Table 3 List of top words related to mental health categories for all clusters, k
Cluster Top words in cluster Mental health
category
0 Stress, main, time, release, covid, game, happy, call, member, Stress
start
1 Depress, bipolar, depressing, mood, post, video, lawyer, cane, Depress
mention, series
2 Pressure, depression, mental, illness, suicide, blood, cooker, Pressure
distress, depress, like
3 Stress, mistake, cut, review, rate, relief, environment, exam, Stress
baby, order
Exploring the Impact of COVID-19 on Individuals’ Mental Health … 501

(a) Cluster 0 (b) Cluster 1 (c) Cluster 2

Fig. 5 K-Means visualization results for k = 3

Table 4 List of mental health

k 3 4 5
categories for each k
Cluster 0 Stress Stress Pressure
Cluster 1 Pressure Depress Stress
Cluster 2 Depress Pressure Depress
Cluster 3 – Stress Depress
Cluster 4 – – Stress

method accurately clustered the dataset into three distinct mental health categories,
whereas the k = 4 and k = 5 methods clustered some categories into two clusters.
Hence, the K-Means clustering algorithm with k = 3 produced the most accurate
results in clustering the given dataset.

4 Conclusion

To summarize, the K-Means algorithm revealed a similar graph structure for different
k clusters but with overlapping outcomes of cluster groups for higher k values.
The dataset’s objectives were met by analyzing the TF-IDF weight results and top
keywords from tweets about mental health during COVID-19. The frequencies of
these keywords were discovered through exploratory data analysis, and the mental
health category of each cluster was determined after visualizing and analyzing the
results. To present the most significant keywords in each cluster, we utilized word
clouds, with different font sizes indicating their frequency of occurrence. Our findings
suggest that the optimum k value for clustering the tweets into three types of mental
health—‘stress,’ ‘depress,’ and ‘pressure’—is 3. The related keywords for cluster
‘stress’ include time, release, COVID, and game, while those for cluster ‘pressure’ are
depression, mental illness, and suicide. Lastly, for cluster ‘depress,’ the keywords
consist of bipolar, mood, post, and video. However, it is important to note that
word clouds only capture vocabulary, but not the quality, and may only sometimes
accurately represent the clusters.
502 A. Ahmad et al.

The clustering results support Sustainable Development Goal (SDG) 3, which

aims to ensure healthy lives and promote well-being for all ages. By utilizing these
results, authorities can identify areas or groups with a higher prevalence of mental
health issues, allocate resources more efficiently, and develop specific interventions
for each cluster’s identified mental health concerns. This can lead to a better under-
standing of mental health in society, promoting well-being and achieving SDG 3.
Furthermore, educational institutions can also play a vital role in achieving SDG
3 by educating students about mental health, promoting well-being, and providing
access to mental health resources. Access to mental health research can also benefit
educators by allowing them to apply evidence-based practices in their teaching
and learning, creating a more supportive and inclusive learning environment that
contributes to the overall well-being of students and staff.

Acknowledgements The authors would like to express their gratitude towards UNITAR Inter-
national University for funding this research under UNITAR Internal research grant “ Machine
Learning Based Model for The Early Prediction of Potential Pandemic Infection Clusters” and the
College of Computing of Universiti Teknologi MARA (UiTM) for providing an excellent research
environment to carry out this study.

References

1. Posel D, Oyenubi A, Kollamparambil U (2021) Job loss and mental health during the COVID-
19 lockdown: Evidence from South Africa. PLoS ONE 16(3 March):1–9. https://doi.org/10.
1371/journal.pone.0249352
2. Cullen W, Gulati G, Kelly BD (2020) Mental health in the COVID-19 pandemic. QJM
113(5):311–312. https://doi.org/10.1093/QJMED/HCAA110
3. Zhang T (2020) Data Mining can play a critical role in COVID-19 linked mental health studies,
January
4. Sabaruddin RA (2021) Malay tweets : discovering mental health Situation during COVID-19
Pandemic in Malaysia, pp 58–63
5. Lee JH (2021) Understanding public attitudes toward COVID-19 with twitter. In: 2021 IEEE
systems and information engineering design symposium, SIEDS 2021, pp 1–6. https://doi.org/
10.1109/SIEDS52267.2021.9483708
6. Hou K, Hou T, Cai L (2021) Public attention about COVID-19 on social media: an investigation
based on data mining and text analysis. Personality and Individual Differences 175(December
2020):110701. https://doi.org/10.1016/j.paid.2021.110701
7. Ahmad A, Yusof R, Zulkifli NSA, Ismail MN (2021) J Inform Commun Technol 20(4):651–676.
https://doi.org/10.32890/jict2021.20.4.8
8. Huan Z, Pengzhou Z, Zeyang G (2018) K-means text dynamic clustering algorithm based on
KL divergence. In: Proceedings—17th IEEE/ACIS international conference on computer and
information science, ICIS 2018, pp 659–663. https://doi.org/10.1109/ICIS.2018.8466385
9. Gustriansyah R, Suhandi N, Antony F (2019) Clustering optimization in RFM analysis based
on k-means. Indonesian J Electri Eng Comput Sci 18(1):470–477. https://doi.org/10.11591/ije
ecs.v18.i1.pp470-477
10. Marutho D, Hendra Handaka S, Wijaya E, Muljono (2018) The determination of cluster number
at k-mean using elbow method and purity evaluation on headline news. In: Proceedings—
2018 international seminar on application for technology of information and communication:
Exploring the Impact of COVID-19 on Individuals’ Mental Health … 503

creative technology for human life, ISemantic 2018, pp 533–538. https://doi.org/10.1109/ISE

MANTIC.2018.854975
11. Muhammad LJ, Islam MM, Usman SS, Ayon SI (2020) Predictive data mining models for
novel coronavirus (COVID-19) infected patients’ recovery. SN Comput Sci 1(4):1–7. https://
doi.org/10.1007/s42979-020-00216-w
12. Reigal RE, Pastrana-Brincones JL, González-Ruiz SL, Hernández-Mendo A, Morillo-Baro JP,
Morales-Sánchez V (2020) Use of data mining to determine usage patterns of an online eval-
uation platform during the COVID-19 Pandemic. Front Psychol 11(September):1–16. https://
doi.org/10.3389/fpsyg.2020.588843
13. Rahman RA, Omar K, Mohd Noah SA, Danuri MSNM, Al-Garadi MA (2020) Application
of machine learning methods in mental health detection: a systematic review. IEEE Access
8:183952–183964. https://doi.org/10.1109/ACCESS.2020.3029154
14. Rahman RA, Haziqah Mohamad Zaini F, Nizam Mohd Danuri MS, Amin A (2022) The senti-
ment analysis on mental health awareness by non-governmental organisation’s twitter. In:
2022 International visualization, informatics and technology conference (IVIT), Kuala Lumpur,
Malaysia, 2022, pp 185–190. https://doi.org/10.1109/IVIT55443.2022.10033345
Tracking High Potential Transmission
Risk Spots of Infectious Disease Using
Spatial Social Network Analysis
and Visualisation (SSNAV) Techniques

Ilham Abdul Jalil and Abdul Rauf Abdul Rasam

Abstract The risk map for infectious disease shows the importance of the Geograph-
ical Information System (GIS) and spatial social network analysis and visualisation
(SSNAV) as a preparedness and response tool to strengthen the capacity for assessing
health risks. The current mapping method still needs to be revised to detect the poten-
tial risk areas of the disease due to the need for more dynamic spatial and social
elements, especially in identifying human mobility effects in detecting missing tuber-
culosis (TB) cases. This study has combined GIS-MCDM and SSNAV techniques to
evaluate whether this combination will enhance TB’s general existing disease hotspot
mapping in Klang, Selangor. The social network structure of selected TB cases in
Klang as actors (nodes) and human mobility (home-workplace) data as edges has
been used to investigate social network mobility structures, analyse the relationships
among the nodes and study their edges regarding their network centrality. The main
ﬁnding has revealed that the higher the node’s centrality in the network structure,
the higher the chance the node inﬂuences the TB spread in the whole network after
comparing the network graph results with the GIS mapping technique. Combining
these techniques increases the existing mapping capabilities towards enhancing the
understanding of how diseases move through the population and creating a reliable
potential risk map in Malaysia.

Keywords SSNAV · GIS-MCDM · Risk disease map · TB hotspots · Infectious

disease

I. A. Jalil · A. R. A. Rasam (B)

School of Geomatics and Natural Resources, College of Built Environment, Universiti Teknologi
MARA, Shah Alam, Selangor, Malaysia
e-mail: [email protected]

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 505
Y. Bee Wah et al. (eds.), Data Science and Emerging Technologies, Lecture Notes on
Data Engineering and Communications Technologies 191,
https://doi.org/10.1007/978-981-97-0293-0_36
506 I. A. Jalil and A. R. A. Rasam

1 Introduction

Tuberculosis (TB) causes about 1.5 million yearly deaths [1]. Most of these cases
occur in developing countries, where job opportunities are often the pull factor for the
foreign workforce. Consequently, the influx of foreign workforce, i.e. immigrants,
becomes one of the main factors in the spread of TB [2]. Malaysia is no exception
and is one of the countries that have recorded a constant number of cases. Cases of
TB in Malaysia recorded a higher mortality rate than expected by the World Health
Organization (WHO). In addition, a medical officer from the Institute of Respiratory
Medicine (IPR) Kuala Lumpur, Dr. Nurhayati Mohd. Marzuki said that in 2017, there
were 26,168 active TB cases in the country involving 2098 deaths.
According to the Health Director-General, Dr. Noor Hisham Abdullah (2019),
Selangor and Sabah had the highest number of TB cases, 5071 and 5008, respectively,
in 2018. The mortality rate for TB cases in 2018 was 6.6 per 100,000, an increase
of 6.5 per 100,000 in 2017. Sabah recorded the highest death toll, and a follow-
up investigation in Selangor indicates a need for more comprehensive control and
detection of new cases.
The Malaysian Ministry of Health (MOH) has established guidelines for control-
ling the disease, but this health organisation must be implemented more efficiently
and up-to-date. However, findings from the national technical records from TB in
2015 indicate that screening methods for high-risk groups must be further strength-
ened to increase the detection rate of TB cases [3]. Altshuler et al. in [1] also supported
the statement that the current method still fails to detect the spread of the disease and
diagnose TB patients in remote areas. This situation is due to several factors, partic-
ularly the lack of effectiveness of existing methods or less comprehensive systems
in detecting TB cases.
For example, although the molecular epidemiological technique has the advan-
tage of diagnosing TB in the human body [4], it does not consider geographical
and environmental factors. TB cases are not only influenced by human factors.
However, they are also influenced by environmental risk factors such as the quality
of the water flow of a place, the type of settlements and human mobility [4]. Other
factors include poverty, inadequate care and medical facilities, the environment and
unhealthy lifestyles [5–12].
Therefore, the current method must be combined with other techniques to improve
case management of the disease complexities. A geospatial or GIS technique of
applying risk mapping can enhance the detection of actual TB transmission in
high-risk areas. It may also improve targeted inspection efforts on field contact
tracing. Several experts explained spatial methods used in studying TB epidemi-
ology, including the spatial distribution, spatial clusters, and predictors of spatial
patterns for the role of congregate settings and the household [3, 13, 14].
The researchers identified TB hotspots and highlighted the potential benefits of
increased community coverage in reducing cases, emphasising the need for effective
population interventions targeting high-risk areas. However, GIS and SNA must be
combined to tackle social dynamic or spatial movement elements. A local researcher
Tracking High Potential Transmission Risk Spots of Infectious Disease … 507

[5] has developed a GIS-MCDM mapping technique for identifying the TB hotspots
in Malaysia. However, the study also suggested considering human factors in the
map for signiﬁcant predictive spots. This study is conducted to create an SSNAV and
GIS-based risk mapping to limit the spread of TB and detect potentially risky areas
in Klang, Selangor.

2 A Review of GIS and SNA Techniques for Disease Risk

Mapping

GIS applications encompass various uses in various fields due to the development
of GIS technology over the past few decades. It facilitates business by substan-
tially saving time, cost, and effort in nearly every sector [15]. GIS is used to raise
awareness and spread information about the environment, natural resources, possible
hazards and risks, and planned urban routes [16]. As the primary use of GIS, mapping
is seen as essential. The information that GIS stores in the database is the data’s
visual representation. For many entities in the modern world, maps constitute a basic
necessity.
Numerous new industries, such as health care and education, have incorporated
GIS into their use cases. The burden of epidemics, their geographic distribution, and
the factors influencing health-related states or occurrences in a given population have
all been mapped using GIS [17]. The geographic distributions of illness prevalence
(including communicable and non-communicable diseases), the trend of disease
transmission, and the spatial modelling of environmental factors influencing disease
occurrences have all been mapped using GIS [18]. GIS is also utilised for temporal
analysis, cause-and-effect analysis, cognate models, and geographical modelling
[19].
Multicriteria decision-making (MCDM) applications are also combined in GIS by
making choices or decisions based on multiple criteria or parameters. These appli-
cations evaluate and compare alternatives against criteria, considering each crite-
rion’s relative importance. In finance, decision-makers often need to select the best
combination of investments based on various factors such as risk, return, liquidity,
and diversification. The MCDM method helps evaluate investment alternatives and
construct optimal portfolios that align with the investor’s goals and risk preferences.
In the context of health and epidemiology, GIS-MCDM has been applied in site
selection for hospital or emergency medical service facilities [20, 21] and in iden-
tifying high-risk populations of disease [5, 6, 22]. GIS researchers [5] have also
suggested the implementation of social-human movement or network factors for
producing an integrative geospatial assessment of the potential spots of TB.
Social network analysis (SNA) has numerous applications, especially for
analysing social media networks. SNA is widely used to analyse social media
networks like Facebook, Twitter, and LinkedIn. It helps understand user interactions,
508 I. A. Jalil and A. R. A. Rasam

identify influential users, detect communities or interest groups, and track informa-
tion diffusion patterns in online social networks. By examining the network ties and
social influence within a population, researchers can understand how information,
attitudes, and behaviours influence the communities.
SNA has been applied in epidemiology to study the spread of diseases, under-
stand transmission patterns, and inform public health interventions, especially in
contact tracing efforts during disease outbreaks. Researchers can identify individ-
uals who may have been exposed to an infectious disease by mapping their contacts
and interactions [7]. This analysis also helps identify potential cases, assess the trans-
mission risk, effectively target interventions, and identify clusters or subgroups with
higher connectivity and may be at a greater risk of spreading disease. The SNA can
be utilised towards prevention and control strategies for those groups via shaping
human behaviours and interactions in social systems [23].

3 Research Methodology

Due to its TB scenario and environment dynamics, this study was conducted in the
Klang district in Selangor, Malaysia. As shown in Fig. 1, the detailed methodology
workflow is planned in stages to achieve the research objectives. First, the study
determines TB risk factors and ranks them based on their specific category per expert
recommendation from the Selangor State Health Department (JKNS), followed by
other stages.
Data acquisition was collected based on the five influential risk factors
contributing to local TB, as shown in Table 1. Inclusively, the study utilised two
types of data: non-spatial and spatial. Spatial data contain information about a specific

Fig. 1 Workﬂow of research methodology

Tracking High Potential Transmission Risk Spots of Infectious Disease … 509

geographical location on the earth’s surface, such as coordinates on the base map. In
contrast, non-spatial data are independent of geographic location, such as risk factors
for TB.
A ﬁve-Likert scale and rank-sum were selected as the MCDM technique for
data processing risk factors, weight, and mapping. This technique is relatively more
straightforward and does not involve complex pairwise comparisons. For example,
the risk calculation group factor consists of ﬁve sub-factors in MCDM, and each
sub-factor has its indicator to make up a sum of 5 for risk group rank (Table 2). TB’s
patient characteristics and risk level calculation are shown in Formula (1).
Whereas,

HAge_Scale + HNat_Scale + HTB_Scale

+ HEdu_Scale + HInc_Scale = HGR_Scale (1)

Table 1 TB risk factors attribute and scale measurement as suggested by experts

No. Factor/ Sub-criterion Rank
criterion
1 Urbanisation • Urban (Industrial + Residential) risk 5
• Semi-Urban (Residential Only) 4
• Rural Area (Single Housing) 3
• Others (Green Area) 2
• Others (Water Bodies) 1
2 Distance from • 250 m 5
factory • 500 m 4
• 750 m 3
• 1000 m 2
• 1250 m 1
3 Population • Non-Malaysian + Non-Landed House 5
• Non-Malaysian + Landed House 4
• Non-Malaysian and Malaysian + Small Single House 3
• Malaysian + Non-Landed House 2
• Malaysian + Landed 1
4 Housing type • Low-cost ﬂat 5
• Apartment or condominium 4
• Single housing, low-cost terrace, townhouse 3
• Landed housing (double-storey, exclusive residential, 2
semi-detached) 1
• Bungalow
5 Risk group • Less ﬁve and above 65 (1) or others (0) 5
• Age • Non-Malaysian (1) or Malaysian (0) 4
• Nationality • PTB (1) or extra-PTB (0) 3
• TB type • Non-education (1) or education (0) 2
category • Less RM1500 (1) or above RM3000 (0) 1
• Education
• Income
510 I. A. Jalil and A. R. A. Rasam

Table 2 Max rank of TB risk

TB risk factor Max rank
factors
Urbanisation 5
Distance from factory 5
Population 5
Housing type 5
Risk group 5
(Total max rank) 25

The HGR_Scale for every TB Case (0 to 25) was ranked based on classes from
the factor range scale from 1 to 5.
After determining the overall risk level (HGR_Scale) of the TB, the risk mapping
of TB cases was created using inverse distance weighted (IDW) interpolation. This
method was conducted after using reclassiﬁcation and weighted overlay tools and
overlaying the actual cases of TB, as illustrated in Fig. 2. The map shows ﬁve cate-
gories: the blue-coloured class indicates low-risk areas, and the red-coloured class
shows high-risk areas. The numbers of TB are distributed based on the patient’s
location or address, which was also included in the map.

Fig. 2 TB risk area using GIS-MCDM

Tracking High Potential Transmission Risk Spots of Infectious Disease … 511

The SNA technique was also used for analysing the patterns of relationships
and the structure of social networks to gain insights into various aspects of social
behaviour, information flow, and influence within a network. SNA is frequently
represented graphically using nodes (points) and edges (lines) to show the complexity
of social networking. SNA can be categorised into some types of approaches, such
as degree centrality and betweenness centrality.
The social network centrality connectivity measure looks at in-degree (number
of inbound links) and out-degree (number of outbound links) as distinct measures.
Network centrality measures a node’s influence based on the number of links it has
to other nodes in the network. This is how the social network centrality of the disease
works to know how well connected a node is, how many links their connections have,
and so on through the network. The detailed information needed in this analysis is
shown below explanation.
Nodes = Home Location and Workplace Location.
Edges =
• Source (Home Location)
• Target (Group of Workplace Locality)
• Weight.
For example, from TB cases in Klang Spreadsheet, another filtering process and a
new spreadsheet were created, including completed home addresses and workplace
data for using the human mobility study as a node. The origin–destination data of
this human mobility consist of 518 nodes. The formatting and attribute table of nodes
were also extracted before being imported into Gephi software. After importing the
node attribute table onto Gephi, the node visualisation, as shown in Fig. 3, consists
of 518 nodes. The line in the network graph has a direction since this edge represents
human mobility (OD), which consists of source-target attributes. The source node
represents the home location, and the target represents the workplace location. The
edge visualisation appears after importing the edge attribute table onto Gephi.

Fig. 3 Nodes with edges

network graph
512 I. A. Jalil and A. R. A. Rasam

Fig. 4 Social network graph

of human mobility of TB and
their network centrality

Selected TB cases with complete home and workplace address information partic-
ipated in this SNA analysis, which comprised 139 workplace nodes. A selected node
was used as the primary node, and buffering analysis was made from ArcGIS with a
diameter of 250 m, similar to the high-risk rank for the factory distance in one of the
TB risk factors in this study. Target nodes (workplace) that fall within the buffer zone
will be grouped into one group as a target. A new column was added as a weight to
be used to analyse their social network graph and their centrality.
The home and destination of human mobility data are tagged as a unique ID to
make it easier to interpret and read when exporting to SNA software. In Fig. 4, the
edge (human mobility) weightage was taken from the result of the GIS-MCDM ﬁnal
score. The eigenvector values were scaled in 5 levels, the same as the GIS-MCDM
process, and the value was used as a map interpolation parameter. Human mobility
needs to be carried out with the help of SNA and integration with the GIS to make
this study more realistic and reliable. These ﬁltered data are used as a separate risk
factor to export into social network analysis software to study human mobility, social
network graphs, and network centrality (Fig. 4). The number on the nodes represents
the group of workplace IDs, while the arrow shows the amount of TB casework at
the same workplace group. The social network graph below shows that the highest
centrality nodes are more prominent and darker nodes.

4 Result and Discussion

4.1 TB Risk Factors in Klang

Generally, the two leading indicators used in this study are environmental and human
factors. These essential risk indicators of TB are used in this study, as suggested by
previous studies (Table 3). Five factors were chosen: urbanisation, built-up area,
Tracking High Potential Transmission Risk Spots of Infectious Disease … 513

population, and type of housing as environmental factors while risk group as human
factor. Environmental factors, including climate and land usage, affect TB bacterium
survival, operations, and growth. Environmental variables influence daily life and
are essential in causing health disparities across geographic locations. Human and
environmental factors considerably impact human health directly or indirectly. Mean-
while, human is a field that considers both the physical and mental features of persons
at high risk of contracting TB.
Based on these five factors, land use (urbanisation) and housing type are potential
dominant factors in the study area. According to TB scholars [30], the burden of
tuberculosis (TB) disease is typically thought to be higher in urban settings than in
rural settings due to overcrowding, high disease (HIV) prevalence, and occupational
transmission. However, it is still being determined whether there is a difference in the
prevalence of recurring tuberculosis between rural and urban areas. Scientists [31],
on the other hand, agreed that the patterns depended on dwelling types. Housing
features could be helpful in an ecological study of tuberculosis. These findings have
significant design and health implications for Asian cities that are becoming denser
and higher. TB is considered a poor man’s sickness. It has also been linked to poor
living conditions. This study investigates the link between TB outcome and housing
characteristics, which reflects socioeconomic standing.

Table 3 TB risk factors as suggested from previous studies

No. Main indicator/ Sub-factor (Unit) Explanation
factors
1 Urbanisation (Land Urbanisation and tuberculosis in Peninsular
use type) Malaysia (2011–2015) [24]
2 Built-up area A spatial analysis of social and economic
(Proximity to factory) determinants of tuberculosis in Brazil [8]
3 Environment Type of housing Investigation of space–time clusters and
geospatial hot spots for the occurrence of
tuberculosis in Beijing [25]
Spatial distribution of tuberculosis and
relationship with living conditions in an
urban area of Campina Grande [26]
4 Population Including the third dimension: a spatial
(Nationality and analysis of TB cases in Houston Harris
housing type) County [27]
Spatial patterns of pulmonary tuberculosis
incidence and their relationship to
socioeconomic status in Vitoria, Brazil [28]
5 Human Risk group (Age, Groups at high risk of developing respiratory
nationality, TB type tuberculosis among adult population of
category, education, Kaluga oblast [29]
income)
514 I. A. Jalil and A. R. A. Rasam

4.2 General and Existing TB Spots Areas in Klang Using

GIS Mapping

The high-risk area of the existing high TB risk factor is shown in Fig. 5 using
interpolation and overlay analysis of GIS as studied by ref. [7]. These two techniques
are static presentations because they depend on the quality of the disease data, which
only refers to the patient’s place of residence. Infectious diseases also spread due
to the patient’s place of mobility, such as working and recreational places. Besides
this, interpolation results are only as good as the quality of the input data. If there are
errors or biases in the data, these can lead to misleading results. Similarly, overlay
analysis heavily relies on the accuracy and quality of input data layers. If the data
used in the analysis are accurate, updated, or complete, it can lead to ambiguous
results. To mitigate these drawbacks, it is essential to use high-quality data and apply
appropriate statistical techniques such as human mobility data and SNA techniques.
The map of the TB hotspot area mainly focuses on the central part of Klang district.
The dots point represents TB distribution cases, and raster grouping values were
created into classes in the form of ﬁve quantile classes. Red represents a higher-risk
area of TB cases, while blue represents a low-risk area of TB cases.

Fig. 5 General and existing TB hotspot map in Klang using GIS

Tracking High Potential Transmission Risk Spots of Infectious Disease … 515

4.3 TB Risk Spots in Klang Using Integration of GIS-MCDM

and SSNAV Mapping

In this integration technique, the additional factor, the eigenvector centrality value
from the SNA technique, was exported into GIS-MCDM analysis. The result shows
a target (workplace) using integration GIS-MCDM from TB risk factor rank and
SNA that have already been determined. The five classes shown in the map (Fig. 6)
represent additional eigenvector centrality values.
SNA offers a multidimensional view of the complex interactions between individ-
uals and their social environments rather than the geographical proximity elements
as displayed in GIS-MCDM techniques. Based on social connections and interac-
tions, SNA can help identify critical locations or communities with higher TB risk
(Table 4). Public health officials can pinpoint areas where TB transmission might be
more likely by analysing patterns of social relationships, such as who interacts with
whom and how frequently.
SNA can also reveal how TB spreads within communities by tracing infection
paths through social networks. This understanding is crucial for designing targeted
interventions and controlling outbreaks. In local exploratory findings, [10] has also
demonstrated the capabilities of visual analytics and exploratory analysis of mobility
data to enhance the static GIS mapping of TB using static maps. The findings also have

Fig. 6 TB risk spots using integration GIS-MCDM and SNA

516 I. A. Jalil and A. R. A. Rasam

Table 4 Comparison of high-risk areas between GIS and GIS-MCDM-SNA mapping

TB analysis technique Shape length High-risk area Shape area High-risk area
(m2 ) (Shape length) (m2 ) (Shape area) %
%
Existing TB risk area using 0.071155 1.386831 0.000354 0.703805
GIS mapping

TB risk area using 0.473756 9.233631 0.001285 2.554558

GIS-MCDM and SNA
mapping

shown that the SNA of human mobility factors can potentially inﬂuence infectious
disease by integrating with GIS mapping. The integration of the centrality node and
the TB hotspot area, which is in the centre of the map, is primarily red.
The high-risk comparative map in Table 4 shows that the existing TB risk area
map only depicts a few high-risk areas compared to other analysis techniques. The
high-risk area shows that integrating GIS-MCDM with SNA produces more high-risk
areas. It shows that a new high-risk area pops up using SNA integration, potentially
tracking new locations of high-risk TB areas. Using integration techniques from
GIS-MCDM and SNA depicts the highest percentage of high-risk areas compared
to other techniques. This highest risk area percentage proves the role of SNA-GIS
in enhancing the general mapping of TB and detecting potential TB areas such as
using GIS and GIS-MCDM only as conducted by previous studies [5–7, 11, 12, 24,
28, 32, 33].
Tracking High Potential Transmission Risk Spots of Infectious Disease … 517

5 Conclusion

Risk disease mapping is one approach in geospatial technology for analysis and
communication in public health and epidemiology. Human mobility is one of the
main factors that can cause TB outbreaks in Malaysia, but this factor is considered
silent in producing a precision disease mapping. Therefore, this study has combined
GIS-MCDM and SNA methods to evaluate the results better than general TB disease
mapping using GIS-based techniques. SNA is based on the theoretical constructs
of sociology and the mathematical foundations of graph theory, where it is used to
study structure and know how it influences health. The combination of GIS-MCDM
and SNA techniques has identified risk areas more precisely because the result of
previously performed TB case overlay and verification with a risk map using GIS-
MCDM-SNA or SSNAV is better than the result produced using only the GIS-MCDM
or general GIS mapping method. SNA is also proven to be a beneficial technique to
better understand the movement of individuals between specific locations, and the
different group contacts of people are essential to predicting the future movement
and interaction pattern of infectious diseases.

References

1. WHO (2023) Tuberculosis. Retrieved from https://www.who.int/health-topics/tuberculosis#

tab=tab_1
2. Stockdale JE, Liu P, Colijn C (2022) The potential of genomics for infectious disease
forecasting. Nat Microbiol. https://doi.org/10.1038/s41564-022-01233-6
3. Waaler HT (2002) Tuberculosis and poverty. Int J Tuberc Lung Dis. https://doi.org/10.1136/
bmj.307.6912.1143
4. Kranzer K, Afnan-Holmes H, Tomlin K, Golub JE, Shapiro AE, Schaap A, Glynn JR (2013)
The beneﬁts to communities and individuals of screening for active tuberculosis disease: a
systematic review. Int J Tuberc Lung Dis. https://doi.org/10.5588/ijtld.12.0743
5. Abdul Rasam AR, Mohd Shariff N, Dony JF (2019) Geospatial-based model for diagnosing
potential high-risk areas of tuberculosis disease in Malaysia. MATEC Web Conf 266:02007.
https://doi.org/10.1051/matecconf/201926602007
6. Abdul Rasam AR, Shariff NM, Dony JF (2016) Identifying high-risk populations of tuber-
culosis using environmental factors and GIS based multicriteria decision making method.
ISPRS—Int Arch Photogramm Remote Sens Spat Inf Sci XLII-4/W1 pp 9–13Ab. https://doi.
org/10.5194/isprs-archives-XLII-4-W1-9-2016
7. Abdul Jalil I, Abdul Rasam AR (2021) Disease risk mapping of tuberculosis hotspots in Klang:
where and why the areas can be at high risk of infection? Malays J Remote Sens GIS 10:64–75
8. Harling G, Castro MC (2014) A spatial analysis of social and economic determinants of
tuberculosis in Brazil. Health Place. https://doi.org/10.1016/j.healthplace.2013.10.008
9. Semilan HM, Abugad HA, Mashat HM, Abdel Wahab MM (2021) Epidemiology of tubercu-
losis among different occupational groups in Makkah region, Saudi Arabia. Sci Rep. https://
doi.org/10.1038/s41598-021-91879-9
10. Abdul Jalil I, Abdul Rasam AR (2021b) Social network analysis of spatial human mobility
behaviour in infectious disease interaction: an exploratory evidence of tuberculosis in Malaysia.
Int Arch Photogrammetry, Remote Sens Spat Inf Sci—ISPRS Archives. https://doi.org/10.
5194/isprs-archives-XLIII-B4-2021-55-2021
518 I. A. Jalil and A. R. A. Rasam

11. Rasam ARA, Mohd Shariff N, Dony JF, Maheswaran P (2017) Mapping risk areas of tubercu-
losis using knowledge-driven GIS model in Shah Alam. Malaysia. Pertanika J Soc Sci HumIties
2:135–144
12. Ridzuan N, Abdul Rasam A, Isa M, Shaﬁe F (2021) Spatial interaction between lifestyles
and tuberculosis: an expert and public participatory GIS in Malaysia. Int J Geoinformatics
17(5):178–192. https://doi.org/10.52939/ijg.v17i5.2033
13. Gómez-Barroso D, Rodriguez-Valín E, Ramis R, Cano R (2013) Spatio-temporal analysis of
tuberculosis in Spain, 2008–2010. Int J Tuberc Lung Dis. https://doi.org/10.5588/ijtld.12.0702
14. Tomita A, Smith CM, Lessells RJ, Pym A, Grant AD, de Oliveira T, Tanser F (2019) Space-time
clustering of recently-diagnosed tuberculosis and impact of A.R.T. scale-up: evidence from an
H.I.V. hyper-endemic rural South African population. Sci Rep. https://doi.org/10.1038/s41598-
019-46455-7
15. Shah P (2021) IGiS for managing cities smartly-urban geoinformatics. Int Arch Photogramm,
Remote Sens Spatial Inf Sci—ISPRS Arch. https://doi.org/10.5194/isprs-Archives-XLVI-4-
W5-2021-509-2021
16. Patil NB, Nalawade SS (2016) Applications of GIS to smart cities. Int J Recent Adv Eng
Technol
17. Masimalai P (2014) Remote sensing and geographic information systems (GIS) as the applied
public health and environmental epidemiology. Int J Med Sci Public Health. https://doi.org/10.
5455/ijmsph.2014.081020141
18. Sabesan S, Palaniyandi M, Das PK, Michael E (2000) Mapping lymphatic ﬁlariasis in India.
Ann Trop Med Parasitol. https://doi.org/10.1080/00034983.2000.11813582
19. Palaniyandi M (2013) GIS for mapping updates of spatial spread and the ecological reasoning
of JE transmission in India (1956–2012). J Geomatics
20. Boyacı AÇ, Şişman A (2022) Pandemic hospital site selection: a GIS-based MCDM approach
employing pythagorean fuzzy sets. Environ Sci Pollut Res Int 29(2):1985–1997. https://doi.
org/10.1007/s11356-021-15703-7
21. Liu K (2022) GIS-based MCDM framework combined with coupled multi-hazard assessment
for site selection of post-earthquake emergency medical service facilities in Wenchuan, China.
Int J Disaster Risk Reduction 73:102873. https://doi.org/10.1016/j.ijdrr.2022.102873
22. Hijriani H, Cahyani A (2021) Web GIS-based assessment using S.A.W. methods to identify
high-risk areas of tuberculosis transmission and incidence in Bandar Lampung City. J Phys:
Conf Ser 1751:012033. https://doi.org/10.1088/1742-6596/1751/1/012033
23. Mekouar S (2018) Social network analysis: basic concepts, tools, and applications. In: Hand-
book of research on contemporary perspectives on web-based systems. https://doi.org/10.4018/
978-1-5225-5384-7.ch017
24. Kaur KK, Said SM, Lim PY, Ismail SNS (2020) Urbanisation and tuberculosis in Peninsular,
Malaysia (2011–2015). Malays J Med Health Sci
25. Liu Y, Li X, Wang W, Li Z, Hou M, He Y, Guo X (2012) Investigation of space-time clusters
and geospatial hot spots for the occurrence of tuberculosis in Beijing. Int J Tuberc Lung Dis.
https://doi.org/10.5588/ijtld.11.0255
26. De Queiroga RPF, de Sá LD, Nogueira JA, de Lima ERV, Silva ACO, Pinheiro PGOD, Braga
JU (2012) Spatial distribution of tuberculosis and relationship with living conditions in an
urban area of Campina Grande—2004 to 2007. Rev Bras Epidemiol. https://doi.org/10.1590/
S1415-790X2012000100020
27. Feske ML, Teeter LD, Musser JM, Graviss EA (2011) Including the third dimension: a spatial
analysis of TB cases in Houston Harris county. Tuberculosis. https://doi.org/10.1016/j.tube.
2011.10.006
28. Maciel EL, Pan W, Dietze R, Peres RL, Vinhas SA, Ribeiro FK, Palaci M, Rodrigues RR,
Zandonade E, Golub JE (2010) Spatial patterns of pulmonary tuberculosis incidence and their
relationship to socioeconomic status in Vitoria, Brazil. Int J Tuberc Lung Dis 14(11):1395–1402
29. Lapshina IS, Tsybikova EB, Kotlovskiy MY (2022) Groups at high risk of developing respira-
tory tuberculosis among adult population of Kaluga oblast. Tuberc Lung Dis. https://doi.org/
10.21292/2075-1230-2022-100-11-20-28
Tracking High Potential Transmission Risk Spots of Infectious Disease … 519

30. Mutembo S, Mutanga JN, Musokotwane K, Kanene C, Dobbin K, Yao X, Whalen CC (2019)
Urban-rural disparities in treatment outcomes among recurrent TB cases in Southern Province,
Zambia. B.M.C. infectious diseases. https://doi.org/10.1186/s12879-019-4709-5
31. Cobey S, Gouma S, Parkhouse K, Chambers BS, Ertl HC, Schmader KE, Grad YH (2018)
Poor immunogenicity, not vaccine strain egg adaptation, may explain the low H3 N2 inﬂuenza
vaccine effectiveness in 2012–2013. Clin Infect Dis. https://doi.org/10.1093/cid/ciy097
32. Azewan MDH, Abdul Rasam AR (2020) Disease mapping and health analysis using free and
open source software for geospatial (FOSS4G): an exploratory qualitative study of tuberculosis.
In: Alias N, Yusof R (eds) Charting the sustainable future of ASEAN in science and technology.
Springer, Singapore. https://doi.org/10.1007/978-981-15-3434-8_43
33. Abdul Rasam AR, Jumali WNS, Abdul Jalil I, Muhamad Jaelani L (2023) Susceptibility risk
index mapping of population at tuberculosis epidemic risk. J ASIAN Behav Stud 8(24):53–65.
https://doi.org/10.21834/jabs.v8i24.423
Evaluation of Machine Learning
Algorithms for Early Prediction of Liver
Disease

Sushmitha Geddam, Sulaf Assi, Hadi Naghavipour, Manoj Jayabalan,

Abdullah Al-Hamid, and Dhiya Al-Jumeily OBE

Abstract Liver disease contributes to major morbidities and mortalities globally.

Around 2 million people die annually worldwide from liver diseases. Diagnosing
liver disease is usually made by measuring levels of biomarkers that could be enzymes
linked to liver function tests and/or proteins. However, not often the levels of these
biomarkers change in cases of liver diseases. The majority of diagnosis based on
biomarkers occur when the liver is partially or fully damaged. On the other hand,
early diagnosis of the disease can prevent further complication, decrease the burden
on healthcare systems and save lives. Subsequently, this research evaluated machine
learning algorithms for early prediction of liver disease. These algorithms included
logistic regression, decision tree, random forest, adaptive boosting, extreme gradient
boost, support vector machine and Naïve Bayes and were applied to a dataset of
patients with and without liver diseases after data preprocessing. Metrics for evalu-
ation included accuracy, precision, recall, AUC-ROC and F1-score. Based on these
metrics and after hyperparameter tuning, support vector machine, random forest,
adaptive gradient boost and extreme gradient boost showed to be the best performing
models. However, the performance of these models did not exclude overfitting that
could be related to the low sample size of the dataset. Future work involves applying
these algorithms to a larger sample size of patients and more features.

Keywords Liver disease · Diagnosis · Liver function tests · Machine learning

algorithms · Evaluation

S. Geddam · S. Assi · M. Jayabalan · D. Al-Jumeily OBE (B)

Liverpool John Moores University, Liverpool, UK
e-mail: [email protected]
H. Naghavipour
School of Information Technology, UNITAR International University, Petaling Jaya, Selangor,
Malaysia
A. Al-Hamid
College of Clinical Pharmacy, King Faisal University, Hofuf, Saudi Arabia

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 521
Y. Bee Wah et al. (eds.), Data Science and Emerging Technologies, Lecture Notes on
Data Engineering and Communications Technologies 191,
https://doi.org/10.1007/978-981-97-0293-0_37
522 S. Geddam et al.

1 Introduction

The liver is the major organ in the body and the main metabolic site of xenobiotics,
medicines and nutrients [1]. Liver disease contributes to the death to around 2 million
people globally [2]. Diagnosing liver disease involves measuring certain biomarkers
in blood such as enzymes and proteins [3]. Common biomarkers used for liver func-
tion testing include total bilirubin (TB), direct bilirubin (DB), alkaline phosphatase
(ALP), alanine transaminase (ALT), aspartate aminotransferase (AST), total proteins
(TP), albumin (Alb) and albumin-to-globulin (AG) ratio [3]. TB indicates the level
of bilirubin in blood, where bilirubin informs about the amount of waste generated
by the liver. DB is the conjugated bilirubin. ALP and AST are enzymes present in
the liver and other organs. Alb and globulin are proteins in which levels change in
case of liver disease.
However, the levels of these biomarkers may not change in early stages of liver
disease and are mostly identified when the liver is partly damaged [4]. In such cases,
early diagnosis of liver disease is life-saving and prevents many complications.
In this respect, machine learning algorithms (MLAs) have proven successful in
diagnosing of medical conditions and thereby reducing the risk linked to these condi-
tions. MLAs have been applied in classification and prediction of different liver
conditions. For instance, classification and prediction of hepatitis and cirrhosis was
made using Naïve Bayes (NB) and support vector machine (SVM) where SVM
showed to be a better and more efficient classifier in terms of time and accuracy [5].
SVM also showed high accuracy in classifying liver diseases when compared
to other algorithms such as Bayesian network and random forest (RF) [6]. SVM, as
well as logistic regression (LR) classifier, also showed high performance in predicting
liver disease with accuracy > 75% [7]. NB and LR alongside other models including
RF and artificial neural networks (ANN) were used to identify patients at risk of fatty
liver for patients above 30 years old [8]. This study highlighted the importance of
addressing missing values in datasets and normalization where they applied synthetic
minority oversampling technique (SMOTE) [8].
Subsequently, SMOTE was applied in another study for predicting liver disease
and improved accuracy > 98% [9]. Similar results were obtained in the study by Gupta
et al. [10] where SMOTE showed effective in balancing the data prior to applying
the MLAs [10]. However, in the latter study that RF, adaptive boosting (AdaBoost)
gave higher accuracy of classification that other MLAs. Likewise, in another study
RF gave an accuracy of 98% when used for classifying liver disease based on liver
function tests [11].
Considering the differences in the results between previous studies, this study
has built on the findings of previous studies by evaluating seven MLAs for classi-
fying liver diseases based on characteristic features related to liver function tests.
These algorithms included logistic regression (LR), decision tree (DT), random
forest (RF), AdaBoost, extreme gradient boost (XGBoost), support vector machine
(SVM) and Naïve Bayes (NB). Prior to application of the analytics the study
assessed the data preprocessing approach in terms of data cleaning, imputation and
Evaluation of Machine Learning Algorithms for Early Prediction … 523

exploratory data analysis. Models were then evaluated after application with and
without hyperparameter tuning.
In particular, the research tackled the following objectives: (a) building a predictive
model for early diagnosis of liver disease; (b) analyzing risk factors related to liver
disease; (c) comparing several MLAs in terms of accuracy of identification of liver
diseases; and (d) evaluating the performance of MLAs based on evaluation metrics
including accuracy, recall/sensitivity, AUC-ROC and F1-score.

2 Methods

2.1 Dataset

Dataset used in this study was the Indian Liver Patient Dataset (ILPD) (https://archive.
ics.uci.edu/dataset/225/ilpd+indian+liver+patient+dataset) and was obtained from
University of California Irvine Machine Learning Repository [12]. ILPD was gener-
ated from patients in Northeast of Andhra Pradesh (India). It consisted of 11 features
for 583 patients that were: gender of patients, presence or absence of a liver disease
and biomarkers (of liver function tests level in blood). Table 1 shows the details of
the biomarkers in the dataset. The gender distribution of males to females in the
dataset was 441:142. Moreover, 416 (71.4%) patients had liver disease. All patients
included in the dataset were of age below 90 years old.

Table 1 Biomarkers evaluated in the study

Biomarker Details Range (present dataset) Normal level (in blood)
AG ratio Albumin-to-globulin 0.3 to 2.8 0.8 to 2
ratio
Alb Albumin 0.9 to 5.5 34 to 54 g/L
ALP Alkaline phosphatase 63 to 2110 44 to 147 IU/L
DB Direct bilirubin 0.1 to 19.7 0 to 0.3 mg/dL
SGOT (or AST) Aspartate 10 to 4929 8 to 33 U/L
aminotransferase
SGPT (or ALT) Alanine transaminase 10 to 2000 4 to 36 U/L
TB Total bilirubin 0.4 to 75 0.3 to 1 mg/dL
TP Total proteins 2.7 to 9.6 6–8 g/dL
524 S. Geddam et al.

2.2 Data Preprocessing

Data preprocessing involved managing missing values, removing duplicates and

feature selection. Missing values were seen for AG ratio feature and were dropped.
The latter step was applied because the missing data were random and the data points
were sufficiently high. In such case dropping them did not cause loss in generaliz-
ability of the model that had been built. In addition, duplicates from the dataset were
removed manually. Then the dataset was scaled using ‘StandardScaler’ to achieve a
standard normal distribution and have high-quality dataset [12]. Then, data imbal-
ance was handled using the synthetic minority oversampling technique (SMOTE)
[13]. SMOTE was applied after normalization and after splitting the training and test
sets. It is worth noting to mention here that the training: test set ratio for all models
was 70:30%. Moreover, tenfold cross-validation was used to maintain bias variance
balance.
Figure 1 shows an overview of the research methodology adopted in this study.

Fig. 1 Flowchart of the methodology adopted in the study

Evaluation of Machine Learning Algorithms for Early Prediction … 525

2.3 Machine Learning Models Development

Seven machine learning models were used for classification and included LR, DT,
RF, AdaBoost, XGBoost, SVM and NB. The software requirements for applying the
models included: data format: CSV; language: Python 3.8.10; package manager:
Anaconda 2.1.2; interactive computing platform: Jupyter Notebook 6.4.12; and
Python libraries: Pandas, NumPy, Matplotlib and Seaborn, Scikit and Sklearn.
LR model was based on probability and predicted a binary outcome that was
presence or absence of liver disease [14]. DT encompassed tree-like models that
utilized a natural human-like decision process by splitting the data into multiple sets
and further splitting the multiple sets to other sets, and so on [15]. RF comprised
ensemble method that combines multiple decision trees such as each tree was seen
individually; thus, predictions were based on groups of models [1]. XGBoost was
based on gradient boosting (GB) where the cost function was log loss; yet XGBoost
used more advanced execution of GB that can handle missing values [16]. AdaBoost
was also based on adaptive boosting and was useful for short decision trees [17].
SVM divided datasets based on decision boundaries in n-dimensional space [1]. NB
was based on Bayes theory with strong naïve that there were independent assump-
tions between different features [18]. Each model was evaluated before and after
hyperparameters’ tuning. Table 2 shows the hyperparameters used for each model.

2.4 Machine Learning Models Validation

For models’ validation, the following metrics were used: accuracy, precision, recall
area under the curve-receiver operator characteristics (AUC-ROC) and F1-score [4].
These metrics are calculated taking into account true positives (TP), false positives
(FP), true negatives (TN) and false negatives (FN). Accuracy was calculated as the
number of positive predictions divided by the total number of predictions (Eq. 1).

TP + TN
Accuracy = (1)
TP + TN + FP + FN

Precision was calculated as the number of true positives divided by the total
number of positive predictions (Eq. 2).

TP
Precision = (2)
TP + FP

Recall was defined as the ability of the model to detect positive samples (Eq. 3).

TP
Recall = (3)
TP + FN
526 S. Geddam et al.

Table 2 Hyperparameters used for tuning each of the seven models

Model Hyperparameter Value (s)
LR Penalty 11, 12, elastic
Solver Sag, saga, lbfgs, liblinear, newton-cg
Max-iter 100, 100, 025, 005, 0000
Cross validation 5
DT Max_depth [3, None]
Max_features [1, 9]
Min_samples_leaf [1, 9]
Criterion [gini, entropy]
RF Max_depth [10, 15]
Max_features [2, 3]
Min_samples_leaf [3–6]
Min_samples_split [3–6]
N-estimators [1150, 1200, 1250, 1300, 1350]
AdaBoost Base_estimator_max_depth [2, 11]
Base_estimator_min_samples_leaf [1, 2]
N-estimators [10, 50]
Learning_rate [0.0001, 0.001]
XGBoost Learning rate [0.1]
N-estimators [100]
Max_depth [5]
SVM C [0.1, 1, 10, 100]
Gamma [1, 0.1, 0.01, 0.001]
Kernel rbf
NB N_splits 5
N_repeat 3
Random_state 999

AUC-ROC indicated the area under the ROC that separated the TP from the FP
(Eq. 4).

1
AUC = ROC(t).dt (4)
0

F1-score denoted the harmonic mean of recall and precision controlled by β-

coefficients (Eq. 5).
Evaluation of Machine Learning Algorithms for Early Prediction … 527

1 + β 2 ∗ Precision ∗ Recall
Fβ = 2 (5)
β ∗ Precision ∗ Recall

Such that TP are true positives, TN are true negative, FP are false positives, and
FN are false negatives.

3 Results and Discussion

3.1 Exploratory Data Analysis

Exploratory data analysis (EDA) showed that men outweighed women and repre-
sented most patients. Hence the numbers of men and women in the study were 441
(75.6%) and 142 (24.4%) respectively. Liver disease in men was high in patients >
40 years old and that explained the high percentage of patients with liver diseases in
the dataset (71.4%) [19].
EDA was followed by correlation analysis that indicated high correlation among
ALT, AST, TB, DB and liver disease [3]. On the other hand, age and TP did not show
a high correlation with liver disease (Fig. 2). Thus, the correlation map indicated
that age was the least important variable, whereas AG ratio and Alb were the most
important.

3.2 Machine Learning Models Evaluation

ML models showed high performance in classifying liver disease. This was evidence
in the metrics measured (Table 3) that presents the performance of the models prior
to hyperparameter tuning.
In this sense, the accuracy of the models was in the range of 0.66–0.8 with the
XGBoost model showing the highest accuracy. This latter model showed high perfor-
mance in other metrics. Thus, values for recall, AUC-ROC, precision and F1-score
of the XGBoost model were 0.89, 0.8, 0.87 and 0.78 respectively. It is worth noting
to state that NB showed higher recall that was 0.95 yet performed poorly on the other
parameters.
Of all the modes, LR and SVM showed the worst performance and that confirmed
the findings in other studies. In other studies, RF and XGBoost showed metric values
in the range of 76–100% and that outweighed other algorithms such as LR, SVM,
KNN and DT [4, 7, 20, 21].
Yet in the present study, performance of SVM improved post-tuning of hyperpa-
rameters (Table 3). This was seen in the recall value that increased from 86 to 99%.
Moreover, all the other metrics’ values for SVM were above 96%. This assured the
528 S. Geddam et al.

Fig. 2 Correlation matrix of the features and liver disease

Table 3 Performance metrics of the machine learning models

Model Accuracy Recall AUC-ROC Precision F1-score
Before tuning of hyperparameters
LR 0.6627 0.8193 0.6627 0.7368 0.6
RF 0.753 0.8434 0.753 0.8088 0.7285
DT 0.7469 0.7952 0.7469 0.7733 0.7342
AdaBoost 0.753 0.8434 0.753 0.8088 0.7285
XGBoost 0.8012 0.8916 0.8012 0.8676 0.7815
SVM 0.6506 0.8554 0.6506 0.7551 0.5606
NB 0.6687 0.9518 0.6687 0.8889 0.5378
After tuning of hyperparameters
LR 0.6566 0.8073 0.6566 0.7241 0.5957
RF 0.9773 0.9848 0.9773 0.9847 0.9772
DT 0.6928 0.7108 0.6927 0.7 0.6871
AdaBoost 0.8897 0.9124 0.8897 0.9083 0.8872
XGBoost 0.9471 0.9668 0.9471 0.9654 0.9461
SVM 0.9637 0.9879 0.9637 0.9873 0.9628
NB 0.6687 0.9518 0.6687 0.8889 0.5378
Evaluation of Machine Learning Algorithms for Early Prediction … 529

importance of hyperparameter optimization in obtaining accurate and precise classifi-

cation. Performance of RF, AdaBoost and XGBoost also improved after hypertuning.
On the other hand, LR, DT and NB performance did not improve and that confirmed
that tuning decision is specific to each model and depends to a degree as well on the
dataset type and features.

4 Conclusions

In conclusion, the objectives of the study were achieved where seven MLAs for
predicting liver diseases at early stages were successfully established. The MLAs
showed high accuracy of prediction that was made on eight features related to liver
function tests. The seven MLAs showed high performance in predicting depending
on the tuning of the hyperparameters. Hence, hypertuning of parameters improved
predictions of several models including SVM, RF, XGBoost and AdaBoost. However,
the models were developed based on only 583 patients’ records, and such a challenge
in the sample size could give rise to overfitting. Hence, future work will benefit from
application of machine learning models to larger sample size. Moreover, including
patients’ characteristics related to socioeconomic, lifestyle, health state, beliefs and
certain protected characteristics could improve classification of diseases.

Acknowledgements The authors would like to thank UITAR International University for
supporting this paper.

References

1. Williams R (2006) Global challenges in liver disease. Hepatology 44(3):521–526

2. Asrani SK, Devarbhavi H, Eaton J, Kamath PS (2019) Burden of liver diseases in the world. J
Hepatol 70(1):151–171. https://doi.org/10.1016/j.jhep.2018.09.014
3. Limdi JK, Hyde GM (2003) Evaluation of abnormal liver function tests. Postgrad Med J
79(932):307–312
4. Ghosh M, Raihan MMS, Raihan M, Akter L, Bairagi AK, Alshamrani SS, Masud M (2021)
A comparative analysis of machine learning algorithms to predict liver disease. Intell Autom
Soft Comput 30(3)
5. Vijayarani S, Dhayanand S (2015) Liver disease prediction using SVM and Naïve Bayes
algorithms. Int J Sci, Eng Technol Res (IJSETR) 4(4):816–820
6. Priya MB, Juliet PL, Tamilselvi PR (2018) Performance analysis of liver disease prediction
using machine learning algorithms. Int Res J Eng Technol 5(1):206–211
7. Rahman AS, Shamrat FJM, Tasnim Z, Roy J, Hossain SA (2019) A comparative study on
liver disease prediction using supervised machine learning algorithms. Int J Sci Technol Res
8(11):419–422
8. Wu CC, Yeh WC, Hsu WD, Islam MM, Nguyen PA, Poly TN, Wang YC, Yang HC, Li YC (2019)
Prediction of fatty liver disease using machine learning algorithms. Comput Meth Programs
Biomed 170:23–29
530 S. Geddam et al.

9. Mostafa F, Hasan E, Williamson M, Khan H (2021) Statistical machine learning approaches

to liver disease prediction. Livers 1(4):294–312
10. Gupta S, Karanth G, Pentapati N, Prasad VB (2020) A web based framework for liver disease
diagnosis using combined machine learning models. In: 2020 international conference on smart
electronics and communication (ICOSEC). IEEE, pp 421–428
11. Vats V, Zhang L, Chatterjee S, Ahmed S, Enziama E, Tepe K (2018) A comparative analysis
of unsupervised machine techniques for liver disease prediction. In: 2018 IEEE international
symposium on signal processing and information technology (ISSPIT). IEEE, pp 486–489
12. University of California Irvine machine learning repository (2023). Available at: https://arc
hive.ics.uci.edu/datasets. Accessed 26 August 2023
13. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-
sampling technique. J Artif Intell Res 16:321–357
14. LaValley MP (2008) Logistic regression. Circulation 117(18):2395–2399
15. Kingsford C, Salzberg SL (2008) What are decision trees? Nat Biotechnol 26(9):1011–1013
16. Chen T, He T, Benesty M, Khotilovich V, Tang Y, Cho H, Chen K, Mitchell R, Cano I, Zhou
T (2015) Xgboost: extreme gradient boosting. R package version 0.4–2, 1(4):1–4
17. Khan F, Ahamed J, Kadry S, Ramasamy LK (2020) Detecting malicious URLs using binary
classification through ada boost algorithm. Int J Electr Comput Eng (2088–8708) 10(1)
18. Rish I (2001) An empirical study of the naive Bayes classifier. IJCAI 2001 Workshop Empirical
Meth Artif Intell 3(22):41–46
19. Germani G, Theocharidou E, Adam R, Karam V, Wendon J, O’Grady J, Burra P, Senzolo M,
Mirza D, Castaing D, Klempnauer J, Burroughs AK (2012) Liver transplantation for acute liver
failure in Europe: outcomes over 20 years from the ELTR database. J Hepatol 57(2):288–296
20. Alizargar A, Chang YL, Tan TH (2023) Performance comparison of machine learning
approaches on Hepatitis C prediction employing data mining techniques. Bioengineering
10(4):481
21. Tanwar N, Rahman KF (2021) Machine learning in liver disease diagnosis: current progress
and future opportunities. In: IOP conference series: materials science and engineering, vol
1022, no 1. IOP Publishing, p 012029
22. Pal M (2005) Random forest classifier for remote sensing classification. Int J Remote Sens
26(1):217–222
23. Pisner DA, Schnyer DM (2020) Support vector machine. In: Machine learning. Academic
Press, pp 101–121
An Agricultural Information
Recommendation Method Based
on Matrix Decomposition Knowledge
Graph Algorithm

Ruipeng Tang, Narendra Kumar Aridas, and Mohamad Sofian Abu Talip

Abstract With the promotion of the Internet of Things, big data, and other tech-
nologies in agricultural production, farmers are accustomed to searching for crop
cultivation information through Internet. However, the traditional way of information
search has problems such as time-consuming, inefficient, and inaccurate, so farmers
urgently need an efficient, fast, and accurate information recommendation method.
So this study proposes an agricultural information recommendation model based on
the matrix decomposition knowledge graph algorithm (MDKG algorithm). It intro-
duces the matrix decomposition algorithm based on the knowledge graph, which can
constructs and trains the interaction matrix between users and agricultural informa-
tion. It also solves the decomposed interaction matrix, and obtains the correlation
score between users and agricultural information through calculation. Experimental
results show that the MDKG algorithm is better than the DKN and RippleNet algo-
rithms in AUC and ACC of information flow clicks and recall and precision of cold
start recommendations, which indicates that the algorithm can deeply mine user
preference characteristics to improve click-through rates and perform better in cold
start scenarios. It can better alleviate the negative impact of data sparseness, which
improves the efficiency of farmers to obtain agricultural information.

Keywords Knowledge graph · Matrix decomposition algorithm · Agricultural

information recommendation · Personalized recommendation · Information
retrieval

R. Tang (B) · N. K. Aridas · M. S. A. Talip

Department of Electrical Engineering, Faculty of Engineering, University of Malaya, 50603 Kuala
Lumpur, Malaysia
e-mail: [email protected]
N. K. Aridas
e-mail: [email protected]
M. S. A. Talip
e-mail: [email protected]

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 531
Y. Bee Wah et al. (eds.), Data Science and Emerging Technologies, Lecture Notes on
Data Engineering and Communications Technologies 191,
https://doi.org/10.1007/978-981-97-0293-0_38
532 R. Tang et al.

1 Introduction

With the rapid growth of agricultural information, traditional keyword and similar
word search methods can no longer meet farmers’ needs to obtain crop cultiva-
tion information due to their time-consuming and low accuracy [1]. Although the
traditional collaborative filtering (CF) recommendation method can summarize the
user’s search habits and preferences based on the user’s past search content [2],
this method requires that the recommendation model have item data input during
training. Otherwise, it will face a cold start problem. And it is difficult to query the
side characteristics of entity resources. The concept of Knowledge Graph [3] (KG)
was first proposed by Google. It mainly takes keyword retrieval as the core and builds
a “question–answer” model to present the effective information to users. Since the
information in the agricultural field is extensive and has certain regionality, timeli-
ness, and diversity, so the knowledge graphs of agricultural information retrieval has
become a current research hotspot.
Yusof et al. [4] pointed out that mobile application systems have been widely
used in people’s daily lives. This paper proposes an expert system for crop diseases
and insect pests, which is useful to improve contributions in the agricultural field.
In order to improve agricultural productivity and economic benefits, Kanako et al.
[5] suggested the utilization of information communication by the information tech-
nology, they acquire agricultural knowledge from web resource databases and estab-
lish an agricultural knowledge system grounded in interconnected open data. Wang
et al. [6] constructed the Ningxia rice knowledge map with the help of professional
books and rice data centers. First, the model and data layers were constructed; then
knowledge fusion, reasoning, and completion were performed to form the rice knowl-
edge map. Wang et al. [7] constructed a knowledge graph of rice diseases. They used
rice diseases in large-scale knowledge bases like Baidu Encyclopedia as data sources.
They used the pattern-matching methods to extract knowledge from the data. After
data integration and other knowledge graph construction processes, rice diseases
were extracted by visualizing disease knowledge graphs. Cui et al. [8] proposed to
use the rich agricultural technology resources on the Internet to build an agricul-
tural question-and-answer system based on a set of common rice questions, which
could solve the problems of high agricultural technology requirements, shortage of
funds, and insufficient manpower in rice planting operations. Liu et al. [9] proposed
the use of deep learning methods to build an agricultural production question-and-
answer system, there was a large amount of useless information in the agricultural
question-and-answer system that uses the Internet to search for agricultural planting
knowledge feedback answers and requires manual secondary screening.
The above studies on knowledge graphs in agriculture mostly focus on the data
of a certain crop in the agricultural Internet of Things, such as rice diseases. It rarely
involves the farmers’ retrieval of agricultural information on the Internet. Although
the knowledge graph can transform the recommendation problem into a ranking
problem and allow the question search to be directed to the answer, it is difficult
to construct knowledge from different data sources in a single knowledge graph. It
An Agricultural Information Recommendation Method Based … 533

is necessary to find an algorithm to cluster the scattered knowledge and implement

text-to-entity mapping. So this study designs an agricultural information recommen-
dation method based on the MDKG algorithm. It introduces a matrix decomposition
algorithm based on the knowledge graph, which can improve the accuracy of recom-
mendations and solve the problems of traditional information search being time-
consuming, inefficient, and inaccurate. It also improves the efficiency of agricultural
information acquisition.

2 Introduction

The algorithm uses the knowledge graph to display the agricultural information
entity vector to provide data support for the later learning and training of the matrix
decomposition algorithm [10]; then, it constructs the interaction matrix between the
user and agricultural information; the entity characteristics are matched; then, the
matrix decomposition model selects the t nearest agricultural information and user
matrices for training; finally, the matrix decomposition model is used to solve the
decomposed user and agricultural information matrices, and the prediction score
between the user and agricultural information is obtained through calculation. The
implementation process of this algorithm is shown in Fig. 1.

2.1 Establish an Interaction Matrix

The agricultural information recommendations have their characteristics between

users and agricultural information, so there is some inevitable connection between
information characteristics and user preferences, which are affected by a few external
factors. This study uses the matrix decomposition algorithm to combine agricultural

Agricultural Recommendation algorithm

Initial data set k-dimensional space
knowledge map based on matrix

Resource Hybrid knowledge Mixed feature Feedback

Basic feature model
library graph model model

Interest transfer Recommendation list

Feature weight Scoring model
model generation

Fig. 1 MDKG algorithm flow chart

534 R. Tang et al.

information and users. The user characteristic matrix and the agricultural information
characteristic matrix are decomposed into the matrix. It is shown in Formula (1):

Z = XY K (1)

In Formula (1), Z i×r represents the r-dimensional user matrix after matrix decom-
position, Z j×r represents the r-dimensional agricultural information matrix after
matrix decomposition, Z represents the interaction matrix composed of the r-
dimensional user matrix and agricultural information matrix. The objective function
of the model training process is shown in Formula (2):

∑
i ∑
j
( )
H= X m YnK − z mn Smn + ω1 ||X m ||2R + ω2 ||Yn ||2R (2)
m=1 n=1

In Formula (2), Smn represents the user’s rating of whether the recommended infor-
mation exists. If a rating exists, Smn is 1. On the contrary, Smn is 0, z mn represents the
user’s true rating of the recommended information, X m and Yn represent the feature
vectors between the decomposed r-dimensional user and agricultural ( information)
after matrix decomposition, X m YnK represents the predicted rating, X m YnK − z mn
represents the error between the predicted value and actual value. ||||2R represents
the Frobenius norm, ω1 and ω2 represent the percentage of the model feature matrix
regularization for constrained users and user information. In order to solve the fitting
problem of the recommendation model, this study introduces ω parameter weighted
regularization. It uses the data parameters between the user cxm and agricultural
information c yn to reduce the transition probability of the objective function. The
improved loss function is shown in Formula (3):

∑
i ∑
j
( )
H= X m YnK − z mn Smn + cxm ||X m ||2R + c yn ||Yn ||2R (3)
m=1 n=1

2.2 Construct Feature Matrix

The matrix decomposition algorithm is used to solve the decomposed user matrix
and agricultural information matrix, and the prediction score between the user and
agricultural information is obtained through calculation. The process of matrix
decomposition is shown in Fig. 2.
The objective function expression after matrix decomposition is:
An Agricultural Information Recommendation Method Based … 535

Fig. 2 Matrix
decomposition process
Y1 Yn

G(Yn,X2)
Y2 Yn Zmn

Yt Yt =j Xm

∑
i ∑
j
( ) ∑
i
H= X m YnK − z mn Smn + ω1 cxm ||X m ||2R
m=1 n=1 m
⎛ ⎞
∑
j
∑
j
∑
+ ω2 c yn ||Yn ||2R + ω3 ⎝Yn − G(Yn , Yt )Yt ⎠ (4)
n m=1 Yt

∑ ∑j ( )
In Formula (4), im=1 n=1 X m YnK − z mn is built by the matrix decomposition
model, the second and third terms are regular terms to avoid overfitting, and the third
term is the potential information that uses the knowledge graph to represent the simi-
larity of agricultural information. c yn represents j neighbor sets of agricultural infor-
mation yn . G(Yn , Yt ) is the similarity function. This study uses the cosine similarity
function [11]. Its value range is between [−1, 1], which is as shown in Formula (5):
∑r
m=1 (X mw Ynw )
G(Yn , Yt ) = /∑ ( ) /∑r ( ) (5)
r
m=1 (X mw ) 2
× n=1 (Ymw )
2

In Formula (5), w represents the obtained dimension through model training, and
its normalization is shown in Formula (6):

P(l) = (1 + l)/2 (6)

536 R. Tang et al.

This study uses the gradient descent method to minimize the objective function
to solve the user X and the agricultural information feature matrix Y to establish a
feedback model [12], which is as shown in Formula (7):

∑
i
( ) ∑
U (X, Y ) = Smn X m YnK − z mn Yn + ω1 jxi Ym (7)
n=1 m

2.3 Build a Mixed-Feature Model

After solving the feature matrix generated by the interaction matrix, the feature values
are mapped in the v-dimensional space. The closer the geometric distance is to the
features, the stronger the correlation is. The correlation information between qn and
qm is obtained through vector cosine similarity, which is shown in Formula (8):
∑j
m=1 qm f qn f
G(qm , qn ) = /∑ ( )2 /∑ j ( )2 (8)
j
q
f =1 m f × f =1 qn f

After processing the training set, the feature vector is constructed as VX m Mn shown
in Formula (9):
{ }
VX m Mn = G(X m , Mn )1 , G(X m , Mn )1 , . . . , G(X m , Mn )ϕ (9)
( )
The training set X m , VX m Mn is inputted into the system resource recommen-
dation model, it can obtain a decision function through the optimization function.
It also generate a list of recommendation results, and obtain a weight set under
multi-dimensional features. This set is the user’s short-term recommendation basis.
According to the survey, it is found that users are interested in agricultural informa-
tion resources. The user’s interest completely determines the user’s preference for a
certain resource. This study also adds user preference impact results to the recom-
mendation results, uses knowledge graph technology to analyze users’ long-term
and short-term preferences, and adds user preference models and impact feedback
models to the knowledge graph. When the data is reduced and the user’s interests
are changed, the purpose of accurate recommendations to the user is achieved. The
weight update model between hybrid knowledge graph entities is shown in Formula
(10):

δ × score × wmn × ρe , γmn = e;

H K Wmn = (10)
ρothers , γmn = other s
An Agricultural Information Recommendation Method Based … 537

In Formula (10), H K Wmn represents the weight value between the characteristic
values m and n in agricultural information, wmn represents the interest degree value
obtained by the user interest migration model of the agricultural information in the
knowledge graph, and e represents the relationship between user m and agricultural
information n. score represents the user’s rating of agricultural information, δ repre-
sents the normalization factor. The normalized initial weight of δ ×score is 1 in order
to avoid excessively high ratings affecting accuracy.
User G(X m , Mn ) with centralized training is deep fused, which combines the
agricultural information features and user interest migration of the knowledge graph.
The similarity features G(X m , Mn )mix is extracted. It also establishes a hybrid feature
model, which is shown in Formula (11):

VX m Mn = {G(X m , Mn )1 , G(X m , Mn )1 , . . . , G(X m , Mn )mix } (11)

( )
The set X m , VX m Mn is used the ranking model to output the final recommen-
dation result based on user preferences. This result not only effectively integrates
multi-dimensional features but also considers the user’s long-term and short-term
preferences, which can enhances the system’s recommendation accuracy and effect.

3 Experimental Design

3.1 Simulation Environment Design

In order to verify the proposed recommendation effect, this study conducts experi-
mental work in the environment of Tensorflow deep learning framework and Python
3.7.1. This experiment crawled the public data set of the agricultural economics
section of the U.S. Agricultural Information Service website. After processing the
crawled data in the background, the user’s interactive data of agricultural economics
was obtained. The interaction dataset between agricultural economics and target
users are also constructed. In the experiment, the data set was divided into a training
set and a test set according to the ratio of 7:3 [13], and the average result of multiple
experiments was used as the final result. The experimental data is shown in Table 1.

Table 1 Agricultural information experimental data

User-information interaction data The amount of data User-information interaction data
set set
Amount of users 4725 Amount of users
Number of pieces of agricultural 58,239 Number of pieces of agricultural
information information
Number of interactive clicks 205,728 Number of interactive clicks
538 R. Tang et al.

3.2 Evaluation Indicators

The recommendation model in this study is used to find the relationship between
information and users, so it is important to measure user performance in the two
scenarios of information flow clicks [14] and cold start recommendations [15]. In the
information flow click scenario, Area Under the Curve (AUC) and Model Accuracy
Rate (ACC) are selected as evaluation indicators to predict the users’ click rate of the
agricultural information. The recall and precision rates of different sample sizes are
selected in the cold start click scenario. As the evaluation index, the two scenarios
are evaluated through the above indicators.
(1) Information flow click prediction
The AUC of this experiment is the area under the Receiver Operating Characteristic
(ROC) curve. The true and false positive rates is obtained according to the prediction
results, it also use them as the ordinate and abscissa respectively to generate the ROC
curve. The ROC curve area is used to judges the quality of the model. AUC means
that the positive cases calculated based on the prediction results are ranked [16]. The
probability before the negative example, as shown in Formula (12):
∑
2× δ sortδ − α × (α + 1)
AUC = (12)
α×β ×2

In Formula (12), δ represents the positive sample, δ represents the sequence

number of the sorting sample according to the score, α represents the number of posi-
tive examples, β represents the number of negative examples. ACC is a commonly
used model performance metric, which represents the proportion of all true and
negative examples in all samples. The calculation formula is as follows:

TP + TN
ACC = (13)
P+N

(2) Cold start click prediction

Cold start clicks are one of the important indicators to measure the performance of
the recommendation model, which reflects the model’s performance in alleviating
the data sparsity problem [17]. Recall represents the ratio of the number of true
examples to the total number of positive examples. Precision represents the ratio of
the number of true examples to the number of all predicted positive examples. The
higher the recall and precision, the better the cold start recommendation performance
of the model. The above indicators are shown in Formulas (14 and 15).

TP
Recall = (14)
TP + FN
TP
Precision = (15)
TP + FP
An Agricultural Information Recommendation Method Based … 539

In order to maximize the optimal parameters of the model performance, this study
uses AUC as the evaluation index, dim is the feature dimension, len is the link length
during preference propagation, size is the size of the preference diffusion set of each
layer, ε1 represents the loss of knowledge graph embedding. ε2 represents the weight
of L2 regularization term. After parameter testing, the parameters of the model are
set to the feature dimension dim = 8, the link length of preference propagation len
= 2, the size of each layer of the preference diffusion set size = 32, and the loss
weight ε1 = 0.5, L2 regular term weight ε2 = 10−7 .

3.3 Experimental Results

In order to explore the performance of the agricultural information recommendation

model based on the MDKG algorithm in information flow clicks and cold start recom-
mendations, this study selected two algorithms, Deep Knowledge-aware Network
(DKN) [18] and RippleNet [19], for comparative evaluation. The DKN algorithm is
an algorithm that obtains the embedding vector representation of the item based on
the translation distance model and uses the convolutional neural network to obtain the
user-item vector for the recommendation. RippleNet is an feature learning algorithm
of knowledge graph, which can obtains user potential interests through preference
diffusion for recommendation.
(1) Information flow click prediction
This experiment sets the average number of recommended pieces of agricultural
information to compare each model’s AUC and ACC performance (the higher the
AUC and ACC is, the better the model recommendation performance is), which is
shown in Figs. 3 and 4. Compared with the DKN and RippleNet algorithm, exper-
imental results show that the models constructed by the MDKG algorithm have an
average improvement of 15.38% and 14.66% in AUC, and an average improvement
of 13.40% and 19.90% in ACC. It indicates that the algorithm proposed in this study
combines matrix decomposition and knowledge graph, which can capture feature
information with smaller granularity and higher accuracy for recommendation to
improve the recommendation performance.
(2) Cold start scenario
In order to verify the model performance in the cold start scenario, this study set the
recommended agricultural information received by each user to be 20. By changing
the sparsity of the data set, the selected items were randomly selected from the orig-
inal training set (a total of 50,000 agricultural information after filtering). Compar-
ative experiments were conducted on the proportion data of 10, 20, 40, 60, 80, and
100%. The experimental recall is shown in Fig. 5. At the same time, the model accu-
racy of the recommendation information is predicted under different proportions of
training sets. The precision experimental results are shown in Fig. 6. Compared with
the DKN and RippleNet algorithms, the experimental results show that the models
540 R. Tang et al.

Fig. 3 AUC model scores for users receiving information of different magnitudes

Fig. 4 ACC model scoring for users receiving different amounts of information

constructed by the MDKG algorithm have an average improvement of 44.53 and

40.23% in recall, and an average improvement of 19.14 and 21.07% in precision,
which indicates that the algorithm is effective in high data sparsity. It can maintain a
good recommendation effect, alleviate the negative impact of data sparsity, and have
better model performance [20].
An Agricultural Information Recommendation Method Based … 541

Fig. 5 Cold start recall for different proportions of training data sets

Fig. 6 Cold start recommendation accuracy after training on different proportions of data
542 R. Tang et al.

4 Conclusion

This paper proposes an agricultural information recommendation model based on

the MDKG algorithm. It introduces the matrix decomposition algorithm into the
knowledge graph, constructs and trains the interaction matrix between users and
agricultural information, solves the decomposed user matrix and agricultural infor-
mation matrix and calculates the correlation score. By comparing the algorithm of
this study with the DKN and RippleNet algorithms, the experimental results show
that the MDKG algorithm is better than the DKN and RippleNet algorithms in AUC
and ACC of information flow clicks, and recall and precision of cold start recom-
mendations, which indicates that the MDKG algorithm can captures more accurate
details of feature information for recommendation. It can better alleviate the negative
impact caused by data sparsity and has better recommendation performance.

References

1. Sunitha M, Adilakshmi T (2018) Music recommendation system with user-based and item-
based collaborative filtering technique [M]. Networking communication and data knowledge
engineering. Springer, pp 267–278
2. Bo L, Ping G, Yue S et al (2015) Optimization and simulation of agricultural knowledge
collaborative filtering recommendation algorithm based on user implicit evaluation [J]. Internet
Things Technol 5(08):76–79
3. Singhal A (2012) Introducing the knowledge graph: things [J]. Not Strings
4. Yusof MM, Rosli NF, Othman M, et al (2018) M-DCocoa: M-agriculture expert system for
diagnosing cocoa plant diseases [M]. Recent advances on soft computing and data mining
5. Komiya K, Abe Y (2013) Question answering system using Q & A site corpus query expansion
and answer candidate evaluation [J]. SpringerPlus
6. Wang D (2020) Research and application of Ningxia rice knowledge graph construction method
[D]. Northern University for Nationalities
7. Wang S (2019) Construction and research on rice disease prevention and control model based
on knowledge graph [D]. Hunan Agricultural University
8. Cui B (2018) Research on sentence similarity calculation in rice FAQ question and answer
system [D]. Nanjing Agricultural University
9. Liu G (2019) Research and development of intelligent question-answering system for agri-
cultural production based on deep learning [D]. Capital University of Economics and
Business
10. Wang Y, Sun Y (2021) Matrix decomposition recommendation algorithm based on GNN [J].
J Comput Eng Appl 57(19)
11. Liao B, Zhang T, Yu J et al (2019) QPR-NN: a recommendation algorithm combining quadratic
polynomial regression and neural network [J]. J Xi’an Jiaotong Univ 53(9):79–87+136
12. Gu J, Xie Z, Wu J, et al (2019) Parallel collaborative filtering recommendation algorithm based
on graph walking [J]. J Intell Syst 14(04):743–751
13. Ge S, Rum SNM (2022) Key points location in infrared images of the human body based on
Mscf-ResNet. Future Internet 14:15. https://doi.org/10.3390/fi14010015
14. Zhao W, Ren X (2020) Matrix decomposition recommendation algorithm integrating label
information and time effect [J]. Inform Control 49(4):472–477,488
15. Shengbao H, Huawei Y, Xiaohui L et al (2022) Cold start recommendation algorithm based on
fused similarity and hierarchical clustering [J]. Small Microcomput Syst 5:985–991
An Agricultural Information Recommendation Method Based … 543

16. Kamarasan M (2016) Unified technique for colour image retrieval based on orthogonal
polynomial model and multiresolution features[J]. Int J Image Data Fusion 7(3):243–263
17. Yue X, Tang D, Shu H et al (2020) Research on improvement of collaborative filtering
recommendation algorithm based on data sparsity [J]
18. Wang H, Zhang F, Xie X et al (2018) DKN: deep knowledge-aware network for news
recommendation [C]. Proceedings of the 2018 world wide web conference, pp 1835–1844
19. Wang H, Zhang F, Wang J et al (2018) Ripplenet: propagating user preferences on the knowledge
graph for recommender systems [C]. Proceedings of the 27th ACM international conference
on information and knowledge management, pp 417–426
20. Zhang R, Zhao N, Fu L et al (2022) Ultrasonic diagnosis method for stainless steel weld defects
based on multi-domain feature fusion[J]. Sens Rev 42(2):214–229
Performance of the Auxiliary
Information Based Hybrid EWMA Chart
with Fast Initial Response

Peh Sang Ng, Huai Tein Lim, Wai Chung Yeong, and Sajal Saha

Abstract An integration of the fast initial response (FIR) feature into a control
chart helps in increasing the shift detection speed for the process with start-up quality
problem or inefficient corrective measures toward an occurrence of off-target process.
Due to the salient feature of the FIR feature, this study investigates the performance of
the FIR auxiliary information based hybrid EWMA (denoted as the AIB-HEW-FIR)
chart by using the regression estimator. To evaluate the efficiency of the AIB-HEW-
FIR chart in triggering off-target process, the performance indicators of the Average
Run Length (ARL) and Expected ARL (EARL) are used, and the values of the
ARL and EARL are computed by using the Monte Carlo simulation. The proposed
chart is then evaluated and compared with the auxiliary information based hybrid
EWMA (AIB-HEW) chart without the FIR feature. Results show that the AIB-HEW-
FIR chart is superior to the AIB-HEW chart in detecting infrequent changes in the
process mean.

Keywords AIB-HEW · Fast initial response · Auxiliary information · Average

run length · Expected average run length · Off-target process

P. S. Ng · H. T. Lim (B)
Department of Physical and Mathematical Science, Faculty of Science, Universiti Tunku Abdul
Rahman, 31900 Kampar, Perak, Malaysia
e-mail: [email protected]
P. S. Ng
e-mail: [email protected]
W. C. Yeong
School of Mathematical Sciences, Sunway University, 47500 Petaling Jaya, Malaysia
e-mail: [email protected]
S. Saha
Department of Mathematics, International University of Business Agriculture and Technology,
Dhaka, Bangladesh
e-mail: [email protected]

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 545
Y. Bee Wah et al. (eds.), Data Science and Emerging Technologies, Lecture Notes on
Data Engineering and Communications Technologies 191,
https://doi.org/10.1007/978-981-97-0293-0_39
546 P. S. Ng et al.

1 Introduction

Quality is an essential factor in influencing consumer decision in the selection of

products as well as services. Thus, the quality of the products and services is the main
factor that contributes to business growth and success. Statistical Process Control
(SPC) is an effective statistical tool that is used to monitor the manufacturing process,
as well as to ensure the steadiness of the process by reducing the variability of the
process via the search and removal of the possible assignable causes [1]. Control
chart is one of the prominent tools in SPC that is used in process monitoring. Control
charts are widely applied in different fields such as manufacturing industry [2, 3],
bank industry [4], and healthcare [5, 6], just to name a few.
The first control chart in the literature is the X chart, which was developed by
Walter A. Shewhart in 1924 [1]. However, the X chart is less sensitive in detecting
small to moderate shift sizes. Thus, the use of the X chart could result in wasted
resources and serious financial penalties for the industries and services sectors that
seek for a quick detection of small to moderate shift sizes in process monitoring. Since
the development of the X chart, extensive research on control charts were conducted.
This brings to the development of various control charting schemes such as synthetic,
group run, run sum, exponentially moving average (EWMA), and Cumulative SUM
(CUSUM) charts.
Among all these control charts, the EWMA and CUSUM charts are known to be
sensitive in detecting small to moderate shift sizes. To further enhance the detection
speed of the EWMA chart by Roberts [7], Haq [8] combined two EWMA charts
by proposing the hybrid EWMA (HEW) chart in monitoring the process mean. The
HEW chart was shown to perform substantially better than the EWMA chart in
detecting small shifts.
Numerous researchers in recent years have shown that the application of auxiliary
information concept in-control charts resulted in more sensitive control charts in shift
detection. Here, the auxiliary information concept is to adopt an efficient estimator
in the charting statistic by considering both the auxiliary and study information in
estimating the process mean of the variable of interest [9]. Riaz [9] is the first study
that introduced the used of auxiliary information in the plotting statistic of the control
chart. The AIB X chart proposed by Riaz [9] was shown to surpass the traditional
X chart and the performance of the proposed chart is enhanced when the correlation
between the study and auxiliary variables is increasing. Subsequently, the AIB-based
charts are extensively developed by researchers (see [10, 11]).
Recently, Haq and Bibi [12] integrated the auxiliary information concept into
the dual CUSUM charts and the results showed that the proposed chart results in a
quicker out-of-control detection than the existing dual CUSUM chart. Noor-ul-Amin
et al. [13] studied the effect of measurement errors on the auxiliary information-based
maximum EWMA chart. With the adaptive feature, Ng et al. [14] showed that the
variable sampling interval (VSI) EWMA t chart with auxiliary information surpasses
the VSI EWMA t chart.
Performance of the Auxiliary Information Based Hybrid EWMA Chart … 547

As claimed by Lucas and Crosier [15], the off-target process can be caused by
start-up problems or ineffective corrective measures taken to address the off-target
process. To improve the detection ability of the CUSUM chart against the start-up
abnormalities, Lucas and Crosier [15] proposed to integrate the fast initial response
(FIR) feature into the CUSUM chart. Subsequently, Lucas and Saccucci [16] and
Steiner [17], both integrated the FIR feature into the one-sided and two-sided EWMA
charts, respectively, where the results show that the proposed charts are superior to
their existing counterparts (i.e., without FIR feature).
Recent studies on FIR can refer to the studies by Ng et al. [18] and Haq and
Bibi [12], where the FIR feature was adopted into the auxiliary information-based
EWMA and dual CUSUM charts, respectively, in process monitoring. They showed
that the proposed FIR charts surpassed the existing counterparts.
In this paper, we introduce the FIR feature into the AIB-HEW chart by using
regression type estimator in process monitoring. The following section gives the
properties, as well as the implementation steps for the AIB-HEW-FIR chart. Then
Sect. 2 elaborates the average run length (ARL) and expected ARL (EARL) compu-
tations of the proposed chart. The performance of the AIB-HEW-FIR chart and the
comparison between the AIB-HEW-FIR and AIB-HEW charts are presented in this
section as well. Lastly, the main conclusions and possible future studies are given in
Sect. 3.

2 The AIB-HEW-FIR Chart

Section 2 comprises of two sub-sections. Section 2.1 gives an overview of the

proposed AIB-HEW-FIR chart and the procedures in implementing the AIB-HEW-
FIR chart are discussed. To demonstrate the salient feature of the FIR concept in
enhancing the sensitivity of the AIB-HEW chart in shift detection, a numerical anal-
ysis for both cases of known and unknown shift sizes are provided in Sect. 2.2. The
computation of the run length properties, i.e., the ARL and EARL for the numerical
examples are discussed in this section as well.

2.1 An Overview of the AIB-HEW-FIR Chart

Suppose an auxiliary variable B is correlated with the study ( variable) A, where the
correlation between A and B variables is denoted as ρ. Let Ai, j , Bi, j , for j = 1, 2, …,
( 1, 2, . . .) from a bivariate
n, denotes the bivariate random sample of( size n, at )time i (= )
normal distribution, which is denoted as Ai, j , Bi, j ∼ N2 μ A , μ B , σ A2 , σ B2 , ρ . Here,
μ A and μ B are, respectively, the population mean for variables A and B, while σ A2 and
σ B2 are, respectively, the variance for variables A and B. Note that μ A = μ A0 + δσ A ,
where δ is the size of the standardized mean shift and μ A = μ A0 (the in-control
548 P. S. Ng et al.

population mean) if δ = 0. Then, an unbiased estimator for the population mean μ A

at time i, D Ai , is defined as [11]:
( )
σA ( )
D Ai = Ai + p μB − Bi , (1)
σB

where Ai and B i , respectively, are denoted as the ith (sample ) means of A( and) B.
The( mean )and variance of D Ai are then computed as E D Ai = μ A and V D Ai =
σ A2 1 − ρ 2 /n, respectively. Note that n is the sample size of the chart.
Based on D Ai in Eq. (1), the plotting statistic of the AIB-HEW-FIR chart is given
as:

Wi = λ1 Vi + (1 − λ1 )Wi−1 , 0 < λ1 ≤ 1, (2)

where

Vi = λ2 D Ai + (1 − λ2 )Vi−1 , 0 < λ2 ≤ 1.

Here, λ1 and λ2 are the smoothing constants and W0 = V0 = μ A0 = 0. Note that

μ A refers to the target value, i.e., μ A0 if the process is on-target, otherwise μ A = μ A1
when the underlying process is off-target.
The mean E(Wi ) and variance V ar (Wi ) for the plotting statistic, Wi , in Eq. (2)
are then derived as the following:

E(Wi ) = μ A0 (3)

and
( ( ) )( )[∑ 2 ( )
σ A2 1 − ρ 2 λ21 λ22 (1 − λc )2 1 − (1 − λc )2i
V ar (Wi ) =
n (λ1 − λ2 )2 c=1
1 − (1 − λc )2
( )]
2(1 − λ1 )(1 − λ2 ) 1 − (1 − λ1 )i (1 − λ2 )i
− . (4)
1 − (1 − λ1 )(1 − λ2 )

From Steiner [17], the FIR adjustment (FA) is given as:

[ ]
F A = 1 − (1 − a)1+b(i−1) , (5)

where a is a proportion of distance from the starting value and b is the adjustment
parameter which is defined as (−2/log(1−a))−1
19
. To obtain 50% head start, Steiner [17]
recommended the values of a and b to be set as 0.5 and 0.3, respectively.
Following the mean and variance of Wi in Eqs. (3) and (4), respectively, and
the FIR feature in Eq. (5), the C L i upper control limit (i.e., U C L i ), central line
(i.e., C L i ), and lower control limit (i.e., LC L i ) of the AIB-HEW-FIR chart are then
derived as:
Performance of the Auxiliary Information Based Hybrid EWMA Chart … 549

Fig. 1 The graphical view

of the operation of the
AIB-HEW-FIR chart

√
U C L i = μ A0 + K (F A) V ar (Wi ), (6a)

C L i = μ A0 (6b)

and
√
LC L i = μ A0 − K (F A) V ar (Wi ), (6c)

respectively, where μ A0 is the in-control population mean, the parameter K is a

positive constant and it is searched by using extensive Monte Carlo simulation to
attain the desired in-control Average Run Length (ARL0 ). The V ar (Wi ) and FA have
been defined in Eqs. (4) and (5), respectively.
Overall, the procedures of the AIB-HEW-FIR chart’s mechanism begins by spec-
ifying the values of n, λ1 , λ2 , ρ, δ, and the ARL0 . Step 2 is to calculate the plotting
statistic Wi (at time i) for the AIB-HEW-FIR chart by using Eq. (2). Step 3 is to
determine the parameter value of K that results in the desired ARL0 by using an
extensive Monte Carlo simulation. Subsequently, determine the U C L i and LC L i
for the AIB-HEW-FIR chart by using Eqs. (6a) and (6c), respectively. The process is
declared as on-target at the time i if the plotting statistic of the AIB-HEW-FIR chart,
Wi , falls in the in-control region (i.e., Wi ∈ [U C L i , LC L i ]). Otherwise, the process
is off-target at the time i (i.e., Wi > U C L i or Wi < LC L i ). The operation of the
AIB-HEW-FIR chart is depicted in Fig. 1, while the summary of the mechanism for
the AIB-HEW-FIR chart with exact shift size is given in Fig. 2.
On the other hand, to compute the out-of-control EATS (EATS1 ) for the unknown
shift size, the procedures for the case with exact shift size are used, but by replacing
its first step and the third step with the following acts:
Step 1.* Specify n, λ1 , λ2 , ρ, and the interval shift sizes (δmin , δmax ). Here, the δmin
and δmax are respectively denoted as the lower and upper bounds of δ.
Step 3.* The parameter K value that gives the desired ARL0 for the case with
exact shift size is adopted. This is to ensure a similar in-control performance.
The mechanism of the AIB-HEW-FIR chart for the unknown shift size is depicted
in Fig. 3.
550 P. S. Ng et al.

Fig. 2 Mechanism of the AIB-HEW-FIR chart for the exact shift size
Performance of the Auxiliary Information Based Hybrid EWMA Chart … 551

Fig. 3 Mechanism of the AIB-HEW-FIR chart for the unknown shift size
552 P. S. Ng et al.

2.2 Computation of the ARL and EARL

To compute the run length properties of the control charts, the Monte Carlo simulation
approach were adopted by Haq [8], Haq et al. [10], Abbas et al. [11], and Ng et al.
[14] for the HEW, AIB double EWMA t, AIB EWMA, and AIB VSI EWMA t charts,
respectively. In a similar manner, this paper also adopts the Monte Carlo simulations
with 50,000 replications to determine the run length properties (i.e., the ARL and
EARL) for the AIB-HEW-FIR chart with the assumption that μ A0 = μ B = 0, and
σ A = σ B = 1. Here, the ARL is used to measure the average number of plotted
plotting statistics needed before the first out-of-control signal is triggered by the
control chart when the shift size is known by the practitioners. The ARL is further
categorized as in-control ARL (ARL0 ) when the process is on-target or out-of-control
ARL (ARL1 ) when the process is off-target. Whereas the EARL is used to measure
the expected average number of plotted plotting statistics needed by a chart to signal
an out-of-control situation over a shift interval (i.e., (δmin , δmax )) when the exact shift
size is undetermined by the practitioners.
Note that the value of in-control EARL is set to be similar to that of ARL0 , while
the EARL is known as out-of-control EARL (EARL1 ) when the process is off-target.
A control chart that results in smaller ARL1 and EARL1 are desirable as it indicates
the off-target process is detected quickly. A control chart is considered to perform
better than the competing charts if it yields smaller ARL1 and EARL1 values when
all the charts are having the same ARL0 .
Table 1 presents the K and ARL values for the AIB-HEW and AIB-HEW-FIR
charts when the parameter values of n = 1, ρ ∈ {0, 0.25, 0.50, 0.75, 0.90, 0.95},
(λ1 , λ2 ) ∈ {(0.10, 0.1000001), (0.10, 0.25), (0.10, 0.50), (0.10, 0.7 5)},δ ∈ {0,
0.25, 0.50, 0.75, 1.00, 1.25, 1.50, 2.00}, and ARL0 is set to be close to 500. Note
that the values of n and the combination of (λ1 , λ2 ) adopted in this study are set to
be similar to the HEW chart proposed by Haq [8]. From Table 1, the AIB-HEW-
FIR chart is shown to be more efficient than the AIB-HEW chart for all cases when
δ ≤ 1.5, that is, by resulting in a smaller ARL1 values. For example, when ρ =
0.95 and δ = 0.25, the ARL1 values for the AIB-HEW-FIR (AIB-HEW) chart are
given as 7.514 (10.692), 7.771 (11.366), 7.799 (11.777), and 7.742 (11.980) when
(λ1 , λ2 ) = (0.1, 0.1000001), (0.1, 0.25), (0.1, 0.5), and (0.1, 0.75), respectively.
Table 1 also shows that the ARL1 values of the AIB-HEW-FIR chart decrease for
all combinations of (λ1 , λ2 ) when the shift size (δ) increases. For example, when
(λ1 , λ2 ) = (0.1, 0.1000001) and ρ = 0.25 is considered, the ARL1 values for δ =
0.25, 0.50, 0.75, 1.00, 1.25, 1.50, and 2.00 are 64.920, 17.540, 7.984, 4.508, 2.865,
2.032, and 1.340, respectively. This indicates that a larger δ is easier to be detected
and thus the number of samples needed to detect the off-target process is smaller.
Additionally, when the ρ value increases, the ARL1 values of the AIB-HEW-FIR
chart decrease. For example, the ARL1 values in Table 1 are dropping off from
20.538, 19.135, 15.245, 8.764, 3.745, and 1.984, for the increasing values of ρ = 0,
0.25, 0.50, 0.75, 0.90, and 0.95, respectively, when (λ1 , λ2 ) = (0.1, 0.25) and δ =
0.50.
Table 1 ARL values for the AIB-HEW and AIB-HEW-FIR charts with the corresponding K parameter when λ1 = 0.10, λ2 ∈ {0.1000001, 0.25, 0.50, 0.75},
ρ ∈ {0, 0.25, 0.50, 0.75, 0.90, 0.95}, and δ ∈ {0, 0.25, 0.50, 0.75, 1.00, 1.25, 1.50, 2.00} at ARL0 that is close to 500
λ1 = 0.10, λ2 = 0.1000001
AIB-HEW AIB-HEW-FIR AIB-HEW AIB-HEW-FIR AIB-HEW AIB-HEW-FIR
ARL ARL ARL ARL ARL ARL
ρ=0 ρ = 0.25 ρ = 0.50
δ K = 2.3627 K = 2.4729 K = 2.3647 K= 2.4762 K = 2.3605 K = 2.4723
0 500.238 500.285 500.116 500.058 500.176 500.310
0.25 79.637 67.892 75.881 64.920 62.209 52.404
0.50 23.808 18.686 22.545 17.540 18.551 14.151
0.75 11.927 8.508 11.300 7.984 9.350 6.401
1.00 7.327 4.801 6.971 4.508 5.744 3.569
1.25 5.023 3.051 4.764 2.865 3.920 2.305
1.50 3.683 2.148 3.485 2.032 2.885 1.688
2.00 2.286 1.390 2.178 1.340 1.827 1.201
ρ = 0.75 ρ = 0.90 ρ = 0.95
δ K = 2.3623 K = 2.4701 K = 2.3634 K = 2.4726 K = 2.3645 K = 2.4758
0 500.124 500.275 500.073 500.055 500.124 500.176
0.25 38.708 31.410 18.816 14.299 10.692 7.514
Performance of the Auxiliary Information Based Hybrid EWMA Chart …

0.50 11.783 8.371 5.812 3.626 3.298 1.922

0.75 5.913 3.702 2.917 1.706 1.733 1.166
1.00 3.633 2.113 1.851 1.209 1.223 1.026
1.25 2.502 1.489 1.374 1.058 1.051 1.003
1.50 1.881 1.219 1.007 1.000 1.007 1.000
(continued)
553
Table 1 (continued)
554

λ1 = 0.10, λ2 = 0.1000001
AIB-HEW AIB-HEW-FIR AIB-HEW AIB-HEW-FIR AIB-HEW AIB-HEW-FIR
ARL ARL ARL ARL ARL ARL
ρ=0 ρ = 0.25 ρ = 0.50
2.00 1.295 1.040 1.000 1.000 1.000 1.000
λ1 = 0.10, λ2 = 0.25
δ K = 2.5482 K = 2.6430 K = 2.5462 K = 2.6395 K = 2.5453 K = 2.6394
0 500.044 500.094 500.058 500.094 500.029 500.014
0.25 92.969 81.104 87.659 75.961 71.684 61.681
0.50 26.347 20.538 24.764 19.135 20.175 15.245
0.75 12.685 8.923 11.994 8.328 9.879 6.619
1.00 7.748 4.953 7.338 4.622 6.071 3.673
1.25 5.336 3.144 5.047 2.943 4.185 2.381
1.50 3.936 2.226 3.721 2.097 3.098 1.746
2.00 2.464 1.431 2.343 1.375 1.971 1.227
ρ = 0.75 ρ = 0.90 ρ = 0.95
δ K = 2.5470 K = 2.6417 K = 2.5486 K = 2.6443 K= 2.5503 K = 2.6426
0 500.061 500.061 500.176 500.068 500.003 500.107
0.25 44.107 36.090 20.438 15.459 11.366 7.771
0.50 12.563 8.764 6.146 3.745 3.537 1.984
0.75 6.252 3.814 3.135 1.769 1.869 1.189
(continued)
P. S. Ng et al.
Table 1 (continued)
λ1 = 0.10, λ2 = 0.1000001
AIB-HEW AIB-HEW-FIR AIB-HEW AIB-HEW-FIR AIB-HEW AIB-HEW-FIR
ARL ARL ARL ARL ARL ARL
ρ=0 ρ = 0.25 ρ = 0.50
1.00 3.878 2.187 1.995 1.236 1.289 1.031
1.25 2.695 1.536 1.464 1.068 1.074 1.003
1.50 2.025 1.248 1.198 1.018 1.012 1.000
2.00 1.372 1.048 1.020 1.001 1.000 1.000
λ1 = 0.10, λ2 = 0.50
δ K = 2.6816 K = 2.7690 K = 2.6790 K = 2.7687 K = 2.6790 K= 2.7671
0 500.044 500.284 500.116 500.094 500.046 500.094
0.25 99.630 86.927 93.798 81.724 76.924 65.917
0.50 27.810 21.501 26.072 19.931 21.166 15.669
0.75 13.167 8.991 12.427 8.358 10.193 6.569
1.00 7.968 4.893 7.518 4.571 6.226 3.632
1.25 5.457 3.110 5.171 2.910 4.301 2.373
1.50 4.032 2.217 3.818 2.094 3.193 1.752
2.00 2.557 1.445 2.429 1.392 2.053 1.240
Performance of the Auxiliary Information Based Hybrid EWMA Chart …

ρ = 0.75 ρ = 0.90 ρ = 0.95

δ K = 2.6808 K= 2.7715 K = 2.6818 K = 2.7699 K = 2.6830 K = 2.7712
0 500.014 500.014 500.009 500.014 500.184 500.107
(continued)
555
Table 1 (continued)
556

λ1 = 0.10, λ2 = 0.1000001
AIB-HEW AIB-HEW-FIR AIB-HEW AIB-HEW-FIR AIB-HEW AIB-HEW-FIR
ARL ARL ARL ARL ARL ARL
ρ=0 ρ = 0.25 ρ = 0.50
0.25 47.012 38.513 21.441 15.954 11.777 7.799
0.50 13.031 8.819 6.301 3.698 3.634 1.986
0.75 6.419 3.781 3.234 1.776 1.948 1.203
1.00 3.986 2.187 2.081 1.251 1.336 1.036
1.25 2.786 1.553 1.526 1.076 1.093 1.004
1.50 2.111 1.263 1.239 1.020 1.017 1.000
2.00 1.427 1.054 1.027 1.001 1.000 1.000
λ1 = 0.10, λ2 = 0.75
δ K = 2.7611 K = 2.8494 K = 2.7592 K = 2.8464 K = 2.7585 K = 2.8468
0 500.132 500.284 500.014 500.130 500.009 500.056
0.25 102.003 89.380 96.250 83.311 78.805 67.313
0.50 28.481 21.644 26.681 20.020 21.606 15.739
0.75 13.411 8.938 12.635 8.261 10.365 6.500
1.00 8.075 4.817 7.619 4.489 6.307 3.572
1.25 5.528 3.056 5.232 2.865 4.355 2.340
1.50 4.086 2.193 3.871 2.071 3.240 1.743
2.00 2.600 1.447 2.471 1.394 2.092 1.247
(continued)
P. S. Ng et al.
Table 1 (continued)
λ1 = 0.10, λ2 = 0.1000001
AIB-HEW AIB-HEW-FIR AIB-HEW AIB-HEW-FIR AIB-HEW AIB-HEW-FIR
ARL ARL ARL ARL ARL ARL
ρ=0 ρ = 0.25 ρ = 0.50
ρ = 0.75 ρ = 0.90 ρ = 0.95
δ K = 2.7613 K= 2.8491 K = 2.7624 K = 2.8495 K = 2.7627 K= 2.8521
0 500.092 500.014 500.107 500.264 500.009 500.061
0.25 48.148 39.005 21.931 16.030 11.980 7.742
0.50 13.255 8.743 6.393 3.631 3.681 1.972
0.75 6.508 3.715 3.280 1.768 1.988 1.209
1.00 4.035 2.164 2.120 1.256 1.366 1.039
1.25 2.828 1.550 1.562 1.081 1.108 1.004
1.50 2.155 1.269 1.265 1.022 1.021 1.000
2.00 1.461 1.058 1.033 1.001 1.000 1.000
Performance of the Auxiliary Information Based Hybrid EWMA Chart …
557
558 P. S. Ng et al.

Next, we look at the EARL1 performance of the AIB-HIW-FIR chart when the
exact shift size is unknown. All the parameters used in computing the ARL1 values
in Table 1 are used to compute the EARL1 values in Table 2, except the exact δ
is replaced by (δmin , δmax ) = {(0.1, 0.5), (0.5, 1.0), (1.0, 1.5), (1.5, 2.0)}. Table 2
presents the EARL1 performance for the AIB-HEW-FIR and AIB-HEW charts.
Table 2 (for an unknown shift size) demonstrates the same observations as in
Table 1 (for an exact shift size). It is found that smaller EARL1 values are obtained
when larger ρ, λ2 , and (δmin , δmax ) are considered. In addition, the charts’ compar-
isons presented in Table 2 show that the proposed AIB-HEW-FIR chart is superior
to the AIB-HEW chart for all pairs of (λ1 , λ2 ) and (δmin , δmax ), across different ρ
values. For instance, when (λ1 , λ2 ) = (0.1, 0.25) and ρ = 0.25 are considered, the
AIB-HEW-FIR chart results in smaller EARL1 values, i.e., 82.678, 9.486, 3.106,
and 1.681, as compared to 93.191, 13.301, 5.223, and 2.995 for the AIB-HEW chart
when the given intervals (δmin , δmax ) are set as (0.1, 0.5), (0.5, 1.0), (1.0, 1.5), and
(1.5, 2.0), respectively. Thus, the numerical examples in this section shows that the
FIR feature enhances the sensitivity of the AIB-HEW chart in detecting infrequent
change in the process mean for both known and unknown shift sizes.

3 Conclusion

This paper investigates the effectiveness of the FIR feature on the AIB-HEW chart
in process monitoring by using Monte Carlo simulations to compute the run length
properties. The run length properties include the ARL and EARL are used to evaluate
the performance of the AIB-HEW-FIR chart. Both the ARL and EARL results show
that the AIB-HEW-FIR produces a superior performance (i.e., smaller ARL1 and
EARL1 values) to the AIB-HEW chart by allowing the infrequent changes in the
process mean to be detected more quickly. This indicates that the FIR feature is
useful in enhancing the detection ability of the AIB-HEW chart and thus it provides
a practical benefit for the practitioners as quicker identification of an out-of-control
situation can be obtained in response to the process with start-up problems. A faster
detection to an off-target process will lead to cost saving due to low production
of inferior products, as well as time savings due to rework issues for all areas of
manufacturing processes. The proposed chart would appear to have wide applications
in the field of industry and manufacturing sectors.
This study can be further extended by studying multiple auxiliary variables in
process monitoring as more information can be obtained in designing the AIB-
HEW-FIR chart, which may result in more desirable performance. In addition, future
research can consider integrating the FIR feature to different types of control chart to
enhance the detection speed for the process with start-up quality problems. This study
compares the performance of the proposed AIB-HEW-FIR chart with the traditional
AIB-HEW chart (i.e., without the FIR feature) to highlight the salient feature of
integrating the FIR concept. To evaluate whether the proposed AIB-HEW-FIR chart
is more sensitive in comparison to its competitors, future study can evaluate and
Performance of the Auxiliary Information Based Hybrid EWMA Chart … 559

Table 2 EARL1 values for the AIB-HEW and AIB-HEW-FIR charts when λ1 = 0.10, λ2 ∈
{0.1000001, 0.25, 0.50, 0.75}, ρ ∈ {0, 0.25, 0.50, 0.75, 0.90, 0.95}, and various (δmin , δmax ) pairs
at ARL0 that is close to 500
λ1 = 0.10, λ2 = 0.1000001
AIB-HEW AIB-HEW-FIR AIB-HEW AIB-HEW-FIR AIB-HEW AIB-HEW-FIR
EARL1 EARL1 EARL1 EARL1 EARL1 EARL1
ρ=0 ρ = 0.25 ρ = 0.50
δmin δmax K = K = 2.4729 K= K = 2.4762 K= K = 2.4723
2.3627 2.3647 2.3605
0.1 0.5 85.865 75.576 82.201 72.322 69.252 60.231
0.5 1.0 13.111 9.563 12.433 9.010 10.263 7.221
1.0 1.5 5.204 3.215 4.934 3.024 4.072 2.437
1.5 2.0 2.905 1.711 2.759 1.632 2.297 1.404
ρ = 0.75 ρ = 0.90 ρ = 0.95
δmin δmax K = K = 2.4701 K= K = 2.4726 K= K = 2.4758
2.3623 2.3634 2.3645
0.1 0.5 45.396 38.047 22.565 17.846 12.702 9.458
0.5 1.0 6.517 4.231 3.233 1.948 1.914 1.266
1.0 1.5 2.603 1.559 1.422 1.077 1.073 1.006
1.5 2.0 1.548 1.110 1.061 1.004 1.002 1.000
λ1 = 0.10, λ2 = 0.25
δmin δmax K = K = 2.6430 K= K = 2.6395 K= K = 2.6394
2.5482 2.5462 2.5453
0.1 0.5 97.906 87.508 93.191 82.678 78.968 69.375
0.5 1.0 14.097 10.167 13.301 9.486 10.938 7.549
1.0 1.5 5.520 3.316 5.223 3.106 4.336 2.511
1.5 2.0 3.117 1.768 2.955 1.681 2.473 1.444
ρ = 0.75 ρ = 0.90 ρ = 0.95
δmin δmax K = K = 2.6417 K= K = 2.6443 K= K = 2.6426
2.5470 2.5486 2.5503
0.1 0.5 51.743 44.111 25.320 20.202 13.968 10.347
0.5 1.0 6.905 4.381 3.455 2.010 2.053 1.293
1.0 1.5 2.795 1.609 1.515 1.090 1.101 1.008
1.5 2.0 1.658 1.127 1.086 1.006 1.003 1.000
λ1 = 0.10, λ2 = 0.50
δmin δmax K = K = 2.7690 K= K = 2.7687 K= K = 2.7671
2.6816 2.6790 2.6790
0.1 0.5 103.388 92.326 98.425 87.897 83.638 73.597
0.5 1.0 14.685 10.330 13.831 9.645 11.331 7.605
1.0 1.5 5.658 3.284 5.353 3.084 4.451 2.499
1.5 2.0 3.214 1.774 3.049 1.691 2.562 1.457
(continued)
560 P. S. Ng et al.

Table 2 (continued)
λ1 = 0.10, λ2 = 0.1000001
AIB-HEW AIB-HEW-FIR AIB-HEW AIB-HEW-FIR AIB-HEW AIB-HEW-FIR
EARL1 EARL1 EARL1 EARL1 EARL1 EARL1
ρ = 0.75 ρ = 0.90 ρ = 0.95
δmin δmax K = K = 2.7715 K= K = 2.7699 K= K = 2.7712
2.6808 2.6818 2.6830
0.1 0.5 54.930 46.973 26.754 21.165 14.638 10.707
0.5 1.0 7.103 4.367 3.558 2.009 2.129 1.304
1.0 1.5 2.888 1.621 1.577 1.098 1.124 1.009
1.5 2.0 1.729 1.137 1.107 1.006 1.004 1.000
λ1 = 0.10, λ2 = 0.75
δmin δmax K = 7611 K = 2.8494 K= K = 2.8464 K= K = 2.8468
2.5792 2.7585
0.1 0.5 105.412 94.216 100.405 89.231 85.272 74.966
0.5 1.0 14.955 10.314 14.085 9.596 11.517 7.552
1.0 1.5 5.732 3.236 5.424 3.035 4.508 2.468
1.5 2.0 3.260 1.766 3.096 1.684 2.605 1.458
ρ = 0.75 ρ = 0.90 ρ = 0.95
δmin δmax K = K = 2.8491 K= K = 2.8495 K= K = 2.8521
2.7613 2.7624 2,7627
0.1 0.5 56.204 47.652 27.382 21.445 14.935 10.788
0.5 1.0 7.207 4.301 3.611 1.993 2.168 1.306
1.0 1.5 2.934 1.616 1.612 1.102 1.139 1.010
1.5 2.0 1.767 1.143 1.122 1.007 1.005 1.000

make performance comparison between the AIB-HEW-FIR chart with other types
of control chart with FIR concept.

References

1. Montgomery DC (2012) Introduction to statistical quality control, 7th edn. Wiley, New York
2. Khan Z, Gulistan M, Hashim R, Yaqoob N, Chammam W (2020) Design of S-control chart for
neutrosophic data: an application to manufacturing industry. J Intell Fuzzy Syst 38(4):4743–
4751
3. Ali S, Abbas Z, Nazir HZ, Riaz M, Zhang X, Li Y (2021) On developing sensitive nonpara-
metric mixed control charts with application to manufacturing industry. Qual Reliab Eng Int
37(6):2699–2723
4. Yang SF, Arnold BC (2016) Monitoring process variance using an ARL-unbiased EWMA-p
control chart. Qual Reliab Eng Int 32(3):1227–1235
5. Zhou W, Zheng Z, Xie W (2017) A control-chart-based queueing approach for service facility
maintenance with energy-delay tradeoff. Eur J Oper Res 261(2):613–625
Performance of the Auxiliary Information Based Hybrid EWMA Chart … 561

6. Suman G, Prajapati D (2018) Control chart applications in healthcare: a literature review. Int
J Metrol Qual Eng 9:5
7. Roberts S (2000) Control chart tests based on geometric moving averages. Technometrics
42(1):97–101
8. Haq A (2013) A new hybrid exponentially weighted moving average control chart for
monitoring process mean. Qual Reliab Eng Int 29(7):1015–1025
9. Riaz M (2008) Monitoring process mean level using auxiliary information. Stat Neerlandica
62(4):458–481
10. Haq A, Ejaz S, Lee MH, Khan M (2021) A new double EWMA-t chart with auxiliary
information for the process mean. Qual Reliab Eng Int 37(8):3381–3394
11. Abbas N, Riaz M, Does RJ (2014) An EWMA-type control chart for monitoring the process
mean using auxiliary information. Commun Stat-Theor Meth 43(16):3485–3498
12. Haq A, Bibi L (2022) The dual CUSUM charts with auxiliary information for process mean.
Commun Stat-Simul Comput 51(1):164–189
13. Noor-ul-Amin M, Javaid A, Hanif M, Dogu E (2022) Performance of maximum EWMA control
chart in the presence of measurement error using auxiliary information. Commun Stat-Simul
Comput 51(9):5482–5506
14. Ng PS, Khoo MB, Saha S, Lee MH (2022) A variable sampling interval EWMA t chart with
auxiliary information—a robustness study in the presence of estimation error. Alex Eng J
61(8):6043–6059
15. Lucas JM, Crosier RB (1982) Fast initial response for CUSUM quality-control schemes: give
your CUSUM a head-start. Technometrics 24:199–205
16. Lucas JM, Saccucci MS (1990) Exponentially weighted moving average control schemes:
properties and enhancements. Technometrics 32(1):1–12
17. Steiner SH (1999) EWMA control charts with time-varying control limits and fast initial
response. J Qual Technol 31(1):75–86
18. Ng PS, Lau YJ, Lim HT, Yeong WC (2022) Fast initial response for auxiliary information based
EWMA chart for the process mean. In: Proceedings of the 11th international conference on
robotics, vision, signal processing and power applications: enhancing research and innovation
through the fourth industrial revolution. Springer, Singapore, pp 677–683
Evaluation of Machine Learning Models
for Breast Cancer Detection
in Microarray Gene Expression Profiles

Mohammad Nasir Abdullah and Yap Bee Wah

Abstract Breast cancer (BC) is a leading global health challenge, with survival
rate varying significantly across regions due to socio-economic disparities and
healthcare accessibility. This research seeks to identify the most efficient machine
learning (ML) classifier for precise BC classification using gene expression data.
Utilizing the CuMiDa database’s microarray BC dataset, which includes 35,983 gene
biomarkers from 146 breast adenocarcinoma patients and 143 normal subjects, the
study employed R-programming for data pre-processing and feature selection. The
Boruta algorithm pinpointed 214 key biomarkers, and the dataset was subsequently
balanced using the SMOTE technique. Among the seven ML classifiers assessed,
the support vector machine (SVM) showcased superior performance metrics such
as sensitivity, specificity, and accuracy, while naïve Bayes (NB) underperformed.
A thorough examination of the BC dataset revealed that SVM is the premier ML
classifier, highlighting its potential for enhancing BC predictive modelling.

Keywords Boruta algorithm · Breast cancer · CuMiDa · Machine learning ·

Support vector machine

1 Introduction

Breast cancer (BC) is the dominant type of cancer in many countries worldwide.
This condition, characterized by rapid and uncontrolled cell growth, predominantly
affects the breast’s duct and lobules. BC is classified based on the conversion of breast

M. N. Abdullah (B)
School of Mathematical Sciences, College of Computing, Informatics and Mathematics,
Universiti Teknologi MARA, Tapah Campus, 35400 Tapah, Perak, Malaysia
e-mail: [email protected]
Y. B. Wah
School of Information Technology, UNITAR International University, Tierra Crest, Jalan SS6/3,
47301 Petaling Jaya, Selangor, Malaysia
e-mail: [email protected]

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 563
Y. Bee Wah et al. (eds.), Data Science and Emerging Technologies, Lecture Notes on
Data Engineering and Communications Technologies 191,
https://doi.org/10.1007/978-981-97-0293-0_40
564 M. N. Abdullah and Y. B. Wah

cells into malignant forms. The main subtypes are invasive ductal carcinoma, which
originates in the ducts and may spread to surrounding breast tissue, and invasive
lobular carcinoma, which begins in the lobules and can metastasize to adjacent organs
and tissues [1].
According to Global Cancer Observatory (GLOBOCAN) 2020 estimates,
accounted for 2,261,419 new cases worldwide, leading to approximately 685,000
deaths [2]. Specifically, in Malaysia, BC recorded the highest incidence with 8,418
cases, marking it as the foremost cancer in the region. It was second in mortality with
3,503 deaths. The cumulative risk was 5.29 for incidence and 2.24 for mortality, with
a 5-year prevalence rate of 187.18 cases per 100,000 individuals [3].
BC survival rates exhibit geographical differences, attributed to socio-economic
disparities in access to diagnosis and treatment. In high-income nations, 10 to 15%
of BC cases are identified at stage IV; this figure is substantially higher in middle-
and low-income countries. Limited access to specialized cancer centres and compre-
hensive healthcare correlates with higher mortality rates. Furthermore, around 30%
of patients diagnosed at an early stage risk relapse, even with optimal treatment [4].
There is an urgent need to enhance current diagnostic and therapeutic BC proto-
cols. While numerous prognostic markers have been identified, the accuracy of BC
prognosis remains limited, evidenced by the increasing number of BC tumours. The
introduction of novel prognostic biomarkers, known for their sensitivity and speci-
ficity, is crucial for improving BC treatment and patient outcomes [5]. Genetically,
BC susceptibility arises from common variants identified in genome-wide associ-
ation studies (GWAS) and rarer coding variants. Despite their significance, these
genetic markers explain less than half the familial relative risk (FRR) of BC [6].
Microarray technology has transformed genetic analysis by enabling high-
throughput profiling of gene expression levels. It aids in identifying vital biomarkers
linked to breast cancer and highlights patterns that influence disease progression and
therapeutic response [7, 8]. A comparative analysis machine learning (ML) classifi-
cations models on microarray breast cancer dataset is pivotal. The CuMiDa database
offers extensive genetic data, while multiple ML models, such as support vector
machine (SVM), random forest (RF), and naïve Bayes (NB), allow a thorough eval-
uation of predictive capabilities. Such assessments are vital to ascertain the most
accurate and robust model for BC classification [9].
Each ML classifier possesses distinctive strengths: SVM is adept with high-
dimensional dataset [10], RF employs multiple decision trees to improve accuracy
[11], and NB uses a probabilistic method to classify samples [12]. Additionally,
k-nearest neighbour (k-NN) operates on proximity principles, classifying samples
based on their nearest neighbour [13]. Given the diversity of these models, this study
endeavours to determine the ML classifier that excels in utilizing gene expression
data for accurate BC classification.
Evaluation of Machine Learning Models for Breast Cancer Detection … 565

1.1 Related Work

ML is rapidly transforming the landscape of BC diagnosis and prognosis. By lever-

aging the power of high-dimensional data analysis, ML algorithms can identify
subtle patterns and correlations that may not be discernible to the human eye. This
enables more accurate and personalized predictions, which can guide clinical deci-
sion making and improve patient outcomes. In a recent study, [14] employed the
Wisconsin diagnostic dataset, encompassing 569 cases with 30 distinct features from
breast lump aspirates. Their analysis revealed that gradient boosting model as the
most proficient, boasting an F-measure of 96.77%.
In the study by ref. [15], ML has shown potential in improving BC diagnosis with
high-dimensional microarray data. Using the Boruta feature selection algorithm, they
identified 28 key microarray biomarkers that were subsequently used to train an SVM
classifier. The SVM classifier achieved a notable macro-accuracy of 86.7%.
ML has also been shown to be effective in predicting BC metastasis and thera-
peutic response. In a study by ref. [16], the XGBoost classifier outperformed other ML
classifiers in predicting BC metastasis, achieving a recall of 0.8140 and F-measure
of 0.7423. Reference [17] employed ML classifiers to predict therapeutic response
to neoadjuvant therapy in BC patients. The Ipredictor model, which was trained on
RNA-seq data, demonstrated a significant correlation with immune microenviron-
ment and genomic aberration markers, suggesting the potential of immunological
genes as pivotal biomarkers in predicting therapeutic responses.
Reference [18] conducted an integrated analysis of 11 GEO BC microarray dataset
to identify diagnostic and prognostic biomarkers. The k-NN classifier model exhib-
ited the highest performance, with a mean AUC of 0.9890, mean accuracy of 0.9810,
mean precision of 0.9830, mean recall of 0.9800, and mean F-measure score of
0.9820. The study identified nine-gene signatures that emerged as potential diag-
nostic and prognostic biomarkers for BC, offering insight into the disease’s molecular
underpinning and potential therapeutic targets.
The studies collectively underscore the transformative potential of ML in
advancing BC diagnostics and prognosis. By leveraging diverse datasets, from
microarray data to RNA-seq datasets, researchers have illuminated the significance
of specific gene signatures and biomarkers in predicting therapeutic responses and
disease progression.

2 Methodology

2.1 Breast Cancer Data

This study utilized Gene Expression Omnibus (GEO) breast cancer data from Curated
Microarray Database (CuMiDa) [GSE70947] available at https://sbcb.inf.ufrgs.br/
cumida. CuMiDa datasets have been used in more than 30,000 studies of the GEO
566 M. N. Abdullah and Y. B. Wah

database [19]. This dataset encapsulates gene expression profiles from breast cancer
samples, rendering it a pivotal resource for predictive modelling.

2.2 Data Pre-processing

To compare the predictive ability of ML classifiers, some strategies have been

planned to achieve the objective. The R-programming language [20] was applied
to analyse the predictive performance of ML classifiers. Before diving into the anal-
ysis, the data pre-processing was done by identifying the missing values and outliers.
The outliers were detected using Rosner’s method [21]. The present outliers were
carefully detected and rectified, to prevent any undue influence on our results.
With a clean dataset in hand, we turned our attention to feature selection. Boruta’s
algorithm was employed to sift through the multitude of attributes, selecting only
the most relevant features [22]. This step not only streamlined the analysis but also
ensured that we focused on the most informative aspects of the dataset.

2.3 Machine Learning Classification

Then, we employed the Synthetic Minority Over-Sampling Technique (SMOTE) to

balance the dataset ensuring equal representation of both microarray breast cancer
and normal groups.
There were seven ML classifiers employed in this process such as support vector
machine (SVM) with radial basis kernel, random forest (RF), naïve Bayes (NB), k-
nearest neighbour (k-NN), lasso logistic regression (lasso), ridge logistic regression
(ridge) and elastic net logistic regression (elastic net).
To ensure the optimal selection of parameters for each ML classifier, we did a grid
search for every ML classifier. The descriptions for execution of each ML classifier
were explained detailed in the following section.

2.3.1 Support Vector Machine (SVM)

SVM is a useful ML classifier for classification and regression. SVM attempts to

find a linear separator (hyperplane) between the data points of two classes in multi-
dimensional space [23]. The hyperplane is called the optimal hyperplane. A set
of instances that is closest to the optimal hyperplane is called a support vector. In
finding the optimal hyperplane, we utilized radial basis function kernel (RBF). The
RBF kernel is defined as follows:
( ) ||x−xi ||2
K xi , x j = e− 2σ 2 ,
Evaluation of Machine Learning Models for Breast Cancer Detection … 567

where ||x − xi ||2 is known as Euclidean distance [15].

Each iteration, the best cost and gamma parameters were determined to optimize
the splitting point of the hyperplane in the SVM [24].

2.3.2 Random Forest (RF)

RF is an ensemble learning algorithm that utilizes many decision trees classifiers.

It is useful for supervising high-dimensional feature problems. It was developed
by Leo Breiman [25] and has become a prominent nonparametric method for ML
classification and regression. RF utilized Gini Index as variable selection measure.
The Gini Index is defined as

∑
n
gini (T ) = 1 − p 2j ,
j=1

where p j = | p|Ti T | is the relative frequency of class j in dataset T . The T represents

a set of data points in n number of classes into which items are to be classified. The
Gini Index is the minimum value when all cases fall into a single target category and
the minimum value when the cases are equally distributed to all classes [26, 27]. A
Gini Index of 0 represents perfect purity, meaning all items in the set T belong to
a single class (hence there is no uncertainty). A higher Gini Index value indicates
greater impurity, and the maximum value occurs when the classes are distributed
equally.
Each tree was grown to the maximum depth on new training data using the combi-
nation of the variables. These grown trees are not pruned and the number of trees to
be grown are subjective, and RF does not over-fit when the number of trees increase.
In developing a RF model, the best number of variables to be selected in building a
single tree was observed and applied into each iteration.

2.3.3 Naïve Bayes (NB)

NB utilized Bayes formula to predict class membership probabilities in which the

probability of a given sample belongs to a particular class. NB classifier learns
from conditional probability of each feature xi given that class label C from the
training data. The classification is performed by applying Bayes’ rule to compute the
probability of C given the specific features of x1 , x2 , x3 , . . . , xd . Then, a case would
be classified into a class label with the highest posterior probability [28, 29]. Let C
be a random variable representing kth class label (which in this study, would be only
class 1 and class 2), and X i = {x1 , x2 , x3 , . . . , xd } be a vector of random variable
denoting the observed feature values. To predict the class of a test feature X i , Bayes’
theorem is used to assess the probability such that
568 M. N. Abdullah and Y. B. Wah

P(C = ck )P(X = xi |C = ck )
P(C = ck |X i = xi ) = ,
P(X i = xi )

where the P(C = ck |X i = xi ) is the posterior probability. It represents the proba-

bility of the class ck given the feature X i has the value xi . Next, P(C = ck ) is the
prior probability, it is the probability of the class ck before observing the feature X i .
P(X = xi |C = ck ) is the likelihood. This is the probability of observing the feature
X i with the value xi given the class ck . Lastly, P(X i = xi ) is the marginal likelihood,
which is the overall probability of observing the feature X i with the value xi across
all classes.

2.3.4 k-Nearest Neighbour (k-NN)

The k-NN classifier is a nonparametric method in a lazy learning algorithm, where it

directly searches through all the training sample by calculating the distance between
the testing data and training data in order to identify its nearest neighbour and
produces the output for classification. In the current work, we( let T ') = {xi , yi }i=1
k
be
arranged in ascending order in term of Euclidean distance d x , xi between x ' and
'

xi , where
/
( )
d x ' , xi = (x ' − xi )T (x ' − xi ).

Then in the second step, we predicted the class

∑ label by the majority voting of
its nearest neighbours such that y ' = arg max (xi ,yi )∈T ' δ(y = yi ), where y is the
class label, yi is the class label for the ith nearest neighbour among its k-nearest
neighbour. δ(y = yi ) is the Dirac delta function, where it takes value if y = yi and
zero if otherwise [30–32].
The best k parameter for k-NN classifier was searched and applied in the k-NN
to get an optimal performance result of the model.

2.3.5 Logistic Regression (LR)

LR is one of the most used models when the dependent variable is dichotomous. The
penalized logistic regression might be the best choice when there is multicollinearity
and a high number of dimensions. Penalized logistic regression provides an alterna-
tive procedure for variable selection as compared to the conventional way [33]. The
methods for penalized LR used in this study were ridge LR, lasso LR and elastic
net LR. Thus, in this study only apply these three penalized LR into the comparison
instead of ordinary binary LR.
As a fundamental binary LR, the logit function of LR ( can be ∑defined as )
∑n ∑ p ∑n p
l(β) = l(β; y, X ) = i=1 j yi xi j β j − i=1 log 1 + exp j=1 xi j β j . [34]
As for ridge LR, the logit function is added with penalized function that is
Evaluation of Machine Learning Models for Breast Cancer Detection … 569

∑
p
l ridge (β) = l(β) − λ β 2j .
j=1

The parameter λ in the ridge function was the tuning parameter determined via
tenfold cross validation.
For lasso LR, the logit function is as follows:

∑
p
| |
l lasso
(β) = l(β) − λ |β j |.
j=1

The lasso has an advantage compared to ridge regression since it only subsets
the important predictors into the final model and it also improves the model inter-
pretability. The parameter λ in the lasso function was the tuning parameter determined
via tenfold cross validation.
The last penalized method considered in this study was elastic net LR. The logit
function is defined as follows:

∑
p
∑
p
| |
l elasticnet
(β) = l(β) − λ1 β 2j − λ2 |β j |.
j=1 j=1

2.4 Predictive Ability

The dataset was then randomly split into 70% training data and 30% testing data,
striking a balance between model training and evaluation. The 70/30 data splitting
is to ensure a model learns well yet remains generalizable to new data. Furthermore,
each classifier underwent 1000 iterations to account for potential variations in the
random data splitting, ensuring the reliability and consistency of the results.
The performance measure was employed to provide comprehensive evaluation of
the classifiers. Sensitivity, specificity, accuracy, error rate, and F-measure collectively
offer a multidimensional view of the predictive capabilities of each classifier.
Next, a comparative analysis of the classifier performance was done. Results were
meticulously analysed to draw meaningful comparisons and identify potential trends.
The best performing classifier, determined by the predefined performance measure,
was selected for further analysis, ensuring the highest predictive accuracy.
Finally, the feature importance was determined from the best ML classifier perfor-
mance. Features were sorted based on the best predictive performance measure,
providing invaluable insight into their relative importance in the classification
process.
In conclusion, this study stands as a testament to the power of ML classifiers
in breast cancer research. The selected classifier, based on the defined performance
570 M. N. Abdullah and Y. B. Wah

measure, represents a significant step forward in advancing breast cancer predictive

modelling. The pictorial flow chart for the methodology is presented in Fig. 1.

Fig. 1 Flow of
methodological process
Evaluation of Machine Learning Models for Breast Cancer Detection … 571

3 Results

The microarray breast cancer dataset applied in this study was collected from
CuMiDa database consisted of 35,983 genes microarray (biomarkers) with 146 breast
adenocarcinoma group patients and 143 normal group subjects. The objective of this
study was to compare the predictive ability of ML classifier models for the microarray
breast cancer dataset. The R programing with RStudio IDE was used to perform the
analysis for the current objective. Before proceeding to the analysis, we checked
for possible missing values and outliers. There were no missing values detected
in the dataset, but we detected some outliers in the dataset. The outliers were not
removed from the dataset because it will reduce the sample size; thus, the outliers
were substituted with median values of the data.
Then, Boruta’s algorithm for feature selection was done to reduce the number
of unimportant genes towards breast cancer. The Boruta’s algorithm is a wrapper
feature selection that utilizes random forest classifier. Initial analysis found that 172
biomarkers that confirmed important, 35,753 biomarkers confirmed unimportant, and
56 tentative or unconfirmed biomarkers left. Then, the tentative rough fix analysis
was further performed to decide the importance of the 56 tentative left biomarkers.
It was found that 214 biomarkers were confirmed important, and the balance 35,767
biomarkers were confirmed unimportant towards the prediction of breast cancer.
Next, since the data was not well balanced, we applied undersampling method using
“ROSE” function in R. By using this method, we reduce the sample in breast cancer
group to balance with the breast cancer group. After undersampling method applied,
the dataset consisted of 143 breast cancer group patients and 143 normal group
subjects. To measure the predictive ability of ML classifiers, the data were split into
training (70%) and testing (30%) sets to ensure the balances training depth with
testing robustness for model validation. Then, the ML classifiers were fitted using
training data, and the predictive measures were obtained using test data. To ensure
the robustness of the results, we repeat 1000 times of the process and summarized the
means of each performance of ML classifiers. Based on Table 1, seven ML classifiers
were evaluated 1000 iterations and the mean of each performance were recorded.
The mean and standard deviation of each performance such as sensitivity, specificity,
F-measure, accuracy, and error rate were recorded for both training and test sets.
The evaluation of various ML classifiers has yielded insightful results. SVM
emerges as the standout performer, demonstrating exceptional sensitivity, specificity,
F-measure, and accuracy across both the training and testing datasets. Moreover,
its remarkably low error rate underlines its robustness in making accurate predic-
tions. RF exhibits commendable performance, albeit slightly trailing behind SVM
in terms of accuracy on the testing set. However, RF does display lower sensitivity
and specificity values, suggesting a potential trade-off between precision and recall.
In contrast, NB lags behind the other classifiers, displaying the lowest sensitivity
and specificity scores. This is further compounded by a relatively high error rate,
indicating its limitations in accurately classifying instances in this context. Moving
on, k-NN showcases noteworthy performance with a high sensitivity on the testing
572 M. N. Abdullah and Y. B. Wah

Table 1 Comparison of ML classifiers on 214 breast cancer biomarkers (n breast cancer = 143,
n normal = 143)
Classifier Data Sensitivity Specificity F-measure Accuracy Error rate
(mean (sd)) (mean (sd)) (mean (sd)) (mean (mean
(sd)) (sd))
SVM Train 0.9475 (0.0137) 0.9804 0.9634 (0.0081) 0.9640 0.0361
(0.0081) (0.0080) (0.0080)
Test 0.9091 (0.0415) 0.9455 0.9258 (0.0260) 0.9273 0.0727
(0.0347) (0.0249) (0.0249)
RF Train 0.9009 (0.0157) 0.9326 0.9154 (0.0110) 0.9168 0.0832
(0.0141) (0.0107) (0.0107)
Test 0.9051 (0.0449) 0.9350 0.9187 (0.0303) 0.9200 0.0800
(0.0436) (0.0296) (0.0296)
NB Train 0.8590 (0.0192) 0.9481 0.8989 (0.0129) 0.9036 0.0964
(0.0097) (0.0117) (0.0117)
Test 0.8537 (0.0483) 0.9372 0.8905 (0.0301) 0.8955 0.1045
(0.0389) (0.0277) (0.0277)
k-NN Train 0.8687 (0.0217) 0.9772 0.9184 (0.0135) 0.9229 0.0771
(0.0096) (0.0121) (0.0121)
Test 0.9470 (0.0361) 0.8640 0.8932 (0.0354) 0.8994 0.1006
(0.04282) (0.0313) (0.0313)
Ridge LR Train 0.8422 (0.0153) 0.9747 0.9019 (0.0106) 0.9084 0.0916
(0.0087) (0.0094) (0.0094)
Test 0.8161 (0.0593) 0.9674 0.8819 (0.0375) 0.8917 0.1083
(0.0244) (0.0314) (0.0314)
Lasso LR Train 0.9707 (0.0276) 0.9912 0.9806 (0.0187) 0.9810 0.0190
(0.0107) (0.0181) (0.0181)
Test 0.8772 (0.0547) 0.9354 0.9029 (0.0332) 0.9063 0.0937
(0.0369) (0.0303) (0.0303)
Elastic net Train 0.9490 (0.0236) 0.9853 0.9664 (0.0158) 0.9672 0.0328
LR (0.0099) (0.0153) (0.0153)
Test 0.8836 (0.0511) 0.9536 0.9151 (0.0285) 0.9186 0.0814
(0.0301) (0.0257) (0.0257)
1000 repetitions were done to evaluate the predictive performance of ML classifiers

dataset, but it registers a lower specificity. This implies that k-NN excels at identifying
positive classes, though it may struggle in discerning negative ones.
Ridge LR, lasso LR, and elastic net LR exhibit comparable performance, with each
displaying respectable sensitivity and specificity values. Lasso LR shines slightly
brighter in certain metrics. To sum up, SVM and the regularization methods (ridge
LR, lasso LR, and elastic net LR) emerge as the most reliable classifiers, offering a
balance of precision and recall. Conversely, NB falls short in terms of sensitivity and
specificity. Figure 2 summarized the performance values (testing dataset) in graphical
way for the better comparison of the ML classifiers.
Evaluation of Machine Learning Models for Breast Cancer Detection … 573

(a) F-measure (b) Accuracy

(c) Sensitivity (d) Specificity

Fig. 2 Comparisons of machine learning performance based on testing data

The investigation results have led to the implementation of a SVM to determine

the importance of biomarkers, representing a pivotal step in the refinement of ML
classifications model. Specifically, emphasis has been placed on the identification of
the top 30 features demonstrating an impressively high importance level exceeding
90 per cent. This discerning selection ensures that attention is directed towards the
most influential biomarkers, consequently bolstering the accuracy and effectiveness
of the classification system.
The list of important biomarkers, meticulously ranked according to their respective
importance levels, is presented in Fig. 3. These variables have been singled out
based on their substantial contributions to the predictive power of the model. By
focusing on this curated set, not only is the computational process expedited, but a
deeper understanding of the underlying patterns and interrelations within the dataset
is also attained. This targeted approach ensures that the model is finely tuned to make
574 M. N. Abdullah and Y. B. Wah

Fig. 3 List of top 30 importance biomarkers

exceptionally accurate predictions, resulting in more reliable and robust classification

outcomes.

4 Conclusion

In conclusion, this study conducted a thorough analysis of a microarray breast cancer

dataset, comprising 35,983 genes from breast cancer and normal tissue samples. The
primary objective was to evaluate the performance of various ML classifiers on this
dataset. Initial data pre-processing involved the identification and replacement of
outliers to maintain sample size integrity. Feature selection using Boruta’s algo-
rithm identified 214 crucial biomarkers for breast cancer prediction. To address data
imbalance, an undersampling technique was employed.
Subsequent evaluations of ML classifiers revealed SVM as the standout performer,
displaying exceptional accuracy, sensitivity, specificity, and F-measure. RF demon-
strated commendable performance, while NB lagged behind in certain metrics. This
comprehensive analysis provides valuable insight for future research and clinical
applications in breast cancer prediction using ML techniques.
Evaluation of Machine Learning Models for Breast Cancer Detection … 575

Acknowledgements The authors would like to thank Universiti Teknologi MARA (UiTM) Perak,
Tapah Campus for providing the facilities to conduct this study. This research was supported by
university research funding under the MyRA grant (600-RMC/GPM LPHD 5/3 (098/2022).

References

1. Singh R, Sain NK (2023) The etiology of breast cancer. J Pharm Negat Results 14:1427–1434.
https://doi.org/10.1007/978-1-4684-2601-4_2
2. Global Cancer Observatory (2023) Estimated number of new cases in 2020, World, both sexes,
all ages. In: World Heal. Organ.
3. Global Cancer Observatory (2023) Incidence, mortality and prevalence by cancer site Malaysia.
In: World heal. Organ
4. Cardoso F, McCartney A, Ponti A et al (2023) European society of breast cancer specialists/
advanced breast cancer global alliance quality indicators for metastatic breast cancer care. Eur
J Cancer 187:105–113. https://doi.org/10.1016/j.ejca.2023.03.028
5. Hu H, Wang Z-W, Hu S et al (2023) GNPNAT1 promotes the stemness of breast cancer and
serves as a potential prognostic biomarker. Oncol Rep 50:1–17. https://doi.org/10.3892/or.
2023.8594
6. Wilcox N, Dumont M, González-Neira et al (2023) Exome sequencing identifies breast cancer
susceptibility genes and defines the contribution of coding variants to breast cancer risk. Nat
Genet 55:1435–1439
7. Orsini A, Diquigiovanni C, Bonora E (2023) Omics technologies improving breast cancer
research and diagnostics. Int J Mol Sci 24. https://doi.org/10.3390/ijms241612690
8. Du H, Si G, Si J et al (2023) Single-cell RNA sequencing analysis revealed malignant ductal cell
heterogeneity and prognosis signatures in pancreatic cancer. Clin Res Hepatol Gastroenterol
47. https://doi.org/10.1016/j.clinre.2023.102200
9. Thirumalaisamy S, Thangavilou K, Rajadurai H, Saidani O (2023) Breast cancer classification
using synthesized deep learning model with metaheuristic optimization algorithm. Diagnostics
13:1–21
10. Del-Pozo-Bueno D, Kepaptsoglou D, Peiró F, Estradé S (2023) Comparative of machine
learning classification strategies for electron energy loss spectroscopy: support vector machines
and artificial neural networks. Ultramicroscopy 253:113828. https://doi.org/10.1016/j.ultramic.
2023.113828
11. Mahajan RA, Balkhande B, Wanjale K, Chitre A (2023) Enhancing heart disease risk prediction
accuracy through ensemble classification techniques. Int J Intell Syst Appl Eng 11:701–713
12. Gohari K, Kazemnejad A, Mohammadi M et al (2023) A Bayesian latent class extension of
naive Bayesian classifier and its application to the classification of gastric cancer patients. BMC
Med Res Methodol 23:1–15. https://doi.org/10.1186/s12874-023-02013-4
13. Ali AH, Mohammed MA, Hasan RA et al (2023) Big data classification based on improved
parallel k-nearest neighbor. Telkomnika Telecommun Comput Electron Control 21:235–246.
https://doi.org/10.12928/TELKOMNIKA.v21i1.24290
14. Kadhim RR, Kamil MY (2023) Comparison of machine learning models for breast cancer
diagnosis. IAES Int J Artif Intell 12:415–421. https://doi.org/10.11591/ijai.v12.i1.pp415-421
15. Abdullah MN, Wah YB, Sapri NNFF, Yaacob WFW (2023) Multi-class classification for breast
cancer with high dimensional microarray data using machine learning cassifier. Data Sci Emerg
Technol 2022:329–342
16. Agraz M (2023) Comparison of feature selection methods in breast cancer microarray data.
Med Rec 5:284–9. https://doi.org/10.37990/medr.1202671
17. Chen J, Hao L, Qian X et al (2022) Machine learning models based on immunological genes to
predict the response to neoadjuvant therapy in breast cancer patients. Front Immunol 13:1–14.
https://doi.org/10.3389/fimmu.2022.948601
576 M. N. Abdullah and Y. B. Wah

18. Mirza Z, Ansari MS, Iqbal MS et al (2023) Identification of novel diagnostic and prognostic
gene signature biomarkers for breast cancer using artificial intelligence and machine learning
assisted transcriptomics analysis. Cancers (Basel) 15:3237. https://doi.org/10.3390/cancers15
123237
19. Feltes BC, Chandelier EB, Grisci BI, Dorn M (2019) CuMiDa: an extensively curated
microarray database for benchmarking and testing of machine learning approaches in cancer
research. J Comput Biol 26:376–386. https://doi.org/10.1089/cmb.2018.0238
20. R Core Team (2023) R: a language and environment for statistical computing
21. Rosner B (1975) On the detection of many outliers. Technometrics 17:221–227. https://doi.
org/10.1080/00401706.1975.10489305
22. Kursa MB, Jankowski A, Rudnicki WR (2010) Boruta—a system for feature selection. Fundam
Informaticae 101:271–285. https://doi.org/10.3233/FI-2010-288
23. Reddy LMM (2022) Performance validation of spectrum sensing using kernelized support
vector machine transformation. Wirel Pers Commun 1–14
24. Abdullah MN, Wah YB, Abdul Majeed AB et al (2022) Identification of blood-based transcrip-
tomics biomarkers for Alzheimer’s disease using statistical and machine learning classifier.
Informatics Med Unlocked 33:101083. https://doi.org/10.1016/j.imu.2022.101083
25. Breiman L (2001) Random forests. Mach Learn 45:5–32. https://doi.org/10.1023/A:101093
3404324
26. Liaw A, Wiener M (2002) Classification and Regression by randomForest. R news 2:18–22.
https://doi.org/10.1177/154405910408300516
27. Biau G, Scornet E (2015) A random forest guided tour. Submitted 173–184
28. Rish I (2001) An empirical study of the naive Bayes classifier. In: IJCAI 2001 workshop on
empirical methods in artificial intelligence, vol 3, no 22, pp 41–46
29. Karabatak M (2015) A new classifier for breast cancer detection based on Naïve Bayesian.
Measurement 72:32–36. https://doi.org/10.1016/j.measurement.2015.04.028
30. Peterson LE (2009) K-nearest neighbor. Scholarpedia 4:1883
31. Dharmasaputro AA, Fauzan NM, Kallista M et al (2022) Handling missing and imbalanced
data to improve generalization performance of machine learning classifier. Int Semin Mach
Learn Optim Data Sci 2021:140–145. https://doi.org/10.1109/ismode53584.2022.9743022
32. Kuhkan M (2016) A method to improve the accuracy of k-nearest neighbor algorithm. Int J
Comput Eng Inf Technol 8:90–95
33. Hastie T, Tibshirani R, Friedman J (2008) The element of statistical learning: data mining,
inference and prediction. Springer
34. Hosmer DW Jr, Lemeshow S, Sturdivant RX (2013) Applied logistic regression. Wiley
Author Index

A C
Abdullah, Mohammad Nasir, 563 Cheng, Ling Ern, 3
Abdul-Rahman, Shuzlina, 319 Chilpuri, Vishnu Sai Reddy, 237
Abela, Clair, 63 Choiruddin, Achmad, 335, 369, 463
Adam, Siti Hajar, 149 Coombs, Thomas, 181, 211, 269
Adnan, Nor Muhammad Ilman Bin Nor,
161
Afghan, Hafez, 385 D
Agustini, Mety, 221 Daud, Paridah, 251
Ahmad, Azlin, 493 Dhayalan, Karthi, 251
Ahmed, Ahmed Mohamed Abdou, 169
Aitizaz, Ali, 105
Alanezi, Ahmad, 181 F
Al-Hamid, Abdullah, 521 Faradilla, Sonia, 369
Alias, Suraya, 17 Fithriasari, Kartika, 221
Al-Jumeily OBE, Dhiya, 279, 521
Al-Nahari, Abdulaziz, 169, 493
G
Al-Nahari, Abdulaziz Yahya Yahya, 79
Geddam, Sushmitha, 521
Alsaleem, Manea, 279
Al-Shabandar, Raghad, 169
Alshabandar, Raghad, 291 H
Amir, Siti Nabilah Mohd Abdul Hakim, 493 Halip, Mohd Hazali Mohamed, 479
Andreas, Christopher, 335 Hayatin, Nur, 17
Aridas, Narendra Kumar, 531 Hung, Lai Po, 17
Aridinanti, Lucia, 429
Arshad, Sabkat, 49
Artana, Ketut Buda, 403 I
Assi, Sulaf, 279, 521 Inguanez, Frankie, 63
Azahari, Afiqah M., 479 Ismadyaliana, Suci, 415
Azies Al, Harun, 319 Ismail, Noor Azma, 269
Azman, Alliesya Binti, 161 Ismail, Noor Lees, 33

B J
Binti Mohamad, Normaiza, 211 Jalil, Ilham Abdul, 505
© The Editor(s) (if applicable) and The Author(s), under exclusive license 577
to Springer Nature Singapore Pte Ltd. 2024
Y. Bee Wah et al. (eds.), Data Science and Emerging Technologies, Lecture Notes on
Data Engineering and Communications Technologies 191,
https://doi.org/10.1007/978-981-97-0293-0
578 Author Index

Jamaludin, Norazliana Akmal, 149 Purhadi, 463

Jayabalan, Manoj, 181, 193, 211, 269, 279, Purnomo, Jerry Dwi Trijoyo, 415, 447
521

R
K Rahayu, Syarifah Bahiyah, 479
Kaha, Putra Roskhairul Fitri, 479 Rahmat, Romi Fadillah, 161
Kassim, Mohamad Nizam, 123 Rajan, Swapna D., 269
Khoo, Eric, 123 Rasam, Abdul Rauf Abdul, 505
Khusna, Hidayatul, 385 Ratih, Iis Dewi, 403
Kumar, Vinay, 211 Rosli, A’tifah Hanim, 149
Kuswanto, Heri, 403 Rumiati, Agnes Tuti, 353

L S
Lim, Huai Tein, 303, 545 Saha, Sajal, 545
Lim, Khai Yin, 3 Sardar, Soumen, 279
Looi, Sing Yan, 303 Saurabh Gupta, 291
Looi, Zheng Ning, 303 Saxena, Ambar, 33
Loy, Chong Kim, 169, 291 Sayli Tawhare, 91
Setiawan, 353, 415
Setiowati, Yuliana, 17
M Sharma, Disha, 193
Mansoor, Syarifah Bahiyah Rahayu Syed, Sivarajan, Roshene A/P, 161
149 Song, Poh Choo, 303
Mehmood, Tahir, 49, 137, 161, 237 Soujanya Rao, Mangu, 211
Meilisa, Mira, 447 Sultanova, Nailya, 33, 193, 251
Mohamad, Normaiza, 79
Mohammed, A. H., 169
Mohammed, Ammar H., 279, 291 T
Mubarik, M. Naufal, 429 Talip, Mohamad Sofian Abu, 531
Mustafina, Jamila, 33, 193, 251 Tan, Chi Wee, 3
Tang, Ruipeng, 531
Tan, Jenny, 3
N Tan, Xuan Yi, 3
Nadeem, Saaman, 137, 237 Ting, Kezia Sim Kui, 161
Nadi, Farhad, 105, 161, 181
Nagantheran, Jeetha A/P, 161
Nagappan, Danush, 181 V
Naghavipour, Hadi, 105, 161, 521 Veerah, Suita A/P, 161
Ng, Peh Sang, 545 Venkatesan, K., 479
Nidhi Shettigar, 91 Vishal Pednekar, 91
Noviyanti, Reny Ari, 353
Nuari, Muhammad Farhan, 403
Nuriman, Muhammad Alifian, 429 W
Wahab, Norshahriah Abdul, 149
Wibowo, Wahyu, 319
O Wildani, Zakiatul, 429
Otok, Bambang Widjanarko, 369, 447 Wilujeng, Susi A., 319
Wulandari, Sri Pingit, 429

P
Prastyo, Dedy Dwi, 221, 335 Y
Pratiwi, Emmy, 403 Yao, Danny Ngo Lung, 79, 193
Author Index 579

Yap, Bee Wah, 563 Z

Yaqoob, Muhammad, 49, 137, 237 Zainal, Anazida, 123
Yasin, Hasbi, 463 Zainudin, Zanariah, 123
Yeong, Wai Chung, 545 Zaman, Ezzatul Akmal Kamaru, 493
Yoon, Liu Jun, 3 Zamry, Nurfazrina Mohd, 123
Yusof, Luqman Hakim Bin, 79 Zolkifly, Iznora Aini, 279

Architecture Patterns With Python 1st Edition Harry Percival PDF Download
100% (1)
Architecture Patterns With Python 1st Edition Harry Percival PDF Download
52 pages
(Image Processing Series) Luciano Da Fona Costa, Roberto Marcond Cesar Jr. - Shape Classification and Analysis - Theory and Practice-CRC Press (2009) PDF
No ratings yet
(Image Processing Series) Luciano Da Fona Costa, Roberto Marcond Cesar Jr. - Shape Classification and Analysis - Theory and Practice-CRC Press (2009) PDF
674 pages
Rethinking Attention With Performers
No ratings yet
Rethinking Attention With Performers
38 pages
Artificial Intelligence
No ratings yet
Artificial Intelligence
104 pages
Statistical Machine Learning For Engineers With Application
No ratings yet
Statistical Machine Learning For Engineers With Application
393 pages
Bishop ML
No ratings yet
Bishop ML
3 pages
Fintech Innovations For Entrepreneurs: Foundations and Business Plans
No ratings yet
Fintech Innovations For Entrepreneurs: Foundations and Business Plans
191 pages
Solution 2
0% (1)
Solution 2
4 pages
Data Science Workflow
No ratings yet
Data Science Workflow
7 pages
Operations Management and Research
No ratings yet
Operations Management and Research
240 pages
Electronics For Physicists An Introduction 1st Edition Bryan H. Suits Available All Format
100% (3)
Electronics For Physicists An Introduction 1st Edition Bryan H. Suits Available All Format
94 pages
Two Body Problem PDF
100% (1)
Two Body Problem PDF
25 pages
Recommender Systems Handbook 3rd Edition Francesco Ricci Instant Download
No ratings yet
Recommender Systems Handbook 3rd Edition Francesco Ricci Instant Download
166 pages
A Federated Transfer Learning Framework For Secure Image Steganalysis
No ratings yet
A Federated Transfer Learning Framework For Secure Image Steganalysis
11 pages
Artificial Intelligence For Financial Markets The Polymodel Approach Thomas Barrau Available Instantly
No ratings yet
Artificial Intelligence For Financial Markets The Polymodel Approach Thomas Barrau Available Instantly
100 pages
Practical Big Data Analytics Hands On Techniques To Implement Enterprise Analytics and Machine Learning Using Hadoop Spark NoSQL and R 1st Edition Nataraj Dasgupta Newest Edition 2025
No ratings yet
Practical Big Data Analytics Hands On Techniques To Implement Enterprise Analytics and Machine Learning Using Hadoop Spark NoSQL and R 1st Edition Nataraj Dasgupta Newest Edition 2025
146 pages
Engineering For Industrial Designers and Inventors Fundamentals For Designers of Wonderful Things Unlocked Test Bank
No ratings yet
Engineering For Industrial Designers and Inventors Fundamentals For Designers of Wonderful Things Unlocked Test Bank
330 pages
Decap446 Data Warehousing and Data Mining
No ratings yet
Decap446 Data Warehousing and Data Mining
252 pages
Machine Learning A Quantitative Approach by Henry H Liu Ebook and TestBank Bundle Fast Access
No ratings yet
Machine Learning A Quantitative Approach by Henry H Liu Ebook and TestBank Bundle Fast Access
329 pages
Quantum Mechanics Basics Guide
No ratings yet
Quantum Mechanics Basics Guide
4 pages
Fundamentals of Convolutional Coding 2nd Edition Rolf Johannesson Instant Download
100% (1)
Fundamentals of Convolutional Coding 2nd Edition Rolf Johannesson Instant Download
54 pages
Metaheuristics For Finding Multiple Solutions 1st Edition Mike Preuss Michael G Epitropakis Xiaodong Li Jonathan E Fieldsend Instant Download
100% (1)
Metaheuristics For Finding Multiple Solutions 1st Edition Mike Preuss Michael G Epitropakis Xiaodong Li Jonathan E Fieldsend Instant Download
50 pages
Financial Macro and Micro Econometrics Using R Volume 42 1st Edition C. R. Rao (Editor) Kindle & PDF Formats
100% (3)
Financial Macro and Micro Econometrics Using R Volume 42 1st Edition C. R. Rao (Editor) Kindle & PDF Formats
158 pages
Smarter IT: Optimize IT Delivery, Accelerate Innovation: Inside
No ratings yet
Smarter IT: Optimize IT Delivery, Accelerate Innovation: Inside
15 pages
The Ultimate Algorithmic Trading System Toolbox Website Using Today S Technology To Help You Become A Better Trader George Pruitt Instant Download
100% (3)
The Ultimate Algorithmic Trading System Toolbox Website Using Today S Technology To Help You Become A Better Trader George Pruitt Instant Download
132 pages
High Dimensional Covariance Estimation With High Dimensional Data 1st Edition Mohsen Pourahmadi Digital Version 2025
100% (3)
High Dimensional Covariance Estimation With High Dimensional Data 1st Edition Mohsen Pourahmadi Digital Version 2025
125 pages
Financial Data Analytics With R Monte-Carlo Validation (Jenny Chen) (Z-Library)
No ratings yet
Financial Data Analytics With R Monte-Carlo Validation (Jenny Chen) (Z-Library)
297 pages
Intelligent Agents: Fundamentals of Artificial Intelligence
No ratings yet
Intelligent Agents: Fundamentals of Artificial Intelligence
51 pages
L2 - Machine Learning For Finance
No ratings yet
L2 - Machine Learning For Finance
128 pages
User Modeling, Adaptation and Personalization: Francesco Ricci Kalina Bontcheva Owen Conlan Séamus Lawless
No ratings yet
User Modeling, Adaptation and Personalization: Francesco Ricci Kalina Bontcheva Owen Conlan Séamus Lawless
416 pages
We Are Intechopen, The World'S Leading Publisher of Open Access Books Built by Scientists, For Scientists
No ratings yet
We Are Intechopen, The World'S Leading Publisher of Open Access Books Built by Scientists, For Scientists
19 pages
机器学习方法
No ratings yet
机器学习方法
579 pages
Fly Cutting Technology For Ultra Precision Machining Precision Manufacturing Suet To (Editor) PDF Version
No ratings yet
Fly Cutting Technology For Ultra Precision Machining Precision Manufacturing Suet To (Editor) PDF Version
107 pages
Numerical Analysis 9ed. Edition Burden R.L. Instant Download 2025
100% (1)
Numerical Analysis 9ed. Edition Burden R.L. Instant Download 2025
128 pages
M.Sc. Economics Module Guide
No ratings yet
M.Sc. Economics Module Guide
232 pages
Real Test Bank Ethical Hacking Techniques Tools and Countermeasures by Solomon 4e Digital Bundle
No ratings yet
Real Test Bank Ethical Hacking Techniques Tools and Countermeasures by Solomon 4e Digital Bundle
329 pages
Econometric Production Analysis R
No ratings yet
Econometric Production Analysis R
370 pages
Robotics & Automation Lab Proposal
No ratings yet
Robotics & Automation Lab Proposal
15 pages
2140.1-2020 - IEEE Standard For General Requirements For Cryptocurrency Exchanges
No ratings yet
2140.1-2020 - IEEE Standard For General Requirements For Cryptocurrency Exchanges
18 pages
Adaline/Madaline:Applications
100% (1)
Adaline/Madaline:Applications
25 pages
The Art of Reinforcement Learning Michael Hu PDF Download
100% (2)
The Art of Reinforcement Learning Michael Hu PDF Download
58 pages
Blockchain: Blueprintfor A New Economy
No ratings yet
Blockchain: Blueprintfor A New Economy
5 pages
Control Systems Engineering
No ratings yet
Control Systems Engineering
119 pages
(Ebook) Connecting With Computer Science (2nd Edition) by Greg Anderson, David Ferro, Robert Hilton ISBN 9781439080351, 1439080356 Digital Download
No ratings yet
(Ebook) Connecting With Computer Science (2nd Edition) by Greg Anderson, David Ferro, Robert Hilton ISBN 9781439080351, 1439080356 Digital Download
149 pages
Nonparametric Statistics on Manifolds and Their Applications to Object Data Analysis 1st Edition Victor Patrangenaru full digital chapters
No ratings yet
Nonparametric Statistics on Manifolds and Their Applications to Object Data Analysis 1st Edition Victor Patrangenaru full digital chapters
167 pages
Agent Based Modelling Working Party
No ratings yet
Agent Based Modelling Working Party
0 pages
Road To Interoperability 1673931842
No ratings yet
Road To Interoperability 1673931842
108 pages
Monocular Depth Estimation with U-Net
No ratings yet
Monocular Depth Estimation with U-Net
8 pages
An Introduction To Deep Reinforcement Learning PDF
No ratings yet
An Introduction To Deep Reinforcement Learning PDF
140 pages
Bayesian Inference
No ratings yet
Bayesian Inference
5 pages
Why Information Can Not Be The Basis of Reality Scientific American Blog Network
No ratings yet
Why Information Can Not Be The Basis of Reality Scientific American Blog Network
4 pages
Feature Engineering and Selection A Practical Approach For Predictive Models 1st Edition Max Kuhn Available All Format
No ratings yet
Feature Engineering and Selection A Practical Approach For Predictive Models 1st Edition Max Kuhn Available All Format
87 pages
Deep Learning in Hilbert Spaces - New Frontiers in Algorithmic Trading
No ratings yet
Deep Learning in Hilbert Spaces - New Frontiers in Algorithmic Trading
351 pages
Big Data's Role in Monetary Policy
No ratings yet
Big Data's Role in Monetary Policy
16 pages
Enhancing LLM Performance (Etc.) (Z-Library)
No ratings yet
Enhancing LLM Performance (Etc.) (Z-Library)
189 pages
Data Science Ethics Concepts Techniques and Cautionary Tales
No ratings yet
Data Science Ethics Concepts Techniques and Cautionary Tales
4 pages
Bayesian Inference of State Space Models Kalman Filtering and Beyond
No ratings yet
Bayesian Inference of State Space Models Kalman Filtering and Beyond
503 pages
Data Mining Using SAS Enterprise Miner
No ratings yet
Data Mining Using SAS Enterprise Miner
108 pages
A Climate Modelling Primer Third Edition Kendal Mcguffie Available Instanly
100% (5)
A Climate Modelling Primer Third Edition Kendal Mcguffie Available Instanly
85 pages
Human Activity and Behavior Analysis Advances in Computer - MD Atiqur Rahman Ahad, Sozo Inoue, Guillaume Lopez Tahera - 2024 - Anna's Archive
No ratings yet
Human Activity and Behavior Analysis Advances in Computer - MD Atiqur Rahman Ahad, Sozo Inoue, Guillaume Lopez Tahera - 2024 - Anna's Archive
285 pages
Image Processing and Machine Learning, Volume 1 - Erik Cuevas, Alma Nayeli Rodríguez - 1, 2024 - Chapman and Hall - CRC - 9781003287414 - Anna's Archive
No ratings yet
Image Processing and Machine Learning, Volume 1 - Erik Cuevas, Alma Nayeli Rodríguez - 1, 2024 - Chapman and Hall - CRC - 9781003287414 - Anna's Archive
225 pages
Handbook of Digital 3D Reconstruction of Historical Architecture
No ratings yet
Handbook of Digital 3D Reconstruction of Historical Architecture
211 pages
Impact of LLMs on Recommender Systems
No ratings yet
Impact of LLMs on Recommender Systems
10 pages
Unit 2: Customizing in Treasury and Risk Management in Sap S/4Hana
No ratings yet
Unit 2: Customizing in Treasury and Risk Management in Sap S/4Hana
2 pages
Smith Derek Cover Letter Au
No ratings yet
Smith Derek Cover Letter Au
1 page
Online Project Management SRS
No ratings yet
Online Project Management SRS
18 pages
Eb1a Sample
No ratings yet
Eb1a Sample
35 pages
Obimbo Moses CV Feb 2016
No ratings yet
Obimbo Moses CV Feb 2016
14 pages
Why English Matters to Students
No ratings yet
Why English Matters to Students
7 pages
Group-III Services Exam Scheme & Syllabus
No ratings yet
Group-III Services Exam Scheme & Syllabus
1 page
Matrix Grade 8 AP
No ratings yet
Matrix Grade 8 AP
2 pages
Waterman&Harewood - Piano Lessons Book One
100% (2)
Waterman&Harewood - Piano Lessons Book One
72 pages
Mba Summer 2019
No ratings yet
Mba Summer 2019
2 pages
Eps 3200 Human Growth and Development - Docx 1
No ratings yet
Eps 3200 Human Growth and Development - Docx 1
2 pages
Department of Family Medicine and Primary Health Care Practice of Medicine 4
No ratings yet
Department of Family Medicine and Primary Health Care Practice of Medicine 4
7 pages
Grade 1 - Makabansa
No ratings yet
Grade 1 - Makabansa
9 pages
The Complete Book of Fingermath
100% (2)
The Complete Book of Fingermath
358 pages
Alpha Hole Prison Walkthrough
No ratings yet
Alpha Hole Prison Walkthrough
22 pages
Redbridge London MLE Training
No ratings yet
Redbridge London MLE Training
8 pages
Internship Workplan
No ratings yet
Internship Workplan
7 pages
Mapeh Answer Sheet Q4 Week 1 8
100% (4)
Mapeh Answer Sheet Q4 Week 1 8
16 pages
Assignment Bst211 (Lux) (Alesya Natasha Aishah Ayman)
No ratings yet
Assignment Bst211 (Lux) (Alesya Natasha Aishah Ayman)
25 pages
Guidelines On The Preparation and Checking of School Forms
No ratings yet
Guidelines On The Preparation and Checking of School Forms
26 pages
Chapter 7 - Internal Selection
No ratings yet
Chapter 7 - Internal Selection
25 pages
Ped 7 4 Pages Week 4 at 5
No ratings yet
Ped 7 4 Pages Week 4 at 5
4 pages
Uclan Dissertation Template
100% (2)
Uclan Dissertation Template
6 pages
A 863287
No ratings yet
A 863287
64 pages
Real-Dato, Jose Mechanisms of Policy Change A Proposal For A Synthetic Explanatory Framework
No ratings yet
Real-Dato, Jose Mechanisms of Policy Change A Proposal For A Synthetic Explanatory Framework
28 pages
Broken Tales Quickstart Eng
100% (1)
Broken Tales Quickstart Eng
24 pages
Student Stress Management Guide
No ratings yet
Student Stress Management Guide
2 pages
Grammar: Possessive Adjectives: My, Your, Etc
No ratings yet
Grammar: Possessive Adjectives: My, Your, Etc
1 page
Math Exam for 8th Graders
No ratings yet
Math Exam for 8th Graders
2 pages
Cognitive Affective Connections in Teach
No ratings yet
Cognitive Affective Connections in Teach
9 pages