Data Science and Emerging Technologies - Yap Bee Wah, Dhiya Al-Jumeily Obe, Michael W - Berry - 2024 - Springer - 9789819702923 - Anna's Archive
Data Science and Emerging Technologies - Yap Bee Wah, Dhiya Al-Jumeily Obe, Michael W - Berry - 2024 - Springer - 9789819702923 - Anna's Archive
Data Science
and Emerging
Technologies
Proceedings of DaSET 2023
Lecture Notes on Data Engineering
and Communications Technologies
Volume 191
Series Editor
Fatos Xhafa, Technical University of Catalonia, Barcelona, Spain
The aim of the book series is to present cutting edge engineering approaches to data
technologies and communications. It will publish latest advances on the engineering
task of building and deploying distributed, scalable and reliable data infrastructures
and communication systems.
The series will have a prominent applied focus on data technologies and commu-
nications with aim to promote the bridging from fundamental research on data
science and networking to data engineering and communications that lead to industry
products, business knowledge and standardisation.
Indexed by SCOPUS, INSPEC, EI Compendex.
All books published in the series are submitted for consideration in Web of Science.
Yap Bee Wah · Dhiya Al-Jumeily OBE ·
Michael W. Berry
Editors
Michael W. Berry
University of Tennessee
Knoxville, TN, USA
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature
Singapore Pte Ltd. 2024
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse
of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and
transmission or information storage and retrieval, electronic adaptation, computer software, or by similar
or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd.
The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721,
Singapore
Patron
Honorary Chairs
v
vi Conference Organization
Proceeding Editors
Secretary
IT Committee
Logistics Committee
Corporate Committee
Industry Committee
ix
x Preface
Teknologi Malaysia, Assoc. Prof. Dr. Chin Kim On, Universiti Malaysia Sabah, Dato’
Dr. Amirudin Abdul Wahab, Cybersecurity Malaysia, Ms. Puteri Anis Aneeza binti
Zakaria, Statworks Group, and Mr. Raja Segaran, MDEC, Malaysia. All the distin-
guished speakers shared various data science and emerging technologies perspectives
and projects which are beneficial for academics and industry practitioners.
We would like to thank Professor Emeritus Tan Sri Dato’ Sri. Ir. Dr. Sahol Hamid
Bin Abu Bakar, Vice-Chancellor of UNITAR International University for his great
leadership, advice, and support of local and international academic activities to foster
collaborations that lead to the exchange of knowledge and skills for research with
impactful outcomes for social and economic prosperity.
We also thank the Series Editor, Springer, Lecture Notes on Data Engineering
and Communications Technologies, for the opportunity to organize this guest-edited
volume. We are grateful to Mr. Aninda Bose (Senior Publishing Editor, Springer
India Pvt. Ltd.) and Mr. Radhakrishnan Madhavamani for the excellent collaboration,
patience, and help during the preparation of this volume. We are confident that the
volume will provide insightful information to researchers, practitioners, and graduate
students in the areas of data science, artificial intelligence, and emerging technologies
which are important in this digital information era. Last but not least, we thank all the
DaSET 2023 committees for working tirelessly to ensure a successful conference.
The book presents selected papers from the Second International Conference on Data
Science and Emerging Technologies (DaSET 2023), held online at UNITAR Inter-
national University, Malaysia, from December 4–5, 2023. This book aims to present
current research and applications of data science and emerging technologies. The
deployment of data science and emerging technology contributes to the achievement
of the Sustainable Development Goals for social inclusion, environmental sustain-
ability, and economic prosperity. Emerging technologies such as artificial intelli-
gence and blockchain are useful for various domains such as marketing, health care,
finance, banking, environmental, and agriculture. Innovations in the field of artifi-
cial intelligence continue to shape the future of work across nearly every industry.
Data Science has a transformative effect on the economy, industry, and society.
An important grand challenge in data science is to determine how developments
in computational and social-behavioral sciences can be combined to improve well-
being, emergency response, sustainability, and civic engagement in a well-informed,
data-driven society. The topics of this book include, but are not limited to: artificial
intelligence, machine and deep learning, statistical learning, and health and industrial
applications.
xi
Contents
Artificial Intelligence
A Comparative Study of Lemmatization Approaches for Rojak
Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Liu Jun Yoon, Xuan Yi Tan, Khai Yin Lim, Chi Wee Tan,
Ling Ern Cheng, and Jenny Tan
Multi-aspect Extraction in Indonesian Reviews Through
Multi-label Classification Using Pre-trained BERT Models . . . . . . . . . . . . 17
Nur Hayatin, Suraya Alias, Lai Po Hung, and Yuliana Setiowati
Artificial Intelligence (AI) Empowered Sign Language Recognition
Using Hybrid Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
Ambar Saxena, Nailya Sultanova, Jamila Mustafina,
and Noor Lees Ismail
The Performance of GPT-3.5 in Summarizing Scientific and News
Articles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Sabkat Arshad, Muhammad Yaqoob, and Tahir Mehmood
Wound Stage Recognition Using YOLOv5 . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
Clair Abela and Frankie Inguanez
Harvest Palm Tree Based on Detection Through 2D LiDAR Sensor
Using Power Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
Luqman Hakim Bin Yusof, Abdulaziz Yahya Yahya Al-Nahari,
Danny Ngo Lung Yao, and Normaiza Mohamad
Enhancing Security Surveillance Through Business Intelligence
with NVIDIA DeepStream . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
Vishal Pednekar, Nidhi Shettigar, and Sayli Tawhare
Fuzzified Hybrid Metaheuristics for QoS-Aware Service
Composition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
Hadi Naghavipour, Farhad Nadi, and Ali Aitizaz
xiii
xiv Contents
Machine/Deep Learning
Fraudulent E-Commerce Website Detection Using Convolutional
Neural Network Based on Image Features . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
Nurfazrina Mohd Zamry, Anazida Zainal, Eric Khoo,
Mohamad Nizam Kassim, and Zanariah Zainudin
A Generic Framework for Ransomware Prediction
and Classification with Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . 137
Saaman Nadeem, Tahir Mehmood, and Muhammad Yaqoob
Leveraging Gamification for Engaged Learning in Online Teaching
and Learning Experiences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
Norshahriah Abdul Wahab, A’tifah Hanim Rosli,
Syarifah Bahiyah Rahayu Syed Mansoor,
Norazliana Akmal Jamaludin, and Siti Hajar Adam
Sentiment Analysis Using Large Language Models: A Case Study
of GPT-3.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
Farhad Nadi, Hadi Naghavipour, Tahir Mehmood,
Alliesya Binti Azman, Jeetha A/P Nagantheran, Kezia Sim Kui Ting,
Nor Muhammad Ilman Bin Nor Adnan, Roshene A/P Sivarajan,
Suita A/P Veerah, and Romi Fadillah Rahmat
Telecom Customer Experience Analysis Using Sentiment Analysis
and Natural Language Processing—Comparative Study . . . . . . . . . . . . . . 169
Ahmed Mohamed Abdou Ahmed, Abdulaziz Al-Nahari,
Raghad Al-Shabandar, Chong Kim Loy, and A. H. Mohammed
Efficient Time Series of Smoothing and Auto-regressive Forecasting
Models for Predicting Police Officer Fatalities in the USA . . . . . . . . . . . . . 181
Danush Nagappan, Manoj Jayabalan, Ahmad Alanezi, Farhad Nadi,
and Thomas Coombs
Multimodal Emotion Recognition Using Attention-Based Model
with Language, Audio, and Video Modalities . . . . . . . . . . . . . . . . . . . . . . . . . 193
Disha Sharma, Manoj Jayabalan, Nailya Sultanova, Jamila Mustafina,
and Danny Ngo Lung Yao
Comparative Analysis of Emotion Recognition Using Large
Language Models and Conventional Machine Learning . . . . . . . . . . . . . . . 211
Mangu Soujanya Rao, Thomas Coombs, Normaiza Binti Mohamad,
Vinay Kumar, and Manoj Jayabalan
The Impact of Clustering-Based Sequential Multivariate Outliers
Detection in Handling Missing Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
Mety Agustini, Kartika Fithriasari, and Dedy Dwi Prastyo
Contents xv
Statistical Learning
A Case Study via Bayesian Network: Investigating Factors
Influencing Student Academic Performance in Online Teaching
and Learning During COVID-19 Pandemic . . . . . . . . . . . . . . . . . . . . . . . . . . 303
Zheng Ning Looi, Poh Choo Song, Huai Tein Lim, and Sing Yan Looi
Harnessing the XGBoost Ensemble for Intelligent Prediction
and Identification of Factors with a High Impact on Air Quality:
A Case Study of Urban Areas in Jakarta Province, Indonesia . . . . . . . . . . 319
Wahyu Wibowo, Harun Al Azies, Susi A. Wilujeng,
and Shuzlina Abdul-Rahman
Modeling Earthquake Catalog in Sumatra by Space–Time
Epidemic-Type Aftershock Sequences Model: Combining
Davidon–Fletcher–Powell and Stochastic Declustering Algorithms . . . . . 335
Christopher Andreas, Achmad Choiruddin, and Dedy Dwi Prastyo
Small Area Estimation of Mean Years of Schooling Under Time
Series and Cross-sectional Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353
Reny Ari Noviyanti, Setiawan, and Agnes Tuti Rumiati
Probabilistic Seismic Hazard Analysis for Sulawesi-Maluku
Region of Indonesia Using the Space–Time Epidemic-Type
Aftershock Sequence Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369
Sonia Faradilla, Achmad Choiruddin, and Bambang Widjanarko Otok
xvi Contents
Professor Yap Bee Wah is the director of the Research and Consultancy Center at
UNITAR International University Malaysia. She is the founding and general chair for
DaSET2022: International Conference on Data Science and Emerging Technologies
and the editor of the proceedings published in Lecture Notes on Data Engineering
and Communications Technologies published by Springer. She was the conference
chair of the International Conference on Soft Computing in Data Science (2015–
2019 and 2021) and an editor of the SCDS conference proceedings published in the
Springer CCIS series. She is also one of the editors of the book titled Supervised
and Unsupervised Learning for Data Science published by Springer Nature Switzer-
land AG 2020. She actively published papers in ISI and Scopus journals such as
Expert Systems with Applications, Journal of Statistical Computation and Simula-
tion, Communications in Statistics-Computation and Simulation, Journal of Clinical
and Translational Endocrinology, and Computers, Materials and Continua.
Professor Michael W. Berry is the co-author and an editor of sixteen books covering
topics in scientific computing, information retrieval, text/data mining, and data
science. He is the co-editor of the Soft Computing in Data Science volumes from
xix
xx About the Editors
2015 to 2021 and Data Science and Emerging Technologies 2022 proceedings by
Springer. He is also the co-author of popular books published by Society for Industrial
and Applied Mathematics: Understanding Search Engines: Mathematical Modeling
and Text Retrieval, Second Edition, and Computational Information Retrieval. He has
published over 115 refereed journals and conference publications. He is a member of
SIAM, ACM, MAA, ASEE, and the IEEE Computer Society and is on the editorial
board of Foundations of Data Science (AIMS) and the SIAM Journal on Matrix
Analysis and Applications (SIAM). He is also a certified program evaluator for
the Computing Accreditation Commission (CAC) of the Accreditation Board for
Engineering and Technology, Inc. (ABET).
Artificial Intelligence
A Comparative Study of Lemmatization
Approaches for Rojak Language
Liu Jun Yoon, Xuan Yi Tan, Khai Yin Lim, Chi Wee Tan, Ling Ern Cheng,
and Jenny Tan
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 3
Y. Bee Wah et al. (eds.), Data Science and Emerging Technologies, Lecture Notes on
Data Engineering and Communications Technologies 191,
https://doi.org/10.1007/978-981-97-0293-0_1
4 L. J. Yoon et al.
99.90% and 92.27% test set accuracy for Malay and English, respectively, whereas
the deep learning-based with POS tag approach produced the worst results of 79.78
and 91.15%.
1 Introduction
Lemmatizer plays an important role in the preprocessing part in most natural language
processing (NLP) applications like information retrieval system (IRS) and machine
translation system (MTS). This is because in most NLP applications, extraction of
a valid and linguistically meaningful lemma from an inflectional word is required.
For example, in MTS, the first phase involves analysis and testing the input text
written in the source language for its grammatical correctness. Hence, a lemmatizer
is required to perform text normalization to break down inflectional words into root
and affixes [1, 2]. By performing lemmatization, different inflected forms of a word
are grouped into a common root which is the base-form or dictionary-form of a
word, known as lemma, that is a valid root and linguistically meaningful word. For
example, if the words “play, plays, and played” are given to a lemmatizer, “play”
will be returned by the lemmatizer as a lemma [2]. Since lemmatization is widely
used in most preprocessing of NLP applications, a lemmatizer that is able to give a
more accurate lemma has to be studied.
Code-mixing is a widespread phenomenon in Asia that mixes two or more
languages. In Malaysia, English has been the focal mixing language practiced with
Malay, resulting in the creation of Rojak language [3, 4]. For example, Malaysians
tend to write “That is such a canggih gadget, I want one too”. Due to the rapid spread
of Rojak language, a lemmatizer capable of lemmatizing the Rojak language, namely
a combination of Malay and English, is needed [5]. Besides, all existing stemmers
and lemmatizers for Malay or Indonesian confuse roots with stems or lemmas. For
instance, although the Sastrawi stemmer is a stemmer, it returns roots instead of
stems, while Malaya also inherits this problem. This is because they do not consider
the parts-of-speech (POS) tags when performing stemming. Conversely, MorphInd is
currently the most widely used morphological analyzer for Indonesian regards roots
as lemmas. No existing tools provide stems and lemmas [6]. Therefore, an approach
capable of lemmatizing inflected Malay and English words, constituting the Rojak
language, into its respective lemmas is needed. This study encompasses three primary
objectives: (1) to introduce an approach capable of lemmatizing Rojak language,
which is able to handle English and Malay simultaneously, without requiring the
user to manually identify and separate the language; (2) to compare the performance
of rule-based, corpus-based, machine learning, and deep learning-based lemmatiza-
tion approaches; and (3) to evaluate the impact of POS tags on the performance of
machine learning and deep learning-based lemmatizers.
A Comparative Study of Lemmatization Approaches for Rojak Language 5
2 Literature Review
Recently, a lemmatizer that uses a novel dictionary lookup approach has been
proposed for the Urdu language [7]. In the approach, the lemma returned depends on
the corpus used to train the model. The lemma of a given word was only returned if
both the word and its corresponding POS tag were present in the corpus. If the word
and POS tag were not found, the lemmatizer would return the word as it was passed
in, without any changes. Any differences caused by spelling mistakes or different
blank spaces from the user input words would result in a no match in the corpus. The
proposed lemmatizer was evaluated with and without POS tagging. The lemmatizer
obtained the highest accuracy 76.44% when words without POS tags were lemma-
tized. On the other hand, the accuracy achieved by words that are passed in together
with POS tags obtained a lower accuracy of 66.79%.
A lemmatizer for Icelandic known as Nefnir was proposed in 2019 [1]. Nefnir was
developed from rules derived from a morphological database known as Database of
Modern Icelandic Inflection (DMII). Suffix substitution rules were derived from the
database and used to lemmatize tagged text. New rules were generated to minimize
the number of errors in the training set until no further reduction in the error count.
In Nefnir, it was assumed all word forms are identical to their lemma. A list of rules
was generated for all the errors. The rule that minimizes the number of remaining
errors was selected and applied to the training set until the number of errors does
not reduce. The criteria for rule generation are that rules are generated only if the
rule correctly lemmatizes at least two examples in the training set. The evaluation
of Nefnir was performed to determine the accuracy of Nefnir in lemmatizing words
with correct POS tags and words that are automatically tagged with POS tagger
which was IceTagger. The accuracy achieved by Nefnir with correct POS tags was
99.55%, while for words that are tagged automatically with IceTagger, the accuracy
was 96.88%. It was shown that Nefnir accuracy dropped when lemmatization was
performed on words that were automatically tagged by POS tagger.
Another rule-based lemmatizer that uses the longest-affix-match approach was
proposed for Kannada inflectional words [2]. In the approach, the input word that
contains prefix or suffix would be applied with a set of linguistic rules to get the
appropriate lemma. Prefixes and suffixes for Kannada inflected nouns and finite verbs
were collected manually from Kannada grammar textbooks [8]. A root dictionary was
created from Kannada dictionary “Kannada Rathnakosha”. The proposed lemmatizer
searched for a lemma in the root dictionary, and if the lemma was not found, the
lemmatizer would append the obtained lemma to the root dictionary. This further
improved the performance of the lemmatizer. The proposed lemmatizer was tested
on four datasets with lemmatization performed on official circulars that achieved
6 L. J. Yoon et al.
an accuracy of 85.72%, newspaper 95.80%, legal documents 97.08%, and All India
Radio news 95.39%. The accuracy of the proposed lemmatizer achieved above 85%
on different dataset.
In 2020, lemmatization of the Russian language based on machine learning algo-
rithms was proposed [8]. Vectorized word forms obtained from open dictionaries
were fitted into various machine learning regression models, which were decision
tree, random forest, extra tree, and bagging. Decision tree produced the highest accu-
racy on the lemmatization of real-world corpora, ABBYY corpus and Open-Corpora
corpus, with the accuracy of 75.61% and 70.88%, respectively.
Deep learning sequence-to-sequence approach was proposed in 2021 to perform
the automatic Romanian lemmatization [9]. The encoder and the decoder in the
sequence-to-sequence model for lemmatization of Romanian words contain a single
long short-term memory (LSTM) layer. The encoder and the decoder were enriched
with one or two additional LSTM layers to improve the system’s accuracy. When the
deep learning models were evaluated on Romanian Explicative Dictionary (DEX)
dataset, one layer LSTM-based architecture achieved the highest accuracy at both
word and character levels with an accuracy of 95.93% and 97.29%, respectively.
When POS information was included, the system’s accuracy increased by 3.39% at
word level and by 2.14% at character level resulting in an accuracy of 99.32% and
99.43%, respectively. The model’s accuracy improved when POS information was
included.
3 Proposed Methodology
This section describes the proposed framework proposed and the workflow for devel-
oping a lemmatizer for Rojak language texts that mixes English and Malay. Figure 1
shows the overall framework of the general process design.
Two corpora, one for the English language and another for the Malay language,
were utilized in constructing the lemmatizer designed for the Rojak language. In
light of the findings [10] indicating a similarity of over 90% between the Malay
and Indonesian lexicons, an Indonesian corpus was used due to its relatively greater
availability of resources as compared to Malay language. In this study, the English
Web Treebank (EWT) corpus and Indonesian GSD corpus were employed [11–13].
Both the corpora were split into train, development, and test sets in the CoNLL-U
file format. The train and development sets were combined to form the train set. Each
set consists of sentences made up of words, where various information is provided
for each word such as ID, FORM, LEMMA, UPOS, and XPOS. For building the
lemmatizers, only the FORM, LEMMA, and UPOS which are universal POS tags
A Comparative Study of Lemmatization Approaches for Rojak Language 7
Fig. 1 Overall architecture of the general process design consisting of a the development of
lemmatization and b the Rojak language lemmatization
are used. UPOS can be used to tag parts-of-speech for all languages, and there are 17
UPOS tags in the UD framework [14]. The number of sentences and words contained
in each dataset is as shown in Table 1.
Lowercasing was performed in this stage to ensure all words can be mapped or
lemmatized despite the different text casings in the corpus.
Lemmatizers for the English and Malay languages were individually constructed
using distinct methods. Four different lemmatizers, consisting of rule-based, corpus-
based, machine learning-based, and deep learning-based were developed. The
training or development process for each method was replicated to create two
lemmatizers of the same method for the distinct languages, namely English and
Malay.
3.5 Rule-Based
Lowercasing was performed in rule-based lemmatizer on both the train and test sets.
The rule-based lemmatizer utilizes information regarding word forms, lemmas, and
POS tags to generate rules, as demonstrated in Table 2 for English and Table 3 for
the Malay language, enabling it to carry out lemmatization. During the lemmatizer
building stage, the initial step involves constructing a corpus that encompasses all
word forms and lemmas. This corpus serves as a reference for the lemmatizer to
determine whether a generated lemma represents a meaningful or dictionary-form of
a word. Subsequently, rules for lemmatizing English and Malay words were formu-
lated based on linguistic knowledge of the languages, observations from the corpora,
and insights garnered from prior research. The overall workflow for the rule-based
lemmatizer is outlined in Algorithm 1.
A Comparative Study of Lemmatization Approaches for Rojak Language 9
(continued)
12 else the received word will be returned as the lemma
The rules for lemmatizing English words were derived from linguistic knowledge
and observing the word-lemma pairs in the training set. As there are many irregular
verbs and plural forms in English, some irregular words found from the training set
are explicitly added to the list of rules for lemmatizing English words to increase
coverage and improve accuracy. As for lemmatizing Malay words, the rules are
mainly derived from the findings of previous studies supplemented by linguistic
knowledge [17].
3.6 Corpus-Based
As for the corpus-based lemmatizer, the training dataset and development database
from UD Treebanks for the two different languages were used as training data. Test
set was used for evaluating the corpus-based lemmatizer to determine the accuracy of
the lemmatizer on unseen dataset. The idea behind this approach is to build a corpus
containing all the possible lemmas for a word according to the different POS tags,
for each language. After the corpus has been built, it can then be used to look for a
matching word form and POS tag to return its corresponding lemma, given an input
word and POS tag.
Therefore, to build the corpus, lowercasing was first performed on the UD Tree-
banks data. The words and lemma from the training set were then used to build a
dictionary corpus for the lemmatizer. Words in the training set that are not in the
dictionary were added to the corpus together with the POS tag. At the same time,
the POS tag of the word was checked if it exists in the corpus. In cases where the
POS tag was not found in the corpus, it was appended to the corpus along with its
lemma. Hence, to use the lemmatizer, it takes a word and its POS tag as input and
then checks the corpus to determine whether the given word and POS tag exist in the
corpus. If they exist, the corresponding lemma is returned; otherwise, the received
word is returned.
In machine learning-based lemmatizer, the train and development sets were combined
to form a train set for training the lemmatizer. Features extraction was then performed
on the words and lemma through fastText word embedding. In this study, decision
tree regression model (DTR) was selected as it managed to yield the highest accuracy
during the lemmatization process on the real-world corpora for the Russian language
[8]. Thus, DTR was expected to be able to produce outstanding results in Rojak
A Comparative Study of Lemmatization Approaches for Rojak Language 11
language lemmatizer. The DTR takes the word embedding of the inflected word as
the input and learns to predict its corresponding output, which is the word embedding
of the lemma of the input word. Depending on the type of approach, it can also take
the encoded POS tag of the inflected word as input, allowing the DTR to learn from
both the word and POS tag.
The DTR is a model based on the decision tree (DT) algorithm that learns using a
tree structure that contains a root node, decision nodes, edges, and leaf nodes. It starts
from the root node by selecting an attribute and splitting values as its starting point
and continuously repeats this process to generate decision nodes, until it reaches its
leaf nodes. The leaf nodes contain the values of the final prediction generated by
the DT. In this case, the DTR uses the DT algorithm to perform lemmatization that
is posed as a regression problem, whereby the predicted lemmas are represented by
their feature vectors, which are sequences of continuous values. Using the generated
tree, a path leading to the leaf nodes can be found to obtain the predicted values, which
can also be expressed in rule form. Hence, the DTR is able to perform predictions
and generate results for performing lemmatization [18].
12 L. J. Yoon et al.
3.9 Evaluation
Both the English and Malay lemmatizers built on the different methods were eval-
uated using the accuracy (Eq. 1). To perform the evaluation, the test sets of the two
corpora were used.
Table 5 Train and test set accuracy by using the (a) rule-based, (b) corpus-based, (c) machine
learning-based (without POS tags), and (d) deep learning-based (without POS tags) lemmatizer
(a) Rule-based (b) Corpus-based (c) Machine (d) Deep
learning-based learning-based
Train set Test set Train set Test set Train set Test set Train set Test set
accuracy accuracy accuracy accuracy accuracy accuracy accuracy accuracy
(%) (%) (%) (%) (%) (%) (%) (%)
Malay 91.22 90.61 99.94 99.90 99.08 89.64 86.81 85.22
English 94.03 94.45 99.79 92.27 99.44 93.26 91.31 91.15
Average 92.625 92.53 99.865 96.085 99.26 91.45 89.06 88.19
14 L. J. Yoon et al.
rules were formulated based on the train set. Therefore, a sufficiently large and high-
quality train set ensures that the generated rules encompass a wide range of word
forms, ultimately leading to a considerably good outcome.
As for the without POS tags machine learning-based method, a relatively big
difference between the train and test accuracies is observed. The difference is more
noticeable in Malay, around 10%, and somewhat less in English, about 6%. This
suggests a slight overfitting on the training data, which fails to generalize well with
new unseen data.
Lastly, the without POS tags deep learning-based, namely the LSTM lemmatizer,
was found to have the lowest accuracy for both the languages. The lower performance
of the LSTM lemmatizer may be due to the considerably low number of training data
as deep learning models often require a substantial amount of training data. The model
might not be able to fully capture the nuances of the language when the training data
are limited. Besides that, the lack of an extensive hyperparameters search on the
LSTM may be a reason for the rather poor performance.
Additionally, both the machine learning and deep learning-based approaches were
experimented with the inclusion of POS tags. This allows a comparative analysis
between the methods with and without the POS tags, which could then facilitate an
assessment of whether POS tags would help in the learning and prediction processes.
Upon observing Tables 6 and 7, it can be concluded that the inclusive of POS
tags not only fails to enhance but in fact deteriorates the lemmatization outcomes
in certain cases. The finding was found to be consistent with [1] but contradicts
with [9]. The decline in accuracy could potentially be due to the model’s inability to
effectively utilize and learn from the meaningful information from the POS tags.
In this context, the POS tags might introduce noises into the model, causing it to
be unable to learn from the data properly which in turn leads to the accuracy drop.
Another possible reason is that there may be inconsistent lemma for the same word
form and POS tag in the train data. Learning from such data can cause confusion for
the model during prediction as there is no exact lemma that the model is able to map
the word to, thereby affecting the overall accuracy of the model.
Lastly, the reduction in the accuracy of the model with POS tags might also be
attributed to different words with different POS tagging sharing the same lemma.
This makes it harder for the model to extract information and patterns in the data,
which therefore resulting in a reduction in the accuracy.
Table 6 Train and test set accuracy of Malay and English machine learning-based lemmatizer
Without POS tag
Machine learning-based lemmatizer Train set accuracy (%) Test set accuracy (%)
Malay 99.08 89.64
English 99.44 93.26
With POS tag
Malay 97.34 87.20
English 97.80 91.94
A Comparative Study of Lemmatization Approaches for Rojak Language 15
Table 7 Train and test set accuracy of Malay and English deep learning-based lemmatizer
Without POS tag
Deep learning-based lemmatizer Train set accuracy (%) Test set accuracy (%)
Malay 86.81 85.22
English 91.31 91.15
With POS tag
Malay 80.53 79.78
English 90.27 91.15
5 Conclusion
In this study, various lemmatization methods have been employed to compare the
performance of the lemmatizers on the UD Treebanks test sets of the English and
Malay languages. To verify the effectiveness of the proposed lemmatizers, the rule-
based, corpus-based, machine learning-based, and deep learning-based lemmatiza-
tion methods were conducted. Among all the lemmatization methods, the corpus-
based lemmatizer was found to outperform the others. The deep learning-based
method, on the other hand, performed the worst. A more detailed analysis as to
the fine-tuning for the model will be deferred to future investigation. Additionally,
the effect of POS tags to the performance of the lemmatizer have also been compared
for the machine learning and deep learning-based lemmatizers. Results have shown
that POS tags do not contribute any value to the lemmatizers for both the methods
in English and Malay languages.
References
N. Hayatin
Informatics Department, University of Muhammadiyah Malang, Malang, Indonesia
N. Hayatin · S. Alias (B) · L. P. Hung
Computing and Informatics Faculty, University Malaysia Sabah, Kota Kinabalu, Malaysia
e-mail: [email protected]
Y. Setiowati
Informatics and Computing Department, Politeknik Elektronika Negeri Surabaya, Surabaya,
Indonesia
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 17
Y. Bee Wah et al. (eds.), Data Science and Emerging Technologies, Lecture Notes on
Data Engineering and Communications Technologies 191,
https://doi.org/10.1007/978-981-97-0293-0_2
18 N. Hayatin et al.
1 Introduction
Aspects are attributes or components of the product or service [1]. Aspects can be
written directly in the form of a word or phrase in a sentence explicitly. Aspect
extraction is a task for automatically identifying and categorizing specific aspects
of a product or service mentioned in the text. Sentiment analysis needs an aspect
level to obtain a more fine-grained opinion analysis [2]. This idea leads to Aspect-
Based Sentiment Analysis (ABSA), a primary task to extract and summarize people’s
opinions and also the aspects contained therein.
Previous researchers have extracted aspects well [3, 4], however, they only extract
a single-aspect. In general, one comment has more than two sentences and can contain
multiple aspects. For instance, “strategic location, and the room is very comfortable”,
this sentence contains two aspects “location” and “room”. If a kind of sentence only
takes one relevant aspect that represents the sentence, then the information from
the user’s opinion cannot be captured completely. This is a challenge that must be
overcome, even though the computations and annotations for supervised learning are
more complex than extracting only one aspect.
Recent advancements in deep learning techniques such as transformer-based
models have shown promising results in aspect extraction tasks. These models can
capture complex relationships between words and sentences, making them suitable
for extracting fine-grained aspects [5, 6]. BERT is a pre-trained model on a large
corpus of data in a self-supervised learning that operates using the same bidirec-
tional encoder mechanism as a transformer [7]. Several studies have successfully
utilized the BERT architecture to overcome the multi-label classification approach
in sentiment analysis tasks [8, 9]. Inspired by these findings, our research aims to
extract multiple aspects from Indonesian reviews based on multi-label classification
using a pre-trained BERT model for better capturing the context of sentences.
We have differences from previous works. In this research, we handle multi-
aspects extraction from Indonesian hotel reviews where BERT’s architecture was
chosen for multi-label classification because the model has a bidirectional encoder
architecture which was better capturing the context of sentences. Other than that, we
carry out tests with various pre-trained models to improve the classifier performance.
To the best of our knowledge, we are the first study implemented Indonesian pre-
trained BERT model for multi-aspect extraction. We implemented various pre-trained
BERT models, include recent Indonesian pre-trained models, IndoBERT [10] and
IndoBERTWiki [11]. We also implemented IndoBERTweet [12], the first large-scale
pre-trained model from Indonesian tweets.
Multi-aspect Extraction in Indonesian Reviews Through Multi-label … 19
2 Related Works
In general, the approach of aspect categorization is divided into two main groups:
supervised and unsupervised approaches [13]. Both of them have superiority as
well as shortcomings issues. Previous study has implemented the multi-label text
classification method to determine dangerous speeches on Twitter texts based on
aspects [14]. Their method presented good performance by implementing Naive
Bayes (NB) classifier; however, the study must handle imbalanced data issues within
each aspect and extend the vocabulary. Other methods used for extracting aspects
in Indonesian reviews data are Conditional Random Field (CRF) [15, 16], Latent
Dirichlet Allocation (LDA) [17], word vector similarity [18], and rule-based [4,
19]. These techniques are relatively inexpensive because they do not require data
labeling. However, they have a weakness when it comes to collecting lots of pattern
combinations and various words to express opinion. In addition, majority of them
need to determine the initial seed words for each aspect manually for classifying
reviews.
In supervised approach, aspect extraction task involves a labeled data,
where researchers have worked with various machine learning techniques such as
Support Vector Machine (SVM), Naïve Bayes (NB), and the K-Nearest Neighbor
(KNN) [20]. Deep learning techniques are also popularly used in aspect catego-
rization which shown good result, such as Recurrent Neural Network (RNN) for
handling aspect detection [5]. Another method is Gated Recurrent Unit (GRU), a
sequential-based deep learning model, which was used for aspect spread detection
[21]. Recent study developed semi supervised deep learning by implementing Graph
Convolutional Network (GCN) for aspect classification to improve ABSA [22].
In term of multi-aspect extraction, previous study employed multi-label classifi-
cation using deep learning approaches which focused on multi-label aspect classifi-
cation on Indonesian hotel reviews. The study used a combination of Convolutional
Neural Network (CNN) and Extreme Gradient Boosting (XGB) [6], while the model
is quite sensitive to misspelling issues. Other models proposed by [23] and [24] have
extracted explicit aspects from Indonesian reviews using the Bidirectional Encoder
Representations from Transformers (BERT) model. The result shown well-turned
performance which employed BERT-multilingual for pre-trained model. However,
many basic words cannot be captured, because BERT multilingual model contained
text from various languages, and not specific for Indonesian text. Therefore, they
need to improve a pre-trained model which is fitted for Indonesian text. Table 1
shows several studies related to our work.
20 N. Hayatin et al.
3 Methodology
Fig. 1 Architecture of multi-aspect extraction through multi-label classification using BERT pre-
trained model
API based on BERT pre-trained model, which automatically selects the appropriate
tokenizer based on the checkpoint available on the hub.
The next process is encoding which refers to the process of converting text data
into numerical representations that can be understood and processed by the model.
BERT utilizes the encoder element of the transformer. Each token was produced from
the previous process then it will be converted into a WordPiece embedding vector by
BERT. Furthermore, the embeddings are then passed through the pre-trained BERT
model which is used for classification. The example of converting the result of text
tokens to text embeddings can be seen in Fig. 2.
data consists of raw texts only, without human labeling. BERT models help machines
understand and interpret the meaning of the text. It uses immediately preceding text
to understand the context. It also checks the relationships of words within a sentence
to give the actual meaning of words. BERT ensures words with the same meaning
will have a similar representation.
In this study, we implemented 10 different pre-trained BERT models to get the
best classifier. To the best of our knowledge, we are the first study which implemented
Indonesian pre-trained BERT models for multi-aspect extraction task in Indonesian
review. All pre-trained BERT models used in this research are uncased models. We
describe each Indonesian pre-trained BERT model used in the experiment as follow:
BERT-Base [7] is the smaller version of the BERT model. It comprises 12 trans-
former layers, with 12 self-attention heads in each layer. The training data used for
BERT-base is from the English language.
BERT-Large [7] is the larger version of the BERT model. It consists of 24 trans-
former layers, each with 16 self-attention heads. The total number of parameters in
BERT-large is significantly higher compared to BERT-base. The training data used
for BERT-base is from the English language.
BERT-Multilingual [7] is a variant of the BERT model that is designed to handle
multiple languages. It has the same underlying architecture as other BERT models,
consisting of transformer layers with self-attention mechanisms. The training data
used for BERT multilingual includes text from various languages, enabling it to
capture the nuances and contextual information specific to each language.
IndoBERT [10] is a recent Indonesian pre-trained BERT model from IndoNLU
which was trained with around 4-billion-word corpus (Indo4B), more than 20 GB
of text data. There are several versions of Indobenchmark-Bert, in this study we use
two versions: Indobenchmark-BERT-base and Indobenchmark-BERT-large. In this
study, we used the p2 version for both models.
IndoConvBERT1 is a variant of the Indonesian-BERT model that is pre-trained
on Indo4B, an Indonesian corpus which consists of around 4B words, and a small
Twitter dump. This model follows a different training procedure on pre-trains the
model and sequence length.
IndoBERTWiki [11] is a recent Indonesian pre-trained BERT model adapted from
BERT model which was trained by combining more than 220 million words from
three primary sources: Indonesian Wikipedia (74 million words), news articles (522
million words), and an Indonesian Web Corpus (90 million words).
1 https://huggingface.co/Wikidepia/IndoConvBERT-base.
24 N. Hayatin et al.
4.1 Dataset
The dataset contains users’ reviews which are gathered from Tripadvisor web travel.
We select comments only for Indonesian hotels reviews. There are more than 5000
comments scrapped, however, this research used a thousand data to label manually
due to limited resources. The total data used for experiment is 1085 rows, which are
split into 80% for train (868 rows) and 20% for validation (217 rows).
After collecting the data, we list aspects related with the hotel domain. These
aspects will later become class labels in the classification process. There are several
references that use different aspect lists for the hotel domain. We decided to combine
aspect lists in the hotel domain from several references, i.e., SPACE [3], HoASA
(Indonesian ABSA dataset) [10], and also from some popular travel web such as
Tripadvisor, Booking.com, and Agoda. Finally, we used eight aspects, i.e., ‘room’,
‘location’, ‘cleanliness’, ‘building’, ‘food’, ‘service’, ‘value’, and ‘facilities’. The
description for each aspect is presented in Table 2.
Based on statistical analysis, one comment has more than two sentences and
can contain more than one comment, which we call multi-aspects review. The total
sentence of the data is 4720 sentences, while the total word in the corpus is 58,012
words. The average length of the sentence is 4 sentences per comment. In terms
of word length, the average word length per comment is 53–54 words, while the
average word length per sentence is 16–17 words. From 1085 sample data, the domi-
nance comments contain multi-aspect with around 89% from the total comments.
An example of sentences which contain multi-aspects is presented in Fig. 3.
2 https://huggingface.co/cahya/BERT-base-indonesian-522M.
3 https://huggingface.co/cahya/BERT-base-indonesian-1.5G.
Multi-aspect Extraction in Indonesian Reviews Through Multi-label … 25
provide good results for aspect extraction in Indonesia’s hotel reviews with an F1-
score of 0.7388 and 0.7104, respectively. We also implemented pre-trained BERT-
multilingual model and we get an increase in the F1-score up to 0.8046. The pre-
trained BERT-multilingual model gives higher results than the two models earlier,
this is because the pre-trained BERT-multilingual model is trained to handle multiple
languages, whereas the two models mentioned are specifically trained for the English
language, making them less suitable for implementation in an Indonesian dataset.
Moreover, we used various Indonesian pre-trained BERT models in the exper-
iment. Based on the experimental results presented in Table 3, both pre-trained
Indonesian-BERT-1.5G and IndoBERT-large models demonstrate the highest supe-
riority with an impressive F1-score of up to 0.84. With a difference of 0.0008
higher for Indonesian-BERT-1.5G compared to IndoBERT-large. However, pre-
trained Indonesian-BERT-1.5G model shows the highest accuracy and roc up to
44.24 and 0.8734, respectively, compared to other techniques, with the lowest loss
value of 0.3029.
On the other hand, the pre-trained IndoBERT-base and Indonesian-BERT-522 M
models exhibit comparable performance within the F1-score of 0.83. The pre-trained
IndoBERTWiki-base model achieved an F1-score of 0.80 exhibit similar performance
with BERT-multilingual model, but higher then IndoBERTweet which only got F1-
score of 0.7936. Notably, the pre-trained IndoConvBERT model exhibits the lowest
performance, yielding an F1-score of 0.6941.
For the second experiment (DoE #2), we employ several machine learning tech-
niques for extracting aspects and then compare their results with BERT. Several
machine learning techniques which are implemented are Decision Tree (DT), Naïve
Bayes (NB), Logistic Regression (LR), Support Vector Machine (SVM), Random
Forest (RF), Extreme Gradient Boosting (XGB), and Multilayers Perceptron (MLP).
Besides that, we also used Long Short-Term Memory (LSTM) for the deep learning
approach.
Table 4 Techniques
F1 Technique F1 Acc(%)
performance comparison
Machine learning DT 0.77 30.41
NB 0.69 7.37
LR 0.74 22.58
SVM 0.77 27.19
RF 0.79 27.65
XGB 0.84 41.47
MLP 0.69 18.43
Deep learning LSTM 0.67 38.71
BERT 0.84 44.24
Table 4 presents the F1-score and accuracy for each technique. The result shows
that XGB produced the highest performance compared to other machine learning
techniques, with an F1-score of 0.84 and an accuracy of 41.47%. Meanwhile, NB
showed the lowest accuracy of 7.37% and the lowest F1-score of 0.69 similar with
MLP. On the other hand, in deep learning approach, LSTM produced an F1-score of
0.67 and an accuracy of 38.71%, this performance is quite large compared to BERT
performance. Overall, in this case, the performance of machine learning techniques
that can be compared with BERT is XGB. However, in terms of accuracy, BERT
is higher than XGB at 44.24% and 41.47, respectively. This indicates that BERT is
more powerful in overcoming imbalanced data than XGB.
Table 5 shows the example of aspect prediction results for both single-and multi-
aspect from the original hotel reviews. Single-aspect review is represented in Review
ID 2209, while the example of multi-aspect review is presented in Review ID 2572.
The embed column contains the embedding results of BERT which are represented
with sequential numbers. Meanwhile, the output represents the binary result of clas-
sification which represented the aspect of class prediction. The “1.0” value repre-
sents a predicted aspect, while the “0.0” value is vice versa. Furthermore, the aspect
predictions are generated from binary to aspect labels.
Based on the experiment result, we found that all BERT pre-trained models can
provide good results for the extraction of multi-aspects of Indonesian hotel reviews.
In the same case, BERT is also superior compared to machine learning techniques and
another deep learning technique. In this research, we compared BERT with LSTM,
while LSTM gives lower performance results than BERT and other ML techniques.
This notes that deep learning does not always provide higher performance than ML
techniques. In another view, it is also possible that extra effort is needed to fine-tune
the parameters so that the model fits the dataset.
Overall, to handle multi-aspect extraction on Indonesian hotel reviews, the best
performance is produced when using a pre-trained Indonesian-BERT-1.5G model
with an F1-score of 0.8421 and an accuracy of 44.24%. This mechanism makes
use of contextual information captured by BERT to carry out classification tasks
effectively. Pre-training in large-scale corpora is capable of helping BERT learn
28 N. Hayatin et al.
5 Conclusion
References
20. Manik LP et al (2020) Aspect-based sentiment analysis on candidate character traits in Indone-
sian presidential election. In: 2020 International conference on radar, antenna, microwave,
electronics, and telecommunications (ICRAMET), pp 224–228
21. Ismet HT, Mustaqim T, Purwitasari D (2022) Aspect based sentiment analysis of product review
using memory network. Sci J Inf 9(1):73–83
22. Chamid AA (2023) Graph-based semi-supervised deep learning for Indonesian aspect-based
sentiment analysis
23. Yanuar MR, Shiramatsu S (2020) Aspect extraction for tourist spot review in Indonesian
language using BERT. In: 2020 International conference on artificial intelligence in information
and communication (ICAIIC 2020), pp 298–302
24. Azhar AN (2020) Fine-tuning pretrained multilingual BERT model for Indonesian aspect-based
sentiment analysis
25. Vaswani A et al (2017) Attention is all you need. Adv Neural Inf Process Syst 2017:5999–6009
26. Ruskanda FZ, Widyantoro DH, Purwarianti A (2018) Comparative study on language rule
based methods for aspect extraction in sentiment analysis. In: 2018 International conference
on Asian language processing (IALP 2018), Bandung, Indonesia, November 15–17, 2018, pp
56–61
Artificial Intelligence (AI) Empowered
Sign Language Recognition Using
Hybrid Neural Networks
Ambar Saxena, Nailya Sultanova, Jamila Mustafina, and Noor Lees Ismail
A. Saxena
Liverpool John Moores University, Liverpool, UK
e-mail: [email protected]
N. Sultanova · J. Mustafina (B)
Kazan Federal University, Kazan, Russia
e-mail: [email protected]
N. Sultanova
e-mail: [email protected]
N. L. Ismail
UNITAR International University, Petaling Jaya, Malaysia
e-mail: [email protected]
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 33
Y. Bee Wah et al. (eds.), Data Science and Emerging Technologies, Lecture Notes on
Data Engineering and Communications Technologies 191,
https://doi.org/10.1007/978-981-97-0293-0_3
34 A. Saxena et al.
1 Introduction
There are over 300 sign languages used by 70 million deaf people worldwide,
according to the World Federation of the Deaf (Dr. Joseph J. Murray, 2018).
However, most communication technologies have been designed for spoken or
written language, creating barriers for sign language users in society. However, most
communication technologies have been designed for spoken or written language,
creating barriers for sign language users in society. While tools such as Imo and
WhatsApp have become ubiquitous in daily life, they are not always accessible or
effective for deaf people. This can create difficulties in communication between the
deaf community and the hearing majority. The development of sign language recog-
nition technology has the potential to bridge this communication gap and facilitate
more seamless and inclusive communication for the deaf community.
Artificial Intelligence (AI) has made significant advancements in the field of
computer vision in recent years. Computer vision is the ability of machines to inter-
pret and understand images and video. With the help of AI techniques such as deep
learning, computer vision algorithms can now recognize objects, faces, and other
important features in images and videos with remarkable accuracy.
Deep learning techniques such as convolutional neural networks (CNNs) have
been particularly effective in computer vision. CNNs can automatically learn and
identify features within images, making them highly effective in object recognition
and classification. Other AI techniques such as recurrent neural networks (RNNs)
and generative adversarial networks (GANs) have also been applied in computer
vision with promising results.
One of the key challenges in developing sign language recognition systems is
the variability of sign language across different regions and cultures, as well as
the variability in the way that different individuals use sign language. In addition
to the technical challenges of sign language recognition, there are also practical
considerations, such as the limited availability of training data, especially for less
commonly used sign languages or dialects.
To address this variability, researchers have developed techniques for adapting
hybrid neural networks to the specific characteristics of individual signers, in order
to improve the accuracy of the model’s translations.
By combining the strengths of convolutional neural networks (CNNs) and recur-
rent neural networks (RNNs), researchers have been able to develop hybrid models
that can accurately recognize sign language gestures and translate them into spoken
or written language.
The primary purpose of this research is to examine hybrid neural networks while
training various CNN and ensemble networks to increase the accuracy and precision
Artificial Intelligence (AI) Empowered Sign Language Recognition … 35
of sign language recognition. The aim of this study is to use deep learning networks
in artificial intelligence to recognize sign language gestures automatically.
2 Background
The majority of studies have concentrated to detect the static sign language motions
in pictures or video-clips that were acquired in a testing setting. Image capture, data
pre-processing, image segmentation, extraction of features, and classification are the
five categories into which the processes of Image-based sign language recognition
(SLR) is classified (Fig. 1).
Many researchers use previously defined datasets like ASL Image Dataset
(ASLID), ImageNet Large-Scale Visual Recognition Challenge (ILSVRC)—2010,
ASL Gesture Dataset 2012, RWTH-Phoenix-Weather dataset, ArSL databases, and
the SIGNUM [1]. Few researchers develop their own datasets for data training due
to scarcity of sign language datasets for specific countries.
Wadhawan and Kumar concentrate on very widely used scalable networks that
are transferred in transfer learning. AlexNet, created by LeNet, is a commonly used
deep learning architecture. They evaluated more than 50 CNN models. The data were
further evaluated using multiple optimizers, and it was determined that the suggested
technique attained the maximum training efficiency of 99.72% on colorful pictures
and 99.90% on monochrome images [2].
Rathi [3] used the GoogleNet neural network, which has depth in both the
directions. Very deep network with high accuracy.
Masood Sarfaraz and Thuwal [4] classified 2624 ASL gestures using a pre-trained
VGG16 model. Simonyan and Zisserman [5] presented the VGG network architec-
ture. Although the ResNet architecture can be successfully trained at deep depths,
VGG-16 networks are regarded as being exceptionally deep. Sadly, VGG has two
significant drawbacks: It is quite slow during training, and the network weights are
fairly enormous.
The most significant deep learning neural network models are convolutional neural
networks (CNN) for image recognition and classification, which is widely accepted.
3 Methodology
The methodology used in this study includes essential sign language recognition
system operations like input data selection, pre-processing of the testing data, trans-
formation into a structured and understandable data format, dataset-balancing, appli-
cation of supervised deep learning techniques, and evaluation of machine learning
performance using evaluation matrix.
This study used two different set of datasets which consists of images of alpha-
bets from the American Sign Language (ASL) and sign language digits images,
respectively, (Figs. 2 and 3).
The inclusion of the original data set together with modified copies of older photos is
another typical pre-processing strategy called augmentation. Common affine trans-
formations include scaling, rotations, and others. To expose the neural network to
38 A. Saxena et al.
Convolutional neural networks were used for the model training. The suggested
model was undergone training. The classifier placed the preprocessed sign images
in the appropriate category after classification. The dataset of several ASL signs was
used to train the classifier. The training set comprised 80% of the dataset after it has
been separated into training and validation sets. To bring unpredictability into the
Artificial Intelligence (AI) Empowered Sign Language Recognition … 39
deep learning model training process, the dataset must be randomly rearranged. This
prevents the model from favoring specific parameters. The following CNN networks
was evaluated and trained for.
• Few different CNN networks (least layers to deep layers).
• Ensemble of few CNNs of varying resolutions followed by classification using
LSTM.
The effectiveness of the network obtained in this study for American Sign Language
recognition will be evaluated in relation to the model training parameters, which
include the number of layers, filters, and optimizers. The study evaluated the perfor-
mance of the approach on the test dataset, using four distinct measures to analyze
its effectiveness. This process typically involves calculating various metrics, such
as accuracy, precision, recall, and F1-score, as well as visualizing the results using
confusion matrices and training and loss graphs.
In this study models are compared with following evaluations:
Training Loss refers to the error that the model makes on the training set during
training.
Validation Loss. The purpose of the validation set is to evaluate the performance of
the model on new, unseen data, and to prevent overfitting.
A Confusion Matrix is a table that summarizes the performance of a classification
model on a set of test data for which the true values are known.
40 A. Saxena et al.
Accuracy. This measures the overall percentage of images that were correctly
classified by the model.
Recall (Sensitivity). This measures the proportion of true positive predictions among
all actual positive instances. It indicates how well the model was able to detect a
specific class.
F1 Score. This is a weighted average of precision and recall and provides a single
value that summarizes the model’s performance.
This section details the outcome of the model building experiments, it includes the
results of each model is trained. The section provides an evaluation of the trained
model, including metrics such as accuracy, precision, and recall. It compares the
models tested with proposed hybrid neural network model.
Training and loss curves have been shared in Chapter 4 for each model. Table 1
compares the training and loss for all models.
Model-1 has given 79% accuracy with training dataset and 87% accuracy with
test dataset, accuracy decreased with Model-2 due increase of filter size and got
62% with training and 78% with test dataset. With one conv layer increased and
smaller filter size Model-3 achieved 86% training and 93% test accuracy but overall
proposed hybrid model which consist all of other model as ensemble and followed
by LSTM classifier has given an accuracy of 98% with training dataset and 99% with
test dataset.
predicts a positive label, it is correct 80.05% of the time. The recall score is
0.7794, which means that the model correctly identifies 77.94% of all positive
instances. The F1-score is 0.7803, which is the harmonic mean of precision and
recall. This model has relatively low accuracy, but the precision score is higher
than the recall score, which suggests that it tends to make more false negative
errors.
3. Model 3: This model achieved an accuracy of 0.9389, which is quite high. The
precision and recall scores are both above 0.93, with values of 0.9428 and 0.9389,
respectively. The F1-score is 0.9390. This model has the highest precision and
recall scores among the four models, indicating that it is better at both minimizing
false positives and false negatives.
4. Hybrid Model: This model achieved the highest accuracy of 0.9934, which is
significantly higher than the other models. The precision and recall scores are
also very high, with values of 0.9935 and 0.9934, respectively. The F1-score is
0.9934. This model has very high precision and recall, which means it performs
very well in minimizing both false positives and false negatives (Fig. 5).
After training with train and validation dataset, proposed hybrid model was used
to predict unseen test dataset and evaluated with confusion matrix and plotting test
images with predict and input labels.
4.5 Prediction
Unseen test data was predicted using proposed hybrid model and Fig. 7 shows images
with actual labels against predicted labels. Model has predicted all labels correctly.
In summary, the hybrid model has the highest accuracy and precision/recall scores
among the four models, while Model 2 has the lowest accuracy and a higher precision
score compared to the recall score. Model 1 and Model 3 have relatively balanced
precision and recall scores.
5 Conclusions
This study explored the Artificial Intelligent-based sign language recognition using
CNN neural networks and further using hybrid neural network. This work presented
a comprehensive overview of recent research done in sign language recognition field
and discussed various aspects around it.
The aim of this study was to train a hybrid neural network to recognize American
Sign Language and digits with higher accuracy. Experiments performed in this study
Artificial Intelligence (AI) Empowered Sign Language Recognition … 45
proved that proposed hybrid model outperformed the other CNN models across
the evaluation matrix. The results suggest that the proposed models can achieve
high performance in ASL recognition, which can have important applications in
improving communication for the hearing-impaired.
This study proved that ensemble of different models can improve the sign language
recognition accuracy over the individual models. There are many different types of
neural networks, each with its strengths and weaknesses. Hybrid neural networks
allow developers to create models that leverage the strengths of each network while
minimizing their weaknesses. Though hybrid neural network have many advantages,
but it can be more complex to design and train compared to traditional neural
46 A. Saxena et al.
Acknowledgements The authors thank the UNITAR management for supporting the publication
of this paper.
References
10. Sharma S, Kumar K (2021) ASL-3DCNN: American sign language recognition technique
using 3-D convolutional neural networks. Multimedia Tools Appl 8017:26319–26331. https://
doi.org/10.1007/s11042-021-10768-5
11. Bheda V, Radpour D (2017) Using deep convolutional networks for gesture recognition in
American sign language. CoRR, abs/1710.06836. http://arxiv.org/abs/1710.06836
12. Rao GA, Syamala K, Kishore PV, Sastry ASCS (2018) Deep convolutional neural networks
for sign language recognition. In: 2018 Conference on signal processing and communication
engineering systems, SPACES, pp 194–197
13. Koller O, Zargaran S, Ney H, Bowden R (2018) Deep sign: enabling robust statistical continuous
sign language recognition via hybrid CNN-HMMs. Int J Comput Vision 12612:1311–1325.
https://doi.org/10.1007/s11263-018-1121-3
14. Kandel I, Castelli M, Popovič A (2021) Comparing stacking ensemble techniques to improve
musculoskeletal fracture image classification. J Imaging 76. https://www.mdpi.com/2313-
433X/7/6/100
The Performance of GPT-3.5
in Summarizing Scientific and News
Articles
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 49
Y. Bee Wah et al. (eds.), Data Science and Emerging Technologies, Lecture Notes on
Data Engineering and Communications Technologies 191,
https://doi.org/10.1007/978-981-97-0293-0_4
50 S. Arshad et al.
1 Introduction
In recent years, there have been significant advancements in the field of natural
language processing (NLP). The NLP entails creating computer models and tech-
niques that allow to comprehend, interpret, and produce language that is similar to
that of humans [1]. The advancements in NLP include improvements in language
models, machine translation, deep learning and transfer learning [2]. Over the past
decade, we have seen a fast-paced evolution in large language models (LLMs). This
has been possible due to advancements in deep learning techniques as well as the
availability of high compute power hardware. This evolution started with the incep-
tion of word2vec model (a simple two-layered neural network), and the development
of continuous bag of words (CBOW) and continuous skip-gram techniques [3]. More-
over, transformers made a significant impact in NLP by altering encoder–decoder
models into attention-based models [4]. This was followed by BERT [5] and GPT
[6]. These models are evolved to the extent that now these pre-trained LLM models
can be applied to many different NLP tasks with zero-shot learning configuration [7,
8]. This year, models like Falcon and Llama-2 [9] have completely revolutionized the
field of NLP. These models are trained on billions of parameters which enables them
to achieve state-of-the-art performance on several tasks including text generation,
text prediction, and text summarization.
Text summarization is among the most researched topics in natural language
processing as it gives a brief overview of the text and is being used in a wide range of
applications. In this study, we aim to analyse the capabilities of the GPT-3.5 model
for text summarization task and discuss its suitability for scientific articles versus
news articles. To the best of our knowledge, no such study has yet been conducted to
evaluate the capabilities of GPT-3.5 and understand how well it captures the context
of complex scientific articles and simple but not trivial news articles.
2 Related Work
Pre-trained large language models (LLMs) came into play with models like embed-
dings from language models (ELMO) [10, 11] and generative pre-trained trans-
former (GPT), which are impressive in their ability to do a wide variety of NLP
tasks, including text generation and comprehension [7]. These models are trained on
billions of parameters and have achieved remarkable breakthroughs in setting new
benchmarks on a variety of NLP tasks [6, 12]. It is argued that GPT-3 has acquired
linguistic knowledge and can comprehend semantic information [8]. Moreover, ten
different LLMs were compared including models like OPT [13], Instruct GPT [14]
and GPT-3 [12], and it was concluded that state-of-the-art LLMs perform on par with
summaries written by humans. Similarly, very recently Llama-2 has been released
which is a series of fine-tuned LLMs ranging from 7 to 70 billion parameters [9]. It
has outperformed Falcon 7-B and 40-B on several benchmarks. These benchmarks
The Performance of GPT-3.5 in Summarizing Scientific and News Articles 51
include coding, reasoning, knowledge tests, etc. These LLMs can be easily used for
zero-shot tasks [7, 8]. These models make use of transformers, which utilize self-
attention mechanism to focus on different parts of the input sequence [4]. They also
make use of position-wise feed-forward networks to capture the order and position of
words as it is important to fully capture the significance of words in a sentence. The
foundational architecture for all LLMs like GPT, BERT, etc. is based on transformers.
For text summarization, BERTSUM has been introduced [15, 16]. It is a modified
variant of BERT specifically designed for text summarization [5]. It generates extrac-
tive summaries and is trained on CNN/DailyMail dataset [17]. On the other hand,
BERTSUMABS is used widely to generate abstractive summaries. Unlike extractive
summaries, abstractive summaries are generated by the model using its learnt vocab-
ulary rather than extracting sentences from the original document. It is worth noting
that these models are trained on news-related datasets like CNN/DailyMail dataset
[17], the New York Times Annotated Corpus [18], and XSum dataset [19]. One reason
behind this could be that the vocabulary of news-related datasets is straightforward
and easier to comprehend as their sources are much simpler in nature. However, BERT
(extractive) performs poorly and generates grammatically incorrect summaries even
when dealing with simple student essays. Similarly, its performance deteriorates even
further when asked to summarize a research article as it starts including noise in its
generated summaries [20]. Interestingly, most of the text summarization work has
been carried out on general news-related articles. The task of summarizing research
papers has not been given much attention. In this direction, a recent study proposed
a model that compares BERT and GPT-2 and is trained/fine-tuned on COVID-19
medical research articles [21]. The dataset consists of only 59,000 scholarly articles
(at the time) [22]. They concluded that fine-tuning GPT-2 only gave a reasonable
abstractive summary and that it was still below human-level performance.
Research papers are not only complex in nature but also contain noise (equations,
figures, tables, etc.) that needs proper distillation in order to produce high-quality
summaries. Therefore, summarizing research articles continues to remain an open
challenge. Models that are fairly recent like Llama-2 and Falcon have not yet even
been tested on their text summarization capabilities. A recent study has analysed
the GPT series for text summarization, but the study has been conducted on the
SAMSum dataset, a dialogue-based corpus with rather simpler text [23]. This work
compares the performance of GPT-3.5 in summarizing scientific research articles and
news data. This will quantify GPT-3.5’s ability to comprehend complex scientific
literature compared to news articles.
3 Methods
The aim of this study is to evaluate the performance of GPT-3.5 for text summarization
on both scientific articles as well as news articles. The evaluation process involves
generating summaries of a hundred scientific articles collected from arXiv STEM
dataset, and 100 news articles sampled from the CNN/DailyMail dataset. Using the
52 S. Arshad et al.
API, the GPT-3.5 model is prompted to generate summaries of both scientific and
news articles.
For research papers, their respective abstracts are used as reference summaries,
whereas the CNN/DailyMail dataset contains article highlights which briefly describe
the theme of an article. These highlights act as reference summaries for news articles.
The average abstract length of a scientific article is 150 words, whereas the average
length of the news highlights is 50 words. To match the number of words in the
reference summaries, GPT-3.5 is prompted to generate summaries in 150 words for
scientific articles and 50 words for news highlights. The generated summaries are then
compared to the reference summaries and ROUGE scores (ROUGE-I, ROUGE-II,
and ROUGE-L) are computed.
The scientific articles were selected from the arXiv STEM dataset. It is a subset of
arXiv dataset which consists of 1.7 million research articles and papers in different
categories. We selected a hundred research articles from the arXiv STEM dataset to
assess the performance of GPT-3.5 in generating summaries for these articles. The
news articles are sampled from the CNN/DailyMail dataset which contains 300,000
news articles.
After selecting the datasets, the next step was to pre-process them in order to
make them usable for the model. The arXiv STEM dataset provides abstracts, a DOI
number, title, author names, publisher, and the year in which the paper was published.
However, it does not provide full-length papers. Another issue with this dataset is
that the DOI number is missing for many articles. Therefore, we selected 100 papers
at random with a condition on the DOI number, i.e. the chosen article was dropped
if its DOI number was missing. Using the DOI number, the articles were fetched
from arXiv website, and a smaller dataset of a hundred papers was prepared. Since
the papers were in Portable Document Format (PDF), they could not be fed directly
into the model. Therefore, the next step involved converting them to text using an
open-source Python library called “PyPdf”.
In contrast to the arXiv STEM dataset, CNN/DailyMail includes both the news
articles and their corresponding highlights. This made it more straightforward to
generate summaries from the articles without the need for pre-processing. A hundred
articles were sampled from this dataset and fed into the model.
3.2 GPT-3.5
GPT-3.5 is available using an API from OpenAI; however, there is a token limitation
for free usage. The limit is 4096 tokens, which works out well for the CNN/DailyMail
dataset as they are not lengthy documents. However, this limitation makes it difficult
The Performance of GPT-3.5 in Summarizing Scientific and News Articles 53
to process scientific articles from the arXiv STEM dataset. This dataset consists
of research papers which are much lengthier to be processed. For this reason, we
used the GPT-3.5 16 k context window model to process the papers by omitting the
abstracts and the references from the papers. Summaries were then generated with
GPT-3.5 16 k for 100 scientific articles selected from the arXiv STEM dataset and
100 news articles sampled from the CNN/DailyMail dataset, respectively.
4 Results
Using the GPT-3.5 OpenAI API, the pre-trained model is prompted to generate
summaries of the 100 scientific articles. Then the ROUGE scores were calculated to
evaluate the similarity between reference summaries (paper abstract) and generated
summaries. Table 1 shows the maximum, minimum, and mean F1 values for all three
types of ROUGE scores (ROUGE-I, ROUGE-II, and ROUGE-L). We observed that
the highest F1 value persistently remained high on three ROUGE measures unigram
(ROUGE-I), bigram (ROUGE-II), and lgram (ROUGE-L), which indicates that the
generated summary captures most of the information in the reference summary (paper
abstract). The best, worst, and average summaries generated by GPT-3.5 are presented
in Table 2.
54 S. Arshad et al.
Table 2 Partial extracts from the best, worst, and average summaries generated by GPT-3.5 for
scientific articles with the corresponding reference summaries (paper abstracts)
ROUGE Reference summary (abstract) Generated summary (GPT-3.5)
score
Best The complex Monoceros Loop SNR/ The HESS telescope array has
(0.88) Rosette Nebula region contains several discovered a new gamma-ray source,
potential sources of very high-energy HESS J0632 + 057, located close to the
(VHE) gamma-ray emission and two as yet rim of the Monoceros SNR. This source
unidentified high-energy EGRET sources. is unresolved by HESS and has no clear
Sensitive VHE observations are required counterpart at other wavelengths but is
to probe acceleration processes in this possibly associated with the weak
region. The H.E.S.S. telescope array has X-ray source 1RXS J063258.3 +
been used to search for very high-energy 054857, the Be-star MWC 148, and/or
gamma-ray sources in this region the lower energy gamma-ray source
3EG J0634 + 0521
Worst I describe the different classes of Active Blazars, a type of active galactic
(0.16) Galactic Nuclei (AGN) and the basic nucleus (AGN), dominate the
tenets of unified schemes. I then review the gamma-ray sky. Blazars are
properties of the extragalactic sources characterized by irregular, rapid
detected in the GeV and TeV bands, variability, high polarization, and a
showing that the vast majority of them non-thermal continuum that extends
belong to the very rare blazar class from radio to gamma-ray wavelengths.
They are believed to be powered by
relativistic jets of plasma emitted from
a central black hole
Average We consider a single ion confined in a trap Researchers have investigated the
(0.35) under radiation of two travelling. waves of system of a single trapped ion
lasers. In the strong-excitation regime and experiencing two travelling wave lasers
without the restriction of Lamb-Dicke in the strong-excitation regime (SER)
limit, the Hamiltonian of the system is and in the absence of the Lamb-Dicke
similar to a driving. Jaynes Cummings limit. They found that the ground state
model without rotating wave in the non-rotating wave approximation
approximation (RWA). The approach we (RWA) case is energetically lower than
developed enables us to present a complete the ground state in the RWA case. This
eigen solutions, which makes it available new ground state could be useful for
to compare with the solutions under the quantum information storage and
RWA quantum computing
The Performance of GPT-3.5 in Summarizing Scientific and News Articles 55
Beginning with the best summary generated by GPT-3.5, it is evident that all the
technical details of this paper are captured by the generated summary. Moreover,
it consists of vocabulary that is seen in the reference summary. ROUGE has given
it the highest score on all three measures, given the fact that it captures the overall
theme very well and has reproduced important facts and figures (Table 2). The lowest
ROUGE score for the worst summary generated by GPT-3.5 deviates from the theme
of the reference summary and does not cover context of the reference summary
(Table 2). However, it summarizes the paper very well. After reading through the
papers with the least scoring summaries, we found two reasons for lower ROUGE
scores: (i) the reference summary (abstract) is too short, providing limited insights
into the original papers, (ii) abstract is too general and does not cover the theme of
the paper.
In the average scoring summary, it is observed that the context of the text has
been well captured by the model (Table 2). However, the summary is generated with
different vocabulary and ROUGE lacks the ability of semantic evaluation. Such is
the case with most of the average scoring summaries, which question the ability of
ROUGE and its usage for text evaluation. GPT can generate abstractive summaries,
whereas surface-level evaluation of text can result in incorrectly scored summaries
as evident from this example.
We used the same GPT-3.5 pre-trained model to generate summaries of the 100 news
articles and calculated ROUGE scores to evaluate the similarity between reference
summaries (news highlights) and generated summaries. Table 3 shows the maximum,
minimum and mean F1 values for all three types of ROUGE scores (ROUGE-I,
ROUGE-II, and ROUGE-L). As expected, the highest F1 value persistently remained
high on three ROUGE measures unigram (ROUGE-I), bigram (ROUGE-II), and
l-gram (ROUGE-L). However, in contrast to our expectations, the news articles
achieved the maximum ROUGE-I score of 0.67 much lower than the maximum
score for scientific articles (Tables 1, 3). To investigate this further, the best, worst,
and average summaries generated by GPT-3.5 are presented in Table 4.
Consistent with our observation for scientific articles, the best-scoring generated
summary picked up all the important points from the news article which have been
Table 4 Partial extracts from the best, worst, and average summaries generated by GPT-3.5 for
news articles with the corresponding reference summaries (news highlights)
ROUGE Reference summary (highlights) Generated summary (GPT-3.5)
score
Best YouTube questions address taxes, the The CNN/YouTube GOP presidential
(0.67) Bible, abortion, gun control. Giuliani, debate was marked by acrimony and
Romney, Huckabee spar over attacks between the Republican
immigration. McCain challenges Paul candidates. They engaged in a free-for-all,
over suggestion to bring troops home discussing immigration, the Iraq war,
from Iraq. Nearly, 5000 videos for the abortion, gun control, and their beliefs in
GOP debate; 2000 more than the Bible
Democratic debate
Worst If you build it, the tourists will come to The text describes ten unusual museums
(0.04) your museum. Museums for hobos, around the USA. These include the SPAM
medical oddities, and trash. Kentucky Museum, the National Museum of
Museum is where dummies go to die Funeral History, the Hobo Museum, the
Mutter Museum, the Barnum Museum,
the Conspiracy Museum, the Museum of
Question
Average President Bush says Tony Snow “will White House press secretary Tony Snow
(0.31) battle cancer and win”. Job of press will step down from his post on
secretary “has been a dream for me”, September 14 due to his ongoing cancer
Snow says Snow leaving on September treatment. He will be replaced by Deputy
14, will be succeeded by Dana Perino Press Secretary Dana Perino. President
Bush expressed his sadness at Snow’s
resignation and praised his contributions
part of the reference summary (news highlights). This shows that GPT-3.5 has been
able to rightly prioritize the important factors similar to the reference summary.
Although the reference summary is written by the author of the news articles, GPT-3.5
could still better summarize the article in a related context (see Table 4).
Moving to the generated summary securing the least ROUGE score, it is obvious
that the reference summary itself is vaguely written and does not really tell much
about the article itself. On the other hand, just by reading the GPT-3.5 summary, one
can easily guess what the article is about (Table 4). This put into question the article
highlights from the original dataset and their usage as ground truth for the articles.
Finally, considering the average scoring summary with ROUGE score of 0.31,
it does capture the context of the news article. However, the summary is generated
with distinct vocabulary which resulted in lower ROUGE score. The other possible
reason is that the news headline (reference summary) is written in a typical writing
style followed by the authors of news articles which is different from the gener-
ated summary. As observed for scientific articles, ROUGE heavily relies on similar
vocabulary, and this example yet again puts its wide acceptability into question.
The Performance of GPT-3.5 in Summarizing Scientific and News Articles 57
5 Discussion
Although we have used a different model from the GPT-3 text-davinci-002, our
model has also generated similar results. The study concluded that almost all avail-
able reference-based, and reference-free metrics are ineffective when it comes to
evaluating summaries. For the reference-based metric ROUGE, our findings are
consistent with the literature [27]. Moreover, given the functioning mechanism of
reference-based text similarity measures, BLEU and METEOR would have also
resulted in similar low-scoring summaries. This points towards the long-standing
issue related to evaluating text as modern LLMs are now reproducing human-like
The Performance of GPT-3.5 in Summarizing Scientific and News Articles 59
summaries [8]. Another study concluded that human annotators prefer summaries
generated by GPT-3 in comparison to fine-tuned models which were trained on large
summarization datasets [27].
In this study, we extend the analysis to see how well can GPT-3.5 summarize scien-
tific papers as opposed to simpler news articles. Although the summaries generated
for scientific papers evaluated a better ROUGE score, the overall scores are closer to
previous studies which have been carried out using automatic reference-based eval-
uation metrics [8, 27]. Moreover, it has been demonstrated that low-quality refer-
ence summaries result in lower scores by automatic metrics due to little correlation
between the summaries produced by these advanced LLMs and reference summaries
[8]. In our case, the average and poor scoring summaries for both scientific and news
datasets also indicate a similar issue with these reference summaries. Manual analysis
of the summaries produced in this study also confirms that the generated summaries
are concise and contain novel vocabulary, but ROUGE has failed to evaluate them.
Finally in agreement with the literature, this study emphasizes the need for a better
evaluation metric for text summarization.
6 Conclusion
This study evaluates the capability of GPT-3.5 on the task of text summarization. The
aim was to quantify the model’s contextual understanding and capacity to produce
precise and succinct summaries for scientific and news articles. We demonstrate that
GPT-3.5 can precisely summarize both scientific and news article with a slightly
higher ROUGE score for scientific summaries. In addition, the summaries generated
by GPT-3.5 demonstrate its ability to absorb and summarize complicated information
to capture the theme of both scientific and simple but not trivial news articles. It was
also noted that the reference summaries included in the datasets had flaws and were
not always accurate reflections of the articles’ context and content. As a result, the
generated summaries received lower ROUGE scores. Additionally, we anticipate that
the lower ROUGE score is the limitation of ROUGE itself, as it heavily relies on
similar vocabulary. Future studies need to develop a semantic evaluation metric for
text summarization. Additionally, the creation of datasets containing high-quality
reference summaries will significantly enhance our ability to accurately evaluate the
summarization capabilities of LLMs across diverse datasets.
Acknowledgements The authors thank the UNITAR International University for supporting the
publication of this paper.
60 S. Arshad et al.
References
1. Mehmood T, Gerevini AE, Lavelli A, Olivato M, Serina I (2023) Distilling knowledge with a
teacher´s multitask model for biomedical named entity recognition. Information 14(5)
2. Mehmood T, Serina I, Lavelli A, Gerevini A (2020) Knowledge distillation techniques for
biomedical named entity recognition. In: Proceedings of the 4th workshop on natural language
for artificial intelligence (NL4AI 2020) co-located with the 19th International conference of
the Italian Association for artificial intelligence (AI*IA 2020), Anywhere, November 25–27th,
2020. CEUR Workshop Proceedings, vol 2735, pp 141–156. CEUR-WS.org
3. Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in
vector space. arXiv preprint arXiv:1301.3781
4. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I
(2017) Attention is all you need. Adv Neural Inf Process Syst 30
5. Kenton JDMWC, Toutanova LK (2019) Bert: pre-training of deep bidirectional transformers
for language understanding. In: Proceedings of NAACL-HLT, vol 1, p 2
6. Radford A, Narasimhan K, Salimans T, Sutskever I et al (2018) Improving language
understanding by generative pre-training
7. Qin C, Zhang A, Zhang Z, Chen J, Yasunaga M, Yang D (2023) Is chatgpt a general-purpose
natural language processing task solver? arXiv preprint arXiv:2302.06476
8. Zhang T, Ladhak F, Durmus E, Liang P, McKeown K, Hashimoto TB (2023) Benchmarking
large language models for news summarization. arXiv preprint arXiv:2301.13848
9. Touvron H, Martin L, Stone K, Albert P, Almahairi A, Babaei Y, Bashlykov N, Batra S, Bhargava
P, Bhosale S et al (2023) Llama 2: open foundation and fine-tuned chat models. arXiv preprint
arXiv:2307.09288
10. Mehmood T, Serina I, Lavelli A, Putelli L, Gerevini A (2023) On the use of knowledge transfer
techniques for biomedical named entity recognition. Future Internet 15(2):79
11. Peters M, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, Zettlemoyer L (2018) Deep
contextualized word representations
12. Brown T, Mann B, Ryder N, Subbiah M, Kaplan JD, Dhariwal P, Neelakantan A, Shyam P,
Sastry G, Askell A et al (2020) Language models are few-shot learners. Adv Neural Inf Process
Syst 33:1877–1901
13. Zhang S, Roller S, Goyal N, Artetxe M, Chen M, Chen S, Dewan C, Diab M, Li X, Lin XV
et al (2022) Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.
01068
14. Ouyang L, Wu J, Jiang X, Almeida D, Wainwright C, Mishkin P, Zhang C, Agarwal S, Slama
K, Ray A et al (2022) Training language models to follow instructions with human feedback.
Adv Neural Inf Process Syst 35:27730–27744
15. Liu Y (2019) Fine-tune bert for extractive summarization. arXiv preprint arXiv:1903.10318
16. Liu Y, Lapata M (2019) Text summarization with pretrained encoders. arXiv preprint arXiv:
1908.08345
17. Hermann KM, Kocisky T, Grefenstette E, Espeholt L, Kay W, Suleyman M, Blunsom P (2015)
Teaching machines to read and comprehend. Adv Neural Inf Process Syst 28
18. Sandhaus E (2008) The New York Times annotated corpus (linguistic data consortium,
Philadelphia)
19. Narayan S, Cohen SB, Lapata M (2018) Don’t give me the details, just the summary! topic-
aware convolutional neural networks for extreme summarization. arXiv preprint arXiv:1808.
08745
20. Miller D (2019) Leveraging bert for extractive text summarization on lectures. arXiv preprint
arXiv:1906.04165
21. Kieuvongngam V, Tan B, Niu Y (2020) Automatic text summarization of COVID-19 medical
research articles using bert and gpt-2. arXiv preprint arXiv:2006.01997
22. Wang LL, Lo K, Chandrasekhar Y, Reas R, Yang J, Burdick D, Eide D, Funk K, Katsis Y,
Kinney R et al (2020) Cord-19: The covid-19 open research dataset. ArXiv
The Performance of GPT-3.5 in Summarizing Scientific and News Articles 61
23. Gliwa B, Mochol I, Biesek M, Wawer A (2019) Samsum corpus: a human-annotated dialogue
dataset for abstractive summarization. arXiv preprint arXiv:1911.12237
24. Lin CY (2004) Rouge: a package for automatic evaluation of summaries. In: Text summarization
branches out, pp 74–81
25. Chakraborty R, Bhavsar M, Dandapat SK, Chandra J (2019) Tweet summarization of news
articles: an objective ordering-based perspective. IEEE Trans Comput Soc Syst 6(4):761–777
26. Ghanem FA, Padma M, Alkhatib R (2023) Automatic short text summarization techniques in
social media platforms. Future Internet 15(9):311
27. Goyal T, Li JJ, Durrett G (2022) News summarization and evaluation in the era of GPT-3. arXiv
preprint arXiv:2209.12356
Wound Stage Recognition Using
YOLOv5
Abstract Pressure injuries or pressure sores are lesions that develop over the bony
prominences of the body. It is noted from professional nurses and from first-hand
experience that such wounds are often wrongly classified, making the healing process
difficult and painful for the patient, when it does not need to be the case. This study
aims to research the use of modern computer vision and artificial intelligence tech-
niques to aid in the classification of pressure ulcers to a degree superior to that
achieved by humans. The YOLOv5 algorithm was trained in various combinations
of three data sets. Single-phase and multi-phase architecture models were developed
and trained for the different data set configurations to identify the ideal scenario
of classification for the different stages of pressure ulcers. A single-phase architec-
ture model trained using the public data sets and enhanced with the novel data set
improves on the current state of the art. While the single-phase architecture was
deemed the better option for the wound stage classification throughout this study,
both architectures had their own strong points. The multi-phase architecture was
found to be better suited for smaller data sets, while the single-phase architecture
was better suited for larger ones. It was also found that throughout both architectures,
the combination of three data sets is varied enough, however, less is not. This study
is further strengthened through the input of different viewpoints that complement the
lead researchers’ work.
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 63
Y. Bee Wah et al. (eds.), Data Science and Emerging Technologies, Lecture Notes on
Data Engineering and Communications Technologies 191,
https://doi.org/10.1007/978-981-97-0293-0_5
64 C. Abela and F. Inguanez
1 Introduction
Pressure injuries (PIs), also known as pressure ulcers or pressure sores, are lesions
that develop over bony prominences of the body. These areas include the heels of
the feet, sacral area, hip, and the area of the coccyx. Other areas not as frequently
affected include the elbows, knees, ankles, back of the cranium, and the back of the
shoulders. These develop because of prolonged pressure or shearing in that area.
PIs usually affect people who suffer from a mobility impairment and simultaneously
have poor nutrition and hydration.
Apart from the obvious pain and discomfort, there are various complications that
could come with PIs. One of the most significant risks associated with pressure
ulcers is an infection, which can spread to deeper layers of the skin, leading to
conditions such as cellulitis, septic arthritis, and osteomyelitis, as well as potentially
life-threatening conditions like septicaemia, septic shock, and necrotising fasciitis
or gas gangrene. In addition to these health risks, pressure ulcers can also delay
healing, leading to chronic wounds and decreased mobility, affecting a person’s
quality of life, and potentially resulting in permanent disability or death. Agency for
Healthcare Research and Quality (AHRQ) estimates a total of 60,000 individuals
dying yearly due to PIs acquired through their hospital stay. Treatment for pressure
ulcers can be costly and time consuming, requiring specialised medical care and
ongoing management, which can put a significant burden on both the patient and the
healthcare system.
Currently, the most common way that PIs are classified is through assessment
tools such as the Braden Scale and the Waterlow Scale [1], which break down
various factors, such as mobility and sensory perception, the clinician conducting the
assessment gives a score-based prompts, and finally, all the scores are added up and
classification is done based on the total score. Treatment of PIs can be done through
multiple mediums, such as dressings, which help to maintain a moist environment
to maximise wound-bed epithelialisation and fix any nutritional deficiencies. Mobil-
isation and frequent turning also help to heal as they help to elevate pressure and
make sure that the wound does not get worse. This study focuses on classifying the
different stages of PIs to be used by both medical and non-medical professionals.
The hypothesis being addressed in this research is: By making use of computer
vision, it is possible to identify the different stages of PIs to further aid in their
treatment and healing. This research is addressing the following research questions:
1. By making use of computer vision, is it possible to classify the severity of skin
wounds?
2. What are the key features needed for a data set in such research?
3. How do different data sets contribute to the current state of the art?
4. Is a single-phase or multi-phase approach ideal for the chosen skin injury?
5. How does the proposed model architecture compare to the current state of the
art?
6. How can nurses or student nurses benefit from such research?
Wound Stage Recognition Using YOLOv5 65
2 Background
Object detection and localisation via computer vision have been researched to aid
individuals with visual impairment [2] yet also to aid medical professionals in the
correct localisation [3] and classification of medical alignments such as PIs [4].
These are areas of injured skin, which could go on to affect underlying tissues
and muscles. As mentioned, PIs develop because of prolonged pressure or shearing
on an area, usually affect people who suffer from a mobility impairment, and simul-
taneously have poor nutrition and hydration. They are divided into four different
stages. Stage one is where the area is red and warm to the touch. During stage two,
the area will have an open sore or blister. In the third stage, the area is hollow, but
tendons, bones, and muscles are not exposed. Stage four is where the wound reaches
the muscle, ligaments, or bone. Apart from these four stages, a PI can be unstageable,
which means that there is full-thickness tissue loss, and the wound bed is necrotic,
meaning that the tissue is dead.1 The treatment of a single grade IV PI costs around
£14,108 [5], so it is always best to prevent PIs rather than treat them. Multiple risk
assessments, such as the Waterlow score, the DESIGN tool, and the SSKIN assess-
ment, have been established to determine the risk of someone developing a PI and
aid in the prevention of development. However, according to Barnard and Copson
[6], an estimated 73% of skin damage referred to the Northern Lincolnshire and
Goole Hospitals NHS Foundation Trust, from October 2015 to December 2015, was
misclassified. This is due to human error, possibly because the person performing the
risk assessment is not knowledgeable enough about the different stages of pressure
ulcers.
As of today, there have been multiple studies carried out to evaluated images of
wounds with the aid of machine learning; however, most of the focus has been on
wound segmentation and depth perception. To be able to distinguish between stages,
certain features need to be present. In the first stage of PIs, there is no outline around
the wound as the skin is still intact during this stage. The wound is also lighter in
colour, as during this stage the skin only starts to get red, so having areas highly
concentrated with pink to light red pixels indicates a ‘Stage 01’ PI.
In the second stage, there will be a light outline around the wound as the skin
would have broken; however, the wound will not be deep. While the colour of the
wound can be a bright red, it will not be a dark red as that will indicate heavy bleeding.
Wounds in the class ‘Stage 03’ have a thick border, showing wound depth. They
also feature bright to dark red colours, showing active bleeding; usually, these wounds
have a dark border. When the wound is a lighter red, it will usually have a light border.
During this stage, there is also no large amount of white in the wound itself, as this
would insinuate bone or muscle showing.
Throughout ‘Stage 04’, there is white inside the wound itself, insinuating that
there is bone or muscle showing. However, most of the surface area of the wound is
not white, and it will be a shade of red.
1 https://www.hopkinsmedicine.org/health/conditions-and-diseases/bedsores.
66 C. Abela and F. Inguanez
The ‘Unstageable’ class has the most distinct features as during this stage the
wound has gone from being able to be healed to dead tissue. The wound features
during this stage are the colours: black, yellow, and light tan. There is no red in the
wound as that indicates living cells.
Object localisation refers to the identification of one or multiple objects in an
image with a bounding box around each object.2 Being able to locate a wound
successfully is the first step to a functional automated wound healing system. Patel
[3] focused solely on developing a wound locator. A deep neural network (DNN)
was used to determine the time progress of wound healing in terms of wound colour
composition. The data set was collected from Advancing the Zenith of Healthcare
(AZH) Wound and Vascular Centre, and it contains a total of 1010 wound images,
with three different types of ulcers being included: diabetic foot ulcers (DFU), venous
ulcers (VU), and PIs. For testing, 56 images were collected from the Medetec database
[7]. The data set was then augmented by rotating, flipping, and blurring the images.
After augmenting, the data set had a total of 4050 images. In the end, 3645 images
were used to train, while 405 images were used for testing. The best results obtained
were for YOLOv5 with a precision of 0.925, recall of 0.905, F1-Score of 0.915, and
mean Average Precision (mAP) of 0.939.
Image segmentation is where the recognised object’s location is not outlined by
bounding boxes but by the highlighting of the pixels containing the object; due to this,
it makes the object easier to analyse. Wang et al. [8] researched fully automatic wound
segmentation and did not find an adequate public data set, and so collaborated with
the AZH Wound and Vascular Centre.3 It was made sure that no identifiable features
of the patients were included in the images. The data set was collected over 2 years
and included a total of 1109 foot ulcer images from 889 patients under uncontrolled
illumination conditions with various backgrounds. Bounding boxes were then placed
on the data set by YOLOv3, the images were then augmented, and a training set of
3645 photos and a validation set of 405 photos were produced. The data set was also
made public [9]. The YOLOv3 model was run with a batch size of 8 for 273 epochs,
and this has resulted in a mAP of 0.939. A convolutional neural network (CNN),
MobileNet-v2, was then used to segment the wound from the images, and they did
not aim to further classify the wounds. Apart from testing their model on their data
set, they also tested it on the Medetec [7] data set, which contains 358 images of
various wounds. They tested both data sets on different models, and the best results
were obtained using the Medetec data set as shown in Table 1.
Elmogy et al. [10] aimed to develop an automatic segmentation system to detect
and segment PI red, blue, green (RGB) coloured images. The system proposed
extracts three different ulcer tissues: granulation being red; necrotic eschar being
black; and slough being yellow. Their data set consisted of 193 PI RGB images.
Thirty-six of the images were provided by IMQ Igurco Oficina Central4 with a reso-
lution of 1024 × 1024 pixels, and the other 157 images were obtained from the
2 https://machinelearningmastery.com/object-recognition-with-deep-learning/.
3 https://azhcenters.com/.
4 https://igurco.imq.es/.
Wound Stage Recognition Using YOLOv5 67
Medetec [7] wound database with a resolution of 1024 × 731 pixels. The data set
was divided into 60% for training, 10% for validation, and 30% for testing. A four-
fold cross-validation technique was applied. A convolutional neural network (CNN)
was used for segmenting the PI tissues, making it quite fast.
To validate their results, they compared their study with four different segmen-
tation systems. These systems are the linear combinations of the discrete Gaussians
(LCDG) segmentation technique, the Fuzzy C-Means (FCM) technique, the Otsu
technique, and finally, a system from their previous work which was based on two
different CNN networks and only extracted features from hue, saturation, intensity
(HIS) colour space [11]. Apart from this validation method, they further validated
their system by using a fourfold cross-validation method, which achieved an average
accuracy of 0.94, getting a consistent result when compared to the first method.
The results from their former work [11] and current work were similar, showing that
changing the colour space does not have a high impact; however, since in their current
work they used one CNN network, as opposed to two, as in their former work, their
current system is more efficient.
Furthering the methods is object detection and classification. The output of this
method includes the image inputted, along with bounding boxes around objects and
a class label that goes along each bounding box. Veredas et al. [4] focused on wound
image evaluation, and their data set was divided into: necrotic, slough, healing, and
granulation. They partnered with clinicians, who took colourised photos of PIs of
patients with home care, accumulating a data set of 113 photos. The next step was a
preliminary segmentation process, which was carried out with a k-means clustering
algorithm, where light reflections, clinician fingers, blood stains, and medical equip-
ment were removed from the photos. This was done to divide the images into groups
with similar characteristics based only on the wound itself. The second stage then
defined a region of interest of the estimated wound-bed area, getting a peri-ulcer data
set. In the third and final segmentation stage, the k-means clustering algorithm was
once again used, to process pixels in and out of the wound bed, this time making use
of different colour space combinations. The data set was then tested on three different
machine learning algorithms, support vector machine (SVM), feed-forward neural
network (FNN), and random forest (RF).
To determine which algorithm worked best, the performance rates or performance
metrics were compared, and Veredas et al. [4] utilised accuracy as their main perfor-
mance rate. In the wound-bed set, SVMs achieved the best overall performance rate,
with a difference of 0.0257, 0.0052 when compared to FNNs and RFs, respectively.
68 C. Abela and F. Inguanez
In the peri-ulcer set, SVMs and RFs gave the highest accuracy rates once again, with
no significant difference between them (p ≈ 0.003). However, there was a signif-
icant difference observed between the two previous algorithms mentioned and the
FNN algorithm (p < 0.01). From this study, it was found that SVMs and RFs gave
the highest performing rates when classifying wound-bed patterns. All three models
have a high efficacy score when identifying patterns from the peri-ulcer area.
A list of research on the subject matter and the datasets used is shown in Table 2.
It can be seen that while most studies gathered their own data sets from various
sources, all but one data set, the one by [8], is private. Due to this, the amount of
public wound data sets remains to be scarce.
3 Methodology
Many PIs are misclassified, even with the numerous tools that exist to properly clas-
sify them. The aim of this study is to show that if we make use of the current technolo-
gies, it will be possible to create a solution to help educate healthcare professionals
or students and offer a guide to classifying PIs.
The two public data sets found are [7] (Medetec) and [8] (Wang). The Medetec data
set contains 358 wound images, 183 images being pressure ulcers at the time of
collecting the images. The Wang data set contained 1010 training images and 1009
validation images of pressure ulcers. Some images from both data sets were of the
same patient only a few weeks apart, and due to this, not all images were used as they
were not significantly different. The images from both data sets were not augmented
Wound Stage Recognition Using YOLOv5 69
‘Stage 01’ and unstageable PIs images underwent smoothing. ‘Stage 02’ and ‘Stage
03’ were not smoothed since they could have been easily mistaken for each other as
the features that differentiate them were blurred out. ‘Stage 04’ was also ruled out
from undergoing the smoothing method as it could be mistaken for an unstageable
PI. Each image would be flipped both vertically and horizontally. For the healthy skin
data set each image underwent flipping and smoothing, increasing the total number
to 693 images. Since the data set was not balanced with the wound data set, it was
decided that it would undergo further augmentations, and this time noise (5%) and
saturation (between − 35% and + 35%) were applied.
In this study, the performance of two algorithms, YOLOv5 and YOLOv7, was eval-
uated on the accumulated data set. The YOLO algorithm was chosen due to its
low-resource requirements, and since the intended use is to be integrated within a
mobile app, we believe that this algorithm would be the ideal way forward. Results
indicated that YOLOv5 outperformed YOLOv7, which led to the decision to discard
the latter algorithm for the time being. However, it is worth mentioning that with more
time, potential solutions to improve the performance of YOLOv7 could be explored.
Listing 1 demonstrates the configurations used in the execution of YOLOv5.
5 https://activeageing.gov.mt/st-vincent-de-paul-long-term-care-facility/?lang=en.
70 C. Abela and F. Inguanez
Experiment 1
An experiment was carried out using YOLO to attempt to answer the fourth research
question. In this, the PI data set was run once with all stages and then run another
time in four different phases. Phase 00 checked whether the images given contained
a wound or not. If not, the model stops there, however, if a wound is found, the model
moves on to the next stage. Phase 01 focused on identifying if the PI in the photo
was stageable or unstageable. From there, if the PI is stageable, it moves on to Phase
02 which will further classify the PI into either Stage 1/2, Stage 3, or Stage 4. For
this phase, stages 1 and 2 are grouped together as they are often mistaken for each
other. Moving on to Phase 03, the model will determine whether the PI in the photo
falls in Stage 1 or Stage 2. The single-phase classification focuses on the single-class
classification of PIs, where a picture is given to the model, and it determines if it is
classified as one of the stages or if it is unstageable. Since during this experiment the
model was divided into various phases, each phase had its own data set.
Experiment 2
The second experiment addresses the third and fifth research questions. The model
architecture chosen from the previous experiment was repeated with the various
data sets used throughout this study. The breakdown of data sets trained during this
experiment was as follows: the Medetec data set on its own, the Wang data set on its
own, the Medetec data set alongside the novel data set, the Wang data set combined
with the novel data set. This was done so that there could be a comparison for which
combination of data sets performed best for specific phases, while also trying to
determine if one data set contributes more than another.
Experiment 3
For the third experiment, a testing data set was created, consisting of an equal number
of instances per data set, per stage, where possible. Whenever this was not possible,
due to a lack of images, it was replaced by images from the other data sets. The
aim was to have two images, per stage per data set, and this was to eliminate the
possibility of a data set bias. From each training run carried out during the second
experiment, the best model was saved so that it could be tested on the testing data
set accumulated during this stage.
Professional Evaluation
Semi-structured interviews were conducted with nurses. The questions were open-
ended so that the participant can elaborate and give as much information as possible
[16]. This will aid in obtaining information about if such a system could prove
beneficial during a nurse’s day-today, if any key features would be needed for the
system to be functional and any recommendations on how the system could progress
further in the future.
Wound Stage Recognition Using YOLOv5 71
4 Results
4.1 Experiment 1
The first experiment was carried out to compare a single-stage architecture with a
multi-stage architecture to see which approach is better for wound detection. Table 3
outlines the result. For the multi-phase architecture, weighted results are provided
for a fair comparison.
Each metric had a significantly higher result in the single-stage architecture when
compared to the average of the multi-stage architecture. After evaluating the results,
the single-phase architecture was deemed the best option when it comes to detecting
wounds. However, to further solidify this result, the multi-stage architecture was also
utilised in the second experiment.
4.2 Experiment 2
Multi-phase Architecture
The weighted average for each stage was calculated for the multiple data sets to be
able to properly compare the results. For most of the data set combinations which
did not include all three data sets at once (Medetec, Medetec + Novel, Wang, Wang
+ Novel), the ‘Unstageable’ class performed the best in the multi-stage, and this is
72 C. Abela and F. Inguanez
due to it being the stage with the least number of models to go through to get to the
final classification. The results have not been included here for brevity, especially
since the single-phase architecture has performed better.
Single-Phase Architecture
The results for each stage are not included here for brevity yet explained verbally.
Like the multi-stage architecture, the overall results of the Medetec data set are
lowered significantly due to the class ‘Stage 02’ containing no images.
Even though ‘Stage 03’ had the lowest metrics among all stages in the Medetec
data set, it still performed well with a PPV of 0.899, TPR of 0.743, ACC of 0.842,
and [email protected] of 0.926. These results suggest that the model was able to accurately
classify and localise wounds, despite the data set being quite small in size. In regards
to [email protected], the model performed well across all stages, with scores ranging from
0.926 to 0.988, indicating that the model is able to locate and detect accurately.
When looking at the overall metrics, the single-phase architecture with the combi-
nation of Medetec and Novel data sets performed better than the one with only the
Medetec data set in terms of PPV, TPR, ACC, and [email protected]. This is due to the
class ‘Stage 02’ being added to the data set. When looking at each class PPV, apart
from ‘Stage 02’ and ‘Unstageable’, each class is higher with the Medetec data set on
its own.
In terms of PPV and TPR, the Wang data set outperformed the Medetec data set,
both with and without the Novel addition. The best PPV of 0.955 and TPR of 0.923
was both achieved in ‘Stage 01’, while the worst PPV of 0.726 and TPR of 0.621 was
both in ‘Stage 02’. On the other hand, the best ACC of 0.982 is achieved in ‘Stage 01’,
while the worst ACC of 0.853 is in ‘Unstageable’. Finally, the best [email protected] of
0.868 is achieved in ‘Unstageable’, while the worst [email protected] of 0.667 is in ‘Stage
02’. These metrics show that the model struggles to detect the class ‘Stage 02’ the
most, and this could be due to the class being under-represented or not containing
many distinguishing features.
Incorporating the Novel data set with the Wang data set, the model had a signifi-
cantly lower PPV and TPR (difference of 0.022 and 0.024, respectively), an insignifi-
cantly lower ACC (0.002), and an insignificantly higher [email protected] (0.004). Adding
images from a different source might have thrown off the model training since most
of the images used were from the Wang data set.
The Medetec model, which was the lowest performer among the single-phase
architecture models, still outperformed all the models in the multi-stage architec-
ture. With both architectures, the addition of the Novel data set resulted in lower
performance; however, when all three data sets were combined, the model outper-
formed the rest. This could indicate that two data sets together are not diverse enough
to work well together, however, three data sets are.
Wound Stage Recognition Using YOLOv5 73
4.3 Experiment 3
A testing data set was produced, which consisted of 36 images, 6 for each class. For
each data set, two images were chosen per class for the testing data set; however, when
there was a class not represented in a data set, the images were taken from the other
remaining data sets instead. Another six were added from the healthy skin data set so
that all were equally represented for the multi-stage architecture. The images were
evaluated based on each individual box that was correctly or incorrectly identified
or completely missed.
The single-stage architecture trained with all data sets performed the best. The
classes ‘Stage 02’, ‘Stage 03’, and ‘Stage 04’ detected all wounds, missing none.
‘Stage 01’ performed the worst with 0.833, further supporting that the ‘Stage 01’ class
could benefit from having a higher number of images due to it being less noticeable.
Overall, the model achieved a 0.947 of correct detections.
When it came to comparing the results presented in this study with the studies
presented earlier, a direct comparison could be made with those studies in the ‘Wound
localisation’ and ‘Classifying wounds’ sections. The studies from these sections were
chosen due to the studies having similar classes and end goals and this study.
The ‘Wound localisation’ section contained only one study [3], and Phase00 was
chosen to make a comparison. [3] accumulated his own data set, adding to it from the
Medetec data set. The best model achieved PPV of 0.925, TPR of 0.905, and a mAP
of 0.939. When compared with this study (0.952, 0.913, and 0.941, respectively),
the model presented in Phase00 performed slightly better.
Moving onto papers in the ‘Classifying wounds’ section, three of the papers were
able to provide a direct comparison to the models presented in this study. The first
paper was that by [4], who divided their classes into: ‘necrotic’, ‘slough’, ‘healing’,
and ‘granulation’. They got their data set from various clinicians. Three of the four
classes were used for a comparison, the ‘necrotic’ and ‘slough’ classes were utilised
for a comparison with the ‘Unstageable’ class, and the granulation class was used
for a comparison with the ‘Stage 03’ class as any wounds containing those features
would belong to the corresponding class. When looking at the wound data set results
of [4], which had an ACC of 0.878, it was higher than the multi-stage architecture
by 0.404 but lower than the single-stage architecture by 0.056.
Yilmaz et al. [12] divided their classes into ‘Stage 1’, ‘DTPI’, ‘Stage 2’, ‘Stage 3’,
‘Stage 4’, and ‘Unstageable’. ‘DTPI’, which stands for Deep Tissue Pressure Injury, is
skin areas that have a colour that is persistently dark red, purple, or maroon and do not
turn lighter when pressed. These areas may occur on skin that is unbroken or damaged,
and they may appear as blisters filled with blood. The cause of these skin changes is
damage to the soft tissues that are beneath the skin’s surface. DTPI are not usually
74 C. Abela and F. Inguanez
classified with other pressure injuries, however, when they are they are commonly
put into ‘Stage 03’, ‘Stage 04’ or ‘Unstageable’ due to lack of information.6 Due to
this, the overall accuracy of the results presented by Yilmaz et al. [12] was directly
compared to the results from this study. Yilmaz et al. [12] used Medetec to accumulate
their data set. The best model from the study conducted by [12] achieved an ACC
of 0.759, a PPV of 0.732, and a TPR of 0.759. When compared to the multi-phase
architecture, the model by Yilmaz et al. performed significantly better across all
metrics. However, when compared to the single-phase architecture, it performed
significantly worse, with a difference of 0.175 for ACC, 0.143 for PPV, and 0.183
for TPR.
Lau et al. [14] used various public data sets to accumulate their own. Lau et al.
[14] divided their study into the classes: ‘Stage 1’, ‘Stage 2’, ‘Stage 3’, ‘Stage 4’ &
Unstageable’, so a direct comparison could be made. Lau et al. [14]’s study achieved
an ACC of 0.632, which was found to be significantly lower than the single-stage
architecture by a margin of 0.302. Furthermore, the multi-stage architecture reported
a lower ACC than [14]’s study, with a difference of 0.158. Each paper in the ‘Classi-
fying Wounds’ section supports the notion that a single-stage architecture is superior
to a multi-stage architecture.
Five participants were chosen for the interview process of this study, a video was
shown to each participant of how the prototype works, and they were then told that the
result would be more polished as an actual application. Four of the participants were
qualified nurses, while one was a student nurse. The qualified nurses had experience
ranging from 2 to 39 years.
In regard to experience with PI detection, it ranged from no experience to some
lectures. Overall, no participant had vast training when it comes to handling PI.
While nurses are not the ones always handling PIs, they see them and document
them day-to-day, so a good understanding is important.
The next question was about the current version of the PI scale used in their
place of work or placement. The student was not aware of any consistent scale used;
however, the nurses mentioned the Waterlow Scale, the Braden Scale, or a shortened
version of one of them. None of the nurses interviewed throughout this study found
these scales helpful as they did not allow for a personalised analysis and were seen
to be biased.
Participants were asked to rate how easy the prototype would be to use. One
participant stated that the software was easy to use (1 out of 10), while another
participant stated that it was relatively difficult (8 on the scale). The participant who
stated that the software seemed difficult to use was due to a feeling of technological
illiteracy. The other three participants all stated that the software seemed relatively
6 https://www.calmedmedical.com/description-of-pressure-ulcer-staging.
Wound Stage Recognition Using YOLOv5 75
easy to use as there was not a lot going on and did not have multiple functionalities
in one.
Four out of the five participants stated that they would be interested in integrating
the software in their day-to-day work. The one participant who stated that they would
not be interested in using it in their daily work since they work in the emergency
department and PIs are not something they deal with.
4.6 Limitations
Initially this study was going to focus on two different skin wounds, PIs and burn
wounds. However, despite the completion of the required forms and persistent follow-
ups, the acquisition of a primary burns data set was not possible and public data sets
did not have substantial data sets available. Due to these limitations, the burns skin
wound approach had to be abandoned.
Another limitation faced throughout this study was regarding the PI data set.
While a primary data set was gathered, it was not large enough to make our own
data set consisting of only primary images. This was primarily due to the fact that
only one hospital was willing to share its PI images, coupled with the relatively short
duration of the study, which resulted in a limited number of images being collected.
5 Conclusion
This rigorous study has focused on the use of computer vision and artificial intelli-
gence techniques for the assisted automated classification of pressure ulcer stages.
In order to pursue the aim of this research, a locally sourced novel data set has
been assimilated and complemented to the two most publicly cited and commonly
accepted data sets on the subject matter. Furthermore, this research has considered
different architectures to address the limitations of the data sets and the challenges
with this research problem. The findings of this study by the researcher were also
complemented and supported by student and professional nurses.
Depending on the size of the data set and possibly the type of wound, different
approaches might need to be taken; however, throughout this research, it was found
that it is possible to classify the severity of skin wounds using CV.
The defining features in the images of a PI data set were found to be very impor-
tant to remain clear and not reduced during any data augmentation or data transfor-
mation processes. Due to this, data augmentation considered in this study such as
contrast enhancement, segmentation, and image shifting is not recommended to be
implemented. Augmentations such as flipping or rotation are seen to be appropriate.
When comparing the single-stage architecture to the multi-stage architecture with
various data set configurations, it was found that two data sets together are not diverse
enough; however, complimenting with the Novel data set improves diversity. This
76 C. Abela and F. Inguanez
observation is corroborated by the fact that in both the single-stage and multi-stage
architectures, the addition of the Novel data set resulted in lower results; however,
when all three data sets were combined, the results outperformed the other models.
This finding indicates that having diverse image sources is beneficial as the model
performs better, while also ensuring that the model is not biased.
When comparing the proposed models with various other works which focused on
classifying wounds, it was found that the multi-stage architecture performed worse
than the current state of the art, however, the single-stage architecture performed
better. Moreover, when compared to papers that focused solely on identifying the
location of a wound, and thus only the first phase of the multi-stage architecture was
considered, it was discovered that wound detection of the multi-stage architecture
yielded superior results.
When evaluating the results of the interviews conducted, the prototype was seen
to be overall easy to use (with a 4.2 out of 10 mean overall). Furthermore, four out of
the five participants were interested in the software being integrated into their daily
work as it would make PIs easier to understand and could lead to fewer human errors.
After conducting this study, several key insights were revealed, including that the
multi-stage architecture would be better suited for a smaller data set, as throughout
the experiments with the various data set combinations, the multi-stage architecture
consistently performed better when the data set presented was smaller in size. Another
takeaway was that, for single-stage architecture, the number of images always directly
correlated to how well a model performed, proving the importance of a large, balanced
data set. Furthermore, with both architectures, the addition of the Novel data set
resulted in lower-performing models, indicating that two data sets are not diverse
enough to work well together, however, three data sets are, as when all three data
sets were combined, the models outperformed the rest.
References
1. Chou R, Dana T, Bougatsos C, Blazina I, Starmer AJ, Reitel K, Buckley DI (2013) Pressure ulcer
risk assessment and prevention: a systematic comparative effectiveness review. Ann Internal
Med 159(1):28–38. https://doi.org/10.7326/0003-4819-159-1-201307020-00006
2. Khan W, Hussain A, Khan BM, Crockett KA (2023) Outdoor mobility aid for people with visual
impairment: Obstacle detection and responsive framework for the scene perception during the
outdoor mobility of people with visual impairment. Expert Syst Appl 228:120464. https://doi.
org/10.1016/J.ESWA.2023.120464
3. Patel (2020) Deep learning-based object detection in wound images. Master’s thesis, University
of Wisconsin Milwaukee
4. Veredas FJ, Luque-Baena RM, Martn-Santos FJ, Morilla-Herrera JC, Morente L (2015) Wound
image evaluation with machine learning. Neurocomputing. https://doi.org/10.1016/j.neucom.
2014.12.091
5. Hunter IA, Edwards KJ (2017) Managing pressure sores. Surg Infect (Larchmt) 35(9):505–510.
https://doi.org/10.1016/j.mpsur.2017.06.008
6. Barnard JA, Copson DL (2016) Increasing the accuracy of pressure ulcer classification using
a pressure ulcer guidance tool. Wounds UK 12(4)
Wound Stage Recognition Using YOLOv5 77
Abstract The palm oil industry in Malaysia is encountering challenges that need
innovative and technology-driven solutions. This study focuses on palm tree detection
where 2D LiDAR sensors are utilized to collect data like distance and reflection
strength. Through analysis, the gathered data are compared with an array of trend
lines to ascertain the optimal data relationship. Among the equations considered,
including linear, logarithmic, polynomial, and power equations, the power equation
emerges to fit more for the detection algorithm. The chosen equation is integrated into
the ESP32 firmware. The evaluation of the algorithm’s efficacy transpires through its
accuracy in identifying palm trees. The algorithm exhibits an accuracy rate of 98%,
attesting to its proficiency in discerning palm trees within plantations.
1 Introduction
Over the decades, the palm oil sector has contributed in Malaysia to the country’s
economy. The substantial expansion has been undergone by the sector since the
early 1970s, leading to the conversion of approximately 5.8 million hectares of land
into palm oil plantations, employment provided for over 650,000 individuals, and
a yield of 19.5 million tons of palm oil in 2018 [1]. However, recent years have
seen the emergence of significant challenges. One of the challenges was the ban on
Malaysian palm oil imports in Europe due to hazardous glycidyl fatty acid esters
(GE) detected in Malaysian fruit, resulting from pesticide use on plantations [2].
This ban caused substantial financial losses. Furthermore, heavy reliance on manual
labor, amplified by the COVID-19 pandemic, poses challenges. The shortage of
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 79
Y. Bee Wah et al. (eds.), Data Science and Emerging Technologies, Lecture Notes on
Data Engineering and Communications Technologies 191,
https://doi.org/10.1007/978-981-97-0293-0_6
80 L. H. B. Yusof et al.
2 Literature Review
georeferencing of LiDAR data are achieved through the GPS device’s accurate sensor
location information.
Usually, the LiDAR laser emits a light pulse, a swift burst of light lasting nanosec-
onds, traveling just centimeters before reflection [5]. This time taken for the pulse’s
round trip, known as “time-of-flight,” is measured by the LiDAR system. By multi-
plying the time-of-flight by the speed of light, the system accurately calculates the
target distance. Illumination of the target with laser light enables measurement of
the reflected light’s wavelength and arrival time, providing data to deduce a digital
representation of the target’s distance.
Additional properties of the reflected light, including intensity, wavelength,
surface characteristics, color, and chemical composition, can be measured by LiDAR
systems. By integrating these measurements with distance data, LiDAR generates
highly precise three-dimensional environmental representations. Equation (1) is used
to calculate the distance [6], where D is distance, c is the speed of light, and T is the
time required by the light to travel. Measurement accuracy in LiDAR is influenced by
factors like laser pulse rate, beam divergence, and receiver sensitivity, with adjust-
ments such as increasing pulse rate and minimizing beam divergence enhancing
distance measurement precision, amplifying receiver sensitivity refining intensity
measurements.
T
D=c . (1)
2
The application of Median Filter preprocessing in classifying oil palm trees using
Convolutional Neural Networks on LiDAR images concluded that its impact on
classification effectiveness is limited due to low salt and pepper noise levels, slightly
lower than previous research without Median Filter preprocessing [7]. Besides, the
LiDAR technology is also being implemented to categorize the ripeness of oil palm
fresh fruit bunch (FFB) using sensor data and create a distribution map in the context
of Malaysia’s palm oil industry [8].
Initially, data collection from various oil palm fields is imperative, involving field
preparation, data gathering, and data accuracy verification, as these data serve as
the foundation for developing a precise algorithm to estimate oil palm tree expan-
sion and production. Subsequently, data analysis advances to plotting graphs, visu-
ally depicting the relationships between variables. After data plotting, a trendline
is determined through visual validation, assessing the line that best fits the data’s
overall trend. This trendline is further validated using mathematical graph equa-
tions, including linear, logarithmic, polynomial, and power models, to ensure reli-
able and confident decision-making support. Once the optimal algorithm is selected,
82 L. H. B. Yusof et al.
the next phase involves firmware development for an ESP32 microcontroller, inte-
grating the chosen algorithm and a communication program for seamless 2D LiDAR
sensor connectivity. The firmware design includes an indicator light to provide real-
time feedback during palm tree detection, enhancing usability. Well-documented and
organized firmware construction ensures easy maintenance and future updates. Thor-
ough testing confirms compliance with original requirements, resulting in a high-
quality, efficient palm tree detection solution. To interface sensors with the ESP32
and manage physical components, an electronic circuit will be meticulously crafted,
emphasizing reliability and data processing capability. Real-world deployment in an
oil palm setting evaluates the algorithm’s accuracy and circuit performance, guiding
potential applications in the agricultural industry and shaping the project’s future
direction.
To comprehensively collect data from the LiDAR sensor, a data sampling process is
required, utilizing a serial port terminal application for data viewing and recording.
This terminal tool must accommodate specific communication settings tailored to
the LiDAR sensor, including baud rate, bit length, and stop bit type. After evaluating
various options, CoolTerm version 2 serial port terminal has been chosen for its
versatility and reliability, being a highly capable freeware supporting a wide range
of serial port configurations. Moreover, CoolTerm’s recording features are crucial
for the efficient and accurate accumulation of palm tree data, a vital aspect of these
project’s goals.
To construct the algorithm, it is essential to retrieve at least two parameters from
the LiDAR sensor, facilitating the establishment of data relationships. Unlike most
LiDAR sensors on the market, which typically provide a single output, measuring the
distance from the sensor to the object, the Benewake TFMini Plus—Micro-LiDAR
offers a distinctive capability. This LiDAR sensor furnishes two types of readings: the
object’s distance in centimeters and the object’s surface reflection value, presented
as a generic value without specific measurement units. While its primary application
is distance measurement, it offers two correlated values. The TFMini Plus, func-
tioning on the Time of Flight (ToF) principle, can measure distances as close as
10 cm (±5 cm up to 6 m) and as far as 12 m (±1% starting at 6 m). Its effectiveness
in detecting distances varies depending on lighting conditions and target reflectivity,
yet its compact dimensions, measuring only 35 × 18.5 × 21 mm, enable its inte-
gration into applications typically reserved for smaller sensors. Distinguishing itself
from the original TFMini, the TFMini Plus boasts an IP65 enclosure, ensuring dust
tightness and water resistance, and has passed drone-level vibration tests, broadening
its range of applications. Powered by a simple 5 V source, it communicates data via
the Universal Asynchronous Receiver/Transmitter (UART) serial protocol, enabling
direct reading through a USB-to-Serial Converter. Table 1 shows the configuration
of TFMini Plus LiDAR.
Harvest Palm Tree Based on Detection Through 2D LiDAR Sensor … 83
Table 1 Configuration of
Default baud rate 115,200
TFMini Plus LiDAR
Data bits 8
Stop bit 1
Parity None
Table 2 shows the data code for the TFMini Plus—LiDAR. The manufacturer has
structured the sensor’s output data into a 9-byte data frame, including frame head,
distance value (Dist_L and Dist_H), signal strength (Strength_L and Strength_H),
temperature (Temp_L and Temp_H), and checksum, all transmitted in hexadecimal
format. Specifically, bytes 2 and 3 carry the distance value to the object, allowing
for centimeter measurement, while bytes 4 and 5 convey the reflection strength,
facilitating reflection strength determination.
An extensive survey of the plantation areas is conducted to ensure that the data
collected accurately represent the full spectrum of palm tree ages and sizes present.
It is noteworthy that palm trees have a productive lifespan ranging from 3 to 25 years,
during which they yield the most. As they age and grow in size, it can affect sensor
readings’ accuracy. Therefore, accounting for the diverse ages and sizes of palm
trees in the plantations is crucial to gathering reliable and relevant data for the study.
To ensure an accurate reflection of the entire range of palm tree ages and sizes, the
research will employ a systematic data collection process, measuring trees of varying
ages and sizes. This meticulous approach enables a comprehensive examination of
palm tree plantation characteristics. Specifically, data sampling will focus on palm
trees aged between 5 and 10 years, a critical period marked by significant changes
in trunk diameter, as diameter alterations impact sensor readings due to variations
in surface area and reflected infrared light. The sensor will be positioned at a height
of 1.2 m above the ground to ensure sampling from the trunk rather than the base,
where the presence of roots results in a greater diameter.
i.
The data sample will be grouped by the age of the palm tree.
ii.
Each age group will be taken in five data sets.
iii.Each data set will have a ten-fixed distance range.
iv.The data then will be stored in the format as Table 3
v.Samples will be taken for palm trees at the age of 5. Distance from the sensor
to the palm tree will be fixed and the strength value will be measured.
vi. This procedure will be repeated for palm trees at multiple ages until 25 data sets
are collected.
After successfully collecting the data, the focus shifts to establishing the correla-
tion between distance and reflection strength in order to identify a suitable mathemat-
ical equation that aligns with the data trend. This involves loading the collected data
into Microsoft Excel for further analysis, where all datasets are grouped by distance,
and the average reflection strength is computed for each. This process is repeated for
all distances across all datasets. Following data preprocessing, a chart is generated
to visualize the trend and relationship, and a trend line is drawn in the chart. Four
types of mathematical equations are then tested to assess their compatibility with
the data trend. Microsoft Excel’s chart tools are employed to generate formulas for
each equation type. Multiple equations like linear equations, logarithmic equations,
polynomial equations, and power equations are generated and individually compared
with the actual collected data. When distance values are input into the formula, they
should produce reflection strength values closest to the actual averaged data obtained
from the process. Error percentages are calculated for each mathematical equation,
with the equation demonstrating the lowest error percentage being chosen as the final
formula for use in real-world testing. Table 4 shows the virtual distance generated
by different equations.
To assess the equation’s performance, a comparison is made between the virtual
distance and the actual distance, and the efficiency is measured by dividing the virtual
value by the actual value. These efficiency scores are then totaled for all seven data
points under evaluation, and the equation’s overall efficiency is determined. Ideally,
the sum of efficiencies should closely approach a total of seven, given the seven
data points being tested. The equation demonstrating the sum of efficiencies closest
to seven will be selected for incorporation into the algorithm. Table 5 shows the
efficiency of each equation. Based on Table 5, the power equation achieved better
efficiency.
After the comparison, the power equation generated as Eq. (2) was chosen where
y is the virtual distance and x is the actual strength. The power equation is then coded
into the Firmware for ESP32. For algorithm testing, the electronic circuit is equipped
86 L. H. B. Yusof et al.
with two LiDAR sensors, enabling the system to detect palm trees on both the left and
right sides of the pathway. The firmware writing process involves implementing the
Real-Time Operating System (RTOS) capability of the ESP32, which enhances code
efficiency and overall system operation. The firmware is structured into several tasks:
obtaining LiDAR sensor data and performing detection logic, acquiring data from
the GPS module, and recording detection activity into internal memory, including
location coordinates. Before executing the detection algorithm, the firmware must
capture raw data from both LiDAR sensors. This is achieved through dedicated tasks
in the RTOS, with two independent tasks looping every 100 ms to parse distance and
strength values to perform the detection. The detection function validates readings,
calculates estimated distance using a specific equation, establishes a margin value
for flexibility in detection logic, and compares actual distance readings to the esti-
mated distance within a certain range to produce a “true” output indicating palm tree
detection. Furthermore, the firmware retrieves location coordinates when a palm tree
is detected, records UTC time and stores data in its internal memory, aiding in result
measurement.
Fig. 2 Installation of an electronic system and LiDAR sensor on the spraying machine
88 L. H. B. Yusof et al.
5 Conclusion
The palm tree detection algorithm successfully detects the palm tree and eases the
spraying process. However, there is still room for improvement. To enhance the
algorithm’s accuracy, an effective strategy is to expand the volume of sample data
used during its development. Achieving this involves capturing more data across
diverse environmental settings and locations. Furthermore, the algorithm’s precision
90 L. H. B. Yusof et al.
Acknowledgements The authors thank UNITAR International University for the publication of
this research.
References
1. Malaysian Palm Oil Council (2019) Monthly palm oil trade statistics. Retrieved from https://
www.mpoc.org.my/Monthly-Palm-Oil-Trade-Statistics-Jan-Dec-2018
2. New Straits Times (2019) EU decision on palm oil not surprising, says Malaysia.
Retrieved from https://www.nst.com.my/news/nation/2019/03/471097/eu-decisionpalm-oil-
not-surprising-says-malaysia
3. Ho J, Phang SK, Mun H (2021) 2-D UAV navigation solution with LIDAR sensor under GPS-
denied environment. J Phys: Conf Ser 2120:012026. https://doi.org/10.1088/1742-6596/2120/
1/012026
4. Zhang H, Zhang K, Chen C (2020) 2D lidar-based indoor mapping with SLAM. Sensors
20(15):4259. https://doi.org/10.3390/s20154259
5. Setyawan A, Taftazani I, Bahri S, Noviana E, Faridatunnisa M (2022) Drone LiDAR application
for 3D city model. J Appl Geospat Inform 6:572–576. https://doi.org/10.30871/jagi.v6i1.3462
6. Kim G, Ashraf I, Eom J, Park Y (2021) Concurrent firing light detection and ranging system for
autonomous vehicles. Rem Sens 13(9):1767. https://doi.org/10.3390/rs13091767
7. Aliandra SR, Prasvita DS (2022) Application of median filter method for classification of oil
palm tree on LiDAR images. In: 2022 international conference on informatics, multimedia,
cyber and information system (ICIMCIS), Jakarta, Indonesia, pp 441–444. https://doi.org/10.
1109/ICIMCIS56303.2022.10017880
8. Husin HS, Amar N, Bakar Sajak AA, Sallehin Mohd Kassim M (2021) Distribution map of
oil palm fresh fruit bunch using LiDAR. In: 2021 12th international conference on information
and communication systems (ICICS), Valencia, Spain, 2021, pp 4–9. https://doi.org/10.1109/
ICICS52457.2021.9464575
Enhancing Security Surveillance
Through Business Intelligence
with NVIDIA DeepStream
1 Introduction
Security incidents at commercial and public premises such as theft, break-ins, and
vandalism are unfortunate but common occurrences in today’s society. Traditional
security systems, while effective to some extent, often fail to provide user-defined
insightful analytics and real-time security alerts, leaving commercial and public
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 91
Y. Bee Wah et al. (eds.), Data Science and Emerging Technologies, Lecture Notes on
Data Engineering and Communications Technologies 191,
https://doi.org/10.1007/978-981-97-0293-0_7
92 V. Pednekar et al.
2 Related Work
A heuristic is used to determine pipeline cut points, and the methodology is evalu-
ated with six object detection networks on an NVIDIA Jetson AGX Xavier board.
Results show significant throughput improvements of 81% to 91% over the base-
line GPU-only inference. The proposed approach shows promise for accelerating
object detection, and future work can explore scalability to even large networks and
datasets.
The survey [4] reviewed transfer learning techniques for visual categorization
tasks and highlighted the importance of knowledge transfer. Methods for transferring
knowledge were discussed at both the feature representation and classifier levels. The
study emphasized the need to avoid negative knowledge transfer and deal with the
many-to-one adaptation problem. The challenge for transfer learning in the future is
to mine helpful information from noisy source domain data and extend methods to
deal with large-scale data.
The study [5] focuses on improving the performance of ResNet and DenseNet
models by adjusting the training pipeline and applying a weighted sum of individual
losses for regression models. Additionally, it suggests further training for VGG19
models with early stopping until convergence before moving to the next training
stage. These adjustments aim to overcome abnormal validation losses and improve
overall model performance.
A novel framework [6] integrating human hand detection and pose estimation is
proposed based on deeply learned networks. The shared convolutional layers effi-
ciently detect human hands and compute their pose configuration. An energy function
is proposed to optimize the predicted result. Experimental results show high relia-
bility for real-world applications. Future work includes improving estimator network
accuracy and enhancing input data quality while maintaining efficient computation.
A discriminative single-shot segmentation tracker—D3S, [7] narrows the gap
between visual object tracking and video object segmentation. A single-shot network
applies two target models with complementary geometric properties, one invariant
to a broad range of transformations, including non-rigid deformations, the other
assuming a rigid object to achieve both strong resilience and real-time target segmen-
tation simultaneously. Without per-dataset fine-tuning and trained only for segmen-
tation as the primary output, D3S outperforms all trackers on VOT2016, VOT2018,
and GOT-10k benchmarks and performs close to the state-of-the-art trackers on the
TrackingNet.
Another class of tracker, called Siamese trackers [8], has evolved in the direction
of generative templates. Siamese trackers apply a backbone pre-trained offline with
general targets such that object-background discrimination is maximized by corre-
lation between the search region and target template extracted in the first frame [8].
Segmentation of moving objects is a central problem in the emerging field of video
object segmentation (VOS) [9, 10].
Recent studies [11–15] have demonstrated impressive outcomes, yet they rely
on extensive deep networks, often necessitating fine-tuning and resulting in slow
processing. Simultaneously, Hu et al. [16] and Chen et al. [17] introduced a method
for segmentation through feature matching in the initial frame, significantly reducing
processing time. However, these approaches are less effective in addressing the visual
94 V. Pednekar et al.
object tracking (VOS) task, especially when dealing with small, rapidly moving
objects that exhibit limited appearance changes in short videos. Consequently, the
methods proposed by Hu et al. [16] and Chen et al. [17] are less proficient in visual
object tracking involving such objects. This paper [8] aims to bridge the gap between
visual object tracking and video object segmentation.
3 Proposed Architecture
and thresholds. The plugin has been modified to send data to the analytics server by
publishing it to an Apache Kafka topic for further processing.
To display the results, the pipeline incorporates the NvMultiStreamTiler plugin to
lay out the camera streams in a grid arrangement, NvVideoConvert plugins convert
the video from the proprietary NV12 format to a suitable standard format for display,
and the NvDsOsd plugin for rendering the video on a display.
The video output is also available for streaming on the Real-Time Streaming
Protocol, which will accessed remotely. This functionality is extended from the
“NvDsOsd” plugin by converting the video feed to NV12 format, and then by
applying a suitable caps filter, video is encoded by employing H264 or H265 NvV4L2
encoder. The encoded video is then converted to an RTP payload and sent to a UDP
sink for streaming.
4 Proposed System
URSA works in six stages to provide user-defined real-time insightful analytics and
security alerts. They are as follows:
Stage 1: Camera Setup and Configuration.
The first stage involves the user adding the surveillance cameras using their RTSP
URLs to URSA. Other information, such as the name and location of cameras, can
also be specified. Additionally, the analytics settings for each camera are configured.
Some analysis types allow users to define boundary lines and areas of interest for
analysis. The user can also choose the type of analysis they expect from the camera
feed, e.g., overcrowding detection.
Stage 2: Building and Initializing DeepStream Pipeline.
The DeepStream pipeline, as depicted in Fig. 2 in Chap. 3, is initialized with config-
uration files for its components, such as primary and secondary inferences, tracker,
and analytics. Key configurations include models like PeopleNet, Spatial–Temporal
Graph Convolutional Network (ST-GCN), and FaceNet for person detection, body
posture detection, and criminal face detection respectively. PeopleNet—a combi-
nation of ResNet34 (for feature extraction) and DetectNet V2 (for inference) is the
primary inference model. The tracker parameters include minimum object confidence
and the tracking algorithm (NVIDIA® Discriminative Correlation Filter—NvDCF—
that uses Discriminative Correlation Filter). User-defined analytics configuration is
loaded into the NvDsAnalytics plugin to extract relevant inference data. The pipeline
includes plugins to display the output video on a connected device and stream it using
the Real-Time Streaming Protocol (RTSP).
Stage 3: Loading Camera Sources.
Once the pipeline is configured and initialized, multiple camera feeds are connected
to its source. The DeepStream pipeline is configured with the details of enabled
Enhancing Security Surveillance Through Business Intelligence … 97
cameras, including their names, locations, RTSP URLs, and other metadata that are
loaded from the database. Each camera gets an ID sequentially within the pipeline,
and the results are published to the sink against the camera IDs.
Stage 4: Inferring from the Live Camera Feeds.
Inference is performed on the input video signal by leveraging the deep learning
models discussed in Stage 2. PeopleNet, the primary inference model, detects people
in the video feed and generates a corresponding bounding box. The area within this
bounding box is then subjected to secondary inference using ST-GCN for body pose
detection and FaceNet models for facial recognition. Inference is made solely on the
data inside the bounding box, and in the absence of a matching class, the models
produce no outputs. The inference stage will output the bounding box data for all the
processed frames. This data needs to be refined and normalized to obtain analytics.
This task will be achieved in the forthcoming stage.
Stage 5: Analyzing the Raw Inference Data.
The vast raw inference data is filtered to highlight unique features for each detected
class. Consecutive duplicate data is irrelevant, and thus, it gets discarded at this stage.
The remaining data is analyzed to determine person counts, line crossings, and people
in specific areas. The analysis compares the overlaps and direction of movement of
the primary object’s bounding box with the user-defined lines and areas.
Finally, the analysis results are published on multiple Kafka topics for further
analysis on the analytics server. Kafka is a messaging service between two systems,
wherein the messages published on a particular “topic” can be accessed by the
systems subscribed to it. With more computing power, the analytics server performs
complex calculations and aggregations to generate interpretable insights and alerts.
These are stored in a Cassandra database. This stage includes finding aggregations
over some time, or even performing future predictions.
Stage 6: Creating Insights, Alerts and Reports.
After processing the raw data, insights and alerts can be generated in real time or near
real time from the results stored in the Cassandra database. The analytics dashboard
provides control over functions like preparing reports from aggregated data over
time and making future predictions as required by the end users. The results can be
displayed through the means of meaningful and appropriate graphs and charts.
URSA incorporates three primary deep learning models, namely PeopleNet, ST-
GCN, and FaceNet. Each of these models was trained separately using separate
datasets using a technique known as transfer learning. This task of adapting the
98 V. Pednekar et al.
models to our use cases was done on NVIDIA Tao Toolkit. We chose this method as
it allows us to build highly performant and accurate models using only a fraction of
time and data used to train a model from scratch. Transfer learning has been proved
to be an effective training method for many AI applications. The training was carried
out on a NVIDIA® GeForce® GTX 1650 Mobile GPU with 4 GB GDDR5 video
memory. The GPU has a CUDA compute capability of 7.5.
PeopleNet model, which is also the primary inference model that detects people
in a scene, has been trained on a dataset consisting of images collected by us
from different crowded locations such as railway stations, supermarkets, and educa-
tional institutions as well as from a dataset created by Konstantin Verner’s “Human
Detection Dataset” on Kaggle.
We ourselves collected about 300 images and combined them with about 550
images in Konstantin’s dataset. The images were then labeled individually using
LabelImg as shown in Fig. 3.
The original trainable pre-trained model has a general accuracy of 82.17%. We
trained this model for 25 epochs, and it resulted in an accuracy of 96.59% using FP16
calibration on our dataset.
The graph shown in Fig. 4 shows the training progress of PeopleNet over a course
of 25 epochs. The trained model was pruned to achieve optimum performance and
accuracy. The final model has an accuracy of about 95%.
Similar to how the dataset was created for PeopleNet, the dataset for FaceNet was
also created by annotating the faces of people using LabelImg and using their names
as the image label as shown in Fig. 5. For FaceNet, we created our dataset by using
about 40 images of the same person from different angles.
The original pre-trained model has a general accuracy of 83.85%. After training
the model for 15 epochs, the accuracy is 84.68% using INT8 calibration. After
pruning, the model is 83.92% accurate. Figure 6 shows the training progress of
FaceNet over the course of 15 epochs.
100 V. Pednekar et al.
URSA monitors a crowd outside a busy railway station during peak morning hours
in real time. Here, the crowd movement to and from the platform and the staircase
leading up to the bus terminal is being tracked. The green lines in the scene measure
the number of people moving through these areas in both directions. To prevent
overcrowding, the orange box at the base of the staircase sends alerts when the number
of people in the area exceeds the set threshold (4 in this case). The yellow box tracks
the number of people in the region and their average dwell time, providing valuable
insights for crowd management. The system also monitors the total number of people
appearing in the camera frame. All the lines, boxes, and thresholds discussed above
are defined by the user as per the requirements as shown in Figs. 8, 9, 10, and 11.
Enhancing Security Surveillance Through Business Intelligence … 101
Fig. 8 Real-time surveillance of the busy crowd outside the railway station
6 Conclusion
References
1. Gongguo Z, Junhao W (2021) An improved small target detection method based on Yolo V3.
IEEE
2. Badave H, Kuber M (2021) Evaluation of person recognition accuracy based on openpose
parameters. IEEE
3. Jin E, Wei Jangryul Kim J, Tan S, Lee J, Ha S (2021) Deep learning inference parallelization
on heterogeneous processors with TensorRT. IEEE
4. Shao L, Zhu F (2014) Transfer learning for visual categorization: a survey. IEEE
5. Hong Nga C, Nguyen K-T, Tran NC, Wang JC (2021) Transfer learning for gender and age
prediction. IEEE
6. Chen T-Y, Wu M-Y, Hsieh Y-H, Fu L-C (2017) Deep learning for integrated hand detection
and pose estimation. IEEE
7. Lukežič A, Matas J, Kristan M (2020) D3S—a discriminative single shot segmentation tracker.
IEEE
8. Bertinetto L, Valmadre J, Golodetz S, Miksik O, Torr PHS (2016) Staple: complementary
learners for real-time tracking. Comp Vis Patt Recogn
9. Perazzi F, Pont-Tuset J, McWilliams B, Van Gool L, Gross M, Sorkine-Hornung A (2016) A
benchmark dataset and evaluation methodology for video object segmentation. Comp Vis Patt
Recogn
10. Xu N, Yang L, Fan Y, Yue D, Liang Y, Yang J, Huang T (2018) YouTubeVOS: a large-scale
video object segmentation benchmark. arXiv:1809.03327
11. Voigtlaender P, Chai Y, Schroff F, Adam H, Leibe B, Chen L-C (2019) FEELVOS: Fast end-
to-end embedding learning for video object segmentation. Comp Vis Patt Recogn
12. Caelles S, Maninis K-K, Pont-Tuset J, Leal-Taixe L, Cremers D, Van Gool L (2017) Oneshot
video object segmentation. Comp Vis Patt Recogn
13. Voigtlaender P, Leibe B (2017) Online adaptation of convolutional neural networks for video
object segmentation. Proc Br Mach Vision Conferen
14. Cheng J, Tsai Y-H, Hung W-H, Wang S, Yang M-H (2018) Fast and accurate online video
object segmentation via tracking parts. Comp Vis Patt Recogn
15. Yang L, Wang Y, Xiong X, Yang J, Katsaggelos AKK (2018) Efficient video object
segmentation via network modulation. Comp Vis Patt Recogn
16. Hu Y-T, Huang J-B, Schwing AG (2018) VideoMatch: Matching based Video Object
Segmentation. Proc Euro Conf Comput Vision
17. Chen Y, Pont-Tuset J, Montes A, Van Gool L (2018) Blazingly fast video object segmentation
with pixelwise metric learning. Comp Vis Patt Recogn
Fuzzified Hybrid Metaheuristics
for QoS-Aware Service Composition
1 Introduction
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 105
Y. Bee Wah et al. (eds.), Data Science and Emerging Technologies, Lecture Notes on
Data Engineering and Communications Technologies 191,
https://doi.org/10.1007/978-981-97-0293-0_8
106 H. Naghavipour et al.
The problem of service composition can fall into three stages. The first stage is
task decomposition, in which the submitted tasks are considered and broken down
into smaller subtasks denoted by T = T 1 , T 2 , T 3 ,. . . , T n according to an abstract
workflow. The second stage of service discovery is finding services that satisfy user
requirements by selecting the optimal set of services. From a technical perspective,
service discovery is a matching endeavour to find a service set from the service
repository that fulfils user requirements. The application of meta-heuristics arrives
at this stage to solve the optimization problem, and these are based on semantic
similarities between sub-tasks and resource descriptions. The third stage validates
an optimal selection, ensuring the overall QoS is optimized. The aggregation models
to compute overall service composition quality and normalization formulas are given
in the following:
Fuzzified Hybrid Metaheuristics for QoS-Aware Service Composition 107
∑
Max(QoS) = Max wk × Norm(Q k ) (1)
∑
r
wk = 1 (2)
k=1
⎧
QoS(S)−Q mtn
⎪
⎨ Q maxx
− Q min if Q is positive and Q max − Q min /= 0,
Q max −Q(S)
QoS = if Q is negative and Q max − Q min /= 0, (3)
⎪
⎩
Q −Q m1n
max
1 Otherwise.
2 Related Works
Table 1 (continued)
Reference Approach Fitness Metaheuristics Hybrid operator
Sadeghiram et al. Tree-based SAW GA K-Means
(2018) [28]
Jatoth et al. (2018) Vector-base SAW GA Guided
[29] mutation
Liu et al. (2018) Vector-base Independent NSGA-II Differential
[30] optimization
technique
Xu et al. (2018) Graph-base SAW PSO Evolutionary
[31] operator
Alayed et al. (2019) Graph-base SAW ACO Swap concept
[32] trap of local
Sadouki et al. Vector-base Independent EHO Evolutionary
(2019) [33] operator of GA
Bouzary et al. Vector-base SAW GWO Evolutionary
(2019) [34] operator of GA
Gao et al. (2020) Vector-base TFNSAW AIS Clonal selection
[35] algorithm
Bhaskar et al. Vector-base SAW GWO Levy flight
(2020) [36]
Yang et al. (2020) Vector-base Independent GWO Backward
[37] learning
strategy
Li et al. (2020) [38] Vector-base SAW HHO Ergodic
operator
Zhang et al. (2021) Vector-base SAW ABC Group
[39] initialization
strategy
Wang, Ma et al. Vector-base SAW EDA Local search
(2022) [40] procedure
The previous works indicated the lack of an adaptive search mechanism where algo-
rithm operators can readjust based on the population fitness of a generated population
in the course of the search process. Furthermore, the stochastic mechanism will lead
to premature or slow convergence, a fundamental shortcoming in metaheuristics.
Consequently, slow convergence impacts time complexity while premature conver-
gence influences search capability to point out global optima. Therefore, the current
approach in service composition has been the application of nature-inspired meta-
heuristics in an overwhelming scale. In a vast number of hybrid models, nature-
inspired algorithm operators are employed to achieve better results. This paper aims
to use fuzzy logic to overcome the problems associated with slow convergence and
Fuzzified Hybrid Metaheuristics for QoS-Aware Service Composition 111
A fuzzy set (Z) [42] is a generalization of a Crisp or Boolean set, which is defined
in a universe of discourse X. In this definition, Z is a linguistic label that defines the
fuzzy set through the word Z. The fuzzy set (Z) is characterized by a membership
function μZ(x), which provides a measure of the degree of similarity of an element
x from X to the fuzzy set A. It takes some values in the interval within the [0,1]
that are defined in Eq. (4). Therefore, a generic variable x c can be represented using
multiple fuzzy sets {Z c , Z c , . . . , Z c }, each one modelled by a membership function
as follows:
{ }
μZ 1c (xc ), μZ 2c (xc ), . . . , μZ mc (xc )
consists of mathematical functions. Since the available knowledge for the design of
the fuzzy system conceived in our approach includes functional, local behaviours,
the Mamdani inference model has been used for system modelling [44].
The key idea is to achieve an adaptive search mechanism rather than conventional
metaheuristics where search operators are constant. The adaptive search mecha-
nism has shown to be an effective strategy that leads to fast convergence methods
immune to the trap of local optima. An existing study has benefited from probability
theory, chaos theory, and statistical analysis in the process of population generation in
order to produce high-quality solution, particularly with respect to population-based
metaheuristics, which are the current research paradigms for service composition
problems.
The flowchart illustrated in Fig. 2 is a detailed description of the proposed method.
In this approach, the evolutionary algorithm operator, crossover and mutation, will
be tuned by fuzzy linguistics predefined rules in three low, medium, and high values
for iteration, fitness variance, best fitness, and means. In this setting, nine rules have
been defined to generate high-quality fitness values as detailed in Figs. 1, 2.
The proposed solution is based on a guided search mechanism with a fuzzy system
acting as an algorithm operator. The idea behind proposed algorithm is an adaptive
search process rather than traditional stochastic exploration in search space with
fixed operators. In contrast, in the proposed approach, every iteration’s population
diversity will be evaluated to generate an adaptive algorithm operator according to
the fuzzy rule matrix in a fuzzy interface. The details of such are detailed in the
following algorithm 2.
This section analyses the behaviour of the proposed algorithm considering the
different settings. All experiments have been executed on an Intel Xeon(R) CPU
E3-1225 v5 @ 3.30 GHz and 64-GB RAM under MATLAB® 9.6. We have used
QWS VER2 data sets where Throughput, success ability and reliability are positive
values while compliance and best practice are deemed as negative values. The weight
of quality attributes defines in chronological order and denoted by w = [0.18, 0.17,
0.11, 0.39 0.15].
In order to have a holistic empirical evaluation, genetic algorithm from evolu-
tionary algorithms and PSO from swarm intelligence have been selected. Moreover,
a hybrid version of GA and PSO is included to showcase the hybridization of two
metaheuristics. In addition, the Eagle strategy as a metaheuristics with a robust explo-
ration search mechanism also implemented to highlight algorithm behaviour against
various scenarios. The proposed model here is a fuzzy evolutionary algorithm where
the detail in Table 2.
According to the stochastic nature of metaheuristics, the near-optimal solution
is not unique. Therefore, a viable empirical study requires experiments on multiple
instances. In this study, the algorithm has run ten times with hundred iterations.
Swarm intelligence methods such as PSO and ES suffer from local optima entrap-
ment. Moreover, the hybridization of two metaheuristics GAPSO did not accelerate
convergence. In Table 3, primary performance metrics, including execution time,
global fitness value, and hit-rate, are summarized for ten runs and a hundred iterations.
The evolutionary algorithm has shown quite satisfying results in exploring global
optima. Nonetheless, it suffers from slow convergence. Figure 3 illustrates the box
plot for the proposed fuzzy genetic algorithm compared to traditional GA for ten
runs. The following chart shows how fuzzification of metaheuristics can improve the
time of execution and also minimize the stochastic behaviour of metaheuristics.
Metaheuristic [45, 46] methods are generally complex processes with several random
operations and stochastic subroutines. Therefore, conducting a complexity analysis
from a deterministic point of view is impractical. For that reason, the hit-rate (%) is
used to evaluate each algorithm’s computational effort. Hit-rate exhibits the number
of iterations an algorithm requires to point out the highest fitness value that is global
optima. These metrics are system-independent and provide a precise measurement
independent from implementation tools or hardware specs. The time complexity of
each algorithm iteration depends not only on the hardware specs but also on the
choice of tools, such as Python, Matlab, R. In order to have a fair comparison, the
hit-rate computed for the algorithm in the setting can be relatively in an equally
converged search process. Equation 5 shows the computational logic for hit-rate.
116 H. Naghavipour et al.
Itcon
HitRate (% ) = ∗100 (5)
Itmax
5 Conclusion
Acknowledgements The main author wants to thank Professor Witold Pedrycz for providing
precise comments and the Centre for Research and Consultancy, UNITAR International University.
References
19. Bao L, Zhao F, Shen M, Qi Y, Chen P (2016) An orthogonal genetic algorithm forqos-aware
service composition. Comput J 59(12):1857–1871
20. Seghir F, Khababa A (2018) A hybrid approach using genetic and fruit fly opti- mization
algorithms for qos-aware cloud service composition. J Intell Manuf 29(8):1773–1792
21. da Silva AS, Mei Y, Ma H, Zhang M Particle swarm optimization with sequence-like indirect
representation for web service composition. In: Evolutionary computation in combinatorial
optimization, Springer, pp 400–405
22. Hossain MS, Moniruzzaman M, Muhammad G, Ghoneim A, Alamri A (2016) Big data-
driven service composition using parallel clustered particle swarm optimization in mobile
environment. IEEE Trans Serv Comput 9(5):806–817
23. Zhou J, Yao X (2017) A hybrid artificial bee colony algorithm for optimal selection of qos-based
cloud manufacturing service composition. Int J Adv Manuf Technol 88(9–12):3371–3387
24. Podili P, Pattanaik K, Rana PS (2017) Bat and hybrid bat meta-heuristic for quality of service-
based web service selection. J Intell Syst 26(1):123–137
25. Savarala BB, Chella PR (2017) An improved fruit fly optimization algorithm for qos aware
cloud service composition. Int J Intell Eng Syst 10(5):105–114
26. Chifu VR, Pop CB, Salomie I, Chifu ES (2017) Hybrid honey bees mating optimization algo-
rithm for identifying the near-optimal solution in web service composition. Comput Inform
36(5):1143–1172
27. Zhou J, Yao X, Lin Y, Chan FT, Li Y (2018) An adaptive multi-population differential artificial
bee colony algorithm for many-objective service composition in cloud manufacturing. Inf Sci
456:50–82
28. Sadeghiram S, Ma H, Chen G (2018) Cluster-guided genetic algorithm for distributed data-
intensive web service composition. In: 2018 IEEE congress on evolutionary computation
(CEC), IEEE, pp 1–7
29. Jatoth C, Gangadharan G, Fiore U, Buyya R (2018) Qos-aware big service composition using
mapreduce based evolutionary algorithm with guided mutation. Futur Gener Comput Syst
86:1008–1018
30. Liu L, Gu S, Fu D, Zhang M, Buyya R (2018) A new multi-objective evolutionary algorithm
for inter-cloud service composition. TIIS 12(1):1–20
31. Xu X, Rong H, Pereira E, Trovati M (2018) Predatory search-based chaos turbo particle
swarm optimization (ps-ctpso): a new particle swarm optimization algorithm for web service
combination problems. Fut Generat Comput Syst Int J Escience 89:375–386
32. Alayed H, Dahan F, Alfakih T, Mathkour H, Arafah M (2019) Enhancement of ant colony
optimization for qos-aware web service selection. IEEE Access 7:97041–97051
33. Sadouki SC, Tari A (2019) Multi-objective and discrete elephants herding optimization
algorithm for qos aware web service composition. RAIRO-Operat Res 53(2):445–459
34. Bouzary H, Chen FF (2019) A hybrid grey wolf optimizer algorithm with evolutionary operators
for optimal qos-aware service composition and optimal selection in cloud manufacturing. Int
J Adv Manufact Technol 101(9–12):2771–2784
35. Gao M, Chen M, Liu A, Ip WH, Yung KL (2020) Optimization of microservice composition
based on artificial immune algorithm considering fuzziness and user preference. IEEE Access
8:26385–26404
36. Bhaskar B, Jatoth C, Gangadharan G, Fiore U (2020) A mapreduce-based modified grey wolf
optimizer for qos-aware big service composition. Concurren Comput Pract Exp 32(8):e5351
37. Yang Y, Yang B, Wang S, Jin T, Li S (2020) An enhanced multi-objective grey wolf optimizer
for service composition in cloud manufacturing. Appl Soft Comput 87:106003
38. Li C, Li J, Chen H (2020) A meta-heuristic-based approach for qos-aware service composition.
IEEE Access 8:69579–69592
39. Zhang S, Shao Y, Zhou L (2021) Optimized artificial bee colony algorithm for web service
composition problem. Int J Mach Learn Comput 11(5):327–332
40. Wang C, Ma H, Chen G, Hartmann S (2022) Memetic eda-based approaches to qos-aware fully
automated semantic web service composition. IEEE Trans Evol Comput 26(3):570–584
Fuzzified Hybrid Metaheuristics for QoS-Aware Service Composition 119
41. Cuevas E, Zaldívar D, Pérez-Cisneros M (2018) Metaheuristic algorithms based on fuzzy logic,
Springer, pp 167–218
42. Zadeh LA (1965) Fuzzy sets. Inf Control 8(3):338–353
43. Mamdani EH, Assilian S (1975) An experiment in linguistic synthesis with a fuzzy logic
controller. Int J Man Mach Stud 7(1):1–13
44. Bagis A, Konar M (2016) Comparison of sugeno and mamdani fuzzy models opti- mized
by artificial bee colony algorithm for nonlinear system modelling. Trans Inst Measur Control
38(5):579–592
45. Naghavipour H, Soon TK, Idris MYI, Namvar M, Salleh RB, Gani A (2021) Hybrid meta-
heuristics for qos-aware service composition: a systematic mapping study. IEEE Access
1–25
46. Naghavipour H, Idris MYIB, Soon TK, Salleh RB, Gani A (2022) Hybrid metaheuristics using
rough sets for qos-aware service composition. IEEE Access 10:112609–112628
Machine/Deep Learning
Fraudulent E-Commerce Website
Detection Using Convolutional Neural
Network Based on Image Features
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 123
Y. Bee Wah et al. (eds.), Data Science and Emerging Technologies, Lecture Notes on
Data Engineering and Communications Technologies 191,
https://doi.org/10.1007/978-981-97-0293-0_9
124 N. M. Zamry et al.
1 Introduction
In the current digital and connected era, the Internet is an indispensable element
of daily life, offering many benefits but also causing several challenges. While e-
commerce and e-digital websites are continually growing, the number of fraudulent
websites is also increasing. Foreign currency exchange (Forex), gold, and other
precious metal investment, Ponzi, pyramid schemes, and multi-level marketing,
online shopping, and e-commerce websites are among the most popular types of
fraudulent websites [1]. These fraudulent websites appear to be genuine, making it
hard for consumers to distinguish them from legitimate websites. Consumers lose
money as a result of fraudulent e-commerce websites selling fake products, never-
theless the digital marketing industry suffers equally [2]. Moreover, this fraudu-
lent e-commerce not only misleads people, but it also affects the image of legal
online stores [3]. Additionally, the fraudulent activity may lead to other cybercrime
that involves the stealing and misuse of the victim’s sensitive personal information,
such as bank account information, social security numbers, or credit card numbers.
Over the years, millions of dollars are lost because of fraudulent website [1, 3–5]
Researchers have proposed a variety of approaches for detecting fraudulent websites,
including heuristic methods, machine learning-based methods, and currently utilizing
deep learning including in [1, 3, 6, 7]. However, due to several challenges, existing
measures are often insufficient in detecting fraudulent websites. As described in [1,
3], these challenges are first, the current trend of web programming technologies
makes online scraping more complicated and hinders the ability to access fraudu-
lent websites’ contents. Secondly, with the diverse range of web fraudulent activities
(e-commerce fraud, MLM, Forex, etc.) it is challenging to propose a comprehen-
sive, all-in-one solution for fraudulent website detection. Next, the rapid prolifer-
ation of fraudulent websites leads to static measures irrelevant and necessitates a
dynamic solution. Finally, the ineffectiveness of these models is driven by fraud-
sters’ attempts to conceal, deceive, obstruct, and circumvent the fraudulent website
detection models.
Based on the challenges, an efficient website detecting technique needs to be
designed. Utilizing the technique of deep learning, this study proposed a technique
for detecting fraudulent websites based on website images. Our research is focused
on fraudulent e-commerce sites. This paper is divided into five sections. Section 2
highlights several significant research in fraudulent website detection, including
the detection of fraudulent websites and deep learning methods. Section 3 outlines
the proposed model and experimental design. Section 4 presents and discusses the
findings, while Sect. 5 concludes the study.
Fraudulent E-Commerce Website Detection Using Convolutional … 125
2 Related Work
The two primary methods used by fraudulent websites to operate are the duplicating
and cloning of legitimate websites and the development of websites with short lifes-
pans [8]. Meanwhile, [9] phishing websites as fraudulent websites that pose as legit-
imate and deceive people into engaging with them to obtain their sensitive data.
The emergence of fake websites may be due to a variety of factors, including their
genuine appearance, an inadequate level of user awareness, and fraudsters’ skill to
overcome many of the current measures in use to safeguard against them [10]. From
these studies, fraudulent websites are also known as phishing websites and fake
websites. Few afford on detecting fraudulent are proposed in [8, 11–13].
Proposed a new fraudulent website detection system based on statistical learning
theory (SLT) [10]. Combination of textual (word phrases, lexical measures, spelling
and grammar) URL features (URL text, anchor text), source code (HTML and
Javascript, coding style), images (metadata, image contents), linkage (site level, page
level) features were used to identify the fraudulent websites. Their evaluation results
showed the accuracy of 96% using SLT in a dataset of 900 fraudulent websites.
Another study by [3] used three types of features including image features to detect
fraudulent e-commerce website. Image Moments technique was used to extract the
screenshotted images features while four classification techniques namely Logistic
Regression, Random Forest, Decision Tree, and XGBoost were used to classify
fraudulent websites. XGBoost shows the better accuracy among the classification
techniques with 80% detection accuracy.
Deep learning has shown to be a particularly useful strategy due to its capacity
to manage enormous volumes of data. Deep learning has an impact on not just
conventional machine learning approaches, but also human perception. Hidden layers
outperform traditional approaches, particularly in pattern identification. Convolu-
tional neural networks (CNN), a variant of artificial neural network (ANN), are one
of the most often used deep neural networks. CNNs may be trained with either
supervised or unsupervised machine learning techniques. CNN is utilized in picture
detection and processing (pixels), despite its origins in data processing. A neural
network is made up of many interconnected nodes named “neurons” that are catego-
rized into three layers: the input layer, the hidden layer, and the output layer. CNN
has been a hub of research in a variety of areas, with successful applications in natural
language processing, catastrophe climate prediction, and many more.
CNN has recently been used successfully on a variety of tasks related to computer
vision such as object detection, image classification, image segmentation, and object
tracking [14], as well as in many domains such as medical image detection such as
MRI and CT-scan, remote sensing image, document analysis, handwriting analysis,
and more. Convolutional neural networks (CNN) use filters to capture the image’s
local structures, and as a result perform effectively on computer vision applications
[15]. CNN was used by [6, 16] to learn and categorize webpage screenshot images.
The collection comprises a total of 6104 samples, including 2375 malicious websites
and 3729 safe websites. Next, at random, choose 60% of the samples as a training
126 N. M. Zamry et al.
set, 20% as a validation set, and 20% as a test set. As a result, there are 3662 samples
in the training set, 1221 samples in the validation set, and 1221 samples in the test
set.
CNN and other deep learning techniques are also employed in [3, 6, 13–15].
Integrate CNN and recurrent neural network (RNN) and identify false news in two
datasets (ISO and FA-KES) [17]. To train, a recurrent neural network (RNN) employs
sequential data processing. It is known as recurrent as the results of every time step
are utilized as input in the next time step. This is accomplished by remembering the
previous time step’s output. Conducted another work for false news identification
using CNN, Bidirectional LSTM, and ResNet on four fake news datasets [18]. The
back-translation method is employed in the data augmentation process in this paper
to reduce data imbalances between classes. Another CNN-based fake news detec-
tion method is proposed in [15] which addresses the American presidential election
dataset. Both text and image data are utilized to train the TI-CNN simultaneously
and proposed convolutional neural networks train substantially faster than LSTM and
several other RNN models. According to [4] study recommends a CNN and random
forest-based hybrid phishing website detection technique (RF). The technique may
be used to assess if a URL is valid without having access to the website’s content or
using other services.
Convolutional neural networks (CNN) provide the foundation of the proposed fraud-
ulent website detection model framework, as previously mentioned. The experiment
dataset are taken from e-commerce website screenshot gathered from [3]. Meanwhile
optimizers namely: Adam, Root Mean Square (RMSprop), and Stochastic Gradient
Descent (SGD) are used to measure the detection accuracy.
The data utilized in this study obtained from both legitimate and fraudulent e-
commerce websites from [3]. Approximately 500 websites were crawled, with 258
of which were legitimate and 272 were fraudulent websites. The website’s offi-
cial website was manually screenshotted to obtain the images. The distribution of
websites evaluated in the study is shown in Table 1. A sample of images is shown in
Fig. 1.
Fraudulent E-Commerce Website Detection Using Convolutional … 127
Fig. 1 A sample of (a) [19] legal and (b) [20] fraudulent e-commerce website screenshot
This study employs the convolution neural network (CNN) to classify the legal and
fraudulent websites. In these experiments, the same image dataset from [3] was used
to classify the fraud and legal website image utilized CNN classification. A total of
530 website images were collected with 272 fraud website images and 258 legitimate
website images. The websites are partitioned into 80% (370 images) samples as
128 N. M. Zamry et al.
training set, 10% (80 images) samples as testing set, and the rest 10% (80 images)
samples as validation set the experiment used the same setup from [6] as follow:
• The input layer is a screenshot of the page with the size adjusted. The size is W ∗
L ∗ 3, where 3 represents the three channels. The size of input images is adjusted
to W ∗ L ∗ 3, where W = L = 256
• Each of the first two convolutions has 32 output filters (i.e., M1 = M2 = 32), and
the third one has 64 output filters (i.e., M3 D 64).
• The size of the convolution kernel is 3 ∗ 3, where K1 = K2 = K3 = 3.
• The output of the convolution layer is then mapped nonlinearly.
• A full connection layer with 64 neurons (i.e., N = 64) and is connected after
several convolution and pooling layers.
• The prediction output layer accounts for image prediction and determines the
corresponding website category.
• Considering that it is a binary classification problem, we adopt the sigmoid func-
tion. If the sigmoid value is bigger than 0:5, the corresponding site is classified
as a phishing website; otherwise, it is recognized as a legitimate website.
• The optimizer used in the CNN model is an Adam optimization algorithm, which is
an extension to stochastic gradient descent that has recently seen broader adoption
for deep learning applications in computer vision and natural language processing.
• The parameters for image data augmentation are also set: rescale = 1 = 255,
shear_range = 0:2, zoom_range = 0:2, horizontal_flip = True, batch_size is set
as 32, and epochs are set as 61 in our experiment.
A CNN model of four layers, three spatial convolution layers, and one fully
connected layer was created in accordance with [6] setup. Then the operation of
the random flip, shift, shear, and other operations as data augmentation. The input
image sizes are transformed to W x L × 3, where W = L = 256. For the convolution
process, the output filters for the first two convolutions are 32 and the third one has
64 output filters. The convolution kernel has a size of 3 × 3, where K1 = K2 = K3
= 3. Meanwhile, a nonlinear mapping is applied to the convolution layer’s output.
These convolutional layer and max pooling layers are both used for automated image
feature extraction techniques. The ReLU was chosen as the activation function for
the CNN as specified in Eq. (1) due to its fast convergence time and simplicity in
computing the gradient.
1
sigmoid(x) = (2)
1 + e−x
loss(x, z) = (x[i] × loglog(z[i]) + (1 − x[i]) × loglog(1 − z[i])), (3)
The experiment results using the mentioned dataset to measure the detection accu-
racy based on three optimizers namely: Adam, Root Mean Square (RMSprop) and
Stochastic Gradient Descent (SGD). The epoch varies for 61 as set in [6] and other
two epochs: 100 and 120. The experiment also measured F1-scores and training time.
The F1-score, a machine learning metric, combines precision and recall assessing
model accuracy. It is especially useful for imbalanced datasets, where each class has
a different number of samples, providing a more reliable evaluation than traditional
accuracy metrics. The F1-score is calculated as in Eq. (4).
2 × Precision × Recall
F1 = (4)
Precision × Recall
where:
TruePositive
Precision = (5)
TruePositive + FalsePositive
TruePositive
Recall = (6)
TruePositive + FalseNagetive
Meanwhile the training time in CNNs denotes the period required for the model
to learn from input data and fine-tune its parameters. This process entails providing
CNN with a labeled dataset, enabling it to make predictions, and subsequently
refining its internal parameters (weights and biases) based on prediction errors. The
result of the experiment is shown in Table 2.
SGD is a very basic optimizer that is rarely utilized in applications due to its
poor processing speed. Another issue with the approach is the consistent learning
rate for each epoch. Thus, this results in the lowest accuracy for all three epochs.
Meanwhile, RMSProp optimizer automatically adjusts the learning rate and selects
a different learning rate for each parameter. Unlike SGD training, which maintains
a single learning rate, Adam optimizer updates the learning rate for each network
weight individually. Moreover, Adam optimizer combining RMSProp and other algo-
rithms’ best characteristics. The Adam optimizer produces better results than other
optimization methods, has a shorter computation time, and requires fewer tuning
parameters. As a result of all of this, Adam is suggested as the default optimizer for
most applications. The training and validation accuracy as well as loss is illustrated
in Figs. 4, 5, and 6.
The experiment is also compared with accuracy and from [3] for accuracy perfor-
mance as shown in Table 3. The best optimizer results from the experiments which
is Adam optimizer is compared with Linear Regression, Decision Tree, Random
Forest, and XGBoost from [3].
Fraudulent E-Commerce Website Detection Using Convolutional … 131
Table 2 The CNN model accuracy and F1-score using training set and testing set and training time
Optimizer Epoch
61 100 120
(a) Accuracy
Adam 74.0 71.0 68.0
RMSprop 63.0 66.0 52.0
SGD 66.0 66.0 66.0
(b) F1-score
Adam 74.0 72.0 69.0
RMSprop 62.0 67.0 48.0
SGD 52.0 52.0 52.0
(c) Training time
Adam 693,708.314 ms 1,168,302.057 ms 1,371,541.660 ms
RMSprop 715,048.149 ms 1,154,158.749 ms 1,325,656.689 ms
SGD 680,314.043 ms 1,138,732.898 ms 1,365,106.503 ms
The result of accuracy shows the CNN techniques using Adam optimizer outper-
formed almost all the techniques in [3] except the XGBoost and Decision Tree. The
XGBoost technique is the most recent version of the Decision Tree with an optimized
gradient boosting algorithm that uses parallel processing, tree pruning, missing value
handling, and regularization to counteract bias or overfitting during training. Mean-
while, utilizing partial convolution layers and one fully connected layer, batches of
augmented data are generated, allowing for the enrichment of training data and the
avoidance of overfitting in CNN technique.
132 N. M. Zamry et al.
Fig. 4 The training and validation accuracy as well loss based on [6] setup using Adam optimizer
with different epoch
Fraudulent E-Commerce Website Detection Using Convolutional … 133
Fig. 5 The training and validation accuracy as well loss based on [6] setup using RMSprop
optimizer with different epoch
134 N. M. Zamry et al.
Fig. 6 The training and validation accuracy as well loss based on [6] setup using SGD optimizer
with different epoch
Fraudulent E-Commerce Website Detection Using Convolutional … 135
5 Conclusion
Websites have evolved into a venue for cybercriminals to prey on victims and spread
fraud and cybercrime. Despite researchers’ efforts, most measures are still unable to
keep up with the rapid increase and variation of fraudulent websites. This paper aims
to address this problem by utilizing a deep learning-based fraudulent website detec-
tion model that employs image with CNN techniques to detect fraudulent websites.
Experiment finding showed that the fraudulent website detection model based on
CNN algorithm achieved acceptable results compared to previous work, with an
accuracy of 66.0–74.0%. Future work will focus on analyzing and implementing
others deep learning-based methods to increase detection accuracy.
Acknowledgements The authors would like to acknowledge the Universiti Teknologi Malaysia
for supporting this study under the UTM RA ICONIC Grant (Q.J130000.4351.09G61).
References
Abstract With the increase in the development of technology, the threat of “Ran-
somware” has also increased especially towards organizations. Ransomware is a
malicious software that encrypts all the user’s data or system and demands a ransom
payment for decryption. Despite various machine learning approaches proposed for
ransomware detection, they often fail to identify those threats accurately in time,
thus leading to data loss and victimization. This research introduces a novel frame-
work, primarily based on static analysis of ransomware and predicting the pres-
ence of ransomware on users’ systems by monitoring a defined set of ransomware
activities. In this study, we used the Resilient Information Systems Security (RISS)
ransomware dataset, encompassing 582 ransomware samples from 11 distinct fami-
lies and 982 instances of goodware. We proposed a generic neural network frame-
work for the identification of ransomware and compared the performance of artificial
neural networks (ANN) and deep neural networks (DNN) in terms of accurately clas-
sifying ransomware and goodware. The suggested framework secured an accuracy
of 98.56% with ANNs, and achieved a slightly better performance (99.06%) when
ANN was replaced with DNN. Our results showed that a basic ANN can achieve
performance comparable to that of a DNN for ransomware detection. In future work,
we plan to evaluate the performance of the proposed framework in a real-time setting.
S. Nadeem
Department of Computer Science, University of Management and Technology, Lahore, Pakistan
T. Mehmood (B)
School of Information Technology, UNITAR International University, Selangor, Malaysia
e-mail: [email protected]
M. Yaqoob
Department of Computer Science, University of Hertfordshire, Hatfield, UK
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 137
Y. Bee Wah et al. (eds.), Data Science and Emerging Technologies, Lecture Notes on
Data Engineering and Communications Technologies 191,
https://doi.org/10.1007/978-981-97-0293-0_10
138 S. Nadeem et al.
1 Introduction
Ransomware is a malicious code or malware that poisons user’s computer and quickly
spreads to encrypt data or to lock a device. In this way, the data become inaccessible
to the victim and the attacker requests payment from the victim to decrypt it. Payment
is mostly requested in Bitcoins or other cryptocurrencies that cannot be tracked [1].
Companies and individuals around the world are being attacked by using ransomware.
The main goal of ransomware is to make money by using malicious software.
One reason for the popularity of ransomware is that the cyber-security area is
not fully ready for its recovery and mitigation. When there are security breaches
it gets easier for hackers to attack the system. Several security systems have been
proposed to detect and mitigate threats, identify changes in data, and track abnormal
behaviour and vice versa [2–4]. Such proposed security framework also includes
signature-based approach or one-hand rule [5]. This method requires the user to
assess the system using a hand rule to determine whether it is infected by a virus or
not. Another popular approach is based on machine learning methods, which learn
the hidden patterns of the ransomware and use that knowledge for future prediction
[6–8].
Nowadays, the computer security industry tends to create rules and signatures
manually and automatically through machine learning and statistical methods. Rules
and signature-based methods can achieve lower false positives, whereas machine
learning models provide researchers with powerful tools to create models for the
inspection of ransomware, leading to greater accuracy and scalability. However,
the performance of these machine learning algorithms is highly dependent on
the features. Redundant and irrelevant features deteriorate the performance of the
machine learning algorithms [9].
In this paper, we propose a generic framework and evaluate it for two models
(i) artificial neural network (ANN), and (ii) deep neural network (DNN) to predict
and classify both ransomware and goodware. These models are integrated into the
suggested framework for the characterization of essential static ransomware features,
which require ransomware prediction. To ensure that the feature sets are retained,
we utilize shared data standards to identify the most pertinent static characteristics
among a multitude of factors. Importantly, these approaches are also well-suited for
identifying new ransomware variants that share similar characteristics. In terms of
accuracy, our proposed framework stands out with an impressive 99.06%, surpassing
other machine learning classifiers. Furthermore, the framework can be easily adapted
to real-time settings. Finally, regardless of the model, the framework works well for
both ANN and DNN, reaching an accuracy of 98.56% and 99.06%, respectively.
A Generic Framework for Ransomware Prediction … 139
1.1 Background
Ransomware is a type of malware that locks devices, holding the user’s data hostage
until a ransom is paid. Ransomware preys on people’s fears, the vulnerabilities of
their devices, their lack of security awareness, and their frustration. The countries
most frequently targeted by ransomware include the United States, Japan, the United
Kingdom, Italy, Germany, and Russia, primarily due to their heavy reliance on the
Internet [10].
There are two primary types of ransomwares: Crypto ransomware and Locker
ransomware, as illustrated in Fig. 1. Crypto ransomware encrypts the user’s files
or private information, demanding payment for decryption. Conversely, Locker
ransomware locks the system, preventing user access until the ransom is paid.
Ransomware is highly profitable, as it effectively seizes control of the digital lives
of many system users [11].
Locker ransomware denies the user access to the system’s interface, restricts the
availability of system resources, and limits certain abilities, such as keyboard func-
tionality, while inner system functionalities might remain unlocked. The keyboard
and mouse, however, are locked. Locker ransomware spreads through social
engineering, phishing campaigns, and restricted sites.
In crypto ransomware, instead of locking the user’s interface and denying access
to the system, it targets the user’s important files and data stored on the system. The
overall functionality of the system remains unaffected. Only data files are encrypted,
allowing the victim to use the computer, but preventing access to the encrypted files.
The process cycle of the ransomware is shown in Fig. 2.
2 Literature Review
Fig. 2 The ransomware process cycle displays all the steps for the attack, from the initial infection
to the eventual payment of the ransom
abnormal behaviour in the system registry. A specific tool, PEid, is employed for
Windows ransomware detection. Users are advised to take precautionary steps, such
as online backups of files and important data like images and related documents [15].
A model called “Cyber Army Modelling and Simulation” has been presented for
detecting malware circulating over typical network behaviours [16]. The proposed
approach used the National Cyber Range (NCR) to generate data and provide results
for networks of different sizes in various test cases.
Furthermore, clustering algorithms have been utilized to select relevant features
for training classifiers to detect different malware types, including unknown malware
[17]. Malware, once replicated, can create entirely new variants. The study of
malware and executable files is conducted in groups to identify new malware with
high accuracy. Based on classifiers, malware detection can achieve accuracy rates of
up to 99.11% regarding file sizes.
To analyse and classify malware using artificial intelligence techniques and
categorize them into different groups, a behavioural-based methodology has been
proposed in [18]. Two methods, intrusion detection and honeypot techniques, were
employed to gather malware samples. Using a virtual machine platform such as
Cuckoo Sandbox, a model was developed to generate behaviour-based reports for
each sample, which were then manually analysed. AI technology is used to gener-
alize malware samples into worms or Trojan horses. The limitation lies in the manual
analysis, which is not suitable for a large number of concurrently generated malware
samples [18].
Another paper introduces a scalable method for identifying and clustering
malware with similar behaviour. The approach involves dynamic analysis to capture
execution traces of malware, which are then generalized into behavioural profiles.
These profiles are then fed into the proposed clustering algorithm. The proposed
system has been employed to collect, recognize, and cluster malware in real-world
setting [19].
Hwang et al. [20] developed a model based on a two-stage approach involving
Markov models and Random Forest Models to classify ransomware. The authors
focused on creating a sequence pattern of Windows-based API calls associated with
ransomware characteristics. To control false positives and true negatives, another
model based on Random Forest machine learning was built. The two-stage model
achieved an overall accuracy of 97.3%, with a false positive rate of 4.8% and true
negative rate of 1.5%.
Arivudainambi et al. [21] presented a model for the classification and detection
of ransomware in live traffic. The author elucidates how ransomware infiltrates and
compromises devices, followed by an analysis of live ransomware classification.
This model employs deep learning algorithms and classification patterns based on
network traffic characteristics. For instance, it examines the behaviour of well-known
ransomware to develop a real-time live traffic-based model. Various methodologies
and algorithms for ransomware detection have been introduced, yet they often exhibit
shortcomings.
In another study [23], an experimental platform was constructed to compare
the performance of various machine learning-based algorithms, including Random
142 S. Nadeem et al.
Forest, Gradient Boosting Decision Trees, Neural Networks using Multilayer Percep-
tron, and three types of Support Vector Machine kernels for ransomware detection,
using a ransomware dataset. The experiments involved the analysis of executable files
and the measurement of opcode frequencies. The primary objective was to identify
algorithms best suited for developing ransomware detection models and systems.
The results indicated that Random Forest, Gradient Boosting Decision Trees, and
Support Vector Machine outperformed neural networks.
3 System Design
The Resilient Information Systems Security (RISS) ransomware dataset [24] was
collected and analysed using the Cuckoo Sandbox, which includes 582 ransomware
and 942 goodware instances. It’s noteworthy that many ransomwares, belonging to
different families, share common features or behaviour patterns. However, numerous
detection techniques struggle to identify them. The absence of automatic detection
in many techniques contributes to delays in ransomware detection. This issue can be
addressed through the application of artificial intelligence and machine learning-
based approaches. Importantly, the proposed approach does not require a rigid
framework for implementation. Ransomware samples are randomly selected from
the database, and artificial neural networks are employed to predict and classify
ransomware into their respective families.
3.1 Dataset
The dataset used in this research paper comprises 582 samples of ransomware span-
ning 11 different families and 942 samples of goodware as shown in Table 1.
It includes both older and newer ransomware samples. The gathered ransomware
samples represent the prevailing versions and variations currently encountered with
a significant portion being crypto ransomware types.
This framework is designed for the RISS ransomware dataset (Fig. 3), where data
scaling was unnecessary as the dataset had already been appropriately scaled. In the
feature extraction phase, Principal Component Analysis (PCA) is utilized to extract
the most significant features. Simultaneously, to ensure generalization, K-fold cross-
validation is used, where the value of K is set to 10. The use of tenfold cross-validation
ensures that each data point is used for both training and testing, reducing the risk of
A Generic Framework for Ransomware Prediction … 143
Table 1 Ransomware
ID family Samples
families and the number of
samples for each family 1. Citroni 50
contained in the dataset 2. CryptLocker 107
3. CryptoWall 46
4. Kollah 25
5. Kovter 64
6. Locker 97
7. Matsnu 59
8. Pgpcoder 4
9. Reveton 90
10. TeslaCrypt 6
11. Trojan-Ransom 34
overfitting and enhancing the model’s generalization capability. Following this pre-
processing, a classification model (ANN or DNN) is applied to differentiate between
instances of ransomware and goodware. Additionally, the family of ransomware is
also identified in the next step.
Artificial Neural Network Model Artificial neural networks are inspired by the
biological nervous system. The biologically inspired neurons communicate with
each other via electrical pulses known as spikes [25]. An artificial neural network
contains input, hidden, and output layers, and can be trained on complex data [26].
When analysing big and noisy data, neural networks are very efficient in extracting
hidden patterns and non-linear complicated relationships in the target variable.
In this work, we use an artificial neural network (ANN)-based classifier. The
ANN model comprises an input layer, an output layer, and a hidden layers with
1000 neurons. The input layer accommodates the dataset’s features, the hidden layer
employs the Rectified Linear Unit (ReLU) activation functions for pattern learning,
and the output layer have the number of nodes equal to the number of classes. In the
case of ransomware detection, the output is binary (ransomware or goodware) and
Fig. 3 The proposed framework for statically analysing the RISS dataset, employs K-fold batch
splitting, PCA for feature extraction, and a classifier (ANN or DNN) to detect and classify
ransomware
144 S. Nadeem et al.
Fig. 4 Model implementation of ANN shows the method used for feature extraction, for data
splitting and the basic working of the model
Fig. 5 Model implementation of DNN shows the method used for feature extraction, for data
splitting, and the basic working of the model
Our preliminary results show that both ANN and DNN consistently demonstrate high
training accuracy throughout the training batches, reaching average values of around
99.64–99.48% (Table 2). This indicates that both models are capable of learning and
fitting the training data well, capturing the underlying patterns in the dataset. When
we consider the testing accuracy, both models perform remarkably well, with ANN
achieving an average testing accuracy of 98.56%, while DNN outperforms slightly
with an average testing accuracy of 99.06%. Both models can correctly classify
ransomware samples in the testing set which is crucial for practical application of
ransomware detection systems.
We can also speculate that both models show stability in their performance as the
batch number progresses. There is no significant fluctuation in accuracy from one
batch to another, indicating that both models are able to generalize well to different
subsets of the data. Moreover, the models are not sensitive to variations in the training
data and is likely to perform reliably on new unseen data. Additionally, the small gaps
between training and testing accuracies suggest a robust generalization and minimal
overfitting of both models. This implies that the models have learned meaningful
patterns in the data rather than memorizing it.
146 S. Nadeem et al.
The purpose of this research is to analyse different types of ransomware attacks and to
provide an efficient conceptual framework that can be applied in real-time detection
and classification of ransomware. The performance of the suggested framework is
evaluated with both artificial neural networks (ANNs) and deep neural networks
(DNNs). Our findings indicate that the proposed framework is network-agnostic,
with DNNs achieving a slightly higher accuracy of 99.06% compared to ANNs
at 98.56%. Moreover, our demonstration reveals that the performance of a basic
ANN is comparable to that of a DNN in ransomware classification. Moreover, since
both ANNs and DNNs are trained on RISS dataset, the proposed framework may
fail to predict ransomware with new features. Therefore, the future studies need to
investigate the significance of deep models and focus on developing robust real-time
ransomware detection systems capable of effectively handling emerging ransomware
variants with new features.
Acknowledgements The authors thank the UNITAR International University supporting the
publication of this paper.
References
4. Lin PP (2006) System security threats and controls. The CPA J 76(7):58
5. Mehmood T, Helmi B (2016) Svm for network anomaly detection using aco feature subset. In:
2015 International symposium on mathematical sciences and computing research (iSMSC),
IEEE, pp 121–126
6. Bae SI, Lee GB, Im EG (2020) Ransomware detection using machine learning algorithms.
Concurr Comput Pract Exp 32(18):e5422
7. Fernando DW, Komninos N, Chen T (2020) A study on the evolution of ransomware detection
using machine learning and deep learning techniques. IoT 1(2):551–604
8. Noorbehbahani, F., Rasouli, F., Saberi, M.: Analysis of machine learning techniques for
ransomware detection. In: 2019 16th International ISC (Iranian Society of Cryptology)
conference on information security and cryptology (ISCISC), IEEE, pp 128–133
9. Mehmod T, Rais HBM (2016) Ant colony optimization and feature selection for intrusion
detection. In: Advances in machine learning and signal processing, Springer International
Publishing, Cham, pp 305–312
10. Sajjan RS, Ghorpade VR (2017) Ransomware attacks: Radical menace for cloud computing. In:
2017 International conference on wireless communications, signal processing and networking
(WiSPNET), IEEE, pp 1640–1646
11. Liska A, Gallo T (2016) Ransomware: defending against digital extortion. O’Reilly Media,
Inc.
12. Nari S, Ghorbani AA (2015) Automated malware classification based on network behavior. In:
2013 International conference on computing, networking and communications (ICNC), IEEE,
pp 642–647
13. Rao V, Hande K (2017) A comparative study of static, dynamic and hybrid analysis techniques
for android malware detection. Int J Eng Develop Res 5(2):1433–1436
14. Grosse K, Papernot N, Manoharan P, Backes M, McDaniel P (2020) Adversarial examples
for malware detection. In: Computer security–ESORICS 2017: 22nd European symposium on
research in computer security, Oslo, Norway, September 11–15, Proceedings, Part II 22, pp
62–79
15. Zavarsky P, Lindskog D et al (2016) Experimental analysis of ransomware on windows and
android platforms: evolution and characterization. Procedia Comput Sci 465–472
16. Brown S, Henz B, Brown H, Edwards M, Russell M, Mercurio J (2015) Validation of network
simulation model with emulation using example malware. In: MILCOM 2015–2015 IEEE
military communications conference, pp 1264–1269
17. Sahay SK, Sharma A (2016) Grouping the executables to detect malwares with high accuracy.
Procedia Comput Sci 78:667–674
18. Rieck K, Holz T, Willems C, Du¨ssel P, Laskov P (2008) Learning and classification of malware
behavior. In: International conference on detection of intrusions and malware, and vulnerability
assessment, Springer, pp 108–125
19. Bayer U, Comparetti PM, Hlauschek C, Kruegel C, Kirda E (2009) Scalable, behavior-based
malware clustering. In: NDSS, vol 9, pp 8–11
20. Hwang J, Kim J, Lee S, Kim K (2020) Two-stage ransomware detection using dynamic analysis
and machine learning techniques. Wireless Pers Commun 112:2597–2609
21. Arivudainambi D, KA VK, Visu P et al (2020) Ransomware traffic classification using deep
learning models: ransomware traffic classification. Int J Web Port (IJWP) 12(1):1–11
22. Kok S, Abdullah A, Jhanjhi N, Supramaniam M (2019) Ransomware, threat and detection
techniques: A review. Int J Comput Sci Netw Secur 19(2):136
23. Dion Y, Brohi SN (2020) An experimental study to evaluate the performance of machine
learning alogrithms in ransomware detection. J Eng Sci Technol 15(2):967–981
24. Resilient information systems security (riss) ransomware dataset. Retrieved from http://rissgr
oup.org/ransomware-dataset/. Accessed on 05 Jan 2023
25. Eluyode O, Akomolafe DT (2013) Comparative study of biological and artificial neural
networks. Euro J Appl Eng Sci Res 2(1):36–4
148 S. Nadeem et al.
26. Pini M, Scalvini A, Liaqat MU, Ranzi R, Serina I, Mehmood T (2020) Evaluation of machine
learning techniques for inflow prediction in lake como, italy. In: Knowledge-based and intel-
ligent information & engineering systems: proceedings of the 24th international conference
KES-2020, Virtual Event, 16–18 September 2020. Procedia Computer Science, vol 176, pp
918–927. Elsevier
27. 6Sgandurra D, Mun˜oz-Gonza´lez L, Mohsen R, Lupu EC (2016) Automated dynamic analysis
of ransomware: benefits, limitations and use for detection. arXiv preprint arXiv:1609.03020
Leveraging Gamification for Engaged
Learning in Online Teaching
and Learning Experiences
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 149
Y. Bee Wah et al. (eds.), Data Science and Emerging Technologies, Lecture Notes on
Data Engineering and Communications Technologies 191,
https://doi.org/10.1007/978-981-97-0293-0_11
150 N. A. Wahab et al.
the way for a new era of enhanced and interactive learning experiences, equipping
educators to harness the potential of gamification and adapt education to the evolving
needs of contemporary learners.
1 Introduction
In today’s rapidly evolving educational landscape, the realms of teaching and learning
(TnL) have undergone a profound transformation, fuelled by the integration of tech-
nology and the shifting dynamics of student engagement [1]. As the new generation
of learners, often referred to as ICT natives by immerse themselves in a digital world,
the traditional paradigms of education have given way to innovative approaches that
tap into the technological tools familiar to them. This article delves into the inter-
section of teaching and learning, exploring the captivating realm of gamification
as a dynamic strategy to foster motivation, engagement, and meaningful learning
experiences in the digital era.
The components that comprise effective teaching and learning have expanded
beyond the confines of conventional instruction. Educators now shoulder the respon-
sibility of not merely disseminating information, but guiding students in their explo-
ration of knowledge and development of essential skills. As the modern classroom
extends beyond physical walls into the digital realm, educators face the challenge of
maintaining motivation and engagement in remote and online learning environments
[2]. It is within this context that gamification emerges as a potent solution, seam-
lessly blending educational content with elements drawn from the world of gaming
to create a stimulating and rewarding learning journey.
1.1 Gamification
rapid evolution, burgeoning applications, and its role as a catalyst for innovation and
enhanced learning experiences.
The Gartner Hype Cycle for 2011 predicted that by 2014, over 70% of
Fortune Global 2000 organizations would have adopted gamification in some way
[6]. Numerous encouraging outcomes have been documented, demonstrating the
potential of gamification in conjunction with both conventional and cutting-edge
approaches in educational settings to enhance students’ overall learning experiences,
inspire and involve them, and foster desired behaviour [6]. Many companies have
integrated gamification in various field such as Nike in sport and eBay in online shop-
ping. Similarly, companies use gamification for marketing by altering their customer
behaviour to encourage them to purchase products or visit their website. As for now,
Malaysia is one of countries that mostly carried out empirical study research into the
use of Augmented Reality and gamification [7]. Malaysia also known as an active
country examined the use of Augmented Reality and gamification in education [7].
This article embarks on an exploration of the synergistic relationship between
technology-enhanced education and the principles of gamification. As online
teaching and learning (TnL) becomes increasingly prevalent, it is imperative to under-
stand how to harness the power of gamified experiences to fuel students’ intrinsic
motivation, curiosity, and active participation. By investigating the impact of gami-
fication on engagement, this article seeks to unravel the intricacies of incorporating
game elements like points, badges, leader boards, and multimedia analytics into the
fabric of educational design. Through a comprehensive analysis of existing liter-
ature, this study aims to provide insights into how gamification can revolutionize
the teaching and learning landscape, creating a dynamic and interactive space that
resonates with the digital generation’s learning preferences and aspirations.
2 Methodology
The methodology for this study aimed to gather insights into the effectiveness and
efficiency of gamification in online teaching and learning environments. It involved
a purposive sampling approach and employed a structured questionnaire to eval-
uate the impact of gamification on student engagement and learning outcomes. The
primary aim of the study was to gather insights into the effectiveness and efficiency
of gamification design in an online learning environment.
2.1 Participants
The study was conducted with a total of 30 students enrolled in the Computer Science
Department at National Defence University of Malaysia. Participants were selected
purposively to ensure representation from various age groups and genders.
152 N. A. Wahab et al.
2.2 Questionnaire
Participants were approached and informed about the study’s purpose and the volun-
tary nature of their participation. The structured questionnaire, featuring a Likert
scale ranging from 1 (Strongly Disagree) to 5 (Strongly Agree), was administered
electronically to the participants to facilitate straightforward data collection and
record-keeping. Subsequently, quantitative data analysis techniques were employed
to process the collected data. The analysis primarily involved calculating percent-
ages to summarize and interpret participants’ responses, providing a quantitative
perspective on the effectiveness and efficiency of gamification in online teaching
and learning environments.
Leveraging Gamification for Engaged Learning in Online Teaching … 153
3 Findings
In the dynamic world of education, fostering student engagement has long been
a pivotal goal, particularly in the context of online learning environments where
the physical boundaries of the traditional classroom are transcended. One of the
most promising strategies to achieve this goal is the integration of gamification
elements into the learning process. This study delves into a fascinating facet of this
phenomenon: how online gamification systems can serve as catalysts for actively
engaging students in small-group discussions.
Question 2 of the survey, designed as a Likert scale ranging from 1 (Strongly
Disagree) to 5 (Strongly Agree), provided a nuanced perspective on the effectiveness
of this gamification element. The results, as illustrated in Fig. 3, were nothing short
of remarkable. A staggering 80% of respondents concurred with the statement that
merit and demerit activities during online learning serve as a catalyst for engaging
students’ passion for learning.
This finding carries profound implications for the realm of education, as it high-
lights the potential of gamification to not only enhance student engagement but also
to ignite a fervent interest in learning. By strategically integrating the principles of
merit and demerit within the online learning context, educators can create a dynamic
environment where students are not merely passive recipients of knowledge but
enthusiastic participants in their own educational journey. This approach, rooted in
gamification, offers a promising avenue for shaping a generation of learners who
are not just motivated but genuinely passionate about the pursuit of knowledge. It
underscores the transformative potential of gamification as a powerful ally in the
quest to make learning a vibrant and fulfilling experience for students in the digital
age.
Figure 4 highlights on participants’ opinions regarding learning through
Augmented Reality (AR) unveiled intriguing insights into the role of AR in educa-
tion. Notably, 65% of respondents expressed their enthusiasm, describing learning
with AR as “exciting.” An additional 25% of participants highlighted the facilitative
nature of AR, perceiving it as a tool that eases the learning process. This collective
sentiment of excitement and facilitation echoes the transformative potential of AR in
education, positioning it as an engaging and supportive force in the learning journey.
The data underscores the notion that AR can captivate students’ interest, turning the
educational experience into an exciting adventure.
Figure 5 offers a comprehensive statistic of the survey responses to questions
four (Q4), five (Q5), six (Q6), and seven (Q7), unveiling valuable insights into the
impact of gamification on the learning experience. These responses, illuminated by
the participants, underscore the transformative potential of gamification in educa-
tion. Based on survey question number four (Q4), a remarkable 80% of participants
voiced their agreement with the statement that Augmented Reality (AR) can pique
students’ interest in learning techniques more effectively compared to conventional
learning methods. This outcome aligns with the broader narrative of gamification’s
role in capturing student attention and motivation, providing a new dimension to
154 N. A. Wahab et al.
the learning process. Based on green bar in Fig. 5, an impressive 90% of respon-
dents acknowledged the allure of receiving a reward after engaging with a video
game. This sentiment underscores the excitement and motivation that gamification
elements, such as rewards and recognition, can offer students in their learning journey.
These participant responses underscore the significance of gamification as a means
to ignite student interest, motivation, and active engagement. They provide empirical
evidence of gamification’s ability to captivate learners’ attention and stimulate their
enthusiasm for the educational content. In the ever-evolving landscape of education,
these insights are invaluable, offering a compelling argument for the continued explo-
ration and integration of gamification in teaching and learning contexts. As proven in
survey question number seven (Q7), which delves into the heart of the matter, an over-
whelming 93% of participants expressed their agreement with the idea that the online
gamification system effectively encourages their active participation in small-group
discussions. This finding solidifies the pivotal role of gamification in transforming
passive online interactions into dynamic and engaging discussions, where students
are not just participants but enthusiastic contributors.
The findings from Fig. 6 illuminate the profound impact of gamification on the
level of student participation. Notably, an impressive 80% of respondents unequiv-
ocally agreed that online gamification effectively encourages students to participate
actively in online discussions. This revelation underscores the transformative power
of gamification, reshaping what was once a passive exchange of ideas into a vibrant,
interactive discourse. The beauty of gamification lies in its ability to infuse elements
of competition, goals, and rewards into the learning experience. When applied to
online discussions, this approach ignites a newfound enthusiasm among students to
actively contribute their insights and ideas. The gamification system, through points,
badges, and leader boards, not only taps into students’ competitive spirit but also
offers them clear objectives to strive for. As a result, students are not mere bystanders
in discussions; they become enthusiastic participants, driven by the desire to excel
and earn recognition.
This transformation is not just about increased activity; it signifies a paradigm shift
in how students engage with course material and interact with their peers. It encour-
ages them to think critically, articulate their thoughts effectively, and constructively
challenge their peers’ ideas. Gamification, therefore, emerges as a potent tool in
addressing the age-old challenge of passive online discussions. It brings a newfound
vibrancy to the digital classroom, fostering an environment where students actively
seek to contribute, collaborate, and learn from one another. In the era of remote and
online learning, this discovery heralds a new dawn, promising more engaging and
enriching educational experiences for students and educators alike.
One of the fascinating discoveries in this study revolves around the implemen-
tation of merit and demerit points as motivational tools within the gamified online
learning platform. This innovative approach harnesses the intrinsic motivation of
students, tapping into their inherent desire for recognition, achievement, and rewards.
In this context, “merit” represents the attainment of a reward, while “demerit”
signifies a loss of privilege or reward.
Leveraging Gamification for Engaged Learning in Online Teaching … 155
The investigation into the integration of gamification in online teaching and learning
environments yielded insightful findings that shed light on its potential to enhance
student engagement and learning outcomes. The study, conducted with a purpo-
sive sample of 30 students from the Computer Science Department at the National
Defence University of Malaysia, delved into the effectiveness and efficiency of
gamification design.
The incorporation of merit and demerit points as motivational tools within the
gamified online learning platform emerged as another noteworthy finding. Remark-
ably, 80% of respondents concurred that merit and demerit activities during online
learning effectively engage students’ passion for learning. This result underscores the
importance of recognizing students’ intrinsic motivation and their desire for achieve-
ment and recognition. By implementing gamification elements that reward desirable
behaviours and penalize undesirable ones, educators tap into students’ innate drive to
excel and stand out. This finding not only validates the effectiveness of gamification
as a motivational strategy but also suggests that it can play a pivotal role in shaping
students’ attitudes toward learning, transforming them from passive recipients of
knowledge into active and enthusiastic participants.
The findings presented in Fig. 2 illuminate the profound impact of Augmented
Reality (AR) on student engagement. This collective sentiment reflects the transfor-
mative potential of AR in education, positioning it as an engaging and supportive
force in the learning journey. These responses suggest that AR can effectively capti-
vate students’ interest, making the educational experience an exciting adventure. By
providing an interactive and immersive learning environment, AR holds the promise
of elevating engagement, fostering curiosity, and enhancing the overall educational
experience. The data presented in Fig. 3 underscores the transformative potential
of gamification in education and its synergy with Augmented Reality (AR). Survey
question number four (Q4) revealed that 80% of participants believe that AR can
more effectively pique students’ interest in learning techniques compared to conven-
tional methods. Additionally, the survey unveiled that 90% of respondents found the
idea of receiving rewards after engaging with a video game exciting, highlighting
the motivating power of gamification elements. This alignment between AR and
gamification speaks to the ability of these innovative approaches to capture student
attention and motivation, redefining the learning process.
The study’s investigation into the impact of gamification on student engagement
revealed compelling results. A significant majority of participants, a striking 80%,
expressed agreement with the statement that the online gamification system effec-
tively encourages active participation in small-group discussions. This outcome
underscores the pivotal role gamification plays in transforming the often-passive
nature of online discussions into dynamic and interactive exchanges. Gamification
elements, such as points, badges, and leader boards, introduce a competitive and
goal-oriented dimension that motivates students to actively contribute to discus-
sions. These finding highlights gamification’s potential to bridge the engagement
156 N. A. Wahab et al.
Fig. 2 Countries that carried out empirical study research into use of Augmented Reality (AR) and
gamification [7]
gap in remote and online learning environments, offering educators a potent tool for
fostering collaboration and knowledge sharing among students.
The results of this study carry significant implications for the future of online
teaching and learning. Gamification emerges as a promising approach to address the
perennial challenge of student engagement in digital education settings. Its ability
Leveraging Gamification for Engaged Learning in Online Teaching … 157
[VALUE] (53.3%)
15
10
[VALUE] (26.7%)
5 [VALUE] (13.3%)
2(6.7%)
[VALUE] (0%)
0
1 2 3 4 5
5%
5%
Learnig using AR is
unnecessary
20
17
15 1414
13
1111
10 10
10 8
5
5 3
2
1 1
0 0 0 0 0 0
0
1 2 3 4 5
Q4 Q5 Q6 Q7
15 (50%)
15
10 9 (30%)
6 (20%)
5
0 (0%) 0 (0%)
0
1 2 3 4 5
environment that excites and engages students, making learning a thrilling adven-
ture. Meanwhile, gamification offers a range of elements like rewards, points, and
leader boards that can motivate and inspire learners, transforming them from passive
recipients of knowledge into enthusiastic participants. Together, these approaches
can capture students’ interest, foster curiosity, and elevate the overall educational
experience, creating a dynamic and engaging digital classroom.
5 Conclusion
In conclusion, the integration of gamification into online teaching and learning expe-
riences presents an exciting and transformative path for education in the digital age.
This article has illuminated the dynamic synergy between traditional education and
the engaging world of gaming, showcasing how gamification can fuel motivation,
foster active participation, and enhance learning outcomes. The traditional role of
educators has evolved beyond imparting knowledge; it now includes guiding students
in their exploration of knowledge and skills. As online teaching and learning continue
to gain prevalence, harnessing the power of gamified experiences becomes impera-
tive. This article has delved into the impact of gamification on student engagement
and found compelling evidence of its effectiveness.
Our study with Computer Science students at the National Defence University
of Malaysia underscores how gamification effectively encourages active participa-
tion in online discussions and ignites students’ passion for learning. These insights
affirm gamification’s ability to create dynamic and interactive learning environments,
aligning with its growing adoption across various sectors. As gamification transcends
traditional boundaries, its transformative potential in shaping user behaviour and
interaction is evident. Incorporating gamification principles into online education
is a promising avenue for revolutionizing educational practices. By tapping into
students’ intrinsic motivation for achievement, recognition, and engagement, gami-
fication offers an innovative approach to fostering dynamic and effective learning
experiences.
In an era where the digital generation shapes the educational landscape, the explo-
ration of gamification’s multifaceted impact on motivation, engagement, and learning
outcomes is a crucial step. It equips educators with a powerful tool to adapt educa-
tion to the evolving preferences and needs of contemporary learners. In summary,
gamification is more than a trend; it’s a pathway to creating education that is not
only engaging but also effective. By embracing gamification, educators can navigate
the dynamic intersection of technology and education, ensuring that students are
well-prepared for a future that demands adaptability, creativity, and active partici-
pation. The gamified classroom is not just a concept; it’s a reality that promises to
revolutionize education for generations to come.
160 N. A. Wahab et al.
References
1 Introduction
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 161
Y. Bee Wah et al. (eds.), Data Science and Emerging Technologies, Lecture Notes on
Data Engineering and Communications Technologies 191,
https://doi.org/10.1007/978-981-97-0293-0_12
162 F. Nadi et al.
media into positive, negative, or neutral sentiment at different levels, such as sentence
level, document level, or features level sentiment analysis. Sentiment analysis has
various applications, including understanding public perception of eco-friendly trans-
formation, air quality, predicting box office success, analyzing public sentiment
during global events like pandemics, brand management, customer feedback analysis,
market research, social media insights, finance, health care, and more [2, 13].
Using traditional methods, developing accurate sentiment analysis models for
all languages poses challenges [13]. Using traditional methods, sentiment analysis
involves strict adherence to specific protocols, including pre-processing, feature
extraction, and model creation. Initially, a supplied dataset typically undergoes
preliminary processing stages like tokenization and stemming. Then, various tech-
niques are employed during the feature extraction phase to extract relevant features.
A model is subsequently constructed based on these extracted features. Finally, the
resulting model can be implemented to perform sentiment analysis [9]. It’s impor-
tant to note that this process must be repeated for each context since the constructed
models are only trained on specific contexts.
The advent of sizable language models represents a groundbreaking progres-
sion in the domain of natural language processing, providing a fresh perspective
on sentiment analysis that challenges established techniques. These models, with
their unparalleled magnitude and capacity to acquire complex linguistic patterns,
possess the capability to supplant traditional sentiment analysis models. Divergent
from specialized models that necessitate meticulous feature engineering and domain-
specific fine-tuning, Large Language Models are inherently general in nature [7].
Their comprehensive pretraining on diverse text sources grants them a broad compre-
hension of language subtleties and contextual relationships, facilitating the identifi-
cation of sentiment-carrying elements in text. By presenting textual prompts, these
models can produce sentiment predictions with remarkable accuracy, eliminating the
need for task-specific feature extraction and intricate model development. Addition-
ally, their adaptability enables them to capture emerging language trends and adjust
to varying domains without requiring extensive retraining. Hence, Large Language
Models are well-positioned to streamline and enhance sentiment analysis procedures
by providing a versatile and unified alternative to conventional, application-specific
models. Nevertheless, ethical considerations, such as the mitigation of bias and trans-
parency, must be meticulously addressed to ensure the responsible and equitable
deployment of such models in sentiment analysis applications.
In sentiment analysis literature, three main approaches are commonly employed:
Machine Learning-Based Methods: These approaches utilize machine learning
algorithms which are commonly employed for text classification into predefined
label categories, such as positive, negative, or neutral. Commonly used algorithms
include Naive Bayes [12], Support Vector Machines (SVM) [11], and Random
Forest [4]. Lexicon-Based Methods: Lexicon-based methods rely on predefined
dictionaries containing words associated with positive or negative sentiments. The
sentiment score of a text is determined by summing the scores of words in the
text [1]. Hybrid Methods: Hybrid methods combine both machine learning and
lexicon-based approaches to enhance the accuracy of sentiment analysis.
Sentiment Analysis Using Large Language Models: A Case Study … 163
2 Datasets
Two datasets were employed in this research study. The initial dataset used is the
IMDB Review Dataset, commonly referred to as the Large Movie Review Dataset
[6]. This dataset is suitable for comparative analysis, containing 25,000 film reviews
categorized as either positive or negative. In our study, we selected the first 501
reviews from each category within the test set, resulting in a total of 1,002 reviews
for analysis.
The second dataset was curated from Google Maps reviews of 124 distinct univer-
sities in Malaysia. To gather these reviews, the Google Maps API was employed.
Due to restrictions imposed on the free version of the Google Maps API, the data
collection process was limited to capturing a maximum of five reviews from each
university. After the removal of empty reviews and those containing only scores
without accompanying text, a total of 592 reviews remained for analysis. These
retained reviews underwent a manual evaluation, wherein they were assigned one
of three sentiment labels: ‘Positive,’ ‘Negative,’ or ‘Neutral.’ It has to be mentioned
that the constructed dataset has been made publicly available for download [8].
3 Methodology
We employed the OpenAI API to utilize GPT-3.5 for sentiment analysis. Reviews
from each dataset under scrutiny were processed by GPT-3.5 along with corre-
sponding prompts, as given in Table 1.
These prompts were strategically constructed to direct GPT-3.5 in allocating
classifications to the input texts. To evaluate the impact of diverse prompts on the
outcomes, we employed two distinct prompts for the Google Maps dataset. In the
Table 1 Prompt that was used for instructing the GPT-3.5 for performing the task
Prompt for You will be provided with a movie review, and your task is to classify its sentiment
IMDB forcefully to as either positive or negative. The final result should be in the format
movie of ‘sentiment:’ followed by one of the following two labels based on sentiment of
review the review: ‘positive’ or ‘negative’
dataset
First You will be provided with a Google Map review of a university, and your task is to
prompt for classify its sentiment as either positive, negative, or neutral
Google
Map review
dataset
Second You will be provided with a Google Map review of from a university, and your task
prompt for is to classify its sentiment forcefully to as either positive, negative, or neutral. The
Google final result should be in the format of ‘sentiment:’ followed by one of the following
Map review two labels based on sentiment of the review: ‘positive’, ‘negative’, or ‘neutral’
dataset
164 F. Nadi et al.
case of the IMDB dataset, it was labeled with two distinct categories including posi-
tive and negative. Accordingly, the prompt was structured to instruct GPT-3.5 to
classify the reviews under either of these two categories. In contrast, for the Google
Maps reviews, a different approach was taken. GPT-3.5 was tasked with assigning
three distinct labels to the reviews, aligning with our predefined labeling scheme.
The present study introduces GPT-3.5 as a potent Large Language Model (LLM) and
compares its performance with conventional methodologies. The results emphasize
that the LLM exhibits a competitive capacity to classify reviews into distinct senti-
ment categories. A notable advantage of GPT-3.5 is its ability to eliminate the need
for intricate pre-processing or feature extraction steps on textual data, which makes it
an attractive option. Nevertheless, it is crucial to acknowledge the inherent expenses
involved in deploying GPT-3.5 for diverse tasks. The convenience it provides, saving
substantial time, could be perceived as a cost-saving factor when compared with
conventional approaches. This aspect gains paramount significance when dealing
with varying contextual domains. Unlike traditional methods that require a complete
iteration of the model construction or lexicon assembly process, LLM models stand
out. They offer the advantageous capability to seamlessly adapt to different contexts
without encountering limitations. This adaptability underscores a pivotal strength of
LLMs over their traditional counterparts.
Table 3 shows the confusion matrix of applying GPT-3.5 on the Google Map
reviews dataset. Table 4 provides an overview of GPT-3.5’s performance on the
Google Maps review dataset when using the first and second prompts. Notably, with
the first prompt, GPT-3.5 failed to label one review (as given in row 3 of Table 2).
Consequently, the overall accuracy of GPT-3.5 in predicting ternary class labels
using the first prompt was 92.22%. In contrast, when utilizing the second prompt, the
accuracy improved slightly to 92.57%, and GPT-3.5 successfully assigned labels to
all reviews. A detailed report of various measures for each class in this experiment is
presented in Table 4. Notably, the accuracy for the neutral class was 93.57%, while
other metrics hover around the 50% mark. Conversely, for the remaining two classes,
GPT-3.5 consistently delivers superior results, with a majority exceeding 91%.
Table 5 shows detail of the confusion matrix for the IMDB database. As a result,
the overall accuracy of the GPT-3.5 for binary sentiment analysis was 94.11%.
Table 6 presents a comprehensive breakdown of measures for each sentiment
class. Notably, all measures surpass or equal 92%.
A comparison between the GPT-3.5 results on the IMDB movie review dataset
and three other proposed algorithms from the literature [3] is presented in Table 7.
As can be seen, the GPT-3.5 was able to perform better than all three algorithms
over all measures. This suggests that GPT-3.5, as a Large Language Model (LLM),
can be reliably used for performing sentiment analysis tasks. Moreover, its ease of
Sentiment Analysis Using Large Language Models: A Case Study … 165
Table 2 GPT3.5 outputs for Google Map reviews that that were more than 50 character long when
using the first prompt
1 Review Good university, good place, good view, and good environment but maybe lack
(label) of public transportation for tourists and lack of shops, and the coffee shop next
to the library does not accept cash you can’t buy some drinks when tired.
(positive)
GPT-3.5 The sentiment of the review is mixed. It contains positive aspects such as a
output good university, good place, good view, and good environment. However, it
also mentions some negative points such as a lack of public transportation for
tourists, lack of shops, and the inconvenience of the coffee shop not accepting
cash. Overall, the sentiment can be classified as neutral
2 Review World’s most global university. Students from different countries study there,
(label) and you have an opportunity to meet students from all over the world. Great
campus, design, and students. Everywhere in the world, it is the creative
thinkers and the normbreakers who are driving the development of the world’s
most successful companies and the world’s most powerful economies. tan sri
Limkokwing (positive)
GPT-3.5 Overall, the sentiment of this review is positive. The reviewer highlights the
output university’s global nature, diverse student body, great campus design, and
students. They also mention the importance of creative thinkers and
normbreakers in driving successful companies and economies
3 Review University of choice (positive)
(label)
GPT-3.5 I’m sorry, but I cannot provide a sentiment analysis for a university without the
output actual review. Could you please provide the review for me to analyze?
4 Review Facilities are age old, hope serious attention is given in upgrading them. It’s not
(label) only quality of education but also the basic needs of students. Don’t run like a
prison dorm. Please take this as a positive feedback (negative)
GPT-3.5 The sentiment of this review is mixed. While the reviewer mentions that the
output facilities are age old and hopes for upgrades, they also mention that it is
important to address the basic needs of students. They further state that the
university should not run like a prison dorm. However, they end the review by
asking the university to take their feedback as positive. Overall, the sentiment
can be classified as neutral
5 Review Pros friendly security guards, they always help when you need auditorium is
(label) modern have restroom at any floor and big parking lot. cons food and drink in
the canteen is very expensive lift is going super slow, you must wait in the long
time and the small way to motorcycle in or out. (neutral)
GPT-3.5 The sentiment of the review is a mix of positive and negative. Positive aspects:
output friendly security guards, modern auditorium, restrooms on every floor, big
parking lot. Negative aspects: expensive food and drinks in the canteen, slow
lift, long waiting time, small way for motorcycles to enter or exit. Overall, the
sentiment can be classified as neutral, as the positive and negative aspects
balance each other out
166 F. Nadi et al.
Table 3 Confusion matrix of the GPT-3.5 output for Google Map reviews using the
Positive Neutral Negative
Positive 422 (428) 17 (11) 5 (6)
Neutral 9 (13) 21(15) 9 (11)
Negative 3 (2) 3(1) 102 (105)
Table 4 Evaluation measures for GPT-3.5 on Google Maps review dataset using first prompt
(Second prompt)
Sentiment Accuracy Precision Recall F1 Score
Positive 94.25 (94.59) 0.95 (0.96) 0.97 (0.97) 0.96 (0.96)
Neutral 93.57 (03.92) 0.54 (0.38) 0.51 (0.56) 0.53 (0.45)
Negative 96.62 (96.62) 0.94 (0.97) 0.88 (0.86) 0.91 (0.91)
Table 5 Confusion matrix of the GPT3.5 output for IMDB movie reviews
Positive Negative
Positive 461 40
Negative 19 482
use stands out when compared to traditional methods, especially because it does not
require any specific training on the context in which it will be utilized.
Another significant advantage we have observed in LLMs is their proficiency in
processing poorly written English reviews, including those containing typos. For
example, when examining the reviews in Table 2, you will notice that although the
Table 7 Comparison of the GPT-3.5 results on sentiment analysis task with CNN, LSTM, and
LSTM-CNN algorithms reported in [3]
Evaluation measure CNN LSTM LSTM-CNN GPT3.5
Accuracy 0.90 0.88 0.89 0.9411
Recall 0.95 0.82 0.90 0.9604
Specificity 0.84 0.90 0.87 0.9234
Precision 0.87 0.90 0.87 0.9202
F-score 0.91 0.86 0.88 0.9399
Sentiment Analysis Using Large Language Models: A Case Study … 167
reviews are not well-written, the results produced by GPT-3.5 are still satisfactory.
This suggests that GPT-3.5 was able to infer the context of the reviews effectively.
5 Conclusion
The present study introduces GPT-3.5 as a potent Large Language Model (LLM)
and compares its performance with conventional methodologies. The results empha-
size that the LLM exhibits a competitive capacity to classify reviews into distinct
sentiment categories. A notable advantage of GPT-3.5 is its ability to eliminate the
need for intricate pre-processing or feature extraction steps on textual data, which
makes it an attractive option. However, it’s important to acknowledge the inherent
expenses of deploying GPT-3.5 for various tasks. Its efficiency in saving both time
and computational costs related to pre-processing can be considered a cost-saving
advantage compared to conventional approaches. This aspect gains paramount signif-
icance when dealing with varying contextual domains. Unlike traditional methods
that require a complete iteration of the model construction or lexicon assembly
process, LLM models stand out. They offer the advantageous capability to seam-
lessly adapt to different contexts without encountering limitations. This adaptability
underscores a pivotal strength of LLMs over their traditional counterparts.
The GPT-3.5, an exceptional Large Language Model, has been extensively trained
on diverse textual data [14]. However, it’s possible that publicly accessible datasets,
such as the IMDB dataset, were included in its training corpus. Therefore, evaluating
its performance under such circumstances may not be entirely justified. To address
this concern, we utilized Google Maps reviews to ensure that the GPT-3.5 model was
not exposed to this data during training.
In our experimental analysis of the Google review dataset using the initial prompt
(see Table 1), we obtained descriptive outputs from GPT-3.5 that necessitated metic-
ulous post-processing. Although the prompts explicitly instructed the model to solely
assign labels to the input text, the resultant outputs were not consistently straightfor-
ward. This compelled us to conduct further analysis on the GPT-generated results.
While this phenomenon was observed in only a limited number of reviews, the
descriptive nature of the output necessitated additional processing.
Table 2 presents a selection of such results obtained when utilizing the first prompt
with Google Map reviews. Notably, there were six instances where the GPT-3.5
outputs were descriptive, requiring careful extraction of the assigned labels from
the output. In addressing this challenge, we introduced a second prompt, explicitly
instructing GPT-3.5 to generate its response in a predefined format. Subsequently,
the generation of descriptive answers ceased entirely, consistently yielding outputs
conforming to the specified format. This observation underscores the pivotal role
of well-structured prompts when harnessing GPT-3.5’s capabilities. Prompt engi-
neering refers to the process of designing and refining prompts or instructions given
to AI models to generate desired outputs. It involves finding suitable prompts for
specific tasks and improving the effectiveness of prompts through various methods
168 F. Nadi et al.
and techniques. The goal is to enhance the performance and capabilities of AI models
in different domains such as conversational agents, visual and image processing, and
text-to-image generation [5].
References
1. Bonta V et al (2019) A comprehensive study on lexicon based approaches for sentiment analysis.
AJCST. 8(S2):1–6. https://doi.org/10.51983/ajcst-2019.8.S2.2037
2. Feldman R (2013) Techniques and applications for sentiment analysis. Commun ACM
56(4):82–89. https://doi.org/10.1145/2436256.2436274
3. Haque MdR et al (2019) Performance analysis of different neural networks for sentiment
analysis on IMDb movie reviews. In: 2019 3rd international conference on electrical, computer
and telecommunication engineering (ICECTE), pp. 161–164 IEEE, Rajshahi, Bangladesh.
https://doi.org/10.1109/ICECTE48615.2019.9303573
4. Karthika P et al (2019) Sentiment analysis of social media network using random forest algo-
rithm. In: 2019 IEEE international conference on intelligent techniques in control, optimization
and signal processing (INCOS). IEEE, Tamil Nadu, India, pp 1–5. https://doi.org/10.1109/INC
OS45849.2019.8951367
5. Lo LS (2023) The CLEAR path: a framework for enhancing information literacy through
prompt engineering. J Acad Librarianship 49(4):102720. https://doi.org/10.1016/j.acalib.2023.
102720
6. Maas A et al (2011) Learning word vectors for sentiment analysis. In: Proceedings of the 49th
annual meeting of the association for computational linguistics: human language technologies,
pp 142–150
7. Mirchandani S et al (2023) Large language models as general pattern machines. https://doi.
org/10.48550/ARXIV.2307.04721
8. Nadi F Malaysian universities google map review. https://github.com/pharhadnadi/Malaysian
UniversitiesGoogleMapReview
9. Nandwani P, Verma R (2021) A review on sentiment analysis and emotion detection from text.
Soc Netw Anal Min 11(1):81. https://doi.org/10.1007/s13278-021-00776-6
10. Serrano-Guerrero J et al (2015) Sentiment analysis: a review and comparative analysis of web
services. Inf Sci 311:18–38
11. Styawati S, Mustofa K (2019) A support vector machine-firefly algorithm for movie opinion
data classification. Indonesian J Comput Cybern Syst 13(3):219. https://doi.org/10.22146/ijccs.
41302
12. Wongkar M, Angdresey A (2019) Sentiment analysis using naive bayes algorithm of the data
crawler: twitter. In: 2019 fourth international conference on informatics and computing (ICIC).
IEEE, Semarang, Indonesia, pp 1–5. https://doi.org/10.1109/ICIC47613.2019.8985884
13. Yadav J (2023) Sentiment analysis on social media. Qeios. https://doi.org/10.32388/YF9X04
14. Zhou C et al (2023) A comprehensive survey on pretrained foundation models: a history from
BERT to ChatGPT. https://doi.org/10.48550/ARXIV.2302.09419
Telecom Customer Experience Analysis
Using Sentiment Analysis and Natural
Language Processing—Comparative
Study
A. M. A. Ahmed
Iraqi Prime Minister Office, Baghdad, Egypt
A. Al-Nahari · C. K. Loy
School of Information Technology, UNITAR International University, Petaling Jaya, Selangor,
Malaysia
R. Al-Shabandar (B) · A. H. Mohammed
Advisory Office for Scientific, Iraqi Prime Minister Office, Academic Affairs and Artificial
Intelligence Applications, Baghdad, Egypt
e-mail: [email protected]
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 169
Y. Bee Wah et al. (eds.), Data Science and Emerging Technologies, Lecture Notes on
Data Engineering and Communications Technologies 191,
https://doi.org/10.1007/978-981-97-0293-0_13
170 A. M. A. Ahmed et al.
1 Introduction
2 Related Works
Social media becomes one of the main sources of users’ opinions and customer
sentiment toward any product. It reflects real public opinion toward any phenomenon,
incident, or even governmental policy. Much research has been conducted during
the last decade on sentiment analysis and how to precisely classify public opinions
into different polarities. The majority of this research concentrates on the English
language, and few are focusing on other morphologically rich languages like Arabic.
Moreover, sentiment analysis and opinion mining research touched several domains
and industries like health, sports, politics, technology, and marketing [6].
The author in [7] uses a public Twitter dataset consisting of 2000 tweets that are
divided into positive and negative tweets that are being processed and tokenized using
4G techniques. After tokenization khoja-stemmer is used to remove stop words and
TF-IDF is used to count word occurrence as a selected feature input to a discrimina-
tive multinomial Naïve Bayes (DMNB) classifier. DMNB classifier achieved 0.875
weighted average recall and 0.876 weighted average precision for negative and posi-
tive tweets while accuracy and f-score were 87.5% which was an improvement to
approaches in related work.
Reference in [8] explored the topic of Arabic sentiment analysis of online learning
tweets in Saudi Arabia during Covid-19 pandemic where manual annotation for
tweets was done and different sizes of N-gram along with TF-IDF were used for
feature extractions before input into the classification model. Six different models
have been tested and compared including Naïve Bayes, support vector machine,
logistic regression, KNN, random forest, and XG-Boosting. Logistic regression with
unigram TF-IDF has achieved the best accuracy (89.9%) and recall (89.9%); however,
best precision (96.8%) was achieved by SVM with bigram and TF-IDF [8].
The reference in [4] created Arabic sentiment lexicon through the automatic trans-
lation of English sentiment lexicons and manual extraction of sentiment words trans-
lated to Jordanian dialects that were used to train three different machine learning
classification models: Naïve Bayes, KNN, and support vector machine on Facebook
comments related to telecom operators’ services in Jordan. Lexicon-based labeling
achieves very good accuracy (98%), and the support vector machine algorithm was
found to be the most accurate model among the three tested models with (97.8%)
accuracy. The reference in [9] proposed new approach to detect churn in Saudi’s
telecom companies using Arabic tweets sentiment analysis. Several approaches
have been applied to detect customer satisfaction and compared with a question-
naire designed for the customers whose tweets were analyzed using the proposed
approach.
172 A. M. A. Ahmed et al.
3 Methodology
The dataset was collected from Kaggle and consists of 10,000 un-labeled Arabic
tweets from stc customer care account on Twitter [10]. The tweets in this dataset for
42 days start from 17 February 17, 2022, to 30 March 30, 2022, and it was written
by 6084 unique users.
Dataset that was loaded and read using Pandas has five different columns including
index column that is unnamed, and it has values from 0 to 10,000: datetime column
has detailed date and time of posted tweet, tweet id is a key column that contains
a unique id for each tweet, text column contains the body of the tweets shared by
different users, and finally, username column contains the user identity as per Twitter
identification.
Various text preprocessing techniques are considered in this paper. These are data
annotation, tokenization, lower casing, or normalization, stop words and unnecessary
words removal, stemming and lemmatization.
In this study, Twitter-xlm-roberta-base model was used to annotate the tweets to
either negative, positive, or neutral. XLM-Roberta-base model developed by trained
on ~ 198 M multilingual tweets. An instance of XLM-Roberta-based model was
initialized using auto classes from hugging face and a batch size of 32 was used
without changing any of model parameters [11].
In the text preprocessing phase, various steps were being executed to ensure a
precise data transformation and text preparation for suitable format input to topic
modeling and classification tasks. The implementation of text preprocessing in this
research follows the most common practice steps used in Arabic text preprocessing
including removing hashtag accounts, web links, excluding non-Arabic words, and
removing stop words, unnecessary spaces, emojis, and punctuations.
Using regular expressions hashtags account names and web links removed
successfully, Fig. 1 shows sample of tweets before, and after removing hashtags
account names and web links, the English translation added below each Arabic tweet
with blue color before and after processing to be well understood for non-Arabic
speakers.
Removing emojis is done by compiling all its format and substituting it with
no characters using regular expression. The English translation was added to each
Arabic tweet with blue color before and after processing to be well understood for
non-Arabic speakers.
NLTK tool was used to remove Arabic stop words and while checking word cloud
using ar_wordcloud, it was found that a lot of words that are being repeated without
Telecom Customer Experience Analysis Using Sentiment Analysis … 173
adding more meaning to the tweet. To avoid its influence on proposed models a list of
words added to stop words
list and removed from the whole tweets.
The target of topic modeling is to discover the most repeated topics on which
customers are writing their tweets on them. Two different models used to handle
the topic modeling job models which are Latent Dirichlet Allocation and BERTopic.
3.3.2 BERTopic
In this paper, two sets of experiments were conducted in our work. In the first exper-
iment, NLP model BERT is utilized while in second, fine-tuned MARBERT is used.
The output of NLP models were used as input to machine learning classifiers (SVM,
GBoost). In order to evaluate classification accuracy, accuracy, precision, recall, and
F1-score methods were used. More details about models can be found below.
The data should be converted into a form that is can be appreciated and suitable to
BERT model. Therefore tto undertaketa has been tokenized, and the pretrain model
‘BERT-base model’ is used to undertake the initial training process; Padding and
truncation were set in the tokenization process to have same length for each sentence
or tweet where short sentences were padded with zero and long sentences would be
cut off to max_length and output was adjusted to return PyTorch tensor.
The result of tokenizing a text will be a dictionary that contains the input_ids, that
are the tokens expressed in numbers, and the attention_mask that tells us if the token
is or is not a [PAD]. Then the texts [CLS] hidden states have been gotten by running
the model on tokenized sentences or tweets. The [CLS] hidden state is then used as
an input to support vector machine initialized from sklearn library and with linear
kernel and ‘ovo’ decision_function_shape in order to handle multi-class classification
where we have three labels here neutral, positive, and negative sentiments.
The same BERT-base model hidden states have been used to extract features from
the tweets along with machine learning classifier that was XGBoost classifier. Auto
class was used for text tokenization with same settings and configurations, the only
difference in this model is that the [CLS] hidden state input to XGBClassifier from
XGBoost library with 50 estimators.
The MARBERT specialized Arabic dialect model is utilized in our work with
the aim to extract the tweets hidden state features input to support vector machine
classifier. The Ammar-alhaj-ali/Arabic MARBERT-sentiment model and ‘utoTok-
enizer’ are used to extract the textual features. The resulted dictionary of input_ids
and attention_mask used in MARBERT model to get the texts [CLS] hidden state.
The [CLS] hidden state is then used as an input to support vector machine initialized
from sklearn library and with linear kernel and ‘ovo’ decision_function_shape in
order to handle multi-class classification where we have three labels here neutral,
positive, and negative sentiments [16, 17].
3.4.5 MARBERT
The findings demonstrate that SVM outperforms XGBoost when utilizing the BERT-
base model, which was primarily pre-trained on the English language and subse-
quently fine-tuned on Arabic language telecom domain data in this study. SVM
achieves an F1-score of 0.71, while XGBoost achieves an F1-score of 0.65. This
significant performance gap between the two machine learning classifiers diminishes
Telecom Customer Experience Analysis Using Sentiment Analysis … 177
Fine-tuned MARBERT exhibits the potential for enhanced feature extraction through
its embedding layers, leveraging the advantages of pre-trained Arabic dialects and
modern standard Arabic corpora.
The data for the second experiment is collected from the Kaggle tweets dataset
[10]. We use the MARBERT Arabic-specific transformer LMs model as the source
model for Table 2 and Fig. 3 [16].
The study effectively employs the latest techniques of topic modeling and text classifi-
cation based on pre-trained bidirectional encoder representations from transformers
(BERT). BERTopic leverages transformers, and c-TF-IDF is utilized to generate
distinct dense clusters, facilitating the identification of easily interpretable topics.
This research marks the first application of BERTopic to Arabic telecom-related
text, successfully generating six clusters for 50% of the tweets, aided by manual
assessment and topic reduction.
Telecom Customer Experience Analysis Using Sentiment Analysis … 179
References
12. Lenka RK, Coombs T, Assi S, Jayabalan M, Mustafina J, Liatsis P, Al-Hamid A, Al-Sudani
S, Ismail NL, Al-Jumeily OBE D (2022) Evaluation of extractive and abstract methods in text
summarization. In: The international conference on data science and emerging technologies.
Springer Nature Singapore, Singapore, pp 535–546
13. Mehta R, Varma V (2023) LLM-RM at SemEval-2023 Task 2: Multilingual Complex NER
using XLM-RoBERTa. ArXiv preprint arXiv:2305.03300
14. Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3(Jan):993–
1022
15. Sheikha H (2020) Text mining Twitter social media for Covid-19: comparing latent semantic
analysis and latent Dirichlet allocation
16. Abdul-Mageed M, Elmadany AR, Nagoudi EMB (2021) ARBERT & MARBERT: Deep bidi-
rectional transformers for Arabic. ACL-IJCNLP 2021—59th annual meeting of the association
for computational linguistics and the 11th international joint conference on natural language
processing, proceedings of the conference, i, pp 7088–7105
17. Komang Ananta Aryadinata I, Pangesti D, Anugerah GB, Aditya IE, Ruldeviyani Y (2021)
Sentiment analysis of 5G network implementation in Indonesia using twitter data
Efficient Time Series of Smoothing
and Auto-regressive Forecasting Models
for Predicting Police Officer Fatalities
in the USA
Danush Nagappan, Manoj Jayabalan, Ahmad Alanezi, Farhad Nadi,
and Thomas Coombs
Abstract With police deaths rising, predicting the number of police deaths is now
of significant importance and is necessary to take precautions to prevent deaths from
affecting the police force, public, and government from an associated reduction in
police numbers. The aim of this research is to compare different time series fore-
casting models and find the most efficient model in predicting police deaths occurring
in the USA. The dataset used in this study consisted of details of police officers who
had died in service on duty in the USA. A total of 26,269 records between January
3, 1791, and June 3, 2022. The dataset was obtained through the Kaggle website
data repository. A total of four smoothing models and four auto-regressive models
were used and compared in this research. The smoothing models had better RMSE
and MAPE scores, with HWES being the best-performing model. In summary, the
HWES model performed the best on the USA police deaths dataset by producing
less error compared to the other smoothing and auto-regressive models used in this
research.
D. Nagappan · M. Jayabalan
Liverpool John Moores University, Liverpool, United Kingdom
A. Alanezi
Saad Alabdullah Police Academy, Al Jahra, Kuwait
F. Nadi
School of Information Technology, UNITAR International University, Petaling Jaya, Malaysia
T. Coombs (B)
British American Tobacco, Southampton, United Kingdom
e-mail: [email protected]
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 181
Y. Bee Wah et al. (eds.), Data Science and Emerging Technologies, Lecture Notes on
Data Engineering and Communications Technologies 191,
https://doi.org/10.1007/978-981-97-0293-0_14
182 D. Nagappan et al.
1 Introduction
Recent years have witnessed a concerning increase in police fatalities in the USA,
primarily attributed to the heightened accessibility of firearms among the public
and the accompanying surge in gang-related violence [1–3]. This rising trend in
annual police casualties, combined with an uptick in gun-related incidents nation-
wide, presents a growing threat to public safety. This threat carries the potential for
severe economic and substance abuse ramifications [4–6].
The year 2021 marked a grim milestone, with the USA recording its highest annual
tally of police fatalities. Consequently, it has become imperative to develop predictive
models for estimating the number of police fatalities. Such models can serve as a
foundation for enhancing the protection of police officers at higher risk, augmenting
training for vulnerable officers, and improving the recruitment and training of new
officers [7–10].
Recent research has demonstrated the utility of time series data analysis in fore-
casting various phenomena, such as COVID-19 cases and identifying crime-prone
areas along with the timing of such events [7]. This methodological approach holds
promise for informing strategies to mitigate the risks associated with police fatalities
in the USA
Time series forecasting methods are employed to predict future events using
historical data. Their primary purpose is to assist in making more informed deci-
sions. These forecasting techniques find wide application in various fields, including
predicting population trends, sales figures, inventory levels, workforce demand,
weather patterns, and disease outbreaks [8]. Visualizing time series data is often
done by creating graphs, which can reveal underlying patterns and trends within the
data.
To assess the performance of time series forecasting models, researchers
commonly employ error metrics such as the mean absolute percentage error (MAPE)
and root mean square error (RMSE) [7, 9, 10]. These metrics help researchers deter-
mine which model is the most accurate, as they quantify the level of error associated
with each model’s predictions [9, 10]. This information is vital for selecting the most
reliable forecasting model.
Predicting the number of police deaths is highly important to take necessary
precautions and reduce the impact of such deaths on the police force, the public,
and the government due to a decrease in police numbers. In a study conducted in a
Chinese city, a dataset from a police station was utilized, which included information
on robberies, thefts, and burglaries. The research concluded that the ARIMA model
provided more accurate forecasts compared to other models [11–13].
Another study by Cesario [11] focused on crime forecasting using the AR and
ARIMA models in Chicago from 2001 to 2014, achieving an 84% accuracy rate
[11]. Kumar (2018) collected crime data from Indian datasets and employed prepro-
cessing algorithms to make the data stationary. Subsequently, SES, HES, HWES,
and ARIMA models were used for prediction purposes [8]. The results indicated
that all models produced a 95% confidence interval.
Efficient Time Series of Smoothing and Auto-regressive Forecasting … 183
2 Methodology
2.1 Dataset
The dataset employed in this study comprises information on police officers who lost
their lives while on duty in the USA. It includes data on their names, dates of death,
officer ranks, affiliated departments, the respective states of service, and the causes
of their demise. The dataset encompasses a total of 26,269 records spanning from
January 3, 1791, to June 3, 2022. We accessed this dataset from the Kaggle data.
repository [14]. The utilization of this extensive time frame stems from our inten-
tion to leverage historical data for the purpose of forecasting future police officer
fatalities in the USA.
To ensure the suitability of the dataset for our research, we implemented several data
preprocessing steps. Initially, the dataset underwent filtration to exclude instances
related to K9 dog deaths, as our study exclusively focused on human fatalities [18, 19].
Subsequently, we excluded data concerning COVID-19-associated deaths among
police officers, as their inclusion introduced inaccuracies into the prediction of police
fatalities.
184 D. Nagappan et al.
In the process of feature selection, we carefully considered the variables that would
be fed into our time series forecasting models. We retained two essential features: (1)
the combination of years and months in which police officers died, and (2) the number
of deaths occurring in a specific month of a given year. Any remaining variables that
did not align with our research objectives were omitted from the analysis.
Addressing missing values, particularly those containing zero values, required
special attention. We decided to cap these zero values to the next lowest value, which
is 1. This adjustment helped mitigate errors without significantly impacting the final
forecast output.
Subsequently, we divided the dataset into training and testing sets. The training
dataset encompassed the initial 180 years, spanning from 1791 to 1970, while the
remaining data, covering the years from 1971 to 2022, constituted the test set. This
partitioning allowed us to evaluate the performance of our time series forecasting
models effectively. Figure 1 depicts the result time series after preprocessing step
was applied.
Figure 2 shows the methodological approach through a process flow diagram used
in this study. In this study, we are going to use multiple methodologies available
in time series analysis to predict the future values and compare which of these
models performs better on this data. The models which are used for this study are
SES, HES, HWES additive, HWES multiplicative, AR, MA, ARMA, ARIMA, and
SARIMA. Sequential exponential smoothing (SES) technique is used for the data
which does not have seasonality and trend, but contains only level. Holt’s exponential
smoothing (HES) model is used in this study to check whether the model performs
well with only the trend and level of our police deaths data. Holt-Winters Exponential
Smoothing (HWES) model is used in our research to predict based on trend, level,
and seasonality components of the series and gave satisfactory prediction values in
other studies; therefore, we assume that the HWES model might perform great on
our data and yield good forecast values. The auto-regressive (AR)prediction model
uses the past value to find the future values. We can find the dependent value using
the independent values. In moving average (MA) model, the future is predicted by
considering the residuals of the past predicted errors. ARMA consists of both AR and
MA components; therefore, it is called as auto-regressive moving average (ARMA).
This research uses this model to check whether the performance of the model is great
without the integrated component on our police dataset. ARMA consists of both
AR and MA components; therefore, it is called as auto-regressive moving average
(ARMA). This research uses this model to check whether the performance of the
model is great without the integrated component on our police dataset. Seasonal
auto-regressive integrated moving average (SARIMA) model captures seasonality
as well. The SARIMA has two elements which are non-seasonal and seasonal. Here,
the non-seasonal values are used to make the data stationary, and the seasonal values
are used to make the differencing on season on the data.
In the initial step of our analysis, we subjected the data used in the auto-regressive
models to an Augmented Dickey-Fuller (ADF) test to ascertain its stationarity. The
obtained p-value of 0.58 suggested non-stationarity, as it exceeded the conventional
threshold of 0.05. To rectify this issue and ensure stationarity, we performed a Box-
Cox transformation using the SciPy library. Subsequently, the transformed stationary
data was utilized to generate autocorrelation function (ACF), as shown in Fig. 3, and
partial autocorrelation function (PACF), as shown in Fig. 4, plots via the statsmodels
library, helping us determine the appropriate lag order function.
In the next stage of our analysis, we employed the root mean square error (RMSE)
and mean absolute percentage error (MAPE) evaluation metrics to fine-tune various
hyperparameters for forecasting time series data. These metrics played a pivotal role
in enhancing the accuracy and effectiveness of our forecasting model.
186 D. Nagappan et al.
3 Results
The results obtained from each model will be compared to find which model produced
the lowest error value and performed best on US police deaths dataset.
Simple exponential smoothing (SES) model. The initial analysis of police deaths
data involved employing a straightforward exponential smoothing model, which
yielded RMSE and MAPE values of 6.01 and 43.63, respectively, with a smoothing
level set at 0.5. Subsequently, adjusting the smoothing level to 0.7 resulted in revised
RMSE and MAPE values of 5.81 and 41.83, respectively. Given that a smoothing
level of 0.7 produced the most favorable error values, this parameter was chosen as
the optimal setting for further analysis.
Holt’s exponential smoothing model. By applying a smoothing level of 0.7 from
the SES model and subsequently modifying the smoothing slope to 0.5 using Holt’s
smoothing method, we obtained RMSE and MAPE values of 367.75 and 2475.08,
respectively. Further adjustment of the smoothing slope to 0.1 resulted in RMSE and
MAPE values of 36.52 and 234.55, respectively. Finally, when the smoothing slope
was set to 0.06, the RMSE and MAPE values reached 4.68 and 27.24. Notably, this
method outperformed the SES model in terms of its ability to identify the underlying
level and decreasing trend within the dataset.
Holt’s winter exponential smoothing additive model. The initial selection
of smoothing parameters for the models, namely a smoothing level of 0.7 and
a smoothing slope value of 0.06, was accompanied by the choice of a seasonal
smoothing value of 0.2, resulting in respective RMSE and MAPE values of 22.57
and 140.80. Maintaining all other variables constant while adjusting the smoothing
slope value to 0.055 led to improved model performance, yielding RMSE and MAPE
values of 18.38 and 111.89. Subsequently, a modification in the seasonal smoothing
parameter from 0.2 to 0.01 resulted in even more favorable RMSE and MAPE values,
188 D. Nagappan et al.
amounting to 6.53 and 32.77. Further refinement by reducing the seasonal smoothing
to 0.0025 yielded the most optimal RMSE and MAPE values of 4.62 and 28.13, and
the model successfully identified the presence of level, decreasing trend, and season-
ality in the dataset, as illustrated in Fig. 5. It is noteworthy that the magnitude of the
seasonal pattern in the data did not exhibit a direct correlation with the series value.
Holt’s winter exponential smoothing multiplicative model. For the final
smoothing method, smoothing level (0.7), smoothing slope (0.055) and smoothing
seasonal (0.3) values were initially used and yielded RMSE and MAPE values of
25.16 and 156.17, respectively. Then, seasonal smoothing value was changed from
0.3 to 0.0025 resulting in RMSE MAPE values of 8.24 and 42.71. Finally, the seasonal
smoothing value again was adjusted to 0.001 and yielded RMSE and MAPE values
of 4.83 and 26.89. Although this did achieve a lower error rate, the HWES additive
model performed better as the seasonal pattern data did not increase with the increase
in data values and did not decrease with a decrease in the data values.
Simple auto-regressive model. The first auto-regressive model that were used is the
simple AR method. As mentioned in the method (2.4 Model Implementation and
Hyperparameter Tuning) when using ACF and PACF, we got the lag order as 1, and
therefore, we will keep the order as 1, 0, 0. This achieved RMSE and MAPE values
of 15.03 and 107.63, respectively. The AR model did not perform well on this dataset
due to failing to capture the downward trend of the test dataset.
Moving average model. By maintaining the lag order at 1 (model order 0, 0,
1), we obtained RMSE and MAPE values of 22.80 and 165.12, respectively. The
moving average model again like the auto-regressive model did not perform well on
the police deaths dataset as it failed to capture the downward trend of the test dataset.
Auto-regressive moving average model. The lag order is kept the same (1) for
the auto-regressive moving average model with an order of 1, 0, 0. For the ARMA
model, these specific value inputs yielded RMSE and MAPE values of 23.20 and
168.19 and did not perform well due to it, again, failing to capture the decreasing
trend of the test dataset.
Efficient Time Series of Smoothing and Auto-regressive Forecasting … 189
Auto-regressive integrated moving average model. Using the same lag order of
1, we got the order of 1, 1, 1 for the ARIMA model, which when these values were
given as inputs, we achieved RMSE and MAPE values of 23.20 and 168.19. This
model performed the same as ARMA and again did not perform well on the police
deaths dataset as the downtrend of the test dataset failed to be captured.
Seasonal auto-regressive integrated moving average methods. The final auto-
regressive model used was SARIMA which the same lag order of 1 was used as
input to this model (model order: 1, 1, 1 and seasonal order: 1, 1, 1, 12). This yielded
RMSE and MAPE values of 15.00 and 106.52 and performed better than the other
auto-regressive models as it captured seasonality. However, the performance was not
satisfactory when compared to the smoothing models as it failed to capture the police
death dataset trend.
The performance of the HWES additive model on the police deaths dataset is demon-
strated in Table 1, where it effectively captures the level, trend, and seasonality in
police fatalities. The selection of this model was based on its competitive perfor-
mance in comparison to other research models, as indicated by its RMSE and MAPE
values of 4.62 and 28.13, respectively [15, 16].
Table 1 RMSE and MAPE values of different time series forecasting models
Method Smoothing Smoothing Smoothing Order RMSE MAPE
level slope seasonal
SES 0.5 – – – 6.01 43.63
SES 0.9 – – – 5.9 42.67
SES 0.7 – – – 5.81 41.83
HES 0.7 0.5 – – 367.75 2475.08
HES 0.7 0.1 – – 36.52 234.55
HES 0.7 0.06 – – 4.68 27.24
HWES additive 0.7 0.06 0.2 – 22.57 140.80
HWES additive 0.7 0.055 0.2 – 18.38 111.89
HWES additive 0.7 0.055 0.01 – 6.53 32.77
HWES additive 0.7 0.055 0.0025 - 4.62 28.13
HWES 0.7 0.055 0.3 – 25.16 156.17
multiplicative
HWES 0.7 0.055 0.0025 – 8.24 42.71
multiplicative
HWES 0.7 0.055 0.001 – 4.83 26.89
multiplicative
AR – – – (1, 0, 0) 15.03 107.63
MA – – – (0, 0, 1) 22.80 165.12
ARMA – – – (1, 0, 1) 23.20 168.19
ARIMA – – – (1, 1, 1) 23.20 168.19
SARIMA – – – (1, 1, 1) 15.00 106.52
4 Conclusions
Table 2 Comparison of evaluation measures across different time series prediction models in
literature
Previous literature Model used Evaluation measure
Deethong and Boonam, (2022) [21] HWES R-squared = 0.723
MAPE = 168.135
Jayaraman et al. (2022) [22] SARIMA RMSE = 18.758
MAPE = 8.455
Chordia and Pawar (2021) [23] ARIMA RMSE = 1180.22
MAPE = 1.184
Cesario et al. (2016) [11] ARIMA RMSE = 28.86
MAPE = 0.16
Chen et al. (2008) [12] ARIMA RMSE = 56.94
MAPE = 9.48
Kamarudin et al. (2021) [20] Multilayer Perception Neural RMSE = 40.08
Network MAPE = 20.92
Daryl et al. (2021) [24] SARIMA RMSE = 112.21
MAPE = 36.05
Goswami and Kandali (2020) [25] SARIMA RMSE = 200
MAPE = 10.7
Dutta et al. (2020) [26] SARIMA RMSE = 18.08
This study HWES Additive RMSE = 4.62
MAPE = 28.13
Acknowledgements The authors would like to express their acknowledgment to the UNITAR
international university for supporting the publication of this research.
References
1. Mustard DB (2001) The impact of gun laws on police deaths. J Law and Econom 44(S2):635–
657
2. Krieger N, Chen JT, Waterman PD, Kiang MV, Feldman J (2015) Police killings and police
deaths are public health data and can be counted. PLoS Med 12(12):e1001915
3. Edwards F, Esposito MH, Lee H (2018) Risk of police-involved death by race/ethnicity and
place, United States, 2012–2018. Am J Public Health 108(9):1241–1248
4. Jennings JT, Rubado ME (2017) Preventing the use of deadly force: the relationship between
police agency policies and rates of officer-involved gun deaths. Public Adm Rev 77(2):217–226
5. Anon (2022) US police deaths on the rise - BBC. [online] BBC. Available at: https://www.bbc.
com/news/world-us-canada-61218611. Accessed 25 Aug 2023
6. Coombs T, Abdelkader A, Ginige T, Van Calster P, Assi S (2023) Understanding synthetic
drug analogues among the homeless population from the perspectives of the public: thematic
analysis of Twitter data. J Substance Use 11:1–8
7. Cihan P (2021) ARIMA-based forecasting of total COVID-19 cases in the USA and India. In:
2021 29th signal processing and communications applications conference (SIU), pp 1–4
8. Kumar M, Athulya S, Minu MMB, Vinodini VMD, Lakshmi AKG, Anjana S, Manojkumar TK
(2018) Forecasting of annual crime rate in India: a case study. In: 2018 International conference
on advances in computing, communications and informatics (ICACCI), pp 2087–2092
9. Yadav R, Sheoran SK (2018) Crime prediction using auto regression techniques for time series
data. In: 2018 3rd international conference and workshops on recent advances and innovations
in engineering (ICRAIE), pp 1–5
192 D. Nagappan et al.
10. Hartono AP, Luhur CR, Qomariyah NN (2022) Forecasting vaccination growth for COVID-19
using machine learning. In: 2022 5th international conference on computing and informatics
(ICCI), pp 356–363
11. Cesario E, Catlett C, Talia D (2016) Forecasting crimes using autoregressive models. In: 2016
IEEE 14th Intl conference on dependable, autonomic and secure computing, 14th Interna-
tional conference on pervasive intelligence and computing, 2nd international conference on big
data intelligence and computing and cyber science and technology congress (DASC/PiCom/
DataCom/CyberSciTech), pp 795–802
12. Chen P, Yuan H, Shu X (2008) Forecasting crime using the ARIMA model. In: 2008 Fifth
international conference on fuzzy systems and knowledge discovery, pp 627–630
13. Abdurrahman M, Irawan B, Latuconsina R (2017) Flood forecasting using holt-winters
exponential smoothing method and geographic information system. In: 2017 International
conference on control, electronics, renewable energy and communications (ICCREC), pp
159–163
14. Koli M (2023) Police deaths in USA from 1791 to 2022: USA law enforcement line of duty
deaths, Version 1. Retrieved 1 Sept 2023 from https://www.kaggle.com/datasets/mayureshk
oli/police-deaths-in-usa-from-1791-to-2022
15. Saha AS, Haque M, Alam MGR (2021) Covid-19 confirmed-cases prediction in SAARC coun-
tries through machine learning. In: 2021 Third international conference on inventive research
in computing applications (ICIRCA), pp 889–894
16. Shimizu S, Shin S (2021) Applicability of SARIMA model in tokyo population migration
forecast. In: 2021 14th international conference on human system interaction (HSI), Pp 1–4.
17. Shukla S, Ramaprasad R, Pasari S, Sheoran S (2022) Statistical analysis and forecasting of
wind speed. In: 2022 4th international conference on energy, power and environment (ICEPE),
pp 1–6
18. Alloghani M, Al-Jumeily D, Hussain A, Mustafina J, Baker T, Aljaaf AJ (2020) Implementation
of machine learning and data mining to improve cybersecurity and limit vulnerabilities to cyber
attacks. Nature-inspired computation in data mining and machine learning, pp 47–76
19. Abbas G, Tanveer M, Abbas ZH, Waqas M, Baker T, Al-Jumeily OBED (2021) A secure
remote user authentication scheme for 6LoWPAN-based Internet of Things. PLoS ONE
16(11):e0258279
20. Kamarudin KA, Wan Ismail WA, Harymawan I, Shafie R (2021) Types of politically connected
firms and analysts’ earnings forecast. Journal of Applied Accounting Research, 22(5), 883–913
21. Deethong T, Boonnam N (2022) Forecasting Analysis of the Durian Yield Trends in Southern
Thailand Using Holt-Winters Exponential Smoothing Method and Box-Jenkins’ Techniques.
In: 2022 Joint International Conference on Digital Arts, Media and Technology with ECTI
Northern Section Conference on Electrical, Electronics, Computer and Telecommunications
Engineering (ECTI DAMT & NCON). pp. 29–32. IEEE, Chiang Rai, Thailand.https://doi.org/
10.1109/ECTIDAMTNCON53731.2022.9720330
22. Jayaraman V, Parthasarathy S, Lakshminarayanan AR (2022) Forecasting the emission of
greenhouse gases from the waste using SARIMA model. In 2022 6th International Conference
on Trends in Electronics and Informatics (ICOEI) (pp. 99–106). IEEE
23. Chordia S, Pawar Y (2021) Analyzing and forecasting covid-19 outbreak in India. In 2021
11th International Conference on Cloud Computing, Data Science & Engineering (Confluence)
(pp. 1059–1066). IEEE
24. Winata DA, Kumara S and Suhartono D (2021) Predicting Stock Market Prices using Time
Series SARIMA. 1st International Conference on Computer Science and Artificial Intelligence
(ICCSAI) (pp. 92–99) https://doi.org/10.1109/ICCSAI53272.2021.9609720
25. Goswami K, Kandali AB (2020) Electricity demand prediction using data driven forecasting
scheme: ARIMA and SARIMA for real-time load data of Assam. In 2020 International
Conference on Computational Performance Evaluation (ComPE) (pp. 570–574). IEEE
26. Dutta B, Barman MP, Patowary AN (2020) Application of Arima model for forecasting road
accident deaths in India. International Journal of Agricultural and Statistical Sciences, 16(2),
607–615
Multimodal Emotion Recognition Using
Attention-Based Model with Language,
Audio, and Video Modalities
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 193
Y. Bee Wah et al. (eds.), Data Science and Emerging Technologies, Lecture Notes on
Data Engineering and Communications Technologies 191,
https://doi.org/10.1007/978-981-97-0293-0_15
194 D. Sharma et al.
retrieved from RoBERTa with other modality-based features. A model better than
BERT is introduced as part of this work that helps to improve the performance.
1 Introduction
2 Background
the best accuracy. The output form video and audio paths were fused using the k-
Nearest Neighbor (k-NN) or artificial neural network (ANN) algorithm to classify
combinations of features.
Xu et al. [9] provided a model that was based on head fusion that was based on
the multi-head attention process in order to make SER more accurate. It utilized a
model known as an attention-based convolution neural network (ACNN).
To solve this problem research papers study correlations of bimodal and trimodal
for data fusion by moving from unimodal vectors to bimodal and to trimodal vectors
using deep neural networks. Multimodal transformer (MulT) is a way to deal with
both problems in a general way from beginning to end without explicitly aligning
the data [3]. (1) Data that doesn’t match up because each modality’s sequences are
sampled at different rates, and (2) long-term dependencies between elements from
different modalities. At the heart of our model is the directional pair wise cross-modal
attention, which pays attention to interactions between multimodal sequences across
different time steps and latently adapts streams from one modality to another.
Siriwardhana et al. [1] proposed Transformer-Based Self-Supervised Learning
(SSL) method for multimodal emotion recognition. Retrained SSL models for text,
audio, and faces were used in RoBERTa [4], Wav2Vec [10], and Fabnet [11] respec-
tively to Wav2Vec (A) and Fabnet (V) embeddings for self-attention’s is a special
token that can sum up the sequence’s data. After modifying embeddings, audio,
video, and test embeddings are inputted into six IMA transformer blocks to capture
cross-modal data. Used Hadamard product of CLS tokens for identical modality to
add most relevant data.
Wang T., Hou Y., Zhou D. and Zhang Q. proposed a neural network for contextual
attention based on a multimodal framework that uses conversational information from
both the target speaker and the other speaker to determine how someone feels from
what they say [12].
Lee et al. [2] presented a new multimodal approach to emotion recognition that
improves the BERT model by combining it with different features from the language,
audio, and visual modalities. In particular, we improve the BERT model because
the audio and visual modalities are different in many ways. We introduce the self-
multi-attention fusion module, the multi-attention fusion module, and the video
fusion module. These are attention-based multimodal fusion mechanisms that use the
recently proposed transformer architecture. We look into the best ways to combine
fine-grained representations of audio and visual features into a single embedding
while combining a pretrained BERT model.
Our research considers acoustic, visual, and textual cues simultaneously to
improve emotion recognition performance over unimodal approaches, as the trimodal
emotion recognition model shows high performance and robustness by comprehen-
sively considering multiple modalities. Lastly, we applied the attention mechanism’s
capacity to incorporate multimodal aspects as the attention mechanism has recently
drawn some interest in the field due to its usefulness. Additionally, transfer learning
methods that extract features from pretrained networks have made substantial recent
advancements.
Multimodal Emotion Recognition Using Attention-Based Model … 197
3 Methodology
In this paper, we proposed the model that is shown in Fig. 1. It takes raw video and
breaks it down into “utterance units” to get information about audio, video, and text.
Toolkit and pretrained models, audio and visual modalities are used to pull out each
deep learning feature. Self-multi-attention fusion module, each different visual and
audio feature linked, and the high-level representation are taken out of the fusion
process. RoBERTa is used to pull out text expressions. Multi-head attention fusion
module predicts person’s emotional state in the end.
3.1 Dataset
Our tests are conducted on the public IEMOCAP, CMU-MOSI, and CMUMOSEI
datasets, all of which represent the current state of the art in benchmarking video
sentiment analysis.
For multimodal emotion and emotion recognition, CMU-MOSI includes 3702
brief speech video snippets taken from YouTube movie reviews. Each statement is
labeled with an emotion scale from [3], where [− 3] denotes an exceedingly negative
mood and [+ 3] denotes an exceptionally pleasant one.
CMU-MOSEI dataset is also from YouTube, like CMU-MOSI, except it’s much
bigger. CMU-MOSEI has 23,453 video movie reviews, making it the world’s largest
multimodal analysis dataset. Additionally, there are 22,856 annotated statements
in that data. Like the CMU-MOSI dataset, the CMU-MOSEI contains a variety of
emotion scores ranging from [3 to + 3].
IEMOCAP dataset is 12 h of data among ten actors across five sessions. In this
study, we focused on four distinct feelings—joy, sadness, rage, and apathy.
In our method, visual, audio, and textual features are used to improve the performance
of emotion recognition. We experiment with the CMU-MOSI, CMU-MOSEI, and
IEMOCAP Database.
First, we used the concatenation method to combine the different kinds of features
(handcraft and bottleneck features) that were taken from the audio and video data.
It uses a relative positional encoding method to let states be used more than once
without confusing the time between frames. Because audio and visual features have
different sizes than text features, changed both Fabnet (V) and Wav2Vec (A) embed-
dings by appending a trainable vector termed CLS and then applying self-attention
to each embedding sequence. The self-attention fusion module figures out how
strongly different parts of a single modality are related and pulls out a sympathetic
representation.
We use multi-head attention to combine the parts of each modality’s atten-
tion. Scaled dot-product attention is done more than once at the same time by the
multi-head mechanism. We propose not only the context information between each
modality feature but also the textual representation of RoBERTa and two attention
mechanisms (self-multi-attention fusion and multi-attention fusion) that combine
these features. Self-multi-attention fusion and multi-attention fusion both have a
residual connection and layer normalization.
Multimodal Emotion Recognition Using Attention-Based Model … 199
4 Analysis
For our experiments, we used four sets of data that were available to the public. All
of these datasets include speech, written video, and text. The proposed method has
been compared with the best results published so far for each dataset. Accuracy, F1
score, mean average error (mae), and Correlation Coefficient were the main ways we
measured how well something worked. IEMOCAP’s dataset was made up of common
emotion categories like “happy,” “angry,” “sad,” “neutral,” and “excitement”. The
CMU-MOSI and CMUMOSEI datasets have scores for sentiment that range from
-3 to + 3.
Initially, three pretrained models, RoBERTa, Wav2Vec, and FABNET, were used
to extract features from raw data modalities. Dimensions and maximum lengths of
training sequences for features vary by modality. Pretrained Wav2Vec and RoBERTa
models are accessible from the Fairseq code base and employ extracted SSL audio
and text features. Faces were cut from video frames using the Retina-Face facial
200 D. Sharma et al.
recognition models and extracted video features. Then, routed every face-containing
frame through the pretrained Fabnet model to acquire video modality characteristics.
The retrieved features from models possess a big embedding size and a lengthy
sequences length. Created a technique so that single embedding might represent
lengthy embedding sequences associated with modalities. Changed both Fabnet
embeddings (V) and Wav2Vec embeddings (A) by appending trainable vector termed
CLS and then apply self-attention to every embedding sequence in order to extract
significant features from each modality. Symbol Q, K, V, and dQ refer to Query, Key,
Value, and dimensionalities of Query vector, respectively.
In the embedding sequences were influenced by how RoBERTa or BERT model
describe entire sequences used first unique tokens termed CLS (stands for classifica-
tion). As the self-attention mechanism of BERT-based model used bidirectional (past
and future), first token of sequence, the CLS tokens, encoded all information to right,
that is the future sequence. Consequently, the CLS tokens were utilized compressed
representation to tackle classification’s problems such as sentiment analysis. In our
model, prepended CLS tokens in the embedding sequences of Wav2Vec and Fabnet
because neither of these formats has architecture similar to BERT. Since RoBERTa
is a BERT-based model, the text embedding sequences were left unchanged. Access
to three CLS tokens representing three modalities enabled us to compute attention
efficiently and develop a straightforward late fusion mechanism.
Vembedding = Self - Attention [CLS]v , Fabnetseq , (1)
Aembedding = Self - Attention [CLS] A , Wav2Vecseq , (2)
QK
Self - Attention = softmax √ V (3)
d
The fundamental objective of IMA fusion layers was to facilitate the exchange of
pertinent data across modalities. IMA fusion layers were developed to incorporate
information from representations of various modalities into the representation of
one modality. The multi-head attention fusion layer functions similarly to the self-
attention fusion layer, with the exception that it generates a Query (Q) vector using
Multimodal Emotion Recognition Using Attention-Based Model … 201
the CLS token of one modality and the Key (K)–Value (V) vectors from the embed-
ding sequences of other modalities. The inputs to the multi-head fusion layer are
three embedding sequences, with the first token of each embedding sequences being
the CLS token. IMA attention computed between the CLS token of one modality
and the complete embedding sequence of the other modality, as the CLS token of
each modality added the sequence’s information. So, there are six multi-head atten-
tion transformer blocks, every transformer block’s Q vector is computed using one
modality CLS token and another modality whole embedding sequence.
Considering their similarities, the six CLS embeddings can be divided into three
pairs. Derived from a single modality, Hadamard product was considered between
token pairings that share same Core-Modality. Hadamard product is used to extract
mutual information from a group of entities the two representations of CLS. The
Hadamard product is used to find out what a group of entities know about each other.
The two representations of CLS.
• Accuracy is the proportion of correct predictions over total predictions.
• Precision is the fraction of relevant instances among the retrieved instances.
• Recall is the fraction of relevant instances that were retrieved.
• The F1 score is defined as the harmonic mean of recall and precision. Measure of
a model’s accuracy on a dataset.
• MAE (mean absolute error) represents the difference between the original and
predicted values extracted by averaged the absolute difference over the dataset.
• True Positive (TP)—predicted positive and it is positive.
• True Negative (TN)—predicted negative and it is negative.
• False Positive (FP)—predicted positive and it is negative.
• False Negative (FN)—predicted negative and it is positive (Fig. 2).
202 D. Sharma et al.
Fig. 3 Accuracy comparison between our model and state of the art for CMU-MOSI dataset
We carried out a series of ablations research that employ the CMUMOSEI datasets
comprehend effect of various constituents on projected fusion mechanisms. Used
CMUMOSEI possesses the greatest amount of training example, relative to other
datasets, using primarily three types of ablation studies characterized as follows:
• Ablations research inputs of text, speech, and video modalities.
• Ablations research the application of multi-head attention
• A study on the utilization of Hadamard.
Table 1 depicts results of the ablation studies. Letters A, V, and T stand for
audio, video, and text, respectively. Most of T in unimodal shows the high perfor-
mance. Text and video are used together in bimodal T + V, which show high
performance. Comparing the results for trimodal to our final model shows how well
Hadamard products-based information’s extraction works. Finally, result shows using
Hadamard computation improves the accuracy of seven class and binary sentiment
by about 3% reducing the number of trainable parameters. This is because adding six
vectors together and add three times as many parameters to final prediction layer as
204 D. Sharma et al.
Fig. 4 F1 score comparison between our model and state of the art for CMU-MOSI dataset
Fig. 5 Accuracy comparison between our model and state of the art for CMUMOSEI dataset
Multimodal Emotion Recognition Using Attention-Based Model … 205
Fig. 6 F1 score comparison between our model and state of the art for CMUMOSEI dataset
using Hadamard computation. This result shows that combining text modality better
recognize emotions.
Table 2 shows that our model for the CMU-MOSI dataset does better than the
current best methods in most ways. Experiment results of the CMU-MOSI dataset
show that proposed by HFU-BERT model improves performance by 0.53 and 0.21
compared to lowest state-of-the-art performance Acc-7 and Acc-2, respectively. But
there is still a difference between our model and [2], which shows that there is room
for more improvement.
For CMU-MOSEI dataset compared various state-of-the-art models, this model
does better each in F1 score, MAE, Acc-2, and Corr. For IEMOCAP, dataset demon-
strates the better F1 score and accuracy for each sentiment compared with earlier
studies.
To sum up, our research proposed a model that jointly employs a pretrained
BERT model for MMLs and heterogeneous feature unification for video and audio
for the purpose of multimodal emotion recognition. Proposed model combined audi-
tory and visual modalities into different features, and RoBERTa was used to fine-
tune the model. The state of the art was beaten by proposed method using three
difficult benchmarks: CMU-MOSI, CMUMOSEI, and the IEMOCAP dataset. The
higher calculations necessitated by the production of additional trainable weights
and Hyperparameter represented a possible constraint of our proposed model.
206 D. Sharma et al.
Fig. 7 Accuracy comparison between our model and state of the art for IEMOCAP dataset
Multimodal Emotion Recognition Using Attention-Based Model … 207
Fig. 8 F1 score comparison between our model and state of the art for IEMOCAP dataset
208 D. Sharma et al.
6 Conclusion
multi-head attention to extract significant features from various modalities and fuse
data across modalities, respectively.
Our research proposed a model that jointly employs a pretrained BERT model
for MMLs and heterogeneous feature unification for video and audio for the purpose
of multimodal emotion recognition. Proposed model combined auditory and visual
modalities into different features, and RoBERTa was used to fine-tune the model.
The state of the art was beaten by proposed method using three difficult benchmarks:
CMU-MOSI, CMUMOSEI, and the IEMOCAP dataset. The higher calculations
necessitated by the production of additional trainable weights and Hyperparameter
represented a possible constraint of our proposed model.
Automatic emotion detection systems are becoming an increasingly essential
component of human–computer interactions as the field of emotion identification
studies becomes increasingly important. Nevertheless, multimodal emotion detec-
tion is complicated by the fact that there is insufficient evidence. Transfer learning
is a method that we recommend using to solve this problem. Transfer learning takes
advantage of pretrained models, such as BERT, and attention-based mechanisms,
such as self-attention and multi-head attention, to extract relevant features from
multiple modalities and fuse data across modalities, respectively.
This study focused on video, audio, and text; however, we aim to investigate addi-
tional modalities in future studies. Independently trained models for each modality
were used together with pretrained models; however, current research reveals that
certain algorithms can gain joint text and video knowledge of task such as video
question answering.
References
8. Song KS, Nho YH, Seo JH, Kwon DS (2018) Decision-level fusion method for emotion recog-
nition using multimodal emotion recognition information. In: 15th International conference on
ubiquitous robots, UR 2018, pp 472–476
9. Xu M, Zhang F, Zhang W (2021) Head fusion: improving the accuracy and robustness of speech
emotion recognition on the IEMOCAP and RAVDESS dataset. IEEE Access 9:74539–74549
10. Zhao Z, Wang Y, Wang Y (2022) Multi-level fusion of Wav2vec 2.0 and BERT for multimodal
emotion recognition
11. Wiles O, Sophia Koepke A, Zisserman A (2018) Self-supervised learning of a facial attribute
embedding from video. In: British Machine Vision Conference 2018, BMVC
12. Wang T, Hou Y, Zhou D, Zhang Q (2021) A contextual attention network for multimodal
emotion recognition in conversation. In: Proceedings of the international joint conference on
neural networks, 2021-July
13. Pham H, Liang PP, Manzini T, Morency LP, Póczos B (2019) Found in translation: Learning
robust joint representations by cyclic translations between modalities. In: 33rd AAAI confer-
ence on artificial intelligence, AAAI 2019, 31st innovative applications of artificial intelligence
conference, IAAI 2019 and the 9th AAAI symposium on educational advances in artificial
intelligence, EAAI 2019, Shaffer 2018, pp 6892–6899
Comparative Analysis of Emotion
Recognition Using Large Language
Models and Conventional Machine
Learning
Abstract Emotions are significant aspects of human existence and influence inter-
action between individuals and groups, influencing how we think and behave. In this
research, we aim to use conventional and neural network models to identify emotions
from textual data and compare which performed best. The Go Emotions dataset
contained 27 different emotions across 58,000 samples. The approach involves
modelling the conventional machine learning models and the neural network-based
models and comparing the results over test dataset and choosing the best model.
Upon comparing the classification reports for the conventional and neural network-
based models on the Ekman taxonomy, conventional machine learning algorithms
were outperformed by neural network-based models which gained almost 10% more
than conventional models. Conventional models averaged the values around 50%
of macro-average F1-score except for the KNN classifier which performed poorly
getting the macro-average F1-score of 21%. BERT classifier with Ekman taxonomy
including neutral emotion had a macro-average precision of 55% and a sensitivity
of 68%. This classifier also outperformed the macro-average F1-score by 106 61%.
While the RoBERTa classifier had a macro-average precision of 65%, the recall, or
sensitivity, was found to be 53%. This study clearly states that neural network-based
models outperformed conventional models. Our study proposed BERT model which
achieved a macro-average F1-score of 0.50 across Go Emotion taxonomy.
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 211
Y. Bee Wah et al. (eds.), Data Science and Emerging Technologies, Lecture Notes on
Data Engineering and Communications Technologies 191,
https://doi.org/10.1007/978-981-97-0293-0_16
212 M. Soujanya Rao et al.
1 Introduction
2 Method
2.1 Dataset
The data used in this study were obtained from Google and comprised of Reddit
comments manually annotated by humans and classified with 27 different emotion
categories posted between January 2005 and January 2019. Emotion taxonomy
included four ambiguous emotions, 11 negative emotions, 12 positive emotions
and 1 neutral emotion categories. The dataset, known as the Go Emotion dataset,
consisted of almost 58,000 comments from Reddit users. The goal was to gather a
sizable dataset with a concentration on conversational data, where emotion is a key
component of communication. The Reddit platform offers a sizable, publicly acces-
sible volume of content that includes direct user-to-user communication, making it
a significant source for emotion research. Reddit comments from subreddits with at
least 10,000 comments from January 2019 through the year 2005 were used to build
Go Emotions, with deleted and non-English comments being removed [20]
In the Go Emotion taxonomy, the data were labelled using indices of each emotion
in text format. Encoding the emotion was done by converting the indices into lists
of indices using comma separator, map each index to the actual emotion label, and
use One-Hot encoding. After encoding, text cleaning was done by the following
steps: (1) emojis are converted into text, (2) contractions are expanded, (3) acronyms
and typing mistakes are fixed, (4) some word with multiple letters are replaced, (5)
special characters and numbers from text and replaced with spaces, and finally (6)
double spaces are eliminated. Figure 1 shows the data pre-processing used for this
dataset.
Data analysis consisted of exploratory and text analysis. For exploratory analysis,
multiple emotion could be used to annotate the sample and the number of samples
per emption. Word analysis was used to visualise the top 10 words that occurs most
frequently for each emotion.
Table 1 shows that the proportions of emotions in train, validation, and test
datasets are similar. This is essential in order to avoid adding an additional bias
when evaluating predictive models.
214 M. Soujanya Rao et al.
Stochastic Gradient Decent (SGD). The performance evaluation shows that Ekman
taxonomy with and without neutral emotion resulted in the highest F1-macro-average
score of 57 and 50%, respectively. Go Emotion and Ekman taxonomy (with and
without neutral emotion) produced the highest precision macro score of 65, 64,
Comparative Analysis of Emotion Recognition Using Large Language … 215
and 64%, respectively. Moreover, Ekman taxonomy obtained the best recall macro-
average score of 53 and 47%.
Support Vector Machines (SVM). Again, the Ekman taxonomy with and without
neutral emotion resulted in the highest F1-macro-average score of 57 and 50%,
respectively. Go Emotion and Ekman taxonomy with and without emotion achieved
precision macro-average scores of 65, 64, and 64%, respectively. Ekman taxonomy
without neutral emotion achieved the highest score of 53%.
K-Nearest Neighbours (KNN). Ekman taxonomy without neutral emotion achieved
the highest F1-macro-average score of 46%, highest precision macro-average score of
60%, and highest recall score of 41%. However, KNN achieved lower F1, precision,
and recall macro-average scores compared to the two previous models.
Decision Tree. The performance of the decision tree model for Ekman taxonomy
without neutral emotion achieved the same F1, precision, and recall macro-average
scores of 48, 60, and 47%, respectively.
216 M. Soujanya Rao et al.
72% from 70%. The precision and recall scores were 62 and 66% and did not improve
from the BERT model.
ELECTRA. Finally, the ELECTRA model with Ekman taxonomy with and without
neutral emotion performances were evaluated. The best performing model was
ELECTRA with Ekman taxonomy without neutral emotion with F1, precision, and
recall scores of 66, 68, and 67%, respectively. The ELECTRA technique did not
perform better than the BERT technique (Table 5).
Table 6 Comparison of
Previous literature Predictive model Evaluation measure
previous research with this
research on the Go Emotion Kamath et al., 2022 RoBERTa F1-score = 47
dataset Singh et al., 2021 BERT + CDP F1-score = 52.34
Demszky et al., 2020 BERT F1-score = 46
This research BERT F1-score = 50
Comparative Analysis of Emotion Recognition Using Large Language … 219
4 Conclusions
This study was conducted using the Go Emotions dataset which consist of 58 k
Reddit comments that have been manually annotated by humans and classified
with 27 different emotion categories. There were three variables in the dataset,
the text, emotion label index and ID. Data pre-processing was done to correct
the spelling mistakes and remove the unwanted words like stop words, special
symbols and convert the data into clean text. Then, exploratory data analysis was
performed, and the results were interpreted to understand the number of emotions
per sample. Here, cardinality varies from 1 to 5 emotions per sample, with a
large majority of samples that have only 1 or 2 emotions assigned. In summary,
the BERT neural network algorithm achieved the best accuracy and efficiency at
predicting emotion from the Go Emotion dataset. The number of class samples per
emotion was also explored to understand the class balancing of the train dataset
and showed that the data were skewed due to 30% being labelled as neutral. Upon
comparing the classification reports for the conventional and neural network-based
models on the Ekman taxonomy, conventional machine learning algorithms like
Stochastic Gradient Descent, Support Vector Machines, decision tree, and random
forest performed well, while the neural network-based models like BERT, RoBERTa
and ELECTRA performed exceptionally well with gaining almost 10% than that
of the conventional models. Conventional models averaged the values around 50%
of macro-average F1-score except for the KNN classifier which performed poorly
getting the macro-average F1-score of 21%. Here, the BERT classifier with Ekman
taxonomy including neutral emotion had a macro-average precision of 55% and a
sensitivity of 68%. This classifier also outperformed the macro-average F1-score by
106 61%. While the RoBERTa classifier had a macro-average precision of 65%, the
recall, or sensitivity, was found to be 53%. Future research should look into using
fine-grained datasets with low skew with the aim to improve model performance and
also use a hybrid model structure.
References
5. Rajendram SM, Mirnalinee TT (2022) Contextual emotion detection on text using gaussian
process and tree based classifiers. Intell Data Anal 261:119–132
6. Senn S, Tlachac M, Flores R, Rundensteiner E (2022) Ensembles of BERT for depression clas-
sification. In: 2022 44th annual international conference of the IEEE engineering in medicine &
biology society (EMBC). [online] IEEE, pp 4691–4694. Available at: https://ieeexplore.ieee.
org/document/9871120/
7. Ab Nasir AF, Seok Nee E, Sern Choong C, Shahrizan Abdul Ghani A, Abdul Majeed APP,
Adam A, Furqan M (2020) Text-based emotion prediction system using machine learning
approach. In: IOP conference series: materials science and engineering. Institute of Physics
Publishing
8. Chowanda A, Sutoyo R, Meiliana ST (2021) Exploring text-based emotions recognition
machine learning techniques on social media conversation. Procedia Comput Sci 821–828
9. Saad MM, Jamil N, Hamzah R (2018) Evaluation of support vector machine and decision tree
for emotion recognition of malay folklores. Bull Electr Eng Inform 73:479–486
10. Alm CO, Roth D, Sproat R (2005) Emotions from text: machine learning for text-based emotion
prediction. In: Proceedings of human language technology conference and conference on
empirical methods in natural language processing, pp 579–586
11. Desmet B, Hoste V (2013) Emotion detection in suicide notes. Exp Syst Appl 4016:6351–6358
12. Gievska S, Koroveshovski K, Chavdarova T (2014) A hybrid approach for emotion detection
in support of affective interaction. In: 2014 IEEE international conference on data mining
workshop, IEEE, pp 352–359
13. Gupta A, Sahu H, Nanecha N, Kumar P, Roy PP, Chang V (2019) Enhancing text using emotion
detected from EEG signals. J Grid Comput 172:325–340
14. Acheampong FA, Nunoo-Mensah H, Chen W (2021) Transformer models for text-based
emotion detection: a review of BERT-based approaches. Artif Intell Rev 548:5789–5829
15. Santhosh Baboo S, Amirthapriya M (2022) Sentiment analysis and automatic emotion detection
analysis of twitter using machine learning classifiers. Int J Mechan Eng
16. Lee SYM, Chen Y, Huang C-R (2010) A text-driven rule-based system for emotion cause
detection. In: Proceedings of the NAACL HLT 2010 workshop on computational approaches
to analysis and generation of emotion in text, pp 45–53
17. Tao J (2004) Context based emotion detection from text input. In: 8th international confer-
ence on spoken language processing, ICSLP 2004. International Speech Communication
Association, pp 1337–1340
18. Yousaf A, Umer M, Sadiq S, Ullah S, Mirjalili S, Rupapara V, Nappi M (2021) Emotion
recognition by textual tweets classification using voting classifier (LR-SGD). IEEE Access
9:6286–6295
19. Deshpande M, Rao V (2017) Depression detection using emotion artificial intelligence.
GoEmotions: a dataset of fine-grained emotions. arXiv preprint arXiv:2005.00547
20. Singh G, Brahma D, Rai P, Modi A (2021) Fine-grained emotion prediction by modeling
emotion definitions. In: 2021 9th international conference on affective computing and
intelligent interaction, ACII 2021. Institute of Electrical and Electronics Engineers Inc.
The Impact of Clustering-Based
Sequential Multivariate Outliers
Detection in Handling Missing Values
Abstract The presence of missing values is a common issue that frequently leads
to incomplete data in a wide range of research. They diminish the accessibility of
the dataset that can be utilized and degrade the statistical power of the analysis.
A significant focus in numerous studies has been directed toward the methods of
missing value imputation. In cases where the dataset includes outliers, the impu-
tation of missing values might be incorrect or significantly deviate from the actual
values. One of the challenges that impacts the quality of data is the handling of missing
values and outliers simultaneously. Several studies removed outliers before imputing
missing values or deleted observations with missing values before detecting outliers.
The removal approach leads to a lack of information included within the data. Other
researchers integrate clustering methods into the process of missing value imputation
to mitigate the impact of outliers and data variations, thereby enhancing the accu-
racy of the imputation model. This paper proposes a new clustering-based sequen-
tial multivariate outlier detection (SMOD) method to effectively handle incom-
plete data within outliers. The method is applied to an official economic statistics
dataset that involves outliers and performs a missing value rate scenario of about 50
percent. In comparison with a well-known and widely used clustering technique, i.e.,
model-based clustering (MBC), the proposed method works well in missing value
imputation.
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 221
Y. Bee Wah et al. (eds.), Data Science and Emerging Technologies, Lecture Notes on
Data Engineering and Communications Technologies 191,
https://doi.org/10.1007/978-981-97-0293-0_17
222 M. Agustini et al.
1 Introduction
The existence of missing values and outliers is one of the issues that affects data
quality. Removing missing values and outliers can potentially introduce bias in
parameter estimation and weaken the generalizability of the findings [1]. Statistical
analysis that disregards missing values might result in the loss of information, hence
diminishing the statistical power and increasing standard errors [2]. Several statistical
procedures have been designed to imputation the data [3]. Before performing anal-
ysis of datasets containing missing values and outliers, an editing process is required
to make the data a complete dataset. Failure to accurately edit data can result in data
that is not suitable for statistical inference and statistical analysis is vulnerable to
violations of assumptions.
Currently, imputation-based procedures for handling missing values, whether
using statistical or machine learning approaches, have emerged as promising
solutions for addressing incomplete datasets [4].
For many decades, the process of computing against large datasets has been
executed using complicated methodologies. Rubin [5] introduced multiple impu-
tation as a method to address nonresponse in surveys. Each complete dataset
is simulated and then analyzed using standard methods. The results are subse-
quently combined to generate estimates and confidence intervals that incorporate
the uncertainty of single imputation.
Clustering is one of the techniques used to detect the presence of outliers [6].
Several multiple imputation techniques include clustering techniques as an approach
to imputing missing values [7]. Clustering approaches are employed to mitigate the
impact of outliers and data variations, hence enhancing the accuracy of the imputation
model [8]. However, only a few cluster techniques are insensitive to outliers [9].
This study introduces a novel clustering method that can be used in missing value
imputation processing with a special focus on solving outliers. The distinguishing
factor of this study is the utilization of a sequential multivariate outlier detection
(SMOD) technique for grouping. The performance of the addition of the introduced
clustering technique will be compared to the already well-known and widely used
method of clustering, i.e., model-based clustering.
2 Literature Review
for multiple imputation of missing data. Tsai et al. [12] defined a class center-
based approach for efficiently generating effective imputation results. Xue et al.
[13] constructed a mixture-based multiple imputation model to handle the variance
of predictive models. Lin et al. [14] proposed a new hybrid multiple imputation
framework and cluster monitoring system based on deep neural network. Samad
et al. [8] introduced missing value imputation using clustering and deep learning
within multiple imputation framework.
Fortunately, the majority of existing research hasn’t put significant emphasis on
the development of an imputation approach that takes considered the presence of
outliers. Hence, this research presents a clustering approach that relies on the principle
of identifying multivariate outliers through the use of robust distances in handling
missing values. A detailed discussion of this methodology will be provided in the
next section. Multiple imputation uses predictive mean matching model in the ‘mice’
package of R software [15].
One method used to check the presence of a multivariate outlier is the calculation of
the Mahalanobis distance [16], which is defined as follows:
[ ]1/2
MDi = (xi − µ)T ∑ −1 (xi − µ) , i = 1, 2, . . . , n, (1)
where µ is the mean vector of the sample and ∑ is the variance–covariance matrix.
An observation is said to be outlier if the square root value of Mahalanobis distance >
χ2p;1−α/2 , where p is degree of freedom. However, the Mahalanobis distance is very
sensitive to the outlier, so it can change the values of µ and ∑ resulting in the
nonoutlier data being misidentified to outlier. To overcome this, Rousseeuw and
Zomeren introduced a robust distance (RD) that is more robust to the outlier [17]. RD
uses an estimate of the variance–covariance matrix and mean based on the minimum
covariance determinant (MCD).
If assumed x1 , x2 , …, xn is a random sample of a number of n observed objects in
a R p dimensional space, then the MCD estimate aims to obtain h subset of samples
of n objects of observation with maximum size
[ ]1/2
RDi = (xi − µmcd )T ∑ −1
mcd (xi − µmcd ) , i = 1, 2, . . . , n, (4)
where µmcd is a robust sample mean estimator based on MCD (average of the h
subset of the sample):
1∑
h
µmcd = xj, (5)
h j=1
1 ∑( )
h
∑ mcd = x j − μmcd (x i − μmcd )T . (6)
h − 1 j=1
Roussew dan Van Zomeren [17] used the RD and quantum of the χ p2 distribution
as a cut off value to detect multivariate outlier. Data ith with value:
/
RD => χ p;1−α/2
2
(7)
∑ g = λg Dg Ag DTg, (8)
with
λg is the scalar value that shows the volume of an ellipse.
The Impact of Clustering-Based Sequential Multivariate Outliers … 225
The normalized root mean square error (NRMSE) is one of numerous metrics used to
evaluate the imputation model of several variables that have far different unit values.
The discrepancy between the imputed values and the true values on the dataset is
quantified using the root mean squared error (RMSE) divided by the mean of the
actual values as the normalized root mean square error (NRMSE). The measurement
requirements are expressed mathematically in Eqs. (9) and (10) as stated below.
┌
|∑ ( )
| m X −X 2
∧
√ i=1 i i
RMSE = , (9)
m
RMSE
NRMSE = 1 ∑m , (10)
m i=1 X i
∧
with m is the number of imputed values, X i is the actual value, X i is the imputed value.
The techniques presenting lower values of NRMSE demonstrate superior accuracy
in the model imputation of missing values.
Davies and Gather [20] made a significant divergence between single-step and
sequential steps for outlier detection. Single-step techniques are able to discover
all outliers simultaneously, in contrast to the sequential procedure of eliminating
or adding data points. In the sequential process, at each step, every observation is
assessed to determine if it fits the criteria for being identified as an outlier. The data
point with the highest outlyingness measure is subjected to outlier detection. In the
event that an observation is identified as an outlier, it is subsequently removed from
the dataset, and the process is repeated. The process will stop if no more observations
are detected as outliers.
226 M. Agustini et al.
Data.2
Data.2.2 Data.2.2.2
The boxes with red lines show observations detected as outliers, while boxes with
black lines show reverse. At the last step in Fig. 1, there were no red line boxes,
which meant no further observations were identified as outliers.
Drawing upon the works of Filzmoser et al. [21], this paper utilizes a resilient
distance metric that was created for the purpose of detecting outliers in multivariate
data [17]. The current update on this work entails the implementation of a sequential
multivariate outlier detection approach, which different with sequential procedure
by Davies and Gather.
The results of outlier detection on the first sequence dataset are examined, both
for the subset identified as an outlier and the subset that is not considered an outlier.
Subsequently, the outlier is recognized once again. The aforementioned procedure
persists until a specific threshold is attained, such as the sequential outlier detection
until four steps. The process of outlier identification is anticipated to yield data
clusters or smaller subsets of variations due to its primary focus on identifying the
existence of outliers from the beginning of data grouping. The both figures depict a
comparison between the already employed sequential outlier detection approaches
and the newly suggested sequential outlier detection procedure.
In Fig. 2, the boxes in either black or white filled in gray mean the selected subset
of data as clusters. For comparative equality, the number of clusters is determined
based on the optimum clusters derived from other methods, namely model-based
clustering. The MBC employs two criteria, namely Akaike information criterion
(AIC) and Bayesian information criterion (BIC), to determine the optimal number
of clusters. This research has not developed a criterion in determining the optimum
number of clusters in clustering-based SMOD.
The primary objective of this article is to mitigate substantial deficiencies in data
by including clusters into the imputation procedure and evaluate the efficacy of the
proposed clustering approach in comparison with current clustering techniques. This
article presents the following contributions. Initially, we do imputation on the datasets
without using clustering. Furthermore, we conducted comprehensive analyses to
illustrate the impact of including clustering analysis on the accuracy of imputation.
Thirdly, a comparison is made between the performance of the pre-existing clustering
approach: model-based clustering (MBC) and the proposed method: clustering-based
sequential multivariate outlier detection (SMOD).
The Impact of Clustering-Based Sequential Multivariate Outliers … 227
4 Discussion
The data utilized in this study were obtained from the 2016 Economic Census for
Medium and Large Establishments (MLE) in the province of East Java. The official
economic statistics dataset has three variables: turnover (X 1 ), measured in million
Indonesian rupiah (IDR), cost (X 2 ), also measured in million IDR, and worker (X 3 ),
measured as the number of individuals. The compilation includes official economic
information and encompasses a substantial sample size of over 8,000 enterprises.
From a conceptual standpoint, the quantity of employees directly corresponds to
the scale of human resources inside an organization. Consequently, it can be inferred
that an escalation in the number of workers would likely result in a proportional rise
in employee turnover. Moreover, the magnitude of expenses serves as an indicator of
a company’s scale of operations. Consequently, a positive correlation is anticipated
between the growth in expenses and the corresponding increase in turnover [22].
The original dataset does not contain the missing values. To facilitate the research
process, a scenario that includes the missed values for three variables is created
using the ’ampute’ function of the MICE package in the R program. This means that
a missing value has an actual known value.
Figure 3 represents the composition of the dataset, which comprises 4,382
observed values (OV) and 4170 occurrences of missing values (MV). These values
related to 8552 complete data for variables X 1 , X 2 , and X 3 . The observed values
are represented in the uppermost row, where bars are depicted as blue. The missing
values are denoted by bars that have been colored in red.
228 M. Agustini et al.
Table 1 presents the descriptive statistics of the dataset, together with the corre-
sponding rate of missing values. There is a significant range between the highest
(Max) and lowest (Min) values. The range variation among the datasets is quite
wide. Hence, the standard deviation is greater than the mean. The datasets also
contain some outliers, as shown in Fig. 4.
Table 1 Descriptive statistics and missing value rate of case study data
Datasets Min Mean Max SD Numbers of Missing
OV MV rate (%)
Figures 5 and 6 show the distance-distance plot of outlier detection using robust
distance for first step and second step. The process of outlier detections is
accomplished in four steps.
Sequential multivariate outlier detection generates a variety of dataset combina-
tions that can be utilized to create clusters.
The discrepancy in the cluster membership counts between the MBC and SMOD
approaches is notable in Table 2.
According to the findings, it has been found that the ideal number of groups for
MBC is six. This conclusion was reached by utilizing the clustering-based SMOD
230 M. Agustini et al.
approach, which resulted in the formation of data subsets that can be seen as clusters,
as illustrated in Fig. 7
Data.1.1.1.1
n=13
Data.1.1.1
n=67 Data.1.1.1.2
Data.1.1 n=54
n=365
Data.1.1.2.1
n=36
Data.1.1.2
n=298
Data.1.1.2.2
n=262
Data.1
n=1,881
Data.1.2.1.1
n=0
Data.1.2.1
n=185
Data.1.2.1.2
n=185
Data.1.2
n=1,446
Data.1.2.2.1
n=161
Data.1.2.2
n=1,261
Data.1.2..2.2
n=1,100
Data
n=8,552
Data.2.1.1.1
n=7
Data.2.1.1
n=82
Data.2.1.1.2
n=75
Data.2.1
n=1,411
Data.2.1.2.1
n=62
Data.2.1.2
n=1,329
Data.2.1.2.2
n=1,267
Data.2
n=6,741
Data.2.2.1.1
n=6
Data.2.2.1
n=826 Data.2.2.1.2
Data.2.2 n=820
n=5,330 Data.2.2.2.1
Data.2.2.2 n=450
n=4,504 Data.2.2.2.2
n=4,054
Fig. 8 Plot of plausible value for all clusters of model-based clustering (MBC) for predictive mean
matching imputation (iteration = 20)
The Impact of Clustering-Based Sequential Multivariate Outliers … 233
Fig. 9 Plot of plausible value for all clusters of clustering-based sequential multivariate outlier
detection (SMOD) for predictive mean matching imputation (iteration = 20)
234 M. Agustini et al.
5 Conclusion
This paper compares the effectiveness of two clustering strategies for dealing with
missing values. The proposed method has better imputation accuracy for some vari-
ables that have lower NRMSE values. The suggested method yields outcomes that are
comparable to those of commonly used clustering techniques. However, the inclusion
of clustering significantly improves the accuracy of all variables. In this paper, we
continue to cluster the entire dataset before creating a scenario with missing values.
This study aims to demonstrate that clustering datasets improves model imputation
accuracy. In our upcoming research, we propose to perform clustering on incom-
plete datasets. It is anticipated that this can be a solution to the frequent occurrence
of missing values in the data survey, especially in the official statistics dataset. For
the extended research, we will add more datasets and simulations for generalization
of outcomes.
References
12. Tsai CF, Li ML, Lin WC (2018) A class center based approach for missing value imputation.
Knowl-Based Syst 151:124–135. https://doi.org/10.1016/j.knosys.2018.03.026
13. Xue Y, Klabjan D, Luo Y (2016) Mixture-based multiple imputation model for clinical data
with a temporal dimension. In: Proceedings of 2019 IEEE international conference on Big
Data, Big Data 2019, pp 245–252. https://doi.org/10.1109/BigData47090.2019.9005672
14. Lin J, Li NH, Alam MA, Ma Y (2019) Data-driven missing data imputation in cluster monitoring
system based on deep neural network. Appl Intell 50(3):860–877. https://doi.org/10.1007/s10
489-019-01560-y
15. van Buuren S, Oudshoorn CGM (2007) MICE: multivariate imputation by chained equations.
R Packag Version 1(3):2007
16. Bedrick EJ, Lapidus J, Powell JF (2000) Estimating the Mahalanobis distance from mixed
continuous and discrete data. Biometrics 56(2):394–401. https://doi.org/10.1111/j.0006-341X.
2000.00394.x
17. Rousseeuw PJ, van Zomeren BC (1990) Unmasking multivariate outliers and leverage points:
rejoinder. J Am Stat Assoc 85(411):648. https://doi.org/10.2307/2289999
18. Filzmoser P, Gschwandtner M (2022) Package ‘mvoutlier’ R Package. version [Online].
Available: http://cstat.tuwien.ac.at/filz
19. Banfield JD, Raftery AE (1993) Banfield & Raftery (1993) model based gaussian dan non
gaussian.pdf. Biometrics
20. Davies L, Gather U (1993) The identification of multiple outliers. J Am Stat Assoc 88(423):782–
792. https://doi.org/10.1080/01621459.1993.10476339
21. Filzmoser P, Garrett RG, Reimann C (2005) Multivariate outlier detection in exploration
geochemistry. Comput Geosci 31(5):579–587. https://doi.org/10.1016/j.cageo.2004.11.013
22. Takahashi M, Ito T (2013) Multiple imputation of missing values in economic surveys: compar-
ison of competing algorithms. In: Proceedings 59th ISI world statistics congress, no. August,
pp 3240–3245
Sarcasm Detection in Newspaper
Headlines
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 237
Y. Bee Wah et al. (eds.), Data Science and Emerging Technologies, Lecture Notes on
Data Engineering and Communications Technologies 191,
https://doi.org/10.1007/978-981-97-0293-0_18
238 V. S. R. Chilpuri et al.
1 Introduction
2 Related Work
3 Methods
The dataset is obtained by merging the headlines from two widespread news websites
called “huffingtonpost.com” (HuffPost) and “theonion.com” (Onion). The news
headlines dataset for sarcasm detection is less likely to noise as they are direct head-
lines from newspapers written by professional journalists. All the sarcastic posts are
gathered from the Onion newspaper as it is more known to publish sarcastic news
versions of the present events and the real (non-sarcastic) headlines are collected
from HuffPost [25]. The dataset contains 26,710 headlines where around 56% of
headlines are non-sarcastic and the remaining 44% are sarcastic. This ratio of 56:44
is sufficient for the problem statement, and there is no need to use oversampling
or under-sampling techniques. Since the Onion only writes sarcastic headlines, the
labels (or tags) we have for sarcasm are accurate, unlike Twitter data, where many
messages are responses to other messages.
Sarcasm Detection in Newspaper Headlines 241
3.2 Pre-processing
The raw data contains various types of characters and numerical values that can
affect the performance of the model. First, all upper-case words are converted to
lower-case words, then punctuations are removed followed by the removal of special
characters using the regular expression library. In the next phase stop words such
as “from”, “she”, “it”, “are”, etc., are removed. They are commonly used in all
sentences, however, in natural language processing and in text mining applications,
these words are not extremely useful as they carry little information [35]. Thus, they
were removed from all headlines in the dataset. It is also important to transform
words into their base forms. Numerous applications, such as web search, labelling,
and information retrieval, use lemmatization which involves extracting the correct
lemma from a word and then returning its dictionary form [30]. To perform this
lemmatization, we used Wordnet Lemmatizer. The data is then divided into 80%
train set, 10% validation set and 10% test set.
Vectorization and Feature Extraction: To perform machine learning on text data,
one of the first steps is to transform a document into a vector representation, and this
process is referred to as feature extraction or text vectorization which is to convert
the features into numerical representations. Although machine learning algorithms
can perform various mathematical operations, they cannot process raw text data [21].
Instead, they can only process the numerical representation of the text in a vector,
and this textual data processing is done through vectorization, which enables the
machines to recognize and interpret the contents of the text [33]. In this work, we
implement two text vectorization techniques: (i) Bag of Words (BoW), and (ii) Term
Frequency–Inverse Document Frequency (TF–IDF).
Bag of Words (BoW) technique converts text documents into numbers and the
result is stored in the form of a key-value pair (a dictionary). Each dictionary item
has a key (a word) and a value (the number of times this word appeared in the
headline). Since the BoW is a representation of the number of times a word appears
in a document, it does not consider the context or the order in which they appear which
is important for natural language processing and document classification. Other cons
of BoW vectorization include sparsity of the matrix and vocabulary shortage, i.e. if
new data or extra word comes up in the testing set are not captured in testing data
[11].
TF–IDF, a statistical measure, evaluates a word’s relevance to a document by
considering its frequency within the document and its rarity across multiple docu-
ments. It combines term frequency (word occurrences in a document) with inverse
document frequency (word prevalence across a document set). Term frequency is
determined by the number of instances a word appears in a document, adjusted for
word length. There are various methods to calculate it, including raw count or relative
to the most frequent word in a document (see Eq. 1). Inverse document frequency
quantifies a word’s rarity in the entire document set. It’s computed by taking the total
number of documents, dividing by the number of words, and applying a logarithm.
Common widely used words approach a value of 0, while rarer words approach 1
242 V. S. R. Chilpuri et al.
Fig. 1 Machine learning algorithms pipeline (a). The layered architecture of BiLSTM model (b)
Decision Tree Classifier: The decision tree classifier analyses the characteristics of
a dataset to predict its class. It begins at the root node, then progresses to subsequent
nodes. The process involves comparing attributes at each node, following different
branches, until it reaches the end of the tree [39]. The process begins with the complete
dataset, then, using an Attribute Selection Measure (ASM), the algorithm selects
the best attribute to divide the root node into subsets. This leads to the creation
of multiple decision trees. This process is repeated until nodes can no longer be
classified. The commonly used attribute selection measure is based on the information
gain. Information gain quantifies the reduction in uncertainty (entropy) achieved by
splitting a dataset based on a particular attribute (see Eq. 4) [40]. This equation
measures the amount of information a feature provides about a given class.
|subset|
Information Gain = Entropy(parent) − ∗ Entropy(subset). (4)
|psrent|
Random Forest Classifier The Random Forest (RF) algorithm is a collection of deci-
sion trees [29] trained through bagging or bootstrapping methods. Its purpose is to
provide a robust prediction by averaging the outputs of each tree. Moreover, RF algo-
rithms can mitigate overfitting commonly associated with individual decision trees.
They create multiple decision trees, each with a leaf, a root, and decision nodes. The
leaf node represents the tree’s output, and the root node shows the initial point where
the data is split based on a chosen feature. Furthermore, RF classifiers are good at
handling missing data and generating reliable predictions with minimal hyperparam-
eter tuning. They effectively counter overfitting, outperforming individual decision
trees in terms of accuracy [3].
Bi-directional Long Short-Term Memory (BiLSTM): Bi-directional Long Short-
Term Memory (LSTM) is typically developed to overcome the long dependency
problem and retains the information for a long time [19]. This is accomplished
using a separate memory cell called cell state, which sequentially passes through
all LSTM units as shown in Fig. 2. The LSTM usually learns from the beginning
of the sentence until its end in a forward manner. However, to get the full context
of a sentence, it is better to look through it in a backward manner too. This is done
using Bi-directional LSTM (BiLSTM), where two hidden units are used to get the
information of the sentence in a forward and backward way. This enables the model
to capture dependencies in both past and future contexts, enhancing its ability to
understand temporal patterns [9].
Compared to conventional recurrent neural networks (RNNs), LSTMs are
designed to tackle the issue of vanishing or exploding gradients, allowing for the
retention of relevant information over longer sequences. They achieve this by main-
taining a cell state, acting as a form of memory that the network can access and
update over time. An LSTM model comprises three gates: input, output, and forget.
The forget gate decides what information to discard from the cell state. The input
gate determines what new information to store in the cell state. These gates operate
Sarcasm Detection in Newspaper Headlines 245
together to update the cell state and produce an output that contributes to the model’s
predictions [38].
Bi-directional LSTMs process information in both directions. Therefore, they are
useful where context from both past and future events is needed, e.g. in natural
language processing and speech recognition. By combining the outputs from both
directions, the model can make more informed and accurate predictions. This
architectural enhancement greatly expands the capabilities of traditional LSTMs
in understanding complex temporal relationships in data [8].
approach as it has produced the comparable stable accuracy for both Unigram and
Bigram analysis.
On the other hand, when the vectorization was changed the Term Frequency–
Inverse Document Frequency (TF–IDF), the performance of Logistic Regression and
Naive Bayes classifiers remained consistent in the Unigram analysis, achieving an
accuracy of 80.32 and 80.54%, respectively, as given in Table 2. Whereas in Bigram
analysis, Logistic Regression outperformed the other algorithms, achieving an accu-
racy of 80.09%. We again notice that the accuracy of the Naive Bayes algorithm
is dropped for the Bigram analysis, similar to the results of the BoW vectorization
method presented in Table 1.
Our results show that sarcasm detection in news headlines is invariant of the vector-
ization technique (BoW, TF–IDF). The vectorization (BoW, TF–IDF) has little or no
effect on the performance of these four machine learning algorithms (see Tables 1
and 2). One reason behind this invariance could be the short length of the newspaper
headlines. Furthermore, the performance of the three machine learning algorithms
(Logistic Regression, Random Forest and decision tree) remains stable across two
tokenization techniques (Unigram and Bigram), except Naive Bayes, reporting high
precision (80.9%) in the Unigram analysis, regardless of the vectorization method
(BoW, TF–IDF).
Furthermore, we extended our analysis to include a Bi-directional LSTM model,
a preferred model for natural language processing as compared to the traditional
machine learning algorithms [32, 16]. In the case of sarcasm detection in news
headlines, the BiLSTM model achieved an accuracy of 88.48% on the test set (93%
on the train set and 90% on the validation set). The proximity of these accuracy
values indicates minimal overfitting. The performance of the model is reported in
Table 3.
Table 1 Results of machine learning models with Bag of Words (BoW) vectorization technique
with Unigram and Bigram analysis
Algorithms BOW vectorization technique
Unigram analysis Bigram analysis
Accuracy Recall Precision F1 Accuracy Recall Precision F1
score score
Logistic 80.68 80.63 80.68 80.53 81.01 81.11 81.01 80.77
Regression
Naive 80.98 81.25 80.98 80.65 72.95 79.53 72.95 69.68
Bayes
Classifier
Decision 72.86 72.74 72.86 72.77 73.50 73.32 73.50 73.31
tree
classifier
Random 76.28 76.86 76.28 75.54 75.17 76.91 75.17 73.81
Forest
Classifier
Sarcasm Detection in Newspaper Headlines 247
Table 2 Results of machine learning models with Term Frequency–Inverse Document Frequency
(TF–IDF) vectorization technique with Unigram and Bigram analysis
Algorithms TF–IDF vectorization technique
Unigram analysis Bigram analysis
Accuracy Recall Precision F1 Accuracy Recall Precision F1
score score
Logistic 80.32 80.33 80.32 80.11 80.09 80.20 80.10 79.81
Regression
Naive 80.54 81.01 80.98 80.54 72.62 79.41 72.59 69.95
Bayes
Classifier
Decision 71.77 71.70 71.78 71.74 71.64 71.41 71.64 71.29
tree
classifier
Random 77.12 77.37 77.12 76.62 75.62 76.78 75.62 74.55
Forest
Classifier
It is observed that BiLSTM and logistic regression are the best choices for sarcasm
detection in news headlines, yielding average accuracy of 88 and 81%, respec-
tively. The performance of the BiLSTM model can be attributed to its architectural
design which facilitates the retention of extensive sequences of data while effec-
tively filtering out irrelevant information. Moreover, the model uses both backward
and forward LSTM layers, along with a sigmoid layer, to decide which parts of the
information are important for the output [23].
5 Conclusion
This work investigates the performance of four machine learning algorithms (Logistic
Regression, Naive Bayes, decision tree, and Random Forest) and one deep learning
model (Bi-directional LSTM) for sarcasm detection in news headlines. We demon-
strate that sarcasm detection in news headlines is invariant of the vectorization tech-
nique (BoW, TF–IDF). The vectorization (BoW, TF–IDF) has little or no effect on the
performance of machine learning algorithms. Moreover, the performance of the three
machine learning algorithms (Logistic Regression, Random Forest and decision tree)
remains stable across two tokenization techniques (Unigram or Bigram), whereas
Naive Bayes achieved a high precision with Unigram analysis. In the nutshell, we
248 V. S. R. Chilpuri et al.
conclude that BiLSTM is the most preferred model for sarcasm detection in news
headlines.
In future work, we plan to evaluate the performance of the BiLSTM model on a
combined dataset sourced from multiple platforms, such as Twitter, Facebook, Insta-
gram, Reddit, news headlines, and various e-commerce websites. Another interesting
direction is to compare the performance of the BiLSTM model with pre-trained Large
Language Models (LLMs) for sarcasm detection, focusing on efficiency in terms of
speed and computational cost.
Acknowledgements The authors thank the UNITAR International University supporting the
publication of this paper.
References
16. Lin R (2022) Comment texts sentiment analysis based on improved bi-LSTM and Naive
Bayes. In: 2022 international conference on data analytics, computing and artificial intelligence
(ICDACAI), pp 407–412. IEEE
17. Maynard DG, Greenwood MA (2014) Who cares about sarcastic tweets? Investigating the
impact of sarcasm on sentiment analysis. In: LREC 2014 proceedings. ELRA
18. Mehmood T, Gerevini AE, Lavelli A, Olivato M, Serina I (2023) Distilling knowledge with a
teacher’s multitask model for biomedical named entity recognition. Information 14(5):255
19. Mehmood T, Gerevini A, Lavelli A, Serina I (2019) Leveraging multi-task learning for biomed-
ical named entity recognition. In: AI*IA 2019—advances in artificial intelligence—XVIIIth
international conference of the Italian Association for Artificial Intelligence, Rende, Italy,
19–22 Nov 2019, Proceedings. Lecture notes in computer science, vol 11946, pp 431–444.
Springer, Berlin
20. Mehmood T, Gerevini A, Lavelli A, Serina I (2019) Multi-task learning applied to biomedical
named entity recognition task. In: Proceedings of the sixth italian conference on computational
linguistics, Bari, Italy, 13–15 Nov 2019. CEUR Workshop Proceedings, vol 2481. CEUR-
WS.org
21. Mehmood T, Gerevini AE, Lavelli A, Serina I (2020) Combining multi-task learning with
transfer learning for biomedical named entity recognition. In: Knowledge based and intelligent
information & engineering systems: proceedings of the 24th international conference KES-
2020, Virtual Event, 16–18 Sept 2020. Procedia Computer Science, vol 176, pp 848–857.
Elsevier
22. Mehmood T, Md Rais HB (2016) Machine learning algorithms in context of intrusion detection.
In: 2016 3rd international conference on computer and information sciences (ICCOINS), pp
369–373
23. Mehmood T, Serina I, Lavelli A, Putelli L, Gerevini A (2023) On the use of knowledge transfer
techniques for biomedical named entity recognition. Fut Internet 15(2):79
24. Mishra A, Kanojia D, Nagar S, Dey K, Bhattacharyya P (2017) Harnessing cognitive features
for sarcasm detection. arXiv preprint arXiv:1701.05574
25. Misra R (2022) News headlines dataset for sarcasm detection. arXiv preprint arXiv:2212.06035
26. Mouthami K, Devi KN, Bhaskaran VM (2013) Sentiment analysis and classification based
on textual reviews. In: 2013 international conference on Information communication and
embedded systems (ICICES). pp 271–276. IEEE
27. Ortigosa A, Martín JM, Carro RM (2014) Sentiment analysis in facebook and its application
to e-learning. Comput Hum Behav 31:527–541
28. Pawar N, Bhingarkar S (2020) Machine learning based sarcasm detection on twitter data. In:
2020 5th international conference on communication and electronics systems (ICCES), pp
957–961. IEEE
29. Pini M, Scalvini A, Liaqat MU, Ranzi R, Serina I, Mehmood T (2020) Evaluation of machine
learning techniques for inflow prediction in Lake Como, Italy. In: Knowledge-based and intel-
ligent information & engineering systems: proceedings of the 24th international conference
KES-2020, Virtual Event, 16–18 Sept 2020. Procedia Computer Science, vol 176, pp 918–927.
Elsevier
30. Plisson J, Lavrac N, Mladenic D et al (2004) A rule based approach to word lemmatization.
In: Proceedings of IS, vol 3, pp 83–86
31. Porwal S, Ostwal G, Phadtare A, Pandey M, Marathe MV (2018) Sarcasm detection using
recurrent neural network. In: 2018 second international conference on intelligent computing
and control systems (ICICCS), pp 746–748. IEEE
32. Preethi V et al (2021) Survey on text transformation using bi-LSTM in natural language
processing with text data. Turkish J Comput Math Educ (TURCOMAT) 12(9):2577–2585
33. Prokhorov S, Safronov V (2019) AI for AI: what NLP techniques help researchers find the
right articles on NLP. In: 2019 international conference on artificial intelligence: applications
and innovations (IC-AIAI), pp 76–765. IEEE
34. Rajadesingan A, Zafarani R, Liu H (2015) Sarcasm detection on twitter: a behavioral modeling
approach. In: Proceedings of the eighth ACM international conference on web search and data
mining (pp. 97–106)
250 V. S. R. Chilpuri et al.
K. Dhayalan (B)
Liverpool John Moores University, Liverpool, UK
e-mail: [email protected]
N. Sultanova · J. Mustafina
Kazan Federal University, Kazan, Russia
e-mail: [email protected]
J. Mustafina
e-mail: [email protected]
P. Daud
School of Information Technology, UNITAR International University, Petaling Jaya, Selangor,
Malaysia
e-mail: [email protected]
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 251
Y. Bee Wah et al. (eds.), Data Science and Emerging Technologies, Lecture Notes on
Data Engineering and Communications Technologies 191,
https://doi.org/10.1007/978-981-97-0293-0_19
252 K. Dhayalan et al.
1 Introduction
2 Background
The language feature extraction process improves the performance of the NER model
of low resourced languages. Regex feature extraction and morphological analyzer
are explored by different NER-based models in the Tamil language.
Transformer-Based Named Entity Recognition Model—Tamil Language 253
Numerical values, date and time are not needed for this task, so these values are
considered noise in the dataset. Extracting and removing these values help to reduce
the complexity. Extracting these values using REGEXP was used by Rajendran
et al. [2].
Nzeyimana and Rubungo explored a supervised method for sub-word tokenization
for the language Kinyarwanda which is a low resourced language from Africa. A
language model architecture is proposed to represent most of the input words with
the morphological parser generated by a morphological analyzer. BPE of BERT is
used for the words which are not decomposed by the morphological analyzer like
foreign language words, misspellings and proper names [5].
In Tamil, morphs are usually suffixed to the root or base. Most of the words
in Tamil have two different parts: (1) base and (2) grammatical formatives. It is
possible to add grammatical formatives through six morphophonological operations,
which include affixation, incorporation, compounding, cliticization, doubling and
stem mutation. Sarveswaran et al. suggested morphological parser techniques used
to extract morphological features [6].
There are many named entity recognition works done in the Tamil language using
machine learning and deep learning algorithms.
Random Kitchen Sink (RKS) algorithm is used to classify nonlinear data like text.
Unlike the conventional nonlinear kernel methods, RKS works faster and performs
better with fewer data. Only feature size is used by RKS and ignores the number
of data points. This RKS algorithm is used by Abinaya N., Anand Kumar M. and
Soman K.P. to identify the named entities—person, location and organization. The
model is trained with the FIRE NER dataset, and results are compared with SVM
and CRM models.
Bidirectional long short-term memory has sequence information in both direc-
tions, backwards (left to right) or forward (right to left). This Bi-LSTM algorithm is
used to predict named entities in Tamil, Telugu, Kannada, Malayalam and Hindi
[7]. The model is also trained using FIRE NER corpus to identify the named
entities—person, location and organization.
The BERT is a transformer-based model which is designed to pre-train unlabelled
texts and fine-tuned based on NLP tasks.
BERT-based models learn language features automatically which reduces the
effort of labour-intensive and skill-dependent tasks. BERT pre-trained model was
used to extract chemical and protein entities from Spanish biomedical texts [8].
Clinical named entity recognition in the Chinese language is explored by Chang
et al. [9].
The multilingual BERT (Pires et al., n.d.) is trained in 104 languages, including
Tamil [10]. Entire Tamil Wikipedia texts are extracted to train the mBERT and
normalized with other language datasets by oversampling/under-sampling.
Baumann A. explored and evaluated multilingual BERT model on NER for
German and English datasets [11].
XLM-R model is trained on around 100 languages, including Tamil with other
Dravidian languages (Telugu, Kannada and Malayalam) [12]. This model was
254 K. Dhayalan et al.
selected to get better accuracy as it is trained on more than 2.5 Terabytes of common
crawl data.
XLM-R Base and Large were explored for the Turkish NER dataset [13]. MBERT
and BERTurk models are also explored along with XLM-R, and all the models are
tried with or without CRF layer options. The models with the CRF layer perform
better compared with the model.
Indic BERT is trained using datasets sourced from news crawls in 11 Indian
languages and English [14]. The Tamil language is part of these 11 languages. This
model has this unique advantage (only trained in Indian languages other than English)
over mBERT and XLM-R which are trained in more than 100 languages.
Multilingual language models like mBERT and XLM-R are mostly trained in 100+
languages. MuRIL addressed these significant challenges [15]. MuRIL is specifically
built and trained for Indian languages. MuRIL supports 16 Indian languages and
English. Sharma, R., Morwal, S. and Agarwal, B., used MuRIL to extract named
entities from Hindi language texts using MuRIL. They have explored two different
variants with MuRIL architecture, (1) with fully connected and SoftMax layer (2)
fully connected with CRF layers. The performance results are listed in Table 1.
3 Methodology
The proposed NER model includes feature extraction methods and named entity
recognition model (Fig. 1).
Transformer-Based Named Entity Recognition Model—Tamil Language 255
The WikiANN-ta is a multilingual silver-standard dataset used in this task that has
25.6k sentences which are extracted from the Wikipedia Tamil site and annotated as
person, location and organization. The data is split into train, validation and test with
256 K. Dhayalan et al.
22.6k, 1.5k and 1.5k examples as per the split ratio of the huggingface WikiANN-ta
dataset. The data is annotated in the standard IOB format. Table 2 describes each tag.
The training dataset has 22.6k records. The training data has 16.8k I-ORG tags
which are 28% of all tags, and it has 13.1 k B-LOC tags and 9.7k I-PER tags which
are 21 and 16% total count of tags.
B-PER and I-LOC tags are 11% each of the total tags which are 6.9k and 6.4k,
respectively.
The validation dataset has 1.5k records. The validation dataset has also more no
of I-ORG tags like training data—1.3k I-ORG tags which are 29% of all tags. It has
831 I-LOC tags and 759 B-LOC tags which are 19 and 17% total count of tags.
B-PER and I-PER tags are 9 and 12% of the total tags which are counted 401 and
536, respectively.
The test dataset has 1.5k records. The test dataset has 1592 I-ORG tags which is
34% of all tags. It has 671 B-ORG tags and 659 B-LOC tags which is 14% each of
the total count of tags.
I-LOC and I-PER tags are 13% of the total tags each which are counted as 609
and 603, respectively.
The proposed model is based on a hybrid deep neural network approach which
includes BERT variation models for token representations. The architecture of the
BERT variation language models has 12 encoder layers with 12 self-attention heads
and 768 hidden layer dimensions. So, every input token is represented as 768 dimen-
sional. The output of the last layer of the BERT variation models (mBERT, Indic
BERT, XLM-R, MuRIL Large, MuRIL base) is passed to the NER model which is
based on the architecture space. TransitionBasedParser.v2 with state type “NER”.
Out of five transformer models, MuRIL base is selected for fine-tuning further as
this model performs better in terms of accuracy and execution time. There are no
out-of-context numeric values in the WikiANN dataset, like date and time. Dataset
sentences are having numeric values, and all of them are tagged as “O”. Removing
them will help to reduce the class imbalance ratio.
258 K. Dhayalan et al.
3.4 Optimizers
Stochastic gradient descent, Adam and RAdam optimizers are explored as part of
this task.
MuRIL-based NER model is trained and validated with the same datasets, and the
performance of the MuRIL model is better than mBERT, Indic BERT and XLM-R.
In training phase, MuRIL base model predicts 324 I-LOC tags wrongly as I-ORG
and 134 B-LOC tags are wrongly predicted as B-ORG. The wrong prediction pattern
looks the same as the Indic BERT and XLM-R prediction in the validation dataset
(Fig. 2).
The F1-score of the training phase is 84.92% given in Table 3, and the performance
is increased by 1.6% as compared to XLM-R model.
In testing phase, testing phase MuRIL base model predicts 143 I-LOC tags
wrongly as I-ORG followed by 94 wrong predictions of O tags as I-PER and 90
I-PER tags as B-PER (Fig. 3).
The F1-score of the test phase 90.15% is given in Table 4, and the performance
is increased by 2% as compared to XLM-R model.
4.2 MuRIL-Large
MuRIL-Large NER model is trained and validated with the same datasets, and the
performance of the MuRIL-Large model is better than all BERT models tried before.
Transformer-Based Named Entity Recognition Model—Tamil Language 259
In training phase, MuRIL Large model predicts 273 I-LOC tags wrongly as I-
ORG, 134 O tags are predicted as B-PER and 128 B-LOC tags are wrongly predicted
as B-ORG. The wrong prediction pattern looks the same as the Indic BERT and
XLM-R prediction in the validation dataset (Fig. 4).
The F1-score of the training phase is 85.51% given in Table 5, and the performance
is increased by 0.5% compared to MuRIL base model.
In testing phase, MuRIL-Large model predicts 141 I-LOC tags wrongly as I-ORG
followed by 90 wrong predictions of I-PER tags as B-PER (Fig. 5).
The F1-score of the test phase 90.78% is given in Table 6, and the performance
is increased by 0.6% compared to MuRIL Base model.
Precision, recall and F1-score of the above five BERT-based NER models are
compared, and the optimized model is chosen for further fine-tuning.
4.3.1 Precision
In test dataset predictions, MuRIL Large model performs better than all other models
with the highest precision score of 0.9053. The precision score of the MuRIL base
is 0.8961 which is close to MuRIL-Large (0.9% less) (Fig. 6).
4.3.2 Recall
The recall score follows the same pattern as the precision score. In test dataset
predictions, MuRIL Large model performs better than all other models with the
highest recall score of 0.9104. The recall score of the MuRIL base is 0.9069 which
is close to MuRIL-Large (0.35% less) (Fig. 7).
262 K. Dhayalan et al.
4.3.3 F1-score
The recall and precision scores have the same pattern, so F1-score follows the same.
In test dataset predictions, MuRIL Large model performs better than all other models
with the highest F1-score of 0.9078. The F1-score of the MuRIL base is 0.9015 which
is close to MuRIL-Large (0.69% less) (Fig. 8).
0.9069 0.9072
0.9034
0.9015
0.8996
0.8961
Fig. 9 MuRIL base NER comparison in test dataset—with/without REGEX and morphological
analysis
The MuRIL base model performs better with categorical cross-entropy loss function
which is the default. The F1-score difference is 0.0004 between categorical and
sequential categorical cross-entropy (Fig. 10).
The MuRIL base model performs better with Adam optimizer which is the default.
The F1-score difference is 0.0023 between categorical and sequential categorical
cross-entropy (Fig. 11).
264 K. Dhayalan et al.
0.9069
0.9028 0.9015 0.9011
0.8993
0.8961
0.9100
F1-Score
0.9000
0.8900
0.8800
0.E+00 1.E-05 2.E-05 3.E-05 4.E-05 5.E-05 6.E-05 7.E-05 8.E-05
Learning Rate
The MuRIL base model performs better with the learning rate 5.E−05. The F1-score
difference is 0.0016 between 5.E−05 and 6.E−05 (Fig. 12).
The MuRIL base model performs better with the dropout rate 0.1. The F1-score
difference is 0.0008 between 0.1 and 0.3 (Fig. 13).
MuRIL base and MuRIL large NER models achieve a state-of-the-art result with
the highest F1-scores of 90.15 and 90.78. There is minimal improvement in terms
Transformer-Based Named Entity Recognition Model—Tamil Language 265
F1-Score
comparison 0.9
0.89
0 0.2 0.4 0.6 0.8
Dropout rate
5 Conclusion
This research examines detailed the advantages of using language models like
BERT, multilingual BERT (mBERT), XLM-RoBERTa, Indic BERT and Multilingual
Representations for Indian Languages (MuRIL) in NER models with the ensemble
approach.
The most appropriate optimizers (stochastic gradient descent, RAdam, Adam),
loss functions (categorical cross-entropy and negative log-likelihood) and evalua-
tion measures (precision, recall and F1-score) matching predictions against the gold
standard are discussed in detail.
The advantages of the language model pre-training methods for the NER task
are discussed in detail. Data augmentation methods like knowledge-based system
for multilingual named entity recognition and external context retrieving, and
cooperative learning are detailed.
The findings of the experiment show NER model with MuRIL base, and MuRIL
Large is performing better than other models. F1-score of MuRIL base is almost equal
to MuRIL-Large, but the NER model with MuRIL base is getting trained faster and
predicting faster due to less no of encoder layers. MuRIL base has only 12 encode
layers which are 50% of MuRIL-Large encode layers. So MuRIL base is considered
an optimized model in terms of accuracy and run-time performance.
As a result, two different NER models are built (1) MuRIL Large followed by
FF network is built with Adam optimizer, categorical cross-entropy loss function,
learning rate as 5.E−05 and dropout rate as 0.1 for better accuracy. (2) MuRIL base
followed by FF network is built with Adam optimizer, categorical cross-entropy loss
function, learning rate as 5.E−05 and dropout rate as 0.1 for better run time.
Further works can involve class balancing techniques as there is a class imbalance
issue in the WikiAnn-ta dataset.
The multilingual pre-trained models produced state-of-the-art results in NLP tasks
in low resourced languages. But recent research shows that pre-training multilingual
BERT with high-quality monolingual corpora outperforms the results of multilin-
gual models. Language model pre-training for any NLP task generally improves
266 K. Dhayalan et al.
performance. Fine-tuning language models like XLM-R, and MuRIL Large using
the T-NER library may improve the performance of the model.
Performance is significantly improved by fetching the document-level context
of each sentence and feeding it into the NER model. Document-level context of
a sentence is retrieved by searching the semantically relevant texts using a search
engine. Manual annotators are using their domain knowledge to annotate named
entities and use a search engine if there is a gap, this method works in this way.
The ensemble is one of the proven successful approaches to work on complex
datasets. Two steps of the supervised ensemble classifications may improve the
performance (1) generating classification results using multiple weak classifiers and
(2) integrating those multiple results using consistency function with voting schemes.
The major contribution of this study is that the proposed NER model can be
used for retrieving people, organizations and places from the documents of these
domains, in other NLP applications like information extraction, machine translation
and question answering. This model can also be used in information extraction from
government documents of Tamil Nadu, India, Tamil news articles, law and medical
documents in Tamil.
Acknowledgements The authors thank UNITAR International University for supporting this
research’s publication.
References
1. Abinaya N, Anand Kumar M, Soman KP (2015) Randomized kernel approach for named entity
recognition in Tamil. Indian J Sci Technol 824
2. Rajendran S, Cn S, Srinivasan R, Subalalitha CN (2019) Automated named entity recognition
from Tamil documents 2019. In: IEEE 1st international conference on energy, systems and
information processing, ICESIP. IEEE (2019). https://www.researchgate.net/publication/338
362244
3. Hariharan V, Anand Kumar M, Soman KP (2019) Named entity recognition in Tamil language
using recurrent based sequence model. In: Lecture notes in networks and systems, pp 91–99.
Springer
4. Mohseni M, Fondazione AT, Kessler B (n.d.) MorphoBERT: a Persian NER system with BERT
and morphological analysis
5. Nzeyimana A, Rubungo AN (n.d.) KinyaBERT: a morphology-aware Kinyarwanda language
model. https://github.com, last accessed 26 Nov 2022
6. Sarveswaran K, Dias G, Butt M (2021) ThamizhiMorph: a morphological parser for the Tamil
language. Mach Transl 35:37–70. https://doi.org/10.1007/s10590-021-09261-5
7. Nagesh Bhattu S, Satya Krishna N, Somayajulu DVLN (n.d.) idrbt-team-a@IECSIL-FIRE-
2018: named entity recognition of Indian languages using Bi-LSTM. http://www.idc.upe
nn.edu, last accessed 5 Feb 2022
8. Sun C, Yang Z (n.d.) Transfer learning in biomedical named entity recognition: an evaluation
of BERT in the PharmaCoNER task, pp 100–104
9. Chang Y, Kong L, Jia K, Meng Q (2021) Chinese named entity recognition method based
on BERT. In: 2021 IEEE international conference on data science and computer application
(ICDSCA), pp 294–299. IEEE. https://ieeexplore.ieee.org/document/9650256/, last accessed
23 Jan 2022
Transformer-Based Named Entity Recognition Model—Tamil Language 267
10. Pires T, Schlinger E, Garrette D (n.d.) How multilingual is Multilingual BERT? pp 4996–5001.
https://github.com/google-research/bert, last accessed 12 Feb 2022
11. Baumann A (2019) Multilingual language models for named entity recognition in German and
English. In: Proceedings of the student research workshop associated with RANLP 2019, pp
21–27. Incoma Ltd. https://acl-bg.org/proceedings/2019/RANLPStud%202019/pdf/RANLPS
tud004.pdf
12. Conneau A, Khandelwal K, Goyal N, Chaudhary V, Wenzek G, Guzmán F, Grave E, Ott M,
Zettlemoyer L, Stoyanov V (n.d.) Unsupervised cross-lingual representation learning at scale.
https://github.com/facebookresearch/cc, last accessed 12 Feb 2022
13. Aras G, Makaroğlu D, Demir S, Cakir A (2021) An evaluation of recent neural sequence
tagging models in Turkish named entity recognition. Exp Syst Appl 182
14. Kakwani D, Kunchukuttan A, Golla S, Gokul NC, Bhattacharyya A, Khapra MM, Kumar P
(n.d.) IndicNLPSuite: monolingual corpora, evaluation benchmarks and pre-trained multilin-
gual language models for Indian languages, vol 3, p 5. https://github.com/AI4Bharat/indicnlp_
catalog, last accessed 12 Feb 2022
15. Khanuja S, Bansal D, Mehtani S, Khosla S, Dey A, Gopalan B, Margam DK, Aggarwal P,
Nagipogu RT, Dave S, Gupta S, Chandra S, Gali B, Subramanian V, Talukdar P (n.d.) MuRIL:
multilingual representations for Indian languages. https://oscar-corpus.com, last accessed July
Feb 2022
16. Sharma R, Morwal S, Agarwal B (2022) Named entity recognition using neural language model
and CRF for Hindi language. Comput Speech Lang 74:101356
A Comparative Study of Methods
for Topic Modelling in News Articles
Abstract The past few decades have seen an increase in textual data and influence
from the news media. With the rise in available data, especially in regard to textual
data from news media, it is imperative to quickly categorise news topics. In this
research, the primary aim is to suggest a method for automatically identifying news
topics in articles. The dataset used in this research was the news category published
on Kaggle and comprised of 210,294 headlines and abstracts from HuffPost between
2012 and 2022. The dataset consisted of a total of 42 categories and six columns.
Traditional modelling techniques did not perform well in comparison with Top2Vec,
NMF or BERTopic. This research confirms the efficacy of Top2Vec and BERTopic,
followed by NMF, LDA and LSA for analysing, news category data from a human-
interpretation perspective. Though BERTopic was able to deduce 1145 topics from
the data, it could not chuck unwanted words like “to”, “say”, “for” which do not add
any value to the topic semantics. In summary, TF-IDF proved to be the best feature
extraction technique and Top2Vec the best topic modelling technique.
1 Introduction
Textual data provides one of the most significant sources of knowledge and informa-
tion throughout history till the present day [1]. The past few decades have seen an
increase in the influence from the news media towards the public via dissemination of
S. D. Rajan · M. Jayabalan
Liverpool John Moores University, Liverpool, UK
T. Coombs (B)
British American Tobacco, Southampton, UK
e-mail: [email protected]
N. A. Ismail
UNITAR International University, Petaling Jaya, Selangor, Malaysia
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 269
Y. Bee Wah et al. (eds.), Data Science and Emerging Technologies, Lecture Notes on
Data Engineering and Communications Technologies 191,
https://doi.org/10.1007/978-981-97-0293-0_20
270 S. D. Rajan et al.
2 Method
2.1 Dataset
The dataset used in this research was news category published on Kaggle and
comprised of 210,294 headlines and abstracts from HuffPost between 2012 and
2022. The dataset is composed a total of 42 categories and six columns (category;
online article; web link; headline; short description; date published).
The model implementation was done by five unsupervised topic techniques being
latent semantic analysis (LSA), latent Dirichlet analysis (LDA), non-negative matrix
form (NMF), BERTopic and Top2Vec. The first approach was LSA and utilised the
single value decomposition (SVD) function on open source python library Sklearn
considering ten n_components with the algorithm randomised to build the model.
We implemented SVD with dimensions of 5, 15 and 20 using vectoriser for deriving
features for LSA. To build the LDA model, we used Sklearn’s LDA module with ten
n_components, learning method online and learning offset 25 achieving a coherence
score of 0.41. This was followed by hyperparameter tuning to find the optimum k
done by grid search with a range between 5 and 40. An optimum k value of 20 for
this dataset was found and achieved a coherence of 0.43 and was visualised using
pyLDAvis to generate intertropical distance maps. In contrast to LSA, NMF is a
decomposition, non-probabilistic algorithm using matrix factorisation belonging to
a group of linear-algebraic algorithms.
The LSA model uses bag of words (BoW) and was built using Gensim module where
BoW were used for deriving features. Next, the evaluation coherence measure value
(c_v) was plotted against the number of topics to find the optimal number of topics
with good coherence. It was observed that after topic nine, there was a notable drop
in coherence, and hence, nine topics were chosen as “k” and obtained a coherence
score of 0.29 for LSA model.
A Comparative Study of Methods for Topic Modelling in News Articles 273
LDA, on the other hand, hyperparameter tuning was performed to find optimum
“k” demonstrating that increasing the number of topics increases the coherence score
until a peak at 25 topics was observed. Therefore, the optimal number of topics for
LDA model is chosen as 25 with a coherence score of 0.43. However, high coherence
with 25 topics did not reflect a good topic as many of the keywords were repeated:
there is a need to fine-tune the model parameters to minimise overlap. Based on this,
the LDA model parameters were adjusted to chunk size = 10,000, iteration = 100
and pass = 40 improving the model coherence to 0.39. With the above observations,
it can be confirmed that LDA model performs better than LSA post-hyperparameter
tuning.
The NMF model was built using Gensim module with special attention focused on
hyperparameter tuning (kappa, chunk size, passes). Figure 2 shows the NMF model
with a coherence score of 0.42 and topic number 25.
LDA, on the other hand, is more focused on COVID-19-related topics generating
repetitive topics of politics. It seems that NMF outperformed LDA as the NMF model
distinguishes between all the indicated topics. However, because both LDA and NMF
rely on hyperparameters the results are not unexpected, but neither had a thorough
grasp on the phenomenon. The next two approaches focus on embedding techniques
to overcome the aforementioned issue.
Fig. 2 Hyperparameter
tuning for optimal topic
number for LDA model
274 S. D. Rajan et al.
still Top2Vec was able to identify diverse topics when compared to BERTopic which
had multiple concepts and much overlapping topics. On the other hand, all traditional
topic modelling including LDA and LSA failed to extract meaning topics. Since NMF
was able to provide some intriguing topics, this study recommends NMF over LDA.
Term score decline per topic shows the importance of each keyword in the topic
extracted. It is an interactive Plotly chart which is line chart with the term rank being
the x-axis and the c-TF-IDF score on the y-axis. Another interesting visualisation
is inter-topic distance map. The distance between topics is depicted on an inter-
topic distance map. Themes that are highly distinct from one another are far apart,
whereas topics that are similar are closer together. As shown in Fig. 5. There are six
topic groups for all the topics. In the same topic group, topics with similar semantic
meanings are clubbed together.
4 Conclusions
In summary, TF-IDF proved to be the best feature extraction technique and Top2Vec
the best topic modelling technique. Traditional modelling techniques did not perform
well in comparison with Top2Vec, NMF or BERTopic. Though BERTopic was able to
deduce 1145 topics from the data, it could not chuck unwanted words like “to”, “say”,
“for” which do not add any value to the topic semantics. Additionally, BERTopic
requires an extra data pre-processing step to remove noisy words. Special attention
should be taken with respect to the pre-processing and optimising of the model
using hyperparameters when the model in scope is LDA, LSA and NMF as these
might sometimes require domain knowledge and sufficient resources. It should also
be noted that each of these models might perform differently in different datasets.
Future studies should use the SPARK environment on the same dataset, which has
faster hyperparameter tuning for traditional algorithms. This will help save media
analysts categorising new articles manually.
Acknowledgements The authors would like to thank UNITAR International University for
supporting the publication of this paper.
References
1. Zhou Z, Qin J, Xiang X, Tan Y, Liu Q, Xiong NN (2020) News text topic clustering optimized
method based on TF-iDF algorithm on spark. Comput Mater Continua 621:217–231
2. Van Dijk TA (1995) Power and the news media. Polit Commun Action 6(1):9–36
3. Schudson M (2002) The news media as political institutions. Annu Rev Polit Sci 5(1):249–269
4. Holt K, Ustad Figenschou T, Frischlich L (2019) Key dimensions of alternative news media.
Digit J 7(7):860–869
5. Alam KM, Hemel MTH, Muhaiminul Islam SM, Akther A (2020) Bangla news trend obser-
vation using LDA based topic modeling. In: ICCIT 2020—23rd international conference on
computer and information technology, proceedings, pp 19–21
6. Xia L, Luo D, Zhang C, Wu Z (2019) A survey of topic models in text classification. In: 2019
2nd international conference on artificial intelligence and big data, ICAIBD 2019, pp 244–250
7. Zosa E, Granroth-Wilding M (2019) Multilingual dynamic topic model. In: International
conference recent advances in natural language processing, RANLP, 2019-Septe, pp 1388–1396
8. Tabassum A, Patil RR (2020) A survey on text pre-processing & feature extraction techniques
in natural language processing. Int Res J Eng Technol [online] 4864–4867. Available at: www.
irjet.net
9. Sethia K, Saxena M, Goyal M, Yadav RK (2022) Framework for topic modeling using BERT,
LDA and K-means. In: 2022 2nd international conference on advance computing and innovative
technologies in engineering, ICACITE 2022, pp 2204–2208
10. Rahmawati D, Khodra ML (2016) Word2vec semantic representation in multilabel classifica-
tion for Indonesian news article. In: 4th IGNITE conference and 2016 international conference
on advanced informatics: concepts, theory and application, ICAICTA 2016, pp 0–5
11. Qiang J, Qian Z, Li Y, Yuan Y, Wu X (2022) Short text topic modeling techniques, applications,
and performance: a survey. IEEE Trans Knowl Data Eng 343:1427–1445
12. Rahamat Basha S, Rani JK (2019) A comparative approach of dimensionality reduction
techniques in text classification. Eng Technol Appl Sci Res 96:4974–4979
A Comparative Study of Methods for Topic Modelling in News Articles 277
13. Bansal S, Srivastava A, Arora A (2017) Topic modeling driven content based jobs recommen-
dation engine for recruitment industry. Procedia Comput Sci [online] 122:865–872. Available
at: https://doi.org/10.1016/j.procs.2017.11.448
14. Blei DM, Ng AY, Jordan MI (2003) LDA-blei.pdf. J Mach Learn Res 3:993–102
15. Chen F, Xie S, Li X, Li S, Tang J, Wang T (2019a) What topics do images say: a 70 neural
image captioning model with topic representation. In: Proceedings—2019 IEEE international
conference on multimedia and expo workshops, ICMEW 2019, pp 447–452
16. Deng X, Smith R, Quintin G (2020) Semi-supervised learning approach to discover enterprise
user insights from feedback and support [online]. Available at: http://arxiv.org/abs/2007.09303
17. Zhang F, Gao W, Fang Y (2019) News title classification based on sentence-LDA model and
word embedding. In: Proceedings—2019 international conference on machine learning, big
data and business intelligence, MLBDBI 2019, pp 237–240
18. Mohamed AHH, Tawfik H, Norton L, Al-Jumeily D (2011, April) e-HTAM: a technology
acceptance model for electronic health. In: 2011 international conference on innovations in
information technology, pp 134–138. IEEE
19. Al-Jumeily D, Hussain A, Alghamdi M, Dobbins C, Lunn J (2015) Educational crowdsourcing
to support the learning of computer programming. Res Pract Technol Enhanc Learn 10:1–15
Application of Deep Learning Algorithms
to Terahertz Images for Detection
of Concealed Objects
Abstract Safety of the public at large venues is of utmost importance, and there-
fore, it is important to detect threats encountered due to concealed objects, especially
on human bodies. Terahertz imaging has gained popularity over the last few years
due to its ability to detect concealed objects inside fabric without harm to humans
or invasion to their privacy. However, terahertz images suffer from poor resolution
and low signal-to-noise ratio. Deep learning has shown high performance in classi-
fying images lately, especially feedforward neural networks. Therefore, this study
utilised deep learning for the detection of concealed objects on human bodies using
the public active terahertz imaging (ATZ) dataset that contained images of 11 tiny
objects. An end-to-end framework was applied and involved image enhancement
using wavelet filtering, locating object using skip attention generative adversarial
networks (SAGAN) and forward-forward mixed convolution with Gaussian affinity
network (FMCGNet). The results showed that wavelet filtering did not affect the
performance of the deep learning models. The SAGAN showed performance accu-
racy of 68.4% and that was related to the insufficient training of the model that
required high computational power. However, the accuracy of classifying images
with anomalies was high when FMCGNet was applied and was featured in the true
positive rate of 91.9% though it was applied to imbalanced dataset. In summary, the
proposed end-to-end framework successfully identified concealed objects on human
bodies in an efficient way. Future work involves adopting this approach to different
types of imaging datasets to further understand its strengths and limitations.
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 279
Y. Bee Wah et al. (eds.), Data Science and Emerging Technologies, Lecture Notes on
Data Engineering and Communications Technologies 191,
https://doi.org/10.1007/978-981-97-0293-0_21
280 S. Sardar et al.
1 Introduction
2 Methods
2.1 Dataset
Dataset used in this study included the public active terahertz imaging (ATZ) dataset
that contained images of tiny items [20]. As any terahertz (THZ) images, the images
were extremely noisy and of low contrast [9, 21]. Moreover, the images consisted of
tiny images that of different shapes, sizes, scales and orientation and that made the
dataset imbalanced.
The dataset contained 3157 images that had 1349 objects of 11 different classes
related to the material and varieties in the images. Images were then split into training
(n = 2555) and test (n = 602) images. The 11 classes included: Kitchen Knife (KK)
(n = 100); Gun (GA) (n = 116); Metal Dagger (MD) (n = 64); Scissors (SS) (n
= 96), Water Bottle (WB) (n = 107), Ceramic Knife (CK) (n = 129), Cell Phone
(CP) (n = 129), Key Chain (KC) (n = 78), Leather Wallet (LW) (n = 78), Cigarette
Lighter (CL) (n = 163) and Unknown (UN) (n = 289). The dataset had 1218 images
with objects and 1349 images without objects. Qualitative statistics of the dataset
showed size and format of 225 × 880 JPEG, resolution of 5 × 5 mm and object sizes
range in pixels (min–max) of 390–13,390. File format details (Table 1).
The location of objects varied in images and was located in one of these positions:
left arm (LA); abdomen (S); right arm (RA); back (B); chest (C); left thigh (LT); hip
(N); waist (W); right calf (RL); left calf (LL); and right thigh (RT).
Considering the challenges in the dataset highlighted above, it was not possible to
perform classical supervised learning. Subsequent an end-to-end framework detec-
tion task was adopted (Fig. 1). The framework comprised three stages: image
enhancement by wavelength filtering [6], skip attention generative adversarial
network (SAGAN) [19] and forward-forward mixed convolution with Gaussian
affinity network I (FMCGNet) [18].
In this respect, SAGAN detects anomalies in images, whereas FMCGNet classifies
them. Wavelet filtering removed the noise from the images by isolating the noise from
the signal using the biorthogonal wavelet transformation [6].
SAGAN is a type of GAN that is able to identify unusual objects in an image by
flagging them as anomalies. Unlike traditional GAN, SAGAN generates indications
(for anomalies) from all feature location rather than from low-resolution feature
location [19]. As other GAN models, it consisted of a generator and discriminator
networks. Because of the small sample size in the present dataset, depthwise separable
convolution (DSC) was applied in order to overcome overfitting problem [19]. A U-
Net architecture was used in this case and three objective functions were optimised
during training related to image reconstruction loss, latent representation loss and
adversarial loss. For evaluation of SAGAN model, two metrics were used the final
anomaly score and raw anomaly score.
In addition, FMCGNet was first proposed by Hinton in order to overcome limi-
tations encountered in backpropagation techniques that in turn must have complete
knowledge of the function used in the forward-pass otherwise it cannot proceed
[18]. In addition, backpropagation, in huge networks where billion of parameters
perturbe, does not consider variables perturbating simultaneously and averages out
gradient noise produced by other perturbing variables. That in turn introduces error
in the algorithm outcome [22]. On the other hand, FMCGNet addresses this error by
calculating gradients locally by having local objective function at each layer and that
in turn produces more accuracy in classifying positive from negative samples. This
accuracy is achieved by applying weights in each layer to raise goodness for posi-
tive samples and reduce goodness for negative samples [18]. Two types of goodness
functions were used for updating weights being sum of squares and negative sum of
squares.
The dataset analysed in this study was not balanced in terms of the number of objects
present. Moreover, the height and width of individual objects varied and their dimen-
sions and not all showed normal distribution. Table 2 shows the summary of the
dimensions of the 11 objects with the mean and median of each. In this respect, the
width of the objects ranged between 12 cm (seen for cigarette lighter) and 132 cm
(seen for kitchen knife). Furthermore, the height of objects ranged between 15 cm
(observed for cigarette lighter) to 49 cm (seen for gun).
Thus, the objects were relatively smaller than the usual sizes that had been reported
for them [9]. Of all these objects, kitchen knives were larger than other items and
cigarette lighters were the smallest. This was visualised in both the dimensions and
areas of objects. Thus, kitchen knives showed areas in the range of 6000–10,000
rpm, whereas cigarette lighters were below 1000 rpm (Fig. 2).
The location these items were concealed in varied depending on whether the
person concealing them was a male or female. Figure 3 shows a scatter of the different
types of items across both genders.
Table 2 Width and height for the different objects evaluated in the study
Object Width Height
Min. Median Mean Max. Min. Median Mean Max.
Cell phone 26 42 44.1 77 48 68.5 67.4 83
Ceramic knife 14 38 39.5 78 27 74 70.8 89
Cigarette lighter 12 27.5 28.6 52 15 39 36.7 56
Gun 26 59 61.5 96 49 85 83.8 107
Key chain 18 31 31 52 21 33.5 35 56
Kitchen knife 31 56 64.1 132 45 133 128 152
Leather wallet 36 50 50.5 66 36 59 57.5 70
Metal dagger 14 37 38.3 74 43 80 80.4 108
Scissors 23 41 43.1 81 35 80.5 77.7 100
Water bottle 20 34 34.4 62 43 92.5 89.2 129
284 S. Sardar et al.
Fig. 2 Box plots of the areas (left and middle) and scatter plot of the height and width (right) of
the different objects
Fig. 3 Items scatter from the front and back side across both genders
Hence, males hid more objects in their back than females. On the other hand,
both genders concealed equal volumes of objects in their calf (back and front) and
abdominal areas. Moreover, both genders had their hands free. Figure 4 shows the
percentage of different objects across body parts.
Authentic images were prepared by combining the original ATZ images (128 × 128)
with 20% of overlapping images to form the authentic images. This was followed
by truncating images to 27 patches per image (Fig. 5). Patches falling outside the
human region and those with 99% of dark pixels (above grey threshold value of 30)
were excluded. Wavelet-based transformation techniques and global threshold-based
denoising enhanced the input images. The denoising setting used were: {‘wavelet’:
‘sym4’, ‘method’: ‘VisuShrink’, ‘decomposition_level’: 2, ‘mode’: ‘soft’}. This
Application of Deep Learning Algorithms to Terahertz Images … 285
resulted in the mean peak SNR is recorded as 45.46 for ‘sym4’ wavelets and 45.08 for
‘bior4.4’ wavelets. Image augmentations was made by rotating images or changes
the brightness settings. This was done to generate sufficient negative images so that
the FMCGNet model can learn from. Additional negative images were generated
using a linear classifier with softmax function [23]. After the generation of negative
images, anomalies in data were detected using SAGAN where images with at least
one object were treated as abnormal (n = 4725) and the remaining were normal (n
= 71,744). Then only normal images were included in the training sets; whereas the
test set contained a mixture of normal to abnormal images in the ratio of 1:1.
FMCGNet trained each of its layers with both positive and negative images unlike
traditional deep learning algorithms [18, 24]. Therefore, both positive and negative
patches were considered in the training procedure. Ratio of training to test set was 4:1
to with the training set containing 7604 images and the test containing 1900 images.
286 S. Sardar et al.
For FMCGNet development and evaluation, an intermediate dataset was used with
clear class imbalance to challenge the Gaussian affinity loss function [10, 26]. Five
Application of Deep Learning Algorithms to Terahertz Images … 287
layers were used in the FMCGNet model where each layer had a separate loss func-
tion. The training parameters included Adam optimiser with lr = 0.0001, affinity
weighting parameter of λ = 0.5, goodness threshold of g = 2. Hybrid negative data
made FMCGNet learn faster after being trained with 30 epochs, and this was seen
with a goodness score of 3.95. Moreover, the total positive rate was high (TPR =
0.9188) that confirmed the model accuracy. In addition, the model showed FPR of
0.1005, F1-score of 0.8996, accuracy of 0.8501 and goodness of 2.015. These results
were not affected by bath size, training or time.
4 Conclusion
and anomaly score of 0.1228 and that showed that further training was required. On
the other hand, FMCGNet showed very high accuracy with TPR rate of 91.9% when
applied to imbalanced dataset and that showed its outperformance to other machine
learning models. There were limitations encountered in the study related to sample
size as well as computing power. Future work involves building on the FMCGNet
architecture by applying it to larger datasets in order to maximise its performance.
Acknowledgements The authors thank UNITAR International University for the publication of
this research.
References
1. Chen HM, Lee S, Rao RM, Slamani MA, Varshney PK (2005) Imaging for concealed weapon
detection: a tutorial overview of development in imaging sensors and processing. IEEE Signal
Process Mag 22(2):52–61
2. Kowalski M, Kastek M, Palka N, Polakowski H, Szustakowski M, Piszczek M (2013) Inves-
tigation of concealed objects detection in visible, infrared and terahertz ranges of radiation.
Photon Lett Poland 5(4):167–169
3. Helal S, Sarieddeen H, Dahrouj H, Al-Naffouri TY, Alouini MS (2022) Signal processing and
machine learning techniques for terahertz sensing: an overview. IEEE Signal Process Mag
39(5):42–62
4. Hu BB, Nuss MC (1995) Imaging with terahertz waves. Opt Lett 20(16):1716–1718
5. Cheng YZ, Withayachumnankul W, Upadhyay A, Headland D, Nie Y, Gong RZ et al (2014)
Ultrabroadband reflective polarization convertor for terahertz waves. Appl Phys Lett 105(18)
6. Danso S, Liping S, Hu D, Odoom J, Quancheng L, Mushtag M (2022) Security inspection image
processing methods applying wavelet transform filters on Terahertz active images. Revista de
Investigaciones Universidad del Quindío 34(1):37–51
7. Velayudhan D, Hassan T, Damiani E, Werghi N (2022) Recent advances in baggage threat
detection: a comprehensive and systematic survey. ACM Comput Surv 55(8):1–38
8. Ahmed A, Velayudhan D, Hassan T, Hassan B, Dias J, Werghi N (2022, May) Baggage threat
detection under extreme class imbalance. In: 2022 2nd international conference on digital
futures and transformative technologies (ICoDT2), pp 1–6. IEEE
9. Liang D, Pan J, Yu Y, Zhou H (2019) Concealed object segmentation in terahertz imaging via
adversarial learning. Optik 185:1104–1114
10. Sun Q, He Y, Liu K, Fan S, Parrott EP, Pickwell-MacPherson E (2017) Recent advances in
terahertz technology for biomedical applications. Quant Imaging Med Surg 7(3):345
11. Li R, Li C, Li H, Wu S, Fang G (2018) Study of automatic detection of concealed targets in
passive terahertz images for intelligent security screening. IEEE Trans Terahertz Sci Technol
9(2):165–176
12. Guo Y, Liu Y, Oerlemans A, Lao S, Wu S, Lew MS (2016) Deep learning for visual
understanding: a review. Neurocomputing 187:27–48
13. Wang J, Zhu H, Wang SH, Zhang YD (2021) A review of deep learning on medical image
analysis. Mob Netw Appl 26:351–380
14. Suganyadevi S, Seethalakshmi V, Balasamy K (2022) A review on deep learning in medical
image analysis. Int J Multimed Inform Retriev 11(1):19–38
15. Bharathi PT, Subashini P (2011) Optimization of image processing techniques using neural
networks: a review. WSEAS Trans Inf Sci Appl 8(8):300–328
16. Al-Saffar AAM, Tao H, Talab MA (2017, October) Review of deep convolution neural network
in image classification. In: 2017 International conference on radar, antenna, microwave,
electronics, and telecommunications (ICRAMET), pp 26–31. IEEE
Application of Deep Learning Algorithms to Terahertz Images … 289
17. Chen L, Li S, Bai Q, Yang J, Jiang S, Miao Y (2021) Review of image classification algorithms
based on convolutional neural networks. Rem Sens 13(22):4712
18. Hinton G (2022) The forward-forward algorithm: some preliminary investigations. arXiv
preprint arXiv:2212.13345
19. Liu G, Lan S, Zhang T, Huang W, Wang W (2021, September) SAGAN: skip-attention GAN
for anomaly detection. In: 2021 IEEE international conference on image processing (ICIP), pp
2468–2472. IEEE
20. Terahertz Imaging Dataset (2022) Available at: https://github.com/LingLIx/THz_Dataset.
Accessed: 31 Aug 2023
21. Shen X, Dietlein CR, Grossman E, Popovic Z, Meyer FG (2008) Detection and segmentation
of concealed objects in terahertz images. IEEE Trans Image Process 17(12):2465–2475
22. Karner C, Kazeev V, Petersen PC (2022) Limitations of gradient descent due to numerical
instability of backpropagation. arXiv preprint arXiv:2210.00805
23. Lin D, Lin Z, Sun L, Toh KA, Cao J (2017, May) LLC encoded BoW features and softmax
regression for microscopic image classification. In: 2017 IEEE international symposium on
circuits and systems (ISCAS), pp 1–4. IEEE
24. Yeh AB, Huwang L, Wu YF (2004) A likelihood-ratio-based EWMA control chart for
monitoring variability of multivariate normal processes. IIE Trans 36(9):865–879
25. Yang C, Lan S, Huang W, Wang W, Liu G, Yang H et al (2022, September) A transformer-
based GAN for anomaly detection. In: International conference on artificial neural networks.
Springer Nature, Cham, pp 345–357
26. Hayat M, Khan S, Zamir SW, Shen J, Shao L (2019) Gaussian affinity for max-margin class
imbalanced learning. In: Proceedings of the IEEE/CVF international conference on computer
vision, pp 6469–6479
27. Xiao H, Zhang R, Wang H, Zhu F, Zhang C, Dai H, Zhou Y (2018, October) R-pcnn method to
rapidly detect objects on thz images in human body security checks. In: 2018 IEEE SmartWorld,
ubiquitous intelligence & computing, advanced & trusted computing, scalable computing &
communications, cloud & big data computing. Internet of People and Smart City Innovation
SmartWorld/SCALCOM/UIC/ATC/CBDCom/IOP/SCI, pp 1777–1782. IEEE
Multivariate Comparative Analysis
of Statistical and Deep Learning Models
for Prediction Hardware Failure
Saurabh Gupta, Raghad Alshabandar, Chong Kim Loy,
and Ammar H. Mohammed
S. Gupta (B)
Gurgaon, India
e-mail: [email protected]
R. Alshabandar · A. H. Mohammed
Baghdad, Iraq
C. K. Loy
UNITAR International University, Petaling Jaya, Selangor, Malaysia
e-mail: [email protected]
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 291
Y. Bee Wah et al. (eds.), Data Science and Emerging Technologies, Lecture Notes on
Data Engineering and Communications Technologies 191,
https://doi.org/10.1007/978-981-97-0293-0_22
292 S. Gupta et al.
1 Introduction
The COVID-19 pandemic has accelerated the adoption of digital products across
various sectors, including businesses and personal use, regardless of age. This surge
in digital dependence has intensified the demand for high availability of applications
hosted on in-house servers or cloud infrastructure. According to available online data,
even a few minutes of application downtime can result in significant financial losses
for organizations and individuals. Each year, several large companies experience
server or application outages. In December 2022, Amazon AWS faced an outage
that disrupted services globally. Reports indicate that the outage was caused by “an
impairment of several network devices,” leading to increased API error rates [1].
Similarly, in October 2021, Facebook, Instagram, and WhatsApp went down for six
hours, affecting 3.5 billion people worldwide [2].
While the concept of Consistency, Availability, and Partition Tolerance (CAP)
is crucial for ensuring system reliability, it is becoming increasingly evident that
additional measures are necessary to maintain seamless cloud operations. As orga-
nizations embrace AI/ML, they need to develop more robust MLOPS or AIOPS
solutions that can effectively detect anomalies and predict faults [3].
Despite the growing use of machine learning for hardware failure prediction,
most previous studies have relied on a single type of feature, such as SMART indi-
cators or CPU utilization metrics. Only a limited number of studies have explored
hybrid feature spaces. To date, no study has focused on comparing different ML/AI
algorithms solely on server health-monitored data to predict system failures. This is
primarily because OPS teams typically have access to both server health data and
success or failure logs for application calls [4].
This paper proposes a multivariate analysis approach to identify the most effective
technique for predicting cloud server failures based on usage patterns (CPU, memory,
and network). By forecasting server failures using current usage data, organizations
can prevent cloud failures, reduce associated costs, and minimize downtime for online
and mobile applications hosted on the affected cloud infrastructure.
The performance of deep learning models for failure prediction in virtual machines
will be compared with baseline models to determine the most suitable approach for
cloud server failure prediction. Additionally, the study will provide valuable insights
to guide future research in developing novel methodologies for failure prediction.
The remainder of this paper is structured as follows: Section 2 presents a compre-
hensive literature review of related work. Section 3 details the methodology and
Multivariate Comparative Analysis of Statistical and Deep Learning … 293
presents the obtained results. Finally, Section 4 outlines the conclusions and discusses
potential future research directions.
2 Related Works
To identify and prevent cloud-based failures before they materialize, the authors advo-
cate for a “combined system metrics approach” [5]. They integrated system indicators
like CPU utilization with data from SMART hard drives. The system was subjected
to rigorous evaluation using over 100 cloud server datasets and four AI algorithms:
random forest, gradient boosting, long short-term memory, and gated recurrent unit.
The results were subsequently subjected to correlation analysis. Random forest (0.99
accuracy) and gradient boosting (0.99) emerged as the most effective methods.
A study by [5] demonstrated the feasibility of predicting failure using a support
vector machine in a cloud environment, highlighting the sensitivity to the failure
threshold. To prevent operational system failures on individual hosts, Chalermar-
rewong et al. [6] employed multivariate analysis, analyzing log files that docu-
ment factors such as system resources (CPU, RAM, and network throughput). The
precision achieved ranged from 97.32 to 99.89, depending on the system.
One study [7] proposed a machine learning approach to predict virtual machine
(VM) failure using a time series stochastic model. The proposed method utilizes
an autoregressive integrated moving average (ARIMA) failure predictor module to
forecast VM failure. The method’s performance was evaluated on non-stationary
failure traces of VMs. The observations revealed that the proposed method accurately
predicts failure with a mean absolute error (MAE) of 0.0344786.
Generative adversarial networks (GAN) and an improved long short-term memory
(LSTM) model called convolution-LSTM (C-LSTM) to predict hard disk failure.
Their findings indicate that C-LSTM achieves promising results with an AUC of
0.76 [8].
In [7] utilized the time series stochastic model ARIMA to predict virtual machine
failure. The results demonstrate that the model achieves satisfactory performance
with a root mean square error (RMSE) of 0.0457443, mean absolute error (MAE) of
0.0344786, and mean absolute scaled error (MASE) of 0.6036391.
3 Methods
3.1 Dataset
The dataset utilized in this study was obtained from the Bitbrain open-source data
repository [9]. It comprises 1750 virtual machine (VM) data points suitable for
multivariate analysis. The dataset encompasses a comprehensive range of features
294 S. Gupta et al.
This research employed various data preprocessing techniques to ensure the dataset’s
quality and suitability for model training. This involved identifying and correcting
structural errors, removing inaccurate or corrupted data, and eliminating duplicate
entries. This involved normalizing the data using MinMaxScaler, which scales the
data to a range between 0 and 1. This technique helps improve the performance
of some machine learning models. We carefully addressed missing values by either
removing the corresponding data points or imputing missing values using appropriate
techniques.
4 Results
The machine learning models were employed to forecast CPU utilization, memory
utilization, and network throughput. The root mean square error (R2 ) was used to
assess the models’ fit to the data, as shown in Table 2. A positive R2 value indicates
a good fit, while a negative R2 value suggests a poor fit. It also displays the predicted
CPU utilization, memory utilization, and network transmitted throughput for each
model. Based on the results, the Bi-LSTM model performed the best, achieving the
highest R2 score for memory utilization and the second-highest R2 score for CPU
usage. The ARIMA model exhibited a negative R2 score for CPU utilization, and the
AR model yielded a very low R2 score for memory consumption.
296 S. Gupta et al.
all three variables, the Bi-LSTM model emerges as the most suitable choice for
predicting system failures and enhancing system reliability.
Figure 1 presents a scatter plot of the Bi-LSTM model, visualizing the actual
versus predicted values based on R2 for the three main features: CPU usage (%),
memory usage (Kb), and network transmitted throughput.
The findings of this study demonstrate that the Bi-LSTM model consistently
outperforms other statistical and deep learning models in predicting system resource
utilization, including CPU utilization, memory utilization, and network throughput.
Its superior performance, evident in both R2 scores and MAE values, suggests that
the Bi-LSTM model is a powerful tool for anticipating system failures and enhancing
system reliability.
This study investigated the application of multivariate time series analysis to predict
system failures using the “BitsBrain” dataset. Following data preprocessing and
feature selection to extract meaningful information, a range of basic and advanced
models were employed to forecast CPU consumption, memory utilization, and
network transmitted throughput.
298 S. Gupta et al.
References
1. Yu G, Chen P, Chen H, Guan Z, Huang Z, Jing L, Weng T, Sun X, Li X (2021) Microrank: end-to-
end latency issue localization with extended spectrum analysis in microservice environments.
In: Proceedings of the web conference 2021, pp 3087–3098
2. Anon (2021) Mark Zuckerberg apologises for Facebook, WhatsApp disruption—The
Economic Times [Online]. economictimes.indiatimes.com. Available from: https://economict
imes.indiatimes.com/tech/technology/mark-zuckerberg-apologises-for-facebook-whatsapp-
disruption/articleshow/86772424.cms. Accessed 3 Nov 2022
3. Abro JH, Li C, Shafiq M, Vishnukumar A, Mewada S, Malpani K, Osei-Owusu J (2022) Artifi-
cial intelligence enabled effective fault prediction techniques in cloud computing environment
for improving resource optimization. Sci Program 2022:1–7
4. Litoiu M, Watts I, Wigglesworth J (2021) The 13th CASCON workshop on cloud computing:
engineering AIOps. In: Proceedings of the 31st annual international conference on computer
science and software engineering, pp 280–281
5. Tehrani C, Beer R, Popp H et al (2017) Education 4.0—fostering student’s performance
with machine learning methods. In: 2017 IEEE 23rd international symposium for design and
technology in electronic packaging (SIITME), pp 438–443. IEEE
6. Chalermarrewong T, Achalakul T, See SCW (2012) Failure prediction of data centers using
time series and fault tree analysis. In: 2012 IEEE 18th international conference on parallel and
distributed systems, pp 794–799. IEE
7. Rawat A, Sushil R, Agarwal A, Sikander A (2021) A new approach for vm failure prediction
using stochastic model in cloud. IETE J Res 67(2):165–172
8. Shi J, Du J, Ren Y, Li B, Zou J, Zhang A (2022) Convolution-LSTM-based mechanical hard
disk failure prediction by sensoring SMART indicators. J Sens 2022:1–15
9. Bitbrains (n.d.) GWA-T-12 Bitbrains [Online]. Bitbrains. Available from: http://gwa.ewi.tud
elft.nl/datasets/gwa-t-12-bitbrains. Accessed 3 Nov 2022
10. Jain R, Chetty P (2020) Introduction to the autoregressive integrated moving average (ARIMA)
model. [Online]. projectguru.in. Available from: https://www.projectguru.in/introduction-to-
the-autoregressive-integrated-moving-average-arima-model/. Accessed: 27 Dec 2022
11. Triebe O, Laptev N, Rajagopal R (2019) Ar-net: a simple auto-regressive neural network for
time-series. arXiv preprint arXiv:1911.12436
12. Zhang N, Zhang Y, Lu H (2011) Seasonal autoregressive integrated moving average and support
vector machine models: prediction of short-term traffic flow on freeways. Transp Res Rec
2215(1):85–92
Multivariate Comparative Analysis of Statistical and Deep Learning … 299
13. Ostertagová E, Ostertag O (2011, September) The simple exponential smoothing model. In: The
4th international conference on modelling of mechanical and mechatronic systems, Technical
University of Košice, Slovak Republic, proceedings of conference, pp 380–384
14. Yu Y, Si X, Hu C, Zhang J (2019) A review of recurrent neural networks: LSTM cells and
network architectures. Neural Comput 31(7):1235–1270
15. Sun Q, Jankovic MV, Bally L, Mougiakakou SG (2018, November) Predicting blood glucose
with an lstm and bi-lstm based deep neural network. In: 2018 14th symposium on neural
networks and applications (NEUREL), pp 1–5. IEEE
Statistical Learning
A Case Study via Bayesian Network:
Investigating Factors Influencing Student
Academic Performance in Online
Teaching and Learning During
COVID-19 Pandemic
Zheng Ning Looi, Poh Choo Song, Huai Tein Lim, and Sing Yan Looi
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 303
Y. Bee Wah et al. (eds.), Data Science and Emerging Technologies, Lecture Notes on
Data Engineering and Communications Technologies 191,
https://doi.org/10.1007/978-981-97-0293-0_23
304 Z. N. Looi et al.
1 Introduction
Generally, way of improving students’ academic performance has been studied glob-
ally, for instance, South Africa [1], Midwestern [2], Australia [3], Vietnam [4] and
Indonesia [5]. In Malaysia, there are 114 universities, including 20 state universities,
50 private universities, 34 private university colleges, and ten branch campuses of
international universities [6]. The acceptance of change upon educational system,
e.g., implementing online teaching and learning (OTL), has been stimulated in these
universities during the harsh situation of COVID-19 pandemic. COVID-19 is the
disease caused by the SARS-CoV-2 coronavirus. It was first identified at the end of
the year 2019 which outbreak in China, Wuhan. Malaysia government has imposed
movement control order in response to the huge number of infected cases, and hence,
OTL is implemented. However, there are challenges confronting beginning educators
in conducting OTL. As [4] claimed that most students are not interested in online
learning due to lack of interactions and unstable Internet connection. Likewise, [7]
noted that students had poor sleeping quality than employees during COVID-19
pandemic. Additionally, about 56% of students in the Southern Philippines avoided
associating with others during the COVID-19 pandemic [8]. In all, the COVID-19
pandemic has vitally caused students to lose their motivation in learning [9, 10] as
well as deviate their learning attitude [11]. These conditions are certainly affecting
our new generation’s education quality in future.
Conventionally, there are two categories of factors, i.e., internal factor and external
factor that will influence students’ academic performance [12]. The classroom atmo-
sphere, learning facilities, instructor role, etc., are internal influences. As [2] found
that instructor quality is a strong impact factor on students’ academic achievement.
On the other hand, family influence, financial difficulties, social issues, and others
are the examples of external causes. However, students who are undergoing OTL
system during COVID-19 pandemic may lead to a slightly different point of view
on achieving academic performance as compared to the conventional viewpoint.
According to [13] and [14], time management and online learning have a close rela-
tionship since students have more flexible ways of managing their time. Based on
the past studies, their academic achievement could have strong correlation with self-
efficacy [3] and positive correlation with student attendance [15]. Moreover, many
students experienced environmental issues in OTL during the pandemic [16].
Therefore, to better understand the variables of affecting undergraduate students’
academic performance in OTL during COVID-19 pandemic, this study has (1)
statistically determined a relationship network between academic achievement (i.e.,
dependent variable) and independence variables; (2) evaluated the accuracy of the
prediction with best-performing model by comparing the Bayesian network and
support vector regression under model testing. In a limited capital condition, with
a better causal network, it can assist higher education institutions to effectively
strengthen the significant variables to raise students’ academic performance, avoid
any unpleasant variables, as well as work out academic performance’s prediction.
A Case Study via Bayesian Network: Investigating Factors Influencing … 305
2 Literature Review
70% of the total attendance for every course [28]. A study of the impact of atten-
dance patterns on academic performance and wellness among medical students at
the University of Central Florida during the COVID-19 pandemic was conducted by
Salzman et al. [15]. The study pointed that there was a weak but substantial strong
connection between academic achievement and attendance. This is in line with [30]
where some students show up for class in accordance with university’s attendance
regulation but short of attention on the lecturing.
To have a fresh mind and good health, sleep quality is crucial. Thus, sleep quality
could affect academic performance [31, 32]. According to a survey in [32], more than
nine hours of sleep had greater CGPA than those who get less than six hours. However,
about 60% of college students still experience daytime sleepiness and insufficient
sleep. Moreover, [7] found that the amount and quality of sleep among students and
university staff were strongly affected during COVID-19 lockdown. However, [33]
revealed a contrary point of view where there is no connection between academic
achievement and sleep quality among medical students.
Moreover, a student’s learning desire and academic performance may depend on
family interaction. According to [34], a healthy family dynamic will lower a student’s
cortisol levels and enable the student to concentrate on their studies. Additionally,
parents who interact with their kids through game can support them in recognizing
their growth, well-being, and academic standing [35]. However, during the COVID-
19 pandemic, about 54–56% of high school students refuse social interaction [36],
even no social contact between postgraduate students and their family and friends [8].
This condition may give some impacts during a learning process as well as academic
performance.
Learning attitudes can react positively to academic achievement and [37–40]. Lee
et al. [11] stated that secondary school students’ academic performance is correlated
with their learning attitude during COVID-19 pandemic. Furthermore, how well a
student felt about a topic may be the key to cultivate a better learning attitude [41] and
self-efficacy [42]. Likewise, undergraduate students from science stream have higher
levels of academic self-efficacy than those from the art and commerce streams in India
[43]. Thereby, the self-efficacy factor can advocate good learning behavior and thus
improve academic achievement [42, 43], similarly when COVID-19 pandemic [3].
Self-efficacy can be a kind of self-confidence as [42] stated it is about an idea that
one has a capability of carrying out and accomplishing a task.
Financial stress that encountered by students, such as school loan, tuition fees,
reference books purchases, computing equipment purchases, and stationary fees,
would have some impact on their studies [8, 44]. Majority of the undergraduate
students are having financial problems as [45] revealed 60% of undergraduate
students were having trouble paying their tuition fees and monthly expenses. Perhaps,
to cover the debt, their learning progress may be influenced by skipping classes for
part-time jobs or live frugally. The situation becomes worst in the occurrence of
COVID-19 pandemic, [46] found that students’ emotion and financial well-being
were affected during COVID-19 pandemic. However, on the flipside, the flexible
feature of OTL may aid to reduce the financial stress.
A Case Study via Bayesian Network: Investigating Factors Influencing … 307
In this study, ten potential independent variables which adopted from the past
studies are compiled and investigated concurrently to determine the core factors
of affecting and predicting academic performance with online learning during
COVID-19 pandemic (Fig. 1).
The investigated independent variables for this paper are teaching method, time
management, environmental, attendance, sleep quality, family influence, learning
attitudes, financial stress, motivation, and self-efficacy, while the dependent variable
is academic performance in terms of CGPA.
3 Methodology
In this study, Year 3 and Year 4 undergraduate students from various faculties in
Universiti Tunku Abdul Rahman are the target group, since they had experienced
on both learning methods, i.e., face-to-face mode and online virtual mode. Students
from the Faculty of Medicine and Health Science (FMHS) and Faculty of Creative
Industries (FCI) are not involved in this study as they have not implemented online
teaching and learning (OTL) during the pandemics. Ethics approval is obtained from
the university and the participants provide their written informed consent to partici-
pate in this survey. There were 158 respondents involved in the study. The sample size
is selected based on the total population, confidence level and margin of error, which
is about 4000, 80% and 5%, respectively. Convenience sampling is applied when time
and budget constraints are considered. Knowing that convenience sampling is faster
than other sampling techniques, especially during crucial moments [50]. A ques-
tionnaire has classified into two sections. The former section regards to respondent’s
personal information, such as gender, age, place of residence, ethnicity, nationality,
faculty, year of study, medium of academic study, parent’s educational level, CGPA,
monthly household income, etc. The latter section is used to assess the predictor and
predicted variables.
Technically, Bayesian networks, a type of probabilistic graphical model, are useful
for probabilistic inference, decision-making, and prediction, though it may be compu-
tationally expensive that requires a large amount of data to train [51]. Nevertheless,
it also can manage missing data and applicable for small data sets [52], showcasing
a capability quite similar to support vector machine. Furthermore, Bayesian network
can handle uncertainty and can be combined with other Bayesian analytic techniques
like Markov chain Monte Carlo (MCMC). However, HC does not promise to return
the optimal solution; thus, more sophisticated neighborhood search algorithms, Tabu
search [53], or other metaheuristics [54] have been suggested to improve the room.
Since Hill-Climbing is comparable to Tabu search method [53], by reselecting the
restricted object into the model, Tabu search can prevent the loss of a good solu-
tion and reach an optimal solution [55]. Therefore, in this study, Bayesian network
model which selected from Hill-Climbing and Tabu search would be compared with
support vector regression to find the best network; see Fig. 2. Furthermore, we would
consider the correlation between the predictor and the predicted variable.
Bayesian network is made up of two components which are a set of conditional prob-
ability distributions and directed acyclic graphs [51]. It represents a joint probability
distribution between a group of variables. An example of directed acyclic graphs is
A Case Study via Bayesian Network: Investigating Factors Influencing … 309
shown in Fig. 3. The directed arc between nodes A and C indicating a probabilistic
reliance between a computer failure and an electrical failure. There is no relation-
ship between node A and node B given node C is known. Additionally, conditional
probability distributions based on the directed graph are built for each node.
Generally, there are three types of structure learning methods for employing
Bayesian networks: score-based, constraint-based, and hybrid. [53] gives a compre-
hensive review of the three respective structure learning methods such as Tabu search
and Hill-Climbing (HC) in the score-based learning algorithms; the constraint-based
learning algorithms are Grow-Shrink, Incremental Association Markov Blanket
(IAMB), Fast Incremental Association Markov Blanket (FastIAMB), Max–Min
Parents and Children (MMPC); and the hybrid learning techniques such as Max–Min
Support vector machine (SVM) is created by Vapnik. It can be classified into two
categories that are support vector classification (SVC) and support vector regression
(SVR) [58]. The aim of SVR is to find a fitting line, called hyperplane, to categorize
a number of data points [59]. Boundary lines, on the other hand, are two lines drawn
around the hyperplane to form margin [60]. Moreover, SVR uses a collection of kernel
functions, such as sigmoidal kernels, polynomial kernels, and Gaussian kernels, to
transform data into necessary dimensions [60]. In addition, SVR has certain benefits
such as resistant to outliers, the decision model is easily to be applied as well as
updated when there are any changes. According to [59] and [60], it has a high
prediction accuracy and can operate with a minimal dataset. Therefore, in this study,
SVR is implemented and compared with the BN to assess the prediction accuracy.
Moreover, 80% of the data are used to train the model, and 20% of the data are used to
test the model. Mean absolute error (MAE), mean absolute percentage error (MAPE)
and root mean square error (RMSE) are used to compare the accuracy performance
between BN and SVM. MAE is defined as:
1∑
n
MAE = |yi − ŷi |, (2)
n i=1
where n is the number of errors and |yi − ŷi | is the absolute errors in which yi is the
actual value and ŷi is the forecast value. MAPE is defined as:
n | |
1 ∑|| yi − ŷi ||
MAPE = . (3)
n i=1 | yi |
RMSE is defined as
A Case Study via Bayesian Network: Investigating Factors Influencing … 311
/
∑
(yi − ŷi )2
RMSE = , (4)
N
4 Result
According to the findings which is shown in Table 1, the lowest BIC score of Tabu
search was slightly higher than HC. Hence, the final trained network in this study is
determined by the Tabu search. Figure 5 illustrates the factors of affecting student
CGPA academic performance. We found that there is a relationship between student
CGPA and Faculty, Age, Types of gadgets (gadgetTypes), Attendance, Self-efficacy
(SelfE), and Teaching method (TeachMS), respectively.
Based on the output, student CGPA and faculty have a causal relationship. The
measures used by different faculties to evaluate students’ academic performance
could be varied. For instance, some lecturers would evaluate a student’s performance
on their summative assessment based on their assignments. Also, there is a link
between the types of gadgets used and student CGPA. For instance, students who
use mobile devices for studying may affect eyesight that having more strain on their
eyes with a narrower screen. It is hard for students to focus and cause them to lose
interest in online learning.
Moreover, student’s learning attitudes that regard to attendance and self-efficacy
are vital for achieving academic performance. Student absenteeism matters learning
progress. Low self-efficacy of a student could have less faith in his abilities
whereby less interest to perform well in accomplishing learning assignments. Lastly,
educational strategies have a substantial impact on student academic performance.
Different educators employ various educational practices. An appropriate teaching
approach may lead to a smoother and effective learning progress. Only the variables
that are linked to the CGPA are used to train the SVR model. Therefore, variables of
age, faculty, gadget types, teaching technique, attendance, and self-efficacy will be
used in training the model.
Figure 4 illustrates the CGPA results that were predicted by BN and SVR, respec-
tively. To have a clearer comparison, we decomposed the 4-points CGPA into 8-
points, where ‘1’ denoted as 3.80–4.00 CGPA, ‘2’ denoted as 3.50–3.79, ‘3’ denoted
as 3.20–3.49, ‘4’ denoted as 2.90–3.19, ‘5’ denoted as 2.60–2.89, ‘6’ denoted as
2.30–2.59, ‘7’ denoted as 2.00–2.29, and ‘8’ denoted as 0.00–1.99. The red dots in
Fig. 4 representing the original students’ CGPA and the blue-colored line are the
prediction based on the trained model.
The result in Table 2 showed the CGPA predictive accuracy of BN is superior
compared to SVR. All three predictive measures including MAE, MAPE and RMSE
of BN are lower compared to SVR. A MAPE of BN showed that the CGPA prediction
is about 14% off from the actual CGPA on average, which is 5% lower compared to
the SVR (Fig. 5).
5 Conclusion
Fig. 4 Prediction by
Bayesian network and
support vector regression
314 Z. N. Looi et al.
References
1. Ganyaupfu EM (2013) Teaching methods and students’ academic performance. Int J Humanit
Soc Sci Invention 2(9):29–35
2. Rugutt J, Chemosit C (2005) A study of factors that influence college academic achievement:
a structural equation modeling approach
3. Talsma K, Robertson K, Thomas C, Norris K (2021) COVID-19 beliefs, self-efficacy and
academic performance in first-year university students: cohort comparison and mediation
Analysis. Front Psychol 12
4. Pham TTT, Le HA, Do DT (2021) The factors affecting students’ online learning outcomes
during the COVID-19 pandemic: a Bayesian exploratory factor analysis. Educ Res Int 2021:1–
13
A Case Study via Bayesian Network: Investigating Factors Influencing … 315
26. Ali M, Tazilah MDABK, Ahmad MF, Adnan NS, Awang N (2016) The relationship between
environmental factors, learning style and self-management towards academic performance of
accounting students: a case of Uniten. South East Asia J Contemp Bus Econ Law 10(1)
27. Ragpala E (2021) Environmental factors that affect the academic performance of senior high
school students during COVID-19 pandemic. Int J Soc Sci Curr Future Res Trends 11(1):1–19
28. Ancheta RF, Daniel D, Ahmad R (2021) Effect of class attendance on academic performance.
Eur J Educ Stud 8(9)
29. Al Shenawi H, Yaghan R, Almarabheh A, Al Shenawi N (2021) The relationship between atten-
dance and academic performance of undergraduate medical students during surgical clerkship.
BMC Med Educ 21(1)
30. Macfarlane B (2013) The surveillance of learning: a critical analysis of university attendance
policies. High Educ Q 67(4):358–373
31. CDC healthy schools: sleep and health [Online]. Available at: https://www.cdc.gov/healthysc
hools/sleep.htm#:~:text=Students%20should%20get%20the%20proper (2019)
32. Learning center: sleeping to succeed—learning center [Online]. Available at: https://learningc
enter.unc.edu/tips-and-tools/sleeping-to-succeed/ (2016)
33. Jalali R, Khazaie H, Khaledi Paveh B, Hayrani Z, Menati L (2020) The effect of sleep quality
on students’ academic achievement. Adv Med Educ Pract 11:497–502
34. Kohnke K (2019) Familial stress and academic performance: the role of an
outside stress factor on the academic success of college students [Online]. Avail-
able at: https://rex.libraries.wsu.edu/esploro/outputs/essay/Familial-Stress-and-Academic-Per
formance-The/99900590739701842 (2019)
35. Ðurišić M, Bunijevac M (2017) Parental involvement as a important factor for successful educa-
tion 7(3)
36. Baloran ET (2020) Knowledge, attitudes, anxiety, and coping strategies of students during
COVID-19 pandemic. J Loss Trauma 25(8):1–8
37. Bakar KA et al (2010) Relationships between university students’ achievement motivation,
attitude and academic performance in Malaysia. Procedia Soc Behav Sci 2(2):4906–4910
38. Educational research techniques: attitude and behavior of students [Online]. Avail-
able at: https://educationalresearchtechniques.com/2021/01/03/attitude-and-behavior-of-stu
dents/#comments (2021)
39. Kafle T, Kongawad NB (2019) Students’ attitude in teaching learning activities—ignited minds
journals. Multi Acad Res 16(5):478–482
40. Ismaili Y (2021) Evaluation of students’ attitude toward distance learning during the pandemic
(Covid-19): a case study of ELTE university. On the Horizon 29(1):17–30
41. Mahzan Awang M et al (2013) Students’ attitudes and their academic performance in
nationhood education. Int Educ Stud 6(11)
42. Doménech-Betoret F, Abellán-Roselló L, Gómez-Artiga A (2017) Self-efficacy, satisfaction,
and academic achievement: the mediator role of students’ expectancy-value beliefs. Front
Psychol 8
43. Bhati K, Baral R, Meher V (2022) Academic self-efficacy and academic performance among
undergraduate students in relation to gender and streams of education. Indonesian J Contemp
Educ 4(2):80–88
44. Archuleta KL, Dale A, Spann SM (2013) College students and financial distress: exploring
debt, financial satisfaction, and financial anxiety. J Fin Couns Plann 24(2):50–62
45. Grabmeier J (2015) 70 percent of college students stressed about finances. osu.edu [Online].
Available at: https://news.osu.edu/70-percent-of-college-students-stressed-about-finances/
46. Haji Wahab MZ, Othman K (2021) Impact of COVID-19 on student’s emotional and financial
aspects in the higher learning institutions. SEISENSE J Manag 4(4):1–15
47. Marler EK et al (2021) The impact of COVID-19 on university students’ academic motivation,
social connection, and psychological well-being. Sch Teach Learn Psychol
48. Muhammad AS, Bakar NA, Mijinyawa SI, Halabi KA (2021) Impact of motivation on students’
academic performance: a case study of university sultan Zainal Abidin students. Am J Innov
Res Appl Sci 13(1–6):203–207
A Case Study via Bayesian Network: Investigating Factors Influencing … 317
Abstract This article aims to develop an accurate air quality prediction model to
handle Jakarta’s air pollution challenges. In this study, data from air quality moni-
toring stations’ conventional air pollution indexes was employed. In the research
phase, data is explored, SMOTE is used to manage imbalances, and XGBoost is
used to develop a model with the best parameters. The evaluation stage shows the
model’s ability to predict air quality. With an accuracy rate of 99.516%, an F1-score
of 99.528%, and a recall rate of 99.509%, the results were very astounding. These
performance indicators show the model’s exceptional ability to classify and predict
air quality levels. Furthermore, this study investigates the significance of various vari-
ables in predicting air quality. A thorough evaluation of measures such as weight,
gain, total gain, and cover indicators reveals the significance of numerous aspects.
Even while SO2 helps predict air quality, the prevalence of PM2.5 on several measures
reveals a significant influence. This study contributes to a better understanding of the
W. Wibowo (B)
Department of Business Statistics, Faculty of Vocational Studies, Institut Teknologi Sepuluh
Nopember, 60111 Surabaya, Indonesia
e-mail: [email protected]
H. Al Azies
Faculty of Computer Science, Study Program in Informatics Engineering, Universitas Dian
Nuswantoro, 50131 Semarang, Indonesia
Research Center for Materials Informatics, Faculty of Computer Science, Universitas Dian
Nuswantoro, 50131 Semarang, Indonesia
S. A. Wilujeng
Department of Environmental Engineering, Faculty of Civil, Planning, and Geo Engineering,
Institut Teknologi Sepuluh Nopember, 60111 Surabaya, Indonesia
S. Abdul-Rahman
College of Computing, Informatics and Mathematics, Universiti Teknologi MARA, 40450 Shah
Alam, Selangor, Malaysia
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 319
Y. Bee Wah et al. (eds.), Data Science and Emerging Technologies, Lecture Notes on
Data Engineering and Communications Technologies 191,
https://doi.org/10.1007/978-981-97-0293-0_24
320 W. Wibowo et al.
1 Introduction
The consequences of air pollution are deeply concerning. Air pollution has had
major implications, including increasing dangers to human health [1]. According
to the World Health Organization (WHO), the number of deaths from air pollution
reached an alarming 4.2 million in 2019 [2]. The situation is compounded further by
the growing number of vehicles in Indonesia, which emit hazardous petrol pollutants
[3]. Carbon monoxide (CO) concentrations are also a factor in global warming and
temperature fluctuations on Earth [4]. The amount of contaminants in the air has
significantly increased recently due to Indonesia’s rapid economic growth and social
development [5]. The main concern is the assessment of environmental air quality
and the control of air pollution, which is a serious issue with effects on many facets
of daily life, the environment, and the ecosystem as a whole, particularly in urban
areas like Indonesia’s DKI Jakarta Province. Urban air quality is under threat and
becoming a source of concern, notwithstanding the exciting economic expansion and
the rapid rate of urbanization [6].
Government departments use the Air Quality Index (AQI) to notify the public
about the present or predicted levels of air pollution and to monitor air quality [7].
The higher the AQI score, the more of the population is likely to suffer from an
adverse health impact, which can be severe. As a result, timely air quality prediction
is critical for the government to understand the pattern of changes in air quality and
can be used to implement efficient air pollution control and management strategies.
The World Air Quality Index (WAQI), which updates its data in real time, gives
Jakarta an AQI score of 167. Jakarta is now ranked second on the list of the world’s
most polluted cities, trailing only Dubai City in the United Arab Emirates, which
holds the top spot with an AQI score of 176. This graph demonstrates the significant
problems these two cities are facing with air pollution, which has a detrimental effect
on both the general public’s health and the environment.
Monitoring air quality alone is inadequate in addressing the root causes of air
pollution. In recent years, various methods for predicting air pollution have emerged.
While the Gaussian dispersion model is commonly used in air pollution research
[8, 9], other statistical techniques [10–12] can also predict pollutant levels. These
models, based on physical principles, often lack detailed data on pollution sources
and related factors. To overcome this limitation, a more comprehensive approach
is needed, and machine learning algorithms [13, 14] offer a promising solution.
This study distinguishes itself from the previously cited studies by taking a novel
Harnessing the XGBoost Ensemble for Intelligent Prediction … 321
and different approach. Although some prior studies have used various machine
learning approaches and artificial neural networks to predict air quality, this study
focuses specifically on the Jakarta air quality scenario. The research on air quality
predictions in Jakarta is divided into several important sections in this publication to
facilitate reader comprehension and navigation. In Sect. 2, the examination of related
work in air quality prediction research will continue. Section 3 discloses the applied
research framework, which serves as the foundation for a full understanding of the
issue. Section 4 will next show the study’s numerical results, which include crucial
information regarding Jakarta’s air quality. The article will conclude with Sect. 5,
which presents the research’s results and suggestions for future directions.
2 Related Works
The provided literature review gives a general overview of the many methodologies
used to predict air quality using machine learning techniques. Ma et al. [15] conducted
an additional study on the application of the XGBoost algorithm. This study develops
a machine learning model that predicts daily PM2.5 concentrations by combining
estimations of PM2.5 mass concentration with measured data and meteorological
variables. The outcomes demonstrate improved performance over the WRF-Chem
model [15]. Pan [16] to estimate hourly PM2.5 concentrations, this study uses the
XGBoost algorithm and compares its effectiveness to several alternative techniques.
The outcomes demonstrate XGBoost’s advantage in predicting air quality [16]. Wang
and Jing [17] the hourly AQI in Shijiazhuang, Hebei Province, as predicted by this
study using the XGBoost model. With a stronger correlation coefficient and a lower
mean absolute error (MAE), the trained model outperforms the other approaches
[17]. The aforementioned research has contributed significantly to the application
of the XGBoost algorithm in air quality prediction; nonetheless, there are several
significant variations between them. With a focus on the DKI Jakarta Province region
of Indonesia, this research aims to establish a thorough evaluation methodology
specifically for forecasting the AQI in metropolitan areas. The goal is to offer a
technique for predicting air quality that may be applied to complicated urban air
pollution issues.
As a result, the focus of this research will be on establishing an evaluation frame-
work to predict AQI levels using machine learning approaches. In this context, the
XGBoost algorithm is critical in overcoming the shortcomings of prior techniques.
Furthermore, the Synthetic Minority Oversampling Technique (SMOTE) was used
in this work to address the issue of an imbalance in the number of samples in different
classes. By doing this, it is made possible for the developed model to deal with data
inconsistencies and provide predictions of Jakarta’s air quality that are more accu-
rate. Additionally, the XGBoost method and the randomized search approach are
employed for parameter tuning throughout the model construction phase. To find the
ideal parameter that yields the most accurate predictions, the model is then randomly
322 W. Wibowo et al.
run through a collection of potential parameter combinations. With the help of this
method, the model performs better and is better able to predict Jakarta’s air quality.
This study employs a research framework composed of four primary steps to analyze
air quality in the DKI Jakarta Province area. This methodological approach is delib-
erately created to comprehend the components that influence air quality and to create
accurate predictive models. Each stage of the methodology has a specific goal that
adds to the overall success of the research. The first stage is data collection, which
outlines how data is obtained from current sources, according to Fig. 1, which serves
as the foundation for this study. The following phase, known as data processing,
explains the actions performed to prepare data before using it to train the model. The
next section of this stage’s model development describes the process of developing
the model or algorithm for predicting air quality. Analyzing a model’s performance,
the aim of this stage is to evaluate the performance of the model that is currently
being created.
Fig. 1 Research framework for air quality prediction models in Jakarta, Indonesia
Harnessing the XGBoost Ensemble for Intelligent Prediction … 323
This research dataset examines the Air Quality Index (AQI) as recorded by five
air quality monitoring sites located throughout DKI Jakarta Province in 2021. This
information comes from a reliable source, Jakarta Open Data, which can be viewed
through the official website: https://data.jakarta.go.id/. The AQI is an important
metric for assessing air quality in a given area. Based on the degree of air pollu-
tion by various pollutants such as PM10, PM25, SO2 , CO, NO2 , and O3 , this index
provides an overview of how excellent or terrible air quality is at a given time. This
study’s attribute variables include PM10, which refers to a material particle having
a diameter of fewer than 10 µm (PM10) [18].
These particles are inhalable and have the potential to harm human health, partic-
ularly the respiratory tract. A material particle (PM2.5) having a diameter of fewer
than 2.5 µm is referred to by the PM25 property. These particles are smaller than
PM10 and can enter the lungs of people more easily, having more detrimental effects
on their health [18]. The air contains sulfur dioxide, or SO2 , which is measured as
SO2 . A gas called SO2 is produced by human activities like the burning of fossil
fuels and is bad for both the environment and human health [18]. The symbol CO
represents the concentration of carbon monoxide (CO). Numerous things, including
motor vehicles, can emit CO, which can be hazardous if inhaled in large quanti-
ties. O3 denotes the amount of ozone (O3 ) in the atmosphere. Ozone is a dangerous
contaminant on the Earth’s surface, but it also serves as a natural barrier against ultra-
violet light in the upper atmosphere. The amount of nitrogen dioxide (NO2 ) in the
atmosphere is referred to as NO2 [18]. NO2 , which is mostly created by human activ-
ities such as fuel combustion, may affect the human respiratory system. Meanwhile,
the predicted category of air pollution standard index (PSI) is the target variable.
This is the outcome of the Air Pollution Standard Index calculation. PSI is an air
quality metric that describes how safe or dangerous the air we breathe is. To offer an
indication of the level of air pollution, this category covers many levels of “good,”
“moderate,” and “unhealthy” air quality.
The data acquired in the previous stage will go through a preprocessing process at
this level. This procedure ensures that the data is of good quality and ready for use
in model building. This stage is divided into three major sub-stages. Exploratory
Data Analysis (EDA): At this step, the data is examined to discover patterns and
other information that can aid in a deeper understanding of the data’s features [19].
In this study, the correlation matrix was also used to investigate the association.
This EDA stage will enable researchers to have a deeper understanding of the data
being used. Imbalance Data Handling Stage: There is frequently an imbalance in the
number of samples in each target class in the dataset used for air quality prediction.
324 W. Wibowo et al.
For example, the number of samples in the “good” or “moderate” category could be
significantly more than the number of samples in the “unhealthy” category. Because
of this imbalance, the model may perform better when predicting the majority class,
while the minority class may perform poorly. The Synthetic Minority Oversampling
Technique (SMOTE) is an oversampling technique that focuses on minority classes
by developing synthetic samples that look like existing minority samples [20].
SMOTE collects samples from the minority class and then generates a new
synthetic sample by connecting existing points [20]. This increases the number of
samples in the minority class without having to repeat the present sample. The number
of samples in the minority class becomes more balanced with the majority class when
SMOTE is used. This will help the model learn more effectively and avoid bias toward
the dominant class. Data splitting: After processing, the data is divided into training
and test data in a predetermined proportion. Training data is used to train the model,
whereas test data is used to evaluate the model’s performance. The proportion of
training data to test data in this study is 70:30 [21], with the majority of the data used
to train the model (70%), and just a small percentage (30%) used for testing. This
ratio ensures that the model has enough opportunities to learn from the training data
and that it also performs well on the test data.
The model development stage is critical in this study since it will build an air quality
prediction system using the XGBoost algorithm. The XGBoost method was selected
for a variety of reasons, including its ability to handle complex data and give superior
results in classification challenges. The XGBoost algorithm will be used to develop
an air quality prediction model at this step. The XGBoost algorithm was chosen
because of its strong reputation in the data science and artificial intelligence commu-
nities [22]. Extreme gradient boosting (XGBoost) is an ensemble learning algorithm
that combines several simple predictive models (weak learner) into a stronger model
(strong learner) [23]. This is done by using reinforcement techniques that can reduce
the bias and variance that exists in the model. Before training the model, it is neces-
sary to determine the optimal parameters for the XGBoost algorithm. This step
was performed using a random search method [24]. This method will try various
combinations of different parameters and measure their performance using cross-
validation techniques [25]. The optimal parameters will be selected based on the
best performance results during cross-validation. Once the optimal parameters are
found through a random search, the XGBoost model will be trained using the training
data and these parameters using the K-Fold validation technique [26]. In this tech-
nique, the training data will be split into several folds and the model will be trained
and evaluated at each fold. This ensures that the developed model can generalize well
to a wide variety of data. By using the XGBoost algorithm and optimized parame-
ters, as well as involving K-Fold validation in the evaluation, this model development
Harnessing the XGBoost Ensemble for Intelligent Prediction … 325
stage aims to produce an air quality prediction model that has high performance and
can generalize to data that has never been seen before [26].
The XGBoost algorithm was used in this study to provide accurate and thorough
air quality predictions. XGBoost is widely recognized for its outstanding capabilities
in various machine learning applications, especially in classification tasks. As such,
it is an obvious choice for developing strong models that provide precise forecasts of
air quality. Predicting air quality can be difficult when dealing with data that is unbal-
anced, meaning that some categories of air quality occur less frequently than others.
Crucially, the study approach combines XGBoost with the Synthetic Minority Over-
sampling Technique (SMOTE), which balances the dataset by creating new instances
of the minority class artificially. This guarantees that the model is free from biases
and that it can provide precise predictions for all categories of air quality. Moreover,
the usefulness of XGBoost is emphasized by its ability to optimize parameters. This
implies that the model’s predictive power can be further increased by fine-tuning it
to match the particular features of air quality data. Additionally, XGBoost makes the
inner workings of the model transparent, allowing insights into the factors driving
air quality predictions to be extracted. This interpretability enables researchers to
make defensible conclusions by promoting a deeper comprehension of the intricate
interactions between variables and air quality. In addition to its effectiveness and
interpretability, the selection of XGBoost contributes to the larger field of machine
learning applications that tackle environmental issues. It’s a start in the right direction
toward developing long-term solutions to the problems caused by air pollution.
The final stage is model evaluation, in which the predictive model’s performance is
tested using important metrics such as accuracy, F1-score, and recall [27]. Accuracy
is a metric that measures how successfully a classification model predicts each class
in a dataset. It involves dividing the total number of predictions made by the model
by the number of accurate predictions across all classes [28]. Recall is a metric that
assesses how well a model can detect real positive cases among all positive cases in a
dataset. In multi-class classification, recall is calculated independently for each class.
In instances when avoiding false negatives (not recognizing true positive cases) is
critical, recall is key. The F1-score is a metric that combines recall and accuracy, or
the model’s ability to recognize positives correctly. In multi-class classification, the
F1-score can be calculated independently for each class, and then the average of all
F1-scores for that class is determined [28].
326 W. Wibowo et al.
Before going on to the next step of modeling to predict air quality based on the
standard air pollution index in DKI Jakarta Province, this stage is an exploratory
data analysis (EDA), which attempts to understand the features of the data.
According to Fig. 2, a high correlation between PM10 and PM25 particle concen-
trations shows that changes in PM10 typically follow changes in PM25. The modestly
positive connection between PM10 and SO2 and PM10 and NO2 particles, however,
suggests that gaseous pollutants like SO2 and NO2 can coexist with big airborne
particles. The fact that there is little association between the pollution parameters
(CO, O3 , SO2 , and NO2 ) and air quality suggests that the quality of the air is also
influenced by the weather, pollution sources, and other intricate interactions. In some
low negative relationships between pollution parameters, such as those between SO2
and NO2 and between O3 and NO2 , one pollutant can become more concentrated
while another pollutant becomes less concentrated. The next stage is to verify the
number of classes in the air quality target variable. The goal of this stage is to deter-
mine whether or not there is an imbalance in the distribution of data among air quality
categories. If the distribution of data between air quality classes is highly unequal,
the performance of the model to be created may suffer, especially if the model tends
to favor the majority class while ignoring the minority class.
According to preliminary examination, the class distribution of air quality param-
eters appears to be unbalanced. This experiment uses the SMOTE approach to
manage uneven data to address this imbalance. Using the features of existing minority
samples, this method generates synthetic data for the minority class. As seen in Fig. 3,
the data distribution was adequately balanced after applying the SMOTE technique.
After handling the data imbalance using the SMOTE approach, the next step in this
research is to divide the data into training data and test data with a ratio of 70:30,
resulting in a composition of 2408 samples of training data size and 1033 samples
of test data size. The objective of this division is to ensure that, in addition to gener-
alizing to training data, the developed model can perform well on data that has never
been seen before. After the data has been exchanged, the next step is to build a model
using the XGBoost technique. In classification tasks, one of the most efficient algo-
rithms is widely used. The XGBoost algorithm will have some parameters adjusted
in this study to determine the ideal configuration. These are the parameters that were
used:
a. Subsample: This setting regulates the number of samples utilized in each itera-
tion of the tree-building process. Since only a small portion of the data is used
in each iteration, choosing lower values for the subsample can aid in preventing
overfitting [29]. Configurations 0.6, 0.7, and 0.8 are used in this investigation.
b. n estimators: which are parameters specifying the number of trees to be built in
the algorithm [29]. A larger number of trees generally increases the model’s
ability to capture complex patterns in the data. A total of 100 and 200 are
configurations of n estimators in this study.
Harnessing the XGBoost Ensemble for Intelligent Prediction … 327
Fig. 2 Correlation heat map of numeric attributes: each color’s darker contrast denotes a strong
link, either positive or negative. Brighter contrast, on the other hand, denotes a weak or absent link
Fig. 3 Comparison of the distribution of air quality data before and after handling the imbalance
c. max depth: 3, 4, and 5 different variations are used. This parameter controls
the maximum depth of each tree in the ensemble [30]. The right setting for max
depth is necessary to avoid overfitting [31]. If the model is given too much depth,
it could be able to detect noise in the training data.
328 W. Wibowo et al.
d. Learning rate: How many steps are changed with each iteration of the tree
weights depends on the learning rate [32]. Choosing the right learning rate
can have an impact on model convergence and overall performance. If the
learning rate is too high, the model could fail to detect the global minimum.
This arrangement makes use of 0.01 and 0.02.
e. Gamma: Gamma configurations 0, 0.1, and 0.2 were applied in this experiment.
When a tree is formed, the gamma value controls when the nodes split [33].
The right gamma value can help prevent overfitting by controlling the tree’s
complexity.
f. Colom sample by tree: This option regulates the proportion of features used
to build each tree. The model can perform better in generalization and be more
resilient to overfitting by selecting a feature subset [34]. The settings are 0.4, 0.5,
and 0.6.
air quality in Jakarta. This parameter combination is used to train the model on all
training data.
The model’s performance in each fold is shown by the results of fivefold cross-
validation (CV) [35]. Table 2 shows that the model’s performance at each fold is
quite good. For each fold, the accuracy, F1-score, and average recall are in the
0.9937 to 0.9980 range. This shows that the computer can predict Jakarta’s air quality
accurately. Overall, Fold 5 has the highest F1-score and recall, indicating that the
model performs best when categorizing data in this fold. Based on the performance
evaluations in each fold, the 5th fold is the best choice to represent the overall
performance of the generated XGBoost model. Whenever the performance results
for each fold are good and consistent, the best model is chosen to create predictions
on data that has never been seen before (test data). In this case, the best model can
be selected based on previous experimental discoveries of the best parameters and
folds. The best model may be employed to predict air quality in Jakarta based on
new data.
Figure 4 shows the confusion matrix, which reflects the model’s predicted results
on the test data. This confusion matrix illustrates how well the model classifies each
class. The algorithm correctly identified 324 samples as “good,” whereas 0 samples
were incorrectly labeled as “good.” The model correctly classified 369 samples in the
“Medium” category, whereas no samples were incorrectly classified as “Medium.”
The program correctly identifies 335 samples as “unhealthy,” but five samples are
incorrectly labeled as such. Overall, the model does an excellent job of classifying air
into “good” and “medium” categories with flawless accuracy and precision. However,
there are some predicted inaccuracies in the “unhealthy” air class. This may be an
area that requires additional effort to improve model performance in classifying
underrepresented classes. The confusion matrix shows the results of the model’s
performance on test data (testing), which received an accuracy value of 0.995160.
The accuracy of the model is measured by how well it classifies the data as a whole.
With an accuracy value of roughly 0.995, it indicates that the model successfully
predicted 99.5% of all test data. Furthermore, the harmonic mean of precision and
recall is 0.995288 in the F1-score. The F1-score value of roughly 0.995 implies
a balance between accuracy and recall, indicating that the model performs well in
measuring precision and recognizing true positives. The last metric is the recall value,
which compares the model’s ability to recognize true positives to the total number
330 W. Wibowo et al.
of actual positives. A recall value of roughly 0.995 suggests that the model is quite
good at classifying positive classifications.
After determining the optimal model, a more in-depth examination of the signifi-
cance of variables is required to comprehend the role and influence of each feature on
Jakarta air quality predictions. The importance of each variable on model predictions
is measured by the variable’s significance. According to this theory, high-importance
features have a larger impact on prediction choices than low-importance features. The
significance of factors in predicting Jakarta’s air quality can shed light on the charac-
teristics that are most crucial in determining air pollution levels. This information can
help anyone understand the factors that affect air quality, including environmental
scientists, policymakers, and the general public.
The variable importance measuring approach in this study makes use of several
indicators, including weight, gain, total gain, and cover as shown in Fig. 5. Each
metric offers a unique perspective on the significance of a feature in terms of affecting
the model’s predictions. The PM25 feature highlights three indicators: gain, total
gain, and cover. This reveals that PM25 has a major impact on air quality estimates.
Because PM2.5 is a fine particulate emitted by a range of sources, including autos,
industry, and combustion, its dominance in three indices indicates that PM2.5 levels
have a substantial impact on Jakarta’s air pollution. Controlling PM2.5 emissions
could be the primary focus of efforts to enhance air quality. Despite having low
Harnessing the XGBoost Ensemble for Intelligent Prediction … 331
relevance values in two indicators (gain and cover), SO2 has a considerable impact
on air quality predictions, as evidenced by its high weight value. SO2 levels in the
air have a significant role in affecting air quality. The high importance of SO2 in the
“Weight” indicator may indicate that fluctuations in SO2 levels in the air can have
a large impact on overall air quality. Even though it dominates by weight indicator,
it is critical to continue monitoring and controlling SO2 levels to maintain good air
quality. CO has a low significance value for all indicators, indicating that CO has no
meaningful impact on predicting air quality in the context of this study. However, it
should be highlighted that CO levels should be regularly monitored because this gas
can come from sources such as motor vehicles and combustion and can have major
health consequences.
5 Conclusion
According to the findings of this study, the XGBoost parameters improved using
random search have a considerable impact on model performance. The optimal
parameter combination, with subsample values of 0.7, n estimators of 200, max
depth of 5, learning rate of 0.01, gamma of 0.1, and sample by tree of 0.6, may
offer reliable and consistent predictions of air quality. The fivefold cross-validation
technique yields a model with consistent performance across all folds. The accu-
racy, F1-score, and recall values in each fold range from 0.9937 to 0.9980. This
demonstrates that the algorithm can accurately estimate air quality in Jakarta and is
constant across different conditions. Furthermore, the test data evaluation results are
excellent. The model has an accuracy of 0.995160, an F1-score of 0.995288, and a
recall of 0.995098. This demonstrates that the model is highly good at categorizing
air quality.
PM2.5 and SO2 are the key variables influencing air quality predictions in Jakarta,
according to an examination of the importance of variables. This model can be used to
332 W. Wibowo et al.
understand and address the air pollution issues in Jakarta since PM2.5 has a substan-
tial impact on three critical indicators, while SO2 has a strong impact on the “Weight”
indicator, demonstrating its significant significance in air quality prediction. In their
efforts to preserve better air quality, environmental scientists, legislators, and the
general public can benefit greatly from this knowledge.
References
1. Domingo JL, Rovira J (2020) Effects of air pollutants on the transmission and severity of respi-
ratory viral infections. Environ Res 187:109650. https://doi.org/10.1016/J.ENVRES.2020.
109650
2. Liu M et al (2019) Population susceptibility differences and effects of air pollution on cardio-
vascular mortality: epidemiological evidence from a time-series study. Environ Sci Pollut Res
26(16):15943–15952. https://doi.org/10.1007/S11356-019-04960-2/FIGURES/1
3. Lestari P, Arrohman MK, Damayanti S, Klimont Z (2022) Emissions and spatial distribution of
air pollutants from anthropogenic sources in Jakarta. Atmos Pollut Res 13(9):101521. https://
doi.org/10.1016/J.APR.2022.101521
4. Mehmood I et al (2020) Carbon cycle in response to global warming. Environ Clim Plant Veg
Growth 1–15. https://doi.org/10.1007/978-3-030-49732-3_1/COVER
5. Raihan A, Muhtasim DA, Pavel MI, Faruk O, Rahman M (2022) An econometric analysis of
the potential emission reduction components in Indonesia. Cleaner Prod Lett 3:100008. https://
doi.org/10.1016/J.CLPL.2022.100008
6. McGranahan G, Songsore J, Kjellén M (2021) Sustainability, poverty and urban environmental
transitions. In: The Earthscan reader in sustainable cities, pp 107–133.. https://doi.org/10.4324/
9781315800462-8
7. Abulude F, Abulude I, Oluwagbayide S, Afolayan S, Ishaku D (2021) Air quality index: case
of one-day monitoring of 253 urban and suburban towns in Nigeria. Environ Sci Proc 8(1):4.
https://doi.org/10.3390/ECAS2021-10342
8. Tiwari A et al (2019) Considerations for evaluating green infrastructure impacts in microscale
and macroscale air pollution dispersion models. Sci Total Environ 672:410–426. https://doi.
org/10.1016/J.SCITOTENV.2019.03.350
9. Masih A (2019) Machine learning algorithms in air quality modeling. Global J Environ Sci
Manag 5(4):515–534. https://doi.org/10.22034/GJESM.2019.04.10
10. Wang H, Yilihamu Q, Yuan M, Bai H, Xu H, Wu J (2020) Prediction models of soil heavy
metal(loid)s concentration for agricultural land in Dongli: a comparison of regression and
random forest. Ecol Indic 119:106801. https://doi.org/10.1016/J.ECOLIND.2020.106801
11. Abdullah S, Ismail M, Ahmed AN, Abdullah AM (2019) Forecasting particulate matter concen-
tration using linear and non-linear approaches for air quality decision support. Atmosphere
10(11):667. https://doi.org/10.3390/ATMOS10110667
12. Su X, An J, Zhang Y, Zhu P, Zhu B (2020) Prediction of ozone hourly concentrations by support
vector machine and kernel extreme learning machine using wavelet transformation and partial
least squares methods. Atmos Pollut Res 11(6):51–60. https://doi.org/10.1016/J.APR.2020.
02.024
13. Ali M, Dewan A, Sahu AK, Taye MM (2023) Understanding of machine learning with deep
learning: architectures, workflow, applications and future directions. Computers 12(5):91.
https://doi.org/10.3390/COMPUTERS12050091
14. Otok BW, Suharsono A, Purhadi, Standsyah RE, Al Azies H (2022) Partitional clustering of
underdeveloped area infrastructure with unsupervised learning approach: a case study in the
Island of Java, Indonesia. J Reg City Plann 33(2):77–196. https://doi.org/10.5614/JPWK.2022.
33.2.3
Harnessing the XGBoost Ensemble for Intelligent Prediction … 333
33. Liang Y et al (2019) Product marketing prediction based on XGboost and LightGBM algorithm.
In: ACM international conference proceeding series, pp 150–15. https://doi.org/10.1145/335
7254.3357290
34. Parsa M (2021) A data augmentation approach to XGboost-based mineral potential mapping:
an example of carbonate-hosted ZnPb mineral systems of Western Iran. J Geochem Explor
228:106811. https://doi.org/10.1016/J.GEXPLO.2021.106811
35. Haumahu JP, Permana SDH, Yaddarabullah Y (2021) Fake news classification for Indone-
sian news using Extreme Gradient Boosting (XGBoost). IOP Conf Ser Mater Sci Eng
1098(5):052081. https://doi.org/10.1088/1757-899X/1098/5/052081
Modeling Earthquake Catalog
in Sumatra by Space–Time
Epidemic-Type Aftershock Sequences
Model: Combining
Davidon–Fletcher–Powell and Stochastic
Declustering Algorithms
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 335
Y. Bee Wah et al. (eds.), Data Science and Emerging Technologies, Lecture Notes on
Data Engineering and Communications Technologies 191,
https://doi.org/10.1007/978-981-97-0293-0_25
336 C. Andreas et al.
1 Introduction
The remainder of the paper is presented as follows. Section 2 describes the earth-
quake dataset while Sect. 3 details the methodology. We discuss the results in Sect. 4
and conclude the paper in Sect. 5.
2 Data Description
The dataset includes coordinates (latitude, longitude), time, and magnitude of earth-
quake occurrences in the Sumatra during 1 July 2005–30 June 2023, obtained from
the United States Geological Survey (USGS). We focus only on the major-shallow
earthquakes (magnitude ≥ 4 and depth < 60 km) which are often classified as a type
of earthquake causing great damage [17]. The Sumatra is defined in the area within
− 6.125° to 5.806° longitude and 95.010° to 108.545° latitude (see Fig. 1). Based
on Fig. 1, the distribution of earthquakes on the island of Sumatra tends to be domi-
nated by the western and southern regions of Sumatra. In addition, most earthquakes
originate from marine areas close to Sumatra. Table 1 depicts the data structure for
N earthquake events, consisting of n complementary events and N − n main events.
In this study, we divide the research data into two parts, namely the main events and
complementary events. The division is done to increase the accuracy of the modeling.
The existence of a complementary event is useful for identifying previous patterns of
earthquake occurrence, especially aftershocks in the past which can have an effect on
the next time period. In this study, we divide main events and complementary events
based on the length of time duration. The main event is defined from 01 July 2017
to 30 June 2023. Meanwhile, complementary events are defined differently in the
four model scenarios as shown in Table 2. In previous studies, complementary events
were determined by the researcher directly [11, 13, 14]. The model scenario is used
to identify the best choice of complementary events. Because the use of different
complementary events affects the computation time and convergence of parameters,
we also conduct a review of the use of complementary events in the Space–Time
ETAS model.
From 1 July 2005 to 30 June 2023, there have been 5142 earthquakes that have
shallow depth and moderate strength (with 1104 earthquake events defined as main
events). One source of earthquakes on the island of Sumatra is the subduction zone off
the west coast of Sumatra. This is due to the Indo-Australian Plate shifting downward
against the Eurasian Plate. The collision boundary between the two plates can be
observed in the form of a deep-sea trench line west of Sumatra to the Andaman
Islands [18, 19]. Therefore, the distribution of earthquake events on the island of
Sumatra is dominated in the western region. During this time period, the frequency of
earthquakes on Sumatra Island varied. The highest frequency of earthquakes occurred
on 12 September 2007, namely 66 times. Meanwhile, during other time periods, the
frequency of earthquakes on the island of Sumatra is less than 66 times per day.
338 C. Andreas et al.
Fig. 1 Earthquake distribution map on Sumatra Island. The red line indicates the study area. The
dots scattered in the study area are marked point patterns that show the distribution of earthquakes
with certain magnitude values on the island of Sumatra, Indonesia, since 01 July 2005 until 30 June
2023
3 Methodology
Equation (3) contains a number of constraints for parameter values, namely μ > 0,
A > 0, α > 0, c > 0, p > 1, D > 0, γ > 0, and q > 1. Equation (3) also has
various important features for each earthquake [11].
a. The main earthquake is assumed to be an immigrant in the branching process of
earthquake events. The background seismicity rate is denoted as ũ(x, y):
In this study, the background seismicity rate is estimated using a kernel estimator
via an isotropic bivariate Gaussian function (ϕ(x)) with a bandwidth h [13].
b. Each aftershock is independent of other aftershocks. The average number of
aftershocks resulting from an earthquake with a magnitude of m is expressed as
κ(m):
c. The probability distribution from the time of the main earthquake until after-
shocks occur is assumed to be a function of the time lag of the main earthquake,
namely g(t − τ ) which τ is the time between earthquake events:
⎧ − p
⎪
⎨ p − 1 1 + t − ti t − ti > 0
gc, p (t − ti ) = c c
⎪
⎩
0 t − ti ≤ 0
In this study, the function is defined as the radial symmetry density function [13].
e. The distribution of the magnitude of the main earthquake that is independent of
the location, the distribution of the magnitude of the aftershocks is independent
of the magnitude of the main earthquake, and the distribution of the magnitude
of the main earthquake and the magnitude of the aftershocks is identical.
Based on Eqs. (2) and (3), there are nine parameters, namely β, μ, A, α, c, p, D,
γ , q, which need to be estimated in the Space–Time ETAS model.
Modeling Earthquake Catalog in Sumatra by Space–Time … 341
where
N
N
l1 (β|Ct ) = log(β) δi − β δi (m i − m 0 ); and
i=1 i=1
N
tstart +T
¨
l2 (θ |Ct ) = δi log(λθ (ti , xi , yi |Ct )) − ∫ λθ (t, x, y|Ct )dxdydt.
tstart
i=1 S
The value of δi = 1 if event i is the main event and δi = 0 if event i is not the main
event. Based on Eq. (4) has two main parameter components, namely the parameter
β, and the vector parameter θ = (μ, A, α, c, p, D, γ , q)T .
Parameter estimation in the Space–Time ETAS model is carried out using the
Maximum Likelihood Estimation (MLE) method. The MLE method produces an
N
δ
estimate of the parameter β = N δ i=1 i
. However, for the vector parameter
i=1 i (m i −m 0 )
θ , the MLE method produces an equation that is not in closed form. Due to the
complexity of l2 (θ|Ct ), numerical methods must be considered. For this reason,
parameter estimation is continued using numerical iteration. In this case, a radial
partition is performed to approximate the value in the integral form in l2 (θ|Ct ). The
integral value of each partition is approximated using a transformation from Cartesian
coordinates to polar coordinates. Furthermore, the estimation of vector parameter θ
is obtained by minimizing ξ (θ ) = −l2 (θ |Ct ) through the Davidon–Fletcher–Powell
(DFP) algorithm and continued with stochastic declustering (SD) algorithm [11–13].
Table 3 describes the DFP algorithm and SD algorithm, which requires ∇ξ (θ ) (the
gradient of ξ (θ )), matrix H k (Hessian matrix of ξ (θ )), and û 0 x j , y j as an inputs.
In this study, the limit of convergence criteria used is ε of 0.001.
The computation is done with the help of the ETAS packages available in R. In
this case, the initial value for the vector parameter θ is μ = 4TN|S| , A = 0.01, α =
1, c = 0.01, p = 1.3, D = 0.01, γ = 1, and q = 1, where T is the length of time
duration of the main event [13, 14]. After getting the complete parameter estimation
values, the Akaike information criterion (AIC) value can be calculated. The AIC
value serves as a measure of optimization of a model [13]. In this study, there are
four model scenarios whose optimization values are compared. The AIC value is
obtained through the following formula [20, 21]:
6 c2 = 1
5 û k+1 (x j ,y j )
ηkT ×Hk ×ηk e2 = max j û k (x j ,y j )
−1
ξ (θ̂k+1 )
7 Hk+1 = Hk × c1 h k × dkT − 6 e3 = −1
ξ (θ̂k )
c2 Hk × ηk × ηkT × Hk
Repeat from process (1), for k = k + 1 until the convergent criterion is reached
Convergent criteria
θ̂k+1 − θ̂k < ε or e1 < ε, e2 < ε, and e3 < ε
∇ξ θ̂k+1 <ε
Outputs
θ̂k+1 andHk+1 û k+1 (x, y)
Based on research data, the highest magnitude value of 8.4 occurred on 12 September
2007. This earthquake has a hypocenter at a point as far as 122 km southwest of
Bengkulu, Sumatra. In general, the magnitude value of earthquakes on the island of
Sumatra tends to be dominated by a magnitude value of 4 (with a mean of 4.59 and
Modeling Earthquake Catalog in Sumatra by Space–Time … 343
700
600
500
Frequency
400
300
200
100
0
4,2 4,8 5,4 6,0 6,6 7,2 7,8 8,4
mag
(a)
(b)
Fig. 2 a Histogram of the values of magnitude and b magnitude plots with log values of the number
of earthquake occurrences on Sumatra Island
a median of 4.50), whereas a very high magnitude value has a small frequency of
occurrence. The distribution of the magnitude values is presented in Fig. 2a.
Figure 2a shows that the earthquakes on Sumatra Island were dominated by
moderate earthquakes with a magnitude of 4.35 to 4.45, namely 621 times. Based
on Fig. 2b, it can be seen that in general, magnitude plots with log values of the
number of earthquake occurrences form a linear trend across all models. This is in
accordance with the Gutenberg–Richter law [14]. The Gutenberg–Richter law states
that the logarithmic value of the frequency of earthquakes is directly proportional to
the magnitude, namely log10 (N ) = a − bm, where a and b are parameters.
Furthermore, Space–Time ETAS modeling is carried out based on the model
scenario defined in Table 2. In this case, it is necessary to carry out diagnostics of
stationarity assumptions for time and data completeness before estimating parame-
ters. This is because the Space–Time ETAS model is built on these two assumptions.
Diagnostics of the two assumptions are presented through a magnitude plot with the
log value of the number of earthquake occurrences and a time plot with the cumu-
lative number of earthquake occurrences. In the previous section, it has been shown
that research data follow Gutenberg–Richter law (see Fig. 2b). This indicates that
the assumption of data completeness has been met.
Meanwhile, the assumption of stationarity needs to be tested in each modeling
scenario. Figure 3 shows the time plot with the cumulative number of earthquake
occurrences for each model. The time plot against the cumulative number of earth-
quake occurrences which forms a linear trend shows that the earthquake occurrences
are stationary with respect to time. Models 1 have a linear pattern in the study period,
but have a nonlinear pattern in complementary events. Therefore, it can be concluded
that models 1 only fulfill the assumption of completeness, but parameter estimation
is still carried out as study material for comparison of results. On the other hand,
models 2, 3, and 4 fulfill all assumptions, namely stationarity and data completeness.
It is very important to fulfill the assumptions in the Space–Time ETAS modeling.
344 C. Andreas et al.
Violation of the assumptions can cause the estimation results to fail to converge or
the estimation results obtained to be illogical to interpret (if they converge) [22].
Parameter estimation results for each model are presented in Table 4. In this case, the
estimated value for the parameter β is identical for each model because the solution is
closed form. Meanwhile, other parameters have different estimation results because
they are obtained by a numerical approach. For all models, the parameters c and D
are not significant because they have a larger standard error value than the estimation
results (see Table 4). This means that the difference in time duration used for the
complementary events does not affect the significance of the estimated parameters
Modeling Earthquake Catalog in Sumatra by Space–Time … 345
of the Space–Time ETAS model. This indicates that the length of complementary
events does not have a significant impact.
To evaluate the four models, we compare using several metrics with the results
shown in Table 5. Model 4 outperforms the other models in terms of log-likelihood,
AIC, number of stochastic declustering iterations, and execution time. In this case,
the optimal model is characterized by the largest log-likelihood value, the minimum
AIC value, the least number of iterations, and the fastest execution time.
In addition, Table 5 shows that the longer duration of time on complementary
events is not always directly proportional to the AIC value (see, e.g., models 2 and
3). Furthermore, the role of data stationarity is very important to the efficiency of
computation time. In Fig. 3, it has been shown that model 1 does not meet the
assumption of stationarity. If parameter estimation is still carried out on data with
this case, the computation time becomes very long. Meanwhile, data that meets the
assumption of stationarity will tend to have faster computation time. Therefore, the
selection of the length of the duration of the complementary events can be done
Fig. 4 Comparison of the number of iterations of the DFP algorithm and the SD algorithm for each
model
by considering the stationarity of the data. If the research data used as the main
event is not stationary, then a certain length of time duration can be selected as a
complementary event to make the data stationary.
The time required for computation is also affected by the volume data. In general,
smaller amounts of data require faster computation time. Figure 4 shows a comparison
of the number of iterations of the DFP algorithm and the SD algorithm for each model.
In Fig. 4, it can be seen that the number of iterations of the DFP algorithm in model
2 at each stochastic declustering stage is 19; 9; 5; 3; and 2. This is the least number
of iterations when compared to iterations in other models. However, it doesn’t mean
that model 2 has the fastest computing time. It is proven that the fastest computation
time occurs in model 4, which has less data than model 2. Thus, for time efficiency,
we suggest that the selection of a complementary event does not need to be longer
than the main event, as long as the data meets the assumptions of the Space–Time
ETAS model, then the optimal modeling can be done.
Model interpretation and earthquake hazard mapping on Sumatra Island are based
on model 4. The parameter estimation results in model 4 show that there are two
parameters that are not significant out of nine parameters. From each of these param-
eter values, each parameter can be used to identify patterns of temporal seismicity.
The significant β parameter shows that the magnitude of the magnitude affects the
probability of an earthquake occurring on the island of Sumatra. In addition, the
significant parameters A and α indicate that the number of aftershocks on the island
of Sumatra is affected by the magnitude. The α value of 1.7464 measures the effi-
ciency of the magnitude of an earthquake in producing aftershocks. Meanwhile, the
p value of 1.1740 indicates the aftershock decay rate. Based on the estimation results,
Modeling Earthquake Catalog in Sumatra by Space–Time … 347
Fig. 5 a Background seismicity rate, b total spatial rate, c clustering coefficient, and d conditional
intensity function value at the end of the study period
the results can be visualized to facilitate the interpretation of the model as shown in
Fig. 5.
The value of the background seismicity rate of the main earthquake in each region
of Sumatra Island is dominated along the west and north coasts of Sumatra Island
(see Fig. 5a). The rate value is between 0.00 and 0.25. Meanwhile, the total spatial
rate for both the main and aftershocks is mapped in Fig. 5b. These results tend to be
almost the same as the background seismicity rate.
On the other hand, the mapping of aftershocks can be seen based on the value of the
clustering coefficient in Fig. 5c. Some regions have very high clustering coefficient
values. This means that the potential for aftershocks after the main earthquake in the
area is very high. It can be seen that the high coefficient values are in the sea area
and land area of Sumatra Island. In the sea area, the west coast of Sumatra Island has
a very high potential for aftershocks. Meanwhile, in the mainland, the provinces of
Aceh and North Sumatra have the potential for very high aftershocks. The mapping
of the conditional intensity function at the end of the earthquake period on Sumatra
Island is presented in Fig. 5d. The probability of an earthquake occurring at the end
348 C. Andreas et al.
of the study period has the same distribution pattern as the background seismicity
rate chart. This means that the potential for an earthquake to occur in the future will
still have the same intensity as the previous incident.
The results of this study are used to create a hazard map for earthquake-prone areas,
both the main earthquake and aftershocks on the island of Sumatra. The hazard map is
presented in Fig. 6. Figure 6 shows that the main earthquake potential in the Sumatra
region is dominated by land and sea areas. Meanwhile, the potential for aftershocks to
occur is more dominated in the sea area, especially along the west coast of the island
of Sumatra. This shows that the regions of Aceh, North Sumatra, West Sumatra,
Bengkulu, and Lampung are provinces that are prone to earthquakes, especially the
main earthquake. In addition, small island areas on the west coast of Sumatra Island,
such as the Simoloe, Sikolo, Nias, Mentawai, and Enggano Islands, are areas that
are prone to earthquakes, both main and aftershocks. In general, aftershocks have
a smaller magnitude than the main earthquake. The frequency of aftershocks will
decrease over time. In this study, we set a minimum bandwidth threshold value of
0.05 degrees in the parameter estimation process. This shows that the point of risk
of an earthquake occurring can occur right at that point or around that point, but still
within a radius of 0.05 degrees.
This fact is in accordance with BMKG records regarding earthquake-prone areas
on Sumatra Island [23]. In the Aceh region, most of the epicenters occurred on the
mainland. This is due to the geographical conditions of Aceh which are flanked by
two segments, namely the Aceh segment and the Seulimeum segment. In addition,
Simoloe Island in the province of Aceh is also an area that is very prone to earthquakes
and tsunamis due to its proximity to the subduction zone. In the same way, this also
occurs in the Nias Islands, whose geographical conditions are close to the subduction
zone. Another earthquake-prone area is the province of West Sumatra. The high risk
of earthquakes in West Sumatra is caused by the presence of 7 segments of the Suma-
tran fault located in West Sumatra, namely the Siulak segment, the Tutup segment, the
Sumani segment, the Sianok segment, the Sumpur segment, the Barumun segment,
and the Angkola segment. The high risk of earthquakes on the island of Sumatra
is inseparable from its geographical conditions which are traversed by subduction
zones, faults, and volcanic routes [24, 25].
5 Conclusion
By combining the DFP and SD algorithms, the Space–Time ETAS model can be used
to model a complete earthquake catalog. The results of Space–Time ETAS modeling
on earthquake data on Sumatra Island provide good earthquake hazard mapping
results because the Space–Time ETAS model is built based on the characteristics of
earthquake events by considering the main earthquake and aftershocks. In addition,
this model also accommodates the use of complementary events as supporting data
to improve model accuracy. In this study, we recommend considering the length of
the duration of the complementary event based on the length of the duration of the
Modeling Earthquake Catalog in Sumatra by Space–Time … 349
Fig. 6 Hazard map for main earthquakes and aftershocks on Sumatra Island. The map shows that
the characteristics of the earthquakes that occurred on the island of Sumatra were often followed
by aftershocks. Many earthquakes occur around the small islands on the west coast of Sumatra
Island. Number 1 represents the Simoloe Island area, number 2 represents Sikolo Island, number 3
represents Nias Island, number 4 represents the Mentawai Islands, and number 5 represents Enggano
Island
main event. In addition, the assumption of stationarity and data completeness is also
important in terms of the efficiency of computation time. With this approach, the
most optimal Space–Time ETAS model can be obtained as measured by the AIC
value. Thus, a hazard mapping of earthquakes is obtained, especially shallow earth-
quakes with a minimum magnitude of 4 on the island of Sumatra. In general, shallow
earthquakes with a minimum magnitude of 4 have caused damage. Therefore, appro-
priate mitigation steps can be prepared in areas that are prone to earthquakes. Further
research can be carried out by constructing the effects of geological variables into
the Space–Time ETAS model so that the effects of these variables can be measured.
Acknowledgements The first author thanks Lembaga Pengelola Dana Pendidikan (LPDP) for the
financial support to conduct postgraduate study at the Institut Teknologi Sepuluh Nopember. The
350 C. Andreas et al.
study is partially supported by the grant from Institut Teknologi Sepuluh Nopember No 1735/PKS/
ITS/2023. We also thank three reviewers for their constructive comments.
References
1. Ritchie H, Rosado P, Roser M (2023) Natural disasters, our world in data, https://ourworldi
ndata.org/natural-disasters. Accessed 22 July 2023
2. Parwanto NB, Oyama T (2014) A statistical analysis and comparison of historical earthquake
and tsunami disasters in Japan and Indonesia. Int J Disaster Risk Reduction 7:122–141
3. Jena R, Pardhan B, Beydoun G (2020) Earthquake vulnerability assessment in northern Sumatra
province by using a multi-criteria decision-making model. Int J Disaster Risk Reduction
46:101518
4. USGS, https://earthquake.usgs.gov/earthquakes. Accessed 01 June 2023
5. Cremen G, Galasso C (2020) Earthquake early warning: recent advances and perspectives.
Earth Sci Rev 205:103184
6. Sediono, Mardianto MFF, Ulyah SM, Pangestu AA, Susanti R, Firdaus HA, Andreas C (2022)
The modelling of earthquake magnitude in the southern part of java island using geographically
weighted regression. Commun Math Biol Neurosci 2022:13
7. Tim Pusat Studi Gempa Nasional. Peta Sumber dan Bahaya Gempa Indonesia Tahun 2017.
Kementeria Pekerjaan Umum dan Perumahan Rakyat, Jakarta (2017)
8. Choiruddin A, Aisah Trisnisa F, Iriawan N (2021) Quantifying the effect of geological factors on
distribution of earthquake occurrences by inhomogeneous cox processes. Pure Appl Geophys
178:1579–1592
9. Choiruddin A, Susanto TY, Metrikasari R (2021) Two-step estimation for modeling the earth-
quake occurrences in Sumatra by Newman-Scott cox point processes. Soft Comput Data Sci
1489:146–159
10. Ogata Y, Zhuang J (2006) Space-time ETAS models and an improved extension. Tectonophysics
413:13–23
11. Zhuang J, Ogata Y, Vere-Jones D (2002) Stochastic declustering of space-time earthquake
occurrences. J Am Statistical Assoc 97(458):369–380
12. Zhuang J (2011) Next-day earthquake forecasts for the japan region generated by the ETAS
model. Earth Planets Space 63:207–216
13. Jalilian A (2019) ETAS: an R package for fitting the space-time ETAS model to earthquake
data. J Stat Softw 88(1):1–39
14. Ogata U (1998) Space-time point-process models for earthquake occurrences. Ann Inst Stat
Math 50(2):379–402
15. Davoudi N, Tavakoli HR, Zare M, Jalilian A (2020) Aftershock probabilistic seismic hazard
analysis for Bushehr Province in Iran using ETAS model. Natural Hazard 100:1159–1170
16. Nicolis O, Chiodi M, Adelfio G (2015) Windowed ETAS models with application to the Chilean
seismic catalogs. Spatial Statistics 14:151–165
17. BPBD, https://bpbd.ntbprov.go.id/pages/gempa-bumi. Accessed 05 Aug 2023
18. Natawidjaja DH, Triyoso W (2007) The Sumateran fault zone—from source to hazard. J
Earthquake Tsunami 1(1):21–47
19. Wijaya AR (2023) Model ETAS Spatio-Temporal pada Analisis Pemetaan Intensitas Kegem-
paan di Wilayah Sumatera. Jambura J Math 5(1):179–188
20. Choiruddin A, Hannanu FF, Mateu J, Fitriyanah V (2023) COVID-19 transmission risk in
Surabaya and Sidoarjo: an inhomogeneous marked poisson point process approach. Stoch Env
Res Risk Assess 37:2271–2282
21. Choiruddin A, Susanto TY, Husain A, Kartikasari YM (2023) Kppmenet: Combining the
kppm and Elastic Net Reqularization for Inhomogeneous Cox Point Process with Correlated
Covariates. J Appl Stati 1–14
Modeling Earthquake Catalog in Sumatra by Space–Time … 351
22. Zhuang J (2012) Long-term earthquake forecasts based on the epidemic-type aftershock
sequence (ETAS) model for short-term clustering. Res Geophys 2(1):e8
23. BMKG. Katalog Gempa Bumi Indonesia: Relokasi Hiposenter dan Implikasi Tektonik. Bidang
Informasi Gempabumi dan Peringatan Dini Tsunami, Badan Meteorologi Klimatologi dan
Geofisika, Jakarta (2021)
24. Susanto TY, Choiruddin A, Purnomo JDT (2023) On the earthquake distribution modeling in
Sumatra by cauchy cluster process: comparing log-linear and log-additive intensity models.
Sains Malaysiana 52(2):655–557
25. Kartikasari YM, Choiruddin A (2022) Analisis Risiko Gempabumi di Sumatera dengan Cauchy
Cluster Process. Inferensi 5(2):123–134
Small Area Estimation of Mean Years
of Schooling Under Time Series
and Cross-sectional Models
Abstract Small area estimation develops within the framework of time series and
cross-sectional models. The restricted estimation maximum likelihood method was
used to obtain the empirical best linear unbiased prediction for small areas and its
mean squared error estimators. The model focuses on applying statistical models that
permit borrowing strength over area and time. The process uses regularly conducted
survey data, where the areas of interest are observed repeatedly under a predeter-
mined scheme. The time series and cross-sectional models were able to capture
heterogeneity across area and time, so it can be used to enhance sample size effec-
tiveness, thereby minimizing mean squared error and producing a more accurate
estimation. The simulation results show that the degree of correlation parameters
significantly affects the efficiency of the model. The application to estimate mean
years of schooling at the sub-district level in Langkat Regency, North Sumatra,
Indonesia, for the period of 2018–2021 showed that the time correlation coefficient
was 0.3758, the variance of the area random effect was 1.1125, and the variance of
the area-time random effect was 0.3241. The estimations derived from time series
and cross-sectional models had a lower mean squared error than those obtained from
the Fay-Herriot models and direct estimation.
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 353
Y. Bee Wah et al. (eds.), Data Science and Emerging Technologies, Lecture Notes on
Data Engineering and Communications Technologies 191,
https://doi.org/10.1007/978-981-97-0293-0_26
354 R. A. Noviyanti et al.
1 Introduction
Small area estimation is primarily concerned with methods for acquiring precise
regional statistics when the available sample data is insufficient to make statistically
significant inferences with sufficient precision. In the past few years, the worldwide
demand for accurate estimations of small areas has significantly increased [1]. In
addition, it is difficult to obtain a sufficient sample size to provide an accurate direct
estimation for all areas.
Increasing the sample size can enhance the precision of the estimation in a small
area. Nevertheless, sample size is positively correlated with financial resources,
human resources, and volume of time. Therefore, it is important to use indirect esti-
mations that leverage the notion of “borrowing strength” by using some values related
to the variables of interest from the corresponding areas. This approach effectively
increases the effectiveness of the sample size [1–5].
Small area estimation is a subfield method of statistics that integrates samples from
surveys with constrained populations based on a model of statistics. The monograph
by Rao [3] presents comprehensive explanations of this theory, as do the critical
analyses conducted by Ghosh and Rao [1], Pfeffermann [6], and Jiang and Lahiri
[5]. The small area estimation method consists of two types of variances, the fixed
effects obtained through the application of auxiliary variables and the random effects
obtained by considering specific varieties within the area. Two distinct estimation
methods can be used to improve precision, especially borrowing strength across areas
and/or borrowing strength over multiple time periods.
Fay and Herriot [7] proposed the model of borrowing strength across areas to
derive log per-capita income. Several surveys are conducted repeatedly over a period
of time. Within the framework of repeated surveys, the Fay-Herriot models were
not able to capture heterogeneity both over area and time. Therefore, Rao-Yu [8,
9] proposed time series and cross-sectional models that combine information from
different time periods in various cross-sections. Hence, it can be used to capture
heterogeneity both over area and time; the model is known as the Rao-Yu models.
The Rao-Yu models are a modification of the Fay-Herriot models by adding an area-
time random effects component that follows a first-order autoregressive process. Rao
and Yu [8, 9], Esteban et al. [10], and Fay and Diallo [11] described that significant
gains in the efficiency of estimation can be achieved by using a model with area-time
random effects.
The variance components of random effects on the empirical best linear unbi-
ased prediction (EBLUP) estimator and its mean squared error (MSE) can be esti-
mated using the method of moments, maximum likelihood, or restricted estima-
tion maximum likelihood (REML). In this study, we presented the REML method.
According to Rao [3], the REML method considers the loss in degrees of freedom
by using transformed data. Molina et al. [12] showed that the REML estimator is
more robust and produces an unbiased estimator even at small sample sizes.
Therefore, this study focuses on applying the REML method to estimate the
unknown variance components of random effects. The final goal of this study was
Small Area Estimation of Mean Years of Schooling Under Time Series … 355
to apply developed area-level models to derive the EBLUP and MSE of model 1
(Fay-Herriot model) and model 2 (Rao-Yu model), with a focus on estimating the
mean years of schooling at the sub-district level in Langkat Regency, North Sumatra,
Indonesia, for the period of 2018–2021.
2 Literature Review
The Fay-Herriot models consist of two stages. The sampling model is used to
depict the sampling error of direct estimators in the first stage. Let θd represent
the characteristic of interest in the d-th area and yd represent a direct estimator of θd .
yd = θd + ed , d = 1, ..., D, (1)
The symbol D denotes the areas, and the sampling errors ed are independent, with
ed ∼ iid N (0, σe2 ). In the second stage, assuming that θd is related to area-specific
auxiliary variables xd = (x1d , . . . , xqd ) through a linear model
Henderson [13] describes the best linear unbiased prediction (BLUP) approach
as a common method for obtaining the parameter estimation of random effects in
Eq. (3). The BLUP estimator for the d-th area d = (1, ..., D) can be written as
σv2 z d2
θ̃d F H σv2 = x d β̃ + yd − x d β̃ , (4)
σv z d + σe
2 2 2
σ̂v2 z d2
θ̂d F H σ̂v2 = x d β̂ + yd − x
d β̂ ,
σ̂v2 z d + σe2
2
356 R. A. Noviyanti et al.
where
σ̂ 2 z 2
π̂d = 2 2v d 2
σ̂v z d + σe
with
−1 D
xd x d 2 2
D
β̂ = x y
σ̂v z d + σe2 d d
σ̂v2 z d2 + σe2 .
d=1 d=1
The mean squared error of the EBLUP estimator of Fay-Herriot models is defined
as follows:
MSE θ̂dH ≈ g1d σv2 + g2d σv2 + g3d σv2 (6)
where
σ 2 z2 σ 2
g1d σv2 = 2 v 2 d e 2 = πd σe2 ,
σv z d + σe
−1
2 2
D
d=1 xd x d
g2 σv = (1 − πd ) x d xd ,
σv2 z d2 + σe2
2 −3
g3 σv2 = σe2 z d4 σv2 z d2 + σe2 σ̂v2
2 2
D z4 2 −1 .
where σ̂v2 = 2 σ̂v z d + σe2
d=1 d
The estimator of MSE θ̂dH is determined similarly to Prasad and Rao [14] by
replacing the estimators of σv2 . It reduces to
mse θ̂dH ≈ g1d σ̂v2 + g2d σ̂v2 + 2g3 σ̂v2 . (7)
Several sample surveys are conducted periodically, with partial replacement of the
sample elements. To improve the efficiency of repeated surveys, it is possible to
enhance the accuracy by leveraging information from both small areas and previous
time periods. Rao and Yu [8, 9] introduced a modification to the Fay-Herriot model
that allows for the analysis of both time series and cross-sectional data. The Rao-Yu
Small Area Estimation of Mean Years of Schooling Under Time Series … 357
where the variable ydt represents a direct estimator of the area-specific variable of
interest, denoted as d with (d = 1, ..., D), and a particular time instant, denoted as
t with (t = 1, ..., T ), an unbiased estimator for θdt . xdt is the vector consisting
of2 q
auxiliary variables, vd is a random effect for small
area d, with v d ∼ iid N 0, σv ,
the error edt is independent, edt ∼ iid N 0, σe2 with known σe2 ’s, and the random
vectors u dt are iid. AR(1) with common variance σε2 and correlation parameters ρ
respectively is,
y = Xβ + Zw + e, (12)
where
y = (y11 , ..., y1T ; y D1 , ..., y DT ) = y 1 , ..., y D ,
X = X 1 , ..., X D , Xd = (xd1 , ..., xdT ) ,
Z = I D ⊗ Zd , Zd = (1T , IT ), w = w 1 , ..., w D ,
wd = vd , u d , ud = (u d1 , ..., u dT ) ,
e = e 1 , ..., e D , ed = (ed1 , ..., edT ) , β = β1 , ..., βq .
ρ |t−t |
matrix containing the components . The vd , ud , and ed are independent, so
(1−ρ 2 )
the covariance matrix of y is = ZGZ + R, with G = I D ⊗ Gd and R =
blockdiagd (Rd ).
Based on the model (11), BLUP estimator for θdt for the current occasion T is
−1
θ̃dTRY = x dT β̃ + σv2 1T + σε2 γT σv2 JT + σε2 + Rd yd − Xd β̃ (13)
where JT = 1T 1 T is T × T matrix with all elements that are 1 and γT is the T -th
row or column of .
In general, parameters σv2 , σε2 , and ρ are unknown and estimated by σ̂v2 , σ̂ε2 , and
ρ̂, respectively. The EBLUP is derived by substituting the unknown parameters with
the estimators.
−1
θ̂dTRY = x dT β̂ + σ̂v2 1T + σ̂ε2 γ̂T σ̂v2 JT + σ̂ε2 ˆ + Rd yd − Xd β̂ (14)
−1
where β̂ = X −1 X X −1 y .
In this study, the parameter estimation was obtained based on the restricted esti-
mation maximum likelihood method. The log-likelihood function associated with
the model (12) is equal to
DT − q 1 1
1 −1
lreml = − log(2π ) + logX X − log|| + logX X − y Py
2 2 2 2
(15)
where
−1
−1 −1
P = −1 − −1 X X X X , P = PP, PX = 0.
∂P ∂
Let δ = σv2 , σε2 , ρ , then Pa = ∂δa
= −P ∂δ(δ)a
P = −Pa P, a = 1, 2, 3.
The estimator for the variance component is derived by maximizing the probability
function (15). The scores vector can be obtained by determining the partial derivatives
of the function lr eml (15) with respect to δ with a-th element,
∂lr eml 1 1
Sa = = − tr (Pa ) + y Pa Py, a = 1, 2, 3 (16)
∂δa 2 2
where a = ∂ ∂δa is the partial derivative of with respect to δ with a-th element.
By taking second-order partial derivatives of (15) with respect to δ with a, b-th
element, modifying the sign, and taking expectations, the Fisher information matrix
components are obtained
1
Fab = tr (Pa Pb ), a, b = 1, 2, 3 (17)
2
Small Area Estimation of Mean Years of Schooling Under Time Series … 359
(k+1) (k)
(k) (k)
δ̂ = δ̂ + F−1 δ̂ S δ̂ , (18)
where S represents the column vector of scores and F represents the Fisher informa-
tion matrix. The estimation of the random effect variance component will be obtained
once the iteration of Eq. (18) has converged.
Based on Prasad and Rao [14] and Das et al. [15], the mean square error of the
Rao-Yu models is defined as follows:
MSE θ̂dTRY ≈ g1 (δ) + g2 (δ) + g3 (δ) (19)
where
g1 (δ) = a ZTZ a,
−1 −1
g2 (δ) = a X − a ZTZ R Q X a − X R ZTZ a
g3 (δ) ≈ tr ∇b ∇b E δ̂ − δ δ̂ − δ
−1
with Q = X −1 X , T = G − GZ −1 ZG, b = a ZGZ −1 .
According to Das et al. [15] and Rao and Molina [4], the mean square error of
EBLUP (19) can be approximated as
mse θ̂dTRY ≈ g1 δ̂ + g2 δ̂ + 2g3 δ̂ . (20)
3 Simulation Studies
This section presents the simulation studies used to analyze the performance of the
mean square error estimators. The purpose of the simulations was to determine what
could be obtained by using Rao-Yu models with the time correlation parameter ρ.
The auxiliary and interest variables for area (d = 1, ..., D) and time (t = 1, ..., T )
were
1 1
D T D T
MSE = MSEdt , CV = CVdt
DT d=1 t=1 DT d=1 t=1
To evaluate the impact of the number of areas, time, and time correlation, we set
D = 10, 20, 30 areas and both short (T = 3), moderate (T = 5), and long (T = 8)
at time. We used ρ = 0.2 for low correlation, ρ = 0.5 for moderate correlation, and
ρ = 0.9 for high correlation. The results of the simulations to evaluate the impact
of the number of area, time, and time correlation on the mean squared error and
coefficient of variation are presented in Table 1. The best results in mean square error
are achieved when the correlation parameter is high (ρ = 0.9). The mean squared
error was decreased consistently as the correlation parameter (ρ) increased.
The mean squared error of EBLUP based on the Rao-Yu models for each of
D = 100 areas, with T = 8 time instants, and time correlation ρ = 0.2, 0.5, 0.9
are presented in Fig. 1.
The simulation results showed that an increased degree of correlation would lead
to a decrease in mean squared error and coefficient of variation. When the correlation
parameter of the Rao-Yu models was high (ρ = 0.9), it correspondingly produced
smaller mean squared error and coefficient of variation. Furthermore, the mean
squared error and coefficient of variation increased consistently as the number of
areas (D) increased, but it decreased as the time (T ) increased.
Small Area Estimation of Mean Years of Schooling Under Time Series … 361
5.0
4.5 = 0.2
= 0.5
4.0 = 0.9
3.5
MSE
3.0
2.5
2.0
1.5
1.0
10 20 30 40 50 60 70 80 90 100
Areas
4 Application
The Fourth Sustainable Development Goal (SDG) ensures equitable and inclusive,
high-quality education and opportunities for lifelong learning for all. Mean years of
schooling are included in the education indicators of the SDG. Furthermore, the mean
years of schooling also serve as an instrument to measure the Human Development
Index, providing data regarding the average level of education in a certain area. The
indicator defines the years of schooling as the duration of formal education attended
by individuals aged 25 years and older. The mean years of schooling are derived from
the National Socioeconomic Surveys (Susenas) data that is conducted periodically by
the Statistics Indonesia. The mean years of schooling are examined for the purpose
of monitoring and evaluating the basic education sector. The Susenas is a twice-
yearly survey that is designed to provide direct estimates of variables at the national,
362 R. A. Noviyanti et al.
provincial, and district/city levels. The estimation of the mean years of schooling at
a smaller area level is necessary for the policies formulation.
This study aims to provide an estimation of the mean years of schooling at 23
sub-districts in Langkat Regency, North Sumatra, Indonesia, for the period of 2018–
2021. The distribution of the mean years of schooling is shown in Fig. 2. It shows
that the distribution of mean years of schooling is fit to normal distribution.
For the correspond area-specific auxiliary variables, we use five variables, which
are population (X 1 ), harvested area of paddy (X 2 ), planting area of plantation crops
(X 3 ), number of industries (X 4 ), and educational facilities per 10,000 residents (X 5 ).
We apply logarithm transformation on variables X 1 , X 2 , X 3 , and X 4 to obtain a
suitable model. The regression parameter estimates and their corresponding standard
errors from model 1 and model 2 are presented in Table 2.
Table 2 shows that the models consist of fixed effects and variance component
estimates of the random effect. For the fixed effect, of the five auxiliary variables,
there were three variables significant for estimating mean years of schooling, and
they are population (X 1 ), harvested area of paddy (X 2 ), and number of industries
(X 4 ). Based on the standard errors of parameter estimates, model 2 obtained a lower
standard error than model 1. It can be concluded that model 2 is more efficient
compared to model 1.
Small Area Estimation of Mean Years of Schooling Under Time Series … 363
Meanwhile, for the variance component that used for capturing heterogeneity,
model 1 produced an area random effect, while model 2 produced an area random
effect, area-time random effect, and autocorrelation coefficients. The variance of the
area random effect in model 1 is higher than in model 2. This means that model 1
is better at capturing heterogeneity over the area. Although there was a variation in
how they captured heterogeneity, both models produced EBLUP and mean square
error that were quite similar. On the other hand, model 2 shows that the variance of
the area-time random effect was 1.1125 greater than the variance of the area random
effect, which was 0.3241, where model 1 was unable to capture it. The first-order
autocorrelation coefficients for model 2 were 0.3758. Although the correlation across
time of model 2 is low, this cannot be disregarded in the context of estimation.
According to the model parameters, as shown in Table 2, the small area model of
mean years of schooling is defined as follows:
Model 1
θ̂d F H = −0.628 + 1.107 log X 1d − 0.119 log X 2d − 0.431 log X 4d + v̂d (22)
Model 2
θ̂dt RY = −0.186 + 0.978 log X 1dt − 0.092 log X 2dt − 0.292 log X 4dt + v̂d + û dt
(23)
Table 3 Comparison of the mean years of schooling estimator in Langkat Regency by sub-district,
2018–2021
Sub-district 2018 2019 2020 2021
Dir M1 M2 Dir M1 M2 Dir M1 M2 Dir M1 M2
1 Bahorok 7.91 7.97 7.92 7.85 7.94 7.85 6.49 7.05 6.68 7.26 7.50 7.33
2 Sirapit 8.71 8.40 8.65 7.30 7.28 7.30 6.15 6.78 6.74 7.12 7.15 7.11
3 Salapian 8.51 8.45 8.51 9.30 8.92 9.15 7.03 7.21 7.06 6.94 7.33 7.05
4 Kutambaru 8.16 8.11 8.16 9.11 8.83 9.06 6.71 7.13 6.99 6.13 6.48 6.18
5 Sei Bingai 7.96 7.99 7.97 9.40 9.05 9.31 6.08 6.53 6.23 7.58 7.64 7.58
6 Kuala 9.39 9.24 9.38 7.76 7.80 7.77 7.52 7.71 7.53 6.52 6.99 6.65
7 Selesai 9.67 9.57 9.66 9.62 9.55 9.62 8.34 8.42 8.34 6.97 7.30 7.02
8 Binjai 9.70 9.36 9.63 8.60 8.46 8.57 7.69 7.81 7.70 7.49 7.68 7.51
9 Stabat 9.39 9.31 9.38 8.56 8.57 8.56 8.78 8.76 8.77 8.89 8.84 8.88
10 Wampu 9.15 8.95 9.12 7.25 7.40 7.28 7.82 7.88 7.80 6.78 7.14 6.87
11 Batang 9.26 8.75 8.93 7.90 7.91 7.90 6.55 7.05 6.76 9.20 8.61 8.75
Serangan
12 Sawit 8.89 8.84 8.89 10.10 9.75 9.98 6.44 6.92 6.54 7.08 7.42 7.13
Seberang
13 Padang 9.06 8.93 9.05 8.90 8.66 8.85 8.43 8.37 8.38 6.74 7.13 6.83
Tualang
14 Hinai 8.46 8.35 8.45 8.82 8.49 8.73 7.87 7.85 7.87 7.33 7.38 7.33
15 Secanggang 9.27 9.15 9.26 7.55 7.56 7.55 7.31 7.42 7.32 6.89 7.03 6.91
16 Tanjung 9.47 9.31 9.46 9.57 9.47 9.57 7.78 7.84 7.79 7.46 7.60 7.47
Pura
17 Gebang 7.58 7.72 7.66 8.90 8.81 8.89 6.47 6.94 6.78 8.53 8.32 8.45
18 Babalan 9.60 9.47 9.59 9.00 8.89 8.99 8.33 8.30 8.32 6.56 7.13 6.75
19 Sei Lepan 8.40 8.42 8.41 8.43 8.39 8.42 6.80 7.24 6.96 7.16 7.34 7.19
20 Brandan 8.43 8.06 8.28 6.33 6.56 6.41 6.50 6.69 6.53 7.39 7.34 7.28
Barat
21 Besitang 8.75 8.66 8.74 6.53 6.88 6.66 6.38 6.83 6.51 6.86 7.07 6.90
22 Pangkalan 9.79 9.45 9.72 8.23 8.16 8.23 8.32 8.20 8.30 7.68 7.66 7.67
Susu
23 Pematang 8.79 8.79 8.79 7.50 7.50 7.50 7.69 7.36 7.27 5.51 5.88 5.66
Jaya
5 Conclusions
The application of the small area estimation method can effectively optimize sample
size efficiency, and this leads to a decrease in the standard error. To enhance the accu-
racy of the area-level model, we proposed time series and cross-sectional models that
borrow strength over time. This can be achieved by integrating an area-time random
Small Area Estimation of Mean Years of Schooling Under Time Series … 365
effects component into a linear time model at the area level. The selection of suit-
able linking models and the presence of sufficient auxiliary variables are important
elements in acquiring indirect estimators. The simulation results show that the degree
of correlation parameters produced from time series and cross-sectional models has
an important impact on the effectiveness of the models.
The restricted estimation maximum likelihood method, which was applied to
derive empirical best linear unbiased prediction for small areas and mean squared
error, is more reliable and provides an unbiased estimator. The estimation using the
time series and cross-sectional models can obtain a lower mean square error than
the direct estimation and Fay-Herriot models. This indicates that the time series and
cross-sectional models are more effective in estimating mean years of schooling at
a small area level.
For the upcoming study, we will develop multivariate time series and cross-
sectional models to analyze small area estimation. Variables obtained from repeated
surveys in real applications may have correlation. Thus, considering the correlation
between variables of interest, it is possible to significantly enhance the accuracy of
parameter estimates in small area estimation by using multivariate time series and
cross-sectional models.
The observed range of direct estimates of mean years of schooling in Langkat was
5.51 to 10.10 years. For the period of 2018–2021, the mean years of schooling with
the highest average population attained formal education up to the upper secondary
level, with an average duration of 10.10 years. Meanwhile, the lowest mean years of
schooling among the average population had not completed elementary education
(5.51 years). For model 1, the range of estimates was 5.88 to 9.75 years, and for
model 2, it was 5.66 to 9.98 years. The results show that the range of estimates
obtained from model 1 and model 2 was comparatively shorter than direct estimates.
The best model selection for estimating mean years of schooling can be determined
by evaluating the mean square error. Figure 3 presents a comparison between the
mean square error obtained from direct estimation and the EBLUP from model 1
and model 2. The best model is characterized by the lowest mean square error value.
The results indicate that the mean square error of the EBLUP model 2 was smaller
compared to the mean square error of direct estimate and mean square error of
EBLUP model 1, the average mean square error of model 2 was 0.01766 and model
1 was 0.01976, and the direct estimate was 0.02509. It shows that the EBLUP model
2 provides higher accuracy in parameter estimation due to its lower mean square
error value.
6 Conclusions
The application of the small area estimation method can effectively optimize sample
size efficiency, and this leads to a decrease in the standard error. To enhance the accu-
racy of the area-level model, we proposed time series and cross-sectional models that
borrow strength over time. This can be achieved by integrating an area-time random
366 R. A. Noviyanti et al.
0.16
Direct
0.14 Model 1
Model 2
0.12
0.10
MSE
0.08
0.06
0.04
0.02
0.00
Bahorok_2018
Secanggang_2018
Sirapit_2018
Salapian_2018
Kutambaru_2018
Sei Bingai_2018
Kuala_2018
Selesai_2018
Binjai_2018
Stabat_2018
Wampu_2018
Bt Serangan_2018
Sw Seberang_2018
Pd Tualang_2018
Tj Pura_2018
Gebang_2018
Babalan_2018
Sei Lepan_2018
Brandan Brt_2018
Besitang_2018
Pk Susu_2018
P Jaya_2018
2019
2020
2021
2019
2020
2021
2019
2020
2021
2019
2020
2021
2019
2020
2021
2019
2020
2021
2019
2020
2021
2019
2020
2021
2019
2020
2021
2019
2020
2021
2019
2020
2021
2019
2020
2021
2019
2020
2021
Hinai_2018
2019
2020
2021
2019
2020
2021
2019
2020
2021
2019
2020
2021
2019
2020
2021
2019
2020
2021
2019
2020
2021
2019
2020
2021
2019
2020
2021
2019
2020
2021
Sub-district
effects component into a linear time model at the area level. The selection of suit-
able linking models and the presence of sufficient auxiliary variables are important
elements in acquiring indirect estimators. The simulation results show that the degree
of correlation parameters produced from time series and cross-sectional models has
an important impact on the effectiveness of the models.
The restricted estimation maximum likelihood method, which was applied to
derive empirical best linear unbiased prediction for small areas and mean squared
error, is more reliable and provides an unbiased estimator. The estimation using the
time series and cross-sectional models can obtain a lower mean square error than
the direct estimation and Fay-Herriot models. This indicates that the time series and
cross-sectional models are more effective in estimating mean years of schooling at
a small area level.
For the upcoming study, we will develop multivariate time series and cross-
sectional models to analyze small area estimation. Variables obtained from repeated
surveys in real applications may have correlation. Thus, considering the correlation
between variables of interest, it is possible to significantly enhance the accuracy of
parameter estimates in small area estimation by using multivariate time series and
cross-sectional models.
Acknowledgements The main author of this research article would like to thank BPS and ITS
for providing her with the opportunity to study in the Statistics Department of ITS. Additionally,
the authors extend their gratitude to all the entities that have made significant contributions to the
successful completion of this research.
Small Area Estimation of Mean Years of Schooling Under Time Series … 367
References
1. Ghosh M, Rao JNK (1994) Small area estimation: an appraisal. Stat Sci 9:55–93
2. Rao JNK (1999) Some recent advances in model-based small area estimation. Surv Methodol
25:175–186
3. Rao JNK (2003) Small area estimation. Wiley
4. Rao JNK, Molina I (2015) Small area estimation, 2nd edn. Wiley, New York
5. Jiang J, Lahiri P (2006) Mixed model prediction and small area estimation. TEST 15:1–96
6. Pfeffermann D (2002) Small area estimation—new developments and directions. Int Stat Rev
70(1):125–143
7. Fay RE, Herriot RA (1979) Estimates of income for small places: an application of James-Stein
procedures to census data. J Am Stat Assoc 74:269–277
8. Rao JNK, Yu M (1992) Small area estimation combining time series and cross sectional data.
In: Proceedings of the joint statistical meetings, pp 1–9
9. Rao JNK, Yu M (1994) Small area estimation by combining time series and cross sectional
data. Can J Stat 22:511–528
10. Esteban MD, Morales D, Pérez A, Santamaría L (2012) Small area estimation of poverty
proportions under area-level time models. Comput Stat Data Anal 56(10):2840–2855
11. Fay RE, Planty M, Diallo MS (2013) Small area estimation alternatives for the National crime
victimization survey. In: Proceedings of the joint statistical meetings, pp 1544–1557
12. Molina I, Rao JNK, Datta GS (2015) Small area estimation under a Fay-Herriot model with
preliminary testing for the presence of random area effects. Surv Methodol 41(1):1–19
13. Henderson CR (1953) Estimation of variance and covariance components. Biometrics 9:226–
252
14. Prasad NGN, Rao JNK (1990) The estimation of the mean squared error of small-area
estimators. J Am Stat Assoc 85:163–171
15. Das K, Jiang J, Rao JNK (2004) Mean squared error of empirical predictor. Ann Stat 32:818–840
Probabilistic Seismic Hazard Analysis
for Sulawesi-Maluku Region of Indonesia
Using the Space–Time Epidemic-Type
Aftershock Sequence Model
Abstract Sulawesi and Maluku are the only regions in Indonesia where three major
plates (Eurasian, Indo-Australian and Pacific) meet, resulting in the most seismi-
cally active region in Indonesia. It is important to perform probabilistic seismic
hazard analysis to understand the seismic activity in this region, study aftershock
sequences, and provide comprehensive seismic hazard maps. In this study, we
consider spatiotemporal analysis and perform space–time epidemic-type aftershock
sequencing (space–time ETAS) to model earthquake catalog data in Sulawesi-
Maluku, Indonesia. The method combines Davidon-Fletcher-Powell (DFP) algo-
rithm and stochastic declustering algorithm. We conclude that the model has good
performance and can be used to construct seismic hazard maps based on the generated
model. The results of the seismic map provide information on the total rate of earth-
quake events occurring around the Maluku Sea, with the clustering of earthquake
events spread almost throughout Sulawesi and Maluku.
1 Introduction
The collision of one plate with another would cause natural disasters such as earth-
quakes. In Indonesia, earthquakes occur in almost all regions because most of
Indonesia is located in the Pacific Ring of Fire. There are seven areas in Indonesia
that are suspected to be subduction zones, including the Maluku Sea area bordering
Sulawesi [1, 2]. Earthquakes that occur in the Sulawesi and Maluku regions are due
to tectonic plates, both of which include areas that have three major plates, so the
two regions have quite a lot of seismicity and are quite active, where the Sulawesi
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 369
Y. Bee Wah et al. (eds.), Data Science and Emerging Technologies, Lecture Notes on
Data Engineering and Communications Technologies 191,
https://doi.org/10.1007/978-981-97-0293-0_27
370 S. Faradilla et al.
and Maluku regions are traversed by the meeting of three tectonic plates, namely the
Eurasian plate, the Indo-Australian plate, and the Pacific plate (see Fig. 1). There-
fore, it is important to study the seismic activity in the region of Sulawesi-Maluku,
especially the one that involves cataloging earthquakes data including coordinate
location, time occurrence, and magnitude.
In Indonesia, statistical models based on point processes have been considered,
but the analysis of the earthquake catalogs used is still separate. For example, point
processes considering the coordinate location such as log-Gaussian Cox point process
(LGCP) [3–5] and Neyman–Scott Cox point process (NSCP) [5–8], as well as point
processes considering the time of earthquake occurrence such as ETAS [9–11].
However, no study has been considered for a complete earthquake catalog.
Regarding the complex model, Ogata [12] has incorporated a complete catalog
of earthquake events (times, locations, magnitudes) using the space–time ETAS
model. The space–time ETAS model is now widely used for seismicity analysis
to explain the probability of main and aftershocks and to detect changes in seis-
micity [13]. The space–time ETAS model covers the behavior of earthquake events
as a function of time, location (latitude and longitude), and magnitude, given histor-
ical earthquake data [14]. The parameters are estimated using maximum likelihood
estimation (MLE) by applying stochastic declustering iterations using the Davidon-
Fletcher-Powell (DFP) method [15]. The results are combined with the peak ground
acceleration (PGA) information to determine how much seismicity is present in the
Probabilistic Seismic Hazard Analysis for Sulawesi-Maluku Region … 371
region. The space–time ETAS model has been considered for earthquake modeling
in, for example, Japan [12, 14, 16], USA [17], China [13], Taiwan [18], and Iran [15,
19]. To our knowledge, such a model has not been applied for earthquake modeling
in the Sulawesi-Maluku region.
In this study, we implement the space–time ETAS model for probabilistic seismic
hazard analysis in Sulawesi-Maluku by considering catalog data including time
occurrences, coordinate locations, and magnitudes. The model is evaluated using
residual analysis and diagnostic plot. The resulting model is used as the basis for
developing the seismic hazard map in Sulawesi-Maluku.
2 Methodology
2.1 Dataset
This study uses earthquake catalog data consisting of time, longitude, latitude,
and magnitude. The data were obtained from the United States Geological Survey
(USGS) (https://earthquake.usgs.gov/earthquakes/search/) from January 1, 2008, to
July 31, 2023, resulting in 4842 earthquake events. The variables are listed in Table 1.
The map of earthquake locations in the Sulawesi region is located at latitude and
longitude in the range of − 6.5° S to 3° N and 118° to 126° E, while the map of
earthquake locations in the Maluku region is located at the latitude and longitude in
the range of − 3° S to 6° N and 124° to 132° E [1]. The map is shown in Fig. 1.
The space–time ETAS model is a model that can be used to estimate the occurrence
of aftershocks in a given area using time, location (latitude and longitude), and
magnitude variables that depend on historical earthquake data [14]. The conditional
intensity function in the space–time ETAS model can be defined in Eq. (1).
where
and
λθ (t, x, y|Ht ) = ũ(x, y) + κA,α (m i )gc, p (t − ti ) f D,γ ,q (x − xi , y − yi ; m i ),
i:ti <t
(3)
where Ht is the earthquake occurrence until the time before t and Eq. (3) can be
explained as follows:
ũ(x, y) = μu(x, y) represents the background seismicity rate or can be referred
to as the main earthquake [1]. ũ(x, y) can be referred to as the background seismicity
rate of an event and is assumed to be stationary in time or the base seismicity rate
with the functions u(x, y) and μ > 0.
κA,α (m i ) = A exp[α(m i − m 0 )] is the contribution to the seismicity rate of the ith
event that has occurred previously [16] or the expected number of events triggered by
aftershocks resulting from an event of (magnitude) m i [15]. κA,α (m i ) can be referred
to as the probability of occurrence of an aftershock of magnitude m i , where m i ≥ m 0
while A > 0 and α > 0 are unknown parameters.
gc, p (t − ti ) is the probability density function of an earthquake occurring at time
(t) that previously occurred at time (ti ), where c > 0 and p > 1 are unknown
parameters. Based on the modified Omori law, gc, p (t − ti ) can be described by
Eq. (4).
p−1
t−ti − p
1 + t − ti > 0
gc, p (t − ti ) = c c , (4)
0 t − ti ≤ 0
q −1
f D,γ ,q (x − xi , y − yi ; m i ) =
π D exp γ (m i − m 0 )
−q (5)
(x − xi )2 + (y − yi )2
1+ .
D exp γ (m i − m 0 )
Probabilistic Seismic Hazard Analysis for Sulawesi-Maluku Region … 373
N
l(β, θ |HT ) = δi log λβ,θ ti , xi , yi , m i |Hti
i=1
∞ tstart +T ¨ (6)
− λβ,θ (t, x, y, m|Ht )dx dy dt dm,
m0 tstart S
where δi = 1 is event i which is the target event and δi = 0 is event i which is the
complementary (initial) event. In Eq. (1), the parameters β and θ can be separated
∞
by m 0 vβ (m)dm = 1. Thus, the log-likelihood function can be written as Eq. (7).
N
l1 (β|HT ) = δi log vβ (m i ) . (8)
i=1
Estimating β using the MLE method by maximizing l1 ( β|HT ) yields results that
are close to.
For l2 ( β|HT ) can be described as Eq. (9) as follows:
tstart +T ¨
N
l2 (θ |HT ) = δi log λθ ti , xi , yi |Hti − λθ (t, x, y|Ht )dx dy dt, (9)
i=1 tstart S
where
⎧
⎨ κA,α (m i )gc, p t j − ti f D,γ ,q x j − xi , y j − yi ; m i , t > t ,
⎪
j i
pi j = λθ t j , x j , y j |Ht j (11)
⎪
⎩
0, t j ≤ ti ,
The variable bandwidth kernel estimator and a better estimate for u(x, y) can be
defined as Eq. (12).
1
N
û(x, y) = 1 − p̂ j ϕ x − x j , y − y j ; h j , (12)
T j=1
where
h j = max h min , r j, n p ,
where h min is the minimum threshold bandwidth value and r j, n p is the distance
between the jth event location and its nearest neighbor n p [15]. If the initial estimate
u(x, y) is replaced by a new estimate from Eq. (12), then the estimates θ and u(x, y)
are repeated until the estimates converge. Thus, the total spatial intensity function
can be estimated as described in Eq. (13).
N
ˆ (x, y) = 1 ϕ x − x j, y − yj; h j . (13)
T j=1
Thus, the equation together with û(x, y) gives ω̂(x, y) = 1 − û(x, y) ˆ (x, y)
estimates for the clustering coefficient [15].
The algorithm using the DFP method and the stochastic declustering method can
be described in Tables 2 and 3 .
The space–time ETAS model requires residual analysis and diagnostic plots to
assess model fit and visualization and ease of computation. Residual analysis in
this model calculates only temporal and spatial residuals. If the model used is
capable of describing temporal and spatial variations in the data, the expected
results are R(I × B; h) ≈ 0,R temp (I ; h) ≈ 0 and R spat (Bi ; h) ≈ 0 for each
I ⊂ [tstart , tstart + T ] and B ⊂ S [15]. For the diagnostic plot in this model, the
transformation time τi is plotted against the i-th event. If there is a significant devi-
ation in the straight line y = x, then it indicates the weakness of the model. The
Probabilistic Seismic Hazard Analysis for Sulawesi-Maluku Region … 375
deviation occurs because there are some characteristics of the data that cannot be
accounted for by the model [21].
376 S. Faradilla et al.
To determine the occurrence of an earthquake, a map of the region is used. Since the
earthquakes in this study used data from the Sulawesi-Maluku earthquake catalog,
the map of the Sulawesi-Maluku earthquake catalog can be shown in Fig. 2.
Figure 2 present the map of seismicity at latitude and longitude coordinates around
the range of – 6, 5° S to 3° N and 118° to 126° E that occurred between January
1, 2008, and July 29, 2023. Earthquake events tend to occur along the Maluku and
North Sulawesi Seas and around Central Sulawesi, specifically around the city of
Palu (see Fig. 1). Figure 2b, c shows that there are earthquake events around the
coordinates of latitude 1° S to 2° N and around the coordinates of longitude 125° to
130° E. Most of the earthquake events occur around the Maluku and North Sulawesi
Fig. 2 Visualization of seismicity points with depth ≤ 70 km and magnitude ≥ 4.5 from 2008
to 2023: a Seismicity map in Sulawesi and Maluku regions, b relationship of latitude coordinate
point with earthquake occurrence time, c relationship of longitude coordinate point with earthquake
occurrence time, d relationship of magnitude with earthquake occurrence time, e relationship of
number of earthquake occurrences at time (time stationarity) with earthquake occurrence time,
f relationship of predicted value of log10 N mag (predicted value of Gutenberg Richter Law) based
on actual earthquake data with magnitude
Probabilistic Seismic Hazard Analysis for Sulawesi-Maluku Region … 377
Seas in the range 1000 to 5000 with a considerable magnitude ranging from 7 to
7.7. The largest earthquake with a magnitude of 7.7 occurred at the 256th time on
January 3, 2009, and at the 4622nd time on January 18, 2023, with magnitude 7 can
be seen in Fig. 2d. The black small circle line shows the magnitude strength against
the predicted value of log10 Nmag (predicted value from the Gutenberg Richter Law)
based on actual earthquake data parallel to the linear line (blue dashed line) can be
seen in Fig. 2f. The conclusion that can be drawn is that the greater the magnitude,
the fewer earthquakes that occur in Sulawesi-Maluku. While Fig. 2e shows a plot
of the accuracy of the amount of data against time, marked by a blue dashed plot,
following a linear line or parallel to the plot of the number of earthquake events at
time (t), marked by a black line plot, it can be said that the background seismicity
rate satisfies the process of stationarity in time
Table 4 Parameter
Parameters Estimation StdErr Conclusion
estimation for the space–time
ETAS model β 3.287 0.003 Significant
μ 0.669 0.007 Significant
A 0.406 0.017 Significant
c 0.011 0.042 Not significant
α 0.649 0.035 Significant
p 1.160 0.004 Significant
D 0.011 0.036 Not significant
q 2.225 0.013 Significant
γ 0 3.044 × 108 Not significant
378 S. Faradilla et al.
whole earthquake by 0.406 event/ deg2 ×day , and parameter D̂ explains the influ-
ence of the coordinate point of the mainshock event by 0.011 deg2 . For parameter
α̂, it clarifies the influence of the alleged magnitude
of the background event that
triggers the occurrence of aftershocks by 0.649 magnitude−1 and parameter γ̂ = 0
explains the efficiency of the earthquake that is independent of the probability of
occurrence of aftershocks by magnitude; this means that the magnitude is very influ-
ential compared to the location of the coordinate point of occurrence of aftershocks.
The value of the parameter p̂ > 1 that the probability of decreasing the number of
aftershocks requires a fast period of time and the parameter ĉ explains the seismic
activity in Sulawesi-Maluku, which occurs due to the shaking of the earth’s crust that
causes earthquakes around 0.011(day). Then, the value of the parameter q̂ is 2.225,
which means that the larger the value of the parameter q̂, the faster the decrease
in the number of aftershocks over time after the main earthquake occurred. These
results are sufficient to prove that the space–time ETAS model gives good results on
the earthquake data in Sulawesi-Maluku.
Fig. 3 Visualization of seismicity map prediction from a background seismicity rate, b total spatial
intensity, c clustering coefficient, and d conditional intensity
The results of the background seismicity rate estimation in Fig. 3a gave better
results because it obtained fewer earthquakes than the total spatial intensity esti-
mation in Fig. 3b. Therefore, areas with high seismicity rates indicate that not all
earthquakes occur in the interior, but when a main earthquake occurs, it will be
followed by aftershocks with a high probability of earthquake occurrence. The esti-
mated probability of earthquake occurrence can be seen from the background and
triggered events in Fig. 4.
Figure 4 shows that the detection of earthquake probabilities in the Sulawesi-
Maluku region extends almost along the Maluku Sea, where triggered or aftershocks
(subsequent earthquakes) have a relatively high probability in the region. This is
indicated by the red plus sign symbol which is clustered quite high along the Maluku
Sea, almost comparable to the background or main earthquake indicated by the black
small circle symbol which has a high probability of earthquake occurrence along the
Maluku Sea at latitude coordinate points around 0° to 6° N and longitude coordinate
points around 125° to 130° E.
To evaluate the quality of the model, residual analysis and diagnostic plots of
the space–time ETAS model can be explained in the temporal residual analysis,
and there is a large deviation from the residual data because there is the highest
peak in the estimated intensity function. Fig. 5a, b explains the earthquake event that
occurred around the Maluku Sea with a large PGA value of about 0.5 g at coordinates
between 0° to 4° N and about 125° to 130° E. Therefore, spatial residual analysis can
describe earthquake events in Sulawesi-Maluku quite a lot around the Malukus Sea
380 S. Faradilla et al.
Fig. 4 Visualization of the estimated probability declustering the actual data of real earthquake
events and their triggers in Sulawesi-Maluku from 2008 to 2023
and obtain quite complex results. This is also supported by the results of plotting the
transformation time τi against the cumulative number of i-th earthquake in Fig. 5c
where the plot follows the red line (theoretical cumulative earthquake events against
time change). Similar results are obtained by Fig. 5d Q-Q plot of Ui , where the pair
of quantiles and observed quantile points are aligned along the red line (linear line),
so it is concluded that the earthquake data in Sulawesi-Maluku fulfill the assumption
of U (0, 1) distribution. Thus, from the results of residual analysis and diagnostic
plots, the space–time ETAS model obtained good model quality, although there was
an indication of a deviation in data characteristics.
4 Conclusions
In this study, we consider space–time ETAS model to perform the probability seismic
hazard analysis, for Sulawesi-Maluku. We apply Davidon-Fletcher-Powell (DFP)
and stochastic declustering algorithms to estimate the parameters. The results of the
parameter estimation explain the number of earthquake events in the Sulawesi and
Maluku regions of 0.669(shocks/day)
with the intensity of the trigger for the total
earthquake of 0.406 event/ deg2 ×day . For the influence of the coordinate point of
the mainshock event of 0.011 deg2 and the influence of the alleged magnitude of the
background event that triggered the aftershocks of 0.649 magnitude−1 , where the
Probabilistic Seismic Hazard Analysis for Sulawesi-Maluku Region … 381
Fig. 5 Visualization of residual plots and diagnostic plots of the space–time ETAS model:
a temporal residuals, b spatial residuals, c time plots of the transformation of τi versus i, d Q-Q
plots of Ui
magnitude is very influential compared to the location of the coordinate point of the
aftershocks. Meanwhile, the probability of a decrease in the number of aftershocks
in Sulawesi-Maluku requires a fast period of time and seismic activity in Sulawesi-
Maluku, which occurs due to the shaking of the earth’s crust that causes earthquakes
is 0.011(day). Based on the results of residual analysis and diagnostic plots, the
space–time ETAS model obtained good model quality. These results can be used to
provide more accurate estimates of earthquake events. The estimation of earthquake
events from the seismic hazard map results provides information on the total rate of
earthquake events around the Maluku Sea, with clusters of earthquake events spread
almost throughout Sulawesi and Maluku. The modeling in this study can be used
by the government and local communities to predict the occurrence of earthquakes,
especially in the Sulawesi and Maluku regions. For future study, the space–time
ETAS model can be extended by including the effect of geological variables such
as subduction and faulting. In the spatial case, this direction has been studied by
Choiruddin, et al. [3, 8].
Acknowledgements The authors gratefully acknowledge financial support from the Institut
Teknologi Sepuluh Nopember for this work, under the project scheme of the Publication
Writing and IPR Incentive Program (PPHKI) 2024. We also thank the four reviewers for their
constructive comments.
382 S. Faradilla et al.
References
1. Tim Pusat Studi Gempa Nasional, Peta Sumber dan Bahaya Gempa Indonesia Tahun 2017,
Bandung: Pusat Penelitian dan Pengembangan Perumahan dan Pemukiman Badan Penelitian
dan Pengembangan Kementrian Pekerjaan Umum dan Perumahan Rakyat (2017)
2. Otok BW, Soeparno K, Rahmawati I, Azies HA, Isnawati (2021) Structural equation modeling
on decision making in understanding disasters in Maluku. In: AIP conference proceedings.
Indonesia, pp 020003-1–020003-8
3. Choiruddin A, Susanto TY, Husain A, Kartikasari YM (2023) kppmenet: combining the kppm
and elastic net regularization for inhomogeneous Cox point process with correlated covariates.
J Appl Stat 1–14
4. Sakdiyah K, Choirudddin A (2021) Model inhomogeneous log-Gaussian cox process (LGCP)
untuk Pemetaan Risiko Gempa Bumi di Sumatra. Jurnal Sains dan Seni ITS 9(2):D108–D114
5. Choiruddin A, Trisnisa AF, Iriawan N (2021) Quantifying the effect of geological factors on
distribution of earthquake occurrences by inhomogeneous cox processes. Pure Appl Geophys
1579–1592. Springer Nature Switzerland AG
6. Susanto TY, Choiruddin A, Purnomo JDT (2023) On the earthquake distribution modeling in
Sumatra by Cauchy cluster process: comparing log-linear and log-additive intensity models.
Sains Malaysiana 52(2):655–667
7. Choiruddin A, Susanto TY, Metrikasari R (2021) Two-Step estimation for modeling the
earthquake occurrences in Sumatra by Neyman–Scott cox point processes. In: International
conference on soft computing in data science. Springer, Singapore, pp 146–159
8. Mukhti TO, Choiruddin A, Purhadi (2021) Generalized additive poisson models for quantifying
geological factors effect on the earthquake risk mapping. In: ICMSDS 2020. IPB University,
pp 1–11
9. Zhuang J, Ogata Y, Wang T (2017) Data completeness of the Kumamoto earthquake sequence
in the JMA catalog and its influence on the estimation of the ETAS parameters. Earth Planets
Space 1–12
10. Bansal AR, Dimri VP, Babu KK (2013) Epidemic type aftershock sequence (ETAS) modeling
of northeastern Himalayan seismicity. J Seismol 255–264
11. Kumazawa T, Ogata Y (2013) Quantitative description of induced seismic activity before
and after the 2011 Tohoku-Oki earthquake by nonstationary ETAS models. J Geophys Res
6165–6182
12. Ogata Y (1998) Space-time point-process models for earthquake occurrences. Ann Inst Stat
Math 50(2):397–402
13. Guo Y, Zhuang J, Zhou S (2015) An improved space-time ETAS model for inverting the rupture
geometry from seismicity triggering. J Geophys Res Solid Earth 120:3309–3323
14. Ogata Y, Zhuang J (2006) Space–time ETAS models and an improved extension. Tectono-
physics 413:13–23
15. Jalilian A (2019) ETAS: an R package for fitting the space-time ETAS model to earthquake
data. J Stat Softw 88:1–39
16. Zhuang J (2012) Long-term earthquake forecasts based on the epidemic-type aftershock
sequence (ETAS) model for short-term clustering. Res Geophys 2(e8):52–57
17. Guo Y, Zhuang J, Zhou S (2015) A hypocentral version of the space–time ETAS model.
Geophys J Int 203:366–372
18. Zhuang J, Chang CP, Ogata Y, Chen YI (2005) A study on the background and clustering
seismicity in the Taiwan region by using point process models. J Geophys Res 110(B05S18):1–
12
19. Davoudi N, Tavakoli HR, Zare M, Jalilian A (2020) Aftershock probabilistic seismic hazard
analysis for Bushehr province in Iran using ETAS model. Nat Hazards 100:1159–1170
Probabilistic Seismic Hazard Analysis for Sulawesi-Maluku Region … 383
20. Tim Revisi Peta Gempa Indonesia 2010, Peta Hazard Gempa Indonesia 2010 sebagai
Acuan Dasar Perencnaan dan Perancangan Infrastruktur Tahan Gempa, Jakarta: Kementerian
Pekerjaan Umum (2010)
21. Nicolis O, Chiodi M, Adelfio G (2015) Windowed ETAS models with application to the Chilean
seismic catalogs. Spatial Statistics 14:151–165
Application of Time Series Regression,
Double Seasonal ARIMA, and Long
Short-Term Memory for Short-Term
Electricity Load Forecasting
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 385
Y. Bee Wah et al. (eds.), Data Science and Emerging Technologies, Lecture Notes on
Data Engineering and Communications Technologies 191,
https://doi.org/10.1007/978-981-97-0293-0_28
386 H. Afghan and H. Khusna
1 Introduction
Electricity has been a primary need for the society. Electricity as a vital source
of power can significantly improve the welfare of the society, such as economy,
education, health, industry, housing, offices, and revenue fields. In 2020, 2021,
and 2022, the total of electricity load in East Java was approximately 37.87 TWh,
40.68 TWh, and 43.72 TWh, respectively [1]. The increase in electricity load was
driven by economic recovery and increased business activities in the post-COVID-19
pandemic. PT Perusahaan Listrik Negara (PLN) Persero is Indonesia’s main elec-
tricity provider which distributes electricity to customers. One of the challenges in
distributing electricity is the fluctuation of electricity load over time which is related
to customer’s electricity consumption habits. For example, we create a time series
plot as in Fig. 1 that shows electricity load patterns from January 3rd to January 16th ,
2020. The figure shows that customer’s electricity consumption patterns are similar
on a daily basis. It shows that during the day there is a peak load, while at night the
load is lower. Moreover, on Saturdays and Sundays there are significant decreases
in electricity consumption compared to weekdays. Having this in mind, electricity
load must be accurately estimated for its non-storable condition. If electricity is
generated higher than demand, it will be a waste and the generator power should be
lowered. Meanwhile, if electricity is generated less than demand, it may cause power
outage and must undergone backup-plant operation. Therefore, forecasting methods
are recommended in order to obtain accurate load forecasts.
Electricity load data has a double seasonal pattern of daily and weekly base [2]. For
this purpose, we apply forecasting methods which can capture the double seasonal
pattern. The first is time series regression (TSR) which is a practical forecasting
method and suitable to be applied for low variances data in each aggregation time.
Since daily customer’s electricity consumption pattern is similar and results in low
variance at each aggregation time, this method is worth to be used. This method
practicality is similar to ordinary linear regression. Meanwhile, TSR hardly has non-
autocorrelated residuals for its less effectiveness in capturing double seasonal lag,
but the case is possible to be handled with the double seasonal ARIMA method.
Few studies have been done using TSR on electricity load forecasting. A study on
input selection for TSR using random forest and stepwise method is conducted by
Sintyaningrum [3], whereas other studies [4, 5] utilize TSR into a hybrid model. So,
this method needs further evaluation for non-hybrid method.
Autoregressive Integrated Moving Average (ARIMA), a conventional forecasting
method that requires data stationarity in mean and variance to identify lag order.
ARIMA estimates parameters in the form of lag orders so that forecasting can be
compatible with the actual data over time. Double seasonal ARIMA (DSARIMA)
as the second method applied in the present study is the development of ARIMA
designed for forecasting data with double seasonal pattern. This method estimates
parameters in the form of double seasonal orders. It has been applied in many elec-
tricity load forecasting studies, such as Mohamed et al. [6], Mado et al. [7], Khusna
and Suhartono [8], and many more.
Neural network (NN) is a type of machine learning where the algorithm in
processing data is inspired by the human brain. This type gives “machine” the ability
to capture patterns and to solve various problems. This study’s third method, long
short-term memory (LSTM) derives from NN. This method is capable of saving the
output of the previous cell [9] resulting in its powerful dependency for long time lags
and thus can capture double seasonal pattern. In addition, the concise procedure and
the no assumptions make this method commonly used in forecasting studies, without
exception in electricity load forecasting, such as Bouktif et al. [10], Bai et al. [11],
Abbasimehr et al. [12], and many more.
This is a fresh study, for the first time comparing practical, conventional, and
machine learning forecasting methods, especially in electricity load forecasting. The
selected models are those that have proven good at forecasting electricity loads for
each method. In addition, there are remarkable scenarios and treatments which is
undertaken in this study in order to obtain more accurate forecasts. Therefore, this
study aims to provide the best model from time series regression, double seasonal
ARIMA, and long short-term memory to forecast electricity load in East Java.
Ultimately, the best overall model is obtained.
2 Methodology
The observation data in this study are secondary data provided by PT PLN Unit
Pelayanan Pengatur Beban (UP2B) Jawa Timur [1]. The variable is half-hourly
recorded electricity load of the East Java Province from the period of January 1st ,
2020, to January 31st , 2023, in Mega Watt (MW) units with a total of 54,096 time
series data. Short-term electricity load forecasting is for periods of several hours to
1 week to gain accurate forecasts. Hence, the data were divided into two, in-samples
388 H. Afghan and H. Khusna
and out-samples, respectively, from January 1st , 2020, to January 24th , 2023 (53,760
time series data) and January 25th , 2023, to January 31st , 2023 (336 time series data).
Time series regression is a forecasting method that uses regression analysis to identify
effect of independent variables on dependent variable Z t which depends on a time
function (t) [4]. The independent variables are denominated as dummy variables.
In this study, time series regression uses dummy variables such as trend (t), daily
seasonal (Jr,t ), weekly seasonal (Hh,t ), and inter-seasonal interactions (Ir,h,t ). The
time series regression model can be expressed as in Eq. (1).
∑
48 ∑
7 ∑
7 ∑
48
Z t = β̂0 + τ̂ t + ϑ̂r Jr,t + η̂h Hh,t + υ̂r,h Ir,h,t + εt , (2)
r =1 h=1 h=1 r =1
where β̂0 , τ̂ , ϑ̂r , η̂h , and υ̂r,h are the estimated parameters for intercept, trend, daily
seasonal, weekly seasonal, and inter-seasonal interaction, respectively. The model
shows as such 1 + 1 + 48 + 7 + (7 × 48) = 393 parameters and 392 independent
variables.
This model has many independent variables; consequently, the model is not effec-
tive for forecasting. In order to solve this problem, the stepwise method is utilized
to eliminate the independent variables by combining two stages that are forward
stepwise selection and backward elimination [13]. The time series regression utilizes
Ordinary Least Square (OLS) for estimating the parameters. OLS is one of the param-
eter estimation methods by minimizing the sum of squared errors. The OLS formula
with matrix approach is given by Eq. (3) [14].
( )−1 T
β̂ = X T X X Z, (3)
where Z, X, and β̂ are the vector of Z t , the matrix of independent variables (vector
1 and all dummy variables), and the vector of estimated parameters, respectively.
The residual assumptions for time series regression model are commonly the same
as ordinary linear regression [15]. The independent assumption is a state where there
is no autocorrelation in the residual series. In this study, ACF plot is utilized to identify
this state. If there is a cut-off lag, it indicates that there is autocorrelation in the residual
series, or vice versa. In order to overcome the autocorrelation, lag variables are
Application of Time Series Regression, Double Seasonal ARIMA … 389
added to the model, especially for lags that have strong autocorrelation. The identic
assumption also known as homoscedasticity is a state where the residual variations
are identical or not significantly different. This assumption is not necessarily satisfied,
because it only works on cross-sectional data [16]. Normally distributed assumption
is a state( where
) the residual series is normally distributed and usually denoted as
εt ∼ N 0, σ 2 . While outliers should be detected and handled, the high number
of outliers in the electricity load data leads to a large computational time; thus, the
normally distributed residual assumption is not necessarily satisfied.
the assumption is not necessarily satisfied. This has been for the outliers which should
be detected and handled demand a large computational time.
Z t − Z min
Z t' = . (6)
Z max − Z min
Figure 2 shows the architecture of LSTM. As we can see, it starts from entering the
inputs until finally gaining the prediction (output). The steps of LSTM are elaborated
as follows [12].
Step 1. Execute the LSTM to obtain forecast value, using Eqs. (7)–(12), where
WO : The weight value of O gate, bO : The bias value of O gate, and O denotes the
aforementioned gates.
( [ ] )
f t = σ ι W f h t−1 , X t + b f , (7)
( [ ] )
i t = σ ι Wi h t−1 , X t + bi , (8)
( [ ] )
c̃t = tanh Wc h t−1 , X t + bc , (9)
( [ ] )
ot = σ ι Wo h t−1 , X t + bo , (11)
Application of Time Series Regression, Double Seasonal ARIMA … 391
h t = ot ∗ tanh(ct ). (12)
Step 2. Calculate residuals and calculate Mean Square Error (MSE) as the loss.
Step 3. The residuals are back-propagated to every gate. This step is popularly
called back-propagation through time [9].
Step 4. Optimizer updates the weights and biases of each gate by utilizing the
loss.
These steps are repeated until the stop condition test is met or the loss value has
converged. The forecasting value is in the form of minmax transformation, and thus it
must be reverse-transformed to compare it with the actual data. Hyperparameters are
parameters used in running machine learning algorithms to get optimal forecasting
performance. Epoch is defined as the number of iterations used in modeling. The
more the number of epochs, the more the weights change in the network. Optimizer
is used to find the optimal weight, so as to minimize the errors and maximize the
accuracy. This study uses Adaptive Moment estimation (ADAM) optimizer which
outperforms other optimizers [21].
392 H. Afghan and H. Khusna
In this study, we measure forecasting accuracy by means of Root Mean Square Error
(RMSE) and symmetric Mean Absolute Percentage Error (sMAPE) through out-
sample approach. RMSE is used to determine the best model from each method
while sMAPE is used to determine the best overall model. The formulas are defined
in Eqs. (13) and (14), respectively [22]. The Z t+l and Ẑ t+l in sequence are the
out-sample data and its forecast at l-th, with L is the number of out-sample.
┌
|
|1 ∑ L
( )2
RMSE = √ Z t+l − Ẑ t+l , (13)
L l=1
| |
1 ∑ | Z t+l − Ẑ t+l |
L
sMAPE = / ( ) × 100%. (14)
L l=1 1 2 Z t+l + Ẑ t+l
Figure 3 shows the time series plot of electricity load from January 1st , 2020, to
January 31st , 2023. The plot shows an up-trend which was caused by the population
growth resulting in the increasing electricity needs from facilities, domestic demands,
companies, and other matters. There was a phenomenon of decreasing electricity
consumption during Eid al-Fitr, several days before and after. This phenomenon was
explained by the fact of people returning to their hometowns, so that the houses left
almost did not consume electricity. In addition, some industries did not operate due
to a collective leave so as reducing the use of electricity.
Electricity load is found to have different averages and variances at each half-hour.
The data frame was organized by collecting data from each same day at each half-
hour in order to compute averages and variances. Figures 4 and 5 in sequence show
the averages and the variances of electricity load per half-hour at each individual
day. In Fig. 4, the average electricity load on Saturday and Sunday is less than
that on weekdays. This is explained by less activities consuming electricity during
weekends as most people using the time for rest. In addition, it is also the case for
some industries and offices which are off during such time. Meanwhile, at 18.00 is
the peak of electricity consumption, because people rush to complete their works.
After 18.00, electricity consumption decreases, signifying that most people have
completed their works, while the rest are still working. In Fig. 5, every day’s pattern
is almost identical. During working hours, the variance pattern has a fluctuation
Application of Time Series Regression, Double Seasonal ARIMA … 393
which shows an increase from 07:00 to around noon and then a decrease up to 18:00.
People’s activities during these hours are various, and thus electricity consumptions
have a higher variance than that during other hours. It is apparent that on Sunday,
the consumption is lower in variance than that on other days since most people do
less diverse activities on Sunday. It is noted that from 11:00 to 13:00 there is a fall
both in average and in variance of the electricity load. This is attributed to the lunch
break during which most activities are stopped.
The analysis of this method was divided into two input types. The first, all input
variables were analyzed. Whereas the second, only input variables suspected of
effect (based on the study by Sintyaningrum [3]) were analyzed. The other models
394 H. Afghan and H. Khusna
were adjustments to the predefined models (1st model and 5th model). The models
are summarized in Table 1. The reference variables are J48,t H7,t , I48,1,t , I48,2,t , …,
I48,6,t , I1,7,t , I2,7,t , …, I48,7,t , that are used for comparison with other aggregation
times.
The evaluations for the first input type are shown in Table 2, with k is the number
of parameters. The number of parameters in the 1st model is 337 parameters, which
is the total 393 parameters subtracted by 56 parameters of the reference variable. The
2nd model is shown to yield the lowest out-sample RMSE value, and thus it becomes
the best model for the first input type. The evaluations for the second input type are
shown in Table 3. Since the 7th model shows the lowest out-sample RMSE value, it
becomes the best model for the second input type. There are many significant lags,
especially around the double seasonal lag (see Fig. 6), indicating that both models did
not satisfy the independent residual assumption. Adding lag variables as independent
variables based on significant lags is required.
We added lag variables Z t−1 , Z t−48 , and Z t−336 representing the non-seasonal
lag, the first seasonal lag, and the second seasonal lag, respectively. Two new models
were identified, that are the 8th model and the 9th model which were in sequence
the definitions of the 2nd and 7th models added with the lag variables for analysis.
Therefore, those new models were executed to forecast. The evaluations for the
modified models are shown in Table 4. It is noted that the number of parameters in
the 8th model is higher than the 2nd model and as well as the 9th model to the 7th model,
due to the stepwise method. This indicates that with the addition of lag variables,
the models have more dummy variables that have a significant effect. Out-sample
RMSE value of the 9th model is lower than the 8th model, thus the 9th model is the
best model for the time series regression method. Nevertheless, there are still many
significant lags, especially around the first and second seasonal lags (see Fig. 7). The
addition of lag variables did not satisfy the assumption, indicating the existence of
significant lags not being included in the modified models.
396 H. Afghan and H. Khusna
The rounded value of the box-cox analysis is − 0.34. Figure 3 shows that electricity
load has the same range of data for each time shift. There are several falls in electricity
load around Eid al-Fitr, by which the box-cox analysis shows the data not being
stationary in variance. Therefore, we considered that the data is stationary in variance
and did not need to be transformed. Figure 8 shows the ACF plot which is not
stationary in mean because the ACF has extremely slow dies down pattern. Hence,
differencing was carried out at the extremely slow dies down lags, at lags 1, 48, and
336.
Fig. 9 a ACF plot, and b PACF plot of electricity load after differences
Figure 9 is the ACF and PACF plot after the differences and the ACF no longer
has extremely slow dies down pattern, indicating the data has been stationary in
mean. As we can see in Fig. 9, ACF Plot has cuts off at lags 48 and 336, while
PACF plot has dies down patterns at each seasonal lag. Thus, we also identified the
conjecture models that incorporate the order of ARMA. As a result, the conjecture
models are multiplicative double seasonal ARIMA. Based on these identifications,
four conjecture models were obtained and presented in Table 5.
The evaluations for the conjecture models are shown in Table 6. All models
already have significant parameters and satisfy the white noise assumption. However,
all models did not satisfy the normally distributed residuals assumption due to large
number of outliers. The lowest out-sample RMSE value is shown by model 3 making
it to be the best model for the Double Seasonal ARIMA method.
The variables in the form of lags from the first 336 lags were selected based on
the highest autocorrelation value. In type 1, we took three lags with the highest
autocorrelation value and took Z t−48 , Z t−336 to represent the double seasonal lags.
398 H. Afghan and H. Khusna
In type 2, we took five lags with the highest autocorrelation value. The summary of
these scenarios is presented in Table 7. In this study, we used three hidden layers
with number of neurons equal to 64, 32, and 32. The epoch combinations were set
equal to 50, 100, 150, and 200. LSTM modeling cannot directly gain minimum
out-sample RMSE, so it is necessary to do several iterations. Hence, boxplots of
out-sample RMSE were created and presented to serve as the comparison between
combinations.
In type 1, epoch = 200 is the best combination for having the lowest median and
mean value of out-sample RMSE (see Fig. 10a). As for in type 2, epoch = 200 is also
the best combination for the same reason (see Fig. 10b). In spite of this, Interquartile
Range (IQR) Box of out-sample RMSE with epoch = 200 in type 2 is smaller than
that in type 1 as we can see in Fig. 10. This demonstrates that the forecasting with
the combination of epoch = 200 in the type 2 gains steadier forecasts. Therefore,
LSTM model with epoch = 200 and Z t−1 , Z t−2 , Z t−3 , Z t−4 , Z t−336 as the inputs are
the best model for the long short-term memory method.
Fig. 11 Time series plot of best model forecasts for each method
Up to this point, we have determined one best model from each of the three methods.
Forecasting comparison, then, needs to be drawn in order to notice the forecasting
performance of the three best models. Hence, we created Fig. 11 to show their
comparison in out-sample data. The figure indicates that the forecasts of each model
quite accurately fit the actual data across the out-sample horizon. This can be inferred
that each method has successfully forecasted the electricity load. The best overall
model is determined by the lowest out-sample sMAPE value. For simplicity, the
values are compared through the visualization of the iterative plot of out-sample
sMAPE, as presented in Fig. 12. Notice, Fig. 12 shows that the out-sample sMAPE at
l = 1 is around 2–3.5%, but it falls significantly thereafter. The farther the forecasting
time, the steadier the iterative line, so that the models have been consistently good
at forecasting electricity load in a 1-week period. The line of iterative out-sample
sMAPE of the DSARIMA model is the lowest among other models. Therefore,
double seasonal ARIMA is the best overall model among the others. Finally, we can
conclude that double seasonal ARIMA is the best method in forecasting electricity
load.
In this study, we have applied time series regression, double seasonal ARIMA, and
long short-term memory methods for forecasting electricity load. We have deter-
mined the best model of each method. The best model of time series regression is
defined—analyzed variables are the variables suspected of effect without reference
400 H. Afghan and H. Khusna
variable and the lag variables Z t−1 , Z t−48 , Z t−336 by utilizing the stepwise elimina-
tion method. This model resulted in a quite low out-sample RMSE value of 47.734.
However, the model does not satisfy the independent residual assumption. Next, the
best model of double seasonal ARIMA is double seasonal ARIMA ([2, 10, 11, 12,
15, 16, 17, 18, 19, 20, 21, 22, 23], 1, [1, 2, 3, 7, 8, 30, 34, 35, 39, 40, 41, 42, 43, 44, 45,
46, 47, 48]) (0, 1, 1)48 (0, 1, 1)336 . This model gained a quite low out-sample RMSE
value of 35.631. It has successfully satisfied the white noise assumption. Lastly, the
best model of long short-term memory is the LSTM (5, 64, 32, 32, 1) with epoch =
200 and Z t−1 , Z t−2 , Z t−3 , Z t−4 , Z t−336 as inputs. This model obtained a quite low
out-sample RMSE value of 65.104 in average. Then, we compared those best models
by the lowest out-sample sMAPE value in order to obtain the best overall model.
The best model of TSR, DSARIMA, and LSTM acquired extremely low out-sample
sMAPE values equal to 0.644%, 0.55%, and 0.971%, respectively. Having the lowest
out-sample sMAPE value, the double seasonal ARIMA is concluded to be the best
overall model for forecasting electricity load in this study. There are scopes for further
studies which is still possible to obtain lower errors and more accurate forecasts. In
time series regression, we did not satisfy the independent residual assumption so as
indicating that there were still significant lags that had not been included in the model.
Future study could perform more lag variables to obtain a better TSR model and as an
experiment to satisfy such assumption. In general, future studies could also consider
external factors that may affect electricity load such as weather or temperature.
Application of Time Series Regression, Double Seasonal ARIMA … 401
References
Iis Dewi Ratih, Ketut Buda Artana, Heri Kuswanto, Emmy Pratiwi,
and Muhammad Farhan Nuari
Abstract Ship collisions are the type of accident with the highest percentage of
investigations, making them the type of accident with a high variation in causes.
Additionally, ship collisions pose a serious threat because they occur between two
different vessels, resulting in material losses and loss of life. This condition makes
ship collisions a serious problem that requires efforts to minimize prevention and
adjust existing conditions. This study aims to model the causes of ship collisions in
Indonesia to determine the probability of a ship experiencing a collision or a near
miss. The modeling will be conducted using the Bayesian network method. The
Bayesian network model is based on the factors that cause ship collisions, relying on
past incidents and written reports from National Transportation Safety Committee
(NTSC) investigations and judgments from the Maritime Court. The purpose of this
study is to identify the factors that cause ship collisions, determine the probability
of a ship experiencing a collision, and identify the factors that contribute the most
to the probability of ship collisions in Indonesia through sensitivity analysis. The
results obtained from the model, with a 70% weight for training data, show that the
probability of a ship experiencing a collision during a dangerous condition is 63%,
with an accuracy and sensitivity of 93.75% and 100% respectively. According to
the model, the factors with the greatest influence are “crew competence,” “decision
making,” “maneuverability,” and “ship communication.”
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 403
Y. Bee Wah et al. (eds.), Data Science and Emerging Technologies, Lecture Notes on
Data Engineering and Communications Technologies 191,
https://doi.org/10.1007/978-981-97-0293-0_29
404 I. D. Ratih et al.
1 Introduction
Ship accidents are a crucial topic that requires extensive research due to their high
risk and potential for significant losses. According to the National Transportation
Safety Committee (NTSC), ship collisions rank second in terms of the percentage of
ship accidents investigated, following burnt ships. IMO MSC/255 (84) 2 and UU No.
17 of 2008 on shipping regulations state that the NTSC investigates only a portion
of ship accidents. The purpose of these investigations is to identify potential issues
and assess the problems arising from the incidents, with the aim of improving safety
measures in the future. This information highlights the importance of analyzing ship
collision accidents, as ship collision accidents are the type of ship accidents with the
highest investigated ratio by the NTSC compared to other types of ship accidents and
involve various causes. In-depth analysis can lead to the development of effective
solutions for reducing the occurrence of subsequent ship collisions.
In the analysis of ship collision accidents, ship accident modeling plays a funda-
mental role, especially in the modeling of causation probability for estimating the
frequency of ship accidents. The estimation of ship accident frequency can be carried
out using a calculation model provided as follows [1]:
λcol = Pc · NG . (1)
This equation defines the frequency of ship collisions in terms of two components:
Geometric probability (N G ) and causation probability (Pc ) [2]. Geometric probability
represents the number of candidate ships that have the potential to experience a
collision, while causation probability represents the likelihood of a ship failing to
avoid a collision when it is on a collision course. The value of causation probability
is influenced by various factors that are represented as interconnected nodes, leading
to the root cause. Some of these nodes include human error, weather conditions, and
technical failures [3]
Based on the concept, each water area has a unique causation probability value,
which distinguishes it from other water areas. Unlike previous studies that used
default causation probability values[1], this research aims to determine causation
probability values based on real conditions using the Bayesian network method.
According to the literature study conducted, [4] states that the Bayesian network is
the best method for estimating the causation probability value.
Several previous studies have focused on examining cases related to the risk of
ship collisions. For instance, studies conducted by researchers [5–7] involved ship
collision risk analysis using default causation probability (Pc) values from IALA
Waterway Risk Assessment Program (IWRAP). These studies indicated that the Pc
results obtained from the analysis in the Akashi and Dover Straits could be gener-
alized and applied to all waters worldwide. In the analysis of ship collisions in the
Sunda, [8–10] employed the fault tree analysis (FTA) modeling method to determine
Pc values. However, the modeling in these studies was based on factors and oppor-
tunities derived from references rather than historical data specific to accidents in
A Bayesian Network for Classifying and Predicting Ship Collision 405
the Sunda Strait. Furthermore, other studies such as those conducted by researchers
[11, 12] employed the Bayesian network method for modeling causation probability.
However, these studies utilized a simplified approach with limited reference data.
Therefore, there is a need to develop and update the conditions based on the latest
available data and establish a more detailed network, which will be undertaken in
this current study.
For this reason, in this paper, Bayesian network is utilized to model the factors
contributing to ship collisions (head-on, overtaking, and crossing) specifically for
Indonesian waters. Consequently, if necessary, appropriate measures can be taken
promptly to ensure safety and minimize potential losses resulting from ship collisions.
2 Literature Review
where:
P(X p | X p+1 ) = The probability of an event X p occurring given that event X p+1 has
already happened
P(X p+1 | X p ) = The probability of an event X p+1 occurring given that event X p has
already happened
P(X p ) = The probability of an event X p
P(X p+1 ) = The probability of an event X p+1
A Bayesian network is an explicit description of direct independencies among a
set of variables. This description takes the form of a directed graph and a set of node
probability tables (NPT). The directed graph, or Bayesian network structure, consists
of a set of nodes and edges. Nodes correspond to variables, and edges connect directly
to dependent variables. An edge from X p to X p+1 encodes the assumption that there is
a direct causal dependence or influence from X p to X p+1 ; the node X p is then referred
406 I. D. Ratih et al.
to as the parent node of X p+1 . No cycles occur within the graph (e.g., if we have an
edge from X p to X p+1 and from X p+1 to X p+2 , there is no edge from X p+2 to X p ). This
is done to avoid circular reasoning. Each node X p has an associated probability table
called the node probability table (NPT) of X p [14].
If there are p random variables X 1 , X 2 , X 3 , …, X p , in a directed acyclic graph with
p-numbered nodes, and let node j(1 ≤ j < p) of the graph be associated with variable
X j . This is called the joint probability distribution (JPD), which is the probability of all
events occurring simultaneously. Then the graph is a Bayesian network representing
variables X 1 , X 2 , X 3 , …, X p if:
p
P X 1, X 2, X 3, . . . , X p = P(X j |Par ents X j ) (2)
j=1
BN, also known as Bayesian belief network (BBN), is constructed using graphical
structures and Bayes’ theorem. It is a widely accepted formalism for representing
uncertain knowledge (subjective or objective) and for efficient reasoning with it. A
complete BN model comprises both qualitative and quantitative components. The
qualitative component of a BN is a directed acyclic graph, where nodes and edges
represent system variables and their causal dependencies. The quantitative compo-
nent of the BN model is presented through a set of conditional probabilities or
probability distributions for each child node given its parent nodes in the network
[15].
The holdout validation method randomly divides the data into two categories: training
data and testing data. The training data are used for the Bayesian network algorithm
to construct a classification tree, while the testing data is used to evaluate the formed
classification model. Estimation and prediction will then be carried out to assess the
reliability of the classification model formed based on the training data [16] (Fig. 1).
The holdout validation method is chosen to evaluate the Bayesian network model
that has been formed. Holdout validation is the most common approach for evaluating
machine learning models. However, when the dataset used for training and evaluating
the machine learning model is small, performance metrics that rely on this validation
and testing set composition can be sensitive, and the resulting performance measures
are often unreliable [17].
Performance evaluation measure (PEM) is a step taken to measure system perfor-
mance. PEM is performed on the training data to evaluate the created model. PEM is
usually depicted in a table called a confusion matrix [18]. The confusion matrix table
describes the performance of a model in a specific manner. Each row of the matrix
represents the actual class of the data, and each column represents the predicted class
of the data (Table 1).
A Bayesian Network for Classifying and Predicting Ship Collision 407
The confusion matrix table consists of values for true positive, false positive, false
negative, and true negative. True positive indicates how many data points are actually
positive in class, and the model also predicts them as positive. True negative indicates
how many data points are actually negative in class, and the model predicts them as
negative. False positive indicates how many data points are actually negative, but the
model predicts them as positive. False negative indicates how many data points are
actually positive, but the model predicts them as negative. Through the two criteria
of the confusion matrix, we can obtain [16]:
1. Accuracy is the ratio of the correctness of predictions to the total target data,
representing how often the model correctly classifies. The formula to calculate
accuracy is shown in Eq. 3.
2. Sensitivity or recall is the ratio of correct positive predictions to the actual data.
Sensitivity explains how often the model predicts positive when the actual class
is positive. The formula for calculating sensitivity is shown in Eq. 4.
True Positive
Sensitivity = . (4)
True Positive + False Negative
408 I. D. Ratih et al.
3 Methodology
The data used includes chronological information on each ship collision incident
documented in the NTSC investigation reports and the court decisions from the
Shipping Court from 2007 to 2021. The data used consists of 44 investigated ship
collision incidents involving a total of 108 vessels. The target variable is “collision,”
and there are 30 causal variables associated with it. These causal variables include the
factors, such as human error, technical failures, and environmental factors. The causal
variables are good seamanship, crew competence, uncertain, ship communication,
understanding of ship characteristics, understanding of water, inexperience, decision
making, crew health, number of crew, dual task, fatigue, situational awareness, visual
observation, daylight, master, understanding of navigation and communication sign,
navigation and communication utilized, navigation and communication equipment,
preventive timing, pilot, maneuverability, technical failure, ship type, water type,
wind velocity, wave height, ship length, ship breadth, ship draft, ship speed. The
following are the steps in this research:
a. Build a Bayesian network structure by estimating the sequence of events so that
ship collisions occur.
b. Determining parameters (Prior Probability) after the Bayesian network structure
is formed.
c. Determine the parameters (Prior Probability) of each cause of ship collisions.
Prior probability is the degree of confidence of a phenomenon that is used when
no other information can be used to see the probability of an event occurring, but
once new information is known the probability must be updated.
d. Create a Conditional Probability Table (CPT)
Conditional probability is the probability of an event B if event A has occurred.
Any table that contains the probabilities of each possible value of A and B is called
a conditional probability table (CPT).
e. Creating a Joint Probability Distribution (JPD)
Joint probability distribution is the probability of co-occurrence for all
possible combinations of values contained in variables A and B. Similarly,
the distribution of a cause of accident is to multiply the value of conditional
probability by prior probability.
f. Calculating Posterior Probability
Posterior probability can be calculated from the results of the joint probability
distribution that has been obtained which is then used to calculate the probability
value of each cause of the accident.
g. Model Validation
The validation model used is using the holdout method where the data is
divided into training and testing data. Training data is used to form the model
while testing data is used to validate the model that has been formed. The
scenarios for training and testing data composition are 60:40, 70:30, 80:20,
and 90:10.
h. Sensitivity Analysis
410 I. D. Ratih et al.
In this particular case, the research focuses on analyzing 44 ship collision incident
chronologies that have occurred in Indonesian waters, as documented in the investi-
gation reports of the NTSC and the trial decisions of the Shipping Court. The factors
identified in these reports serve as the basis for constructing the Bayesian network
model. The relationships between these factors are determined based on the chrono-
logical sequence of events influenced by the predetermined factors. The Bayesian
network model structure, depicting the nodes representing these factors, is illustrated
in Fig. 3.
Prior nodes are factors or variables that do not have antecedent factors. In the
Bayesian network model used in this study, there are several nodes that serve as prior
nodes. These nodes include “daylight,” “master,” “Navcom equipment,” “number
of crew,” “inexperience,” “crew health,” “crew competence,” “ship communication,”
“pilot,” “understanding of Navcom sign,” “maneuverability,” and “technical failure.“
In the Bayesian network model, a child node is a node whose probability value
depends on one or more parent nodes. To calculate the probability at a child node,
Bayes’ theorem is utilized, as described previously.
In this study, the joint probability calculations are conducted once all the indi-
vidual probabilities are obtained. The software tool “GeNIe” is utilized to assist in
these joint probability calculations. Additionally, “GeNIe” is also used to construct
the structure of the Bayesian network, allowing for the visualization of the joint prob-
ability distributions for each node. Figure 3 illustrates the structure of the Bayesian
network, which is used to classify the probability of a ship experiencing a collision
or surviving a collision (near miss). The “GeNIe” software enables the display of the
joint probability distributions for the nodes within the network, aiding in the analysis
and inference process.
The sensitivity analysis in this research is conducted using the software “GeNIe.“ It
involves analyzing the sensitivity of the model based on significant alterations in the
probability of specific events. This analysis is performed by simulating the effects of
a 100% change in the probability of these events. The sensitivity analysis results are
presented in the form of a tornado diagram, which displays the top ten factors that
contribute the most to the increase or decrease in the probability of the target node
(“Collision = Collision”) (Fig. 4).
The sensitivity analysis results indicate that altering the conditional probability
of “collision = collision” because “good seamanship = no,” “decision making =
bad,” “maneuverability = good,” “technical failure = no,” and “preventive timing =
bad” are lowered to be 0%; the causation probability value decreases to 0.426, with
a difference of 0.079. It means that good maneuverability of ship and no technical
failure are the important factors to decrease the causation probability, even though
the condition is bad preventive timing and good seamanship and decision making is
bad. There are also other circumstances that can lead to lower causation probability
values. For example, when the “crew health = fit” has a significant impact on lowering
the causation probability value compared to other circumstances. For instance, if the
probability is decreased from 93 to 0%, the causation probability value decreases to
0.564, resulting in a difference of 0.071. The overall results of the sensitivity analysis
for head-on collisions are presented in the tornado diagram shown in Fig. 2. This
diagram visually represents the top ten factors that contribute the most to the increase
or decrease in the probability of the target node (“collision = collision”) for head-on
collisions. It provides valuable insights into the relative importance of these factors
and their impact on the occurrence of head-on collisions.
A Bayesian Network for Classifying and Predicting Ship Collision 413
5 Conclusion
Based on the analysis conducted on Bayesian network modeling for predicting ship-
ship collision in Indonesia, here is a summary of the key findings:
1. Factors Influencing Ship Collisions: The analysis considered 31 factors that influ-
ence the occurrence of ship collisions. These factors were identified based on the
investigation report by NTSC and the results of the Shipping Court trial. Some of
the factors included daylight, master, navigational equipment, number of crew,
crew competence, ship communication, pilot, and technical failure, among others.
From all nodes, “maneuverability,” “technical failure,” “crew health,” and “crew
competence” are the most significant factors influencing collisions.
2. Bayesian Network Modeling: The results of the Bayesian network modeling
showed that the probability of a ship experiencing a potential collision condition
was determined to be 63%, while the probability of experiencing a collision or
near miss was 37%. The accuracy of the model was measured to be 93.75%.
These findings provide valuable insights into the factors influencing ship colli-
sions, the probabilities associated with collision occurrences, and the frequency of
collisions in the Sunda Strait. They can be used to inform decision-making processes,
improve safety measures, and minimize the risks of ship collisions in the designated
area.
References
11. Sukma RA, Handani DW, Nugroho TF (2021) Risk assessment of ship collision on FSO
Pertamina Abherka and oil spill modelling due to structural damage
12. Mulyadi Y, Kobayashi E, Wakabayashi N, Pitana T, Wahyudi (2014) Development of ship
sinking frequency model over subsea pipeline for Madura strait using AIS data. WMU J
Maritime Affairs 13(1):43–594. https://doi.org/10.1007/s13437-013-0049-2
13. Usadha NR (2017) Implementasi Bayesian Network Untuk Perhitungan Probabilitas Pada
Penilaian Risiko Pipa Bawah Laut oleh Faktor Kapal. Limits: J Math Appl 14(1):61. https://
doi.org/10.12962/limits.v14i1.2250
14. Fenton N, Neil M (2013) Risk assessment and decision analysis with Bayesian networks. CRC
Press
15. Wu Z, Shen Y, Wang H, Wu M (2019) Assessing urban flood disaster risk using Bayesian
network model and GIS applications. Geomat Nat Haz Risk 10(1):2163–2184. https://doi.org/
10.1080/19475705.2019.1685010
16. Hermawan H, Yoannita Y (2018) Komparasi metode evaluasi pada credit scoring data mining.
1:22–25
17. Maleki F, Muthukrishnan N, Ovens K, Reinhold C, Forghani R (2020) Machine learning
algorithm validation. Neuroimaging Clin N Am 30(4):433–445. https://doi.org/10.1016/j.nic.
2020.08.004
18. Imron A (2019) Analisis sentimen terhadap tempat wisata di kabupaten rembang menggunakan
metode
19. Akbar A, Supono RA (2022) Prediksi kelancaran piutang pelanggan pada PT. Citra Ina feed-
mill dengan menggunakan algoritma naïve bayes dan K-Nearest neighbors. Jurnal Ilmiah
Edunomika 6(1):558. https://doi.org/10.29040/jie.v6i1.4692
Outlier Detection in Simultaneous
Equations with Panel Data
1 Introduction
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 415
Y. Bee Wah et al. (eds.), Data Science and Emerging Technologies, Lecture Notes on
Data Engineering and Communications Technologies 191,
https://doi.org/10.1007/978-981-97-0293-0_30
416 S. Ismadyaliana et al.
y = Xβ + ε, (1)
r =y−y
= y − X(X' X)−1 X' y
= y − Hy
= (I − H)y
Index i represents the individual (cross-section unit), index t represents the time
period, xit is a row vector of regressors whose size 1 × p does not include constants,
di' α is heterogeneity or individual effect, and di is a row vector of size 1 × n which
contains constants and group-specific variables. Outlier detection in the proposed
panel data model uses a balanced panel and fixed effects. The fixed effect model can
be written as follows:
y = Xβ + Dα + ε, (4)
r =y−y
= y − (Xβ + Dα)
( [ ]−1 [ ]−1 )
= y − X X' M D X X' M D y + D(D' D)−1 D' (y − X X' M D X X' M D y)
= y − (Ay + B(y − Ay))
= y − (A + B − BA)y
= (I − (A + B − BA))y
= (I − P)y
with,
[ ]−1
A = X X' M D X X' M D ,
B = D(D' D)−1 D' ,
P = A + B − BA.
rit
tit = √ ,
σ 1 − pit,it
2
where r it is the residual for observation unit-it, σ = r' r/(nT − k), pit, it is diagonal
element of matrix P. The observation is detected as an outlier if |tit | > tγ /2,nT −k .
The panel data simultaneous equation model is a simultaneous equation model with
panel data. The testing procedure is the same as ordinary simultaneous equations,
only adding panel data effects. The lth panel simultaneous equation with fixed effects
can be written as follows [12]:
yl = Yl τl + Xl βl + Dαl + εl , l = 1, 2, . . . , L (5)
with
εl' = (εl11 , ..., εlnT ), αl' = (αl1 , ..., αln ) ⊗ 1'T , and 1T is vector 1 of size T × 1.
Each model must be just identified or over-identified. There are various methods for
parameter estimation in simultaneous equations. This study is limited to Instrumental
Variables (IV) and Two-Stage Least Squares (2SLS). Estimation method uses IV if
the model is just identified, and 2SLS if the model is over-identified.
Equation (5) can be simplified to:
whereas when the model is overidentified, the estimator of βl∗ can be written as:
∗ [ ' ]−1 [ ' ]
βl(2SL S) = Xl∗ Zl (Zl' Zl )−1 M D Zl' Xl∗ Xl∗ Zl (Zl' Zl )−1 M D Zl' yl ,
∗
α = (Dl' Dl )−1 Dl (yl − Xl βl(2SL S) ).
In the condition the model is just identified, the residual of the model can be
written:
rl = yl − yl
∗
= yl − (Xl∗ βl(I V ) + Dα)
( )
= yl − Xl∗ (Zl' M D Xl∗ )−1 Zl' M D yl + Dl (Dl' Dl )−1 Dl' (yl − Xl∗ (Zl' M D Xl∗ )−1 Zl' M D yl )
= y − (My + N(y − My))
= y − (M + N − NM)y
= (I − (M + N − NM))y
= (I − Q)y
with,
Analogous to the previous panel data model, the outlier observation in the simul-
taneous equation can be detected by ISR. The ISR for simultaneous equation with
panel data model can be written as:
rit
tit = √ ,
σ 1 − qit,it
2
where r it is the residual for observation unit-it, σ = r' r/(nT − k), qit, it is diagonal
element of matrix Q. The observation is detected as an outlier if |tit | > tγ /2,nT −k .
Outlier Detection in Simultaneous Equations with Panel Data 421
The application of data in this study is data for ten countries members of the ASEAN-
China Free Trade Area (ACFTA), i.e., Brunei Darussalam (BRN), Cambodia (KHM),
China (CHN), Indonesia (IDN), Malaysia (MYS), Myanmar (MMR), Philippines
(PHL), Singapore (SGP), Thailand (THA), and Vietnam (VNM) during the period
2007–2021. This research data is sourced from World Development Indicators
(WDI), World Bank. Endogenous variables consist of GDP (y1 ) and FDI (y2 ), while
exogenous variables consist of total exports and imports to GDP or Trade (x 1 ) and
labor force participation rate for 15–64 year olds or Labor (x 2 ). These variables are
useful for economic analysis, policy planning, and evaluation of a country’s economic
performance. The research model specification is as follows:
In Sect. 5.1, both Eqs. (6) and (7) are considered to be independent of each other,
so each is modeled separately to detect the presence of outliers in the panel data
model. In Sect. 5.2, both Eqs. (6) and (7) are considered simultaneous, to detect the
presence of outliers in the simultaneous equation with panel data.
Outlier detection in panel data models with fixed effects using the Least Square
Dummy Variable (LSDV) estimation method.
Model for GDP
Model estimation results for GDP (Eq. 6) can be seen in Table 1. Meanwhile, the
results of outlier detection in the GDP model can be seen in Table 2.
Based on ISR (t it ) with a critical value of t-table (α = 5%, df = 147), there are
eight observation units that are detected as outliers. These outliers occur in the same
country at different times. Observations categorized as outliers are those that are in
the left and right tails of the distribution of the studentized residuals.
Gray [14] proposed the L-R plot to display the high influence in a case to its
leverage. This study uses the plot between ISR and leverage to see the high influence
and high-leverage observations. This can be seen in Fig. 1.
The 107th observation is high leverage but not an outlier. A high-leverage obser-
vation is data that has extreme values in the predictor or outliers on the X-axis.
Fig. 1 Plot studentized residual versus leverage on GDP’s model. Observations detected as outliers
are observations that are outside the dashed red boundary line
Outlier Detection in Simultaneous Equations with Panel Data 423
This high leverage can be ignored because it does not affect the model estimation.
Therefore, high leverage is not classified as an outlier.
Model for FDI
Model estimation results for FDI (Eq. 7) can be seen in Table 3. Meanwhile, the
results of outlier detection in the FDI model can be seen in Table 4.
Based on ISR (t it ) with a critical value of t-table (α = 5%, df = 148), there are
five observation units that are detected as outliers. Outliers spread across multiple
countries and years. This result showed that observations categorized as outliers are
those that are in the left tail of the distribution of the studentized residuals.
A plot between studentized residual vs. leverage can be seen in Fig. 2. The 107th
is an observation with high leverage and high influence. Thus, it is categorized as
an outlier. The outlier categorization emphasizes that the observation is an influence
observation.
Equations (6) and (7) are considered a system of simultaneous equations, with Eq. 6
as the first equation and Eq. 7 as the second equation, so the estimation rule follows
simultaneous equation estimation. The estimation used IV method.
424 S. Ismadyaliana et al.
Fig. 2 Plot studentized residual versus leverage on FDI’s model. Observations detected as outliers
are observations that are outside the dashed red boundary line
Identification
Identification of simultaneous equation systems using order conditions, referring to
[15]. These can be seen in Table 5.
Only the second equation can be estimated using IV, while the first equation cannot
be estimated due to a lack of information about the predetermined variables.
Simultaneous Equation Model
The estimation results of the second equation using the IV method can be seen in
Table 6, while the results of outlier detection can be seen in Table 7.
Based on ISR (t it ) with a critical value of t-table (α = 5%, df = 148), there are six
observation units that are detected as outliers. Observations detected as outliers are
part of the observations detected in Tables 2 and 4, except for the 77th observation.
Fig. 3 Plot studentized residual versus leverage on second equation. Observations detected as
outliers are observations that are outside the dashed red boundary line
A plot between studentized residual and leverage can be seen in Fig. 3. The
distribution of leverage is more spread out than in the previous two plots.
6 Discussion
Outlier detection in panel data models has not been done by many researchers, espe-
cially outlier detection in simultaneous equations with panel data. Most researchers
directly use robust estimation to overcome the presence of outliers without detecting
them first. This study adopts [9, 10] for detecting outliers simply using ISR. Ronner
[16] previously detected outliers in simultaneous equations of cross-sectional data
using ESR.
This study uses a dataset of GDP, FDI, and its influencing factors in the ACFTA
region. Modeling this dataset is used for economic analysis, policy planning, and
evaluation of a country’s economic performance. Therefore, outlier detection in this
modeling is very important to ensure that the estimates obtained are not misleading.
Initially, outlier detection is performed on the panel data equations separately without
426 S. Ismadyaliana et al.
considering the two equations to be interconnected. The results of the initial stage of
detection indicate that there are eight outlier observations in the GDP model and five
outlier observations in the FDI model. Outliers spread among individuals at certain
times. Outliers in time and outliers in individuals are both mutually independent.
Outlier detection with ISR mainly considers the aspect of influence observation.
Observations detected purely as high leverage are not categorized as outliers, but
both high influence and high leverage are classified as outliers. Another advantage
of outlier detection with ISR is that it can detect more than one outlier at once. ISR is
based on residuals, so outlier observations will be different when done with different
types of modeling.
In the end, the two equations are considered simultaneous equations, and then
outlier detection is performed. The results of outlier detection in the simultaneous
equation obtained six observations that were detected as outliers. The outlier obser-
vations on the simultaneous equations are mostly part of the outlier observations on
the equations detected separately previously.
7 Conclusions
This study aims to detect outliers in the panel data model and simultaneous equations
with panel data in a simple way using ISR. ISR calculation based on residual value.
The results of this study indicate that outlier observations will vary depending on
the type of modeling. Each model will obtain certain residuals, so the observations
detected as outliers will also be different. Outlier detection results in the simultaneous
equations show that the outlier observations are a subset of the independent equation
outliers that compose them. After detecting outliers, first determine the potential
impact of outliers by comparing results with and without outliers. Then it was decided
that the handling strategy depended on the nature of the outliers. One of them is the
robust method. A future study can determine the best outlier handling strategy.
Acknowledgements The authors are grateful to the Ministry of Education, Culture, Research
and Technology Indonesia under the PPS-PDD grant, BPS-Statistics Indonesia, and ITS-Surabaya
Indonesia, which support this research.
References
1. Barnett V, Lewis T (1995) Outliers in statistical data. J Oper Res Soc 46:1034
2. Hawkins DM (1980) Identification of outliers. Chapman & Hall. https://doi.org/10.1007/978-
94-015-3994-4
3. Hair JF, Black WC, Babin BJ, Anderson RE (2018) Multivariate data analysis, 8th edn
4. Chandola V, Banerjee A, Kumar V (2009) Anomaly detection. Comput Mater Contin 14:1–22
5. Hair JF, Black WC, Babin BJ, Anderson RE (2009) Multivariate data analysis, 7th edn
Outlier Detection in Simultaneous Equations with Panel Data 427
6. Lyu Y (2015) Detection of outliers in panel data of intervention effects model based on variance
of remainder disturbance. Math Probl Eng 2015
7. Edgeworth F (1887) On discordant observations. Lond Edinb Dublin Philos Mag J Sci 23:364–
375
8. Thompson R (1985) A note on restricted maximum likelihood estimation with an alternative
outlier model. J R Stat Soc Ser B 47:53–55
9. Rousseeuw PJ, Leroy AM (1987) Robust regression and outlier detection. J R Stat Soc Series
A (Stat Soc) 152
10. Cook RD, Weisberg S (1982) Residuals and influence in regression. Chapman and Hall. https://
doi.org/10.1016/s0169-4332(96)90178-6
11. Greene WH (2012) Econometric analysis, 7th edn. Pearson UK
12. Cornwell C, Schmidt P, Wyhowski D (1992) Simultaneous equations and panel data. J Econom
51:151–181
13. Johnston J, Dinardo J (1997) Econometric methods. McGraw-Hill
14. Gray JB (1986) A simple graphic for assessing influence in regression. J Stat Comput Simul
24:121–134
15. Koutsoyiannis A (1977) Theory of econometrics: an introductory exposition of econometric
methods
16. Ronner AE (1982) Part I: proceedings in computational statistics 1:12. https://doi.org/10.1007/
978-3-642-51461-6
Assessing Departmental Efficiency
at Sepuluh Nopember Institute
of Technology: A Comparative Study
Using Classical and Advanced Data
Envelopment Analysis
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 429
Y. Bee Wah et al. (eds.), Data Science and Emerging Technologies, Lecture Notes on
Data Engineering and Communications Technologies 191,
https://doi.org/10.1007/978-981-97-0293-0_31
430 Z. Wildani et al.
1 Introduction
countries, especially in the field of education such as Australia (see [8]), Taiwan
(see [9]), or China (see [10]). One of the studies using the DEA to measure effi-
ciency in Indonesian school is research by [11] which states that the percentage
of Muhammadiyah Elementary School in Kulon Progo Regency has greater effi-
cient performance (highest 65%, lowest 20%) compared to SD Muhammadiyah in
Bantul Regency (highest 33%, lowest 10%). Another research studied about effi-
ciency of world-class university in Indonesia which shows that the most efficient
universities are the University of Indonesia and Diponegoro University [12]. Further-
more, [13] employed fuzzy DEA to measure the relative efficiency of Indonesian top
tertiary educations in 2015. The result shows that only half top Indonesian univer-
sities perform as its best, which indicate improvements are needed. In department
level, efficiency measurement previously has been done by [14] with comprehensive
explanation and benchmark; however, it did not consider different kinds of schemes
such as teaching or research scheme and only used classical DEA models.
One of the drawbacks of the classical model is the assumption that adding
input will increase output in the same proportion, and the most efficient depart-
ment sequence cannot be identified. Consequently, an additional model or advanced
model was developed from the classic CCR and BCC models. The advanced model
is a model that can identify data with broader assumptions compared to the classical
model, this model can also reduce discrimination against DMU [7]. Thus, this study
will compare the sensitivity of the classical and the advanced DEA model such as
slacks-based measure (SBM) and super-efficiency of DEA to measure the efficiency
of the Department at ITS in terms of three schemes, namely teaching and learning,
research, and development and both (overall indicator).
2 Methodology
2.1 Efficiency
Efficiency is the comparison between the actual output and the frontier output. The
basis for measuring efficiency is the ratio of output to input. Efficiency can be divided
into technical efficiency or output-oriented which means the ability of a system to
achieve maximum output from a given set of inputs and allocative efficiency or
input-oriented model, which means the ability of a system to use minimum input to
produce a certain amount of output [8].
432 Z. Wildani et al.
Data envelopment analysis (DEA) was first introduced by Charnes, Cooper, Rhodes
in 1978. DEA is a mathematical programming technique based on linear program-
ming that is used to evaluate the efficiency of a decision-making unit (work unit) that
is responsible for using several inputs to obtain a targeted output. One of the advan-
tages of DEA is that it can measure efficiency with more than one input and output.
However, the disadvantage of using DEA is that this approach is very sensitive to
extreme observations, hypothesis testing is difficult to do and only measures relative
efficiency not absolute efficiency. The assumption used is that there is no random
error, and consequently, deviation from the frontier is indicated as inefficiency [3].
However, in real-life scenarios, efficiency is defined by multiple inputs and
outputs. This is when DEA overcomes the shortcoming of other methods by
measuring efficiency as a weighted sum of outputs to the inputs. Efficiency can
be rewritten by incorporating a set of decision-making units (DMUs), as shown in
Eq. 1. Each DMU-j has j = 1,2,…,n, with x and y as input and output variables
weighing vi and μr , respectively. Meanwhile, i = 1,2,…,m is the number of input
variables, and (r = 1,2,…,s) is the number of output variables. As such, the efficiency
score of each DMU can be expressed as the ratio of output and input.
s
μr yr j
e j = rm=1 (1)
i=1 vi x i j
In theory, each DMU is free to choose the weights that are most advantageous
in calculating its relative efficiency, if the same weights do not result in efficiency
scores greater than 1 for all DMUs.
There are two classic DEA models, namely CCR and BCC. These models design
DEA at an early stage and form the basis for other DEA models [7].
CCR Model
Charnes, Cooper, Rhodes (CCR) model is a model that uses the assumption of
constant return to scale where the ratio of output and input values is constant or the
addition of input values will increase an output proportionally. Another assumption
related to this model is that each DMU operates at an optimal scale [11]. Solving
Eq. (1) normally requires transformation into the following CCR model [4]. The
multiplier output-oriented model can be written as follows:
m
Min q = vi xio
i=1
Assessing Departmental Efficiency at Sepuluh Nopember Institute … 433
subject to:
m
s
vi xi j − μr yr j ≥ 0
i=1 r =1
s
(2)
μr yr o = 1
r =1
μr , vi > 0
Max φ
subject to
n
λ j yr j ≥ φyr k ;
j=1
n
(3)
λ j xi j ≤ xik ;
j=1
λ j ≥ 0;
BCC Model
Subsequently, Banker, Charnes, and Cooper extended the CCR-DEA model by
assuming a variable return to scale (VRS). As such, the BCC model distinguished
n
itself from the CCR model, which was referred to by adding constraint λ j = 1.
j=1
The CCR assumes constant return to scale (CRS), which means that an increase
in input by a certain percentage will proportionally increase output by the same
percentage. On the other hand, the BCC model assumes variable return to scale
(VRS), which means that an increase in the input will not result in a proportional
increase in input. DMU-j is considered as efficient if and only if φ ∗ = 1.
The classical model DEA approach to determining DMU efficiency cannot explain
the ranking of efficient units. This is the main drawback of this model. One other
weakness is that if the number of input and output variables is greater than the DMU
it can reduce the sensitivity of the classical DEA model. To overcome this weakness,
434 Z. Wildani et al.
the researcher developed a DEA basic model to increase its discriminatory power.
Some of the advanced models are as follows [7].
Slacks-Based Measure (SBM)
The slacks-based measure is a model developed from the classic model. Basically,
the classical model assumes that changes in input and output occur proportionally.
However, in a system, the addition of input does not always increase the output
proportionally [15]. The slacks-based measure emphasizes the proportional reduction
of input excess (input slacks) or the proportional addition of output deficiencies
(output slacks) from the previous model which was still inefficient because it had not
yet reached the frontier line. The slack value provides important information about
the efforts that need to be made by an inefficient DMU to achieve efficient status. A
nonzero input slack indicates an excess of input, and a nonzero slack output indicates
a lack of output [16]. Output-oriented SBM based on CCR model can be shown as
follows:
s
1 1 sr+
= max 1 +
φ s r =1 yr k
Subject to
n
λ j xi j + si− = xio
j=1
n
(4)
λ j yr j − sr+ = φyr 0
j=1
λ j , si− , sr+ ≥ 0
Super-Efficiency Model
The DEA super-efficiency model was introduced by Andersen and Petersen. This
model performs very well, however efficient DMUs are separated from inefficient
DMUs [7]. The DEA super-efficiency model has been widely used in fields such as
measuring the efficiency of an area, ranking efficient DMUs, and detecting highly
efficient DMUs. Output-oriented, efforts to make the efficiency score of the SBM
model show efficiency by using the maximum possible output with the objective
function:
Subject to
n
xi j λ j + si− = xio
j=1
(5)
n
yr j λ j − sr+ = ϕyr 0
j=1
Assessing Departmental Efficiency at Sepuluh Nopember Institute … 435
Table 1 Scheme of efficiency measurement and combination input and output variables
Input variables Output variables
Number of lecturers Scheme 1: Teaching and learning
Number of lecturers with doctoral degrees Number of graduates
Number of staffs Number of students national championship/total
students
Number of students Scheme 2: Research and development
Total Publications in Scopus-Indexed International
Journals or Equivalent/total lecturers
Cumulative International Co-Authorship
Publications/total lecturers
Scheme 3: Overall performance
Number of graduates
Number of students national championship/total
students
Total Publications in Scopus-Indexed International
Journals or Equivalent/total lecturers
Cumulative International Co-Authorship
Publications/total lecturers
Overall Performance Achievement
3 Research Design
The data used in this research is secondary data obtained from the unit program
management and control of ITS. The data needed is departmental data based on the
variables defined in this study. There are three schemes of efficiency measurement
that will be carried out with different combinations of input and output variables
which are shown in Table 1.
Table 2 displays the statistical analysis of the input and output variables.
Table 2 shows that the Electrical Engineering department has the highest total
number of lecturers (57) and the highest number of doctoral lecturers (43). In contrast,
the Actuarial department has the lowest number of lecturers (9) and the lowest number
of doctoral lecturers (1). The Civil Engineering department has the highest number
of non-academic staff (26), while the Actuarial department has the least (3). The
Actuarial department’s lower staff numbers may be attributed to its status as a new
436 Z. Wildani et al.
department. The Electrical Engineering department also has the highest number of
students (1384), while the Information Technology department has the lowest (282).
In terms of graduate numbers, Electrical Engineering leads with 1417 graduates,
while Development Studies has the fewest at 1 due to its recent establishment and 1
person who graduated was a student who had graduated in 3.5 years. Instrumentation
Engineering boasts the highest national championship at 12.73%, while Geomatics
Engineering trails with 0.62%. Chemical Engineering excels in Scopus-indexed inter-
national research at 3.08%, whereas Water Transportation lags at 0.53%. Materials
and Metallurgical Engineering collaborates internationally with a 12.64% Scopus
index, contrasting with Actuarial only 0.11%. Biology excels with a performance
achievement of 124%, while Electrical Engineering and Automation trails at 81%.
Overall, the average performance achievement at ITS is 106% with a variance of
0.7%, signaling departments generally surpassing targets.
were employed. The efficiency scores based on teaching and learning performance
are detailed in Table 3.
Table 3 shows the teaching and learning efficiency scores for each department
at ITS where efficient departments’ scores are written in bold. The CCR and BCC
models produce different efficiency scores where the CCR model produces three
efficient departments, namely the Physics Engineering department, the Product
Design department, and the Instrumentation Engineering department. It is note-
worthy that some departments, despite achieving relatively low CCR scores, have
BCC scores that equal one. The BCC model produces 15 efficient departments
such as the Actuarial Science department, Physics Engineering department, Systems
and Industrial Engineering department, Civil Engineering department, Geophysics,
Marine Engineering, Marine Transportation department, Electrical Engineering,
Biomedical Engineering department, Information technology department, Product
Design department, Development Studies department, Civil Infrastructure Engi-
neering department, Industrial Chemical Engineering department, and Instrumenta-
tion Engineering department, where departments that are efficient using CCR model
measurements are also efficient when measured using the BCC model.
Efficiency scores below 1 indicate departments are inefficient, with lower scores
reflecting less efficiency. For instance, using the CCR model in Table 3, Physics
has an efficiency score of 0.752, while Mathematics scores 0.607. This suggests
Mathematics needs a greater output increase to achieve efficiency. The additional
improvement required for an inefficient department, according to the CCR model,
is calculated using the formula 1/φ—1, where φ is the efficiency score. Using the
example, Physics needs a 13.5% increase (1/0.752–1), and Mathematics requires a
64.7% increase (1/0.607–1). On the other hand, the SBM not only determines the
needed output increase but also the input decrease by subtracting 1 from the efficiency
score. In Table 3, the learning and teaching efficiency score for Physics using SBM
is 0.477. Simultaneous optimization of output and input is calculated as 1–0.477,
resulting in 52.3%.
In Table 3, some efficient departments also serve as references for inefficient
departments, with the number of efficient departments indicated in superscript. For
instance, according to the BCC model, the Actuarial Science department can be a
reference for two inefficient departments. However, this does not apply to all effi-
cient departments. For example, although Biomedical Engineering is efficient, it
is not utilized as a reference for inefficient departments as the efficiency score is
written without superscript. Among these 15 efficient departments using the BCC
model, 12 departments comprise the peer group. However, two departments, namely
Development Studies and Civil Engineering, do not appear in the peer group for any
inefficient departments.
Advanced DEA models, specifically the super-efficiency model derived from the
CCR model, consistently yield identical efficiency scores. In contrast, the SBM, also
based on the CCR model, provides unique efficiency scores. The super-efficiency
model distinguishes departments with scores exceeding 1 and facilitates rankings.
In this evaluation, the Department of Instrumentation Engineering stands out as the
most super-efficient, followed by Product Design and Physics Engineering. Notably,
438 Z. Wildani et al.
Table 3 (continued)
No DMUs (department) Efficiency scores
CCR BCC SBM Super
37 Instrumentation engineering 1(14) 1(6) 1(9) 7.075(15)
38 Business statistics 0.735 0.757 0.513 0.735
Note In the CCR, BCC, and SBM models, an efficient department is assigned an efficiency score
of 1. Conversely, in the super efficiency model, an efficient department is characterized by a score
exceeding 1. A higher score beyond 1 indicates that this department is more efficient compared to
other efficient departments. The subscription denotes the frequency of appearance of each efficient
department in the reference set
previously inefficient departments maintain their scores and rankings in the super-
efficiency assessment. The Product Design and Physics Department have the highest
peer counts across all models, indicating their technical efficiency. Greater peer
counts for a department indicate the degree of its robustness when compared to
other efficient departments [16]. Identifying top-performing departments can guide
performance improvements and assist university managers in setting effective targets.
The efficiency in research and development at ITS department using four input and
two output variables (Table 1) are presented in Table 4.
Table 4 displays the research and development efficiency scores for each depart-
ment at ITS, with efficient departments highlighted in bold. Notably, the CCR and
BCC models yield different efficiency scores. The CCR model identifies three effi-
cient departments: Biomedical Engineering, Information Technology, and Instru-
mentation Engineering. In contrast, the BCC model identifies nine efficient depart-
ments, including Chemistry, Biology, Actuarial, Chemical Engineering, Materials
and Metallurgical Engineering, Biomedical, Information Technology, Development
Studies, and Instrumentation Engineering. Interestingly, departments deemed effi-
cient by the CCR model also maintain efficiency when measured using the BCC
model. By the result of this analysis, it is observed that the Product Design and
Physic Engineering which appeared to be efficient in the initial scheme are not
performing well for research and development scheme. Furthermore, the department
of Development Studies is not treated as good practice for this model as it does not
appear in the peer group for any inefficient departments.
According to super-efficiency model with efficiency score greater than 1, the
top three efficient departments, in order, are Instrumentation Engineering, Biomed-
ical Engineering, and Information Technology which means that these departments
perform well in the field concerning research. Conversely, departments with effi-
ciency scores below 1 are deemed inefficient. Moreover, according to all models, the
Information Technology and Biomedical Engineering Department have the highest
440 Z. Wildani et al.
Table 4 (continued)
No DMUs (department) Efficiency scores
CCR BCC SBM Super
37 Instrumentation engineering 1(7) 1(13) 1(8) 3.202(8)
38 Business statistics 0.445 0.514 0.110 0.445
Note In the CCR, BCC, and SBM models, an efficient department is assigned an efficiency score
of 1. Conversely, in the super efficiency model, an efficient department is characterized by a score
exceeding 1. A higher score beyond 1 indicates that this department is more efficient compared to
other efficient departments. The subscription denotes the frequency of appearance of each efficient
department in the reference set
peer counts. Therefore, they are deemed the most purely technically efficient depart-
ments for research although they only secure third and second place in the super-
efficiency model. The interpretation for the percentage of output shortage to achieve
efficiency for inefficient departments remains consistent with the initial scheme.
Table 5 (continued)
No. DMUs (department) Efficiency scores
CCR BCC SBM Super
37 Instrumentation engineering 1(5) 1(6) 1(8) 7.075(9)
38 Business statistics 0.866 0.914 0.342 0.866
Note In the CCR, BCC, and SBM models, an efficient department is assigned an efficiency score
of 1. Conversely, in the super efficiency model, an efficient department is characterized by a score
exceeding 1. A higher score beyond 1 indicates that this department is more efficient compared to
other efficient departments. The subscription denotes the frequency of appearance of each efficient
department in the reference set
The average of efficiency score in terms of teaching and learning, research and
development, and overall aspect are shown in Table 6
Table 6 reveals that the average efficiency score for overall assessment at the
ITS Department is significantly higher than the other two schemes. Additionally,
the research and development aspect in most ITS departments needs improvement,
with average scores falling below 0.5 for three out of four models. Furthermore,
Fig. 1 illustrates the sensitivity analysis for each department. The sensitivity for each
department is described by a super-efficiency score for each scheme.
Figure 1 depicts that the overall efficiency score (purple line) and the teaching
and learning efficiency score (orange line) are relatively higher than the research
and development efficiency score (blue line), except for Biomedical Engineering,
Computer Engineering, Development Study, Materials and Metallurgical Engi-
neering, Information Technology, Actuarial, and Geophysical Engineering. This
figure clearly highlights the need for improvement in research and development.
5 Conclusion
This study assesses the performance of academic departments at ITS Surabaya using
various DEA models, aiming to identify strengths in teaching and learning, research
and development, and overall performance. The findings reveal that over 50% of
departments at Sepuluh Nopember Institute of Technology are not yet efficient,
emphasizing the need for enhancements and resource optimization. The overall
performance assessment indicates that only 12 out of 38 departments exhibit satis-
factory efficiency. Many departments require improvements across different activi-
ties, particularly in research work. A significant observation is the underutilization
of academic staff in the Electrical Engineering department for research, develop-
ment, and teaching activities, despite a high number of lecturers and doctoral degree
holders.
The consistently efficient performance of the Instrumentation Engineering Depart-
ment across all measurement schemes and models positions it as a benchmark for
resource management and optimal outputs. Other departments, policymakers, and
management systems can learn from its efficiency. Besides Instrumentation Engi-
neering, peer analysis highlights the Department of Product Design and Physics
Engineering as references for inefficient departments in teaching and learning prac-
tices. Similarly, the Departments of Information Technology and Biomedical Engi-
neering can serve as examples for inefficient departments aiming to improve their
performance through enhanced research practices such as workshops and training
programs.
Assessing Departmental Efficiency at Sepuluh Nopember Institute … 445
References
Mira Meilisa, Bambang Widjanarko Otok, and Jerry Dwi Trijoyo Purnomo
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 447
Y. Bee Wah et al. (eds.), Data Science and Emerging Technologies, Lecture Notes on
Data Engineering and Communications Technologies 191,
https://doi.org/10.1007/978-981-97-0293-0_32
448 M. Meilisa et al.
1 Introduction
In the first thousand days of life, a person’s life process begins [1]. It is a crucial time
that affects growth from childhood to adulthood [2]. One growth issue that remains
a challenge for children around the world is stunting. Stunting is a nutritional status
condition characterized by a young individual’s length or height being dispropor-
tionately short in comparison with their age. The measurement was conducted in
accordance with the WHO standard for child growth, which is the deviation of the
standard deviation from the median [3]. Stunting, a malnutrition condition that is
linked to previous nutrient deficiencies, is classified as a chronic nutritional issue.
Development and growth into maturity will be impeded if the child’s nutritional
needs are not met and are utilized to the greatest extent possible [4, 5]. Succes-
sion and development will occur without hindrance. The World Health Organization
(WHO) estimates a global stunting population of 22% or 149.2 million by 2020.
Indonesia is the fifth country with the highest stunting rate in the world and second
in Southeast Asia. The prevalence of stunting in Indonesia in 2018 was 30.8%, which
decreased to 22.67% in 2019, 26.92% in 2020, and 24.4% in 2021 [6]. Because the
reduction in stunting in Indonesia is still below the WHO standard of 20%, stunting is
a public health problem that requires further attention. The Indonesian government’s
national strategy for stunting prevention includes a commitment to address stunting
[7]. Identifying the causes of stunting is one of the efforts to reduce the prevalence
of stunting.
Examining the contributing factors is one way to try to lower the incidence of
stunting. Generate a mathematical model that establishes the association between
the response variable and the predictor variable, thereby elucidating the underlying
patterns of their connection. The formation of connections can be characterized by
employing regression analysis to identify patterns. Parametric, semiparametric, and
nonparametric methods can be used in regression analysis. If the hypothesis about
the curve model is known, the parametric method of regression can be utilized;
otherwise, the nonparametric method is suggested. Since nonparametric regression
may identify regression curve patterns of unknown shape, it provides a great deal of
flexibility when estimating regression curves. One of the nonparametric regression
models developed by Friedman is the multivariate adaptive regression splines [8]. The
MARS model is an advanced truncated spline and recursive partitioning regression
(RPR) combination. The MARS model can take into consideration interactions and
additive effects between predictor variables. Additionally, categorical and continuous
response data can be utilized with the MARS model. MARS can also process data
with big samples and high dimensions. Based on the smallest generalized cross-
validation (GCV) outcome, MARS might produce a model that runs at knots [8].
Considerable academic research has been directed toward exploring the influential
aspects of the MARS model, Poisson regression is used to modify the MARS model
[9], and bootstrap aggregating multivariate adaptive regression splines is used for
observational research on diabetic cases [10]. The parameter for cases of TBC in the
Lamongan should be estimated using a stepwise approach [11]. Additionally, identify
Multivariate Adaptive Fuzzy Clustering Means Regression Splines … 449
the cause factor of hemorrhagic fever as suggested by [11, 12], and [13] considering
both continuous responses and categories. The process of Anova decomposition and
importance variable analysis in the context of multivariate adaptive regression splines
is discussed in [14]. The research focuses on the estimation of hourly global solar
radiation in Hong Kong using the MARS technique [15]. Parameter estimation of
spatial error model multivariate adaptive generalized Poisson regression spline [16].
A hybrid model of spatial autoregressive-multivariate adaptive generalized Poisson
regression spline [17]. This research will use the multivariate adaptive regression
splines (MARS) model combined with the fuzzy clustering means (FCM) approach
[18], resulting in a model called multivariate adaptive fuzzy clustering means regres-
sion splines (MAFCMRS) due to patterns in data that were previously unknown.
Furthermore, the heterogeneity of the process is accounted for in the model’s forma-
tion, which continues within the coil. With a degree of membership, data is grouped
into a data center cluster, with each cluster examining the influencing factors. One of
the five provinces in Indonesia with the greatest prevalence rate is Southeast Sulawesi
[6]. As a result, this study will examine the determinants of stunting in the province
of Southeast Sulawesi through the lens of the incident. The selection of the best
cluster uses the validity of the Xie-Beni index [19].
MARS model is one of the nonparametric regressions that can be utilized to model
response variables and predictor variables if the data pattern is unknown. MARS
is a combination of truncated spline and recursive partitioning regression [8]. The
MARS model can be seen in the following equation:
yi = f x1i , x2i , . . . , x pi + εi ; i = 1, 2, . . . , n
M
Km
= α0 + αm skm xv(k,m)i − tkm + + εi
m=1 k=1 (1)
M
= α0 + αm Bmi (x, t) + εi
m=1
where
yi : response variable
α0 : constant basis function parameter
αm : mth non-constant basis functions parameter
M: the number of non-constant basis functions
Km : maximum interaction at the mth basis functions
(continued)
450 M. Meilisa et al.
(continued)
Skm : basis function sign value + 1 if the knot is located on the right or -1 if the knot is
located on the left
xv(k,m)i : vth variable predictor
tkm : knot value at the k interaction and m basis functions
Bmi (x, t): mth basis functions
The MARS model formation begins by determining the knots and basis functions
of each predictor variable by plotting each predictor variable with the response vari-
able. The optimum number of knots will produce a good MARS model so that further
forward and backward stepwise of the modified recursive partitioning algorithm can
be carried out, where the optimum number of knots is adjusted to the behavior of the
data.
The optimal basis function with the smallest GCV value constitutes the finest
model.
FCM is one of the most widely applied methods and can be modified in its application
[20]. Jim Bezdek was the one who first suggested FCM [21]. FCM is a technique
of clustering data in a cluster with a certain membership degree [22]. Clustering
data with FCM produces output in the form of a list of cluster centers and several
membership functions for each data. This information is used in defining membership
functions to give fuzzy membership values. FCM is an algorithm used for clustering
datasets X = {x1 , x2 , ..., xi } ⊂ R into c groups based on objective function:
C
N
J (U, V) = (u ci )m xi − vc 2 (2)
c=1 i=1
which is the vector of the i-th observation (i = 1, 2, ..., N ), and u ci is the member-
ship value in the i-th cluster owned by the randomly formed i-th observation. The
membership value of an object fulfills the condition:
C
u ci = 1 (i = 1, 2, ...., N ; u ci ∈ [0, 1]) (3)
c=1
It is essential to minimize the objective function of Eq. (3) by taking into concern
the conditions of Eq. (4). The process of iteration and membership update is carried
out iteratively so that it can be seen that the cluster moves into the right location with
the right value
Multivariate Adaptive Fuzzy Clustering Means Regression Splines … 451
⎡ 1
⎤−1
C
xi − vc 2 m−1
u ci = ⎣ ⎦ (4)
c∗ =1
xi − vc∗ 2
The iteration process will be performed until the specified threshold value is
obtained.
This research uses secondary data on stunting prevalence from the Southeast Sulawesi
Provincial Health Office in 2020. The observation unit used is a sub-district in the
district/city consisting of 222 districts, with the following variables (Table 1).
Indonesia is the third stunting country in the Southeast Asia Region. Southeast
Sulawesi is among the provinces that have a prevalence of 30.2% in 2020 [6]. The
following is a scatter plot for each variable used in Fig. 1. It can be clearly seen from
the figure that there is a nonlinear relationship between the response variable and the
predictor variables. It can be seen that the scatter of points is random, not forming
a certain pattern, so one alternative model that can be used is multivariate adaptive
fuzzy clustering means regression splines.
Multivariate Adaptive Fuzzy Clustering Means Regression Splines … 453
Cluster building is done using the FCM algorithm. Several scenarios of the number
of clusters are formed and then verified using the Xie and Beni validity index value
with the following results (Table 2).
From the Xie and Beni validity values, the smallest value is obtained in the scenario
of forming 5 clusters. This is followed by the formation of the MAFCMRS model.
Table 3 shows the value of the best model generation experiment results for each
cluster.
Table 3 displayed varying values of n for each cluster. When observed from the
perspective of R-square, it can be observed that each of the clusters exhibits a value
exceeding 70%. The majority of the n values are concentrated within cluster 4.
MAFCMRS parameter estimation for cluster 1 found in Table 3 is obtained from the
combination of BF = 12, MI = 2, and MO = 1. The smallest GCV value obtained
was 3.58475 and R-squared was 84.59%. The best model for cluster 1 is:
454 M. Meilisa et al.
Table 3 Summarizing the results of the best model building experiment MAFCMRS
Model N GCV Kombinasi R-squared
BF MI MO
Cluster 1 53 3.58475 12 2 1 0.84596
Cluster 2 15 0.35607 12 3 1 0.99239
Cluster 3 44 2.89847 12 3 1 0.73666
Cluster 4 63 7.16807 12 2 1 0.80705
Cluster 5 47 6.79629 8 2 1 0.85474
Description: generalized cross-validation (GCV), maximum observation (MO), minimum interac-
tion (MI), and basis functions (BF)
where
After obtaining the number of clusters formed, the MAFCMRS model obtained
the following results in Table 4
The significance of the values of the four predictor variables is recognized from
Table 4. Variables that possess a level of relevance exceeding 86% include the quantity
of integrated number of health centers. This observation implies that variables have
an important effect on the prediction of stunting prevalence.
Table 4 Basis function coefficient test for stunting prevalence in Southeast Sulawesi province in
cluster 1
Coefficients Estimate Std. error Pr(>|t|)
(Intercept) 7.40E + 00 4.53E-16 < 2E-16*
bx [, −1] h (×3-181) −2.23E-02 1.56E-18 < 2E-16*
bx [, −1] h (181- × 3) 5.57E-02 5.90E-18 < 2E-16*
bx [, −1] h (×1-20) −2.08E-02 8.54E-18 < 2E-16*
bx [, −1] h (20- × 1) −4.42E-01 3.44E-17 < 2E-16*
bx [, −1] h (×4-12) 7.29E-01 5.32E-17 < 2E-16*
bx [, −1] h (12- × 4) 3.62E-01 4.56E-17 < 2E-16*
bx [, −1] h (181- × 3) *h (×4-10) −4.76E-03 4.95E-19 < 2E-16*
bx [, −1] h (181- × 3) *h (10- × 4) −5.17E-05 9.00E-19 < 2E-16*
bx [, −1] h (×2-9) 1.00E-01 3.24E-17 < 2E-16*
bx [, −1] h (9- × 2) −3.11E-01 6.16E-17 < 2E-16*
Note * Significant at level 0.05
where
The significance of the values of the four predictor variables is shown in Table 5.
Variables that possess a level of relevance exceeding 85% include the quantity of
integrated number of neonate visit. This observation implies that variables have an
important effect on the prediction of stunting prevalence.
Table 5 Basis function coefficient test for stunting prevalence in Southeast Sulawesi province in
cluster 2
Coefficients Estimate Std. error Pr(>|t|)
(Intercept) 3.74E + 01 8.10E-15 < 2E-16*
bx [, −1] h (×1-17) −3.65E-02 2.18E-16 < 2E-16*
bx [, −1] h (17- × 1) −4.43E-01 6.84E-16 < 2E-16*
bx [, −1] h (×3-143) −9.11E-03 4.95E-17 < 2E-16*
bx [, −1] h (143- × 3) 1.49E-01 9.05E-17 < 2E-16*
bx [, −1] h (×2-3) 1.17E + 00 1.98E-15 < 2E-16*
bx [, −1] h (3- × 2) −1.09E + 00 3.44E-15 < 2E-16*
bx [, −1] × 2*h (143- × 3) −7.80E-01 9.02E-16 < 2E-16*
bx [, −1] × 2*h (143- × 3) * × 4 5.82E-02 6.94E-17 < 2E-16*
Note * Significant at level 0.05
456 M. Meilisa et al.
Table 6 Basis function coefficient test for stunting prevalence in Southeast Sulawesi province in
cluster 3
Coefficients Estimate Std. error Pr(>|t|)
(Intercept) 2.58E + 00 6.32E-16 < 2E-16*
bx [, −1] h (×3-84) −2.08E-02 5.05E-18 < 2E-16*
bx [, −1] h (84- × 3) −2.16E-02 2.40E-17 < 2E-16*
bx [, −1] × 1*h (84- × 3) −1.41E-02 6.94E-18 < 2E-16*
bx [, −1] h (×4-9) −9.30E-02 1.39E-16 < 2E-16*
bx [, −1] h (9- × 4) 4.23E-01 8.54E-17 < 2E-16*
bx [, −1] h (×3-84)* × 4 9.47E-04 3.22E-19 < 2E-16*
bx [, −1] × 1*h (84- × 3)* × 4 1.56E-03 6.10E-19 < 2E-16*
Note * Significant at level 0.05
MAFCMRS parameter estimation for cluster 3 found in Table 3 is obtained from the
combination of BF = 12, MI = 3, and MO = 1. The smallest GCV value obtained
was 2.89847 and R-squared 73.67%. The best model for cluster 3 is:
where
The significance of the values of the four predictor variables is presented in Table 6.
Variables that possess a level of relevance exceeding 89% include the quantity of
integrated number of low birth weight of the baby. This observation implies that
variables have an important effect on the prediction of stunting prevalence.
MAFCMRS parameter estimation for cluster 4 found in Table 3 is obtained from the
combination of BF = 12, MI = 2, and MO = 1. The smallest GCV value obtained
was 7.16807 and R-squared was 80.7%. The best model for cluster 4 is:
Table 7 Basis function coefficient test for stunting prevalence in Southeast Sulawesi province in
cluster 4
Coefficients Estimate Std. error Pr(>|t|)
(Intercept) 1.81E + 01 4.51E-15 < 2E-16*
bx [, −1] h (×3-103) −2.35E-02 1.57E-17 < 2E-16*
bx [, −1] h (103- × 3) 1.98E-01 1.38E-16 < 2E-16*
bx [, −1] h (×1-21) −1.99E-02 5.85E-17 < 2E-16*
bx [, −1] h (21- × 1) −5.66E-01 3.34E-16 < 2E-16*
bx [, −1] h (103- × 3) * × 4 −3.74E-03 9.81E-18 < 2E-16*
bx [, −1] h (×4-5) −1.88E-01 4.19E-16 < 2E-16*
bx [, −1] h (5- × 4) 6.08E-01 9.13E-16 < 2E-16*
bx [, −1] × 2*h (×4-5) 2.38E-02 2.19E-17 < 2E-16*
Note * Significant at level 0.05
where
After obtaining the number of clusters formed, the MAFCMRS model is pointed
out in Table 7.
The significance of the values of the four predictor variables is provided in Table 7.
Variables that possess a level of relevance exceeding 90% include the quantity of inte-
grated number of pregnant women with chronic energy deficiency (CED). This obser-
vation implies that variables have an important effect on the prediction of stunting
prevalence.
where
After obtaining the number of clusters formed, the MAFCMRS model displayed in
Table 8.
The interpretation for the basis functions in cluster 5 is as follows:
1. B f 1 = h(X 3 − 84)
The coefficient will be significant if the ratio of the neonates visit is more than
84. For faithful, it will decrease stunting prevalence by one unit, if the other basis
functions are held constant by 0.0285.
2. B f 6 = h(11 − X 4 )
The coefficient will be significant if the number of number of health centers is
less than 11. For every one unit, the stunting prevalence will increase by 0.0597
(other basis functions are constant).
3. B f 8 = X 1 ∗ h(84 − X 3 )
The coefficient will be significant if the number of pregnant women with
chronic energy deficiency and the neonates visitation ratio is less than 84%.
Each unit will increase the prevalence of stunting by 0.001083.
Table 8 shows that all four of the predictor variables that were employed had
significant values. Variables with an importance level over 80%, specifically the
number of integrated health posts (X 4 ), are considered significant in this study. This
observation indicates that variables have a significant impact on the prediction of
stunting prevalence.
Table 8 Basis function coefficient test for stunting prevalence in Southeast Sulawesi province in
cluster 5
Coefficient of basis function Estimate Std. error p-value
(Intercept) 2.62E + 04 1.58E-12 < 2E-16*
bx [, -1]h (×3-84) −2.09E + 01 5.94E-15 < 2E-16*
bx [, -1]h (84- × 3) 2.19E + 02 5.70E-14 < 2E-16*
bx [, -1]h (×1-15) 7.18E + 01 5.02E-14 < 2E-16*
bx [, -1]h (15- × 1) −8.84E + 02 2.10E-13 < 2E-16*
bx [, -1]h (×4-11) 1.68E + 02 1.27E-13 < 2E-16*
bx [, -1]h (11- × 4) 5.97E + 02 2.78E-13 < 2E-16*
bx [, -1] × 1*h (11- × 4) −1.17E + 01 6.14E-15 < 2E-16*
bx [, -1] × 1*h (84- × 3) −1.08E + 00 9.51E-16 < 2E-16*
Note * Significant at level 0.05
Multivariate Adaptive Fuzzy Clustering Means Regression Splines … 459
The map displays the selection of generated clusters based on the validity index
values of Xie and Beni for the five existing cluster values (Fig. 2).
The comparison between the actual value and the predicted value of the MAFCMRS
model for each cluster can be seen in (Fig. 3).
From the (Fig. 3), it can be seen that there is no significant difference between
the actual value and the predicted value of the MAFCMRS model.
460 M. Meilisa et al.
a. Cluster 1 b. Cluster2
c. Cluster 3 d. Cluster 4
e. Cluster 5
Fig. 3 Comparison of the actual response value with the predicted response value of each cluster
7 Conclusion
The MAFCMRS model is one of the nonparametric regression models that combined
the multivariate adaptive regression splines model with fuzzy clustering means. This
model’s development observes the process’s heterogeneity as well as continuing
in the knot. The results of the cluster validity analysis stated that the results of
the formation of 5 clusters were the best results according to the validity of the
Xie and Beni index. The formation of the best MAFCM model is shown by the
combination of BF, MI, and MO which is different for each cluster. The coefficient
test of the MAFCMRS model revealed that the 4 response variables involved are
number of pregnant women with chronic energy deficiency (CED), number of low
birth weight of the baby, number of neonate visits, and number of health centers that
had a significant effect on the prevalence of stunting in Southeast Sulawesi province
at the significant level of 5%.
Multivariate Adaptive Fuzzy Clustering Means Regression Splines … 461
References
1. Yuningsih Y (2022) Hubungan status Gizi dengan stunting pada Balita Oksitosin. J Ilm
Kebidanan 9(2):102–109. https://doi.org/10.35316/oksitosin.v9i2.1845
2. Menkokesra (2013) Kerangka Kebijakan Gerakan Nasional Percepatan Perbaikan Gizi Dalam
Rangka Seribu Hari Pertama Kehidupan. Republik Indonesia
3. RI MK (2020) Peraturan Mentri Kesehatan Republik Indonesia Nomor 2 Tahun 2020
4. Mugianti S, Mulyadi A, Anam AK, Najah ZL (2018) Faktor Penyebab Anak Stunting Usia
25–60 Bulan di Kecamatan Sukorejo Kota Blitar. J Ners dan Kebidanan 5(3):268–278. https://
doi.org/10.26699/jnk.v5i3.art.p268-278
5. Ernawati A (2020) Gambaran Penyebab Balita Stunting di Desa Lokus Stunting Kabupaten
Pati. J Litbang Media Inf Penelitian, Pengemb dan IPTEK 16(2):77–94. https://doi.org/10.
33658/jl.v16i2.194
6. Kemenkes RI (2021) Hasil Studi Status Gizi Indonesia (SSGI) Tingkat Nasional, Provinsi dan
Kabupaten/Kota
7. Kementerian PPN/Bappenas (2019) Rencana Pembangunan Jangka Menengah Nasional 2020–
2024
8. Friedman JH (1991) Multivariate adaptive regression splines. Ann Stat 19(1):1–141
9. Prihastuti Yasmirullah SD, Otok BW, Trijoyo Purnomo JD, Prastyo DD (2021) Modification
of multivariate adaptive regression spline (MARS). J Phys Conf Ser 1863(1):012078. https://
doi.org/10.1088/1742-6596/1863/1/012078
10. Otok BW, Putra RY, Sutikno Y, Yasmirullah SDP (2020) Bootstrap aggregating multi-
variate adaptive regression spline for observational studies in diabetes cases. Syst Rev Pharm
11(8):406–413. https://doi.org/10.31838/srp.2020.8.59
11. Yasmirullah SDP, Otok BW, Purnomo JDT, Prastyo DD (2021) Parameter estimation of multi-
variate adaptive regression spline (MARS) with stepwise approach to multi drug-resistant
tuberculosis (MDR-TB) modeling in lamongan regency. J Phys Conf Ser 1752(1). https://doi.
org/10.1088/1742-6596/1752/1/012017
12. Sriningsih R, Otok BW, Sutikno S (2023) Determination of the best multivariate adaptive
geographically weighted generalized poisson regression splines model employing generalized
cross-validation in dengue fever cases. Methods 10. https://doi.org/10.1016/j.mex.2023.102174
13. Sriningsih R, Otok BW, Sutikno W (2021) Factors affecting the number of dengue fever cases
in West Sumatra province using the multivariate adaptive regression splines (MARS) approach.
J Phys Conf Ser 1722(1). https://doi.org/10.1088/1742-6596/1722/1/012094
14. Otok BW, Rumiati AT, Ampulembang AP, Al Azies H (2023) Anova decomposition and impor-
tance variable proccess in multivariate adaptive regression splines. IJASEIT 13(3). https://doi.
org/10.18517/ijaseit
15. Li DHW, Chen W, Li S, Lou S (2019) Estimation of hourly global solar radiation using multi-
variate adaptive regression spline (MARS)—a case study of Hong Kong. Energy 186. https://
doi.org/10.1016/j.energy.2019.115857
16. Yasmirullah SDP, Otok BW, Purnomo JDT, Prastyo DD (2023) Parameter estimation of
spatial error model—multivariate adaptive generalized poisson regression spline. Eng Lett
31(3):1265–1272. https://doi.org/10.5267/j.dsl.2023.7.004
17. Yasmirullah SDP, Otok BW, Trijoyo Purnomo JD, Prastyo DD (2023) A hybrid model of
spatial autoregressive-multivariate adaptive generalized Poisson regression spline. Decis Sci
Lett 12:721–728. https://doi.org/10.5267/dsl.2023.7.004
462 M. Meilisa et al.
18. De Andrés J, Lorca P, De Cos Juez FJ, Sánchez-Lasheras F (2011) Bankruptcy forecasting: a
hybrid approach using fuzzy c-means clustering and multivariate adaptive regression splines
(MARS). Expert Syst Appl 38(3):1866–1875. https://doi.org/10.1016/j.eswa.2010.07.117
19. Muranishi M, Honda K, Notsu A (2014) Xie-beni-type fuzzy cluster validation in fuzzy co-
clustering of documents and keywords. Adv Intell Syst Comput 270. https://doi.org/10.1007/
978-3-319-05515-2
20. Miyamoto S, Ichihashi H, Honda K (2008) Algorithms for fuzzy clustering. Springer-verlag
Berlin Heidelberg
21. Bezdek JC, Ehrlich R, Full W (1984) FCM: The fuzzy c-means clustering algorithm. Comput
Geosci 10(2–3):191–203. https://doi.org/10.1016/0098-3004(84)90020-7
22. Zarinbal M, Fazel Zarandi MH, Turksen IB (2014) Relative entropy fuzzy c-means clustering.
Inf Sci (Ny) 260:74–97. https://doi.org/10.1016/j.ins.2013.11.004
Statistical Inferences for Multivariate
Generalized Gamma Regression Model
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 463
Y. Bee Wah et al. (eds.), Data Science and Emerging Technologies, Lecture Notes on
Data Engineering and Communications Technologies 191,
https://doi.org/10.1007/978-981-97-0293-0_33
464 H. Yasin et al.
1 Introduction
Linear regression model is a popular tool to study the relationship between a response
variable and one or more predictor variables. However, this model requires a strong
assumption of normal distribution which is often difficult to meet. To handle this
issue, several proposals have been made especially for the asymmetric contin-
uous distribution including gamma, exponential, log-normal, Weibull, and Rayleigh
regressions [1–9]. Such a regression model is developed separately.
In this study, we are interested to explore the generalized gamma regression to
model response variable with positive skew. The generalized gamma is advantageous
since it has more general form of many other distributions such as gamma, Weibull,
exponential, chi-squared, Erlang, Rayleigh, and half normal [10]. This makes sure
that our study also applies for other regression models for continuous distribution.
Although the generalized gamma distribution has been introduced since 60s [11,
12], more development has been considered by re-parametrization to enhance flexi-
bility [10, 13]. In particular, Yasin et al. [14] propose a univariate generalized gamma
regression by modeling its mean as a response of predictor variables. In this study,
we extend the study of Yasin et al. [14] to the multivariate case, develop the method
for parameter estimation, and study the statistic for hypothesis testing. Finally, we
fit the multivariate generalized gamma regression to model educational indicators in
Central Java involving a number of predictor variables.
The rest of the article is organized as follows. Section 2 details the multivariate
generalized gamma distribution and the corresponding regression model, followed by
technique for parameter estimation and hypothesis testing. We evaluate our model
and method using simulated data (Sect. 3) and real data in educational indicators
(Sect. 4). Section 5 provides conclusion.
2 Methods
[ ]T
where y = [y1 y1 . . . y K ]T , Θ = λ τ θT δT is a vector of MGG distribution
parameters, θT = [θ1 θ2 . . . θk . . . θ K ], δT = [δ1 δ2 . . . δk . . . δ K ], λ, τ, θk > 0,
δk ∈ R, yk > 0, y1 + δk < yk ; k > 1, δ1 < y1 , and f (y|Θ ) = 0 for others. Hence,
Yk follows a GG distribution with theoretical mean as in Eq. (2).
( ) k
┌ λ + τ1 ∑ ∑ k
E(Yk ) = θr + δr . (2)
┌(λ) r =1 r =1
where Yki is the k-th response variable at [ the i-th observation, x ]ji is the j-th predictor
variable at the i-th observation, xiT = 1 x1i x2i . . . x ji . . . x pi , i = 1, 2, . . . , n, n
[ ]T
represent the sample size of the observation, βk = β0k β1k . . . β jk . . . β pk , k =
1, 2, . . . , K , p is the number of predictor variables, and K is the number of response
variables. Therefore, the joint PDF of the MGGR model at the i-th observation is
( ( )τ ∑ ( )τ )
( )K exp − yθ1i1 (x−δ1
− K yki −y(k−1)i −δk
τ i) k=2 θk (xi )
f (yi |Θ M R ) = (∏ )( )1−τ λ ∏ ( )1−τ λ ,
┌(λ) K y1i −δ1 K yki −y(k−1)i −δk
k=1 θk (xi ) θ1 (xi ) k=2 θk (xi )
(4)
where yi is the vector random of response variable at the i-th observation, and Θ M R =
[ T T ]T
β1 β2 . . . βkT . . . βTK λ τ δT represents the parameter of the MGGR model. This
study operates under the assumption that alterations in the predictor’s value solely
466 H. Yasin et al.
impact the scale parameter’s value. Therefore, building on Eqs. (2) and (3), we can
express the scale parameter element for each response variable using Eqs. (5) and
(6).
┌(λ) ( ( T ) )
θ1 (xi ) = ( ) exp xi β1 − δ1 , (5)
┌ λ+ τ1
┌(λ) ( ( T ) ( ) )
θk (xi ) = ( ) exp xi βk − exp xiT β(k−1) − δk , k = 2, 3, . . . , K . (6)
┌ λ+ τ1
To estimate the parameters of the MGGR model, we employ the maximum like-
lihood estimation (MLE) method. However, it is important to note that there is not
a closed-form expression for the first-order derivative of the log-likelihood function
for each parameter. As a result, we require an optimization technique that involves
numerical iteration, and we use the BHHH algorithm for this purpose. We selected
this algorithm because it does not necessitate the calculation of a second derivative
for creating the Hessian matrix [16, 17]. The gradient vector and Hessian matrix of
the BHHH algorithm are shown in Eqs. (8) and (9).
∑
n
g(Θ M R ) = gi (Θ M R ), (8)
i=1
∑
n
H(Θ M R ) = − gi (Θ M R )gi (Θ M R )T , (9)
i=1
where gi (Θ M R ) represents the gradient vector for the i-th individual observation
expressed by
[
∂ log( f (yi |ΘMR )) ∂ log( f (yi |ΘMR )) ∂ log( f (yi |ΘMR ))
gi (ΘMR ) = ···
∂β1T ∂β2T ∂βTK
]T
∂ log( f (yi |ΘMR )) ∂ log( f (yi |ΘMR )) ∂ log( f (yi |ΘMR ))
. (10)
∂λ ∂τ ∂δT
Statistical Inferences for Multivariate Generalized Gamma Regression … 467
Thus, to obtain the value of the MGGR model, parameter estimator can be
explained by the subsequent steps:
(0)
• Step 1. Define the initial value of Θ̂ M R =
[ (0)T (0)T (0)T
]
(0)T (0) (0) (0)T T (0)
β̂ 1 β̂ 2 . . . β̂ k . . . β̂ K λ̂ τ̂ δ̂ , where λ̂(0) , τ̂ (0) , δ̂ is derived
from the estimate of the distribution parameters of the response variable [18].
(0)
In contrast, β̂ k is acquired from the univariate generalized gamma regression
(UGGR) model parameters for each response variable [14].
• Step 2. Determine the stopping criteria of the BHHH iteration. This study imple-
mented a tolerance limit, denoted as ε = 10−8 , or alternatively, the maximum
iteration of M = 150. These choices are made to balance accuracy and efficiency.
It ensures that the algorithm terminates when it achieves a high level of accuracy
or when it reaches a predefined limit on the number of iterations.
• Step 3. Begin the BHHH iteration process by employing the subsequent procedure.
(m+1) (m)
( (m) ) ( (m) )
Θ̂ M R = Θ̂ M R − H−1 Θ̂ M R g Θ̂ M R , (11)
where m = 1, 2, . . . , m ∗ .
∥ ∥
• Step 4. If ∥
(m∗+1) (m∗) ∥
∥Θ̂ M R − Θ̂ M R ∥ < ε then the iterations stop at m∗-th iteration.
Alternatively, the iteration is terminated when it reaches the M-th iteration. The
final iteration generates an estimator value when convergence occurs.
The testing of the significance of the regression parameters of the MGGR
model simultaneously was carried out with the null hypothesis of H0 : β jk = 0,
j = 1, 2, . . . , p, k =
{ 1, 2, . . . , K , whereas the alternative
} is H1 : at least one β jk /= 0.
Suppose that Ω = β1 , β2 , . . . , βk , . . . , β K , λ, τ, δ represent the set of parameters
under the populations and ω = {βω01 , βω02 , . . . , βω0k , . . . , βω0K , λω , τω , δω } repre-
sent the parameters that are assumed to be true under the null hypothesis. Note that
Ω̂ and ω̂ are estimators obtained by the MLE method. Therefore, by using the MLRT
method, the test statistics for testing this hypothesis is as follows:
( ( ) )
G 2 = 2 ln L Ω̂ − ln L(ω̂) . (12)
In the case of large sample sizes, the test statistic G 2 should correspond to a Chi-
squared distribution with pK degrees of freedom [17]. Thus, the decision to reject
the null hypothesis can be reached if G 2 > χ(1−α);
2
pK .
The significance of the MGGR model regression parameters is partially tested
with the null hypothesis H0 : β jk = 0 and alternative hypothesis H1 : β jk /= 0 with
the test statistics presented in the following equation:
β̂ jk
Z jk = / ( ) , (13)
∧
var β̂ jk
468 H. Yasin et al.
∧
( ) [ ( )]−1
where var β̂ jk is the main diagonal element of matrix − H Θ̂ M R corre-
| |
| |
sponding to β̂ jk . The null hypothesis is rejected when | Z jk | >Zα/ 2 .
3 Simulation Study
We used simulation studies to check if our new estimation method works well. In
this simulation, we looked at different sample sizes: 50, 100, 150, and 200 samples,
and we repeated each scenario 500 times. We picked these numbers based on what
makes sense statistically, what our computers can handle, and to make sure our results
are reliable and can be applied widely. To simulate the MGGR model, we needed
to create data that matches how the predictor variables are related to the response
variables, which follow the MGG distribution. Here is a step-by-step guide on how
we did that:
Step 1: We start by setting up the MGGR model as described in Eq. (3). For our
study, we work with three predictor variables.
Step 2: We specify the parameters. In our research, we consider two parameter
models: the bivariate and trivariate models, which are presented in Table 1.
Step 3: Generate predictor variables from a uniform distribution.
X 1 ∼ U (10, 90), X 2 ∼ U (10, 25), and X 3 ∼ U (10, 50).
Step 4: Calculate the scale parameters for each response variable at each obser-
vation k-th response variable and i-th observation by applying Eqs. (5) and
(6).
Step 5: Create response variables based on the multivariate generalized gamma
(MGG) distribution. We derive the shape and threshold parameters from the distri-
bution parameters provided in Table 1. The scale parameters are obtained from
the calculations made in Step 4.
Step 6: Execute the parameter estimation process for the MGGR model, following
the steps outlined in Sect. 2.2.
We assessed the accuracy of our parameter estimation by computing the mean
for each estimated parameter, as detailed in Table 2. Furthermore, we examined
the bias, variance, and root of mean squared error (RMSE) associated with these
parameter estimates. Figures 1 and 2 present these results graphically, providing a
visual representation of the performance of our parameter estimation procedure. This
comprehensive evaluation helps us gage the robustness and reliability of the MGGR
model in capturing the underlying relationships in our simulated data.
Based on the insights from Table 2, our proposed procedure demonstrates robust-
ness in accurately estimating MGGR model parameters across both small and large
sample sizes. The average values of the estimated parameters closely align with
the true parameters, affirming the method’s reliability. Furthermore, the patterns
observed in Figs. 1 and 2 are noteworthy. They illustrate that the variance and root of
mean squared error (RMSE) associated with bivariate and trivariate model parame-
ters tend to decrease as the number of samples increases. This trend signifies that the
Fig. 1 Bias, variance, and RMSE of bivariate generalized gamma regression model parameters
Fig. 2 Bias, variance, and RMSE of trivariate generalized gamma regression model parameters
estimated parameter values become more precise and consistent with an increase in
sample size. However, it is essential to highlight that this trend does not extend to
Statistical Inferences for Multivariate Generalized Gamma Regression … 471
the bias of the estimated parameters. Both Figs. 1 and 2 indicate no clear relation-
ship between bias and sample size. Consequently, our proposed procedure consis-
tently provides parameter estimations for the MGGR model with relatively low bias,
offering dependable and accurate results.
In this study, we utilized secondary data spanning from 2017 to 2021, sourced from
BPS Central Java Province. Our observation unit encompassed the 35 districts and
cities within Central Java Province [19]. The data employed pertained to education
statistics in Central Java, which included indicators related to education participation,
specifically the school participation rate (SPR) for individuals aged 16 to 18, and the
gross participation rate (GPR) at the high school or equivalent level. Additionally, we
considered education development indicators, such as the mean years of schooling
(MYS). These indicators were examined in conjunction with various factors believed
to influence them. These factors included GRDP per capita, the percentage of the
population living in poverty, gender ratio, percentage of households with access to
proper sanitation, labor force participation rate, and teacher-student ratio at the junior
high school level. Descriptive statistics for each of the research variables is presented
in Table 3, offering a comprehensive overview of the data characteristics.
The visual depiction of the relationships between response variables and predictor
variables is effectively showcased through a matrix plot, as demonstrated in Fig. 3.
This figure provides a visual summary of the correlations between each predictor
variable and each response variable, and the observed patterns align with established
theories.
Among the predictor variables, GRDP per capita and the Percentage of households
with access to proper sanitation exhibit a positive correlation with the response
variables. In contrast, the percentage of people living in poverty, gender ratio, labor
force participation rate, and teacher-student ratio show a negative correlation with
the response variables.
Notably, Y 1 (mean years of schooling) exhibits the most significant correlation
with its six predictors. This suggests that the predictor variables offer a better expla-
nation for the variation in the Y 1 variable compared to other response variables,
emphasizing its importance in our analysis.
Subsequently, the MGG distribution completed evaluation by the Kolmogorov–
Smirnov (KS) test. The null hypothesis proposes that the observed data conforms to
the MGG distribution, whereas the alternative hypothesis proposes that the observed
data diverges from the MGG distribution [18]. The Kolmogorov–Smirnov (KS) test
produced a test statistic of 0.052564 and a p-value of 0.7189. These results lead us
to the conclusion that there is no compelling evidence to reject the null hypothesis.
In simpler terms, it means that the school participation rate (SPR), gross participa-
tion rate (GPR), and mean years of schooling (MYS) are well-suited to follow the
multivariate generalized gamma (MGG) distribution.
Statistical Inferences for Multivariate Generalized Gamma Regression … 473
In our analysis, we employed the MGGR model to identify which predictor vari-
ables had a significant impact on the response variable. The initial parameter values
required for estimating the MGGR model are provided in Table 4. Subsequently,
we present the parameter estimation results in Table 5 considering three response
variables and six predictor variables. Within this table, the p-values for each param-
eter are used to gage the significance of individual predictors. At a significance level
of 10%, the outcomes are as follows: All predictors significantly affect Y 1 , except
for variable X 5 . Variables X 3 , X 4 , and X 6 notably influence Y 2 . Meanwhile, Y 3 is
substantially affected by variables X 3 , X 4 , and X 5 . These findings offer valuable
insights into the complex relationships between predictor and response variables,
providing a comprehensive understanding of the model’s parameters.
The model’s significance can be assessed concurrently using Wilk’s likelihood
ratio statistics derived from the MLRT. The calculated test statistic is 337.433, while
the quantile of the Chi-squared distribution with 18 degrees of freedom and a signif-
icance level of α = 0.05 is 28.869. The calculated test statistic exceeds the critical
value of the Chi-squared distribution, leading to the rejection of the null hypothesis.
It implies that the six predictor factors collectively influence the response variables
considerably.
Based on the outcomes of a single response modeling (see Table 4) and multi-
variate modeling using MGGR (see Table 5), it can be seen that the difference in
coefficient sign only occurs for variable X 1 in responses Y 2 , Y 3 , and X 5 in response
Y 1 . The presence of this phenomenon can be ascribed to the presence of a correlation
between the response variables. As a comparison, it can also be seen from the matrix
plot in Fig. 3. Therefore, based on the correlation values in this figure, it can be
concluded that the MGGR model is more reasonable than univariate modeling.
5 Conclusion
To sum up, the multivariate generalized gamma regression (MGGR) model has
emerged as a robust and effective tool for estimating parameters in the context of
responses adhering to the MGG distribution. By leveraging the maximum likelihood
estimation (MLE) method and the BHHH algorithm for optimization, the model
consistently delivered reliable results. Our application of the MGGR model to analyze
three education indicators within the framework of six predictor variables revealed
Statistical Inferences for Multivariate Generalized Gamma Regression … 475
the clear advantage of employing multivariate modeling over its univariate counter-
part. This approach not only deepened our understanding of the interrelationships
among variables but also highlighted that parameter significance may vary when
assessed individually in univariate models. It is important to note that addressing the
challenge of collinearity among responses remains an avenue for further exploration
in the future research. In essence, our study underscores the efficacy of the MGGR
model in parameter estimation and emphasizes the merits of adopting a multivariate
approach when confronting intricate data relationships. This work sets the stage
for future research endeavors aimed at resolving the complexities associated with
collinearity among responses in statistical modeling.
Acknowledgements The authors would like to express deep appreciation for the financial support
provided by Balai Pembiayaan Pendidikan Tinggi (BPPT) or Central for Higher Education Funding
and Lembaga Pengelola Dana Pendidikan (LPDP) under the Ministry of Education, Culture,
Research, and Technology of the Republic of Indonesia.
References
15. Mathai AM, Moschopoulos PG (1992) A form of multivariate gamma distribution. Ann Inst
Stat Math 44(1):97–106
16. Hayati FN, Otok BW (2018) Parameter estimation and statistical test of mixed geographically
weighted bivariate Weibull regression (MGWBWR). In: 2018 international symposium on
advanced intelligent informatics (SAIN), pp 78–83. IEEE, Yogyakarta
17. Rahayu A, Purhadi S, Prastyo DD (2020) Multivariate gamma regression: parameter estimation,
hypothesis testing, and its application. Symmetry 12(5):1–17
18. Yasin H, Purhadi, Choiruddin A (2023) Parameter estimation and the goodness-of-fit test for the
multivariate generalized gamma distribution. In: 2023 international conference on computer,
control, informatics and its applications (IC3INA), pp 382–387. IEEE, Bandung, Indonesia
19. BPS-Statistics of Jawa Tengah Province, LNCS Homepage. https://jateng.bps.go.id/. Last
accessed 17 Feb 2023
Health and Industrial Applications
W@rk: Attendance Application
Framework Using Blockchain
Technology
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 479
Y. Bee Wah et al. (eds.), Data Science and Emerging Technologies, Lecture Notes on
Data Engineering and Communications Technologies 191,
https://doi.org/10.1007/978-981-97-0293-0_34
480 P. R. F. Kaha et al.
1 Introduction
The onset of the COVID-19 pandemic in early 2020 had an unintended impact on
business, accelerating the adoption of digital technologies and resulting in changes
to work patterns due to the lockdown. This led to a shift from traditional work
practices restricted by time, location, and structure to the widespread adoption of
remote work that relied on tools such as video conferencing, collaborative software,
and online accessibility. During the lockdown period, many organizations faced chal-
lenges in recording and monitoring employee attendance using conventional systems
and procedures such as the attendance register system or through an on-premises
biometric-attendance system for them to clock in and out. Furthermore, these atten-
dance systems are prone to intentional or unintentional modification of the system,
the components of systems, and its data. Blockchain technology performs a key role
in mitigating several existing issues, which can be faced by the global community
during pre- and post-pandemics. Blockchain is a technology that has been intro-
duced since the concept of blockchain was first proposed in [1], even though it
has changed rapidly from a prototype concept, and it received outreach scope in
earlier studies. However, it combines three technologies: consensus protocol, peer-
to-peer (P2P) networking, and private key cryptography. This emerging technology
promotes a better architecture with transparency, trust, security, and immutability
via transactions.
This paper proposes a blockchain-based attendance framework to ensure its
tamper-proof mechanisms system. The blockchain stores employee attendance infor-
mation through its shared database or ledger in digital, decentralized, and secure
transactions. The data updated from the attendance can be observed in a public mode.
This accessibility makes the data trustworthy and transparent, leading to a trace for
any tampered datasets. As per reference [2], blockchain has five basic components
consensus mechanism, P2P network, validity rules, ledger, and cryptography. Here,
cryptographic blocks are mainly utilized to collect the transactions across several
nodes that can be either private or public. A public blockchain is open to everybody,
whereas a private blockchain is limited to authorized participants only [3]. However,
different consensus protocols are discussed in blockchain technology, which helps
to ensure the essential agreement over the ledger status and contribution by the other
participants in the blockchain network. Apart from blockchain, many organizations
can continue to use biometric technology such as face recognition systems as part
of their digital attendance authentication mechanism [4]. By using a computer or
laptop with a high-quality web camera as input during the attendance check, the
facial recognition system saves employee attendance into the database. Works of
[5] propose a system that is implemented using four steps such as finding all faces,
posing, and working faces, encoding faces, and finding the person’s name from the
encoding. Another technology that can be applied to the attendance system is through
tracking the location of the employee within the designated geo-fence. The location-
based system enables employees who are working away from home to be tracked
for attendance records. In [6], a student attendance system is developed for schools
W@rk: Attendance Application Framework Using Blockchain Technology 481
and colleges combining geo-fencing using the geo-fencing API from Google Play
services by acquiring the live location of students and a geo-fence of the class area.
In this paper, we introduce a framework of contactless attendance, W@RK,
embedded with a face recognition feature and location-based for recording employee
attendance using blockchain technology. The aim is to monitor employee attendance
and empower employees to work from anywhere. The objective is to develop an
alternative contactless attendance application. Agile software development method-
ology is adopted to develop this attendance application. A facial recognition feature
is used to scan the face of employees to authenticate themselves through verification
services. While the location is detected using geo-fencing, the blockchain technology
records and verifies the attendance of the employee. The facial data is recorded on a
blockchain, along with a timestamp and other relevant information such as location.
The findings show the proposed W@RK framework can record and verify atten-
dance securely and efficiently. As the use of blockchain technology continues to
grow, we expect more organizations will adopt facial attendance systems to track the
attendance of employees, students, and other individuals at events and meetings.
The rest of this paper is organized as follows. Section 2 provides background and
previous work on blockchain adoption in attendance systems. The related works of
face recognition systems in an attendance system are discussed in this section. The
methodology of W@RK is described in Sect. 3. The design of the W@RK framework
is presented in Sect. 4. Section 5 concludes the research work and discusses the future
scope of this work.
2 Background Study
used face recognition integrated with blockchain systems are still limited. The study
of [11] utilizes a convolutional neural network for face recognition and stores the
attendance data on the blockchain system. The attendance system with the blockchain
has not been limited to organizations’ attendance, but it has also been designed
for conference [12] attendance and class attendance at a university [13, 14]. The
attendance system discussed in [12] performs auto-track operations for the user’s
behavior over the screen, device location, and allotted time during the session for
which the user makes the subscription. In [13], the attendance system used Wi-Fi that
allows students to record their attendance through a mobile phone or laptop and the
record will be saved on a distributed and decentralized blockchain. As the location
of the worker is important when they log into the attendance system to ensure the
location that they are working either from home or the office, we proposed the W@RK
attendance framework which combines a face recognition feature and location-based
for recording employee attendance data using blockchain technology.
3 Methodology
The first component is the biometric recognition method. It uses facial recognition
as the selected biometric. Facial recognition is a method of identifying and verifying
a user’s identity using their face. This unique choice for biometric authentication
has several advantages. Facial recognition is non-intrusive [21] and does not require
physical contact compared to other scanning such as fingerprint and iris. Here, users
can look into the cameras for authentication, making it user-friendly and easy to
adopt. In addition, this method can also be easily accessible for many users who can
perform through front-facing cameras using smartphones and computers, making
high acceptance [22], robustness [23], convenience [24], speed, contactless, hygiene
[25], and less sensitivity to age [26]. Furthermore, it is also necessary to address
the limitations of facial recognition, including detection accuracy, privacy concerns,
and algorithmic bias. Organizations should collect facial images of their employees
from various angles and lighting conditions to ensure accuracy and preprocess them
to enhance quality and standardize for analysis.
484 P. R. F. Kaha et al.
Set the Determine the geographic area where attendance will be tracked
geofence using geo
geo--fencing.
Test and
Test it’s accuracy tracking attendance and refine it as needed.
refine
Choose a
blockchain Select a blockchain platform that is best suited for W@RK
platform
Develop the
Write the smart contract that will govern the interactions on
smart
the blockchain.
contract
Integrate
with your Integrate the smart contract with W@RK via API
system
4.4 Integration
The previous section discusses the components and features that collectively form
the entirety of the W@RK project. In this section, we will present the integration
process of the components and the data flow of the W@RK. Figure 5 shows the
flowchart analysis of the proposed W@RK. The flowchart module starts with the face
scanning system. Initially, the face is read through the system and tries to identify the
images from the existing database. If the face is recognized from the server location,
it will try to collect the location identity from the auto-attendee location tracking.
Here, the location tracking is performed through LBS and uses GPS for position
tracking. If the face is not recognized, the system will go back and start scanning
the face for another verification. Here, we can discuss the detailed explanation of
each step, defined in the flow chart. To make the information of attendance secure
and transparent, blockchain technology is employed. A transaction of attendance
record along with the transaction address, private key, and public key is written in
the blockchain. The transaction will be added to the blockchain node if it is valid.
In a blockchain network, only valid transactions can adhere to predefined rules and
conditions that ensure the blockchain ledger’s integrity and consistency. Certain
modification and enhancement of the W@RK system is motivated by Mohammad
Azahari et al. [20].
Initially, the proposed system uses the computer’s or phone’s front camera to
capture the facial image. Once the camera captures the image, it sends it into
processing. Here, the identified facial image is checked in the labeled data. If it
matches the existing images, it will process the next steps in the flowchart. The next
phase discussed in the flowchart is facial recognition. The proposed system uses
facial recognition software to identify the person by searching various images. The
software used in this process compares the facial features of the images in the labeled
database of the stored faces. If the software finds the match of the facial image, then
it recognizes the person. If it is not identified, then the process ends. The third phase
of the flowchart discusses location detection. To identify the attendee’s location, the
system uses GPS and other location tracking systems to determine the person’s loca-
tion. Location detection using GPS and LBs is an essential part of event management.
This technology will allow the organizer to monitor the movement of the attendees
and improve the event experience, security, and safety. GPS is the satellite naviga-
tion system that provides accurate location information through multiple satellites.
GPS-enabled smartphones, computers, tablets, and other tracking devices are used
to calculate their precise details.
To verify the attendee’s location, GPS is used to track the participants’ real-time
position. Attendees must carry a device with GPS receivers allowing the event orga-
nizers to observe their movements in the venue. LBS uses the location information
provided by the positioning technologies such as GPS. This technology leverages
the GPS data and offers other services based on the user’s location while navigating,
mapping, geotagging, and proximity-based notifications. For the attendee’s location,
LBS provides location-specific information and is critical in identifying the event
participants. LBS acquires maps, personal schedules, event-specific notifications, and
other nearby amenities for this process. LBS application may also use the network
opportunities for similar affiliations based on the suggestions. Recently, almost all
the event management provided a web-based application for attendees equipped with
the LBS and other GPS capabilities.
Attendees can use the applications on their smart devices, which allows the event
organizers to monitor and track the attendee’s movements. In addition, other wear-
able devices, such as RFID wristbands and smart badges, can also be distributed
to the participants, which provides the exact location for tracking. This event app
processes real-time data and updates the position of the attendees over the event
mapping section. Even though the GPS provides high accuracy in outdoor environ-
ments, it has signal attenuation for indoor environments, which introduces limita-
tions. This problem can be solved using Bluetooth Low Energy (BLE) beacons or
Wi-Fi positioning systems, which provide location information within indoor areas,
conference rooms, and exhibition halls. Collecting these data allows the event appli-
cation to provide the attendee’s better-personalized experiences. For instance, this
facility may provide information about the nearby exhibitions, sessions, or schedules
based on their interest.
The proposed attendance system uses blockchain technology to store atten-
dance records securely, transparently, and tamper-proof. The stored data includes
the attendee’s name, date, time, and location and can be encrypted using cryp-
tographic hashes to prevent unauthorized access. The system integration of smart
contracts simplifies auditing and reduces administrative burden, making the system
highly trustworthy and fraud-resistant. Furthermore, implementing a blockchain-
based attendance system may have unforeseen consequences, such as increased stress
490 P. R. F. Kaha et al.
5 Conclusions
References
1. Nakamoto S (2008) Bitcoin: a peer-to-peer electronic cash system. [Online]. Available: https://
bitcoin.org/bitcoin.pdf
2. Garrick H, Michel R (2017) Global blockchain benchmarking study. Cambridge Centre for
Alternative Finance, Cambridge
3. Mamun Q (2022) Blockchain technology in the future of healthcare. Smart Health 23:100223
4. Shashikala HK, Shakya SN, Panjiyar P, Upreti AS, Dadapeer S (2022) Attendance monitoring
system using face recognition. Int J Inform Technol Res Appl 1(3):15–22. https://doi.org/10.
5281/zenodo.7385439
5. Sarath Krishnan P, Manikuttan A (2022) Attendance management system using facial recogni-
tion. In: 2022 International conference on computing, communication, security and intelligent
systems (IC3SIS), Kochi, India, pp 1–6. https://doi.org/10.1109/IC3SIS54991.2022.9885693
6. Babatunde AN, Oke AA, Babatunde RS, Ibitoye O, Jimoh ER (2022) Mobile based student
attendance system using geo-fencingwith timing and face recognition. J Adv Mathem Comput
Sci 9(2):75–90. https://doi.org/10.22624/AIMS/MATHS/V9N1P8
7. Ardina H, Nugraha IGBB (2019) Design of a blockchain-based employee attendance system.
In: 2019 International conference on ICT for smart society (ICISS). vol 7. IEEE
8. Zhang Q (2022) Attendance system based on blockchain and face recognition. In: 2022
International conference on smart applications, communications and networking (SmartNets),
November, IEEE, pp 1–6
9. Adhinata FD, Tanjung NAF, Widayat W, Pasfica GR, Satura FR (2022) Real-time masked face
recognition using FaceNet and supervised machine learning. In: Proceedings of the 2nd inter-
national conference on electronics, biomedical engineering, and health informatics: ICEBEHI
2021, 3–4 November, Surabaya, Indonesia, June, Singapore, Springer Nature Singapore, pp
189–202
10. Adetiba E, Opara AE, Ajayi OT, Owolabi FO (2021) DeepFacematch: a convolutional neural
network model for contactless attendance on e-SIWES Portal. In: Information and communica-
tion technology and applications: third international conference, ICTA 2020, Minna, Nigeria,
November 24–27, 2020, Revised Selected Papers 3, Springer International Publishing, pp
196–205
11. Yadav AR, Kumar J, Anumeha AKA, Kumar R (2021) Contactless attendance system: a health-
care approach to prevent spreading of COVID-19. people, vol 5. pp 6
12. Vubangsi M, Al-Turjman F (2022) Design and implementation of a conference attendance
monitoring system using blockchain and AI technologies. In: 2022 International conference
on artificial intelligence in everything (AIE), IEEE, pp 197–202
13. Sarumi UA, Ameen ZS, Al-Turjman F, Altrjman C, Mubarak AS (2022) A novel attendance
system via integrated wifi and blockchain technologies. In: 2022 International conference on
artificial intelligence in everything (AIE), August, IEEE, pp 209–215
14. Bálint K (2021) Possibilities for the utilization of an automatized, electronic blockchain-based,
students’ attendance register, using a universities’ modern security cameras. Acta Polytechnica
Hungarica 18(2):127–142
15. Petersen J (2016) MEAN Web Application development with Agile Kanban. [Online].
Available: https://digitalcommons.wou.edu/honors_theses/105
16. Commons D, Ryerson @, Kaleel SB, Harishankar S (2013) Applying Agile methodology in
mobile software engineering: android application development and its challenges. [Online].
Available: http://digitalcommons.ryerson.ca/compsci_techrpts
17. Flora HK, Chande SV, Wang X (2014) Adopting an agile approach for the development of
mobile applications
18. Abrahamsson P, Salo O, Ronkainen J, Warsta J (2002) Agile software development methods:
review and analysis. [Online]. Available: http://www.vtt.fi/inf/pdf/publications/2002/P478.pdf
19. Sultania AK (2015) Developing software product and test automation software using Agile
methodology. In: Proceedings of the 2015 3rd international conference on computer, commu-
nication, control and information technology, C3IT 2015, March 2015. https://doi.org/10.1109/
C3IT.2015.7060120
492 P. R. F. Kaha et al.
20. Mohammad Azahari A, Ahmad A, Rahayu S (2021) Contactless attendance method with face
recognition, body temperature measurement and GPS system using blockchain technology. In:
Zakaria Z, Emamian SS (eds) Advances in electrical and electronic engineering and computer
science. Lecture notes in electrical engineering, vol 741. Springer, Singapore. https://doi.org/
10.1007/978-981-33-6490-5_8
21. Lai X, Patrick Rau P-L (2021) Has facial recognition technology been misused? A public
perception model of facial recognition scenarios. Comput Human Behav 124:106894. ISSN
0747-5632.https://doi.org/10.1016/j.chb.2021.106894
22. Kauba C, Prommegger B, Uhl A (2019) Combined fully contactless finger and hand vein
capturing device with a corresponding dataset. Sensors 19:5014. https://doi.org/10.3390/s19
225014
23. Yang B, Yang S, Zhu X, Qi M, Li H, Lv Z, Cheng X, Wang F (2023) Computer vision technology
for monitoring of indoor and outdoor environments and HVAC equipment: a review. Sensors
23:6186. https://doi.org/10.3390/s2313618
24. van Hoof J, Marston HR, Kazak JK, Buffel T (2021) Ten questions concerning age-friendly
cities and communities and the built environment. Build and Environ 199:107922. ISSN 0360-
1323.https://doi.org/10.1016/j.buildenv.2021.107922
25. Yan L, Shi Y, Wei M, Wu Y (2023) Multi-feature fusing local directional ternary pattern for
facial expressions signal recognition based on video communication system. Alexandria Eng
J 63:307–320. ISSN 1110-0168. https://doi.org/10.1016/j.aej.2022.08.003
26. Gode CS, Khobragade AS, Thanekar C, Thengadi O, Lakde K (2023) Face recognition-
based attendance system. In: Smys S, Tavares JMRS, Shi F (eds) Computational Vision and
bio-inspired computing. advances in intelligent systems and computing, vol 1439. Springer,
Singapore. https://doi.org/10.1007/978-981-19-9819-5_12
27. Rajeshkumar G, Braveen M, Venkatesh R, Josephin Shermila P, Ganesh Prabu B, Veerasamy B,
Bharathi B, Jeyam A (2023) Smart office automation via faster R-CNN based face recognition
and internet of things. Measurem: Sens 27:100719. ISSN 2665-9174.https://doi.org/10.1016/
j.measen.2023.100719
28. Falayi A, Wang Q, Liao W, Yu W (2023) Survey of distributed and decentralized iot securities:
approaches using deep learning and blockchain technology. Future Internet 15:178. https://doi.
org/10.3390/fi15050178
29. Mokhamed T, Talib MA, Moufti MA, Abbas S, Khan F (2023) The potential of blockchain
technology in dental healthcare: a literature review. Sensors 23:3277. https://doi.org/10.3390/
s23063277
30. Waseem M, Adnan Khan M, Goudarzi A, Fahad S, Sajjad IA, Siano P (2023) Incorporation
of blockchain technology for different smart grid applications: architecture, prospects, and
challenges. Energies 16:820. https://doi.org/10.3390/en16020820
31. Zhang T, Li P, Wang N (2023) Multi-period price competition of blockchain-technology-
supported and traditional platforms under network effect. Int J Prod Res 61(11):3829–3843.
https://doi.org/10.1080/00207543.2021.1884308
32. KN RP, Rajat (2023) The intelligent information integrity model to ensure the database protec-
tion using blockchain in cloud networking. In: 2023 International conference on distributed
computing and electrical circuits and electronics (ICDCECE), Ballar, India, 2023, pp 1–7.
https://doi.org/10.1109/ICDCECE57866.2023.10151301
Exploring the Impact of COVID-19
on Individuals’ Mental Health Through
Cluster Analysis
Abstract The COVID-19 pandemic has had a significant impact on mental health,
resulting in anxiety and other issues among many individuals due to the lockdowns
implemented to curb its spread. With the world moving toward 2030, it is crucial to
reduce premature mortality from non-communicable diseases through prevention and
treatment. Sustainable Development Goal (SDG) 3 emphasizes prioritizing mental
health and well-being to address the increasing burden of mental health issues. The
study utilized text clustering through the K-Means algorithm to gain a better under-
standing of the mental health issues people are facing. The Term Frequency-Inverse
Document Frequency (TF-IDF) was used to determine each word’s weight after
extracting tweets from Twitter and preprocessing the data. The K-Means clustering
algorithm was then employed in the data, which revealed that the clusters could be
classified into three categories of mental health: ‘stress,’ ‘depression,’ and ‘pressure.’
It was found that using three clusters provided more dependable outcomes since clus-
ters with more than three tended to have overlapping mental health conditions. This
study sheds light on the mental health problems that people face during the COVID-
19 pandemic, which can help guide efforts to support those in need. Moreover, it
would be more beneficial to incorporate Bahasa Malaysia in future research since
there has yet to be much exploration done on this language despite it being Malaysia’s
official language. By adopting a holistic approach and prioritizing mental health, we
can work toward ensuring a healthier and happier future for everyone.
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 493
Y. Bee Wah et al. (eds.), Data Science and Emerging Technologies, Lecture Notes on
Data Engineering and Communications Technologies 191,
https://doi.org/10.1007/978-981-97-0293-0_35
494 A. Ahmad et al.
1 Introduction
The COVID-19 pandemic has caused many changes worldwide, such as lockdowns
and economic slowdowns, which have resulted in job losses and increased mental
health concerns. The impact on mental health has been significant; unfortunately,
resources have been inadequate to tackle this issue. It is crucial to prioritize both phys-
ical and mental health when it comes to handling the pandemic. Due to the outbreak
of COVID-19, many countries have enforced lockdowns to prevent the spread of
the disease. This has also resulted in declining economic growth as many economic
activities had to be suspended. As a result of the COVID-19 pandemic, many individ-
uals have lost their jobs, which has contributed to an increase in depressive symptoms
[1]. The worst-case pandemic is death, which affects the population’s numbers and
mental and physical health. Mental health confines emotional, psychological, and
social well-being, impacting how individuals think, feel, act, manage stress, interact
with others, and make decisions. In times of epidemic, there is a strong correla-
tion between emotional distress and social disorder, which is often influenced by
the spread of infectious diseases and the psychological responses of those affected.
Despite this, mental health and well-being resources are often lacking in the face
of pandemics. It is crucial to prioritize psychological and psychiatric needs during
any phase of pandemic management, alongside testing, critical patient care, and
reducing transmission. This is essential to ensure the overall health and well-being
of individuals and communities impacted by pandemics, as [2] emphasized.
Finding a mentally unstable person could be a challenge. However, one approach
is to analyze the language and content of their written communication, as this can
provide insight into their thought patterns and potential issues [3]. Following this
methodology, our study utilizes an unsupervised machine learning technique, K-
Means clustering, to identify mental health-related keywords during the COVID-19
pandemic, building upon the work of previous researchers in this area [11, 12].
Sabaruddin [4] analyzed mental health tweets during Malaysia’s first Movement
Control Order period, and they also created a Malay tweet dataset using six unique
search strategy keywords associated with mental health, including hashtags related to
COVID-19 and emotions like sadness and anger. The common supervised machine
learning classifiers, Naive Bayes, and support vector machine, are executed to predict
the sentiment of the tweets. Six unique keywords associated with mental health,
including hashtags: #KitaJagakita, #covid19, sedih (sad), marah (angry), kecewa
(disappointed), and bunuh diri (suicide). These keywords were found to be associated
with the emotions felt by Malaysians during the early stages of the pandemic. The
study’s findings showed that the SVM algorithm had the highest accuracy rate of
approximately 84%.
Exploring the Impact of COVID-19 on Individuals’ Mental Health … 495
Then, Lee [5] analyzed the public attitudes toward COVID-19 on Twitter and
found that the top keywords related to the tweets were ‘wash your hands,’ ‘stay
home,’ ‘wear a mask,’ ‘social distancing,’ ‘quarantine,’ and ‘mental,’ indicating that
people are concerned about both their physical and mental health. This study looked
at how people felt about COVID-19 on the well-known social network platform
Twitter. The main goals of this research are to find out what COVID-19-related topics
individuals are talking about, why they are interested in them, and how their feelings
have changed over time. Additionally, the study looks for any connections between
the breakout and any previously unknown to the broader public secret theories.
Furthermore, Hou et al. [6] studied how the public uses social media. They
collected popular Weibo texts mentioning COVID-19 with the keyword’s ‘coron-
avirus’ and ‘pneumonia’ between December 27, 2019, and May 31, 2020. Data
mining and text analysis were used to show the trend in the level of public atten-
tion at different stages. They also conducted topic analysis, sentiment analysis, and
a correlation analysis between public interest level and the number of COVID-19-
related cases. However, this study only focuses on Sina Weibo, and public attention
data on Facebook and Twitter is omitted.
Therefore, this research aims to explore and identify the crucial terms associated
with mental health issues in Malaysia amid the COVID-19 outbreak through an
analysis of tweets shared on Twitter. The clustering technique was utilized to cluster
the tweets due to its ability to recognize hidden data patterns that might not be
noticeable at first glance. Grouping similar data points can emphasize relationships
and trends we have missed [7–10]. Thus, K-Means clustering was chosen to perform
this study’s clustering task. In this paper, we have organized our content in a specific
manner. Firstly, we have presented all related works in the mental health field in
the introduction section. Secondly, we will discuss the proposed methods in Sect. 2,
followed by a conversation on experimental works and analysis of results in Sect. 3.
Lastly, we will conclude our description in the final section by discussing the obtained
results and future works.
2 Proposed Methodology
Figure 1 shows the methodology process for this research. Few processes were
involved in executing this research; (1) data collection, (2) data preprocessing, and
(3) clustering.
1. Data Collection
For this study, we gathered tweets from April 2020 until the end of the lockdown in
Malaysia in October 2021. The data was scraped from Twitter using Twint Python
based on keywords of mental health types in Table 1. We saved the data in a CSV file
with several properties, but only the ‘tweet’ attribute is relevant. Then we combined
the tweets in a new file and removed any duplicates by looping through the documents.
This ensured that the dataset contains all the tweets related to the keywords [13, 14].
496 A. Ahmad et al.
2. Data Preprocessing
In the data preprocessing stage, the raw Twitter data is converted into an under-
standable and accessible format before proceeding to the clustering stage. This step
is crucial for the model to learn data trends precisely. It comprises tokenization,
normalization, and calculating term frequency-inverse document frequency. Before
tokenization, the data undergoes preprocessing to eliminate any Bahasa Malaysia
phrases, enabling only English terms to be processed. After the dataset is cleaned,
the data transformation step concludes, and the processed data is then saved in a CSV
file for further processing with TF-IDF calculation.
A. Tokenization
The technique of tokenization is used to separate the words in a textual document. This
is especially helpful for textual data, where sentences are decomposed into smaller
units or tokens to eliminate any unwanted characters such as commas, spaces. For
this research, each tweet is considered as a document that is treated as a string and
passed through the tokenization process to extract the tokens present in the document.
Exploring the Impact of COVID-19 on Individuals’ Mental Health … 497
B. Normalization
Normalization enables uniform processing by putting every word on an equal footing.
It can be done by:
1. Case normalization: converting all text to lowercase or uppercase.
2. With this, there will not be any redundant words that give off the same meaning.
3. Removes all punctuations, symbols, and numbers to keep only words that will
be meaningful.
4. Remove all stopwords: Common words with little meaning, such as ‘the,’ ‘and,’
‘is,’ etc.
After reducing the dimensionality, the K-Means model was fitted to the newly
reduced dataset using the ‘fit(X)’ method, where X was a 2D array-like object. The
K-Means algorithm used are as follows:
1. Initialize k centroids, where I is the number of clusters to create. In this case,
several numbers of clusters were tested: k = 3, k = 4, and k = 5.
2. The distance from each document to its k cluster centers is calculated by the m
iteration (m is the maximum iteration number) and output the data into cluster
with the shortest distance.
3. Recompute centroid for each cluster by method mean value of all the data point
assigned to that cluster.
4. Steps 2 and 3 are repeated until the centroids no longer change or a maximum
number of iterations is reached. Final clusters are formed by the data points
assigned to each centroid.
In this project, the ‘predict(X)’ method was utilized to allocate clusters for each
tweeted data point. After the clusters were assigned, we grouped the tweets dataset
based on their cluster and computed the average for words in each cluster. The
resulting list contains words and their frequencies, sorted in descending order to
identify the most frequent words in each cluster. By assigning k values to the K-
Means algorithm in scikit-learn, clusters were formed. It is important to note that
each cluster from each k value had different tweets assigned to it.
inter-cluster results using word clouds. Word clouds provide an easy way to spot the
most significant keywords in text data. For each cluster, we generated a word cloud
showcasing the top 100 words with larger font sizes, indicating a higher frequency of
occurrence. Although word clouds are useful in identifying and highlighting signif-
icant keywords, they have limitation in representing clusters of related words and
capturing the text’s quality accurately. Additionally, a lack of attributes or informa-
tion will reduce the accuracy of word clouds, which can lead to a loss of nuance and
context, as explained in the previous section.
Based on the word clouds, we compared the impact of different clusters on text
analysis, as presented in Table 3. Each cluster was represented by the words with
the highest frequency in that cluster, which became the defining category for the
mental health associated with that cluster. However, some words appeared in multiple
clusters, making it difficult to determine which cluster belonged to which mental
health category. To address this issue, we found that setting k = 3 produced the most
reliable and accurate results, with each cluster distinctly associated with the mental
health categories of ‘stress,’ ‘depress,’ and ‘stress,’ as shown in Fig. 5. Then, from
the word clouds, we analyzed and identified associated words related to the three
desired categories (Stress, Depress, and Pressure), shown in Table 4.
And finally, we assigned mental health categories to each cluster based on the iden-
tified keywords and labeled them accordingly. We observed that the k = 3 clustering
Table 3 List of top words related to mental health categories for all clusters, k
Cluster Top words in cluster Mental health
category
0 Stress, main, time, release, covid, game, happy, call, member, Stress
start
1 Depress, bipolar, depressing, mood, post, video, lawyer, cane, Depress
mention, series
2 Pressure, depression, mental, illness, suicide, blood, cooker, Pressure
distress, depress, like
3 Stress, mistake, cut, review, rate, relief, environment, exam, Stress
baby, order
Exploring the Impact of COVID-19 on Individuals’ Mental Health … 501
method accurately clustered the dataset into three distinct mental health categories,
whereas the k = 4 and k = 5 methods clustered some categories into two clusters.
Hence, the K-Means clustering algorithm with k = 3 produced the most accurate
results in clustering the given dataset.
4 Conclusion
To summarize, the K-Means algorithm revealed a similar graph structure for different
k clusters but with overlapping outcomes of cluster groups for higher k values.
The dataset’s objectives were met by analyzing the TF-IDF weight results and top
keywords from tweets about mental health during COVID-19. The frequencies of
these keywords were discovered through exploratory data analysis, and the mental
health category of each cluster was determined after visualizing and analyzing the
results. To present the most significant keywords in each cluster, we utilized word
clouds, with different font sizes indicating their frequency of occurrence. Our findings
suggest that the optimum k value for clustering the tweets into three types of mental
health—‘stress,’ ‘depress,’ and ‘pressure’—is 3. The related keywords for cluster
‘stress’ include time, release, COVID, and game, while those for cluster ‘pressure’ are
depression, mental illness, and suicide. Lastly, for cluster ‘depress,’ the keywords
consist of bipolar, mood, post, and video. However, it is important to note that
word clouds only capture vocabulary, but not the quality, and may only sometimes
accurately represent the clusters.
502 A. Ahmad et al.
Acknowledgements The authors would like to express their gratitude towards UNITAR Inter-
national University for funding this research under UNITAR Internal research grant “ Machine
Learning Based Model for The Early Prediction of Potential Pandemic Infection Clusters” and the
College of Computing of Universiti Teknologi MARA (UiTM) for providing an excellent research
environment to carry out this study.
References
1. Posel D, Oyenubi A, Kollamparambil U (2021) Job loss and mental health during the COVID-
19 lockdown: Evidence from South Africa. PLoS ONE 16(3 March):1–9. https://doi.org/10.
1371/journal.pone.0249352
2. Cullen W, Gulati G, Kelly BD (2020) Mental health in the COVID-19 pandemic. QJM
113(5):311–312. https://doi.org/10.1093/QJMED/HCAA110
3. Zhang T (2020) Data Mining can play a critical role in COVID-19 linked mental health studies,
January
4. Sabaruddin RA (2021) Malay tweets : discovering mental health Situation during COVID-19
Pandemic in Malaysia, pp 58–63
5. Lee JH (2021) Understanding public attitudes toward COVID-19 with twitter. In: 2021 IEEE
systems and information engineering design symposium, SIEDS 2021, pp 1–6. https://doi.org/
10.1109/SIEDS52267.2021.9483708
6. Hou K, Hou T, Cai L (2021) Public attention about COVID-19 on social media: an investigation
based on data mining and text analysis. Personality and Individual Differences 175(December
2020):110701. https://doi.org/10.1016/j.paid.2021.110701
7. Ahmad A, Yusof R, Zulkifli NSA, Ismail MN (2021) J Inform Commun Technol 20(4):651–676.
https://doi.org/10.32890/jict2021.20.4.8
8. Huan Z, Pengzhou Z, Zeyang G (2018) K-means text dynamic clustering algorithm based on
KL divergence. In: Proceedings—17th IEEE/ACIS international conference on computer and
information science, ICIS 2018, pp 659–663. https://doi.org/10.1109/ICIS.2018.8466385
9. Gustriansyah R, Suhandi N, Antony F (2019) Clustering optimization in RFM analysis based
on k-means. Indonesian J Electri Eng Comput Sci 18(1):470–477. https://doi.org/10.11591/ije
ecs.v18.i1.pp470-477
10. Marutho D, Hendra Handaka S, Wijaya E, Muljono (2018) The determination of cluster number
at k-mean using elbow method and purity evaluation on headline news. In: Proceedings—
2018 international seminar on application for technology of information and communication:
Exploring the Impact of COVID-19 on Individuals’ Mental Health … 503
Abstract The risk map for infectious disease shows the importance of the Geograph-
ical Information System (GIS) and spatial social network analysis and visualisation
(SSNAV) as a preparedness and response tool to strengthen the capacity for assessing
health risks. The current mapping method still needs to be revised to detect the poten-
tial risk areas of the disease due to the need for more dynamic spatial and social
elements, especially in identifying human mobility effects in detecting missing tuber-
culosis (TB) cases. This study has combined GIS-MCDM and SSNAV techniques to
evaluate whether this combination will enhance TB’s general existing disease hotspot
mapping in Klang, Selangor. The social network structure of selected TB cases in
Klang as actors (nodes) and human mobility (home-workplace) data as edges has
been used to investigate social network mobility structures, analyse the relationships
among the nodes and study their edges regarding their network centrality. The main
finding has revealed that the higher the node’s centrality in the network structure,
the higher the chance the node influences the TB spread in the whole network after
comparing the network graph results with the GIS mapping technique. Combining
these techniques increases the existing mapping capabilities towards enhancing the
understanding of how diseases move through the population and creating a reliable
potential risk map in Malaysia.
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 505
Y. Bee Wah et al. (eds.), Data Science and Emerging Technologies, Lecture Notes on
Data Engineering and Communications Technologies 191,
https://doi.org/10.1007/978-981-97-0293-0_36
506 I. A. Jalil and A. R. A. Rasam
1 Introduction
Tuberculosis (TB) causes about 1.5 million yearly deaths [1]. Most of these cases
occur in developing countries, where job opportunities are often the pull factor for the
foreign workforce. Consequently, the influx of foreign workforce, i.e. immigrants,
becomes one of the main factors in the spread of TB [2]. Malaysia is no exception
and is one of the countries that have recorded a constant number of cases. Cases of
TB in Malaysia recorded a higher mortality rate than expected by the World Health
Organization (WHO). In addition, a medical officer from the Institute of Respiratory
Medicine (IPR) Kuala Lumpur, Dr. Nurhayati Mohd. Marzuki said that in 2017, there
were 26,168 active TB cases in the country involving 2098 deaths.
According to the Health Director-General, Dr. Noor Hisham Abdullah (2019),
Selangor and Sabah had the highest number of TB cases, 5071 and 5008, respectively,
in 2018. The mortality rate for TB cases in 2018 was 6.6 per 100,000, an increase
of 6.5 per 100,000 in 2017. Sabah recorded the highest death toll, and a follow-
up investigation in Selangor indicates a need for more comprehensive control and
detection of new cases.
The Malaysian Ministry of Health (MOH) has established guidelines for control-
ling the disease, but this health organisation must be implemented more efficiently
and up-to-date. However, findings from the national technical records from TB in
2015 indicate that screening methods for high-risk groups must be further strength-
ened to increase the detection rate of TB cases [3]. Altshuler et al. in [1] also supported
the statement that the current method still fails to detect the spread of the disease and
diagnose TB patients in remote areas. This situation is due to several factors, partic-
ularly the lack of effectiveness of existing methods or less comprehensive systems
in detecting TB cases.
For example, although the molecular epidemiological technique has the advan-
tage of diagnosing TB in the human body [4], it does not consider geographical
and environmental factors. TB cases are not only influenced by human factors.
However, they are also influenced by environmental risk factors such as the quality
of the water flow of a place, the type of settlements and human mobility [4]. Other
factors include poverty, inadequate care and medical facilities, the environment and
unhealthy lifestyles [5–12].
Therefore, the current method must be combined with other techniques to improve
case management of the disease complexities. A geospatial or GIS technique of
applying risk mapping can enhance the detection of actual TB transmission in
high-risk areas. It may also improve targeted inspection efforts on field contact
tracing. Several experts explained spatial methods used in studying TB epidemi-
ology, including the spatial distribution, spatial clusters, and predictors of spatial
patterns for the role of congregate settings and the household [3, 13, 14].
The researchers identified TB hotspots and highlighted the potential benefits of
increased community coverage in reducing cases, emphasising the need for effective
population interventions targeting high-risk areas. However, GIS and SNA must be
combined to tackle social dynamic or spatial movement elements. A local researcher
Tracking High Potential Transmission Risk Spots of Infectious Disease … 507
[5] has developed a GIS-MCDM mapping technique for identifying the TB hotspots
in Malaysia. However, the study also suggested considering human factors in the
map for significant predictive spots. This study is conducted to create an SSNAV and
GIS-based risk mapping to limit the spread of TB and detect potentially risky areas
in Klang, Selangor.
GIS applications encompass various uses in various fields due to the development
of GIS technology over the past few decades. It facilitates business by substan-
tially saving time, cost, and effort in nearly every sector [15]. GIS is used to raise
awareness and spread information about the environment, natural resources, possible
hazards and risks, and planned urban routes [16]. As the primary use of GIS, mapping
is seen as essential. The information that GIS stores in the database is the data’s
visual representation. For many entities in the modern world, maps constitute a basic
necessity.
Numerous new industries, such as health care and education, have incorporated
GIS into their use cases. The burden of epidemics, their geographic distribution, and
the factors influencing health-related states or occurrences in a given population have
all been mapped using GIS [17]. The geographic distributions of illness prevalence
(including communicable and non-communicable diseases), the trend of disease
transmission, and the spatial modelling of environmental factors influencing disease
occurrences have all been mapped using GIS [18]. GIS is also utilised for temporal
analysis, cause-and-effect analysis, cognate models, and geographical modelling
[19].
Multicriteria decision-making (MCDM) applications are also combined in GIS by
making choices or decisions based on multiple criteria or parameters. These appli-
cations evaluate and compare alternatives against criteria, considering each crite-
rion’s relative importance. In finance, decision-makers often need to select the best
combination of investments based on various factors such as risk, return, liquidity,
and diversification. The MCDM method helps evaluate investment alternatives and
construct optimal portfolios that align with the investor’s goals and risk preferences.
In the context of health and epidemiology, GIS-MCDM has been applied in site
selection for hospital or emergency medical service facilities [20, 21] and in iden-
tifying high-risk populations of disease [5, 6, 22]. GIS researchers [5] have also
suggested the implementation of social-human movement or network factors for
producing an integrative geospatial assessment of the potential spots of TB.
Social network analysis (SNA) has numerous applications, especially for
analysing social media networks. SNA is widely used to analyse social media
networks like Facebook, Twitter, and LinkedIn. It helps understand user interactions,
508 I. A. Jalil and A. R. A. Rasam
identify influential users, detect communities or interest groups, and track informa-
tion diffusion patterns in online social networks. By examining the network ties and
social influence within a population, researchers can understand how information,
attitudes, and behaviours influence the communities.
SNA has been applied in epidemiology to study the spread of diseases, under-
stand transmission patterns, and inform public health interventions, especially in
contact tracing efforts during disease outbreaks. Researchers can identify individ-
uals who may have been exposed to an infectious disease by mapping their contacts
and interactions [7]. This analysis also helps identify potential cases, assess the trans-
mission risk, effectively target interventions, and identify clusters or subgroups with
higher connectivity and may be at a greater risk of spreading disease. The SNA can
be utilised towards prevention and control strategies for those groups via shaping
human behaviours and interactions in social systems [23].
3 Research Methodology
Due to its TB scenario and environment dynamics, this study was conducted in the
Klang district in Selangor, Malaysia. As shown in Fig. 1, the detailed methodology
workflow is planned in stages to achieve the research objectives. First, the study
determines TB risk factors and ranks them based on their specific category per expert
recommendation from the Selangor State Health Department (JKNS), followed by
other stages.
Data acquisition was collected based on the five influential risk factors
contributing to local TB, as shown in Table 1. Inclusively, the study utilised two
types of data: non-spatial and spatial. Spatial data contain information about a specific
geographical location on the earth’s surface, such as coordinates on the base map. In
contrast, non-spatial data are independent of geographic location, such as risk factors
for TB.
A five-Likert scale and rank-sum were selected as the MCDM technique for
data processing risk factors, weight, and mapping. This technique is relatively more
straightforward and does not involve complex pairwise comparisons. For example,
the risk calculation group factor consists of five sub-factors in MCDM, and each
sub-factor has its indicator to make up a sum of 5 for risk group rank (Table 2). TB’s
patient characteristics and risk level calculation are shown in Formula (1).
Whereas,
The HGR_Scale for every TB Case (0 to 25) was ranked based on classes from
the factor range scale from 1 to 5.
After determining the overall risk level (HGR_Scale) of the TB, the risk mapping
of TB cases was created using inverse distance weighted (IDW) interpolation. This
method was conducted after using reclassification and weighted overlay tools and
overlaying the actual cases of TB, as illustrated in Fig. 2. The map shows five cate-
gories: the blue-coloured class indicates low-risk areas, and the red-coloured class
shows high-risk areas. The numbers of TB are distributed based on the patient’s
location or address, which was also included in the map.
The SNA technique was also used for analysing the patterns of relationships
and the structure of social networks to gain insights into various aspects of social
behaviour, information flow, and influence within a network. SNA is frequently
represented graphically using nodes (points) and edges (lines) to show the complexity
of social networking. SNA can be categorised into some types of approaches, such
as degree centrality and betweenness centrality.
The social network centrality connectivity measure looks at in-degree (number
of inbound links) and out-degree (number of outbound links) as distinct measures.
Network centrality measures a node’s influence based on the number of links it has
to other nodes in the network. This is how the social network centrality of the disease
works to know how well connected a node is, how many links their connections have,
and so on through the network. The detailed information needed in this analysis is
shown below explanation.
Nodes = Home Location and Workplace Location.
Edges =
• Source (Home Location)
• Target (Group of Workplace Locality)
• Weight.
For example, from TB cases in Klang Spreadsheet, another filtering process and a
new spreadsheet were created, including completed home addresses and workplace
data for using the human mobility study as a node. The origin–destination data of
this human mobility consist of 518 nodes. The formatting and attribute table of nodes
were also extracted before being imported into Gephi software. After importing the
node attribute table onto Gephi, the node visualisation, as shown in Fig. 3, consists
of 518 nodes. The line in the network graph has a direction since this edge represents
human mobility (OD), which consists of source-target attributes. The source node
represents the home location, and the target represents the workplace location. The
edge visualisation appears after importing the edge attribute table onto Gephi.
Selected TB cases with complete home and workplace address information partic-
ipated in this SNA analysis, which comprised 139 workplace nodes. A selected node
was used as the primary node, and buffering analysis was made from ArcGIS with a
diameter of 250 m, similar to the high-risk rank for the factory distance in one of the
TB risk factors in this study. Target nodes (workplace) that fall within the buffer zone
will be grouped into one group as a target. A new column was added as a weight to
be used to analyse their social network graph and their centrality.
The home and destination of human mobility data are tagged as a unique ID to
make it easier to interpret and read when exporting to SNA software. In Fig. 4, the
edge (human mobility) weightage was taken from the result of the GIS-MCDM final
score. The eigenvector values were scaled in 5 levels, the same as the GIS-MCDM
process, and the value was used as a map interpolation parameter. Human mobility
needs to be carried out with the help of SNA and integration with the GIS to make
this study more realistic and reliable. These filtered data are used as a separate risk
factor to export into social network analysis software to study human mobility, social
network graphs, and network centrality (Fig. 4). The number on the nodes represents
the group of workplace IDs, while the arrow shows the amount of TB casework at
the same workplace group. The social network graph below shows that the highest
centrality nodes are more prominent and darker nodes.
Generally, the two leading indicators used in this study are environmental and human
factors. These essential risk indicators of TB are used in this study, as suggested by
previous studies (Table 3). Five factors were chosen: urbanisation, built-up area,
Tracking High Potential Transmission Risk Spots of Infectious Disease … 513
population, and type of housing as environmental factors while risk group as human
factor. Environmental factors, including climate and land usage, affect TB bacterium
survival, operations, and growth. Environmental variables influence daily life and
are essential in causing health disparities across geographic locations. Human and
environmental factors considerably impact human health directly or indirectly. Mean-
while, human is a field that considers both the physical and mental features of persons
at high risk of contracting TB.
Based on these five factors, land use (urbanisation) and housing type are potential
dominant factors in the study area. According to TB scholars [30], the burden of
tuberculosis (TB) disease is typically thought to be higher in urban settings than in
rural settings due to overcrowding, high disease (HIV) prevalence, and occupational
transmission. However, it is still being determined whether there is a difference in the
prevalence of recurring tuberculosis between rural and urban areas. Scientists [31],
on the other hand, agreed that the patterns depended on dwelling types. Housing
features could be helpful in an ecological study of tuberculosis. These findings have
significant design and health implications for Asian cities that are becoming denser
and higher. TB is considered a poor man’s sickness. It has also been linked to poor
living conditions. This study investigates the link between TB outcome and housing
characteristics, which reflects socioeconomic standing.
The high-risk area of the existing high TB risk factor is shown in Fig. 5 using
interpolation and overlay analysis of GIS as studied by ref. [7]. These two techniques
are static presentations because they depend on the quality of the disease data, which
only refers to the patient’s place of residence. Infectious diseases also spread due
to the patient’s place of mobility, such as working and recreational places. Besides
this, interpolation results are only as good as the quality of the input data. If there are
errors or biases in the data, these can lead to misleading results. Similarly, overlay
analysis heavily relies on the accuracy and quality of input data layers. If the data
used in the analysis are accurate, updated, or complete, it can lead to ambiguous
results. To mitigate these drawbacks, it is essential to use high-quality data and apply
appropriate statistical techniques such as human mobility data and SNA techniques.
The map of the TB hotspot area mainly focuses on the central part of Klang district.
The dots point represents TB distribution cases, and raster grouping values were
created into classes in the form of five quantile classes. Red represents a higher-risk
area of TB cases, while blue represents a low-risk area of TB cases.
In this integration technique, the additional factor, the eigenvector centrality value
from the SNA technique, was exported into GIS-MCDM analysis. The result shows
a target (workplace) using integration GIS-MCDM from TB risk factor rank and
SNA that have already been determined. The five classes shown in the map (Fig. 6)
represent additional eigenvector centrality values.
SNA offers a multidimensional view of the complex interactions between individ-
uals and their social environments rather than the geographical proximity elements
as displayed in GIS-MCDM techniques. Based on social connections and interac-
tions, SNA can help identify critical locations or communities with higher TB risk
(Table 4). Public health officials can pinpoint areas where TB transmission might be
more likely by analysing patterns of social relationships, such as who interacts with
whom and how frequently.
SNA can also reveal how TB spreads within communities by tracing infection
paths through social networks. This understanding is crucial for designing targeted
interventions and controlling outbreaks. In local exploratory findings, [10] has also
demonstrated the capabilities of visual analytics and exploratory analysis of mobility
data to enhance the static GIS mapping of TB using static maps. The findings also have
shown that the SNA of human mobility factors can potentially influence infectious
disease by integrating with GIS mapping. The integration of the centrality node and
the TB hotspot area, which is in the centre of the map, is primarily red.
The high-risk comparative map in Table 4 shows that the existing TB risk area
map only depicts a few high-risk areas compared to other analysis techniques. The
high-risk area shows that integrating GIS-MCDM with SNA produces more high-risk
areas. It shows that a new high-risk area pops up using SNA integration, potentially
tracking new locations of high-risk TB areas. Using integration techniques from
GIS-MCDM and SNA depicts the highest percentage of high-risk areas compared
to other techniques. This highest risk area percentage proves the role of SNA-GIS
in enhancing the general mapping of TB and detecting potential TB areas such as
using GIS and GIS-MCDM only as conducted by previous studies [5–7, 11, 12, 24,
28, 32, 33].
Tracking High Potential Transmission Risk Spots of Infectious Disease … 517
5 Conclusion
Risk disease mapping is one approach in geospatial technology for analysis and
communication in public health and epidemiology. Human mobility is one of the
main factors that can cause TB outbreaks in Malaysia, but this factor is considered
silent in producing a precision disease mapping. Therefore, this study has combined
GIS-MCDM and SNA methods to evaluate the results better than general TB disease
mapping using GIS-based techniques. SNA is based on the theoretical constructs
of sociology and the mathematical foundations of graph theory, where it is used to
study structure and know how it influences health. The combination of GIS-MCDM
and SNA techniques has identified risk areas more precisely because the result of
previously performed TB case overlay and verification with a risk map using GIS-
MCDM-SNA or SSNAV is better than the result produced using only the GIS-MCDM
or general GIS mapping method. SNA is also proven to be a beneficial technique to
better understand the movement of individuals between specific locations, and the
different group contacts of people are essential to predicting the future movement
and interaction pattern of infectious diseases.
References
11. Rasam ARA, Mohd Shariff N, Dony JF, Maheswaran P (2017) Mapping risk areas of tubercu-
losis using knowledge-driven GIS model in Shah Alam. Malaysia. Pertanika J Soc Sci HumIties
2:135–144
12. Ridzuan N, Abdul Rasam A, Isa M, Shafie F (2021) Spatial interaction between lifestyles
and tuberculosis: an expert and public participatory GIS in Malaysia. Int J Geoinformatics
17(5):178–192. https://doi.org/10.52939/ijg.v17i5.2033
13. Gómez-Barroso D, Rodriguez-Valín E, Ramis R, Cano R (2013) Spatio-temporal analysis of
tuberculosis in Spain, 2008–2010. Int J Tuberc Lung Dis. https://doi.org/10.5588/ijtld.12.0702
14. Tomita A, Smith CM, Lessells RJ, Pym A, Grant AD, de Oliveira T, Tanser F (2019) Space-time
clustering of recently-diagnosed tuberculosis and impact of A.R.T. scale-up: evidence from an
H.I.V. hyper-endemic rural South African population. Sci Rep. https://doi.org/10.1038/s41598-
019-46455-7
15. Shah P (2021) IGiS for managing cities smartly-urban geoinformatics. Int Arch Photogramm,
Remote Sens Spatial Inf Sci—ISPRS Arch. https://doi.org/10.5194/isprs-Archives-XLVI-4-
W5-2021-509-2021
16. Patil NB, Nalawade SS (2016) Applications of GIS to smart cities. Int J Recent Adv Eng
Technol
17. Masimalai P (2014) Remote sensing and geographic information systems (GIS) as the applied
public health and environmental epidemiology. Int J Med Sci Public Health. https://doi.org/10.
5455/ijmsph.2014.081020141
18. Sabesan S, Palaniyandi M, Das PK, Michael E (2000) Mapping lymphatic filariasis in India.
Ann Trop Med Parasitol. https://doi.org/10.1080/00034983.2000.11813582
19. Palaniyandi M (2013) GIS for mapping updates of spatial spread and the ecological reasoning
of JE transmission in India (1956–2012). J Geomatics
20. Boyacı AÇ, Şişman A (2022) Pandemic hospital site selection: a GIS-based MCDM approach
employing pythagorean fuzzy sets. Environ Sci Pollut Res Int 29(2):1985–1997. https://doi.
org/10.1007/s11356-021-15703-7
21. Liu K (2022) GIS-based MCDM framework combined with coupled multi-hazard assessment
for site selection of post-earthquake emergency medical service facilities in Wenchuan, China.
Int J Disaster Risk Reduction 73:102873. https://doi.org/10.1016/j.ijdrr.2022.102873
22. Hijriani H, Cahyani A (2021) Web GIS-based assessment using S.A.W. methods to identify
high-risk areas of tuberculosis transmission and incidence in Bandar Lampung City. J Phys:
Conf Ser 1751:012033. https://doi.org/10.1088/1742-6596/1751/1/012033
23. Mekouar S (2018) Social network analysis: basic concepts, tools, and applications. In: Hand-
book of research on contemporary perspectives on web-based systems. https://doi.org/10.4018/
978-1-5225-5384-7.ch017
24. Kaur KK, Said SM, Lim PY, Ismail SNS (2020) Urbanisation and tuberculosis in Peninsular,
Malaysia (2011–2015). Malays J Med Health Sci
25. Liu Y, Li X, Wang W, Li Z, Hou M, He Y, Guo X (2012) Investigation of space-time clusters
and geospatial hot spots for the occurrence of tuberculosis in Beijing. Int J Tuberc Lung Dis.
https://doi.org/10.5588/ijtld.11.0255
26. De Queiroga RPF, de Sá LD, Nogueira JA, de Lima ERV, Silva ACO, Pinheiro PGOD, Braga
JU (2012) Spatial distribution of tuberculosis and relationship with living conditions in an
urban area of Campina Grande—2004 to 2007. Rev Bras Epidemiol. https://doi.org/10.1590/
S1415-790X2012000100020
27. Feske ML, Teeter LD, Musser JM, Graviss EA (2011) Including the third dimension: a spatial
analysis of TB cases in Houston Harris county. Tuberculosis. https://doi.org/10.1016/j.tube.
2011.10.006
28. Maciel EL, Pan W, Dietze R, Peres RL, Vinhas SA, Ribeiro FK, Palaci M, Rodrigues RR,
Zandonade E, Golub JE (2010) Spatial patterns of pulmonary tuberculosis incidence and their
relationship to socioeconomic status in Vitoria, Brazil. Int J Tuberc Lung Dis 14(11):1395–1402
29. Lapshina IS, Tsybikova EB, Kotlovskiy MY (2022) Groups at high risk of developing respira-
tory tuberculosis among adult population of Kaluga oblast. Tuberc Lung Dis. https://doi.org/
10.21292/2075-1230-2022-100-11-20-28
Tracking High Potential Transmission Risk Spots of Infectious Disease … 519
30. Mutembo S, Mutanga JN, Musokotwane K, Kanene C, Dobbin K, Yao X, Whalen CC (2019)
Urban-rural disparities in treatment outcomes among recurrent TB cases in Southern Province,
Zambia. B.M.C. infectious diseases. https://doi.org/10.1186/s12879-019-4709-5
31. Cobey S, Gouma S, Parkhouse K, Chambers BS, Ertl HC, Schmader KE, Grad YH (2018)
Poor immunogenicity, not vaccine strain egg adaptation, may explain the low H3 N2 influenza
vaccine effectiveness in 2012–2013. Clin Infect Dis. https://doi.org/10.1093/cid/ciy097
32. Azewan MDH, Abdul Rasam AR (2020) Disease mapping and health analysis using free and
open source software for geospatial (FOSS4G): an exploratory qualitative study of tuberculosis.
In: Alias N, Yusof R (eds) Charting the sustainable future of ASEAN in science and technology.
Springer, Singapore. https://doi.org/10.1007/978-981-15-3434-8_43
33. Abdul Rasam AR, Jumali WNS, Abdul Jalil I, Muhamad Jaelani L (2023) Susceptibility risk
index mapping of population at tuberculosis epidemic risk. J ASIAN Behav Stud 8(24):53–65.
https://doi.org/10.21834/jabs.v8i24.423
Evaluation of Machine Learning
Algorithms for Early Prediction of Liver
Disease
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 521
Y. Bee Wah et al. (eds.), Data Science and Emerging Technologies, Lecture Notes on
Data Engineering and Communications Technologies 191,
https://doi.org/10.1007/978-981-97-0293-0_37
522 S. Geddam et al.
1 Introduction
The liver is the major organ in the body and the main metabolic site of xenobiotics,
medicines and nutrients [1]. Liver disease contributes to the death to around 2 million
people globally [2]. Diagnosing liver disease involves measuring certain biomarkers
in blood such as enzymes and proteins [3]. Common biomarkers used for liver func-
tion testing include total bilirubin (TB), direct bilirubin (DB), alkaline phosphatase
(ALP), alanine transaminase (ALT), aspartate aminotransferase (AST), total proteins
(TP), albumin (Alb) and albumin-to-globulin (AG) ratio [3]. TB indicates the level
of bilirubin in blood, where bilirubin informs about the amount of waste generated
by the liver. DB is the conjugated bilirubin. ALP and AST are enzymes present in
the liver and other organs. Alb and globulin are proteins in which levels change in
case of liver disease.
However, the levels of these biomarkers may not change in early stages of liver
disease and are mostly identified when the liver is partly damaged [4]. In such cases,
early diagnosis of liver disease is life-saving and prevents many complications.
In this respect, machine learning algorithms (MLAs) have proven successful in
diagnosing of medical conditions and thereby reducing the risk linked to these condi-
tions. MLAs have been applied in classification and prediction of different liver
conditions. For instance, classification and prediction of hepatitis and cirrhosis was
made using Naïve Bayes (NB) and support vector machine (SVM) where SVM
showed to be a better and more efficient classifier in terms of time and accuracy [5].
SVM also showed high accuracy in classifying liver diseases when compared
to other algorithms such as Bayesian network and random forest (RF) [6]. SVM, as
well as logistic regression (LR) classifier, also showed high performance in predicting
liver disease with accuracy > 75% [7]. NB and LR alongside other models including
RF and artificial neural networks (ANN) were used to identify patients at risk of fatty
liver for patients above 30 years old [8]. This study highlighted the importance of
addressing missing values in datasets and normalization where they applied synthetic
minority oversampling technique (SMOTE) [8].
Subsequently, SMOTE was applied in another study for predicting liver disease
and improved accuracy > 98% [9]. Similar results were obtained in the study by Gupta
et al. [10] where SMOTE showed effective in balancing the data prior to applying
the MLAs [10]. However, in the latter study that RF, adaptive boosting (AdaBoost)
gave higher accuracy of classification that other MLAs. Likewise, in another study
RF gave an accuracy of 98% when used for classifying liver disease based on liver
function tests [11].
Considering the differences in the results between previous studies, this study
has built on the findings of previous studies by evaluating seven MLAs for classi-
fying liver diseases based on characteristic features related to liver function tests.
These algorithms included logistic regression (LR), decision tree (DT), random
forest (RF), AdaBoost, extreme gradient boost (XGBoost), support vector machine
(SVM) and Naïve Bayes (NB). Prior to application of the analytics the study
assessed the data preprocessing approach in terms of data cleaning, imputation and
Evaluation of Machine Learning Algorithms for Early Prediction … 523
exploratory data analysis. Models were then evaluated after application with and
without hyperparameter tuning.
In particular, the research tackled the following objectives: (a) building a predictive
model for early diagnosis of liver disease; (b) analyzing risk factors related to liver
disease; (c) comparing several MLAs in terms of accuracy of identification of liver
diseases; and (d) evaluating the performance of MLAs based on evaluation metrics
including accuracy, recall/sensitivity, AUC-ROC and F1-score.
2 Methods
2.1 Dataset
Dataset used in this study was the Indian Liver Patient Dataset (ILPD) (https://archive.
ics.uci.edu/dataset/225/ilpd+indian+liver+patient+dataset) and was obtained from
University of California Irvine Machine Learning Repository [12]. ILPD was gener-
ated from patients in Northeast of Andhra Pradesh (India). It consisted of 11 features
for 583 patients that were: gender of patients, presence or absence of a liver disease
and biomarkers (of liver function tests level in blood). Table 1 shows the details of
the biomarkers in the dataset. The gender distribution of males to females in the
dataset was 441:142. Moreover, 416 (71.4%) patients had liver disease. All patients
included in the dataset were of age below 90 years old.
Seven machine learning models were used for classification and included LR, DT,
RF, AdaBoost, XGBoost, SVM and NB. The software requirements for applying the
models included: data format: CSV; language: Python 3.8.10; package manager:
Anaconda 2.1.2; interactive computing platform: Jupyter Notebook 6.4.12; and
Python libraries: Pandas, NumPy, Matplotlib and Seaborn, Scikit and Sklearn.
LR model was based on probability and predicted a binary outcome that was
presence or absence of liver disease [14]. DT encompassed tree-like models that
utilized a natural human-like decision process by splitting the data into multiple sets
and further splitting the multiple sets to other sets, and so on [15]. RF comprised
ensemble method that combines multiple decision trees such as each tree was seen
individually; thus, predictions were based on groups of models [1]. XGBoost was
based on gradient boosting (GB) where the cost function was log loss; yet XGBoost
used more advanced execution of GB that can handle missing values [16]. AdaBoost
was also based on adaptive boosting and was useful for short decision trees [17].
SVM divided datasets based on decision boundaries in n-dimensional space [1]. NB
was based on Bayes theory with strong naïve that there were independent assump-
tions between different features [18]. Each model was evaluated before and after
hyperparameters’ tuning. Table 2 shows the hyperparameters used for each model.
For models’ validation, the following metrics were used: accuracy, precision, recall
area under the curve-receiver operator characteristics (AUC-ROC) and F1-score [4].
These metrics are calculated taking into account true positives (TP), false positives
(FP), true negatives (TN) and false negatives (FN). Accuracy was calculated as the
number of positive predictions divided by the total number of predictions (Eq. 1).
TP + TN
Accuracy = (1)
TP + TN + FP + FN
Precision was calculated as the number of true positives divided by the total
number of positive predictions (Eq. 2).
TP
Precision = (2)
TP + FP
Recall was defined as the ability of the model to detect positive samples (Eq. 3).
TP
Recall = (3)
TP + FN
526 S. Geddam et al.
AUC-ROC indicated the area under the ROC that separated the TP from the FP
(Eq. 4).
1
AUC = ROC(t).dt (4)
0
Such that TP are true positives, TN are true negative, FP are false positives, and
FN are false negatives.
Exploratory data analysis (EDA) showed that men outweighed women and repre-
sented most patients. Hence the numbers of men and women in the study were 441
(75.6%) and 142 (24.4%) respectively. Liver disease in men was high in patients >
40 years old and that explained the high percentage of patients with liver diseases in
the dataset (71.4%) [19].
EDA was followed by correlation analysis that indicated high correlation among
ALT, AST, TB, DB and liver disease [3]. On the other hand, age and TP did not show
a high correlation with liver disease (Fig. 2). Thus, the correlation map indicated
that age was the least important variable, whereas AG ratio and Alb were the most
important.
ML models showed high performance in classifying liver disease. This was evidence
in the metrics measured (Table 3) that presents the performance of the models prior
to hyperparameter tuning.
In this sense, the accuracy of the models was in the range of 0.66–0.8 with the
XGBoost model showing the highest accuracy. This latter model showed high perfor-
mance in other metrics. Thus, values for recall, AUC-ROC, precision and F1-score
of the XGBoost model were 0.89, 0.8, 0.87 and 0.78 respectively. It is worth noting
to state that NB showed higher recall that was 0.95 yet performed poorly on the other
parameters.
Of all the modes, LR and SVM showed the worst performance and that confirmed
the findings in other studies. In other studies, RF and XGBoost showed metric values
in the range of 76–100% and that outweighed other algorithms such as LR, SVM,
KNN and DT [4, 7, 20, 21].
Yet in the present study, performance of SVM improved post-tuning of hyperpa-
rameters (Table 3). This was seen in the recall value that increased from 86 to 99%.
Moreover, all the other metrics’ values for SVM were above 96%. This assured the
528 S. Geddam et al.
4 Conclusions
In conclusion, the objectives of the study were achieved where seven MLAs for
predicting liver diseases at early stages were successfully established. The MLAs
showed high accuracy of prediction that was made on eight features related to liver
function tests. The seven MLAs showed high performance in predicting depending
on the tuning of the hyperparameters. Hence, hypertuning of parameters improved
predictions of several models including SVM, RF, XGBoost and AdaBoost. However,
the models were developed based on only 583 patients’ records, and such a challenge
in the sample size could give rise to overfitting. Hence, future work will benefit from
application of machine learning models to larger sample size. Moreover, including
patients’ characteristics related to socioeconomic, lifestyle, health state, beliefs and
certain protected characteristics could improve classification of diseases.
Acknowledgements The authors would like to thank UITAR International University for
supporting this paper.
References
Ruipeng Tang, Narendra Kumar Aridas, and Mohamad Sofian Abu Talip
Abstract With the promotion of the Internet of Things, big data, and other tech-
nologies in agricultural production, farmers are accustomed to searching for crop
cultivation information through Internet. However, the traditional way of information
search has problems such as time-consuming, inefficient, and inaccurate, so farmers
urgently need an efficient, fast, and accurate information recommendation method.
So this study proposes an agricultural information recommendation model based on
the matrix decomposition knowledge graph algorithm (MDKG algorithm). It intro-
duces the matrix decomposition algorithm based on the knowledge graph, which can
constructs and trains the interaction matrix between users and agricultural informa-
tion. It also solves the decomposed interaction matrix, and obtains the correlation
score between users and agricultural information through calculation. Experimental
results show that the MDKG algorithm is better than the DKN and RippleNet algo-
rithms in AUC and ACC of information flow clicks and recall and precision of cold
start recommendations, which indicates that the algorithm can deeply mine user
preference characteristics to improve click-through rates and perform better in cold
start scenarios. It can better alleviate the negative impact of data sparseness, which
improves the efficiency of farmers to obtain agricultural information.
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 531
Y. Bee Wah et al. (eds.), Data Science and Emerging Technologies, Lecture Notes on
Data Engineering and Communications Technologies 191,
https://doi.org/10.1007/978-981-97-0293-0_38
532 R. Tang et al.
1 Introduction
With the rapid growth of agricultural information, traditional keyword and similar
word search methods can no longer meet farmers’ needs to obtain crop cultiva-
tion information due to their time-consuming and low accuracy [1]. Although the
traditional collaborative filtering (CF) recommendation method can summarize the
user’s search habits and preferences based on the user’s past search content [2],
this method requires that the recommendation model have item data input during
training. Otherwise, it will face a cold start problem. And it is difficult to query the
side characteristics of entity resources. The concept of Knowledge Graph [3] (KG)
was first proposed by Google. It mainly takes keyword retrieval as the core and builds
a “question–answer” model to present the effective information to users. Since the
information in the agricultural field is extensive and has certain regionality, timeli-
ness, and diversity, so the knowledge graphs of agricultural information retrieval has
become a current research hotspot.
Yusof et al. [4] pointed out that mobile application systems have been widely
used in people’s daily lives. This paper proposes an expert system for crop diseases
and insect pests, which is useful to improve contributions in the agricultural field.
In order to improve agricultural productivity and economic benefits, Kanako et al.
[5] suggested the utilization of information communication by the information tech-
nology, they acquire agricultural knowledge from web resource databases and estab-
lish an agricultural knowledge system grounded in interconnected open data. Wang
et al. [6] constructed the Ningxia rice knowledge map with the help of professional
books and rice data centers. First, the model and data layers were constructed; then
knowledge fusion, reasoning, and completion were performed to form the rice knowl-
edge map. Wang et al. [7] constructed a knowledge graph of rice diseases. They used
rice diseases in large-scale knowledge bases like Baidu Encyclopedia as data sources.
They used the pattern-matching methods to extract knowledge from the data. After
data integration and other knowledge graph construction processes, rice diseases
were extracted by visualizing disease knowledge graphs. Cui et al. [8] proposed to
use the rich agricultural technology resources on the Internet to build an agricul-
tural question-and-answer system based on a set of common rice questions, which
could solve the problems of high agricultural technology requirements, shortage of
funds, and insufficient manpower in rice planting operations. Liu et al. [9] proposed
the use of deep learning methods to build an agricultural production question-and-
answer system, there was a large amount of useless information in the agricultural
question-and-answer system that uses the Internet to search for agricultural planting
knowledge feedback answers and requires manual secondary screening.
The above studies on knowledge graphs in agriculture mostly focus on the data
of a certain crop in the agricultural Internet of Things, such as rice diseases. It rarely
involves the farmers’ retrieval of agricultural information on the Internet. Although
the knowledge graph can transform the recommendation problem into a ranking
problem and allow the question search to be directed to the answer, it is difficult
to construct knowledge from different data sources in a single knowledge graph. It
An Agricultural Information Recommendation Method Based … 533
2 Introduction
The algorithm uses the knowledge graph to display the agricultural information
entity vector to provide data support for the later learning and training of the matrix
decomposition algorithm [10]; then, it constructs the interaction matrix between the
user and agricultural information; the entity characteristics are matched; then, the
matrix decomposition model selects the t nearest agricultural information and user
matrices for training; finally, the matrix decomposition model is used to solve the
decomposed user and agricultural information matrices, and the prediction score
between the user and agricultural information is obtained through calculation. The
implementation process of this algorithm is shown in Fig. 1.
information and users. The user characteristic matrix and the agricultural information
characteristic matrix are decomposed into the matrix. It is shown in Formula (1):
Z = XY K (1)
In Formula (1), Z i×r represents the r-dimensional user matrix after matrix decom-
position, Z j×r represents the r-dimensional agricultural information matrix after
matrix decomposition, Z represents the interaction matrix composed of the r-
dimensional user matrix and agricultural information matrix. The objective function
of the model training process is shown in Formula (2):
∑
i ∑
j
( )
H= X m YnK − z mn Smn + ω1 ||X m ||2R + ω2 ||Yn ||2R (2)
m=1 n=1
In Formula (2), Smn represents the user’s rating of whether the recommended infor-
mation exists. If a rating exists, Smn is 1. On the contrary, Smn is 0, z mn represents the
user’s true rating of the recommended information, X m and Yn represent the feature
vectors between the decomposed r-dimensional user and agricultural ( information)
after matrix decomposition, X m YnK represents the predicted rating, X m YnK − z mn
represents the error between the predicted value and actual value. ||||2R represents
the Frobenius norm, ω1 and ω2 represent the percentage of the model feature matrix
regularization for constrained users and user information. In order to solve the fitting
problem of the recommendation model, this study introduces ω parameter weighted
regularization. It uses the data parameters between the user cxm and agricultural
information c yn to reduce the transition probability of the objective function. The
improved loss function is shown in Formula (3):
∑
i ∑
j
( )
H= X m YnK − z mn Smn + cxm ||X m ||2R + c yn ||Yn ||2R (3)
m=1 n=1
The matrix decomposition algorithm is used to solve the decomposed user matrix
and agricultural information matrix, and the prediction score between the user and
agricultural information is obtained through calculation. The process of matrix
decomposition is shown in Fig. 2.
The objective function expression after matrix decomposition is:
An Agricultural Information Recommendation Method Based … 535
Fig. 2 Matrix
decomposition process
Y1 Yn
G(Yn,X2)
Y2 Yn Zmn
Yt Yt =j Xm
∑
i ∑
j
( ) ∑
i
H= X m YnK − z mn Smn + ω1 cxm ||X m ||2R
m=1 n=1 m
⎛ ⎞
∑
j
∑
j
∑
+ ω2 c yn ||Yn ||2R + ω3 ⎝Yn − G(Yn , Yt )Yt ⎠ (4)
n m=1 Yt
∑ ∑j ( )
In Formula (4), im=1 n=1 X m YnK − z mn is built by the matrix decomposition
model, the second and third terms are regular terms to avoid overfitting, and the third
term is the potential information that uses the knowledge graph to represent the simi-
larity of agricultural information. c yn represents j neighbor sets of agricultural infor-
mation yn . G(Yn , Yt ) is the similarity function. This study uses the cosine similarity
function [11]. Its value range is between [−1, 1], which is as shown in Formula (5):
∑r
m=1 (X mw Ynw )
G(Yn , Yt ) = /∑ ( ) /∑r ( ) (5)
r
m=1 (X mw ) 2
× n=1 (Ymw )
2
In Formula (5), w represents the obtained dimension through model training, and
its normalization is shown in Formula (6):
This study uses the gradient descent method to minimize the objective function
to solve the user X and the agricultural information feature matrix Y to establish a
feedback model [12], which is as shown in Formula (7):
∑
i
( ) ∑
U (X, Y ) = Smn X m YnK − z mn Yn + ω1 jxi Ym (7)
n=1 m
After solving the feature matrix generated by the interaction matrix, the feature values
are mapped in the v-dimensional space. The closer the geometric distance is to the
features, the stronger the correlation is. The correlation information between qn and
qm is obtained through vector cosine similarity, which is shown in Formula (8):
∑j
m=1 qm f qn f
G(qm , qn ) = /∑ ( )2 /∑ j ( )2 (8)
j
q
f =1 m f × f =1 qn f
After processing the training set, the feature vector is constructed as VX m Mn shown
in Formula (9):
{ }
VX m Mn = G(X m , Mn )1 , G(X m , Mn )1 , . . . , G(X m , Mn )ϕ (9)
( )
The training set X m , VX m Mn is inputted into the system resource recommen-
dation model, it can obtain a decision function through the optimization function.
It also generate a list of recommendation results, and obtain a weight set under
multi-dimensional features. This set is the user’s short-term recommendation basis.
According to the survey, it is found that users are interested in agricultural informa-
tion resources. The user’s interest completely determines the user’s preference for a
certain resource. This study also adds user preference impact results to the recom-
mendation results, uses knowledge graph technology to analyze users’ long-term
and short-term preferences, and adds user preference models and impact feedback
models to the knowledge graph. When the data is reduced and the user’s interests
are changed, the purpose of accurate recommendations to the user is achieved. The
weight update model between hybrid knowledge graph entities is shown in Formula
(10):
In Formula (10), H K Wmn represents the weight value between the characteristic
values m and n in agricultural information, wmn represents the interest degree value
obtained by the user interest migration model of the agricultural information in the
knowledge graph, and e represents the relationship between user m and agricultural
information n. score represents the user’s rating of agricultural information, δ repre-
sents the normalization factor. The normalized initial weight of δ ×score is 1 in order
to avoid excessively high ratings affecting accuracy.
User G(X m , Mn ) with centralized training is deep fused, which combines the
agricultural information features and user interest migration of the knowledge graph.
The similarity features G(X m , Mn )mix is extracted. It also establishes a hybrid feature
model, which is shown in Formula (11):
3 Experimental Design
In order to verify the proposed recommendation effect, this study conducts experi-
mental work in the environment of Tensorflow deep learning framework and Python
3.7.1. This experiment crawled the public data set of the agricultural economics
section of the U.S. Agricultural Information Service website. After processing the
crawled data in the background, the user’s interactive data of agricultural economics
was obtained. The interaction dataset between agricultural economics and target
users are also constructed. In the experiment, the data set was divided into a training
set and a test set according to the ratio of 7:3 [13], and the average result of multiple
experiments was used as the final result. The experimental data is shown in Table 1.
The recommendation model in this study is used to find the relationship between
information and users, so it is important to measure user performance in the two
scenarios of information flow clicks [14] and cold start recommendations [15]. In the
information flow click scenario, Area Under the Curve (AUC) and Model Accuracy
Rate (ACC) are selected as evaluation indicators to predict the users’ click rate of the
agricultural information. The recall and precision rates of different sample sizes are
selected in the cold start click scenario. As the evaluation index, the two scenarios
are evaluated through the above indicators.
(1) Information flow click prediction
The AUC of this experiment is the area under the Receiver Operating Characteristic
(ROC) curve. The true and false positive rates is obtained according to the prediction
results, it also use them as the ordinate and abscissa respectively to generate the ROC
curve. The ROC curve area is used to judges the quality of the model. AUC means
that the positive cases calculated based on the prediction results are ranked [16]. The
probability before the negative example, as shown in Formula (12):
∑
2× δ sortδ − α × (α + 1)
AUC = (12)
α×β ×2
TP + TN
ACC = (13)
P+N
Cold start clicks are one of the important indicators to measure the performance of
the recommendation model, which reflects the model’s performance in alleviating
the data sparsity problem [17]. Recall represents the ratio of the number of true
examples to the total number of positive examples. Precision represents the ratio of
the number of true examples to the number of all predicted positive examples. The
higher the recall and precision, the better the cold start recommendation performance
of the model. The above indicators are shown in Formulas (14 and 15).
TP
Recall = (14)
TP + FN
TP
Precision = (15)
TP + FP
An Agricultural Information Recommendation Method Based … 539
In order to maximize the optimal parameters of the model performance, this study
uses AUC as the evaluation index, dim is the feature dimension, len is the link length
during preference propagation, size is the size of the preference diffusion set of each
layer, ε1 represents the loss of knowledge graph embedding. ε2 represents the weight
of L2 regularization term. After parameter testing, the parameters of the model are
set to the feature dimension dim = 8, the link length of preference propagation len
= 2, the size of each layer of the preference diffusion set size = 32, and the loss
weight ε1 = 0.5, L2 regular term weight ε2 = 10−7 .
Fig. 3 AUC model scores for users receiving information of different magnitudes
Fig. 4 ACC model scoring for users receiving different amounts of information
Fig. 5 Cold start recall for different proportions of training data sets
Fig. 6 Cold start recommendation accuracy after training on different proportions of data
542 R. Tang et al.
4 Conclusion
References
1. Sunitha M, Adilakshmi T (2018) Music recommendation system with user-based and item-
based collaborative filtering technique [M]. Networking communication and data knowledge
engineering. Springer, pp 267–278
2. Bo L, Ping G, Yue S et al (2015) Optimization and simulation of agricultural knowledge
collaborative filtering recommendation algorithm based on user implicit evaluation [J]. Internet
Things Technol 5(08):76–79
3. Singhal A (2012) Introducing the knowledge graph: things [J]. Not Strings
4. Yusof MM, Rosli NF, Othman M, et al (2018) M-DCocoa: M-agriculture expert system for
diagnosing cocoa plant diseases [M]. Recent advances on soft computing and data mining
5. Komiya K, Abe Y (2013) Question answering system using Q & A site corpus query expansion
and answer candidate evaluation [J]. SpringerPlus
6. Wang D (2020) Research and application of Ningxia rice knowledge graph construction method
[D]. Northern University for Nationalities
7. Wang S (2019) Construction and research on rice disease prevention and control model based
on knowledge graph [D]. Hunan Agricultural University
8. Cui B (2018) Research on sentence similarity calculation in rice FAQ question and answer
system [D]. Nanjing Agricultural University
9. Liu G (2019) Research and development of intelligent question-answering system for agri-
cultural production based on deep learning [D]. Capital University of Economics and
Business
10. Wang Y, Sun Y (2021) Matrix decomposition recommendation algorithm based on GNN [J].
J Comput Eng Appl 57(19)
11. Liao B, Zhang T, Yu J et al (2019) QPR-NN: a recommendation algorithm combining quadratic
polynomial regression and neural network [J]. J Xi’an Jiaotong Univ 53(9):79–87+136
12. Gu J, Xie Z, Wu J, et al (2019) Parallel collaborative filtering recommendation algorithm based
on graph walking [J]. J Intell Syst 14(04):743–751
13. Ge S, Rum SNM (2022) Key points location in infrared images of the human body based on
Mscf-ResNet. Future Internet 14:15. https://doi.org/10.3390/fi14010015
14. Zhao W, Ren X (2020) Matrix decomposition recommendation algorithm integrating label
information and time effect [J]. Inform Control 49(4):472–477,488
15. Shengbao H, Huawei Y, Xiaohui L et al (2022) Cold start recommendation algorithm based on
fused similarity and hierarchical clustering [J]. Small Microcomput Syst 5:985–991
An Agricultural Information Recommendation Method Based … 543
16. Kamarasan M (2016) Unified technique for colour image retrieval based on orthogonal
polynomial model and multiresolution features[J]. Int J Image Data Fusion 7(3):243–263
17. Yue X, Tang D, Shu H et al (2020) Research on improvement of collaborative filtering
recommendation algorithm based on data sparsity [J]
18. Wang H, Zhang F, Xie X et al (2018) DKN: deep knowledge-aware network for news
recommendation [C]. Proceedings of the 2018 world wide web conference, pp 1835–1844
19. Wang H, Zhang F, Wang J et al (2018) Ripplenet: propagating user preferences on the knowledge
graph for recommender systems [C]. Proceedings of the 27th ACM international conference
on information and knowledge management, pp 417–426
20. Zhang R, Zhao N, Fu L et al (2022) Ultrasonic diagnosis method for stainless steel weld defects
based on multi-domain feature fusion[J]. Sens Rev 42(2):214–229
Performance of the Auxiliary
Information Based Hybrid EWMA Chart
with Fast Initial Response
Peh Sang Ng, Huai Tein Lim, Wai Chung Yeong, and Sajal Saha
Abstract An integration of the fast initial response (FIR) feature into a control
chart helps in increasing the shift detection speed for the process with start-up quality
problem or inefficient corrective measures toward an occurrence of off-target process.
Due to the salient feature of the FIR feature, this study investigates the performance of
the FIR auxiliary information based hybrid EWMA (denoted as the AIB-HEW-FIR)
chart by using the regression estimator. To evaluate the efficiency of the AIB-HEW-
FIR chart in triggering off-target process, the performance indicators of the Average
Run Length (ARL) and Expected ARL (EARL) are used, and the values of the
ARL and EARL are computed by using the Monte Carlo simulation. The proposed
chart is then evaluated and compared with the auxiliary information based hybrid
EWMA (AIB-HEW) chart without the FIR feature. Results show that the AIB-HEW-
FIR chart is superior to the AIB-HEW chart in detecting infrequent changes in the
process mean.
P. S. Ng · H. T. Lim (B)
Department of Physical and Mathematical Science, Faculty of Science, Universiti Tunku Abdul
Rahman, 31900 Kampar, Perak, Malaysia
e-mail: [email protected]
P. S. Ng
e-mail: [email protected]
W. C. Yeong
School of Mathematical Sciences, Sunway University, 47500 Petaling Jaya, Malaysia
e-mail: [email protected]
S. Saha
Department of Mathematics, International University of Business Agriculture and Technology,
Dhaka, Bangladesh
e-mail: [email protected]
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 545
Y. Bee Wah et al. (eds.), Data Science and Emerging Technologies, Lecture Notes on
Data Engineering and Communications Technologies 191,
https://doi.org/10.1007/978-981-97-0293-0_39
546 P. S. Ng et al.
1 Introduction
As claimed by Lucas and Crosier [15], the off-target process can be caused by
start-up problems or ineffective corrective measures taken to address the off-target
process. To improve the detection ability of the CUSUM chart against the start-up
abnormalities, Lucas and Crosier [15] proposed to integrate the fast initial response
(FIR) feature into the CUSUM chart. Subsequently, Lucas and Saccucci [16] and
Steiner [17], both integrated the FIR feature into the one-sided and two-sided EWMA
charts, respectively, where the results show that the proposed charts are superior to
their existing counterparts (i.e., without FIR feature).
Recent studies on FIR can refer to the studies by Ng et al. [18] and Haq and
Bibi [12], where the FIR feature was adopted into the auxiliary information-based
EWMA and dual CUSUM charts, respectively, in process monitoring. They showed
that the proposed FIR charts surpassed the existing counterparts.
In this paper, we introduce the FIR feature into the AIB-HEW chart by using
regression type estimator in process monitoring. The following section gives the
properties, as well as the implementation steps for the AIB-HEW-FIR chart. Then
Sect. 2 elaborates the average run length (ARL) and expected ARL (EARL) compu-
tations of the proposed chart. The performance of the AIB-HEW-FIR chart and the
comparison between the AIB-HEW-FIR and AIB-HEW charts are presented in this
section as well. Lastly, the main conclusions and possible future studies are given in
Sect. 3.
Suppose an auxiliary variable B is correlated with the study ( variable) A, where the
correlation between A and B variables is denoted as ρ. Let Ai, j , Bi, j , for j = 1, 2, …,
( 1, 2, . . .) from a bivariate
n, denotes the bivariate random sample of( size n, at )time i (= )
normal distribution, which is denoted as Ai, j , Bi, j ∼ N2 μ A , μ B , σ A2 , σ B2 , ρ . Here,
μ A and μ B are, respectively, the population mean for variables A and B, while σ A2 and
σ B2 are, respectively, the variance for variables A and B. Note that μ A = μ A0 + δσ A ,
where δ is the size of the standardized mean shift and μ A = μ A0 (the in-control
548 P. S. Ng et al.
where Ai and B i , respectively, are denoted as the ith (sample ) means of A( and) B.
The( mean )and variance of D Ai are then computed as E D Ai = μ A and V D Ai =
σ A2 1 − ρ 2 /n, respectively. Note that n is the sample size of the chart.
Based on D Ai in Eq. (1), the plotting statistic of the AIB-HEW-FIR chart is given
as:
where
Vi = λ2 D Ai + (1 − λ2 )Vi−1 , 0 < λ2 ≤ 1.
E(Wi ) = μ A0 (3)
and
( ( ) )( )[∑ 2 ( )
σ A2 1 − ρ 2 λ21 λ22 (1 − λc )2 1 − (1 − λc )2i
V ar (Wi ) =
n (λ1 − λ2 )2 c=1
1 − (1 − λc )2
( )]
2(1 − λ1 )(1 − λ2 ) 1 − (1 − λ1 )i (1 − λ2 )i
− . (4)
1 − (1 − λ1 )(1 − λ2 )
where a is a proportion of distance from the starting value and b is the adjustment
parameter which is defined as (−2/log(1−a))−1
19
. To obtain 50% head start, Steiner [17]
recommended the values of a and b to be set as 0.5 and 0.3, respectively.
Following the mean and variance of Wi in Eqs. (3) and (4), respectively, and
the FIR feature in Eq. (5), the C L i upper control limit (i.e., U C L i ), central line
(i.e., C L i ), and lower control limit (i.e., LC L i ) of the AIB-HEW-FIR chart are then
derived as:
Performance of the Auxiliary Information Based Hybrid EWMA Chart … 549
√
U C L i = μ A0 + K (F A) V ar (Wi ), (6a)
C L i = μ A0 (6b)
and
√
LC L i = μ A0 − K (F A) V ar (Wi ), (6c)
Fig. 2 Mechanism of the AIB-HEW-FIR chart for the exact shift size
Performance of the Auxiliary Information Based Hybrid EWMA Chart … 551
Fig. 3 Mechanism of the AIB-HEW-FIR chart for the unknown shift size
552 P. S. Ng et al.
To compute the run length properties of the control charts, the Monte Carlo simulation
approach were adopted by Haq [8], Haq et al. [10], Abbas et al. [11], and Ng et al.
[14] for the HEW, AIB double EWMA t, AIB EWMA, and AIB VSI EWMA t charts,
respectively. In a similar manner, this paper also adopts the Monte Carlo simulations
with 50,000 replications to determine the run length properties (i.e., the ARL and
EARL) for the AIB-HEW-FIR chart with the assumption that μ A0 = μ B = 0, and
σ A = σ B = 1. Here, the ARL is used to measure the average number of plotted
plotting statistics needed before the first out-of-control signal is triggered by the
control chart when the shift size is known by the practitioners. The ARL is further
categorized as in-control ARL (ARL0 ) when the process is on-target or out-of-control
ARL (ARL1 ) when the process is off-target. Whereas the EARL is used to measure
the expected average number of plotted plotting statistics needed by a chart to signal
an out-of-control situation over a shift interval (i.e., (δmin , δmax )) when the exact shift
size is undetermined by the practitioners.
Note that the value of in-control EARL is set to be similar to that of ARL0 , while
the EARL is known as out-of-control EARL (EARL1 ) when the process is off-target.
A control chart that results in smaller ARL1 and EARL1 are desirable as it indicates
the off-target process is detected quickly. A control chart is considered to perform
better than the competing charts if it yields smaller ARL1 and EARL1 values when
all the charts are having the same ARL0 .
Table 1 presents the K and ARL values for the AIB-HEW and AIB-HEW-FIR
charts when the parameter values of n = 1, ρ ∈ {0, 0.25, 0.50, 0.75, 0.90, 0.95},
(λ1 , λ2 ) ∈ {(0.10, 0.1000001), (0.10, 0.25), (0.10, 0.50), (0.10, 0.7 5)},δ ∈ {0,
0.25, 0.50, 0.75, 1.00, 1.25, 1.50, 2.00}, and ARL0 is set to be close to 500. Note
that the values of n and the combination of (λ1 , λ2 ) adopted in this study are set to
be similar to the HEW chart proposed by Haq [8]. From Table 1, the AIB-HEW-
FIR chart is shown to be more efficient than the AIB-HEW chart for all cases when
δ ≤ 1.5, that is, by resulting in a smaller ARL1 values. For example, when ρ =
0.95 and δ = 0.25, the ARL1 values for the AIB-HEW-FIR (AIB-HEW) chart are
given as 7.514 (10.692), 7.771 (11.366), 7.799 (11.777), and 7.742 (11.980) when
(λ1 , λ2 ) = (0.1, 0.1000001), (0.1, 0.25), (0.1, 0.5), and (0.1, 0.75), respectively.
Table 1 also shows that the ARL1 values of the AIB-HEW-FIR chart decrease for
all combinations of (λ1 , λ2 ) when the shift size (δ) increases. For example, when
(λ1 , λ2 ) = (0.1, 0.1000001) and ρ = 0.25 is considered, the ARL1 values for δ =
0.25, 0.50, 0.75, 1.00, 1.25, 1.50, and 2.00 are 64.920, 17.540, 7.984, 4.508, 2.865,
2.032, and 1.340, respectively. This indicates that a larger δ is easier to be detected
and thus the number of samples needed to detect the off-target process is smaller.
Additionally, when the ρ value increases, the ARL1 values of the AIB-HEW-FIR
chart decrease. For example, the ARL1 values in Table 1 are dropping off from
20.538, 19.135, 15.245, 8.764, 3.745, and 1.984, for the increasing values of ρ = 0,
0.25, 0.50, 0.75, 0.90, and 0.95, respectively, when (λ1 , λ2 ) = (0.1, 0.25) and δ =
0.50.
Table 1 ARL values for the AIB-HEW and AIB-HEW-FIR charts with the corresponding K parameter when λ1 = 0.10, λ2 ∈ {0.1000001, 0.25, 0.50, 0.75},
ρ ∈ {0, 0.25, 0.50, 0.75, 0.90, 0.95}, and δ ∈ {0, 0.25, 0.50, 0.75, 1.00, 1.25, 1.50, 2.00} at ARL0 that is close to 500
λ1 = 0.10, λ2 = 0.1000001
AIB-HEW AIB-HEW-FIR AIB-HEW AIB-HEW-FIR AIB-HEW AIB-HEW-FIR
ARL ARL ARL ARL ARL ARL
ρ=0 ρ = 0.25 ρ = 0.50
δ K = 2.3627 K = 2.4729 K = 2.3647 K= 2.4762 K = 2.3605 K = 2.4723
0 500.238 500.285 500.116 500.058 500.176 500.310
0.25 79.637 67.892 75.881 64.920 62.209 52.404
0.50 23.808 18.686 22.545 17.540 18.551 14.151
0.75 11.927 8.508 11.300 7.984 9.350 6.401
1.00 7.327 4.801 6.971 4.508 5.744 3.569
1.25 5.023 3.051 4.764 2.865 3.920 2.305
1.50 3.683 2.148 3.485 2.032 2.885 1.688
2.00 2.286 1.390 2.178 1.340 1.827 1.201
ρ = 0.75 ρ = 0.90 ρ = 0.95
δ K = 2.3623 K = 2.4701 K = 2.3634 K = 2.4726 K = 2.3645 K = 2.4758
0 500.124 500.275 500.073 500.055 500.124 500.176
0.25 38.708 31.410 18.816 14.299 10.692 7.514
Performance of the Auxiliary Information Based Hybrid EWMA Chart …
λ1 = 0.10, λ2 = 0.1000001
AIB-HEW AIB-HEW-FIR AIB-HEW AIB-HEW-FIR AIB-HEW AIB-HEW-FIR
ARL ARL ARL ARL ARL ARL
ρ=0 ρ = 0.25 ρ = 0.50
2.00 1.295 1.040 1.000 1.000 1.000 1.000
λ1 = 0.10, λ2 = 0.25
δ K = 2.5482 K = 2.6430 K = 2.5462 K = 2.6395 K = 2.5453 K = 2.6394
0 500.044 500.094 500.058 500.094 500.029 500.014
0.25 92.969 81.104 87.659 75.961 71.684 61.681
0.50 26.347 20.538 24.764 19.135 20.175 15.245
0.75 12.685 8.923 11.994 8.328 9.879 6.619
1.00 7.748 4.953 7.338 4.622 6.071 3.673
1.25 5.336 3.144 5.047 2.943 4.185 2.381
1.50 3.936 2.226 3.721 2.097 3.098 1.746
2.00 2.464 1.431 2.343 1.375 1.971 1.227
ρ = 0.75 ρ = 0.90 ρ = 0.95
δ K = 2.5470 K = 2.6417 K = 2.5486 K = 2.6443 K= 2.5503 K = 2.6426
0 500.061 500.061 500.176 500.068 500.003 500.107
0.25 44.107 36.090 20.438 15.459 11.366 7.771
0.50 12.563 8.764 6.146 3.745 3.537 1.984
0.75 6.252 3.814 3.135 1.769 1.869 1.189
(continued)
P. S. Ng et al.
Table 1 (continued)
λ1 = 0.10, λ2 = 0.1000001
AIB-HEW AIB-HEW-FIR AIB-HEW AIB-HEW-FIR AIB-HEW AIB-HEW-FIR
ARL ARL ARL ARL ARL ARL
ρ=0 ρ = 0.25 ρ = 0.50
1.00 3.878 2.187 1.995 1.236 1.289 1.031
1.25 2.695 1.536 1.464 1.068 1.074 1.003
1.50 2.025 1.248 1.198 1.018 1.012 1.000
2.00 1.372 1.048 1.020 1.001 1.000 1.000
λ1 = 0.10, λ2 = 0.50
δ K = 2.6816 K = 2.7690 K = 2.6790 K = 2.7687 K = 2.6790 K= 2.7671
0 500.044 500.284 500.116 500.094 500.046 500.094
0.25 99.630 86.927 93.798 81.724 76.924 65.917
0.50 27.810 21.501 26.072 19.931 21.166 15.669
0.75 13.167 8.991 12.427 8.358 10.193 6.569
1.00 7.968 4.893 7.518 4.571 6.226 3.632
1.25 5.457 3.110 5.171 2.910 4.301 2.373
1.50 4.032 2.217 3.818 2.094 3.193 1.752
2.00 2.557 1.445 2.429 1.392 2.053 1.240
Performance of the Auxiliary Information Based Hybrid EWMA Chart …
λ1 = 0.10, λ2 = 0.1000001
AIB-HEW AIB-HEW-FIR AIB-HEW AIB-HEW-FIR AIB-HEW AIB-HEW-FIR
ARL ARL ARL ARL ARL ARL
ρ=0 ρ = 0.25 ρ = 0.50
0.25 47.012 38.513 21.441 15.954 11.777 7.799
0.50 13.031 8.819 6.301 3.698 3.634 1.986
0.75 6.419 3.781 3.234 1.776 1.948 1.203
1.00 3.986 2.187 2.081 1.251 1.336 1.036
1.25 2.786 1.553 1.526 1.076 1.093 1.004
1.50 2.111 1.263 1.239 1.020 1.017 1.000
2.00 1.427 1.054 1.027 1.001 1.000 1.000
λ1 = 0.10, λ2 = 0.75
δ K = 2.7611 K = 2.8494 K = 2.7592 K = 2.8464 K = 2.7585 K = 2.8468
0 500.132 500.284 500.014 500.130 500.009 500.056
0.25 102.003 89.380 96.250 83.311 78.805 67.313
0.50 28.481 21.644 26.681 20.020 21.606 15.739
0.75 13.411 8.938 12.635 8.261 10.365 6.500
1.00 8.075 4.817 7.619 4.489 6.307 3.572
1.25 5.528 3.056 5.232 2.865 4.355 2.340
1.50 4.086 2.193 3.871 2.071 3.240 1.743
2.00 2.600 1.447 2.471 1.394 2.092 1.247
(continued)
P. S. Ng et al.
Table 1 (continued)
λ1 = 0.10, λ2 = 0.1000001
AIB-HEW AIB-HEW-FIR AIB-HEW AIB-HEW-FIR AIB-HEW AIB-HEW-FIR
ARL ARL ARL ARL ARL ARL
ρ=0 ρ = 0.25 ρ = 0.50
ρ = 0.75 ρ = 0.90 ρ = 0.95
δ K = 2.7613 K= 2.8491 K = 2.7624 K = 2.8495 K = 2.7627 K= 2.8521
0 500.092 500.014 500.107 500.264 500.009 500.061
0.25 48.148 39.005 21.931 16.030 11.980 7.742
0.50 13.255 8.743 6.393 3.631 3.681 1.972
0.75 6.508 3.715 3.280 1.768 1.988 1.209
1.00 4.035 2.164 2.120 1.256 1.366 1.039
1.25 2.828 1.550 1.562 1.081 1.108 1.004
1.50 2.155 1.269 1.265 1.022 1.021 1.000
2.00 1.461 1.058 1.033 1.001 1.000 1.000
Performance of the Auxiliary Information Based Hybrid EWMA Chart …
557
558 P. S. Ng et al.
Next, we look at the EARL1 performance of the AIB-HIW-FIR chart when the
exact shift size is unknown. All the parameters used in computing the ARL1 values
in Table 1 are used to compute the EARL1 values in Table 2, except the exact δ
is replaced by (δmin , δmax ) = {(0.1, 0.5), (0.5, 1.0), (1.0, 1.5), (1.5, 2.0)}. Table 2
presents the EARL1 performance for the AIB-HEW-FIR and AIB-HEW charts.
Table 2 (for an unknown shift size) demonstrates the same observations as in
Table 1 (for an exact shift size). It is found that smaller EARL1 values are obtained
when larger ρ, λ2 , and (δmin , δmax ) are considered. In addition, the charts’ compar-
isons presented in Table 2 show that the proposed AIB-HEW-FIR chart is superior
to the AIB-HEW chart for all pairs of (λ1 , λ2 ) and (δmin , δmax ), across different ρ
values. For instance, when (λ1 , λ2 ) = (0.1, 0.25) and ρ = 0.25 are considered, the
AIB-HEW-FIR chart results in smaller EARL1 values, i.e., 82.678, 9.486, 3.106,
and 1.681, as compared to 93.191, 13.301, 5.223, and 2.995 for the AIB-HEW chart
when the given intervals (δmin , δmax ) are set as (0.1, 0.5), (0.5, 1.0), (1.0, 1.5), and
(1.5, 2.0), respectively. Thus, the numerical examples in this section shows that the
FIR feature enhances the sensitivity of the AIB-HEW chart in detecting infrequent
change in the process mean for both known and unknown shift sizes.
3 Conclusion
This paper investigates the effectiveness of the FIR feature on the AIB-HEW chart
in process monitoring by using Monte Carlo simulations to compute the run length
properties. The run length properties include the ARL and EARL are used to evaluate
the performance of the AIB-HEW-FIR chart. Both the ARL and EARL results show
that the AIB-HEW-FIR produces a superior performance (i.e., smaller ARL1 and
EARL1 values) to the AIB-HEW chart by allowing the infrequent changes in the
process mean to be detected more quickly. This indicates that the FIR feature is
useful in enhancing the detection ability of the AIB-HEW chart and thus it provides
a practical benefit for the practitioners as quicker identification of an out-of-control
situation can be obtained in response to the process with start-up problems. A faster
detection to an off-target process will lead to cost saving due to low production
of inferior products, as well as time savings due to rework issues for all areas of
manufacturing processes. The proposed chart would appear to have wide applications
in the field of industry and manufacturing sectors.
This study can be further extended by studying multiple auxiliary variables in
process monitoring as more information can be obtained in designing the AIB-
HEW-FIR chart, which may result in more desirable performance. In addition, future
research can consider integrating the FIR feature to different types of control chart to
enhance the detection speed for the process with start-up quality problems. This study
compares the performance of the proposed AIB-HEW-FIR chart with the traditional
AIB-HEW chart (i.e., without the FIR feature) to highlight the salient feature of
integrating the FIR concept. To evaluate whether the proposed AIB-HEW-FIR chart
is more sensitive in comparison to its competitors, future study can evaluate and
Performance of the Auxiliary Information Based Hybrid EWMA Chart … 559
Table 2 EARL1 values for the AIB-HEW and AIB-HEW-FIR charts when λ1 = 0.10, λ2 ∈
{0.1000001, 0.25, 0.50, 0.75}, ρ ∈ {0, 0.25, 0.50, 0.75, 0.90, 0.95}, and various (δmin , δmax ) pairs
at ARL0 that is close to 500
λ1 = 0.10, λ2 = 0.1000001
AIB-HEW AIB-HEW-FIR AIB-HEW AIB-HEW-FIR AIB-HEW AIB-HEW-FIR
EARL1 EARL1 EARL1 EARL1 EARL1 EARL1
ρ=0 ρ = 0.25 ρ = 0.50
δmin δmax K = K = 2.4729 K= K = 2.4762 K= K = 2.4723
2.3627 2.3647 2.3605
0.1 0.5 85.865 75.576 82.201 72.322 69.252 60.231
0.5 1.0 13.111 9.563 12.433 9.010 10.263 7.221
1.0 1.5 5.204 3.215 4.934 3.024 4.072 2.437
1.5 2.0 2.905 1.711 2.759 1.632 2.297 1.404
ρ = 0.75 ρ = 0.90 ρ = 0.95
δmin δmax K = K = 2.4701 K= K = 2.4726 K= K = 2.4758
2.3623 2.3634 2.3645
0.1 0.5 45.396 38.047 22.565 17.846 12.702 9.458
0.5 1.0 6.517 4.231 3.233 1.948 1.914 1.266
1.0 1.5 2.603 1.559 1.422 1.077 1.073 1.006
1.5 2.0 1.548 1.110 1.061 1.004 1.002 1.000
λ1 = 0.10, λ2 = 0.25
δmin δmax K = K = 2.6430 K= K = 2.6395 K= K = 2.6394
2.5482 2.5462 2.5453
0.1 0.5 97.906 87.508 93.191 82.678 78.968 69.375
0.5 1.0 14.097 10.167 13.301 9.486 10.938 7.549
1.0 1.5 5.520 3.316 5.223 3.106 4.336 2.511
1.5 2.0 3.117 1.768 2.955 1.681 2.473 1.444
ρ = 0.75 ρ = 0.90 ρ = 0.95
δmin δmax K = K = 2.6417 K= K = 2.6443 K= K = 2.6426
2.5470 2.5486 2.5503
0.1 0.5 51.743 44.111 25.320 20.202 13.968 10.347
0.5 1.0 6.905 4.381 3.455 2.010 2.053 1.293
1.0 1.5 2.795 1.609 1.515 1.090 1.101 1.008
1.5 2.0 1.658 1.127 1.086 1.006 1.003 1.000
λ1 = 0.10, λ2 = 0.50
δmin δmax K = K = 2.7690 K= K = 2.7687 K= K = 2.7671
2.6816 2.6790 2.6790
0.1 0.5 103.388 92.326 98.425 87.897 83.638 73.597
0.5 1.0 14.685 10.330 13.831 9.645 11.331 7.605
1.0 1.5 5.658 3.284 5.353 3.084 4.451 2.499
1.5 2.0 3.214 1.774 3.049 1.691 2.562 1.457
(continued)
560 P. S. Ng et al.
Table 2 (continued)
λ1 = 0.10, λ2 = 0.1000001
AIB-HEW AIB-HEW-FIR AIB-HEW AIB-HEW-FIR AIB-HEW AIB-HEW-FIR
EARL1 EARL1 EARL1 EARL1 EARL1 EARL1
ρ = 0.75 ρ = 0.90 ρ = 0.95
δmin δmax K = K = 2.7715 K= K = 2.7699 K= K = 2.7712
2.6808 2.6818 2.6830
0.1 0.5 54.930 46.973 26.754 21.165 14.638 10.707
0.5 1.0 7.103 4.367 3.558 2.009 2.129 1.304
1.0 1.5 2.888 1.621 1.577 1.098 1.124 1.009
1.5 2.0 1.729 1.137 1.107 1.006 1.004 1.000
λ1 = 0.10, λ2 = 0.75
δmin δmax K = 7611 K = 2.8494 K= K = 2.8464 K= K = 2.8468
2.5792 2.7585
0.1 0.5 105.412 94.216 100.405 89.231 85.272 74.966
0.5 1.0 14.955 10.314 14.085 9.596 11.517 7.552
1.0 1.5 5.732 3.236 5.424 3.035 4.508 2.468
1.5 2.0 3.260 1.766 3.096 1.684 2.605 1.458
ρ = 0.75 ρ = 0.90 ρ = 0.95
δmin δmax K = K = 2.8491 K= K = 2.8495 K= K = 2.8521
2.7613 2.7624 2,7627
0.1 0.5 56.204 47.652 27.382 21.445 14.935 10.788
0.5 1.0 7.207 4.301 3.611 1.993 2.168 1.306
1.0 1.5 2.934 1.616 1.612 1.102 1.139 1.010
1.5 2.0 1.767 1.143 1.122 1.007 1.005 1.000
make performance comparison between the AIB-HEW-FIR chart with other types
of control chart with FIR concept.
References
1. Montgomery DC (2012) Introduction to statistical quality control, 7th edn. Wiley, New York
2. Khan Z, Gulistan M, Hashim R, Yaqoob N, Chammam W (2020) Design of S-control chart for
neutrosophic data: an application to manufacturing industry. J Intell Fuzzy Syst 38(4):4743–
4751
3. Ali S, Abbas Z, Nazir HZ, Riaz M, Zhang X, Li Y (2021) On developing sensitive nonpara-
metric mixed control charts with application to manufacturing industry. Qual Reliab Eng Int
37(6):2699–2723
4. Yang SF, Arnold BC (2016) Monitoring process variance using an ARL-unbiased EWMA-p
control chart. Qual Reliab Eng Int 32(3):1227–1235
5. Zhou W, Zheng Z, Xie W (2017) A control-chart-based queueing approach for service facility
maintenance with energy-delay tradeoff. Eur J Oper Res 261(2):613–625
Performance of the Auxiliary Information Based Hybrid EWMA Chart … 561
6. Suman G, Prajapati D (2018) Control chart applications in healthcare: a literature review. Int
J Metrol Qual Eng 9:5
7. Roberts S (2000) Control chart tests based on geometric moving averages. Technometrics
42(1):97–101
8. Haq A (2013) A new hybrid exponentially weighted moving average control chart for
monitoring process mean. Qual Reliab Eng Int 29(7):1015–1025
9. Riaz M (2008) Monitoring process mean level using auxiliary information. Stat Neerlandica
62(4):458–481
10. Haq A, Ejaz S, Lee MH, Khan M (2021) A new double EWMA-t chart with auxiliary
information for the process mean. Qual Reliab Eng Int 37(8):3381–3394
11. Abbas N, Riaz M, Does RJ (2014) An EWMA-type control chart for monitoring the process
mean using auxiliary information. Commun Stat-Theor Meth 43(16):3485–3498
12. Haq A, Bibi L (2022) The dual CUSUM charts with auxiliary information for process mean.
Commun Stat-Simul Comput 51(1):164–189
13. Noor-ul-Amin M, Javaid A, Hanif M, Dogu E (2022) Performance of maximum EWMA control
chart in the presence of measurement error using auxiliary information. Commun Stat-Simul
Comput 51(9):5482–5506
14. Ng PS, Khoo MB, Saha S, Lee MH (2022) A variable sampling interval EWMA t chart with
auxiliary information—a robustness study in the presence of estimation error. Alex Eng J
61(8):6043–6059
15. Lucas JM, Crosier RB (1982) Fast initial response for CUSUM quality-control schemes: give
your CUSUM a head-start. Technometrics 24:199–205
16. Lucas JM, Saccucci MS (1990) Exponentially weighted moving average control schemes:
properties and enhancements. Technometrics 32(1):1–12
17. Steiner SH (1999) EWMA control charts with time-varying control limits and fast initial
response. J Qual Technol 31(1):75–86
18. Ng PS, Lau YJ, Lim HT, Yeong WC (2022) Fast initial response for auxiliary information based
EWMA chart for the process mean. In: Proceedings of the 11th international conference on
robotics, vision, signal processing and power applications: enhancing research and innovation
through the fourth industrial revolution. Springer, Singapore, pp 677–683
Evaluation of Machine Learning Models
for Breast Cancer Detection
in Microarray Gene Expression Profiles
Abstract Breast cancer (BC) is a leading global health challenge, with survival
rate varying significantly across regions due to socio-economic disparities and
healthcare accessibility. This research seeks to identify the most efficient machine
learning (ML) classifier for precise BC classification using gene expression data.
Utilizing the CuMiDa database’s microarray BC dataset, which includes 35,983 gene
biomarkers from 146 breast adenocarcinoma patients and 143 normal subjects, the
study employed R-programming for data pre-processing and feature selection. The
Boruta algorithm pinpointed 214 key biomarkers, and the dataset was subsequently
balanced using the SMOTE technique. Among the seven ML classifiers assessed,
the support vector machine (SVM) showcased superior performance metrics such
as sensitivity, specificity, and accuracy, while naïve Bayes (NB) underperformed.
A thorough examination of the BC dataset revealed that SVM is the premier ML
classifier, highlighting its potential for enhancing BC predictive modelling.
1 Introduction
Breast cancer (BC) is the dominant type of cancer in many countries worldwide.
This condition, characterized by rapid and uncontrolled cell growth, predominantly
affects the breast’s duct and lobules. BC is classified based on the conversion of breast
M. N. Abdullah (B)
School of Mathematical Sciences, College of Computing, Informatics and Mathematics,
Universiti Teknologi MARA, Tapah Campus, 35400 Tapah, Perak, Malaysia
e-mail: [email protected]
Y. B. Wah
School of Information Technology, UNITAR International University, Tierra Crest, Jalan SS6/3,
47301 Petaling Jaya, Selangor, Malaysia
e-mail: [email protected]
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 563
Y. Bee Wah et al. (eds.), Data Science and Emerging Technologies, Lecture Notes on
Data Engineering and Communications Technologies 191,
https://doi.org/10.1007/978-981-97-0293-0_40
564 M. N. Abdullah and Y. B. Wah
cells into malignant forms. The main subtypes are invasive ductal carcinoma, which
originates in the ducts and may spread to surrounding breast tissue, and invasive
lobular carcinoma, which begins in the lobules and can metastasize to adjacent organs
and tissues [1].
According to Global Cancer Observatory (GLOBOCAN) 2020 estimates,
accounted for 2,261,419 new cases worldwide, leading to approximately 685,000
deaths [2]. Specifically, in Malaysia, BC recorded the highest incidence with 8,418
cases, marking it as the foremost cancer in the region. It was second in mortality with
3,503 deaths. The cumulative risk was 5.29 for incidence and 2.24 for mortality, with
a 5-year prevalence rate of 187.18 cases per 100,000 individuals [3].
BC survival rates exhibit geographical differences, attributed to socio-economic
disparities in access to diagnosis and treatment. In high-income nations, 10 to 15%
of BC cases are identified at stage IV; this figure is substantially higher in middle-
and low-income countries. Limited access to specialized cancer centres and compre-
hensive healthcare correlates with higher mortality rates. Furthermore, around 30%
of patients diagnosed at an early stage risk relapse, even with optimal treatment [4].
There is an urgent need to enhance current diagnostic and therapeutic BC proto-
cols. While numerous prognostic markers have been identified, the accuracy of BC
prognosis remains limited, evidenced by the increasing number of BC tumours. The
introduction of novel prognostic biomarkers, known for their sensitivity and speci-
ficity, is crucial for improving BC treatment and patient outcomes [5]. Genetically,
BC susceptibility arises from common variants identified in genome-wide associ-
ation studies (GWAS) and rarer coding variants. Despite their significance, these
genetic markers explain less than half the familial relative risk (FRR) of BC [6].
Microarray technology has transformed genetic analysis by enabling high-
throughput profiling of gene expression levels. It aids in identifying vital biomarkers
linked to breast cancer and highlights patterns that influence disease progression and
therapeutic response [7, 8]. A comparative analysis machine learning (ML) classifi-
cations models on microarray breast cancer dataset is pivotal. The CuMiDa database
offers extensive genetic data, while multiple ML models, such as support vector
machine (SVM), random forest (RF), and naïve Bayes (NB), allow a thorough eval-
uation of predictive capabilities. Such assessments are vital to ascertain the most
accurate and robust model for BC classification [9].
Each ML classifier possesses distinctive strengths: SVM is adept with high-
dimensional dataset [10], RF employs multiple decision trees to improve accuracy
[11], and NB uses a probabilistic method to classify samples [12]. Additionally,
k-nearest neighbour (k-NN) operates on proximity principles, classifying samples
based on their nearest neighbour [13]. Given the diversity of these models, this study
endeavours to determine the ML classifier that excels in utilizing gene expression
data for accurate BC classification.
Evaluation of Machine Learning Models for Breast Cancer Detection … 565
2 Methodology
This study utilized Gene Expression Omnibus (GEO) breast cancer data from Curated
Microarray Database (CuMiDa) [GSE70947] available at https://sbcb.inf.ufrgs.br/
cumida. CuMiDa datasets have been used in more than 30,000 studies of the GEO
566 M. N. Abdullah and Y. B. Wah
database [19]. This dataset encapsulates gene expression profiles from breast cancer
samples, rendering it a pivotal resource for predictive modelling.
∑
n
gini (T ) = 1 − p 2j ,
j=1
P(C = ck )P(X = xi |C = ck )
P(C = ck |X i = xi ) = ,
P(X i = xi )
xi , where
/
( )
d x ' , xi = (x ' − xi )T (x ' − xi ).
LR is one of the most used models when the dependent variable is dichotomous. The
penalized logistic regression might be the best choice when there is multicollinearity
and a high number of dimensions. Penalized logistic regression provides an alterna-
tive procedure for variable selection as compared to the conventional way [33]. The
methods for penalized LR used in this study were ridge LR, lasso LR and elastic
net LR. Thus, in this study only apply these three penalized LR into the comparison
instead of ordinary binary LR.
As a fundamental binary LR, the logit function of LR ( can be ∑defined as )
∑n ∑ p ∑n p
l(β) = l(β; y, X ) = i=1 j yi xi j β j − i=1 log 1 + exp j=1 xi j β j . [34]
As for ridge LR, the logit function is added with penalized function that is
Evaluation of Machine Learning Models for Breast Cancer Detection … 569
∑
p
l ridge (β) = l(β) − λ β 2j .
j=1
The parameter λ in the ridge function was the tuning parameter determined via
tenfold cross validation.
For lasso LR, the logit function is as follows:
∑
p
| |
l lasso
(β) = l(β) − λ |β j |.
j=1
The lasso has an advantage compared to ridge regression since it only subsets
the important predictors into the final model and it also improves the model inter-
pretability. The parameter λ in the lasso function was the tuning parameter determined
via tenfold cross validation.
The last penalized method considered in this study was elastic net LR. The logit
function is defined as follows:
∑
p
∑
p
| |
l elasticnet
(β) = l(β) − λ1 β 2j − λ2 |β j |.
j=1 j=1
The dataset was then randomly split into 70% training data and 30% testing data,
striking a balance between model training and evaluation. The 70/30 data splitting
is to ensure a model learns well yet remains generalizable to new data. Furthermore,
each classifier underwent 1000 iterations to account for potential variations in the
random data splitting, ensuring the reliability and consistency of the results.
The performance measure was employed to provide comprehensive evaluation of
the classifiers. Sensitivity, specificity, accuracy, error rate, and F-measure collectively
offer a multidimensional view of the predictive capabilities of each classifier.
Next, a comparative analysis of the classifier performance was done. Results were
meticulously analysed to draw meaningful comparisons and identify potential trends.
The best performing classifier, determined by the predefined performance measure,
was selected for further analysis, ensuring the highest predictive accuracy.
Finally, the feature importance was determined from the best ML classifier perfor-
mance. Features were sorted based on the best predictive performance measure,
providing invaluable insight into their relative importance in the classification
process.
In conclusion, this study stands as a testament to the power of ML classifiers
in breast cancer research. The selected classifier, based on the defined performance
570 M. N. Abdullah and Y. B. Wah
Fig. 1 Flow of
methodological process
Evaluation of Machine Learning Models for Breast Cancer Detection … 571
3 Results
The microarray breast cancer dataset applied in this study was collected from
CuMiDa database consisted of 35,983 genes microarray (biomarkers) with 146 breast
adenocarcinoma group patients and 143 normal group subjects. The objective of this
study was to compare the predictive ability of ML classifier models for the microarray
breast cancer dataset. The R programing with RStudio IDE was used to perform the
analysis for the current objective. Before proceeding to the analysis, we checked
for possible missing values and outliers. There were no missing values detected
in the dataset, but we detected some outliers in the dataset. The outliers were not
removed from the dataset because it will reduce the sample size; thus, the outliers
were substituted with median values of the data.
Then, Boruta’s algorithm for feature selection was done to reduce the number
of unimportant genes towards breast cancer. The Boruta’s algorithm is a wrapper
feature selection that utilizes random forest classifier. Initial analysis found that 172
biomarkers that confirmed important, 35,753 biomarkers confirmed unimportant, and
56 tentative or unconfirmed biomarkers left. Then, the tentative rough fix analysis
was further performed to decide the importance of the 56 tentative left biomarkers.
It was found that 214 biomarkers were confirmed important, and the balance 35,767
biomarkers were confirmed unimportant towards the prediction of breast cancer.
Next, since the data was not well balanced, we applied undersampling method using
“ROSE” function in R. By using this method, we reduce the sample in breast cancer
group to balance with the breast cancer group. After undersampling method applied,
the dataset consisted of 143 breast cancer group patients and 143 normal group
subjects. To measure the predictive ability of ML classifiers, the data were split into
training (70%) and testing (30%) sets to ensure the balances training depth with
testing robustness for model validation. Then, the ML classifiers were fitted using
training data, and the predictive measures were obtained using test data. To ensure
the robustness of the results, we repeat 1000 times of the process and summarized the
means of each performance of ML classifiers. Based on Table 1, seven ML classifiers
were evaluated 1000 iterations and the mean of each performance were recorded.
The mean and standard deviation of each performance such as sensitivity, specificity,
F-measure, accuracy, and error rate were recorded for both training and test sets.
The evaluation of various ML classifiers has yielded insightful results. SVM
emerges as the standout performer, demonstrating exceptional sensitivity, specificity,
F-measure, and accuracy across both the training and testing datasets. Moreover,
its remarkably low error rate underlines its robustness in making accurate predic-
tions. RF exhibits commendable performance, albeit slightly trailing behind SVM
in terms of accuracy on the testing set. However, RF does display lower sensitivity
and specificity values, suggesting a potential trade-off between precision and recall.
In contrast, NB lags behind the other classifiers, displaying the lowest sensitivity
and specificity scores. This is further compounded by a relatively high error rate,
indicating its limitations in accurately classifying instances in this context. Moving
on, k-NN showcases noteworthy performance with a high sensitivity on the testing
572 M. N. Abdullah and Y. B. Wah
Table 1 Comparison of ML classifiers on 214 breast cancer biomarkers (n breast cancer = 143,
n normal = 143)
Classifier Data Sensitivity Specificity F-measure Accuracy Error rate
(mean (sd)) (mean (sd)) (mean (sd)) (mean (mean
(sd)) (sd))
SVM Train 0.9475 (0.0137) 0.9804 0.9634 (0.0081) 0.9640 0.0361
(0.0081) (0.0080) (0.0080)
Test 0.9091 (0.0415) 0.9455 0.9258 (0.0260) 0.9273 0.0727
(0.0347) (0.0249) (0.0249)
RF Train 0.9009 (0.0157) 0.9326 0.9154 (0.0110) 0.9168 0.0832
(0.0141) (0.0107) (0.0107)
Test 0.9051 (0.0449) 0.9350 0.9187 (0.0303) 0.9200 0.0800
(0.0436) (0.0296) (0.0296)
NB Train 0.8590 (0.0192) 0.9481 0.8989 (0.0129) 0.9036 0.0964
(0.0097) (0.0117) (0.0117)
Test 0.8537 (0.0483) 0.9372 0.8905 (0.0301) 0.8955 0.1045
(0.0389) (0.0277) (0.0277)
k-NN Train 0.8687 (0.0217) 0.9772 0.9184 (0.0135) 0.9229 0.0771
(0.0096) (0.0121) (0.0121)
Test 0.9470 (0.0361) 0.8640 0.8932 (0.0354) 0.8994 0.1006
(0.04282) (0.0313) (0.0313)
Ridge LR Train 0.8422 (0.0153) 0.9747 0.9019 (0.0106) 0.9084 0.0916
(0.0087) (0.0094) (0.0094)
Test 0.8161 (0.0593) 0.9674 0.8819 (0.0375) 0.8917 0.1083
(0.0244) (0.0314) (0.0314)
Lasso LR Train 0.9707 (0.0276) 0.9912 0.9806 (0.0187) 0.9810 0.0190
(0.0107) (0.0181) (0.0181)
Test 0.8772 (0.0547) 0.9354 0.9029 (0.0332) 0.9063 0.0937
(0.0369) (0.0303) (0.0303)
Elastic net Train 0.9490 (0.0236) 0.9853 0.9664 (0.0158) 0.9672 0.0328
LR (0.0099) (0.0153) (0.0153)
Test 0.8836 (0.0511) 0.9536 0.9151 (0.0285) 0.9186 0.0814
(0.0301) (0.0257) (0.0257)
1000 repetitions were done to evaluate the predictive performance of ML classifiers
dataset, but it registers a lower specificity. This implies that k-NN excels at identifying
positive classes, though it may struggle in discerning negative ones.
Ridge LR, lasso LR, and elastic net LR exhibit comparable performance, with each
displaying respectable sensitivity and specificity values. Lasso LR shines slightly
brighter in certain metrics. To sum up, SVM and the regularization methods (ridge
LR, lasso LR, and elastic net LR) emerge as the most reliable classifiers, offering a
balance of precision and recall. Conversely, NB falls short in terms of sensitivity and
specificity. Figure 2 summarized the performance values (testing dataset) in graphical
way for the better comparison of the ML classifiers.
Evaluation of Machine Learning Models for Breast Cancer Detection … 573
4 Conclusion
Acknowledgements The authors would like to thank Universiti Teknologi MARA (UiTM) Perak,
Tapah Campus for providing the facilities to conduct this study. This research was supported by
university research funding under the MyRA grant (600-RMC/GPM LPHD 5/3 (098/2022).
References
1. Singh R, Sain NK (2023) The etiology of breast cancer. J Pharm Negat Results 14:1427–1434.
https://doi.org/10.1007/978-1-4684-2601-4_2
2. Global Cancer Observatory (2023) Estimated number of new cases in 2020, World, both sexes,
all ages. In: World Heal. Organ.
3. Global Cancer Observatory (2023) Incidence, mortality and prevalence by cancer site Malaysia.
In: World heal. Organ
4. Cardoso F, McCartney A, Ponti A et al (2023) European society of breast cancer specialists/
advanced breast cancer global alliance quality indicators for metastatic breast cancer care. Eur
J Cancer 187:105–113. https://doi.org/10.1016/j.ejca.2023.03.028
5. Hu H, Wang Z-W, Hu S et al (2023) GNPNAT1 promotes the stemness of breast cancer and
serves as a potential prognostic biomarker. Oncol Rep 50:1–17. https://doi.org/10.3892/or.
2023.8594
6. Wilcox N, Dumont M, González-Neira et al (2023) Exome sequencing identifies breast cancer
susceptibility genes and defines the contribution of coding variants to breast cancer risk. Nat
Genet 55:1435–1439
7. Orsini A, Diquigiovanni C, Bonora E (2023) Omics technologies improving breast cancer
research and diagnostics. Int J Mol Sci 24. https://doi.org/10.3390/ijms241612690
8. Du H, Si G, Si J et al (2023) Single-cell RNA sequencing analysis revealed malignant ductal cell
heterogeneity and prognosis signatures in pancreatic cancer. Clin Res Hepatol Gastroenterol
47. https://doi.org/10.1016/j.clinre.2023.102200
9. Thirumalaisamy S, Thangavilou K, Rajadurai H, Saidani O (2023) Breast cancer classification
using synthesized deep learning model with metaheuristic optimization algorithm. Diagnostics
13:1–21
10. Del-Pozo-Bueno D, Kepaptsoglou D, Peiró F, Estradé S (2023) Comparative of machine
learning classification strategies for electron energy loss spectroscopy: support vector machines
and artificial neural networks. Ultramicroscopy 253:113828. https://doi.org/10.1016/j.ultramic.
2023.113828
11. Mahajan RA, Balkhande B, Wanjale K, Chitre A (2023) Enhancing heart disease risk prediction
accuracy through ensemble classification techniques. Int J Intell Syst Appl Eng 11:701–713
12. Gohari K, Kazemnejad A, Mohammadi M et al (2023) A Bayesian latent class extension of
naive Bayesian classifier and its application to the classification of gastric cancer patients. BMC
Med Res Methodol 23:1–15. https://doi.org/10.1186/s12874-023-02013-4
13. Ali AH, Mohammed MA, Hasan RA et al (2023) Big data classification based on improved
parallel k-nearest neighbor. Telkomnika Telecommun Comput Electron Control 21:235–246.
https://doi.org/10.12928/TELKOMNIKA.v21i1.24290
14. Kadhim RR, Kamil MY (2023) Comparison of machine learning models for breast cancer
diagnosis. IAES Int J Artif Intell 12:415–421. https://doi.org/10.11591/ijai.v12.i1.pp415-421
15. Abdullah MN, Wah YB, Sapri NNFF, Yaacob WFW (2023) Multi-class classification for breast
cancer with high dimensional microarray data using machine learning cassifier. Data Sci Emerg
Technol 2022:329–342
16. Agraz M (2023) Comparison of feature selection methods in breast cancer microarray data.
Med Rec 5:284–9. https://doi.org/10.37990/medr.1202671
17. Chen J, Hao L, Qian X et al (2022) Machine learning models based on immunological genes to
predict the response to neoadjuvant therapy in breast cancer patients. Front Immunol 13:1–14.
https://doi.org/10.3389/fimmu.2022.948601
576 M. N. Abdullah and Y. B. Wah
18. Mirza Z, Ansari MS, Iqbal MS et al (2023) Identification of novel diagnostic and prognostic
gene signature biomarkers for breast cancer using artificial intelligence and machine learning
assisted transcriptomics analysis. Cancers (Basel) 15:3237. https://doi.org/10.3390/cancers15
123237
19. Feltes BC, Chandelier EB, Grisci BI, Dorn M (2019) CuMiDa: an extensively curated
microarray database for benchmarking and testing of machine learning approaches in cancer
research. J Comput Biol 26:376–386. https://doi.org/10.1089/cmb.2018.0238
20. R Core Team (2023) R: a language and environment for statistical computing
21. Rosner B (1975) On the detection of many outliers. Technometrics 17:221–227. https://doi.
org/10.1080/00401706.1975.10489305
22. Kursa MB, Jankowski A, Rudnicki WR (2010) Boruta—a system for feature selection. Fundam
Informaticae 101:271–285. https://doi.org/10.3233/FI-2010-288
23. Reddy LMM (2022) Performance validation of spectrum sensing using kernelized support
vector machine transformation. Wirel Pers Commun 1–14
24. Abdullah MN, Wah YB, Abdul Majeed AB et al (2022) Identification of blood-based transcrip-
tomics biomarkers for Alzheimer’s disease using statistical and machine learning classifier.
Informatics Med Unlocked 33:101083. https://doi.org/10.1016/j.imu.2022.101083
25. Breiman L (2001) Random forests. Mach Learn 45:5–32. https://doi.org/10.1023/A:101093
3404324
26. Liaw A, Wiener M (2002) Classification and Regression by randomForest. R news 2:18–22.
https://doi.org/10.1177/154405910408300516
27. Biau G, Scornet E (2015) A random forest guided tour. Submitted 173–184
28. Rish I (2001) An empirical study of the naive Bayes classifier. In: IJCAI 2001 workshop on
empirical methods in artificial intelligence, vol 3, no 22, pp 41–46
29. Karabatak M (2015) A new classifier for breast cancer detection based on Naïve Bayesian.
Measurement 72:32–36. https://doi.org/10.1016/j.measurement.2015.04.028
30. Peterson LE (2009) K-nearest neighbor. Scholarpedia 4:1883
31. Dharmasaputro AA, Fauzan NM, Kallista M et al (2022) Handling missing and imbalanced
data to improve generalization performance of machine learning classifier. Int Semin Mach
Learn Optim Data Sci 2021:140–145. https://doi.org/10.1109/ismode53584.2022.9743022
32. Kuhkan M (2016) A method to improve the accuracy of k-nearest neighbor algorithm. Int J
Comput Eng Inf Technol 8:90–95
33. Hastie T, Tibshirani R, Friedman J (2008) The element of statistical learning: data mining,
inference and prediction. Springer
34. Hosmer DW Jr, Lemeshow S, Sturdivant RX (2013) Applied logistic regression. Wiley
Author Index
A C
Abdullah, Mohammad Nasir, 563 Cheng, Ling Ern, 3
Abdul-Rahman, Shuzlina, 319 Chilpuri, Vishnu Sai Reddy, 237
Abela, Clair, 63 Choiruddin, Achmad, 335, 369, 463
Adam, Siti Hajar, 149 Coombs, Thomas, 181, 211, 269
Adnan, Nor Muhammad Ilman Bin Nor,
161
Afghan, Hafez, 385 D
Agustini, Mety, 221 Daud, Paridah, 251
Ahmad, Azlin, 493 Dhayalan, Karthi, 251
Ahmed, Ahmed Mohamed Abdou, 169
Aitizaz, Ali, 105
Alanezi, Ahmad, 181 F
Al-Hamid, Abdullah, 521 Faradilla, Sonia, 369
Alias, Suraya, 17 Fithriasari, Kartika, 221
Al-Jumeily OBE, Dhiya, 279, 521
Al-Nahari, Abdulaziz, 169, 493
G
Al-Nahari, Abdulaziz Yahya Yahya, 79
Geddam, Sushmitha, 521
Alsaleem, Manea, 279
Al-Shabandar, Raghad, 169
Alshabandar, Raghad, 291 H
Amir, Siti Nabilah Mohd Abdul Hakim, 493 Halip, Mohd Hazali Mohamed, 479
Andreas, Christopher, 335 Hayatin, Nur, 17
Aridas, Narendra Kumar, 531 Hung, Lai Po, 17
Aridinanti, Lucia, 429
Arshad, Sabkat, 49
Artana, Ketut Buda, 403 I
Assi, Sulaf, 279, 521 Inguanez, Frankie, 63
Azahari, Afiqah M., 479 Ismadyaliana, Suci, 415
Azies Al, Harun, 319 Ismail, Noor Azma, 269
Azman, Alliesya Binti, 161 Ismail, Noor Lees, 33
B J
Binti Mohamad, Normaiza, 211 Jalil, Ilham Abdul, 505
© The Editor(s) (if applicable) and The Author(s), under exclusive license 577
to Springer Nature Singapore Pte Ltd. 2024
Y. Bee Wah et al. (eds.), Data Science and Emerging Technologies, Lecture Notes on
Data Engineering and Communications Technologies 191,
https://doi.org/10.1007/978-981-97-0293-0
578 Author Index
R
K Rahayu, Syarifah Bahiyah, 479
Kaha, Putra Roskhairul Fitri, 479 Rahmat, Romi Fadillah, 161
Kassim, Mohamad Nizam, 123 Rajan, Swapna D., 269
Khoo, Eric, 123 Rasam, Abdul Rauf Abdul, 505
Khusna, Hidayatul, 385 Ratih, Iis Dewi, 403
Kumar, Vinay, 211 Rosli, A’tifah Hanim, 149
Kuswanto, Heri, 403 Rumiati, Agnes Tuti, 353
L S
Lim, Huai Tein, 303, 545 Saha, Sajal, 545
Lim, Khai Yin, 3 Sardar, Soumen, 279
Looi, Sing Yan, 303 Saurabh Gupta, 291
Looi, Zheng Ning, 303 Saxena, Ambar, 33
Loy, Chong Kim, 169, 291 Sayli Tawhare, 91
Setiawan, 353, 415
Setiowati, Yuliana, 17
M Sharma, Disha, 193
Mansoor, Syarifah Bahiyah Rahayu Syed, Sivarajan, Roshene A/P, 161
149 Song, Poh Choo, 303
Mehmood, Tahir, 49, 137, 161, 237 Soujanya Rao, Mangu, 211
Meilisa, Mira, 447 Sultanova, Nailya, 33, 193, 251
Mohamad, Normaiza, 79
Mohammed, A. H., 169
Mohammed, Ammar H., 279, 291 T
Mubarik, M. Naufal, 429 Talip, Mohamad Sofian Abu, 531
Mustafina, Jamila, 33, 193, 251 Tan, Chi Wee, 3
Tang, Ruipeng, 531
Tan, Jenny, 3
N Tan, Xuan Yi, 3
Nadeem, Saaman, 137, 237 Ting, Kezia Sim Kui, 161
Nadi, Farhad, 105, 161, 181
Nagantheran, Jeetha A/P, 161
Nagappan, Danush, 181 V
Naghavipour, Hadi, 105, 161, 521 Veerah, Suita A/P, 161
Ng, Peh Sang, 545 Venkatesan, K., 479
Nidhi Shettigar, 91 Vishal Pednekar, 91
Noviyanti, Reny Ari, 353
Nuari, Muhammad Farhan, 403
Nuriman, Muhammad Alifian, 429 W
Wahab, Norshahriah Abdul, 149
Wibowo, Wahyu, 319
O Wildani, Zakiatul, 429
Otok, Bambang Widjanarko, 369, 447 Wilujeng, Susi A., 319
Wulandari, Sri Pingit, 429
P
Prastyo, Dedy Dwi, 221, 335 Y
Pratiwi, Emmy, 403 Yao, Danny Ngo Lung, 79, 193
Author Index 579