0% found this document useful (0 votes)
56 views15 pages

BERT-based Model For Aspect-Based Sentiment Analysis For Analyzing Arabic Open-Ended Survey Responses: A Case Study

BERT-based Model for Aspect-Based Sentiment Analysis for Analyzing Arabic Open-ended Survey Responses: A Case Study

Uploaded by

r.python2030
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views15 pages

BERT-based Model For Aspect-Based Sentiment Analysis For Analyzing Arabic Open-Ended Survey Responses: A Case Study

BERT-based Model for Aspect-Based Sentiment Analysis for Analyzing Arabic Open-ended Survey Responses: A Case Study

Uploaded by

r.python2030
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

This article has been accepted for publication in IEEE Access.

This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3348342

Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier 10.1109/ACCESS.2022.Doi Number

BERT-based Model for Aspect-Based


Sentiment Analysis for Analyzing Arabic
Open-ended Survey Responses: A Case
Study
KHLOUD A. ALSHAIKH1, OMAIMA ALMATRAFI1, AND YOOSEF B. ABUSHARK2
1
Department of Information Systems, Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah 21589, Saudi
Arabia
2
Department of Computer Science, Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah 21589, Saudi
Arabia
Corresponding author: Khloud A. Alshaikh ([email protected])

ABSTRACT Educational institutions typically gather feedback from beneficiaries through formal surveys.
Offering open-ended questions allows students to express their opinions about matters that may not have been
measured directly in closed-ended questions. However, responses to open-ended questions are typically
overlooked due to the time and effort required. Aspect-based sentiment analysis is used to automate the
process of extracting fine-grained information from texts. This study aims to 1) examine the performance of
different BERT-based models for aspect term extraction for Arabic text sourced from educational institution
surveys; 2) develop a system that automates the ABSA process in a way that will automatically label survey
responses. An end-to-end system was developed as a case study to extract aspect terms, identify their polarity,
map extracted aspects to their respective categories, and aggregate category polarity. To accomplish this, the
models were evaluated using an in-house dataset. The result showed that FAST-LCF-ATEPC, a multilingual
checkpoint, outperformed other models including AraBERT, MARBERT, and QARiB, in the aspect-term
extraction task, with an F1 score of 0.58. Hence, it was used for aspect-term polarity classification, showing
an F1 score of 0.86. Mapping aspects to their respective categories using a predefined list yielded an average
F1 score of 0.98. Furthermore, the polarities of the categories were aggregated to summarize the overall
polarity for each category. The developed system can support Arabic educational institutions in harnessing
valuable information in responses to open-ended survey questions, allowing decision-makers to better
allocate resources, and improve facilities, services, and students’ learning experiences.

INDEX TERMS Arabic ABSA, aspect extraction, aspect-based sentiment analysis, BERT-based model,
education, polarity classification.

I. INTRODUCTION opportunity to express their opinions and sentiments [1].


Universities and higher education institutions worldwide This type of question is valuable because it encourages
allocate significant financial resources to enhance their them to express their minds and feelings and provide useful
services to maintain existing students and attract new ones information on personal experiences [3], [4]. However,
[1]. Student satisfaction and opinions about the university’s these textual responses require more effort in the analysis
service quality are very important because they have a process to extract helpful information and obtain
direct impact on student impressions and the institution’s sentiments from it. Such analyses also consume
reputation [2]. Students can express their thoughts through considerable human time, especially when the number of
official surveys published at the institutional level. Usually, responses is large and the questions cover more than one
these surveys include closed- and open-ended questions. aspect [3], [4]. Students’ responses are typically related to
Close-ended questions are specific and easy to analyze. In university aspects, such as services, professors, and
contrast, open-ended questions give students the

VOLUME XX, 2017 1

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3348342

buildings, and their feelings (positive, negative, and the aspects for each category. To the best of our knowledge,
neutral) toward these aspects. Extracting useful this is the first study to assess ABSA using Arabic surveys in
information from textual responses calls for an automated the educational domain.
system that can analyze the text and detect the sentiments
of the elements (aspects) presented in the response. II. BACKGROUND
Sentiment analysis or opinion mining is an active area of the A. ASPECT-BASED SENTIMENT ANALYSIS
natural language processing [5]. The main task of sentiment ABSA produces finely detailed sentiment information. This
analysis is to classify expressed opinions in the text [6]. The information is useful for many applications in various
extracted opinion is typically classified according to its domains. The ABSA consists of four tasks: aspect term
extraction (T1), aspect polarity classification (T2), aspect
polarity as positive, negative, or neutral [7]. There are three
category mapping (T3), and category polarity (T4). T1
levels of classification in sentiment analysis: document,
extracts all the words or aspects that need to specify their
sentence, and aspect [8]. Although document- and sentence-
polarity sentiment. This aspect can be implicit or explicit. The
level analyses are useful for some applications, they are not task was executed using supervised and unsupervised methods
sufficient for others to search for fine-grained information [12]. T2 assigns the polarity of the sentiment analysis to the
about a particular aspect. In such cases, aspect-based extracted aspect [13]. T3 identifies the category using a
sentiment analysis (ABSA) is used. Principally, ABSA multilabel classifier that classifies each entity into multiple
systems receive a set of texts (product reviews, comments, labels, where the label consists of entities and aspects. T4
forum discussions, etc.) that discuss a specific entity. The assigns the polarity of sentiment analysis to the identified
system attempts to obtain the main aspects of the entity and categories. Figure 1 shows an example of these tasks.
detect the sentiments expressed toward each aspect [7]. The
results of ABSA provide detailed sentiment information that
can be highly valuable in various domains. Despite this
benefit, ABSA has not been extensively applied in the
educational domain. In addition, the majority of prior work on
ABSA has focused on English, with a limited number of
studies targeting ABSA in Arabic and other languages [9].
Arabic is the primary spoken language for approximately 422
million speakers worldwide [10]. It is a rich language with a
large number of vocabulary words with different sentence
structures and multiple meanings. It has approximately 10,000
roots and more than 900 forms of nouns and verbs based on
their morphology [11]. This results in a variety of derivational
morphologies and structural forms, which increase the sparsity FIGURE 1. ABSA tasks.

of morphemes and words [9] as well as the complexity of the


analysis. Assuming that there are only two reviews for a restaurant,
An advanced system is needed to analyze students’ survey tasks T1, T2, and T3 are assigned, as shown in Figure 1. Task
responses offered in Arabic, categorize them based on various T4 for the overall polarity of the category in this example is
aspects of the university, and identify students’ sentiments positive for food and negative for service since pasta and steak
toward these aspects. Such a system will support the are rated as positive to yield an overall positive category
integration of student feedback into decision-making polarity for food, whereas the waiter is rated negative, yielding
processes and aid university leaders in allocating resources an overall negative category polarity for service.
and improving the quality of the services provided. Thus, there
is a need to examine the literature and identify potential B. DEEP LEARNING
approaches that can improve ABSA for the Arabic language Deep learning is a rising technique in machine learning that
as well as its effectiveness in analyzing educational data. uses a hierarchy of layers to progressively extract higher-level
This study aims to fill this gap by examining a transfer features. During training, the high layers exploit the complex
learning approach to assess ABSA in the Arabic educational compositional nonlinear functions of the lower layers. This
context and to evaluate its performance. Different means that the layers in a higher hierarchy have more abstract
bidirectional encoder representations from transformer or divided representations than the lower ones. Consequently,
BERT-based models were evaluated for aspect extraction each layer receives input to analyze and classify it to provide
using an in-house dataset of open-ended survey responses at the output that feeds the input of the next layer [14], [15]. A
King Abdulaziz University (KAU). The best-performing variety of algorithms, such as deep neural networks,
model was used to classify the polarity of each aspect. The convolutional neural networks, recurrent neural networks
aspects are then mapped to their category, and the results are (RNN), and recursive neural networks, help in the analysis of
summarized by category by simply counting the polarities of many fields, especially in fine-grained processes for

VOLUME XX, 2017 5

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3348342

processors with a large number of layers [14]. Additionally, Bayes, linear regression, and support vector machine (SVM)
word embedding, long short-term memory (LSTM), and bi- with only two studies employed deep neural networks, LSTM
directional LSTM are concepts related to deep learning that [24], [26]. We also found that all studies focused on aspect
allow dealing with various types of data such as text, images, extraction and polarity classification tasks, with the exception
and videos [16]. of a study that used a combination of machine learning- and
lexicon-based approaches to accomplish all four tasks [23].
C. TRANSFORMER-BASED TRANSFER LEARNING ABSA has also found application beyond the English
Transfer learning is an emerging machine-learning technique language in education, albeit in a limited capacity. In Serbia,
that uses existing knowledge to solve different domain [1] employed ML algorithms to achieve T1 and T2. They
problems and produces state-of-the-art prediction results examined student reviews on the "Oceni profesora" ("Rate my
[17]. Transfer learning methods perform extensively in professors") website to gain insights into the teaching faculty,
computer vision tasks such as anomalous activity detection, courses, and programs offered by the Faculty of Technical
object classification, and image captioning. Moreover, Sciences. Another study in Indonesia used an unsupervised
transfer-learning-based methods, such as BERT, have been lexicon-based method for both tasks [27]. They used recent
successful in several natural language processing (NLP) online learning graduates feedback from BINUS (Bina
tasks [18] and in the field of sentiment analysis [17]. BERT Nusantara University). Moreover, [28] proposed a hybrid
is a pre-trained language model developed by Google in features selection method to address T1 and T2 in Arabic
2018. It uses deep neural network architecture with an tweets related to Qassim University. It extracts aspects related
attention component. It is designed to process sequential data to the education domain such as teaching quality, services,
such as text and learn the contextual relationships between activities, etc. The purpose of this study is to enhance the SVM
words [19]. classifier in ABSA by decreasing used features. The results
showed that the hybrid method successfully improves SVM
III. LITERATURE REVIEW classifier performance with (F1: 0.70) for T1 and (F1: 0.71)
To decide which is the most appropriate approach, a for T2. Table 1 provides a summary of prior work on ABSA
comprehensive literature review on ABSA in the education in the educational domain showing the year of publication, the
domain and Arabic ABSA approaches was done. The targeted language, the data source, the approach used, and the
"ABSA in educational domain" section presents all existing tasks covered by each paper. As shown, there is a lack of
empirical studies in the educational domain to review the ABSA research on the Arabic language in the educational
source of used data, approaches, and ABSA tasks through a domain. This research aims to contribute to this direction
methodical and exhaustive literature review using search benefitting researchers and practitioners.
queries consisting of the keywords ("aspect-based sentiment
analysis" OR "ABSA") AND "education". After that, there B. ARABIC ABSA APPROACHES
is still a need for more research on approaches used for
ABSA tasks in Arabic datasets in other domains that are 1) UNSUPERVISED APPROACHES
covered in the "Arabic ABSA approaches" by exploring the A comparative study was conducted to test and assess various
literature review using the keywords ("aspect-based lexicon-based approaches for ABSA tasks T3 and T4 based on
sentiment analysis" OR "ABSA") AND "Arabic" presented 63,000 book reviews annotated by humans [29]. This was later
in the "Arabic ABSA Approaches" section. We came up with extended using enhanced lexicon-based approaches on the
three subsections: unsupervised learning, supervised same book review dataset to achieve results that exceeded
learning, and deep learning. The associated studies of these those of the previous study, particularly for T4 (accuracy:0.88)
approaches were presented in detail. and T3 (F1 score:0.24) [30]. Several studies have combined
two approaches or models to produce superior models. [31]
A. ABSA IN EDUCATIONAL DOMAIN combined corpus- and lexicon-based approaches to address
Most ABSA studies in the educational field have been tasks T2 and T4 using a large-scale Arabic book review
conducted on English-language datasets. The aim of these dataset. Furthermore, [32] proposed a hybrid approach to
studies was to assist academic institutions in identifying and address T1 and T2 from reviews in Arabic government
addressing student issues through feedback analysis. The data applications. This approach combined lexicons with rule-
for these studies was primarily gathered from social media based models. The authors aimed to develop rules, techniques,
platforms like Twitter and Facebook, as mentioned in [20], and lexicons to address the challenges of sentiment analysis.
[21], [22], [23]. Other studies utilized data collected from the The results showed an increase in accuracy when compared to
institution, such as MOOC platforms or traditional institution the baseline models.
surveys, as seen in [24], [25], [26]. The methods used in these
studies included semantic relatedness and sentiment polarity 2) SUPERVISED APPROACHES
categorization. The researchers employed various classical Supervised approaches depend on the training process using
machine learning algorithms such as k-means clustering, naive labeled data to train the machine in predicting the output for

VOLUME XX, 2017 5

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3348342

TABLE 1. A summary of prior work on ABSA in the educational field.

Approach ABSA
Year Ref. Language Dataset
Type\ Name Tasks
k-mean clustering and naïve Bayes
2017 [20] Tweets of online student feedback T1T2
classification
OpenNLP
2017 [22] Review sites and social media networks POS T1 T2
Standford NLP library
2019 [24] Last five years students feedback of Sukkur IBA University Supervised two-layered LSTM model T1 T2
T1 T2
2019 [23] Student’s comments from social media Machine learning and lexicon based
English T3 T4
Word-Embedded (fastText- GloVe-
Students’ reviews on MOOCs from Coursera and students’
2020 [26] Word2Vec- MOOC) T2 T3
feedback in traditional classroom settings
With CNN and LSTM
Logistic Regression
Linear SVC
2022 [21] UAE COVID-19 education related data T1 T2
Multinomial NB
Random Forest
2022 [25] Coursera dataset from Kaggle unsupervised and semi-supervised LDA T1 T2
Student reviews of Faculty of Technical Sciences in Serbia
Rule-based and dictionary-based
2020 [1] Serbian and a corpus of online reviews from (“Rate my professors”) T1 T2
components
website
2022 [27] Indonesian Students’ feedback from Bina Nusantara University Unsupervised lexicon-based method T1 T2
Hybrid feature selection method to Enhance
2020 [28] Arabic Tweets related to Qassim University in KSA T1 T2
SVM

the new input. Various studies have used the Arabic-language embedding. The result was slightly better when fastText
hotel review dataset as a benchmark to evaluate their proposed Arabic Wikipedia word embedding was used compared with
approaches or models. The authors in [9] proposed a AraVec-Web, indicating the usefulness of word embedding
framework for applying ABSA to Arabic. They suggested the for sentiment analysis.
use of a SVM approach for tasks T1, T2, and T3. [33] Other studies used the Arabic-language hotel review dataset
considered morphological, syntactic, and semantic features to to evaluate the proposed approach or model. [35] applied a
address task T2, in addition to T1 and T3. The authors deep RNN and SVM to hotel reviews to address tasks T1, T2,
examined multiple classification methods such as naïve and T3. The results showed that the SVM exceeded the deep
Bayes, Bayes networks, decision trees, k-nearest neighbor (K- RNN. However, the authors suggested enhancing the
NN), and SVM. The results showed that models developed by proposed deep learning approach by assessing different LSTM
the supervised learning approach performed better than networks and using word embedding, such as fastText. [36]
combined lexicons with rule-based models, whereas SVM applied the suggestions of a previous study by utilizing LSTM
performed the best compared with the other classifiers for all neural networks for T1 and T2. The results showed that the
tasks in the study. Moreover, [12] evaluated various classifier method used exceeded the baseline (SVM trained with N-
techniques for T1, and the results showed that the adaptive gram features) for both the T1 and T2 tasks. Furthermore, [37]
boosting (AdaBoost) classifier achieved the best results applied two deep learning models: the convolutional
compared with previous methods in terms of precision (97%) independent LSTM model (C-IndyLSTM) for T1, and the
and recall (96.9%). memory-based recurrent attention model (MBRA) for T3. The
C-IndyLSTM model is based on a convolutional neural
3) DEEP LEARNING APPROACHES network and stacked independent long-short-term memory,
A study by [34] compared two pretrained word-embedding whereas the MBRA model is based on stacked bidirectional
models for ABSA. These models are fastText Arabic independent LSTM, a position-weighting mechanism, and
Wikipedia and AraVec Web. An SVM classifier was used to multiple attention mechanism layers. Moreover, [38] applied
train the model for tasks T1 and T2 in a dataset of 5000 Arabic two deep-learning models based on GRU neural networks.
tweets related to airline services that were manually labeled The first model, BGRU-CNN-CRF, combines a bidirectional
for ABSA. The study showed an enhancement in the SVM GRU, CNN, and CRF for T1. The second model, IAN-BGRU,
classifier performance when extracting features using word is an interactive attention network used for T2.

VOLUME XX, 2017 5

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3348342

Recently, increased attention has been paid to the use of required substantial amounts of labeled data and involved
large pre-trained language models, such as BERT and its feature engineering. However, in recent years, deep learning
variations, as it achieves superior results for a variety of NLP has emerged as a dominant approach in ABSA, largely due
tasks. [39] proposed a BERT with a simple linear to the Transformer-based BERT model. BERT has
classification layer to accomplish T2 only. Experiments on demonstrated remarkable effectiveness in understanding
three Arabic datasets, hotel reviews, book reviews, and Arabic contextual information, capturing complex language
news, showed that the proposed model accuracies were patterns, and addressing prior limitations. Moreover, BERT
89.51%, 73.23%, and 85.73%, respectively. The researchers has introduced the concept of transfer learning in NLP,
aim to accomplish T1 and T3 in future work. [40] proposed a enabling it to learn general language representations through
transfer learning method using the AraBERT pre-trained pre-training on vast text corpora. Subsequently, fine-tuning
language model to accomplish tasks T1 and T3. BERT on task-specific data significantly reduces the need
Most previous studies individually or sequentially for extensive labeled data, making it an ideal choice for this
handled the T1 and T2 tasks, where independent models research.
were designed for each task. However, T1 and T2 are As a part of our research, we have carefully selected the
performed jointly in multi-task learning by other studies. latest and top-performing BERT-based models from the
[41] developed a lightweight ABSA framework called literature - LCF and AraBERT. FAST-LCF-ATEPC model
Python aspect-based sentiment analysis (PyABSA), which stood out as it efficiently performs aspect term extraction and
can be used for T1 and T2. The models were trained on aspect polarity classification simultaneously. AraBERT, on
various datasets, including restaurants, laptops, MOOCs, the other hand, was specifically designed and trained on
Twitter, and other domains in eight languages (one of them Arabic data, making it a promising model. While AraBERT
was the Arabic language dataset SemEval-2016 Task 5). The has shown significant improvement over baseline
Arabic dataset was used to evaluate the BERT-ATESC, Fast approaches in various Arabic NLP tasks, it has been
LCF-ASESC, and LCF-ATESC models. Performance outperformed by MARBERT [44]. QARiB also performed
evaluation showed that the BERT-ATESC model achieved well in Arabic NLP tasks like SA and NER, but its
the best results, with an F1 score of 71.18% for T1 and T2. performance for Arabic ABSA has not yet been evaluated.
Furthermore, [42] tested a transfer-learning approach using Therefore, it is essential to experiment with the most
Arabic-BERT-CRF for tasks T1 and T2 on a human-annotated promising BERT-based models for Arabic ABSA and
Arabic dataset for ABSA. The experimental results evaluate their performance on related data to be able to
demonstrated that the model exceeded the baseline model, develop an effective ABSA system for educational
which relied on conditional random fields (CRF) with institutions. Table 3 provides a comprehensive overview of
features extracted using named entity recognition (NER), POS the BERT models used, highlighting their respective areas of
tagging, parsing, semantic analysis, and other recently focus, advantages, and limitations. Moreover, each model is
proposed models such as AraBERT, MarBERT, and explained separately in the methodology section.
CamelBERT-MSA. [43] proposed a multi-task learning
approach called local context focus-aspect term extraction and IV. RESEARCH CONTRIBUTION
polarity classification (LCF-ATEPC) and AraBERT as a This study is unlike prior works, as it delves into the
shared layer for Arabic contextual text representation to examination and application of ABSA methods in a domain
accomplish T1 and T2 simultaneously. The reference hotel that has received limited attention - Arabic language text
and product review datasets were used. In addition, the authors obtained from the educational sector. This sector has not
proposed a data augmentation technique for T2 that involves been extensively studied, and the effectiveness of pre-trained
generating synthetic data using back-translation and synonym models, such as BERT, which performed well in various
replacement. The results showed that the proposed model
NLP tasks remains unexplored in the intersection of Arabic
outperformed the baseline models on both datasets for both
ABSA and the educational sector. It is essential to
single- and multitask approaches, achieving state-of-the-art
performance. Table 2 provides a comparison of the different acknowledge that models trained for one domain may not
Arabic language ABSA literature reviews that were perform as well in another, emphasizing the need for
summarized above. The comparison is across the year of rigorous evaluation of different ABSA models on Arabic text
publication, the data source, the used approach, and the result derived from educational data. The main contribution of this
of the covered tasks by each research. research is 1) to examine the performance of different
Overall, Arabic ABSA has evolved significantly over the BERT-based models for aspect term extraction for Arabic
years, transitioning from lexicon-based approaches to deep text sourced from educational institution surveys; 2) to
learning techniques. Lexicon-based approaches were simple develop a system that automates the ABSA process in a way
but suffered from scalability constraints and the inability to
that will automatically label survey responses. This research
adapt to context-dependent nuances in sentiment analysis.
has significant implications including improving the quality
Supervised learning methods improved scalability but

VOLUME XX, 2017 5

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3348342

TABLE 2. A summary of prior work on ABSA in Arabic.

Ref. Approach ABSA


Dataset
Year Type\ Name Tasks\ Results

Un-supervised Learning

[31] T2: Acc.(80.5%)


Book reviews (LABR) Corpus-based and lexical-based approach
2020 T4: Acc.(78%)
[32] UAE government mobile application T1 : F1(92.50%)
Lexicon with rule-based
2020 reviews T2 : Acc.(95.81%)

Supervised Learning

Best:
[45]
News posts on social media CRF vs. J48 T1: J48 F1(0.82)
2015
T2: CRF Acc.(0.87)
T1 : F1(30.978)
[9] SVM T2: Acc.(76.421)
Hotels’ reviews (SemEval 2015)
2017 T3: F1(40.336)
T4: F1(18.806)
Best:
RNN vs. SVM
[35] T1: SVM F1(89.8%)
Hotels’ reviews (SemEval 2016) trained along with lexical, word, syntactic, morphological, and
2018 T2: SVM Acc.(95.4%)
semantic features.
T3: SVM F1(89.8%)
Best:
[33] Hotels’ reviews (SemEval 2016) SVM, K-Nearest Neighbor, Decision Tree, Bayesian T1: SVM F1(93.4)
2019 Networks, and Naïve Bayes T2: SVM F1(89.9)
T3: SVM Acc.(95.4)
ADAL system
[12] Hotels’ reviews (SemEval 2016) T1: P.(97)
(Adaboost classifier: rule based with machine learning
2021 T1: R.(97)
methods)

Deep Learning

Best :
[34] fastText vs. AraVec
Airline services tweets T1 : fastText F1(79)
2019 with SVM classifier
T2 : fastText Acc.(89)
[36] T1: Bi-LSTM-CRF (fastText) T1: F1(69.98)
Hotels’ reviews (SemEval 2016)
2019 T2: INSIGHT-1 (CNN) T2: Acc.(82.7)
[37] T2: MBRA T2: F1(58.05)
Hotels’ reviews (SemEval 2016)
2021 T3: C-IndyLSTM T3: Acc.(87.31)
[38] T1: CNN-BGRU-CRF (fastText) T1: F1(70.67)
Hotels’ reviews (SemEval 2016)
2021 T2: IAN-BGRU T2: Acc.(83.98)
BiLSTM + CRF
BERT + linear classification layer Best:
[40] BERT + CRF T1: BERT + BiGRU + CRF
News posts on social media
2022 BERT + BiLSTM + CRF F1(88.1)
BERT + BiGRU + CRF
*BERT refers to AraBERT
Hotels’
T2: Acc.(89.51)
Hotels’ reviews (SemEval 2016)
[39] Book reviews
ABSA book reviews (HAAD) BERT with a simple linear classification layer
2022 T2: Acc.(73.23)
News posts on social media
News posts
T2: Acc.(85.73)
Best:
BERT-ATESC
[41] Hotels’ reviews (SemEval 2016) BERT-ATESC
Fast-LCF-ASESC
2022 T1: F1(71.18)
LCF-ATESC
T2: F1(71.18)
[42] T1: F1(47.63)
ABSA book reviews (HAAD) AraBERT and text classification by using CRF
2023 T2: Acc.(95.23)
LCF-ATEPC model + AraBERT
Best:
[43] Hotels’ reviews (SemEval 2016) AR-LCF-ATEPC_Fusion
AR-LCF-ATEPC_Fusion
2023 T1: F1(75.94)
AR-LCF-ATEPC_CWD
T2: Acc.(91.5)
AR-LCF-ATEPC_CMD

VOLUME XX, 2017 5

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3348342

TABLE 3. Overview of the used pre-trained BERT models.

Model Aspect Content

FAST-LCF-ATEPC Focus  Joint task of aspect term extraction and aspect polarity classification.
[41], [46]  Applies self-attention and local context focus techniques to aspect word extraction task.
Related Work  English [46], [41], [47]
 Chinese [46], [41]
 Dutch, Spanish, French, Turkish, Russian, and Arabic (Hotel’s reviews) [41]
Advantages  Extracts aspect term and infers aspect term polarity synchronously.
 Integrates the pre-trained BERT model, to leverage its strengths of separate layers for local and global
context modeling.
 Achieved new state-of-the-art performance, especially the F1 score of T1 task.
 Achieved state-of-the-art performance on seven ABSA datasets.
Limitation  performs better on some datasets when dealing with a single task such as T1 or T2 than ABSA multi-
tasking based on experimental results.
AraBERT Focus  Specifically designed for the Arabic language.
[48]  Captures the linguistic characteristics and nuances of the Arabic language.
Related Work  Hotel’s reviews:
o (T1): [49]
o (T3): [50], [51]
o (T1, T2): [52]
o (T2, T4): [53]
 Book reviews (HAAD)
o (T2, T4): [53]
 Social media:
o News posts about the 2014 Gaza Attacks
 (T1, T3): [40]
 (T2, T4): [53]
o Tweets about food delivery service reviews (T1, T2, T3): [54]
o Tweets about sports, politics, and economics (T2): [55]
Advantages  Better performance on Arabic NLP tasks compared to generic multilingual models.
 Provides better language-specific understanding.
 Achieved state-of-the-art performance on most tested Arabic NLP tasks.
 Evaluated on NLP tasks: SA and NER.
 Designed specifically for the Arabic language, which makes it a strong candidate for ABSA tasks in
Arabic.
Limitation  Performance may be suboptimal for certain Arabic dialects or domains.
 It may not generalize well to low-resource Arabic varieties.
MARBERT Focus  Multilingual Bert-based model that handles code-switching scenarios and allows cross-lingual
[44] transfer learning.
Related Work  Twitter about sports, politics, and economics (T2): [55]
Advantages  Enables better performance on Arabic and related languages.
 Provides benefits of cross-lingual transfer learning.
 Outperformed AraBERT.
Limitation  Performance may be weaker than language-specific models like AraBERT for Arabic-specific tasks.
 Less effective for languages less similar to Arabic.
QARiB Focus  Designed specifically for Arabic, addressing dialectal variations within the Arabic language.
[56] Related Work  It has not been yet evaluated for ABSA tasks.
Advantages  Captures linguistic features specific to Arabic, improving performance on NLP tasks.
 Achieved state-of-the-art results on emotion and NER tasks.
Limitation  Performance may not generalize well to other Arabic dialects or languages and may be less effective
for tasks involving standard Arabic or other dialects.
 Explore dialect-specific models like QARiB for other Arabic dialects and investigate approaches to
handle dialectal variations effectively.

of education and enhancing user satisfaction. Moreover, the V. METHODOLOGY


automatic identification of areas of concern or success, This section introduces the used datasets and models. After
which, in turn, can inform policymakers and aid in the that, we described the approach used to develop the aspect-
allocation of resources to meet the evolving needs of students based sentiment analysis system. Lastly, we defined the
and educators. The benefits of this research are not limited performance measures used to evaluate the different tasks in
to educational institutions, which can expedite the analysis this research.
through the automation of the four steps of ABSA but also
extend to the advancement of natural language processing A. DATA
This section describes the steps involved in building a
research, particularly for the Arabic language.

VOLUME XX, 2017 5

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3348342

TABLE 4. Examples of responses to the open-ended question. TABLE 6. Annotators Cohen’s kappa for aspect, polarity, and category.

Arabic Response Translation Annotators Aspect Polarity Category

‫ﺷﻜﺮا ﺟﺎﻣﻌﺘﻲ وﻓﺮﺗﻲ ﻟﻲ اﻟﻤﺴﻜﻦ واﻟﻤﺄﻛﻞ‬ Thank you, my university, for A and B 0.72 0.82 0.86
‫واﻟﻌﻠﻢ واﻟﺒﯿﺌﺔ اﻻﻓﻀﻞ ﻛﻮﻧﮭﺎ ﺟﺎﻣﻌﺔ‬ providing me with housing, food, A and C 0.74 0.89 0.82
‫ﺣﻜﻮﻣﯿﺔ‬ education, and the best
environment, being a public B and C 0.70 0.92 0.87
university

‫ ﺗﻮﻓﯿﺮ ﻣﻮاﻗﻒ‬،‫ﺗﻮﻓﯿﺮ ﻣﻜﺘﺒﺔ ﺧﺎﺻﺔ ﺑﻜﻞ ﻛﻠﯿﺔ‬ Providing a library for each
‫ﺳﯿﺎرات ﻟﻠﻄﺎﻟﺒﺎت وأﻣﺎﻛﻦ اﺳﺘﺮاﺣﺔ ﻟﻠﺴﺎﺋﻘﯿﻦ‬ college, providing car parking for
female students and resting places
for drivers. annotations, showing a sample of the in-house built dataset.
In the third step, the labeled responses were assessed and
‫اﻟﺘﺨﺼﺼﺎت اﻟﻤﺘﻮﻓﺮة ﻻﺗﻨﺎﺳﺐ اﺣﺘﯿﺎج‬ The available specializations do evaluated by calculating Cohen’s kappa, which measures the
‫ﺳﻮق اﻟﻌﻤﻞ‬ not fit the needs of the labor
market agreement between two annotators to ensure the quality of the
annotation process [57]. In our case, Cohen’s kappa was used
reliable annotated dataset from an educational context for to check the agreement for pairs of annotators (A and B, A and
testing and evaluation. Data were collected from the KAU C, and B and C) separately for aspect, polarity, and category.
service evaluation survey. The responses were usually As shown in Table 6, the best agreement for the aspect,
written in formal Arabic, with a maximum of 200 characters. polarity, and category was between annotators B and C.
Students disclosed their feelings transparently regarding the Cohen’s Kappa showed a substantial agreement, with a
university’s main aspects. The total number of collected kappa value of 0.70 for the aspect. Further, B and C gave the
responses was 1815 responses, 91 of which were written in largest number of aspects compared to A and B or A and C.
English, and 218 were garbled data, such as spaces, numbers, Moreover, the polarity and category showed an almost
and symbols. The remaining 1506 responses were analyzed. perfect agreement, with kappa values of 0.92 and 0.87,
Some sample responses to the open-ended questions “Any respectively.
other additions that were not mentioned in the questionnaire According to the Cohen kappa results, annotators B and C
that you would like to mention?” are shown in Table 4. were selected to construct the golden dataset. When there was
In the second step, the collected dataset underwent a discrepancy between B and C, the annotation of A was
annotation to label the responses manually. The annotation consulted. Only matched aspect, polarity, and category were
was performed by three KAU employees: A, B, and C. sustained, which resulted in the retention of 448 responses that
Detailed guidelines were provided to the annotators to help included 639 aspects related to 13 categories. The polarity of
them extract aspects and identify their polarity and categories. these aspects is skewed toward negative (512 aspects, ~80%),
After receiving the annotated data from the annotators, the which is expected because people tend to recall and report
data were explored and cleaned. Table 5 provides three negative experiences or thoughts more than positive ones,
examples of responses to this question, along with human which is also known as a negativity bias [58], [59].

TABLE 5. Examples of labeled responses.

Response Aspects Polarity Category

my university Impression of the University and personal skills


Thank you, my university, for providing me with housing, ‫ﺟﺎﻣﻌﺘﻲ‬ ‫اﻟﺼﻮرة اﻟﺬھﻨﯿﺔ ﻟﻠﺠﺎﻣﻌﺔ واﻟﻤﮭﺎرات اﻟﺸﺨﺼﯿﺔ‬
food, education, and the best environment, being a public housing
university ‫اﻟﻤﺴﻜﻦ‬
food Positive University infrastructure and public services
‫اﻟﻤﺄﻛﻞ‬ ‫إﯾﺠﺎﺑﻲ‬ ‫اﻟﺒﻨﯿﺔ اﻟﺘﺤﺘﯿﺔ ﻟﻠﺠﺎﻣﻌﺔ واﻟﺨﺪﻣﺎت اﻟﻌﺎﻣﺔ‬
‫ﺷﻜﺮا ﺟﺎﻣﻌﺘﻲ وﻓﺮﺗﻲ ﻟﻲ اﻟﻤﺴﻜﻦ واﻟﻤﺄﻛﻞ واﻟﻌﻠﻢ واﻟﺒﯿﺌﺔ اﻻﻓﻀﻞ ﻛﻮﻧﮭﺎ‬ environment
‫ﺟﺎﻣﻌﺔ ﺣﻜﻮﻣﯿﺔ‬ ‫اﻟﺒﯿﺌﺔ‬
education Educational process
‫اﻟﻌﻠﻢ‬ ‫اﻟﻌﻤﻠﯿﺔ اﻟﺘﻌﻠﯿﻤﯿﺔ‬
Providing a library for each college, providing car parking library
for female students and resting places for drivers ‫ﻣﻜﺘﺒﺔ‬
car parking Negative University infrastructure and public services
‫ﻣﻮاﻗﻒ ﺳﯿﺎرات‬ ‫ﺳﻠﺒﻲ‬ ‫اﻟﺒﻨﯿﺔ اﻟﺘﺤﺘﯿﺔ ﻟﻠﺠﺎﻣﻌﺔ واﻟﺨﺪﻣﺎت اﻟﻌﺎﻣﺔ‬
‫ ﺗﻮﻓﯿﺮ ﻣﻮاﻗﻒ ﺳﯿﺎرات ﻟﻠﻄﺎﻟﺒﺎت وأﻣﺎﻛﻦ‬، ‫ﺗﻮﻓﯿﺮ ﻣﻜﺘﺒﺔ ﺧﺎﺻﺔ ﺑﻜﻞ ﻛﻠﯿﺔ‬ resting places for drivers
‫اﺳﺘﺮاﺣﺔ ﻟﻠﺴﺎﺋﻘﯿﻦ‬ ‫أﻣﺎﻛﻦ اﺳﺘﺮاﺣﺔ ﻟﻠﺴﺎﺋﻘﯿﻦ‬
The available specializations do not fit the needs of the
specializations Negative Educational process
labor market
‫اﻟﺘﺨﺼﺼﺎت‬ ‫ﺳﻠﺒﻲ‬ ‫اﻟﻌﻤﻠﯿﺔ اﻟﺘﻌﻠﯿﻤﯿﺔ‬
‫اﻟﺘﺨﺼﺼﺎت اﻟﻤﺘﻮﻓﺮة ﻻﺗﻨﺎﺳﺐ اﺣﺘﯿﺎج ﺳﻮق اﻟﻌﻤﻞ‬

VOLUME XX, 2017 5

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3348342

B. MODELS TABLE 7. FAST-LCF-ATEPC checkpoints.

1) FAST-LCF-ATEPC MODEL FAST-LCF-ATEPC


FAST-LCF-ATEPC, proposed in 2021, is a multitask learning
model based on self-attention and local context focus (LCF) Checkpoint multilingual multilingual- multilingual-
mechanisms that integrate the pretrained BERT model. Unlike 256 256-2
other models, it extracts aspect terms and synchronously infers Dataset language 5 languages 5 languages 15 languages
polarity [46]. It employs two separate BERT layers to capture Embedding layer
the global and local context, respectively. To enable 768 256 256
size
simultaneous multi-task training, the input sequences are
divided into separate tokens, and each token is assigned two answering, and NER. The performance of AraBERT was
labels. The first label determines whether the token is part of compared with that of multilingual BERT from Google and
an aspect, while the second label denotes the polarity of the other state-of-the-art approaches. The results showed that the
token associated with the aspect. newly developed AraBERT achieved state-of-the-art
PyABSA, which is an open framework, has different performance on most tested Arabic NLP tasks [48].
versions of FAST-LCF-ATEPC trained on the SemEval
2016 Arabic dataset [41]. The checkpoints used in this study 3) MARBERT MODEL
were multilingual, multilingual-256, and multilingual-256-2. MARBERT is an Arabic-focused transformer language model
The main difference between these three models is the developed in 2021. Unlike AraBERT, MARBERT is trained
number of languages and the size of the embedding layers using data from the Twitter platform (one billion Arabic
used in the model as shown in Table 7. For the multilingual tweets), which includes both MSA and diverse Arabic
checkpoint, the model is trained on a multilingual dataset dialects. MARBERT uses the same network architecture as
with 5 languages (English, French, German, Italian, and the BERT model, but excludes the next sentence prediction
Spanish) and the embedding layer size used is 768. For the objective because of the word count limit in tweets.
multilingual-256 checkpoint, the model is also trained on a MARBERT was evaluated using six NLP tasks: sentiment
multilingual dataset with 5 languages but uses a smaller analysis, topic classification, dialect identification, question
embedding layer size of 256. This reduces the memory answering, NER, and social meaning. According to [44], the
footprint of the model and can improve training speed on results of these six tasks showed that MARBERT was
smaller datasets. For the multilingual-256-2, the model is significantly better than AraBERT.
trained on a larger multilingual dataset with 15 languages
and uses a smaller embedding layer size of 256. This allows 4) QARIB MODEL
the model to generalize better across languages and reduces QARiB is a pretrained model developed in 2021 [56]. The
the likelihood of overfitting on any particular language [41], authors trained five BERT models on different sizes of
[46]. Therefore, it is essential to conduct an empirical training sets, different linguistic preprocessing, and different
evaluation on all three models to determine which one would text dialects: MSA formal and informal Arabic dialects. The
yield better results considering the differences in the size of MSA texts include data extracted from newswire sources,
the embedding layer and the diversity of languages used in online Arabic newspaper websites, and movie and TV
training. subtitles, whereas the dialect text includes Twitter data. The
corpus contained 180 M sentences and 440 M tweets
2) ARABERT MODEL composed of 2.7 B words. According to [56], QARiB
AraBERT was developed in 2021 as a pretrained BERT model achieved state-of-the-art results on several tasks such as
specifically for the Arabic language to achieve the same emotion, NER, and offensive aspects.
success as BERT for the English language. In addition to
BERT base configuration, AraBERT employs two tasks: C. ABSA TASKS
Masked Language Modeling (MLM) task to improve pre- Our research objective was to develop an Aspect-Based
training tasks by forcing the model to predict the whole word Sentiment Analysis system for KAU that facilitates the
instead of getting hints from parts of the word, and Next analysis of Arabic survey responses. To achieve this, we
Sentence Prediction (NSP) task to helps the model understand conducted experiments to determine the most suitable model
the relationship between two sentences, which can be useful for our task. Using the PyABSA framework [41], we
for many language understanding tasks such as Question evaluated the performance of three models: FAST-LCF-
Answering. It was trained on a large-scale Arabic corpus ATEPC (multilingual), FAST-LCF-ATEPC (multilingual-
extracted from news articles on the Arabic media. This corpus 256), and FAST-LCF-ATEPC (multilingual-256-2). These
contained modern standard Arabic (MSA) data. It includes 70 models are designed to perform both T1, which involves
million sentences and 3 billion words. The authors evaluated identifying the aspect term, and T2, which involves assigning
the model on three NLP downstream tasks: SA, question its polarity, simultaneously. In addition, we fine-tuned three

VOLUME XX, 2017 5

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3348342

pre-trained language models designed for Arabic language extraction [62]. MUC considers different categories of errors:
NLP tasks, namely AraBERT [48], MARBERT [44], and correct (COR), incorrect (INC), partial (PAR), missing (MIS),
QARiB [56]. All models were fine-tuned to accomplish T1 and spurious (SPU). These metrics were defined by comparing
using a reference multilingual ABSA dataset (SemEval2016- the responses of a model against golden annotation. We used
ABSA for Task 5) with 9620 examples [60]. Additionally, we COR, INC, MIS, and SPU metrics and eliminated the PAR
added a named entity recognition (NER) layer to perform the metric because we considered PAR to be COR in our case. For
aspect-extraction task (T1). NER is a technique used in NLP example, the aspect “university” was considered COR as long
to automatically find and categorize names, words, or phrases as it is part of the actual aspect “KAU university” in the golden
in text that refer to real objects such as people, groups, places, dataset and does not need to be identical. Recall (R), precision
dates, amounts, etc. Figure 2 illustrates the four tasks we (P), and F score (F1) were calculated as secondary metrics
performed to develop an end-to-end ABSA system in this from MUC-5, as these metrics are commonly used for
study. The input was Arabic survey responses. The first task comparison of models, as shown in (1)–(3):
(T1) involved extracting aspects from each response. We
conducted six experiments to examine the performance of 𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅 (𝑅𝑅) =
𝐶𝐶𝐶𝐶𝐶𝐶
(1)
𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝
these models to accomplish this task. Then, the best-
performing model, the one that has the highest F1-score for 𝐶𝐶𝐶𝐶𝐶𝐶
extracting aspects from the responses, was used in (T2), which 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 (𝑃𝑃) = (2)
𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎
involved identifying the polarity that is associated with each
aspect. Following that, we executed the third task (T3), which 2∗(𝑃𝑃𝑃𝑃)
𝐹𝐹 𝑠𝑠𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 (𝐹𝐹1) = (3)
involved mapping each extracted aspect to a category. We (𝑃𝑃+𝑅𝑅)
used a predefined list of categories and their associated aspects
curated from the golden dataset to accomplish this task. Once
we completed task 3 for all responses, we presented the where:
extracted aspects, polarity, and category for each response as possible = COR + INC + MIS
shown in Table 8. In the final task (T4), we aggregated the actual = COR + INC + SPU
results for each category by counting the polarities of their
related aspects. This allowed us to assign an overall polarity For the polarity classification (T2) and category mapping
for each category. (T3), a confusion matrix was used to report the detailed
performance of the classification tasks. From the confusion
matrix, four commonly used classification metrics were
computed: P, R, F1, and accuracy (Acc) [63]. The overall
category polarity (T4) is a summation of the polarities that
belong to the same category, which allows for an overall result
representation.

VI. EXPERIMENTAL RESULTS


The experiments were performed on an educational dataset
with 448 responses to evaluate T1. The first three experiments
used the FAST–LCF–ATEPC model. Each of these
experiments used different checkpoints: multilingual,
multilingual-256, and multilingual-256-2. The remaining
three experiments used AraBERT, MARBERT, and QARiB,
respectively. Default hyperparameters were used for all
models with an embedding size (100), batch size (32), epochs
(8), and learning rate (5e-5). The experiments were
FIGURE 2. An end-to-end ABSA framework used in the study. implemented in Python. PyTorch was used as the deep
learning framework. A snapshot of the output results is shown
D. PERFORMANCE MEASURES in Table 8.
In this study, various evaluation metrics were used. For T1,
TABLE 8. Snapshot of the output results.
because aspects were extracted directly from the responses
and not from a predefined list, message understanding
conference (MUC) metrics were used [61] to obtain detailed Response Aspect Polarity Category
results. MUC represents one of the earliest and longest- ‫أﺷﻌﺮ ﺑﺎﻟﺴﻌﺎدة واﻟﻔﺨﺮ‬ Impression of the
‫ﻛﻮﻧﻲ أﺣﺪ ﻣﻨﺘﺴﺒﻲ ھﺬه‬ ‫اﻟﺠﺎﻣﻌﺔ‬ Positive University and personal
running efforts to evaluate language-understanding ‫اﻟﺠﺎﻣﻌﺔ‬ skills
technologies. It is particularly useful for text processing ‫ﻋﺪم ﻧﻈﺎﻓﺔ اﻻﻛﻞ ﻓﻲ‬ University infrastructure
‫اﻻﻛﻞ‬ Negative
problems such as sentiment analysis and information ‫اﻟﻜﻔﺘﺮﯾﺎت‬ and public services

VOLUME XX, 2017 5

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3348342

TABLE 9. Summary of T1 experiment results.

FAST-LCF-ATEPC

Multilingual Multilingual-256 Multilingual-256-2 AraBERT MARBERT QARiB

Responses (#) 314 297 341 368 319 328

COR (#) 205 172 201 179 145 146

INC (#) 55 66 71 46 58 66

MIS (#) 134 151 107 80 129 120

SPU (#) 54 59 69 143 116 116

R (%) 0.52 0.44 0.53 0.59 0.44 0.44

P (%) 0.65 0.58 0.59 0.49 0.45 0.45

F1 (%) 0.58 0.50 0.55 0.54 0.44 0.44

A. ASPECT EXTRACTION RESULTS with the educational domain existing work, there was only one
The results of the six experiments are presented in Table 9. study that used Arabic language data collected from Twitter
Based on the experiments, we've found that the FAST-LCF- and applied the SVM method, which is completely different
ATEPC (multilingual) model has shown promising results for from the method used in this study. Nonetheless, we compared
T1 with an F1 score of 0.58, precision of 0.65, and recall of various BERT-based methods on our dataset, which consists
0.52. The reason behind this could be the large embedding size of Arabic text derived from the educational domain to help us
layer that allows for more expressive representations because determine the best method for Arabic ABSA related to
it provides a higher-dimensional space in which tokens can be education.
represented. This higher dimensionality enables the model to
capture more nuanced relationships and semantic information B. POLARITY CLASSIFICATION RESULTS
between words. While the model was successful in extracting The polarity classification task determined the polarity of
aspects from 314 out of 448 responses, there were 134 each extracted aspect. For this task, the FAST-LCF-ATEPC
responses from which no aspects could be extracted. (multilingual) model was used because it achieved the best
Upon evaluating the MUC-5 metrics, we found that the results for T1. The model results are as follows. Table 10
model accurately extracted aspects from 205 responses, shows that 29% of the aspects extracted by the model were
while 55 contained incorrect aspects and 54 contained positive, 70% had a negative polarity, and 1% had a neutral
spurious aspects that were not in the golden dataset. We polarity. Compared with the 300 matched aspects in the
believe that with fine-tuning and data augmentation, the golden dataset, 14% of the aspects had a positive polarity, and
model can be further improved to extract aspects from the 86% had a negative polarity. Since there was no neutral
remaining responses to achieve better results. polarity in the golden dataset, the neutral aspects were
AraBERT, on the other hand, was able to extract aspects for removed from the results.
the largest number of responses, which could be due to its The model was then reevaluated using a confusion matrix,
tailored training for the Arabic language and its ability to P, R, Acc, and F1, as shown in Figure 3 and Table 11. In the
capture the unique nuances of the language. However, it also confusion matrix, rows represent the actual number of aspects
extracted a high number of spurious aspects, leading to a with negative polarity and those with positive polarity in the
lower precision and F1 score. MARBERT and QARiB had golden dataset, whereas columns represent the polarity of the
lower performance, which could be due to the original aspects predicted by the FAST-LCF-ATEPC (multilingual)
dataset used in the pre-trained models, which included model. As shown, the data were unbalanced, with 255
various dialects in addition to formal Arabic. Overall, these negative aspects and 42 positive aspects. As per the model
observations highlight opportunities for further improvement prediction, only one aspect was incorrectly classified as
in aspect extraction for the Arabic language in the negative, and 47 aspects were classified incorrectly as
educational domain. positive.
In our case, the domain we are working on is relatively As shown in Table 11, the accuracy of the model is 84%.
unexplored and as mentioned in [64], no technique can For negative polarity, the precision was 100%, recall was
guarantee good performance in all domains. For that, we have 82%, and the F1 score was 90%. For positive polarity, the
opted not to compare with existing work in different domains precision was 47%, recall was 98%, and the F1 score was
to avoid any potential inaccuracies. Regarding comparison 63%.

VOLUME XX, 2017 5

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3348342

TABLE 10. Polarity classification statistics. TABLE 11. Precision, recall, F1, accuracy, and weighted average of
polarity classification.
Aspects Polarity FAST-LCF-ATEPC Golden Annotated F1 Weighted
Precision Recall Accuracy
Model Data score avg
positive 88 42 Negative 1.00 0.82 0.90
0.84 0.86
negative 209 258 Positive 0.47 0.98 0.63

neutral (removed) 3 0
total number of
correctly extracted 300 300
aspects

FIGURE 4. Confusion matrix of model category mapping result.


FIGURE 3. Confusion matrix of model polarity classification results.
two reasons. First, they did not have an explicit aspect, and
C. CATEGORY MAPPING RESULTS second, they were typically positive sentiments, such as
In this task, each aspect extracted by the model was mapped “Thank you” or “Nothing,” which are not valuable to decision
to a category. There were 13 categories, including university makers.
infrastructure and public services, medical administration and
its services, and libraries and their services. To achieve this, a VII. CONCLUSION AND FUTURE WORK
predefined list of categories was constructed from the golden In this study, we evaluate different BERT-based models for
dataset. The assigned category was evaluated against the Arabic ABSA in the educational domain: FAST-LCF-ATEPC
human-assigned category for each sample. (multilingual), FAST-LCF-ATEPC (multilingual-256),
In the confusion matrix, Figure 4, the rows represent the FAST-LCF-ATEPC (multilingual-256-2), AraBERT,
actual categories in the golden dataset, and the columns MARBERT, and QARiB. These models were fine-tuned
represent the same categories assigned using the predefined using a reference multilingual ABSA dataset (SemEval2016-
list. As shown, the data were unbalanced. The confusion ABSA for Task 5). Six experiments were performed to
matrix result shows that for category (0), 148 aspects were determine the best method for extracting the aspect terms. The
labeled correctly, and three aspects were incorrectly classified. best result was achieved using the FAST-LCF-ATEPC
The overall accuracy for assigning a category for the extracted (multilingual) model. This model performs T1 and T2
aspects showed an overall accuracy of 0.98 and a weighted simultaneously by extracting aspect terms and classifying their
average F1 score of 0.98. polarities, which is better than pipeline solutions that design
different models for each task, in which the output from the
D. CATEGORY POLARITY RESULTS T1 model is used as the input for the T2 model, thus potentially
In this section, the results are summarized to provide the propagating errors from one step to another. The end-to-end
overall polarity for each category. Table 12 summarizes the ABSA system achieved good results for all the four tasks.
number of positive and negative aspects extracted by the Future research should explore new methods to improve the
aspect extraction task as there is still room for improvement.
model and the overall polarity of each category. Table 12 can
Other methods for optimizing T2 should be investigated. This
inform decision makers about the services that need to be
study contributes to the body of knowledge by enriching
improved, as people tend to leave written feedback when they
research in the Arabic language as well as the educational
want to complain. In this summary, all short responses with
field. The system can be used by educational institutions to
fewer than three words were removed from the analysis for analyze open-ended Arabic responses more efficiently and
improve their services and institutions.

VOLUME XX, 2017 5

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3348342

TABLE 12. Summarization of the overall polarity for each category. study,” 2016 11th International Conference for Internet
Technology and Secured Transactions, ICITST 2016, pp. 98–103,
Feb. 2017, doi: 10.1109/ICITST.2016.7856675.
Category ID #Positive #Negative Overall Polarity [10] E. Alsharhan and A. Ramsay, “Investigating the effects of gender,
dialect, and training size on the performance of Arabic speech
1 0 5 100% negative recognition,” Language Resources and Evaluation, vol. 54, no. 4,
pp. 975–998, Oct. 2020, doi: 10.1007/s10579-020-09505-5.
2 0 1 100% negative [11] K. Shaalan, S. Siddiqui, M. Alkhatib, and A. Abdel Monem,
“Challenges in Arabic Natural Language Processing,” in
3 13 42 76% negative Computational Linguistics, Speech and Image Processing for
Arabic Language, vol. Volume 4, in Series on Language
4 33 15 69% positive Processing, Pattern Recognition, and Intelligent Systems, no.
Volume 4, vol. Volume 4. , WORLD SCIENTIFIC, 2017, pp. 59–
5 0 2 100% negative 83. doi: 10.1142/9789813229396_0003.
[12] S. Trigui, I. Boujelben, S. Jamoussi, and Y. Ben Ayed, “ADAL
6 0 1 100% negative system: Aspect detection for arabic language,” in Advances in
Intelligent Systems and Computing, Springer, Cham, Dec. 2021,
7 0 2 100% negative pp. 31–40. doi: 10.1007/978-3-030-49336-3_4.
[13] A. Sabeeh and R. K. Dewang, “Comparison, classification and
8 5 12 71% negative survey of aspect based sentiment analysis,” in Communications in
Computer and Information Science, Springer, Singapore, Jul.
9 4 5 56% negative 2019, pp. 612–629. doi: 10.1007/978-981-13-3140-4_55.
[14] H. H. Do, P. W. C. Prasad, A. Maag, and A. Alsadoon, “Deep
10 0 4 100% negative Learning for Aspect-Based Sentiment Analysis: A Comparative
Review,” Expert Systems with Applications, vol. 118, pp. 272–299,
11 0 3 100% negative Mar. 2019, doi: 10.1016/j.eswa.2018.10.003.
[15] L. Deng and D. Yu, “Deep learning: Methods and applications,”
12 32 117 78% negative Foundations and Trends in Signal Processing, vol. 7, no. 3–4, pp.
197–387, Jun. 2013, doi: 10.1561/2000000039.
13 0 4 100% negative [16] A. Kumar and A. Sharan, “Deep Learning-Based Frameworks
for Aspect-Based Sentiment Analysis,” pp. 139–158, 2020, doi:
10.1007/978-981-15-1216-2_6.
[17] R. Liu, Y. Shi, C. Ji, and M. Jia, “SPECIAL SECTION ON
REFERENCES
ADVANCED OPTICAL IMAGING FOR EXTREME
[1] N. Nikolić, O. Grljević, and A. Kovačević, “Aspect-based
ENVIRONMENTS A Survey of Sentiment Analysis Based on
sentiment analysis of reviews in the domain of higher education,”
Transfer Learning,” 2019, doi: 10.1109/ACCESS.2019.2925059.
The Electronic Library, vol. 38, no. 1, pp. 44–64, Feb. 2020, doi:
[18] H. Gandhi and V. Attar, “Transfer Learning for Aspect Term
10.1108/EL-06-2019-0140.
Polarity Determination,” Solid State Technology, vol. 63, pp. 956–
[2] P. A. Rauschnabel, N. Krey, B. J. Babin, and B. S. Ivens, “Brand
968, Oct. 2020.
management in higher education: The University Brand
[19] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-
Personality Scale,” Journal of Business Research, vol. 69, no. 8,
training of Deep Bidirectional Transformers for Language
pp. 3077–3086, Aug. 2016, doi: 10.1016/j.jbusres.2016.01.023.
Understanding,” NAACL HLT 2019 - 2019 Conference of the
[3] V. Baburajan, J. D. A. E Silva, and F. C. Pereira, “Open-Ended
North American Chapter of the Association for Computational
Versus Closed-Ended Responses: A Comparison Study Using
Linguistics: Human Language Technologies - Proceedings of the
Topic Modeling and Factor Analysis,” IEEE Transactions on
Conference, vol. 1, pp. 4171–4186, Oct. 2018.
Intelligent Transportation Systems, vol. 22, no. 4, pp. 2123–2132,
[20] M. Sivakumar and U. S. Reddy, “Aspect based sentiment analysis
Apr. 2021, doi: 10.1109/TITS.2020.3040904.
of students opinion using machine learning techniques.” Accessed:
[4] M. Saarela, J. Lahtonen, M. Ruoranen, A. Mäkeläinen, T.
Sep. 25, 2023. [Online]. Available:
Antikainen, and T. Kärkkäinen, “Automatic Profiling of Open-
https://ieeexplore.ieee.org/abstract/document/8365231
Ended Survey Data on Medical Workplace Teaching,”
[21] H. Ismail, A. Khalil, N. Hussein, and R. Elabyad, “Triggers and
International Journal of Emerging Technologies in Learning
Tweets: Implicit Aspect-Based Sentiment and Emotion Analysis of
(iJET), vol. 14, no. 05, Art. no. 05, Mar. 2019, doi:
Community Chatter Relevant to Education Post-COVID-19,” Big
10.3991/ijet.v14i05.9639.
Data and Cognitive Computing, vol. 6, no. 3, Art. no. 3, 2022, doi:
[5] B. Liu, “Sentiment analysis and opinion mining,” Synthesis
10.3390/bdcc6030099.
Lectures on Human Language Technologies, vol. 5, no. 1, pp. 1–
[22] L. Balachandran and A. Kirupananda, “Online reviews evaluation
184, May 2012, doi: 10.2200/S00416ED1V01Y201204HLT016.
system for higher education institution: An aspect based sentiment
[6] H. Liu, I. Chatterjee, M. Zhou, X. S. Lu, and A. Abusorrah,
analysis tool.” Accessed: Sep. 25, 2023. [Online]. Available:
“Aspect-Based Sentiment Analysis: A Survey of Deep Learning
https://ieeexplore.ieee.org/abstract/document/8294118
Methods,” IEEE Transactions on Computational Social Systems,
[23] G. S. Chauhan, P. Agrawal, and Y. K. Meena, “Aspect-Based
vol. 7, no. 6, pp. 1358–1375, Dec. 2020, doi:
Sentiment Analysis of Students’ Feedback to Improve Teaching–
10.1109/TCSS.2020.3033302.
Learning Process,” in Information and Communication Technology
[7] N. K. Laskari and S. K. Sanampudi, “Aspect Based Sentiment
for Intelligent Systems, S. C. Satapathy and A. Joshi, Eds., in
Analysis Survey,” IOSR Journal of Computer Engineering (IOSR-
Smart Innovation, Systems and Technologies. Singapore: Springer,
JCE), vol. 18, no. 2, pp. 24–28, 2016, doi: 10.9790/0661-
2019, pp. 259–266. doi: 10.1007/978-981-13-1747-7_25.
18212428.
[24] I. Sindhu, S. Muhammad Daudpota, K. Badar, M. Bakhtyar, J.
[8] K. Schouten and F. Frasincar, “Survey on Aspect-Level Sentiment
Baber and M. Nurunnabi, “Aspect-Based Opinion Mining on
Analysis,” IEEE Transactions on Knowledge and Data
Student’s Feedback for Faculty Teaching Performance
Engineering, vol. 28, no. 3, pp. 813–830, Mar. 2016, doi:
Evaluation.” Accessed: Sep. 25, 2023. [Online]. Available:
10.1109/TKDE.2015.2485209.
https://ieeexplore.ieee.org/abstract/document/8763969
[9] M. Al-Smadi, O. Qwasmeh, B. Talafha, M. Al-Ayyoub, Y.
[25] J. Melba Rosalind and S. Suguna, “Predicting Students’
Jararweh, and E. Benkhelifa, “An enhanced framework for aspect-
Satisfaction Towards Online Courses Using Aspect-Based
based sentiment analysis of Hotels’ reviews: Arabic reviews case
Sentiment Analysis,” in Computer, Communication, and Signal

VOLUME XX, 2017 5

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3348342

Processing, E. J. Neuhold, X. Fernando, J. Lu, S. Piramuthu, and [40] R. Bensoltane and T. Zaki, “Towards Arabic aspect-based
A. Chandrabose, Eds., in IFIP Advances in Information and sentiment analysis: a transfer learning-based approach,” Social
Communication Technology. Cham: Springer International Network Analysis and Mining, vol. 12, no. 1, pp. 1–16, Dec. 2022,
Publishing, 2022, pp. 20–35. doi: 10.1007/978-3-031-11633-9_3. doi: 10.1007/S13278-021-00794-4/METRICS.
[26] Z. Kastrati, A. S. Imran and A. Kurti, “Weakly Supervised [41] H. Yang and K. Li, “PyABSA: Open Framework for Aspect-based
Framework for Aspect-Based Sentiment Analysis on Students’ Sentiment Analysis,” 2022, doi: 10.48550/arXiv.2208.01368.
Reviews of MOOCs.” Accessed: Sep. 25, 2023. [Online]. [42] H. Chouikhi, M. Alsuhaibani, and F. Jarray, “BERT-Based Joint
Available: https://ieeexplore.ieee.org/document/9110884 Model for Aspect Term Extraction and Aspect Polarity Detection
[27] Y. Heryadi, B. D. Wijanarko, D. F. Murad, C. Tho and K. in Arabic Text,” Electronics, vol. 12, no. 3, Art. no. 3, Jan. 2023,
Hashimoto, “Aspect-based Sentiment Analysis for Improving doi: 10.3390/electronics12030515.
Online Learning Program Based on Student Feedback.” Accessed: [43] A. Fadel, O. Abulnaja, and M. Saleh, “Multi-Task Learning Model
Sep. 25, 2023. [Online]. Available: with Data Augmentation for Arabic Aspect-Based Sentiment
https://ieeexplore.ieee.org/abstract/document/9865450 Analysis,” CMC, vol. 75, no. 2, pp. 4419–4444, 2023, doi:
[28] M. Alassaf and A. M. Qamar, “Aspect-Based Sentiment Analysis 10.32604/cmc.2023.037112.
of Arabic Tweets in the Education Sector Using a Hybrid Feature [44] M. Abdul-Mageed, A. Elmadany, and E. M. B. Nagoudi,
Selection Method.” Accessed: Sep. 25, 2023. [Online]. Available: “ARBERT & MARBERT: Deep Bidirectional Transformers for
https://ieeexplore.ieee.org/abstract/document/9299026 Arabic.” arXiv, Jun. 07, 2021. doi: 10.48550/arXiv.2101.01785.
[29] I. Obaidat, R. Mohawesh, M. Al-Ayyoub, M. Al-Smadi, and Y. [45] M. Al-Smadi, M. Al-Ayyoub, H. Al-Sarhan, and Y. Jararweh,
Jararweh, “Enhancing the determination of aspect categories and “Using Aspect-Based Sentiment Analysis to Evaluate Arabic News
their polarities in Arabic reviews using lexicon-based approaches,” Affect on Readers,” Proceedings - 2015 IEEE/ACM 8th
in 2015 IEEE Jordan Conference on Applied Electrical International Conference on Utility and Cloud Computing, UCC
Engineering and Computing Technologies, AEECT 2015, Institute 2015, pp. 436–441, 2015, doi: 10.1109/UCC.2015.78.
of Electrical and Electronics Engineers Inc., Dec. 2015. doi: [46] H. Yang, B. Zeng, J. Yang, Y. Song, and R. Xu, “A Multi-task
10.1109/AEECT.2015.7360595. Learning Model for Chinese-oriented Aspect Polarity
[30] M. Al-Smadi, I. Obaidat, M. Al-Ayyoub, R. Mohawesh, and Y. Classification and Aspect Term Extraction.” arXiv, Feb. 12, 2020.
Jararweh, “Using enhanced lexicon-based approaches for the Accessed: Mar. 02, 2023. [Online]. Available:
determination of aspect categories and their polarities in Arabic http://arxiv.org/abs/1912.07976
reviews,” International Journal of Information Technology and [47] A. Boumhidi, A. Benlahbib, and E. H. Nfaoui, “Cross-Platform
Web Engineering, vol. 11, no. 3, pp. 15–31, 2016, doi: Reputation Generation System Based on Aspect-Based Sentiment
10.4018/IJITWE.2016070102. Analysis,” IEEE Access, vol. 10, pp. 2515–2531, 2022, doi:
[31] R. Masadeh, S. Al-Azzam, and B. Hammo, “A hybrid approach of 10.1109/ACCESS.2021.3139956.
lexicon-based and corpus-based techniques for arabic book aspect [48] W. Antoun, F. Baly, and H. Hajj, “AraBERT: Transformer-based
and review polarity detection,” International Journal of Advanced Model for Arabic Language Understanding.” arXiv, Mar. 07, 2021.
Trends in Computer Science and Engineering, vol. 9, no. 4, pp. doi: 10.48550/arXiv.2003.00104.
4336–4340, 2020, doi: 10.30534/ijatcse/2020/24942020. [49] A. S. Fadel, M. E. Saleh, and O. A. Abulnaja, “Arabic Aspect
[32] S. Areed, O. Alqaryouti, B. Siyam, and K. Shaalan, “Aspect-Based Extraction Based on Stacked Contextualized Embedding With
Sentiment Analysis for Arabic Government Reviews,” in Studies Deep Learning,” IEEE Access, vol. 10, pp. 30526–30535, 2022,
in Computational Intelligence, vol. 874, Springer, Cham, 2020, pp. doi: 10.1109/ACCESS.2022.3159252.
143–162. doi: 10.1007/978-3-030-34614-0_8. [50] M. A. Almasre, “Enhance the Aspect Category Detection in Arabic
[33] M. Al-Smadi, M. Al-Ayyoub, Y. Jararweh, and O. Qawasmeh, Language using AraBERT and Text Augmentation | IEEE
“Enhancing Aspect-Based Sentiment Analysis of Arabic Hotels’ Conference Publication | IEEE Xplore.” Accessed: Oct. 03, 2023.
reviews using morphological, syntactic and semantic features,” [Online]. Available:
Information Processing and Management, vol. 56, no. 2, pp. 308– https://ieeexplore.ieee.org/abstract/document/10067648
319, Mar. 2019, doi: 10.1016/j.ipm.2018.01.006. [51] R. Bensoltane and T. Zaki, “Comparing word embedding models
[34] M. M. Ashi, M. A. Siddiqui, and F. Nadeem, “Pre-trained Word for Arabic aspect category detection using a deep learning-based
Embeddings for Arabic Aspect-Based Sentiment Analysis of approach,” E3S Web Conf., vol. 297, p. 01072, 2021, doi:
Airline Tweets,” in Advances in Intelligent Systems and 10.1051/e3sconf/202129701072.
Computing, Springer, Cham, Sep. 2019, pp. 241–251. doi: [52] H. Chouikhi, F. Jarray, and M. Alsuhaibani, A Sequence-to-
10.1007/978-3-319-99010-1_22. Sequence Neural Network for Joint Aspect Term Extraction and
[35] M. Al-Smadi, O. Qawasmeh, M. Al-Ayyoub, Y. Jararweh, and B. Aspect Term Sentiment Classification Tasks. 2023, p. 123. doi:
Gupta, “Deep Recurrent neural network vs. support vector 10.5220/0011620500003393.
machine for aspect-based sentiment analysis of Arabic hotels’ [53] A. / S. Dahou, “Aspect-based Sentiment Classification Model
reviews,” Journal of Computational Science, vol. 27, pp. 386–393, Employing Dialect Normalization and Deep Learning,” Thesis,
Jul. 2018, doi: 10.1016/j.jocs.2017.11.006. University Ahmed DRAIA of Adrar, 2022. Accessed: Oct. 04,
[36] M. Al-Smadi, B. Talafha, M. Al-Ayyoub, and Y. Jararweh, “Using 2023. [Online]. Available: https://dspace.univ-
Long Short-Term Memory Deep Neural Networks for Aspect- adrar.edu.dz/jspui/handle/123456789/7924
Based Sentiment Analysis of Arabic Reviews,” International [54] I. Al-Jarrah, A. M. Mustafa, and H. Najadat, “Aspect-Based
Journal of Machine Learning and Cybernetics, vol. 10, no. 8, pp. Sentiment Analysis for Arabic Food Delivery Reviews,” ACM
2163–2175, Mar. 2019, doi: 10.1007/s13042-018-0799-4. Trans. Asian Low-Resour. Lang. Inf. Process., vol. 22, no. 7, p.
[37] S. Al-Dabet, S. Tedmori, and M. AL-Smadi, “Enhancing Arabic 200:1-200:18, Jul. 2023, doi: 10.1145/3605146.
aspect-based sentiment analysis using deep learning models,” [55] A. Israeli, A. Naaman, Y. Nahum, R. Assi, S. Fine, and K. Bar,
Computer Speech and Language, vol. 69, p. 101224, Sep. 2021, “Love Me, Love Me Not: Human-Directed Sentiment Analysis in
doi: 10.1016/j.csl.2021.101224. Arabic,” in Proceedings of the Third International Workshop on
[38] M. M.Abdelgwad, T. H. A Soliman, A. I.Taloba, and M. F. NLP Solutions for Under Resourced Languages (NSURL 2022) co-
Farghaly, “Arabic aspect based sentiment analysis using located with ICNLSP 2022, Trento, Italy: Association for
bidirectional GRU based models,” Journal of King Saud Computational Linguistics, Dec. 2022, pp. 22–30. Accessed: Oct.
University - Computer and Information Sciences, Jan. 2021, doi: 03, 2023. [Online]. Available: https://aclanthology.org/2022.nsurl-
10.1016/j.jksuci.2021.08.030. 1.4
[39] M. M. Abdelgwad, T. H. A. Soliman, and A. I. Taloba, “Arabic [56] A. Abdelali, S. Hassan, H. Mubarak, K. Darwish, and Y. Samih,
aspect sentiment polarity classification using BERT,” Journal of “Pre-Training BERT on Arabic Tweets: Practical Considerations.”
Big Data, vol. 9, no. 1, p. 115, Dec. 2022, doi: 10.1186/s40537- arXiv, Feb. 21, 2021. doi: 10.48550/arXiv.2102.10684.
022-00656-6.

VOLUME XX, 2017 5

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3348342

[57] R. G. Pontius and M. Millones, “Death to Kappa: birth of quantity Publishing Corporation, Jun. 2020, pp. 01–14. doi:
disagreement and allocation disagreement for accuracy 10.5121/csit.2020.100801.
assessment,” International Journal of Remote Sensing, vol. 32, no. [64] R. Hajrizi and K. P. Nuçi, “Aspect-Based Sentiment Analysis in
15, pp. 4407–4429, Aug. 2011, doi: Education Domain.” arXiv, Oct. 03, 2020. Accessed: May 30,
10.1080/01431161.2011.552923. 2023. [Online]. Available: http://arxiv.org/abs/2010.01429
[58] P. Rozin and E. B. Royzman, “Negativity Bias, Negativity
Dominance, and Contagion,” Pers Soc Psychol Rev, vol. 5, no. 4,
pp. 296–320, Nov. 2001, doi: 10.1207/S15327957PSPR0504_2.
[59] R. F. Baumeister, E. Bratslavsky, C. Finkenauer, and K. D. Vohs, KHLOUD A. ALSHAIKH received the bachelor’s degree in information
“Bad is Stronger than Good,” Review of General Psychology, vol. technology, in 2018. She is currently pursuing the master’s degree in
5, no. 4, pp. 323–370, Dec. 2001, doi: 10.1037/1089-2680.5.4.323. information systems with the Faculty of Computing and Information
[60] M. Pontiki et al., “SemEval-2016 task 5 : aspect based sentiment Technology, King Abdulaziz University (KAU). Her research interests
analysis,” in Proceedings of the 10th International Workshop on include machine learning and deep learning.
Semantic Evaluation (SemEval-2016), Association for
Computational Linguistics, 2016, pp. 19–30. Accessed: Jun. 03, OMAIMA A. ALMATRAFI received the B.S. degree in computer science
2023. [Online]. Available: http://hdl.handle.net/1854/LU-8131987 from King Abdulaziz University (KAU), Jeddah, Saudi Arabia, in 2008, and
[61] N. Chinchor and B. Sundheim, “MUC-5 Evaluation Metrics,” in the M.S. degree in information systems and the Ph.D. degree in information
Fifth Message Understanding Conference (MUC-5): Proceedings technology from George Mason University, USA, in 2013 and 2018,
of a Conference Held in Baltimore, Maryland, August 25-27, 1993, respectively. She is currently an Assistant Professor with the Department of
1993. Accessed: Jan. 31, 2023. [Online]. Available: Information Systems, KAU. Her research has been published in several
https://aclanthology.org/M93-1007 conferences and academic journals. Her research interests include
[62] L. Hirschman, “The Evolution of evaluation: Lessons from the computer-supported collaborative learning, learning analytics, and
Message Understanding Conferences,” Computer Speech & educational data mining.
Language, vol. 12, no. 4, pp. 281–305, Oct. 1998, doi:
10.1006/csla.1998.0102. YOOSEF B. ABUSHARK is currently an Associate Professor with the
[63] D. Krstinić, M. Braović, L. Šerić, and D. Božić-Štulić, “Multi- Computer Science Department, King Abdulaziz University (KAU). His
label Classifier Performance Evaluation with Confusion Matrix,” research interests are in software engineering with a focus engineering
in Computer Science & Information Technology, AIRCC intelligent systems and building agent-based simulations. He has been
publishing several research outcomes in leading venues.

VOLUME XX, 2017 5

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/

You might also like