Natural Language Processing NLP and Machine Learning MLTheory and Applications PDF
Natural Language Processing NLP and Machine Learning MLTheory and Applications PDF
Processing (NLP)
and Machine
Learning (ML)
Theory and Applications
Edited by
Florentina Hristea and Cornelia Caragea
Printed Edition of the Special Issue Published in Mathematics
www.mdpi.com/si/mathematics
Natural Language Processing (NLP)
and Machine Learning (ML)—Theory
and Applications
Natural Language Processing (NLP)
and Machine Learning (ML)—Theory
and Applications
Editors
Florentina Hristea
Cornelia Caragea
MDPI • Basel • Beijing • Wuhan • Barcelona • Belgrade • Manchester • Tokyo • Cluj • Tianjin
Editors
Florentina Hristea Cornelia Caragea
University of Bucharest University of Illinois at Chicago
Romania USA
Editorial Office
MDPI
St. Alban-Anlage 66
4052 Basel, Switzerland
This is a reprint of articles from the Special Issue published online in the open access journal
Mathematics (ISSN 2227-7390) (available at: https://www.mdpi.com/si/mathematics/Natural
Language Processing Machine Learning).
For citation purposes, cite each article independently as indicated on the article page online and as
indicated below:
LastName, A.A.; LastName, B.B.; LastName, C.C. Article Title. Journal Name Year, Volume Number,
Page Range.
© 2022 by the authors. Articles in this book are Open Access and distributed under the Creative
Commons Attribution (CC BY) license, which allows users to download, copy and build upon
published articles, as long as the author and publisher are properly credited, which ensures maximum
dissemination and a wider impact of our publications.
The book as a whole is distributed by MDPI under the terms and conditions of the Creative Commons
license CC BY-NC-ND.
Contents
Josiane Mothe
Analytics Methodsto Understand Information Retrieval Effectiveness—A Survey
Reprinted from: Mathematics 2022, 10, 2135, doi:10.3390/math10122135 . . . . . . . . . . . . . . . 7
Santosh Kumar Banbhrani, Bo Xu, Hongfei Lin and Dileep Kumar Sajnani
Taylor-ChOA: Taylor-Chimp Optimized Random Multimodal Deep Learning-Based Sentiment
Classification Model for Course Recommendation
Reprinted from: Mathematics 2022, 10, 1354, doi:10.3390/math10091354 . . . . . . . . . . . . . . . 69
Christopher Haynes, Marco A. Palomino, Liz Stuart, David Viira, Frances Hannon,
Gemma Crossingham and Kate Tantam
Automatic Classification of National Health Service Feedback
Reprinted from: Mathematics 2022, 10, 983, doi:10.3390/math10060983 . . . . . . . . . . . . . . . . 95
Mihailo Škorić, Ranka Stanković, Milica Ikonić Nešić, Joanna Byszuk and Maciej Eder
Parallel Stylometric Document Embeddings with Deep Learning Based Language Models in
Literary Authorship Attribution
Reprinted from: Mathematics 2022, 10, 838, doi:10.3390/math10050838 . . . . . . . . . . . . . . . . 157
v
Jianquan Ouyang and Mengen Fu
Improving Machine Reading Comprehension with Multi-Task Learning and Self-Training
Reprinted from: Mathematics 2022, 10, 310, doi:10.3390/math10030310 . . . . . . . . . . . . . . . . 219
Mihai Masala, Stefan Ruseti, Traian Rebedea, Mihai Dascalu, Gabriel Gutu-Robu
and Stefan Trausan-Matu
Identifying the Structure of CSCL Conversations Using String Kernels
Reprinted from: Mathematics 2021, 9, 3330, doi:10.3390/math9243330 . . . . . . . . . . . . . . . . 243
vi
About the Editors
Florentina Hristea
Florentina Hristea, Ph.D., is currently Full Professor in the Department of Computer Science at
the University of Bucharest, Romania. Here, she received both her B.S. degree in mathematics and
computer science and Ph.D. degree in mathematics, in 1984 and 1996, respectively. She received her
habilitation in computer science from this same university, in 2017, with the habilitation thesis “Word
Sense Disambiguation with Application in Information Retrieval”. Her current field of research is
artificial intelligence, with specialization in knowledge representation, natural language processing
(NLP) and human language technologies (HLT), computational linguistics, as well as computational
statistics and data analysis with applications in NLP. She has been Principal Investigator of several
national and international interdisciplinary research development projects and is Expert Evaluator
of the European Commission in the fields of NLP and HLT. Professor Hristea is author or co-author
of 9 books, 2 chapters in books, and of various scientific papers, of which 32 are articles in peer
reviewed scholarly journals. She is the author of an outlier detection algorithm which is named
in her respect (Outlier Detection, Hristea Algorithm. Encyclopedia of Statistical Sciences, Second
Edition, Vol. 9, N. Balakrishnan, Campbell B. Read, and Brani Vidakovic, Editors-in-Chief. Wiley,
New York, p. 5885–5886, 2005) and is an elected member of ISI (International Statistical Institute).
She is also a member of GWA (Global WordNet Association). Professor Hristea was a Fulbright
Research Fellow at Princeton University, USA, an Invited Professor at the University of Toulouse,
France, and has been a visiting scientist at Heidelberg Institute for Theoretical Studies, Germany;
University of Toulouse Paul Sabatier III, France; Institut de Recherche en Informatique de Toulouse,
France; and L’ école Polytechnique “Polytech Montpellier”, France.
Cornelia Caragea
Cornelia Caragea, Ph.D., is currently Full Professor in the Department of Computer Science at
the University of Illinois at Chicago, USA, and Adjunct Associate Professor at Kansas State University,
USA. She received her B.S. in computer science from University of Bucharest in 1997 and Ph.D. degree
in computer science from Iowa State University, USA, in 2009. Her research interests are in natural
language processing, artificial intelligence, deep learning, and information retrieval. From 2012 to
2017, she was Assistant Professor at University of North Texas; from 2017 to 2018, she was Associate
Professor at Kansas State University, where she has been Adjunct Associate Professor since 2018.
Professor Caragea has received more than USD 4.5M in NSF funding for her research initiatives,
including ten NSF grants as the Principal Investigator. She is author or co-author of more than
125 international journals and 7 book chapters, tutorials, and other technical reports. Professor
Caragea has been invited to give talks and presentations at more than 30 national and international
conferences. She is a member of the Association for Computing Machinery and the Association for
the Advancement of Artificial Intelligence.
vii
mathematics
Editorial
Preface to the Special Issue “Natural Language Processing
(NLP) and Machine Learning (ML)—Theory and Applications”
Florentina Hristea 1, * and Cornelia Caragea 2, *
1 Department of Computer Science, Faculty of Mathematics and Computer Science, University of Bucharest,
010014 Bucharest, Romania
2 Department of Computer Science, University of Illinois at Chicago, Chicago, IL 60607, USA
* Correspondence: [email protected] (F.H.); [email protected] (C.C.)
Natural language processing (NLP) is one of the most important technologies in use
today, especially due to the large and growing amount of online text, which needs to
be understood in order to fully ascertain its enormous value. During the last decade,
machine learning techniques have led to higher accuracies involving many types of NLP
applications. Although numerous machine learning models have been developed for NLP
applications, recently, deep learning approaches have achieved remarkable results across
many NLP tasks. This Special Issue has focused on the use and exploration of current
advances in machine learning and deep learning for a great variety of NLP topics, belonging
to a broad spectrum of research areas that are concerned with computational approaches to
natural language.
The paper authored by Mothe [1] concentrates on better understanding information
retrieval system effectiveness when taking into account the system and the query, while
the other existing dimensions (document collection, effectiveness measures) are left in the
background. The paper reviews the literature of the field from this perspective and provides
a clear negative answer to the basic but essential question: “Can we design a transparent
Citation: Hristea, F.; Caragea, C.
model in terms of its performance on a query?” The review concludes there is “lack of
Preface to the Special Issue “Natural
full understanding of system effectiveness according to the context although it has been
Language Processing (NLP) and
possible to adapt the query processing to some contexts successfully”. It equally concludes
Machine Learning (ML)—Theory and
that, so far, neither the system component analysis, nor the query features analysis has
Applications”. Mathematics 2022, 10,
proven successful “in explaining when and why a particular system fails on a particular
2481. https://doi.org/10.3390/
query”. This leaves room for further analyses, which prove to be necessary.
math10142481
The paper authored by Donaj and Maučec [2] reports the results of a systematic analy-
Received: 12 July 2022 sis of adding morphological information into neural machine translation (NMT) system
Accepted: 14 July 2022 training, with special reference to languages with complex morphology. Experiments
Published: 16 July 2022 are performed on corpora of different sizes for the English–Slovene language pair, and
Publisher’s Note: MDPI stays neutral conclusions are drawn for a domain-specific translation system and for a general-domain
with regard to jurisdictional claims in translation system. The authors conclude that NMT systems can benefit from additional
published maps and institutional affil- morphological information when one of the languages in the translation pair is morpho-
iations. logically complex, with benefits depending on the size of the training corpora, on the
form in which morphological information is injected into the corpora, as well as on the
translation direction. We hope the conclusions of this paper will stimulate further research
in order to see if they could apply to other language pairs containing English and highly
Copyright: © 2022 by the authors. inflected languages.
Licensee MDPI, Basel, Switzerland.
The paper authored by Nisioi et al. [3] studies the degree to which translated texts
This article is an open access article
preserve linguistic features of dialectal varieties. The paper provides the first translation-
distributed under the terms and
related result (to the best of our knowledge), showing that translated texts depend not
conditions of the Creative Commons
only on the source language, but also on the dialectal varieties of the source language,
Attribution (CC BY) license (https://
with machine translation being impacted by them. These authors show that automatically
creativecommons.org/licenses/by/
4.0/).
distinguishing between the dialectal varieties is possible, with high accuracy, even after
2
Mathematics 2022, 10, 2481
can be used in future research in stylometry and in NLP, with focus on the authorship
attribution task.
The paper authored by Badache et al. [9] refers to unsupervised and supervised
methods to estimate temporal-aware contradictions in online course reviews. It studies
contradictory opinions in MOOC comments with respect to specific aspects (e.g., speaker,
quiz, slide), by exploiting ratings, sentiment, and course sessions where comments were
generated. The contradiction estimation is based on review ratings and on sentiment
polarity in the comments around specific aspects, such as “lecturer”, “presentation”, etc.
The reviews are time dependent, since users may stop interacting and the course contents
may evolve. Thus, the reviews taken into account must be considered as grouped by course
sessions. The contribution is threefold: (a) defining the notion of subjective contradiction
around aspects, then estimating its intensity as a function of sentiment polarity, ratings
and temporality; (b) developing a data set to evaluate the contradiction intensity measure,
which was annotated based on a user study; (c) comparing the unsupervised method
with supervised methods with automatic criteria selection. The data set is collected from
coursera.org and is in English. The results prove that the standard deviation of the ratings,
the standard deviation of the polarities, and the number of reviews represent suitable
features for estimating the contradiction intensity and for predicting the intensity classes.
The paper authored by Fuad and Al-Yahya [10] aims to explore the effectiveness of
cross-lingual transfer learning in building an end-to-end Arabic task-oriented dialogue
system (DS), using the mT5 transformer model. The Arabic-TOD dataset was used in
training and testing the model. In order to address the problem of the small Arabic
dialogue dataset, the authors present cross-lingual transfer learning using three different
approaches: mSeq2Seq, Cross-lingual Pre-training (CPT), and Mixed-Language Pre-training
(MLT). The conclusion of this research is that cross-lingual transfer learning can improve
the system performance of Arabic in the case of small datasets. It is also shown that results
can be improved by increasing the training dataset size. This research and corresponding
results can be used as a baseline for future study aiming to build robust end-to-end Arabic
task-oriented DS that refer to complex real-life scenarios.
The paper authored by Ouyang and Fu [11] is concerned with improving machine
reading comprehension (MRC) by using multi-task learning and self-training. In order to
meet the complex requirements of real-life scenarios, these authors construct a multi-task
fusion training reading comprehension model based on the BERT pre-training model. The
proposed model is designed for three specific tasks only, which leaves an open window
toward future study. It uses the BERT pre-training model to obtain contextual represen-
tations, which are then shared by three downstream sub-modules for span extraction,
yes/no question answering, and unanswerable questions. Since the created model re-
quires large amounts of labeled training data, self-training is additionally used to generate
pseudo-labeled training data, in order to improve the model’s accuracy and generalization
performance. An improvement of existing results occurs, in terms of the F1 metric.
The paper authored by Curiac et al. [12] discusses the evaluation of research trends by
taking into account research publication latency. To our knowledge, this is the first work
that explicitly considers research publication latency as a parameter in the trend evaluation
process. A new trend detection methodology, which mixes auto-ARIMA prediction with
Mann–Kendall trend evaluations, is presented. Research publication latency is introduced
as a new parameter that needs to be considered when evaluating research trends from
journal paper metadata, mainly within rapidly evolving scientific fields. The performed
simulations use paper metadata collected from IEEE Transactions on Computer-Aided
Design of Integrated Circuits and System and provide convincing results.
The paper authored by Masala et al. [13] introduces a method for discovering semantic
links embedded within chat conversations using string kernels, word embeddings, and
neural networks. The identification of these semantic links has become increasingly neces-
sary since the mixture of multiple and often concurrent discussion threads leads to topic
mixtures and makes it difficult to follow multi-participant conversation logs. The authors
3
Mathematics 2022, 10, 2481
come to very clear conclusions: “string kernels are very effective at utterance level, while
state-of-the-art semantic similarity models under-perform when used for utterance similar-
ity. Besides higher accuracy, string kernels are also a lot faster and, if used in conjunction
with a neural network on top of them, achieve state of the art results with a small number
of parameters”.
The paper authored by Vanetik and Litvak [14] uses deep ensemble learning in order
to extract definitions from generic and mathematical domains. The paper concentrates
on automatic detection of one-sentence definitions in mathematical and general texts, for
which this problem can be viewed as a binary classification of sentences into definitions
and non-definitions. Since the general definition domain and the mathematical domain are
quite different, it is commented that transfer cross-domain learning performs significantly
worse than traditional single-domain learning. The superiority of the ensemble approach
for both domains is empirically shown, together with the fact that BERT does not perform
well on this task. Experiments performed on four datasets clearly show the superiority of
ensemble voting over multiple state-of-the-art methods.
Finally, the paper authored by Burdick et al. [15] presents a systematic analysis of dif-
ferent curriculum learning strategies and different batching strategies. The three considered
tasks are text classification, sentence and phrase similarity, and part-of-speech tagging, for
which multiple datasets are used in the experiments. The paper takes into account different
combinations of curriculum learning and batching strategies across the three mentioned
downstream tasks. While a single strategy does not perform equally well on all tasks,
it is shown that, overall, cumulative batching performs better than basic batching. We
especially retain the general conclusion that “the observed batching variation is something
that researchers should consider” more in the future.
We hereby note the large range of research topics that have been touched within
this Special Issue, showing the diversity and the dynamic of a permanently evolving
field, which is giving one of the most important technologies in use today, that of natural
language processing (NLP). This Special Issue has provided a platform for researchers
to present their novel work in the domain of NLP and its applications, with a focus on
applications of machine learning and deep learning in this field. We hope that this will help
to foster future research in NLP and all related fields.
As Guest Editors of this Special Issue we would like to express our gratitude to the
47 authors who contributed their articles. We are equally grateful to a great number
of dedicated reviewers, whose valuable comments and suggestions helped improve the
quality of the submitted papers, as well as to the MDPI editorial staff, who helped greatly
during the entire process of creating this Special Issue.
References
1. Mothe, J. Analytics Methods to Understand Information Retrieval Effectiveness—A Survey. Mathematics 2022, 10, 2135. [CrossRef]
2. Donaj, G.; Sepesy Maučec, M. On the Use of Morpho-Syntactic Description Tags in Neural Machine Translation with Small and
Large Training Corpora. Mathematics 2022, 10, 1608. [CrossRef]
3. Nisioi, S.; Uban, A.S.; Dinu, L.P. Identifying Source-Language Dialects in Translation. Mathematics 2022, 10, 1431. [CrossRef]
4. Banbhrani, S.K.; Xu, B.; Lin, H.; Sajnani, D.K. Taylor-ChOA: Taylor-Chimp Optimized Random Multimodal Deep Learning-Based
Sentiment Classification Model for Course Recommendation. Mathematics 2022, 10, 1354. [CrossRef]
5. Haynes, C.; Palomino, M.A.; Stuart, L.; Viira, D.; Hannon, F.; Crossingham, G.; Tantam, K. Automatic Classification of National
Health Service Feedback. Mathematics 2022, 10, 983. [CrossRef]
6. Dascălu, S, .; Hristea, F. Towards a Benchmarking System for Comparing Automatic Hate Speech Detection with an Intelligent
Baseline Proposal. Mathematics 2022, 10, 945. [CrossRef]
7. Savini, E.; Caragea, C. Intermediate-Task Transfer Learning with BERT for Sarcasm Detection. Mathematics 2022, 10, 844.
[CrossRef]
8. Škorić, M.; Stanković, R.; Ikonić Nešić, M.; Byszuk, J.; Eder, M. Parallel Stylometric Document Embeddings with Deep Learning
Based Language Models in Literary Authorship Attribution. Mathematics 2022, 10, 838. [CrossRef]
4
Mathematics 2022, 10, 2481
9. Badache, I.; Chifu, A.-G.; Fournier, S. Unsupervised and Supervised Methods to Estimate Temporal-Aware Contradictions in
Online Course Reviews. Mathematics 2022, 10, 809. [CrossRef]
10. Fuad, A.; Al-Yahya, M. Cross-Lingual Transfer Learning for Arabic Task-Oriented Dialogue Systems Using Multilingual Trans-
former Model mT5. Mathematics 2022, 10, 746. [CrossRef]
11. Ouyang, J.; Fu, M. Improving Machine Reading Comprehension with Multi-Task Learning and Self-Training. Mathematics 2022,
10, 310. [CrossRef]
12. Curiac, C.-D.; Banias, O.; Micea, M. Evaluating Research Trends from Journal Paper Metadata, Considering the Research
Publication Latency. Mathematics 2022, 10, 233. [CrossRef]
13. Masala, M.; Ruseti, S.; Rebedea, T.; Dascalu, M.; Gutu-Robu, G.; Trausan-Matu, S. Identifying the Structure of CSCL Conversations
Using String Kernels. Mathematics 2021, 9, 3330. [CrossRef]
14. Vanetik, N.; Litvak, M. Definition Extraction from Generic and Mathematical Domains with Deep Ensemble Learning. Mathematics
2021, 9, 2502. [CrossRef]
15. Burdick, L.; Kummerfeld, J.K.; Mihalcea, R. To Batch or Not to Batch? Comparing Batching and Curriculum Learning Strategies
across Tasks and Datasets. Mathematics 2021, 9, 2234. [CrossRef]
5
mathematics
Article
Analytics Methods to Understand Information Retrieval
Effectiveness—A Survey
Josiane Mothe
INSPE, IRIT UMR5505 CNRS, Université Toulouse Jean-Jaurès, 118 Rte de Narbonne, F-31400 Toulouse, France;
[email protected]; Tel.: +33-5-61556444
Abstract: Information retrieval aims to retrieve the documents that answer users’ queries. A typical
search process consists of different phases for which a variety of components have been defined in
the literature; each one having a set of hyper-parameters to tune. Different studies focused on how
and how much the components and their hyper-parameters affect the system performance in terms of
effectiveness, others on the query factor. The aim of these studies is to better understand information
retrieval system effectiveness. This paper reviews the literature of this domain. It depicts how data
analytics has been used in IR to gain a better understanding of system effectiveness. This review
concludes that we lack a full understanding of system effectiveness related to the context which the
system is in, though it has been possible to adapt the query processing to some contexts successfully.
This review also concludes that, even if it is possible to distinguish effective from non-effective
systems for a query set, neither the system component analysis nor the query features analysis were
successful in explaining when and why a particular system fails on a particular query.
Keywords: information systems; information retrieval; system effectiveness; search engine; IR system
analysis; data analytics; query processing chain
the same query or queries, the same collection of documents, and the same effectiveness
measure. This variability in effectiveness due to the system is also named as the system
factor of variability in effectiveness.
Figure 1. An online search process consists of four main phases to retrieve an ordered list of
documents that answer the user’s query. The component used at each phase has various hyper-
parameters to tune.
8
Mathematics 2022, 10, 2135
thanks to some tools [18,19]. The generated systems differ by the components used or their
hyper-parameters. Such chains have the advantage of being deterministic in the sense
that the components they used are fully described and known, and, thus, allowed deeper
analyses. The effect of individual components or hyper-parameters can be studied.
Finally, some studies developed models that aim at predicting the performance of a
given system on a query for a document collection [20–22]. The predictive models can help
in understanding the system effectiveness. If they are transparent, it is possible to know
what the most important features are, what deep learning approaches seldom do.
The challenge for IR we are considering in this paper is:
• Can we understand better the IR system effectiveness, that is to say successes and
failures of systems, using data analytics methods?
The sub-objectives targeted here are:
• Did the literature allow conclusions to be drawn from the analysis of international
evaluation campaigns and the analysis of the participants’ results?
• Did data driven analysis, based on thorough examination of IR components and
hyper-parameters, lead to different or better conclusions?
• Did we learn from query performance prediction?
The more long-term challenge is:
• Can system effectiveness understanding be used in a comprehensive way in IR to solve
system failures and to design more effective systems? Can we design a transparent
model in terms of its performance on a query?
This paper reviews the literature of this domain. It covers analyses on the system factor,
that is to say the effects of components and hyper-parameters on the system effectiveness.
It also covers the query factor through studies that analyse the variability due to the queries.
Cross-effects are also mentioned. This paper does not cover the question of relevance,
although it is in essence related to the effectiveness calculation. It does not cover query
performance prediction either.
Rather, this paper question the understanding we have of information retrieval thanks
to data analytic methods and provides an overview on which methods have been used in
relation to which angle of effectiveness understanding the studies focused on.
The rest of this paper is organised as follows: Section 2 presents the related work.
Section 3 presents the material and methods. Section 4 reports on the results of analyses
conducted on participants obtained at evaluation campaigns. Section 5 covers the system
factor and analyses of results obtained with systematically generated query processing
chains. Section 6 is about the analyses on the query effect and cross effects. Section 7
discusses the reported work in terms of its potential impact for IR and concludes this paper.
2. Related Work
To the best of our knowledge, there is no survey published on this specific challenge.
Related work mainly consists of surveys that study a particular IR component. Other
related studies are of relevance in IR, query difficulty and query performance prediction,
and fairness and transparency in IR.
9
Mathematics 2022, 10, 2135
(e.g., Wordnet, top ranked documents, . . . ), candidate feature extraction method, feature
selection method, and the expanded query representation. With regard to effectiveness,
they report mean average precision on TREC collections (sparse results). Mean average
precision is the average of average precision on a query set. Average precision is one of
the main evaluation measure in IR. It is the area under the precision–recall curve which,
in practice, is replaced with an approximate based on precision at every position in the
ranked sequence of documents, more at https://jonathan-hui.medium.com/map-mean-
average-precision-for-object-detection-45c121a31173 accessed on 15 May 2022. The authors
concluded that for query expansion, linguistics techniques are considered as less effective
than statistic-based methods. In particular, local analysis seems to perform better than
corpus based. The authors also mentioned that the methods seem to be complementary and
that this should be exploited more. Their final conclusion is that the best choice depends
on many factors among which the type of collection being queried, the availability and
features of the external data, and the type of queries. The authors did not detail the link
between these features and the choices of a query expansion mechanism.
Moral et al. [25] focuses on stemming algorithms applied in the indexing and query
pre-processing and their effect. They considered mainly rule-based stemmers and classified
the stemmers according to their features, such as their strength, the aggressiveness with
which the stemmer clears the terminations of the terms, the number of rules and suffixes
considered, their use of recoding phase, partial-matching, and constraint rules. They also
compared the algorithms according to their conflation rate or index compression factor.
The authors did not compare the algorithms in terms of effectiveness but rather refer to
other papers for this aspect.
We can also mention the study by Kamphuis et al. in which they considered 8
variants of the BM25 scoring function [26]. The authors considered 3 TREC collections
and used average precision at 30 documents. Precision at 30 documents is the precision,
the proportion of relevant document within the retrieved document list where this list
is considered up to the 30th retrieved document. They show that there is no significant
effectiveness difference between the different implantation of BM25.
These analyses focus on a single component and do not analyse the results obtained
strictly speaking but rather compare them using typical report means (mainly tables of
effectiveness measures averaged over queries) as presented in Section 2.3.
10
Mathematics 2022, 10, 2135
papers.) to analyse the hyper-parameters of the method one developed. Analysing the
results is generally performed by comparing the results in terms of effectiveness in tables
or graphs that show the effectiveness for different values of the hyper-parameters (see
Figure 2 that represent typical reports on comparison of methods and hyper-parameters in
IR papers). In these figures and tables, the parameter values change and either different
effectiveness measures or different evaluation collections, or both are reported.
The purpose here is to emphasise that, even if extensive experimental evaluation is
reported in IR papers, the reports are mainly under the form or tables and curves, which
are low level data analysis representations that we do not discuss in the rest of this paper.
11
Mathematics 2022, 10, 2135
12
Mathematics 2022, 10, 2135
property of CA compared to PCA is that individuals and features can be observed all
together in the same projected space. Factorial analyses are used in the context of IR
effectiveness analysis.
Clustering methods is a family of methods that aims to group together similar objects
or individuals. Under this group falls agglomerative clustering and k-means. In agglomer-
ative clustering, each individual corresponds to a cluster; at each processing step, the two
closest clusters are merged; the process ends when there is a single cluster. The minimum
value of the error sum of squares is used as the ward criterion to choose the pair of clusters
to merge [35]. The resulting dendogramme (tree-like structure) can be cut at any level to
produce a partition of objects. Depending on its level, the cut will result in either numerous
but homogeneously-composed clusters or few but heterogeneously-composed clusters.
Another popular clustering method is k-means where a number of seeds, corresponding
to the desired number of clusters, are chosen. Objects are associated to the closest seed.
Objects can then be re-allocated to a different cluster if it is closer to the centroid of an other
cluster. For system effectiveness analysis, clustering can be used to group queries, systems
or even measures.
Although the previous methods are usually considered as descriptive ones, the two
other groups of methods are predictive methods. That means they are used to predict either
a class (e.g., for a qualitative variable) or a value (e.g., for a continuous variable).
Regression methods aim to approach the value of a dependent variable (the variable
to be predicted) considering one or several independent variables (the variables or features
that are used to predict). The regression is based on a function model with one or more pa-
rameters (e.g., linear function in the linear regression; polynomial, . . . ). Logistic regression
is for the case the variable to explain is binary (e.g., the individual belongs to a class or not).
It is used, for example, in query performance prediction.
Decision trees show a family of non-parametric supervised learning methods that are
used for classification and regression. The resulting model is able to predict the value of a
target variable by learning simple decision rules inferred from the data features. CART [36]
and random forests [37] are the most popular among these methods. They have been shown
as very competitive methods. The extra advantage is that the model can combine both
quantitative and qualitative variables. In addition, the obtained models are explainable.
For system effectiveness analysis, the target variable is effectiveness measurement or class
of query difficulty (easy, hard, medium, for example). The system hyperparameters or
query features are used to infer the rules.
In this study, we do not consider deep learning methods as means to analyse and
understand information retrieval effectiveness. Deep learning is more and more popular
in IR but still these models lack interpretability. The artificial intelligence community is
re-investigating the explainability and interpretability challenge of neural network based
models [38]. For example, a recent review focused on explainable recommendation sys-
tems [39]. Still, model explanability is mainly based on model interpretability and promi-
nent interpretable models are more conventional machine learning ones, such as regression
models and decision tree models [39].
13
Mathematics 2022, 10, 2135
Figure 3. The 3D matrices obtained from participants’ results to shared ad hoc information retrieval
tasks that report effectiveness measurements for systems, topics, effectiveness measures can be
transformed into 2D matrices that fit many data analysis methods.
Such a 3D matrix can be transformed into a 2D one for a given effectiveness measure
where the two remaining dimensions are the systems and the topics. The resulting matrix
can then be used as an input to many of the data analysis methods we presented in the
previous sub-section where individuals are the systems represented according to the topics,
or using the transposed matrix, individuals are topics represented according to the systems.
We can also have more information at our disposal on systems or on topics or on
both. In that case, the data structures can be more complex. For example, if we consider
a given effectiveness measure, systems can be represented by different features (e.g., the
components that are used, their hyperparameters, . . . ). In the same way, topics can come
with various features (e.g., linguistic or statistical features, . . . ) (see Figure 4).
Figure 4. More complex data structures can be used that integrate features on topics, on systems or
on both.
Finally, some studies consider aggregated values. For example, rather than considering
each query individually, we can consider aggregated value across queries; this is commonly
used to compare systems and methods at a upper level. On the other hand, it is possible not
to consider each system individually but aggregate the results across systems (see Figure 5).
14
Mathematics 2022, 10, 2135
15
Mathematics 2022, 10, 2135
Figure 6. A 2D matrix representing the effectiveness of different systems (X axis) on different topics
(Y axis). This matrix is an extract of the one representing the AP (effectiveness measure) for TREC 7
ad hoc participants on the topic set of that track.
Analyses were produced on web track overviews. On Web track 2009 [43], the organis-
ers reported the plot representing the effectiveness of participants’ system considering two
evaluation measures, the mean subtopic recall and the mean precision (see Figure 7). This
analysis showed that the two measures correlate, which means that a system A that is better
than a system B on one of the two measures is also better when considering the second
measure. When effective systems are effective, the measure that is used does not matter.
Figure 7. Effective systems are effective whatever the measure used. Web track 09 participants’
results considering mean subtopic recall (X-axis) and mean precision (Y-axis); each dot is a participant
systems. Figure reprinted with permission from [43], Copyright 2009, Charles Clarke et al.
In web track 2014 [44], the authors provided a different type of analysis with box plots
that show the dispersion of the effectiveness measurements for each of the topics, across
participants’ systems (see Figure 8). This type of view informs on the impact of the system
factor on the results. The smaller the box, the smaller the importance of the system factor is.
Both some easy queries, for which the median effectiveness is high, and hard queries, for
which the median effectiveness is low, have packed results (e.g., easy 285 topic in Figure 8
and hard 255 topic—not presented here). Both types have also dispersed results (e.g., easy
285 topic on Figure 8 and hard 269 topic—not presented here).
16
Mathematics 2022, 10, 2135
Figure 8. In the easiest topics according to the median effectiveness on the participants’ results, there
are both topics with very diverse system effectiveness results (e.g., 298) and very similar ones (e.g.,
topic 285)—Web track 2014—topics are ordered according to decreasing the err@20 of the best system.
Figure reprinted with permission from [44], Copyright 2014, Kevyn Collins-Thompson et al.
On the same type of matrices than the ones Banks et al. used (see Figure 6), Dinçer
et al. [10], Mothe et al. [11], and Bigot et al. [15] applied factorial analyses more successfully.
These studies showed that topics and systems are indeed linked. Principal Component
Analysis (PCA) and Correspondence Analysis were used.
On Figure 9 we can see PCA applied on a matrix that reports average precision for
different queries (variables, matrix columns) by different systems (individuals, matrix rows)
at TREC 12 Robust track. We can see on the left bottom part systems that behave similarly
on the same queries (they fail on the same queries, succeed on the same queries) and that
behave differently from the other systems. We can see another group of systems on the top
left corner of the figure. Similar results where reported in [11] where PCA was applied on a
matrix that studied recall at TREC Novelty track. In both studies, the results showed that
there is not just two groups of systems, thus emphasising that systems behave differently
on different queries but that some systems have similar profiles (behave similarly on the
same queries).
Figure 9. System failure and effectiveness depend on queries—not all systems succeed or fail on the
same queries. The visualisation shows the two first principal components of a Principal Component
Analysis, where the data of the system effectiveness is obtained for each topic by each participants’
run. MAP measure of TREC 12 Robust Track participants’ runs. Figure reprinted with permission
from [10], Copyright 2007, John Wiley and Sons.
These results are complementary from Mizzaro and Roberston’ findings which show
that “easy” queries, the ones that on average systems can answer pretty well, are best to
distinguish good systems from bad systems [13].
Kompaore et al. [45] showed that queries can be clustered in a more sophisticated way,
based on their linguistic features. The authors extracted 13 linguistic features that they used
17
Mathematics 2022, 10, 2135
Figure 10. The first ranked system differs according to the query clusters. The rank of the system
is on the Y-axis and the system is on X-axis. Blue diamonds correspond to the ranks of the systems
when considering all the queries, pink squares when considering the query cluster 1, brown triangles
are for query cluster 2, and green crosses for cluster 3. Systems on the X-axis are ordered according to
decreasing effectiveness on average on the query set. Figure reprinted with permission from [45],
Copyright 2007, Kompaore et al.
5. Analyses Based on Systems That Were Generated for the Study—The System Factor
The system factor is the factor that has been mentioned the first in shared tasks:
systems do not perform identically. Thanks to the results the participants’ system obtained
in shared tasks, it has been possible to identify which techniques or systems work better
in average over queries, but because the description of those systems was not enough, it
has not been possible to study the important factors within the systems. This is what some
studies aimed to analyse.
Two research groups deeply studied this specific point: the Information Management
Systems Research Group of the University of Padua in Italy, starting from 2011 [48] and
the Information System Research Group of the University of Toulouse in France, starting
from 2011 [49]. Google Scholar was used to find what were the first pieces of work related
to automatically generated IR chains in the objective to analyse the component effects.
Although the two cited works did not obtain many citations, they mark the starting point
of this new research track.
The analyses based on synthetic data are in line with the idea developed in 2010
in [46] for an evaluation framework where components would be run locally and where
intermediate output would be upload so that component effects could be analysed deeper;
evaluation as a service has further developed the same idea [50].
One of the first implementations of the automatic generation of a large series of query
processing chains was the one in Louedec et al. [18]; in line with the ideas in [46,51]
also implemented in [19]. It was made possible because of the valuable work that has
been performed in Glasgow on Terrier [5] to implement IR components from the litera-
ture. Other platforms can also serve this purpose, such as Lemur/Indri (https://www.
lemurproject.org/ accessed on 15 May 2022); https://sourceforge.net/p/lemur/wiki/
18
Mathematics 2022, 10, 2135
Figure 11. The choice of the weighting model has more impact than the stemmer used. Individual
boxplots represent average precision on the TREC 7 and 8 topics when a given component is used in
a query processing chain—80,000 query processing chains or component combinations were used.
Figure reprinted with permission from [7], Copyright 2015, J.UCS.
Another important study is the one from Ferro and Silvello [16] followed up by [17]
from the same authors. On TREC 8, they considered the cross effect of the component
choice in the query processing chain. They considered three components: the stop list
(5 variants), the stemmer (5 variants), and the weighting model (16 variants), for a total of
400 possible different combinations. As an effectiveness measure, they also used average
precision. They show that variants of the stop word list used during indexing does not
have a huge impact, but the system should use one (see Figure 12, subfigures A, B where
the blue—no stopword—curve is systematically below the others and C where the starting
point of the curve—no stopword—is lower than the rest of the curves). It also showed
that, given a stopword list is used, the weighing model has the strongest impact among the
three studied components (subfigures B and D, where waves show that the systems have
different effectiveness).
Additionally, related to these studies, CLAIRE [52] is a system to explore IR results.
When analysing TREC ad hoc and web tracks, the findings are consistent with previous
analyses: dirichletLM is the weaker weighting model among the studied ones, IR weighting
models suffer from the absence of a stop list and of stemmer. By such exploration, the
authors were able to show which is the most promising combination of components
for a system on a collection (e.g., jskls model equipped with a snowball stop list and a
porter stemmer).
19
Mathematics 2022, 10, 2135
These data analytics studies allowed to understand the influence of the components
and their possible cross-effect. They are using the results obtained on different collection;
they do not aim to predict results.
Figure 12. Interaction between component choices. The curves used in this representation are
somehow misleading since the variables are not continuous but nevertheless can be understood, we
thus kept the original Figures from [16]) where we added letters in each sub-figure for clarity. On
the first row, the stop list effect is shown for different stemmers (A) and different weighting models
(B). On the second row, the effect of stemmers is reported for different stop lists (C) and different
weighting models (D). On the latest row, the weighting model effect is reported, for different stop lists
(E) and different stemmers (F). Figure adapted with permission from [16], Copyright 2016, Nicola
Ferro et al.
20
Mathematics 2022, 10, 2135
links span (distance between syntactically linked words) was shown as inversely correlated
to precision while polysemy value, the number of semantic classes a given word belongs to
in WordNet was shown to be inversely correlated to recall.
Molina [74] developed a resource where 258 features have been extracted from queries
from Robust, WT10G and GOV2 collections. Among the features extracted from the query
only, are aggregations (minimum, maximum, mean, quartiles, standard deviation, and
total) over query terms on the number of synonyms, hyponyms, meronyms and sister
terms from WordNet. The authors did not provide a deep analysis on how much each of
these features correlate with the system performance.
Hauff et al. [60] surveyed 22 pre-retrieval features from the literature at that time,
from which some use the query only to be computed. The authors categorised these
features into four groups, depending on the underlying heuristic: specificity, ambiguity,
term relatedness, and ranking sensitivity. Intuitively, the higher the term specificity the
easier the query. Inversely, the higher the term ambiguity, the more difficult the query is.
When analysed on three TREC collections, the authors found weak correlation between
the system performance and features based on the query only. Among them, the features
related to the term ambiguity where the most correlated to system performance, in line
with [21]. Features that consider information on the documents were found more correlated
to performance than the ones based on the query only.
The query effect was also studied in link with the document collection that is searched
by considering information resulting from the document indexing. These query features are
grounded on the same idea as term weighting is for indexation: terms are not equivalent
and their frequency in documents matters. Inverse document frequency, based on the
number of documents in which the term occurs has specifically been used, but other
features were also developed [59,61,75].
Finally, the query effect was also studied considering post-retrieval features [57,62–67,69–73].
Post-retrieval predictors are categorised into clarity-based, score-based, and robustness-based
approaches [22,68]. They imply that a first document retrieval is performed using the query
before the features can be calculated. Post-retrieval features mainly used the document scores.
Considered individually, post-retrieval features have been consistently shown as better
predictors than pre-retrieval ones. It appeared, however, that an individual feature, either
pre- or post-retrieval, is not enough to predict whether the system is going to fail or not or
to predict its effectiveness. That mean that none of these individual features “explained”
the system performance.
Indeed, many studies have reported weak correlation values for individual fea-
tures [60,71,76] with the actual system effectiveness. When considering a single feature, the
correlation values differ from one collection to another and from one feature to another.
Moreover, they are weak. For example, Hauff et al. [60] report 396 correlation values among
which 13 only are over 0.5. Hashemi et al. [77] reported 216 correlation values including
the ones obtained with a new neural network-based predictor, with a maximum value of
0.58, a median of 0.17. Chifu et al. [71] reported 312 values, none of which above 0.50. In
the same way, Khodabakhsh and Bagheri report 952 correlation values, none of which are
above 0.46 [73]. When correlation are low it is even likelier that there is either very weak or
no correlation at all between the predicted value (here effectiveness) and the feature used
to predict. Table 1 and Figure 13 illustrate this. For this illustration, we took IDFMax and
IDFAVG which are considered as the best pre-retrieval features [60,69,78], as well as BM25,
a post-retrieval feature. We can see that with a correlation of 0.29 for BM25 or 0.24 for IDF
(see Table 1), there is no correlation between the two variables as depicted on the scatter
plots (see Figure 13).
Papers on query performance prediction seldom plot the predicted and actual values
which is however an appropriate mean to check whether the correlation exists or not. As
a counter example of this, we should here recall that the Anscombe’s quartet [79] effect
on the Pearson correlation illustrates that even a Pearson correlation up to 0.816 can be
obtained with no correlation between the two studied variables (see Figure 14).
21
Mathematics 2022, 10, 2135
Table 1. Correlation between query features and ndcg. WT10G TREC collection. * marks the usual
<0.05 p-Value significance.
Feature
Measure BM25_MAX BM25_STD IDF_MAX IDF_AVG
Pearson ρ 0.294 * 0.232 * 0.095 0.127
p-Value 0.0034 0.0224 0.3531 0.2125
Spearman r 0.260 * 0.348 * 0.236 * 0.196
p-Value 0.0100 <0.001 0.0202 0.0544
Kendall τ 0.172 * 0.230 * 0.159 * 0.136 *
p-Value 0.0128 <0.001 0.0215 0.0485
Figure 13. No correlation using pre- or post-predictors with the actual effectiveness—IDF pre-
retrieval predictor and BM25 post-retrieval predictor (X-axis) and ndcg (Y-axis) values on WT10G
TREC collection. Although the correlation values are up to 0.35, there is no correlation.
22
Mathematics 2022, 10, 2135
queries only. However, we know from Mizzaro and Robertson [13] that easy queries do not
help to distinguish effective systems from non-effective ones [13].
Figure 14. Pearson correlation value higher than 0.8 does not mean the two variables are correlated.
The Anscombe’s quartet presents four datasets that share the same mean, same number of values,
same Pearson correlation value (ρ = 0.816) but for which this latter value does not always means
the two variables X and Y are correlated. X and Y are not correlated on #4 despite high ρ value. #2
X and Y are perfectly correlated but not in a linear way (Pearson cannot measure other than linear
correlations) #1 and #3 illustrates two cases of linear correlation. Figures generated from the data
in [79].
Figure 15. Predicted AP is correlated to actual AP for easy queries (the ones on the right part of the
plot), although there are sparse. Figure reprinted with permission from Roy et al. [78], Copyright
2019, Elsevier.
With regard to (2), rather than considering individual system performance, Mizzaro
et al. [82] focused on the average of average precision values over systems, this to detect
the queries for which systems will fail in general. The authors showed that the correlation
is more obvious between the predictor and the average system effectiveness than it was in
other studies between the predictor and a single system (see Figure 16).
This call also for the need to try to understand the relationship between the query
factor and the system factor.
23
Mathematics 2022, 10, 2135
80
AS - Predicted
40
QF - Predicted
60
30
40 20
20 10
0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.0 0.1 0.2 0.3 0.4 0.5 0.6
AAP AAP
(a) (b)
Figure 16. AS feature [83] is correlated to the average effectiveness of a set of systems. TREC7
Adhoc collection. Pearson correlation between AAP and (a) QF [66], (b) AS [83]. Dots correspond to
actual and predicted AAP for individual topics; the cones represent the confidence intervals. Figure
reprinted with permission from [82], Copyright 2018, Mizzaro et al.
6.2. Relationship between the Query Factor and the System Factor
In Section 5, we reported studies on system variabilities and the effect of the com-
ponents used at each search phase on the results when averaged over query sets, not
considering deeply the query effect.
Here, we consider studies that were conducted at a finer grain. These pieces of work
tried to understand the very link between the level of query difficulty (or level of system
effectiveness) and the system components or hyperparameters.
It was concretely used in Ayter et al. where 80,000 combinations of components
and hyper-parameters were evaluated on the 100 queries of TREC 7 and 8 ad hoc track
topics. The combinations differed on the stemmer used to index the documents (4 different
stemmers were used), the topic tags used to build the query, the weighting model (7
different models), the query expansion model (6 different models) and the query expansion
hyper-parameters which take different values as well. The authors showed that the most
important parameters, the ones that influence the results the most, depend on the difficulty
of the query, that is to say whether relevant documents will be easily found by the search
engine or not [7] (see Figure 17).
In Figure 17a where the easy queries only are analysed, we can see that the most
influential parameter is the query expansion model used because this is the one where
the tree first split, here for the value c, which corresponds to the Info query expansion
model. The retrieving or matching model is the second most influential parameter. For
hard queries however, the most influential parameter is the topic part used for the query.
In that research the authors either used the title only, or the other topic fields, narrative
and descriptive, that provide more information on the users’ need related to the topic.
The leaves of the tree is whether the decision for a query is “easy” (good performance),
“average” or “hard” (bad performance) when following a branch from the root to the leave.
The main overall conclusion is that the influential parameters are not the same for easy
and hard queries; giving the intuition that obtaining the best performance cannot be by
applying the same process whatever the queries are. This was further analysed in [53],
where more TREC collections were studied with the same conclusions.
These results are in favour of considering the system component parameters, not
at a global level like search grid or other optimisation methods do in IR, but rather at a
finer grain.
24
Mathematics 2022, 10, 2135
25
Mathematics 2022, 10, 2135
• C3: some queries are easy for all the systems while others are hard for all (see
Figure 18, left-side part) but systems do not always fail or succeed on the same
queries (see Figure 18, right side part). Some systems have a similar profile, they
fail/success on the same queries.
However, it was not possible to understand system successes and failures.
Regarding C1, we also considered the participants’ results from the first 7 years of
TREC ad hoc for a total of 496 systems (or runs) and considered the 130 effectiveness
measures from trec_eval that evaluate (system, query) pairs, such as in [84]. Correlation
when considering pairs of measurements for a given topic and a given system are high,
which means it is possible to distinguish between effective and non effective systems, it
does not depend on the measure used (see Figure 19).
Figure 18. Some queries are easy for all the systems, some are hard for all, other depends on the
system. On the TREC topic 297, all the analysed systems obtained at least 0.5 as NDCG@20, half of
them obtained more than 0.65 and some obtained 0.8, which is high. For topic 255, all the systems
failed but 3, only one obtained more than 0.3. The right part boxplot, as opposed to the left side ones,
shows that for topic 298, the system effectiveness have a large span from 0 to almost 1.
Figure 19. When considering a given system and a given query, the effectiveness measure used to
compare the systems does not matter much: all are strongly correlated. Pearson correlation values
between two effectiveness measurements on two measures for a given (system, query) pair. Correla-
tions are represented using a divergent palette (a central colour, yellow, and 2 shades depending on
whether the values go for negative—red—or positive values—blue).
26
Mathematics 2022, 10, 2135
that they may be just small variants one from the other (e.g., different hyper-parameters
but basically the same query processing chain, using the same components).
Different attempts to extract more information from official runs were not conclusive.
From automatically generated query processing chains, we have a deep knowledge
on the systems, we know exactly what components are used with which hyper-parameters.
From the analyses that used these data, we can conclude:
• C4: some components and hyper-parameters are more influential than others and
informed choices can be made;
• C5: the choice of the most appropriate components depends on the query level
of difficulty.
Regarding C4: for some components or their hyperparameters, their choice will have
a huge impact on the system effectiveness, for some others, there is no or little impact on
effectiveness. This means that if one wants to tune or decide on not all the parameters but a
few they should start by the more influential ones. Moreover, we know what is the best
decision on some components: a stoplist should be used, a stemmer should be used, but
the choice of the stemmer does not matter much; considering the weighting models that
are implemented in Terrier, dirichletLM should be avoid, BB2 is a much better option.
Regarding C5: the choice of the most appropriate query processing chain is in relation
with the query level of difficulty. In other words, different queries need different processing
chains. This means that if we want to increase system effectiveness in the future, we
should not just tune the system on a per collection basis: grid searching or any other more
sophisticated version of parameter optimisation is not enough. What we need rather is to
adapt the processing chain to the query.
From query analyses, we can conclude:
• C6: a single query feature or a combination of features have not been proven to explain
system effectiveness;
• C7: query features can explain somehow system effectiveness.
Despite their apparent contradiction, C6 and C7 are in fact complementary. Some
query features or combination of them seems to be accurate to predict not individual
systems but average system effectiveness; there is also some success on predicting easy
queries. Systems are, however, more easily distinguishable based on the difficult queries,
not the easy ones for which they are more homogeneous in their successes. Up to now, the
accuracy of features or feature combinations has not demonstrated that they can explain
system effectiveness; correlation values that are reported are seldom over 0.5 and more
tricky studies do not report scatter plots.
Although we do not yet understand well the factors of system effectiveness, the studies
show that not a single system, while effective in average on a query set, is able to answer
all the queries well (mainly C5 in addition to C3 and C4). Advanced IR techniques can be
grounded on this knowledge. Selective query expansions (SQE) for example, where a meta-
system has two alternative component combinations which differ on using or omitting
the automatic query reformulation phase, made use of the fact that some queries benefit
from being expanded while other do not [85–87]. SQE has not been proven to be very
effective, certainly due to both the limited number of configurations used at that time
(two different query processing chains) and the relatively poor learning techniques used at
that time. Selective query processing expand SQE concept, where the system decides which
one, from a possibly large number of component combinations, should be used for each
individual query [10,15,88]. Here, the results were more conclusive. For example, Bigot
et al. [89] developed a model that learns the best query processing chain to use for a query
based on subsets of documents. Although this makes the method applicable for repeated
queries only, it can be an appropriate approach for real world web search engines. Deveaud
et al. [90] learn to rank the query processing chain for each new query. They used 20,000
different query processing chains. However, this very large number of combinations makes
it difficult to use in real world systems. Arslan and Dinçer [47] developed a meta-model
27
Mathematics 2022, 10, 2135
that uses eight term-weighting models that could be chosen among for any new query.
The meta-model has to be train, as well as the term-weighting models; this is performed
by a grid search optimisation which limit the usability. Mothe and Ullah [91] present
an approach to optimise the set of query processing chains that can be chosen among a
selective query processing strategy. It is based on a risk-sensitive function that optimises
the possible gain in considering a specific query processing chain. The authors show that 20
query processing chains is a good trade-off between the cost of maintaining different query
processing chains and the gain on effectiveness. Still they do not explain the successes
and failures.
Thus, to the question “Can we design a transparent model in terms of its performance
on a query”, I am tempted to answer: “No, at this stage of IR knowledge; further analyses
are needed”. I am convinced that data analytics methods can further been investigated to
analyse the amount of data that have been generated by the community, both in shared
tasks and in labs while tuning systems.
The robustness of the finding across collections would also worth investigating in
the future.
Abbreviations
The following abbreviations are used in this manuscript:
AP Average Precision
CA Correspondence Analysis
CIKM Conference on Information and Knowledge Management
CLEF Conference and Labs of the Evaluation Forum
IR Information Retrieval
MAP Mean Average Precision
PCA Principal Component Analysis
QE Query Expansion
QPP Query Performance Prediction
Conference of the Association for Computing Machinery Special Interest Group in
SIGIR
Information Retrieval
SQE Selective Query Expansion
TREC Text Retrieval Conference
References
1. Salton, G.; Wong, A.; Yang, C.S. A vector space model for automatic indexing. Commun. ACM 1975, 18, 613–620. [CrossRef]
2. Robertson, S.E.; Jones, K.S. Relevance weighting of search terms. J. Am. Soc. Inf. Sci. 1976, 27, 129–146. [CrossRef]
3. Robertson, S.; Zaragoza, H. The Probabilistic Relevance Framework: BM25 and Beyond; Now Publishers Inc.: Delft, The Netherlands,
2009; pp. 333–389.
4. Ponte, J.M.; Croft, W.B. A Language Modeling Approach to Information Retrieval. In Proceedings of the 21st Annual International
ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’98, Melbourne, Australia, 24–28 August
1998; ACM: New York, NY, USA, 1998; pp. 275–281. [CrossRef]
5. Ounis, I.; Amati, G.; Plachouras, V.; He, B.; Macdonald, C.; Johnson, D. Terrier information retrieval platform. In European
Conference on Information Retrieval; Springer: Berlin/Heidelberg, Germany, 2005; pp. 517–519.
6. Taylor, M.; Zaragoza, H.; Craswell, N.; Robertson, S.; Burges, C. Optimisation methods for ranking functions with multiple
parameters. In Proceedings of the 15th ACM International Conference on Information and Knowledge Management, Arlington,
VA, USA, 6–11 November 2006; pp. 585–593.
7. Ayter, J.; Chifu, A.; Déjean, S.; Desclaux, C.; Mothe, J. Statistical analysis to establish the importance of information retrieval
parameters. J. Univers. Comput. Sci. 2015, 21, 1767–1789.
8. Tague-Sutcliffe, J.; Blustein, J. A Statistical Analysis of the TREC-3 Data; NIST Special Publication SP: Washington, DC, USA, 1995;
p. 385.
9. Banks, D.; Over, P.; Zhang, N.F. Blind men and elephants: Six approaches to TREC data. Inf. Retr. 1999, 1, 7–34. [CrossRef]
10. Dinçer, B.T. Statistical principal components analysis for retrieval experiments. J. Am. Soc. Inf. Sci. Technol. 2007, 58, 560–574.
[CrossRef]
28
Mathematics 2022, 10, 2135
11. Mothe, J.; Tanguy, L. Linguistic analysis of users’ queries: Towards an adaptive information retrieval system. In Proceedings of
the 2007 Third International IEEE Conference on Signal-Image Technologies and Internet-Based System, Shanghai, China, 16–18
December 2007; pp. 77–84.
12. Harman, D.; Buckley, C. The NRRC reliable information access (RIA) workshop. In Proceedings of the 27th Annual International
ACM SIGIR Conference on Research and Development in Information Retrieval, Sheffield, UK, 25–29 July 2004; pp. 528–529.
13. Mizzaro, S.; Robertson, S. Hits hits trec: Exploring ir evaluation results with network analysis. In Proceedings of the 30th Annual
International ACM SIGIR Conference on Research and Development in Information Retrieval, Amsterdam, The Netherlands,
23–27 July 2007; pp. 479–486.
14. Harman, D.; Buckley, C. Overview of the reliable information access workshop. Inf. Retr. 2009, 12, 615. [CrossRef]
15. Bigot, A.; Chrisment, C.; Dkaki, T.; Hubert, G.; Mothe, J. Fusing different information retrieval systems according to query-topics:
A study based on correlation in information retrieval systems and TREC topics. Inf. Retr. 2011, 14, 617. [CrossRef]
16. Ferro, N.; Silvello, G. A general linear mixed models approach to study system component effects. In Proceedings of the 39th
International ACM SIGIR conference on Research and Development in Information Retrieval, Pisa, Italy, 17–21 July 2016; pp. 25–34.
17. Ferro, N.; Silvello, G. Toward an anatomy of IR system component performances. J. Assoc. Inf. Sci. Technol. 2018, 69, 187–200.
[CrossRef]
18. Louedec, J.; Mothe, J. A massive generation of ir runs: Demonstration paper. In Proceedings of the IEEE 7th International
Conference on Research Challenges in Information Science (RCIS), Paris, France, 29–31 May 2013; pp. 1–2.
19. Wilhelm, T.; Kürsten, J.; Eibl, M. A tool for comparative ir evaluation on component level. In Proceedings of the 34th International
ACM SIGIR Conference on Research and Development in Information Retrieval, Beijing, China, 24–28 July 2011; pp. 1291–1292.
20. Carmel, D.; Yom-Tov, E.; Darlow, A.; Pelleg, D. What makes a query difficult? In Proceedings of the 29th Annual International
ACM SIGIR Conference on Research and Development in Information Retrieval, Seattle,WA, USA, 6–11 August 2006; pp. 390–397.
21. Mothe, J.; Tanguy, L. Linguistic features to predict query difficulty. In ACM Conference on Research and Development in Information
Retrieval, SIGIR, Predicting Query Difficulty-Methods and Applications Workshop; ACM: New York, NY, USA, 2005; pp. 7–10.
22. Zamani, H.; Croft, W.B.; Culpepper, J.S. Neural query performance prediction using weak supervision from multiple signals. In
Proceedings of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, Ann Arbor,
MI, USA, 8–12 July 2018; pp. 105–114.
23. Carpineto, C.; Romano, G. A survey of automatic query expansion in information retrieval. ACM Comput. Surv. (CSUR) 2012,
44, 1–50. [CrossRef]
24. Azad, H.K.; Deepak, A. Query expansion techniques for information retrieval: A survey. Inf. Process. Manag. 2019, 56, 1698–1735.
[CrossRef]
25. Moral, C.; de Antonio, A.; Imbert, R.; Ramírez, J. A survey of stemming algorithms in information retrieval. Inf. Res. Int. Electron.
J. 2014, 19, n1.
26. Kamphuis, C.; de Vries, A.P.; Boytsov, L.; Lin, J. Which BM25 Do You Mean? A Large-Scale Reproducibility Study of Scoring
Variants. In Advances in Information Retrieval; Jose, J.M., Yilmaz, E., Magalhães, J., Castells, P., Ferro, N., Silva, M.J., Martins, F.,
Eds.; Springer International Publishing: Cham, Switzerland, 2020; pp. 28–34.
27. Mizzaro, S. How many relevances in information retrieval? Interact. Comput. 1998, 10, 303–320. [CrossRef]
28. Ruthven, I. Relevance behaviour in TREC. J. Doc. 2014, 70, 1098–1117. [CrossRef]
29. Hofstätter, S.; Lin, S.C.; Yang, J.H.; Lin, J.; Hanbury, A. Efficiently teaching an effective dense retriever with balanced topic
aware sampling. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information
Retrieval, Virtual Event, 11–15 July 2021; pp. 113–122.
30. Breslow, N.E.; Clayton, D.G. Approximate inference in generalized linear mixed models. J. Am. Stat. Assoc. 1993, 88, 9–25.
31. McCullagh, P.; Nelder, J.A. Generalized Linear Models, 2nd ed.; Chapman and Hall: London, UK, 1989.
32. Dumais, S.T. LSA and information retrieval: Getting back to basics. Handb. Latent Semant. Anal. 2007, 293, 322.
33. Sarwar, B.; Karypis, G.; Konstan, J.; Riedl, J. Application of Dimensionality Reduction in Recommender System—A Case Study; Technical
Report; Department of Computer Science and Engineering, University of Minnesota: Minneapolis, MN, USA, 2000.
34. Benzécri, J.P. Statistical analysis as a tool to make patterns emerge from data. In Methodologies of Pattern Recognition; Elsevier:
Amsterdam, The Netherlands, 1969; pp. 35–74.
35. Ward, J.H., Jr. Hierarchical grouping to optimize an objective function. J. Am. Stat. Assoc. 1963, 58, 236–244. [CrossRef]
36. Li, B.; Friedman, J.; Olshen, R.; Stone, C. Classification and regression trees (CART). Biometrics 1984, 40, 358–361.
37. Ho, T.K. Random decision forests. In Proceedings of the 3rd International Conference on Document Analysis and Recognition,
Montreal, QC, Canada, 14–16 August 1995; Volume 1, pp. 278–282.
38. Gunning, D. Explainable Artificial Intelligence; Defense Advanced Research Projects Agency (DARPA): Arlington, VA, USA, 2017; p. 2.
39. Zhang, Y.; Chen, X. Explainable recommendation: A survey and new perspectives. Found. Trends® Inf. Retr. 2020, 14, 1–101.
[CrossRef]
40. Harman, D. Overview of the First Text Retrieval Conference (trec-1); NIST Special Publication SP: Washington, DC, USA, 1992;
pp. 1–532.
41. Harman, D. Overview of the first TREC conference. In Proceedings of the 16th Annual International ACM SIGIR Conference on
Research and Development in Information Retrieval, Pittsburgh, PA, USA, 27 June–1 July 1993; pp. 36–47.
29
Mathematics 2022, 10, 2135
42. Buckley, C.; Mitra, M.; Walz, J.A.; Cardie, C. SMART high precision: TREC 7; NIST Special Publication SP: Washington, DC, USA,
1999; pp. 285–298.
43. Clarke, C.L.; Craswell, N.; Soboroff, I. Overview of the Trec 2009 Web Track; Technical Report; University of Waterloo: Waterloo,
ON, Canada, 2009.
44. Collins-Thompson, K.; Macdonald, C.; Bennett, P.; Diaz, F.; Voorhees, E.M. TREC 2014 Web Track Overview; Technical Report;
University of Michigan: Ann Arbor, MI, USA, 2015.
45. Kompaore, D.; Mothe, J.; Baccini, A.; Dejean, S. Query clustering and IR system detection. Experiments on TREC data. In
Proceedings of the ACM International Workshop for Ph. D. Students in Information and Knowledge Management (ACM PIKM
2007), Lisboa, Portugal, 5–10 November 2007.
46. Hanbury, A.; Müller, H. Automated component–level evaluation: Present and future. In International Conference of the Cross-
Language Evaluation Forum for European Languages; Springer: Berlin/Heidelberg, Germany, 2010; pp. 124–135.
47. Arslan, A.; Dinçer, B.T. A selective approach to index term weighting for robust information retrieval based on the frequency
distributions of query terms. Inf. Retr. J. 2019, 22, 543–569. [CrossRef]
48. Di Buccio, E.; Dussin, M.; Ferro, N.; Masiero, I.; Santucci, G.; Tino, G. Interactive Analysis and Exploration of Experimental
Evaluation Results. In European Workshop on Human-Computer Interaction and Information Retrieval EuroHCIR; Citeseer: Nijmegen,
The Netherlands, 2011; pp. 11–14.
49. Compaoré, J.; Déjean, S.; Gueye, A.M.; Mothe, J.; Randriamparany, J. Mining information retrieval results: Significant IR
parameters. In Proceedings of the First International Conference on Advances in Information Mining and Management,
Barcelona, Spain, 23–29 October 2011; Volume 74.
50. Hopfgartner, F.; Hanbury, A.; Müller, H.; Eggel, I.; Balog, K.; Brodt, T.; Cormack, G.V.; Lin, J.; Kalpathy-Cramer, J.; Kando, N.; et al.
Evaluation-as-a-service for the computational sciences: Overview and outlook. J. Data Inf. Qual. (JDIQ) 2018, 10, 1–32. [CrossRef]
51. Kürsten, J.; Eibl, M. A large-scale system evaluation on component-level. In European Conference on Information Retrieval; Springer:
Berlin/Heidelberg, Germany, 2011; pp. 679–682.
52. Angelini, M.; Fazzini, V.; Ferro, N.; Santucci, G.; Silvello, G. CLAIRE: A combinatorial visual analytics system for information
retrieval evaluation. Inf. Process. Manag. 2018, 54, 1077–1100. [CrossRef]
53. Dejean, S.; Mothe, J.; Ullah, M.Z. Studying the variability of system setting effectiveness by data analytics and visualization.
In International Conference of the Cross-Language Evaluation Forum for European Languages; Springer: Cham, Switzerland, 2019;
pp. 62–74.
54. De Loupy, C.; Bellot, P. Evaluation of document retrieval systems and query difficulty. In Proceedings of the Second International
Conference on Language Resources and Evaluation (LREC 2000) Workshop, Athens, Greece, 31 May–2 June 2000; pp. 32–39.
55. Banerjee, S.; Pedersen, T. Extended gloss overlaps as a measure of semantic relatedness. In Proceedings of the IJCAI 2003,
Acapulco, Mexico, 9–15 August 2003; pp. 805–810.
56. Patwardhan, S.; Pedersen, T. Using WordNet-based context vectors to estimate the semantic relatedness of concepts. In
Proceedings of the Workshop on Making Sense of Sense: Bringing Psycholinguistics and Computational Linguistics Together,
Trento, Italy, 4 April 2006.
57. Cronen-Townsend, S.; Zhou, Y.; Croft, W.B. Predicting query performance. In Proceedings of the 25th Annual International ACM
SIGIR Conference on Research and Development in Information Retrieval, Tampere, Finland, 11–15 August 2002; pp. 299–306.
58. Scholer, F.; Williams, H.E.; Turpin, A. Query association surrogates for web search. J. Am. Soc. Inf. Sci. Technol. 2004, 55, 637–650.
[CrossRef]
59. He, B.; Ounis, I. Inferring query performance using pre-retrieval predictors. In International Symposium on String Processing and
Information Retrieval; Springer: Berlin/Heidelberg, Germany, 2004; pp. 43–54.
60. Hauff, C.; Hiemstra, D.; de Jong, F. A survey of pre-retrieval query performance predictors. In Proceedings of the 17th ACM
Conference on Information and Knowledge Management, Napa Valley, CA, USA, 26–30 October 2008; pp. 1419–1420.
61. Zhao, Y.; Scholer, F.; Tsegay, Y. Effective pre-retrieval query performance prediction using similarity and variability evidence. In
European Conference on Information Retrieval; Springer: Berlin/Heidelberg, Germany, 2008; pp. 52–64.
62. Sehgal, A.K.; Srinivasan, P. Predicting performance for gene queries. In Proceedings of the ACM SIGIR 2005 Workshop on
Predicting Query Difficulty-Methods and Applications. Available online: http://www.haifa.il.ibm.com/sigir05-qp (accessed on
15 May 2022).
63. Zhou, Y.; Croft, W.B. Ranking robustness: A novel framework to predict query performance. In Proceedings of the 15th ACM
International Conference on Information and Knowledge Management, Arlington, VA, USA, 6–11 November 2006; pp. 567–574.
64. Vinay, V.; Cox, I.J.; Milic-Frayling, N.; Wood, K. On ranking the effectiveness of searches. In Proceedings of the 29th Annual
International ACM SIGIR Conference on Research and Development in Information Retrieval, Seattle,WA, USA, 6–11 August
2006; pp. 398–404.
65. Aslam, J.A.; Pavlu, V. Query hardness estimation using Jensen-Shannon divergence among multiple scoring functions. In
European Conference on Information Retrieval; Springer: Berlin/Heidelberg, Germany, 2007; pp. 198–209.
66. Zhou, Y.; Croft, W.B. Query performance prediction in web search environments. In Proceedings of the 30th Annual International
ACM SIGIR Conference on Research and Development in Information Retrieval, Amsterdam, The Netherlands, 23–27 July 2007;
pp. 543–550.
30
Mathematics 2022, 10, 2135
67. Shtok, A.; Kurland, O.; Carmel, D. Predicting query performance by query-drift estimation. In Conference on the Theory of
Information Retrieval; Springer: Berlin/Heidelberg, Germany, 2009; pp. 305–312.
68. Carmel, D.; Yom-Tov, E. Estimating the query difficulty for information retrieval. Synth. Lect. Inf. Concepts Retr. Serv. 2010, 2, 1–89.
69. Cummins, R.; Jose, J.; O’Riordan, C. Improved query performance prediction using standard deviation. In Proceedings of the
34th International ACM SIGIR Conference on Research and Development in Information Retrieval, Beijing, China, 24–28 July
2011; pp. 1089–1090.
70. Roitman, H.; Erera, S.; Weiner, B. Robust standard deviation estimation for query performance prediction. In Proceedings of
the ACM SIGIR International Conference on Theory of Information Retrieval, Amsterdam, The Netherlands, 1–4 October 2017;
pp. 245–248.
71. Chifu, A.G.; Laporte, L.; Mothe, J.; Ullah, M.Z. Query performance prediction focused on summarized letor features. In
Proceedings of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, Ann Arbor,
MI, USA, 8–12 July 2018; pp. 1177–1180.
72. Zhang, Z.; Chen, J.; Wu, S. Query performance prediction and classification for information search systems. In Asia-Pacific
Web (APWeb) and Web-Age Information Management (WAIM) Joint International Conference on Web and Big Data; Springer: Cham,
Switzerland, 2018; pp. 277–285.
73. Khodabakhsh, M.; Bagheri, E. Semantics-enabled query performance prediction for ad hoc table retrieval. Inf. Process. Manag.
2021, 58, 102399. [CrossRef]
74. Molina, S.; Mothe, J.; Roques, D.; Tanguy, L.; Ullah, M.Z. IRIT-QFR: IRIT query feature resource. In International Conference of the
Cross-Language Evaluation Forum for European Languages; Springer: Cham, Switzerland, 2017; pp. 69–81.
75. Macdonald, C.; He, B.; Ounis, I. Predicting query performance in intranet search. In Proceedings of the SIGIR 2005 Query
Prediction Workshop, Salvador, Brazil, 15–19 August 2005.
76. Faggioli, G.; Zendel, O.; Culpepper, J.S.; Ferro, N.; Scholer, F. sMARE: A new paradigm to evaluate and understand query
performance prediction methods. Inf. Retr. J. 2022, 25, 94–122. [CrossRef]
77. Hashemi, H.; Zamani, H.; Croft, W.B. Performance Prediction for Non-Factoid Question Answering. In Proceedings of the 2019
ACM SIGIR International Conference on Theory of Information Retrieval, Paris, France, 21–25 July 2019; pp. 55–58.
78. Roy, D.; Ganguly, D.; Mitra, M.; Jones, G.J. Estimating Gaussian mixture models in the local neighbourhood of embedded word
vectors for query performance prediction. Inf. Process. Manag. 2019, 56, 1026–1045. [CrossRef]
79. Anscombe, F. American Statistical Association, Taylor & Francis, Ltd. are collaborating with JSTOR to. Am. Stat. 1973, 27, 17–21.
80. Grivolla, J.; Jourlin, P.; de Mori, R. Automatic Classification of Queries by Expected Retrieval Performance; SIGIR: Salvador, Brazil, 2005.
81. Raiber, F.; Kurland, O. Query-performance prediction: Setting the expectations straight. In Proceedings of the 37th International
ACM SIGIR Conference on Research & Development in Information Retrieval, Gold Coast, Australia, 6–11 July 2014; pp. 13–22.
82. Mizzaro, S.; Mothe, J.; Roitero, K.; Ullah, M.Z. Query performance prediction and effectiveness evaluation without relevance
judgments: Two sides of the same coin. In Proceedings of the 41st International ACM SIGIR Conference on Research &
Development in Information Retrieval, Ann Arbor, MI, USA, 8–12 July 2018; pp. 1233–1236.
83. Aslam, J.A.; Savell, R. On the Effectiveness of Evaluating Retrieval Systems in the Absence of Relevance Judgments. In
Proceedings of the 26th ACM SIGIR, Toronto, ON, Canada, 28 July–1 August 2003; pp. 361–362.
84. Baccini, A.; Déjean, S.; Lafage, L.; Mothe, J. How many performance measures to evaluate information retrieval systems? Knowl.
Inf. Syst. 2012, 30, 693–713. [CrossRef]
85. Amati, G.; Carpineto, C.; Romano, G. Query difficulty, robustness, and selective application of query expansion. In European
Conference on Information Retrieval; Springer: Berlin/Heidelberg, Germany, 2004; pp. 127–137.
86. Cronen-Townsend, S.; Zhou, Y.; Croft, W.B. A framework for selective query expansion. In Proceedings of the Thirteenth ACM
International Conference on Information and Knowledge Management, Washington, DC, USA, 8–13 November 2004; pp. 236–237.
87. Zhao, L.; Callan, J. Automatic term mismatch diagnosis for selective query expansion. In Proceedings of the 35th International
ACM SIGIR Conference on Research and Development in Information Retrieval, Portland, OR, USA, 12–16 August 2012;
pp. 515–524.
88. Deveaud, R.; Mothe, J.; Ullah, M.Z.; Nie, J.Y. Learning to Adaptively Rank Document Retrieval System Configurations. ACM
Trans. Inf. Syst. (TOIS) 2018, 37, 3. [CrossRef]
89. Bigot, A.; Déjean, S.; Mothe, J. Learning to Choose the Best System Configuration in Information Retrieval: The Case of Repeated
Queries. J. Univers. Comput. Sci. 2015, 21, 1726–1745.
90. Deveaud, R.; Mothe, J.; Nia, J.Y. Learning to Rank System Configurations. In Proceedings of the 25th ACM International on
Conference on Information and Knowledge Management, CIKM ’16, Indianapolis, IN, USA, 24–28 October 2016; ACM: New
York, NY, USA, 2016; pp. 2001–2004.
91. Mothe, J.; Ullah, M.Z. Defining an Optimal Configuration Set for Selective Search Strategy-A Risk-Sensitive Approach. In
Proceedings of the 30th ACM International Conference on Information & Knowledge Management, Online, 1–5 November 2021;
pp. 1335–1345.
31
mathematics
Article
On the Use of Morpho-Syntactic Description Tags in Neural
Machine Translation with Small and Large Training Corpora
Gregor Donaj * and Mirjam Sepesy Maučec
Abstract: With the transition to neural architectures, machine translation achieves very good qual-
ity for several resource-rich languages. However, the results are still much worse for languages
with complex morphology, especially if they are low-resource languages. This paper reports the
results of a systematic analysis of adding morphological information into neural machine translation
system training. Translation systems presented and compared in this research exploit morphological
information from corpora in different formats. Some formats join semantic and grammatical infor-
mation and others separate these two types of information. Semantic information is modeled using
lemmas and grammatical information using Morpho-Syntactic Description (MSD) tags. Experiments
were performed on corpora of different sizes for the English–Slovene language pair. The conclusions
were drawn for a domain-specific translation system and for a translation system for the general
domain. With MSD tags, we improved the performance by up to 1.40 and 1.68 BLEU points in the
two translation directions. We found that systems with training corpora in different formats improve
the performance differently depending on the translation direction and corpora size.
Keywords: neural machine translation; POS tags; MSD tags; inflected language; data sparsity;
Citation: Donaj, G.; Sepesy Maučec, M. corpora size
On the Use of Morpho-Syntactic
Description Tags in Neural Machine
MSC: 68T50
Translation with Small and Large
Training Corpora. Mathematics 2022,
10, 1608. https://doi.org/10.3390/
math10091608
1. Introduction
Academic Editors: Florentina Hristea In the last decade, research in machine translation has seen the transition from statisti-
and Cornelia Caragea cal models to neural net-based models for most mainstream languages and also some other
Received: 31 January 2022
languages. At the same time, researchers and developers got access to more and more par-
Accepted: 6 May 2022
allel corpora, which are essential for training machine translation systems. However, many
Published: 9 May 2022 languages can still be considered low-resource languages.
Before the widespread use of neural machine translation (NMT), statistical machine
Publisher’s Note: MDPI stays neutral
translation (SMT) was the predominant approach. Additional linguistic information was
with regard to jurisdictional claims in
added to deal with data sparsity or the morphological complexity of some languages.
published maps and institutional affil-
Often part-of-speech (POS) or morpho-syntactic description (MSD) tags were included in
iations.
SMT systems in some way or the other. The tags can be included on the source side, the
target side, or on both sides of the translation direction. This can be done either in the
alignment or the training and translation phase.
Copyright: © 2022 by the authors. Since the emergence of NMT, relatively few studies have explored the use of additional
Licensee MDPI, Basel, Switzerland. linguistic information for machine translation.
This article is an open access article In this paper, we want to give an overview of the available studies and present
distributed under the terms and experiments on using MSD tags for the translation between English and Slovene, a mor-
conditions of the Creative Commons phologically complex language. In order to provide a more comprehensive look at the
Attribution (CC BY) license (https:// usefulness of MSD tags, we perform several sets of experiments on a general domain
creativecommons.org/licenses/by/ corpus and a domain-specific corpus by using different training corpora sizes and methods
4.0/).
to include MSD tags. In addition, we explore possibilities to reduce MSD tags, which can
become rather complex in morphologically rich languages.
The rest of the paper is organized as follows. In Section 2 we present related work in
machine translation and morphology. In Section 3 we present our experimental system, the
used corpora, all corpora pre-processing steps, and the designed experiments. The results
of the experiments are presented in Section 4 and discussed in Section 5. Our conclusions
are presented in Section 6.
2. Related Work
2.1. Machine Translation
At first, machine translation was guided by rules. Corpus-based approaches followed.
They were notably better and easier to use than the preceding rule-based technologies.
Statistical machine translation (SMT) [1] is a corpus-based approach, where translations
are generated on the basis of statistical models whose parameters are derived from the
analysis of bilingual and monolingual text corpora. For a long time, SMT was the dominant
approach with the best results. Migration from SMT to neural machine translation (NMT)
started in 2013. In [2], a first attempt was made, where a class of probabilistic continuous
translation models was defined that was purely based on continuous representations for
words, phrases, and sentences and did not rely on alignments or phrasal translation units
like in SMT. Their models obtain a perplexity with respect to gold translations that was
more than 43% lower than that of the state-of-the-art SMT models. Using NMT was, in
general, a significant step forward in machine translation quality. NMT is an approach
to machine translation that uses an artificial neural network. Unlike SMT systems, which
have separate knowledge models (i.e., language model, translation model, and reordering
model), all parts of the NMT model are trained jointly in a large neural model. Since 2015,
NMT systems have been shown to perform better than SMT systems for many language
pairs [3,4]. For example, NMT generates outputs that lower the overall post-edit effort
with respect to the best PBMT system by 26% for the German–English language pair [5].
NMT outperformed SMT in terms of automatic scores and human evaluation for German,
Greek, and Russian [6]. In recent years NMT has also been applied to inflectional languages
such as Slovene, Serbian, and Croatian. Experiments in [7] showed that on a reduced
training dataset with around two million sentences, SMT outperformed the NMT neural
models for those languages. In [8], NMT regularly outperformed SMT for the Slovene–
English language pair. In the present paper, we will more systematically analyze NMT
performance using training corpora of different sizes.
ht = sigm(W hx xt + W hh ht−1 )
(1)
yt =W yh ht ,
where W hx , W hh , and W yh are parameter matrices from the input layer to the hidden layer,
from the hidden layer to itself, and from the hidden layer to the output layer; while xt , ht ,
and yt denote the input, hidden, and output layer vectors at word t respectively.
The output sequence is a translation in the target language. The RNN presumes an
alignment between the input and the output sequence, which is unknown. Furthermore, input
and output sequences have different lengths. The solution is to map the input sequence to
a fixed-sized vector using one RNN and then another RNN to map the vector to the target
sequence. Another important phenomenon in natural languages is long-term dependencies
which have been successfully modeled by using long short-term memory (LSTM) [9].
34
Mathematics 2022, 10, 1608
Here, p(yt |v, y1 , . . . , yt−1 ) is represented with a softmax function over all words in
the vocabulary.
Not all parts of the input sequence are relevant to producing an output word. The concept
of attention was introduced to identify parts of the input sequence that are relevant [10].
The context vector ct captures relevant input information to help predict the current target
word yt . Given the target hidden state ht and the context vector ct , a simple concatenation layer
is used to combine the information from both vectors to produce a hidden attentional state:
The attentional vector h˜t is then fed through the softmax layer:
LSTM helps to deal with long sequences, but still, it fails to maintain the global
information of the source sentence. Afterward, the transformer architecture was proposed,
which handles the entire input sequence at once and does not iterate only word by word.
The superior performance of transformer architecture was reported in [11].
35
Mathematics 2022, 10, 1608
information, at decoding time. In factored models, a word is not a token but a vector of
factors. Usually, factors in each word are the surface form of a word, lemma, POS tag,
additional morphological information, etc. Factored models use more than one translation
step and one generation step. Each translation step corresponds to one factor, e.g., lemma,
POS factor, MSD feature vector, etc. Translation from the sequence of factors on the source
side to the sequence of factors on the target side is trained. In the generation step, translated
factors are combined into the final surface form of a word. Factored model training uses
factors the same way as words in word-based models. In [1], gains of up to 2 BLEU points
were reported for factored models over standard phrase-based models for the English–
German language pair. Grammatical coherence was also improved.
In [12] a noisier channel is proposed, which extends the usual noisy channel metaphor
by suggesting that an “English” source signal is first distorted into a morphologically
neutral language. Then morphological processes represent a further distortion of the
signal, which can be modeled independently. Noisier channel implementation uses the
information extracted from surface word forms, lemmatized forms, and truncated forms.
Truncated forms are generated using a length limit of six characters. This means that
the maximum of the first six characters of each word is taken into account, and the rest
is discarded. Hierarchical grammar rules are then induced for each type of information.
Finally, all three grammars are combined, and a hierarchical phrase-based decoder is built.
The authors reported a 10% relative improvement of the BLEU metric when translating
into English from Czech, a language with considerable inflectional complexity.
In [13] the focus is on the three common morphological problems: gender, number, and
the determiner clitic. The translation is performed in multiple steps, which differ from the
steps used in factored translation. The decoding procedure is not changed. Only the train-
ing corpus is enriched with new features: lemmas, POS tags, and morphological features.
Classic phrase-based SMT models are trained on this enriched corpus. They afterward use
conditional random fields for morphological class prediction. They reported improvement
of BLEU by 1 point on a medium-size training set (approximately 5 million words) for
translation from English to Arabic.
In [14,15] the quality of SMT was improved by applying models that predict word
forms from their stems using extensive morphological and syntactic information from both
the source and target languages. Their inflection generation model was the maximum
entropy Markov model, and it was trained independently of the SMT system. The authors
reported a small BLEU improvement (<0.5) when translating from English to Russian and
a larger BLEU improvement of 2 when translating English to Arabic. A similar approach
was used in [16]. They trained a discriminative model to predict inflections of target words
from rich source-side annotations. They also used their model to generate artificial word-
level and phrase-level translations, that were added to the translation model. The authors
reported BLEU improvement by 2 points when translating from English to Russian.
36
Mathematics 2022, 10, 1608
sentence pairs between English and Romanian. Using additional synthetic parallel data,
they again achieved comparable improvements.
Lemma and morphological factors were also used as output features [18,19]. Results for
English to French translation were improved by more than 1.0 BLEU using a training
corpus with 2 million sentence pairs. They also used an approach that generates new words
controlled by linguistic knowledge. They halved the number of unknown words in the
output. Using factors in the output reduces the target side vocabulary and allows the model
to generate word forms that were never seen in the training data [18]. A factored system
can support a larger vocabulary because it can generate words from the lemma and factors
vocabularies, which is advantageous when data is sparse.
The decoder learns considerably less morphology than the encoder in the NMT archi-
tecture. In [20], the authors found that the decoder needs assistance from the encoder and
the attention mechanism to generate correct target morphology. Three ways to explicitly
inject morphology in the decoder were explored: joint generation, joint-data learning, and
multi-task learning. Multi-task learning outperformed the other two methods. A 0.2 BLEU
improvement was reported for English to Czech translation. Authors argued that by having
larger corpora, the improvement would be larger.
Unseen words can also be generated by using character-level NMT or sub-word
units, determined by the byte-pair encoding (BPE) algorithm [21]. Character-level NMT
outperformed NMT with BPE sub-words when processing unknown words, but it per-
formed worse than BPE-based systems when translating long-range phenomena of morpho-
syntactic agreement. Another possibility is to use morpheme-based segmentations [22].
A new architecture was proposed to enrich character-based NMT with morphological
information in a morphology table as an external knowledge source [23]. Morphology
tables increase the capacity of the model by increasing the number of network parameters.
The proposed extension improved BLEU by 0.55 for English to Russian translation using
2.1 million sentence pairs in the training corpus.
The authors in [24] also investigate different factors on the target side. Their experi-
ments include three representations varying in the quantity of grammatical information
they contain. In the first representation, all the lexical and grammatical information was
encoded in a single factor. In the second representation, only a well-chosen subset of
morphological features was kept in the first factor, and the second factor corresponded to
the POS tag. In the third representation, the first factor was a lemma, and the second was
the POS tag. In some experiments, BPE splitting was also used. They reported that carefully
selected morphological information improved the translation results by 0.56 BLEU for
English to Czech translation and 0.89 BLEU for English to Latvian translation.
In [25], the authors focused on information about the subject’s gender. For regular
source language, words were annotated with grammatical gender information of the
corresponding target language words. Information about gender improved the BLEU
results by 0.4.
Most often, NMT systems rely only on large amounts of raw text data and do not use
any additional knowledge source. Linguistically motivated systems can help to overcome
data sparsity, generalize, and disambiguate, especially when the dataset is small.
37
Mathematics 2022, 10, 1608
focus on data preparation rather than on NMT architecture. We used the same NMT
architecture through all experiments. We only modify the data used for training the NMT
systems, similarly to some systems presented in [20]. The advantage of our approach is that
it can be easily transferred to other language pairs, as appropriate tools for lemmatization
and MSD tagging are already available for several morphologically complex languages.
The literature review also shows that often changes in the architecture of the models are
required, or that special translation tools are used, which enable a factored representation
of the input tokens. However, the most widely used NMT tools do not enable such a
representation. Thus, our approach can be more easily included in practical applications,
such as the use of the OPUS-MT plugin for translation tools, which uses Marian NMT [26].
The main contributions of this paper are: (1) a review of literature on using morpholog-
ical information in SMT and NMT; (2) an empirical comparison of the influence of different
types of morphologically tagged corpora on translation results in both translation direc-
tions separately; (3) a comparison of the effect of morphologically pre-processed training
corpora with regard to the training corpus size; (4) a comparison of results obtained on
domain-specific corpus and general corpus.
3. Methods
3.1. Corpora
In our experiments, we used the freely available Europarl and ParaCrawls corpora
for the English–Slovene language pair. Both can be downloaded at the OPUS collection
website (https://opus.nlpl.eu/, accessed on 28 January 2022).
Europarl is extracted from the proceedings of the European Parliament and is con-
sidered a domain-specific corpus. It consists of approximately 600,000 aligned parallel
segments of text. ParaCrawl was built by Web Crawling and automatic alignment. We
consider it to be a more general domain corpus. It consists of approximately 3.7 million
aligned segments of text. Both corpora are available for several language pairs. The above
sizes refer to the English–Slovene language pair. Experiments were performed in the same
way separately on both corpora. Therefore, all processing steps apply the same way for
both corpora.
We divided the corpora into three parts by using 2000 randomly selected segment pairs
as the development set, another 2000 randomly selected segment pairs as the evaluation
set and the remaining segments as the training set. For some later experiments, we further
divided the training set into ten equally sized parts.
Punctuation normalization mainly ensures the correct placing of spaces around punctu-
ation symbols and replaces some UTF-8 punctuation symbols with their ASCII counterparts.
The most significant effect from the pre-processing steps can be seen between the normal-
ized text (which after post-processing is also called de-truecased and de-tokenized text)
and the tokenized and truecased text. A couple of examples from the ParaCrawl corpus are
presented in Table 2. From them, we see that tokenization separates punctuation symbols
38
Mathematics 2022, 10, 1608
into separate tokens by inserting spaces around them. Note, that in the second example,
the point (“.”) is not tokenized, as it is used as a decimal symbol. Truecasing then converts
all words to their lowercase forms, unless they are proper names.
De-truecasing and de-tokenization is the reverse process, i.e., rejoining punctuation
symbols and capitalizing the first word in a segment (sentence). These steps are done in
the post-processing stage in order to obtain a translation in the normalized form according
to the grammar of the target language.
Table 2. Comparison between normalized segments (NM), and tokenized and truecased segments (TC).
Form Segment
NM After the fourth hour, Moda plays a major role.
TC After the fourth hour, Moda plays a major role .
NM User manual PDF file, 2.6 MB, published 19 October 2018
TC User manual PDF file, 2.6 MB, published 19 October 2018
Table 3. Example of a MSD tagged and lemmatized sentence in English (EN) and Slovene (SL).
The English tagset defines 58 different MSD tags for words and punctuations, all of
which appear in the used corpora. The Slovene tagset defines 1902 different MSD tags.
39
Mathematics 2022, 10, 1608
However, only 1263 of them appear in the Europarl training corpus, and 1348 of them
appear in the ParaCrawl training corpus.
Format Segment
W let us be honest .
W+M let MSD:VV us MSD:PP be MSD:VB honest MSD:JJ . MSD:.
WM let-MSD:VV us-MSD:PP be-MSD:VB honest-MSD:JJ .-MSD:.
L+M LEM:let MSD:VV LEM:us MSD:PP LEM:be MSD:VB LEM:honest MSD:JJ
LEM:. MSD:SENT
LM LEM:let-MSD:VV LEM:us-MSD:PP LEM:be-MSD:VB LEM:honest-MSD:JJ
LEM:.-MSD:SENT
WW-MM let us be honest . <MSD> MSD:VV MSD:PP MSD:VB MSD:JJ MSD:.
We build word-based vocabularies from both training corpora, containing 60,000 words
on each translation side for the experiments with the Europarl corpus and 200,000 words
on each side in the experiments with the ParaCrawl corpus.
The reason for the different sizes lies in the different text coverages given a particular
vocabulary size. Table 5 shows the out-of-vocabulary (OOV) rates on the evaluation set in
both corpora. We see that those rates are far lower in the domain-specific corpus Europarl.
The remaining OOV rate at the larger vocabularies is due to words that do not appear in
the training set at all. On the other hand, the OOV rates are higher in the general-domain
corpus ParaCrawl, where a more diverse vocabulary is to be expected. Therefore, a more
extensive vocabulary is needed to obtain reasonable results. From the data, we also see
that OOV rates are significantly greater in Slovene, which is due to the high number of
word forms.
40
Mathematics 2022, 10, 1608
For the systems that use MSD tags as separate tokens (W+M, WW-MM), the vocabulary
was simply extended with the MSD tags. In the system which uses the words and MSD
tags concatenated to one token (WM), the vocabulary was extended so far that it includes
all words in the original vocabulary in combination with all possible corresponding MSD
tags that appear in the training set. Some word forms can have different grammatical
categories. For example, feminine gender words ending in “-e” typically can have four
different tags given their case and number: (1) singular and dative, (2) singular and locative,
(3) dual and nominative, and (4) dual and accusative. This is similarly true for most words
(primarily other nouns, adjectives, and verbs). Hence, the size of the vocabulary increases
substantially for concatenated combinations.
In the systems using lemmas instead of words, the vocabularies are similarly built.
The final sizes of all used vocabularies are presented in Table 6.
Often BPE or other data-driven methods for splitting words into subword units are
used to improve the results of NMT. However, since those methods are not linguistically
informed, they can produce erroneous surface forms by concatenating incompatible sub-
word units. Also, the generation of MSD tags is based on full word forms and cannot
be performed on subword units. For this reason, we decided to avoid using data-driven
segmentation in our paper.
41
Mathematics 2022, 10, 1608
We used BLEU for the validation during training and the final evaluation of the models.
BLEU is a widely used metric to evaluate the performance of machine translation systems,
and it is defined by
4 1
4
leno
BLEU = min 1, · ∏ preci , (5)
lenr i =1
where leno is the length of the evaluated translation, lenr is the length of the reference
translation, and preci is the ratio between the number matched n-grams and the number of
total n-grams of order i between the reference translation and the evaluated translation.
The machine translation output was post-processed: de-truecased and de-tokenized
during validation and final evaluation. Post-processing also included removing MSD tags,
the special tag in format WW-MM. In the two systems that generated the lemma and MSD
tag on the target side, we also had to employ a model that would determine the correct
worm form.
Additionally, we performed statistical significance tests on all results compared to the
baseline results. We used the paired bootstrap resampling test [29], a widely used test in
machine translation with a p-value threshold value of 0.05 [12,13,17,23,25].
Table 7. An example segment tagged with full and reduced MSD tags.
Form Segment
Words Gospod predsednik , hvala za besedo .
Full MSD tags Ncmsn Ncmsn , Ncfsn Sa Ncfsa .
Reduced MSD tags Nmsn Nmsn , Nfsn Sa Nfsa .
The reduction of the MSD tag complexity resulted in a decreased number of different
tags. In the Europarl training corpus, the number was reduced from 1263 to 418, and in the
ParaCrawl training corpus from 1348 to 428.
3.7. Tools
In the pre-processing and post-processing steps, we used several scripts, which are
part of the MOSES statistical machine translation toolkit [30]. Those steps are cleanup,
tokenization, truecasing, de-truecasing, and de-tokenization. We used the SacreBLEU [31]
evaluation tool for scoring and statistical tests.
MSD tagging and lemmatization were performed using TreeTagger [32] and the ac-
companying pre-trained models for Slovene and English.
Translation model training and translation were performed using Marian NMT [33]
on NVIDIA A100 graphical processing units.
4. Results
4.1. Translation Systems
Table 8 shows translation results from English to Slovene for all combinations of
corpora formats, and Table 9 shows the results for the same experiments from Slovene to
42
Mathematics 2022, 10, 1608
English on the Europarl corpus. In both cases, the result in the first line and first column is
the baseline result (surface word form).
Below all results, we show p-values for the paired bootstrap resampling test between
the given result and the baseline result.
Table 8. BLEU scores and p-values for the translations from English to Slovene for different systems
on the Europarl corpus.
Source
Target Form
Form
W W+M WM L+M LM WW-MM
35.97 37.17 36.77 36.96 34.78 36.78
W
p < 0.001 p = 0.003 p = 0.002 p = 0.001 p = 0.004
37.12 37.37 37.18 36.88 35.90 37.28
W+M
p < 0.001 p < 0.001 p < 0.001 p = 0.003 p = 0.319 p < 0.001
36.48 37.21 36.14 36.16 35.15 36.67
WM
p = 0.037 p < 0.001 p = 0.195 p = 0.188 p = 0.008 p = 0.012
37.24 36.57 36.73 36.31 34.92 36.66
L+M
p < 0.001 p = 0.033 p = 0.011 p = 0.117 p = 0.001 p = 0.011
35.90 36.20 35.77 35.52 33.64 36.01
LM
p = 0.305 p = 0.158 p = 0.202 p = 0.071 p < 0.001 p = 0.342
23.83 27.84 25.03 36.71 23.91 28.51
WW-MM
p < 0.001 p < 0.001 p < 0.001 p = 0.014 p < 0.001 p < 0.001
Table 9. BLEU scores and p-values for the translations from Slovene to English for different systems
on the Europarl corpus.
Source
Target Form
Form
W W+M WM L+M LM WW-MM
40.35 40.48 40.33 39.14 39.10 40.35
W
p = 0.222 p = 0.366 p < 0.001 p < 0.001 p = 0.421
41.16 40.55 40.97 40.12 39.46 40.54
W+M
p = 0.002 p = 0.177 p = 0.012 p = 0.146 p = 0.001 p = 0.167
40.19 39.95 40.15 38.86 38.76 39.60
WM
p = 0.193 p = 0.069 p = 0.153 p < 0.001 p < 0.001 p = 0.003
42.03 41.80 41.53 40.40 40.27 41.01
L+M
p < 0.001 p < 0.001 p < 0.001 p = 0.347 p = 0.288 p = 0.006
39.66 39.87 39.70 38.79 38.50 39.83
LM
p = 0.004 p = 0.040 p = 0.007 p < 0.001 p < 0.001 p = 0.029
35.03 39.73 39.80 38.93 38.39 39.59
WW-MM
p < 0.001 p = 0.012 p = 0.022 p < 0.001 p < 0.001 p = 0.002
When comparing baseline results with other models, we see that some combinations
give better results and others give worse results than the baseline model.
In the direction from English to Slovene, the best results are achieved when using
words and MSD tags as separate tokens on both translation sides. On the other hand,
in the direction from Slovene to English, the best results are achieved with lemmas and
MSD tags as separate tokens on the source side, and only the surface words form on the
target side. We will refer to these two systems as the improved systems for their respective
translation direction.
Tables 10 and 11 show the same results as the previous two tables, albeit on the
ParaCrawl corpus. The results show that the best results are obtained with the same
combination of formats as on the Europarl corpus.
43
Mathematics 2022, 10, 1608
Table 10. BLEU scores and p-values for the translations from English to Slovene for different systems
on the ParaCrawl corpus.
Source
Target Form
Form
W W+M WM L+M LM WW-MM
44.09 44.51 44.19 41.78 41.79 44.65
W
p = 0.103 p = 0.306 p < 0.001 p < 0.001 p = 0.059
44.41 44.80 39.35 42.00 41.50 44.65
W+M
p = 0.149 p = 0.025 p < 0.001 p < 0.001 p < 0.001 p = 0.053
44.01 43.96 44.64 41.82 41.88 44.28
WM
p = 0.315 p = 0.255 p = 0.071 p < 0.001 p < 0.001 p = 0.213
42.95 41.89 42.50 39.52 39.75 42.58
L+M
p = 0.001 p < 0.001 p < 0.001 p < 0.001 p < 0.001 p < 0.001
42.41 41.91 42.63 40.41 40.04 42.34
LM
p < 0.001 p < 0.001 p < 0.001 p < 0.001 p < 0.001 p < 0.001
44.35 43.50 44.29 41.83 41.06 44.15
WW-MM
p = 0.167 p = 0.052 p = 0.227 p < 0.001 p < 0.001 p = 0.354
Table 11. BLEU scores and p-values for the translations from Slovene to English for different systems
on the ParaCrawl corpus.
Source
Target Form
Form
W W+M WM L+M LM WW-MM
47.89 47.45 47.34 43.63 42.88 47.67
W
p = 0.088 p = 0.045 p < 0.001 p < 0.001 p = 0.179
47.84 48.02 47.44 43.73 42.70 47.53
W+M
p = 0.340 p = 0.259 p = 0.087 p < 0.001 p < 0.001 p = 0.124
47.81 47.67 46.88 43.12 43.23 47.57
WM
p = 0.314 p = 0.187 p = 0.005 p < 0.001 p < 0.001 p = 0.136
48.19 48.06 48.00 43.98 43.11 47.60
L+M
p = 0.153 p = 0.221 p = 0.285 p < 0.001 p < 0.001 p = 0.181
46.79 47.06 46.84 42.82 43.23 46.87
LM
p = 0.002 p = 0.016 p = 0.003 p < 0.001 p < 0.001 p = 0.002
47.53 47.53 41.99 42.74 42.76 47.46
WW-MM
p = 0.127 p = 0.126 p < 0.001 p < 0.001 p < 0.001 p = 0.085
44
Mathematics 2022, 10, 1608
40
35
30
25
BLEU
20
15
10
5 Word form
W+M to W+M
W+M to W+M (reduced MSD)
0
10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Relative corpus size
Figure 1. Translation results for the selected translation systems from English to Slovene using
Europarl with respect to relative training corpora size.
45
40
35
30
BLEU
25
20
15
10 Word form
L+M to Word
L+M to Word (reduced MSD)
5
10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Relative corpus size
Figure 2. Translation results for the selected translation systems from Slovene to English using
Europarl with respect to relative training corpora size.
45
Mathematics 2022, 10, 1608
46
44
42
40
38
BLEU
36
34
32
30 Word form
W+M to W+M
W+M to W+M (reduced MSD)
28
5% 6% 8% 7% 9% 10% 20% 30% 40% 50% 60% 70% 100%
Relative corpus size
Figure 3. Translation results for the selected translation systems from English to Slovene using
ParaCrawl with respect to relative training corpora size.
50
48
46
44
42
BLEU
40
38
36
34
Word form
32
L+M to Word
L+M to Word (reduced MSD)
30
5% 6% 8% 7% 9% 10% 20% 30% 40% 50% 60% 70% 100%
Relative corpus size
Figure 4. Translation results for the selected translation systems from Slovene to English using
ParaCrawl with respect to relative training corpora size.
46
Mathematics 2022, 10, 1608
the results using transformers were slightly worse. When using transformers to translate
from English to Slovene, the baseline systems gave a BLEU score of 30.90, and the improved
system (W+M to W+M) gave a score of 32.64. When translating from Slovene to English,
the baseline systems gave a BLEU score of 34.97, and the improved system (L+M to Words)
gave a score of 36.21. Overall, the results were approximately 5 points below the results
of the LSTM models. However, we can observe similar improvements and trends as with
the LSTM models. Thus, the choice of model architecture does not impact our conclusions.
Finally, we decided to use the better-performing models.
4 Source text : Kot je v svojem odlinem govoru dejala gospa Essayah ....
Figure 5. Selected translation examples from Slovene to English with the source text, reference
translation, baseline system translation and improved system translation using lemmas and MSD
tags (L+M to W).
In many cases, the difference consists of some words in one translation being replaced
with synonyms or words with only slightly different meanings. Such are the first and second
examples in Figure 5. In the first example, the first marked difference gives a better score to
the improved system, while the second difference gives a better score to the baseline system.
Other examples we found are this–that, final–last, manage–deal with, firstly–first of all. Similar
differences can be seen in the translation from English to Slovene.
In the second example, we see that both systems generated different translations than
what is in the reference translation.
In the third example, we see a restructuring of the sentence that results in a better
score with the improved system, although both systems give a correct translation.
47
Mathematics 2022, 10, 1608
Examples that refer specifically to morphology in English translation are mostly found
in prepositions. This can be seen in the last two examples. In the fourth example, the
improved system produces a better translation, and in the fifth example, the baseline system
produces a better translation.
With regard to morphology, a more interesting comparison can be done when translat-
ing from English to Slovene. Figure 6 shows four examples to illustrate some differences
between the baseline and the improved system.
1 Source text : ... important incidents are a lot for one man.
Figure 6. Selected translation examples from English to Slovene with the source text, reference
translation, baseline system translation and improved system translation using words and MSD tags
(W+M to W+M).
In the first example, we have a sentence that ends with “a lot for one man”. We can
see that the improved system produces a translation that is identical (in the last part of the
sentence) to the reference translation. The baseline system, on the other hand, places the
word “veliko” (engl: much) at the end of the sentence, which is still grammatically correct.
However, the baseline system translates “one man” to “en človek”, which wrongly is the
nominative case instead of the correct accusative case “enega človeka”.
In the second example, we have a sentence with a reference translation that does not
include the word for judgment but instead refers only to “this”. Both translation systems
added the phrase for “this judgment”. However, it was added in different cases. The
baseline system produced “To sodbo”, which can be the accusative or instrumental case
(the last of the six cases in Slovene grammar), both of which are grammatically incorrect.
The improved system produced the correct nominative form “Ta sodba”. There is another
difference in the translations—the word “presenetilo” (engl: to surprise) is in the incorrect
neutral gender form in the baseline system and in the correct feminine gender form in the
improved system.
The third example is interesting from a different angle. The reference translation
literally translates to “I agree completely”, while both systems translate the phrase “I could
not agree more”, and the improved system also adds “S tem” (engl: with this). Interestingly,
the word “could” is in the baseline system translated to the feminine form “mogla”, and in
the improved system to the masculine form “mogel”. Both translations are grammatically
correct, and since the source text is gender neutral, both translations can be considered
correct. The reference translation is also gender neutral.
48
Mathematics 2022, 10, 1608
In the fourth example, we have the abbreviation “ZDA” (engl: USA), which is in
Slovene used in the plural form. The baseline system produced the singular verb “je” (engl:
is), while the improved system gave the correct plural form “so” (engl: are). This example
clearly shows the usefulness of MSD tags, as they more often lead to grammatically correct
translations. We speculate that this particular translation is improved since the MSD tag for
“ZDA” included the information that it is a noun in female form. The baseline translation,
on the other hand, may be wrong because “US” is in the source text (and in English in
general) treated as a singular noun. Hence the English singular form “is” was translated to
the Slovene singular form.
5. Discussion
The results obtained in experiments indicate the role of MSD tags in translation be-
tween English and Slovene. The first set of experiments, when translating from English
to Slovene, showed that the best performance was achieved with models using words
and MSD tags on both sides (W+M to W+M). To generate correct Slovene word forms,
morphological information is needed. In Table 8, we see that the improvement on the
Europarl corpus is 1.40 BLEU points. Examining Figures 1 and 3, we can see these models
start outperforming the baseline models when using 30% of the Europarl training corpora
(187,000 segments). They also outperformed the baseline models on the ParaCrawl cor-
pus from the smallest tested size, 5% (186,000 segments), and up to 30% of the full size
(1.1 million segments). The results on those corpora sizes are also statistically significant.
At larger corpora sizes, these models at some data points also outperform the baseline
models. However, the results are no longer statistically significant.
In the first set of experiments, for the translation from Slovene to English, we found
the best performance with the models that use lemmas and MSD tags on the source side
and words on the target side (L+M to W). To separate the meaning (i.e., lemma) and the
morphology (i.e., MSD) on the Slovene side is beneficial, as almost no morphological
information is needed to generate the correct English translations. Many different Slovene
words are having the same lemma translate to the same English word. Using lemmas
and MSD tags instead of words also reduce the data sparsity to a great extent. In Table 9,
we see that the improvement on the Europarl corpus is 1.68 BLEU points. Examining
Figures 2 and 4, we can see these models outperformed the baseline models at all data
points on the Europarl corpus and on the ParaCrawl corpus up to 70% of its full size
(2.6 million segments). These results are statistically significant, while the results from 80%
of the ParaCrawl corpus upwards are not.
We can also examine the results in Tables 8–11 for format combinations other than
the best. We see that models using lemmas on the target side generally perform the worst,
regardless of the source side. Additionally, such models require the conversion from
lemmas and MSD tags to surface word forms in the post-processing steps, making them
more difficult to use. However, when lemmas are used as separate tokens with MSD
tags on the source side, the models perform better than the baseline model, except on the
ParaCrawl corpus when translating from English to Slovene.
We can also compare results where the MSD tags are separate tokens with the results
where they are concatenated either to words or to lemmas. With a few exceptions, combi-
nations with separate tokens perform better. We assume that this is due to the vocabulary
sizes. When MSD tags are concatenated with words or lemmas, the vocabulary size in-
creases significantly. Consequently, the model might need to learn more parameters to
attain a comparable performance. However, this would mean that the models are no longer
comparable and thus not suited for our research.
Next, we can compare results from models that use full MSD tags and their coun-
terparts with reduced MSD tags. The differences are mostly small. However, we can see
that in most cases, models with full MSD tags outperform models with reduced MSD
tags. The most noticeable exception is in Figure 2 at 10% of the training corpora size.
We speculate that this result is due to data sparsity with the very small training corpora size
49
Mathematics 2022, 10, 1608
(60,000 segments). Such corpora sizes might still be found for very specific domains. These
results indicate that MSD reduction does not result in significant further improvements
in translation performance, or might even lower the performance. Still, a reduction in
the complexity of MSD tags to the most important features could have benefits, as such
a tagging might be more precise and faster. This would be of importance in practical
translation systems, where speed is important.
Our approach to this research consisted of data preparation rather than a modification
of the translation architecture and available tools. This has the advantage that the approach
can easily be transferred to other language pairs, as appropriate tools for lemmatization
and MSD tagging are already available for several morphologically complex languages.
Several examples show that differences between translation systems often consist of
producing synonyms or sentences with a different structure. Systems can generate gram-
matically correct and meaningful translations but receive worse scores with an automatic
evaluation metric such as BLEU, as it compares them against gold standard translation.
One method to alleviate this problem is the use of manual evaluation. The drawback here
is cost and time. This further emphasizes the need for evaluation sets with several possible
reference translations for each sentence on the source side.
It is more difficult to draw a conclusion when comparing the results in Figure 1 with
the results in Figure 3, i.e., the same translation direction in different domains. All results
show a dependency on corpora size, but the size of the Europarl training corpus is about
17% of the size of the ParaCrawl training corpus. We selected the first data point for the
ParaCrawl corpus to be 5% of the full size, which is fairly similar to 30% of the Europarl
corpus. In both figures, we can see a similar trend in the result up to the full size of the
Europarl corpus, which is approximately 17% of the ParaCrawl corpus. Comparing those
results in Figures 1 and 3 we can see that the improved system (W+M) outperforms the
baseline system by approximate the same amount. The same can be seen in the other
translation direction from Figures 2 and 4. Since we compare different domains with
different test sets, it is not reasonable to compare exact numbers. From our results, we
can only conclude that we found no strong indications for a difference in the inclusion of
linguistic knowledge between the domain-specific system and the general domain system.
6. Conclusions
In this research, we presented a systematic comparison of translation systems trained
on corpora in a variety of formats and of different sizes. Such a systematic approach for
empirical evaluation of a large number of NMT systems has only recently become possible
due to the availability of high-performance computing systems. Such systems employ a
large number of graphical processing units, which are needed for training NMT systems.
In this research, we were able to show that NMT systems can benefit from additional
morphological information if one of the languages in the translation pair is morphologically
complex. We were also able to show that those benefits depend on the form in which
morphological information is added to the corpora and the translation direction.
We were able to show which combination of corpora formats gives the best results
when translating to or from a morphologically complex language. Those conclusions may
apply to other language pairs of English and inflected languages. However, for translation
pairs consisting of two complex languages, the experiments would have to be repeated to
determine the best format on the source and the target side.
Mainly, we were able to confirm that the benefits heavily depend on the size of the
training corpora. In the paper, we present empirical evidence for corpora of different
sizes. Thus, we were able to give specific corpora sizes, at which the improvements cease
to be statistically significant. On the other hand, we found that the same systems that
outperform the baseline system on the small domain-specific corpus also outperform the
baseline system on the larger general-domain corpus.
Hence, we would argue that the inclusion of morphological information into NMT
is mostly beneficial for specific domains or, in general, for language pairs with a small
50
Mathematics 2022, 10, 1608
amount of parallel data. However, even when a larger amount of training data is available,
translation performance can still be improved.
Our qualitative analysis also showed that not all differences between the systems
are recognized as improvements. Further work may include testing such systems with
evaluation sets that have several reference translations or include a manual evaluation.
The presented approach for reducing the complexity of MSD tags was based on
grammatical knowledge, and it brought only slightly improved translation accuracy. We
would argue, however, that the tagging speed in practical applications would benefit from
simpler tags. Here, further work may consist of testing data-driven approaches to reduce
their complexity, retain translation performance, and increase the tagging speed.
One of the challenges of machine translation is the vocabulary sizes, especially in
highly inflected languages. There are methods to alleviate this problem by using word
splitting. Future work in this area might include combining the presented addition of lin-
guistic knowledge with data-driven and knowledge-driven approaches for word splitting,
e.g., BPE and stem-ending splitting.
Author Contributions: Conceptualization, G.D.; methodology, G.D. and M.S.M.; software, G.D.;
formal analysis, G.D. and M.S.M.; writing, G.D. and M.S.M.; visualization, G.D. All authors have
read and agreed to the published version of the manuscript.
Funding: This work was supported by the Slovenian Research Agency (research core funding
No.P2-0069-Advanced Methods of Interaction in Telecommunications).
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: Publicly available datasets were analyzed in this study. The data
can be found here: https://opus.nlpl.eu/Europarl.php (accessed on 22 January 2021) and https:
//opus.nlpl.eu/ParaCrawl.php (accessed on 22 January 2021).
Acknowledgments: The authors thank the HPC RIVR (www.hpc-rivr.si, accessed on 31 January 2022)
consortium for the use of the HPC system VEGA on the Institute of Information Science (IZUM).
They also want to thank the authors of the Europarl and ParaCrawl parallel corpora.
Conflicts of Interest: The authors declare no conflict of interest.
References
1. Koehn, P.; Hoang, H. Factored translation models. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural
Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), Prague, Czech Republic, 28–30 June
2007; pp. 868–876.
2. Kalchbrenner, N.; Blunsom, P. Recurrent continuous translation models. In Proceedings of the 2013 Conference on Empirical
Methods in Natural Language Processing, Seattle, WA, USA, 18–21 October 2013; pp. 1700–1709.
3. Cettolo, M.; Jan, N.; Sebastian, S.; Bentivogli, L.; Cattoni, R.; Federico, M. The IWSLT 2015 evaluation campaign. In Proceedings
of the 12th International Workshop on Spoken Language Translation, Da Nang, Vietnam, 3–4 December 2015; pp. 2–14.
4. Junczys-Dowmunt, M.; Dwojak, T.; Hoang, H. Is Neural Machine Translation Ready for Deployment? A Case Study on 30
Translation Directions. In Proceedings of the 9th International Workshop on Spoken Language Translation (IWSLT), Hong Kong,
China, 6–7 December 2016.
5. Bentivogli, L.; Bisazza, A.; Cettolo, M.; Federico, M. Neural versus Phrase-Based Machine Translation Quality: A Case Study. In
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, TX, USA, 1–5 November
2016; pp. 257–267. [CrossRef]
6. Castilho, S.; Moorkens, J.; Gaspari, F.; Calixto, I.; Tinsley, J.; Way, A. Is neural machine translation the new state of the art? Prague
Bull. Math. Linguist. 2017, 108, 109–120. [CrossRef]
7. Arčan, M. A Comparison of Statistical and Neural Machine Translation for Slovene, Serbian and Croatian. In Proceedings of the
Conference on Language Technologies & Digital Humanities 2018, Ljubljana, Slovenia, 20–21 September 2018; pp. 3–10.
8. Vintar, Š. Terminology Translation Accuracy in Phrase-Based versus Neural MT: An Evaluation for the English-Slovene Language
Pair. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki,
Japan, 7–12 May 2018; Du, J., Arcan, M., Liu, Q., Isahara, H., Eds.; European Language Resources Association (ELRA): Paris,
France, 2018.
51
Mathematics 2022, 10, 1608
9. Sutskever, I.; Vinyals, O.; Le, Q.V. Sequence to Sequence Learning with Neural Networks. In Proceedings of the 27th International
Conference on Neural Information Processing Systems, NIPS’14, Montreal, QC, Canada, 8–13 December 2014; MIT Press:
Cambridge, MA, USA, 2014; Volume 2, pp. 3104–3112.
10. Luong, T.; Pham, H.; Manning, C.D. Effective Approaches to Attention-based Neural Machine Translation. In Proceedings of the
2015 Conference on Empirical Methods in Natural Language Processing; Association for Computational Linguistics, Lisbon,
Portugal, 17–21 September 2015; pp. 1412–1421. [CrossRef]
11. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you
need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017;
pp. 5998–6008.
12. Dyer, C. The “noisier channel”: Translation from morphologically complex languages. In Proceedings of the Second Workshop
on Statistical Machine Translation, Prague, Czech Republic, 23 June 2007; pp. 207–211.
13. El Kholy, A.; Habash, N. Translate, predict or generate: Modeling rich morphology in statistical machine translation. In
Proceedings of the 16th Annual Conference of the European Association for Machine Translation, Trento, Italy, 28–30 May 2012;
pp. 27–34.
14. Minkov, E.; Toutanova, K.; Suzuki, H. Generating complex morphology for machine translation. In Proceedings of the 45th
Annual Meeting of the Association of Computational Linguistics, Prague, Czech Republic, 23–30 June 2007; pp. 128–135.
15. Toutanova, K.; Suzuki, H.; Ruopp, A. Applying morphology generation models to machine translation. In Proceedings of the
ACL-08: HLT, Columbus, OH, USA, 12–14 June 2008; pp. 514–522.
16. Chahuneau, V.; Schlinger, E.; Smith, N.A.; Dyer, C. Translating into morphologically rich languages with synthetic phrases. In
Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, WA, USA, 18–21 October
2013; pp. 1677–1687.
17. Sennrich, R.; Haddow, B. Linguistic Input Features Improve Neural Machine Translation. In Proceedings of the First Conference
on Machine Translation, Berlin, Germany, 11–12 August 2016; Research Papers; Association for Computational Linguistics: Berlin,
Germany, 2016; Volume 1, pp. 83–91. [CrossRef]
18. García-Martínez, M.; Barrault, L.; Bougares, F. Factored Neural Machine Translation Architectures. In Proceedings of the 13th
International Conference on Spoken Language Translation; International Workshop on Spoken Language Translation, Seattle,
WA, USA, 8–9 December 2016.
19. Garcia-Martinez, M.; Barrault, L.; Bougares, F. Neural machine translation by generating multiple linguistic factors. In
Proceedings of the International Conference on Statistical Language and Speech Processing, Le Mans, France, 23–25 October 2017;
pp. 21–31.
20. Dalvi, F.; Durrani, N.; Sajjad, H.; Belinkov, Y.; Vogel, S. Understanding and improving morphological learning in the neural
machine translation decoder. In Proceedings of the Eighth International Joint Conference on Natural Language Processing
(Volume 1: Long Papers), Taipei, Taiwan, 27 November–1 December 2017; pp. 142–151.
21. Sennrich, R.; Haddow, B.; Birch, A. Neural Machine Translation of Rare Words with Subword Units. In Proceedings of the 54th
Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, 7–12 August 2016;
Association for Computational Linguistics: Berlin, Germany, 2016; pp. 1715–1725. [CrossRef]
22. Passban, P. Machine Translation of Morphologically Rich Languages Using Deep Neural Networks. Ph.D. Thesis, Dublin City
University, Dublin, Ireland, 2018.
23. Passban, P.; Liu, Q.; Way, A. Improving Character-Based Decoding Using Target-Side Morphological Information for Neural
Machine Translation. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, LA, USA, 1–6 June 2018; Association for
Computational Linguistics: New Orleans, LA, USA, 2018; pp. 58–68. [CrossRef]
24. Burlot, F.; Garcia-Martinez, M.; Barrault, L.; Bougares, F.; Yvon, F. Word representations in factored neural machine translation.
In Proceedings of the Second Conference on Machine Translation, Copenhagen, Denmark, 7–8 September 2017; pp. 20–31.
25. Stafanovičs, A.; Bergmanis, T.; Pinnis, M. Mitigating gender bias in machine translation with target gender annotations.
arXiv 2020, arXiv:2010.06203.
26. Tiedemann, J.; Thottingal, S. OPUS-MT—Building open translation services for the World. In Proceedings of the 22nd Annual
Conferenec of the European Association for Machine Translation (EAMT), Lisbon, Portugal, 3–5 November 2020.
27. Krek, S.; Dobrovoljc, K.; Erjavec, T.; Može, S.; Ledinek, N.; Holz, N.; Zupan, K.; Gantar, P.; Kuzman, T.; Čibej, J.; et al. Training
Corpus ssj500k 2.3, 2021. Slovenian Language Resource Repository CLARIN.SI. Available online: http://hdl.handle.net/11356/
1434 (accessed on 22 January 2021).
28. Erjavec, T.; Fišer, D.; Krek, S.; Ledinek, N. The JOS Linguistically Tagged Corpus of Slovene. In Proceedings of the Seventh
International Conference on Language Resources and Evaluation (LREC’10), Valletta, Malta, 17–23 May 2010; Choukri, K.,
Maegaard, B., Mariani, J., Odijk, J., Piperidis, S., Rosner, M., Tapias, D., Eds.; European Language Resources Association (ELRA):
Valletta, Malta, 2010.
29. Koehn, P. Statistical Significance Tests for Machine Translation Evaluation. In Proceedings of the 2004 Conference on Empirical
Methods in Natural Language Processing, Barcelona, Spain, 25–26 July 2004; Association for Computational Linguistics: Barcelona,
Spain, 2004; pp. 388–395.
52
Mathematics 2022, 10, 1608
30. Koehn, P.; Hoang, H.; Birch, A.; Callison-Burch, C.; Federico, M.; Bertoldi, N.; Cowan, B.; Shen, W.; Moran, C.; Zens, R.; et al.
Moses: Open Source Toolkit for Statistical Machine Translation. In Proceedings of the 45th Annual Meeting of the Association for
Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions, Prague, Czech Republic, 25–27
June 2007; Association for Computational Linguistics: Prague, Czech Republic, 2007; pp. 177–180.
31. Post, M. A Call for Clarity in Reporting BLEU Scores. In Proceedings of the Third Conference on Machine Translation: Research
Papers, Brussels, Belgium, 31 October–1 November 2018; Association for Computational Linguistics: Brussels, Belgium, 2018;
pp. 186–191. [CrossRef]
32. Schmid, H. Probabilistic Part-of-Speech Tagging Using Decision Trees. In Proceedings of the International Conference on New
Methods in Language Processing, Yokohama, Japan, 18–22 September 1994.
33. Junczys-Dowmunt, M.; Grundkiewicz, R.; Dwojak, T.; Hoang, H.; Heafield, K.; Neckermann, T.; Seide, F.; Germann, U.;
Fikri Aji, A.; Bogoychev, N.; et al. Marian: Fast Neural Machine Translation in C++. In Proceedings of the ACL 2018, System
Demonstrations, Melbourne, Australia, 15–20 July 2018; Association for Computational Linguistics: Melbourne, Australia, 2018;
pp. 116–121.
53
mathematics
Article
Identifying Source-Language Dialects in Translation
Sergiu Nisioi *,† , Ana Sabina Uban *,† and Liviu P. Dinu
Human Language Technologies Center, Faculty of Mathematics and Computer Science, University of Bucharest,
Academiei 14, 010014 Bucharest, Romania; [email protected]
* Correspondence: [email protected] (S.N.); [email protected] (A.S.U.)
† These authors contributed equally to this work.
Abstract: In this paper, we aim to explore the degree to which translated texts preserve linguistic
features of dialectal varieties. We release a dataset of augmented annotations to the Proceedings of
the European Parliament that cover dialectal speaker information, and we analyze different classes
of written English covering native varieties from the British Isles. Our analyses aim to discuss the
discriminatory features between the different classes and to reveal words whose usage differs between
varieties of the same language. We perform classification experiments and show that automatically
distinguishing between the dialectal varieties is possible with high accuracy, even after translation,
and propose a new explainability method based on embedding alignments in order to reveal specific
differences between dialects at the level of the vocabulary.
1. Introduction
Computational approaches in Translation studies enforced the idea that translated
texts (regarding translations, we will use the abbreviation SL to define source language
and TL for target language) have specific linguistic characteristics that make them struc-
Citation: Nisioi, S.; Uban, A.S.; Dinu, turally different from other types of language production that take place directly in the
L.P. Identifying Source-Language target language. Translations are considered a sub-language (translationese) of the target
Dialects in Translation. Mathematics language [1–3] and studies [4–9] imply that translated texts have similar characteristics
2022, 10, 1431. https://doi.org/ irrespective of the target language of translation (translation universals). Universals emerge
10.3390/math10091431
from psycholinguistic phenomena such as simplification— “the tendency to make do with
Academic Editors: Florentina Hristea, less words” in the target language [10,11], standardization—the tendency for translators to
Cornelia Caragea and Jakub Nalepa choose more “habitual options offered by a target repertoire” instead of reconstructing the
original textual relations [5], or explicitation —the tendency to produce more redundant
Received: 31 December 2021
constructs in the target language in order to explain the source language structures [12,13].
Accepted: 24 March 2022
In addition, translated texts also exhibit patterns of language transfer or
Published: 24 April 2022
interference [14]—a phenomenon inspired by second-language acquisition, indicating
Publisher’s Note: MDPI stays neutral certain source-language structures that get transferred into the target text. Using text
with regard to jurisdictional claims in mining and statistical analysis, researchers were able to identify such features [3,15,16] up
published maps and institutional affil- to the point of reconstructing phylogenetic trees from translated texts [17,18].
iations. Investigations with respect to translationese identification have strong potential for
improving machine translation, as [19,20] pointed out for statistical machine translation,
and more recently [21] showed that the effect of translationese can impact the system
rankings of submissions made to the yearly shared tasks organized by The Conference
Copyright: © 2022 by the authors.
of Machine Translation [22]. Ref. [23] show that a transformer-based neural machine
Licensee MDPI, Basel, Switzerland.
This article is an open access article
translation (NMT) system can obtain better fluency and adequacy scores in terms of human
distributed under the terms and
evaluation, when the model accounts for the impact of translationese.
conditions of the Creative Commons While the majority of translation research has been focused on how different source
Attribution (CC BY) license (https:// languages impact translations, to our knowledge, little research has addressed the prop-
creativecommons.org/licenses/by/ erties of the source language that stem from dialectal or non-native varieties, how and to
4.0/). what degree they are preserved in translated texts.
In our work, we intend to bring this research question forward and investigate whether
dialectal varieties produce different types of translationese and whether this hypothesis
holds for machine-translated texts. We construct a selection of dialectal varieties based
on the Proceedings of the European Parliament, covering utterances of speakers from
the British Isles and equivalent sentence-aligned translations into French. Our results
imply that interference in translated texts does not depend solely on the source language
(SL), rather, different language varieties of the same SL can affect the final translated text.
Translations exhibit different characteristics depending on whether the original text was
produced by speakers of different regional varieties of the same language.
To our knowledge, this is the first result of its kind extracted from a stylistically
uniform multi-author corpus using principles of statistical learning and our contribution
can be summarized as follows:
1. We build and release an augmented version of the EuroParl [24] corpus that contains
information about speakers’ language and place of birth.
2. We investigate whether the dialectal information extracted is machine-learnable,
considering that the texts in the European Parliament go through a thorough process of
editing before being published,
3. Using sentence-aligned equivalent documents in French, we analyze to what degree
dialectal features of the SL are preserved in the translated texts. Additionally, we employ
a transformer architecture to generate English to French translations [25] and investigate
whether dialectal varieties impact the machine-translated texts.
4. For each dialectal variety we fine-tune monolingual embeddings and align them to
extract words whose usage differs between varieties of the same language. We analyze and
interpret the classification results given the sets of aligned word pairs between different
classes of speakers.
We perform a series of experiments in order to achieve our research goals. In order
to observe the differences between our classes and to gain additional insights based on
the obtained classification performance we compare several different solutions: we use
a variety of linguistic features as well as several model architectures, including logistic
regression log-entropy-based methods and neural networks. For the second stage of
our experiments, we choose state-of-the-art methods used in lexical replacement tasks
(including lexical semantic change detection [26], and bilingual lexicon induction [27]),
based on non-contextual word embeddings and vector space alignment algorithms, in
order to produce a shared embedding space which allows us to more closely compare
word usage across the different varieties.We publicly share our code and detailed results
(https://github.com/senisioi/dialectal_varieties, accessed on 30 November 2021), as well
as the produced dataset.
56
Mathematics 2022, 10, 1431
the representative country where there could be multiple official languages. We ignore
speakers for whom the location is missing or who were invited as guests in the Parliament
(e.g., speeches by the 14th Dalai Lama). We also acknowledge that speakers sometimes
employ external teams to write their official speeches and that EuroParl transcriptions are
strongly edited before being published in their final form.
Statistics regarding the corpus are rendered in Table 1 where we notice the group of
speakers from Wales and the ones with Unknown source are underrepresented with a small
amount of data, therefore we decide to ignore these categories from our experiments.
Table 1. Extracted statistics: mean and standard deviation sentence length, and type-token ratio
(TTR). Both TTR and average sentence length are statistically significant under a permutation test,
with p-value < 0.01 for original English documents from Scotland, pair-wise for: England vs. Scotland
and Ireland vs. Scotland.
We render the sentence length mean and standard deviation, and the overall type/token
ratio to highlight shallow information with respect to the lexical variety of texts. At a first
glance, original English texts appear to have shorter sentences and smaller type-token
ratios (smaller lexical variety) compared to their French counterparts. Rich lexical variety
in translated texts has been previously linked [31,32] to the explicitation phenomenon.
In addition, we construct a machine translated corpus of French sentences using a
transformer-based [33] neural machine translation trained in a distributed fashion using
the fairseq-py library (https://github.com/pytorch/fairseq, accessed on 30 November 2021).
Ref. [25] report state-of-the art results on English-to-French translation for the WMT’14
dataset [34]. We acknowledge that the parallel data used for training the transformer
contains also the EuroParl v7 (http://statmt.org/europarl/, accessed on 30 November
2021) [24] among Common Crawl, French-English 109 , News Commentary, and the United
Nations Parallel Corpora. It is likely that the model has already “seen” similar data during
its training which could probably lead to more fluent automatic translations. In our work we
aim to see whether the dialectal information influences the machine-translated generated
output.
3. Experimental Setup
In our experiments, we use statistical learning tools to observe the structural differ-
ences between our classes. Our aim is to minimize any type of classification bias that
could appear because unbalanced classes, topic, parliamentary sessions, and specific user
utterances, with the purpose of exposing grammatical structures that shape the dialectal
varieties. To minimize the effect of uniform parliamentary sessions, we shuffle all the
sentences for each dialectal variety. The data are split into equally-sized documents of
approximately 2000 tokens to ensure the features are well represented in each document,
following previous work on translationese identification [3,15,35,36]. Splitting is done by
preserving the sentence boundary, each document consisting of approximately 66 sentences.
Larger classes are downsampled multiple times and evaluation scores are reported as an
average across all samples of equally-sized classes. To compare the classification of the
same documents across different languages, we construct a test set of 40 sentence-aligned
chunks. When not mentioned otherwise, we report the average 10-fold cross-validation
scores across multiple down-samplings. We illustrate the stages performed from data
collection to pre-processing and classification in Figure 1.
57
Mathematics 2022, 10, 1431
where N is the number of documents in the corpus and pij is defined by the normalized
frequency of term i in document j.
To normalize the pij values, we divide by the global frequency in the corpus:
N
pij = tfij /( ∑ tfij )
j =1
The final weight of a feature is computed by multiplying the entropy with the log weight:
58
Mathematics 2022, 10, 1431
Features
Function words (FW) consist of conjunctions, preposition, adverbs, determiners, aux-
iliary and modal verbs, pronouns, qualifiers, and question words. Some function words
are also part of the closed class because languages rarely introduce changes (historically)
in this vocabulary subset [42,43]. They posses primarily a grammatical meaning and their
frequency in a document reflects syntactical constructions that are particular to style. This
word category has a long history of usage, being the primary features of analysis for the
identification of authorship, translationese, or dialectal varieties [15,36,44,45] since they
tend to be less biased by the topics or content covered in the texts. Ref. [46] argue that
different brain functions are used to process the closed class and the open class of words.
Pronouns are a subclass of function words that have been previously tied to explic-
itation [12,47–49], translators showing an increased usage of personal pronouns. In our
experiments, we observed that these features play a more important role in distinguishing
human- and machine- translated dialectal varieties than original English texts.
Part of Speech n-grams are useful for capturing shallow syntactic constructs. We
extract PoS bigrams and trigrams from our texts using the latest version of spaCy 3.2 [50]
transformer models for English based on RoBERTa [51] and the French model based on
CamemBERT [52] (Transformer-based models latest release https://spacy.io/usage/v3-2
accessed on 30 November 2021). We insert an additional token in the PoS list (SNTSEP) that
indicates whether the next token is sentence end, in this way we hope to cover syntactic
constructions that are typical to start/end the sentences. Unlike the previous features
and due to the sheer size of possible combinations, PoS tag n-grams have a tendency to
be sparsely represented in documents. This may lead to accurate classifications without
exposing an underlying linguistic difference between the classes. To alleviate this, we have
capped the total number of n-grams to 2000 and further conducted experiments with a list
of 100 PoS n-grams curated using a Recursive Feature Elimination method [53] for both
English and French corpora.
Word n-grams including function and content words. For each text replace all the
entities discovered by spaCy with the corresponding entity type, including proper nouns
that could potentially cover places, locations, nationality, and countries, but also numeric
entities such as percentages and dates which could bias the classification. Using this feature
set, we hope to understand how much the semantics of the texts, the usages and choices
of content words, and potentially the topics addressed by different speakers contribute
to separating between language varieties. This feature set is biased by the topics that are
repeatedly addressed by different groups, given their regional interests in Scotland, Ireland
or England. Furthermore, the feature set can potentially introduce sparsity and in order
to alleviate this, we enforce a strict limit and cap the total number of allowed n-grams to
the most frequent 300. We experimented with smaller numbers of word n-grams (100, 200)
and observed that the majority of features were comprised of function word combinations
and expressions such as: “I would like”, “member states”, “SNTSEP However”, “we
should”, “the commissioner”, “mr president I”. We also experimented with larger numbers
of n-grams: 400, 500 which easily achieved perfect accuracy due to the topic information
embedded in the higher dimension.
Convolutional Neural Networks (CNN) are able to extract relevant semantic contexts
from texts by learning filters over sequences of representations. We apply convolutions over
word sequences (including all words in the text) using one 1-dimensional convolutional
layer of 10 filters, with filter size 3, followed by a max pooling and an output layer.
4. Results
Table 2 contains the full set of classification results that compare the logistic re-
gression log-entropy-based method with different features and the convolutional neural
networks’ results.
Content-dependent methods based on word n-grams and convolutional networks
stand out as having the largest scores (above 0.9 for English and French) even though
59
Mathematics 2022, 10, 1431
proper nouns and entities have been removed beforehand. This is an indicator that speakers
belonging to these regions are classified based on different specific topics addressed in
the European Parliament. Translated dialects appear to be easier to separate with word
n-grams. Manually analysing the entity tagging and removing process for French, we could
observe that markers of location (irlandais, britannique) were not completely removed from
the texts. Content words as features for text classification are less relevant than content-
independent features to test linguistic hypotheses. We can also observe here that CNNs
obtain slightly lower scores for this task, possibly due to the small size of the dataset and
the 2000-word length of the input classification documents.
Table 2. Average F1 scores for distinguishing dialectal varieties and their translations into French.
Values in bold indicate the best accuracy obtained using topic-independent features. The feature set
100 PoS En are the most representative n-grams for classifying original English documents and simi-
larly 100 PoS Fr, for the human-translated French documents. Word n-grams (limited to a maximum
of 300 most frequent) and convolutional neural networks (CNN) are covering content words and are
biased by topic, therefore we do not highlight the classification scores of the two methods.
The magnitude of logistic regression coefficients can give an estimate of feature impor-
tance for classification corresponding to each class. A manual inspection (The supplemen-
tary material contains the full set of features ordered by importance.) of the most important
classification features from Table 3 shows that certain debates have (key)words acting as
good discriminators between our classes. The topics hint towards political tensions with
respect to Northern Ireland, fishing rights, and state-policies in the region.
Table 3. The top most relevant word n-grams in binary text classification scenarios.
60
Mathematics 2022, 10, 1431
From the total most frequent words in the corpus, several function words appear to
have a high importance for English: regard, must, we, to, s, you, he, sure; and French: de l,
son, donc, nous, votre, tres, mais, ont, de m, deja. Table 3 contains several words marked as
important in separating different classes, where we can observe that dialectal varieties are
potentially influenced by function word and more specifically pronoun usage.
Topic-independent features that include function words, pronouns, and PoS n-grams
yield relatively high scores for original English texts, indicating that the place of birth is
a valid indicator of dialectal information for our particular dataset. The translations show
significantly lower scores, but still above 0.8 for 3-way classification on both human and
machine -translated versions. These features are an indicator of the grammatical structures
that transfer from source to target language and we highlight in boldface the highest
scores for each. PoS n-grams tend to achieve the highest classification scores among all
experiments, when taking into account the sparse high dimensionality of each classification
example. When restricting the overall set to the 100 most frequently occurring PoS n-grams,
unsurprisingly, the overall scores drop by 10%. While taking into account this drop, we
can still observe a fair degree of pair-wise separation between the classes. Furthermore,
the curated list of PoS n-grams is language independent and we used the list extracted
from the French corpus to classify the annotated documents in English and vice-versa.
Original English can be separated with an F1 score ranging from 0.71 to 0.82 when using
PoS n-gram features extracted from the French data. A similar phenomenon occurs for
translated French documents can be separated with an F1 score ranging from 0.71 to 0.8
using PoS n-grams extracted from the English data. This is a clear indicator that shallow
syntactic constructs that are specific to each class are transferred during translation into the
resulting documents.
With respect to machine-translated texts, it appears that all the classification exper-
iments achieve slightly higher scores than the equivalent human-translated data. Since
machine-generated translations are more rudimentary, it could very well be that the origi-
nal dialectal patterns are simply amplified or mistranslated into the target language, thus
generating the proper conditions to achieve statistical separation between the classes.
Pronouns show the opposite result on both machine and human translation outputs.
Original English dialectal varieties are weakly classifiable using pronouns - England vs.
Scotland achieving at best a 0.74 score. Pronouns appear to be better markers of separation
in translated texts, these words being markers of explicitation, as previous research hypoth-
esised [12,47,48]. The results show that pronoun distribution in translation accentuates
original patterns of the texts, mainly due to explicitaion, a phenomenon that appears to be
mimicked by machine-translation systems trained on human translations. For example,
the most important pronouns in English classification are: we, this, anyone, anybody, several,
everyone, what. For French we observe several different personal pronouns of high impor-
tance: human translations: nous, l, j, les, la, m, je; and machine translation: nous, j, la, en, l,
qui, m, quoi que, celles.
5. Classification Analysis
Given the high accuracy obtained on both English and French version of the corpora
using PoS n-grams, we render in Table 4 the average confusion matrices across all cross-
validation epochs for these results. On the English side of the table, the largest confusion is
between speakers from England and Scotland, while on the French translations, the confu-
sions are more uniformly distributed. From this result, it becomes clear that translations
preserve certain syntactic aspects of the source-language dialect, although the differences
between the classes are slightly lost in the process.
We have also constructed a comparable train/test split designed with the same doc-
uments in both English and French classification scenarios. The first four rows of Table 5
render the percentage of documents from the test set classified with the same label in
both English and French human-translated versions. The process is similar to computing
an accuracy score of the French classifications given the English equivalent as the gold
61
Mathematics 2022, 10, 1431
standard. The result gives us an estimation of the number of test documents that have
the same distinguishing pattern w.r.t a feature type. From Table 5 we confirm the fact
that pronouns have different roles in translated texts—showing little overlap between the
predictions on the French test set vs. the English equivalent. Function words and PoS
n-grams have slightly higher overlap percentages, again, proving that certain grammatical
patterns transfer from the dialectal variaties onto the French translation. Word n-grams
and CNNs share the highest prediction similarities between the two test sets. We believe
this is to a lesser degree due to source-language transfer, rather it corroborates that topics
addressed in the debates determine similar classification patterns across languages.
Table 4. Comparison of average confusion matrices for original English and French classification
experiments using PoS n-grams feautures.
En Ir Sc
En 80 10 10
Fr Ir 7.5 80 12.5
Sc 5 5 90
En 85 5 10
En Ir 4.5 91 4.5
Sc 0 5 95
Table 5. The percentage of documents from the test set classified with the same label in both English
and French translated versions. The last row compares the 3-way classification similarities between
dialectal classification of documents from human and machine translated output.
For French human vs. machine translation, we present only the 3-way classification
similarities (last row in Table 5), since the pair-wise versions have similar values. In this case
we observe the divergence between human- and machine- generated translations in terms
of different features. The output produced by the transformer-based NMT system does not
resemble typical human language in terms of function words distribution, as seen in the
low amount of classification overlap (67%). However, the machine appears to do better at
imitating translationese explicitaion, given the higher importance of pronouns in classification
(0.71 F1 score and 70% overlap between human and machine translation classifications).
Similarly, the ability of PoS n-grams to distinguish English varieties with a 0.83 F1 score and
with 78% similarity to classification human translation, indicates that dialectal syntactic
structures are reasonably preserved in both human- and machine- translation. Content-
wise, both CNNs and word n-grams lead to similar classification patterns on the test
set (84% overlap and 0.95 avg. F1 score). Overall, the dialectal markers yield prediction
correlations between machine- and human- generated translations.
6. Words in Context
Using weights learned by a linear model to infer feature importance can be useful for
interpreting the behavior of the classifier and explain some of the underlying linguistic
mechanisms that distinguish between the classes considered, in our case - language varieties.
Nevertheless, this method has its limits: based on feature importance in our classifier we are
essentially only able to find differences between the word distributions of two corpora, in
terms of frequencies. In order to gain more insight into the phenomena behind the aspects
62
Mathematics 2022, 10, 1431
of language that make different varieties of English distinguishable with such high accuracy,
we propose a novel method for feature analysis based aligned word embedding spaces in
order to identify word pairs which are used differently in two corpora to be compared.
Aligned embedding spaces have previously been exploited in various tasks in com-
putational linguistics, from bilingual lexicon induction [27], to tracking word sense evo-
lution [54], and identifying lexical replacements [55,56]. We also propose using word
embedding spaces for finding lexical replacements, this time across dialectal varieties.
By training word embeddings on two different corpora, and comparing their structural
differences, we can go beyond word frequency distribution, and look into the specific
lexical differences that distinguish word usage between texts in the two classes. If the
feature weight method could tell us, for example, that English speakers use maybe more
than the Irish do, the embedding alignment based method should be able to show exactly
how that word is used differently, what word Irish speakers use instead in their speech, in
the form of a word analogy: where English speakers say maybe, Irish speakers say X.
For each pair of dialects, we train embeddings, perform embedding space alignments
and extract nearest neighbors for misaligned word pairs. Table 6 shows some descriptive
statistics of the distribution of misalignment scores for all misaligned words in each corpus
pair. A higher average misalignment score should point to a bigger difference in patterns
63
Mathematics 2022, 10, 1431
of word usage between two corpora. The ranking of dialectal “similarities” inferred in
this way is still maintained after translation into French, although, differences in language
usage seem to be reduced after translation.
In Figure 2 we plot the distribution of misalignment scores for all words (including
non-misaligned ones) and all dialect pairs, along with French translations. The distribution
is skewed, with most words having a score in the vicinity of 0, showing similar usage
patterns in the two corpora.
(a) Sc-Ir
(b) Ir-En
(c) Sc-En
Figure 2. Distribution of misalignment scores across dialect pairs for English and French data sets.
64
Mathematics 2022, 10, 1431
Table 6. Average and standard deviation of misalignment scores for all corpus pairs, in original and
translated versions.
English French
Varieties Mean Std Mean Std
En-Sc 0.049 0.040 0.048 0.040
En-Ir 0.063 0.053 0.058 0.048
Ir-Sc 0.053 0.048 0.050 0.040
We take a closer look at some examples of misaligned words (The full set of aligned
words is available in the supplementary material along with their corresponding similarity
scores.) that are used differently across corpora in Table 7. The method unveils word pairs
that capture differences in topic content between the two corpora, further corroborating
that topic contributes to distinguishing between texts written by speakers of different
English varieties. Such an example is the pair Scotland/England, which captures mentions
of proper nouns: in contexts where Scottish speakers say Scotland, the English say England.
The same occurs in the case of irlandais and écossais for the French translations of Irish and
Scottish texts.
More interestingly, the method helps capture an underlying stylistic dimension of
content word usage as well, by identifying misaligned pairs of words with the same
meaning (synonyms). Content-independent features are traditionally employed in stylistic
analyses in order to remove bias from topic. Our analysis shows content words can
encapsulate a stylistic dimension as well, and should not be ignored when considering
aspects of the language independent from topic.
Nearest
Corpora Word Neighbor
England Scotland
reply answer
En-Sc
but however
extremely very
aspiration ambition
En-Sc (fr)
reccomandation proposition
plan program
Ir-Sc
she he
plan programme
Ir-Sc (fr)
irlandais écossais
absolutely perfectly
Ir-En
keep hold
absolument vraiment
Ir-En (fr)
comprendre croire
To express the same concept, the Irish tend to use plan where the Scottish say program,
and the same pattern can be observed in the translated versions of the texts: the French
words plan and programme are nearest neighbors. The same is true for the Irish absolutely
versus and the English perfectly, translated as absolument and vraiment in French. In
addition, several example pairs may still yield an unwanted nearest neighbor, as it is the
case for unless vs. if, indeed vs. nevertheless. These examples show that a certain threshold
must be enforced in order to filter them out. A few examples of function words also stand
out, such as very and extremely that distinguishes Scottish from English speakers. This
last pair is also consistent with the feature importance analysis from our logistic regression
results.
65
Mathematics 2022, 10, 1431
7. Conclusions
We construct an augmented version of the English-French parallel EuroParl corpus
that contains additional speaker information pointing to native regional dialects from the
British Isles (We will open-source the data and code for reproducing the experiments). The
corpus has several properties useful for the joint investigation of dialectal and translated
varieties: it is stylistically uniform, the speeches are transcribed and normalized by the
same editing process, there are multiple professional translators and speakers, and the
translators always translate into their mother tongue.
Our experimental setup brings forward the first translation-related result (to the best
of our knowledge) showing that translated texts depend not only on the source language,
but also on the dialectal varieties of the source language. In addition, we show that
machine translation is impacted by the dialectal varieties, since the output of a state-of-the-
art transformer-based system preserves (or exacerbates, see Table 2) syntactic and topic-
independent information specific to these language varieties. W.r.t pronouns, we show
that these are discriminating markers for dialectal varieties in both human- and machine-
translations (as a source of explicitation), being less effective on original English texts.
We provide a computational framework to understand the lexical choices made by
speakers from different groups and we release the pairs of extracted content words in the
supplementary material. The embeddings-based method offers promising insights into the
word choices and usages in different contexts and we are currently working on filtering
aligned pairs and adapting it to phrases.
Author Contributions: Investigation, S.N. and A.S.U.; Methodology, S.N. and A.S.U.; Supervision,
L.P.D. All authors have read and agreed to the published version of the manuscript.
Funding: This research was partially funded by two grants of the Ministry of Research, Innovation
and Digitization, Unitatea Executiva pentru Finantarea Invatamantului Superior, a Cercetarii, Dez-
voltarii si Inovarii-CNCS/CCCDI—UEFISCDI, CoToHiLi project, project number 108, within PNCDI
III, and CCCDI—UEFISCDI, INTEREST project, project number 411PED/2020, code PN-III-P2-2.1-
PED-2019-2271, within PNCDI III.
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: Not applicable.
Conflicts of Interest: The authors declare no conflict of interest.
Ethical Considerations: The data we release with this paper, including speaker information, is
publicly available in an electronic format on the European Parliament Website at https://www.
europarl.europa.eu/ (accessed on 30 November 2021).
References
1. Toury, G. Search of a Theory of Translation; The Porter Institute for Poetics and Semiotics; Tel Aviv University: Tel Aviv, Israel, 1980.
2. Gellerstam, M. Translationese in Swedish novels translated from English. In Translation Studies in Scandinavia; Wollin, L.,
Lindquist, H., Eds.; CWK Gleerup: Lund, Sweden, 1986; pp. 88–95.
3. Baroni, M.; Bernardini, S. A New Approach to the Study of Translationese: Machine-learning the Difference between Original
and Translated Text. Lit. Linguist. Comput. 2006, 21, 259–274. [CrossRef]
4. Baker, M. Corpus Linguistics and Translation Studies: Implications and Applications. In Text and Technology: In Honour of John
Sinclair; Baker, M., Francis, G., Tognini-Bonelli, E., Eds.; John Benjamins: Amsterdam, The Netherlands, 1993; pp. 233–252.
5. Toury, G. Descriptive Translation Studies and beyond; John Benjamins: Amsterdam, PA, USA, 1995.
6. Mauranen, A.; Kujamäki, P. (Eds.) Translation Universals: Do They Exist? John Benjamins: Amsterdam, The Netherlands, 2004.
7. Laviosa, S. Universals. In Routledge Encyclopedia of Translation Studies, 2nd ed.; Baker, M., Saldanha, G., Eds.; Routledge: New York,
NY, USA, 2008; pp. 288–292.
8. Xiao, R.; Dai, G. Lexical and grammatical properties of Translational Chinese: Translation universal hypotheses reevaluated from
the Chinese perspective. Corpus Linguist. Linguist. Theory 2014, 10, 11–55. [CrossRef]
9. Bernardini, S.; Ferraresi, A.; Miličević, M. From EPIC to EPTIC—Exploring simplification in interpreting and translation from an
intermodal perspective. Target. Int. J. Transl. Stud. 2016, 28, 61–86. [CrossRef]
66
Mathematics 2022, 10, 1431
10. Blum-Kulka, S.; Levenston, E.A. Universals of lexical simplification. In Strategies in Interlanguage Communication; Faerch, C.,
Kasper, G., Eds.; Longman: London, UK, 1983; pp. 119–139.
11. Vanderauwera, R. Dutch Novels Translated into English: The Transformation of a “Minority” Literature; Rodopi: Amsterdam,
The Netherlands, 1985.
12. Blum-Kulka, S. Shifts of Cohesion and Coherence in Translation. In Interlingual and Intercultural Communication Discourse and
Cognition in Translation and Second Language Acquisition Studies; House, J., Blum-Kulka, S., Eds.; Gunter Narr Verlag: Tübingen,
Germany, 1986; Volume 35, pp. 17–35.
13. Øverås, L. In Search of the Third Code: An Investigation of Norms in Literary Translation. Meta 1998, 43, 557–570. [CrossRef]
14. Toury, G. Interlanguage and its Manifestations in Translation. Meta 1979, 24, 223–231. [CrossRef]
15. Koppel, M.; Ordan, N. Translationese and Its Dialects. In Proceedings of the 49th Annual Meeting of the Association
for Computational Linguistics: Human Language Technologies, Stroudsburg, PA, USA, 19–24 June 2011; Association for
Computational Linguistics: Portland, OR, USA, 2011; pp. 1318–1326.
16. Rabinovich, E.; Wintner, S. Unsupervised Identification of Translationese. Trans. Assoc. Comput. Linguist. 2015, 3, 419–432.
[CrossRef]
17. Rabinovich, E.; Ordan, N.; Wintner, S. Found in Translation: Reconstructing Phylogenetic Language Trees from Translations.
arXiv 2017, arXiv:1704.07146.
18. Chowdhury, K.D.; Espa na-Bonet, C.; van Genabith, J. Understanding Translationese in Multi-view Embedding Spaces. In Proceed-
ings of the 28th International Conference on Computational Linguistics, Barcelona, Spain, 8–13 December 2020; pp. 6056–6062.
19. Kurokawa, D.; Goutte, C.; Isabelle, P. Automatic Detection of Translated Text and its Impact on Machine Translation. In
Proceedings of the MT-Summit XII, Ottawa, ON, Canada, 26–30 August 2009; pp. 81–88.
20. Lembersky, G.; Ordan, N.; Wintner, S. Improving Statistical Machine Translation by Adapting Translation Models to Translationese.
Comput. Linguist. 2013, 39, 999–1023. [CrossRef]
21. Zhang, M.; Toral, A. The Effect of Translationese in Machine Translation Test Sets. arXiv 2019, arXiv:1906.08069.
22. Ondrej, B.; Chatterjee, R.; Christian, F.; Yvette, G.; Barry, H.; Matthias, H.; Philipp, K.; Qun, L.; Varvara, L.; Christof, M.; et
al. Findings of the 2017 conference on machine translation (wmt17). In Proceedings of the Second Conference on Machine
Translation, Copenhagen, Denmark, 7–8 September 2017; The Association for Computational Linguistics: Stroudsburg, PA, USA,
2017; pp. 169–214.
23. Graham, Y.; Haddow, B.; Koehn, P. Statistical Power and Translationese in Machine Translation Evaluation. In Proceedings of the
2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; pp. 72–81.
24. Koehn, P. Europarl: A Parallel Corpus for Statistical Machine Translation. In Proceedings of the Tenth Machine Translation
Summit, AAMT, Phuket, Thailand, 13–15 September 2005; pp. 79–86.
25. Ott, M.; Edunov, S.; Grangier, D.; Auli, M. Scaling Neural Machine Translation. In Proceedings of the Third Conference on
Machine Translation: Research Papers, Belgium, Brussels, 31 October–1 November 2018; pp. 1–9.
26. Schlechtweg, D.; McGillivray, B.; Hengchen, S.; Dubossarsky, H.; Tahmasebi, N. SemEval-2020 task 1: Unsupervised lexical
semantic change detection. arXiv 2020, arXiv:2007.11464.
27. Zou, W.Y.; Socher, R.; Cer, D.; Manning, C.D. Bilingual word embeddings for phrase-based machine translation. In Proceedings of
the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, WA, USA, 18–21 October 2013; pp. 1393–1398.
28. Pym, A.; Grin, F.; Sfreddo, C.; Chan, A.L. The Status of the Translation Profession in the European Union; Anthem Press: London, UK, 2013.
29. Rabinovich, E.; Wintner, S.; Lewinsohn, O.L. A Parallel Corpus of Translationese. In Proceedings of the International Conference
on Intelligent Text Processing and Computational Linguistics, Konya, Turkey, 3–9 April 2016.
30. Nisioi, S.; Rabinovich, E.; Dinu, L.P.; Wintner, S. A Corpus of Native, Non-native and Translated Texts. In Proceedings of the
Tenth International Conference on Language Resources and Evaluation (LREC’16), Portorož, Slovenia, 23–28 May 2016.
31. Olohan, M.; Baker, M. Reporting That in Translated English: Evidence for Subconscious Processes of Explicitation? Across Lang.
Cult. 2000, 1, 141–158. [CrossRef]
32. Zufferey, S.; Cartoni, B. A multifactorial analysis of explicitation in translation. Target 2014, 26, 361–384. [CrossRef]
33. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need.
Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008.
34. Bojar, O.; Buck, C.; Federmann, C.; Haddow, B.; Koehn, P.; Leveling, J.; Monz, C.; Pecina, P.; Post, M.; Saint-Amand, H.; et
al. Findings of the 2014 Workshop on Statistical Machine Translation. In Proceedings of the Ninth Workshop on Statistical
Machine Translation, Baltimore, MD, USA, 26–27 June 2014; Association for Computational Linguistics: Baltimore, MA, USA,
2014; pp. 12–58.
35. Ilisei, I.; Inkpen, D.; Pastor, G.C.; Mitkov, R. Identification of Translationese: A Machine Learning Approach. In Proceedings of
the CICLing-2010: 11th International Conference on Computational Linguistics and Intelligent Text Processing, Iaşi, Romania,
21–27 March 2010; Gelbukh, A.F., Ed.; Springer: Berlin/Heidelberg, Germany, 2010; Volume 6008, pp. 503–511.
36. Rabinovich, E.; Nisioi, S.; Ordan, N.; Wintner, S. On the Similarities Between Native, Non-native and Translated Texts. arXiv
2016, arXiv:1609.03204.
37. Dumais, S. Improving the retrieval of information from external sources. Behav. Res. Methods Instruments Comput. 1991, 23,
229–236. [CrossRef]
67
Mathematics 2022, 10, 1431
38. Jarvis, S.; Bestgen, Y.; Pepper, S. Maximizing Classification Accuracy in Native Language Identification. In Proceedings of the
Eighth Workshop on Innovative Use of NLP for Building Educational Applications, Atlanta, GA, USA, 13 June 2013; Association
for Computational Linguistics: Atlanta, Georgia, 2013; pp. 111–118.
39. Fan, R.E.; Chang, K.W.; Hsieh, C.J.; Wang, X.R.; Lin, C.J. LIBLINEAR: A Library for Large Linear Classification. J. Mach. Learn.
Res. 2008, 9, 1871–1874.
40. Malmasi, S.; Evanini, K.; Cahill, A.; Tetreault, J.R.; Pugh, R.A.; Hamill, C.; Napolitano, D.; Qian, Y. A Report on the 2017 Native
Language Identification Shared Task. In Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational
Applications, Copenhagen, Denmark, 8 September 2017; pp. 62–75.
41. Zampieri, M.; Malmasi, S.; Scherrer, Y.; Samardžić, T.; Tyers, F.; Silfverberg, M.; Klyueva, N.; Pan, T.L.; Huang, C.R.;
Ionescu, R.T.; et al. A Report on the Third VarDial Evaluation Campaign. In Proceedings of the Sixth Workshop on NLP for
Similar Languages, Varieties and Dialects, Minneapolis, MN, USA, 7 June 2019; Association for Computational Linguistics: Ann
Arbor, MI, USA, 2019; pp. 1–16.
42. Koppel, M.; Akiva, N.; Dagan, I. Feature instability as a criterion for selecting potential style markers. J. Am. Soc. Inf. Sci. Technol.
2006, 57, 1519–1525. [CrossRef]
43. Dediu, D.; Cysouw, M. Some structural aspects of language are more stable than others: A comparison of seven methods. PLoS
ONE 2013, 8, e55009.
44. Mosteller, F.; Wallace, D.L. Inference in an authorship problem: A comparative study of discrimination methods applied to the
authorship of the disputed Federalist Papers. J. Am. Stat. Assoc. 1963, 58, 275–309.
45. Nisioi, S. Feature Analysis for Native Language Identification. In Proceedings of the 16th International Conference on
Computational Linguistics and Intelligent Text Processing (CICLing 2015), Cairo, Egypt, 14–20 April 2015; Gelbukh, A.F., Ed.;
Springer: Berlin/Heidelberg, Germany, 2015.
46. Münte, T.F.; Wieringa, B.M.; Weyerts, H.; Szentkuti, A.; Matzke, M.; Johannes, S. Differences in brain potentials to open and
closed class words: Class and frequency effects. Neuropsychologia 2001, 39, 91–102. [CrossRef]
47. Olohan, M. Leave it out! Using a Comparable Corpus to Investigate Aspects of Explicitation in Translation. Cadernos de Traduç ao
2002, 1, 153–169.
48. Zhang, X.; Kruger, H.K.; Fang, J. Explicitation in children’s literature translated from English to Chinese: A corpus-based study of
personal pronouns. Perspectives 2020, 28, 717–736. [CrossRef]
49. Volansky, V.; Ordan, N.; Wintner, S. On the Features of Translationese. Digit. Scholarsh. Humanit. 2015, 30, 98–118. [CrossRef]
50. Honnibal, M.; Montani, I.; Van Landeghem, S.; Boyd, A. spaCy: Industrial-Strength Natural Language Processing in Python. 2020.
Available online: https://spacy.io/ (accessed on 30 November 2021).
51. Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. Roberta: A robustly
optimized bert pretraining approach. arXiv 2019, arXiv:1907.11692.
52. Martin, L.; Muller, B.; Ortiz Suárez, P.J.; Dupont, Y.; Romary, L.; de la Clergerie, É.; Seddah, D.; Sagot, B. CamemBERT: A Tasty
French Language Model. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online,
5–10 July 2020; pp. 7203–7219.
53. Guyon, I.; Weston, J.; Barnhill, S.; Vapnik, V. Gene selection for cancer classification using support vector machines. Mach. Learn.
2002, 46, 389–422. [CrossRef]
54. Hamilton, W.L.; Leskovec, J.; Jurafsky, D. Diachronic word embeddings reveal statistical laws of semantic change. arXiv 2016,
arXiv:1605.09096.
55. Szymanski, T. Temporal word analogies: Identifying lexical replacement with diachronic word embeddings. In Proceedings of
the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Vancouver, BC, Canada,
30 July–4 August 2017; pp. 448–453.
56. Uban, A.; Ciobanu, A.M.; Dinu, L.P. Studying Laws of Semantic Divergence across Languages using Cognate Sets. In Proceedings
of the 1st International Workshop on Computational Approaches to Historical Language Change, Florence, Italy, 2 August 2019;
Association for Computational Linguistics: Florence, Italy, 2019; pp. 161–166.
68
mathematics
Article
Taylor-ChOA: Taylor-Chimp Optimized Random Multimodal
Deep Learning-Based Sentiment Classification Model for
Course Recommendation
Santosh Kumar Banbhrani 1, *, Bo Xu 1 , Hongfei Lin 1 and Dileep Kumar Sajnani 2
1 School of Computer Science and Technology, Dalian University of Technology, Ganjingzi District,
Dalian 116024, China; [email protected] (B.X.); hfl[email protected] (H.L.)
2 School of Computer Science and Engineering, Southeast University, Nanjing 210096, China;
[email protected]
* Correspondence: [email protected]
Abstract: Course recommendation is a key for achievement in a student’s academic path. However, it
is challenging to appropriately select course content among numerous online education resources, due
to the differences in users’ knowledge structures. Therefore, this paper develops a novel sentiment
classification approach for recommending the courses using Taylor-chimp Optimization Algorithm
enabled Random Multimodal Deep Learning (Taylor ChOA-based RMDL). Here, the proposed Taylor
ChOA is newly devised by the combination of the Taylor concept and Chimp Optimization Algorithm
(ChOA). Initially, course review is done to find the optimal course, and thereafter feature extraction
is performed for extracting the various significant features needed for further processing. Finally,
sentiment classification is done using RMDL, which is trained by the proposed optimization algorithm,
named ChOA. Thus, the positively reviewed courses are obtained from the classified sentiments for
improving the course recommendation procedure. Extensive experiments are conducted using the
Citation: Banbhrani, S.K.; Xu, B.; Lin, E-Khool dataset and Coursera course dataset. Empirical results demonstrate that Taylor ChOA-based
H.; Sajnani, D.K. Taylor-ChOA: RMDL model significantly outperforms state-of-the-art methods for course recommendation tasks.
Taylor-Chimp Optimized Random
Multimodal Deep Learning-Based Keywords: chimp optimization algorithm; course recommendation; E-learning; long short-term
Sentiment Classification Model for memory; random multimodal deep learning; sentiment classification
Course Recommendation.
Mathematics 2022, 10, 1354. https:// MSC: 68T50
doi.org/10.3390/math10091354
70
Mathematics 2022, 10, 1354
2. Related Work
(a) Hierarchical Approach:
Chao Yang et al. [12] introduced a hierarchical attention network oriented towards
crowd intelligence (HANCI) for addressing rating prediction problems. This method
extracted more exact user choices and item latent features. Although valuable reviews and
significant words provided a positive degree of explanation for the recommendation, this
model failed to analyze the recommendation performance by explaining the method at
the feature level. Hansi Zeng and Qingyao Ai [14] developed a hierarchical self-attentive
convolution network (HSACN) for modeling reviews in recommendation systems. This
model attained superior performance by extracting efficient item and user representations
from reviews. However, this method suffers from computational complexity problems.
(b) Deep Learing Approach:
Qinglong Li and Jaekyeong Kim [10] introduced a novel deep learning-enabled course
recommender system (DÉCOR) for sustainable improvement in education. This method re-
duced the information overloading problems. In addition, it achieved superior performance
in feature information extraction. However, this method does not consider larger datasets to
train the domain recommendation systems. Aminu Da’u et al. [20] modeled a multi channel
deep convolutional neural network (MCNN) for recommendation systems. The model
was more effective in using review text and hence achieved significant improvements.
However, this method suffers from data redundancy problems. Chao Wang et al. [21]
devised a demand-aware collaborative Bayesian variational network (DCBVN) for course
recommendation. This method offered accurate and explainable recommendations. This
model was more robust against sparse and cold start problems. However, this method had
higher time complexity.
(c) Query-based Approach:
Muhammad Sajid Rafiq et al. [22] introduced a query optimization method for course
recommendation. This model improved the categorization of action verbs to a more precise
level. However, the accuracy of online query optimization and course recommendation
was not improved using this technique.
(d) Other Approaches:
Yi Bai et al. [19] devised a joint summarization and pre-trained recommendation
(JSPTRec) for the recommendation based on reviews. This method learned improved
semantic representations of reviews for items and users. However, the accuracy of rate
prediction needed to be improved. Mohd Suffian Sulaiman et al. [23] designed a fuzzy logic
approach for recommending the optimal courses for learners. This method significantly
71
Mathematics 2022, 10, 1354
helped the students choose their course based on interest and skill. However, the sentiment
analysis of user reviews was not considered for effective performance.
3. Proposed Method
The overall architecture of TaylorChOA-based RMDL method for sentiment analysis-
based course recommendation, illustrated in Figure 1, contains several components. The de-
tail of each component is presented next.
Initially, the input review data are presented to the matrix construction phase to
construct the matrix based on learners’ preferences. Thereafter, the constructed matrix
is presented to the course grouping phase so that similar courses are grouped in one
group, whereas different courses are grouped in another group using DEC [24]. When the
query arrives, course matching is performed using the RV coefficient to identify the best
course groups from overall course groups. After finding the best course group, relevant
scholar retrieval and matching are performed between the user query and best course
group using the Bhattacharya coefficient to find the best course. Once course review is
performed, sentimental classification is carried out by extracting the significant features,
such as SentiWordNet-based statistical features, classification-specific features, and TF-IDF
features. Finally, sentiment classification is done using RMDL [25] that is trained by the
72
Mathematics 2022, 10, 1354
developed TaylorChOA, which is the integration of the Taylor concept [26] and ChOA [27].
Finally, the positively recommended reviews are provided to the users. Figure 1 portrays a
schematic representation of the sentiment analysis-based course recommendation model
using the proposed TaylorChOA-based RMDL.
Ds = { Si } 1 < i ≤ n (1)
where n represents the total number of scholars, and Si denotes ith scholar. Each scholar
learns a specific course. Let the course list be expressed as
Dc = C j 1 < j ≤ m (2)
where C1i represents the lth course preferred by scholar i, Ui indicates the course preferred
by scholar i, and the total number of preferred courses is specified as k.
Course preference binary matrix: Once the course preference matrix Ui is generated,
the course preference binary matrix BUi is performed based on the courses preferred, which
is denoted as 0 and 1. For each course, the corresponding binary values of every course
are given in the binary sequence. If a scholar preferred a course, then it is represented as 1,
otherwise it is represented as 0. The course preference binary matrix is expressed as
1 Cli ∈ Cj
BUi = (4)
0 otherwise
where BUi represents the course preference binary matrix for the scholar i.
Course subscription matrix: The course subscription binary matrix UL j specifies the
scholar who searches for a particular course. Thus, the courses searched by scholar are
given as
j j j j
UL j = s1 , s2 , . . . , s p , . . . , s x (5)
j
where, s p indicates the jth course searched by pth scholar, x denotes the total number
of scholars.
Course subscription binary matrix: After generating the course subscription matrix
UL j , the course subscription binary matrix BUL j is constructed based on courses subscribed,
which is represented either as 0 or 1. For each course, the corresponding binary values for
the subscribed course are given in the binary sequence. If the scholar searched for a course,
it is denoted as 1, otherwise it is denoted as 0. The course subscription binary matrix is
given as
73
Mathematics 2022, 10, 1354
j
UL j 1 S p ∈ Si
B = (6)
0 Otherwise
where j = f θ (yi ) ∈ S corresponds to after the process of embedding, the degree of freedom
is represented as α, and Hij denotes the probability of sample i to cluster j.
KL divergence optimization: KL divergence optimization is designed for refining
the clusters iteratively by understanding their assignments with higher confidence using
the auxiliary target function. It computes the loss of convergence ai among the auxiliary
distribution and soft assignment bi .
aij
L = KL( P Q) = ∑ ∑ aij log bij (8)
i j
Furthermore, the computation is done by initially raising to the second power and
thereafter normalizing the outcome by frequency per cluster.
bij2 / f j
aij = (9)
∑ j bij2 / f j
where f j = ∑ j bij represents the frequency of soft cluster. Hence, the DEC algorithm
effectively improves low confidence prediction results.
The process of course grouping is done to group similar courses into their groups.
The course grouping is performed among the scholars and courses. Let the course group
obtained by deep embedded clustering be expressed as
G = { G1 , G2 , . . . , Gn } (10)
where n denotes the total number of groups. Thus, the output obtained by the course
grouping in finding and grouping the course is denoted as G.
74
Mathematics 2022, 10, 1354
Q z = { q1 , q2 , . . . , q d , . . . , qr } (11)
where qd specifies the total number of courses in query d and r represents the total number
of queries.
Binary query sequence: The sequence of queries is transformed to binary query se-
quence formulated as
1; qd ∈ Cj
B Q= = (12)
0; Otherwise
where qd denotes the number of courses in query d and BQz represents the binary query sequence.
Course matching using RV coefficient: The course grouping is done using the RV
coefficient by considering the course grouped sequence G and binary query sequence BQz .
Moreover, the RV coefficient is defined as the multivariate rationalization of the squared
Pearson correlation coefficient because the RV coefficient considers the values within the
range of 0 and 1. It measures the proximity of two sets of points characterized in a matrix
form. The RV coefficient equation is given as follows:
Cov BQz , G
RV BQz , G = (13)
Var( BQz ) Var( G )
where RV indicates the RV coefficient between BQz , G , BQz denotes the binary sequence,
G specifies the grouped course, Cov represents the co-variance of ( BQz , G ), and Var specifies
the variance of ( BQz , G ).
where w represents the total number of best courses, and ryi denotes the best course retrieved
by the scholar i.
Binary best course group: For each best course group, the corresponding binary values
for the retrieved best course are given in a binary sequence. If the best course is retrieved
by the scholar, it is indicated as 1, otherwise it is denoted as 0.
1; ryi ∈ Cj
B RC = (15)
0; Otherwise
Matching query and best course group using Bhattacharya coefficient: Once the
scholar retrieved the best course, the binary query sequence BQz and the best course
group B Rc are compared using the Bhattacharya coefficient. The Bhattacharyya distance
computes the similarity of two probability distributions, and the equation is expressed as
BC BQ= , B RC = ∑ P( BQE ) · P( B RC ) (16)
x∈X
75
Mathematics 2022, 10, 1354
where BC indicates the Bhattacharya coefficient. Once the query and best group binary
sequence are matched, the minimum value distance is chosen as the best course based on
the Bhattacharya coefficient. The output of matching result is scholar preferred courses,
and it is expressed as Cb , given as
Cb = {C1 , C2 , . . . , Ch } (17)
where Ch signifies courses preferred by a scholar that are the best courses. The best course
Cb undergoes a sentimental classification process to verify whether the recommended
course is good or bad. Algorithm 1 provides the Pseudo-code of course review framework.
76
Mathematics 2022, 10, 1354
where ϕm ( p) represents the positive score, ϕm (n) denotes the negative score, and h specifies
the SentiWordNet function. However, the SentiWordNet feature is denoted as Fn . With the
SentiWordNet score, statistical features, such as mean and variance, are computed using
the expressions given below.
(i) Mean: The mean value is computed by taking the average of SentiWordNet score
for every word from the review, given as
|U ( xn )|
1
μ=
|U ( xn )|
× ∑ U ( xn ) (19)
n =1
where n represents the overall words, U ( xn ) signifies the SentiWordNet score of each
review, and |U ( xn )| represents the overall scores obtained from the word.
(ii) Variance: The variance σ is computed based on the value of the mean, given as
|U ( x )|
∑ n =1 | x n − μ |
n
σ= (20)
U ( xn )
where μ signifies the mean value. Thus, the sentiwordNet-based feature considers the
positive and negative scores of each word in the review, and from that, the statistical
features, like mean and variance, are computed.
(b) Classification-specific features: The various classification specific features, such
as capitalized words, numerical words, punctuation marks, and elongated words are
explained below.
(i) All caps: The feature f 1 specifies the all-caps feature, which represents the overall
capitalized words in a review, expressed as
b
f1 = ∑ wCm (21)
m =1
where wCm indicates the total number of words with upper case letters. It considers a value
0 or 1 concerning the state that relies on the absence or presence of capitalized words as
formulated below:
1; if capsword
wCm = (22)
0; otherwise
Here, the feature f 1 is in the dimension of [10,000 × 1].
77
Mathematics 2022, 10, 1354
(ii) Number of numerical words: The number of text characters or numerical digits
used to show numerals are represented as f 2 with the dimension [10,000 × 1].
(iii) Punctuation: The punctuation feature f 3 may be an apostrophe, dot, or exclamation
mark present in a review:
b
f3 = ∑ Sm
p (23)
m =1
where Sm p represents the overall punctuation present in the mth review. Here, S p is given a
value of 1 for the punctuation that occurred in the review and 0 for other cases. Moreover,
the feature f 3 has the dimension of [10,000 × 1].
(iv) Elongated words: The feature f 4 represents the elongated words that have a
character repeated more than two times in a review and is given as
b
f4 = ∑ wm
E (24)
m =1
where wm E specifies the overall hashtags present in the mth review. The term is given with a
value of 0 for every elongated word in the review and 1 for the nonexistence of an elongated
word. Furthermore, the elongated word feature f 4 holds the size of [10,000 × 1].
The classification specific features are signified as F2 by considering the seven extracted
features and is given as
F2 = { f 1 , f 2 , f 3 , f 4 } (25)
where f 1 denotes the all-caps feature, f 2 signifies the numerical word feature, f 3 specifies
the punctuation feature, and f 4 indicates the elongated word feature.
(c) TF-IDF: TF-IDF [29] is used to create a composite weight for every term in each of
the review data. TF measures how frequently a term occurs in review data, whereas IDF
measures how significant a term is. The TF-IDF score is computed as
log(1 + φ1 )
F4 = C (26)
log(φ2 )
where C specifies the total number of review data, term frequency is denoted as φ1 , φ2 rep-
resents the inverse document frequency, and F3 implies the TF-IDF feature with dimension
[1 × 50].
Furthermore, the features extracted are incorporated together to form a feature vector
F for reducing the complexity in classifying the sentiments, which is expressed as
F = { F1 , F2 , F3 } (27)
78
Mathematics 2022, 10, 1354
Figure 2. An illustration of random multimodel deep learning for sentiment analysis-based course rec-
ommendation.
(i) DNN: DNN architecture is designed with multi-classes where every learning model
is generated at random. Here, the overall layer and its nodes are randomly assigned.
Moreover, this model utilizes a standard back-propagation algorithm using activation
functions. The output layer has a softmax function to perform the classification and is
given as
1
f (x) = ∈ (0, 1) (28)
1 + g− x
f ( x ) = max(0, x ) (29)
The output of DNN is denoted as Do .
(ii) RNN: RNN assigns additional weights to the sequence of data points. The infor-
mation about the preceding nodes is considered in a very sophisticated manner to perform
the effective semantic assessment of the dataset structure.
x y = F f x y −1 , h y , θ (30)
xy = Urec κ xy−1 + Uin hy + A (31)
Here, xy signifies the state at the time y, and hy denotes the input at phase y. In addition,
the recurrent matrix weight and input weight are represented as Urec and Uin , the bias is
represented as A, and κ indicates the element-wise operator.
Long short-term memory (LSTM): LSTM is a class of RNN that is used to maintain
long-term relevancy in an improved manner. This LSTM network effectively addresses the
vanishing gradient issue. LSTM consists of a chain-like structure and utilizes multiple gates
79
Mathematics 2022, 10, 1354
for handling huge amounts of data. The step-by-step procedure of LSTM cell is expressed
as follows:
Fd = R(w F [ pd , qd−1 ] + HF ) (32)
d = tan q(wC [ pd , qd−1 ] + HC )
C (33)
rd = R(wr [ pd , qd−1 ] + Hr ) (34)
d + rd Jd−1
Jd = Fd ∗ C (35)
Md = (w M [ pd , qd−1 ] + H M ) (36)
qd = Md tan y( Jd ) (37)
where Fd represents the input gate, Cd specifies the candidate memory cell, rd denotes the
forget gate activation, and Jd defines the new memory cell value. Here, Md and qd specify
the output gate value.
Gated recurrent unit (GRU): GRU is a gating strategy for RNN that consists of two
gates. Here, GRU does not have internal memory, and the step by step procedure for GRU
cells is given as
Nd = l (w N pd + VN qd−1 + Hz ) (38)
where nd implies update gate vector of d, pd denotes the input vector, the various parame-
ters are termed as w, V, and H, and Rl represent the activation parameter.
where z denotes the overall random models, tdz specifies the output for a data point i in z,
and this equation is utilized for classifying the sentiments, k ∈ {0, 1}. The output space
uses majority vote for final t̂d , and the equation is expressed as
N
t̂d = t̂d1 . . . t̂de . . . t̂dt (42)
where t̂d specifies the classification label of review or data point of Ed ∈ { ad , bd } for e, and
t̂d is represented as follows:
N
t̂d,z = arg max softmax t∗d,z (43)
k
After training the RMDL model, the final classification is computed using a majority
vote of DNN, CNN, and RNN models, which improve the accuracy and robustness of the
results. The final result obtained from the RMDL is indicated as Cτ .
80
Mathematics 2022, 10, 1354
(b) Training of RMDL using the proposed TaylorChOA: The training procedure of
RMDL [25] is performed using the developed optimization method, known as TaylorChOA.
The developed TaylorChOA is designed by the incorporation of the Taylor concept and
ChOA. ChOA [27] is motivated by the characteristics of chimps for hunting prey. It is
mainly accomplished for solving the problems based on convergence speed by learning
through the high dimensional neural network. In addition, the independent groups have
different mechanisms for updating the parameters to explore the chimp with diverse
competence in search space. The dynamic strategies effectively balance the global and
local search problems. The Taylor concept [26] exploits the preliminary dataset and the
standard form of the system for validating the Taylor series expansion in terms of a specific
degree. The incorporation of the Taylor series with the ChOA shows the effectiveness
of the developed scheme and minimized the computational complexity. The algorithmic
procedure of the proposed TaylorChOA algorithm is illustrated below.
(i) Initialization: Let us consider the chimp population as Zi (i = 1, 2, . . . , m) in the
solution space N, and the parameters are initialized as n, u, v, and r. Here, n specifies the
non-linear factor, u implies the chaotic vector v, and r denotes the vectors.
(ii) Calculate fitness measure: The fitness measure is accomplished for calculating the
optimal solution using the error function and is expressed as
1
δ∑
ξ= [ Eτ − Cτ ]2 (44)
=1
where ξ signifies fitness measure, Eτ specifies target output, indicates overall training
samples, and the output of the RMDL model is denoted as Cτ .
(iii) Driving and chasing the prey: The prey is chased during the exploitation and
exploration phases. The mathematical expression used for driving and chasing the prey is
expressed as
Z (s + 1) = Zprey (s) − x · y (45)
where s represents the current iteration, x signifies the coefficient vector, Zprey implies the
vector of prey position, y indicates driving the prey, and the position vector of chimp is
specified as Z. Here, y is expressed as
y = r.Zprey (s) − u.Z (s) (46)
Z (s) Z (s)
Z ( s + 1) = Z ( s ) + + (50)
1! 2!
where Z (s) − Z (s − k)
Z (s) = (51)
k
Z (s) − 2Z (s − k ) + Z (s − 2k )
Z (s) = (52)
k2
81
Mathematics 2022, 10, 1354
Z ( s ) − Z ( s − 1) Z (s) − 2Z (s − 1) + Z (s − 2)
Z ( s + 1) = Z ( s ) + + (53)
1! 2!
1 2Z (s − 1) Z ( s − 2)
Z ( s + 1) = Z ( s ) 1 + 1 + − Z ( s − 1) − + (54)
2 2 2
By substituting Equation (56) in Equation (50), the equation becomes
Z1 + Z2 + Z3 + Z4
Z ( s + 1) = (56)
4
(v) Prey attacking: In this prey attacking phase, the chimps attack the prey and end the
hunting operation once the prey starts moving. To mathematically formulate the attacking
behavior, the value must be decreased.
(vi) Searching for prey (exploration phase): The exploration process is performed
based on the position of the attacker, chaser, barrier, and driver chimps. Moreover, chimps
deviate to search for the prey and aggregate to chase the prey.
(vii) Social incentive: To acquire social meeting and related social motivation in the
final phase, the chimps release their hunting potential. To model this process, there is a
50% chance to prefer between the normal position update strategy and chaotic model for
updating the position of chimps during the optimization. The equation is represented as
82
Mathematics 2022, 10, 1354
83
Mathematics 2022, 10, 1354
where δ specifies the precision, A denotes the true positives, and B signifies the false positives.
Recall: Recall is a measure that defines the proportion of true positives to the summing
up of false negatives and true positives, and the equation is given as
A
ω= (59)
A+E
where the recall measure is signified as ω, and E symbolizes the false negatives.
F1-score: This is a statistical measure of the accuracy of a test or an individual based
on the recall and precision, which is given as
δ∗ω
Fm = 2 ∗ (60)
δ+ω
5.1. Results Based on E-Khool Dataset, with Respect to Number of Iterations (10 to 50)
5.1.1. Performance Analysis Based on Cluster Size = 3
Figure 3 presents the performance analysis of the developed technique with iterations
by varying the queries with cluster size =3. Figure 3a presents the assessment based on
precision. For the number of query 1, the precision value measured by the developed
TaylorChOA-based RMDL with iteration 10 is 0.795, iteration 20 is 0.825, iteration 30
is 0.836, iteration 40 is 0.847, and iteration 50 is 0.854. Figure 3b portrays the analysis
using recall.
By considering the number of query 2, the value of recall computed by the developed
TaylorChOA-based RMDL with iteration 10 is 0.825, iteration 20 is 0.847, iteration 30
is 0.874, iteration 40 is 0.885, and iteration 50 is 0.895. The analysis using F1-score is
depicted in Figure 3c. When the number of a query is 3, the value of F1-score computed by
the developed TaylorChOA-based RMDL with iteration 10 is 0.830, iteration 20 is 0.854,
iteration 30 is 0.886, iteration 40 is 0.900, and iteration 50 is 0.919.
84
Mathematics 2022, 10, 1354
Figure 3. Performance analysis with cluster size = 3 using E-Khool dataset: (a) precision, (b) recall,
and (c) F1-score.
85
Mathematics 2022, 10, 1354
Figure 4. Performance analysis with cluster size = 4 using E-Khool dataset: (a) precision, (b) recall,
and (c) F1-score.
86
Mathematics 2022, 10, 1354
Figure 5. Comparative analysis with cluster size = 3 using K-Khool dataset: (a) precision, (b) recall,
and (c) F1-score.
87
Mathematics 2022, 10, 1354
Figure 6. Comparative analysis with cluster size = 4 using E-Khool dataset: (a) precision, (b) recall,
and (c) F1-score.
Table 1. Comparison of proposed TaylorChOA-based RMDL with existing methods using E-Khool
dataset, in terms of precision, recall, and F1-score.
Using cluster size = 4, the maximum precision of 0.936 is computed by the devel-
oped TaylorChOA-based RMDL, whereas the Precision value computed by the existing
methods, such as HSACN, MCNN, Query Optimization, and DCBVN is 0.674, 0.798, 0.825,
and 0.905, respectively. Likewise, the higher recall of 0.941 is computed by the developed
TaylorChOA-based RMDL, whereas the precision value computed by the existing methods,
such as HSACN, MCNN, Query Optimization, and DCBVN is 0.695, 0.814, 0.854, and 0.925,
respectively. Moreover, the F1-score value obtained by the HSACN is 0.685, MCNN is 0.806,
Query Optimization is 0.839, DCBVN is 0.915, and TaylorChOA-based RMDL is 0.938.
Thus, the developed TaylorChOA-based RMDL outperformed various existing methods
and achieved better performance with the maximum precision of 0.936, maximum recall of
0.944, and maximum F1-score of 0.938.
88
Mathematics 2022, 10, 1354
5.2. Results Based on Coursera Course Dataset with Respect to the Number of Iterations (10 to 50)
5.2.1. Performance Analysis Based on Cluster Size = 3
Figure 7 presents the performance analysis of the developed technique with iterations
by varying the queries with cluster size = 3. Figure 7a presents the assessment based
on precision.
Figure 7. Performance analysis with cluster size = 3 using Coursera Course Dataset: (a) precision,
(b) recall, and (c) F1-score.
For the number of query 1, the precision value measured by the developed TaylorChOA-
based RMDL with iteration 10 is 0.795, iteration 20 is 0.825, iteration 30 is 0.836, iteration 40
is 0.847, and iteration 50 is 0.854. Figure 7b portrays the analysis using recall. By considering
the number of query 2, the value of recall computed by the developed TaylorChOA-based
RMDL with iteration 10 is 0.825, iteration 20 is 0.847, iteration 30 is 0.874, iteration 40 is
0.885, and iteration 50 is 0.895. The analysis using F1-score is depicted in Figure 7c. When
the number of a query is 3, the value of F1-score computed by the developed TaylorChOA-
based RMDL with iteration 10 is 0.863, iteration 20 is 0.871, iteration 30 is 0.886, iteration 40
is 0.890, and iteration 50 is 0.907.
89
Mathematics 2022, 10, 1354
Figure 8. Performance analysis with cluster size =4 using Coursera Course Dataset: (a) precision,
(b) recall, and (c) F1-score.
5.2.3. Analysis Based on Cluster Size = 3 in Terms of Precision, Recall, and F1-Score
Figure 9 portrays the assessment with cluster size = 3 by varying the number of queries
using the performance measures, such as precision, recall, and F1-score.
Figure 9a presents the analysis in terms of precision. When number of query is 1,
the precision value measured by the developed TaylorChOA-based RMDL is 0.839, whereas
the precision value measured by the existing methods, such as HSACN, MCNN, Query
Optimization, and DCBVN is 0.556, 0.669, 0.716, and 0.816, respectively. The analysis
based on recall measure is portrayed in Figure 9b. By considering the number of query
as 2, the developed TaylorChOA-based RMDL measured a recall value of 0.878, whereas
the value of recall computed by the existing methods, such as HSACN, MCNN, Query
Optimization, and DCBVN is 0.606, 0.743, 0.769, and 0.849, respectively. The assessment
using F1-score is shown in Figure 9c. The F1-score value attained by the HSACN, MCNN,
Query Optimization, DCBVN, and developed TaylorChOA-based RMDL is 0.615, 0.756,
0.800, 0.870, and 0.907, respectively, when considering the number of query as 3.
90
Mathematics 2022, 10, 1354
Figure 9. Comparative analysis with cluster size =3 using Coursera Course Dataset: (a) precision,
(b) recall, and (c) F1-score.
5.2.4. Analysis Based on Cluster Size = 4 in Terms of Precision, Recall, and F1-Score
The analysis with cluster size = 4 using the evaluation metrics, by varying the number
of queries is portrayed in Figure 10.
The analysis using precision is shown in Figure 10a. When considering the number
of query as 1, the developed TaylorChOA-based RMDL computed a precision value of
0.836, whereas the precision value achieved by the existing methods, such as HSACN,
MCNN, Query Optimization, and DCBVN is 0.584, 0.668, 0.725, and 0.812, respectively.
Figure 10b presents the assessment using recall. The recall values obtained by the HSACN,
MCNN, Query Optimization, DCBVN, and developed TaylorChOA-based RMDL are 0.629,
0.765, 0.798, 0.874, and 0.899, respectively, for the number of query 2. The analysis in
terms of F-1 score is presented in Figure 10c. When the number of query is 3, the F1-score
value of HSACN is 0.643, MCNN is 0.781, Query Optimization is 0.824, DCBVN is0.899,
and developed TaylorChOA-based RMDL is 0.917.
Table 2 explains the comparative discussion of the developed Taylor ChOA-based
RMDL technique in comparison with the existing techniques using the Coursera Course
dataset for the number of query 4. With cluster size = 3, the maximum precision of 0.908,
maximum recall of 0.928, and maximum F1-score of 0.919 are computed by the developed
Taylor ChOA-based RMDL method. Using cluster size = 4, the maximum precision of
0.919 is computed by the developed Taylor ChOA-based RMDL, whereas the precision
value computed by the existing methods, such as HSACN, MCNN, Query Optimization,
and DCBVN is 0.667, 0.776, 0.813, and 0.899, respectively. Likewise, the higher recall of 0.926
is computed by the developed Taylor ChOA-based RMDL, and the F1-score value is 0.925.
91
Mathematics 2022, 10, 1354
From this table, it is clear that, the developed Taylor ChOA-based RMDL outperformed
various existing methods.
Table 3 shows the computational time of proposed and existing methods for query = 1.
The proposed system has the minimum computational time of 127.25 s and 133.84 s for
E-Khool dataset, and Coursera Course dataset, respectively.
Figure 10. Comparative analysis with cluster size = 4 using Coursera Course Dataset: (a) precision,
(b) recall, and (c) F1-score.
Table 2. Comparison of proposed TaylorChOA-based RMDL with existing methods using Coursera
Course dataset, in terms of precision, recall, and F1-score.
92
Mathematics 2022, 10, 1354
Author Contributions: S.K.B. designed and wrote the paper; H.L. supervised the work; S.K.B.
performed the experiments with advice from B.X.; and D.K.S. organized and proofread the paper. All
authors have read and agreed to the published version of the manuscript.
Funding: This research received no external funding.
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: The data used in the experiments are publicly available. Details have
been given in Section 4.1.
Conflicts of Interest: The authors declare no conflict of interest.
Abbreviations
The following abbreviations are used in this manuscript:
References
1. Wen-Shung Tai, D.; Wu, H.-J.; Li, P.-H. Effective e-learning recommendation system based on self-organizing maps and association
mining. Electron. Libr. 2008, 26, 329–344. [CrossRef]
2. Persky, A.M.; Joyner, P.U.; Cox, W.C. Development of a course review process. Am. J. Pharm. Educ. 2012, 76, 130. [CrossRef]
3. Guanchen, W.; Kim, M.; Jung, H. Personal customized recommendation system reflecting purchase criteria and product reviews
sentiment analysis. Int. J. Electr. Comput. Eng. 2021, 11, 2399–2406. [CrossRef]
4. Gunawan, A.; Cheong, M.L.F.; Poh, J. An Essential Applied Statistical Analysis Course using RStudio with Project-Based
Learning for Data Science. In Proceedings of the 2018 IEEE International Conference on Teaching, Assessment, and Learning for
Engineering (TALE), Wollongong, Australia, 4–7 December 2018; pp. 581–588.
93
Mathematics 2022, 10, 1354
5. Assami, S.; Daoudi, N.; Ajhoun, R. A Semantic Recommendation System for Learning Personalization in Massive Open Online
Courses. Int. J. Recent Contrib. Eng. Sci. IT 2020, 8, 71–80. [CrossRef]
6. Hua, Z.; Wang, Y.; Xu, X.; Zhang, B.; Liang, L. Predicting corporate financial distress based on integration of support vector
machine and logistic regression. Expert Syst. Appl. 2007, 33, 434–440. [CrossRef]
7. Aher, S.B.; Lobo, L. Best combination of machine learning algorithms for course recommendation system in e-learning. Int. J.
Comput. Appl. 2012, 41. [CrossRef]
8. Tarus, J.K.; Niu, Z.; Mustafa, G. Knowledge-based recommendation: A review of ontology-based recommender systems for
e-learning. Artif. Intell. Rev. 2018, 50, 21–48. [CrossRef]
9. Zhang, H.; Huang, T.; Lv, Z.; Liu, S.; Zhou, Z. MCRS: A course recommendation system for MOOCs. Multimed. Tools Appl. 2018,
77, 7051–7069. [CrossRef]
10. Li, Q.; Kim, J. A Deep Learning-Based Course Recommender System for Sustainable Development in Education. Appl. Sci. 2021,
11, 8993. [CrossRef]
11. Almahairi, A.; Kastner, K.; Cho, K.; Courville, A. Learning distributed representations from reviews for collaborative filtering. In
Proceedings of the 9th ACM Conference on Recommender Systems, Vienna, Austria, 16–20 September 2015; pp. 147–154.
12. Yang, C.; Zhou, W.; Wang, Z.; Jiang, B.; Li, D.; Shen, H. Accurate and Explainable Recommendation via Hierarchical Attention
Network Oriented Towards Crowd Intelligence. Knowl.-Based Syst. 2021, 213, 106687. [CrossRef]
13. Zheng, L.; Noroozi, V.; Yu, P.S. Joint deep modeling of users and items using reviews for recommendation. In Proceedings of the
Tenth ACM International Conference on Web Search and Data Mining, Cambridge, UK, 6–10 February 2017; pp. 425–434.
14. Zeng, H.; Ai, Q. A Hierarchical Self-attentive Convolution Network for Review Modeling in Recommendation Systems. arXiv
2020, arXiv:2011.13436.
15. Dong, X.; Ni, J.; Cheng, W.; Chen, Z.; Zong, B.; Song, D.; Liu, Y.; Chen, H.; De Melo, G. Asymmetrical hierarchical networks
with attentive interactions for interpretable review-based recommendation. In Proceedings of the AAAI Conference on Artificial
Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 7667–7674.
16. Wang, H.; Wu, F.; Liu, Z.; Xie, X. Fine-grained interest matching for neural news recommendation. In Proceedings of the 58th
Annual Meeting of the Association for Computational Linguistics, Seattle, WA, USA, 5–10 July 2020; pp. 836–845.
17. Bansal, T.; Belanger, D.; McCallum, A. Ask the gru: Multi-task learning for deep text recommendations. In Proceedings of the
10th ACM Conference on Recommender Systems, Boston, MA, USA, 15–19 September 2016; pp. 107–114.
18. Tay, Y.; Luu, A.T.; Hui, S.C. Multi-pointer co-attention networks for recommendation. In Proceedings of the 24th ACM SIGKDD
International Conference on Knowledge Discovery & Data Mining, London, UK, 19–23 August 2018; pp. 2309–2318.
19. Bai, Y.; Li, Y.; Wang, L. A Joint Summarization and Pre-Trained Model for Review-Based Recommendation. Information 2021,
12, 223. [CrossRef]
20. Da’u, A.; Salim, N.; Rabiu, I.; Osman, A. Recommendation system exploiting aspect-based opinion mining with deep learning
method. Inf. Sci. 2020, 512, 1279–1292.
21. Wang, C.; Zhu, H.; Zhu, C.; Zhang, X.; Chen, E.; Xiong, H. Personalized Employee Training Course Recommendation with Career
Development Awareness. In Proceedings of the Web Conference 2020, Taipei, Taiwan, 20–24 April 2020; pp. 1648–1659.
22. Rafiq, M.S.; Jianshe, X.; Arif, M.; Barra, P. Intelligent query optimization and course recommendation during online lectures in
E-learning system. J. Ambient. Intell. Humaniz. Comput. 2021, 12, 10375–10394. [CrossRef]
23. Sulaiman, M.S.; Tamizi, A.A.; Shamsudin, M.R.; Azmi, A. Course recommendation system using fuzzy logic approach. Indones. J.
Electr. Eng. Comput. Sci. 2020, 17, 365–371. [CrossRef]
24. Xie, J.; Girshick, R.; Farhadi, A. Unsupervised deep embedding for clustering analysis. In Proceedings of the International
Conference on Machine Learning, New York, NY, USA, 19–24 June 2016; pp. 478–487.
25. Kowsari, K.; Heidarysafa, M.; Brown, D.E.; Meimandi, K.J.; Barnes, L.E. Rmdl: Random multimodel deep learning for
classification. In Proceedings of the 2nd International Conference on Information System and Data Mining, Lakeland, FL, USA,
9–1 April 2018; pp. 19–28.
26. Mangai, S.A.; Sankar, B.R.; Alagarsamy, K. Taylor series prediction of time series data with error propagated by artificial neural
network. Int. J. Comput. Appl. 2014, 89, 41–47.
27. Khishe, M.; Mosavi, M.R. Chimp optimization algorithm. Expert Syst. Appl. 2020, 149, 113338. [CrossRef]
28. Ohana, B.; Tierney, B. Sentiment classification of reviews using SentiWordNet. In Proceedings of the IT&T, Dublin, Ireland, 22–23
October 2009.
29. Christian, H.; Agus, M.P.; Suhartono, D. Single document automatic text summarization using term frequency-inverse document
frequency (TF-IDF). ComTech Comput. Math. Eng. Appl. 2016, 7, 285–294. [CrossRef]
94
mathematics
Article
Automatic Classification of National Health Service Feedback
Christopher Haynes 1, *, Marco A. Palomino 1, *, Liz Stuart 1 , David Viira 2 , Frances Hannon 2 ,
Gemma Crossingham 2 and Kate Tantam 2
1 School of Engineering, Computing and Mathematics, University of Plymouth, Plymouth PL4 8AA, UK;
[email protected]
2 Faculty of Health, University Hospitals Plymouth, Derriford Rd., Plymouth PL6 8DH, UK;
[email protected] (D.V.); [email protected] (F.H.); [email protected] (G.C.);
[email protected] (K.T.)
* Correspondence: [email protected] (C.H.); [email protected] (M.A.P.)
Abstract: Text datasets come in an abundance of shapes, sizes and styles. However, determining
what factors limit classification accuracy remains a difficult task which is still the subject of intensive
research. Using a challenging UK National Health Service (NHS) dataset, which contains many char-
acteristics known to increase the complexity of classification, we propose an innovative classification
pipeline. This pipeline switches between different text pre-processing, scoring and classification
techniques during execution. Using this flexible pipeline, a high level of accuracy has been achieved
in the classification of a range of datasets, attaining a micro-averaged F1 score of 93.30% on the
Reuters-21578 “ApteMod” corpus. An evaluation of this flexible pipeline was carried out using
a variety of complex datasets compared against an unsupervised clustering approach. The paper
describes how classification accuracy is impacted by an unbalanced category distribution, the rare
use of generic terms and the subjective nature of manual human classification.
Keywords: NLP; classification; clustering; text pre-processing; machine learning; National Health
Citation: Haynes, C.; Palomino, Service (NHS)
M.A.; Stuart, L.; Viira, D.; Hannon, F.;
Crossingham, G.; Tantam, K. MSC: 68T50
Automatic Classification of National
Health Service Feedback. Mathematics
2022, 10, 983. https://doi.org/
10.3390/math10060983 1. Introduction
Academic Editors: Florentina Hristea, The quantity of digital documents generally available is ever-growing. Classification
Cornelia Caragea and Victor Mitrana of these documents is widely accepted as being essential as this reduces the time spent on
analysis. However, manual classification is both time consuming and prone to human error.
Received: 9 February 2022
Therefore, the demand for techniques to automatically classify and categorize documents
Accepted: 16 March 2022
continues to increase. Automatic classification supports the process of enabling researchers
Published: 18 March 2022
to carry out deeper analysis on text corpora. Practical applications of classification include
Publisher’s Note: MDPI stays neutral library cataloguing and indexing [1], email spam detection and filtration [2] and sentiment
with regard to jurisdictional claims in analysis [3].
published maps and institutional affil- Since text documents and datasets exhibit such a wide variety of differences and
iations. combinations of features, it is impossible to adopt a standardized classification approach.
One such feature is the length of the text available as input to the classification process. For
example, a newspaper article is likely to contain significantly more text than a tweet, thus
providing a larger vocabulary which will aid classification. Another important feature of
Copyright: © 2022 by the authors.
text is the intended purpose of the text. The intended use of text can significantly affect
Licensee MDPI, Basel, Switzerland.
This article is an open access article
the author’s style and choice of vocabulary. Let us suppose you are required to determine
distributed under the terms and
Amazon product categories based on the descriptions of the products. Clearly, certain
conditions of the Creative Commons keywords are highly likely to appear on similar products, and those keywords are likely to
Attribution (CC BY) license (https:// have the same intended meaning wherever they are used. In contrast, consider the scenario
creativecommons.org/licenses/by/ where you are required to determine whether a tweet exhibited a positive or negative
4.0/). sentiment. In this case, the same keyword used in multiple tweets may have completely
different meanings based on the context and tone of the author. The intended sentiment
could vary considerably.
Of course, objectivity and subjectivity can also affect the accuracy of the classification
process. If the same text, based on Amazon products and tweets, was used for manual
classification, then there is likely to be more consensus on the category of a product
description than on the sentiment of a tweet. This is due to the inherent objectivity of the
categories. Aside from these examples, there are numerous other features of a dataset
which can limit classification accuracy [4].
This paper describes the analysis of a complex UK National Health Service (NHS) pa-
tient feedback dataset which contains many of the elements known to restrict the accuracy
of automatic classification. Throughout experimentation, several pre-processing and ma-
chine learning techniques are used to investigate the complexities in the NHS dataset and
their effect on classification accuracy. Subsequently, an unsupervised clustering approach
is applied to the NHS dataset to explore and identify underlying natural classifications in
the data.
Section 2 describes existing work on automatic text classification and provides a
theoretical background of the approaches used. Section 3 establishes our research problem
statement and introduces the datasets used, incorporating both the NHS dataset and
the benchmarking datasets used for evaluation. Section 4 details the pre-processing and
classification pipeline, followed by the results of our experiments in Section 5. Finally, the
findings and conclusions are discussed in Section 6.
Figure 1. Procedural diagram of the processes used in automatic text classification approaches, where
rhomboids represent data and rectangles represent processes.
The first stage of the pipeline, text pre-processing, primarily focuses on techniques
to extract the most valuable information from raw text data [13]. Typically, this involves
reducing the amount of superfluous text, minimizing duplication and tagging words based
on their type or meaning. This is achieved by techniques such as:
• Tokenizing. This technique splits text into sentences and words to identify parts of
speech (POS) such as nouns, verbs and adjectives. This creates options for selective
text removal and further processing.
• Stop word removal. This technique removes commonly occurring words which are
unlikely to give extra value or meaning to the text. Examples include the words “the”,
96
Mathematics 2022, 10, 983
“and”, “for” and “of”. There are many different open-source, stop word lists [14],
which have been used in multiple applications with differing levels of success.
• Stemming or lemmatization. In this technique words are replaced with differing suf-
fixes, whilst maintaining the same common stem. For example, the words “thanking”,
“thankful” and “thanks” would all be stemmed to the word “thank”. Some of the
most popular stemming algorithms, such as the Porter Stemmer algorithm [15], use a
truncation approach, which although fast can often result in mistakes as it is aimed
purely at the syntax of the word whilst ignoring the semantics. A slightly more robust,
but slower, approach would be lemmatizing, which uses POS to infer context. Thus, it
reduces words to a more meaningful root. For example, the words “am”, “are” and
“is” would all be lemmatized to the root verb “be”.
• Further cleaning can also occur depending on the raw text data, such as removing
URLs, specific unique characters or words identified by POS tagging.
The second stage of the pipeline, Word Scoring, involves transforming the text into
a quantitative form. This can be to increase the weighting of words or phrases which are
deemed more important to the meaning of a document. Different scoring measures can be
applied. These include:
• Term Frequency Inverse Document Frequency (TF-IDF), a measure which scores a
word within a document based on the inverse proportion in which it appears in the
corpus [16]. Therefore, a word will be assigned a higher score if it is common in
the scored document, but rare in all the other documents in the same corpus. The
advantage of this measure is that it is quick to calculate. The disadvantage is that
synonyms, plurals and misspelled words would all be treated as completely different
words.
• TextRank is a graph-based text ranking metric, derived from Google’s PageRank
algorithm [17]. When used to identify keywords, each word in a document is deemed
to be a vertex in an undirected graph, where an edge exists for each occurrence of
a pair of words within a given sentence. Subsequently, each edge in this graph is
deemed to be a “vote” for the vertex linked to it. Vertices with higher numbers of votes
are deemed to be of higher importance, and the votes which they cast are weighted
higher. By iterating through this process, the value for each vertex will converge to a
score representing its importance. Note that, in contrast to the TF-IDF, which scores
relative to a corpus, TextRank only evaluates the importance of a word within the
given document.
• Rapid Automatic Keyword Extraction (RAKE) is an algorithm used to identify key-
words and multi-word key-phrases [18]. RAKE was originally designed to work on
individual documents focusing on the observation that most keyword-phrases con-
tain multiple words, but very few stop words. Through the combination of a phrase
delimiter, word delimiter and stop word list, RAKE identifies the most important
keywords/key-phrases in a document and weights them accordingly.
The third stage of the pipeline is Feature Generation. It is essential to produce an
input which can be used for the machine learning classifier. In general, these inputs need
to be fixed length vectors containing normalized real numbers. The common approach is
the BoW, which creates an n-length vector representing every unique word in a corpus.
This vector can then be used as a template to generate a feature mask for each document.
This resultant feature mask would also be an n-length vector. To produce the feature mask,
each word in the document would be identified within the BoW vector. Subsequently, its
corresponding index in the feature mask vector would be set, whilst all other positions in
the vector would be reset to 0. For example, suppose there is a corpus of solely the following
two sentences: “This is good” and “This is bad”. Its corresponding BoW vocabulary would
consist of four words {This, is, good, bad}. The first sentence would be represented by the
vector {1, 1, 1, 0}, whilst the second sentence would be represented by the vector {1, 1, 0, 1}.
In this example, the value used to set the feature vector is simply the binary representation
(0 or 1) of whether the word exists in the given document. Alternatively, the values used
97
Mathematics 2022, 10, 983
to set the vector could also be any scoring metric, such as those discussed above. The
BoW model is limited by the fact that it does not represent the ordering of words in their
original document. BoW can also be memory intensive on large corpuses, since the feature
mask of each document has to represent the vocabulary of the entire corpus. Therefore,
for a vocabulary of size n and a corpus of m documents, a matrix of size nm is required to
represent all feature masks. Further, as the size of n increases, each feature mask will also
contain more 0 values, making the data increasingly sparse [19]. There are alternatives to
the BoW model which attempt to resolve the issue of word ordering. The most common is
the n-gram representation, where n represents the number of words or characters in a given
sequence [20]. The BoW model could be considered an n-gram representation where n is
set to 1, also known as a unigram. For example, given the sentence “This is an example”,
an n-gram of n = 2 (bigram) would be the set of ordered words {“This is”, “is an”, “an
example”}. This enables sequences of words to be represented as features. In some text
classification tasks, bigrams have proved to be more efficient than the BoW model [21].
However, it also follows that as n increases, n-gram approaches, are increasingly affected
by the size of the corpus and the corresponding memory required for processing [22].
Word embedding is an alternative to separating the preprocessing of text (second stage)
from the scoring of text (third stage) of the generalized pipeline. It can be used to represent
words in vector space, thus encompassing both sections [8]. A popular model for generating
these vectors is word2vec [23], which consists of two similar techniques to perform these
transformations (i) continuous BoW and (ii) continuous skip-n-gram. Both these processes
manage sequences of words and support any length word encoding to a fixed length vector,
whilst maintaining some of the original similarities between words [24]. Similar techniques
can be used on word vectors to produce sentence vectors, and then on sentence vectors
to produce document vectors. These vectors result in high-level representations of the
document, with less sparce representation and a smaller memory footprint than n-gram
and BoW models. They often outperform the n-gram and BoW models with classification
tasks [25], but they do have a much greater computational complexity which increases
processing time.
The fourth and final stage of the pipeline, Classification, uses the feature masks in
training a classification model. There are many different viable classifiers available, some of
the most widely used approaches in text classification are k-nearest-neighbor (KNN) [26],
naïve bayes (NB) [27], neural networks (NN) [28] and support vector machines (SVM) [9].
Each of these classifiers have numerous variations, each with their own advantages and
disadvantages. The specific variations used in our proposed pipeline will be discussed in
further detail in Section 4.
98
Mathematics 2022, 10, 983
(LFE) dataset. It is composed of solely positive written feedback given to staff by patients
and colleagues. These data were organized into 24 themes; categories where the same
sentiment is expressed using slightly different terminology. Subsequently, each item of
text (phrase, sentence) was manually classified into one or more theme. Each item may
be associated with multiple themes which are ordered. The first theme can be considered
the primary theme. As this paper focuses on single-label classification, only the primary
themes will be used. Note that the full list of the themes and their descriptions is available
in Appendix A, Table A1. The LFE dataset has several characteristics which intrinsically
make its automatic classification a difficult task:
1. The dataset consists of 2307 items. Due to the text or theme being omitted, only
2224 items were deemed viable for classification. This is relatively small compared to
most datasets used in text classification.
2. The length of each text item is short, the average item contained 49.7 words. The
shortest text item is 2 words long and the longest text item is 270 words long.
3. The number of themes is large with respect to the size of the dataset. Even if the
themes were evenly distributed, this would result in an average of less than 93 text
items into each category.
4. The distribution of the themes is not balanced. For example, the largest theme
“Supportive” is the primary theme for 439 items (19.74%). The smallest theme “Safe
Care” is the primary theme for solely 1 item (0.04%). The number of items per category
has a standard deviation of 111.23 items. The distribution for the remaining theme
categories is also uneven, see Figure 2.
5. Since all the text is positive feedback, many of the text items share a similar vocabulary
and tone regardless of the theme category to which they belong. For example, the
phrase “Thank you” appears in 807 items (36.29%). However, only 61 items (2.74%)
belong to the primary theme of “Thank You”.
6. The themes are of a subjective nature, dependent on individual interpretation so
they could be viewed in different ways. For example, the theme “Teamwork” is
not objectively independent of the theme “Leadership”. Thus, there may be some
abstract overlap between these themes. Furthermore, there is no definitive measure to
determine which theme is more important than another for a given text item, making
the choice of the primary theme equally subjective.
Given the classification challenges posed by the LFE dataset, it was important to bench-
mark results. Thus, all experiments are compared to both well-known text classification
datasets and other datasets which share one or more of the characteristics with the LFE
dataset.
The first benchmark dataset was the “ApteMod” split of the Reuters-21578 dataset
(Reuters). This consists of short articles from the Reuters financial newswire service pub-
lished in 1987. This split solely contains documents which have been manually classified as
belonging to at least one topic, making this dataset ideal for text classification. This dataset
is already sub-divided into a training set and testing set. Since k-fold cross validation
was used, the datasets were combined. Finally, since multiple themes were not assigned
with any order of precedence, items which had been assigned to more than one topic were
removed. Although this dataset does not share many of the classification challenges of
the LFE dataset, it is widely used in text classification [29,30]. Thus, it provided indirect
comparisons with other work in this field.
99
Mathematics 2022, 10, 983
Figure 2. Chart of LFE theme category distributions, where the size of a bubble denotes the number
of occurrences of each text item for a particular theme category. Note that the position of the bubbles
is synthetic, solely used to portray that overlap occurs between themes.
Three other datasets were chosen since each share one of the characteristics of the LFE
dataset.
• The “Amazon Hierarchical Reviews” dataset is a sample of reviews from different
products on Amazon, along with the corresponding product categories. Amazon uses
a hierarchical product category model, so that items can be categorized at different
levels of granularity. Each item within this dataset is categorized in three levels. For
example, at level 1 a product could be in the “toys/games” category. At level 2, it
could be in the more specific “games” category. At level 3, it could be in the more
specific “jigsaw puzzles” category. This dataset was selected as it provides a direct
comparison of classification accuracy, when considering the relative dataset volume
compared to the number of categories.
• The “Twitter COVID Sentiment” dataset is a curation of tweets from March and
April 2020 which mentioned the words “coronavirus” or “COVID”. This dataset was
manually classified within one of the following five sentiments: extremely negative,
negative, neutral, positive or extremely positive. The source dataset had been split
into a training set and a testing set. As with the Reuters dataset, these two subsets
were combined.
• The “Twitter Tweet Genre” dataset is a small selection of tweets which have been man-
ually classified into one of the following four high level genres: sports, entertainment,
medical and politics.
Each of these datasets share some of the complex characteristics of the LFE dataset
described at the start of this section. Table 1 presents and compares these similarities. The
full specification of all the datasets is available in Table 2.
100
Mathematics 2022, 10, 983
Table 1. Comparison of the datasets based on the complexity of their classification characteristics.
The numbered characteristics refer to the list at the start of this section.
Twitter The average number of words in a tweet is 27.8, with the shortest 2 containing 1 word and the
2
COVID longest containing 58 words.
Sentiment Sentiment analysis in general has a subjective nature to the classifications given [31]. This dataset
6 also has some specific cases where two very similar tweets have been given opposing sentiments,
an example can be found in Appendix C.
Twitter Tweet 1 This dataset consists of only 1161 documents.
Genre The average amount of words in a tweet is 16 with the shortest 2 containing 1 word and the
2
longest containing 27 words.
1 The shortest review in this dataset contains 0 words and only punctuation. However, this was discounted as it
was removed during pre-processing and not used. 2 Some tweets in this dataset contain 0 words and only URLs,
retweets or user handles. However, these were discounted as they were removed during pre-processing and
not used.
Table 2. Full specification of datasets. All values presented in this table represent the raw datasets
prior to removal of any invalid entries, pre-processing or text cleaning.
Currently, the LFE dataset is manually classified by hospital staff, who have to read
each text item and assign it to a theme. Therefore, we are the first to experiment with
applying automatic text classification to this dataset. The Amazon Hierarchical Reviews,
Twitter COVID Sentiment and Twitter Tweet Genre datasets were primarily selected for
their similar characteristics to the LFE dataset. However, another advantage they provided
was that they contained extremely current data having all been published in 2020 (April,
September and January, respectively). Although these datasets were useful for our inves-
tigation into how dataset characteristics affect classification accuracy, it was difficult to
draw direct comparisons with related work in this field due to the dataset originality. The
Reuters dataset was selected because of its wide use in this field as a benchmark, allowing
direct comparisons of our novel pipeline results to other well-documented work.
Some of the seminal work in automatic text classification on the Reuters dataset was
by Joachims [32]. Through the novel use of support vector machines, a micro averaged
precision-recall breakeven score of 86.4 was achieved across the 90 categories which con-
101
Mathematics 2022, 10, 983
tained at least one training and one testing example. After this, researchers have used
many different configurations of the Reuters dataset for their analysis. Some have used
the exact same subset but applied different feature selection methods [33], while other
work has focused on only the top 10 largest categories [34,35]. Unfortunately, the wide
range of feature selection and category variation limits reliable comparison. However, by
selecting related work with either (i) similar pre-processing and feature section methods, or
(ii) similar category variation, we aim to ensure our proposed pipeline is performing with
comparable levels of accuracy.
4. Methodology
For this research, a software package was developed using the Python programming
language (Version 3.7), making use of the NumPy (Version 1.19.5) [36] and Pandas (Version
1.2.3) [37] libraries for efficient data structures and file input and output. The core concept
was to develop an intuitive data processing and classification pipeline based on flexibility,
thus enabling the user to easily select different pre-processing and classification techniques
each time the pipeline is executed. As discussed in Section 2, classification often uses a gen-
eralized flow of data through a document classification pipeline. The approach presented
in this paper follows this model. Figure 3 shows an overview of the pipeline developed.
Since the LFE dataset contained a range of proper nouns which provided no benefit to
the classification task, they were removed to optimize the time required for each experiment.
The Stanford named entity recognition system (Stanford NER) [38] was used to tag any
names, locations and organizations in the raw text. Subsequently, these were removed from
the dataset. In total, 3126 proper nouns were removed. A manual scan was performed
to confirm that most cases were covered. Some notable exceptions were the name “June”
(which was most likely mistaken for the month) and the word “trust” when used in the
phrase “NHS trust”. Neither of these were successfully tagged by Stanford NER. The
dataset used in this work is the final version with the names, locations and organizations
removed.
To maintain similarity in the pre-processing approaches, the Stanford core NLP
pipeline [39], provided through the Python Stanza toolkit (Version 1.2) [40], was used
where possible. This provided the tools for tokenizing the text, and lemmatizing words.
However, Stanford core NLP did not provide any stemming options, so an implementation
of Porter Stemmer [41] from the Python natural language toolkit (NLTK) (Version 3.5) [42]
was used. NLTK also provided the list of common English stop words for stop word re-
moval. The remaining pre-processing techniques (removal of numeric characters, removal
of single character words and punctuation removal) were all developed for this project.
The final pre-processing component was used to specifically clean the text of the Twitter
collections. This consisted of removing URLs, “#” symbols from Twitter hashtags, “@”
symbols from user handles and retweet text (“RT”). This was the final part of the software
developed for this project’s source code.
A selection of four word scoring metrics is made available in the pipeline. RAKE [10]
was used via an implementation available in the rake-nltk (Version 1.0.4) [43] Python
module. A TextRank [9] implementation was designed and developed derived from an
article and source code by Liang [44]. An implementation of TF-IDF and a term frequency
model were also developed.
Within the stage of feature extraction, the BoW model was developed using standard
Python collections. These were originally listed and subsequently converted to dictionar-
ies to optimize the look-up speed when generating feature masks. Feature masks were
represented in NumPy arrays to reduce memory overhead and execution time.
102
Mathematics 2022, 10, 983
Figure 3. Representation of the text processing and classification pipeline, showing each stage
described in Section 2. Rhomboids represent data, rectangles represent processes and diamonds
represent decisions which can be made modified within the software parameters.
103
Mathematics 2022, 10, 983
All classifiers used in this publication originate from the scikit-learn (Version 0.24.1) [45]
machine learning library. This library was selected since (i) it provided tested classifier
builds to be used, (ii) a range of statistical scoring methods and (iii) it is a popular library
used in similar literature thereby enabling direct comparison with other work in this field.
For this project a set of wrapper classes were designed for the scikit-learn classifiers. All
classifier wrappers were developed upon an abstract base class to increase code reuse, speed
up implementation of new classifiers and ensure a standardized set of method calls through
overriding. The base class and all child classes are available in the “Classifiers” package
within the source code. A link to the full source code can be found in the Supplementary
Materials Section.
Within the final stage of Classification, four of the most common text classifiers
are provided. These are k-nearest neighbor (KNN), compliment weighted naïve bayes
(CNB), multi-layer perceptron (MLP) and support vector machine (SVM). Tuning the hyper
parameters of each of these classifiers, for every dataset, would have produced too much
variability in the results. Therefore, each classifier was tuned to the LFE dataset; the same
hyper parameters were used on all datasets. When tuning was performed, only one variable
was tuned at a time, the remainder of the pipeline remained constant, see Figure 4.
Figure 4. Representation of the processing pipeline used for hyper parameter tuning. From the
text input all the way through the processes of tokenizing, lemmatizing, additional text cleaning,
TF-IDF scoring, BoW modeling, creation of feature masks to the final stage of using the complement
weighted naïve bayes classifier.
The first classifier, KNN, determines the category of an item based on the categories
of the nearest neighbors in feature space. The core parameter to set is the value of k; the
number of neighbors which should be considered when classifying a new item. To define
k, a range of values were tested, and their accuracy was assessed based on their F1 score.
104
Mathematics 2022, 10, 983
See the full results in Appendix C, Table A2. The value of k was defined as 23. Work by
Tan [46] suggested weighting neighbors based on their distance may improve results when
working with unbalanced text corpuses. Thus, this parameter was also tuned. However,
when this was applied to the LFE dataset, uniform weighting produced better results. The
results of these tests are shown in Appendix C, Table A3.
The second classifier, Compliment weighted Naïve Bayes (CNB), is a specialized
version of multinomial Naïve Bayes (MNB). This approach is reported to perform better
on imbalanced text classification datasets, by improving some of the assumptions made in
MNB. Specifically, it focuses on correcting the assumption that features are independent,
and it attempts to improve the weight selection of the MNB decision boundary. This
approach did not require any hyper parameter tuning, and the scikit-learn CNB was
implemented as described by Rennie et al. [27].
The third classifier provided is based on the multi-layer perceptron (MLP). There
are multiple modern text classification approaches which use deep learning variants of
neural networks. Some notable examples are convolutional neural networks (CNN) [47]
and recurrent neural networks (RNN) [11], both of which have been used extensively in
this field. These approaches have a substantial computational overhead for feature creation.
Therefore, deep learning would have been too unwieldly for some of the datasets used
in this work. Furthermore, scikit-learn does not provide an implementation of CNN or
RNN neural network architectures. Therefore, their use would require another library,
reducing the quality of any comparisons made between classifiers. For these reasons, a
more traditional, MLP architecture, with a single hidden layer, was used instead. The
main parameter to tune for this model was the number of neurons used in the hidden
layer. There is much discussion on how to optimize selection of this parameter, but the
general rule of thumb is to select the floored mean between the number of input neurons
and output neurons as defined below:
ninput + noutput
nhidden =
2
The remaining MLP hyper parameters are the default values from scikit-learn and
a full list of these can be found in Appendix C, Table A4. The MLP was also set to stop
training early if there was no change in the validation score, within a tolerance bound of
1 × 10−4 , over ten epochs.
For the fourth classifier, SVM, it has been reported that the selection of a linear kernel
is more effective for text classification problems than non-linear kernels [48]. Four of the
most commonly used kernels were tested and confirmed that this was also the case with
the LFE dataset. Therefore, a linear kernel was selected for use in this classifier. The results
of the tests are found in Appendix C, Table A5. To account for the class imbalance in the
LFE dataset, each class is weighted proportionally in the SVM to reduce bias.
Aside from these supervised classification approaches, an unsupervised model was
also developed using scikit-learn and the same classification wrapper class structure. The
purpose of this was to examine whether any natural clusters form within the LFE dataset,
to enable a wider range of comparisons. K-means [49] was selected as the unsupervised
approach, where k represents the number of groups the data should be clustered into.
To tune this parameter, two metrics were recorded for a range of potential values of k:
the j-squared error and the silhouette score [50]. A lower j-squared error represents a
smaller average distance from any given data point to the centroid of its cluster, and a
higher silhouette score represents an item exhibiting a greater similarity to its own cluster,
compared to other clusters. Therefore, an optimal k value should be minimizing j-squared
error whilst maximizing silhouette score. However, j-squared error is likely to trend lower
as more clusters are added, leading to diminishing returns for larger values of k. So, it is
better suited to examining where the benefit starts to drop off, this is often referred to as
finding the “elbow” in the graph.
Appendix C, Figure A1 shows the graph comparing the j-squared error and the average
silhouette score for all clusters. From this analysis it was difficult to define the optimal
105
Mathematics 2022, 10, 983
value of k, since the j-squared error trended downwards almost linearly, and the average
silhouette score was low for all values of k. Therefore, the LFE dataset was clustered using
different small values of k, {2, 8, 13, 16, 20}, which performed better.
5. Results
This research evaluates how fundamental differences in database volume, category
distribution and subjective manual classification affect the accuracy of automatic document
classification. All experiments were performed on the same computer. It had the following
hardware specification: Intel Core i5-8600K, 6 cores at 3.6 Ghz. RAM: 32 GB DDR4. GPU:
Gigabyte Nvidia GeForce GTX 1060 6 GB VRAM. The Stanza toolkit for Stanford core
NLP supported GPU parallelization, and all experiments exploited this feature. The scikit-
learn library did not have any GPU enabled options, so all classification was processed by
the CPU.
During experiments, each dataset was tested for the given variables. A fivefold cross
validation was used, and the mean score for each validation is reported. If not otherwise
stated, all other elements of the pipeline are identical to the constant processing pipeline,
described in Section 4. The core metric used for evaluating accuracy was the F1 score,
which combines both precision and accuracy into a single measure. This was recorded as
both a micro average and a macro average.
Tables 3–5 contain the experimental results yielded by the evaluation of changes to the
different sections of the proposed pipeline (pre-processing, word scoring and classification
respectively). Based on the results from Tables 3–5, the optimum pipeline for each dataset
was tested and the results can be found in Table 6.
Table 3. Evaluates the effect of different pre-processing techniques on the accuracy of classification.
The same processing pipeline is maintained aside from pre-processing (TF-IDF, BoW, CNB).
106
Mathematics 2022, 10, 983
Table 4. Evaluates the effect of using different word scoring techniques on the accuracy of classifi-
cation. The same processing pipeline is maintained aside from word scoring (stop words removal,
words lemmatization, additional text cleaning, BoW, CNB).
Table 5. Evaluates the use of different classifiers. The processing pipeline is maintained aside from
word scoring (stop words removed, words lemmatized, additional text cleaning, TF-IDF, BoW).
107
Mathematics 2022, 10, 983
Table 6. Optimum pipeline result for each dataset with F1 micro averaged score.
To benchmark the accuracy of our pipeline against other related work on automatic
text classification, Table 7 presents our results on the Reuters corpus compared to the works
mentioned in Section 3.2. As stated in this previous section, it should be noted that a
direct comparison of these results is difficult due to the differences in document/category
reduction, pre-processing approaches and feature selection. However, the results presented
suggest that the approach outlined in this paper produces comparable accuracy to other
state-of-the-art approaches.
Table 7. Comparison of the highest achieved micro-averaged score of our pipeline (shown in bold),
compared to other published automatic text classification results on the “ApteMod” split of the
Reuters-21578 corpus. Accuracy metrics are all F1 scores, except Joachims which is the precision-
recall breakeven point.
Overall Accuracy
Automatic Text Classification Approach
(Micro-Averaged)
Joachims, T. SVM [32] 0.864
Banerjee, S. et al. SVC [33] 0.870
Ghiassi, M. et al. “DAN2” MLP [34] 1 0.910
Zdrojewska A. et al. Feed-forward MLP with ADAM [35] 0.924
Haynes, C. et al. Novel Pipeline MLP 0.933
1 This score is not stated explicitly but was calculated as the average of the F1 testing scores provided in the
referenced paper.
6. Discussion
6.1. Practical Implications
Based on the results of our experiments we will discuss the two research questions
introduced in Section 3.1.
(R1) The NHS is likely to adopt our approach to automatically classify feedback. This
means we have successfully reduced the workload of NHS staff by providing a tool
which can be used in place of manual classification. Therefore, the answer to (R1)
is positive. Although our proposed classification pipeline attained a lower micro-
averaged F1 score on the LFE dataset compared to the benchmark datasets, given
the limitations of the dataset, the NHS has found this better than the alternative of
manually classifying future datasets.
(R2) The performance of the classification pipeline published in this paper is evaluated by
comparing it against the results of the Reuters dataset with other published work. In this
research, a micro-averaged F1 score of 93.30% was achieved. As shown in Table 7, that
accuracy outperforms the seminal SVM approaches of Joachims [32], which achieved
a micro-averaged breakeven point of 86.40%. Furthermore, the classification pipeline
performed in-line with or surpassed more recent approaches [33–35]; demonstrating that
108
Mathematics 2022, 10, 983
this classification pipeline produces high accuracy results on other datasets. Therefore,
the answer to (R2) is positive.
109
Mathematics 2022, 10, 983
The first test was used to evaluate how evenly the text items are distributed for
different values of k (2, 8, 13, 16, 20 and 150). For any number of clusters, a similar trend
emerged, where one cluster would account for between 51% and 98% of all the items. The
remaining items were thinly spread between the remaining categories. Figures 5 and 6
depict how the data is unevenly distributed with k values of 20 and 150, respectively.
Therefore, there is no evidence of a natural separation for most of the text items in the LFE
dataset. Thus, they are either sufficiently generic they get clustered into one large group or
that they are overly similar leading to the formation of limited smaller clusters.
Figure 5. Distribution of LFE dataset when clustered using k-Means where k = 20.
Figure 6. Distribution of LFE dataset when clustered using k-Means where k = 150.
110
Mathematics 2022, 10, 983
of magnitude higher than across the rest of the clusters. As the value of k varied, this cluster
appeared with a 96.6% overlap with a cluster in k = 2, and a 79.3% overlap with a cluster in
k = 13.
Furthermore, when k = 8, there was an even smaller cluster identified which contained
only 9, out of the total 2307, items. This tiny cluster came from sequential items in the
dataset, which share almost exactly the same words. It appears that someone submitted
multiple “excellence texts” for a range of different staff. They copied and pasted the same
text framework, just changing the name, organization or slightly rewording the text. So,
after using the NER, cleaning, lemmatization and removing the stop words, all these items
are virtually identical. What is also interesting in this cluster is how the word “fantastic”
appeared in every entry whereas it only appears in 5.98% of the whole corpus. This shows
one of the downsides to TF-IDF in this case, as words which have no bearing on the
classification are getting scored highly due to their rarity across the rest of the corpus. This
also supports the argument that common terms, unrelated to the category, could be limiting
the classification accuracy. A full breakdown of the occurrence of the most common words
in each cluster when k = 8, shown in Table 8, shows a general theme can be manually
identified for most of clusters.
Table 8. The percentage of times words appeared in each text item, in each cluster, when k = 8. Only
the five largest clusters are shown, since the other three clusters only contained a single item each.
Bold values represent statistically significantly values (p < 0.5) in a given cluster when compared to
their occurrence in the entire dataset. The case (upper or lower) of the words was not considered.
111
Mathematics 2022, 10, 983
Consider how the large remaining cluster has a similar distribution of words when
compared to the full dataset. This suggests that this large cluster is a ‘catch-all’ for all the
items not specific enough to be classified elsewhere. This reinforces the conclusion that
rarely used but generic terms in the LFE dataset are biasing the accuracy of classification.
This could also explain why the simple scoring metric of term count was optimal for
the LFE data. The other single word scoring methods (Text Rank and TF-IDF) both give
higher weight to words which are common in a given item, in comparison to the rest of the
corpus. However, in this dataset the most commonly used words are actually those that
most closely represent the categories:
• “Thank” appears in 43.44% of items.
• “Support” appears in 32.10% of items.
• “Work” appears in 41.46% of items, “Hard” appears in 15.83% of items.
When you consider there are categories specifically for “Thank you”, “Supportive”
and “Hard Work”, it is clear these terms being underweighted could be another limiting
factor of the LFE dataset. The limiting factor of subjective manual classification is evident
in this same analysis. Although “Thank” appears in 43.44% of items, only 2.74% of the
items have got a primary theme of “Thank you”. A specific example of this can be seen
in one of the text items, after it has been lemmatized and stop words have been removed.
Consider the text “ruin humor wrong sort quickly good day much always go help support thank”.
This seems quite generic and contains many keywords which might suggest “Supportive”
or “Thank you” as the category. However, this text item was manually classified with a
primary theme of “Positive Attitude” and a secondary theme of “Hard Work” despite it
not having any of the common keywords associated with these themes.
Overall, the data suggest that the common limiting factors of classifying the LFE
dataset are also present when it is clustered. Indeed, this means that there is an intrinsic
limitation on the ability to classify this specific dataset.
112
Mathematics 2022, 10, 983
Supplementary Materials: The full source code for the developed software tool can be downloaded
from: https://github.com/ChristopherHaynes/LFEDocumentClassifier (accessed on 8 February 2022).
Author Contributions: Conceptualization, C.H. and M.A.P.; methodology, C.H. and M.A.P.; software,
C.H.; validation, C.H.; formal analysis, C.H.; investigation, C.H. and M.A.P.; resources, D.V., F.H.
and G.C.; data curation, D.V., F.H. and G.C. and C.H.; writing—original draft preparation, C.H.;
writing—review and editing, C.H., M.A.P. and L.S.; visualization, C.H. and L.S.; supervision, M.A.P.
and L.S.; project administration, L.S. and K.T. All authors have read and agreed to the published
version of the manuscript.
Funding: This research received no external funding.
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: “Reuters-21578 (ApteMod)”: Used in our software package via Python
NLTK platform, direct download available at http://kdd.ics.uci.edu/databases/reuters21578/reuters2
1578.html. Retrieved 25 April 2021 “Amazon Hierarchical Reviews”: Yury Kashnitsky. Hierarchical
text classification. (April 2020). Version 1. Retrieved 29 April 2021 from https://www.kaggle.com/
kashnitsky/hierarchical-text-classification/version/1. “Twitter COVID Sentiment”: Aman Miglani.
Coronavirus tweet NLP—Text Classification. (September 2020). Version 1. Retrieved 27 April 2021 from
https://www.kaggle.com/datatattle/covid-19-nlp-text-classification/version/1. “Twitter Tweet
Genre”: Pradeep. Text (Tweet) Classification (January 2020). Version 1. Retrieved 28 April 2021 from
https://www.kaggle.com/pradeeptrical/text-tweet-classification/version/1.
Acknowledgments: The authors are grateful to the NHS Information Governance for allowing us to
make use of the anonymized LFE dataset.
Conflicts of Interest: The authors declare no conflict of interest.
Appendix A
Theme Description
Above and Beyond Performing in excess of the expectations or demands.
Adaptability Being able to adjust to new conditions.
Communication Clearly conveying ideas and tasks to others.
Coping with Pressure Adjusting to unusual demands or stressors.
Dedicated Devoting oneself to a task or purpose.
Education Achieving personal improvement through training/schooling.
Efficient Working in a well-organized and competent way.
Hard Work Working with a great deal of effort or endurance.
Initiative Taking the opportunity to act before others do.
Innovative Introducing new ideas; original and creative in thinking.
Kindness Being friendly and considerate to colleagues and/or patients.
Leadership Influencing others in a group by taking charge of a situation.
Morale Providing confidence and enthusiasm to a group.
Patient Focus Prioritizing patient care above other tasks.
Positive Attitude Showing optimism about situations and interactions.
Reliable Consistently good in quality or performance.
Safe Care Taking all necessary steps to ensure safety protocols are met.
Staffing Covering extra shifts when there is illness/absence.
Supportive Providing encouragement or emotional help.
Teamwork Collaboration with a group to perform well on a given task.
Technical Excellence Producing successful results based on specialist expertise.
Time Devoting additional time to colleagues and/or patients.
Thank You Giving a direct compliment to a member of staff.
Well-Being Making a colleague and/or patient, comfortable and happy.
113
Mathematics 2022, 10, 983
Appendix B
Examples of similar tweets from the Twitter COVID Sentiment dataset which have
similar content, but were given opposing manual classifications:
• “My food stock is not the only one which is empty . . . PLEASE, don’t panic, THERE
WILL BE ENOUGH FOOD FOR EVERYONE if you do not take more than you
need. Stay calm, stay safe.#COVID19france #COVID_19 #COVID19 #coronavirus
#confinement #Confinementotal #ConfinementGeneral https://t.co/zrlG0Z520j\T1
\textquotedblright---Manually classified as “extremely negative”.
• “Me, ready to go at supermarket during the #COVID19 outbreak. Not because I’m para-
noid, but because my food stock is literally empty. The #coronavirus is a serious thing, but
please, don’t panic. It causes shortage . . . #CoronavirusFrance #restezchezvous #Stay-
AtHome #confinement https://t.co/usmuaLq72n\T1\textquotedblright---Manually
classified as “positive”.
Appendix C
Table A2. KNN tuning: how the F1 score varied (micro and macro averaged) as the value of k was
altered. Tests performed using the constant processing pipeline. Selected k value shown in bold.
Table A3. KNN tuning: F1 score (micro and macro averaged) using uniform weighting compared
to distance weighting. Although the F1 macro average score is higher for distance weighing, micro
averaging is less susceptible to fluctuations from class imbalance, therefore this was chosen as the
deciding factor. Tests performed using the constant processing pipeline. Selected weighting method
shown in bold.
Parameter Type/Value
Activation function Rectified linear unit function (RELU)
Weight optimization algorithm Adam
Max epochs 200
Batch size 64
Alpha (regularization term) 0.0001
Beta (decay rate) 0.9
Epsilon (numerical stability) 1 × 10−8
Early stopping tolerance 1 × 10−4
Early stopping iteration range 10
114
Mathematics 2022, 10, 983
Table A5. SVM tuning: F1 score (micro and macro averaged) for different kernels. Although the F1
macro average score is higher for the linear kernel, micro averaging is less susceptible to fluctuations
from class imbalance, therefore this was chosen as the deciding factor. Tests performed using the
constant processing pipeline. Selected weighting method shown in bold.
:Ͳ^ƋƵĂƌĞĚƌƌŽƌsĂůƵĞ ǀĞƌĂŐĞ^ŝůŚŽƵĞƚƚĞ^ĐŽƌĞ
ϭϴϬϬ Ϭ͘ϯ
ǀĞƌĂŐĞ^ŝůŚŽƵĞƚƚĞ^ĐŽƌĞ;,ŝŐŚĞƌŝƐďĞƚƚĞƌͿ
:Ͳ^ƋƵĂƌĞĚƌƌŽƌsĂůƵĞ;>ŽǁĞƌŝƐďĞƚƚĞƌͿ
ϭϲϬϬ Ϭ͘Ϯϱ
ϭϰϬϬ
Ϭ͘Ϯ
ϭϮϬϬ
Ϭ͘ϭϱ
ϭϬϬϬ
Ϭ͘ϭ
ϴϬϬ
Ϭ͘Ϭϱ
ϲϬϬ
Ϭ
ϰϬϬ
ϮϬϬ ͲϬ͘Ϭϱ
Ϭ ͲϬ͘ϭ
Ϭ ϱϬ ϭϬϬ ϭϱϬ ϮϬϬ ϮϱϬ ϯϬϬ
EůƵƐƚĞƌƐ
Figure A1. K—Means Tuning Graph. Comparison of how the j-squared error and average silhouette
score vary for differing numbers of clusters (k).
References
1. Pong, J.Y.-H.; Kwok, R.C.-W.; Lau, R.Y.-K.; Hao, J.-X.; Wong, P.C.-C. A comparative study of two automatic document classification
methods in a library setting. J. Inf. Sci. 2007, 34, 213–230. [CrossRef]
2. Androutsopoulos, I.; Koutsias, J.; Chandrinos, K.V.; Paliouras, G.; Spyropoulos, C.D. An evaluation of Naive Bayesian anti-spam
filtering. In Proceedings of the Workshop on Machine Learning in the New Information Age; Potamias, G., Moustakis, V., van Someren,
M., Eds.; Springer: Barcelona, Spain, 2000; pp. 9–17.
3. Connelly, A.; Kuri, V.; Palomino, M. Lack of consensus among sentiment analysis tools: A suitability study for SME firms. In
Proceedings of the 8th Language and Technology Conference, Poznań, Poland, 17–19 November 2017; pp. 54–58.
4. Meyer, B.J.F. Prose Analysis: Purposes, Procedures, and Problems 1. In Understanding Expository Text; Understanding Expository
Text; Routledge: Oxfordshire, England, UK, 2017; pp. 11–64.
5. Kim, S.-B.; Han, K.-S.; Rim, H.-C.; Myaeng, S.H. Some effective techniques for naive bayes text classification. IEEE Trans. Knowl.
Data Eng. 2006, 18, 1457–1466.
6. Ge, L.; Moh, T.-S. Improving text classification with word embedding. In Proceedings of the 2017 IEEE International Conference
on Big Data (Big Data), Boston, MA, USA, 11–14 December 2017; pp. 1796–1805.
7. Zhang, Y.; Jin, R.; Zhou, Z.-H. Understanding bag-of-words model: A statistical framework. Int. J. Mach. Learn. Cybern. 2010, 1,
43–52. [CrossRef]
8. Wang, S.; Zhou, W.; Jiang, C. A survey of word embeddings based on deep learning. Computing 2020, 102, 717–740. [CrossRef]
115
Mathematics 2022, 10, 983
9. Manevitz, L.M.; Yousef, M. One-class SVMs for document classification. J. Mach. Learn. Res. 2001, 2, 139–154.
10. Ting, S.L.; Ip, W.H.; Tsang, A.H.C. Is Naive Bayes a good classifier for document classification. Int. J. Softw. Eng. Its 2011, 5, 37–46.
11. Lai, S.; Xu, L.; Liu, K.; Zhao, J. Recurrent convolutional neural networks for text classification. In Proceedings of the Twenty-ninth
AAAI Conference on Artificial Intelligence, Austin, TX, USA, 25–30 January 2015.
12. Conneau, A.; Schwenk, H.; Barrault, L.; Lecun, Y. Very deep convolutional networks for text classification. Ki Künstliche Intell.
2016, 26, 357–363.
13. Kannan, S.; Gurusamy, V.; Vijayarani, S.; Ilamathi, J.; Nithya, M. Preprocessing techniques for text mining. Int. J. Comput. Sci.
Commun. Netw. 2014, 5, 7–16.
14. Nothman, J.; Qin, H.; Yurchak, R. Stop word lists in free open-source software packages. In Proceedings of the Workshop for NLP
Open Source Software (NLP-OSS), Melbourne, Australia, 20 July 2018; pp. 7–12.
15. Jivani, A.G. A comparative study of stemming algorithms. Int. J. Comp. Tech. Appl 2011, 2, 1930–1938.
16. Ramos, J. Using tf-idf to determine word relevance in document queries. In Proceedings of the First Instructional Conference on
Machine Learning, Citeseer, Banff, AB, Canada, 27 February–1 March 2011; Volume 242, pp. 29–48.
17. Mihalcea, R.; Tarau, P. Textrank: Bringing order into text. In Proceedings of the 2004 Conference on Empirical Methods in Natural
Language Processing, Barcelona, Spain, 2 July 2004; pp. 404–411.
18. Rose, S.; Engel, D.; Cramer, N.; Cowley, W. Automatic keyword extraction from individual documents. Text. Min. Appl. Theory
2010, 1, 1–20.
19. Ljungberg, B.F. Dimensionality reduction for bag-of-words models: PCA vs. LSA. Semanticscholar. Org. 2019. Available online:
http://cs229.stanford.edu/proj2017/final-reports/5163902.pdf (accessed on 8 February 2022).
20. Cavnar, W.B.; Trenkle, J.M. N-gram-based text categorization. In Proceedings of the SDAIR-94, 3rd Annual Symposium on
Document Analysis and Information Retrieval, Las Vegas, NV, USA, 1 June 1994; Volume 161175.
21. Ogada, K.; Mwangi, W.; Cheruiyot, W. N-gram based text categorization method for improved data mining. J. Inf. Eng. Appl.
2015, 5, 35–43.
22. Schonlau, M.; Guenther, N.; Sucholutsky, I. Text mining with n-gram variables. Stata J. 2017, 17, 866–881. [CrossRef]
23. Church, K.W. Word2Vec. Nat. Lang. Eng. 2016, 23, 155–162. [CrossRef]
24. Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient Estimation of Word Representations in Vector Space. arXiv 2013,
arXiv:1301.3781.
25. Yang, Z.; Yang, D.; Dyer, C.; He, X.; Smola, A.; Hovy, E. Hierarchical attention networks for document classification. In Proceedings
of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language
Technologies, San Diego, CA, USA, 12–17 June 2016; pp. 1480–1489.
26. Han, E.-H.S.; Karypis, G.; Kumar, V. Text categorization using weight adjusted k-nearest neighbor classification. In Proceedings
of the Pacific-Asia Conference on Knowledge Discovery and Data Mining, Hong Kong, China, 16–18 April 2001; pp. 53–65.
27. Rennie, J.D.; Shih, L.; Teevan, J.; Karger, D.R. Tackling the poor assumptions of naive bayes text classifiers. In Proceedings of the
20th International Conference on Machine Learning (ICML-03), Washington, DC, USA, 21–24 August 2003; pp. 616–623.
28. Wermter, S. Neural network agents for learning semantic text classification. Inf. Retrieva 2000, 3, 87–103. [CrossRef]
29. Yang, Y.; Liu, X. A re-examination of text categorization methods. In Proceedings of the 22nd Annual International ACM SIGIR
Conference on Research and Development in Information Retrieval, Berkeley, CA, USA, 15–19 August 1999; pp. 42–49.
30. Frank, E.; Bouckaert, R.R. Naive Bayes for Text Classification with Unbalanced Classes. In Lecture Notes in Computer Science;
Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2006; pp. 503–510. [CrossRef]
31. Liu, B. Sentiment analysis and subjectivity. Handb. Nat. Lang. Process. 2010, 2, 627–666.
32. Joachims, T. Text categorization with support vector machines: Learning with many relevant features. In Proceedings of the
European Conference on Machine Learning, Bilbao, Spain, 13–17 September 1998; pp. 137–142.
33. Banerjee, S.; Majumder, P.; Mitra, M. Re-evaluating the need for modelling term-dependence in text classification problems. arXiv
2017, arXiv:1710.09085.
34. Ghiassi, M.; Olschimke, M.; Moon, B.; Arnaudo, P. Automated text classification using a dynamic artificial neural network model.
Expert Syst. Appl. 2012, 39, 10967–10976. [CrossRef]
35. Zdrojewska, A.; Dutkiewicz, J.; J˛edrzejek, C.; Olejnik, M. Comparison of the Novel Classification Methods on the Reuters-21578
Corpus. In Proceedings of the Multimedia and Network Information Systems: Proceedings of the 11th International Conference
MISSI, Wrocław, Poland, 12–14 September 2018; Volume 833, p. 290.
36. Harris, C.R.; Millman, K.J.; van der Walt, S.J.; Gommers, R.; Virtanen, P.; Cournapeau, D.; Wieser, E.; Taylor, J.; Berg, S.; Smith, N.J.
Array programming with NumPy. Version 1.19.5. Nature 2020, 585, 357–362. [CrossRef]
37. McKinney, W. Data structures for statistical computing in python. Version 1.2.3. In Proceedings of the 9th Python in Science
Conference, Austin, TX, USA, 9−15 July 2010; Volume 445, pp. 51–56.
38. Finkel, J.R.; Grenager, T.; Manning, C.D. Incorporating non-local information into information extraction systems by gibbs
sampling. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05), Stroudsburg,
PA, USA, 25–30 June 2005; pp. 363–370.
39. Manning, C.D.; Surdeanu, M.; Bauer, J.; Finkel, J.R.; Bethard, S.; McClosky, D. The Stanford CoreNLP natural language processing
toolkit. In Proceedings of the 52nd annual meeting of the association for computational linguistics: System demonstrations,
Baltimore, MD, USA, 23–24 June 2014; pp. 55–60.
116
Mathematics 2022, 10, 983
40. Qi, P.; Zhang, Y.; Zhang, Y.; Bolton, J.; Manning, C.D. Stanza: A Python Natural Language Processing Toolkit for Many Human
Languages. Version 1.2. arXiv 2003, arXiv:2003.07082.
41. Porter, M.F. An algorithm for suffix stripping. Program 1980, 14, 3. [CrossRef]
42. Bird, S.; Klein, E.; Loper, E. Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit; Version 3.5;
O’Reilly Media, Inc.: Cambridge, UK, 2009.
43. Sharma, V.B. Rake-Nltk. Version 1.0.4 Software. Available online: https://pypi.org/project/rake-nltk/ (accessed on 18 March 2021).
44. Liang, X. Towards Data Science—Understand TextRank for Keyword Extraction by Python. Available online: https:
//towardsdatascience.com/textrank-for-keyword-extraction-by-python-c0bae21bcec0 (accessed on 15 April 2021).
45. Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.
Scikit-learn: Machine learning in Python, Version 0.24.1. J. Mach. Learn. Res. 2011, 12, 2825–2830.
46. Tan, S. Neighbor-weighted k-nearest neighbor for unbalanced text corpus. Expert Syst. Appl. 2005, 28, 667–671. [CrossRef]
47. Kalchbrenner, N.; Grefenstette, E.; Blunsom, P. A convolutional neural network for modelling sentences. arXiv 2014, arXiv:1404.2188.
48. Zhang, W.; Yoshida, T.; Tang, X. Text classification based on multi-word with support vector machine. Knowl.-Based Syst. 2008, 21,
879–886. [CrossRef]
49. MacQueen, J. Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley
Symposium on Mathematical Statistics and Probability, Oakland, CA, USA, 1 January 1967; Volume 1, pp. 281–297. Available
online: https://projecteuclid.org/proceedings/berkeley-symposium-on-mathematical-statistics-andprobability/proceedings-
of-the-fifth-berkeley-symposium-on-mathematical-statisticsand/Chapter/Some-methods-for-classification-and-analysis-of-
multivariateobservations/bsmsp/1200512992 (accessed on 8 February 2022).
50. Rousseeuw, P.J. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 1987,
20, 53–65. [CrossRef]
51. Catal, C.; Diri, B. Investigating the effect of dataset size, metrics sets, and feature selection techniques on software fault prediction
problem. Inf. Sci. 2009, 179, 1040–1058. [CrossRef]
52. Barbedo, J.G.A. Impact of dataset size and variety on the effectiveness of deep learning and transfer learning for plant disease
classification. Comput. Electron. Agric. 2018, 153, 46–53. [CrossRef]
117
mathematics
Article
Towards a Benchmarking System for Comparing Automatic
Hate Speech Detection with an Intelligent Baseline Proposal
S, tefan Dascălu † and Florentina Hristea *,†
Abstract: Hate Speech is a frequent problem occurring among Internet users. Recent regulations are
being discussed by U.K. representatives (“Online Safety Bill”) and by the European Commission,
which plans on introducing Hate Speech as an “EU crime”. The recent legislation having passed in
order to combat this kind of speech places the burden of identification on the hosting websites and
often within a tight time frame (24 h in France and Germany). These constraints make automatic
Hate Speech detection a very important topic for major social media platforms. However, recent
literature on Hate Speech detection lacks a benchmarking system that can evaluate how different
approaches compare against each other regarding the prediction made concerning different types
of text (short snippets such as those present on Twitter, as well as lengthier fragments). This paper
intended to deal with this issue and to take a step forward towards the standardization of testing
for this type of natural language processing (NLP) application. Furthermore, this paper explored
different transformer and LSTM-based models in order to evaluate the performance of multi-task
and transfer learning models used for Hate Speech detection. Some of the results obtained in this
paper surpassed the existing ones. The paper concluded that transformer-based models have the best
performance on all studied Datasets.
Citation: Dascălu, S, .; Hristea, F.
Towards a Benchmarking System for
Keywords: BERT; transfer learning; multi-task learning; RoBERTa; LSTM; Hate Speech detection
Comparing Automatic Hate Speech
Detection with an Intelligent Baseline
MSC: 68T50
Proposal. Mathematics 2022, 10, 945.
https://doi.org/10.3390/
math10060945
In light of these facts concerning varying definitions, this article based its detections on
the more constraining characterization of Hate Speech, which is the one applied in the U.K.
To further refine the presented term’s discussion, many Datasets that are featured in
this paper address the distinction between Hate Speech and Offensive Language by further
splitting the two into distinct labels. This distinction is explained in the Venn diagram
presented in Figure 1.
Figure 1. Venn diagram of positive labels used in the Datasets studied by this paper.
120
Mathematics 2022, 10, 945
this paper was to test multi-task learning and how it compares with single-task learning
with modern transformer-based models for Hate Speech classification. Furthermore, a
comparison between accuracies obtained by solely using a transformer embedding or
combining it with pretrained word embeddings such as GloVe or ELMo can help boost
performance. We also derived our own embeddings by using different transformers in
order to average the contextual representation of each token (as in the case of the word-level
ones derived in [8]) and then compared the results with the other models.
The article is organized into five sections, each introducing a new descriptive layer
of this topic. The first (current) section explains how different countries treat Hate Speech
and the importance of developing a capable detector for it. The section equally outlines
the objectives of this paper. The second section refers to previous work in the domain of
Hate Speech detection and to how our contribution complements this preexisting work.
The third section defines the actual methods used to preprocess the data, as well as the
models that were used in order to improve accuracy on the data classification objectives.
The fourth section discusses the results obtained using the most relevant methods and
how they compare to different results existing throughout the literature. The final section
summarizes our work so far and offers a short description of what can be achieved using
similar methods with respect to other types of NLP applications.
2. Related Work
Recent studies [9,10] pertaining to the analysis of Hate Speech reveal that keywords
are not enough in order to classify its usage. The semantic analysis of the text is required
in order to isolate other usages that do not have the intention or meaning required for the
phrase to be classified as Hate Speech. The outlined [10] possible usages for Hate Speech
words are:
• Homonymous, in which the word has multiple meanings and in the current sentence
is used with a different meaning that does not constitute Hate Speech;
• Derogatory, this being the form that should be labeled as Hate Speech. This form is
characterized by the expression of hate towards that community;
• Non-derogatory, non-appropriative, when the keywords are used to draw attention to
the word itself or another related purpose;
• Appropriative, when it is used by a community member for an alternative purpose.
Figure 2 outlines examples of the presented ways of using Hate Speech key terms.
121
Mathematics 2022, 10, 945
The topic of Hate Speech detection is a very active part of the NLP field [8,15–18].
Every modern advance in textual classification has been subsequently followed by an
implementation in the field of Hate Speech detection [15,19–21].
One of the first modern articles was authored by Waseem and Hovey [19] and consisted
of Twitter detection work. Their paper used character n-grams as features, and they used the
labels racism, sexism, and normal. The method used by the authors is a logistic regression
classifier and 10-fold cross-validation in order to make sure the performance of the model
is not dependent on the way the test and training data are selected. The preprocessing
solution employed by the authors was removing the stop-words, special Twitter-related
words (marking retweets), user names, and punctuation. This type of solution is commonly
used in approaches where sequential features of the text are not used.
Another article, which constitutes the base of modern Hate Speech detection, is [12].
The authors proposed [12] a three-label Dataset that identifies the need to classify offensive
tweets in addition to the Hate Speech ones of the Waseem and Hovey Dataset [19]. Fur-
thermore, the conclusion of the performed experiments was that Hate Speech can be very
hard to distinguish from Offensive Language, and because there are far more examples of
Offensive Language in the real world and the boundaries of Hate Speech and Offensive
Speech intersect, almost 40% of the Hate Speech labeled tweets were misclassified by the
best model proposed in [12]. The article used classic machine learning methods such as
logistic regression, naive Bayes, SVMs, etc. It employed a model for dimensionality reduc-
tion, and as preprocessing steps, it used TF-IDF on bigram, unigram, and trigram features.
Furthermore, stemming was first performed on the words using the Porter stemmer, and
the word-based features were POS tagged. Sentiment analysis tools were used for feature
engineering, computing the overall sentiment score of the tweet. It is important to specify
that only 26% of the tagged Hate Speech tweets were labeled unanimously by all annotators,
denoting the subjectivity involved in labeling these kinds of tweets.
Modern approaches are highlighted by the use of embeddings and neural networks
capable of processing sequences of tokens. One such article is [17], in which the authors
used ELMo embeddings for the detection of Hate Speech in tweets. Another technique that
was present in this work was using one Dataset to improve performance on the primary
Dataset (transfer learning).
The paper expressed this with a single-network architecture and separate task-specific
MLP layers. In addition to these, the article compared the usage of LSTMs [22] and
GRUs [23], noting that LSTMs are superior with respect to prediction accuracy. As pre-
processing steps, the authors removed URLs and added one space between punctuation.
This helps word embedding models such as GloVe, which would consider the combina-
tion of words and punctuation as a new word that it would not recognize. This article
also introduced the concept of domain adaptation to Hate Speech detection. By using
another Dataset to train the LSTM embeddings, the embeddings would be used to ex-
tract Hate-Speech-related features from the text, which would increase accuracy with
subsequent Datasets.
One major downside of Hate Speech detection applications is given by the existing
Datasets, which are very unbalanced (with respect to the number of Hate Speech labels), as
well as by the small amount of samples that they contain. This means that the research in this
field needs to make use of inventive methods in order to improve the accuracy. One such
method is the joint modeling [21] approach, which makes use of complex architectures and
similar Datasets, such as the sentiment analysis ones, in order to improve the performance
of the model on the classification of Hate Speech detection. In [21], the use of LTSMs over
other network architectures was also emphasized.
One of the more interesting models presented in [21] is parameter-based sharing. This
method distinguishes itself by using embeddings that are learned from secondary tasks
and are combined with the primary task embeddings based on a learnable parameter. This
allows the model to adjust how the secondary task contributes to the primary one, based
on the relatedness of the tasks.
122
Mathematics 2022, 10, 945
As preprocessing, the tweets had their mentions and URLs converted to special tokens;
they were lower-cased, and hashtags were mapped to words. The mapping of hashtags
to words is important because the tweets were small (the most common tweet size was
33 characters), so for understanding a Tweet’s meaning, we need to extract as much
information from it as possible. Further improvement upon previous architectures within
this model is the use of an additive attention layer.
As embeddings, GloVe and ELMo were used jointly, by concatenating the output of
both models before passing it to the LSTM layer in order to make a task-specific embedding.
This article [21] also compared the difference between transfer and multi-task learning,
with both having some performance boost over simple single-task learning.
Finally, there are multiple recent papers that made use of the transformer architecture
in order to achieve higher accuracies [8,15,16,18]. One such transformer-based method [8]
that uses the Ethos [24] Dataset employs a fairly straight forward approach. It uses BERT
to construct its own word embeddings. By taking every corpus word and the context in
which it occurs and constructing an embedding, the authors proposed that averaging word
embeddings will have an advantage over other word embedding techniques. Then, by
using a bidirectional LSTM layer, the embeddings were transformed into task-specific ones
in order to improve accuracy, after which a dense layer made the prediction. The accuracy
improved overall when this method was used as opposed to simple BERT classification. An-
other article that made use of the transformer architecture was [15], introducing AngryBert.
The model was composed of a multi-task learning architecture where BERT embeddings
were used for learning on two Datasets at the same time, while LSTM layers were used to
learn specific tasks by themselves. This is similar to the approach proposed in [21], but the
main encoder used a transformer-based network instead of a GloVe + ELMo embedding
LSTM network. The joining of embeddings was performed with a learnable parameter as
in [21]. Another interesting trend is that of [18], where a data-centric approach was used. By
fine-tuning BERT [25] on a secondary task that consisted of a 600,000-sentence-long Dataset,
the model was able to generate a greater performance when trained on the primary Dataset.
Our approach complements those already used by proposing a method of encoding
tokens that, while using the sub-word tokenizing architecture in order to handle new
words, also makes use of the transformer performance, at the same time increasing the
speed by running the transformer only to derive the embeddings. While RoBERTa-based
transformer learning is still superior to using transformer-based token embeddings in terms
of performance, if the training speed is of concern, then a viable alternative to the traditional
multi-task learning methods and to the transformer-based paradigms is transformer-based
token embeddings.
123
Mathematics 2022, 10, 945
Because there are different granularity levels for the labels, as well as the auxiliary
data used by some models, in what follows, all the data collection processes that took place
in the recreation of some of the more interesting state-of-the-art projects are explained.
There are multiple types of Datasets already available, but the main concern of deciding
which Datasets to use or how to construct a Dataset for the specific features of Hate Speech
detection is choosing what features to leverage in this task and how those features can
then be gathered in production environments. The main component of any Dataset should
be the text itself, as this component allows us to examine the idea of the author of the
sample at that particular time. Although it might be convoluted and new meanings of
words appear all the time, the text feature is the easiest to retrieve from the perspective
of a social platform owner as it is already in its database. This component is required for
every employed Dataset and was constructed or harvested. If this was not possible, then
the Dataset was discarded.
The Datasets chosen have different distributions of text length and different sources,
which means that they were harvested from different social media platforms and use differ-
ent labeling schemes. The most frequent labeling scheme observed is Amazon Mechanical
Turk, but other crowdsourcing platforms [12,20,26] are used as well. Finally, other authors
manually annotated their Dataset or sought professional help from members of different
marginalized communities and gender studies students [10,19].
By selecting different types of Datasets, the comprehensive testing of methods and
the discovery of how different approaches to feature extraction and text embedding affect
performance can be facilitated.
Labels of different samples can be binary or multi-class, as a result of evaluating them
with multiple techniques, such as the intended meaning of the slur (sexist, racist, etc.) or
what kind of slur word is being used (classifying each token as being Hate Speech or not).
However, in order to standardize the labels, we split the Datasets into two Dataset groups
(3-label and 2-label Datasets if they do not contain offensive speech labels). This was
performed by merging or computing labels.
Secondary components that could help us determine if a person is posting Hate Speech
content are: comments given by other people to that person’s post, user group names,
user account creation timestamp (accounts participating in Hate Speech tend to be deleted
so new accounts are created). Another secondary component that is recorded in some
Datasets is user responses to the comment (not used for detection, but for generating
responses to Hate Speech content). Some of these components were used, and the obtained
accuracy was compared to the accuracy without them, in order to determine if there was an
accuracy improvement over using text snippets only. Auxiliary Datasets, such as sentiment
classification ones, can also be used (in multi-task learning). These Datasets do not have
any constraints as they were only used to train the embedding vectors and, as such, to
help the primary model’s accuracy. During our experiments, in addition to the Datasets
presented in Figure 3, the SemEval 18 [27] Task1 Dataset (sentiment classification Dataset)
and OLID (Hate Speech detection Dataset) were used as auxiliary Datasets in the sections
where multi-task machine learning was performed.
A problem that was found while evaluating possible Datasets to be used for testing
was that there are so many Datasets in the domain of Hate Speech detection, that the
approaches presented alongside them have not been benchmarked on the same data.
124
Mathematics 2022, 10, 945
125
Mathematics 2022, 10, 945
3.1.7. Hate Speech Dataset from a White Supremacy Forum (the HATE Dataset)
The HATE Dataset, presented in [29], is interesting because instead of searching a
platform by keywords, which have a high probability of being associated with Hate Speech,
the authors used web-scraping on a white supremacist forum (Stormfront) in order to
improve random sampling chances. In addition to this, the Dataset has a skipped class
where information not pertaining to being labeled can be classified. Furthermore, for some
tweets, the context was used to annotate. We removed the classes skipped and those that
needed additional context before training the models.
126
Mathematics 2022, 10, 945
• TF-IDF, which is the most basic type of encoding, being commonly used in com-
bination with classic machine learning techniques or neural networks that cannot
accept sequence-based features. For the purpose of this paper, the scikit-learn [30]
implementation was used. This method of embedding words does not hold any of
the information offered by the context in which the word is used or the way in which
a word is written. Instead, the TF-IDF approach is used after the user has identified
the words that he/she wants to keep track of in his/her data corpus (usually the most
frequent ones after filtering stop words). Then, a formula is applied for each chosen
word. The word count for that particular document is divided by the number of times
the word occurs in the corpus:
127
Mathematics 2022, 10, 945
• RoBERTa [34] embeddings are similar to BERT as model outputs. The embeddings
resulting from the model are sequences of vectors that represent each sub-word.
Besides the model difference from ELMo and GloVe, BERT and RoBERTa use different
types of tokenizers. BERT uses WordPiece tokenization, which can use whole words
when computing the indices to be sent to the model or sub-words. This method starts
with the base vocabulary and keeps adding sub-words by the best likelihood until the
vocabulary cap is reached. This is a middle ground between character-based ELMo
embeddings and the word embeddings that GloVe uses. As opposed to the other
types of tokenizers, RoBERTa uses a byte pair encoding. This means that in order to
make the embeddings, all the characters are in the embedding dictionary at the start,
and then, they are paired based on the level of appearance in the text, as opposed to
WordPiece encoding, which chooses the new pair by the likelihood increase on the
training data.
3.3. Evaluation
Hate Speech predictors focus on the macro F1 metric. This metric is especially im-
portant because the benign samples far outweigh the offensive ones. F1 is calculated as
the harmonic mean of the recall and precision metrics. Furthermore, all metrics shown in
this paper are macro, which means that the metrics are representative of all classes. The
sampling procedure most used to evaluate classification models on a segmented Dataset
is K-fold cross-validation. In the case of our tests, we chose 10-fold because it is the most
popular for Hate Speech classification [8,19,21].
3.4. Preprocessing
In order to make use of all the Dataset features, we needed to preprocess it according
to all the tokenizing module requirements, so as to form the word vectors and embeddings,
as well as to perform the existing preprocessing steps that some of the Datasets in use
already employ.
The first step is adhering to GloVe’s special token creation rules. By following the
token definitions in the paper, a script was created in order to create the tokens that were
needed by GloVe. We used regex in order to create the tokens. The next preprocessing step
was mandated by the HateXplain Dataset. The Dataset is already in a processed state. By
searching for the pattern <Word>, a tokenizing program can be reverse engineered, at the
end of which the embeddings are reconstructed (over 20 in the Dataset). Some of the words
are censored, so in order to improve the accuracy of BERT, a translation dictionary can be
created to accomplish this feat.
The final step was constructing two preprocessing pipelines, one for the HateXplain
Dataset, in order to bring it to a standard format, and another preprocessing function
was created for the other texts. Further modifications were needed for the other Datasets
depending on how they were stored.
Because the study dealt with two different types of texts, the preprocessing should be
different for each article. For the tweet Datasets/short-text Datasets, it is recommended
that as much information as possible be kept, as these are normally very short. As such, the
hashtags should be converted into text and emojis and other text emojis as well. For the
longer texts, however, cleaner information had priority over harvesting the entire informa-
tion, one such method difference being the deletion of emoticons instead of replacing them
as in the tweet Datasets.
As such, the proposed pipeline for these kinds of models was, after preprocessing,
composed of the following word embedding extraction methods:
• N-gram char level features;
• N-gram word features;
• POS of the word;
• TF (raw, log, and inverse frequencies)-IDF(no importance weight, inverse document
frequency, redundancy);
128
Mathematics 2022, 10, 945
3.5. Optimizers
An optimizer is an algorithm that helps the neural network converge, by modifying
its parameters, such as the weights and learning rate, or by employing a different learning
rate for every network weight.
SGD [37] with momentum is one of the simpler optimizer algorithms. It uses a general
learning rate as opposed to the Adam optimizer [38]. It follows the step size to minimize the
function. Momentum is used to accelerate the gradient update and to escape local minima.
Adam [38] is a good choice over stochastic gradient descent because it has a learning
rate for every parameter and not a global learning rate as classical gradient descent algo-
rithms. The original Adam paper [38] claimed to combine the advantages of AdaGrad and
RMSProp. The article claimed that the advantages of Adam were the rescaling invariance of
the magnitudes of parameter updates to the rescaling of the gradient, the hyperparameter
step size bounding the parameter step sizes, etc. The upper hand Adam has as an optimizer
is generated by its step size choice update rule.
RMSProp is different from Adam as it does not have a bias correction term and can lead
to divergence, as proven in [38]. Another difference is the way RMSProp uses momentum
to generate the parameter updates.
AdaGrad [39] is suitable for sparse data because it adapts the learning rate to the
frequency with which features occur in the data. It was used in [32] because of the sparsity
of some words. AdaGrad is similar to Adam because it has a learning rate for every
parameter. The main fault is adding the square gradients, which are positive. For big sets
of data, this is detrimental as the learning rate becomes very small.
3.6. Models
In order to test the methods, we propose the pipeline in Figure 4.
Indexes of
embeddings
Converting into Loss
Datasets Preprocessing tensors + Dataloader Model Scheduler
transforming data
function
Labels
In order to streamline the process, the Datasets can be saved after preprocessing and
then loaded up so as to not repeat this part of the experiment on subsequent tests. In
129
Mathematics 2022, 10, 945
addition to this, the model weights can be reset, as opposed to training the model again for
the next folds, while performing the validation.
130
Mathematics 2022, 10, 945
In addition to the token length, one must consider the batch size. It is important to
make sure that the GPU can handle the size of the model training/prediction data
flow.
When testing on the tweet-based Datasets, the method chosen was to keep all input
words and pad the shorter input samples to the same length. For tweet samples, this is
easy as there is a limit to how many words a tweet can contain. As for our Datasets, the
limit ranged from 70 to 90 tokens. Some words can be split by the sub-word tokenizer
into two or more tokens, so based on what method of tokenization is chosen, the limit
was either 70 or 90 tokens. However, handling inputs from Gab or Reddit, with sizes
of over 330 words, required a different approach. For such Datasets, the Dataset length
chart is plotted, and we considered the length that separates the smallest 95% of the
samples, from the rest, as a cutoff point. The longer texts were then cut to that length,
and all the shorter texts were padded to the same length;
• The second approach was packing the variable length sentences and unpacking them
after the LSTM layer has finished. This approach is supported by PyTorch. It works
by capturing LSTM inputs at the sequence when the true input ends.
These models performed poorly in comparison with the LSTM and transformer networks.
Glove
Dictionary
MLP
Sentence W1 W2 Wn BiLSTM Dropout Attention
Tokenizer
Transformer MLP
Elmo NN
WordVectors
131
Mathematics 2022, 10, 945
132
Mathematics 2022, 10, 945
Prediction
Glove BiLSTM Alpha BiLSTM Dropout for primary task
Dictionary primary Task param primary Task Attention
MLP
Sentence W1 W2 Wn
Tokenizer
Prediction
Dropout for aux task
BiLSTM aux BiLSTM aux
Elmo NN Attention
Task Task
MLP
When the BiLSTM results were to be concatenated, SoftMax was applied on the param-
eter tensor, which normalizes the 2 values in the vector as a probability distribution (they
are complementary, summing to 1). Then, each corresponding embedding is multiplied
with its corresponding vector value. Other modifications were that the common BiLSTM
layer no longer existed, so that each task built its own interpretation of the input sequence.
As a sanity check, as the one in the previous example, a model was built that was trained in
the same manner, but without connecting the BiLSTM of the auxiliary model to the Alpha
parameter. If both models performed identically, this meant that there was no performance
gain out of using a flexible parameter multi-task model.
133
Mathematics 2022, 10, 945
adjusting the optimizer, learning rate, and batch size to the ones in the above guidelines, as
well as retraining for a small number of epochs [11].
The final method in which BERT can be used is for embedding vector creation. By
freezing the weights as in the previous model, we could use the sequential output of BERT
as a word embedding vector. In some of the experiments, the coupling of transformer-
based models together with GloVe embeddings was tested in multiple multi-task learning
use cases, as well as in single-task use cases, as depicted in Figure 8. BERT was used in
combination with GloVe in two ways: by using it as an embedding vector or as a sequential
model. The training was identical to the above-described process.
While training the transformer-based models, one problem was the high cost of
training, as well as the longer training time than for LSTM-based networks, and as such,
a good method was stated by [8]. Instead of taking the word-level embeddings as in [8],
we used the RoBERTa tokenizer to obtain token-level embeddings for multiple Datasets.
Using those embeddings, we then trained a multi-task architecture based on bidirectional
LSTM networks with an addition-based attention mechanism.
The pipeline began with the sentence, which was tokenized using the standard
BERT/RoBERTa tokenizer (depending on the model used). Then, each sample was input
into the model (standard model with no gradients computed to save the GPU), and after
obtaining the embeddings from the models, we assigned them to each of the tokens that
served as the input. We summed up all the different representations a token would have,
within the Dataset, and in the end, we averaged the vectors for each token, by dividing the
token vector by the number of times the token had occurred throughout the Dataset. In the
end, we converted the embeddings to PyTorch embeddings, because it was much faster
than using a Python dictionary.
Last layer
pooled Last layer
output sequence
Bert MLP Bert LSTM MLP
Glove
Dictionary
Sentence
W1 W2 Wn BiLSTM Dropout Attention MLP
Tokenizer
Transformer
Elmo NN
WordVectors
Averaging the
Preprocessed ROBERTA Model Output
Vectors/Token TokenVector
Samples tokenizer+model Vectors
Count
134
Mathematics 2022, 10, 945
4. Results
In this section, we outline the top-performing approaches that can contribute to an
intelligent baseline proposal for Hate Speech detection. These top-performing approaches
were selected from among the previously described models and were tested on each
selected Dataset. The test results can be seen in Tables 1–5. These tests used only the
textual features of the samples. The results were rounded to the nearest hundredth (so the
approximation error was ±0.005).
The results obtained on the HateXplain Dataset using RoBERTa were marginally better
than those obtained in [11] without the addition of metadata collected during training and
on par with those trained with additional annotation features. The results from the hate
supremacist forum could not be compared with the ones from [29] as the authors only
used a subset of about 2000 samples from their own Dataset in order to balance the classes.
The results obtained on the Qian Dataset were better (77.5 F1 → 88 F1 macro) than those
obtained in the original article [20].
For the Gao and Huang Dataset [28], there was an improvement in the F1 accuracy
from 0.6 obtained with an ensemble model (with additional textual features) to 0.71 obtained
using the RoBERTa transformer model on the text samples alone. On the Davidson Tweet
Dataset, our approach had a marginally lower weighted F1 score of 90 (when trained for
macro F1) compared with [15], which had a score of 91. However, if the model was trained
for weighted performance (without early stopping), the same F1 score was obtained. As
for the Waseem Dataset, our results were better (81 F1 compared with 80 F1) than those
obtained in [21], which used the same variant of the Dataset from [19]. The numerical
results of all the performed experiments (see Tables 1–5 and A1–A13) point to the fact that
using transformer models yielded better accuracies overall than using token embeddings.
135
Mathematics 2022, 10, 945
136
Mathematics 2022, 10, 945
surrounding this topic, we note the absence of a benchmarking system that could evaluate
how different approaches compare against one another. In this context, the present paper
intended to take a step forward towards the standardization of testing for Hate Speech
detection. More precisely, our study intended to determine if sentiment analysis Datasets,
which by far outweigh Hate Speech ones in terms of data availability, can help train
a Hate Speech detection problem for which multi-task learning is used. Our aim was
ultimately that of establishing a Dataset list with different features, available for testing and
comparisons, since a standardization of the test Datasets used in Hate Speech detection is
still lacking, making it difficult to compare the existing models. The secondary objective of
the paper was to propose an intelligent baseline as part of this standardization, by testing
multi-task learning, and to determine how it compares with single-task learning with
modern transformer-based models in the case of the Hate Speech detection problem.
The results of all the performed experiments point to the fact that using transformer
models yielded better accuracies overall than trying token embeddings. When speed is a
concern, then using token embeddings is the preferred approach, as training them used
less resources and yielded a comparable accuracy. Using auxiliary Datasets helped the
models achieve higher accuracies only if the Datasets had the same range of classification
(sentiment analysis or Offensive Language detection were complementary for Hate Speech),
but also the same textual features (using tweet Datasets for longer text Datasets can be
detrimental to its convergence).
Although the detection of Hate Speech has evolved over the last few years, with text
classification evolving as well, there still are multiple improvement opportunities. A great
advancement would be to train a transformer on a very large Hate Speech Dataset and then
test the improvement on specific tasks. In addition to these, various other Dataset-related
topics should be addressed, such as the connection between users or different hashtags and
movement names and how these factor into detecting Hate Speech. That issue could be
addressed by a combination between traditional approaches and graph neural networks.
Furthermore, it would be interesting to receive a Dataset with all the indicators a social
media platform has to offer.
One downside of current Datasets is the fact that, during our experiments, as well as
during other experiments in similar works, non-derogatory, non-appropriative, as well as
appropriative uses of Hate Speech words were very hard to classify, as they represent a
very small portion of the annotated tweets. These kinds of usages should not be censored
as they are impacting the targeted groups. Another example of Hate-Speech-related word
usage, that is hard to classify, would be quoting a Hate Speech fragment of text with the
intent to criticize it. Improvements in neural networks would have to be made in order to
treat these kinds of usages.
As future work, we plan on expanding our testing on Datasets that subcategorize
the detection of Hate Speech into different, fine-grained labels. One such Dataset that
was used in this paper is the Waseem and Hovey Dataset [19]; another one should be
SemEval-2019 Task 5 HatEval [40], which categorizes Hate Speech based on the target
demographic. It would be interesting to observe how well features learned during a more
general classification could be used to increase the classification accuracy of a finer-grained
task. Moreover, another direction that this research could take is to expand the types of
NLP applications for which the methods are tested. One such example would be handling
Hate-Speech-related problems such as misogyny identification [41]. We hope the present
study will stimulate and facilitate future research concerning this up-to-date topic, in all of
these directions and in others, as well.
Author Contributions: S, .D. and F.H. equally contributed to all aspects of the work. All authors have
read and agreed to the published version of the manuscript.
Funding: This research received no external funding.
Informed Consent Statement: Not applicable.
137
Mathematics 2022, 10, 945
Data Availability Statement: The datasets used in our experiments are available online at the fol-
lowing links: 1. https://github.com/zeeraktalat/hatespeech—Waseem Dataset; 2. https://github.
com/t-davidson/hate-speech-and-offensive-language—Davidson Dataset; 3. https://github.com/
sjtuprog/fox-news-comments—Gao&Huang Dataset; 4. https://github.com/hate-alert/HateXplain
—Xplain Dataset; 5. https://github.com/networkdynamics/slur-corpus—Kurrek Dataset; 6. https://
github.com/jing-qian/A-Benchmark-Dataset-for-Learning-to-Intervene-in-Online-Hate-Speech—Qian
Dataset; 7. https://github.com/Vicomtech/hate-speech-dataset—Hate Dataset, all accessed on
28 January 2022.
Acknowledgments: We thank three anonymous reviewers for their comments and suggestions,
which helped improve our paper.
Conflicts of Interest: The authors declare no conflict of interest.
Appendix A
Detailed results obtained from evaluating different models on the benchmark Datasets
are given here. All results feature accuracy, macro and weighted recall, precision, and
F1-scores.
Table A1. Scores obtained on the benchmark Datasets using the K nearest neighbors algorithm.
Table A2. Scores obtained on the benchmark Datasets using the SVM algorithm.
Table A3. Scores obtained on the benchmark Datasets using the decision trees algorithm.
Table A4. Scores obtained on the benchmark Datasets using the random forest algorithm.
138
Mathematics 2022, 10, 945
Table A5. Scores obtained on the benchmark Datasets using the GloVe embeddings with 2 LSTM
layers and max pooling.
Table A6. Scores obtained on the benchmark Datasets using the convolutional neural network model.
The embeddings were made with the help of GloVe vectors.
Table A7. Scores obtained on the benchmark Datasets using the bidirectional recurrent neural
network layers. The embeddings were made with the help of GloVe vectors.
Table A8. Scores obtained on the benchmark Datasets using one-way interconnected recurrent layers.
The embeddings were made with the help of GloVe vectors.
Table A9. Scores obtained on the benchmark Datasets using the CNN-GRU method. The embeddings
were made with the help of GloVe vectors.
139
Mathematics 2022, 10, 945
Table A10. Scores obtained on the benchmark Datasets using the learnable parameter encoding
model. Embedding vectors were made with both ELMo and GloVe and then concatenated. The
auxiliary Dataset is the SemEval 18 [27] Task1 Dataset.
Table A11. Scores obtained on the benchmark Datasets using the hard learning multi-task model.
Embedding vectors were made with both ELMo and GloVe and then concatenated. The auxiliary
Dataset is the SemEval 18 [27] Task1 Dataset.
Table A12. Scores obtained on the benchmark Datasets using the multi-task learning method in
which only the last layers were separated. Embedding vectors were made with both GloVe and ELMo
and then concatenated. The auxiliary Dataset is the SemEval 18 [27] Task1 Dataset.
Table A13. Scores obtained on the benchmark Datasets using 2 LSTM layers and embedding the
words with both ELMo and GloVe embeddings.
References
1. Framework Decision on Combating Certain Forms and Expressions of Racism and Xenophobia by Means of Criminal Law. 2008.
Available online: https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=LEGISSUM%3Al33178 (accessed on 25 January 2022).
2. United States Department of Justice—Learn about Hate Crimes. Available online: https://www.justice.gov/hatecrimes/learn-
about-hate-crimes (accessed on 25 January 2022).
3. Council Framework Decision 2008/913/JHA of 28 November 2008 on Combating Certain Forms and Expressions of Racism and
Xenophobia by Means of Criminal Law. Available online: https://ec.europa.eu/commission/presscorner/detail/en/IP_21_6561
(accessed on 25 January 2022).
4. Barron, J.A. Internet Access, Hate Speech and the First Amendment. First Amend. L. Rev. 2019, 18, 1. [CrossRef]
5. Facebook Reports Third Quarter 2021 Results. 2021. Available online: https://investor.fb.com/investor-news/press-release-
details/2021/Facebook-Reports-Third-Quarter-2021-Results/default.aspx (accessed on 25 January 2022).
140
Mathematics 2022, 10, 945
6. Twitter Reports Third Quarter 2021 Results. 2021. Available online: https://s22.q4cdn.com/826641620/files/doc_financials/20
21/q3/Final-Q3’21-earnings-release.pdf (accessed on 25 January 2022).
7. Xia, M.; Field, A.; Tsvetkov, Y. Demoting Racial Bias in Hate Speech Detection. In Proceedings of the Eighth International
Workshop on Natural Language Processing for Social Media, Online, 10 July 2020; Association for Computational Linguistics:
Stroudsburg, PA, USA, 2020; pp. 7–14. [CrossRef]
8. Rajput, G.; Punn, N.S.; Sonbhadra, S.K.; Agarwal, S. Hate Speech Detection Using Static BERT Embeddings. In Proceedings of the
Big Data Analytics: 9th International Conference, BDA 2021, Virtual Event, 15–18 December 2021; Springer: Berlin/Heidelberg,
Germany, 2021; pp. 67–77. [CrossRef]
9. Brown, A. What is Hate Speech? Part 1: The myth of hate. Law Philos. 2017, 36, 419–468. [CrossRef]
10. Kurrek, J.; Saleem, H.M.; Ruths, D. Towards a comprehensive taxonomy and large-scale annotated corpus for online slur usage.
In Proceedings of the Fourth Workshop on Online Abuse and Harms, Online, 20 November 2020; Association for Computational
Linguistics: Stroudsburg, PA, USA, 2020; pp. 138–149.
11. Mathew, B.; Saha, P.; Yimam, S.M.; Biemann, C.; Goyal, P.; Mukherjee, A. HateXplain: A Benchmark Dataset for Explainable Hate
Speech Detection. arXiv 2021, arXiv:2012.10289.
12. Davidson, T.; Warmsley, D.; Macy, M.; Weber, I. Automated Hate Speech detection and the problem of offensive language. In
Proceedings of the 11th International AAAI Conference on Web and Social Media, Montreal, QC, Canada, 15–18 May 2017;
Volume 11.
13. Nobata, C.; Tetreault, J.; Thomas, A.; Mehdad, Y.; Chang, Y. Abusive language detection in online user content. In Proceedings of
the 25th International Conference on World Wide Web, Montreal, QC, Canada, 11–15 April 2016; pp. 145–153.
14. Plaza-Del-Arco, F.M.; Molina-González, M.D.; Ureña-López, L.A.; Martín-Valdivia, M.T. A Multi-Task Learning Approach to
Hate Speech Detection Leveraging Sentiment Analysis. IEEE Access 2021, 9, 112478–112489. [CrossRef]
15. Awal, M.; Cao, R.; Lee, R.K.W.; Mitrović, S. AngryBERT: Joint Learning Target and Emotion for Hate Speech Detection. In
Advances in Knowledge Discovery and Data Mining, Proceedings of the 25th Pacific-Asia Conference, PAKDD 2021, Virtual Event, 11–14
May 2021; Springer: Berlin/Heidelberg, Germany, 2021; pp. 701–713. [CrossRef]
16. Sarwar, S.M.; Murdock, V. Unsupervised Domain Adaptation for Hate Speech Detection Using a Data Augmentation Approach.
arXiv 2021, arXiv:2107.12866.
17. Rizoiu, M.A.; Wang, T.; Ferraro, G.; Suominen, H. Transfer Learning for Hate Speech Detection in Social Media. arXiv 2019,
arXiv:1906.03829.
18. Bokstaller, J.; Patoulidis, G.; Zagidullina, A. Model Bias in NLP–Application to Hate Speech Classification using transfer learning
techniques. arXiv 2021, arXiv:2109.09725.
19. Waseem, Z.; Hovy, D. Hateful symbols or hateful people? predictive features for hate speech detection on twitter. In Proceedings
of the NAACL Student Research Workshop, San Diego, CA, USA, 12–17 June 2016; pp. 88–93.
20. Qian, J.; Bethke, A.; Liu, Y.; Belding-Royer, E.M.; Wang, W.Y. A Benchmark Dataset for Learning to Intervene in Online Hate
Speech. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International
Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; Association for
Computational Linguistics: Stroudsburg, PA, USA, 2019.
21. Rajamanickam, S.; Mishra, P.; Yannakoudakis, H.; Shutova, E. Joint Modelling of Emotion and Abusive Language Detection. In
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 4270–4279.
[CrossRef]
22. Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [CrossRef] [PubMed]
23. Cho, K.; van Merriënboer, B.; Bahdanau, D.; Bengio, Y. On the Properties of Neural Machine Translation: Encoder–Decoder
Approaches. In Proceedings of the SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, Doha,
Qatar, 25 October 2014; Association for Computational Linguistics: Stroudsburg, PA, USA, 2014; pp. 103–111. [CrossRef]
24. Mollas, I.; Chrysopoulou, Z.; Karlos, S.; Tsoumakas, G. ETHOS: A multi-label Hate Speech detection Dataset. Complex Intell. Syst.
2022. [CrossRef]
25. Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understand-
ing. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics:
Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; Association for
Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 4171–4186. [CrossRef]
26. Ousidhoum, N.; Lin, Z.; Zhang, H.; Song, Y.; Yeung, D.Y. Multilingual and Multi-Aspect Hate Speech Analysis. In Proceedings of
the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural
Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; Association for Computational Linguistics:
Stroudsburg, PA, USA, 2019; pp. 4675–4684. [CrossRef]
27. Mohammad, S.M.; Bravo-Marquez, F.; Salameh, M.; Kiritchenko, S. SemEval-2018 Task 1: Affect in Tweets. In Proceedings of the
International Workshop on Semantic Evaluation (SemEval-2018), New Orleans, LA, USA, 5–6 June 2018.
28. Gao, L.; Huang, R. Detecting Online Hate Speech Using Context Aware Models. In Proceedings of the International Conference
Recent Advances in Natural Language Processing, RANLP 2017, Varna, Bulgaria, 2–8 September 2017; INCOMA Ltd.: Shoumen,
Bulgaria, 2017; pp. 260–266. [CrossRef]
141
Mathematics 2022, 10, 945
29. De Gibert Bonet, O.; Perez Miguel, N.; García-Pablos, A.; Cuadros, M. Hate Speech Dataset from a White Supremacy Forum.
In Proceedings of the 2nd Workshop on Abusive Language Online (ALW2), Brussels, Belgium, 31 October 2018; pp. 11–20.
[CrossRef]
30. Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.;
et al. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830.
31. Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient Estimation of Word Representations in Vector Space. arXiv 2013,
arXiv:1301.3781.
32. Pennington, J.; Socher, R.; Manning, C.D. GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference
on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; pp. 1532–1543.
33. Gardner, M.; Grus, J.; Neumann, M.; Tafjord, O.; Dasigi, P.; Liu, N.F.; Peters, M.; Schmitz, M.; Zettlemoyer, L. AllenNLP: A Deep
Semantic Natural Language Processing Platform. In Proceedings of the Workshop for NLP Open Source Software (NLP-OSS),
Melbourne, Australia, 20 July 2018; Association for Computational Linguistics: Stroudsburg, PA, USA, 2018; pp. 1–6. [CrossRef]
34. Zhuang, L.; Wayne, L.; Ya, S.; Jun, Z. A Robustly Optimized BERT Pre-training Approach with Post-training. In Proceedings of
the 20th Chinese National Conference on Computational Linguistics, Huhhot, China, 13–15 August 2021; Chinese Information
Processing Society of China: Beijing, China, 2021; pp. 1218–1227.
35. Peters, M.E.; Neumann, M.; Iyyer, M.; Gardner, M.; Clark, C.; Lee, K.; Zettlemoyer, L. Deep Contextualized Word Representations.
In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human
Language Technologies, Volume 1 (Long Papers), New Orleans, LA, USA, 1–6 June 2018; Association for Computational
Linguistics: Stroudsburg, PA, USA, 2018; pp. 2227–2237. [CrossRef]
36. Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch:
An imperative style, high-performance deep learning library. Adv. Neural Inf. Process. Syst. 2019, 32, 8026–8037.
37. Robbins, H.E. A Stochastic Approximation Method. Ann. Math. Stat. 2007, 22, 400–407. [CrossRef]
38. Kingma, D.; Ba, J. Adam: A Method for Stochastic Optimization. In Proceedings of the International Conference on Learning
Representations, Banff, AB, Canada, 14–16 April 2014.
39. Lydia, A.; Francis, S. Adagrad—An Optimizer for Stochastic Gradient Descent. Int. J. Inf. Comput. 2019, 6, 566–568.
40. Basile, V.; Bosco, C.; Fersini, E.; Nozza, D.; Patti, V.; Rangel Pardo, F.M.; Rosso, P.; Sanguinetti, M. SemEval-2019 Task 5: Multilin-
gual Detection of Hate Speech Against Immigrants and Women in Twitter. In Proceedings of the 13th International Workshop on
Semantic Evaluation, Minneapolis, MN, USA, 6–7 June 2019; Association for Computational Linguistics: Stroudsburg, PA, USA,
2019; pp. 54–63. [CrossRef]
41. Fersini, E.; Rosso, P.; Anzovino, M.E. Overview of the Evalita 2018 Task on Automatic Misogyny Identification (AMI). In
Proceedings of the Sixth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop
(EVALITA 2018), Co-Located with the Fifth Italian Conference on Computational Linguistics (CLiC-it 2018), Turin, Italy, 12–13
December 2018; Volume 2263.
142
mathematics
Article
Intermediate-Task Transfer Learning with BERT for
Sarcasm Detection
Edoardo Savini and Cornelia Caragea *
Abstract: Sarcasm detection plays an important role in natural language processing as it can impact
the performance of many applications, including sentiment analysis, opinion mining, and stance
detection. Despite substantial progress on sarcasm detection, the research results are scattered
across datasets and studies. In this paper, we survey the current state-of-the-art and present strong
baselines for sarcasm detection based on BERT pre-trained language models. We further improve
our BERT models by fine-tuning them on related intermediate tasks before fine-tuning them on
our target task. Specifically, relying on the correlation between sarcasm and (implied negative)
sentiment and emotions, we explore a transfer learning framework that uses sentiment classification
and emotion detection as individual intermediate tasks to infuse knowledge into the target task of
sarcasm detection. Experimental results on three datasets that have different characteristics show
that the BERT-based models outperform many previous models.
MSC: 68T50
Still, despite substantial progress on sarcasm detection, the research results are scattered
across datasets and studies.
In this paper, we aim to further our understanding of what works best across several
textual datasets for our target task: sarcasm detection. To this end, we present strong
baselines based on BERT pre-trained language models [16]. We further propose to improve
our BERT models by fine-tuning them on related intermediate tasks before fine-tuning them
on our target task so that inductive bias is incorporated from related tasks [17]. We study
the performance of our BERT models on three datasets of different sizes and characteristics,
collected from the Internet Argument Corpus (IAC) [11], Reddit [18], and Twitter [7]. Table 1
shows examples of sarcastic comments from each of the three datasets. As we can see from
the table, the dataset constructed by Oraby et al. [11] contains long comments, while the
other two datasets have comments with fairly short lengths. Our purpose is to analyze the
effectiveness of BERT and intermediate-task transfer learning with BERT on the sarcasm
detection task and find a neural framework able to accurately predict sarcasm in many
types of social platforms, from discussion forums to microblogs.
“And, let’s see, when did the job loss actually start?, Oh yes.. We can trace the
troubles starting in 2007, with a big melt down in August/September of 2008.
Oraby et al. [11]:
Let’s see.. Obama must have been a terrible president to have caused that.. oh
WAIT. That wasn’t Obama, that was BUSH.. Excuse Me.”
Khodak et al. [18]: “Obama is in league with ISIS, he wins the shittiest terrorist fighter award.”
Mishra et al. [7]: “I can’t even wait to go sit at this meeting at the highschool.”
2. Related Work
Experiments on automatic sarcasm detection represent a recent field of study. The first
investigations made on text were focused on discovering lexical indicators and syntactic
cues that could be used as features for sarcasm detection [6,11]. In fact, at the beginning,
sarcasm recognition was considered as a simple text classification task. Many studies
focused on recognizing interjections, punctuation symbols, intensifiers, hyperboles [19],
emoticons [20], exclamations [21], and hashtags [22] in sarcastic comments. More recently,
Wallace et al. [4] showed that many classifiers fail when dealing with sentences where
context is needed. Therefore, newer works studied also parental comments or historical
tweets of the writer [3,23,24].
144
Mathematics 2022, 10, 844
3. Baseline Modeling
3.1. BERT Pre-Trained Language Model
The BERT pre-trained language model [16] has pushed performance boundaries on
many natural language understanding tasks. We fine-tune BERT bert-base-uncased from
the HuggingFace Transformers library [36] on our target task, i.e., sarcasm detection, with
an added single linear layer on top as a sentence classifier that uses the final hidden state
corresponding to the [CLS] token.
145
Mathematics 2022, 10, 844
can further improve the performance of our BERT models on the sarcasm detection target
task. Figure 1 shows the steps taken in this transfer learning framework.
Next, we discuss our target task and the intermediate tasks used for transfer learning.
EmoNet “It’s just so great to have baseball back. #happy” joy (1)
“I rented this movie primarily because it had Meg Ryan in it, and I was
disappointed to see that her role is really a mere supporting one. Not only is
IMDB 0
she not on screen much, but nothing her character does is essential to the plot.
Her character could be written out of the story without changing it much.”
146
Mathematics 2022, 10, 844
4. Data
To evaluate our models, we focus our attention on datasets with different characteris-
tics, retrieved from different social media sites and having different sizes. Our first dataset
is the Sarcasm V2 Corpus (https://nlds.soe.ucsc.edu/sarcasm2, accessed on 23 March
2021), created and made available by Oraby et al. [11]. Then, given the small size of this
first dataset, we test our models also on a large-scale self-annotated corpus for sarcasm,
SARC (http://nlp.cs.princeton.edu/SARC/, accessed on 23 March 2021), made available
by Khodak et al. [18]. Last, in order to verify the efficacy of our transfer learning model on a
dataset having a similar structure to the one used by our intermediate task, we selected also
a dataset from Twitter (http://www.cfilt.iitb.ac.in/cognitive-nlp/, accessed on 29 March
2021), created by Mishra et al. [7]. The datasets are discussed below.
Sarcasm V2 Corpus. Sarcasm V2 is a dataset released by Oraby et al. [11]. It is a highly
diverse corpus of sarcasm developed using syntactical cues and crowd-sourced annotation.
It contains 4692 lines having both Quote and Response sentences from dialogue examples
on political debates from the Internet Argument Corpus (IAC 2.0). The data is collected and
divided into three categories: General Sarcasm (Gen, 3260 sarcastic comments and 3260
non-sarcastic comments), Rhetorical Questions (RQ, 851 rhetorical questions and 851 non-
rhetorical questions) and Hyperbole (Hyp, 582 hyperboles and 582 non-hyperboles). We
147
Mathematics 2022, 10, 844
use the Gen Corpus for our experiments and select only the text of the Response sentence
for our sarcasm detection task.
SARC. The Self-Annotated Reddit Corpus (SARC) was introduced by Khodak et al. [18].
It contains more than a million sarcastic and non-sarcastic statements retrieved from Reddit
with some contextual information, such as author details, score, and parent comment. Reddit
is a social media site in which users can communicate on topic-specific discussion forums
called subreddits, each titled by a post called submission. People can vote and reply to the
submissions or to their comments, creating a tree-like structure. This guarantees that every
comment has its “parent”. The main feature of the dataset is the fact that sarcastic sentences
are directly annotated by the authors themselves, through the inclusion of the marker “/s” in
their comments. This method provides reliable and trustful data. Another important aspect is
that almost every comment is made of one sentence.
As the SARC dataset has many variants (Main Balanced, Main Unbalanced, and Pol),
in order to make our analyses more consistent with the Sarcasm V2 Corpus, we run our
experiments only on the first version of the Main Balanced dataset, composed of an equal
distribution of both sarcastic (505,413) and non-sarcastic (505,413) statements (total train
size: 1,010,826). The authors also provide a balanced test set of 251,608 comments, which
we use for model evaluation.
SARCTwitter. To test our models on comments with a structure more similar to the
EmoNet ones, we select the benchmark dataset used by Majumder et al. [28] and created
by Mishra et al. [7]. The dataset consists of 994 tweets from Twitter, manually annotated
by seven readers with both sarcastic and sentiment information, i.e., each tweet has two
labels, one for sentiment and one for sarcasm. Out of 994 tweets, 383 are labeled as positive
(sentiment) and the remaining 611 are labeled as negative (sentiment). Additionally, out
of these 994 tweets, 350 are labeled as sarcastic and the remaining 644 are labeled as non-
sarcastic. The dataset contains also eye-movement of the readers that we ignored for our
experiment as our focus is to detect sarcasm solely from the text content. We refer to this
dataset as SARCTwitter.
5. Experiments
5.1. Implementation Details
To obtain a reliable and well-performing model, we studied a supervised learning
approach on the three sarcasm datasets. We implement our models using the AllenNLP
library [42] and HuggingFace Transformers library [36]. To perform our experiments we
use the AWS Platform, EC2 instances (Ubuntu Deep Learning AMI) with one GPU on a
PyTorch environment.
Each input sentence is passed through our pre-trained Base Uncased BERT. We then
utilize the semantic content in the first special token [CLS] and feed it into a linear layer.
We then apply softmax [43] to compute the class probability and output the label with the
highest probability.
We iterate over each dataset with a mini-batch of size 16. We use AdaGrad opti-
mizer [44] having gradient clipping threshold set to 5.0. We tune hyper-parameters on the
validation set of each dataset. For every epoch we compute F1-score and Accuracy. The
training is stopped (for both target task and intermediate tasks) once the average F1 on the
validation set ceases to grow after some consecutive epochs (the patience is set to 5).
Table 6 shows the performance of BERT with intermediate task fine-tuning on the
validation set for each task. The corresponding BERT models were transferred and further
fine-tuned on the target task.
148
Mathematics 2022, 10, 844
149
Mathematics 2022, 10, 844
6. Results
In this section, we discuss prior works for each dataset and present comparison results.
Table 10. Results on the Sarcasm V2 dataset. Bold font shows best performance overall.
Model F1-Score
BERT (no intermediate pre-training) 80.59
BERT + TransferEmoNet 78.56
BERT + TransferEmoNetSent 80.58
BERT + TransferIMDB 80.85
BiLSTM (ELMo) 76.03
CNN (ELMo) 76.46
SVM with N-Grams (Oraby et al. [11]) 72.00
SVM with W2V (Oraby et al. [11]) (SOTA) 74.00
In this experiment, the model pre-trained on the IMDB dataset achieves state-of-the-
art performance, outperforming the vanilla BERT model by 0.3%. The increase may be
explained by the fact that the features of the Sarcasm V2 comments are more similar to the
ones of movie reviews rather than tweets. That is, they are much longer in length than the
150
Mathematics 2022, 10, 844
tweets’ lengths and this difference in lengths brings additional challenges to the models.
The expressions of sentiment/emotions in EmoNet, i.e., lexical cues, are more obvious in
EmoNet compared with IMDB. Thus, the model struggles more on the IMDB dataset, and
hence, is able to learn better and more robust parameters since these examples are more
challenging to the model, and therefore, more beneficial for learning. This also explains the
lack of improvement for the TransferEmoNetSent model. However, the outcomes of this
experiment underline that there is correlation between sarcasm and sentiment. BERT is
able to outperform previous approaches on this dataset and acts as a strong baseline. Using
BERT with intermediate task transfer learning can push the performance further.
151
Mathematics 2022, 10, 844
Table 11. Results on the SARC dataset. Bold font shows best performance overall.
Models F1-Score
No personality features
BERT (no intermediate pre-training) 77.49
BERT + TransferEmoNet 77.22
BERT + TransferEmoNetSent 77.53
BERT + TransferIMDB 77.48
BiLSTM (ELMo) 76.27
Bag-of-words (Hazarika et al. [27]) 64.00
CNN (Hazarika et al. [27]) 66.00
CASCADE (Hazarika et al. [27]) 66.00
(no personality features)
With personality features
CNN-SVM (Poria et al. [25]) 68.00
CUE-CNN (Amir et al. [13]) 69.00
CASCADE (Hazarika et al. [27])
(with personality features) (SOTA) 77.00
This behavior can be explained by the fact that comments from discussion forums, such
as Reddit, are quite different in terms of content, expressiveness, and topic from the other
social platforms of our intermediate tasks. For example, SARC comments’ lengths can vary
from 3/4 words to hundreds, while the IMDB movie reviews are generally much longer,
composed of multiple sentences, whereas EmoNet tweets usually consist of just one or two
sentences. In addition, on EmoNet the sentiment pattern is more pronounced as people are
more prone to describe their emotional state on Twitter. In SARC, probably also because
of the topics covered (e.g., politics, videogames), the emotion pattern is more implicit and
harder to detect. In the movie reviews, on the other hand, the sentiment is quite explicit but
the length of the sentences may cause a loss of information for the classifier and the sarcastic
content is almost nonexistent. However, the sentiment information from EmoNet slightly
improved the efficacy of the simple BERT classification, making our TransferEmoNetSent
model the new state-of-the-art performance on the SARC dataset.
These results support the pattern discovered on the Sarcasm V2 dataset, highlighting
BERT as the best-performing model and underlining the importance of sentiment in sarcasm
classification. This statement will be confirmed by our last experiment.
152
Mathematics 2022, 10, 844
even from this experiment, BERT is shown to be the most suitable model for sarcasm
detection, outperforming the BiLSTM model by more than 1%.
Table 12. Results on SARCTwitter dataset. Bold font shows best performance overall.
Model F1-Score
BERT (no intermediate pre-training) 96.34
BERT + TransferEmoNet 96.71
BERT + TransferEmoNetSent 97.43
BERT + TransferIMDB 95.96
BiLSTM (ELMo) 95.10
CNN only text (Mishra et al. [29]) 85.63
CNN gaze + text (Mishra et al. [29]) 86.97
GRU+MTL (Majumder et al. [28]) (SOTA) 90.67
153
Mathematics 2022, 10, 844
can be recognized automatically with a good performance without even having to further
use contextual information, such as users’ historical comments or parent comments. We
also explored a transfer learning framework to exploit the correlation between sarcasm
and the sentiment or emotions conveyed in the text, and found that an intermediate task
training on a correlated task can improve the effectiveness of the base BERT models, with
sentiment having a higher impact than emotions on the performance, especially on sarcasm
detection datasets that are small in size. We thus established new state-of-the-art results
on three datasets for sarcasm detection. Specifically, the improvement in performance of
BERT-based models (with and without intermediate task transfer learning) compared with
previous works on sarcasm detection is significant and is as high as 11.53%. We found
that the BERT models that use only the message content perform better than models that
leverage additional information from a writer’s history encoded as personality features
in prior work. We found this result to be remarkable. Moreover, if the dataset size for the
target task—sarcasm detection—is small then intermediate task transfer learning (with
sentiment as the intermediate task) can improve the performance further.
We believe that our models can be used as strong baselines for new research on
this task and we expect that enhancing the models with contextual data, such as user
embeddings, in future work, new state-of-the-art performance can be reached. Integrating
multiple intermediate tasks at the same time could potentially improve the performance
further, although caution should be taken to avoid the loss of knowledge from the general
domain while learning from the intermediate tasks. We make our code available to further
research in this area.
Author Contributions: All authors E.S. and C.C. contributed ideas and the overall conceptualization
of the project. E.S. wrote the code/implementation of the project and provided an initial draft of the
paper. All authors E.S. and C.C. worked on the writing and polishing of the paper and addressed
reviewers’ comments. All authors have read and agreed to the published version of the manuscript.
Funding: This research was funded by the National Science Foundation (NSF) and Amazon Web
Services under grant number NSF-IIS: BIGDATA #1741353. Any opinions, findings, and conclusions
expressed here are those of the authors and do not necessarily reflect the views of NSF.
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: The dataset used in our experiments are available online at the fol-
lowing links: the Sarcasm V2 Corpus at https://nlds.soe.ucsc.edu/sarcasm2 (accessed on 23 March
2021), SARC at http://nlp.cs.princeton.edu/SARC/ (accessed on 23 March 2021), and SARCTwitter
at http://www.cfilt.iitb.ac.in/cognitive-nlp/ (accessed on 29 March 2021).
Acknowledgments: We thank our anonymous reviewers for their constructive comments and feed-
back, which helped improve our paper. This research is supported in part by the National Science
Foundation and Amazon Web Services (for computing resources).
Conflicts of Interest: The authors declare no conflict of interest.
References
1. Maynard, D.; Greenwood, M. Who cares about Sarcastic Tweets? Investigating the Impact of Sarcasm on Sentiment Analysis. In
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14) , Reykjavik, Iceland, 26–31
May 2014; pp. 4238–4243.
2. Sykora, M.; Elayan, S.; Jackson, T.W. A qualitative analysis of sarcasm, irony and related #hashtags on Twitter. Big Data Soc. 2020, 7.
[CrossRef]
3. Joshi, A.; Sharma, V.; Bhattacharyya, P. Harnessing context incongruity for sarcasm detection. In Proceedings of the 53rd Annual
Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language
Processing (Short Papers), Beijing, China, 26–31 July 2015; Volume 2.
4. Wallace, B.C.; Choe, D.K.; Kertz, L.; Charniak, E. Humans Require Context to Infer Ironic Intent (so Computers Probably do, too).
In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Short Papers), Baltimore, MD,
USA, 22–27 June 2014; Volume 2, pp. 512–516. [CrossRef]
154
Mathematics 2022, 10, 844
5. Oraby, S.; El-Sonbaty, Y.; Abou El-Nasr, M. Exploring the Effects of Word Roots for Arabic Sentiment Analysis. In Proceedings of
the Sixth International Joint Conference on Natural Language Processing, Nagoya, Japan, 14–19 October 2013; pp. 471–479.
6. Riloff, E.; Qadir, A.; Surve, P.; De Silva, L.; Gilbert, N.; Huang, R. Sarcasm as Contrast between a Positive Sentiment and Negative
Situation. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, WA, USA, 18–21
October 2013; pp. 704–714.
7. Mishra, A.; Kanojia, D.; Bhattacharyya, P. Predicting readers’ sarcasm understandability by modeling gaze behavior. In
Proceedings of the AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–16 February 2016.
8. Khodak, M.; Risteski, A.; Fellbaum, C.; Arora, S. Automated WordNet Construction Using Word Embeddings. In Proceedings of
the 1st Workshop on Sense, Concept and Entity Representations and Their Applications, Valencia, Spain, 4 April 2017; pp. 12–23.
[CrossRef]
9. Oprea, S.V.; Magdy, W. iSarcasm: A Dataset of Intended Sarcasm. In Proceedings of the 58th Annual Meeting of the Association for
Computational Linguistics, Online, 5–10 July 2020; pp. 1279–1289. Available online: https://aclanthology.org/2020.acl-main.118/
(accessed on 12 February 2021).
10. Chauhan, D.S.; Dhanush, S.R.; Ekbal, A.; Bhattacharyya, P. Sentiment and Emotion help Sarcasm? A Multi-task Learning
Framework for Multi-Modal Sarcasm, Sentiment and Emotion Analysis. In Proceedings of the 58th Annual Meeting of the
Association for Computational Linguistics, Online, 5–10 July 2020; pp. 4351–4360. [CrossRef]
11. Oraby, S.; Harrison, V.; Reed, L.; Hernandez, E.; Riloff, E.; Walker, M. Creating and Characterizing a Diverse Corpus of Sarcasm
in Dialogue. In Proceedings of the 17th Annual Meeting of the Special Interest Group on Discourse and Dialogue, Los Angeles,
CA, USA, 13–15 September 2016; pp. 31–41.
12. Liebrecht, C.; Kunneman, F.; van den Bosch, A. The perfect solution for detecting sarcasm in tweets #not. In Proceedings of the
4th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, Atlanta, GA, USA, 14 June
2013; pp. 29–37.
13. Amir, S.; Wallace, B.C.; Lyu, H.; Silva, P.C.M.J. Modelling context with user embeddings for sarcasm detection in social media.
arXiv 2016, arXiv:1607.00976.
14. Ghosh, A.; Veale, T. Fracking sarcasm using neural network. In Proceddings of the Workshop on Computational Approaches to
Subjectivity, Sentiment and Social Media Analysis, San Diego, CA, USA, 16 June 2016 .
15. Joshi, A.; Tripathi, V.; Patel, K.; Bhattacharyya, P.; Carman, M. Are word embedding-based features useful for sarcasm detection?
arXiv 2016, arXiv:1610.00883.
16. Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding.
arXiv 2018, arXiv:1810.04805.
17. Pruksachatkun, Y.; Phang, J.; Liu, H.; Htut, P.M.; Zhang, X.; Pang, R.Y.; Vania, C.; Kann, K.; Bowman, S.R. Intermediate-Task
Transfer Learning with Pretrained Models for Natural Language Understanding: When and Why Does It Work? arXiv 2020,
arXiv:2005.00628.
18. Khodak, M.; Saunshi, N.; Vodrahalli, K. A Large Self-Annotated Corpus for Sarcasm. In Proceedings of the Eleventh International
Conference on Language Resources and Evaluation (LREC 2018 ), Miyazaki, Japan, 7–12 May 2018.
19. Kreuz, R.J.; Caucci, G.M. Lexical influences on the perception of sarcasm. In Proceedings of the Workshop on computational
approaches to Figurative Language, Rochester, NY, USA, 26 April 2007; pp. 1–4.
20. Carvalho, P.; Sarmento, L.; Silva, M.J.; De Oliveira, E. Clues for detecting irony in user-generated contents: Oh...!! it’s so easy;-).
In Proceedings of the 1st International CIKM Workshop on Topic-Sentiment Analysis for Mass Opinion, Hong Kong, China, 6
November 2009; pp. 53–56.
21. Tsur, O.; Davidov, D.; Rappoport, A. ICWSM—A great catchy name: Semi-supervised recognition of sarcastic sentences in online
product reviews. In Proceedings of the Fourth International AAAI Conference on Weblogs and Social Media, Washington, DC,
USA, 23–26 May 2010.
22. Davidov, D.; Tsur, O.; Rappoport, A. Semi-supervised recognition of sarcastic sentences in twitter and amazon. In Proceedings of
the Fourteenth Conference on Computational Natural Language Learning, Uppsala, Sweden, 15–16 July 2010; pp. 107–116.
23. Rajadesingan, A.; Zafarani, R.; Liu, H. Sarcasm detection on twitter: A behavioral modeling approach. In Proceedings of the
Eighth ACM International Conference on Web Search and Data Mining, Shanghai, China, 2–6 February 2015; pp. 97–106.
24. Bamman, D.; Smith, N.A. Contextualized sarcasm detection on twitter. In Proceedings of the Ninth International AAAI
Conference on Web and Social Media, Oxford, UK, 26–29 May 2015.
25. Poria, S.; Cambria, E.; Hazarika, D.; Vij, P. A deeper look into sarcastic tweets using deep convolutional neural networks. arXiv
2016, arXiv:1610.08815.
26. Zhang, M.; Zhang, Y.; Fu, G. Tweet sarcasm detection using deep neural network. In Proceedings of the COLING 2016, The 26th
International Conference on Computational Linguistics, Osaka, Japan, 11–16 December 2016; pp. 2449–2460.
27. Hazarika, D.; Poria, S.; Gorantla, S.; Cambria, E.; Zimmermann, R.; Mihalcea, R. CASCADE: Contextual Sarcasm Detection in
Online Discussion Forums. In Proceedings of the 27th International Conference on Computational Linguistics, Santa Fe, NM,
USA, 20–26 August 2018; pp. 1837–1848.
28. Majumder, N.; Poria, S.; Peng, H.; Chhaya, N.; Cambria, E.; Gelbukh, A.F. Sentiment and Sarcasm Classification with Multitask
Learning. arXiv 2019, arXiv:1901.08014.
155
Mathematics 2022, 10, 844
29. Mishra, A.; Dey, K.; Bhattacharyya, P. Learning cognitive features from gaze data for sentiment and sarcasm classification using
convolutional neural network. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics,
Vancouver, BC, Canada, 30 July 2017–4 August 2017; pp. 377–387.
30. Plepi, J.; Flek, L. Perceived and Intended Sarcasm Detection with Graph Attention Networks. In Findings of the Association for
Computational Linguistics: EMNLP 2021; Association for Computational Linguistics: Punta Cana, Dominican Republic, 2021;
pp. 4746–4753. [CrossRef]
31. Cai, Y.; Cai, H.; Wan, X. Multi-Modal Sarcasm Detection in Twitter with Hierarchical Fusion Model. In Proceedings of the 57th
Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July 2019–2 August 2019; pp. 2506–2515.
[CrossRef]
32. Li, L.; Levi, O.; Hosseini, P.; Broniatowski, D. A Multi-Modal Method for Satire Detection using Textual and Visual Cues. In
Proceedings of the 3rd NLP4IF Workshop on NLP for Internet Freedom: Censorship, Disinformation, and Propaganda, Barcelona,
Spain, 20 December 2020; pp. 33–38.
33. Wang, X.; Sun, X.; Yang, T.; Wang, H. Building a Bridge: A Method for Image-Text Sarcasm Detection Without Pretraining on
Image-Text Data. In Proceedings of the First International Workshop on Natural Language Processing Beyond Text, Online, 20
November 2020; pp. 19–29. [CrossRef]
34. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [CrossRef]
35. Lu, J.; Batra, D.; Parikh, D.; Lee, S. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language
Tasks. In Advances in Neural Information Processing Systems; Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E.,
Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2019; Volume 32.
36. Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.; Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M.; et al. HuggingFace’s
Transformers: State-of-the-art Natural Language Processing. arXiv 2019, arXiv:1910.03771.
37. Phang, J.; Févry, T.; Bowman, S.R. Sentence Encoders on STILTs: Supplementary Training on Intermediate Labeled-data Tasks.
arXiv 2018, arXiv:1811.01088.
38. Abdul-Mageed, M.; Ungar, L. EmoNet: Fine-Grained Emotion Detection with Gated Recurrent Neural Networks. In Proceedings
of the 55th Annual Meeting of the Association for Computational Linguistics (Long Papers), Vancouver, BC, Canada, 30 July
2017–4 August 2017; Volume 1, pp. 718–728. [CrossRef]
39. Maas, A.L.; Daly, R.E.; Pham, P.T.; Huang, D.; Ng, A.Y.; Potts, C. Learning Word Vectors for Sentiment Analysis. In Proceedings
of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, OR,
USA, 19–24 June 2011; pp. 142–150.
40. Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [CrossRef] [PubMed]
41. Kim, Y. Convolutional Neural Networks for Sentence Classification. In Proceedings of the 2014 Conference on Empirical Methods
in Natural Language Processing (EMNLP) , Doha, Qatar, 25–29 October 2014; pp. 1746–1751. [CrossRef]
42. Gardner, M.; Grus, J.; Neumann, M.; Tafjord, O.; Dasigi, P.; Liu, N.; Peters, M.; Schmitz, M.; Zettlemoyer, L. Allennlp: A deep
semantic natural language processing platform. arXiv 2018, arXiv:1803.07640
43. Bridle, J.S. Probabilistic interpretation of feedforward classification network outputs, with relationships to statistical pattern
recognition. In Neurocomputing; Springer: Berlin/Heidelberg, Germany, 1990; pp. 227–236.
44. Duchi, J.; Hazan, E.; Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res.
2011, 12, 2121–2159.
45. Mikolov, T.; Chen, K.; Corrado, G.S.; Dean, J. Efficient Estimation of Word Representations in Vector Space. In Proceedings of
the 1st International Conference on Learning Representations, ICLR 2013, Scottsdale, AZ, USA, 2–4 May 2013; Workshop Track
Proceedings.
156
mathematics
Article
Parallel Stylometric Document Embeddings with
Deep Learning Based Language Models in Literary
Authorship Attribution
Mihailo Škorić 1, *, Ranka Stanković 1 , Milica Ikonić Nešić 2 , Joanna Byszuk 3 and Maciej Eder 3
1 Faculty of Mining and Geology, University of Belgrade, Djusina 7, 11120 Belgrade, Serbia;
[email protected]
2 Faculty of Philology, University of Belgrade, Studentski Trg 3, 11000 Belgrade, Serbia;
milica.ikonic.nesic@fil.bg.ac.rs
3 Institute of Polish Language, Polish Academy of Sciences, al. Mickiewicza 31, 31-120 Kraków, Poland;
[email protected] (J.B.); [email protected] (M.E.)
* Correspondence: [email protected]
Abstract: This paper explores the effectiveness of parallel stylometric document embeddings in
solving the authorship attribution task by testing a novel approach on literary texts in 7 different
languages, totaling in 7051 unique 10,000-token chunks from 700 PoS and lemma annotated doc-
uments. We used these documents to produce four document embedding models using Stylo R
package (word-based, lemma-based, PoS-trigrams-based, and PoS-mask-based) and one document
embedding model using mBERT for each of the seven languages. We created further derivations
of these embeddings in the form of average, product, minimum, maximum, and l 2 norm of these
document embedding matrices and tested them both including and excluding the mBERT-based
document embeddings for each language. Finally, we trained several perceptrons on the portions of
the dataset in order to procure adequate weights for a weighted combination approach. We tested
Citation: Škorić, M.; Stanković, R.;
standalone (two baselines) and composite embeddings for classification accuracy, precision, recall,
Ikonić Nešić, M.; Byszuk, J.; Eder, M.
Parallel Stylometric Document
weighted-average, and macro-averaged F1 -score, compared them with one another and have found
Embeddings with Deep Learning that for each language most of our composition methods outperform the baselines (with a couple of
Based Language Models in Literary methods outperforming all baselines for all languages), with or without mBERT inputs, which are
Authorship Attribution. Mathematics found to have no significant positive impact on the results of our methods.
2022, 10, 838. https://doi.org/
10.3390/math10050838 Keywords: document embeddings; authorship attribution; language modelling; parallel architectures;
stylometry; language processing pipelines
Academic Editors: Florentina Hristea
and Cornelia Caragea
MSC: 68T50
Received: 30 January 2022
Accepted: 28 February 2022
Published: 7 March 2022
1.1. Stylometry
Both distant reading and AA belong to a broader theoretical framework of stylometry.
Specifically, stylometry is a method of statistical analysis of texts, and it is applied, among
other things, to distinguish between authorial literary styles. For each individual author,
there is an assumption that he/she exhibits a distinct style of writing [14]. This fundamental
notion makes it possible to use stylometric methodology to differentiate between documents
written by different authors and solve the AA task [15].
Performance of particular stylometric methods strongly depends on the choice of
language features as the relevant style-markers. The determination of which features are
the best to use for particular tasks and how they should be ordered has been a subject of
many debates over the years. The earliest approaches relied solely on words, and examined
differences in their use for particular authors [7], or general differences using lists of most
frequent words (MFW) [11]. Further studies experimented with various types of features,
with discussions on whether words should be lemmatized [14,16,17].
Evert et al. [13] discussed AA based on distance measures, different performance of
diverse distance measures, and normalization strategies, as well as specificity for language
families. Instead of relying on a specified number of MFW, they identified a set of discrimi-
nant words by using the method of recursive feature elimination. By repeatedly training a
support vector classifier and pruning the least important ones, they obtained a minimal
set of features for optimal performance. The resulting set contained function words and
not so common content words. Eder and Byszuk [18] also experimented with changing the
order of MFW on the list of used features and its influence on the accuracy of classification,
confirming that the most discriminative features do not necessarily overlap with the MFW.
Among the non-word approaches, most attempts were made using chunks of subsequent
letters (so called character n-grams) [19], or grammatical features.
Weerasinghe and Greenstadt [20] used the following textual features: character n-
grams: (TF-IDF values for character n-grams, where 1 ≤ n ≤ 6), PoS tag n-grams (TF-IDF
158
Mathematics 2022, 10, 838
value of PoS tag trigrams), special characters (TF-IDF values for 31 pre-defined special
characters), frequencies of function words (179 NLTK stopwords), number of characters
and tokens in the document, average number of characters per word, per document,
distribution of word-lengths (1–10), vocabulary richness, PoS tag chunks, and noun and
verb phrase construction. For each document pair, they extracted stylometric features from
the documents and used the absolute difference between the feature vectors as input to the
classifier. They built a logistic regression model trained on a small dataset, and a neural
network based model trained on the large dataset.
One of the most interesting and recent proposals was made by Camps et al. [3], who
attempted stylometric analysis of medieval vernacular texts, noting that the scribal variation
and errors introduced over the centuries complicate the investigations. To counter this
textual variance, they developed a workflow combining handwritten text recognition and
stylometric analysis performed using a variety of lexical and grammatical features to the
study of a corpus of hagiographic works, examining potential authorial groupings in a
vastly anonymous corpus.
Despite the overall good performance of these various approaches, MFW still proved
to be most effective in discrimination between authors. Popular tools for conducting
stylometric analyses like Stylo R package [21] are still suggesting the use of word tokens, or
word or character n-grams, while also supporting further deviations from classic MFW.
Apart from the above shallow text representations, often referred to as a bag-of-words
models, recent studies in AA are also exploring context-aware representations, or features
that take into account contextual information, usually extracted using neural networks.
Kocher and Savoy [22] proposed two new AA classifiers using distributed language
representation, where the nearby context of each word in a document was used to create
a vector-space representation for either authors or texts, and cosine similarities between
these representations were used for authorship-based classification. The evaluations using
the k-nearest neighbors (k-NNs) on four test collections indicated good performance of
that method, which in some cases outperformed even the state-of-the-art methods. Salami
and Momtazi [23] proposed a poetry AA model based on recurrent convolutions neural
networks, which captured temporal and spatial features using either a poem or a single
verse as an input. This model was shown to significantly outperform other state-of-the-
art models.
Segarra et al. [24] used the normalized word adjacency networks as relational struc-
tures data between function words as stylometric information for AA. These networks
express grammatical relationships between words but do not carry lexical meaning on their
own. For long profiles with more than 60,000 words, they achieve high attribution accuracy
even when distinguishing between a large number of authors, and also achieve reasonable
rates for short texts (i.e., newspaper articles), if the number of possible authors is small.
Similarly, Marinho et al. [25] presented another study that focuses on solving the AA task
using complex networks, but by focusing on network subgraphs (motifs) as features to
identify different authors.
Finally, state-of-the-art stylometry is also exploring a combination of several represen-
tations. Such a setup is usually referred to as parallel architecture. Arguably, the use of a
heterogeneous classifier that combines independent classifiers with different approaches
usually outperforms the ones obtained using a single classifier [26,27]. Segarra et al. [24]
also showed that word adjacency networks and frequencies capture different stylometric
aspects and that their combination can halve the error rate of existing methods.
159
Mathematics 2022, 10, 838
160
Mathematics 2022, 10, 838
2. Dataset
The COST Action “Distant Reading for European Literary History” (https://www.
distant-reading.net, accessed on 29 January 2022) coordinates the creation of a multilingual
European Literary Text Collection (ELTeC) [31]. This resource will be used to establish best
practices and develop innovative methods of Distant Reading for the multiple European
literary traditions. Its core will contain at least 10 linguistically annotated 100 novels sub-
collections comparable in their internal structure in at least 10 different European languages,
totaling at least 1000 annotated full-text novels. The extended ELTeC will take the total
number of full-text novels to at least 2500.
In order to create representative sub-collections for the corresponding languages, the
novels were selected to evenly represent (1) novels of various sizes: short (10–50,000 words),
medium (50–100,000 words), and long (more than 100,000 words); (2) four 20-year time
periods T1 [1840–1859], T2 [1860–1879], T3 [1880–1899], T4 [1900–1920]; (3) the number
of reprints, as a measure of canonicity (novels known to wider audience and completely
forgotten), and (4) female and male authors [32].
This multiple encoding levels are provided in the ELTeC scheme: at level–0, only the
bare minimum of markup is permitted, while at level–1 a slightly richer encoding is defined.
At level–2, additional information is introduced to support various linguistic processing,
with mandatory being part of speech (PoS) tags, named entities and lemmas.
In its current version, the ELTeC contains comparable corpora for 17 European
languages, with each intended to be a balanced sample of 100 novels from the period
1840 to 1920. The current total number of novels is 1355 (104,084,631 words), with 10
languages reaching a collection of 100 encoded in level–1: Czech, German, English,
French, Hungarian, Polish, Portuguese, Romanian, Slovenian, and Serbian. The cur-
rent state in ELTeC corpus building can be seen in a github overview web page (https:
//distantreading.github.io/ELTeC, accessed on 29 January 2022). The action is set to finish
by the end of April of 2022, so more novels are expected to be added. Each novel is sup-
ported by metadata concerning their production and reception, aiming to become a reliable
basis for comparative work in data-driven textual analysis [31]. All novels and transforma-
tion scripts and available on GitHub for browse and download, and more curated versions
are published periodically on Zenodo (https://zenodo.org/communities/eltec, accessed
on 29 January 2022). ELTeC sub-collections and their derivations are already being used, for
example in TXM (https://txm.gitpages.huma-num.fr/textometrie, accessed on 29 January
2022), SketchEngine [33,34], or for word embedding development.
161
Mathematics 2022, 10, 838
The level–2 ELTeC collection currently contains 7 sub-collections of 100 novels for the
following languages: German, English, French, Hungarian, Portuguese, Slovenian, and
Serbian (as of December 2021). For the research in this paper, we used these 700 novels,
since in this iteration each token is supplied with lemma and PoS tag as required for the
experiment. The second column of Table 1 presents the number of words per language
sub-collection, totaling in 58,061,996 for these 7 languages, while the third column contains
the number of tokens, totaling in 73,692,461.
For the purpose of this experiment, we produced four document representations for
each novel, each in the form of vertical texts, consisting of: (1) words (as in vertical original
text of the novel), (2) lemmas (as in vertical lemmatized text), (3) PoS tags (each token
in verticalized text is replaced by its PoS tag) and (4) masked text, where tokens were
substituted with PoS tag for following PoS tags: ADJ, NOUNS, NPROP, ADV, VERB, AUX,
NUM, SYM, X, for PoS tags: DET and PRON tokens are substituted with lemma, while
others: ADP, CCONJ, INTJ, PART, PUNCT, SCONJ remained unchanged, as inspired by [35].
Keeping in mind the remarkable variation in size of the novels within and across
particular language collections, we applied chunking. Relaying on results presented
in [36] and the well-known phenomenon: attribution effectiveness grows with the number
of words analyzed, and at a certain point it tends to stabilize or slightly decrease [37].
After a few calibrating experiments with different sizes of chunks, we chose the 10,000
token sample size as the most representative. Each novel was split into chunks of exactly
10,000 tokens, with the last, shorter chunk, being excluded. This resulted in a dataset
consisting of 28,204 chunks (documents)—7051 chunks per each of the 4 aforementioned
document representations. Table 1 also presents the number of chunks for each language
sub-collection. The produced dataset was used as the base for all further processing in this
research, with each language collection considered separately.
3. Workflow
In this section we will explore the generation of all the 19 different word embedding
types we envisioned. Firstly, we created five baseline, standalone embeddings: four based
on stylometry, and one based on a deep-learning language model (Figure 1). Based on
those, we derived 10 more using 5 simple combination techniques, and, finally, 4 more
using weight-based linear combinations.
162
Mathematics 2022, 10, 838
trigrams for chunks containing PoS tags and bigrams for chunks containing PoS-masked
words. For these chunks we picked top 300 and 500 features, while for the original and
lemmatized chunks we picked the 800 most frequent features.
Figure 1. A flowchart depicting a path from the annotated novels to document embeddings using
multilingual BERT and Stylo R package methods.
Figure 2. A flowchart describing a path from documents to document embeddings using Stylo R pack-
age, with N-gram transformation being applied only when generating non-unigram baseddocument
representations.
For each representation, we calculated the cosine delta distance (also known as
Würzburg distance) [12] between each two chunks regarding the previously obtained
frequency tables. Distances were put together in symmetric, hollow matrices Dt , in which
every cell ai,j , i, j ∈ {1, k} represents distances between documents and k is the number of
documents for specific language. Thus,
⎡ ⎤ ⎡ ⎤
a1,1 a1,2 · · · a1,k 0 ∗ ··· ∗
⎢ a2,1 a2,2 · · · a2,k ⎥ ⎢∗ 0 · · · ∗⎥
⎥ ⎢
⎢ ⎥
Dt = ⎢ . .. .. .. ⎥ = ⎢ .. .. . . . ⎥, (1)
⎣ .. . . . ⎦ ⎣. . . .. ⎦
ak,1 ak,2 · · · ak,k ∗ ∗ ··· 0
where t ∈ {word, pos, lemma, masked} and ∗ denotes a numerical distance.
163
Mathematics 2022, 10, 838
These four matrices, Dword , D pos , Dlemma and Dmasked , produced for each of the seven
languages, are used to convey document embeddings grouped by a document representa-
tion method. Each one contains mutual document distances for the same set of documents
with distances differing between matrices as they were obtained using different representa-
tions of the same documents.
Figure 3. Flowchart describing the path from documents to document embeddings using mBERT.
In the first step, each document was split into sentences using several regular expres-
sions. Since mBERT requires each sentence to have 512 tokens or less, longer sentences
were trimmed. Sentences were tokenized using Google AI’s BERT tokenizer [28] with 110k
shared WordPiece vocabulary (provided by the mBERT authors). The model assigns each
token in a sentence with a 768 × 1 word embedding. Those are then summed into sentence
tensors of the same length. All sentence tensors in each document are averaged into a
single 768 × 1 tensor, which is used as a document representation.
If there are k documents, then there will be k document tensors, − →
v 1, −
→
v 2, . . . , −
→
v k.
−
→v i ,−
→
v j
Using cosine similarity di,j = −
→ between vector pairs we get distances between
v i ·−→v j
documents represented by those vectors v i , −
−→ →
v j , i, j ∈ {1, k}, with the final product being
the document embedding matrix:
164
Mathematics 2022, 10, 838
⎡ ⎤
a1,1 a1,2 · · · a1,k
⎢ a2,1 a2,2 · · · a2,k ⎥
⎢ ⎥
Dbert =⎢ . .. .. .. ⎥, (2)
⎣ .. . . . ⎦
ak,1 ak,2 · · · ak,k
165
Mathematics 2022, 10, 838
where,
1 (1) (1) (2) (n)
bi,j = a w + ai,j w(2) + . . . + ai,j w(n) , i, j ∈ {1, k}, (9)
C i,j
166
Mathematics 2022, 10, 838
Figure 4. Visualisation of perceptron inputs and targeted outputs during the training phase—inputs
for each iteration being distances between the same documents from four (or five) different embed-
dings and desired outputs of either 0 or 1.
Adam optimizer [38] with standard initial learning rate of 0.01 was used to train all of
the perceptrons used in this research. The number of inputs with the desired output of 0
was truncated to the number of inputs with desired output of 1 in order to avoid author
verification bias, and the distances between the documents that belong to the same novel
were excluded from the training. The number of inputs for each training was afterwards
truncated to 6384, which was the size of the available set for the smallest dataset, in order
to avoid language bias in a multilingual setting. These inputs and outputs were split into
training and validation sets in a 9:1 ratio, and the batch sizes were fixed at 64. The final
epoch number was set to 356 according to other parameters and several training inspections.
During these inspections what we were looking for specifically was the average number of
epochs across languages before the validation error rate trend changes from descending to
ascending, indicating over-fitting. Once the training of the perceptrons was completed, the
model was used to predict the final weights as shown in Figure 5. The weights were then
used to solve the Equation (9). It has to be emphasized that using the procured weights for
any method satisfies C = 1 in the Equation (10), because the weights were normalized to
sum up to 1. The normalized weights are presented in Table 2.
Figure 5. Visualisation of trained perceptron inputs and outputs—inputs being distances between the
same documents from four (or five) different embeddings and the output being their new weighted-
based scalar distance.
A total of 28 perceptrons were trained, 14 not using the mBERT (upper) and 14 using
the mBERT inputs (bottom half of Table 2). Out of those 14, 7 were trained on inputs
from each language respectively (labeled lng and lng_b, with lng being the acronym of the
language used for the training) and 7 were trained on multilingual inputs, each excluding
167
Mathematics 2022, 10, 838
Using the obtained weights and previously created document embedding matrices,
we generated four new ones for each language using Equation (9). In order to avoid any
bias problem, two strategies were employed. The first strategy was to, for each language,
use universal weights where that language was excluded from their training e.g., applying
universal_excl_deu weights on German (deu). Thus, from Equations (9) and (10) we derived:
Dweights_universal_excl_lng = ∑ wuniversal_excl_lng_t Dt ), (13)
t
Dweights_universal_b_excl_lng = ∑ wuniversal_b_excl_lng_s Ds ), (14)
s
for t ∈ {word, pos, lemma, masked} and for s ∈ {word, pos, lemma, masked, bert}. We used
these two formulas to generate two document embeddings based on universally trained
weights, one without and one with the use of mBERT based embeddings.
168
Mathematics 2022, 10, 838
Our second strategy was to involve direct transfer-learning and produce weight-based
embeddings using the weights trained on a different language dataset, as another way
to avoid the training-testing bias. Suitable weights to compute the document embedding
matrix for a given language were selected through comparison of Euclidean distances
of the trained weights for all languages. Results of these comparisons are presented in
Table 3 (distances were calculated separately for perceptrons without and with mBERT
input, presented in the upper and bottom half of the table, respectively).
Table 3. Euclidean distances between weights acquired through perceptron training, with bold
indicating the shortest distances between different language weights.
Two new embeddings were generated for each language based on the nearest Eu-
clidean neighbor in trained weights. For example, Serbian (srp) embeddings were calculated
with French fra) weights (without mBERT) and with Hungarian (hun) weights (with mBERT)
input, as shown via the bold values in Table 3 upper and lower part, respectively. Thus,
based on Equations (9) and (10), we derived:
Dweights_trans f er_lng = ∑ w xlng_t Dt ), (15)
t
Dweights_trans f er_b_lng = ∑ wxlng_b_s Ds ), (16)
s
for t ∈ {word, pos, lemma, masked} and for s ∈ {word, pos, lemma, masked, bert} and xlng
being the nearest Euclidean neighbor, minding distances of trained weights presented in Table 3.
4. Results
The results reported in this section rely on the following supervised classification
setup. The evaluation was carried out for each of the 19 document embeddings (4 from
Section 3.1, 1 from Section 3.2, 10 from Section 3.3, and 4 from Section 3.4) computed for
each of the 7 languages, totaling in 133 evaluated document embeddings. Only the authors
represented by at least two novels were chosen for the evaluation subset, in order to achieve
a closed-set attribution scenario. All of their chunks (documents) were evaluated against
all the other documents, excluding the ones originating from the same novel, in order to
avoid easy hits.
169
Mathematics 2022, 10, 838
Each resulting subset from the original document embeddings matrix contained pairwise
comparisons (distances) between the selected documents and classification was thus performed
by identifying the minimal distance for each document, which is equivalent to using the k-NN
classifier with k = 1. If a document’s nearest neighbour originates from another novel of the
same author, it is considered a hit. In this section, we will report the overall performance for each
document embeddings matrix via accuracy, precision, recall, weighted-average F1 -score, and
macro-averaged F1 -score, as well as the statistical significance of the procured results. It should
be noted that due to the nature of our test, where the domain of possible authors outnumbers
the domain of expected authors, the macro-averaged F1 -score reflects the potential domain
reduction, where the methods that predict fewer authors tend to have higher scores.
4.1. Baseline
As already mentioned, using the most frequent words as features has been the primary
method of solving AA tasks for many decades. Therefore, we marked the word-based
embeddings results as our primary baseline (baseline 1), while looking for improvements in
accuracy and in weighted-averaged F1 -score across all the remaining embeddings.
Recently, however, for some highly-inflected languages, most frequent lemmas emerged
as a better alternative to most frequent words [39]. The PoS tags and the document rep-
resentation with masked words, where PoS labels are used to mask predefined set of PoS
classes, also achieved good results for specific problems [35]. In evaluation of this exper-
iment we used the following document representations: most frequent words, lemmas,
PoS trigrams, and PoS-masked bigrams (Dword , Dlemma , D pos and Dmasked ), as the secondary
baseline methods. Specifically, we used the best performing method (from the above list)
for each language as a respective secondary baseline (baseline 2).
where n is the sample size, xi , yi are the individual data points indexed with i, x = n1 ∑in=1 xi
(the sample mean) and analogously for y. Given a very high correlation between the two
measures of performance, we decided to focus on one of these (i.e., accuracy) in the further
presentation.
170
Mathematics 2022, 10, 838
Table 4. Accuracy scores obtained in authorship attribution task evaluation, with italic indicating the
baseline and bold indicating best performing methods for each language (baseline and overall).
Embedding Base deu eng fra hun por slv srp avg
bert 0.7129 0.5561 0.4444 0.6991 0.5925 0.5042 0.4918 0.5716
word 0.9203 0.8175 0.7561 0.9812 0.7245 0.7188 0.7279 0.8066
pos 0.8370 0.6632 0.6125 0.8088 0.7509 0.6958 0.7279 0.7280
lemma 0.9212 0.8351 0.7507 0.9781 0.8000 0.7729 0.7082 0.8237
masked 0.7346 0.6439 0.7046 0.9185 0.8113 0.7208 0.7016 0.7479
baseline 1 0.9203 0.8175 0.7561 0.9812 0.7245 0.7188 0.7279 0.8066
baseline 2 0.9212 0.8351 0.7561 0.9812 0.8113 0.7729 0.7279 0.8294
mean 0.9420 0.8158 0.8238 0.9937 0.8717 0.7646 0.7967 0.8583
mean_b 0.9420 0.8175 0.8238 0.9937 0.8717 0.7646 0.8000 0.8590
max 0.9257 0.8298 0.8049 0.9906 0.8226 0.7854 0.7541 0.8447
max_b 0.9257 0.8298 0.8049 0.9906 0.8226 0.7854 0.7541 0.8447
min 0.8433 0.6649 0.6341 0.8150 0.7547 0.6979 0.7344 0.7349
min_b 0.7129 0.5561 0.4444 0.6991 0.5925 0.5042 0.4918 0.5716
product 0.9375 0.8035 0.8157 0.9875 0.8792 0.7625 0.8000 0.8551
product_b 0.9058 0.7474 0.7453 0.9718 0.8189 0.7792 0.7836 0.8217
l2-norm 0.9466 0.8193 0.8293 0.9969 0.8717 0.7646 0.7869 0.8593
l2-norm_b 0.9466 0.8193 0.8293 0.9969 0.8717 0.7646 0.7869 0.8593
weights_transfer 0.9547 0.8421 0.8347 0.9969 0.8415 0.7854 0.796 0.8646
weights_transfer_b 0.9538 0.8404 0.8320 0.9969 0.8415 0.7917 0.7869 0.8633
weights_universal 0.9475 0.8316 0.8347 0.9906 0.8151 0.7812 0.7934 0.8563
weights_universal_b 0.9484 0.8386 0.8293 0.9906 0.8075 0.7771 0.7934 0.8550
Table 5. Weighted-average F1 -scores obtained through authorship attribution task evaluation, with
italic indicating the baseline and bold indicating best performing methods for each language (baseline
and overall).
Embedding Base deu eng fra hun por slv srp avg
bert 0.7423 0.5966 0.4912 0.7403 0.6510 0.5170 0.5226 0.6087
word 0.9387 0.8588 0.7992 0.9904 0.7742 0.7259 0.7518 0.8341
pos 0.8611 0.7200 0.6485 0.8675 0.8167 0.7181 0.7364 0.7669
lemma 0.9391 0.8753 0.7951 0.9840 0.8588 0.7860 0.7414 0.8542
masked 0.7822 0.7017 0.7718 0.9433 0.8705 0.7377 0.7140 0.7887
baseline 1 0.9387 0.8588 0.7992 0.9904 0.7742 0.7259 0.7518 0.8341
baseline 2 0.9391 0.8753 0.7992 0.9904 0.8705 0.7860 0.7518 0.8589
mean 0.9579 0.8484 0.8551 0.9968 0.9163 0.7771 0.8120 0.8805
mean_b 0.9579 0.8493 0.8551 0.9968 0.9163 0.7771 0.8174 0.8814
max 0.9436 0.8698 0.8412 0.9952 0.8790 0.7979 0.7740 0.8715
max_b 0.9436 0.8698 0.8412 0.9952 0.8790 0.7979 0.774 0.8715
min 0.8679 0.7199 0.6717 0.8741 0.8223 0.7212 0.7388 0.7737
min_b 0.7423 0.5966 0.4912 0.7403 0.6510 0.5170 0.5226 0.6087
product 0.9529 0.8364 0.8446 0.9933 0.9202 0.7775 0.8160 0.8773
product_b 0.9196 0.7734 0.7848 0.9789 0.8672 0.7947 0.8042 0.8461
l2-norm 0.9604 0.8509 0.8583 0.9984 0.9147 0.7777 0.8067 0.8810
l2-norm_b 0.9604 0.8509 0.8583 0.9984 0.9147 0.7777 0.8067 0.8810
weights_transfer 0.9660 0.8772 0.8644 0.9984 0.8866 0.7960 0.8220 0.8872
weights_transfer_b 0.9646 0.8770 0.8630 0.9984 0.8851 0.7988 0.8036 0.8844
weights_universal 0.9617 0.8658 0.8644 0.9937 0.8641 0.7934 0.8169 0.8800
weights_universal_b 0.9623 0.8735 0.8602 0.9952 0.8566 0.7878 0.8181 0.8791
171
Mathematics 2022, 10, 838
The complete results for all metrics used in the evaluation (accuracy, precision, recall,
weighted and macro-averaged F1 -score) for each language and embedding method are
shown in the Appendix A Tables A1–A7.
The total improvement of each composite method over the primary and secondary
baseline scores is shown in percentages in Table 6, followed by its visual representation
in Figure 6, a heat map of the accuracy improvement of each composite methods over the
primary (left) and the secondary (right) baseline, for each language inspected.
Figure 6. Heat map visualization representing the improvement of accuracy over primary (left) and
secondary baseline (right) for each language, with yellow meaning low, green meaning high, and
white meaning no improvement.
172
Mathematics 2022, 10, 838
Table 6. Accuracy scores with percentual increase/decrease between the primary (upper) and
secondary (lower) baseline method and each composite method for each examined language, with
the highest improvements for each language indicated in bold.
Table 7. Percentual increase/decrease in accuracy when using the mBERT embeddings as composition
input grouped by composition method and language, with results omitted if there is no change.
173
Mathematics 2022, 10, 838
Figure 7. Heat map visualization representing the accuracy improvement in including mBERT inputs
(left) and excluding them (right), with yellow meaning low, green meaning high, and white meaning
no improvement.
5. Discussion
According to the accuracy scores presented in Table 4, the best scores for the baseline
methods were divided mostly among word-based and lemma-based embeddings. Word-
based embeddings performed best for fra (0.7561), hun (0.9812), and srp (0.7279), while
lemma-based embeddings performed best for deu (0.9212), eng (0.8351) and slv (0.7729) for
accuracy. PoS-mask-based embeddings were best-performing only for por (0.8113) and
PoS-based embeddings matched the best accuracy score for srp (0.7279). These findings
undoubtedly answer RQ1:
174
Mathematics 2022, 10, 838
RQ1: There is no single best document representation method suitable for the AA task across
European languages.
with all but mBERT-based embeddings marking the baseline for at least one language.
From the accuracy score improvement presented in the Table 6 (upper part) and its
visualization in Figure 6, it can be observed that most of our composite embeddings (min
being a clear outlier) outperform the primary baseline for most of the assessed languages,
with four methods improving accuracy for all languages both with and without mBERT
inputs. As for the more strict secondary baseline (represented by the best base method
accuracy for each language), the improvements are presented in Table 6 (lower part). Our
composition-based methods outperform this baseline by 10.4% for fra, 9.91% for srp, 8.37%
for por, 3.64% for deu, 2.43% for slv, 1.6% for hun, and 0.84% for eng using the respective
top-performing methods. Using the Newcombe-Wilson continuity-corrected test, we prove
the statistical significance of these results for at least four languages (fra, srp, por and deu),
while the improvements are present but debatable for the rest. In the case of hun, it should
be noted that the baseline was already at 0.9812 and, considering this, our method actually
reduced the error rate by 83% (from 0.0188 to 0.0031), which is an outstanding improvement.
As for slv, the statistical significance of improvement was corroborated only against the
primary baseline. With a definite improvement for at least four languages, these findings
answer RQ2:
RQ2: Several document embeddings can be combined in order to produce improved results for the
said task.
showing that they can indeed be used together in order to produce improved results for the
said task, and that this method outperforms the established baseline for most languages.
This is particularly significant given previous attempts at using lemmas and PoS as features,
described in the Introduction, which presented them as worse classifiers than most frequent
words.
The results of our weight-based combination methods, as presented in Table 6 and
Figure 6, demonstrate that adding weights to the inputs in a parallel architecture can induce
further improvements of the results.
The weights–trans f er method, based on training weights on one and then applying
them to distances from another language in a linear composition, was found to be the best
performing solution for four out of seven languages (deu, eng, fra, and slv), and it matched
the best solution for one language (hun). It was only outperformed by other compositions
for two languages (por and srp), where the best performing method was found to be product-
based simple composition. Note, however, that for srp the difference between the product
method and the weights–trans f er method was neglectable (0.8000 vs. 0.7967). With an
average improvement of 4.47% across all languages (Figure 8), weights–trans f er was found
to be the best performing composition method, giving the answer to RQ3:
RQ3: Adding weights to the inputs in a parallel architecture can induce further improvements of
the results.
Data from Table 7, as visualized in Figure 7, show that in a few cases the achievement
was gained by including deep learning-based transformations of the document in a parallel
architecture, with up to 2.19% for accuracy for slv in product_b over product. These results
address RQ4,:
RQ4: Including deep learning-based transformations of the document in a parallel architecture can
improve the results of the said architecture.
however, most of these improvements are statistically insignificant and it is apparent that
for the majority of the methods there was no improvement in using mBERT. Moreover, the
175
Mathematics 2022, 10, 838
results deteriorate when mBERT’s embeddings are composed with other methods, which is
most likely due to the model not being trained nor fine-tuned on this task [41,42].
It should also be noted that distances between documents produced by calculating
cosine similarity over mBERT outputs were by far lower (average of 0.0085) than the ones
produced by Stylo R package (average of 0.9998). This resulted in them being completely
ignored by the max composition method, and consequently made the results for max and
max_b identical. For the same reasons, the distances produced by mBERT were picked
for the min method every time, which resulted in mean_b being equal to bert (Table 4).
Arguably, this explains why the min method never outperforms the baseline. A similar
behaviour can be observed for the l2-norm method, where the final distance was squared.
This leads to even smaller values and thus exponentially decreases the impact of mBERT on
the final distance (resulting in equal results for l2-norm and l2-norm_b). The same remark
applies to the mean method, except that here the impact decreases linearly rather than
exponentially, which resulted in nearly identical results for mean and mean_b, as shown
in Table 7 and Figure 7. With the exception of the min method, the only opportunity for
the mBERT embeddings to actually influence the composite matrices were, firstly, the
product-based simple composition, where the distance was multiplied by the product of
all the other distances and, secondly, the weight-based methods, where the distance was
multiplied by its optimized weight. In the case of the product method, it was shown that it
negatively impacts the accuracy in six out of seven languages with a decrease of up to 8.63%
(Table 7). As for the weight-based embeddings, the results are split, with some methods
using the mBERT inputs outperforming the ones not using it. However, it must be noted
that the weights of the mBERT inputs were set to low negative (gravitating around −0.045)
during the training of all the 14 perceptrons using them, thus diminishing their impact on
the final scores.
A summary of the improvements is presented in Figure 8, where the best performing
composition methods were selected. The bar stands for the average percentual increase
of accuracy scores of the six methods across all seven languages, while the points stand
for the gain for each method and for each distinct language. It can be seen that the l2-
norm, with an average improvement of 3.8%, is the best performing simple composition
method. This is a valuable observation for AA tasks relying on limited resources, since the
aggregation of simple features does not involve external language models (e.g., mBERT
or trained weights), and requires less execution time. However, weights_trans f er is the
best performing method overall with 4.471% average improvement. This is also the only
method achieving improvements for each of our scrutinized languages.
Figure 8. Average accuracy score gain over the best baseline method across seven languages for
selected composition methods.
176
Mathematics 2022, 10, 838
The benefits of this research mainly result from the use of a multilingual dataset, as
this marks an effort to verify methods using multiple languages, including ones that are
somewhat under-researched, particularly when it comes to stylometric applications (i.e.,
Hungarian, Serbian, Slovenian). Examining such a number of languages at a time provides
us with a rare opportunity to generalize our findings. Using such diverse, yet similarly
designed, corpora was possible thanks to the international COST Action, in which the
authors actively participate, and which led to the creation of comparable representative
corpora of European literature. This study further advances the Action’s initiatives towards
development of distant reading methods and tools, and towards analysis of European
literature at the distance. The use of multilingual corpora allowed us to conduct transfer
learning through document embeddings for one language, using the weights trained on
other languages’ datasets. The reasonable performance of the above method is in line with
the findings of another outcome of the Action [43], which found that different types of
information combined together improved the performance of BERT-based direct speech
classifier. Secondly, the use of ELTeC level–2 corpora, which contain rich grammatical
annotation that is encoded in a way that facilitates cross-language comparisons, allowed
us to use information about lemmas and PoS. By examining them both on their own,
and in combined embeddings, we were able to determine that combined lexical and
grammatical information outperforms traditional word-based approaches. Finally, the
paper also contributes to the efforts of making artificial intelligence and neural network
based stylometric endeavors more transparent. While mBERT-based classifications are
largely obtained in a black-box manner, the use of a shallow neural network in calculating
weights produces clear and interpretable values.
This research also brought one unexpected result in discovering one unknown author.
Namely, the author of the novel Beogradske tajne (Belgrade’s secrets) in Serbian sub-collection
has been unknown (according to the National and University libraries), but the computed
distances suggested that the author is Todorović Pera. By further research, it was found that
the same suspicion was raised by some historians.
Future research will, firstly, extend the application to other languages, as more text
sub-collections are expected to be published within the same project. This would focus on
the use of the best performing simple composition methods and the previously trained
universal weights. Expanding the list of baseline methods with more different features
(e.g., character n-grams, word length, punctuation frequency) and using them in further
compositions, is also an obvious next step. We also expect that fine-tuning of mBERT for
AA task should produce the expected results and allow for further investigations of RQ3,
making it another area demanding further studies.
Another aspect we intend to test is the effect of document lengths on our proposed
methods. Initial calibration tests, during which we settled with the fixed document length
of 10,000 tokens, suggested an extended gap in accuracy scores of baseline and composition
methods when dealing with shorter texts (2000 and 5000 tokens). We suspect that, since the
baseline accuracy is lower when working with shorter texts, there is an increased possibility
for improvement using the combination methods.
6. Conclusions
In this research, we tested standalone (word-based, lemma-based, PoS-based, and
PoS mask-based) and composition-based embeddings (derived from the standalone ones
using different methods of combining them on a matrix level, e.g., mean, product, l 2
norm of the matrices), compared them with one another, and found that for most of the
examined languages most of our methods outperform the baseline. It is examined that
our composition-based embeddings outperform the best baseline by a significant margin
for four languages: German, French, Portuguese, and Serbian, and also bring a certain
batch of improvements for Hungarian and Slovenian. Our transfer-learning-based method
weights_trans f er also outperformed the best baseline for every language, averaging in
nearly 5% improvement. On the other hand, we found no statistically significant impact of
177
Mathematics 2022, 10, 838
Author Contributions: Conceptualization, M.Š. and R.S.; methodology, M.Š. and M.I.N.; software,
M.Š.; validation, M.E., M.I.N. and M.Š.; formal analysis, M.Š.; investigation, J.B.; resources, R.S.; data
curation, R.S.; writing—original draft preparation, M.Š.; writing—review and editing, all authors;
visualization, M.Š.; supervision, R.S. and M.E.; project administration, J.B.; funding acquisition, J.B.
All authors have read and agreed to the published version of the manuscript.
Funding: This research and work on the ELTeC corpora was performed in the framework of the
project Distant Reading for European Literary History, COST (European Cooperation in Science and
Technology) Action CA16204, with funding from the Horizon 2020 Framework Programme of the EU.
Data Availability Statement: All of the data produced by this experiment as well as the complete
code, which can be used to reproduce the results of the study, is publicly available as a repository
at https://github.com/procesaur/parallel-doc-embeds (accessed on 29 January 2022). Original
document collections (ELTeC level–2) that were used to create datasets for this experiment are
publicly available on GitHub for each language, respectively. https://github.com/COST-ELTeC/
ELTeC-deu/tree/master/level2 (accessed on 29 January 2022); https://github.com/COST-ELTeC/
ELTeC-eng/tree/master/level2 (accessed on 29 January 2022); https://github.com/COST-ELTeC/
ELTeC-fra/tree/master/level2 (accessed on 29 January 2022); https://github.com/COST-ELTeC/
ELTeC-hun/tree/master/level2 (accessed on 29 January 2022); https://github.com/COST-ELTeC/
ELTeC-por/tree/master/level2 (accessed on 29 January 2022); https://github.com/COST-ELTeC/
ELTeC-slv/tree/master/level2 (accessed on 29 January 2022); https://github.com/COST-ELTeC/
ELTeC-srp/tree/master/level2 (accessed on 29 January 2022).
Acknowledgments: The authors thanks to numerous contributors to the ELTeC text collection,
specially the members of the COST Action CA16204 Distant Reading for European Literary History:
WG1 for preparing the text collections and WG2 for making the available annotated versions.
Conflicts of Interest: The authors declare no conflict of interest. The funders had no role in the design
of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript,
or in the decision to publish the results.
Abbreviations
The following abbreviations are used in this manuscript:
AA Authorship Attribution
BERT Bidirectional Encoder Representations from Transformers
CNN Convolutional Neural Network
ELTeC European Literary Text Collection
k-NN k-Nearest Neighbors
mBERT Multilingual BERT
NLP Natural Language Processing
NLTK Natural Language Toolkit
PoS Part of Speech
TF-IDF Term Frequency–Inverse Document Frequency
178
Mathematics 2022, 10, 838
Appendix A
Complete evaluation results—accuracy, precision, recall, and weighted and macro-
averaged F1 -scores for each language—are presented below (Tables A1–A7).
Weighted Macro
Embedding Base Accuracy Precision Recall
F1 -Score F1 -Score
bert 0.7129 0.8184 0.7129 0.7423 0.2582
word 0.9203 0.9790 0.9203 0.9387 0.4283
pos 0.8370 0.9154 0.8370 0.8611 0.3468
lemma 0.9212 0.9644 0.9212 0.9391 0.4815
masked_2 0.7346 0.8949 0.7346 0.7822 0.2979
mean 0.9420 0.9863 0.9420 0.9579 0.4915
mean_b 0.9420 0.9863 0.9420 0.9579 0.4915
max 0.9257 0.9781 0.9257 0.9436 0.4623
max_b 0.9257 0.9781 0.9257 0.9436 0.4623
min 0.8433 0.9215 0.8433 0.8679 0.3714
min_b 0.7129 0.8184 0.7129 0.7423 0.2582
product 0.9375 0.9826 0.9375 0.9529 0.4773
product_b 0.9058 0.9556 0.9058 0.9196 0.4609
l2-norm 0.9466 0.9856 0.9466 0.9604 0.5053
l2-norm_b 0.9466 0.9856 0.9466 0.9604 0.5053
weights_fra 0.9547 0.9871 0.9547 0.9660 0.5403
weights_fra_b 0.9538 0.9854 0.9538 0.9646 0.5520
weights_universal-deu 0.9475 0.9861 0.9475 0.9617 0.5387
weights_universal_b-deu 0.9484 0.9861 0.9484 0.9623 0.5571
Weighted Macro
Embedding Base Accuracy Precision Recall
F1 -Score F1 -Score
bert 0.5561 0.6796 0.5561 0.5966 0.1134
word 0.8175 0.9482 0.8175 0.8588 0.2215
pos 0.6632 0.8415 0.6632 0.72 0.133
lemma 0.8351 0.9645 0.8351 0.8753 0.217
masked_2 0.6439 0.8261 0.6439 0.7017 0.1329
mean 0.8158 0.9244 0.8158 0.8484 0.2052
mean_b 0.8175 0.9248 0.8175 0.8493 0.2166
max 0.8298 0.9523 0.8298 0.8698 0.2047
max_b 0.8298 0.9523 0.8298 0.8698 0.2047
min 0.6649 0.8385 0.6649 0.7199 0.1329
min_b 0.5561 0.6796 0.5561 0.5966 0.1134
product 0.8035 0.9223 0.8035 0.8364 0.2031
product_b 0.7474 0.8679 0.7474 0.7734 0.1717
l2-norm 0.8193 0.9236 0.8193 0.8509 0.2232
l2-norm_b 0.8193 0.9236 0.8193 0.8509 0.2232
weights_slv 0.8421 0.9487 0.8421 0.8772 0.2419
weights_slv_b 0.8404 0.9519 0.8404 0.877 0.235
weights_universal-eng 0.8316 0.9451 0.8316 0.8658 0.2292
weights_universal_b-eng 0.8386 0.9464 0.8386 0.8735 0.2405
179
Mathematics 2022, 10, 838
Weighted Macro
Embedding Base Accuracy Precision Recall
F1 -Score F1 -Score
bert 0.4444 0.6286 0.4444 0.4912 0.1390
word 0.7561 0.8832 0.7561 0.7992 0.2927
pos 0.6125 0.7563 0.6125 0.6485 0.2210
lemma 0.7507 0.9025 0.7507 0.7951 0.2842
masked_2 0.7046 0.8886 0.7046 0.7718 0.2539
mean 0.8238 0.9272 0.8238 0.8551 0.3757
mean_b 0.8238 0.9278 0.8238 0.8551 0.3755
max 0.8049 0.9298 0.8049 0.8412 0.2938
max_b 0.8049 0.9298 0.8049 0.8412 0.2938
min 0.6341 0.7625 0.6341 0.6717 0.2422
min_b 0.4444 0.6286 0.4444 0.4912 0.1390
product 0.8157 0.9164 0.8157 0.8446 0.3711
product_b 0.7453 0.8677 0.7453 0.7848 0.2987
l2-norm 0.8293 0.9276 0.8293 0.8583 0.3882
l2-norm_b 0.8293 0.9276 0.8293 0.8583 0.3882
weights_slv 0.8347 0.9295 0.8347 0.8644 0.3943
weights_slv_b 0.8320 0.9294 0.8320 0.8630 0.3710
weights_universal-fra 0.8347 0.9295 0.8347 0.8644 0.3827
weights_universal_b-fra 0.8293 0.9269 0.8293 0.8602 0.3699
Weighted Macro
Embedding Base Accuracy Precision Recall
F1 -Score F1 -Score
bert 0.6991 0.8204 0.6991 0.7403 0.2987
word 0.9812 1.0000 0.9812 0.9904 0.6423
pos 0.8088 0.9596 0.8088 0.8675 0.2905
lemma 0.9781 0.9909 0.9781 0.9840 0.7219
masked_2 0.9185 0.9822 0.9185 0.9433 0.4289
mean 0.9937 1.0000 0.9937 0.9968 0.8433
mean_b 0.9937 1.0000 0.9937 0.9968 0.8433
max 0.9906 1.0000 0.9906 0.9952 0.7834
max_b 0.9906 1.0000 0.9906 0.9952 0.7834
min 0.8150 0.9661 0.8150 0.8741 0.2860
min_b 0.6991 0.8204 0.6991 0.7403 0.2987
product 0.9875 1.0000 0.9875 0.9933 0.7791
product_b 0.9718 0.9887 0.9718 0.9789 0.8254
l2-norm 0.9969 1.0000 0.9969 0.9984 0.9157
l2-norm_b 0.9969 1.0000 0.9969 0.9984 0.9157
weights_srp 0.9969 1.0000 0.9969 0.9984 0.9157
weights_srp_b 0.9969 1.0000 0.9969 0.9984 0.9157
weights_universal-hun 0.9906 0.9970 0.9906 0.9937 0.8412
weights_universal_b-hun 0.9906 1.0000 0.9906 0.9952 0.7826
180
Mathematics 2022, 10, 838
Weighted Macro
Embedding Base Accuracy Precision Recall
F1 -Score F1 -Score
bert 0.5925 0.8853 0.5925 0.6510 0.1152
word 0.7245 0.9051 0.7245 0.7742 0.2112
pos 0.7509 0.9646 0.7509 0.8167 0.1709
lemma 0.8000 0.9688 0.8000 0.8588 0.2826
masked_2 0.8113 0.9752 0.8113 0.8705 0.2083
mean 0.8717 0.9861 0.8717 0.9163 0.3076
mean_b 0.8717 0.9861 0.8717 0.9163 0.3076
max 0.8226 0.9673 0.8226 0.8790 0.2778
max_b 0.8226 0.9673 0.8226 0.8790 0.2778
min 0.7547 0.9735 0.7547 0.8223 0.1724
min_b 0.5925 0.8853 0.5925 0.6510 0.1152
product 0.8792 0.9863 0.8792 0.9202 0.3333
product_b 0.8189 0.9572 0.8189 0.8672 0.2722
l2-norm 0.8717 0.9811 0.8717 0.9147 0.3493
l2-norm_b 0.8717 0.9811 0.8717 0.9147 0.3493
weights_slv 0.8415 0.9626 0.8415 0.8866 0.3250
weights_slv_b 0.8415 0.9593 0.8415 0.8851 0.3108
weights_universal-por 0.8151 0.9523 0.8151 0.8641 0.2920
weights_universal_b-por 0.8075 0.9473 0.8075 0.8566 0.2774
Weighted Macro
Embedding Base Accuracy Precision Recall
F1 -Score F1 -Score
bert 0.5042 0.5991 0.5042 0.5170 0.2819
word 0.7188 0.8148 0.7188 0.7259 0.4221
pos 0.6958 0.7906 0.6958 0.7181 0.4226
lemma 0.7729 0.8530 0.7729 0.7860 0.5034
masked_2 0.7208 0.8093 0.7208 0.7377 0.4866
mean 0.7646 0.8603 0.7646 0.7771 0.5455
mean_b 0.7646 0.8603 0.7646 0.7771 0.5455
max 0.7854 0.8622 0.7854 0.7979 0.4819
max_b 0.7854 0.8622 0.7854 0.7979 0.4819
min 0.6979 0.7941 0.6979 0.7212 0.4363
min_b 0.5042 0.5991 0.5042 0.5170 0.2819
product 0.7625 0.8600 0.7625 0.7775 0.5482
product_b 0.7792 0.8657 0.7792 0.7947 0.4955
l2-norm 0.7646 0.8602 0.7646 0.7777 0.5455
l2-norm_b 0.7646 0.8602 0.7646 0.7777 0.5455
weights_eng 0.7854 0.8692 0.7854 0.7960 0.5411
weights_eng_b 0.7917 0.8600 0.7917 0.7988 0.4956
weights_universal-slv 0.7812 0.8743 0.7812 0.7934 0.5398
weights_universal_b-slv 0.7771 0.8668 0.7771 0.7878 0.5530
181
Mathematics 2022, 10, 838
Weighted Macro
Embedding Base Accuracy Precision Recall
F1 -Score F1 -Score
bert 0.4918 0.6537 0.4918 0.5226 0.2441
word 0.7279 0.8941 0.7279 0.7518 0.4200
pos 0.7279 0.8312 0.7279 0.7364 0.3765
lemma 0.7082 0.8973 0.7082 0.7414 0.3824
masked_2 0.7016 0.7880 0.7016 0.7140 0.3676
mean 0.7967 0.9692 0.7967 0.8120 0.5383
mean_b 0.8000 0.9692 0.8000 0.8174 0.5404
max 0.7541 0.8896 0.7541 0.7740 0.4676
max_b 0.7541 0.8896 0.7541 0.7740 0.4676
min 0.7344 0.8360 0.7344 0.7388 0.3888
min_b 0.4918 0.6537 0.4918 0.5226 0.2441
product 0.8000 0.9668 0.8000 0.8160 0.5389
product_b 0.7836 0.9406 0.7836 0.8042 0.4453
l2-norm 0.7869 0.9622 0.7869 0.8067 0.5310
l2-norm_b 0.7869 0.9622 0.7869 0.8067 0.5310
weights_fra 0.7967 0.9709 0.7967 0.8220 0.5331
weights_hun_b 0.7869 0.9428 0.7869 0.8036 0.5098
weights_universal-srp 0.7934 0.9655 0.7934 0.8169 0.5464
weights_universal_b-srp 0.7934 0.9673 0.7934 0.8181 0.5325
References
1. Moretti, F. Conjectures on World Literature. New Left Rev. 2000, 1, 54–68.
2. El, S.E.M.; Kassou, I. Authorship analysis studies: A survey. Int. J. Comput. Appl. 2014, 86, 22–29.
3. Camps, J.B.; Clérice, T.; Pinche, A. Stylometry for Noisy Medieval Data: Evaluating Paul Meyer’s Hagiographic Hypothesis.
arXiv 2020, arXiv:2012.03845.
4. Stamatatos, E.; Koppel, M. Plagiarism and authorship analysis: Introduction to the special issue. Lang. Resour. Eval. 2011, 45, 1–4.
[CrossRef]
5. Yang, M.; Chow, K.P. Authorship Attribution for Forensic Investigation with Thousands of Authors. In Proceedings of the ICT
Systems Security and Privacy Protection; Cuppens-Boulahia, N., Cuppens, F., Jajodia, S., Abou El Kalam, A., Sans, T., Eds.; Springer:
Berlin/Heidelberg, Germany, 2014; pp. 339–350.
6. Iqbal, F.; Binsalleeh, H.; Fung, B.C.; Debbabi, M. Mining writeprints from anonymous e-mails for forensic investigation. Digit.
Investig. 2010, 7, 56–64. [CrossRef]
7. Mendenhall, T.C. The characteristic curves of composition. Science 1887, 11, 237–246. [CrossRef]
8. Mosteller, F.; Wallace, D.L. Inference & Disputed Authorship: The Federalist; CSLI Publications: Stanford, CA, USA, 1964.
9. Stamatatos, E. A survey of modern authorship attribution methods. J. Am. Soc. Inf. Sci. Technol. 2009, 60, 538–556. [CrossRef]
10. Jockers, M.L.; Witten, D.M. A comparative study of machine learning methods for authorship attribution. Lit. Linguist. Comput.
2010, 25, 215–223. [CrossRef]
11. Burrows, J. ‘Delta’: A Measure of Stylistic Difference and a Guide to Likely Authorship. Lit. Linguist. Comput. 2002, 17, 267–287.
[CrossRef]
12. Evert, S.; Proisl, T.; Vitt, T.; Schöch, C.; Jannidis, F.; Pielström, S. Towards a better understanding of Burrows’s Delta in literary
authorship attribution. In Proceedings of the Fourth Workshop on Computational Linguistics for Literature, Denver, CO, USA, 4
June 2015; pp. 79–88.
13. Evert, S.; Proisl, T.; Schöch, C.; Jannidis, F.; Pielström, S.; Vitt, T. Explaining Delta, or: How do distance measures for authorship
attribution work? Presented at Corpus Linguistics 2015, Lancaster, UK, 21–24 July 2015.
14. Kestemont, M. Function Words in Authorship Attribution. From Black Magic to Theory? In Proceedings of the 3rd Workshop on
Computational Linguistics for Literature (CLfL@EACL), Gothenburg, Sweden, 27 April 2014; pp. 59–66.
15. R. Sarwar, Q. Li, T.R.; Nutanong, S. A scalable framework for cross-lingual authorship identification. Inf. Sci. 2018, 465, 323–339.
[CrossRef]
16. Rybicki, J.; Eder, M. Deeper Delta across genres and languages: Do we really need the most frequent words? Lit. Linguist. Comput.
2011, 26, 315–321. [CrossRef]
17. Górski, R.; Eder, M.; Rybicki, J. Stylistic fingerprints, POS tags and inflected languages: A case study in Polish. In Proceedings of
the Qualico 2014: Book of Abstracts; Palacky University: Olomouc, Czech Republic, 2014; pp. 51–53.
182
Mathematics 2022, 10, 838
18. Eder, M.; Byszuk, J. Feature selection in authorship attribution: Ordering the wordlist. In Digital Humanities 2019: Book of Abstracts;
Utrecht University: Utrecht, The Netherlands, 2019; Chapter 0930, p. 1.
19. Kestemont, M.; Luyckx, K.; Daelemans, W. Intrinsic Plagiarism Detection Using Character Trigram Distance Scores—Notebook
for PAN at CLEF 2011. In Proceedings of the CLEF 2011 Labs and Workshop, Notebook Papers, Amsterdam, The Netherlands,
19–22 September 2011.
20. Weerasinghe, J.; Greenstadt, R. Feature Vector Difference based Neural Network and Logistic Regression Models for Authorship
Verification. In Proceedings of the Notebook for PAN at CLEF 2020, Thessaloniki, Greece, 22–25 September 2020; Volume 2695.
21. Eder, M.; Rybicki, J.; Kestemont, M. Stylometry with R: A package for computational text analysis. R J. 2016, 8, 107–121. [CrossRef]
22. Kocher, M.; Savoy, J. Distributed language representation for authorship attribution. Digit. Scholarsh. Humanit. 2018, 33, 425–441.
[CrossRef]
23. Salami, D.; Momtazi, S. Recurrent convolutional neural networks for poet identification. Digit. Scholarsh. Humanit. 2020,
36, 472–481. [CrossRef]
24. Segarra, S.; Eisen, M.; Ribeiro, A. Authorship Attribution Through Function Word Adjacency Networks. Trans. Sig. Proc. 2015,
63, 5464–5478. [CrossRef]
25. Marinho, V.Q.; Hirst, G.; Amancio, D.R. Authorship Attribution via Network Motifs Identification. In Proceedings of the 2016
5th Brazilian Conference on Intelligent Systems (BRACIS), Recife, Brazil, 9–12 October 2016; pp. 355–360.
26. Stamatatos, E.; Daelemans, W.; Verhoeven, B.; Juola, P.; López-López, A.; Potthast, M.; Stein, B. Overview of the Author
Identification Task at PAN 2014. CLEF (Work. Notes) 2014, 1180, 877–897.
27. Akimushkin, C.; Amancio, D.R.; Oliveira, O.N. On the role of words in the network structure of texts: Application to authorship
attribution. Phys. A Stat. Mech. Its Appl. 2018, 495, 49–58. [CrossRef]
28. Devlin, J.; Chang, M.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.
arXiv 2018, arXiv:1810.04805.
29. Iyer, A.; Vosoughi, S. Style Change Detection Using BERT—Notebook for PAN at CLEF 2020. In Proceedings of the CLEF 2020
Labs and Workshops, Notebook Papers, Thessaloniki, Greece, 22–25 September 2020; Cappellato, L., Eickhoff, C., Ferro, N.,
Névéol, A., Eds.; CEUR-WS: Aachen, Germany, 2020.
30. Fabien, M.; Villatoro-Tello, E.; Motlicek, P.; Parida, S. BertAA: BERT fine-tuning for Authorship Attribution. In Proceedings of the
17th International Conference on Natural Language Processing (ICON), Patna, India, 18–21 December 2020; NLP Association of
India (NLPAI): Indian Institute of Technology Patna: Patna, India, 2020; pp. 127–137.
31. Burnard, L.; Schöch, C.; Odebrecht, C. In search of comity: TEI for distant reading. J. Text Encoding Initiat. 2021, 2021, 1–21.
[CrossRef]
32. Schöch, C.; Patras, R.; Erjavec, T.; Santos, D. Creating the European Literary Text Collection (ELTeC): Challenges and Perspectives.
Mod. Lang. Open 2021, 1, 25. [CrossRef]
33. Kilgarriff, A.; Rychly, P.; Smrz, P.; Tugwell, D. The Sketch Engine. In Proceedings of the Eleventh EURALEX International
Congress, Lorient, France, 6–10 July 2004; pp. 105–116.
34. Kilgarriff, A.; Baisa, V.; Bušta, J.; Jakubíček, M.; Kovář, V.; Michelfeit, J.; Rychlỳ, P.; Suchomel, V. The Sketch Engine: Ten years on.
Lexicography 2014, 1, 7–36. [CrossRef]
35. Embarcadero-Ruiz, D.; Gómez-Adorno, H.; Embarcadero-Ruiz, A.; Sierra, G. Graph-Based Siamese Network for Authorship
Verification. Mathematics 2022, 10, 277. [CrossRef]
36. Eder, M. Does Size Matter? Authorship Attribution, Small Samples, Big Problem. In Digital Humanities 2010: Conference Abstracts;
King’s College London: London, UK, 2010; pp. 132–134.
37. Eder, M. Style-markers in authorship attribution: A cross-language study of the authorial fingerprint. Stud. Pol. Linguist. 2011,
6, 99–114.
38. Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. In Proceedings of the 3rd International Conference on
Learning Representations, ICLR 2015, San Diego, CA, USA, 7–9 May 2015; Conference Track Proceedings; Bengio, Y., LeCun, Y.,
Eds.; International Conference on Representation Learning (ICLR): La Jolla, CA, USA, 2015; pp. 1–15.
39. Eder, M.; Piasecki, M.; Walkowiak, T. An open stylometric system based on multilevel text analysis. Cogn. Stud. Études Cognitives
2017, 17, 1–26. [CrossRef]
40. Newcombe, R.G. Estimation for the difference between independent proportions: comparison of eleven methods. Stat. Med.
1998, 17, 873–890. [CrossRef]
41. Ehrmanntraut, A.; Hagen, T.; Konle, L.; Jannidis, F. Type-and Token-based Word Embeddings in the Digital Humanities. In
Proceedings of the Conference on Computational Humanities Research 2021, Amsterdam, The Netherlands, 17–19 November
2021; Volume 2989, pp. 16–38.
42. Brunner, A.; Tu, N.D.T.; Weimer, L.; Jannidis, F. To BERT or not to BERT-Comparing Contextual Embeddings in a Deep Learning
Architecture for the Automatic Recognition of four Types of Speech, Thought and Writing Representation. In Proceedings of the
5th Swiss Text Analytics Conference (SwissText) and 16th Conference on Natural Language Processing (KONVENS), Zurich,
Switzerland, 23–25 June 2020; pp. 1–11.
43. Byszuk, J.; Woźniak, M.; Kestemont, M.; Leśniak, A.; Łukasik, W.; Šel, a, A.; Eder, M. Detecting direct speech in multilingual
collection of 19th-century novels. In Proceedings of the LT4HALA 2020-1st Workshop on Language Technologies for Historical
and Ancient Languages, Marseille, France, 11–16 May 2020; pp. 100–104.
183
mathematics
Article
Unsupervised and Supervised Methods to Estimate
Temporal-Aware Contradictions in Online Course Reviews
Ismail Badache, Adrian-Gabriel Chifu * and Sébastien Fournier
Department of Computer Science, Aix Marseille Université, CNRS, LIS, 13007 Marseille, France;
[email protected] (I.B.); [email protected] (S.F.)
* Correspondence: [email protected]
Abstract: The analysis of user-generated content on the Internet has become increasingly popular for
a wide variety of applications. One particular type of content is represented by the user reviews for
programs, multimedia, products, and so on. Investigating the opinion contained by reviews may
help in following the evolution of the reviewed items and thus in improving their quality. Detecting
contradictory opinions in reviews is crucial when evaluating the quality of the respective resource.
This article aims to estimate the contradiction intensity (strength) in the context of online courses
(MOOC). This estimation was based on review ratings and on sentiment polarity in the comments,
with respect to specific aspects, such as “lecturer”, “presentation”, etc. Between course sessions, users
stop reviewing, and also, the course contents may evolve. Thus, the reviews are time dependent,
and this is why they should be considered grouped by the course sessions. Having this in mind, the
contribution of this paper is threefold: (a) defining the notion of subjective contradiction around
specific aspects and then estimating its intensity based on sentiment polarity, review ratings, and
temporality; (b) developing a dataset to evaluate the contradiction intensity measure, which was
annotated based on a user study; (c) comparing our unsupervised method with supervised methods
Citation: Badache, I.; Chifu, A.-G.; with automatic feature selection, over the dataset. The dataset collected from coursera.org is in English.
Fournier, S. Unsupervised and
It includes 2244 courses and 73,873 user-generated reviews of those courses.The results proved that
Supervised Methods to Estimate
the standard deviation of the ratings, the standard deviation of the polarities, and the number of
Temporal-Aware Contradictions in
reviews are suitable features for predicting the contradiction intensity classes. Among the supervised
Online Course Reviews. Mathematics
methods, the J48 decision trees algorithm yielded the best performance, compared to the naive Bayes
2022, 10, 809. https://doi.org/
10.3390/math10050809
model and the SVM model.
Academic Editors: Florentina Hristea, Keywords: sentiment analysis; aspect detection; temporality; rating; feature evaluation; contradiction
Cornelia Caragea and Ioannis G.
intensity
Tsoulos
was a contradiction among the opinions on the same aspect, and then to measure the
intensity of this contradiction. This measure, then, allows the user reading the reviews to
have a metric indicating if the reviews are all (or almost all) in the same direction, positive
or negative, or if there is a large divergence of opinion on a specific aspect. The measure
then indicates that it is difficult to tell if this aspect is rated positively or negatively. This
measure enabled us to alert the user by indicating the points of disagreement present in
the comments for particular aspects, thus highlighting the aspects for which the point of
view is the most subjective to each person’s appreciation. Table 1 shows an example of
contradictory comments from an online course, concerning the “Lesson”.
Table 1. Contradictory opinions example around the “Lesson” aspect, with polarities (Pol.) and
ratings (Rat.).
Source Text on the Left Aspect Text on the Right Pol. Rat.
I thought the Lesson were really boring, never enjoyable −0.9 2
Course
I enjoyed very much the Lesson and I had a very good time +0.9 5
186
Mathematics 2022, 10, 809
187
Mathematics 2022, 10, 809
Discovering the subject of the controversy is then often seen as a problem of classification
aiming to find out which documents, paragraphs, or sentences are controversial and which
are not or to discover the subject itself. Balasubramanyan et al. [20] used a semi-supervised
latent variable model to detect the topics and the degree of polarization these topics caused.
Dori-Hacohen and Allan [21,22] treated the problem as a binary classification: whether the
web page has a controversial topic or not. To perform this classification, the authors looked
for Wikipedia pages that corresponded to the given web page, but displayed a degree of con-
troversy. Garimella et al. [23] constructed a conversation graph on a topic, then partitioned
the conversation graph to identify potential points of controversy, and measured the contro-
versy’s level based on the characteristics of the graph. Guerra et al. [24] constructed a metric
based on the analysis of the boundary in a graph between two communities. The metric
was applied to the analysis of communities on Twitter by constructing graphs from retweets.
Jang and Allan [25] constructed a controversial language model based on DBpedia. Lin and
Hauptmann [26] proposed to measure the proportion, if any, by which two collections of
documents were different. In order to quantify this proportion, they used a measure based
on statistical distribution divergence. Popescu and Pennacchiotti [19] proposed to detect
controversies on Twitter. Three different models were suggested. They were all based on
supervised learning using linear regression. Sriteja et al. [27] performed an analysis of the
reaction of social media users to press articles dealing with controversial issues. In particu-
lar, they used sentiment analysis and word matching to accomplish this task. Other works
sought to quantify controversies. For instance, Morales et al. [28] quantified polarity via
the propagation of opinions of influential users on Twitter. Garimella et al. [29] proposed
the use of a graph-based measure by measuring the level of separation of communities
within the graph.
Quite close to our work is the concept of “point of view“, also known in the literature
as the notion of “collective opinions”, where a collective opinion is the set of ideas shared
by a group. There is also a proximity with the work on the notion of controversy, but
with generally less opposition in the case of “points of view” than in that of “controversy”.
In a sense, the notion of points of view can also be seen as a controversy on a smaller
scale. Among the significant works on the concept of “points of view” is [30], which used
the multi-view Latent Dirichlet Allocation (LDA) model. In addition to topic modeling
at the word level, as LDA performs, the model uses a variable that gives the point of
view at the document level. This model was applied to the discovery of points of view in
essays. Cohen and Ruths [31] developed a supervised-learning-based system for point of
view detection in social media. The approach treats viewpoint detection as a classification
problem. The model used to perform this task was an SVM. Similarly, Conover et al. [32]
developed a system based on SVM and took into account social interactions in social
networks in order to classify viewpoints. Paul and Girju [33] used the Topic-Aspect Model
(TAM) by hijacking the model using aspects as viewpoints. The authors of [34] used an
unsupervised approach inspired by LDA, based on Dirichlet distributions and discrete
variables, to identify the users’ point of view. Trabelsi and Zaïane [35] used the Joint Topic
Viewpoint (JTV) model, which jointly models themes and viewpoints. JTV defines themes
and viewpoint assignments at the word level and viewpoint distributions at the document
level. JTV considers all words as opinion words, without distinguishing between opinion
words and topic words. Thonet et al. [36] presented VODUM, an unsupervised topic model
designed to jointly discover viewpoints, topics, and opinions in text.
A line of research that is also relatively similar to our work, even if it does not always
involve the notion of subjectivity, is the problem of detecting expressions of restraint
or the problem of detecting disagreement. This problem has been widely addressed in
the literature. In particular, Galley et al. [37] used a maximum entropy classifier. They
first identified adjacent pairs using a classification based on maximum entropy from a
set of lexical, temporal, and structural characteristics. They then ranked these pairs as
agreement or disagreement. Menini and Tonelli [38] used an SVM. They also used different
characteristics based on the feelings expressed in the text (negative or positive). They also
188
Mathematics 2022, 10, 809
used semantic features (word embeddings, cosine similarity, and entailment). Mukherjee
and Liu [39] adopted a semi-supervised approach to identify expressions of contention in
discussion forums.
Another quite close research is the one concerning position detection. Position (stance)
detection is a classification problem where the position of the author of the text is obtained
in the form of a category label of this set: favorable, against, neither. Among the works
on the notion of “stance”, Mohammad et al. [40,41] used an SVM and relatively simple
features based on N-grams of words and characters. In [42], the authors used a bi-Long
Short-Term Memory (LSTM) in order to detect the position of the author of the text. The
input of their model was a word embedding based on Word2Vec. Gottopati et al. [43] used
an unsupervised approach to detect the position of an author. In order to perform this task,
they used a template based on collapsed Gibbs sampling. Johnson and Goldwasser [44]
used a weakly supervised method to extract the way questions are formulated and to extract
the temporal activity patterns of politicians on Twitter. Their method was applied to the
tweets of popular politicians and issues related to the 2016 election. Qiu et al. [34,45] used
a regression-based latent factor model, which jointly models user arguments, interactions,
and attributes. Somasundaran and Wiebe [46] used an SVM employing characteristics based
on feelings and argumentation of opinions, as well as targets of feelings and argumentation.
For more details on this topic, refer to [47].
Our work is relatively close to the detection and analysis of points of view, with
however some differences. We focused on opposing points of view. Indeed, our subject
of study relates to the subjective oppositions expressed by several individuals. It is this
subjective opposition, with the formulation of an opinion using feelings, which we call
“contradiction” within the article, that we tried to capture. Our work can also be considered
close to the work on stance since we looked at oppositions. However, the observed
oppositions were not the same since they were not favorable or unfavorable of an assertion,
but rather positive or negative about an aspect. The main difference is yet again in the
strong expression of subjectivity, which may be absent in the expression of stance. We
did not consider our work as being exactly in the domain of controversies since there was
no constructed argumentation. Indeed, we considered in our research reviews that were
independent of each other, and we were not in the case of a discussion as in a forum, for
example. Moreover, unlike most other authors, we did not only try to find out if there was a
contradiction among several individuals. We measured the strength of this contradiction
in order to obtain a level in the contradiction evaluation.
189
Mathematics 2022, 10, 809
3.1. Preprocessing
Two dimensions were combined to measure the strength of the disagreement during
a session: the polarity around the aspect and the rating linked with the review. Together,
they define the so-called “review-aspect”. We utilized a dispersion function based on these
dimensions to measure the intensity of disagreement between opposing viewpoints.
190
Mathematics 2022, 10, 809
1. We computed a threshold that corresponds to the duration of the jump. This was
performed on a per course basis and was based on the average time gaps between
reviews (for instance, there was a gap of 35 d for the “Engagement and Nurture
Marketing Strategies” lecture);
2. We grouped the reviews with respect to the above-mentioned threshold, on a per
course basis;
3. We kept only the important sessions by suppressing the so-called “false sessions”
(sessions that contained only a very low number of reviews).
Only the review clusters that had a significant number of reviews were considered for
the evaluation. For instance, clusters resulting after the use of K-means [56] that contained
only one or two reviews were discarded.
Figure 2. Review distribution with respect to the time dimension, for the lecture entitled “Engagement
and Nurture Marketing Strategies”.
Algorithm 1 describes the review groupings with respect to the course sessions. The
next preprocessing step was the feature extraction for the review groups.
Example 1. Let D be a resource (document, e.g., course) and re its associated review. Table 2
illustrates the five steps for aspect extraction from a review.
191
Mathematics 2022, 10, 809
Algorithm 1: The reviews of a resource are grouped according to the time period
(session) in which they were written.
Input: Days_Threshold (DsT), List_Reviews (LRs)
Output: Groups_of_Reviews (GRs)
1 GRs←− ∅ ; // Creation and initialization of the output list of groups of reviews
generated for a specific resource (in our case a “course”)
// C1 and C2 are the centroids of each of the k types of clusters (Cluster1 and
Cluster2) i.e., sufficient/deficient reviews group
19 if C1>C2 then
20 Target_Cluster = Cluster1;
21 else
22 Target_Cluster = Cluster2;
23 end
Table 2 depicts the five steps. First, we computed the frequencies of the terms in the
review set (as an example, the terms “course”, “material”, “assignment”, “content”, and
“lecturer” occurred 44,219, 3286, 3118, 2947, and 2705 times, respectively). Secondly, we
grammatically labeled each word (“NN” meaning singular noun and “NNS” meaning
plural noun). Thirdly, only nominal category terms were selected. Fourthly, we retained
only the nouns surrounded by terms belonging to the SentiWordNet dictionary (“Michael is
a wonderful lecturer delivering the lessons in an easy to understand manner.”). Finally, we
considered as useful aspects only those nouns that were among the most frequent in the
corpus of reviews (the useful aspects in these reviews were lecturer, lesson, presentation, slide,
and assignment).
192
Mathematics 2022, 10, 809
After constructing the aspect list characterizing the dataset, the sentiment polarity
must be computed. The sentiment analysis method used for this is described in the
following section.
Definition 1. There is a contradiction between two portions of review-aspects ra1 and ra2 contain-
ing an aspect, where ra1 , ra2 ∈ D (document), when the opinions (polarities) around the aspect are
opposite (i.e., pol (ra1 ) × pol (ra2 ) ≤ 0). We found that after several empirical experiments, the
review-aspect ra was defined by a five-word snippet before and after the aspect in review re.
Contradiction intensity was estimated using two dimensions: polarity poli and rating
rati of the review-aspect rai . Let each rai be a point on the plane with coordinates ( poli , rati ).
193
Mathematics 2022, 10, 809
Our hypothesis was that the greater the distance (i.e., dispersion) between the values related
to each review-aspect rai of the same document D, the greater is the contradiction intensity.
The dispersion indicator with respect to the centroid racentroid with coordinates ( pol, rat) is
as follows:
1 n
Disp(raratii , D ) = ∑ Distance( poli , rati )
pol
(1)
n i =1
Distance( poli , rati ) = ( poli − pol )2 + (rati − rat)2 (2)
Distance( poli , rati ) represents the distance between the point rai of the scatter plot
and the centroid racentroid (see Figure 3), and n is the number of rai . The two quantities poli
and rati are represented on different scales; thus, their normalization becomes necessary.
Since the polarity poli is normalized by design, we only needed to normalize the rating
values. We propose the following equation for normalization: rati = rat2i −3 (rati ∈ [−1, 1]).
pol
In what follows, the divergence from the centroid of rai is denoted by Disp(raratii , D ). Its
value varies according to the following:
194
Mathematics 2022, 10, 809
|rati − poli |
ci = (5)
2n
For a data point, if the values of the two dimensions were farther apart, our assumption
was that such a point should be considered of high importance. We hypothesized that
a positive aspect in a low-rating review should have a higher weight, and vice versa.
Therefore, an importance coefficient was computed for each data point, based on the
absolute value difference between the values over both dimensions. The division by
2n represents a normalization by the maximum value of the difference in absolute value
(max (|rati − poli |) = 2) and n. For instance, for a polarity of −1 and a rating of 1, the
coefficient is 1/n (| − 1 − 1|/2n = 2/2n = 1/n), and for a polarity of 1 and a rating of 1,
the coefficient is 0 (|1 − 1|/2n = 0).
4. Experimental Evaluation
This section presents the performed experiments and their results. After the introduc-
tion of our corpus and its study, the section presents and discusses the results obtained in
the presence of our baseline. We then present the experiments that allowed us to select the
features that gave the best results with the learning-based algorithms. The section ends
with a presentation and comparison of the results obtained with the SVM, J48, and naive
Bayes algorithms.
4.1.1. Data
We are not aware of the existence of a standard dataset for evaluating the contradiction
intensity (strength). Therefore, we built our own dataset by collecting 73,873 reviews
and their ratings corresponding to 2244 English courses from coursera.org via its API
https://building.coursera.org/app-platform/catalog (accessed on 5 January 2022) and
web page parsing. This was performed during the time interval 10–14 October 2016. More
detailed statistics on this Coursera dataset are depicted in Figure 4. Our entire test dataset,
195
Mathematics 2022, 10, 809
We were able to automatically capture 22 useful aspects from the set of reviews (see
Figure 5). Figure 5 presents the statistics on the 22 detected aspects, for example for the
Slide aspect, we recorded: 56 one-star ratings, 64 two-star ratings, 81 three-star ratings,
121 four-star ratings, 115 five-star ratings, 131 reviews with negative polarity, 102 reviews
with positive polarity, as well as 192 reviews and 41 courses concerning this aspect.
196
Mathematics 2022, 10, 809
1. The sentiment class for each review-aspect of 1100 courses was assessed by 3 users
(assessors). Users must only judge the polarity of the involved sentiment class;
2. The degree of contradiction between these review-aspects (see Figure 6) was assessed
by 3 new users.
Annotation corresponding to the above judgment was performed manually. For each
aspect, on average, 22 review-aspects per course were judged (in total: 66,104 review-aspects
of 1100 courses, i.e., 50 courses for each aspect). Exactly 3 users evaluated each aspect.
A 3-level assessment scale (Negative, Neutral, Positive) was employed for the senti-
ment evaluation in the review-aspects, in a per-course manner, and a 5-level assessment
scale (Very Low, Low, Strong, Very Strong, and Not Contradictory) was employed for the
contradiction evaluation, as depicted in Figure 6.
Using Cohen’s Kappa coefficient k [60], we estimated the agreement degree among
the assessors for each aspect. In order to obtain a unique Kappa value, we calculated the
pairwise Kappa of assessors, and then, we computed the average.
For each aspect from all the reviews, the distribution of the Kappa values is shown in
Figure 7. The variation of the measure of agreement was between 0.60 and 0.91. Among
the assessors, the average level of agreement was equal to 80%. Such a score corresponds
to a strong agreement. Between the assessors who performed the sentiment annotation,
the Kappa coefficient value was k = 0.78 (78% agreement), which also indicates a substan-
tial agreement.
Figure 7. Distribution of the Kappa values k per aspect. <0 poor agreement, 0.0–0.2 slight agreement,
0.21–0.4 fair agreement, 0.41–0.6 moderate agreement, 0.61–0.8 substantial agreement, and 0.81–1
perfect agreement.
197
Mathematics 2022, 10, 809
Table 3. Correlation values with respect to the accuracy levels (WITHOUT considering review
session). “∗ ” represents significance with a p-value < 0.05 and “∗∗ ” represents significance with a
p-value < 0.01.
198
Mathematics 2022, 10, 809
dispersion. Moreover, when considering the users’ sentiment judgments (Table 3 (b)), we
obtained better results than when considering sentiment analysis models (Table 3, baseline
and (a)). The improvements went from 35% for (baseline) (Pearson: 0.45, compared to 0.61)
to 50% for (b) (Pearson: 0.45, compared to 0.68). The correlation coefficient conclusions
stand for the precision as well. One may notice that a loss of 21% in terms of sentiment
analysis accuracy (100–79%) led to a 34% loss in terms of precision.
Config (2): weighted centroid. This configuration yielded positive correlation values as
well (0.51, 0.80, 0.87). One may note that the results when considering the weight of
the centroids were better than when this particular weight was ignored. Compared to
the averaged centroid (Config (1)), the improvements were 13% for naive Bayes, 31% for
SentiNeuron, and 28% for the manual judgments, respectively. This trend was confirmed
for the precision results as well. Thus, the sentiment analysis model significantly impacted
the estimation quality of the studied contradictions.
Table 4. Correlation values with respect to accuracy levels (WITH considering review session).
“∗ ” represents significance with a p-value < 0.05 and “∗∗ ” represents significance with a p-value < 0.01.
199
Mathematics 2022, 10, 809
with the performance improvements, the clustering of reviews with respect to their course
sessions being helpful in particular. When the sentiment analysis method performed well,
the global results were also improved.
For the experiments, 50 courses were randomly selected from the dataset, for each
of the 22 aspects. Thus, we obtained a total of 1100 courses (instances). The intensity
contradiction classes were then established, with respect to specific aspects. There were
four classes: Very Low (230 courses), Low (264 courses), Strong (330 courses), and Very Strong
(276 courses), with respect to the judgments provided by the annotators.
Since the distribution of the courses by class was not balanced and in order to avoid
a possible model bias that would assign more observations than normal to the majority
class, we applied a sub-sampling approach, and we obtained a balanced collection of
230 individuals by class, therefore a total of 920 courses.
After obtaining the balanced dataset, we applied the feature selection mechanisms on
it. We performed five-fold cross-validation (a machine learning step widely employed for
hyperparameter optimization).
The feature selection algorithms output feature significance scores for the four estab-
lished classes. Their inner workings are different. They may be based on feature importance
200
Mathematics 2022, 10, 809
ranking (e.g., FilteredAttributeEval), or on the feature selection frequency during the cross-
validation step (e.g., FilteredSubsetEval). We mention that we employed the default Weka
parameter settings for these methods.
Since we applied five-fold cross-validation over the ten features, n = 10. The results
concerning the selected features are summarized in Figure 9. There were two classes of
selection algorithms:
• Based on ranking metrics to sort the features (marked by Rank in the figure);
• Based on the occurrence frequency during the cross-validation step (marked by #Folds
in the figure).
One may note that a good feature has either a high rank or a high frequency.
The strongest features, by both the #Folds and the Rank metrics, were f 10 : VarPol,
f 9 : VarRat, f 1 : #NegRev, and f 2 : #PosRev. The features with average importance were
f 3 : #TotalRev, f 4 : #Rat1, and f 8 : #Rat5, except for the case of CfsSubsetEval, for which the
features f 4 and f 8 were not selected. The weakest features were f 5 : #Rat2, f 6 : #Rat3, and f 7 :
#Rat4.
201
Mathematics 2022, 10, 809
1. For the CfsSubsetEval and the WrapperSubsetEval algorithms, the selected features were:
f 1 : #NegRev, f 2 : #PosRev, f 3 : #TotalRev, f 9 : VarRat, and f 10 : VarPol;
2. For CfsSubsetEval, the selected features were: f 1 , f 2 , f 3 , f 9 , and f 10 ;
3. For the CfsSubsetEval and the WrapperSubsetEval algorithms, the selected features were:
f 1 , f 2 , f 3 , f 9 , and f 10 ;
4. For the other algorithms, all the features were selected: f 1 : #NegRev, f 2 : #PosRev,
f 3 : #TotalRev, f 4 : #Rat1, f 5 : #Rat2, f 6 : #Rat3, f 7 : #Rat4, f 8 : #Rat5, f 9 : VarRat, and f 10 :
VarPol.
Algorithm Features
CfsSubsetEval f 1 , f 2 , f 3 , f 9 , f 10
WrapperSubsetEval f 1 , f 2 , f 3 , f 4 , f 8 , f 9 , f 10
Other algorithms f 1 , f 2 , f 3 , f 4 , f 5 , f 6 , f 7 , f 8 , f 9 , f 10
Regarding the input feature vector, we needed to decide how many features to consider,
either all of them or only those proposed by feature selection. For the latter, we must decide
what should be the machine learning algorithm to exploit them.
This type of discussion was conducted by Hall and Holmes [59]. They argued about
the effectiveness of several feature selection methods by crossing them with several ma-
chine learning algorithms. They matched the best feature selection and machine learning
techniques, since they noticed varying performance during the experiments. Inspired by
their findings [59], we used the following couples of learning methods and feature selection
algorithms:
• Feature selection: CfsSubsetEval (CFS) and WrapperSubsetEval (WRP); machine learning
algorithm: naive Bayes;
• Feature selection: ReliefFAttributeEval (RLF); machine learning algorithm: J48 (the C4.5
implementation);
• Feature selection: SVMAttributeEval (SVM); machine learning algorithm: multi-class
SVM (SMO function on Weka).
The naive Bayes algorithm represents the baseline, and statistical significance tests
(paired t-test) were conducted to compare the performances. The results are shown in
Table 6. Significance (p-value < 0.05) is marked by ∗ , and strong significance (p-value < 0.01)
is marked by ∗∗ , in the table. We next discuss the obtained results.
The results with the naive Bayes model: This model yielded precision values of 0.72
and 0.68, corresponding to the WRP and the CFS feature selection algorithms, respectively.
The feature selection algorithms overcame the performance obtained when considering
all the features, which maxed out at 0.60, in terms of precision. Thus, the feature selection
mechanisms helped the learning process of the machine learning algorithms. The classes
for which the highest precision was obtained were Very Low, Strong, and Very Strong. The
202
Mathematics 2022, 10, 809
remaining class, Low, could not yield more than 0.46 in terms of precision, in the case of the
WRP selection algorithm.
Table 6. Precision results for the machine learning techniques. Significance (p-value < 0.05) is marked
by ∗ , and strong significance (p-value < 0.01) is marked by ∗∗ .
The results with the SVM model: This model yielded better performance, compared
to the naive Bayes classifier. The relative improvements of the SVM model, compared to
naive Bayes, went from 14% in the case of WRP to 21% in the case of CFS. One should
note that this model managed to yield better performance for the difficult class (Low). The
feature selection algorithm SVMAttributeEval did not improve the performance, compared
to considering all the features together. This behavior may occur because the performance
was already quite high.
The results with the J48 model: This decision trees model yielded the best perfor-
mance in terms of precision, when considering all the features. The relative improvements
were 17%, with respect to the SVM model, 33% with respect to the naive Bayes model
with the WRP selection algorithm, and finally, 41% with respect to the naive Bayes model
with the CFS selection. The most difficult class for the other models, Low, obtained a
performance of 92% in terms of precision, meaning relative improvements ranging from
28% to 142%, with respect to the other learning models. Moreover, the improvements were
significant for all the involved classes. On the other hand, feature selection did not bring
any improvement this time. As for the SVM model, this non-improvement must surely
be due to the fact that the performance of the algorithm was already extremely high, and
consequently, the impact of feature selection was very marginal.
In what follows, we compared the best results obtained by the two methods of contra-
diction intensity estimation. We refer to the unsupervised method, based on the review-
aspect dispersion function taking into account the review sessions (as in Table 4), and to the
supervised method, based on several features extracted by the selection algorithms within
the learning process (see the average precision in Table 6). In terms of precision, naive Bayes
used with the CFS feature selection algorithm registered the lowest precision result (68%),
as can be seen in Tables 4 and 6. SVM performed relatively better than the unsupervised
203
Mathematics 2022, 10, 809
method with all of its configurations using an averaged centroid. Moreover, SVM even
outperformed naive Bayes used with CFS and WRP with an improvement rate of 21% and
14%, respectively. However, the majority of the results obtained with the unsupervised
method using the weighted centroid significantly outperformed those obtained using the
averaged centroid or even those obtained by the supervised method using naive Bayes
and SVM. In all these experiments, the best results were obtained by the J48 decision trees
algorithm using the RLF selection algorithm. J48 recorded significant improvement rates
over all other configurations, using both supervised and unsupervised methods: 17%, 33%,
and 41%, over SVM, naive Bayes (WRP), and naive Bayes (CFS), respectively. Table 7 shows
in detail the different improvement rates between J48 and the other configurations.
Table 7. Rates of improvement between the decision trees J48 and the various other configurations.
To sum up, the results clearly showed that the contradiction intensity can be predicted
by the J48 machine learning model, with good performance. The feature selection methods
proved to be effective for one case out of three, with respect to the learning models (for
204
Mathematics 2022, 10, 809
naive Bayes). This similar performance between the versions with and without feature
selection shows that, after a certain performance level yielded by the machine learning
algorithm, the feature selection impact stayed quite limited. We conclude that the courses
having highly divergent reviews were prone to containing contradictions with several
intensity levels.
5. Conclusions
This research focused on the estimation of the contradiction intensity in texts, more
in particular in MOOC course reviews. Unlike most other authors, we did not only
try to find out if a contradiction occurred, but we were concerned with measuring its
strength. The contradiction was identified around the aspects that generated the difference
in opinions within the reviews. We hypothesized that the contradiction occurred when
the sentiment polarities around these aspects were divergent. This paper’s proposal to
quantify the contradiction intensity was twofold, consisting of an unsupervised approach
and a supervised one, respectively. Within the unsupervised approach, the review-aspects
were represented as a function that estimated the dispersion (more intense contradictions
occurred when the sentiment polarities and the ratings were dispersed in the bi-dimensional
space characterized by sentiment polarity and ratings, respectively). The other idea was
to group the reviews by sessions (the time dimension), allowing an effective treatment
to avoid fake contradictions. The supervised approach considered several features and
learned to predict the contradiction intensity. We hypothesized that the ratings and the
sentiment polarities around an aspect may be useful as features to estimate the intensity of
the contradictions. When the sentiment polarities and the ratings were diverse (in terms of
the standard deviation), the chances of the contradictions being intense increased.
For the unsupervised approach, the weighted centroid configuration, coupled with the re-
view sessions (considering the time dimension of the reviews), yielded the best performances.
For the supervised approach, the features VarPol, VarRat, #PosRev, VarRat, and #NegRev
had the best chances to correctly predict the intensity classes for the contradictions. The
feature selection study prove to be effective for one case out of three, with respect to the
learning models (for naive Bayes). Thus, feature selection may be beneficial for the learning
models that did not perform very well. The best performance was obtained by the J48
decision trees algorithm. This model was followed, in terms of precision, by the SVM
model and, lastly, by the naive Bayes model.
The most important limitation of our proposal is that the models depend on the
quality of the sentiment polarity estimation and of the aspect extraction method. That is
why our future work will focus on finding ways of selecting methods for sentiment polarity
estimation and aspect extraction that would be the most appropriate for this task.
Additionally, we aim to conduct larger-scale experiments, over various data types,
since the so far, the promising results motivated us to investigate this topic even further.
Author Contributions: Conceptualization, A.-G.C. and S.F.; Data curation, I.B.; Formal analysis, I.B.;
Supervision, S.F.; Validation, A.-G.C.; Writing (original draft), I.B., A.-G.C. and S.F.; Writing (review
and editing), I.B., A.-G.C. and S.F. All authors have read and agreed to the published version of
the manuscript.
Funding: This research received no external funding.
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: Not applicable.
Conflicts of Interest: The authors declare no conflict of interest.
205
Mathematics 2022, 10, 809
References
1. Badache, I.; Boughanem, M. Harnessing Social Signals to Enhance a Search. IEEE/WIC/ACM 2014, 1, 303–309.
2. Badache, I.; Boughanem, M. Emotional social signals for search ranking. SIGIR 2017, 3, 1053–1056.
3. Badache, I.; Boughanem, M. Fresh and Diverse Social Signals: Any impacts on search? In Proceedings of the CHIIR ’17:
Proceedings of the 2017 Conference on Conference Human Information Interaction and Retrieval, Oslo, Norway, 7–11 March
2017; pp. 155–164.
4. Kim, S.; Zhang, J.; Chen, Z.; Oh, A.H.; Liu, S. A Hierarchical Aspect-Sentiment Model for Online Reviews. In Proceedings of the
Twenty-Seventh AAAI Conference on Artificial Intelligence, Bellevue, WA, USA, 14–18 July 2013.
5. Poria, S.; Cambria, E.; Ku, L.; Gui, C.; Gelbukh, A.F. A Rule-Based Approach to Aspect Extraction from Product Reviews. In
Proceedings of the Second Workshop on Natural Language Processing for Social Media, SocialNLP@COLING 2014, Dublin,
Ireland, 24 August 2014; pp. 28–37. [CrossRef]
6. Wang, L.; Cardie, C. A Piece of My Mind: A Sentiment Analysis Approach for Online Dispute Detection. In Proceedings of the
52nd Annual Meeting of the Association for Computational Linguistics, ACL 2014, Baltimore, MD, USA, 22–27 June 2014; Short
Papers; Volume 2, pp. 693–699.
7. Harabagiu, S.M.; Hickl, A.; Lacatusu, V.F. Negation, Contrast and Contradiction in Text Processing. In Proceedings of the
Twenty-First National Conference on Artificial Intelligence and the Eighteenth Innovative Applications of Artificial Intelligence
Conference, Boston, MA, USA, 16–20 July 2006; pp. 755–762.
8. de Marneffe, M.; Rafferty, A.N.; Manning, C.D. Finding Contradictions in Text. In Proceedings of the 46th Annual Meeting of the
Association for Computational Linguistics, Columbus, OH, USA, 15–20 June 2008; pp. 1039–1047.
9. Tsytsarau, M.; Palpanas, T.; Denecke, K. Scalable discovery of contradictions on the web. In Proceedings of the 19th International
Conference on World Wide Web, WWW 2010, Raleigh, NC, USA, 26–30 April 2010; pp. 1195–1196. [CrossRef]
10. Tsytsarau, M.; Palpanas, T.; Denecke, K. Scalable detection of sentiment-based contradictions. DiversiWeb WWW 2011, 11, 105–112.
11. Yazi, F.S.; Vong, W.T.; Raman, V.; Then, P.H.H.; Lunia, M.J. Towards Automated Detection of Contradictory Research Claims in
Medical Literature Using Deep Learning Approach. In Proceedings of the 2021 Fifth International Conference on Information
Retrieval and Knowledge Management (CAMP), Pahang, Malaysia, 15–16 June 2021; pp. 116–121. [CrossRef]
12. Hsu, C.; Li, C.; Sáez-Trumper, D.; Hsu, Y. WikiContradiction: Detecting Self-Contradiction Articles on Wikipedia. In Proceedings
of the IEEE International Conference on Big Data (IEEE BigData 2021), Orlando, FL, USA, 15–18 December 2021.
13. Sepúlveda-Torres, R. Automatic Contradiction Detection in Spanish. In Proceedings of the Doctoral Symposium on Natural
Language Processing from the PLN.net Network, Baeza, Spain, 19–20 October 2021.
14. Rahimi, Z.; Shamsfard, M. Contradiction Detection in Persian Text. arXiv 2021, arXiv:2107.01987.
15. Pielka, M.; Sifa, R.; Hillebrand, L.P.; Biesner, D.; Ramamurthy, R.; Ladi, A.; Bauckhage, C. Tackling Contradiction Detection in
German Using Machine Translation and End-to-End Recurrent Neural Networks. In Proceedings of the 2020 25th International
Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; pp. 6696–6701. [CrossRef]
16. Păvăloaia, V.D.; Teodor, E.M.; Fotache, D.; Danileţ, M. Opinion Mining on Social Media Data: Sentiment Analysis of User
Preferences. Sustainability 2019, 11, 4459. [CrossRef]
17. Mohammad, S.M.; Turney, P.D. Crowdsourcing a word–Emotion association lexicon. Comput. Intell. 2013, 29, 436–465. .
[CrossRef]
18. Al-Ayyoub, M.; Rabab’ah, A.; Jararweh, Y.; Al-Kabi, M.N.; Gupta, B.B. Studying the controversy in online crowds’ interactions.
Appl. Soft Comput. 2018, 66, 557–563. [CrossRef]
19. Popescu, A.M.; Pennacchiotti, M. Detecting controversial events from twitter. In Proceedings of the 19th ACM International
Conference on Information and Knowledge Management, Toronto, ON, Canada, 26–30 October 2010; pp. 1873–1876.
20. Balasubramanyan, R.; Cohen, W.W.; Pierce, D.; Redlawsk, D.P. Modeling polarizing topics: When do different political
communities respond differently to the same news? In Proceedings of the Sixth International AAAI Conference on Weblogs and
Social Media, Dublin, Ireland, 4–8 June 2012.
21. Dori-Hacohen, S.; Allan, J. Detecting controversy on the web. In Proceedings of the 22nd ACM international conference on
Information & Knowledge Management, San Francisco, CA, USA, 27 October–1 November 2013; pp. 1845–1848.
22. Dori-Hacohen, S.; Allan, J. Automated Controversy Detection on the Web. In Advances in Information Retrieval; Hanbury, A.,
Kazai, G., Rauber, A., Fuhr, N., Eds.; Springer International Publishing: Cham, Germany, 2015; pp. 423–434.
23. Garimella, K.; Morales, G.D.F.; Gionis, A.; Mathioudakis, M. Quantifying Controversy in Social Media. In Proceedings of the
Ninth ACM International Conference on Web Search and Data Mining, San Francisco, CA, USA, 22–25 February 2016; pp. 33–42.
[CrossRef]
24. Guerra, P.C.; Meira, W., Jr.; Cardie, C.; Kleinberg, R. A measure of polarization on social media networks based on community
boundaries. In Proceedings of the Seventh International AAAI Conference on Weblogs and Social Media, Cambridge, MA, USA,
8–11 July 2013.
25. Jang, M.; Allan, J. Improving Automated Controversy Detection on the Web. In Proceedings of the 39th International ACM
SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2016, Pisa, Italy, 17–21 July 2016; pp. 865–868.
[CrossRef]
206
Mathematics 2022, 10, 809
26. Lin, W.H.; Hauptmann, A. Are these documents written from different perspectives? A test of different perspectives based on
statistical distribution divergence. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th
Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Sydney, Australia,
6–8 July 2006; pp. 1057–1064.
27. Sriteja, A.; Pandey, P.; Pudi, V. Controversy Detection Using Reactions on Social Media. In Proceedings of the 2017 IEEE
International Conference on Data Mining Workshops (ICDMW), New Orleans, LA, USA, 18–21 November 2017; pp. 884–889.
28. Morales, A.; Borondo, J.; Losada, J.C.; Benito, R.M. Measuring political polarization: Twitter shows the two sides of Venezuela.
Chaos Interdiscip. J. Nonlinear Sci. 2015, 25, 033114. [CrossRef] [PubMed]
29. Garimella, K.; Morales, G.D.F.; Gionis, A.; Mathioudakis, M. Quantifying Controversy on Social Media. Trans. Soc. Comput. 2018,
1. [CrossRef]
30. Ahmed, A.; Xing, E.P. Staying informed: Supervised and semi-supervised multi-view topical analysis of ideological perspective.
In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, Association for Computational
Linguistics, Stroudsburg, PA, USA, 9–11 October 2010; pp. 1140–1150.
31. Cohen, R.; Ruths, D. Classifying political orientation on Twitter: It’s not easy! In Proceedings of the Seventh International AAAI
Conference on Weblogs and Social Media, Cambridge, MA, USA, 8–11 July 2013.
32. Conover, M.D.; Gonçalves, B.; Ratkiewicz, J.; Flammini, A.; Menczer, F. Predicting the political alignment of twitter users.
In Proceedings of the 2011 IEEE Third International Conference on Privacy, Security, Risk and Trust and 2011 IEEE Third
International Conference on Social Computing, Boston, MA, USA, 9–11 October 2011; pp. 192–199.
33. Paul, M.; Girju, R. A Two-Dimensional Topic-Aspect Model for Discovering Multi-Faceted Topics. In Proceedings of the AAAI’10:
Twenty-Fourth AAAI Conference on Artificial Intelligence, Atlanta, GA, USA, 11–15 July 2010; AAAI Press: Atlanta, GA, USA,
2010; pp. 545–550.
34. Qiu, M.; Jiang, J. A Latent Variable Model for Viewpoint Discovery from Threaded Forum Posts. In Proceedings of the 2013
Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,
Atlanta, Georgia, 9–14 June 2013; Association for Computational Linguistics: Atlanta, GA, USA, 2013; pp. 1031–1040.
35. Trabelsi, A.; Zaiane, O.R. Mining contentious documents using an unsupervised topic model based approach. In Proceedings of
the 2014 IEEE International Conference on Data Mining, Shenzhen, China, 14–17 December 2014; pp. 550–559.
36. Thonet, T.; Cabanac, G.; Boughanem, M.; Pinel-Sauvagnat, K. VODUM: A topic model unifying viewpoint, topic and opinion
discovery. In European Conference on Information Retrieval; Spring: Berlin/Heidelberg, Germany, 2016; pp. 533–545.
37. Galley, M.; McKeown, K.; Hirschberg, J.; Shriberg, E. Identifying agreement and disagreement in conversational speech: Use
of bayesian networks to model pragmatic dependencies. In Proceedings of the 42nd Annual Meeting on Association for
Computational Linguistics, Stroudsburg, PA, USA, 21–26 July 2004; p. 669.
38. Menini, S.; Tonelli, S. Agreement and disagreement: Comparison of points of view in the political domain. In Proceedings of the
COLING 2016, the 26th International Conference on Computational Linguistics, Technical Papers, Osaka, Japan, 11–16 December
2016; pp. 2461–2470.
39. Mukherjee, A.; Liu, B. Mining contentions from discussions and debates. In Proceedings of the 18th ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, Beijing, China, 12–16 August 2012; pp. 841–849.
40. Mohammad, S.; Kiritchenko, S.; Zhu, X. NRC-Canada: Building the State-of-the-Art in Sentiment Analysis of Tweets. In
Proceedings of the 7th International Workshop on Semantic Evaluation, SemEval@NAACL-HLT 2013, Atlanta, GA, USA, 14–15
June 2013; pp. 321–327.
41. Mohammad, S.; Kiritchenko, S.; Sobhani, P.; Zhu, X.; Cherry, C. SemEval-2016 Task 6: Detecting Stance in Tweets. In Proceedings
of the 10th International Workshop on Semantic Evaluation (SemEval-2016), San Diego, CA, USA, 10–11 June 2016; Association
for Computational Linguistics: San Diego, CA, USA; pp. 31–41. [CrossRef]
42. Augenstein, I.; Rocktäschel, T.; Vlachos, A.; Bontcheva, K. Stance detection with bidirectional conditional encoding. arXiv 2016,
arXiv:1606.05464.
43. Gottipati, S.; Qiu, M.; Sim, Y.; Jiang, J.; Smith, N. Learning topics and positions from debatepedia. In Proceedings of the 2013
Conference on Empirical Methods in Natural Language Processing, Seattle, WA, USA, 18–21 October 2013; pp. 1858–1868.
44. Johnson, K.; Goldwasser, D. “All I know about politics is what I read in Twitter”: Weakly Supervised Models for Extracting
Politicians’ Stances From Twitter. In Proceedings of the COLING 2016, the 26th International Conference on Computational
Linguistics, Technical Papers, Osaka, Japan, 11–16 December 2016; pp. 2966–2977.
45. Qiu, M.; Sim, Y.; Smith, N.A.; Jiang, J. Modeling user arguments, interactions, and attributes for stance prediction in online debate
forums. In Proceedings of the 2015 SIAM International Conference on Data Mining, SIAM, Vancouver, BC, Canada, 30 April–2
May 2015; pp. 855–863.
46. Somasundaran, S.; Wiebe, J. Recognizing stances in ideological on-line debates. In Proceedings of the NAACL HLT 2010
Workshop on Computational Approaches to Analysis and Generation of Emotion in Text, Los Angeles, CA, USA, 10–12 June
2010; pp. 116–124.
47. Küçük, D.; Can, F. Stance detection: A survey. ACM Comput. Surv. (CSUR) 2020, 53, 1–37. [CrossRef]
48. Hu, M.; Liu, B. Mining and summarizing customer reviews. In Proceedings of the Tenth ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining, Seattle, WA, USA, 22–25 August 2004; pp. 168–177. [CrossRef]
207
Mathematics 2022, 10, 809
49. Hamdan, H.; Bellot, P.; Béchet, F. Lsislif: CRF and Logistic Regression for Opinion Target Extraction and Sentiment Polarity
Analysis. In Proceedings of the 9th International Workshop on Semantic Evaluation, SemEval@NAACL-HLT 2015, Denver, CO,
USA, 4–5 June 2015; pp. 753–758.
50. Titov, I.; McDonald, R.T. Modeling online reviews with multi-grain topic models. In Proceedings of the 17th International
Conference on World Wide Web, WWW 2008, Beijing, China, 21–25 April 2008; pp. 111–120. [CrossRef]
51. Tulkens, S.; van Cranenburgh, A. Embarrassingly Simple Unsupervised Aspect Extraction. In Proceedings of the 58th Annual
Meeting of the Association for Computational Linguistics; Association for Computational Linguistics, Online, 5–10 July 2020;
pp. 3182–3187. [CrossRef]
52. Turney, P.D. Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews. In
Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA, 6–12 July 2002;
pp. 417–424.
53. Pang, B.; Lee, L.; Vaithyanathan, S. Thumbs up? Sentiment Classification using Machine Learning Techniques. In Proceedings of
the 2002 Conference on Empirical Methods in Natural Language Processing, EMNLP 2002, Philadelphia, PA, USA, 6–7 July 2002.
54. Socher, R.; Perelygin, A.; Wu, J.; Chuang, J.; Manning, C.D.; Ng, A.Y.; Potts, C. Recursive Deep Models for Semantic Composition-
ality Over a Sentiment Treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing,
EMNLP 2013, Seattle, WA, USA, 18–21 October 2013; A Meeting of SIGDAT, a Special Interest Group of the ACL; Grand Hyatt
Seattle: Seattle, WA, USA, 2013; pp. 1631–1642.
55. Radford, A.; Józefowicz, R.; Sutskever, I. Learning to Generate Reviews and Discovering Sentiment. arXiv 2017, arXiv:1704.01444.
56. MacQueen, J. Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley
symposium on mathematical statistics and probability, Berkeley, CA, USA, 21 June–18 July 1965.
57. McAuley, J.J.; Pandey, R.; Leskovec, J. Inferring Networks of Substitutable and Complementary Products. In Proceedings of the
21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Sydney, NSW, Australia, 10–13 August
2015; pp. 785–794. [CrossRef]
58. Looks, M.; Herreshoff, M.; Hutchins, D.; Norvig, P. Deep Learning with Dynamic Computation Graphs. In Proceedings of the 5th
International Conference on Learning Representations, ICLR 2017, Toulon, France, 24–26 April 2017.
59. Hall, M.A.; Holmes, G. Benchmarking Attribute Selection Techniques for Discrete Class Data Mining. IEEE Trans. Knowl. Data
Eng. 2003, 15, 1437–1447. [CrossRef]
60. Cohen, J. A coefficient of agreement for nominal scales. Educ. Psychol. Meas. 1960, 20, 37–46. [CrossRef]
61. Pearson, E.S.; Stephens, M.A. The Ratio of Range to Standard Deviation in the Same Normal Sample. Biometrika 1964, 51, 484–487.
[CrossRef]
62. Vosecky, J.; Leung, K.W.; Ng, W. Searching for Quality Microblog Posts: Filtering and Ranking Based on Content Analysis and
Implicit Links. In Proceedings of the Database Systems for Advanced Applications—17th International Conference, DASFAA
2012, Busan, Korea, 15–19 April 2012; pp. 397–413. [CrossRef]
63. Quinlan, J.R. C4.5: Programs for Machine Learning; Morgan Kaufmann: Burlington, MA, USA, 1993.
64. Yuan, Q.; Cong, G.; Magnenat-Thalmann, N. Enhancing naive bayes with various smoothing methods for short text classification.
In Proceedings of the 21st World Wide Web Conference, WWW 2012, Lyon, France, 16–20 April 2012; pp. 645–646. [CrossRef]
208
mathematics
Article
Cross-Lingual Transfer Learning for Arabic Task-Oriented
Dialogue Systems Using Multilingual Transformer Model mT5
Ahlam Fuad * and Maha Al-Yahya
Department of Information Technology, College of Computer and Information Sciences, King Saud University,
P.O. Box 145111, Riyadh 4545, Saudi Arabia; [email protected]
* Correspondence: [email protected] or [email protected]
Abstract: Due to the promising performance of pre-trained language models for task-oriented
dialogue systems (DS) in English, some efforts to provide multilingual models for task-oriented
DS in low-resource languages have emerged. These efforts still face a long-standing challenge due
to the lack of high-quality data for these languages, especially Arabic. To circumvent the cost and
time-intensive data collection and annotation, cross-lingual transfer learning can be used when few
training data are available in the low-resource target language. Therefore, this study aims to explore
the effectiveness of cross-lingual transfer learning in building an end-to-end Arabic task-oriented DS
using the mT5 transformer model. We use the Arabic task-oriented dialogue dataset (Arabic-TOD)
in the training and testing of the model. We present the cross-lingual transfer learning deployed
with three different approaches: mSeq2Seq, Cross-lingual Pre-training (CPT), and Mixed-Language
Pre-training (MLT). We obtain good results for our model compared to the literature for Chinese
language using the same settings. Furthermore, cross-lingual transfer learning deployed with the
MLT approach outperform the other two approaches. Finally, we show that our results can be
improved by increasing the training dataset size.
Citation: Fuad, A.; Al-Yahya, M. Keywords: cross-lingual transfer learning; task-oriented dialogue systems; Arabic language;
Cross-Lingual Transfer Learning for
mixed-language pre-training; multilingual transformer model; mT5; natural language processing
Arabic Task-Oriented Dialogue
Systems Using Multilingual
MSC: 68T50
Transformer Model mT5. Mathematics
2022, 10, 746. https://doi.org/
10.3390/math10050746
the first work to examine the cross-lingual transfer ability of mT5 on Arabic task-oriented
DS in few-shot scenarios. We aimed to answer two research questions:
- To what extent is cross-lingual transfer learning effective in end-to-end Arabic task-
oriented DS using the mT5 model?
- To what extent does the size of the training dataset affect the quality of Arabic task-
oriented DS in few-shot scenarios?
The rest of the paper is organized as follows: Related work in the area of task-oriented
DS for multilingual models and cross-lingual transfer learning is explored in Section 2.
Section 3 delineates the methodology of cross-lingual transfer learning used in this research.
In Section 4, we present our experimental setup and the dataset used. The results and
findings are discussed in Section 5. Finally, in Section 6, we summarize the research and
highlight avenues for future research.
2. Related Works
In task-oriented DS, high-resource languages are those that have many dataset samples,
whereas in low-resource languages there are few dataset samples. Therefore, it is important
to provide datasets for these languages to advance research on end-to-end task-oriented
DS. Cross-lingual transfer learning is a common and effective approach to build end-
to-end task-oriented DS in low-resource languages [8,11,12]. Machine-translation and
multilingual representations are the most used approaches in cross-lingual research. The
authors of [8] introduced a multilingual dataset for task-oriented DS. It contained many
annotated utterances in English, but few translated annotated English utterances to Spanish
and Thai covering the weather, alarm, and reminder domains. They evaluated their
approach using three cross-lingual transfer learning approaches: translating the training
data, cross-lingual contextual word representations, and a multilingual machine-translation
encoder. Their experiment showed that the latter two approaches outperformed the one in
which the training data were translated. However, translating the training data achieves
the best results in cases of zero-shot settings where there are no available target-language
data. Moreover, they proved that joint training on both high-resource and low-resource
target languages improved the performance on the target language.
In [11], Liu et al. proposed an attention-informed Mixed-Language Training (MLT)
approach. This is a zero-shot adaptation approach for cross-lingual task-oriented DS. They
used a code-switching approach, where code-switching sentences are generated from
source-language sentences in English by replacing particular selected source words with
their translations in German and Italian. They used task-related parallel word pairs in order
to generate code-switching (mixed language) sentences. They obtained better generalization
in the target language because of the inter-lingual semantics across languages. Their model
achieved a significantly better performance with the zero-shot setting for both cross-lingual
dialogue state tracking (DST) and natural language understanding (NLU) tasks than other
approaches using a large amount of bilingual data.
Lin et al. [12] proposed a bilingual multidomain dataset for end-to-end task-oriented
DS (BiToD). BiToD contains over 7000 multidomain dialogues for both English (EN) and
Chinese (ZH), with a large bilingual knowledge base (KB). They trained their models
using mT5 and mBART [13]. They evaluated their system in monolingual, bilingual, and
cross-lingual settings. In the monolingual setting, they trained and tested the models on
either English or Chinese dialogue data. In the bilingual setting, they trained the models on
dialogue data in both languages. In the cross-lingual setting, they studied data in a specific
language, and thus they used transfer learning to study the transferability of knowledge
from a high-resource language to a low-resource language. For end-to-end task evaluation,
they used: BLEU; Task Success Rate (TSR), to measure if the system provided the correct
entity and answered all the requested information of a specific task; Dialogue Success Rate
(DSR), to assess if the system completed all the tasks in the dialogue; and API Call Accuracy
(APIAcc ), to assess if the system generated a correct API call. In addition, they used the joint
goal accuracy (JGA) metric to measure the performance of the DST. They suggested the
210
Mathematics 2022, 10, 746
Dataset Size
Study Dataset Metrics
(Train/Validate/Test)
In addition to the previous studies, Louvan et al. [15] suggested using a data aug-
mentation approach to resolve the data scarcity in task-oriented DS. Data augmentation
aims to produce extra training data, and it proved its success in different NLP tasks for
English data [15]. As such, the authors studied its performance for task-oriented DS for
non-English data. They evaluated their approach on five languages: Italian, Spanish, Hindi,
Turkish, and Thai. They found that data augmentation improved the performance for all
languages. Furthermore, data augmentation improved the performance of the mBERT
model, especially for a slot-filling task.
211
Mathematics 2022, 10, 746
To the best of our knowledge, the study of [12] is the most closely related work to
ours. The authors leveraged mT5 and mBART to fine-tune a task-oriented dialogue task
based on a new bilingual dialogue dataset. Although the performances of the pre-trained
language models for task-oriented DS in English has encouraged the emergence of the
multilingual models for task-oriented DS in low-resource languages, there is still a gap
between the system performance of both low-resource and high-resource languages, due to
the lack of high-quality data in the low-resource languages. To the best of our knowledge,
no study exists for Arabic language task-oriented DS using multilingual models. Therefore,
according to the current landscape of cross-lingual transfer learning in addition to the
achieved performance of multilingual language models in task-oriented DS, we aimed to
explore how far can mT5 be useful in building an Arabic end-to-end task-oriented DS using
cross-lingual settings.
3. Methodology
The workflow of our task-oriented DS is based on a single Seq2Seq model using the pre-
trained multilingual model mT5 [10]. D is the dialogue session that represents a sequence
of user utterances (Ut ) and system utterances (St ) at turn t, where D = {U1 , S1 , . . . , Ut , St }.
The interaction between the user and system at turn t creates a dialogue history (Ht ) that
holds all the previous utterances of both the user and system determined by the context
window size (w), where Ht = {Ut-w , St-w , . . . , St−1 ; Ut }. Over the turn (t) of a dialogue
session, the system tracks the dialogue state (Bt ) and knowledge state (Kt ), then generates
the response (R). We first set the dialogue state and knowledge state to empty strings as B0
and K0 , respectively. Then, the input at turn t was composed of the current dialogue history
(Ht ), previous dialogue state (Bt−1 ), and previous knowledge state (Kt−1 ). For each turn,
the dialogue state updated from (Bt−1 ) to (Bt ) and produced the Levenshtein Belief Spans
at turn t (Levt ), which holds the updated information. Finally, the system was queried with
the determined constraint from the dialogue state in order to generate the API name, and
then the system updated the knowledge state form (Kt−1 ) to (Kt ). Both Kt and the API
name were used to generate the response, which was returned to the user.
3.1. Dataset
Arabic-TOD, a multidomain Arabic dataset, was used for training and evaluating
end-to-end task-oriented DS. This dataset was created by manually translating the original
English BiToD dataset into Arabic. Of 3689 English dialogues, the Arabic-TOD dataset con-
tained 1500 dialogues with 30,000 utterances covering four domains (Hotels, Restaurants,
Weather, and Attractions). The dataset comprised 14 intent types and 26 slot types. The
Arabic-TOD dataset was preprocessed and prepared for the training step, then divided into
67%, 7%, and 26% for training, validation, and testing, respectively.
4. Experiments
We investigated the effectiveness of the powerful multilingual model mT5 for few-
shot cross-lingual learning for end-to-end Arabic task-oriented DS. In the cross-lingual
transfer learning, we transferred the knowledge from a high-resource language (English) to
212
Mathematics 2022, 10, 746
a low-resource language (Arabic). We used three approaches from the literature, [8,11,12]:
mSeq2Seq, cross-lingual pre-training (CPT), and mixed-language pre-training (MLT).
mSeq2Seq approach: In this setting, we utilized the existing pre-trained mSeq2Seq model
mT5, and directly fine-tuned these models on the Arabic dialogue data.
Cross-lingual pre-training (CPT) approach: In this setting, we pre-trained the mSeq2Seq
model mT5 on English, and then fine-tuned the pre-trained models on the Arabic dia-
logue data.
Mixed-language pre-training (MLT) approach: In this setting, we used a KB (dictionary)
that contained a mixed-lingual context (Arabic and English) for most of the entities. As
such, we generated the mixed-language training data by replacing the most task-related
keyword entities in English with corresponding keyword entities in Arabic from a parallel
dictionary for both input and output sequences. The process of generating the mixed-
language context is shown in Figure 1. In this setting, we initially pre-trained the mSeq2Seq
model mT5 with the generated mixed language-training dialogue data, then fine-tuned the
pre-trained models on the Arabic dialogue data. During the training, our model learned to
capture the most task-related keywords in the mixed utterances, which helped the model
capture the other unimportant task-related words that have similar semantics, e.g., days of
the week—“ ” (Saturday) and “
” (Tuesday).
In all these three approaches, we used the English BiToD dataset [12]. The Arabic-TOD
dataset contained 10% of the English one. We intended to investigate if transferring the
pre-trained multilingual language model mT5 to task-oriented DS could handle the paucity
of the Arabic dialogue data.
Experiment Setup
We set up our experimental framework using the multilingual model mT5-small. We
used the PyTorch framework [18] and the Transformers library [19]. We set the optimizer
to AdamW [20] with a 0.0005 learning rate. We set the dialogue context window size at 2
and the batch size at 128, based on the best results in the existing literature. We first trained
the models on English dialogues for 8 epochs, then fine-tuned the model on Arabic for
10 epochs. All the trainings took about 12 h using Google Colab.
213
Mathematics 2022, 10, 746
In [12], Chinese was considered as the low-resource target language, with 10% of the
training dialogue, while in this work, Arabic was the low-resource target language, with
10% of the training dialogue.
Table 2. Cross-lingual experiment results of dialogue state tracking (DST) and end-to-end dialogue
generation with 10% of Arabic-TOD dataset (AR) compared to 10% of Chinese dialogues in BiToD
(ZH) [12]. Bold numbers indicate the best result according to the column’s metric value.
mSeq2Seq Approach
TSR DSR APIAcc BLEU JGA
AR 18.63 3.72 15.26 9.55 17.67
ZH [12] 4.16 2.20 6.67 3.30 12.63
CPT Approach
AR 42.16 14.18 46.63 23.09 32.71
ZH [12] 43.27 23.70 49.70 13.89 51.40
MLT Approach
AR 42.16 14.49 46.77 23.98 32.75
ZH [12] 49.20 27.17 50.55 14.44 55.05
Our model outperformed the Chinese model using the cross-lingual model deployed
by the mSeq2Seq approach, while the Chinese model outperformed ours on the other
two approaches, except for BLEU. However, these poor results can be improved by in-
creasing the Arabic-TOD dataset size under the few-shot learning scenarios. Due to the
smallness of the Arabic-TOD dataset, we tried to re-implement the experiment utilizing
all the data of the Arabic-TOD dataset, which comprised 27% of the size of the Chinese
dataset in [12]. In so doing, we improved the results of the Arabic models, as shown in
Table 3. We found a high improvement from the first approach, which still outperformed
the Chinese model. Furthermore, Arabic obtained better results in terms of TSR, APIAcc ,
and BLEU than Chinese using the cross-lingual model deployed by the CPT approach. The
cross-lingual model deployed by the MLT approach outperformed the Chinese model in
terms of APIAcc and BLEU.
Table 3. Three approaches in cross-lingual settings results of DST and end-to-end dialogue generation
with all the Arabic-TOD dataset. Numbers in bold font indicate superior corresponding values to the
Chinese model that were mentioned in Table 2.
Evaluation Metrics
TSR DSR APIAcc BLEU JGA
Approach
AR (mseq2seq) 42.88 13.95 48.68 29.28 35.74
AR
47.18 18.14 52.10 31.16 36.32
(CPT)
AR
48.10 18.84 52.58 31.74 37.17
(MLT)
Overall, our findings indicate the excellent transferability of the cross-lingual multi-
lingual language model mT5. Moreover, the MLT approach improved the performance
of few-shot cross-lingual learning, which indicates that bilingual KB can facilitate the
cross-lingual knowledge transfer in low-resource scenarios, such as in Arabic. In addi-
tion, the JGA values were relatively small for the Arabic models, due to the difficulty
and multiplicity of tasks existing in Arabic-TOD datasets. As we mentioned earlier, we
only translated the task-related keywords in the dialogues and did not translate names,
214
Mathematics 2022, 10, 746
locations, and addresses, which in return made parsing the Arabic utterances easier in a
cross-lingual setting.
In summary, the cross-lingual setting is an effective approach for building an Arabic
end-to-end task-oriented DS, in cases in which there is a scarcity of training data. Our
results can be considered a baseline for the future of Arabic conversational systems.
Table 4. Few-shot learning results of cross-lingual model deployed with MTL approach on the
Arabic-TOD dataset using training dataset of different sizes. Bold numbers indicate the best result
according to the column’s metric value.
Evaluation Metrics
TSR DSR APIAcc BLEU JGA
Dataset Size
5% 30.09 10.23 33.07 20.26 24.85
10% 34.90 11.86 37.89 20.87 28.26
20% 40.73 14.42 44.47 23.84 32.05
50% 42.16 14.88 48.51 24.94 34.03
100% 48.10 18.84 52.58 31.74 37.17
Author Contributions: Conceptualization, A.F. and M.A.-Y.; methodology, A.F.; software, A.F.;
validation, A.F.; formal analysis, A.F. and M.A.-Y.; investigation, A.F. and M.A.-Y.; resources, A.F.
and M.A.-Y.; data curation, A.F.; writing—original draft preparation, A.F.; writing—review and
editing, A.F. and M.A.-Y.; visualization, A.F. and M.A.-Y.; supervision, M.A.-Y.; project administration,
M.A.-Y.; funding acquisition, M.A.-Y. All authors have read and agreed to the published version of
the manuscript.
215
Mathematics 2022, 10, 746
Funding: This research is supported by a grant from the Researchers Supporting Project No. RSP-
2021/286, King Saud University, Riyadh, Saudi Arabia.
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Acknowledgments: The authors extend their appreciation to the Researchers Supporting Project
number RSP-2021/286, King Saud University, Riyadh, Saudi Arabia.
Conflicts of Interest: The authors declare no conflict of interest.
References
1. McTear, M. Conversational AI: Dialogue Systems, Conversational Agents, and Chatbots; Morgan & Claypool Publishers LLC: San Rafael,
CA, USA, 2020; Volume 13.
2. Wu, C.S.; Madotto, A.; Hosseini-Asl, E.; Xiong, C.; Socher, R.; Fung, P. Transferable multi-domain state generator for task-oriented
dialogue systems. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy,
28 July–2 August 2019; pp. 808–819. [CrossRef]
3. Peng, B.; Zhu, C.; Li, C.; Li, X.; Li, J.; Zeng, M.; Gao, J. Few-shot Natural Language Generation for Task-Oriented Dialog. arXiv
2020, arXiv:2002.12328. [CrossRef]
4. Yang, Y.; Li, Y.; Quan, X. UBAR: Towards Fully End-to-End Task-Oriented Dialog Systems with GPT-2. arXiv 2020,
arXiv:2012.03539.
5. Hosseini-Asl, E.; McCann, B.; Wu, C.S.; Yavuz, S.; Socher, R. A simple language model for task-oriented dialogue. Adv. Neural Inf.
Process. Syst. 2020, 33, 20179–20191.
6. AlHagbani, E.S.; Khan, M.B. Challenges facing the development of the Arabic chatbot. In Proceedings of the First International
Workshop on Pattern Recognition 2016, Tokyo, Japan, 11–13 May 2016; Volume 10011, p. 7. [CrossRef]
7. Liu, Z.; Shin, J.; Xu, Y.; Winata, G.I.; Xu, P.; Madotto, A.; Fung, P. Zero-shot cross-lingual dialogue systems with transferable latent
variables. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International
Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; pp. 1297–1303.
[CrossRef]
8. Schuster, S.; Shah, R.; Gupta, S.; Lewis, M. Cross-lingual transfer learning for multilingual task oriented dialog. In Proceedings
of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language
Technologies, Minneapolis, MN, USA, 2–7 June 2019; Volume 1, pp. 3795–3805. [CrossRef]
9. Zhou, X.; Dong, D.; Wu, H.; Zhao, S.; Yu, D.; Tian, H.; Liu, X.; Yan, R. Multi-view response selection for human-computer
conversation. In Proceedings of the EMNLP 2016—Conference on Empirical Methods in Natural Language Processing, Austin,
TX, USA, 1–5 November 2016; pp. 372–381.
10. Xue, L.; Constant, N.; Roberts, A.; Kale, M.; Al-Rfou, R.; Siddhant, A.; Barua, A.; Raffel, C. mT5: A Massively Multilingual
Pre-trained Text-to-Text Transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association
for Computational Linguistics: Human Language Technologies, Online, 6–11 June 2020; pp. 483–498. [CrossRef]
11. Liu, Z.; Winata, G.I.; Lin, Z.; Xu, P.; Fung, P. Attention-informed mixed-language training for zero-shot cross-lingual task-
oriented dialogue systems. In Proceedings of the AAAI 2020—34th Conference on Artificial Intelligence, New York, NY, USA,
7–12 February 2020; pp. 8433–8440. [CrossRef]
12. Lin, Z.; Madotto, A.; Winata, G.I.; Xu, P.; Jiang, F.; Hu, Y.; Shi, C.; Fung, P. BiToD: A Bilingual Multi-Domain Dataset For
Task-Oriented Dialogue Modeling. arXiv 2021, arXiv:2106.02787.
13. Liu, Y.; Gu, J.; Goyal, N.; Li, X.; Edunov, S.; Ghazvininejad, M.; Lewis, M.; Zettlemoyer, L. Multilingual denoising pre-training for
neural machine translation. Trans. Assoc. Comput. Linguist. 2020, 8, 726–742. [CrossRef]
14. Sitaram, S.; Chandu, K.R.; Rallabandi, S.K.; Black, A.W. A Survey of Code-switched Speech and Language Processing. arXiv 2019,
arXiv:1904.00784.
15. Louvan, S.; Magnini, B. Simple data augmentation for multilingual NLU in task oriented dialogue systems. In Proceedings
of the Seventh Italian Conference on Computational Linguistics CLIC-IT 2020, Bologna, Italy, 30 November–2 December 2020;
Volume 2769. [CrossRef]
16. Henderson, M.; Thomson, B.; Williams, J. The second dialog state tracking challenge. In Proceedings of the 15th Annual Meeting
of the Special Interest Group on Discourse and Dialogue (SIGDIAL), Philadelphia, PA, USA, 18–20 June 2014; pp. 263–272.
[CrossRef]
17. Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.-J. BLEU: A method for automatic evaluation of machine translation. In Proceedings of
the 40th Annual Meeting on Association for Computational Linguistics, Philadelphia, PA, USA, 7–12 July 2001; pp. 311–318.
[CrossRef]
18. PyTorch. Available online: https://pytorch.org/ (accessed on 17 November 2021).
216
Mathematics 2022, 10, 746
19. Huggingface/Transformers: Transformers: State-of-the-Art Natural Language Processing for Pytorch, TensorFlow, and JAX.
Available online: https://github.com/huggingface/transformers (accessed on 17 November 2021).
20. Loshchilov, I.; Hutter, F. Decoupled weight decay regularization. In Proceedings of the 7th International Conference on Learning
Representations, New Orleans, LA, USA, 6–9 May 2017.
217
mathematics
Article
Improving Machine Reading Comprehension with Multi-Task
Learning and Self-Training
Jianquan Ouyang * and Mengen Fu
College of Computer Cyberspace Security, Xiangtan University, Xiangtan 411105, China; [email protected]
* Correspondence: [email protected]
Yes/no questions answering and unanswerable question task accounts for 14% and 3%,
respectively. This requires multi-task learning.
Multi-task learning is a field of machine learning in which multiple tasks are learned
in parallel while using a shared representation [9–11]. Compared with learning multiple
tasks individually, this joint learning effectively increases the sample size for training the
model, thus leading to performance improvement by increasing the generalization of the
model [12]. To solve multi-tasking MRC is important, and some straightforward solutions
have been proposed. Liu et al. [13] appended an empty word token to the context and
added a simple classification layer for the MRC model. Hu et al. [14] used two auxiliary
losses, independent span loss to predict plausible answers and independent no answer
loss to determine the answerability of the question. Further, an extra verifier is used to
determine whether the predicted answer is contained by the input snippets. Back et al. [15]
developed an attention-based satisfaction score to compare question embeddings with
candidate answer embeddings. Zhang et al. [16] proposed a verifier layer, which is a linear
layer applied to context embeddings weighted by the start and end distributions over
contextual word representations concatenated to [CLS] token representation for BERT. The
above studies are based on the SQuAD2.0 dataset, but the SQuAD2.0 dataset only includes
two different tasks. We want the model to be able to handle more tasks simultaneously,
such as the CAIL2019 reading comprehension dataset, which contains three different tasks.
Researchers usually set up three types of auxiliary losses to jointly train the model. We
think that too much auxiliary loss may hurt the general model training, so we propose
to fuse three different task outputs into a new span extraction output with only one loss
function for global training.
Early neural reading models typically used various attention mechanisms to build
interdependent representations of passages and questions, then predict answer boundaries
in turn. A wide range of attention models have been employed, including Attention
Sum Reader [17], Gated attention Reader [18], Self-matching Network [19], Attention over
Attention Reader [20], and Bi-attention Network [21]. Recently, PrLMs have dominated
the design of encoders for MRC with great success. These PrLMs include ELMo [22],
GPT [23], BERT [24], XLNet [25], Roberta [26], ALBERT [27], and ELECTRA [28]. They
bring impressive performance improvements for a wide range of NLP tasks for two main
reasons: (1) language models are pre-trained on a large-scale text corpus, which allows
the models to learn generic language features and serve as a knowledge base; (2) thanks
to the Transformer architecture, language models enjoy a powerful feature representation
learning capability to capture higher-order, long-range dependencies in text.
In the model training phase, since our model requires a large amount of labeled train-
ing data, which are often expensive to obtain or unavailable in many tasks, we additionally
use self-training to generate pseudo-labeled training data to train our model to improve
the accuracy and generalization performance of the model. Self-training [29,30] is a widely
used semi-supervised learning method [31]. Most related studies follow the framework of
traditional Self-Training and Co-Training and focus on designing better policies for select-
ing confident samples: trains the base model (student) on a small amount of labeled data,
applies it to pseudo-labeled task-specific unlabeled data, uses pseudo-labels to augment
the labeled data, and retrains the student model iteratively. Recently, self-training has been
shown to obtain state-of-the-art performance in tasks such as image classification [32,33],
few-shot text classification [34,35], and neural machine translation [36,37], and has shown
complementary advantages to unsupervised pre-training [32].
In this paper, we propose a machine reading comprehension model based on multi-
task fusion training, and we construct a multi-task fusion training reading comprehension
model based on the BERT pre-training model. The model uses the BERT pre-training model
to obtain contextual representations, which is shared by three downstream sub-modules
of span extraction, yes/no question answering, and unanswerable questions. Next, we
fuse the outputs of the three submodules into a new span extraction output, and we use a
cross-entropy loss function for global training. We use a self-training method to generate
220
Mathematics 2022, 10, 310
pseudo-labeled data from unlabeled data to expand labeled data iteratively, thus improving
the model performance and achieving better generalization. The experiments show that
our model can efficiently handle different tasks. We achieved 83.2 EM and 86.7 F1 scores
on the SQuAD2.0 dataset and 73.0 EM and 85.3 F1 scores on the CAIL2019 dataset.
2.2. Method
We focus on three reading comprehension tasks: span extraction, Yes/No question
answering, and unanswerable questions. They all can be described as a triplet <P, Q, A>,
where P is a passage and Q is a question for P. When question Q is a span extraction task,
the correct answer A is a span text in P; when question Q is a Yes/No question answering,
the correct answer A is the text “YES” or “NO”; when question Q is an unanswerable
question, the correct answer A is the null string. We have conducted experiments on both
SQUAD2.0 and CAIL2019 reading comprehension datasets, and our model should be able
to predict the begin and end position in passage P and extract the span text as answer A
for a question which is the span extraction task and return the text “YES” or “NO” for the
221
Mathematics 2022, 10, 310
question which is the Yes/No question answering. For the unanswerable question, the
model can return a null string.
Our MRC model mainly includes two aspects of improvement: the multi-task fusion
training model and the self-training method. The multi-task fusion training model mainly
focuses on the fusion of multi-task outputs and training methods. Self-training mainly
focuses on generating pseudo-labeled training data with high confidence from unlabeled
data to expand the training data.
2.2.1. Embedding
We concatenate question and passage texts as the input, which is firstly represented as
embedding vectors to feed an encoder layer. In detail, the input texts are first tokenized
to word pieces(subword tokens). Let T = {t1 , t2 , ..., tn } denote as a sequence of subword
tokens of length N. For each token, the embedding vector of the input text is the sum vector
of its token embedding, position embedding, and token-type embedding.
We take X = { x1 , x2 , ..., xn } as the output of the encoding layer, which is an embedding
feature vector of encoded sentence tokens of length n. Then, the embedded feature vector
is fed to the interaction layer to obtain the contextual representation vector.
2.2.2. Interaction
Following Devlin at all. [24], the encoded sequence X is processed using multiple
layers of Transformers [38] to learn the context representation vector. In the next section,
we use H = { h1 , h2 , ..., hn } to denote the last layer of hidden states of the input sequence.
Hc is denoted as the last layer of hidden states of the first token (classification token) in
the sequence.
where yis and yie represent the true start and end positions of sample i respectively, and N is
the number of samples.
For prediction, given output start and end probability sequences S and E, we calculate
the span score scorespan , the unanswer score scoreua , the Yes-answer score scoreyes and the
no-answer score scoreno :
222
Mathematics 2022, 10, 310
We get 4 answer scores, and we choose the answer with the largest score as the final
answer by comparing them.
Figure 3. An overview of our base model. T is the sequence of subword tokens after embedding, X is
the outputs of the encoder layer. Task-specific learning Module sets up three sub-modules to generate
probability sequences or values of different answers for each sample. CAT means concatenates the
input sequence or value.
223
Mathematics 2022, 10, 310
The unlabeled dataset can be described as a tuple < Pt , Qt >, where Pt = { p1 , p2 , ..., pn }
is a passage, Qt = {q11 , q12 , ..., qnm } is a question over Pt . To obtain pseudo-labels, we
use the model running on the unlabeled dataset Dt , and Algorithm 2 explains the label-
ing process. For each unlabeled example (ci , qi ), the model gives t predicted answers
Aij = { aij1 , aij2 , ..., aijp } and the corresponding confidence Eij = {eij1 , eij2 , ..., eijp }. Then we
use a threshold to filter the data, only the examples with maximum confidence above the
threshold are retained. For each question, the answer with the maximum confidence is
used as a pseudo-label.
In the span extraction task, the model answers questions by generating two probability
distributions for contextual tokens. One is the start position and the other is the end position.
The answer is then extracted by choosing the span between the start and end positions.
Here, we consider probability as a measure of confidence. More specifically, we take
the distributions of the start and end tokens as input, then filter out unlikely candidates
(for example, candidates whose end token precedes the start token) and perform a beam
search with the sum of the start/end distributions as the confidence. The threshold was set
to 0.95, which means that only those examples with the most confident answers scoring
greater than 0.95 were used in the self-training phase.
224
Mathematics 2022, 10, 310
precision ∗ recall
F1 score = 2 ∗ (3)
precision + recall
3.1.2. Setup
We use the available pre-trained models as encoders to build the baseline MRC models:
BERT, ALBERT, ELECTRA. Our implementations for BERT are based on Transformers’
public Pytorch version of the scheme. We use pre-trained language model weights in the
encoder module of the model, using all official hyperparameters. For fine-tuning in our
task, we set the initial learning rate in {2 × 10−5 , 3 × 10−5 } with a warmup rate of 0.1,
and L2 weight decay of 0.01. Specifically, 2 × 10−5 for BERT, 3 × 10−5 for ALBERT and
ELECTRA. For the batch size, BERT is set to 20, ALBERT and ELECTRA are set to 8. In
all experiments, the maximum number of epochs was set to 3. Texts are tokenized using
wordpieces [39], with a maximum length of 512. Hyperparameters were selected using the
development set.
3.2. Results
Tables 1 and 2 compare our model with the current more advanced models on the
SQuAD2.0 and CAIL2019 datasets. In the English dataset SQuAD2.0, our model improves
compared to the NeurQuRI and BERT models but lags behind the ALBERT and ELECTRA
models. This is because our model is constructed based on the pre-trained language
model, and the ability of the pre-trained model to learn generic language representations is
important for the overall model performance. Currently available pre-trained language
models for machine-reading comprehension tasks include BERT, ALBERT, and ELECTRA.
The ALBERT model as an improved model of the BERT model can effectively improve
the downstream performance of multi-sentence coding tasks through three improvements:
factorized embedding parameterization, cross-layer parameter sharing, and inter-sentence
coherence loss. The ELECTRA model proposes a model training framework borrowed
from the design of GAN networks and a new pre-training task of replaced token detection.
These two improvements allow ELECTRA to learn more contextual representations than
BERT. Theoretically, our model can be designed based on the ALBERT or ELECTRA pre-
trained model. However, the problem is that the training time of the ALBERT or ELECTRA
pre-trained model is much longer than that of the BERT model, and our self-training
method requires iterative training of the model. As such, the time cost is unacceptable.
In the Chinese reading comprehension dataset CAIL2019, we did not find an available
Chinese ALBERT pre-training model, so we did not conduct a comparison experiment
of ALBERT models. Our model improves compared to NeurQuRI, BERT, and ELECTRA
models. Benefiting from our designed task-specific learning module, our model can identify
different task samples and make correct answers better than other models.
Table 1. The results (%) for SQuAD2.0 dataset. Time means Training time of model, ST means our
Self Training method (Section 2.2.4).
We fuse the outputs of different tasks into a new span extract output, train the model
using a cross-entropy loss function. It can effectively prevent the imbalance from different
task losses that confuse the whole model training. Our self-training method is also applied
to the model successfully. The experiments show that the self-training method increases
225
Mathematics 2022, 10, 310
0.3EM and 0.3F1 scores in the SQuAD2.0 dataset, and 0.9EM and 1.1F1 scores in the
CAIL2019 reading comprehension dataset compared to our base model.
Table 2. The results (%) for CAIL2019 Reading comprehension dataset. Time means Training time of
model, ST means our Self Training method (Section 2.2.4).
4. Discussion
This paper compares the performance of three methods, including MLT with Fusion
training, MLT with three auxiliary losses, and a pipeline approach. We implemented three
comparative experiments in the CAIL2019 dataset. Based on the idea of Hu et al. [28], we
adopt a BERT model to obtain context representation and use three auxiliary loss functions
to process the output of different task modules. The training loss of the whole model is the
weighted sum of the losses of the three modules. The pipeline approach uses a BERT model
for training the extraction MRC task, turns non-extracted MRC tasks into a classification
task, and learns the classification task directly for the sentences in each passage. Finally, the
predicted results of the two models are combined and the scores are calculated.
The specific experimental comparison results are shown in Table 3. It can be seen that
the performance of the pipeline approach is lower compared to the other two approaches.
It may be the pipeline approach only optimizes the loss of one task and lacks the interaction
of other tasks. However, the pipeline approach also has its advantages, such as better
controllability of the results for each subtask and the convenience of manual observation for
each subtask. In multi-task learning, the magnitude of task loss is considered a perceived
similarity measure. Imbalanced tasks in this regard produce largely varied gradients
which can confuse model training. MLT with three auxiliary losses uses a task weighting
coefficient to help normalize the gradient magnitudes across the tasks. However, its specific
implementation employing an average task loss weighting coefficient does not achieve
the balance among tasks, which may lead to a bias of the model toward a particular task
during training. Our research revealed that the outputs of the different MRC task modules
can be fused into a sequence, and we put it into a loss function for training. The model
can automatically learn the weight coefficient between different tasks during training and
make reasonable judgments. Experiments show that our model, although lower than MLT
with three auxiliary losses in the EM metric, exceeds it by 0.8% in the F1 metric.
Training Method EM F1
MLT with Fusion training 72.1 84.2
MLT with Three Auxiliary losses 72.6 83.4
Pipeline method 70.6 81.0
It is well known that the size of the labeled dataset affects the performance of the
pre-trained model for downstream task fine-tuning. After that, we apply self-training to the
models trained above to investigate the effect of labeled datasets of different sizes on the
effectiveness of the self-training method for the base model. We set the size of the unlabeled
dataset used for the self-training approach to 100%. Figure 4 shows that the evaluation
performance of base model fine-tuning is highly dependent on the size of the domain
labeled data, and the self-training method always improves the evaluation performance of
226
Mathematics 2022, 10, 310
the base model. However, as the evaluation performance of the base model improves, the
self-training
g method has less and less improvement on the effectiveness of the base model.
(a) (b)
Figure 4. The results of different sized labeled datasets for self-training improvement. (a) SQuAD 2.0
dataset. (b) CAIL2019 Reading Comprehension dataset.
5. Conclusions
In this paper, we construct a multi-task fusion training reading comprehension model
based on a BERT pre-training model. The model uses the BERT pre-training model to obtain
contextual representations, which are then shared by three downstream sub-modules for
span extraction, yes/no question answering and unanswerable questions, and we next
fuse the outputs of the three sub-modules into a new span extraction output and use the
fused cross-entropy loss function for global training. We use self-training to generate
pseudo-labeled training data to train our model to improve its accuracy and generalization
performance. However, our self-training approach requires iteratively training a model,
which consumes a lot of time, and we consider it needs further optimization. Our model
is designed for only three specific tasks and cannot be widely applied to more machine
reading comprehension task scenarios. We hope to explore more effective multi-task
methods for machine-reading comprehension in the future.
Author Contributions: Conceptualization J.O. and M.F.; methodology, J.O. and M.F.; software, M.F.;
validation, J.O. and M.F.; formal analysis, J.O. and M.F.; investigation, J.O. and M.F.; resources,
J.O.; data curation, J.O. and M.F.; writing—original draft preparation, M.F.; writing—review and
editing, J.O. and M.F.; visualization, M.F.; supervision, J.O.; project administration, J.O. and M.F.;
funding acquisition, J.O. and M.F. All authors have read and agreed to the published version of
the manuscript.
Funding: This research has been supported by Key Projects of the Ministry of Science and Technology
of the People Republic of China (2020YFC0832401).
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: The CAIL2019 reading comprehension datasets used in our study are
available at https://github.com/china-ai-law-challenge/CAIL2019. The SQuAD 2.0 datasets are
available at https://rajpurkar.github.io/SQuAD-explorer.
Conflicts of Interest: The authors declare no conflict of interest.
227
Mathematics 2022, 10, 310
References
1. Hermann, K.M.; Kocisky, T.; Grefenstette, E.; Espeholt, L.; Kay, W.; Suleyman, M.; Blunsom, P. Teaching machines to read and
comprehend. Adv. Neural Inf. Process. Syst. 2015, 28, 1693–1701.
2. Zhang, Z.; Yang, J.; Zhao, H. Retrospective Reader for Machine Reading Comprehension. In Proceedings of the AAAI Conference on
Artificial Intelligence, Virtual, 2–9 February 2021; AAAI Press: Palo Alto, CA, USA, 2021; Volume 35, pp. 14506–14514.
3. Xie, Q.; Lai, G.; Dai, Z.; Hovy, E. Large-scale Cloze Test Dataset Created by Teachers. In Proceedings of the 2018 Conference on
Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; pp. 2344–2356.
4. Rajpurkar, P.; Zhang, J.; Lopyrev, K.; Liang, P. SQuAD: 100,000+ Questions for Machine Comprehension of Text. In Proceedings
of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, Texas, USA, 1–5 November 2016;
pp. 2383–2392.
5. Inoue, N.; Stenetorp, P.; Inui, K. R4C: A Benchmark for Evaluating RC Systems to Get the Right Answer for the Right Reason. In
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Stroudsburg, PA, USA, 6–8 July 2020;
Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 6740–6750.
6. Rajpurkar, P.; Jia, R.; Liang, P. Know what you don’t know: Unanswerable questions for SQuAD. arXiv 2018, arXiv:1806.03822.
7. Reddy, S.; Chen, D.; Manning, C.D. Coqa: A conversational question answering challenge. Trans. Assoc. Comput. Linguist. 2019, 7,
249–266. [CrossRef]
8. Xiao, C.; Zhong, H.; Guo, Z.; Tu, C.; Liu, Z.; Sun, M.; Xu, J. Cail2019-scm: A dataset of similar case matching in legal domain.
arXiv 2019, arXiv:1911.08962.
9. Jacob, I.J. Performance evaluation of caps-net based multitask learning architecture for text classification. J. Artif. Intell. 2020, 2,
1–10.
10. Peng, Y.; Chen, Q.; Lu, Z. An empirical study of multi-task learning on BERT for biomedical text mining. arXiv 2020,
arXiv:2005.02799.
11. Ruder, S.; Bingel, J.; Augenstein, I.; Søgaard, A. Latent multi-task architecture learning. In Proceedings of the AAAI Conference
on Artificial Intelligence, Honolulu, HI, USA, 27 January 2019; AAAI: Honolulu, HI, USA, 2019; Volume 33, pp. 4822–4829.
12. Zhang, Y.; Yang, Q. A survey on multi-task learning. IEEE Trans. Knowl. Data Eng. 2021, 2, 1–10. [CrossRef]
13. Liu, X.; Li, W.; Fang, Y.; Kim, A.; Duh, K.; Gao, J. Stochastic answer networks for squad 2.0. arXiv 2018, arXiv:1809.09194.
14. Hu, M.; Wei, F.; Peng, Y.; Huang, Z.; Yang, N.; Li, D. Read+ verify: Machine reading comprehension with unanswerable
questions. In Proceedings of the AAAI Conference on Artificial Intelligence, Palo Alto, CA, USA, 2–9 February 2019; Volume 33,
pp. 6529–6537.
15. Back, S.; Chinthakindi, S.C.; Kedia, A.; Lee, H.; Choo, J. NeurQuRI: Neural question requirement inspector for answerability
prediction in machine reading comprehension. In Proceedings of the International Conference on Learning Representations,
New Orleans, LA, USA, 6–9 May 2019.
16. Zhang, Z.; Wu, Y.; Zhou, J.; Duan, S.; Zhao, H.; Wang, R. SG-Net: Syntax-guided machine reading comprehension. In Proceedings
of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 9636–9643.
17. Kadlec, R.; Schmid, M.; Bajgar, O.; Kleindienst, J. Text Understanding with the Attention Sum Reader Network. In Proceedings
of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany,
7–12 August 2016; pp. 908–918.
18. Dhingra, B.; Liu, H.; Yang, Z.; Cohen, W.W.; Salakhutdinov, R. Gated-Attention Readers for Text Comprehension. In Proceedings
of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, BC, Canada,
30 July–4 August 2017; pp. 1832–1846.
19. Park, C.; Song, H.; Lee, C. S3-NET: SRU-based sentence and self-matching networks for machine reading comprehension. ACM
Trans. Asian -Low-Resour. Lang. Inf. Process. (TALLIP) 2020, 19, 1–14. [CrossRef]
20. Cui, Y.; Chen, Z.; Wei, S.; Wang, S.; Liu, T.; Hu, G. Attention-over-Attention Neural Networks for Reading Comprehension. In
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver,
BC, Canada, 30 July–4 August 2017; pp. 593–602.
21. Kim, J.H.; Jun, J.; Zhang, B.T. Bilinear attention networks. arXiv 2018, arXiv:1805.07932.
22. Sarzynska-Wawer, J.; Wawer, A.; Pawlak, A.; Szymanowska, J.; Stefaniak, I.; Jarkiewicz, M.; Okruszek, L. Detecting formal
thought disorder by deep contextualized word representations. Psychiatry Res. 2021, 304, 114135. [CrossRef] [PubMed]
23. Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Amodei, D. Language models are few-shot learners. arXiv
2020, arXiv:2005.14165.
24. Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding.
arXiv 2018, arXiv:1810.04805.
25. Yang, Z.; Dai, Z.; Yang, Y.; Carbonell, J.; Salakhutdinov, R.R.; Le, Q.V. Xlnet: Generalized autoregressive pretraining for language
understanding. Adv. Neural Inf. Process. Syst. 2019, 32, 5753–5763.
26. Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Stoyanov, V. Roberta: A robustly optimized bert pretraining approach. arXiv
2019, arXiv:1907.11692.
27. Lan, Z.; Chen, M.; Goodman, S.; Gimpel, K.; Sharma, P.; Soricut, R. Albert: A lite bert for self-supervised learning of language
representations. arXiv 2019, arXiv:1909.11942.
228
Mathematics 2022, 10, 310
28. Clark, K.; Luong, M.T.; Le, Q.V.; Manning, C.D. Electra: Pre-training text encoders as discriminators rather than generators. arXiv
2020, arXiv:2003.10555.
29. Lee, D.H. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In Workshop on
Challenges in Representation Learning; ICML: Atlanta, GA, USA, 2013; Volume 3, p. 896.
30. Yarowsky, D. Unsupervised word sense disambiguation rivaling supervised methods. In Proceedings of the 33rd Annual Meeting
of the Association for Computational Linguistics, Cambridge, MA, USA, 26–30 June 1995; pp. 189–196.
31. Zhu, X.; Goldberg, A.B. Introduction to semi-supervised learning. Synth. Lect. Artif. Intell. Mach. Learn. 2009, 3, 1–130. [CrossRef]
32. Zoph, B.; Ghiasi, G.; Lin, T.Y.; Cui, Y.; Liu, H.; Cubuk, E.D.; Le, Q.V. Rethinking pre-training and self-training. arXiv 2020,
arXiv:2006.06882.
33. Zhao, R.; Liu, T.; Xiao, J.; Lun, D.P.; Lam, K.M. Deep multi-task learning for facial expression recognition and synthesis based on
selective feature sharing. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy,
10–15 January 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 4412–4419.
34. Wang, Y.; Mukherjee, S.; Chu, H.; Tu, Y.; Wu, M.; Gao, J.; Awadallah, A.H. Adaptive self-training for few-shot neural sequence
labeling. arXiv 2020, arXiv:2010.03680.
35. Li, C.; Li, X.; Ouyang, J. Semi-Supervised Text Classification with Balanced Deep Representation Distributions. In Proceedings of
the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural
Language Processing, Bangkok, Thailand, 1–6 August 2021; pp. 5044–5053.
36. He, J.; Gu, J.; Shen, J.; Ranzato, M.A. Revisiting self-training for neural sequence generation. arXiv 2019, arXiv:1909.13788.
37. Jiao, W.; Wang, X.; Tu, Z.; Shi, S.; Lyu, M.R.; King, I. Self-Training Sampling with Monolingual Data Uncertainty for Neural
Machine Translation. arXiv 2021, arXiv:2106.00941.
38. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Polosukhin, I. Attention is all you need. In Advances in
Neural Information Processing Systems; NeurIPS Proceedings: Long Beach, CA, USA, 2017; pp. 5998–6008.
39. Wu, Y.; Schuster, M.; Chen, Z.; Le, Q.V.; Norouzi, M.; Macherey, W.; Dean, J. Google’s neural machine translation system: Bridging
the gap between human and machine translation. arXiv 2016, arXiv:1609.08144.
229
mathematics
Article
Evaluating Research Trends from Journal Paper Metadata,
Considering the Research Publication Latency
Christian-Daniel Curiac 1 , Ovidiu Banias 2, * and Mihai Micea 1
1 Computer and Information Technology Department, Politehnica University of Timisoara, V. Parvan 2,
300223 Timisoara, Romania; [email protected] (C.-D.C.); [email protected] (M.M.)
2 Automation and Applied Informatics Department, Politehnica University of Timisoara, V. Parvan 2,
300223 Timisoara, Romania
* Correspondence: [email protected]
Abstract: Investigating the research trends within a scientific domain by analyzing semantic informa-
tion extracted from scientific journals has been a topic of interest in the natural language processing
(NLP) field. A research trend evaluation is generally based on the time evolution of the term occur-
rence or the term topic, but it neglects an important aspect—research publication latency. The average
time lag between the research and its publication may vary from one month to more than one year,
and it is a characteristic that may have significant impact when assessing research trends, mainly
for rapidly evolving scientific areas. To cope with this problem, the present paper is the first work
that explicitly considers research publication latency as a parameter in the trend evaluation process.
Consequently, we provide a new trend detection methodology that mixes auto-ARIMA prediction
with Mann–Kendall trend evaluations. The experimental results in an electronic design automation
case study prove the viability of our approach.
Keywords: Mann–Kendall test; Sen’s slope; auto-ARIMA method; paper metadata; research trend
Moreover, Sharma et al. [7] used the Mann–Kendall test to understand machine learning
research topic trends using metadata from journal papers. In [8], Zou analyzed the journal
titles and abstracts to explore the temporal popularity of 50 drug safety research trends over
time using the MK test, while Neresini et al. [15] used the Hamed and Rao [2] MK variant
to extract trends from correlated time series. All of these papers assess actual research
trends by applying trend evaluation methods directly to key term occurrences in metadata
from published papers without explicitly considering the publication latency.
In our perspective, in evaluating the research trends using information extracted
from published journal articles, an important issue has to be considered: there is a time
lag that may range up to one year or even more from the end of the research work until
the journal paper is published. This delay obviously has an important impact mainly
for fast developing domains [16,17], where the trends abruptly change, driven by rapid
theory or technology advancements. To cover the identified gap, we propose a novel
trend, computing methodology, backed on a new method: n-steps-ahead Mann–Kendall
test (nsaMK). This method is based on auto-ARIMA forecasting, which is coupled with
the Yue-Wang variant of the Mann–Kendall (MK) test. To the best of our knowledge, our
method is the first that incorporates the effects of research publication latency on research
trends evaluations.
The main contributions of this paper are summarized below:
• A novel methodology that includes the new nsaMK method to identify term trends
from metadata from journal paper, when considering the inherent time lag between
the research completion and paper publication date;
• A definition of the research publication latency and an empirical formula to derive the
number of prediction steps considered by the proposed method to countermeasure the
effect of the journal review and publication process upon the research trend evaluation;
• An evaluation of the new nsaMK method in an electronic design automation case
study by comparing it with the classical MK trend test. The superiority of nsaMK is
confirmed by a 45% reduction of the mean square error of Sen’s slope evaluations and
by an increase of correct term trend indications with 66%.
The rest of the paper is organized as follows. Section 2 describes the two algorithms
that represent the pillars on which our strategy is built: auto-ARIMA prediction and the
Mann–Kendall trend test with Sen’s slope estimator. Section 3 presents the proposed
methodology and the new method for term trend evaluation. Section 4 presents an illus-
trative example of using the nsaMK trend test by comparing it with classical MK, while
Section 5 concludes the paper.
2. Preliminaries
This section briefly presents the two algorithms (i.e., auto-ARIMA, MK with Sen’s
slope) that constitute the foundation upon which the proposed nsaMK method is built.
232
Mathematics 2022, 10, 233
where p is the autoregressive order, q is the moving average order, ϕi and θ j are the model’s
coefficients, c is a constant, and ε t is a white noise random variable.
The ARMA( p, q) model from (1) can be rewritten in a compact form using the back-
ward shift operator B (BXt = Xt−1 ) as:
ϕ( B) Xt = θ ( B)ε t + c, (2)
ϕ ( B ) = 1 − ϕ1 B − ϕ2 B2 − · · · − ϕ p B p (3)
and
θ ( B ) = 1 + θ1 B + θ2 B2 + · · · + θ q B q . (4)
In practice, the use of ARMA models is restrained to stationary time-series (i.e., time-
series for which the statistical properties do not change over time) [19]. To address this issue,
Box and Jenkins [18] applied a differencing procedure for transforming non-stationary
time-series into stationary ones.
The first difference of a given time-series Xt is a new time-series Yt where each
observation is replaced by the change from the last encountered value:
Yt = Xt − Xt−1 = (1 − B) Xt . (5)
Yt = (1 − B)d Xt . (6)
By applying the generalized differencing process (6) to the ARMA model described by
(2), a general ARI MA( p, d, q) model is obtained in the form [19]:
where d is the differencing order needed to make the original time-series stationary.
For a specified triplet ( p, d, q), the coefficients ϕi and θ j are generally obtained using the
maximum likelihood parameter estimation method [21], while the most suitable values for
the orders p, d, and q can be derived using the auto-ARIMA method [22]. The auto-ARIMA
method varies the p, d, and q parameters in given intervals and evaluates a chosen goodness-
of-fit indicator to select the best fitting ARI MA( p, d, q) model. In our implementation, we
employed Akaike information criterion (AIC) that can be computed by:
where L is the maximum value of the likelihood function for the model, and k is the number
of estimated parameters of the model, in our case k = p + q + 2 if c = 0 and k = p + q + 1
if c = 0 [22]. When AIC has a minimal value, the best trade-off between the model’s
goodness of fit and its simplicity is achieved.
233
Mathematics 2022, 10, 233
n −1 n
S= ∑ ∑ sgn( x j − xk ), (9)
k =1 j = k +1
where m represents the number of tied groups (i.e., successive observations having the
same value) and r p is the rank of the pth tied group.
The standardized form of the MK z-statistic (Z MK ), having a zero mean and unit
variance, is given by: ⎧ S −1
⎪√
⎪ f or S > 0
⎨ V (S)
Z MK = 0 f or S = 0 , (12)
⎪
⎪
⎩ √S+1 f or S < 0
V (S)
a positive value indicates an upward trend, while a negative one describes a down-
ward trend.
The original version of MK test provides poor results in the case of correlated time-
series. In order to solve this issue, Yue and Wang [24], proposed the replacement of the
variance V (S) in Equation (12) with a value that considers the effective sample size n∗ :
n −1
n s
V ∗ ( S ) = ∗ V ( S ) = 1 + 2 ∑ (1 − ) ρ s V ( S ), (13)
n s =1
n
where ρs represents the lag-s serial correlation coefficient for the xi time-series and can be
computed using the following equation:
−s
∑vn= 1 [ xv − E ( xi )][ xv+s − E ( xi )]
1
n−s
ρs = 2
. (14)
1
n ∑nv=1 [ xv − E( xi )]
In practice, to evaluate if the z-statistic computed by Equation (12) is reliable, the two-
sided p-value is calculated using the exact algorithm given in [25]. This p-value expresses
the plausibility of the null hypothesis H0 (i.e., no trend in the time series) to be true [26].
The lesser the p-value, the higher significance level of the z-statistic. Thus, if the two-sided
p-value of the MK test is below a given threshold (e.g., p-value < 0.01), then a statistically
significant trend is present in the data series.
Since the z-statistic coupled with the p-value of the MK test can reveal a trend in the
data series, the magnitude of the trend is generally evaluated using Sen’s slope β, which is
computed as the average of the slopes corresponding to all lines defined by pairs of time
series observations [27]:
x j − xi
β = median , j>i (15)
j−i
234
Mathematics 2022, 10, 233
where β < 0 indicates a downward trend and β > 0 an upward time series trend.
Definition 1. publication latency (t PL ) is the average time lag from the date a manuscript is
submitted to the date when the resulting paper is first published. The publication latency is specific
to the journal where the paper is published.
The publication latency is practically the mean time to review, revise, and first publish
scientific papers in a specified publication. It can be obtained by averaging the time
needed for individual papers to be published. Since this type of information is generally
not included in paper metadata records, it needs to be manually extracted from the final
versions of the papers.
Definition 2. research publication latency (t RPL ) is the average time lag from the moment a
research is completed to the date when the resulting paper is first published.
In order to compute the research publication latency for a given research t RPL , beside
the publication latency t PL induced by the journal, we have to consider the mean time for
paper writing t PW :
t RPL = t PL + t PW [years], (16)
The mean time for paper writing t PW is a positive value that can be taken in the
interval between one and eight weeks depending on the type of publication (e.g., for a
paper published as a short communication t PW has lower values, while for long papers, the
values may be considerably higher). For our evaluations, we considered t PW = 0.1 years.
Our methodology to evaluate research term trends based on information contained in
journal paper metadata consists of three phases:
• Phase I: identify the number of steps N to be predicted. This number depends on the
research publication latency and on the moment in time for which the research trends
are computed and can be obtained using the following formula:
where . is the rounding (nearest integer) function, and τ is the time deviation from
the moment in time the last value in the annual time series was recorded to the date
for which the research trends are computed. Since the published papers within a year
are grouped in journal issues that are generally uniformly distributed during that
year, we may consider that the recording day for each year is the middle of that year.
Thus, if we want to calculate the research trends on 1 January, 2021, when having the
last time series observation recorded for 2020, τ = 0.5 years, while if we calculate the
research trends for 2 July, 2021, τ = 1 year.
235
Mathematics 2022, 10, 233
• Phase II: form the annual time series for a specified key term by computing the number
of its occurrences in paper metadata (i.e., title, keywords and abstract) during each year.
For this, the following procedure can be used: each paper’s metadata are automatically
or manually collected; the titles, keywords, and abstracts are concatenated into a text
document, which is fed into an entity-linking procedure (e.g., TagMe [28], AIDA [29],
Wikipedia Miner [30]), to obtain the list of terms that characterizes the paper; and,
count the number of papers per each year where the key term occurs.
• Phase III: apply the proposed n-steps-ahead Mann–Kendall procedure for the annual
time series containing the occurrences of the specified key term.
The proposed nsaMK method is described in the next paragraph.
4. Experimental Results
We tested our research trend evaluation methodology against the standard Mann–
Kendall test with Sen’s slope method (Yue and Wang variant), using journal paper metadata
from 2010 to 2019, with the observations from 2020 as ground truth. We evaluated the trends
of the main key terms that characterized the highly dynamic research domain of electronic
design automation (EDA) using paper metadata extracted from the IEEE Transactions on
Computer-Aided Design of Integrated Circuits and Systems (TCAD). For this particular
journal, the publication latency, evaluated for the papers published in the first quarter of
2020, is t PL = 0.55 years. By selecting a time deviation τ = 0.5 years that corresponds to
research trend evaluation for 1 January 2021, the number of steps N to be predicted is equal
to one, according to Equation (17).
d f (T , C )
nd f (T , C ) = (18)
N
236
Mathematics 2022, 10, 233
Table 1. Top 24 EDA terms in 2020 and their normalized document frequency score.
237
Mathematics 2022, 10, 233
In Figures 1 and 2, we present two representative examples, namely for the ’algorithm’
and ’logic gates’ key terms. The light blue observations (i.e., for 2010–2019) are used to
compute the slope marked with black solid line by MK2019, the last nine light blue obser-
vations, together with the pink-marked auto-ARIMA prediction are used by nsaMK2020 to
evaluate the slope presented with red dotted line, and the last nine light blue observations
together with the dark blue observation for 2020 are used by MK2020 to reveal the real
trend depicted with a blue dashed line (ground truth).
Figure 2. Trend evaluation comparison for the key term ’logic gate’.
238
Mathematics 2022, 10, 233
Table 2. Comparison results of nsaMK and MK methods for the top 24 terms in EDA.
239
12 ram 4.346 1.3 × 10−5 0.002 4.232 2.3 × 10−5 0.003 5.118 3.0 × 10−7 0.004
13 neural network 1.938 0.05250 0.002 0.907 0.36428 0 2.892 0.00382 0.004
14 low power 0 1 0.000 1.712 0.08684 0.001 1.910 0.05607 0.001
15 hybrid 3.298 0.00097 0.003 2.580 0.00987 0.002 4.213 2.5 × 10−5 0.004
16 system on chip 3.637 0.00027 0.003 6.111 9.8 × 10−10 0.004 3.694 0.00022 0.003
17 mathematical model −5.016 5.2 × 10−7 −0.005 −0.924 0.35519 −0.004 −3.686 0.00022 −0.004
18 power −4.734 2.1 × 10−6 −0.002 −3.678 0.00023 −0.001 −2.532 0.01133 −0.001
19 convolutional neural network 3.842 0.00012 0.001 3.275 0.00105 0.001 3.361 0.00077 0.002
20 logic 0.497 0.61901 0.000 −0.685 0.49291 −0.001 2.419 0.01553 0.001
21 memory management 5.262 1.4 × 10−7 0.003 5.672 1.4 × 10−8 0.003 9.513 0 0.004
22 real time systems 0 1 0 2.307 0.02102 0.002 1.859 0.06289 0.003
23 cmos 7.518 5.5 × 10−14 0.003 7.646 2.0 × 10−14 0.004 6.133 8.6 × 10−10 0.002
24 nonvolatile memory 2.058 0.03958 0.001 2.701 0.00690 0.002 5.340 9.2 × 10−8 0.003
Mathematics 2022, 10, 233
For Figure 1, the best auto-ARIMA prediction for 2020 was obtained when p = 1, d = 0,
and q = 1, with an AIC score of −38.633. We may observe that the trend changes from
“decreasing” when considering 2010–2019 to “increasing” (ground-truth) and our method
nsaMK offers a more appropriate result than MK2019. In the case depicted in Figure 2 the
auto-ARIMA obtained the best results when p = 1, d = 0, and q = 0, where the AIC score
was −35.658. We may observe that the trend provided with our nsaMK method offers a
more suitable result than MK2019.
All methods and experiments were implemented in python 3.8 based on the Mann–
Kendall trend test function yue_wang_modification_test() from pyMannKendall 1.4.2 pack-
age [32], ARIMA prediction function derived from tsa.statespace.SARIMAX() included in
statsmodels 0.13.0 package [33], and CountVectorizer() from scikit-learn 1.0.1 library.
It is worth mentioning that the efficiency of using the nsaMK method strongly depends
on the prediction accuracy provided by ARIMA models. In future work, we intend to
analyze how the trend evaluation performances change when replacing the ARIMA method
in our methodology with exponential smoothing or neural network forecasting. Other
limitations of our method are induced by the use of the Mann–Kendall trend test, which has
poor results when the time series includes periodicities and tends to provide inconclusive
results for short datasets.
5. Conclusions
This paper introduces research publication latency as a new parameter that needs
to be considered when evaluating research trends from journal paper metadata, mainly
within rapidly evolving scientific fields. The proposed method comprises two steps: (i) a
prediction step performed using the auto-ARIMA method to estimate the most recent
research evolution that is not yet available in publications; and, (ii) a trend evaluation
step using a suitable variant of the Mann–Kendall test with Sen’s slope evaluation. Our
simulations, using paper metadata collected from IEEE Transactions on Computer-Aided
Design of Integrated Circuits and System, provide convincing results.
Author Contributions: Conceptualization, C.-D.C.; methodology, C.-D.C., O.B. and M.M.; software,
C.-D.C., O.B. and M.M.; validation, C.-D.C., O.B. and M.M.; formal analysis, C.-D.C., O.B. and M.M.;
investigation, C.-D.C., O.B. and M.M.; resources, O.B. and M.M.; data curation, C.-D.C., O.B. and
M.M.; writing—original draft preparation, C.-D.C. and O.B.; writing—review and editing, C.-D.C.,
O.B. and M.M.; supervision, O.B. and M.M. All authors have read and agreed to the published version
of the manuscript.
Funding: This research received no external funding.
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: Not applicable.
Conflicts of Interest: The authors declare no conflict of interest.
References
1. Marrone, M. Application of entity linking to identify research fronts and trends. Scientometrics 2020, 122, 357–379. [CrossRef]
2. Hamed, K.H. Trend detection in hydrologic data: The Mann–Kendall trend test under the scaling hypothesis. J. Hydrol. 2008,
349, 350–363. [CrossRef]
3. Önöz, B.; Bayazit, M. The power of statistical tests for trend detection. Turk. J. Eng. Environ. Sci. 2003, 27, 247–251.
4. Wang, F.; Shao, W.; Yu, H.; Kan, G.; He, X.; Zhang, D.; Ren, M.; Wang, G. Re-evaluation of the power of the mann-kendall test for
detecting monotonic trends in hydrometeorological time series. Front. Earth Sci. 2020, 8, 14. [CrossRef]
5. Hirsch, R.M.; Slack, J.R. A nonparametric trend test for seasonal data with serial dependence. Water Resour. Res. 1984, 20, 727–732.
[CrossRef]
6. Marchini, G.S.; Faria, K.V.; Neto, F.L.; Torricelli, F.C.M.; Danilovic, A.; Vicentini, F.C.; Batagello, C.A.; Srougi, M.; Nahas, W.C.;
Mazzucchi, E. Understanding urologic scientific publication patterns and general public interests on stone disease: Lessons
learned from big data platforms. World J. Urol. 2021, 39, 2767–2773. [CrossRef]
240
Mathematics 2022, 10, 233
7. Sharma, D.; Kumar, B.; Chand, S. A trend analysis of machine learning research with topic models and mann-kendall test. Int. J.
Intell. Syst. Appl. 2019, 11, 70–82. [CrossRef]
8. Zou, C. Analyzing research trends on drug safety using topic modeling. Expert Opin. Drug Saf. 2018, 17, 629–636. [CrossRef]
9. Merz, A.A.; Gutiérrez-Sacristán, A.; Bartz, D.; Williams, N.E.; Ojo, A.; Schaefer, K.M.; Huang, M.; Li, C.Y.; Sandoval, R.S.; Ye, S.;
et al. Population attitudes toward contraceptive methods over time on a social media platform. Am. J. Obstet. Gynecol. 2021,
224, 597.e1–597.e4. [CrossRef] [PubMed]
10. Chen, X.; Xie, H. A structural topic modeling-based bibliometric study of sentiment analysis literature. Cogn. Comput. 2020,
12, 1097–1129. [CrossRef]
11. Chen, X.; Xie, H.; Cheng, G.; Li, Z. A Decade of Sentic Computing: Topic Modeling and Bibliometric Analysis. Cogn. Comput.
2021, 1–24. [CrossRef]
12. Zhang, T.; Huang, X. Viral marketing: Influencer marketing pivots in tourism–a case study of meme influencer instigated travel
interest surge. Curr. Issues Tour. 2021, 1–8. [CrossRef]
13. Chakravorti, D.; Law, K.; Gemmell, J.; Raicu, D. Detecting and characterizing trends in online mental health discussions. In
Proceedings of the 2018 IEEE International Conference on Data Mining Workshops (ICDMW), Singapore, 17–20 November 2018;
pp. 697–706.
14. Modave, F.; Zhao, Y.; Krieger, J.; He, Z.; Guo, Y.; Huo, J.; Prosperi, M.; Bian, J. Understanding perceptions and attitudes in breast
cancer discussions on twitter. Stud. Health Technol. Inform. 2019, 264, 1293. [PubMed]
15. Neresini, F.; Crabu, S.; Di Buccio, E. Tracking biomedicalization in the media: Public discourses on health and medicine in the UK
and Italy, 1984–2017. Soc. Sci. Med. 2019, 243, 112621. [CrossRef]
16. King, A.L.O.; Mirza, F.N.; Mirza, H.N.; Yumeen, N.; Lee, V.; Yumeen, S. Factors associated with the American Academy of
Dermatology abstract publication: A multivariate analysis. J. Am. Acad. Dermatol. 2021. [CrossRef] [PubMed]
17. Andrew, R.M. Towards near real-time, monthly fossil CO2 emissions estimates for the European Union with current-year
projections. Atmos. Pollut. Res. 2021, 12, 101229. [CrossRef]
18. Box, G.; Jenkins, G.; Reinsel, G. Time-Series Analysis: Forecasting and Control; Holden-Day Inc.: San Francisco, CA, USA, 1970;
pp. 575–577.
19. Chatfield, C. Time-Series Forecasting; CRC Press: Boca Raton, FL, USA, 2000.
20. Whittle, P. Hypothesis Testing in Time-Series Analysis; Almquist and Wiksell: Uppsalla, Sweden, 1951.
21. Cryer, J.; Chan, K. Time-Series Analysis with Applications in R; Springer Science & Business Media: New York, NY, USA, 2008.
22. Hyndman, R.; Athanasopoulos, G. Forecasting: Principles and Practice; OTexts: Melbourne, Australia, 2018.
23. Şen, Z. Innovative Trend Methodologies in Science and Engineering; Springer: New York, NY, USA, 2017.
24. Yue, S.; Wang, C. The Mann–Kendall test modified by effective sample size to detect trend in serially correlated hydrological
series. Springer Water Resour. Manag. 2004, 18, 201–218.:WARM.0000043140.61082.60. [CrossRef]
25. Best, D.; Gipps, P. Algorithm AS 71: The upper tail probabilities of Kendall’s Tau. J. R. Stat. Society. Ser. C (Appl. Stat.) 1974,
23, 98–100. [CrossRef]
26. Helsel, D.; Hirsch, R.; Ryberg, K.; Archfield, S.; Gilroy, E. Statistical Methods in Water Resources; Technical Report; US Geological
Survey Techniques and Methods, Book 4, Chapter A3; Elsevier: Amsterdam, The Netherlands, 2020; 458p.
27. Sen, P.K. Estimates of the regression coefficient based on Kendall’s tau. J. Am. Stat. Assoc. 1968, 63, 1379–1389. [CrossRef]
28. Ferragina, P.; Scaiella, U. TagMe: On-the-fly annotation of short text fragments (by Wikipedia entities). In International Conference
on Information and Knowledge Management; ACM: Toronto, ON, Canada, 2010; pp. 1625–1628.
29. Yosef, M.A.; Hoffart, J.; Bordino, I.; Spaniol, M.; Weikum, G. Aida: An online tool for accurate disambiguation of named entities
in text and tables. Proc. VLDB Endow. 2011, 4, 1450–1453. [CrossRef]
30. Milne, D.; Witten, I.H. Learning to link with wikipedia. In Proceedings of the 17th ACM conference on Information and
knowledge management, Napa Valley, CA, USA, 26–30 October 2008; pp. 509–518.
31. Happel, H.J.; Stojanovic, L. Analyzing organizational information gaps. In Proceedings of the 8th Int. Conference on Knowledge
Management, Graz, Austria, 3–5 September 2008; pp. 28–36.
32. Hussain, M.; Mahmud, I. PyMannKendall: A python package for non parametric Mann–Kendall family of trend tests. J. Open
Source Softw. 2019, 4, 1556. [CrossRef]
33. Seabold, S.; Perktold, J. Statsmodels: Econometric and statistical modeling with python. In Proceedings of the 9th Python in
Science Conference, Austin, TX, USA, 28 June–3 July 2010; pp. 92–96.
241
mathematics
Article
Identifying the Structure of CSCL Conversations Using
String Kernels
Mihai Masala 1,2, *, Stefan Ruseti 1 , Traian Rebedea 1 , Mihai Dascalu 1,3 , Gabriel Gutu-Robu 1
and Stefan Trausan-Matu 1,3
increased online participation, whose traces can be effectively used to create even more
advanced predictive models.
Education leans on online communication to enhance the learning process by integrat-
ing facilities, such as discussion forums or chat conversations, to stimulate collaboration
among peers and with course tutors [13]. Nevertheless, group size has a major impact
on delivery time, level of satisfaction, and overall quality of the result [14]. Usually,
these conversations occur between more than two participants, which leads to numerous
context changes and development of multiple discussion threads within the same conversa-
tion [15,16]. As the number of participants increases, the conversation may become harder
to follow, as the mix of different discussion threads becomes more frequent. Moreover,
divergences and convergences appear between these threads, which may be compared to
dissonances and consonances among counterpointed voices in polyphonic music, which
have a major role in knowledge construction [15,16].
Focusing on multi-participant chat conversations in particular, ambiguities derived
from the inner structure of a conversation are frequent due to the mixture of topics and of
messages on multiple discussion threads, that may overlap in short time spans. As such,
establishing links between utterances greatly facilitates the understanding of the conver-
sation and improves its readability, while also ensuring coherence per discussion thread.
Applications that allow users to manually annotate the utterances they are referring to,
when writing their reply, have long existed [17], whereas popular conversation applica-
tions (e.g., WhatsApp) successfully integrated such functionalities. Although users are
allowed to explicitly add references to previous utterances when issuing a reply, they do
not always annotate their utterances, as this process feels tedious and interrupts the flow
of the conversation.
Our research objective is to automatically discover semantic links between utterances
from multi-participant chat conversations using a supervised approach that integrates
neural networks and string kernels [18]. In terms of theoretical grounding, we establish an
analogy to the sentence selection task for automated question answering—in a nutshell,
detecting semantic links in chat conversations is similar, but more complex. In question
answering, most approaches [19–22] consider that the candidate sentence most similar
to the question is selected as the suitable answer. In our approach, the user reply is
semantically compared to the previous utterances in the conversation, and the most similar
contribution is selected while considering a sliding window of previous utterances (i.e.,
a predefined time-frame or using a preset number of prior utterances). It is worth noting
that we simplified the problem of identifying links between two utterances by reducing
the context of the conversation to a window of adjacent utterances. Nevertheless, we
emphasize the huge discrepancy in terms of dataset sizes between the question answering
task and the small collections of conversations currently available for our task.
We summarize our core contributions as follows:
• Employing a method grounded in string kernels used in conjunction with state of
the art NLP features to detect semantic links in conversations; in contrast to previous
studies [23,24], we also impose a threshold for practical usage scenarios, thus ensuring
the ease of integration of our model within chat environments;
• Providing extensive quantitative and qualitative results to validate that the lexical
information provided by string kernels is highly relevant for detecting semantic links
across multiple datasets and multiple learning frameworks (i.e., classification and
regression tasks);
• Obtaining state of the art results on two different datasets by relying on string kernels
and handcrafted conversation-specific features. Our method surpasses the results of
Gutu et al. [25,26] obtained using statistical semantic similarity models and semantic
distances extracted from the WordNet [27] ontology. In addition, our experimental
results argue that simpler supervised models, fine-tuned on relatively small datasets,
such as those used in Computer-Supported Collaborative Learning (CSCL) research,
244
Mathematics 2021, 9, 3330
may perform better on specific tasks than more complex deep learning approaches
frequently employed on large datasets.
In the following subsections we present state-of-the-art methods for computing text
similarity and for detecting semantic links.
p ( a, b ) =
k0/1 ∑ inv ( a) · inv (b) (3)
v∈Σ p
245
Mathematics 2021, 9, 3330
where:
• Σ p = all p-grams of a given size p,
• numv (s) = number of occurrences of string (n-gram) v in document s,
• inv (s) = 1 if string (n-gram) v occurs in document s, 0 otherwise.
String kernels can also be used as features for different classifiers to solve tasks such
as native language identification [37], protein fold prediction, or digit recognition [38].
Beck and Cohn [39] use the Gaussian Process framework on string kernels with the goal of
optimizing the weights related to each n-gram size, as well as decay parameters responsible
for gaps and matches. Their results show that such a model outperforms linear baselines
on the task of sentiment analysis. Another important result is that, while string kernels
are better than other linear baselines, non-linear methods outperform string kernels; thus,
non-linearly combining string kernels may further improve their performance. One such
extension was proposed by Masala et al. [18] for the task of question answering. The authors
show that a shallow neural network based on string kernels and word embeddings yielded
good results, comparable to the ones obtained by much more complex neural networks.
The main advantage of the approach is that a small number of parameters needs to be
learned, which allows the model to be also trained and used on small datasets, while
concurrently ensuring a very fast training process. We rely on a similar approach for
detecting semantic links in chat conversations, a task with significantly smaller datasets
than question answering.
246
Mathematics 2021, 9, 3330
Transformer-based architectures [47] have become popular in the NLP domain because
of their state-of-the-art performance on a wide range of tasks [48–53]. The idea behind the
Transformer architecture was to replace the classical models used for processing sequences
(e.g., RNNs or CNNs) with self-attention mechanisms that allow global and complex
interactions between any two words in the input sequence. For example, BERT [48] used
multiple layers of Transformers trained in a semi-supervised manner. The training of BERT
is based on two tasks: Masked LM (MLM)—in which a random token (word) is masked and
the model is asked to predict the correct word—and Next Sentence Prediction (NSP)—in
which the model is given two sentences A and B and is trained to predict whether sentence
B follows sentence A. In our experiments, we consider the NSP pretrained classifier.
247
Mathematics 2021, 9, 3330
2. Method
2.1. Datasets
2.1.1. Corpus of CSCL Chat Conversations
Our experiments were performed on a collection of 55 chat conversations (ChatLinks
dataset, available online at https://huggingface.co/datasets/readerbench/ChatLinks, ac-
cessed on 10 November 2021) held by Computer Science undergraduate students [15,25].
Participants had to discuss on software technologies that support collaborative work in
a business environment (e.g., blog, forum, chat, wiki). Each student had to uphold one
preferred technology in the first part of the conversation, introducing benefits or disad-
vantages for each CSCL technology, followed by a joint effort to define a custom solution
most suitable for their virtual company in the second part of the chat. The discussions
were conducted using ConcertChat [17], a software application which allows participants
to annotate the utterance they refer to, when writing a reply. The vision behind these inter-
actions was grounded in Stahl’s vision of group cognition [64] in which difficult problems
can be solved easier by multiple participants using a collaborative learning environment.
Two evaluations were considered. The first one relies on the exact matching between
two utterances, which checks whether the links are identical with the references manually
added by participants while discussing. The second approach considers in-turn matching
which checks whether the detected links belong to the same block of continuous utterances
written by the same participant, as defined in the manually annotated references. The auto-
mated approach computes similarity scores between each given contribution and multiple
previous utterances, within a pre-imposed window size. The highest matching score is
used to establish the semantic link. A conversation excerpt depicting an exact matching
between the reference and semantic link is shown in Table 1. An in-turn matching example
is shown in Table 2. In both cases, the emphasized text shows the utterance which denotes
the semantic link. The Ref ID column shows the explicit manual annotations added by
the participants.
Table 1. Fragments extracted from conversations showing exact matching (Semantic link is high-
lighted in bold).
248
Mathematics 2021, 9, 3330
Table 1. Cont.
Table 2. Fragments extracted from conversations showing in-turn matching (Semantic link is high-
lighted in bold).
107 Lucian They do not require any complicate client application, central me-
diation
108 Lucian Actually, all this arguments are pure technical
109 Lucian The single and best reason for which chats are the best way of
communication in this age of technology is that
110 Lucian Chat emulate the natural way in which people interact. By talking,
be argumenting ideas, by shares by natural speech
111 Lucian Hence,chat is the best way to transform this habit in a digital era.
112 Lucian We can start debating now? :D
113 111 Florin I would like to contradict you on some aspects
379 Alina No, curs.cs is an implementation of moodle
380 Alina Moodle is jus a platform
381 Alina You install it on a servere
382 Alina and use it
383 Alina Furthermore, populate it wih information.
384 Andreea and students are envolved too in development of moodle?
385 Alina It has the possibility of wikis, forums, blogs. I’m not sure with the
chat, though.
386 379 Stefan Yes that is right
The manually added links were subsequently used for determining accuracy scores for
different similarity metrics, using the previous strategies (e.g., exact and in-turn matching).
The 55 conversations from the corpus made up to 17,600 utterances, while 4500 reference
links were added by participants (e.g., about 29% of utterances had a corresponding
reference link). Out of the 55 total conversations, 11 of them were set aside and used as a
test set.
A previous study by Gutu et al. [25] showed that 82% of explicit links in the dataset
were covered by a distance window of 5 utterances; 95% of annotations were covered by
enlarging the window to 10 utterances, while a window of 20 utterances covered more
than 98% of annotated links, while considering time-frames, a 1 min window contained
only 61% of annotations, compared to the 2 min window which contains about 77% of
all annotated links. A wider time-frame of 3 min included about 93% of all links, while a
249
Mathematics 2021, 9, 3330
5 min window covered more than 97% of them. As our aim was to keep the majority of
links and to remove outliers, a 95% coverage was considered ideal. Smaller coverages were
included in our comparative experiments. Thus, distances of 5 and 10 utterances were
considered, while time-frames of 1, 2, and 3 min were used in the current experiments.
250
Mathematics 2021, 9, 3330
where:
• ur refers to the utterance for which the link is computed,
• u+ refers to the manually annotated utterance,
• u− is an incorrect utterance contained within the current window,
• sim(ur , u) refers to the semantic similarity score calculated by the MLP between the
two utterances representations,
• M is the desired margin among positive and negative samples.
Furthermore, we enhance the lexical information with conversation-specific infor-
mation, including details regarding the chat structure, as well as semantic information.
The conversation-specific features are computed for each candidate contribution (for a link)
as follows: we check whether the contribution contains a question, or the candidate and
the link share the same author, while referring to the chat structure, we use the number
of in-between utterances and the time between any two given utterances. Two methods
for computing semantic information were considered. Given two utterances, the first
method computes the embedding of each utterance as the average over the embeddings
(e.g., pretrained word2vec, FastText and GloVe models) of all words from the given ut-
terance, followed by cosine similarity. The second method relies on BERT, namely the
Next Sentence Prediction classifier, which is used to compute the probability that the two
utterances follow one another.
251
Mathematics 2021, 9, 3330
same model architecture, features, and training methods, the only difference being that we
investigate the usage of string kernels and BERT-based features.
We consider two approaches for capturing semantic information. The first model
considers an RNN trained on the Ubuntu Dialogue Corpus, as proposed by Lowe et al. [67].
The purpose of this network is to model semantic relationships between utterances and
it is trained on a large dataset of chat conversations. We use a siamese Long Short-Term
Memory (LSTM) [59] network to model the probability of a message following a given
context. Both the context and the utterance are processed first using a word embedding
layer with pre-trained GloVe [33] embeddings, which are further fine-tuned. After the
embedding layer, the representations of context and utterance are processed by the LSTM
network. Let c and r be the final hidden representation of the context and of the utterance,
respectively. These representations, alongside a learned matrix M, are used to compute the
probability of a reply (see Equation (5)).
The same training procedure introduced by Lowe et al. [67] is applied, namely mini-
mizing the cross-entropy of all labeled (context, contribution) pairs by considering a hidden
layer size of 300, an Adam optimizer with gradients clipped to 10, and a 1:1 ratio between
positive examples and negative examples (that are randomly sampled from the dataset).
In our implementation, a dropout layer was also added after the embedding layer and after
the LSTM layer for fine-tuning the embeddings, with the probability of dropping inputs
set to 0.2.
The second method is more straightforward and it is based on BERT [48]. We use
a pretrained BERT-base model and we query the model for whether two utterances are
connected, using the Next Sentence Prediction classifier.
Furthermore, we employ string kernels as a feature extraction method for lexical
information. Given a pair of utterances, we compute their lexical similarity with three
string kernels (spectrum, presence and intersection) at different granularity (n-gram ranges):
1–2, 3–4, 5–6, 7–8 and 9–10. Thus, a lexical feature vector v ∈ R15 is computed for each pair
of messages.
Mehri’s et al. [60] training methodology was used, namely the Linux IRC reply dataset
was split into a set of positive examples (annotated replies) and a set of negative exam-
ples (the complement of the annotated replies). This leads to a very imbalanced dataset,
with most pairs being non-replies. One method to alleviate this problem is to only consider
pairs of message within a time frame of 129 s [60] one from another. Different classifiers
relying on the features described in Table 3 were trained using 10-fold cross-validation. De-
252
Mathematics 2021, 9, 3330
cision trees, a simple Multi-Layer Perceptron (MLP) with a hidden size of 20, and random
forest [68] were considered in our experiments, each with their benefits and drawbacks.
3. Results
The following subsections provide the results on the two tasks, namely semantic links
and direct reply detection, arguing their strong resemblance in terms of both models and
features. Subsequently, we present a qualitative interpretation of the results.
Table 4. Results for semantic links detection (Exact matching accuracy (%)/In-turn matching accuracy (%).
Window (Utterances) 5 10
Time (mins) 1 2 3 1 2 3
Path Length [25] 32.44/41.49 32.44/41.49 not reported 31.88/40.78 31.88/40.78 not reported
AP-BiLSTM [21] 32.95/34.53 32.39/35.89 33.97/37.58 33.86/35.10 28.89/31.82 24.49/28.32
Intersection kernel 31.40/34.59 33.87/39.58 33.58/40.01 31.71/34.78 32.24/37.66 29.47/35.24
Presence kernel 31.84/34.94 33.97/39.81 33.58/40.01 31.80/34.89 32.33/37.71 29.67/35.41
Spectrum kernel 31.21/34.34 33.45/39.12 33.17/39.49 31.39/34.46 31.56/36.72 28.75/34.26
BERT 40.07/41.99 43.45/47.07 44.47/48.42 40.41/42.33 41.53/45.15 38.15/38.60
NN using sk 35.21/36.90 35.55/39.39 35.77/39.95 35.55/37.02 34.08/37.47 30.24/33.74
NN using sk + conv 37.92/39.39 45.48/49.66 47.06/51.80 38.14/39.50 46.27/50.79 47.85/52.93
NN using sk+sem 36.45/38.14 36.90/40.47 36.00/40.29 36.68/38.14 35.10/38.26 31.26/34.76
NN using sk + sem + conv 37.02/38.60 46.38/50.00 48.08/52.25 37.24/38.71 47.29/51.46 49.09/53.83
NN using sk + BERT 37.35/38.71 39.16/42.21 39.61/43.00 37.24/38.60 37.13/40.18 33.52/36.90
NN using sk + BERT + conv 40.40/42.21 46.72/49.88 48.08/51.91 40.63/42.43 46.95/50.33 48.08/52.37
Note: sk—string kernels; conv—conversation-specific features, namely window and time (window—# of in-between utterances; time—
elapsed time between utterances); question and author (question—whether the utterance contains a question; author—if the utterance
shares the same author as the utterance containing the link); sem—semantic information. Bolded values represent the best results with and
without semantic information.
253
Mathematics 2021, 9, 3330
where:
• mean p is the mean of the predictions on the training set,
• std p is the standard deviation of the predictions on the training set,
• meand is the mean of the distances between the mean prediction and the best threshold
found for each fold.
The final values are the following: mean p = 0.478, std p = 0.14, meand = 1.32, which
lead to a threshold = 0.66. We evaluate our approach on the test set with the threshold set
to 0.66 and obtain an F1 Score of 0.31 (for the positive class) and a 51.67% accuracy. In a
practical scenario, if the semantic link suggestion is incorrect, the user might just ignore the
predicted link.
254
Mathematics 2021, 9, 3330
Table 5. F1 scores (positive class) for reply detection on Linux IRC reply dataset. C denotes Conver-
sation features, S denotes semantic information while L stands for lexical information. More details
about used features can be found in Table 3.
In addition, we identify which features were the most important for the random forest
model by considering Gini feature importance [68]. For the case in which we do not use
lexical information, the semantic information is the most important feature (Gini value of
0.40), followed by time difference (Gini value of 0.27), and space difference (Gini value of
0.14); see Figure 2.
Figure 2. Gini feature importance for random forest without string kernels.
When adding string kernels, the semantic information is still the most important
feature (Gini value of 0.16), followed by time and distance conversation features (Gini
value of 0.15 and 0.12, respectively), and by string kernels on ngrams of size 1–2 (Gini value
of 0.11); see Figure 3.
255
Mathematics 2021, 9, 3330
Figure 3. Gini feature importance for random forest with string kernels.
Figure 4. Gini feature importance for random forest with string kernels replacing semantic information.
120 Adrian so tell me why the chat could provide collaborative learning?
121 120 Maria for example if we have to work on a project and everybody has a part to do
we can discuss it in a conference
122 Adrian you can discuss the premises but not the whole project
123 120 Andreea well, you can have a certain amount of info changed
124 122 Maria you can discuss the main ideas and if someone encounters problems he can
ask for help and the team will try to help him
216 Dan are there private wikis?
217 Ana yes...there are wikis that need authentication...
218 216 Dan by private i understand that only certain users can modify their content
Our model also detects distant semantic links (see Table 7) and interleaved semantic
links (see Table 8). Table 9 presents two examples of complex interactions in which our
system correctly detects semantic links.
256
Mathematics 2021, 9, 3330
139 Lucian To CHAT with a teacher and with our colleagues which have knowledge
and share it with us.
140 Florin yes, but I do not agree that chat’s are the best way to do that
141 Lucian For example, we could read a book about NLP but we could learn much
more by CHATTING with a teacher of NLP
142 Claudia but that chat is based on some info that we have previously read
... ... ...
145 139 Sebi yes but the best way to share your knowledge is to make publicly with
other people who wants to learn something. In chats you can do this just
with the people who are online, in forums everybody can share it
62 Alexandru the blog supports itself on something our firm calls a “centeredcommu-
nity”...the owner of the blog is the center...interacting with thecenter
(artist, engineer, etc. ) is something very rewarding
63 Alexandru and i would like to underline once again the artistic side of the blog
64 Alexandru blog is ART
65 Raluca you can also share files in a chat conference, in realtime
66 62 Radu you’re right... blogs have their place in community interaction
173 Ionut as the wiki grow the maintenance becomes very hard, and it consumes a
lot of human resources
174 Bogdan many users manage that information, so the chores are distributed
175 174 Ionut this would be a pro for wikis
176 173 Bogdan yes, but the wiki grows along with the users that manage that wiki
186 Costin I mean can an admin take the rights to another admin even if the firstone
became admin after the second one?
187 Ionut who can also have the right to give admin permissions
188 Tatar wikis are good for documentation but is not a communication tool
189 188 Ionut so “Thumbs up!” for chat here!
190 186 Bogdan I think if some admin does not do their job, he could lose the status
However, the model is not perfect—in about half of the cases, it is unable to detect the
correct semantic link (e.g., the best accuracy is 49.09%). Nevertheless, we must consider
257
Mathematics 2021, 9, 3330
that the detection of correct links is a difficult task even for human readers due to complex
interactions between utterances (see examples in Table 10).
Table 10. First example of wrong semantic link prediction (explicit link/predicted link).
156 Delia It can also make it easier to communicate with larger groups
157 Delia where will be having this conversation if the chat would not exist?:P
158 Cristian it’s the easiest way to solve a problem indeed, but only if the person is available
159 Delia the availability is prerequisitive
160 Delia yup
161 Delia but who is not online ourdays?:)
162 156 Marian yes but what about when you have a problem none of your friends know
how to solve
288 Andreea if kids get used to it at an early age, they would have an easier time later
289 Razvan blog posts, chatting, video conferences are ok
290 Razvan I thinks it’s dirrect enough
291 Andreea maybe the teacher and the students could do something fun together, like a trip :)
292 288 Mihai yes,this is a very important thing, every teacher should promote elearning
In some cases, the model fails to capture semantic relationships due to the simplistic
way of extracting the semantic information. This shortcoming can be observed in the first
example of Table 11, as the model fails to capture the relatedness of the verbs “moderate”
and “administarte”, although in this case it can be attributed to the misspelling of the verb
administrate. Our model also fails to reason—the model selects as the semantic link in the
example presented in the bottom half of Table 11) an utterance by the same participant
who refers to herself as “you”.
Table 11. Second example of wrong semantic link prediction (explicit link/predicted link).
242 Alex whenyou need to have a discution between multiple people and you need itstored
so other people would be able to read it, you should definetlyuse a forum
243 Luis We need someone to administarte that forum. so it will waste their time
244 Alex what other solution you have?
245 243 Cristi exacly, someone has to moderate the discussions
51 Alexandru and sometimes you do :)
52 Raluca you save a text file when you only chat or save a video file if you really
need it
53 Radu Theidea of community is the fact that most of the members should know
whatis going on. If you do not want others (ousiders) to see the inside
messages, than you can create a private forum and never give it’s IPaddress
or DNS name to anyone :)
54 51 Raluca i agree with you here
In other cases, utterances simply do not provide enough information (see Table 12).
More complex features might help overcome these limitations. Nevertheless, some limita-
tions are also due to the way the problem was formulated, as each utterance is analysed in-
dependently.
Table 12. Third example of wrong semantic link prediction (explicit link/predicted link).
400 Bogdan let us suppose that we use these technics in a serial mode:...so, every method
corresponds to a step in clasification...
401 Mihai yes, we can combine the powers of every techniques and come out with a
very versatile machine learning software
402 Bogdan and each step can get a result...
403 402 Mihai or we can choose a method depending on the nature of the problem, taking
into consideration that each one fits best on a few type of problems
258
Mathematics 2021, 9, 3330
4. Discussion
Answer selection techniques, in conjunction with string kernels, were evaluated as
a method for detecting semantic links in chat conversations. We used two datasets with
related tasks (i.e., detection of semantic links as a classification task or as a regression task)
to validate our approach.
The proposed neural model greatly improves upon the previous state-of-the-art for se-
mantic link detection, boosting the exact match accuracy from 32.44% (window/time frame:
5 utterances/1 min) to 47.85% (window/time frame: 10 utterances/3 min). The neural net-
work model outperforms previous methods for all combinations of considered frames. It is
important to note that the neural network models generalize very well—e.g., performance
is better for larger windows. This was not the case with any other methods presented in
the first part of Table 4, where we included several strong baselines from previous works.
The addition of semantic information to our model increased its performance, but not
by a large margin, especially when compared to the performance gained by adding con-
versation features. This highlights that an answer selection framework imposed upon the
semantic link detection task has some limitations. The largest observed gain for semantic
information was achieved on longer frames (e.g., window/time frame: 10 utterances/3 min,
improvement from 47.85% to 49.09%), which means that semantic information becomes
more helpful when discriminating between a larger number of candidates (larger window).
An interesting observation is that using BERT for extracting semantic information
(last part of Table 4) does not bring significant improvements. We believe this is the case
because the BERT pretrained model is trained on much longer and structurally different
sentences (e.g., Wikipedia texts versus chat messages).
The results on the Linux IRC reply dataset (Table 5) are compelling. The first im-
portant observation is that the usage of string kernels improves the performance for all
combinations of features (see the first two columns of Table 5 with any of the last three
columns of Table 5). Similar to the first set of experiments (on the CSCL chat corpus; see
Table 4), semantic information helps the model better capture direct links, but it is not
critical. Using string kernels to replace semantic information (see last column of Table 5)
proves to be very effective, obtaining better performance than the model without string
kernels, but with semantic information (first two columns of Table 5).
Based on two different datasets, two slightly different tasks (regression and classifi-
cation), and two different models, we observe the same patterns: string kernels are very
effective at utterance level, while state-of-the-art semantic similarity models under-perform
when used for utterance similarity. Besides higher accuracy, string kernels are also a lot
faster and, if used in conjunction with a neural network on top of them, achieve state of the
art results with a small number of parameters. This allows the model to be trained very
fast, even on a CPU.
5. Conclusions
Computer-Supported Collaborative Learning environments have shown an increased
usage, especially when it comes to problem-solving tasks, being very useful in the con-
text of online activities imposed by the COVID-19 pandemic. For such purposes, chat
environments are usually seen as a suitable technology. In addition, chats can sometimes
incorporate a facility that allows participants to explicitly mark the utterance they are
referring to, when writing a reply. In conversations with a higher number of participants,
several discussion topics emerge and continue in parallel, which makes the conversation
hard to follow.
This paper proposes a supervised approach inspired by answer selection techniques
to solve the problem of detecting semantic links in chat conversations. A supervised
neural model integrating string kernels, semantic and conversation-specific features was
presented as an alternative to complex deep learning models, especially when smaller
datasets are available for training. The neural network learnt to combine the lexical and
semantic features together with other conversation-specific characteristics, like the distance
259
Mathematics 2021, 9, 3330
and time spent between two utterances. Results were compared to findings in the question
answering field and validated the proposed solution as suitable for detecting semantic
links in a small dataset, such as the collection of CSCL conversations. Our best model
achieves 49.09% accuracy for exact match and 53.58% for in-turn metric. String kernels
were successfully combined with semantic and conversation-specific information using
neural networks, as well as other classifiers such as decision trees and random forests.
State-of-the-art results (0.65 F1 score on the reply class) were also obtained using string
kernels for reply detection on the Linux IRC reply dataset.
String kernels provide a fast, versatile, and easy approach for analyzing chat conversa-
tions, and should be considered as a central component to any automated analysis method
that involves chat conversations, while the neural network provided relevant results for the
explicit links detection task, the semantic information did not bring important additional
information to the network. The experiments on the reply detection task lead to very
similar results. Nonetheless, an inherent limitation of our model is generated by the answer
selection framework imposed on the semantic link detection task (i.e., not considering the
flow of conversation and addressing utterances independently).
The proposed method allows participants to more easily follow the flow of a discus-
sion thread within a conversation by allowing the disambiguation of the conversation
threads. As such, we provide guidance while reading entangled discussion with multiple
inter-twined discussion threads. This is of great support for both educational and business-
related tasks. Moreover, chat participants can obtain an overview of their involvement by
having access to the inter-dependencies between their utterances and the corresponding dis-
cussion threads. Another facility might aim at limiting the mixture of too many discussion
topics by providing guidance to focus only on the topics of interest. With accuracy scores
slightly over 50%, the proposed method may still require human confirmation or adjust-
ment, if the detected link is not suitable. Finally, the automated detection of semantic links
can be used to model the flow of the conversation by using a graph-based approach, further
supporting the understanding and analysis of the conversation’s rhetorical structure.
Our method does not take into account the context in which the replies occur, which
might prove to be important for detecting semantic links. Further experiments target
gathering an extended corpora of conversations to further advance the chat understand-
ing subdomain.
Author Contributions: Conceptualization, T.R., M.D. and S.T.-M.; Data curation, M.M. and G.G.-R.;
Formal analysis, M.M.; Funding acquisition, T.R. and M.D.; Investigation, M.M.; Methodology, M.M.,
S.R. and G.G.-R.; Project administration, T.R. and M.D.; Resources, G.G.-R. and S.T.-M.; Software,
M.M.; Supervision, T.R., M.D. and S.T.-M.; Validation, M.M., S.R. and G.G.-R.; Visualization, M.M.;
Writing—original draft, M.M.; Writing—review & editing, S.R., T.R., M.D. and S.T.-M. All authors
have read and agreed to the published version of the manuscript.
Funding: This research was supported by a grant of the Romanian National Authority for Scientific
Research and Innovation, CNCS—UEFISCDI, project number TE 70 PN-III-P1-1.1-TE-2019-2209,
“ATES—Automated Text Evaluation and Simplification” and POC-2015 P39-287 IAVPLN.
Institutional Review Board Statement: The study was conducted according to the guidelines of the
Declaration of Helsinki. Both datasets (ChatLinks and Reply Annotations) used in this work are from
external sources and are available freely online.
Informed Consent Statement: Informed consent was obtained from all subjects involved in the
study.
Data Availability Statement: In this work, we use two public corpora available at: https://huggingface.
co/datasets/readerbench/ChatLinks (accessed on 10 November 2021) and http://shikib.com/td_
annotations (accessed on 10 November 2021).
Conflicts of Interest: The authors declare no conflict of interest. The funders had no role in the design
of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript,
or in the decision to publish the results.
260
Mathematics 2021, 9, 3330
References
1. Cinelli, M.; Quattrociocchi, W.; Galeazzi, A.; Valensise, C.M.; Brugnoli, E.; Schmidt, A.L.; Zola, P.; Zollo, F.; Scala, A. The
COVID-19 social media infodemic. Sci. Rep. 2020, 10, 16598.
2. Alalwan, N.; Al-Rahmi, W.M.; Alfarraj, O.; Alzahrani, A.; Yahaya, N.; Al-Rahmi, A.M. Integrated three theories to develop a
model of factors affecting students’ academic performance in higher education. IEEE Access 2019, 7, 98725–98742. [CrossRef]
3. Dutt, A.; Ismail, M.A.; Herawan, T. A systematic review on educational data mining. IEEE Access 2017, 5, 15991–16005. [CrossRef]
4. Chen, L.; Chen, P.; Lin, Z. Artificial intelligence in education: A review. IEEE Access 2020, 8, 75264–75278. [CrossRef]
5. Rashid, N.A.; Nasir, T.M.; Lias, S.; Sulaiman, N.; Murat, Z.H.; Abdul Kadir, R.S.S. Learners’ Learning Style classification related
to IQ and Stress based on EEG. Procedia Soc. Behav. Sci. 2011, 29, 1061–1070. [CrossRef]
6. Rashid, N.A.; Taib, M.N.; Lias, S.; Bin Sulaiman, N.; Murat, Z.H.; Abdul Kadir, R.S.S. EEG theta and alpha asymmetry analysis of
neuroticism-bound learning style. In Proceedings of the 2011 3rd International Congress on Engineering Education: Rethinking
Engineering Education, The Way Forward, ICEED 2011, Kuala Lumpur, Malaysia, 7–8 December 2011; pp. 71–75. [CrossRef]
7. Anaya, A.R.; Boticario, J.G. Clustering learners according to their collaboration. In Proceedings of the 2009 13th International
Conference on Computer Supported Cooperative Work in Design, CSCWD 2009, Santiago, Chile, 22–24 April 2009; pp. 540–545.
[CrossRef]
8. Rus, V.; D’Mello, S.; Hu, X.; Graesser, A.C. Recent advances in conversational intelligent tutoring systems. AI Mag. 2013, 34, 42–54.
[CrossRef]
9. Zhang, H.; Magooda, A.; Litman, D.; Correnti, R.; Wang, E.; Matsmura, L.C.; Howe, E.; Quintana, R. eRevise: Using natural
language processing to provide formative feedback on text evidence usage in student writing. In Proceedings of the AAAI
Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 9619–9625.
10. Hamdi, M.S. MASACAD: A multi-agent approach to information customization for the purpose of academic advising of students.
Appl. Soft Comput. J. 2007, 7, 746–771. [CrossRef]
11. Kolb, D.A. Experiential Learning: Experience as the Source of Learning and Development; Prentice Hall: Englewood Cliffs, NJ, USA,
1984.
12. Dascalu, M.D.; Ruseti, S.; Dascalu, M.; McNamara, D.S.; Carabas, M.; Rebedea, T.; Trausan-Matu, S. Before and during COVID-19:
A Cohesion Network Analysis of Students’ Online Participation in Moodle Courses. Comput. Hum. Behav. 2021, 121, 106780.
[CrossRef]
13. Stahl, G. Group Cognition. Computer Support for Building Collaborative Knowledge; MIT Press: Cambridge, MA, USA, 2006.
14. Blanco, M.; Gonzalez, C.; Sanchez-Lite, A.; Sebastian, M.A. A practical evaluation of a collaborative learning method for
engineering project subjects. IEEE Access 2017, 5, 19363–19372. [CrossRef]
15. Trausan-Matu, S. Computer support for creativity in small groups using chats. Ann. Acad. Rom. Sci. Ser. Sci. Technol. Inf. 2010,
3, 81–90.
16. Trausan-Matu, S. The polyphonic model of collaborative learning. In The Routledge International Handbook of Research on Dialogic
Education; Mercer, N., Wegerif, R., Major, L., Eds.; Routledge: London, UK, 2019; pp. 454–468.
17. Holmer, T.; Kienle, A.; Wessner, M. Explicit referencing in learning chats: Needs and acceptance. Innov. Approaches Learn. Knowl.
Shar. Proc. 2006, 4227, 170–184. [CrossRef]
18. Masala, M.; Ruseti, S.; Rebedea, T. Sentence selection with neural networks using string kernels. Procedia Comput. Sci. 2017, 112,
1774–1782. [CrossRef]
19. Rajpurkar, P.; Zhang, J.; Lopyrev, K.; Liang, P. SQuAD: 100,000+ Questions for Machine Comprehension of Text. In Proceedings of
the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, TX, USA, 1–5 November 2016; pp. 2383–2392.
20. Shen, G.; Yang, Y.; Deng, Z.H. Inter-Weighted Alignment Network for Sentence Pair Modeling. In Proceedings of the 2017
Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, 9–11 September 2017; pp. 1179–1189.
21. dos Santos, C.; Tan, M.; Xiang, B.; Zhou, B. Attentive Pooling Networks. CoRR 2016, 2, 4.
22. Yu, L.; Hermann, K.M.; Blunsom, P.; Pulman, S. Deep learning for answer sentence selection. arXiv 2014, arXiv:1412.1632.
23. Masala, M.; Ruseti, S.; Gutu-Robu, G.; Rebedea, T.; Dascalu, M.; Trausan-Matu, S. Help Me Understand This Conversation:
Methods of Identifying Implicit Links Between CSCL Contributions. In Proceedings of the European Conference on Technology
Enhanced Learning, Leeds, UK, 3–6 September 2018; Springer: Cham, Switzerland, 2018; pp. 482–496. [CrossRef]
24. Masala, M.; Ruseti, S.; Gutu-Robu, G.; Rebedea, T.; Dascalu, M.; Trausan-Matu, S. Identifying implicit links in CSCL chats
using string kernels and neural networks. In Proceedings of the International Conference on Artificial Intelligence in Education,
London, UK, 27–30 June 2018; Springer: Cham, Switzerland, 2018; pp. 204–208. [CrossRef]
25. Gutu, G.; Dascalu, M.; Rebedea, T.; Trausan-Matu, S. Time and Semantic Similarity—What is the Best Alternative to Capture
Implicit Links in CSCL Conversations? In Proceedings of the 12th International Conference on Computer Supported Collaborative
Learning (CSCL), Philadelphia, PA, USA, 18–22 June 2017; pp. 223–230.
26. Gutu, G.; Dascalu, M.; Ruseti, S.; Rebedea, T.; Trausan-Matu, S. Unlocking the Power of Word2Vec for Identifying Implicit Links.
In Proceedings of the IEEE 17th International Conference on Advanced Learning Technologies, ICALT 2017, Timisoara, Romania,
3–7 July 2017; pp. 199–200. [CrossRef]
27. Miller, G.A. WordNet: A lexical database for English. Commun. ACM 1995, 38, 39–41. [CrossRef]
28. Wu, Z.; Palmer, M. Verb Semantics and Lexical Selection. In Proceedings of the 32nd Annual Meeting on Association for
Computational Linguistics, Las Cruces, NM, USA, 27–30 June 1994; pp. 133–138. [CrossRef]
261
Mathematics 2021, 9, 3330
29. Leacock, C.; Chodorow, M. Combining Local Context and WordNet Similarity for Word Sense Identification. In WordNet: An
Electronic Lexical Database; Fellbaum, C., Ed.; MIT Press: Cambridge MA, USA, 1998; pp. 265–283.
30. Landauer, T.K.; Dumais, S.T. A solution to Plato’s problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and
Representation of Knowledge. Psychol. Rev. 1997, 104, 211–240. [CrossRef]
31. Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient Estimation of Word Representation in Vector Space. In Proceedings of the
Workshop at ICLR, Scottsdale, AZ, USA, 2–4 May 2013.
32. Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Distributed Representations of Words and Phrases and their Compositionality. In
Advances in Neural Information Processing Systems; Curran Associates Inc.: Red Hook, NY, USA, 2013; pp. 1–9.
33. Pennington, J.; Socher, R.; Manning, C.D. GloVe: Global Vectors for Word Representation. In Proceedings of the 2014 Conference
on Empirical Methods in Natural Language Processing, Doha, Qatar, 25–29 October 2014; pp. 1532–1543. [CrossRef]
34. Bojanowski, P.; Grave, E.; Joulin, A.; Mikolov, T. Enriching word vectors with subword information. Trans. Assoc. Comput.
Linguist. 2017, 5, 135–146. [CrossRef]
35. Lodhi, H.; Saunders, C.; Shawe-Taylor, J.; Cristianini, N.; Watkins, C. Text Classification using String Kernels. J. Mach. Learn. Res.
2002, 2, 419–444. [CrossRef]
36. Ionescu, R.T.; Popescu, M.; Cahill, A. Can characters reveal your native language? A language-independent approach to native
language identification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP),
Doha, Qatar, 25–29 October 2014; pp. 1363–1373.
37. Ionescu, R.T.; Popescu, M.; Cahill, A. String Kernels for Native Language Identification: Insights from Behind the Curtains.
Comput. Linguist. 2016, 42, 491–525.
38. Gönen, M.; Alpaydın, E. Multiple Kernel Learning Algorithms. J. Mach. Learn. Res. 2011, 12, 2211–2268.
39. Beck, D.; Cohn, T. Learning Kernels over Strings using Gaussian Processes. In Proceedings of the Eighth International Joint
Conference on Natural Language Processing, Taipei, Taiwan, 27 November–1 December 2017; Volume 2, pp. 67–73.
40. Bachrach, Y.; Zukov-Gregoric, A.; Coope, S.; Tovell, E.; Maksak, B.; Rodriguez, J.; McMurtie, C. An Attention Mechanism for
Answer Selection Using a Combined Global and Local View. arXiv 2017, arXiv:1707.01378.
41. Tan, M.; dos Santos, C.; Xiang, B.; Zhou, B. Improved Representation Learning for Question Answer Matching. In Proceedings of
the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Germany, 7–12 August 2016; pp. 464–473.
42. Graves, A.; Schmidhuber, J. Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures.
Neural Netw. 2005, 18, 602–610. [CrossRef] [PubMed]
43. Kim, Y. Convolutional Neural Networks for Sentence Classification. In Proceedings of the 2014 Conference on Empirical Methods
in Natural Language Processing (EMNLP 2014), Doha, Qatar, 25–29 October 2014; pp. 1746–1751.
44. Luong, M.T.; Pham, H.; Manning, C.D. Effective approaches to attention-based neural machine translation. In Proceedings of the
Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, 17–21 September 2015; pp. 1412–1421.
45. Bahdanau, D.; Cho, K.; Bengio, Y. Neural Machine Translation by Jointly Learning to Align and Translate. In Proceedings of the
3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, 7–9 May 2015.
46. Wang, S.; Jiang, J. A Compare-Aggregate Model for Matching Text Sequences. arXiv 2016, arXiv:1611.01747.
47. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. In
Advances in Neural Information Processing Systems; Curran Associates Inc.: Red Hook, NY, USA, 2017; pp. 5998–6008. [CrossRef]
48. Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understand-
ing. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics:
Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186.
49. Radford, A.; Narasimhan, K.; Salimans, T.; Sutskever, I. Improving Language Understanding by Generative Pre-Training; Technical
Report; OpenAI: San Francisco, CA, USA, 2018.
50. Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language Models Are Unsupervised Multitask Learners; Technical
Report; OpenAI: San Francisco, CA, USA, 2019.
51. Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. RoBERTa: A Robustly
Optimized BERT Pretraining Approach. arXiv 2019, arXiv:1907.11692.
52. Yang, Z.; Dai, Z.; Yang, Y.; Carbonell, J.; Salakhutdinov, R.; Le, Q.V. XLNet: Generalized Autoregressive Pretraining for Language
Understanding. In Proceedings of the Advances in Neural Information Processing Systems 32: Annual Conference on Neural
Information Processing Systems, NeurIPS 2019, Vancouver, BC, Canada, 8–14 December 2019; pp. 5754–5764.
53. Lan, Z.; Chen, M.; Goodman, S.; Gimpel, K.; Sharma, P.; Soricut, R. ALBERT: A Lite BERT for Self-supervised Learning of
Language Representations. In Proceedings of the 8th International Conference on Learning Representations, ICLR 2020, Addis
Ababa, Ethiopia, 26–30 April 2020.
54. Trausan-Matu, S.; Rebedea, T. A polyphonic model and system for inter-animation analysis in chat conversations with multiple
participants. In Proceedings of the International Conference on Intelligent Text Processing and Computational Linguistics, Iasi,
Romania, 21–27 March 2010; Springer: Berlin/Heidelberg, Germany, 2010; Volume 6008, pp. 354–363. [CrossRef]
55. Elsner, M.; Charniak, E. Disentangling chat. Comput. Linguist. 2010, 36, 389–409. [CrossRef]
262
Mathematics 2021, 9, 3330
56. Jiang, J.Y.; Chen, F.; Chen, Y.Y.; Wang, W. Learning to disentangle interleaved conversational threads with a siamese hierarchical
network and similarity ranking. In Proceedings of the 2018 Conference of the North American Chapter of the Association for
Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, LA, USA, 1–6 June 2018;
pp. 1812–1822.
57. Liu, H.; Shi, Z.; Gu, J.C.; Liu, Q.; Wei, S.; Zhu, X. End-to-End Transition-Based Online Dialogue Disentanglement. In Proceedings
of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI-20, Yokohama, Japan, 7–15 January 2020;
pp. 3868–3874.
58. Li, T.; Gu, J.C.; Zhu, X.; Liu, Q.; Ling, Z.H.; Su, Z.; Wei, S. DialBERT: A Hierarchical Pre-Trained Model for Conversation
Disentanglement. arXiv 2020, arXiv:2004.03760.
59. Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [CrossRef]
60. Mehri, S.; Carenini, G. Chat Disentanglement: Identifying Semantic Reply Relationships with Random Forests and Recurrent
Neural Networks. In Proceedings of the International Joint Conference on Natural Language Processing (IJCNLP2017), Taipei,
Taiwan, 27 November–1 December 2017; pp. 615–623.
61. Searle, J.R. Speech Acts; Cambridge University Press: Cambridge, UK, 1969. [CrossRef]
62. Moldovan, C.; Rus, V.; Graesser, A.C. Automated Speech Act Classification for Online Chat. MAICS 2011, 710, 23–29.
63. Rus, V.; Maharjan, N.; Tamang, L.J.; Yudelson, M.; Berman, S.R.; Fancsali, S.E.; Ritter, S. An Analysis of Human Tutors’ Actions in
Tutorial Dialogues. In Proceedings of the International Florida Artificial Intelligence Research Society Conference (FLAIRS 2017),
Marco Island, FL, USA, 22–24 May 2017; pp. 122–127.
64. Stahl, G. Studying Virtual Math Teams; Springer Science & Business Media: New York, NY, USA, 2009.
65. Kingma, D.; Ba, J. Adam: A method for stochastic optimization. In Proceedings of the 3rd International Conference for Learning
Representations, San Diego, CA, USA, 7–9 May 2015. [CrossRef]
66. Hu, B.; Lu, Z. Convolutional Neural Network Architectures for Matching Natural Language Sentences. In Advances in Neural
Information Processing Systems; MIT Press: Cambridge, MA, USA, 2014; pp. 1–9.
67. Lowe, R.; Pow, N.; Serban, I.; Pineau, J. The ubuntu dialogue corpus: A large dataset for research in unstructured multi-turn
dialogue systems. arXiv 2015, arXiv:1506.08909.
68. Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [CrossRef]
263
mathematics
Article
Definition Extraction from Generic and Mathematical Domains
with Deep Ensemble Learning
Natalia Vanetik *,† and Marina Litvak *,†
Software Engineering Department, Shamoon College of Engineering, Bialik 56, Beer Sheva 8434231, Israel
* Correspondence: [email protected] (N.V.); [email protected] (M.L.)
† These authors contributed equally to this work.
Abstract: Definitions are extremely important for efficient learning of new materials. In particular,
mathematical definitions are necessary for understanding mathematics-related areas. Automated
extraction of definitions could be very useful for automated indexing educational materials, building
taxonomies of relevant concepts, and more. For definitions that are contained within a single sentence,
this problem can be viewed as a binary classification of sentences into definitions and non-definitions.
In this paper, we focus on automatic detection of one-sentence definitions in mathematical and general
texts. We experiment with different classification models arranged in an ensemble and applied to a
sentence representation containing syntactic and semantic information, to classify sentences. Our
ensemble model is applied to the data adjusted with oversampling. Our experiments demonstrate the
superiority of our approach over state-of-the-art methods in both general and mathematical domains.
Publisher’s Note: MDPI stays neutral Definition 2. The magnitude of a number, also called its absolute value, is its distance from zero.
with regard to jurisdictional claims in
published maps and institutional affil- Naturally, we expect to find mathematical definitions in mathematical articles, which
iations. frequently use formulas and notations in both definitions and surrounding text. The
number of words in mathematical text is smaller than in standard text due to the formulas
that are used to express the former. The mere presence of formulas is not a good indicator
of a definition sentence because the surrounding sentences may also use notations and
Copyright: © 2021 by the authors.
formulas. As an example of such text, Definition 3, below, contains a definition from
Licensee MDPI, Basel, Switzerland.
Wolfram MathWorld. Only the first sentence in this text is considered a definition sentence,
This article is an open access article even though other sentences also contain mathematical notations.
distributed under the terms and
conditions of the Creative Commons Definition 3. A finite field is a field with a finite field order (i.e., number of elements), also called
Attribution (CC BY) license (https:// a Galois field. The order of a finite field is always a prime or a power of a prime. For each prime
creativecommons.org/licenses/by/ power, there exists exactly one (with the usual caveat that "exactly one" means "exactly one up to
4.0/). an isomorphism") finite field GF(pn ), often written as Fpn in current usage.
Definition extraction (DE) is a challenging and popular task today, as shown by a recent
research call at SemEval-2020 shows (https://competitions.codalab.org/competitions/20900,
accessed on 1 September 2021).
Multiple current methods for automatic DE view it as a binary classification task, where
a sentence is classified as a definition or a non-definition. A supervised learning process is
usually applied for this task, employing feature engineering for sentence representation.
However, all recently published works study generic definitions, without evaluation of
their methods on mathematical texts.
In this paper, we describe a supervised learning method for automatic DE from both
generic and mathematical texts. Our method applies ensemble learning to adjusted-by-
oversampling data, where 12 deep neural network-based models are trained on a dataset
with labeled definitions and then applied on test sentences. The final label of a sentence is
decided by the ensemble voting.
Our method is evaluated on four different corpora; three for generic DE and one is an
annotated corpus of mathematical definitions.
The main contributions of this paper are (1) the introduction of a new corpus of
mathematical texts with annotated sentences and (2) an evaluation of an ensemble learning
model for the DE task, (3) an evaluation of the introduced ensemble learning model on a
general and mathematical domains, including (4) cross-domain experiments. We performed
extensive experiments with different ensemble models on four datasets, including the
introduced one. Our experiments demonstrate the superiority of our model for the three
out of four datasets, belonging to two different domains. The paper is organized as follows.
Section 2 contains a survey of up-to-date related work. Section 3 describes our approach.
Section 4 provides the evaluation results and their analysis. Finally, Section 5 contains
our conclusions.
2. Related Work
Definition extraction has been a popular topic in NLP research for more than a
decade [2], and it remains a challenging and popular task today as a recent research call
at SemEval-2020 show. Prior work in the field of DE can be divided into three main
categories: (1) rule-based methods, (2) machine-learning methods relying on manual
feature engineering, and (3) methods that use deep learning techniques.
Early works about DE from text documents belong to the first category. These works rely
mainly on manually crafted rules based on linguistic parameters. Klavans and Muresan [3]
presented the DEFINDER, a rule-based system that mines consumer-oriented full text
articles to extract definitions and the terms they define; the system is evaluated on
definitions from on-line dictionaries such as the UMLS Metathesaurus [4]. Xu et al. [2]
used various linguistic tools to extract kernel facts for the definitional question-answering
task in Text REtrieval Conference (TREC) 2003. Malaise et al. [5] used semantic relations to
mine defining expressions in domain-specific corpora, thus detecting semantic relations
between the main terms in definitions. This work is evaluated on corpora from fields of
anthropology and dietetics. Saggion and Gaizauskas [6], Saggion [7] employed analysis of
on-line sources to find lists of relevant secondary terms that frequently occur together with
the definiendum in definition-bearing passages. Storrer and Wellinghoff [8] proposed a
system that automatically detects and annotates definitions for technical terms in German
text corpora. Their approach focuses on verbs that typically appear in definitions by
specifying search patterns based on the valency frames of definitor verbs. Borg et al. [9]
extracted definitions from nontechnical texts using genetic programming to learn the typical
linguistic forms of definitions and then using a genetic algorithm to learn the relative
importance of these forms. Most of these methods suffer from both low recall and precision
(below 70%), because definition sentences occur in highly variable and noisy syntactic
structures.
The second category of DE algorithms relies on semi-supervised and supervised
machine learning that use semantic and other features to extract definitions. This approach
266
Mathematics 2021, 9, 2502
generates DE rules automatically but relies on feature engineering to do so. Fahmi and
Bouma [10] presented an approach to learning concept definitions from fully parsed text
with a maximum entropy classifier incorporating various syntactic features; they tested
this approach on a subcorpus of the Dutch version of Wikipedia. In [11], a pattern-based
glossary candidate detector, which is capable of extracting definitions in eight languages,
was presented. Westerhout [12] described a combined approach that first filters corpus
with a definition-oriented grammar, and then applies machine learning to improve the
results obtained with the grammar. The proposed algorithm was evaluated on a collection
of Dutch texts about computing and e-learning. Navigli and Velardi [13] used Word-Class
Lattices (WCLs), a generalization of word lattices, to model textual definitions. The authors
introduced a new dataset called WCL that was used for the experiments. They achieved a
75.23% F1 score on this dataset. Reiplinger et al. [14] compared lexico-syntactic pattern
bootstrapping and deep analysis. The manual rating experiment suggested that the concept
of definition quality in a specialized domain is largely subjective, with a 0.65 agreement
score between raters. The DefMiner system, proposed in [15], used Conditional Random
Fields (CRF) to predict the function of a word and to determine whether this word is a
part of a definition. The system was evaluated on a W00 dataset [15], which is a manually
annotated subset of ACL Anthology Reference Corpus (ACL ARC) ontology. Boella
and Di Caro [16] proposed a technique that only uses syntactic dependencies between
terms extracted with a syntactic parser and then transforms syntactic contexts to abstract
representations to use a Support Vector Machine (SVM). Anke et al. [17] proposed a weakly
supervised bootstrapping approach for identifying textual definitions with higher linguistic
variability. Anke and Saggion [18] presented a supervised approach to DE in which only
syntactic features derived from dependency relations are used.
Algorithms in the third category use Deep Learning (DL) techniques for DE, often
incorporating syntactic features into the network structure. Li et al. [19] used Long
Short-Term Memory (LSTM) and word vectors to identify definitions and then tested this
approach on the English and Chinese texts. Their method achieved a 91.2% F-measure on
the WCL dataset. Anke and Schockaert [20] combined Convolutional Neural Network
(CNN) and LSTM, based on syntactic features and word vector representation of sentences.
Their experiments showed the best F1 score (94.2%) on the WCL dataset for CNN and the
best F1 score (57.4%) on the W00 dataset for the CNN and Bidirectional LSTM (BLSTM)
combination, both with syntactically enriched sentence representation. Word embedding,
when used as the input representation, have been shown to boost the performance in
many NLP tasks, due to its ability to encode semantics. We believe that a choice to use
word vectors as input representation in many DE works was motivated by its success in
NLP-related classification tasks.
We use the approach of [20] as a starting point and as a baseline for our method. We
further extend this work by (1) additional syntactic knowledge in a sentence representation
model, (2) testing additional network architectures, (3) combining 12 configurations (that
were the result of different input representations and architectures) in a joint ensemble
model, and (4) evaluation of the proposed methodology on a new dataset of mathematical
texts. As previously shown in our and others’ works [1,20], dependency parsing can add
valuable features to the sentence representation in the DE task, including mathematical DE.
The same works showed that a standard convolutional layer can be sufficiently applied to
automatically extract the most significant features from the extended representation model,
which improves accuracy for the classification task. Word embedding matrices enhanced
with dependency information naturally call for CNN due to their size and CNN’s ability
to decrease dimensionality swiftly. On the other hand, sentences are sequences for which
LSTM is naturally suitable. In [1], we explored how the order of the CNN and LSTM layers
and the input representation affect the results. To do that, we evaluated and compared
between 12 configurations (see Section 3.3). As result, we obtained a conclusion that CNN
and its combination with LSTM, applied on a syntactically enriched input representation,
267
Mathematics 2021, 9, 2502
3. Method
This section describes out method, including representation of input sentences,
individual composition models, and their combination through ensemble learning.
2. (ml)—an extension of (m) with one-hot encoding of dependency labels between pairs
of words [20]; formally, it includes the average of word vectors of dependency words,
and dependency label representations as follows:
avg
ml = Sn×k ◦ [rij ◦ depij ]ij
i ◦ w
mld = [w j ◦ depij ◦ depthij ]ij
300
avg
(m) vector rij i + w
:= 12 (w j)
300 46
avg
(ml) vector rij i + w
:= 12 (w j) depij
300 300 46 1
Figure 1. Dependency representation for input configurations (m), (ml) and (mld).
268
Mathematics 2021, 9, 2502
The intuition behind using mixed models is supported by known benefits of their
composition layers—CNN can automatically extract features, while LSTM is a classification
model that is context-aware. The experiment, performed in [1], was aimed to examine which
order of layers is beneficial for the DE task—first to extract features from the original input
and then feed them to the context-aware classifier, or first to calculate hidden states with
context-aware LSTM gates and then feed them into CNN classifier for feature extraction
before the classification layer. The results demonstrated the superiority of models with a
CNN layer, which can be explained by the ability of CNN to learn features and reduce the
number of free parameters in a high-dimensional sentence representation, allowing the
network to be more accurate with fewer parameters. Due to a high-dimensional input in
our task (also in context-aware representation, produced by LSTM gates), this characteristic
of CNN appears to be very helpful.
269
Mathematics 2021, 9, 2502
To see the advantage of our ensemble approach, we also evaluated every one of the
12 neural models individually on the data after oversampling. The pipeline of individual
model evaluation is depicted in Figure 5.
4. Experiments
Our experiments aim at testing our hypothesis about superiority of the ensemble of
all 12 composition models over single models and SOTA baselines.
Based on the feature analysis described in Section 4.8, we decided to employ all 12
configurations as composition models and, based on the observations from our previous
work [1], we use pretrained fastText (further denoted by FT) word embedding in all of them.
Tests were performed on a cloud server with 32 GB of RAM, 150 GB of PAGE memory,
an Intel Core I7-7500U 2.70 GHz CPU, and two NVIDIA GK210GL GPUs.
4.1. Tools
The models were implemented with help of the following tools: (1) Stanford CoreNLP
wrapper [27] for Python (tokenization, sentence boundary detection, and dependency
parsing), (2) Keras [28] with Tensorflow [29] as a back-end (NN models), (3) fastText vectors
270
Mathematics 2021, 9, 2502
pretrained on English webcrawl and Wikipedia [26], (4) Scikit-Learn [30] (evaluation with
F1, recall, and precision metrics), (5) fine-tunable BERT python package [24] available at
https://github.com/strongio/keras-elmo (accessed on 1 September 2021), and (6) WEKA
software [31]. All neural models were trained with batch size 32 and 10 epochs.
4.2. Datasets
In our work we use the following four datasets–DEFT, W00, WCL, and WFMALL–that
are described below. The dataset domain, number of sentences for each class, majority vote
values, total number of words, and number of words with non-zero fastText (denoted by
FT) word vectors are given in Table 1.
Non- Covered
Dataset Domain Definitions Majority Words
Definitions by FT
WCL General 1871 2847 0.603 21,297 16,645
DEFT General 562 1156 0.673 7644 7350
W00 General 731 1454 0.665 8261 7003
WFMALL Math 1934 4206 0.685 13,138 8238
WCL contains the following annotations: (1) the DEFINIENDUM field (DF), referring
to the word being defined and its modifiers, (2) the DEFINITOR field (VF), referring to the
verb phrase used to introduce a definition, (3) the DEFINIENS field (GF) which includes
the genus phrase, and (4) the REST field (RF), which indicates all additional sentence
parts. According to the dataset description, existence of the first three fields indicates that a
sentence is a definition.
271
Mathematics 2021, 9, 2502
In monocots, petals usually number three or multiples of three; in dicots, the number of
petals is four or five, or multiples of four and five.
272
Mathematics 2021, 9, 2502
(20% of the original data size). In this setting, we have a test set size identical to those of
baselines that used 75% training, 5% validation, and 20% test set split.
For the consistency of the experiment, it was important to us to keep the same training
and validation sets for individual models, whether as stand-alone or as composition
models in an ensemble. Additionally, all evaluated and compared models, including
ensemble, were evaluated on the same test set. Because ensemble models were trained
using traditional machine-learning algorithms, no validation set was needed for them.
4.5. Baselines
We compared our results with five baselines:
• DefMiner [15], which uses Conditional Random Fields to detect definition words;
• BERT [23],[24], fine-tuned on the training subset of every dataset for the task of
sentence classification;
• CNN_LSTMml , proposed in [20];
• CNN_LSTMm , which is the top-ranked composition model on the adjusted WFMALL
dataset.
• CNN_LSTMmld , which is the top-ranked composition model on the adjusted W00
dataset.
We applied two supervised models for the ensemble voting: Linear Regression (LR)
and Random Forest (RF). The results reported here are the average over 10 runs, with
random reshuffling applied to the dataset each time (We did not apply a standard 10-fold
cross validation due to a non-standard proportion between a training and a test dataset.
Additionally, 10-fold cross validation was not applied on individual models, as it is not
a standard evaluation technique for deep NNs.). We also applied the majority voting,
denoted as Ensemble majority (Our code will be released once the paper is accepted).
273
Mathematics 2021, 9, 2502
274
Mathematics 2021, 9, 2502
Table 3. Cont.
275
Mathematics 2021, 9, 2502
Table 5. Cont.
For the ensemble evaluation, we trained all 12 models and the ensemble classifier using
the same pipeline as in Section 3.3 on the training set (denoted as Dataset X in Figure 7). Then,
these 12 models produce the labels for both 5% of the Dataset X (20% from its test set that
is used for training ensemble weights) and 20% of the Dataset Y (its entire test set). The
ensemble model is trained on 5% of the Dataset X and applied on 20% of the Dataset Y,
using labels produced by 12 trained individual models.
To make this process fully compatible to an in-domain testing procedure (depicted in
Figures 4 and 5), we used the same dataset splits both for evaluation of individual models
(see Figure 6) and for evaluation of ensemble models (see Figure 7).
276
Mathematics 2021, 9, 2502
As we can see from Table 6, all cases where the training set comes from one domain
and the test set is from another domain produce the significantly lower accuracy than that
reported in Tables 2–5. This is a testament to the fact that general definition domain and
mathematical domain are quite different.
277
Mathematics 2021, 9, 2502
instance, increasing the number of training epochs above 15 did not improve the scores at
all. The final parameters we used for our neural models appear in Table 7.
Table 8. Information gain rankings for individual models within the ensemble model.
278
Mathematics 2021, 9, 2502
279
Mathematics 2021, 9, 2502
4. Partial definition:
A center X is the triangle centroid of its own pedal triangle iff it is the symmedian point.
This sentence was annotated as non-definition, because it does not define the symme-
dian point.
5. Description:
The cyclide is a quartic surface, and the lines of curvature on a cyclide are all straight
lines or circular arcs.
Most misclassified definitions (false negatives) can be described by an atypical grammatical
structure. Examples of such sentences can be seen below:
Once one countable set S is given, any other set which can be put into a one-to-one cor-
respondence with S is also countable.
The word cissoid means “ivy-shaped”.
A bijective map between two metric spaces that preserves distances,
i.e., d(f(x), f(y)) = d(x, y), where f is the map and d(a, b) is the distance function.
We propose to deal with some of the identified error sources as follows. Partial
definitions can probably be discarded by applying part-of-speech tagging and pronouns
detection. Coreference resolution (CR) can be used for identification of the referred
mathematical entity in a text. Additionally, the partial definitions problem should be
resolved by reduction of the DE task to multi-sentence DE. Formulations and notations
can probably be discarded by measuring the ratio between mathematical symbolism and
regular text in a sentence. Sentences providing alternative naming for mathematical objects
can be discarded if we are able to detect the truth definition and then select it from multiple
candidates. It can also be probably resolved with the help of such types of CR as split
antecedents and coreferring noun phrases.
4.11. Discussion
As can be seen from all four experiments, ensemble outperforms individual models,
despite latter being trained on more data. This outcome definitely supports the superiority
of the ensemble approach for both domains.
It is worth noting that BERT did not perform well on our task. We explain it by
the difference between the general domain of its training and our application domain of
definitions and the lack of syntactic information in its input representation.
The scores of individual models approve again that syntactic information embedded
into a sentence representation usually delivers better performance in both domains.
As our cross-domain evaluation results show, general definition domain and mathem-
atical domain are quite different and, therefore, transfer cross-domain learning performs
significantly worse than traditional single-domain learning.
5. Conclusions
In this paper, we introduce a new approach for DE, using ensemble from deep neural
networks. Because it is a supervised approach, we adjust the class distribution of our
datasets with oversampling. We evaluate this approach on datasets from general and
mathematical domains. Our experiments on four datasets demonstrate superiority of
ensemble voting over multiple state-of-the-art methods.
In the future, we intend to adapt our methodology for multi-sentence definition extraction.
Author Contributions: Conceptualization, N.V. and M.L.; methodology, N.V. and M.L.; software,
N.V.; validation, N.V. and M.L.; formal analysis, investigation, resources, N.V. and M.L.; writing—
original draft preparation, N.V. and M.L.; writing—review and editing, N.V. and M.L. All authors
have read and agreed to the published version of the manuscript.
Funding: This research received no external funding.
Data Availability Statement: The WFMALL dataset is freely available for download from https:
//github.com/NataliaVanetik1/wfmall, accessed on 1 September 2021.
280
Mathematics 2021, 9, 2502
Acknowledgments: The authors express their deep gratitude to Guy Shilon and Lior Reznik for their
help with server configuration.
Conflicts of Interest: The authors declare no conflict of interest.
Abbreviations
The following abbreviations are used in this manuscript, in the order of their appearance:
DE Definition Extraction
NLP Natural Language Processing
WCL World-Class Lattice
DL Deep Learning
CRF Conditional Random Fields
CNN Convolutional Neural Network
LSTM Long Short-Term memory
BLSTM Bidirectional LSTM
SOTA State of the Art
NN Neural Network
FT fastText word vectors
LR Logistic Regression
RF Random Forest
References
1. Vanetik, N.; Litvak, M.; Shevchuk, S.; Reznik, L. Automated discovery of mathematical definitions in text. In Proceedings of the
12th Language Resources and Evaluation Conference, Marseille, France, 13–15 May 2020; pp. 2086–2094.
2. Xu, J.; Licuanan, A.; Weischedel, R.M. TREC 2003 QA at BBN: Answering Definitional Questions; TREC: Gaithersburg, MD, USA,
2003; pp. 98–106.
3. Klavans, J.L.; Muresan, S. Evaluation of the DEFINDER system for fully automatic glossary construction. In Proceedings of the
AMIA Symposium, American Medical Informatics Association, Washington, DC, USA, 3–7 November 2001; p. 324.
4. Schuyler, P.L.; Hole, W.T.; Tuttle, M.S.; Sherertz, D.D. The UMLS Metathesaurus: Representing different views of biomedical
concepts. Bull. Med. Libr. Assoc. 1993, 81, 217. [PubMed]
5. Malaisé, V.; Zweigenbaum, P.; Bachimont, B. Detecting semantic relations between terms in definitions. In Proceedings of
CompuTerm 2004: 3rd International Workshop on Computational Terminology, Geneva, Switzerland, 29 August 2004.
6. Saggion, H.; Gaizauskas, R.J. Mining On-line Sources for Definition Knowledge. In Proceedings of the International FLAIRS
Conference, Miami Beach, FL, USA, 12–14 May 2004; pp. 61–66.
7. Saggion, H. Identifying Definitions in Text Collections for Question Answering. In Proceedings of the International Conference
on Language Resources and Evaluation, LREC, Lisbon, Portugal, 26–28 May 2004.
8. Storrer, A.; Wellinghoff, S. Automated detection and annotation of term definitions in German text corpora. In Proceedings of the
International Conference on Language Resources and Evaluation, LREC, Genoa, Italy, 24–26 May 2006; Volume 2006.
9. Borg, C.; Rosner, M.; Pace, G. Evolutionary algorithms for definition extraction. In Proceedings of the 1st Workshop on Definition
Extraction. Association for Computational Linguistics, Borovets, Bulgaria, 18 September 2009; pp. 26–32.
10. Fahmi, I.; Bouma, G. Learning to identify definitions using syntactic features. In Proceedings of the Workshop on Learning
Structured Information in Natural Language Applications, Trento, Italy, 3 April 2006.
11. Westerhout, E.; Monachesi, P.; Westerhout, E. Combining pattern-based and machine learning methods to detect definitions for
elearning purposes. In Proceedings of the RANLP 2007 Workshop “Natural Language Processing and Knowledge Representation
for eLearning Environments”, Borovets, Bulgaria, 27–29 September 2007.
12. Westerhout, E. Definition extraction using linguistic and structural features. In Proceedings of the 1st Workshop on Definition
Extraction. Association for Computational Linguistics, Borovets, Bulgaria, 18 September 2009; pp. 61–67.
13. Navigli, R.; Velardi, P. Learning word-class lattices for definition and hypernym extraction. In Proceedings of the 48th Annual
Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Uppsala, Sweden, 11–16
July 2010; pp. 1318–1327.
14. Reiplinger, M.; Schäfer, U.; Wolska, M. Extracting glossary sentences from scholarly articles: A comparative evaluation of pattern
bootstrapping and deep analysis. In Proceedings of the ACL-2012 Special Workshop on Rediscovering 50 Years of Discoveries.
Association for Computational Linguistics, Jeju Island, Korea, 10 July 2012; pp. 55–65.
15. Jin, Y.; Kan, M.Y.; Ng, J.P.; He, X. Mining scientific terms and their definitions: A study of the ACL anthology. In Proceedings of
the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, WA, USA, 18–21 October 2013; pp. 780–790.
16. Boella, G.; Di Caro, L. Extracting definitions and hypernym relations relying on syntactic dependencies and support vector
machines. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers),
Sofia, Bulgaria, 4–9 August 2013; Volume 2, pp. 532–537.
281
Mathematics 2021, 9, 2502
17. Anke, L.E.; Saggion, H.; Ronzano, F. Weakly supervised definition extraction. In Proceedings of the International Conference
Recent Advances in Natural Language Processing, Hissar, Bulgaria, 5–11 September 2015; pp. 176–185.
18. Anke, L.E.; Saggion, H. Applying dependency relations to definition extraction. In Proceedings of the International Conference
on Applications of Natural Language to Data Bases/Information Systems, Montpellier, France, 18–20 June 2014; Springer:
Berlin/Heidelberg, Germany, 2014; pp. 63–74.
19. Li, S.; Xu, B.; Chung, T.L. Definition Extraction with LSTM Recurrent Neural Networks. In Chinese Computational Linguistics and
Natural Language Processing Based on Naturally Annotated Big Data; Springer: Berlin/Heidelberg, Germany, 2016; pp. 177–189.
20. Anke, L.E.; Schockaert, S. Syntactically Aware Neural Architectures for Definition Extraction. In Proceedings of the 2018
Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,
Volume 2 (Short Papers), New Orleans, LA, USA, 2–4 June 2018; Volume 2, pp. 378–385.
21. Avram, A.M.; Cercel, D.C.; Chiru, C. UPB at SemEval-2020 Task 6: Pretrained Language Models for Definition Extraction. In
Proceedings of the Fourteenth Workshop on Semantic Evaluation, Barcelona, Spain, 12–13 December 2020; pp. 737–745.
22. Xie, S.; Ma, J.; Yang, H.; Lianxin, J.; Yang, M.; Shen, J. UNIXLONG at SemEval-2020 Task 6: A Joint Model for Definition Extraction.
In Proceedings of the Fourteenth Workshop on Semantic Evaluation, Barcelona, Spain, 12–13 December 2020; pp. 730–736.
23. Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding.
In Proceedings of the NAACL-HLT 2019, Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186.
24. Peters, M.E.; Neumann, M.; Iyyer, M.; Gardner, M.; Clark, C.; Lee, K.; Zettlemoyer, L. Deep contextualized word representations.
In Proceedings of the NAACL, New Orleans, LA, USA, 1–6 June 2018.
25. Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.S.; Dean, J. Distributed representations of words and phrases and their
compositionality. In Proceedings of the Advances in Neural Information Processing Systems, Harrahs and Harveys, Lake Tahoe,
CA, USA, 5–10 December 2013; pp. 3111–3119.
26. Grave, E.; Bojanowski, P.; Gupta, P.; Joulin, A.; Mikolov, T. FastText Word Vectors. Available online: https://fasttext.cc/docs/en/
crawl-vectors.html (accessed on 1 January 2018).
27. Manning, C.; Surdeanu, M.; Bauer, J.; Finkel, J.; Bethard, S.; McClosky, D. Python Interface to CoreNLP Using a Bidirectional
Server-Client Interface. Available online: https://github.com/stanfordnlp/python-stanford-corenlp (accessed on 1 January 2019).
28. Chollet, F. Keras. Available online: https://keras.io (accessed on 1 January 2015).
29. Abadi, M.; Agarwal, A.; Barham, P.; Brevdo, E.; Chen, Z.; Citro, C.; Corrado, G.S.; Davis, A.; Dean, J.; Devin, M.; et al. TensorFlow:
Large-Scale Machine Learning on Heterogeneous Systems. 2015. Available online: tensorflow.org. (accessed on 1 January 2015).
30. Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et
al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830.
31. Hall, M.; Frank, E.; Holmes, G.; Pfahringer, B.; Reutemann, P.; Witten, I.H. The WEKA data mining software: An update. ACM
Sigkdd Explor. Newsl. 2009, 11, 10–18. [CrossRef]
32. Navigli, R.; Velardi, P.; Ruiz-Martínez, J.M. WCL Definitions Dataset. Available online: http://lcl.uniroma1.it/wcl/ (accessed on 1
September 2020).
33. Navigli, R.; Velardi, P.; Ruiz-Martínez, J.M. An Annotated Dataset for Extracting Definitions and Hypernyms from the Web. In
Proceedings of the International Conference on Language Resources and Evaluation, LREC, Valetta, Malta, 19–21 May 2010.
34. Spala, S.; Miller, N.A.; Yang, Y.; Dernoncourt, F.; Dockhorn, C. DEFT: A corpus for definition extraction in free-and semi-structured
text. In Proceedings of the 13th Linguistic Annotation Workshop, Florence, Italy, 1 August 2019; pp. 124–131.
35. Jin, Y.; Kan, M.Y.; Ng, J.P.; He, X. W00 Definitions Dataset. Available online: https://bitbucket.org/luisespinosa/neural_de/src/
afedc29cea14241fdc2fa3094b08d0d1b4c71cb5/data/W00_dataset/?at=master (accessed on 1 January 2013).
36. Bird, S.; Dale, R.; Dorr, B.J.; Gibson, B.R.; Joseph, M.T.; Kan, M.; Lee, D.; Powley, B.; Radev, D.R.; Tan, Y.F. The ACL Anthology
Reference Corpus: A Reference Dataset for Bibliographic Research in Computational Linguistics. In Proceedings of the
International Conference on Language Resources and Evaluation, LREC, Marrakech, Morocco, 26 May–1 June 2008.
37. Vanetik, N.; Litvak, M.; Shevchuk, S.; Reznik, L. WFM Dataset of Mathematical Definitions. Available online: https:
//github.com/uplink007/FinalProject/tree/master/data/wolfram (accessed on 1 January 2019).
38. Weisstein, E. Wolfram Mathworld. Available online: https://www.wolframalpha.com/ (accessed on 1 January 2019).
39. Manning, C.; Surdeanu, M.; Bauer, J.; Finkel, J.; Bethard, S.; McClosky, D. The Stanford CoreNLP natural language processing
toolkit. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations,
Baltimore, MD, USA, 23–25 June 2014; pp. 55–60.
40. Honnibal, M.; Johnson, M. An Improved Non-monotonic Transition System for Dependency Parsing. In Proceedings of the 2015
Conference on Empirical Methods in Natural Language Processing; Association for Computational Linguistics, Lisbon, Portugal,
17–21 September 2015; pp. 1373–1378.
41. Grave, E.; Bojanowski, P.; Gupta, P.; Joulin, A.; Mikolov, T. Learning Word Vectors for 157 Languages. In Proceedings of the
International Conference on Language Resources and Evaluation LREC, Miyazaki, Japan, 7–12 May 2018.
42. Veyseh, A.; Dernoncourt, F.; Dou, D.; Nguyen, T. A joint model for definition extraction with syntactic connection and semantic
consistency. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34,
pp. 9098–9105.
282
mathematics
Article
To Batch or Not to Batch? Comparing Batching and Curriculum
Learning Strategies across Tasks and Datasets
Laura Burdick *,† , Jonathan K. Kummerfeld and Rada Mihalcea
Computer Science and Engineering, University of Michigan, Ann Arbor, MI 48109, USA;
[email protected] (J.K.K.); [email protected] (R.M.)
* Correspondence: [email protected]
† Current address: 2260 Hayward Street, Ann Arbor, MI 48109, USA.
Abstract: Many natural language processing architectures are greatly affected by seemingly small
design decisions, such as batching and curriculum learning (how the training data are ordered during
training). In order to better understand the impact of these decisions, we present a systematic analysis
of different curriculum learning strategies and different batching strategies. We consider multiple
datasets for three tasks: text classification, sentence and phrase similarity, and part-of-speech tagging.
Our experiments demonstrate that certain curriculum learning and batching decisions do increase
performance substantially for some tasks.
Keywords: natural language processing; word embeddings; batching; word2vec; curriculum learning;
text classification; phrase similarity; part-of-speech tagging
2. Related Work
Our work builds off previous work on word embeddings, batching, and curricu-
lum learning.
2.2. Batching
We use two batching approaches (denoted as basic batching and cumulative batching).
Basic batching was first introduced in Bengio et al. [7], and cumulative batching was first
introduced in Spitkovsky et al. [8]. These approaches are described in more detail in
Section 3.2.
While we use the batching approaches described in these papers, we analyze their
performance on different tasks and datasets. Basic batching was originally proposed for
synthetic vision and word representation learning tasks, while cumulative batching was
applied to unsupervised dependency parsing. We use these batching techniques on NLP
architectures built with word embeddings for the tasks of text classification, sentence and
phrase similarity, and part-of-speech tagging.
In addition to using two batching techniques, we vary the number of the batches that
we use. This was studied previously; Smith et al. [9] showed that choosing a good batch
size can decrease the number of parameter updates to a network.
284
Mathematics 2021, 9, 2234
Figure 1. Experimental setup for text classification, sentence and phrase similarity, and POS tagging.
3.2. Batching
We apply batching to the Wikipedia dataset input to the word embedding algorithm.
We batch words (and their contexts) using two strategies: basic batching and cumulative
batching, visualized in Figure 2. For each batching strategy, we consider different numbers
of batches, ranging exponentially between 2 and 200.
285
Mathematics 2021, 9, 2234
Figure 2. Basic vs. cumulative batching. Rectangles represent chunks of the training data, with
different colors representing different sections of the data.
286
Mathematics 2021, 9, 2234
are tokenized using NLTK’s Tokenizer.) These datasets span a wide range of sizes (from
96 sentences to 3.6 million training sentences), as well as number of classes to be categorized
(from 2 to 14).
Table 1. Data statistics (number of training sentences, number of test sentences, number of classes)
for text classification. The first eight datasets are from Zhang et al. [15]. Two datasets have both a
polarity (pol.) version with two classes and a full version with more classes.
Of particular note are three datasets that are at least an order of magnitude smaller
than the other datasets. These are the Open Domain Deception Dataset [16] and the Real
Life Deception Dataset [17], both of which classify statements as truthful or deceptive, as
well as the Personal Email Dataset [18], which classifies e-mail messages as personal or
non-personal.
After creating embedding spaces, we use fastText [19] for text classification. (Avail-
able online at https://fasttext.cc/. (accessed on 7 September 2021.) FastText represents
sentences as a bag of words and trains a linear classifier to classify the sentences. The
performance is measured using accuracy.
For each pair of phrases or sentences in our evaluation set, we average the embeddings
for each word, and take the cosine similarity between the averaged word vectors from
both phrases or sentences. We compare this with the ground truth using Spearman’s
correlation [23]. Spearman’s correlation is a measure of rank correlation. The values of
two variables are ranked in order, and then Pearson’s correlation [24] (a measure of linear
correlation) is taken between the ranked values. Spearman’s correlation assesses monotonic
relationships (are values strictly not decreasing, or strictly not increasing?), while Pearson’s
correlation assesses linear relationships.
287
Mathematics 2021, 9, 2234
4. Results
We apply the different curriculum learning and batching strategies to each task, and
we consider the results to determine which strategies are most effective.
Figure 3. Accuracy scores on the development set for three text classification datasets. Different lines indicate models
trained with different curriculum and batching strategies (basic, cumulative). Datasets span different ranges of the x-axis
because they are different sizes. Error bars show the standard deviation over ten word2vec embedding spaces, trained
using different random seeds.
On the smallest dataset, Real Life Deception (96 training sentences), we see that above
approximately ten batches, ascending curriculum with cumulative batching outperforms
the other methods. On the test set, we compare our best strategy (ascending curriculum
with cumulative batching) with the baseline setting (default curriculum with basic batch-
ing), both with 100 batches, and we see no significant difference. This is most likely because
the test set is so small (25 sentences).
288
Mathematics 2021, 9, 2234
Figure 4. Spearman’s correlation scores on the train set for sentence and phrase similarity tasks. Different lines indicate
models trained with different curriculum and batching strategies (basic, cumulative). Datasets span different ranges of the
x-axis because they are different sizes. Error bars show the standard deviation over ten word2vec embedding spaces trained
using different random seeds.
First, we note that the relative performance of different strategies remains consistent
across all three datasets and across all six measures of similarity. An ascending curriculum
with cumulative batching performs the worst by a substantial amount, while a descending
curriculum with cumulative batching performs the best by a small amount. As the number
of sentences per batch increases, the margin between the different strategies decreases.
On the test set, we compare our best strategy (descending curriculum with cumulative
batching) with the baseline setting (default curriculum with basic batching), and we see in
Table 4 that the best strategy significantly outperforms the baseline with five batches.
Table 4. Spearman’s correlation on the test set for similarity tasks (all have a standard deviation
of 0.0).
Human Activity
Dataset Sim. Rel. MA PAC STS SICK
Baseline 0.36 0.33 0.33 0.22 0.27 0.51
Best 0.43 0.41 0.41 0.29 0.32 0.53
For all six measures, we observe a time vs. performance trade-off: the fewer sentences
are in a batch, the better the performance is, but the more computational power and time it
takes to run.
289
Mathematics 2021, 9, 2234
changes in performance on the smallest dataset. Both of the datasets that we use to
evaluate POS tagging are relatively large (>2500 sentences in the training data), which may
explain why we do not see significant performance differences here.
5. Conclusions
One strategy does not perform equally well on all tasks. On some tasks, such as
POS tagging, the curriculum and batching strategies that we tried have no effect at all.
Simpler tasks that rely most heavily on word embeddings, such as sentence and phrase
similarity and text classification with very small datasets, benefit the most from fine-tuned
curriculum learning and batching. We have shown that making relatively small changes to
curriculum learning and batching can have an impact on the results; this may be true in
other tasks with small data as well.
In general, cumulative batching outperforms basic batching. This is intuitive because
cumulative batching sees the same training example more times than basic batching, and
overall sees the training data more times. As the number of sentences per batch increases,
the differences between cumulative and basic batching shrink. Even though cumulative
batching has higher performance, it takes more computational time and power than basic
batching. We see a trade-off here between computational resources and performance.
It is inconclusive what the best curriculum is. For text classification, the ascending
curriculum works best, while for sentence and phrase similarity, the descending curriculum
works best. One hypothesis for why we see this is that for text classification, the individual
words are more important than the overall structure of the sentence. The individual
words are able to determine which class the sentence is a part of. Therefore, the algorithm
does better when it looks at smaller sentences first, before building to larger sentences
(an ascending curriculum). With the smaller sentences, the algorithm can focus more
on the words and less on the overall structure. For sentence and phrase similarity, it is
possible that the overall structure of the sentence is more important than the individual
words because the algorithm is looking for overall similarity between two phrases. Thus,
a descending curriculum, where an algorithm is exposed to longer sentences first, works
better for this task.
We have explored different combinations of curriculum learning and batching strate-
gies across three different downstream tasks. We have shown that for different tasks,
different strategies are appropriate, but that overall, cumulative batching performs better
than basic batching.
Since our experiments demonstrate that certain curriculum learning and batching
decisions do increase the performance substantially for some tasks, for future experiments,
we recommend that practitioners experiment with different strategies, particularly when
the task at hand relies heavily on word embeddings.
6. Future Work
There are many tasks and NLP architectures that we have not explored in this work,
and our direct results are limited to the tasks and datasets presented here. However, our
work implies that curriculum learning and batching may have similar effects on other tasks
and architectures. Future work is needed here.
Additionally, the three curricula that we experimented with in this paper (default,
ascending, and descending) are relatively simple ways to order data; future work is
needed to investigate more complex orderings. Taking into account such properties as
the readability of a sentence, the difficulty level of words, and the frequency of certain
part-of-speech combinations could create a better curriculum that consistently works well
on a large variety of tasks. Additionally, artificially simplifying sentences (e.g., substituting
simpler words or removing unnecessary clauses) at the beginning of the curriculum, and
then gradually increasing the difficulty of the sentences could be helpful for “teaching” the
embedding algorithm to recognize increasingly complex sentences.
290
Mathematics 2021, 9, 2234
There are many other word embedding algorithms (e.g., BERT [28], GloVe [29]);
batching and curriculum learning may affect these algorithms differently than they affect
word2vec. Different embedding dimensions and context window sizes may also make a
difference. More work is needed to explore this.
Finally, our paper is an empirical study, with our observations indicating that the
observed batching variation is something that researchers should consider, even though
we do not explore the theoretical reasons for this.
The code used in the experiments is publicly available at https://lit.eecs.umich.edu/
downloads.html (accessed on 7 September 2021).
Author Contributions: Conceptualization, L.B., J.K.K. and R.M.; methodology, L.B., J.K.K. and
R.M.; software, L.B. and J.K.K.; investigation, L.B. and J.K.K.; writing—original draft preparation,
L.B.; writing—review and editing, L.B., J.K.K. and R.M.; supervision, J.K.K. and R.M.; project
administration, R.M.; funding acquisition, R.M. All authors have read and agreed to the published
version of the manuscript.
Funding: This material is based in part upon work supported by the National Science Foundation
(NSF #1344257), the Defense Advanced Research Projects Agency (DARPA) AIDA program under
grant #FA8750-18-2-0019, and the Michigan Institute for Data Science (MIDAS). Any opinions,
findings, and conclusions or recommendations expressed in this material are those of the authors
and do not necessarily reflect the views of the NSF, DARPA, or MIDAS.
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: Wikipedia: The Wikipedia dataset used to create the initial embedding
spaces was used in Tsvetkov et al. [10] and is available by contacting the authors of that paper.
Text classification: Amazon Review (pol.), Amazon Review (full), Yahoo! Answers, Yelp Review
(full), Yelp Review (pol.), DBPedia, Sogou News, and AG News are available at https://course.
fast.ai/datasets#nlp. (accessed on 7 September 2021). The Open Domain Deception Dataset is
available at https://lit.eecs.umich.edu/downloads.html. (accessed on 7 September 2021) under
“Open-Domain Deception." The Real Life Deception Dataset is available at https://lit.eecs.umich.
edu/downloads.html (accessed on 7 September 2021) under “Real-life Deception”. The Personal
Email Dataset is available at https://lit.eecs.umich.edu/downloads.html (accessed on 7 September
2021) under “Summarization and Keyword Extraction from Emails”. Sentence and phrase similarity:
The Human Activity Dataset is available at https://lit.eecs.umich.edu/downloads.html (accessed
on 7 September 2021) under “Human Activity Phrase Data”. The STS Benchmark is available at
https://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark (accessed on 7 September 2021). The
SICK dataset is available at https://wiki.cimec.unitn.it/tiki-index.php?page=CLIC (accessed on
7 September 2021). Part-of-speech tagging: The English Universal Dependencies Corpus is available at
https://universaldependencies.org/ (accessed on 7 September 2021).
Conflicts of Interest: The authors declare no conflict of interest.
References
1. Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.S.; Dean, J. Distributed representations of words and phrases and their
compositionality. In Proceedings of the Advances in Neural Information Processing Systems, Lake Tahoe, NA, USA, 5–10
December 2013; pp. 3111–3119.
2. Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient estimation of word representations in vector space. arXiv 2013,
arXiv:1301.3781.
3. Collobert, R.; Weston, J.; Bottou, L.; Karlen, M.; Kavukcuoglu, K.; Kuksa, P. Natural language processing (almost) from scratch. J.
Mach. Learn. Res. 2011, 12, 2493–2537.
4. Kenter, T.; de Rijke, M. Short Text Similarity with Word Embeddings. In Proceedings of the 24th ACM International on Conference
on Information and Knowledge Management, Melbourne, Australia, 19–23 October 2015; Association for Computing Machinery:
New York, NY, USA, 2015; CIKM ’15; pp. 1411–1420. [CrossRef]
5. Faruqui, M.; Dodge, J.; Jauhar, S.K.; Dyer, C.; Hovy, E.; Smith, N.A. Retrofitting Word Vectors to Semantic Lexicons. In
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human
Language Technologies, Denver, CO, USA, 31 May–5 June 2015; Association for Computational Linguistics: Stroudsburg, PA,
USA, 2019; pp. 1606–1615. [CrossRef]
291
Mathematics 2021, 9, 2234
6. Mikolov, T.; Le, Q.V.; Sutskever, I. Exploiting similarities among languages for machine translation. arXiv 2013, arXiv:1309.4168.
7. Bengio, Y.; Louradour, J.; Collobert, R.; Weston, J. Curriculum learning. In Proceedings of the International Conference on
Machine Learning, Montreal, QC, Canada, 14–18 June 2009; pp. 41–48.
8. Spitkovsky, V.I.; Alshawi, H.; Jurafsky, D. From Baby Steps to Leapfrog: How “Less is More” in Unsupervised Dependency
Parsing. In Proceedings of the Human Language Technologies: The 2010 Annual Conference of the North American Chapter of
the Association for Computational Linguistics, Los Angeles, CA, USA, 2–4 June 2010; Association for Computational Linguistics:
Stroudsburg, PA, USA, 2019; pp. 751–759.
9. Smith, S.L.; Kindermans, P.J.; Ying, C.; Le, Q.V. Don’t decay the learning rate, increase the batch size. In Proceedings of the
International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018.
10. Tsvetkov, Y.; Faruqui, M.; Ling, W.; MacWhinney, B.; Dyer, C. Learning the Curriculum with Bayesian Optimization for Task-
Specific Word Representation Learning. In Proceedings of the 54th Annual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers), Berlin, Germany, 7–12 August 2016; pp. 130–139.
11. Xu, B.; Zhang, L.; Mao, Z.; Wang, Q.; Xie, H.; Zhang, Y. Curriculum Learning for Natural Language Understanding. In
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics; Association for Computational
Linguistics, Online, 5–10 July 2020; pp. 6095–6104. [CrossRef]
12. Zhang, X.; Shapiro, P.; Kumar, G.; McNamee, P.; Carpuat, M.; Duh, K. Curriculum Learning for Domain Adaptation in Neural
Machine Translation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–9 June 2019; Association for Computational Linguistics:
Stroudsburg, PA, USA, 2019; pp. 1903–1915. [CrossRef]
13. Antoniak, M.; Mimno, D. Evaluating the Stability of Embedding-based Word Similarities. Tran. Assoc. Comput. Linguist. 2018,
6, 107–119. [CrossRef]
14. Wendlandt, L.; Kummerfeld, J.K.; Mihalcea, R. Factors Influencing the Surprising Instability of Word Embeddings. In Proceedings
of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language
Technologies, New Orleans, LA, USA, 1–6 June 2018; Assocation for Computational Linguistics: Stroudsburg, PA, USA, 2019;
pp. 2092–2102. [CrossRef]
15. Zhang, X.; Zhao, J.; LeCun, Y. Character-level convolutional networks for text classification. In Proceedings of the Advances in
Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; pp. 649–657.
16. Pérez-Rosas, V.; Abouelenien, M.; Mihalcea, R.; Burzo, M. Deception detection using real-life trial data. In Proceedings of the
2015 ACM on International Conference on Multimodal Interaction, Seattle, WA, USA, 9–13 November 2015; ACM: Seattle, WA,
USA, 2015; pp. 59–66.
17. Pérez-Rosas, V.; Mihalcea, R. Experiments in open domain deception detection. In Proceedings of the 2015 Conference on
Empirical Methods in Natural Language Processing, Lisbon, Portugal, 17–21 September 2015; pp. 1120–1125.
18. Loza, V.; Lahiri, S.; Mihalcea, R.; Lai, P.H. Building a Dataset for Summarization and Keyword Extraction from Emails. In
Proceedings of the Ninth International Conference on Language Resources and Evaluation, Reykjavik, Iceland, 26–31 May 2014;
European Languages Resources Association: Paris, France, 2014; pp. 2441–2446.
19. Joulin, A.; Grave, E.; Bojanowski, P.; Mikolov, T. Bag of Tricks for Efficient Text Classification. In Proceedings of the 15th Conference
of the European Chapter of the Association for Computational Linguistics, Valencia, Spain, 3–7 April 2017; Association for
Computational Linguistics: Stroudsburg, PA, USA, 2017; pp. 427–431.
20. Wilson, S.; Mihalcea, R. Measuring Semantic Relations between Human Activities. In Proceedings of the Eighth International
Joint Conference on Natural Language Processing (Volume 1: Long Papers), Asian Federation of Natural Language Processing,
Taipei, Taiwan, 27 November–1 December 2017; pp. 664–673.
21. Cer, D.; Diab, M.; Agirre, E.; Lopez-Gazpio, I.; Specia, L. SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and
Crosslingual Focused Evaluation. In Proceedings of the 11th International Workshop on Semantic Evaluation, Vancouver, BC,
Canada, 3–4 August 2017; Association for Computational Linguistics: Stroudsburg, PA, USA, 2017; pp. 1–14. [CrossRef]
22. Bentivogli, L.; Bernardi, R.; Marelli, M.; Menini, S.; Baroni, M.; Zamparelli, R. SICK through the SemEval glasses. Lesson learned
from the evaluation of compositional distributional semantic models on full sentences through semantic relatedness and textual
entailment. Lang. Resour. Eval. 2016, 50, 95–124. [CrossRef]
23. Spearman, C. Correlation calculated from faulty data. Br. J. Psychol. 1904–1920 1910, 3, 271–295. [CrossRef]
24. Sedgwick, P. Pearson’s correlation coefficient. Bmj 2012, 345, e4483 . [CrossRef]
25. Nivre, J.; de Marneffe, M.C.; Ginter, F.; Goldberg, Y.; Hajič, J.; Manning, C.D.; McDonald, R.; Petrov, S.; Pyysalo, S.; Silveira, N.;
et al. Universal Dependencies v1: A Multilingual Treebank Collection, Language Resources and Evaluation. In Proceedings
of the Tenth International Conference on Language Resources and Evaluation, Portorož, Slovenia, 23–28 May 2016; European
Languages Resources Association: Paris, France, 2016; pp. 1659–1666.
26. Neubig, G.; Dyer, C.; Goldberg, Y.; Matthews, A.; Ammar, W.; Anastasopoulos, A.; Ballesteros, M.; Chiang, D.; Clothiaux, D.;
Cohn, T.; et al. DyNet: The Dynamic Neural Network Toolkit. arXiv 2017, arXiv:1701.03980.
27. Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [CrossRef] [PubMed]
292
Mathematics 2021, 9, 2234
28. Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understand-
ing. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics:
Human Language Technologies, Minneapolis, MN, USA, 2–9 June 2019; Association for Computational Linguistics: Stroudsburg,
PA, USA, 2019; pp. 4171–4186.
29. Pennington, J.; Socher, R.; Manning, C. GloVe: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on
Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Doha, Qatar, 25–29 October 2014;
pp. 1532–1543. [CrossRef]
293
MDPI
St. Alban-Anlage 66
4052 Basel
Switzerland
Tel. +41 61 683 77 34
Fax +41 61 302 89 18
www.mdpi.com