Towards Responsible Machine Translation - Ethical and Legal Considerations in Machine Translation
Towards Responsible Machine Translation - Ethical and Legal Considerations in Machine Translation
Helena Moniz
Carla Parra Escartín Editors
Towards
Responsible
Machine
Translation
Ethical and Legal Considerations in
Machine Translation
Machine Translation: Technologies and
Applications
Volume 4
Editor-in-Chief
Andy Way, ADAPT Centre, Dublin City University, Dublin, Ireland
Towards Responsible
Machine Translation
Ethical and Legal Considerations in Machine
Translation
Editors
Helena Moniz Carla Parra Escartín
School of Arts and Humanities Research & Development
University of Lisbon RWS Language Weaver
Lisbon, Portugal Dublin, Ireland
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland
AG 2023
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of
illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and
transmission or information storage and retrieval, electronic adaptation, computer software, or by
similar or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this
book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Foreword
As with the previous volume edited by Michael Carl, I am delighted to write a few
words to be included at the beginning of the fourth book in our series on Machine
Translation: Technologies and Applications. So far, we have seen how solutions to
problems for MT might be found by looking at how the brain works, how MT
quality can be evaluated, and how MT is used in practice. Now that MT is in
widespread use, it is time to look—as is happening in AI more generally—at how
it can be used in a responsible manner.
Volume 4 in the series is entitled Towards Responsible Machine Translation:
Ethical and Legal Considerations in Machine Translation, a collection of 11 chap-
ters edited by two of the nicest people in the field of MT and Translation, Helena
Moniz and Carla Parra Escartín. After an introduction by the editors, the main body
of the book consists of three separate but related parts: (1) Responsible Machine
Translation: Ethical, Philosophical and Legal Aspects; (2) Responsible Machine
Translation from the End-User Perspective; and (3) Responsible Machine Transla-
tion: Societal Impact. Altogether, there are 23 contributors, all of whom are experts
in their respective areas, so there will be plenty of interest in this volume for sure,
whether the reader is interested in looking at the topic from an ethical point of view,
licensing issues, and ownership rights; or from a user perspective, topics surrounding
post-editing of MT output, the impact of MT on the reader, and other issues of
automation; or issues of bias, the impact of large AI models on the planet we live on,
and the very topical subject of privacy.
Each reader of this volume will be more or less at home in different sections of the
book, but none, I contend, will argue against the inclusion of all these topics in a
single coherent volume. Perhaps you have never thought about these topics before,
and will be inspired to follow up the work presented here; at the very least, when you
are writing about issues on responsible AI (and especially MT) in the future—a topic
that can no longer be ignored—you will know where to come for a perspective on all
these issues as well as a thorough set of references.
As such, this volume fills an important gap in the field, providing an encapsula-
tion of the state of the art, and pointers to future work in the area, especially for
v
vi Foreword
newcomers to the topic who are well versed in ethics and legal issues from other
application areas, but also to longstanding researchers in the discipline, like me, who
have perhaps buried their head in the sand for too long. As such, I am delighted that
Helena and Carla agreed to take up this challenge in the first place, and the book they
have delivered is, I think, essential reading for us all.
Many stakeholders have been involved in the process of preparing this book for
publication, and we are grateful to all of them for their patience with us during the
whole process and for their constant support.
First of all we would like to thank Sharon O’Brien for inspiring us to work on
research related with Ethics and MT during the INTERACT project and Andy Way
for encouraging us to go beyond writing a book chapter and embarking on this book
project. Without his encouragement and support in the very early stages of this
project, this book would not have been possible.
We are very grateful to all the authors who contributed to this volume. Editing a
book in the middle of a pandemic has brought about additional challenges and your
efforts and contributions to the book have been paramount in making this book a
reality. We also sympathise deeply with those authors that had to decline our
invitation or had to withdraw their contribution due to the harsh conditions of the
pandemic.
We are also very grateful to our editors at Springer, Sowmya Thodur and Shina
Harshavardhan, for all the support along this journey, the numerous emails helping
us navigate through the process, and their patience.
Last but certainly not least, we would like to thank our families and close friends
and colleagues for their continuous support during this project. You all understood
the importance of this topic and have given us the much-needed energy to complete
it. Without your support it would have been possible, but certainly much harder.
Helena Moniz and Carla Parra Escartín
vii
Contents
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Helena Moniz and Carla Parra Escartín
ix
x Contents
Helena Moniz is the President of the European Association for Machine Transla-
tion (EAMT) and Vice President of the International Association for Machine
Translation (IAMT). She is also the Vice-coordinator of the Human Language
Technologies Lab at INESC-ID. Helena is an Assistant Professor at the School of
Arts and Humanities at the University of Lisbon, where she teaches Computational
Linguistics, Computer-Assisted Translation, and Machine Translation Systems and
Post-editing. She graduated in Modern Languages and Literature at School of Arts
and Humanities, University of Lisbon (FLUL), in 1998. She received a Master’s
degree in Linguistics at FLUL, in 2007, and a PhD in Linguistics at FLUL in
cooperation with the Technical University of Lisbon (IST), in 2013. She has been
working at INESC-ID/CLUL since 2000, in several national and international pro-
jects involving multidisciplinary teams of linguists and speech processing engineers.
Within these fruitful collaborations, she participated in 19 national and international
projects. Since 2015, she is also the PI of a bilateral project with INESC-ID/Unbabel,
a translation company combining AI + post-editing, working on scalable Linguistic
Quality Assurance processes for crowdsourcing. She was responsible for the crea-
tion of the Linguistic Quality Assurance processes developed at Unbabel for Lin-
guistic Annotation and Editors’ Evaluation. She now is working mostly on research
projects involving Linguistics, Translation, and AI.
School of Arts and Humanities, University of Lisbon, Lisbon, Portugal
Carla Parra Escartín is Research Manager within the R&D department of RWS
Language Weaver. She holds a MA in English Philology from the University of
Zaragoza (Spain), a MA Degree in Translation and Interpreting and a MA in Applied
Linguistics, both from the Pompeu Fabra University (Barcelona, Spain), and a PhD
in Computational Linguistics from the University of Bergen (Norway). She has over
10 years of research experience in linguistic infrastructures, human factors in
machine translation, and multiword expressions (MWEs). Throughout her career
she has worked in various EU-funded projects and actions (LIRICS, CLARIN,
xi
xii Editors and Contributors
Contributors
FB Frame Buffer
FP32 32-bit floating point number
FPO Floating Point Operations
FR Language code for the French language
GB Gigabytes
GC Garbled Circuit
GDACS Global Disaster Alerting Coordination System
GDPR General Data Protection Regulation
GloVe Global Vector
GMM Gaussian Mixture Model
GNMT Google Neural Machine Translation
GNU GNU’s Not Unix! [recursive abbreviation], a free software initiative
and software collection
GPHIN Global Public Health Intelligence Network
GPT Generative Pre-trained Transformer
GPU Graphics Processing Unit
HE Homomorphic Encryption
HLT Human Language Technology
HMM Hidden Markov Model
HT Human Translation
IBM International Business Machines
IE Ireland
ILSP Institute for Language and Speech Processing (Athens, Greece)
INT16 Integer number stored with 16 bit
INT8 Integer number stored with 8 bit
IPR Intellectual Property Rights
ISCA International Speech Communication Association
KL Kullback-Leibler
LoAs Levels of Abstraction
LR Language Resource
LS-BERT Language-Specific BERT
LSP Language Service Provider
LSTM Long Short-Term Memory
mBERT Multilingual BERT
MCO Mars Climate Orbiter
MD5 Message Digest version 5, a function assigning a 128-bit number to
a piece of data
ML Machine Learning
MLaaS Machine Learning as a Service
MOS Mean Opinion Score
MS Word Microsoft Word
MT Machine Translation
MTPE Machine Translation Post-Editing
MUL Multiplication operation
Abbreviations xvii
Together with Artificial Intelligence (AI) technologies and their prevalence in our
everyday lives, new trends in research have emerged which researchers have started
to question from an ethical and ecological point of view. Along with this trend,
citizens have also started to question how these systems can be trusted if there is little
or no control over the decisions they make. Should one trust an AI algorithm to be
the main driver of investments in the stock market, for instance? How many of these
algorithms should citizens know and who is responsible for potential damages
caused? What is the ecological impact that training these systems has in an era
where we are all concerned about climate change? Are the systems able to detect and
categorise speakers’ attitudes, beliefs, or even biases beyond the semantic meaning
and alert the user on such patterns? All these questions are addressed in this
emerging field of research, and along with it, different terms have been coined to
refer to it: Ethical AI, Fair AI, Responsible AI, and even Green AI.
It is well known that the so-called neural systems are not able to extract cause-
effect as human beings do (e.g. Pearl and Mackenzie 2018) and rather work as a
“black box”. In the words of Joshua Bengio, one of the mentors of such systems,
“It’s a big thing to integrate [causality] into AI”.1 Several metaphors have been used
to express such concerns (or even fears) that technology is no longer controlled by
the human being, but rather performs in such “magical” ways that it can lead to
1
https://www.wired.com/story/ai-pioneer-algorithms-understand-why/.
2
See Kenny (2019) for an overview on the topic.
3
https://ethicsinnlp.org/ethnlp-2017 and https://ethicsinnlp.org/.
4
See: https://acl2020.org/committees/program, https://2021.aclweb.org/organization/program/.
5
https://2021.aclweb.org/ethics/Ethics-review-questions/.
6
https://spsc-symposium2021.de/.
1 Introduction 3
platforms, on the one hand; and communication settings based on trust, credibility,
and social equality, on the other.
Finally, Part III covers the societal impact of MT, tackling gender and age bias in
MT systems, ecological implications of Neural MT systems, and the role of speech
as key Personally Identifiable Information. There are multiple topics that could be
encompassed in this last part of the book, as MT is nowadays used to enable
communication in a myriad of possible scenarios and hence multiple perspectives
could be explored. With the ultimate goal of highlighting how Responsible MT
spreads across domains and even has societal implications in our daily lives, the
three chapters included here aim to be a first approximation towards reflecting
further on the societal impact MT has. In Chap. 9, “Gender and Age Bias in
Commercial Machine Translation”, Federico Bianchi, Tommaso Fornaciari, Dirk
Hovy, and Debora Nozza tackle style issues in MT and how they reflect gender and
age. Three commercial MT systems outputs (Bing, DeepL, Google) are analysed and
evidence on demographic bias is provided. The authors further explore whether the
bias found can be used as a feature, by correcting skewed initial samples, and
compute fairness scores for the different demographics.
GPU performance and NMT models have been gaining traction within the topic
of Green AI and NLP, discussing the implications of higher performances and
inference in the NMT systems and balanced those with ecological concerns. There
are ecological implications of training big datasets with lower or no significance
improvements, often described with biased metrics, such as BLEU (Kocmi et al.
2021). Chapter 10, “The Ecological Footprint of Neural Machine Translation Sys-
tems”, written by Dimitar Shterionov and Eva Vanmassenhove, reports on several
experiments aimed at measuring the carbon footprint of MT. The authors train
models using distinct NMT architectures (Long Short-Term Memory (LSTM) and
Transformer) on different types of GPU (NVidia GTX 1080Ti and NVidia Tesla
P100) and collect power readings from the corresponding GPUs in order to compute
their respective ecological footprint.
Our book ends with the final chapter entitled “Treating Speech as Personable
Identifiable Information—Impact in Machine Translation”. In it, Isabel Trancoso
provides a broad view of the widespread use of speech technologies and the ethical
implications of using such an idiosyncratic biomarker. She illustrates how it is
possible to extract metadata on personality traits, emotions, health status, gender,
age, accent, etc. from very small samples of speech. Sending speech to remote
servers should, therefore, be a very cautious and informed action. Unfortunately,
most citizens are not aware of the potential misuses of their voice. Speech gives more
idiosyncratic information than a fingerprint and consequently it is highly sensitive in
terms of security and privacy. Isabel’s chapter allows us to end the book in a circular
way, as she answers Wessel and Quinn’s question “what is artificial speech?” and its
ethical implications.
1 Introduction 7
The topic of Responsible AI is a very complex one. This book provides what we
consider a first exploration of many venues of work that should be pursued to
accomplish responsible MT systems. Each chapter provides an introduction to
what could be a book of its own. Along this journey, we had a constant feeling
that much more could be covered. We felt as if we were just covering nanoparticles
in an unexplored multiverse and reading the contributions of all the authors, this was
only confirmed. Responsible MT plays a key role in every citizen’s life. This book
aims at bringing up this topic for discussion and providing the basis for a much
broader field of research: one that encompasses cross-disciplinary collaborations to
ensure that responsible MT becomes a reality in the (near) future.
AI technologies are here to stay, in health applications, in e-learning, in embodied
agents, in generating writing suggestions, in bias detectors, etc. And along with them
comes a societal impact that needs to be addressed. NLP is surpassing the frontier of
“meaning” and embracing metadata on our attitudes, our mentality, our ways of
thinking, and even our ways of expressing ourselves. We leave a fingerprint of our
inner selves in our speech, in the texts we write, and in the gestures we use when we
communicate. AI models are more than ever able to capture and code all those
nuances. They are coding who we are. And that is why we think we are at a crucial
turning point, a disruptive one, we should state, in which the ethical and legal
implications of such systems need to be raised up for discussion and informed
decisions need to be made, not only on the way we develop such technologies, but
also on the way we use them. It is clear that AI systems—and by extension MT
systems—are having a major impact on our daily lives, but it is still unclear how we
will manage to balance the progress of such technologies and the uniqueness of each
citizen’s voice/text/visual information. Ultimately, as end-users of the systems we
should be part of this discussion. This book is intended to be a first step in that
direction.
References
Albano M, Hovy D, Mitchell M, Strube M (2018) Proceedings of the Second ACL Workshop on
Ethics in Natural Language Processing, EthNLP@NAACL-HLT 2018. New Orleans, Louisi-
ana, USA, June 5, 2018
Batliner A, Hantke S, Schuller BW (2022) Ethics and good practice in computational paralinguis-
tics. IEEE Trans Affect Comput 13:1236. https://doi.org/10.1109/TAFFC.2020.3021015
Belz A, Agarwal S, Shimorina A, Reiter E (2021) A systematic review of reproducibility research in
natural language processing. In: Proceedings of the 16th Conference of the European Chapter of
the Association for Computational Linguistics: Main Volume. EACL 2021
Bender EM, Hovy D, Schofield A (2020) Integrating ethics into the NLP curriculum. In: Pro-
ceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Tutorial
Abstracts. ACL 2020
8 H. Moniz and C. P. Escartín
Cowie R (2015) Ethical issues in affective computing. In: Calvo R, D’Mello S, Gratch J, Kappa A
(eds) The Oxford handbook of affective computing. Oxford University Press, Oxford. https://
doi.org/10.1093/oxfordhb/9780199942237.013.006
Ethayarajh K, Jurafsky D (2020) Utility is in the eye of the user: a critique of NLP leaderboard
design. ArXiv:abs/2009.13888
Hovy D, Shannon LS (2016) The social impact of natural language processing. In: Proceedings of
the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short
Papers). ACL 2016
Hovy D, Spruit SL, Mitchell M, Bender EM, Strube M, Wallach HM (2017) Proceedings of the
First ACL Workshop on Ethics in Natural Language Processing, EthNLP@EACL, Valencia,
Spain, April 4, 2017. EthNLP@EACL
IEEE (2016) Ethically aligned design. A vision for prioritizing human well-being with autonomous
and intelligent systems. IEEE, Washington, DC
Kenny D (2019) Machine translation. In: Rawling JP, Wilson P (eds) The Routledge handbook of
translation studies and linguistics. Routledge, London, pp 428–445
Kenny D, Moorkens J, do Carmo F (2020) Fair MT: Towards ethical, sustainable machine
translation. Transl Space 9:1. https://doi.org/10.1075/ts.00018.int
Kocmi T, Federmann C, Grundkiewicz R, Junczys-Dowmunt M, Matsushita H, Menezes A (2021)
To ship or not to ship: an extensive evaluation of automatic metrics for machine translation.
ArXiv:abs/2107.10821
Larson BN (2017) Gender as a variable in natural-language processing: ethical considerations. In:
Proceedings of the First ACL Workshop on Ethics in Natural Language Processing,
EthNLP@EACL, Valencia, Spain, April 4, 2017. EthNLP@EACL
Marie B, Fujita A, Rubino R (2021) Scientific credibility of machine translation research: a meta-
evaluation of 769 papers. ArXiv:abs/2106.15195
Parra Escartín C, Reijers W, Lynn T, Moorkens J, Way A, Liu C-H (2017) Ethical considerations in
NLP shared tasks. Proceedings of the First ACL Workshop on Ethics in Natural Language
Processing, EthNLP@EACL, Valencia, Spain, April 4, 2017. EthNLP@EACL
Pearl J, MacKenzie D (2018) The book of why: the new science of cause and effect. Penguin Book,
London
Stanovsky G, Smith N, Zettlemoyer L (2019) Evaluating gender bias in machine translation. In:
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.
ACL 2019
Strubell E, Ganesh A, MacCallum A (2019) Energy and policy considerations for deep learning in
NLP. In: Proceedings of the 57th Annual Meeting of the Association for Computational
Linguistics. ACL 2019
Vanmassenhove E, Hardmeier C, Way A (2018) Getting gender right in neural machine translation.
ArXiv:abs/1909.05088
Wang C, Sennrich R (2020) On exposure bias, hallucinations and domain shift in neural machine
translation. In: Proceedings of the 58th Annual Meeting of the Association for Computational
Linguistics. ACL 2020
Wiegreffe S, Pinter Y (2019) Attention is not explanation. In: Proceedings of the 2019 Conference
on Empirical Methods in Natural Language Processing and the 9th International Joint Confer-
ence on Natural Language Processing (EMNLP-IJCNLP). ACL
Part I
Responsible Machine Translation: Ethical,
Philosophical and Legal Aspects
Chapter 2
Prolegomenon to Contemporary Ethics
of Machine Translation
W. Reijers (✉)
Robert Schuman Centre, European University Institute, San Domenico di Fiesole, Italy
e-mail: [email protected]; [email protected]
Q. Dupont
School of Business, University College Dublin, Dublin, Ireland
e-mail: [email protected]
2.1 Introduction
What does it mean that a machine translates this information? Does it mean I have to
expect bad sentence structures and weird spelling? Might it mean that I receive
factually wrong information? Or perhaps I will miss out on some important nuances?
MT generates many such practical questions, but it also raises more fundamental
questions about language, writing, meaning, reference, and representation. What
does a word, phrase or sentence mean and hence, when is a given translation
‘correct’ or ‘good’? Such quandaries also provoke questions concerning responsi-
bility. If a MT is ‘wrong,’ who is to be blamed? What is the ethical choice made by
the translator—in this case a machine—in picking one word rather than another? Is
something ‘lost’ in translation, and why does this matter?
Let us start with some straightforward ethical problems facing MT (Vieira et al.
2020). Consider again the EU portal, which might offer machine-translated recom-
mendations that are rendered ambiguous or incorrect. This might lead to people
being denied visitation or immigration rights. Or, medical information could be
displayed incorrectly. These issues of facticity can cause direct harm. More intricate
issues of interpretation emerge too. For instance, a text translated by Google
Translate2 might be disregarded in a legal court case because of its origins, even
when the facts are correct. Due to algorithmic and training data set bias based on
English text corpora, MT might affect the way we use language at a global scale, for
instance by giving it a more ‘English’ ring (Raley 2003). This might also affect
natural linguistic diversity.
The ethics of MT is still nascent and largely motivated by the MT community
(Kenny 2011), which focuses on fairness in MT (Kenny et al. 2020) and conse-
quentialist issues that emerge when MT is used in crisis situations (Parra Escartín
and Moniz 2019; O’Mathúna and Hunt 2019). Scholars studying the philosophy and
ethics of technology have, despite MT’s obvious societal importance, remained
1
https://reopen.europa.eu/en.
2
https://translate.google.com/.
2 Prolegomenon to Contemporary Ethics of Machine Translation 13
largely silent on the matter. We wonder if, to echo Feenberg (2010), philosophers of
technology have failed to see the paradox at the heart of MT. Is the most obvious the
most hidden? In this prolegomenon, we seek to build a bridge between the ethics of
MT and philosophy of technology.
Philosophy of technology brings at least three significant insights to the fore.
First, technologies are not mere instruments; they are not neutral with regards to
human actions and values but instead mediate reality (Verbeek 2005). For example,
eyeglasses do not let the wearer see reality more ‘correctly’. Rather, eyeglasses
mediate vision by amplifying some aspects and reducing or skewing others. Second,
technologies can promote or prohibit human action and delegate responsibility. As
such, technologies have ontological significance, a conceit strikingly captured by
Bruno Latour under the term “translation” (Latour 1994, 32). For Latour, translation
simpliciter is part of technology because it implies an uncertainty of goals. One
technology can be used to attain a multitude of aims. Third, philosophy of technol-
ogy teaches us that while our social and political norms are the basis of technological
design, technologies in return ‘refigure’ these values (Winner 1980). As such,
technology can have particular values that evolve, leaving a lasting effect on values
themselves: the insight behind value-centred design (Friedman et al. 2002).
In this chapter, we start building the bridge between the ethics of MT and
philosophy and ethics of technology. First, we consider translation as such and
argue that it should be seen as an inherently ethical activity. Its technological
mediation should therefore be of paramount interest for philosophers of technology.
Second, we take the role of a philosopher of technology and ask what machines ‘do’
to translation: how do machines mediate the human activity of translation and what
happens to the values associated with this activity? We draw on insights from a range
of thinkers to show that MT is in many ways an apotheosis of over 2000 years of
philosophical thought, but going forward must address three enduring ethical chal-
lenges. Third, we take these philosophical insights and discuss normative implica-
tions, highlighting three questions that could prompt further work on the ethics
of MT.
While the ethics of MT is in its infancy, it leans heavily on the ethics of translation
(Pym 2001), which is as old as philosophy itself, at least insofar as philosophy
considers problems that are treated under the heading of interpretation. Indeed, the
early names for translators were hermeneus in Greek and interpres in Latin (Kearney
2007). In modern philosophy, translation becomes a major theme with the advent of
hermeneutics in the Romantic age. Schleiermacher thought translation was the
object of textual interpretation. With some novelty, he opposed the common view
that translation should accommodate the reader’s understanding. Rather, taking the
ideal of a translation as if it is the original, Schleiermacher observed how translation
happens ‘in between’ (Venuti 1991, 127). That is, the translator finds theirself in
14 W. Reijers and Q. Dupont
between the author and the reader and can move along an interpretive continuum.
For example, ethical choices in translation have been recognised by religions for
millennia. Scriptural religions, like those of the Abrahamic faith, have had to wrestle
with the perseverance of meaning in translation. Given that scripture contains the
words of God, or at least prophets and followers, religious thinkers have long
worried about how translations mediate ‘original’ meaning and translation
(famously, there are no translations of Qurʼān in Islam, only interpretations). By
working in between, the translator accommodates the Author and the believer, the
recipient of His words.
The hermeneutic significance of translation offers three clues to its ethical
challenges. We find a first clue in the original meaning of translation, as carried
(from the Latin latus) over (from the Latin trans). It implies a process of moving
something from one place and carrying it over to another place, like transporting
grain from a farm to feed a city. The thing that is carried over is not entirely
possessed by or integral to the agent that carries it but remains, nonetheless, external
or already committed. While in transport, the farmer might spill some grain. Simi-
larly, a translator does not possess the meaning of a text but rather carries it over
from the author to the reader, risking a loss of meaning along the way. Translation
thus shows affinities with deduction (reasoning from general rules to conclusions
about particulars) and induction (reasoning from particulars to conclusions about
general rules). Similarly, Renaissance scholars discussed translation in terms of
transduction (Kearney 2007). Crucially, moving in between, translation involves
sacrifice: an inevitable loss or lack.
We find a second clue in how the concept of translation is used pragmatically. In
the vernacular, we often speak about translation outside of language. We ‘translate’
thoughts into actions, a recipe into a dish, and a football strategy into play on the
field. Similarly, scholars use the verb ‘to translate’ in a variety of ways and contexts:
investigating whether biological indicators translate cancer cells, whether clinical
trial outcomes translate into benefits for patients, and whether organisations translate
climate change into business as usual. Moreover, translation links symptoms to
diseases, descriptions into prescriptions, theory into practice, and use into abuse
(the latter already indicates a move towards norms). What these uses have in
common is that they take two or more different elements and connect them through
a commonality. Like metaphors, translation bridges the gap of ‘foreignness’ between
heterogeneous elements.
We find a third clue when considering translation as a human activity. Often
translation is performed as a professional activity, for example, real-time interpreters
in the European Parliament enable multilingual communication between politicians,
paralegal translators create authorised translations of documents, and technical
writers prepare manuals for products in different languages. What characterises
these efforts is the activity of translation as work, as a profession. As with other
kinds of work, translation comes with standards of excellence—best practises—with
an understanding of professional responsibility. Of course, translation work ought to
be done well, serving the idea of a ‘good’ translation.
2 Prolegomenon to Contemporary Ethics of Machine Translation 15
Benjamin stipulates that the activity of translation does not copy the same thing, as
though a translation is just another configuration of characters and words, but rather
that translation produces commonality by relating to a ‘meta-language’ that exists
invisibly in between languages (a controversial point we will return to). According to
Benjamin, translation invokes the possibility of an ideal or pure language that existed
before the construction of the Biblical Tower of Babel, which has remained forever
unreachable. Such an exercise has been attempted many times before, with little
success. From the sixteenth century, Renaissance and Enlightenment scholars
(including intellectual giants like Descartes, Bacon, Leibniz, and Kircher) have
tried to develop or uncover ‘artificial’, ‘perfect’, and ‘philosophical’ languages,
often modelled on Hebrew or Chinese.
Translation cannot exist without sacrifice and therefore involves (semantic)
violence. Paul Ricoeur recognises this essential element of translation as a form of
suffering (Kearney 2007, 150). According to Ricoeur, the translator is compelled to
reduce the ‘otherness’ of the text while translating, and therefore cannot help but
inflict violence on the original. The translator cannot help but ‘betray’ the original
meaning of a text when working with it. Translation is hence about dealing respon-
sibly with the inevitability of betrayal.
Second, translation establishes commonality between foreign and heterogeneous
elements. Sigmund Freud, for instance, saw translation as a process that bridges the
foreignness between our subconscious and conscious states, a process that happens
within ourselves. In this same internal sense, even when speaking in a mother tongue
(cf. Kittler 1990), we are always already translating: between public and private
settings, formal and informal settings, work and play, and so on. We use language to
cross over from the familiar to the unfamiliar. As Kearney argues, in translating “we
are called to make our language put on the stranger’s clothes at the same time as we
invite the stranger to step into the fabric of our own speech” (Kearney 2007, 151).
Because of this, translation between languages also involves a translation between
different visions of the world. A text is not only ‘German’ because of its grammar
and syntax, but because it embodies and circumscribes a culture and inhabits a
16 W. Reijers and Q. Dupont
Having surveyed the ethics of translation, we now consider how the activity of
translation is mediated by technology, or—in the words of Verbeek (2005)—we ask
what technology ‘does’ to translation. In short, machines transcribe notation, which
is a specific form of inscription or writing. However, notational writing is distinctive:
neither a form of language nor writing simpliciter. In this section we briefly discuss
what ‘translation’ machines are, how they have developed historically, and why they
use notation. We then discuss two enduring pitfalls of our common sense under-
standing of MT, that translation would capture the ‘presence’ of lived experience
(logocentrism) and would ‘translate’ the spoken word (phonocentrism).
The history of MT is illustrative of its scientific specificity. According to Hutch-
ins (2006), the history of MT began exactly in 1933 (exempting the much longer
pre-history of perfect, universal, and philosophical languages; see, e.g., Eco 1995;
Markley 1993; or Slaughter 1982). Two patents for MT were issued simultaneously
in France and Russia to Georges Artsrouni and Petr Trojanskij, respectively.
Artsrouni’s patent described a general-purpose text machine that could also function
as a mechanical multilingual dictionary. Trojanskij’s patent, also basically a
mechanical dictionary, included coding and grammatical functions using ‘universal’
symbols. Importantly, like many of the earlier designs for ‘mechanical brains’
(Gardner 1958), Artsrouni and Trojanskij’s machines permuted alphabetic text.
But, unaware of these earlier designs, in 1946 the British crystallographer
Andrew Booth met Warren Weaver (director of the Rockefeller Foundation), to
discuss how their experiences of codebreaking during the war might apply to
MT. Over the next few years, the two men came to believe that MT would yield
similar successes if treated as a problem of code breaking, or cryptanalysis. In 1949,
Weaver distributed (and later published) a memorandum that introduced the idea of
MT to the scientific community. In this memorandum, Weaver included his corre-
spondence with Norbert Wiener, describing how one might apply the “powerful new
mechanised methods in cryptography” to MT and concluding that, when faced with
an unknown language, one might respond by presupposing that the text is “really
written in English, but it has been coded in some strange symbols” (Weaver 1955).
Thereafter, the task of MT was tantamount to code breaking.
We need not valorise or belabour the subsequent successes of this technoscientific
program. After several early attempts at ‘rationalist’ MT using grammatical rules,
Weaver’s codebreaking ethos returned. Formulated as a problem of code breaking,
MT thereafter was understood as a process by which computers calculate statistical
inferences across linguistic corpora (DuPont 2018). Unlike rationalist efforts of
translation that try to reconstruct meaning like a human translator, codebreaking
discovers hidden patterns of use through computational permutations of symbols.
The issue facing the rationalist is not whether computers ‘think’ as well as humans.
In fact, in many cases computers ‘think’ much better than humans, capable of
making deep, uncanny insights. Rather, machines ‘think’ differently from humans
because they process notational symbols.
18 W. Reijers and Q. Dupont
The most familiar notation for computing is the digital bit, first theorised and
named by John Tukey. The bit is usually represented as the numerals 1 and 0 or ‘on’
and ‘off’ electrical states (and following Russell and Whitehead’s program of
logicism in the early 1900s, ‘true’ and ‘false’). Collections (or sets) of bits create
deterministic routines, or algorithms. According to Knuth (1997), algorithms are
essentially proof by mathematical induction.
While the bit is now our most familiar notation, it is joined by a long history of
notational inscriptions used in other domains of knowledge, such as Western
classical music’s staff notation, mathematical notation, and dance notation (the
most sophisticated of which is Labanotation). The purpose of these special marks
is to create meaning. To be sure, all forms of inscription, including painting, dia-
grams, and natural language create meaning. In the Order of Things, Foucault
describes the rich ‘semantic web’ of resemblance underlying inscription up to the
end of the sixteenth century, which “played a constructive role in the knowledge of
Western culture” (this includes scientia) (2002/1966, 20). The four ‘essential’
modalities of meaning in the human sciences are, according to Foucault,
convenientia, aemulatio, analogy, and sympathy (2002/1966, 20–26).
Thereafter, representing, which includes ordering, mathesis, and taxinomia,
comes to dominate the French ‘classical’ epistemé. The historical accuracy and
precision of Foucault’s account need not concern us here. Rather, we reflect on the
changing epistemé, from resemblance to representation: on how “the written word
and things no longer resemble one another”, and how “words have swallowed up
their own nature as signs” (2002/1966, 53–54). Foucault locates this fissure of the
human sciences in the work of Francis Bacon and René Descartes (two scholars
keenly involved in the longue durée of MT), finding a critique of resemblance “that
concerns, not the relations of order and equality between things, but the types of
mind and the forms of illusions to which they might be subject” (2002/1966, 57). To
put the point bluntly, MT is only possible in a representational epistemé.
Unlike human translation, which arguably works through chains of metaphoric
connotation (Eco 1994, 29), MT requires a much more rigid and stable semiotics. To
do ‘work,’ machines must transcribe notational inscriptions according to narrowly
prescribed rules. But in the infinite semiotic chain of translation, machines build
hidden layers of semiosis using a constitutive set of characters that are disjoint and
finitely differentiated. For each layer of meaning, these sets of characters are
correlated to a field of reference that is unambiguous, semantically disjoint, and
semantically finitely differentiated (see Goodman 1976). Thus, despite
accomplishing the task of translation, machines do not translate ‘words’—however
and variously inscribed in image, sound, video, or viva voce—but rather transcribe
symbolic ‘notations.’
The transformation implied in the activity of translation raises at least two
fundamental philosophical issues: (1) to what extent does MT transcribe the ‘pres-
ence’ of lived experience (logocentrism), and (2) to what extent does it transcribe the
spoken word associated with this lived experience (phonocentrism)? That is, do
words represent some deeper, more essential, reality? Peirce did not think so:
2 Prolegomenon to Contemporary Ethics of Machine Translation 19
The meaning of a representation can be nothing but a representation. In fact it is nothing but
the representation itself conceived as stripped of irrelevant clothing. But this clothing never
can be completely stripped off: it is only changed for something more diaphanous.
(CP 1.339)
reducing one to the other. Heterogeneous elements are different visions of the world.
Arguably, different languages represent different visions of the world, but also
within languages (O’Mathúna and Hunt 2019). Hospitality suggests that even
when communicating in one language, it is often necessary to ‘translate’ particular
social contexts. Translation offers an opportunity to move between different visions
of collective meaning making. Linguistic hospitality requires such a capacity, but
can machines possess such a thing?
To motivate an answer to this question, we recall Goodman’s concept of ‘ways of
world making’ (1978). Goodman argued that we understand reality through a
pluralist lens of different versions of the world that are irreducible to one another.
These versions of the world are constructed through procedures of ‘world making,’
such as composition, ordering, and deformation (see also Herman 2009, 77). Which
ways of world making are possible depends on the ‘affordances’ and ‘constraints’ of
the medium through which they are conducted. When the medium is a (notational)
translation machine, the procedures of world making are necessarily limited. Most
crucially, a machine, unlike a human translator, has no access to the experiential
context of a particular version of the world. Machines lack ‘experience’ of the world
within (and across) linguistic realms. A machine may transcribe notations from
source to target languages but it lacks the procedures necessary to embed the
constructed meaning in a context of lived experience.
Does this mean that MT is incapable of producing linguistic hospitality? Not
necessarily so. Insofar as the procedural limits of world making allow, machines can
‘translate’ between different worlds by composition and ordering procedures. How-
ever, machines will not have access to the full range of practises that linguistic
hospitality entails and moreover, such translations always evince semantic violence.
For instance, a machine may not be able to produce a ‘diplomatic’ tone because
diplomacy requires access to lived experience: tacit understanding of cultural cues
and preferences, body language, time perception, and all the sundry actions of
embodiment. A notionally correct and semantically serviceable translation will, in
such cases, fail to cross the gap between the familiar and the foreign.
Third, we ask: to what extent does MT preserve the activity of translation as
virtuous work? Prima facie this seems impossible, for how could machines have
virtues that translators would and should possess? They certainly lack the ‘person-
ality’ usually associated with virtues. However, following MacIntyre (2007) we
argue that virtues lie primarily in practises (i.e., performances of a work). Returning
to the example explored above, a piano does not possess any virtues outside of its
context of use, but it is a necessary part of a performance and hence co-constitutes
relevant virtues, like Gould’s mastery. When practises—or performances—are
hybrids of humans and machines, they are symbiotic, as when, for instance,
human translators prepare texts (pre-editing) or correct translated results (post-
editing). To be sure, such forms of human-machine collaboration still carry many
risks, such as diminishing human creativity and introducing bias. Yet, at least in
principle, such practises could support professional virtues associated with transla-
tion. When humans are ‘out of the loop’ entirely, the risks increase further. Consider,
for instance, the challenges of preserving the virtues of translation in fully automated
24 W. Reijers and Q. Dupont
political contexts, like the European Commission. What would the virtue of fidelity
mean in such a context, beyond the narrow notion of correctness?
It is difficult to respect professional virtues in MT because the practice is
transmedial. By passing through a notational media, we argue, MT risks subjecting
the nuance of translation activity to a logic of commodification. As with any other
modes of production, works of translation have ‘commodity potential’ (Appadurai
2003, 82). Linguistic commodification occurs when meaning is made exchangeable,
which in turn, renders things (cars, apples, office buildings, but also translations)
universally substitutable. The layers of standardisation entailed in MT commodify
language in the service of global capitalism. As we discussed above, notational
writing is one of the oldest, most profound, and most impactful forms of
standardisation and universalisation. Notational writing offers a transmedial foun-
dation for the political economy of commodification par excellence.
However, translation work that generates a common understanding may even
resist commodification. Even though notational writing is subject to standardisation
and universalisation, language also forecloses the possibility of an infinitude of
instantiations with finite means. This gives rise to the possibility that, like a work
of art, translation bears the mark of its maker and is in that sense a particular object
that resists universal exchangeability. Consider, for instance, how translations of the
Illiad are sometimes incommensurable. This incommensurability does not lie in the
product itself (the translation) but in the practice of translation as work. In this
practice, the translator is disposed to make certain choices and thereby enact virtues,
most notably fidelity. In literary works, translators often choose to translate proper
names in such a way that they lie closer to the target language than to the source
language, which requires a considerable level of creativity. No two works of
translation are the same and because translators are unique. Rather, each translator
brings a particular character and worldview, which imbues the work with a style.
Characteristic styles may also resist commodification: a translation of poetry is
distinct from real-time political translation because it is more standardised and
rudimentary. Yet, in both cases professional virtues—perhaps unique to the
style—are required.
With echoes of Arendt (1958), we find that MT signals the substitution of
mechanical labour—in terms of cyclical process—for work, in terms of durability.
MT does not extend the life of a work but continuously consumes and produces
‘meaning.’ Even when MT is deemed highly accurate, it dissolves the identity of a
work of art through its process—practice and production—that ends up being both
universalised and standardised and therefore imminently exchangeable. Through
this process, MT reifies the practice of translation, as a ‘given’ thing with ‘phantom
objectivity’ (Pitkin 1987, 265). In so doing, MT risks obscuring and covering up the
social relations that its practice gathers and mobilises. This problematic underlies the
political economy of MT and has led to an alienation of human translators and the
erosion of profession (cf. Moorkens 2017).
2 Prolegomenon to Contemporary Ethics of Machine Translation 25
2.5 Conclusion
This chapter offers a critical introduction to the ethics of MT. First, we discussed the
ethics of translation by exploring three major themes: sacrifice, linguistic hospitality,
and the virtues of translation work as a professional activity. Second, in the spirit of
philosophy of technology, we asked what machines do to the activity of translation.
We argued that machines do not translate words, as human translators (arguably) do,
but rather transcribe notations. We then presented the enduring philosophical chal-
lenges of logocentrism and phonocentrism and identified performance as central to
MT. Third, we discussed how the limitations of MT became ethical questions. We
explicitly asked how MT can accommodate responsibility, to what extent it can
create linguistic hospitality, and how it affects the virtues of the work of translation.
Our approach opened the ethics of MT across different levels of granularity: at the
micro level of individual sensemaking, the meso level of mediation between oneself
and another, and the macro level of the political economy of MT.
Acknowledgments One author (Reijers) was funded by the European Research Council (ERC)
under the European Union’s Horizon 2020 Research and Innovation Programme (Grant Agreement
No. 865856).
References
Appadurai A (2003) Commodities and the politics of value. In: Pearce S (ed) Interpreting objects
and collections. Routledge, London
Arendt H (1958) The human condition, vol 24. University of Chicago Press, Chicago, IL. https://
doi.org/10.2307/2089589
Benjamin W (2012) The translator’s task. TTR : Trad, Terminol, Réd 10(2):151. https://doi.org/10.
7202/037302ar
Bottone A (2013) Translation and justice in Paul Ricoeur. In: Foran L (ed) Translation and
philosophy. Peter Lang, Oxford
Cáceres Würsig I (2017) Interpreters in history: a reflection on the question of loyalty. In: Valero-
Garcés C, Tipton R (eds) Ideology, ethics and policy development in public service interpreting
and translation, vol 1. Multilingual Matters, Bristol; Blue Ridge Summit, pp 3–20
Derrida J (1998) Of grammatology. Johns Hopkins University Press, Baltimore. MD; London.
(G. C. Spivak, Trans.; Corrected)
Dewey J, Tufts J (1909) Ethics. Henry Holt and Company, New York, NY
DuPont Q (2018) The cryptological origins of machine translation, from al-Kindi to Weaver.
Amodern:8. https://amodern.net/article/cryptological-origins-machine-translation/
Eco U (1994) The limits of interpretation. Indiana University Press, Bloomington, IN
Eco U (1995) The search for the perfect language. Blackwell, Cambridge, MA. Translated by James
Fentress
Feenberg A (2010) Ten paradoxes of technology. Techne 14(1):3–15
Fodor JA (1975) The language of thought. Thomas Y. Crowell, New York, NY
Foucault M (2002/1966) Order of things. Routledge, New York, NY
Friedman B, Kahn P, Borning A (2002) Value sensitive design: theory and methods. Univ
Washington Tech Rep 2(12):1–8
Gardner M (1958) Logic machines and diagrams. McGraw-Hill, New York, NY
26 W. Reijers and Q. Dupont
Vieira LN, O’Hagan M, O’Sullivan C (2020) Understanding the societal impacts of machine
translation: a critical review of the literature on medical and legal use cases. Inf Commun Soc
24(11):1515–1532. https://doi.org/10.1080/1369118x.2020.1776370
Weaver W (1955) Translation. In: Machine translation of languages. Technology Press of Massa-
chusetts Institute of Technology, Boston, MA
Winner L (1980) Do artifacts have politics? Daedalus 109(1):121–136. https://doi.org/10.2307/
20024652
Zielinski S (2006) Deep time of the media: toward an archeology of hearing and seeing by technical
means. MIT Press, Boston, MA
Chapter 3
The Ethics of Machine Translation
Alexandros Nousias
Abstract Language technologies are gradually turning into key modalities of our
algorithmic present and future. Real world texts embed patterns and patterns hack
natural meaning, following the language structure, word ordering, the underlying
values and perceptions of a given semantic agent. Understanding the logic behind
computational linguistics and meaning formulation provides the necessary theoret-
ical substrate for ethical screening and reasoning. The present paper aims to map the
components and functionalities of a random linguistic system, tracking all stages of
semantic rendering throughout the semantic cycle and providing recommendations
for ethical design optimisation. We attempt to articulate the information space as a
transition process of natural and/or non-natural phenomena into subjective informa-
tion models, driven by subjective human experience and worldview: To realise the
information space as a dynamic space, full of arbitrary features, where the informa-
tion maker and the information receiver both affect/influence and are affected/
influenced by the system. This article is not a practical toolkit towards the application
of ethics and further social considerations on Natural Language Processing at large
and/or more specifically Machine Translation ones; it is rather a contribution in
shaping the appropriate framework within which the Language Technology
(LT) stakeholders may identify, analyse, understand and appropriately manage
issues of ethical concern. It is a plea for multimodal computational linguistic
management, via interdisciplinary groups of professionals, in order to optimise the
research around the impact of computational linguistics on human rights, language
diversity, self-identity and social behaviour. Language technologies need to articu-
late complex meanings and subtle semantic divergences. New external metrics need
to be added, based on legal, ethical, social, anthropological grounds. This is a
necessary puzzle to solve, in order not to discount future structural rewards.
A. Nousias (✉)
Present Address: National Centre for Scientific Research – Demokritos, Athens, Greece
e-mail: [email protected]
3.1 Introduction
1
Narayanan on Twitter: https://twitter.com/random_walker/status/993866661852864512.
3 The Ethics of Machine Translation 31
Language functionality derives from symbols. Each term, word etc. is a symbol fit
for purpose and language functionality is the output of a random symbol juxtaposi-
tion so as to produce meaningful information. Symbols “denote” things, namely all
natural and non-natural phenomena in language through coherent semantics. In the
theory of meaning, semiotics discerns the signifier from the signified. Aristotle
identifies the problematic of meaning in language as functionality and consistency
(Avgelis 2014 as included in Andriopoulos 2021). Symbols do not cite natural or
non-natural phenomena directly, but through meaning derived from consciousness,
experience, culture, information and the knowledge base of a given epistemic agent.
In other words, symbols, as expressed in different spoken languages, dialects at
different spaces and at different times, do not signify the phenomena per se, rather
views, perspectives, norms and values across these random parameters. They are
semantic nuances that need to be annotated and modelled accordingly so as to
appropriate and contextualise symbols as expressed across languages, groups,
space and time. Values Across Space and Time,2 is a very interesting European
Union’s Horizon 2020 research and innovation programme that envisions to capture
different semantic states of that sort, focusing on values through advanced model-
ling, methods, techniques and digital tools, enabling the collaborative study, anno-
tation and continuous capture of experience. Equally to value appropriation, we
believe that a relevant semantic appropriation of symbols is imperative.
Back to the task of symbol appropriation, we identify a semantic cluster of three
interconnected nodes: (a) the symbols per se; (b) the expressions of the soul as
shaped by consciousness, experience, culture, information and the knowledge base
which in turn shape how phenomena are expressed; and (c) the phenomena them-
selves which exist prior to or independently from the language. This makes ques-
tionable the adequacy or consistency of the symbol (or word)/thing ratio of
language. Aristotle (in his De Interpretatione) seems to be navigating within such
polarity:
1. Symbols represent things or a type of data, words reflect realities or a type of
information and sentences provide the state of these realities or—ground—
knowledge;
2. Linguistic expressions are meanings specific within an intersubjective
framework;
Evidently, both theses provide an ontological character, as they represent a universe
of things, their relations and an implied or expressed set of rules thereof thus
providing room for further inferences. The relation between a word and a thing
seizes its linear course, as things are replaced by thought, transforming it (the
relationship) to a complex and dynamic system in a state of informational modelling,
a mental, creative and intersubjective process, that extends the is to the could or
should, flirting with the idea of naturalism. It is not by coincidence that Plato (in the
Cratylus) understands ‘logos’ (λóγoς) as a mix of language and intellect with the
2
https://www.vast-project.eu/vision/.
3 The Ethics of Machine Translation 33
‘nous’. Therefore, logos in Greek stands for both intellect and language. Language
turns out to be the reflection of the relations between ideas. Language, reality and
ideas are absolutely discernible components of meaning generation. The essence of
language lies in depicting the dialectic relation between the ideas.
A very straightforward question that arises in this complex system of meaning is
whether language mirrors the world or generates meaning via an ‘intersubjectivity’
reached by ‘convention’. When using the term ‘intersubjectivity by convention’,
Aristotle refers to the contingent ability of varying interlocutors or semantic agents,
to understand the meaning a hypothetical semantic agent A is conveying. The
essence of meaning lies not on the image of a thing, a fact or a phenomenon, but
rather on the intersubjective function of language and its underlying semantics.
Linguistics and the communication thereof are about delivering/receiving meaning
or providing a statement about something. It is not about naming a thing. Such
intersubjectivity requires conventional convergence, a consensus on the basics. At
this stage such consensus lies on the variant prospects of language and logos. When
communicating, we aim at sending/receiving meaning, via a given set of linguistic
symbols, the syntax and grammar thereof. Meaning does not stand for a transcending
entity referenced by language; rather it is the output of a constructive process.
Meaning is conveyed through language from the moment we reach consensus in
regard to the infinite informatic capabilities of logos.
A clear distinction between naturalism and convention may prove helpful at this
point. There might be a conflict between language as a reality and language as an
informatic convention. The semantic function of language (as logos) cannot be
articulated under a symbol/thing ratio, as the latter allows variable interpretations,
depending on the approach, the interests, and the perceptions of the observer. How
we choose the set-up of logos is a constructive process that shapes the language used,
which in turn delivers the information communicated, which in turn conveys a set
meaning accordingly. Remixing Hume’s views in regard to natural theory (Wacks
2006) with the problematic of meaning analysed herein, the meaning conveyed by
the selection of a specific set of language, in a set syntax and grammar is neither
equal to what it is nor to what it ought to be. It is neither a reflection of reality, a
mirroring or portraying nor an output of semantic moral reasoning. As Floridi (2019)
nicely says, “the transformation of the world into information is more like cooking:
the dish does not represent the ingredients, it uses them to make something else out
of them, yet the reality of the dish and its properties hugely depend on the reality and
the properties of the ingredients”. I would add that the transformation of the world
into information is carried out via the logos and generation of meaning. In other
words, it is a shift from representation to interpretation through perception. Alter-
natively, in an analogy to Bentham’s critique of natural law (Commentaries 1765 as
referenced in Wacks 2006), meaning generation through language formulation is or
could be a ‘private opinion in disguise’, something somewhat ‘censorial’ (Bentham
Manuscripts, University College, London Library). After all, the actual rule of law is
language with reason in an already established ethical context.
This process of linguistic creativity qualifies for an ethical assessment and
analysis. The implications of such conveyed meaning to the recipient and society
34 A. Nousias
at large claim our attention and ethical processing and reasoning as well, as they urge
the semantic agent (the information sender) to step backwards in order to reassess the
constructive process of the former meaning generation. Add multilingualism and the
embedded hindrances, constraints and compromises into the mix and what you get is
a set of subsequent interfaces comprised of variant points of interest through which
humans access reality. Each interface carries its own requirements, design, logic,
ethics and implications thereof. It develops in its own contextual framework serving
the very purpose of its user, the semantic agent, as formulated by their perception.
initial query. In our example the returned results would include films, instead of
cities and namely the Wizard of Oz, directing the semantic output to something
totally different and absolutely more (or less) relevant.
The LoA qualifies the level at which a system is considered by the selection of its
observables. An interface is a set of LoAs under which a given system is analysed.
Models are the outcome of the analysis of a system developed at some LoAs. The
way a semantic agent models a system is subject to the selection of its observables
and/or the selection of its LoAs, that include different types of observables. Such
selection affects the language model and the semantic output. To put it in context,
let’s reflect on the nurse/doctor example, inspired by Stanovsky et al. (2019) and
assume that agent A is analysing texts related to health services at the LoA of
HUMAN CAPITAL; in these texts, the observables for NURSE consist of values
related to females (e.g. female names) while the corresponding values for DOCTOR
are associated with male persons. Apparently, the idea of LoAs commits the
existence of some specific types of observables (nurse: female, doctor: male),
qualifying the system (that is health services) accordingly. Add observables regard-
ing e.g. OUTER LOOKS and you get the ‘pretty’ and ‘handsome’ values directly
(and undesirably) linked with the profession parameters in the model we examine.
This could be highly problematic for an annotated corpus and the MT model where,
by design, the system will exhibit distortions that ‘corrupt’ the receiver, the end user,
who consumes the semantic output. Google Translate, traditionally has provided
only one translation for a query, even if the translation could have either feminine or
masculine form (Kuczmarcski 2018). Google started tackling this malfunction by
offering the end user gender-specific translations for some gender-neutral words like
the English “nurse”, which can be translated into French both as “infirmier” (mas-
culine) and as “infirmière” (feminine). Alternatively, the user is allowed at times to
choose between gendered translations. However, malfunctions of that sort are
distributed across the NLP domain and the role of information ethics in this context
may provide insights under information modelling via a given set of LoAs based on
the primes of information ethics.
Similarly, in a different context, the training of Machine Learning (ML) models
on annotated language data, in which dialectical differences have not been consid-
ered, and their subsequent deployment in automatic hate speech detection systems
without again putting into perspective the appropriate LoAs, may result in wrongly
associated dialectical expressions with hate markers, thus potentially amplifying
harm against minority populations. Sap et al. (2019) uncover unexpected correla-
tions between surface markers of African American English (AAE) and ratings of
toxicity in several widely used hate speech datasets. Then, they show that models
trained on these corpora acquire and propagate these biases, such that AAE tweets
and tweets by self-identified African Americans are up to two times more likely to be
labelled as offensive compared to others. The researchers provide a two-step meth-
odology. They first empirically characterise the racial bias present in several widely
used Twitter corpora annotated for toxic content and quantify the propagation of this
bias through models trained on them. They establish strong associations between
AAE markers (e.g., “n*ggas”, “ass”) and toxicity annotations, and show that models
36 A. Nousias
acquire and replicate this bias: in other corpora, tweets inferred to be in AAE and
tweets from self-identifying African American users are more likely to be classified
as offensive. They then conduct an annotation study, where they introduce a way of
mitigating annotator bias through dialect and race priming. We focus on dialect, as
further explained.
In this example we could assume that annotators engage in an analysis on a
supposed LoA ‘OFFENSIVE LANGUAGE’ consisted of observables for SALU-
TATIONS with values like “wussup n*gga!” and “what’s up bro!”, REFERENCES
with values like “I saw him yesterday” and “I saw his ass yesterday” and ETHNIC-
ITY with values like black male/female, white male/female. The annotators annotate
the given digital interactions, on the LoA OFFENSIVE LANGUAGE. This LoA
marks a target dialect in the context of hate speech as highly toxic, thus corrupting
the receivers of the said cognitive output. Switching from LoA OFFENSIVE
LANGUAGE to LoA DIALECT, you get an enhanced understanding of the sub-
jectivity of offensive language, as you extend the analysis from a different standpoint
of variable properties and values (please refer to the Emerald City example men-
tioned earlier). What you get, as the study shows, is that AAE tweets are significantly
less likely to be labelled as offensive to anyone. What matters here in the analysis of
hate speech is the occurrence and flow of information, that is the Twitter corpora at a
different LoA, namely DIALECT, where the model returns more relevant results.
Different types of analysis lead to different models that may produce more accurate
and more ethically aligned outcomes. The key is the information qualities at hand.
LoAs allow us to navigate among these systemic bugs in a better oriented course,
as they provide insight on the systems’ conceptual interface, as well as the systems’
maker perception, like for example the annotator in the race bias use case. On the
other hand, The model of Google Translate replicated at first instance a function of
the available observables, based on the modelling specifications like consistency
with the data, background informativeness, social conception of given stereotypes
etc. Information Ethics or (for the economics of the present work) a branch of ethics
in LTs could make use of the LoA in order to: (a) identify whether a system is
assessed at the appropriate level; and (b) explore possible complex connections,
clusters or overlaps among various LoAs in a given context. This is clearly illus-
trated by the example provided in Floridi (2013, 2019) which additionally highlights
the importance of LoAs in the information ethics modelling process: “In November
1999, NASA lost the $125m Mars Climate Orbiter (MCO) because the Lockheed
Martin engineering team used English (also known as Imperial) units of measure-
ment, while the agency, team used the metric system. As a result, the MCO crashed
into Mars. LoAs are not equal to context”. LoAs are applied within contexts to
articulate semantic implications for a purpose, in our case ethical screening, in a
more consistent and straightforward fashion. Ultimately, LoAs may disclose latent
tensions between the maker and the user, the sender and the receiver, between
information construction and information consumption.
3 The Ethics of Machine Translation 37
A complex and dynamic space, like the information space, requires a very delicate
set of skills and well-designed screening tools that may allow its inhabitants to truly
understand its functional design, the logic that lies behind and the ethics thereof.
Having these properties activated, the system interpreter may tackle or mitigate (the
least) contingent malfunctions like bias, propaganda, hate speech, semantic misrep-
resentation, trolling or, at a more esoteric level, confusion, overwhelmingness and
misconception, issues that we touch upon in the case studies below. But still we, the
end (semantic) users, fail to perceive our presence in the digital space and the
occurring meaning formulation system, as mediated by technology. The same
analogy can be used in the language/meaning ratio, where a message receiver
perceives subjective language as objective reality. In regard to MT applications,
we perceive MT outcomes as lacking “understanding of language nuances” or
“understanding of meaning hidden in linguistic utterances” in the background.
This means that having the impression that doctor is by default male and nurse by
default female or that African American English inclines towards hate speech,
regardless of the dialect’s specific semantic features, we fail to realise that this is
actually model mediated where the model is trained under a not-so-accurate anno-
tation schema of a training dataset with unequal language or social group represen-
tations, thus requiring semantic filtering and model repackaging. Below we examine
three complementary case studies, denoting diverse hidden implications of different
MT-related systems, articulating the influence of the input data and parameters set by
the information makers and the implications of each given choice to the information
receivers.
Semantic mismatch is related to the diversity and context of the input data as offered
in different languages. Human LT refers to the production of technologies that seek
to understand and reproduce human language (Sveinsdottir et al. 2020). All these
technologies produce tools that are used in a range of fields, e.g., communication,
health and education, and can significantly improve people’s quality of life (Meta-
Net). LTs have been mostly developed for high resource languages, meaning
languages captured in a large number of digital resources (e.g. datasets, lexica,
ontologies, terminologies, etc.). Such languages are English, German, Spanish,
Arabic and Mandarin Chinese.3 Putting aside about 30 languages with a relatively
satisfactory number of resources, the remaining language capital lies in the digital
language resource margins, if it exists at all; in other words, it lacks digital resources
3
For Europe’s Languages in the Digital Age, please consult the Meta-Net White Paper series
available at: http://www.meta-net.eu/whitepapers/overview.
38 A. Nousias
in quantity and, more likely, also quality, if it exists at all. Moreover, the language
resource ecosystem looks even poorer when we add in the picture dialects, for which
digital resources are almost non-existent. The features of low resource languages are:
• Lack of linguistic expertise;
• Lack of a unique writing system or stable orthography;
• Limited presence on the web; and
• Lack of electronic resources for speech and language processing, such as mono-
lingual corpora, bilingual electronic dictionaries, transcribed speech data, pro-
nunciation dictionaries, and vocabulary lists (Krauwer 2003).
The shortage of language training data, text, audio and other media corpora puts in
question the overall NLP application development framework, as it impedes the
digital presence and well-being of specific groups with very negative social
implications.
In Sect. 3.1, we identified that meaning derives from a semantic cluster of three
interconnected nodes: (a) the symbols per se, (b) the expressions of the soul as
shaped by consciousness, experience, culture, information and the knowledge base
which in turn shapes how phenomena are expressed, and (c) the phenomena
themselves which exist prior to or independently from the language. In the context
of HLT, point (c) refers among others to data representing human attitudes and
behaviours. Indeed, we concluded that meaning formulation derives from a con-
structive process of combining, interpreting and repurposing phenomena-as-data and
not simply mirroring or portraying them (the phenomena). This is an invariably
necessary component for the evolving information models including those devel-
oped for LTs. What happens, however, when human attitudes, behaviours and
languages are not equally represented in the massive ‘artificialisation’ currently in
the making? Take social media data, for example, which provides access to
behavioural data at an unprecedented scale and granularity. However, using these
data to understand phenomena in a broader population is rather difficult due to their
unrepresentativeness and the bias of statistical inference tools towards dominant
languages and groups (Wang et al. 2019). While demographic attribute inference
could be used to mitigate such bias, current techniques are almost entirely monolin-
gual and fail to work in a global environment (Wang et al. 2019). This provokes
information reduction, altering the state of phenomena in both normative and
semantic sense, with further implications for the receiver and the society at large.
From an ethics perspective, what is important is to understand that such reduction
and consequent normative and semantic alteration is a non-natural construct that
needs to be addressed and assessed as such. In practice this requires (a) creating a
dataset representative of the types of diversity within languages; and (b) explicitly
modelling multilingual and codes-switched communication for arbitrary language
pairs (Jurgens et al. 2017). Not solely a data science task, rather a broader interdis-
ciplinary one where the establishment and further development of an interoperable
ecosystem of AI and LT platforms (Rehm et al. 2020) allowing appropriate model-
ling is an imperative. A model design needs to be aware of and subject to the
applicable system parameters that do or may affect language, like general social
3 The Ethics of Machine Translation 39
and economic factors, speaker background (birthplace, place where they live, edu-
cational level, cultural beliefs), situation in which an utterance is uttered (formal,
informal, setting, other participants), etc. It is the context that matters not the words;
and context is a complex system.
A second problem with the semantic mismatches is the way in which a semantic
output is perceived. Digital rating illustrates this dimension at large. The number and
quality of user reviews greatly affects consumer purchasing decisions (Hale and
Eleta 2017). While reviews in all languages are increasing, it is still often the case
(especially for non-English speakers) that there are only a few reviews in a person’s
first language. Using an online experiment, Hale (2016) and Hale and Eleta (2017)
examined the value that potential purchasers receive when reading reviews written in
languages other than their native (first) language. For instance, English native
speakers may read reviews of the same product written, for example, in French or
German by users who are respectively native speakers or French or German. But this
seems not always to be the case. The fundamental question he poses is whether
reviews in different languages are analytically similar to each other. It appears that
speakers of different languages focus on different aspects, evaluate products differ-
ently, and/or have consistently different experiences (e.g., different
internationalisation/localisation choices for software or different information avail-
able for in-person activities), thus the reviews from one language may have less
relevance to individuals primarily speaking a different language (Hale 2016; Hale
and Eleta 2017). Hale concluded, for instance, that reviews in German, Norwegian,
and French are more strongly correlated with reviews in other languages than are
reviews in Japanese, Portuguese, or Russian, without however, explicitly defining
the specific language correlations thereof. Thus, the usefulness that users have from
reviews in other languages likely varies with the languages of the same language
family. The correlations between pairs of languages suggest that reviews from some
languages will be closer to the experience of a person speaking a correlated (but still
not explicitly defined) language, than reviews from other languages. This may be
due to underlying elements of culture that are captured by the language(s) of a
person. This is where the language/meaning dichotomy strikes again, we may claim.
Hale’s (2016) and Hale and Eleta’s (2017) findings reveal that consumers (aka
information receivers) most value reviews in their first language. If so, the practice
of creating an average rating from reviews in multiple languages could be unhelpful
or even misleading, Hale concludes.
One way of tackling such outcomes would be “calculating the correlations
between languages and countries”, and thus deploying a MT design towards more
appropriate and adoptable reviews based on both linguistic and conceptual compo-
nents. For example, in terms of a multilingual product/service review, native English
speakers, as the semantic agents with the lowest level of bilingualism, seem to be
40 A. Nousias
less tolerant to foreign comments and reviews and closer to specific Anglo American
ones, while native speakers of smaller sized languages derive more value from
foreign language reviews as they are more frequent bilinguals. Following this
finding, in the event we address information (the multilingual review in our example)
as a resource, it stands to reason to argue that such a resource may be able to define
behaviour, as behaviour is defined by information. As a matter of fact, such
information-as-a-resource covers a multitude of ethical issues like liability, testi-
mony, advertising, misinformation, deception, censorship etc., thus “literally
re-ontologizing our world” (Floridi 2013). In our example, reviews in languages
with lack of correlation to reviews from other languages may constitute a driver for
less clicks of the translation button by the native English speakers, thus less market
research, low trust levels for our target product/service and possibly (at a highly
hypothetical level) less purchases accordingly. At such a macro-ethical level, ethics
screening of paradigms of that sort becomes imperative.
Let us now switch to the medical sector and assume we are seeking evidence for the
safety and effectiveness of a drug or vaccine in order to provide a proof of concept
for the population at large. Due to conspiracy theories, lack of trust, distress and
inadequate data, we look for evidence outside the controlled settings of the clinical
trials and the sponsor’s semantic primes. Sounds disturbingly familiar, doesn’t it?4
Such real-world evidence refers to textual resources, randomly spread in socially
related domains, like social media or patient/doctor forums, where subjective assess-
ments of the target subjects (namely doctors or patients) on the efficiency and/or
added value of the drug or vaccine are harvested, contextualised, further analysed
(on certain LoAs) and probably reshaped or repurposed. Such content is typically
available as unstructured natural language text in multiple languages. Access to
better solutions for extracting evidence from multilingual content would have an
important impact in this case, as it would allow information receivers to derive
crucial insights or deliver a message either fact checked or semantically derived out
of thin air. In practice, a combination of ML and NLP applications like MT, needs to
be applied in order to convert this raw pharma centric text into meaningful multi-
lingual information. In other words, a range of models generated at a range of LoAs
will be required for ethical screening in context. Such a semantic design requires an
innovative data Extraction-Transformation-Loading process. We believe that this
process should be based on linked data principles and methods given that they can
take advantage of the links between data and annotation schemas, developed out of
the right mix of domains and parameters.
4
In the context of this paper, from now on we refer to the COVID-19 era.
3 The Ethics of Machine Translation 41
A pilot experiment showcasing the above has been conducted in the context of the
Pret-a-LLOD project funded by the European Union’s Horizon 2020 research and
innovation programme.5 The Pret-a-LLOD project focuses on Linguistic Linked
Open Data and caters for the development of multilingual resources intended to
support language transfer in various types of NLP systems and models. The pilot
experiment seeks for novel pharmaceutical applications based on real world evi-
dence. The pilot operates in a selection of contexts, that include ENTITY RECOG-
NITION, RELATION EXTRACTION, SENTIMENT ANALYSIS and EMOTION
ANALYSIS. Each of them is subsequently analysed and modelled at a predefined set
of LoAs for a predefined purpose. From an ethical perspective, the interest lies on
three (3) main topics: (1) on the qualities of real-world evidence (e.g., accuracy,
representativeness, completeness); (2) on identifying and following the semantic
transitions of real-world evidence throughout the semantic cycle (translation phase
included); and (3) the semantic output. Let us focus on the second topic of interest.
Such a use case typically involves one or more of the following steps:
• The manual annotation of a corpus;
• The development of ontologies, lexica and terminologies to be used in this
annotation, the formulation of lexico-syntactic rules for the processing of the
corpus; and
• The training of ML models based on the annotated data.
Annotators and pharma-agnostic language engineers would collaborate in this
scenario.
The pilot, however, aims to formulate a framework of configurable language
transfer pipelines enabled by the capabilities to discover, transform and compose
language resources developed within the Prêt-à-LLOD project, reducing the need for
bespoke engineering of the processing pipeline. The ethical rigour lies ahead as the
channel and communication process of the semantic output makes its way to the
multilingual receiver and risks of overgeneralisation, confirmation bias, implicit bias
and topic under- and overexposure. As such, the inclusion of social science, law and
ethics in the process stands for a number of reasons. Indicatively, when applications
need more data, random human experiences, opinions, beliefs and preferences are
extracted, harvested and rendered into behavioural data pursuing further ‘emotion
scanning’ as e.g. word embeddings offer fertile ground for sociological analysis
reaching at times the level of individuals.
To date, NLP used to involve mostly anonymous corpora, with the goal of
enriching linguistic analysis, and was therefore unlikely to raise ethical concerns.
Adda et al. (2014) touch upon these issues under “traceability” (i.e., whether
individuals can be identified): this is undesirable for experimental subjects, but
might be useful in the case of annotators in the event the annotation features are
ontologically poor and further documentation is required. The public outcry over the
“emotional contagion” experiment on Facebook (Kramer et al. 2014; Selinger and
5
https://pret-a-llod.github.io/.
42 A. Nousias
Hartzog 2016) further suggests that data sciences now affect human subjects in real
time, and the LT domain needs to seriously reconsider the application of ethical
considerations to the research involved (Puschmann and Bozdag 2014). This brings
us to the need to assess the maker’s view in meaning formulation and linking.
Section 3.5 elaborates on a couple of case studies illustrative of the subjectivism,
the loopholes and the ethics gap at the design stage.
In principle, our perception about the world is either direct, coming from our ‘first
hand testimony of the senses’ (Floridi 2019), or indirect, that is ‘second hand
perception by proxy’ an information properly interpreted and assigned a meaning
by a third party, a testimony. Semantic agents transfer information as perception to
each other, blurring the boundaries between ‘empirically knowing’ and ‘merely
being informed’. This brings us to a fundamental distinction of meaning as natural
meaning vs. non-natural or conventional one, a distinction that may allow us to
better understand and ethically assess use cases like ratings and real-world evidence
and how MT and further NLP applications are involved. Add into the mix the
various collaborative design processes of a given system, like Linguistic Linked
Data and semantics turn to an absolutely non-natural, complex and dynamic phe-
nomenon, where, for proper ethical assessment, different LoAs not visible at sight
become interpretation modules of added value. What do ratings tell us about product
x, service y, destination z, etc.? How did these ratings occur and what semantics did
they convey when initially processed and further translated? Can we identify,
evaluate and validate all data processing activities from input to output? These
kinds of questions lead to the basic conclusion, that is the artificial nature of design
intervention. Such intervention is evident as a series of tech-driven data processing
activities of perception/testimony mediated data, which in turn lead to data and
further semantic hacking, served to end users (the message receivers) as reality.
Below, we examine some use cases and analyse the ethics of the design in informa-
tion formulation from the maker’s (as both designer and messenger) perspective.
The use cases around reviews and the articulated perception of real-world evidence
and testimony, as background modalities of information formulation, are derived by
two basic information sources: (a) corpora, that are collections of language data
formulated in a certain context, at a defined LoA, for a set purpose; and (b) lexicons
that focus on the general meaning of words and the structure of semantic concepts in
a certain context, at a defined LoA, for a set purpose. An overall system infrastruc-
ture is complemented by metadata and typological databases that describe features of
3 The Ethics of Machine Translation 43
individual languages and/or dialects (Cimiano et al. 2020). The typical case of
corpora and lexicons in given contexts at defined LoAs, for set purposes, is usually
that they remain unconnected. Linking dictionaries and corpora at the level of
meaning is a crucial component in order to be able to derive new content (and
meaning as well), either within the same language or across multiple languages, or to
enrich currently available data. Pilot II of the project focuses on devising and
developing technology for interlinking lexical knowledge to other lexical resources
or encyclopaedic resources, such as WordNet or Wikidata (McCrae and Cillessen
2021), in order to facilitate rapid integration of lexicographic resources and allow for
wider application of these types of data for companies in the LT area, such as
multilingual search, cross-lingual document retrieval, domain adaptation, and lexical
translation. In particular, the goal of this pilot is to explore and use state-of-the-art
methods and techniques in computational semantics, data mining and ML for linking
language data at the level of meaning via available multilingual lexical content
interlinking, the creation of new lexical content from current datasets, the enrichment
of corpora etc. (McCrae et al. 2019). Relations between lexical elements can be
found in three levels:
• Lexical relations relate the surface forms of a word, e.g. to represent etymology
and derivation.
• Sense relations relate the meanings of two words, e.g. to express that two senses
are translations, synonyms or antonyms of each other.
• Conceptual relations relate concepts regardless of their lexicalisation. Examples
of such conceptual relations are the hypernymy or meronymy relations.
The task consists of: (a) the rendering of language data to lexical definitions and
further extension of these definitions to appropriate translations; and (b) the linking
of the corpus text to the dictionary content. The ethical rigour lies in the selected
requirements. In both cases we have a semantic transition from state A to state B. To
follow the semantic flow of every given semantic state, we need to screen (a) the
intrinsic information qualities, like accuracy of the resources, in order to control
potential data errors and consequent informational errors; (b) the contextual infor-
mation qualities like relevancy, timeliness and completeness; and (c) the represen-
tational information qualities like interpretability, ease of understanding and
consistent representations6 while setting the appropriate parameters (context,
LoAs, purpose). The main objective here is to understand whether the input infor-
mation is fit for reuse for a new oriented purpose, to what extent and under what
requirements. In other words, it is about proper domain adaptation. In the event of
required domain adaptation, the maker needs to take into account the fact that the
information quality properties do weigh differently in each state of given parameters,
as we move from semantic state A to semantic state B, etc. This means that
6
Floridi (2019) introduces the concept of Bi-Categorical Information Quality and its main proper-
ties are accuracy, objectivity, accessibility, security, relevancy, timeliness, interpretability and
understandability.
44 A. Nousias
transferring an MT system initially based on texts from a given domain (e.g. the
financial domain) to another domain (e.g. legal, media, news) requires the adaptation
to new lexical and terminological data as well as on handling different linguistic
structures, as each domain has its own vocabulary, grammar and inclusive
properties.
Constantly reassessing the parameters is of high value as there are many ways of
expressing a state A or a state B and many different dimensions to look at. The next
logical question is how. Indicatively, such reassessment could take the form of a set
of questions that require a single YES/NO answer throughout the semantic cycle.
These questions will be ethically curated and appropriated in a fixed language within
specified context, LoAs and purpose parameters, thus allowing the computational
identification of possible semantic alterations of a given linguistic system as it
transits from semantic state A to semantic state B.7 The good news is that the
semantic distance between state A and state B is traceable and measurable. This is
similar to how the word embeddings run in neural networks and MT architectures
make use of word embeddings, such as OpenNMT, an open source toolkit for neural
MT (Klein et al. 2017). Neural networks measure similarities among words and
capture a great amount of real-world information, maybe supererogatory at times,
thus articulating the great value of ethical parameters. For example, Preotiuc-Pietro
et al. (2016) identified in their study isolated stylistic differences by using paraphrase
pairs and clusters from social media text. Paraphrases represent alternative ways to
convey the same information (Barzilay and McKeown 2001), using single words or
phrases (e.g. giggle/laugh or brutal/fierce) linked to user attributes (e.g. male/female
or of high/low occupational status). By studying occurrences within these paraphrase
pairs and clusters, Preotiuc-Pietro et al. (2016) directly presented the difference of
stylistic lexical choice between different user groups, while minimising the confla-
tion of topical differences. These stylistic differences entail predictive power in user
profiling and conformity with human perception. Translating paraphrases leads to a
marked improvement in coverage and translation quality, especially in the case of
unknown words, as paraphrases introduce some amount of ‘generalisation’ into
statistical MT (Callison-Burch et al. 2006). In other words, general knowledge,
external to the translation model is exploitable, while also capturing meaning in
translation. An appropriate set of questions within a fixed set of parameters (namely
context, LoAs, purpose) may illustrate the varying attributes per author and reveal in
the lexical linking in question, semantic dissimilarities, correctable errors and
information quality shortages. In other words, they may identify a mismatch and
shed the necessary light for an application for semantic mismatch prevention. The
sceptical reader may object, however, that very few of the problems of the world are
binary.
7
Based on Floridi (2019) and the concept of Borel numbers (β) and Hamming distance (hd).
3 The Ethics of Machine Translation 45
8
For more info about the Principle of Information Closure (PIC) see Floridi (2019: p. 149).
46 A. Nousias
3.6 Conclusion
As we automate in scale and in scope, a clear tension is evident among (a) the
opportunities and tools provided by NLP applications, like MT; and (b) the impli-
cations in social behaviour, behaviour engineering and the societal properties
thereof. It becomes clear that applied ethics needs to take the lead in the overall
LT design. Ethics on the field, if we may say, provided the prior framing of its
conceptual grounds and the development of its theoretical reasoning. The objective
is to allow for smooth navigation among the emerging polarities within LTs, rather
than focusing on what Salganik (2017) calls ‘abstract social theory or fancy
machine learning’. Following the target system observance, the input data and
appropriate description (modelling) of the observed system, we need to aim at a
diffuse, distributed, decentralised design process towards a relevant blueprint of
interpretation and engagement with our target system through interdisciplinary
lenses. For language in particular, the core problems of societal behavioural distor-
tion lie in data exclusion or demographic misrepresentation, modelling
overgeneralisation, such as automatic inferences of user attributes, topic overexpo-
sure leading to the so-called psychological effect called availability heuristics
(Tversky and Kahneman 1973) 9 and dual uses (Hovy and Spruit 2016). The
objective for a consistent blueprint is subject to general moral imperatives like
societal wellness, fairness, beneficence plus more specific professional responsibil-
ities transcending strict legal compliance. This is not a tech problem rather the free
space, where ethical elaboration lies. Ethical elaboration on ‘what’ we design, ‘how’
we design it and what is the ‘impact’ of such design.
References
Adda G, Besacier L, Couillault A, Fort K, Mariani J, de Mazancourt H (2014) Where are the data
coming from? Ethics, crowdsourcing and traceability for Big Data in Human Language
Technology. In: Crowdsourcing and human computation multidisciplinary workshop, CNRS,
September 2014, Paris, France
Andriopoulos DZ (2021) ΑΡIΣΤΟΤΕΛΗΣ ΠΕΝΗΝΤΑ ΤΡΕIΣ ΟMΟKΕΝΤΡΕΣ MΕΛΕΤΕΣ (7Η
ΕKΔΟΣΗ). Private Edition
Barzilay R, McKeown KR (2001) Extracting paraphrases from a parallel corpus. In: Proceedings of
the 39th Annual Meeting of the Association for Computational Linguistics, ACL, Toulouse.
https://doi.org/10.3115/1073012.1073020
Callison-Burch C, Koehn P, Osborne M (2006) Improved statistical machine translation using
paraphrases. In: Proceedings of the Human Language Technology Conference of the NAACL,
Main Conference, New York, Association for Computational Linguistics
9
If people can recall a certain event, or have knowledge about specific things they infer it must be
more important. If research repeatedly found that the language of a certain demographic group
was harder to process, it could create a situation where this group was perceived to be difficult, or
abnormal, especially in the presence of existing biases.
3 The Ethics of Machine Translation 47
Cimiano P, Chiarcos C, McCrae JP, Gracia J (2020) Linguistic linked data: representation,
generation and applications. Springer International Publishing, Cham. https://doi.org/10.1007/
978-3-030-30225-2
Clark HH, Schober MF (1992) Asking questions and influencing answers. In: Tanur JM
(ed) Questions about questions: inquiries into the cognitive bases of surveys. Russell Sage
Foundation, New York, NY, pp 15–48
Floridi L (2013) The ethics of information. Oxford University Press, Oxford. https://doi.org/10.
1093/acprof:oso/9780199641321.001.0001
Floridi L (2018) Semantic capital: its nature, value, and curation. Phil Tech 31(4):481–497. https://
doi.org/10.1007/s13347-018-0335-1
Floridi L (2019) The logic of information: a theory of philosophy as conceptual design. Oxford
University Press, Oxford. https://doi.org/10.1093/oso/9780198833635.001.0001
Gracia J, Montiel-Ponsoda E, Cimiano P, Gómez-Pérez A, Buitelaar P, McCrae J (2012) Challenges
for the multilingual web of data. J Web Semant 11:63–71. https://doi.org/10.1016/j.websem.
2011.09.001
Hale SA (2016) User reviews and language: how language influences ratings. In: Proceedings of the
2016 CHI Conference Extended Abstracts on Human Factors in Computing Systems (CHI EA
‘16). Association for Computing Machinery, New York, NY, pp 1208–1214. https://doi.org/10.
1145/2851581.2892466
Hale SA, Eleta I (2017) Foreign-language reviews: help or hindrance? In: Proceedings of the 2017
CHI Conference on Human Factors in Computing Systems (CHI ‘17). Association for Com-
puting Machinery, New York, NY, pp 4430–4442. https://doi.org/10.1145/3025453.3025575
Hovy D, Spruit SL (2016) The social impact of natural language processing. In: Proceedings of the
54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short
Papers). Association for Computational Linguistics, pp 591–598. https://doi.org/10.18653/v1/
P16-2096
Jurgens D, Tsvetkov Y, Jurafsky D (2017) Incorporating dialectal variability for socially equitable
language identification. In: Proceedings of the 55th Annual Meeting of the Association for
Computational Linguistics (Volume 2: Short Papers). Association for Computational Linguis-
tics, pp 51–57. https://doi.org/10.18653/v1/P17-2009
Klein G, Kim Y, Deng Y, Senellart J, Rush AM (2017) OpenNMT: Open-Source Toolkit for neural
machine translation. ArXiv. https://arxiv.org/abs/1701.02810
Kostakis V (2019) How to reap the benefits of the “digital revolution”? Modularity and the
commons. Halduskultuur 20(1):4–19. https://doi.org/10.32994/hk.v20i1.228
Kramer et al (2014) Proc Natl Acad Sci USA (111:8788–8790), Issue 24
Krauwer (2003) The Basic Language Resource Kit (BLARK), as the first milestone for the
Language Resources Roadmap Utrecht Institute for Linguistics, available at http://www.
elsnet.org/dox/krauwer-specom2003.pdf
Kuczmarcski J (2018) Reducing gender bias in Google Translate. https://blog.google/products/
translate/reducing-gender-bias-google-translate/
McCrae JP, Cillessen D (2021) Towards a linking between WordNet and Wikidata. In: Proceedings
of the 11th Global Wordnet Conference, Global Wordnet Association, South Africa
McCrae JP, Tiberius C, Khan AF, Kernerman I, Declerck T, Krek S, Monachini M, Ahmadi S
(2019) The ELEXIS interface for interoperable lexical resources. In: Electronic lexicography in
the 21st Century. Proceedings of the ELex 2019 Conference, Lexical Computing. Sintra,
Portugal
Narayanan (2018) Twitter. https://twitter.com/random_walker/status/993866661852864512
Preotiuc-Pietro D, Xu W, Ungar L (2016) Discovering user attribute stylistic differences via
paraphrasing. Proc AAAI Conf Artif Intell 30(1):3030
Puschmann C, Bozdag E (2014) Staking out the unclear ethical terrain of online social experiments.
Intern Pol Rev 3(4):1. https://doi.org/10.14763/2014.4.338
Rehm G, Galanis D, Labropoulou P, Piperidis S, Welß M, Usbeck R, Köhler J, Deligiannis M,
Gkirtzou K, Fischer J, Chiarcos C, Feldhus N, Moreno-Schneider J, Kintzel F, Montiel E,
48 A. Nousias
Rodríguez Doncel V, McCrae JP, Laqua D, Theile IP, Dittmar C, Bontcheva K, Roberts I,
Vasiļjevs A, Lagzdiņš A (2020) Towards an interoperable ecosystem of AI and LT platforms: a
roadmap for the implementation of different levels of interoperability. In: Proceedings of the 1st
International Workshop on Language Technology Platforms. European Language Resources
Association
Salganik MJ (2017) Bit by bit: social research in the digital age. Princeton University Press,
Princeton, NJ
Sap M, Card D, Gabriel S, Choi Y, Smith NA (2019) The risk of racial bias in hate speech
detection. In: Proceedings of the 57th Annual Meeting of the Association for Computational
Linguistics. Association for Computational Linguistics
Selinger E, Hartzog W (2016) Facebook’s emotional contagion study and the ethical problem of
co-opted identity in mediated environments where users lack control. Res Ethics 12(1):35–43.
https://doi.org/10.1177/1747016115579531
Stanovsky G, Smith NA, Zettlemoyer L (2019) Evaluating gender bias in machine translation. In:
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.
Association for Computational Linguistics
Sveinsdottir T, Troullinou P, Aidlinis S, Delipalta A, Finn R, Loukinas P, Muraszkiewicz J,
O’Connor R, Petersen K, Rovatsos M, Santiago N, Sisu D, Taylor M, Wieltschnig P (2020)
The role of data in AI. Zenodo. https://doi.org/10.5281/zenodo.4312907
Tversky A, Kahneman D (1973) Availability: a heuristic for judging frequency and probability.
Cogn Psychol 5(2):207–232. 0010028573900339. https://doi.org/10.1016/0010-0285(73)
90033-9
Wacks R (2006) Philosophy of law: a very short introduction. Oxford University Press, Oxford
Wang Z, Hale S, Adelani DI, Grabowicz P, Hartman T, Flöck F, Jurgens D (2019) Demographic
inference and representative population estimates from multilingual social media data. In: The
World Wide Web Conference (WWW ‘19). Association for Computing Machinery, New York,
NY, pp 2056–2067. https://doi.org/10.1145/3308558.3313684
Zunger Y (2017) So, about this Googler’s manifesto. Medium. https://medium.com/
@yonatanzunger/so-about-this-googlers-manifesto-1e3773ed1788
Chapter 4
Licensing and Usage Rights of Language
Data in Machine Translation
Mikel L. Forcada
M. L. Forcada (✉)
Dept. de Llenguatges i Sistemes Informàtics, Universitat d’Alacant, Sant Vicent del Raspeig,
Spain
Prompsit Language Engineering, Elx, Spain
e-mail: [email protected]
4.1 Introduction
Among other software, Machine Translation (MT) is special in that it heavily relies
on data. On the one hand, rule-based or knowledge-based MT is usually
implemented in such a way that a roughly language-independent engine performs
the translation task by using language resources such as dictionaries and grammar
rules, which may be (a) manually written from scratch, (b) obtained by converting
existing, manually written dictionaries and rules, (c) learned from monolingual text
or from sentence-aligned bilingual text, or (d) a mixture of some or all three. On the
other hand, corpus-based MT (CBMT) such as statistical MT (SMT) and, more
recently, neural MT (NMT) rely mainly on monolingual and sentence-aligned
bilingual text ((c) above), but they may also use rules and dictionaries to
pre-process text for easier learning. Section 4.2 briefly reviews the kinds of data
used in MT. This chapter will not deal with the fact that computer programs using
these data are also works of creation that should also be protected, but will rather
concentrate on the data itself. Human labour, and therefore, creative authorship of
works, is present in all forms of MT data: monolingual text has been authored,
parallel text has been translated and aligned, and rules and dictionaries have been
written by experts. Since it was conceived more than three centuries ago, copyright
(explicit or implicit as recognized by the Berne Convention) establishes authorship
and regulate (usually by restricting it for a limited time) the usage rights of copies or
data derived from these works using legal instruments such as licences, to make it
possible for authors to make a living from creative work. Section 4.3 gives an
overview of the different sources of data used in MT, defining authorship along
the steps of creating, curating and transforming those data for use with MT,
determining the kinds of implicit and explicit licensing schemes that apply to them
and how they work. On the one hand, the case of dictionaries and grammars as used
in rule-based MT (RBMT) (Sect. 4.3.1) is reasonably clear, as they are purposely
written for one or another language-processing application; a brief mention is made
of the benefits of free/open-source licensing for RBMT data. On the other hand,
however, monolingual and parallel text (discussed in detail in Sect. 4.3.2), as used in
MT, were not created with MT in mind, and this may lead, for example, to the
question whether authors and translators should get additional compensation for this
unintended use of their works (particularly those published on the Internet and
harvested using crawling techniques) to generate new, initially unintended, value,
not only through MT but also through other translation technologies such as
computer-aided translation (CAT); this is discussed in Sect. 4.3.3. The chapter
ends with concluding remarks—but they are concluding more in the sense that
they end the paper rather than in the sense of settling a rather complicated matter;
instead, they try to summarise the various ways in which copyright issues are
addressed in the real world and give an indication of the open questions ahead.
4 Licensing and Usage Rights of Language Data in Machine Translation 51
In the decades between the first attempt at machine translation to the 1990s, the main
approach was RBMT. Typically, RBMT starts with the translations of words and
ideally builds from them a translation for the whole sentence.
To build an RBMT system, translation experts must create machine-readable
monolingual and bilingual dictionaries for the languages involved and rules to
analyse the source text and perform other actions, such as converting the source
grammatical structure to the equivalent structure of the target language. Keep in
mind that the translator’s intuitive and unformalized knowledge needs to be turned
into rules and dictionaries and coded in a computationally efficient way. This may
sometimes require simplifications which may however be useful in most cases if
chosen wisely. Computer experts of course have to write engines that use the
dictionaries and apply rules to the source text in the specified order, to produce a
raw translation, and any tools needed to manage and convert data produced by
experts to the format required by the engine.
In the case of RBMT, the translation data, dictionaries and rule sets, are an
example of linguistic resources.
The beginning of the 1990s saw the inception of corpus-based machine translation,
which is currently the main paradigm, at least for major language pairs.
In this case, the MT program is automatically trained, that is, it learns to translate
from a huge corpus of examples, each of which contains a source-language sentence
paired with its translation in the target language, and sometimes also from an
additional corpus of monolingual target-language sentences. These large parallel
corpora, more precisely sentence-aligned parallel corpora (Sect. 4.3.2), may indeed
be seen as large translation memories (TMs) such as the ones used in CAT (Bowker
52 M. L. Forcada
1
https://translate.google.com/.
2
https://translator.microsoft.com/.
3
It has been shown that it is possible to train CBMT systems in an unsupervised way, that is, using
only monolingual text on both languages and very small amounts of parallel text, or even no parallel
text at all (Artetxe et al. 2018; Lample et al. 2018), with more modest results.
4
Sometimes the CBMT system is trained on a large general stock corpus and then refined or tuned in
some way using the available data for the specific domain (see e.g. Pecina et al. 2012).
4 Licensing and Usage Rights of Language Data in Machine Translation 53
Of course, there are also hybrid systems that integrate the two strategies. For
example, one may (a) use morphological rules to analyse the text before translating
it using a system that has also been trained on a corpus of morphologically analysed
texts,5 or (b) use a syntactic parser to pre-reorder the source text so that its syntax is
closer to that of the target text,6 (c) or use a rule-based system to translate a large
target-language monolingual corpus into the source language to generate synthetic
parallel training material for a CBMT system.7 The discussion below about usage
rights and licensing also applies to these hybrid systems.
The previous section has identified two main kinds of data that may be used when
building MT systems: linguistic resources, mainly used in RBMT, and corpora,
mainly used in CBMT.
For RBMT to work, one must provide the engine with specific computer-readable
linguistic resources, such as dictionaries describing the morphology of the source
and target languages, rules to disambiguate homographs and polysemic words,
bilingual dictionaries, and rules transforming the structure of the source sentence
into a valid target-language structure. As mentioned earlier, building these resources
requires linguists and translation experts who are familiar with the formats required
by the MT system. These experts have to create their resources from scratch or by
transforming results which are already available.
Note that linguistic resources are not only useful in RBMT; they can also be used
to automatically transform, annotate, and prepare corpora to make them useful to
train CBMT systems, as described in Sect. 4.2.3.
5
See, for instance, Lee (2004).
6
Both in statistical (Cai et al. 2014) and neural (Du and Way 2017) MT.
7
This would be a kind of back-translation, similar to that described by Sennrich et al. (2016)
for NMT.
54 M. L. Forcada
Isabelle et al. (1993)—but also Simard et al. (1993)—are famously quoted for saying
that “Existing translations contain more solutions to more translation problems than
any other currently available resource”. CBMT does indeed spring from this idea. As
mentioned above, to work, CBMT requires large numbers (hundreds of thousands to
millions) of sentence pairs made up of an original sentence and its translation.
Creating such a corpus requires a great deal of effort. To start, you need to have
enough translated text, ideally professionally translated.13 Then, before training the
system, the translations must be aligned sentence by sentence (if they were not
translated in a computer-aided environment and therefore already segmented and
8
Examples are the End User License Agreements used with various commercial software (https://
en.wikipedia.org/wiki/End-user_license_agreement) or the Free Software Foundation’s GNU Gen-
eral Public License (https://en.wikipedia.org/wiki/GNU_General_Public_License).
9
“All rights reserved” licences with wordings in the style of “Copyright © 2020 by Author Name.
All rights reserved. This book or any portion thereof may not be reproduced or used in any manner
whatsoever without the express written permission of the publisher except for the use of brief
quotations in a book review or certain other non-commercial uses permitted by copyright law.” or
open licences such as those in the Creative Commons set of copyright licences (https://
creativecommons.org/).
10
http://wizard.elra.info.
11
http://www.apertium.org.
12
The GNU General Public Licence, https://www.gnu.org/licenses/gpl-3.0.en.html.
13
When this is not possible, it is not uncommon to make do with corpora where translations are used
as the source text, or where documents are translations of documents in a third language (for
example, a German–Finnish parallel text may be the result of Finnish → English translation that
was subsequently translated into German).
4 Licensing and Usage Rights of Language Data in Machine Translation 55
Institutions, organisations and companies that produce (or purchase) and publish
translations may decide to publish also the corresponding TMs or sentence-aligned
parallel corpora.
Some public agencies publish part of their TMs, particularly those related to
documents that they will publish in various languages. For example, since 2007,
the Directorate General for Translation of the European Commission have published
large, curated TMs16 corresponding to the so-called Acquis Communautaire, that is,
the body of common rights and obligations that are binding for all European Union
countries. This kind of TMs may be very useful in CAT of future documents
produced by the same administration, but also in CAT or to train MT systems for
texts of this genre to be published by related administrations. Another initiative by
the government of the Basque autonomous community, called Open Data Euskadi,
publishes TMs for the Basque language, generated by the government or related
institutions in the territory.17
Many administrations that generate translations for public service information
such as websites have however not reached internally the level of translation
workflow maturity (Bel et al. 2016) that would allow them to curate and, if so
desired, publish their translations as TMs.18 Without the TMs, they may end up
14
For instance, due to errors in sentence alignment.
15
http://opus.nlpl.eu.
16
https://ec.europa.eu/jrc/en/language-technologies/dgt-translation-memory.
17
A search of “TMX” in https://opendata.euskadi.eus/inicio/ reveals many such TMs.
18
In addition to project Paracrawl, which will be discussed in Sect. 4.3.2.2, which ends up crawling
this kind of translations to build, other Connecting Europe Facility projects collect parallel data
56 M. L. Forcada
paying for the translation of material that had already been translated, at an unnec-
essary expense of taxpayers’ money. Sometimes, members of the institutions have
later taken it upon themselves and worked in collaboration with third parties to create
sentence-aligned parallel corpora from their public document, as in the case of the
six-language United Nations corpus (Rafalovitch and Dale 2009; Ziemski et al.
2016).
from institutions. For instance, the project Principle (https://principleproject.eu/) collects data in
Icelandic, Norwegian, Croatian and Irish from early adopter institutions and companies, or the
MT4All project (http://ixa2.si.ehu.eus/mt4all/project), which focuses on unsupervised MT (Artetxe
et al. 2018; Lample et al. 2018) and on the creation of resources for EU and non-EU languages
including Kazakh. There are also initiatives like ELRC-Share (https://www.elrc-share.eu/) which
collect all sorts of parallel corpora, many based on institutional translations.
19
https://www.gnome.org/.
20
https://translations.launchpad.net/.
21
https://translate.apache.org/projects/aoo40/.
22
http://tatoeba.org.
23
http://opus.nlpl.eu/Tatoeba.php.
4 Licensing and Usage Rights of Language Data in Machine Translation 57
A very large, perhaps the largest source of sentence-aligned translations used to train
MT systems comes from publicly-accessible documents published on the Internet
either by manually scraping and aligning it, or either by automated crawling and
alignment.
Recent advances have made it possible to automatically crawl multilingual
websites to obtain parallel corpora. This is very likely one of the methods used by
commercial systems like Google, Microsoft and DeepL. To do this, documents in the
two languages of interest are downloaded from selected candidate websites. Once
the language of downloaded documents is automatically identified, source-language
and target-language documents are matched by examining, for instance, their length
and the internal structure of the text and their content, using available bilingual
resources such as dictionaries and MT to guide the matching. Then, the source-
language and the target-language texts are split into sentences, and statistical
methods, again assisted by bilingual resources if available, are used to produce as
many sentence pairs (translation units) as possible. Finally, additional statistical
techniques are used to discard translation units which are not likely to be useful in
any application, for instance, when the source and the target have very disparate
lengths, or when they contain more numbers or punctuation than text, or when an
automatic language identifier detects them as not being in the expected language.24
For example, the project Paracrawl25 performs this sort of crawling to obtain,
publish, and provide the European Commission with sentence-aligned parallel
corpora for languages that are official in Europe and develops software to perform
the task, but many other projects crawl the web for parallel text.
24
This may be due to the existence of segments in a language different from the rest of the
document.
25
http://paracrawl.eu.
26
https://globalvoices.org/.
27
http://casmacat.eu/corpus/global-voices.html.
58 M. L. Forcada
Many public administrations publish public service multilingual data, but they do
not publish the TMs or sentence-aligned corpora that they could probably have
produced as a side-product of their use of CAT tools.
For example, the European Medicines Agency publishes all sorts of texts related
to the evaluation of medicines for human and veterinary use in the form of PDF files.
The OPUS project29 has extracted text from these PDF files and has aligned them to
produce corpora for 22 official languages of Europe.
Another example: the Catalan government has to publish its legal gazette in
Catalan and Spanish, as they are both official languages in Catalonia. In spite of
the fact that CAT is used to produce it, the sentence-aligned parallel corpora have not
been made available by the Catalan government, but rather by an independent
Catalan researcher, Antoni Oliver, and published in OPUS.30
28
https://www.jw.org/en/.
29
http://opus.nlpl.eu/EMEA.php.
30
https://opus.nlpl.eu/DOGC.php.
4 Licensing and Usage Rights of Language Data in Machine Translation 59
corpora are clearly a derivative of the published translations and are therefore
directly affected by the terms associated to them at the time of publication, which
may sometimes preclude the publication of any derivatives.
It is also important to consider that the compilation of the corpus, as will be
discussed below in Sect. 4.3.3.2, adds new value to the original translations, and
compilers may want to use a licence that protects their work.
In case the terms of the original translations allow for the derived corpora to be
published, the licence attached to the corpora has to be compatible with that of the
original material.
This section discusses both scenarios; that is, corpora published by their owners
(Sect. 4.3.3.1) and corpora crawled from the Internet, with emphasis in the latter case
(Sect. 4.3.3.2), and it discusses aspects such as automatic copyright and copyright as
protection, considers a non-literal approach to copyright, the implementation of
exceptions to copyright for the purpose of text mining, a workaround called deferred
crawling, the recognition of translators as authors having the right to copyright their
work, the nature of value added through the compilation of sentence-aligned parallel
corpora and the training of CBMT, and the claims of translators for compensation for
the subsequent unintended use of their work.
31
https://joinup.ec.europa.eu/sites/default/files/custom-page/attachment/eupl_v1.2_en.pdf.
60 M. L. Forcada
Automatic Copyright
Under the Berne Convention for the Protection of Literary and Artistic Works,34
Berne Convention for short, which has been signed by 197 countries,35 it is under-
stood that the author reserves all rights to reproduction when the work does not
specify the conditions of reuse (“Protection must not be conditional upon compli-
ance with any formality (principle of “automatic” protection)”). As a result, the
content of websites not bearing any licence cannot be reproduced in any way without
the explicit permission of the author.
Regular sentence-aligned parallel corpora contain thousands or millions of
sentences coming from hundreds or thousands of documents. Owners may implicitly
or explicitly reserve all rights: Clearly, if one followed copyright rules strictly, one
would have to contact them and explicitly ask for clearance, which would often start
a discussion on the licensing of the corpus.
For most corpora, this is a very hard task, as it would involve contacting the
owners of many sources (for a description in a real scenario, see De Clercq and
Montero Pérez 2010, cited in Macken et al. 2011). Moreover, where owners have
32
https://creativecommons.org/licenses/by-sa/3.0/.
33
“[. . .] unless expressly authorised by The Washington Post in writing, you are prohibited from
publishing, reproducing, distributing, publishing, entering into a database, displaying, performing,
modifying, creating derivative works, transmitting, or in any way exploiting any part of the
Services, except that you may make use of the content for your own personal use as follows: you
may make one machine-readable copy and/or print copy that is limited to occasional articles of
personal interest only.”
34
https://www.wipo.int/treaties/en/ip/berne/.
35
https://www.wipo.int/treaties/en/ShowResults.jsp?treaty_id=15.
4 Licensing and Usage Rights of Language Data in Machine Translation 61
explicitly stated a licence, as said above, this may restrict the choice of licences for
the corpus. In fact, it may be the case that the corpus would have to be partitioned in
sub-corpora, each one with a licence which is compatible with that of the original
sources. Does this mean that sentence-aligned parallel corpora created from these
pages cannot be distributed at all?
Section 4.3.2.2 discusses legal exceptions to the automatic all-rights-
reserved rule.
Copyright as Protection
Let us take a closer look at copyright. Copyright was originally conceived as a way
to protect the right of authors to obtain proper compensation for their work by
guaranteeing that, during a certain period of time, no one else would be able to make
a profit from derivatives. As a result, society as a whole would benefit as authors
would be encouraged to create works useful for it. One of the earliest legal formu-
lations of this principle of copyright for the common good appears in the United
States Constitution (article I, section 8, clause 8) as one of the duties of Congress:
“To promote the Progress of Science and useful Arts, by securing for limited Times
to Authors and Inventors the exclusive Right to their respective Writings and
Discoveries.” This period of time, initially small, was subsequently extended so
that not only authors but also their heirs could benefit after their death.
Now, let us leave aside for a moment the literal interpretation of “all rights reserved”
copyright clauses and think about how the publication of web-crawled sentence-
aligned parallel corpora may threaten the livelihood of authors (and their heirs) or the
viability of enterprises producing web content.
Let us clarify that we are concerned here with textual content which is freely
accessible on the Internet, that is, not behind any paywall or authentication chal-
lenge, so that anyone with a browser can read it, as this is basically what web
crawlers, that is, headless browsers, can obtain if allowed.36 But one may ask, what
is the point of protecting against copies of publicly-available content that anyone
can read?
There are a number of reasons. When authors or copyright holders of this textual
content want to make a profit with it, they can, for example, add advertisement37 to
36
Some websites avoid access by headless browsers by challenging them to show human behav-
iour, but this is quite unusual with textual content.
37
And even secure future advertisement by leaving small files in the local computer of the person
browsing the site, called cookies, which may later be picked up by other websites to reinsert new
advertisement.
62 M. L. Forcada
it, paid for by third parties. If someone else publishes a copy of substantial parts of
that content somewhere else without the advertisement, they may compete with the
original content and reduce the advertisement revenue associated with the original
content. The fact that publishers may derive value from human reading is the reason
why some sites expect headless browsers such as text crawlers to respect the
restrictions expressed in suitable files.38 However, when a sentence-aligned parallel
corpus is generated from multilingual web content, it often happens that:
• Textual content is divided in small units such as sentences;
• Formatting, which may actually be used to guide segmentation and alignment is
thrown away;
• Sentence order may not be preserved;
• Sentences that have not been successfully aligned with a translation or were
already found in another document are discarded;
• Material from different documents ends up mixed (for example, as the result of
sorting and de-duplication).
It would be a bit as if a number of documents had been processed by a document
shredder that made a paper strip out of each line, so that strips coming from different
documents had been mixed and shuffled into a mess from which some strips might
have been removed.
As a result, it is impossible, or at least very difficult, to reconstruct substantial
parts of the original document from the usual sentence-aligned parallel corpora, and
therefore, to compete with the original content by publishing reconstructed copies of
substantial parts of it elsewhere.39 In fact, Tsiavos et al. (2014) suggest that “[E]nsur
[ing] the original material [without any copyright notice] cannot be reconstructed
from the LR” is one way to reduce legal risk. This is because, while the sentence-
aligned corpus could be seen to be literally in violation of the copyright of many
pages of content, there would be very little incentive on the part of their authors to
sue the compilers of the corpora, as it would be very difficult to prove the existence
of damage and even more difficult to put a value on it.
A common legal risk mitigation strategy used by some compilers of sentence-
aligned parallel corpora is to establish a notice-and-take-down procedure.40 To give
a specific example, the wording in the paracrawl.eu website invites whoever con-
siders that their data “contains material that is owned by you and should therefore not
be reproduced here” to contact the project, clearly identify themselves, “clearly
identify the copyrighted work claimed to be infringed” and “the material that is
claimed to be infringing” together with “information reasonably sufficient to allow
us to locate the material”, and commits to “comply to legitimate requests by
38
Websites can have a robots.txt file that regulates access of headless browsers (“robots” or
crawlers) to specific parts of the site, and should therefore be respected.
39
But not completely impossible. Carlini (2020) shows that, using carefully-crafted queries, neural
models can reproduce substantial parts of the data that was used to train them.
40
https://en.wikipedia.org/wiki/Notice_and_take_down.
4 Licensing and Usage Rights of Language Data in Machine Translation 63
removing the affected sources from the next release of the corpus”. Other corpus
distribution papers such as the Wacky Corpus41 simply state “If you want your
webpage to be removed from our corpora, please contact us.”. In fact, “notice and
take-down” is one of the strategies suggested by European projects QTLaunchpad
(Tsiavos et al. 2014) and Panacea (Arranz et al. 2013) to reduce legal risk.42
One of the main applications of sentence-aligned parallel corpora is training MT
systems, which may then reasonably be considered derivative work. Therefore,
when creating a commercial MT system by reusing the original text and its trans-
lations, it would be wise to ask how much the copyright of the original text is
respected. Is CBMT a real threat to the rights of the authors and translators of the
texts used to train it? SMT output may indeed contain short subsegments (sequences
of a few or several target-language words) from the translations it was trained on, but
these are spliced together in a wholly new order; as a result, the original works are
even harder to recover than they were from sentence-aligned parallel corpora.
Recoverability is virtually impossible in NMT output, where each word (or even
every sub-word unit43 or every character (Lee et al. 2017)) is produced separately.
Therefore, considering that translations produced by systems trained with multilin-
gual works should be considered the public reproduction of substantial parts of these
works would seem like a rather outlandish claim. In fact, it would be impossible to
trace back copyright to translators, as “in [NMT] training, data is broken down to the
level of words, subwords, or even characters [. . .] so that the input of any individual
translator is unrecognisable and their contribution to a system trained with very large
amounts of data is untraceable.” (Moorkens and Lewis 2019, citing Lee et al. 2017).
In fact, in many other industries, you can buy products where traceability is lost. If
you buy packaged beef sirloin, in some parts of Europe you have traceability
numbers explaining where the animal was raised and slaughtered, and sometimes
the ear tag of the actual individual, but if you buy a can of meatballs, meat from
different provenances, individuals (and even different species such as pork) may be
found with no possible traceability. However, this does not necessarily imply that
farmers do not get proper compensation (they may not, but that is beside the point
now). Corpus-based MT would be a bit like the untraceable canned meatballs of
translation.
In fact, Tsiavos et al. (2014) suggest that “risk mitigation strategies” to avoid
“expos[ing] the [language resource] processor to legal risks” may include the choice
to “[p]rovide the service” (for instance, the MT system) “rather than the data” (the
sentence-aligned parallel corpora). Arranz et al. (2013) also suggest that language
technology providers “[do] not offer any content or derivative content as such but
41
https://wacky.sslmit.unibo.it.
42
Tsiavos et al. (2014) describe in particular detail a number of scenarios in which language
resources derived from web text are published.
43
Inputs and outputs in NMT are usually sub-word units produced by a trained segmenter based on
byte-pair encoding (Sennrich et al. 2016) or the newer SentencePiece approach (Kudo and
Richardson 2018).
64 M. L. Forcada
only services that do not replicate the material collected but only produce a service of
its processing [sic]”. In fact, while web-based commercial MT systems such as
Google Translate or Microsoft Translator are clearly based on text crawled from
the Internet, they have been basically free of legal challenge, even considering the
colossal commercial value they derive from MT.
44
As one of the anonymous reviewers of this chapter points out, and as an illustration of fair use in
the United States, when the Authors Guild sued Google over their service Google Books in 2005, in
an attempt to exercise their copyright, the judge (in 2013, after many attempts to reach a settlement)
ruled against the authors, as it was considered that the crawling done by Google was a fair use and
could help in the advancement of other language technologies (for details, see https://www.
authorsguild.org/where-we-stand/authors-guild-v-google/ and https://www.jipitec.eu/issues/jipitec-
5-1-2014/3908). Rulings like these fueled lobbying in Europe in favour of EU-level legislation
authorising researchers to crawl data, and not letting individual countries decide on it (at that time,
only English-speaking countries in the EU allowed their researchers to freely crawl data under the
fair use provision). Current EU legislation (directive 2017/790) partially reflects these requests.
4 Licensing and Usage Rights of Language Data in Machine Translation 65
or exploiting derivative work that do not fulfil the conditions of the European Union
Directive do occur without appreciable legal challenge, as it would be very difficult
to successfully substantiate the nature of the economic damage sustained by authors.
The recommendations set out by Arranz et al. (2013) and Tsiavos et al. (2014) to
minimise legal risk discussed above predate the Directive but are therefore still
relevant.
Translators as Authors
The fact that translators are authors, and should be considered as such, is hard to
dispute. They produce a new text which extends the reach of an otherwise inacces-
sible text to a new readership, and this involves a variety of cognitive processes
which involve interpreting the source text and then getting into the shoes of readers
to decide the best target-language rendering of it. The value they add to the original
text is the reason for the existence of translation as a profession. But it is also hard to
dispute that the depth or the intensity of authorship varies widely across text genres,
45
Such as an MD5 code, https://en.wikipedia.org/wiki/MD5.
46
Effectively a stand-off annotation of the World-Wide Web.
47
Project Paracrawl (http://paracrawl.eu) is developing and will release in 2021 a complete deferred
crawling software suite as part of Bitextor (http://github.com/bitextor); however, the new software
will not specify any indication of position inside the document, following a suggestion by Heafield
(2020), but simply the URL of the document and digest or checksum of the text segment, assuming a
standard process to split the document in segments: after segmentation, the segment having the
same digest will be recovered (it is possible, but utterly unlikely, for two different segments to have
the same digest).
66 M. L. Forcada
with artistic and literary texts towards one end and repetitive formulaic texts towards
the other end.
According to Article 2 of the Berne Convention, “Translations, adaptations,
arrangements of music and other alterations of a literary or artistic work shall be
protected as original works without prejudice to the copyright in the original work.”
In the same article, literary and artistic works are said to “include every production
in the literary, scientific and artistic domain, whatever may be the mode or form of its
expression.” So, clearly, the Berne Convention recognizes translators as authors and
establishes that they can also benefit from copyright.48
The production of sentence-aligned parallel corpora from texts found on the Internet
is usually an automated process. For example, one may use software such as
Bitextor49 or the ILSP Focused Crawler.50 In addition to the research and develop-
ment effort put in creating this software, most of which has been already paid for
using European taxpayers’ money, the effort to set up a crawling job (identifying
websites, installing and configuring the software, etc.) and the price of the compu-
tational resources involved is clearly non-negligible. Subsequent cleaning also
requires software that has to be installed,51 configured and run. Finally, the storage
and, optionally, the publication of the resulting sentence-aligned parallel corpora
requires sizable computational resources again.
But value is also crucially added when each sentence is aligned with its transla-
tion in another document which is likely to be its translation. The resulting sentence-
aligned parallel corpora may be used as TMs in CAT environments, with substantial
savings in translation costs; as a result, the use of TMs has radically changed the way
in which professional translators work (limited access to context or TMs, work with
disordered pre-translated segments) and get paid (fuzzy-match discounts); see
Garcia (2007, 2009). Note that this commoditization of how TMs are used has led
companies such as TAUS to derive profit from setting up a “data marketplace”52
where people buy and sell TMs.
With further investment in research, development, installation, configuration and
computational resources, the sentence-aligned parallel corpora may be used to create
CBMT systems which may further cut translation costs through MT post-editing
workflows, and sometimes even drive translators off areas in which machine-
translated text is good enough to be used ‘as is’ (probably in new applications
where professional translation had little or no penetration such as customer reviews
or other user-generated textual content).
48
https://www.ceatl.eu/translators-rights/legal-status.
49
https://github.com/bitextor.
50
http://nlp.ilsp.gr/redmine/projects/ilsp-fc.
51
Such as Bicleaner (https://github.com/bitextor/bicleaner).
52
https://datamarketplace.taus.net/.
4 Licensing and Usage Rights of Language Data in Machine Translation 67
The collection, curation and training efforts discussed above do add value to data
which was published after translators were compensated as agreed, ideally fairly.
When the customer hiring the translator clearly states in the contract that the
translation will be published, it may not be possible to say that the transaction
with the translator was not completed, but even when this was not explicitly stated,
a fair compensation for the translator’s work should probably be enough.53 While
Moorkens and Lewis (2019) echo the view that this may leave “many [translators]
disempowered with regard to working conditions and repurposing of translated
work”, it is also true that many workers in the world are paid to produce goods or
services that are repurposed downstream to generate other goods or services of
additional value. Historically, workers have managed to organise in unions, guilds,
etc. protecting their interests and have used instruments such as industrial action
(e.g. a strike) to force negotiation of fair compensation for their work, and, to some
extent, to empower themselves. It is true that the nature of repurposing in the case of
translations is very elaborate and, in the case of MT, mediated by machine learning,
but, at the end of the day, translators are workers that could organise (and do
organise to some extent, usually in translators’ associations but less so in traditional
unions) to attain fair compensation for their work, particularly if one considers that
their work will ultimately be used to create translation technologies that change their
profession in ways that may not be acceptable to them. It is worth noting here that
Moorkens and Lewis (2019) advocate for a more collaborative approach and pro-
pose “a community-owned and managed digital commons [which] would ultimately
benefit the public and translators by making the industry more sustainable than at
present” and state that “there are several reasons for changing the current copyright
and data ownership conditions”.
This chapter has discussed the licensing and usage rights of the language data used to
train MT systems. There are two main kinds of language data: linguistic resources,
on the one side, used mainly (but not exclusively) in RBMT, and corpora (mainly
sentence-aligned parallel corpora), used to train CBMT systems. It is argued that,
while the case of licensing and usage rights for linguistic resources such as dictio-
naries or rule sets may be considered settled, the case for sentence-aligned parallel
corpora needs closer examination; in particular, the case is most controversial when
these corpora are derived from text published on the Internet. Sentence-aligned
53
In fact, as one of the anonymous reviewers points out, leaving aside the case of literary works that
become best sellers, or, in some jurisdictions, audiovisual content and videogames, most of the
work done by translators is not subject to royalties.
68 M. L. Forcada
References
Agić Ž, Vulić I (2020) JW300: a wide-coverage parallel corpus for low-resource languages. In:
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL
2019), pp 3204–3210. https://doi.org/10.17863/CAM.44029
Arranz V, Choukri K, Hamon O, Bel N, Tsiavos P (2013) PANACEA project deliverable 2.4, annex
1: Issues related to data crawling and licensing [project deliverable]. http://cordis.europa.eu/
docs/projects/cnect/4/248064/080/deliverables/001-PANACEAD24annex1.pdf
Artetxe M, Labaka G, Aguirre E, Cho K (2018) Unsupervised neural machine translation. In:
Proceedings of ICLR 2018, the International Conference on Learning Representations. https://
openreview.net/forum?id=Sy2ogebAW
Bel N, Forcada ML, Gómez-Pérez A (2016) A maturity model for public administration as open
translation data providers. ArXiv:1607.01990. https://arxiv.org/pdf/1607.01990.pdf
Bowker L (2002) Computer-aided translation technology: a practical introduction. University of
Ottawa Press, Ottawa, ON
Cai J, Utiyama M, Sumita E, Zhang Y (2014) Dependency-based pre-ordering for Chinese-English
machine translation. In: Proceedings of the 52nd Annual Meeting of the Association for
Computational Linguistics (Volume 2: Short Papers), pp 155–160. https://www.aclweb.org/
anthology/P14-2026.pdf
Carlini N (2020) Privacy considerations in large language models. Blog post. https://ai.googleblog.
com/2020/12/privacy-considerations-in-large.html
Christodouloupoulos C, Steedman M (2015) A massively parallel corpus: the Bible in 100 lan-
guages. Lang Resour Eval 49(2):375–395. https://link.springer.com/article/10.1007/s10579-
014-9287-y
De Clercq O, Montero Pérez M (2010) Data collection and IPR in multilingual parallel corpora. In:
Proceedings of the Seventh Language Resources and Evaluation Conference (LREC 2010), pp
19–21. http://www.lrec-conf.org/proceedings/lrec2010/pdf/204_Paper.pdf
Du J, Way A (2017) Pre-reordering for neural machine translation: helpful or harmful? Prague Bull
Math Linguist 108:171–182. https://ufal.mff.cuni.cz/pbml/108/art-du-way.pdf
Forcada ML (2017) Making sense of neural machine translation. Transl Space 6(2):291–309
Forcada ML (2020) Building machine translation systems for minor languages: challenges and
effects. Revista de Llengua i Dret 73. http://revistes.eapc.gencat.cat/index.php/rld/article/
download/10.2436-rld.i73.2020.3404/n73-forcada-en.pdf
Forcada ML, Esplà-Gomis M, Pérez-Ortiz JA (2016) Stand-off annotation of web content as a
legally safer alternative to crawling for distribution. Baltic J Mod Comput 4(2):152–164.
4 Licensing and Usage Rights of Language Data in Machine Translation 69
(proceedings of the 19th Annual Conference of the European Association for Machine Trans-
lation, Riga, Latvia, May 30–June 1, 2016). https://www.bjmc.lu.lv/fileadmin/user_upload/lu_
portal/projekti/bjmc/Contents/4_2_8_Forcada.pdf
Garcia I (2007) Power shifts in web-based translation memory. Mach Transl 21:55–68. https://link.
springer.com/content/pdf/10.1007/s10590-008-9033-6.pdf
Garcia I (2009) Beyond translation memory: computers and the professional translator. J Special
Transl 12:199–214. https://www.jostrans.org/issue12/art_garcia.pdf
Heafield K (2020) Personal communication
Isabelle P, Dymetman M, Foster G, Jutras J-M, Macklovitch E, Perrault F, Ren X, Simard M (1993)
Translation analysis and translation automation. In: Proceedings of the 1993 conference of the
Centre for Advanced Studies on Collaborative research: distributed computing, vol 2, pp
1133–1147. http://www.iro.umontreal.ca/~foster/papers/trans-tmi93.pdf
Koehn P (2009) Statistical machine translation. Cambridge University Press, Cambridge, MA
Kudo T, Richardson J (2018) SentencePiece: a simple and language independent subword tokenizer
and detokenizer for Neural Text Processing. In: Proceedings of the 2018 Conference on
Empirical Methods in Natural Language Processing: System Demonstrations, pp 66–71.
https://www.aclweb.org/anthology/D18-2012.pdf
Lample G, Conneau A, Denoyer L, Ranzato MA (2018) Unsupervised machine translation using
monolingual corpora only. In: Proceedings of ICLR 2018, the International Conference on
Learning Representations. https://openreview.net/forum?id=rkYTTf-AZ
Lee Y-S (2004) Morphological Analysis for Statistical Machine Translation. In Proceedings of
HLT-NAACL 2004: Short Papers, pp 57–60. https://www.aclweb.org/anthology/N04-4015
Lee J, Cho K, Hoffmann T (2017) Fully character-level neural machine translation without explicit
segmentation. Trans Assoc Comput Linguist 5:365–378. https://www.aclweb.org/anthology/Q1
7-1026.pdf
Macken L, De Clercq O, Paulussen H (2011) Dutch parallel corpus: a balanced copyright-cleared
parallel corpus. Meta 56(2):374–390. https://doi.org/10.7202/1006182ar
Moorkens J, Lewis D (2019) Research questions and a proposal for the future governance of
translation data. J Special Transl 32:2–25. https://www.jostrans.org/issue32/art_moorkens.pdf
Pecina P, Toral A, Van Genabith J (2012) Simple and effective parameter tuning for domain
adaptation of statistical machine translation. In: Proceedings of COLING 2012, pp 2209–2224.
https://www.aclweb.org/anthology/C12-1135.pdf
Rafalovitch A, Dale R (2009) United Nations General Assembly resolutions: a six-language parallel
corpus. In: Proceedings of the MT Summit, vol 12, pp 292–299. http://clt.mq.edu.au/~rdale/
publications/papers/2009/MTS-2009-Rafalovitch.pdf
Sennrich R, Haddow B, Birch A (2016) Neural machine translation of rare words with subword
units. In: Proceedings of the 54th Annual Meeting of the Association for Computational
Linguistics, pp 1715–1725. https://www.aclweb.org/anthology/P16-1162.pdf
Simard M, Foster GF, Perrault F (1993) Transsearch: a bilingual concordance tool. Technical
report. Centre d’Innovation en Technologies de l’Information, Laval, QC. http://rali.iro.
umontreal.ca/rali/sites/default/files/publis/sfpTS93e.pdf
Tsiavos P, Piperidis S, Gavrilidou M, Labropoulou P, Patrikakos T (2014) QTLaunchpad public
deliverable D4.5.1: Legal framework. http://www.qt21.eu/launchpad/system/files/deliverables/
QTLP-Deliverable-4_5_1_0.pdf
Ziemski M, Junczys-Dowmunt M, Pouliquen B (2016) The united nations parallel corpus. In:
Proceedings of the Language Resources and Evaluation Conference (LREC’16), Portorož,
Slovenia, May 2016. https://conferences.unite.un.org/UNCorpus/Content/Doc/un.pdf
Chapter 5
Authorship and Rights Ownership
in the Machine Translation Era
literary or artistic work shall be protected as original works without prejudice to the
copyright in the original work (. . .)”.
It is, therefore, obvious that as translation involves such a transformation of the
original text, it is a right of the holder of that text to authorise (or not) its translation.
This is also covered in Article 8 of the same Convention: “Article 8. Right of
Translation. Authors of literary and artistic works protected by this Convention
shall enjoy the exclusive right of making and of authorizing the translation of their
works throughout the term of protection of their rights in the original works.”
National regulations also contain this right to authorise the translation of a literary
work, as well as the consideration of the translation itself as an autonomous work. In
France, translation is regulated by Article L112-3 of the Code de la propriété
intellectuelle, which warns that the authors of translations, adaptations, transforma-
tions and modifications of protected works enjoy the protection derived by this Code
without prejudice to the rights of the author of the original work. The British
Copyright, Designs and Patents Act 1988 (CDPA) states in its Article 21-3 that
translations are among the adaptations subject to the author's authorization: “In this
Part “adaptation”—(a) in relation to a literary work, other than a computer
program or a database, or in relation to a dramatic work, means—(i) a translation
of the work”. Similarly, U.S. Copyright law (17 U.S. Code §101) considers trans-
lations to be derivative works.
The Spanish Intellectual Property Law, Royal Legislative Decree 1/1996, of
12 April (Ley de Propiedad intelectual, Real Decreto Legislativo 1/1996, de 12 de
abril), tells us in its Article 11 that translations are derivative works: “Sin perjuicio
de los derechos de autor sobre la obra original, también son objeto de propiedad
intelectual: 1.° Las traducciones y adaptaciones (. . .)”.1 In the aforementioned
Article, translations, as well as revisions, updates and annotations of works, com-
pendiums, abstracts, extracts or musical arrangements are considered by law as
transformations of literary or scientific work. Another feature is that such trans-
formations may be paid not only by means of a percentage, but also by a lump sum.
We see, therefore, how translation is considered as a genuine “work”, although
one of a dependent or secondary nature, indeed subjected to obtaining the
corresponding authorization by the right holder of the text to be translated. This
need to request permission for translation does not occur when the use of the original
text is permitted by intellectual property laws, for example, when we are dealing
with literary works in the public domain, or if they are legislative texts. But the
translation rights remain, and such translation remains an intellectual property work.
The translation author, the translator, holds the rights over their work. As stated
by Venuti (1995), “a translator can be said to author a translation because translating
originates a new medium of expression, a form for the foreign text in a different
language and literature”. Lee (2020) points out that on the basis of the former rulings
of Byrne v. Statist (1914) and Walter v. Lane (1900), it was concluded that the
1
Without prejudice to the copyright on the original work, the following are also the object of
intellectual property: 1. Translations and adaptations (...) [translation of my own].
5 Authorship and Rights Ownership in the Machine Translation Era 73
translator is “(. . .) an author as a matter of law in respect of the translated text – not a
joint author, but an author in his/her own right.”
This ownership by the translator occurs despite the inevitable lack of originality
of the derivative work, since it expresses (translates) an existing work. Obviously,
originality and creativity in translation must be referred to the operation of
transforming the source text into another language, not the form or content, which
by definition must be reproduced from the original. Their work consists of a kind of
reinterpretation of the original work, although Lee (2020) believes that this is an
exaggeration of the role of the translator, which merely transforms the original
language into another.
The rights ownership of the translator remains even in front of the rights holder of
the original work, and as Moorkens and Lewis (2020) state, the translation can't be
used “without authorization from the translation copyright owner (. . .) the original
author may not use the translation as the basis for a further translation without
permission, and that the translation copyright owner retains rights even over trans-
lation of a text that is out of copyright”. Despite this, Venuti points out that in most
cases the translation rights have been transmitted to the holder of the rights to the
original work, either by means of a working contract, whereby the translator is an
employee, or by commissioning the translation work.
The universal principle in the field of intellectual property is that the original
ownership of rights to the literary or artistic work belongs to its author; The
authorship of the work carries as a consequence the ownership of rights. The same
applies in principle to the ownership of translation as a derivative work, in which the
rights correspond to the author-translator. However, this original attribution of rights
is a legally articulated mechanism to allow the subsequent transmission of rights
over the created work. Once there is a right—of the kind of ownership—to the work,
a firm starting point is created for the commissioning and distribution of the work in
the artistic, literary or audiovisual market. In the case of translation, authorship
allows the transfer of rights to the beneficiary of the translation, who will make
use of the translation with the ultimate goal of using the work for its intended
application. Copyright regulation is therefore based on the recognition of an initial
point of rights ownership, which is the author, for an openly commercial purpose. In
the field of translation, this initial point will be the translator.
With the use of intelligent systems for the production of musical, artistic or
literary works, the current problem is that the machine takes on such a leading role
that the authorship of the human being is blurred until it almost disappears.
The solution chosen in such cases is to recognize authorship in the person who
has contributed in the use of the creative intelligent system by means of a significant
activity (or a co-authorship if there have been several people involved). This idea
seeks to preserve the ownership of rights in the human being that is more directly
related to the creative result. Such a solution is justified in the legal doctrine
especially on the basis of Article 9.3 of the UK Copyright, Designs and Patent Act
(1988): “Authorship of work. In the case of a literary, dramatic, musical or artistic
5 Authorship and Rights Ownership in the Machine Translation Era 75
2
The Next Rembrandt project was led by the agency J. Walter Thompson for the ING bank. It
counted with Microsoft and the Technical University of Delft as technology partners, as well as with
the Mauritshuis Museum and the Rembrandt House Museum. Using as a starting point the analysis
of Rembrandt's work, in 2017 it was possible to create a new Rembrandt, the portrait of a man who
replicates (not copies) the style of the painter.
76 M. L. Lacruz Mantecón
Today, the use of systems based on neural networks and corpus-based MT (CBMT)
results in greater outcomes in the translation output. As Wahler (2019) says, since
2016 the Google Neural Machine Translation (GNMT) system “(. . .) . . . uses deep
machine learning to mimic the function of a human brain, and Google claims the tool
is sixty percent more accurate for English, Spanish, and Chinese than Google
Translate, which is phrase-based. . . the GNMT system no longer translates word-
by-word but instead translates entire phrases as units, a feature known as ‘soft
alignment’”. In spite of this, and in order to avoid complaints, Wahler considers
that the intervention of the human translator in post-editing the results remains
essential.
But progress in this field leads us to think that in the near future the quality level
of MT may have reached (if it has not already done so) that of the human translator.
This leads us to rethink the problem of the authorship of these “automatic trans-
lations”, because the supervisory work carried out today by the human translator will
not even take place. In fact, this is already an existing scenario, and there are
companies that today already choose to use MT without a human in the loop for
specific purposes. This problem has arisen in the same way in the field of artistic and
literary creation, in which genuine robotic authorship is denied by many jurists,
understanding that a genuine creation activity is always human and is not replicable
by machines. Miernicki and Ng (2020) point out that the interpreters of the Berne
Convention, where there is a lack of a clear definition of the concept of “authorship”,
have repeatedly suggested that the convention refers only to human creators, and that
the minimum standard established by this Agreement applies only to works carried
out by human beings, excluding AI productions. They add that the US Copyright Act
n.d. provides protection for “original works” (17 U.S.C.§ 102 (a)), with the under-
standing that this refers only to creations made by human beings.
In this sense, Ginsburg and Budiardjo (2019) insist on rejecting even the author-
ship of AI systems that use neural networks and are trained by the use of deep
learning. They do so because the use of more sophisticated “learning” models does
not change their initial conclusion that machines are not genuinely “creative”.
Lanteri (2020) also warns that the United States Bar Association endorses the
position of Ginsburg that we just referred to, and that the International Confederation
of Societies of Authors and Composers (CISAC) also maintains this position that the
protected works are those produced using AI just as an additional support to human
creativity and hence they can be managed within the current copyright framework.
For his part, Calo (2016) reduces cybernetic creativity to the ability of a robot to be
considered as an “interpreter”, which could place it in the scope of copyright
protection over derived creative contributions, as those of musicians or actors.
In short, genuine authorship of artistic works is denied when there is no signif-
icant human intervention in its creation. It is a question of demanding that sufficient
human activity has taken place to allow the human actor to be qualified as the author,
and not the machine. In other words, the authorship can be attributed to the person,
5 Authorship and Rights Ownership in the Machine Translation Era 77
as Article 9.3 of the British CDPA tells us, “the person by whom the arrangements
necessary for the creation of the work are undertaken”.
However, the fact is that in the current field of art, and predictably also in the
future of translation, intelligent systems allow for the creation of works without any
human activity other than pressing the “on” button. As Dornis (2021) says, in these
cases, it must be recognized that the system occupies “the driver’s seat”, as far as
creativity and technical innovation are concerned, and the results are “works without
an author”—human author, one might add. This is shared by Bridy (2012), who
believes that we must stop looking for a subterfuge and admit that the author of the
work is, without doubt, the AI system that creates such work. Galanter (2020) states
a similar thing from the point of view of generative art, and suggests considering
appropriate an approach that treats the assignment of authorship not as a moral act,
but simply as a descriptive one. We would be morally obliged to acknowledge the
authorship of the machine, but not in view of it itself, but in view of the human being,
because otherwise our human social life, in this world of art, would be based on a lie.
The same applies to automatic translations.
We are therefore faced with a clash between factual reality and the legal require-
ment that the author of any work be a human being. However, I believe that one
thing is the fact that all or most of the creation work is carried out by a machine, and
another very different thing is that such a machine may be the rights holder over such
creation. Moreover, and as we have already discussed above, that legal recognition
of authorship is only a resource to establish a firm basis for the recognition of rights,
so that works can enter the art market and its distribution channels.
This is why in Spain Navas Navarro (2018) distinguishes the “legal author”, as
the physical or legal person who commissioned the work or used the system, and the
“material author” who would be the “robot machine”. Duque Lizarralde (2020)
points out that in works generated by intelligent systems, the creator is in fact the
system, which is not an entity but an object, and under the aforementioned normative
frameworks cannot be considered an author. However, it is also evident from the
regulations that authorship corresponds only to the effective creator of the work.3
Ramalho (2017) acknowledges AI as the author in factual terms, but questions
whether it should be the author in legal terms.4 Clearly, no: The only way to proceed
is to transfer ownership to the human being who produces the machine output
results. The solution is to assign the rights ownership to the human being by
means of a “legal authorship” (or fictitious authorship), and not to the machine.
3
“El creador de hecho es el sistema, que no es un ente sino un objeto, y bajo los marcos normativos
citados no puede considerarse autor. Sin embargo, también se deduce de la normativa que la
autoría corresponde únicamente al creador de hecho de la obra” (The effective creator is the
system, which is not an entity, but an object, and according the cited legal frameworks, it cannot be
considered an author. Nevertheless, it can also be inferred from the legislation that the authorship
solely corresponds to the factual creator of the work”—translation of my own).
4
“En otras palabras, la IA es el autor en términos fácticos, pero ¿debería ser el autor en términos
legales?” (In other words, AI is the author in factual terms, but should it be the author in legal
terms?—translation of my own).
78 M. L. Lacruz Mantecón
This proposal was echoed a few years ago. It consists of acknowledging a kind of
legal personality to advanced intelligent systems, similar to that of societies and
other “moral persons”. This idea was particularly publicised following the European
Parliament resolution of 16 February 2017, which included recommendations to the
Commission on civil law rules on robotics, and which proposed to manage the
liability of intelligent systems by attributing them an electronic personality (this was
primarily thinking of self-driven vehicles). In this sense, the authorship could be
assigned to the MT system itself.
But accepting that a machine has a personality becomes very difficult, due to its
lack of consciousness and authentic subjectivity, as I explain in a previous work
(Lacruz Mantecón 2020). Moreover, the ownership of rights would necessarily have
to be managed by humans, which makes this solution useless. Precisely because of
these reasons, such an electronic personality is being rejected by most legal special-
ists. Further still, the most recent European texts abandon this idea which only
sought to solve the problem of responsibility. The European Parliament resolution
of 20 October 2020 on intellectual property rights for the development of artificial
intelligence technologies (2020/2015(INI)) expressly states in its Recital 13 that
although the process of automatic generation of artistic content raises problems
related to the ownership of rights, the European Parliament considers “(. . .) that it
would not be appropriate to seek to impart legal personality to AI technologies and
points out the negative impact of such a possibility on incentives for human
creators”.
5.6.1 Approach
separation between authorship and ownership of rights, so that the rights necessarily
correspond to a human being, call it a legal author or simply a rights holder.
Taking as a starting point Article 9.3 of the British CDPA, this human being will
be the one who has made the necessary arrangements to obtain the output of the
machine, even when such operations are often reduced to a commissioning of the
MT system and sending the texts for translation to such system. In short, we are
trying to find the “human behind the machine”, the human who makes the machine
work or who just takes advantage of its output. Who could be this person that holds
the best relationship with the machine and hence is in a position to claim rights over
its output? Fernández Carballo-Calero (2021) points out that in the specific field of
computer-generated works, the potential candidates are four: (1) the author of the
program; (2) the user of the program; (3) the program; and (4) none.5 We have
already seen that the program—the machine—has no personality or therefore ability
to own rights. Let’s now have a look at the various human candidates.
To attribute rights on the system results (output) to the author of the program is the
classic solution. Rogel Vide (1984) notes that as a result of a meeting held in Geneva
in 1979 between WIPO and UNESCO a Working Group was established that drafted
a Report which, inter alia, dealt with “copyright problems arising from the use of
computers for the creation of works” (Rogel Vide 1984). One of them was the
fatherhood of computer-generated works, a matter in which the principle was
affirmed that the copyright owner of such works could not be the computer itself,
but the person who triggered the creation. This solution is still maintained today by
many authors, such as Ginsburg and Budiardjo (2019), who believe that in fully
generative systems the ownership of the created works corresponds to the “program-
mer-designer” of the machine.
Likewise, Bridy (2012) follows Judge Holmes’s assertion of authorship based on
the inherent uniqueness of human personality. She points out that the law cannot
confer the copyright ownership of an artificially generated work of art to the author
of the work, because this is, in fact, a generative software program, and has no legal
personality. Therefore, the programmer of the generative software is the logical
rights holder of the works generated by its software: It would be “the author of the
author of the works”. As Ramalho (2017) points out, this was also the position taken
by the English courts in the Nova Productions Ltd v Mazooma Games Ltd and
Others (2007), in which the court attributes ownership to the programmer and
designer of a video game.
5
“En el ámbito específico de las computer-generated works se ha señalado que los "sospechosos
habituales" son cuatro: (1) el autor del programa; (2) el usuario del programa; (3) el programa y
(4) ninguno”.
80 M. L. Lacruz Mantecón
This range of potential candidates would include all participants in the design of
the MT system, from programmers to data selectors and “trainers” of the intelligent
system by means of deep learning. The normal will be a plural authorship of the
systems and a subsequent co-authorship. Of course, if designers act for a company,
the rights owner would be the company they work for.
However, despite its popularity, this attribution of rights to the programmer or
designer is now being rejected, because as Carrasco Perera and del Estal Sastre
(2017) explain, it implies an overprotection since program creators already have an
exclusive right to their creations and obtain their remuneration by licensing their
programs. These authors state that the owner (author) of intellectual property rights
in a computer program does not require nor deserve additional protection, as the
author at the same time of the work resulting from the use of the program.6 In our
case, such resulting work would be the output translation. The same idea is found in
American doctrine. Yu (2017) claims that allowing the programmer to have
protected by copyright not only his software, but also any result thereof, is an
excessive reward for their efforts and invites the accumulation of copyright.
Holder et al. (2019) point out that, in the future, when the machine acquires an
autonomy that allows it to generate creations of its own, the only human intervention
will be the initiation of the process and the establishment of the requirements for the
work, something that the authors describe as “interaction” with the machine. As they
point out, in these cases the robot's manufacturer-programmer has no intervention
whatsoever in the creative result. In such cases, the user and owner of the machine
would be the only one who holds the rights of the work, because they are the only
one who intervenes in its creation.
Yu (2017) also believes that the assignment of rights should be made to the end
user of the machine, because it makes more sense both from a social policy point of
view as well as from an economic point of view. And this is because the end user
ultimately determines whether a machine-created work is produced, and so it is this
end user who must connect with the interests of the general public to introduce the
works into the market. If, through copyright, we want to encourage the production of
more creative works, it seems better to grant copyright protection to the end user of
the system. In addition, this attribution of user rights would be effective in encour-
aging the acquisition of licences and the development of better MT systems. Yu
(2017) exemplifies this saying that in the analogical world, this would be the
equivalent to asking whether copyright should be attributed to the pen manufacturer
6
“El titular (autor) de derechos de propiedad intelectual sobre un programa de ordenador no
requiere ni merece una protección suplementaria, como autor al mismo tiempo del opus resultante
de la aplicación del programa.”
5 Authorship and Rights Ownership in the Machine Translation Era 81
or to the writer. He then questions why this ambiguity should prove to be problem-
atic in the digital world and uses Microsoft Word as another example: Microsoft
created the Word software, but obviously does not own all the work done with that
software.7
In the event that the machine produces several alternative results such as multiple
parallel translations, or that a post-editing of the resulting text is required, a new
subject must be added: The person who makes the choice between the various output
results or the person who performs the post-editing of the text. It will normally be the
user themselves who makes such a choice or performs the post-editing, and hence
there will be no discussion as to their rights ownership. But if we are faced with
different people, as will be the case if these tasks are outsourced to a third party, this
ownership will have to be reconsidered. For these cases Ginsburg and Budiardjo
(2019) recover a theory that was applied to acknowledge the photographer’s title as
the author, as opposed to the mere photo taker employed by the photographer and
who just followed their instructions. This is the so-called “theory of adoption”, and it
prescribes that in these cases part of the creative work of the author consists of the
choice of the results, that is, of the photographs that they consider worthy of their
genius, which results in authorship by adoption. Through the “adoption theory”,
authorship is attributed to the photographer-planner when random forces intervene in
the results, because it is the photographer who “adopts” (or rejects) these results.
This idea is perfectly applicable to the translator who chooses from the various
possibilities presented to them by the machine, a specialised human intervention that
also grants them authorship and rights over the final translation.
The ownership of user rights could be articulated through the creation of a sui
generis8 right, as Sanjuán Rodríguez (2020) proposes in Spain. She considers that
the best thing would be the construction of a sui generis right or a related right in line
with the one referred to in article 129.2 of the Spanish Intellectual Property Law.
Seeking investment protection, Ramalho (2017) also advocates for the granting of a
related or sui generis right similar to that of database manufacturers. This would be a
regime similar to the law of the publisher of unpublished works as prescribed by the
Directive 2006/116/EC of the European Parliament and of the Council of
12 December 2006 on the term of protection of copyright and certain related rights,
whose Article 4 grants publishers 25 years of protection.
7
“En el mundo analógico, esto es como preguntarse si el derecho de autor debería atribuirse al
fabricante de una pluma o al escritor. Entonces, ¿por qué podría tendría que ser problemática esta
ambigüedad en el mundo digital? Tomemos el caso de Microsoft Word. Microsoft creó el programa
informático Word, pero evidentemente no es titular de todos los trabajos realizados con ese
software.”
8
Sui generis is a Latin expression that means “special”. In Intellectual Property Law it is used to
refer to the rights of those not covered by intellectual or artistic work authorship, such as a cinema
producer, or the businessman behind the creation of a data base.
82 M. L. Lacruz Mantecón
If the minimum user activity of operating the machine and collecting the results is
not considered sufficient to grant the user the rights to the generated translation, the
generated work would fall, as a consequence, in the public domain. This would be
due to a lack of human creative activity, which determines that the results of MT are
not considered works protected by copyright. Ginsburg and Budiardjo (2019)
believe that as the authorship and subsequent rights ownership is something exclu-
sively referable to human beings, a work falling into the public domain is the
consequence of a lack of sufficient human intervention in the creative process.
The public domain would be what Fernández Carballo-Calero (2021) denotes as
the “none” owner. As Ríos Ruiz (2001) points out, this is a classic solution, already
advocated for by WIPO professor Daniel Gervais in the early 90s. On the basis of the
autonomous generation of author and artistic productions, Gervais holds that the
works thus generated fall in the public domain because they are not intellectual
creations created by human beings, and hence International Copyright does not
provide protection to them.
Mezei (2020) also advocates for the public domain, and notes that although this
solution will lead to the loss of incentives for some, he deems it more appropriate to
follow the proposal by Victor Palace, who asserted that the AI industry is likely to
continue to flourish regardless of copyright, as it has done so far, due to the inherent
incentives in the AI industry itself. He proposes the creation of a special category of
authorless works, which are already “born” in the public domain.
However, this does not mean that the results of the system, the translations, are
left out without any protection, and alternative means such as unfair competition
rules can be used. Mezei (2020) refers to a Japanese legislative proposal to introduce
an intellectual property regime for works not created by humans and which would
therefore be applied to the results generated by AI. According to this proposal,
instead of expanding the copyright system, the regulatory body will analyse a
framework that handles AI-created works in a manner similar to trademarks,
protecting them from unauthorised use through legislation that prohibits unfair
competition. The proposal goes on to explain that in light of the ability of
AI-based systems to create a huge amount of work in a short time, the plan is to
provide protection only to those works that reach a certain degree of popularity or
otherwise maintain market value. In particular, and as Scheuerer (2021) explains,
unfair competition legislation can intervene effectively in the field of intellectual
property in three dimensions: First, establishing a general regulatory paradigm for an
approach to the protection of intangible assets that is adapted to the data economy.
Second, in a dimension de ferenda,9 proposing it as an alternative to the introduction
9
In legal texts, de lege ferenda refers to matters that need to be regulated by laws in the future.
5 Authorship and Rights Ownership in the Machine Translation Era 83
The intelligent system, when translating, returns the translated text as the first result.
But there are other associated results that are less visible.
For starters, when loading the source text, it is possible to extract the personal data
relating to the subject referred to in the text, or even the biometric data of the
individual who introduces his or her spoken language into a translation system,
and even data on attitudes or emotional states extracted through paralinguistic
communication. As Trancoso (2022) says in another chapter in this volume, when
sending a biometric signal such as speech to a remote server for processing, “. . . this
input signal reveals, in addition to the meaning of words, much information about
the user, including his/her preferences, personality traits, mood, health, political
opinions, among other data such as gender, age range, height, accent, etc. Moreover,
the input signal can be also used to extract relevant information about the user
environment, namely background sounds”. That is, through the MT mode called
speech-to-speech MT (S2SMT), this whole series of personal data can be obtained.
However, in the case of personal data, the processing and use of personal data
must be authorised by the person from whom they have been extracted, so it is this
consent that determines the legitimacy in the use of such data. See in this sense
Regulation EU 2016/679 of the European Parliament and of the Council of 27 April
2016 on the protection of natural persons with regard to the processing of personal
data, whose Article 6 states: “Lawfulness of processing. 1. Processing shall be
lawful only if and to the extent that at least one of the following applies: (a) the
data subject has given consent to the processing of his or her personal data for one or
more specific purposes. . .”. However, Trancoso (2022) proposes a simpler solution:
Protecting the user by encrypting and anonymizing their communication by means
of spoofing algorithms.
On the other hand, the ethical guidelines that are being imposed in the vast field of
AI require all participants to use personal data appropriately. As Parra Escartín and
Moniz (2019) warn, in the case of translation: “Issues such as who has access to the
data, who is the data curator and manager, how is the data processed and where and
how it is stored are key prior to establishing any translation workflow to ensure that
all parties are protected from potential data and privacy breaches, or even potential
10
Alternative to the introduction of new intellectual property rights in cases of uncertainty
(my translation).
11
In legal texts, lege data refers to what is already enforced by law.
84 M. L. Lacruz Mantecón
The creation and adaptation of linguistic resources (they must be computer readable)
is a work of experts, who see their work protected by copyright. In particular, and by
5 Authorship and Rights Ownership in the Machine Translation Era 85
12
Note that this is what Gow (2007) advocates for. In his view, the creation and feeding of a TM
over time would not be considered an investment. However, this is a field still to be regulated and
different parties may view at this matter from different angles.
86 M. L. Lacruz Mantecón
13
Article 12 of the Spanish Intellectual property law establishes the same.
14
https://tatoeba.org/en/ (Accessed 28 October 2021).
5 Authorship and Rights Ownership in the Machine Translation Era 87
under copyleft or creative commons licenses, on the other hand, would be accessible,
but generally they deny subsequent non-free uses of the data.
Nevertheless, the text processing applied to both source and target texts, partic-
ularly their segmentation, makes it virtually impossible to make any claims over the
non-consented use of protected data. As Forcada (2023) warns, segmentation is
equivalent to inserting the test into a document shredder, which transforms it into a
series of paper strips that mix with other strips from different shredded documents.
As the author points out, the result is that it is extremely difficult to rebuild
substantial parts of the original document, which makes it almost impossible to
demonstrate that a non-consented exploitation of the text has occurred, as well as
that a compensation has been incurred. He also adds two possible exceptions to the
need to obtain authorization from the right holders: The first, that the use of text
fragments could be covered by the “fair use” exception in Anglo-Saxon law. The
second is that the processing of texts can be understood as “data mining”, an
operation permitted by the Directive (EU) 2019/790 of the European Parliament
and of the Council of 17 April 2019 on copyright and related rights in the Digital
Single Market: “Article 3. Text and data mining for the purposes of scientific
research. 1. Member States shall provide for an exception to the rights provided
for in Article 5(a) and Article 7(1) of Directive 96/9/EC, Article 2 of Directive 2001/
29/EC, and Article 15(1) of this Directive for reproductions and extractions made by
research organisations and cultural heritage institutions in order to carry out, for
the purposes of scientific research, text and data mining of works or other subject
matter to which they have lawful access. . .”.
To sum up, the reservation of rights to original texts and their translation can only
be effectively protected if those texts have not left the possession of their holders.
When such texts become accessible via web pages or repositories, they become
susceptible to tracing, localization and extraction for segmentation by web crawlers.
And once the text has been segmented, any rights complaint becomes extremely
difficult. We can also conclude that professional translation work today cannot
survive without the use of MT instruments, for reasons of competitiveness. As
early as 2009, Garcia (2009) said that “as soon as 2010, translation will be pushed
into simple MT post-editing, (..). Translators will still be needed, but their working
conditions into the next decade will be quite dissimilar to those of the nineties.” And
of course, and to conclude, the evolution and progress of MT systems will lead to a
reduced need for human translators. This was the case with the arrival of the railway:
the number of stagecoach drivers decreased, but the number of train drivers
increased.
better degree of quality in its results. But this increase in quality means that, on the
one hand the translator is replaced in some cases by the tool, and on the other, that
the competition is increasingly fierce, as productivity has increased, allowing more
services to be offered for the same remuneration, which sometimes places translators
in difficult situations.
Today the profession is still in this tense situation, but translators are aware that
systems are becoming increasingly better precisely because they themselves are
helping to increase the quality of MT systems. The image that comes to my mind
is that of a person riding on the back of a tiger: They cannot fall from their saddle or
descend from it, because the tiger will devour them. This is nowadays the translator's
dilemma with MT technology.
The solution is, of course, to continue riding the tiger. I know virtually nothing
about neural networks, but I do know that they need to be “trained,” and that they
need continuous readjustments: This may be a new niche for the translator’s work. I
also know that the large text-generating companies will continue to need profes-
sional translation services, and that to get the exact translation of the ambiguities of
any language a brain that understands these ambiguities is required, because only
such a brain will be able to discern the context in which they are generated and the
nuances associated to them.
This is precisely one of the weaknesses of any AI system. They are usually
developed to act in a very narrow context, and hence have problems understanding
double senses, contradictions or absurd claims. We are faced with one of the
limitations of intelligent systems already warned by the AI pioneers, who soon
realised that self-referential claims (such as those referred to in the Gödel Theorem)
could not be processed by an intelligent system. Here the human being has an
enormous advantage: Contradictions cause no problem to them as they are assumed
as a fairly frequent human feature. Transposed to the field of translation, the human
translator therefore has this enormous advantage over the machine.
Finally, with regard to the unauthorised use of translations which is carried out
through the fragmentation of texts, and the lack of remuneration that translators
suffer from such exploitation, the solution may come from copyright. In fact, in view
of the unremunerated use of protected works that occurred through copying literary
works, musical performances or films by means of photocopiers, tape recorders or
video devices and later also CD recorders or computer memories, the law reacted by
imposing an equitable compensation for the authors, editors, producers or inter-
preters who saw copied the works in which they had been involved.
This is the system of “equitable compensation” by private copy, which the
Spanish Intellectual Property Law regulates in its article 25: “1. La reproducción
de obras divulgadas en forma de libros o publicaciones que a estos efectos se
asimilen mediante real decreto, así como de fonogramas, videogramas o de otros
soportes sonoros, visuales o audiovisuales, realizada mediante aparatos o
instrumentos técnicos no tipográficos, exclusivamente para uso privado, no
profesional ni empresarial, sin fines directa ni indirectamente comerciales, de
90 M. L. Lacruz Mantecón
Acknowledgements The present work has been carried out under the project “Derecho e
inteligencia artificial: nuevos horizontes jurídicos de la personalidad y la responsabilidad
robóticas”, IP. Margarita Castilla Barea, (PID2019-108669RB-100/AEI/10.13039/501100011033).
References
Berne Convention for the Protection of Literary and Artistic Works of September 9, 1886
Bridy A (2012) Coding creativity: copyright and the artificially intelligent author. Stanford Technol
Law Rev 2012:5. http://stlr.stanford.edu/pdf/bridy-coding-creativity.pdf
Calo R (2016) Robots in American Law (February 24, 2016). University of Washington School of
Law Research Paper No. 2016-04. Available https://ssrn.com/abstract=2737598
Carrasco Perera Á, del Estal Sastre R (2017) Art. 5. In: Bercovitz R (ed) Comentarios a la Ley de
Propiedad Intelectual, 4th edn. Tecnos, Madrid
Directive 2006/116/EC of the European Parliament and of the Council of 12 December 2006 on the
term of protection of copyright and certain related rights
15
1. The reproduction of works disseminated in the form of books or publications that are
assimilated to these purposes by real decree, as well as phonograms, videograms or other sound,
visual or audiovisual media, carried out by means of non-typographical technical apparatus or
instruments, exclusively for private, non-professional or business use, without direct or indirect
commercial purposes, in accordance with article 31(2) and (3), shall result in an equitable com-
pensation . . . (my translation).
16
https://www.cedro.org/.
5 Authorship and Rights Ownership in the Machine Translation Era 91
Dornis TW (2021) Of ‘Authorless Works’ and ‘Inventions without Inventor’ -The Muddy Waters
of ‘AI Autonomy’ in Intellectual Property Doctrine. In European Intellectual Property Review
(E.I.P.R.) 2021
Duque Lizarralde M (2020) Las obras creadas por Inteligencia Artificial, un nuevo reto para la
propiedad intelectual. In Pe. i.: Revista de propiedad intelectual, N° 64
European Parliament resolution of 16 February 2017 with recommendations to the Commission on
Civil Law Rules on Robotics (2015/2103(INL))
European Parliament resolution of 20 October 2020 on intellectual property rights for the devel-
opment of artificial intelligence technologies (2020/2015(INI))
Fernández Carballo-Calero P (2021) La propiedad intelectual de las obras creadas por inteligencia
artificial. Aranzadi Thomson Reuters, Cizur Menor
Forcada ML (2023) Licensing and usage rights of language data in machine translation. In:
Moniz H, Escartín CP (eds) Towards responsible machine translation. Ethical and legal con-
siderations in machine translation. Springer International Publishing, Heidelberg
Galanter P (2020) Towards ethical relationships with machines that make art. In: West B (ed) AI,
arts & design: questioning learning machines. Artnodes, no. 26, 2020. UOC. https://doi.org/10.
7238/a.v0i26.3371
Garcia I (2009) Beyond translation memory: computers and the professional translator. J Spec
Transl 12:199–214
Ginsburg JC, Budiardjo LA (2019) Authors and machines. (August 5, 2018). Columbia public law
research paper No. 14-597. Berkeley Technol Law J 34(2):61–62. https://doi.org/10.2139/ssrn.
3233885
Gow F (2007) You must remember this: the copyright conundrum of “translation memory”
databases. Can J Law Technol 6(3):175–192
Holder C, Khurana V, Hook J, Bacon G, Day R (2019) Robotics and law: key legal and regulatory
implications of the robotics age (Part II of II). Comp Law Secur Rev 32:2016
Lacruz Mantecón ML (2020) Robots y personas. Una aproximación jurídica a la personalidad
cibernética. Editorial Reus, Madrid
Lacruz Mantecón M (2021) La ética de los agentes cibernéticos (una ética de plástico para seres de
plástico). Paper presented at the XXVII Congreso Internacional Derecho y Genoma Humano,
Bilbao
Lanteri P (2020). La problemática de la Inteligencia Artificial y el Derecho de autor llama a la puesta
de la OMPI. Cuadernos jurídicos: Instituto de Derecho de Autor, 15 ° aniversario / Díez
Alfonso (dir.), p 19
Lee TK (2020) Translation and copyright: towards a distributed view of originality and authorship.
Translator. https://doi.org/10.1080/13556509.2020.1836770
Ley de Propiedad intelectual, Real Decreto Legislativo 1/1996, de 12 de abril.
Mezei P (2020) From leonardo to the next rembrandt – the need for AI-Pessimism in the age of
algorithms (July 24, 2020). Arch Med Medienwissensc 2:390–429. https://doi.org/10.5771/
2568-9185-2020-2-390
Miernicki M, Ng I (2020) Artificial intelligence and moral rights. AI Soc. https://doi.org/10.1007/
s00146-020-01027-6
Moorkens J, Lewis D (2020) Copyright and the reuse of translation as data. In: O’Hagan M (ed) The
Routledge handbook of translation and technology. Routledge, London, pp 469–481
Navas Navarro S (2018) Obras generadas por algoritmos. En torno a su posible protección jurídica.
Rev Derecho Civil 5:273–291
Nova Productions Ltd v Mazooma Games Ltd & Others (2007) EWCA Civ 219, Case No:
A3/2006/0205
Parra Escartín C, Moniz H (2019) Chapter 7. Ethical considerations on the use of machine
translation and crowdsourcing in cascading crises. In: O’Brien S, Federici FM (eds) Translation
in cascading crises. Routledge, London
Ramalho A (2017) Will robots rule the (artistic) world? A proposed model for the legal status of
creations by AI systems. SSRN Pap 2017:2987757. https://doi.org/10.2139/ssrn.2987757
92 M. L. Lacruz Mantecón
Regulation EU 2016/679 of the European Parliament and of the Council of 27 April 2016 on the
protection of natural persons with regard to the processing of personal data
Ríos Ruiz WR (2001) Los sistemas de inteligencia artificial y la propiedad intelectual de las obras
creadas, producidas o generadas mediante ordenador. Rev Propiedad Mater 3:5–13
Rogel Vide C (1984) Autores, coautores y propiedad intelectual. Tecnos, Madrid
Sanjuán Rodríguez N (2020) La inteligencia artificial y la creación intelectual: ¿está la propiedad
intelectual preparada para este nuevo reto? La Ley mercantil, N°. 72 (septiembre)
Scheuerer S (2021) Artificial intelligence and unfair competition – unveiling an underestimated
building block of the AI regulation landscape. GRUR Int 2021:8–10. https://doi.org/10.1093/
grurint/ikab021
Trancoso I (2022) Treating speech as personable identifiable information—impact in machine
translation. In: Moniz H, Parra Escartín C (eds) Towards responsible machine translation.
Ethical and legal considerations in machine translation. Springer International Publishing,
Heidelberg
Topping S (2000) Sharing translation database information: considerations for developing an
ethical and viable exchange of data. Multiling Comput Technol 11(5):59–61. Available online:
https://multilingual.com/all-articles/?art_id=1105. Accessed 12 Nov 2018
US Copyright Act (17 U.S.C.) (n.d.)
Venuti L (1995) Translation, authorship, copyright. Translator 1:1–24. https://doi.org/10.1080/
13556509.1995.10798947
Wahler ME (2019) A word is worth a thousand words: legal implications of relying on machine
translation technology. Stetson Law Rev 48:109
Way A (2013) Traditional and emerging use-cases for machine translation. Paper presented at
translating and the computer 35, London
Way A (2018) Quality expectations of machine translation. In: Moorkens J, Castilho S, Gaspari F,
Doherty S (eds) Translation quality assessment: from principles to practice. Springer Interna-
tional Publishing, Heidelberg, pp 159–178. https://doi.org/10.1007/978-3-319-91241-7_8
WTO Agreement on Trade-Related Aspects of Intellectual Property Rights (TRIPS) 1995
Yu R (2017) The machine author: what level of copyright protection is appropriate for fully
independent computer-generated works? Univ Pa Law Rev 1245:1260. Available https://
scholarship.law.upenn.edu/penn_law_review/vol165/iss5/5
Part II
Responsible Machine Translation from
the End-User Perspective
Chapter 6
The Ethics of Machine Translation
Post-editing in the Translation Ecosystem
C. Rico (✉)
Universidad Complutense de Madrid, Madrid, Spain
e-mail: [email protected]
M. del Mar Sánchez Ramos
University of Alcalá, Madrid, Spain
e-mail: [email protected]
and technology. Routledge, London, p 320, 2019), comprising different, yet com-
plementary, tasks and procedures: as a separate service with its own international
standard, a dynamic activity that goes beyond the static cleaning of MT outputs, and
a task associated by default with lower quality expectations. The instability of MTPE
as a concept leads to the discussion of human agency in the MTPE process, and the
exploration of the extent to which translators are able to intervene in the use of MT in
MTPE. Furthermore, the analysis of the different degrees of human control triggers
diverse issues in the ethics of MTPE. This chapter explores such issues in the light of
the translation ecosystem, analysing three specific ethical dilemmas: (a) Dilemma
#1: the post-editor’s status; (b) Dilemma #2: the post-editor’s commitment to
quality; and (c) Dilemma #3: digital ethics and the post-editor’s responsibility.
Rather than offering a set of closed conclusions, the chapter should be read as an
invitation to the reader to think about key ethical elements and the way MTPE is
affecting the translator’s work.
6.1 Introduction
The latest advances in machine translation (MT) testify to the rapid emergence of
automated translation tools. While each advance in MT (e.g., neural MT) represents
progress, each step forward substantially transforms the translator’s tasks. Machine
Translation post-editing (MTPE), for instance, is an activity related to changes made
to translations produced by an MT system to maintain previously established quality
standards (Allen 2003). Although its relevance is sometimes underestimated, MTPE
has always been part of MT: “While mostly a matter of language, it is clear from
early records that post-editors—and pre-editors—have a peripheral role in the
MT-based translation process” (Vieira et al. 2019, p. 3). However, similar to MT,
MTPE has evolved from being a minor task to becoming a translation necessity. It is
around the second decade of the twenty-first century that MTPE emerged as a
discipline of its own in professional workflows, with specific discussions in profes-
sional forums, as part of specialised training courses or mentioned in academic
journals. Accordingly, during this period, translation companies began to offer it
as a value-added service, and the first job offers began to appear (Sánchez Ramos
and Rico Pérez 2020). In any case, it is important to highlight that MTPE has been a
field of research for a longer period of time with pioneers in the field like Krings
(2001), O’Brien (2002, 2011) and Guerberof (2012) focusing on how MT was
impacting translators.
The evolution of MTPE is also linked to the constant technology-induced
changes in the daily work of the translator and the translation process. MTPE has
evolved from being a static task (Vieira 2019) and being regarded as an activity
performed once the machine has produced its output, to becoming an activity
6 The Ethics of Machine Translation Post-editing in the Translation Ecosystem 97
wherein translators interact with adaptive MT systems 1 as the final version of the
text is being generated. However, although MTPE tasks have gained some recogni-
tion within the translation sector, many studies indicate that translators have a
negative perception of MT for various reasons, ranging from the low quality of
MT systems (Vieira et al. 2019) to the effects of MT on the translation market (Vieira
2020). MTPE has also garnered significance following its emergence as an academic
field of inquiry and inclusion in different university syllabi, as further shown by
research studies on the profile of post-editors and their competencies (Rico and
Torrejón 2012; Sánchez Gijón 2016; Guerberof and Moorkens 2019; Konttinen et al.
2021) or on differences between MTPE, revision, and editing (Do Carmo and
Mookens 2021).
Other studies have focused on MTPE efforts (Krings 2001) and the quality of
MTPE, two keywords because the possible benefits of MTPE in professional
translation settings depends on them (Vieira 2019). For example, a heavily debated
topic is whether MTPE should be considered a monolingual or bilingual activity,
which in turn depends on whether the original text is available together with the MT
output. Although the consensus is that MTPE is a bilingual task, various studies
abound on the benefits of monolingual MTPE, as long as the product of an MT
system is post-edited by a subject-matter expert (Schwartz 2014). However, other
studies (Mitchell et al. 2013; Nitzke 2016) only partly corroborate these findings,
demonstrating that monolingual post-editing improves fluency and not adequacy.
Another key concept related to MTPE is productivity (Plitt and Masselot 2010),
which is also linked to quality (Guerberof 2014). 2 This may even be the main reason
why MTPE tasks are now in such high demand. Although MT followed by post-
editing increases the productivity levels of a company and of an individual transla-
tion, it may adversely affect quality. Furthermore, whereas neural MT minimises
major grammatical errors, the resulting text may not read like natural language.
Several studies have been conducted to identify the differences between MT
followed by post-editing and human translations. Most recently, Sánchez Gijón
(2016) conducted a study to determine the factors to be considered in addition to
those expected by the industry, such as productivity, when evaluating MTPE aimed
at providing results with quality matching that of human translation. The author
concludes that using MT followed by MTPE jeopardises the naturalness of the target
text, despite the very low error rate. In her words, “reaching human quality means
going beyond the grammatical correction that machine translation systems aspire to
achieve, which is currently provided through MTPE” (Sánchez-Gijón 2020, p. 98).
She concludes that “the overestimation of machine translation is probably its most
denounced aspect among professionals” (Sánchez-Gijón 2020, p. 98).
1
Adaptive MT allows an MT system to learn from corrections on the fly, as the post-editor
makes them.
2
As we will see later when discussing dilemma #2, the question of which quality is to be delivered
in MTPE raises some important concerns in ethical terms.
98 C. Rico and M. del Mar Sánchez Ramos
This brief account about MTPE 3 highlights some of the many ways in which this
field of activity is transforming translation, bringing some changes which, in turn,
give rise to some ethical considerations. In the present chapter we approach these
from the perspective of the translation ecosystem as described by Krüger (2016a, b).
With a view to adapting Krüger’s model for the purposes of MTPE we will first
review each of the components that make up the translation ecosystem—cooperation
partners, social factors, artefacts and psychological factors—and show how they
relate to the task of MTPE. This will reveal different aspects that intervene in the
construction of MTPE ethics, providing the framework upon which we later con-
struct the argumentation over three key ethical dilemmas. The second part of the
chapter will concentrate on the discussion of these three key ethical dilemmas:
(a) Dilemma #1: the post-editor’s status; (b) Dilemma #2: the post-editor’s commit-
ment to quality; and (c) Dilemma #3: digital ethics and the post-editor’s
responsibility.
The ecosystem metaphor in cognitive studies is based on the central assumption that
the cognitive ecosystem consists of humans and everything that surrounds them
(Strohner 1995). This ecosystem has clear implications in translation studies and has
given rise to situated translation, one of the most relevant theories in translation
studies (Risku 2004, 2010). From a cognitive perspective translators are situated at
the centre of the process, with two other elements around them which are significant:
artefacts and people (cooperation partners). Artefacts are the objects used by the
translator for specific tasks (e.g., translation memories or MT systems). On the other
hand, cooperation partners comprise language service providers, project managers,
and reviewers, in addition to other internal factors such as the physical and psycho-
logical conditions of the translator. In this model, the translator acquires the status of
the situated agent, who is the creator of textual material in a specific cultural context
(Risku 2004, p. 75).
Inspired by Risku’s (2004) initial proposal, Krüger (2016a, b) creates a situational
model applicable to specialised translation, known as the Cologne model of the
situated LSP translator.4 Krüger’s (2015, 2016a, b) proposal for a translation
ecosystem is based on the models by Serrano Piqueras (2011), Risku (2004,
2010), Holz-Mänttäri (1984), and Schubert (2007). In its representation according
to the cologne model, Krüger (2016a, b) formulates the translational ecosystem,
covering the entire translation process, which is divided into the creation, transfer,
and organisation phases. Creation is the phase in which the source texts of the LSP
3
Our account is necessarily brief as we understand that the reader is already familiar with MTPE.
For a thorough review into the matter see, for example, Koponen et al. (2021).
4
In the cologne model, LSP stands for language for specific purposes (Krüger 2016b, p. 118).
6 The Ethics of Machine Translation Post-editing in the Translation Ecosystem 99
Fig. 6.1 MTPE in the translation ecosystem (adapted from Krüger 2016a)
translation process are written; in the second phase, transfer, content is processed
from one language to another; the last phase, organisation, includes all operational
flows (Krüger 2016a, p. 312). The translation process specifically takes place in the
transfer phase, which is subdivided into various work phases, ranging from project
preparation to quality control (Krüger 2016a, p. 311). This model is premised on the
following components: (1) the translator as the central agent; (2) the cooperation
partners or users; (3) the social factors that comprise the professional world of the
translator; (4) different types of artefacts (or resources) that facilitate the translation
process such as computer-assisted translation (CAT); and (5) psychological and
social factors.
For the purposes of analysing the ethics of MTPE we have adapted Krüger’s
model, addressing those features which we understand are most related to MTPE
(Fig. 6.1). Although the author does not specifically refer to MTPE, we use this
figure as the basis to examine how translation ecosystem components contribute to
the construction of the ethics of MTPE. All these elements are located along the
transfer phase, which explains why we have given less prominence to the other two
phases in our adaptation of the model.5
We will first review each component, highlighting the main ethical issues that
spring from them and, later, concentrate on the discussion of these issues along the
three main ethical dilemmas as indicated above.
5
The creation phase can be related to the concept of pre-editing. In this connection see, for instance,
Guerberof (2019).
100 C. Rico and M. del Mar Sánchez Ramos
In line with the guidelines in Risku (2010) proposal, Krüger points to cooperation
partners as an essential part of the translational ecosystem since the translator
interacts with each of them at different phases of the translation process (Krüger
2016a, p. 317). Cooperation partners refer to the following figures: the initiator, the
commissioner, the source text producer, the target text user, the target text receiver,
the co-translator, the proof-reader, and the project manager. In our adaptation of the
model, we identify two main groups of cooperation partners in MTPE. On the one
hand, those partners the post-editor interacts with in the exchange of expertise or for
extra input (co-translators, post-editing team and project manager). On the other, the
group of partners represented by the client and the target text user (TT user), who
determine the final use of the text and, therefore, the way it is to be post-edited. In the
interaction with cooperation partners, we can see some aspects that need to be
explored in the light of an ethical dimension. The first relates to the way the
MTPE project is conceived, the role assigned to post-editors, their status and the
possible result into negative attitudes towards the task (Plaza-Lara 2020, p. 164). As
we will see in the next section, these constitute what we call dilemma #1, related to
the post-editor’s status. A second aspect (dilemma #2) is concerned with the quality
assigned to the final text in view of the requirements from the client and the TT user.
A final consideration (dilemma #3) is the post-editor’s responsibility towards data
governance and ownership (for instance, the exposure of private data when using
online MT).
Krüger (2016a) uses the Bourdian terminology (Bourdieu 1984) to classify the social
factors that constitute the professional world of the translator into field, capital, and
habitus. The first term refers to the translation field and location of the client who
requests the translation service; the key players here are the actors who provide these
translation services (including language service providers or freelance translators).
The capital determines the economic, cultural, and social relationships between the
translation and the cooperative partners. For example, membership of professional
associations, subject-matter expertise or field-specific training will enable specific
levels of communication with the cooperative partners. Together, these factors will
define the translational habitus, that is, the behaviour of the translator in the field of
action or (translation and location) field.
As we will see below, these three features of Bourdieu’s framework are key in
defining the post-editor’s status in relation to that of the translator, and will serve as
the basis for discussing dilemma #1.
6 The Ethics of Machine Translation Post-editing in the Translation Ecosystem 101
In the translation ecosystem there are two types of artefacts: translation technology
tools and steering instruments. Although Krüger’s (2016a) model does not assign
much prominence to MTPE, we can easily relate it to the artefact group of translation
technology tools or, in the author’s words, ‘translation technology in a narrow
sense’. In a non-exhaustive list, the author assigns the following tools to this
group: translation memory systems (TM systems), terminology management, MT
systems and project management tools (PM tools). MT systems are seen as affecting
the translation process “even more drastically than TM systems since, in this case,
the translator’s task is reduced to pre- and post-editing while the actual translation is
performed by a machine” (Krüger 2016a, p. 321).
The interaction of MT and TM is further explained in the elaboration of the model
(Krüger 2016b), where CAT tools are contextualised in dynamic processes of
interaction between the translator and the working environment. MT systems are
part of the translation process, as a component of TM systems, providing an
alternative translation to exact or fuzzy matches (Krüger 2016b, p. 125), when the
translator has to decide whether or not to accept the MT output. Again, this opens the
debate of whether the post-editor’s work should be set to the mere acceptance of a
segment translated by a MT system integrated into a CAT tool. In fact, Krüger
(2016a) anticipates here what we currently know as the CAT-e(nvironment): a CAT
tool is much more than a translation memory, it is the environment that surrounds the
translator and the artefacts that interact with the rest of the ecosystem. However, as
recent studies state, the translator’s task extends further than pre- and post-editing
and MT can be seen as part of the CAT-e (Vieira 2019).
As we will discuss in the following section, the interaction of the post-editor with
the artefact group of translation technology tools presents two main ethical issues:
(a) the human role—and status—as related to the machine (dilemma #1); (b) the
personal commitment to quality of the final text (dilemma #2).
Together with the artefact group of translation technology tools, MTPE is also
connected to the group of steering instruments, which highlight the importance of
client instructions, style guides, glossaries, databases, and terminological standards,
all influencing the production of the target text by the translator. It is in this group
that we need to include MTPE guidelines and standards, as they are steering
instruments that allow the post-editor to decide when and how a particular segment
has to be post-edited. In this respect, it is interesting to note that using those is not as
straightforward as it may seem. On the one hand, the apparent better quality
delivered in the output of neural MT systems is blurring the division originally
devised by Allen (2001) between light and full post-editing.6 On the other, the
conceptualization of MTPE made by ISO2015 standard (ISO (International Organi-
zation for Standardization) 2017) implies that this task is easier than translation as it
6
The aim of light post-editing is to make the text comprehensible by making as few changes as
possible, while full post-editing is performed on texts that require higher quality (Allen 2001).
102 C. Rico and M. del Mar Sánchez Ramos
rests on the assumption that it is the machine that has done the translation effort
(do Carmo 2020). As we see, both guidelines and standards have, then, a relevant
role in defining the ethics of MTPE and the post-editor’s commitment to quality
(dilemma #2).
The last factors Krüger (2016a) mentions in his translation ecosystem are psycho-
logical factors, which include external aspects (e.g. time pressure) and internal
aspects (e.g. motivation). These factors have been also pointed out in recent studies
as intrinsically related to translation: “Translation is currently described as a profes-
sion under pressure from automation, falling prices and globalization” (Vieira 2020).
Without a doubt, these psychological factors of the translator also have a direct
influence on our adaptation of Krüger’s model. For instance, time pressure is linked
to the concept of productivity. As Vieira and Alonso (2020) state, the automation of
translation has made clients expect large volumes of texts translated in a short period
of time. However, this ‘speed’ in the process of translation can be related to
low-quality target translations (dilemma #2) if there is a lack of communication
between the client and the post-editor. This fact, also, can have a negative effect on
the post-editor’s motivation and status (dilemma #2). The client, as a cooperation
partner and part of the post-editor’s working environment, should place the post-
editor in a relevant position within the MT network, that is, and using Krüger’s
(2016a, p. 326) words, a proper amount of symbolic capital (or degree of expert
status) should be assigned to the post-editor. As we shall see, these psychological
factors of the translation ecosystem also have an influence in stating the ethics
of MTPE.
In the translation ecosystem as depicted by Krüger (2016a), we see that the figure of
the translator is central, mastering the use of artefacts, interacting with cooperation
partners, and exhibiting a distinctive professional status. What is not so straightfor-
ward in this model is whether the status of the post-editor remains the same as that of
the translator. As we have seen in our adaptation of Krüger’s model, the post-editor
status is a recurrent issue, present in all interactions with the components of the
ecosystem: (a) the relationship with cooperation partners and the different ways the
MTPE project can be conceived questions the post-editor’s role in the translation
process; (b) social factors determine the position that post-editors play in the
translation ecosystem; (c) artefacts (i.e. translation technology tools and steering
6 The Ethics of Machine Translation Post-editing in the Translation Ecosystem 103
instruments) have an effect on the way the MTPE task is performed and the actual
value assigned to it; and (d) psychological factors undoubtedly have an effect on the
post-editor’s motivation.
In order to deal with this first ethical dilemma about the post-editor’s status, we
will explore the nature of both translators and post-editors from a sociological point
of view, and analyse whether they share the same attributes or, alternatively, whether
they can be considered two different actors in the translation ecosystem. By com-
paring one with the other, we will examine whether there is some loss of power/
influence, or even some marginalisation in the process of becoming a post-editor
(Vieira and Alonso 2020).
From a sociological perspective, Sakamoto (2019) uses Bourdieu’s (1984) social
framework to conceptualise the relationship between the post-editor and the trans-
lator (the main candidate to become a future post-editor). In her model, post-editors
are considered as a new category of workers while translators remain reluctant to the
expectations of MTPE, as they feel that the incorporation of this task leaves aside
their professional skills and identities (Sakamoto 2019, p. 201). The position of both
translators and post-editors can, then, be seen as complementary: while the most
experienced translators who work in a traditional environment (translation-edition-
revision) are at the top of the cultural capital, the composition of the post-editors'
capital is described as high in economic capital and low in cultural capital.7 This is so
as the cost-saving property of MTPE is highly valued by end clients who request an
MTPE service and language service providers which are keen to train post-editors.
The intellectual property of MTPE, on the other hand, is assigned a lower cultural
value (Sakamoto 2019, p. 210). This disjunction may give rise to feelings of
restlessness, anxiety and, sometimes, resentment among translators. However,
these positions are not static and may change as a result of factors such as techno-
logical developments and the evolution of the traditional work environment towards
an MT-based model.
In a subsequent study, Sakamoto and Yamada (2020) explore further the nego-
tiations that take place between project managers and translators in their daily
interactions in the translation ecosystem. The authors conducted four focus groups
involving 22 project managers from 19 language service providers, with a view to
eliciting how the translation community has been shaping the practice of MTPE. The
project managers’ accounts of translators’ work revealed three positions towards the
task (Sakamoto and Yamada 2020, p. 87–88). They identified a first group of “proud
professionals who love to write texts and to create texts from scratch in the way they
like”. This first group tends to reject MTPE work. A second group included those
who are willing to accept MTPE and “prefer to correct existing translations as this
7
“Capital is a resource that social agents invest in and exchange to locate themselves in the social
spaces and hierarchies. In addition to economic capital [. . .], social agents possess other forms of
capital, i.e. cultural capital (this includes upbringing and educational background), social capital
(e.g. personal connections with persons of certain social standings) and symbolic capital (which
confers legitimacy and prestige to the person in the form of, for example, professional titles)”
(Sakamoto 2019, p. 202).
104 C. Rico and M. del Mar Sánchez Ramos
involves less manual and cognitive effort, making the work easy”. The third group is
described by project managers as “fast and cheap translators”, situated “low down in
the translators’ hierarchy but needed to cater to different needs in the market”. A
similar division of perceptions is mentioned in Torres Hostench et al. (2016), who
also identify three approaches to MTPE from translators: those who willingly accept
it, those who reject it and those who accept it but with some reluctance. Among the
negative aspects of MTPE, the participants indicate the following: “we do not trust
MT”, “results are not good”, “we use our heads when we translate”, “we do trans-
lations the old artisanal work” (Torres Hostench et al. 2016, p. 20).
On the other hand, and according to Guinovart Cid (2020), the different roles
assigned to translators and post-editors might be a matter of perspective. For some,
MTPE is a new trade or service while, for others, it is an existing activity (MT-aided
translation). The author groups both profiles under a single one, “linguist”, on the
grounds that “future job positions in the translation industry will be more pluri- and
transdisciplinary”, affecting “the very activity of MTPE and the profile of the
professional, who is found in constant synergy of (now fully mixed) boundaries”
(Guinovart Cid 2020, p. 172). This notion of perspective is better seen in the light of
the interaction between the translator/post-editor and the computer, as discussed by
Vieira (2019). He explores agency in MT, involving different degrees of human
control, from MT-centred automatic MTPE to human-centred interactive/adaptive
MT. Agency does not only depend, then, on the nature of the task but also on other
aspects such as client requirements, the nature of the commission and the translation
company, among other factors. From a situated approach, we can adopt a holistic
view and conceive of MTPE not only as an additional service but also as an activity
that goes beyond the simple static cleaning of MT output. In Koponen’s (2016,
p. 133) words, the discussion arises “when the output of MT is considered to be a
first version that overrules that of the translator”. It is at that point that we see
translators/post-editors at risk of being marginalised and disempowered in the
translation workflow.
A complementary aspect that contributes to the uncertain status of the post-editor
is the “terminological instability” of MT (Vieira et al. 2019, p. 4). From a taxonomic
point of view, the integration of MT (and, therefore, of MTPE) in the translation
process has resulted in the fact that, somehow, the lines that determine what is proper
to the machine and what corresponds to the translator are blurred. This is the case, for
example, when MTPE takes place in environments where translation memories, MT
and human translation interact. From a conceptual point of view, do Carmo (2020)
challenges the narratives that think of MTPE as the revision of pre-translated content
and analyses how these, in combination with the assumption that MTPE increases
productivity, result in downgrading the value of the service. The very definition of
MTPE, according to ISO2015, implies that editing and correcting MT output is
easier and takes less time than translation, as the machine has already done the
translation effort (do Carmo 2020, p. 37). This view is reinforced by usual MTPE
guidelines in the industry that recommend to perform as few edits as possible, with a
focus on time efficiency and productivity. The devaluation of the post-editor is only
a natural consequence when we link MTPE tasks to time, productivity and money.
6 The Ethics of Machine Translation Post-editing in the Translation Ecosystem 105
Perhaps it would be worth acknowledging with do Carmo (2020, p. 41) that the
cognitive load and complexity of performing MTPE tasks is comparable to those of
translation tasks and that the tendency in the industry to reduce per-word fees does
not really take into account this effort.
In our view, the divide between the different status in translation versus MTPE
comes from the conception of these activities as separated or even antagonistic. As
Mitchell-Schuitevoerder (2020, p. 107) adequately remarks, the act of translating
and MTPE overlap since “the post-editor is not only post- editing but also translat-
ing”. In order to modify a target sentence, post-editors need to first generate a
translation in their minds, a tentative model, so to speak. In this respect, we agree
with Mitchell-Schuitevoerder (2020, p. 107) when she points out that “the post-
editor’s cognitive effort is undervalued (also financially) if we consider the many
thought patterns needed while MTPE”. This goes in line with Melby and Hague’s
(2019) argumentation that the numerous advances in technology call for a different
view of the translator’s role, one that places them at the core of the process and turns
them into language advisors, dominating the translation ecosystem and deciding
which tools to use, when and how.
There are a number of principles common to all ethical codes for translation. These
concern translation competence, impartiality, integrity and accuracy, potential con-
flicts of interest, confidentiality and continuous professional development. However,
the same codes fall short of providing adequate support to translators in their daily
practice, as they fail to cover the infinite range of potential situations they may face
(Lambert 2018). This situation gets more complex when introducing MTPE in the
translation ecosystem since, as we pointed out before, the quality of the final text
depends on factors such as the quality requirements imposed by cooperation partners
(mainly the client and the final user), the position assigned to the machine as an
artefact in relation to the human post-editor, or the conceptualization of MTPE in
guides and standards as steering instruments. We could assume, a priori, that the
end-product of the MTPE process should be a text similar in quality to the one a
translator might deliver. However, the discussion is precisely what is meant by
quality when we deal with a translation created by a machine and subsequently
revised by a person. In other words, to what extent can the two results be compared?
What’s more, should they even be compared?
When the post-editor is instructed to “use as much of the raw MT output as
possible” (TAUS 2010), quality might be undermined. In this sense, the debate
arises as to whether quality, in the translation ecosystem, depends on the context in
which the work is carried out, the multiple cooperation partners involved (translator,
client, language service provider) as well as the different interests of each of them,
106 C. Rico and M. del Mar Sánchez Ramos
which can lead to divergence. The artefact group of explicit steering instruments,
such as MTPE guidelines and instructions, is of utmost relevance in this context. As
Moorkens et al. (2018) point out, the variability to translation quality requirements is
expressed “using vague, relatively undefined terms”, in the form of prescriptive
guidelines for light or medium MTPE, error typologies or penalties specifically
designed for a translation client. In the absence of clear rules that can be applied
universally, post-editors may be involved in a complex situation affecting their
commitment to quality.
In translation studies, the concept of quality presupposes a theory of translation:
from the early intuitive conceptualizations of quality as the “natural flow of the
translated text”, dependent on the translator’s artistic competence, to formal models
based on the concept of text equivalence and the requirements of a translation’s end
user. When quality is explored in the light of what MTPE entails, we must also add
MT as a complementary factor, as the developments of the latter over time have
significantly affected the way the former is perceived. At the time of the first
experiments, back in the 1950s, when the aim was to use computers as a replacement
for the translator, the success of MT was measured by the quality obtained in
comparison with human translation. Those were the times of the so-called FAHQMT
(Fully Automatic High Quality Machine Translation), a wish that had researchers
struggling with computers and texts for a long time. As Bowker (2019, p. 453) points
out, quality, in this context, was understood as “the excellence of machine translation
in relation to a translation made by a professional human translator”. The ultimate
test of quality would be for MT to achieve human parity, a concept that is still under
question (Toral 2020). In fact, the development of translation technology along time
also brings an evolving concept of translation quality, associated variably to human
time, human judgements of linguistic quality, and to productivity measured in terms
of terminological consistency and usability (Pym 2019, p. 437).
With the new developments in MT technology and its incorporation in industrial
processes, MTPE starts to gain ground and quality is discussed in terms of which
segments should be post-edited, what time should be allocated to the task and the
type of corrections to be made and, most importantly, what level of quality is
expected. This took Allen to define two types of MTPE according to the final
purpose of the text: light and full post-editing (Allen 2001). These two concepts
are being currently superseded with the emergence of neural MT, which calls into
question certain aspects of the way in which MTPE is carried out. For Vieira (2019,
p. 326, 328), the fact that neural systems produce more fluent texts may make it more
difficult to detect (and correct) translation errors. From this point of view, the notion
of post-editing levels may lose relevance and give way to a different concept of MT
revision in which the post-editor focuses on checking the correct use of terminology
and giving approval to the translated content. This type of MTPE is in line with a
more flexible way of understanding quality, the so-called “fit for purpose” (Way
2013). This concept introduces the idea that the final quality of the text can be
negotiated on the basis of a series of variable criteria and taking into account the
technology to be used. From a practical point of view, the product of MTPE no
longer aspires to be comparable to a human translation, but rather to the end use that
6 The Ethics of Machine Translation Post-editing in the Translation Ecosystem 107
will be given to the text. MT evaluation metrics should be sensitive to the intended
use of the system and determine whether a system permits an adequate response to
given needs and constraints (Bowker 2019, p. 454–455).
This diversity of the concept of quality in relation to MTPE is best described in
the framework of a situated model where translation is more than a linguistic
exercise and the translator’s choices depend on a holistic description of “the relevant
factors influencing his/her cognition in real world translation environments” (Krüger
2016a, p. 310). Following Krüger’s model, we see translator’s cognitive perfor-
mance as closely related to the artefact group of “technology in the narrow sense”,
where MT is placed together with TM systems, terminology management, alignment
tools and project management tools. This artefact group is essential to the translation
process (Krüger 2016a, p. 320) and, as such, is an important part of the translator’s
cognition. The post-editor’s commitment to quality is then linked to the specific use
of this artefact group and allows for broader and adaptable requirements in quality
which include such factors as human time, human judgements of linguistic quality,
and to productivity measured in terms of terminological consistency and usability.
There are many aspects that need to be considered when exploring the post-editor’s
responsibility in terms of digital ethics in the use of MT, and they mostly refer to the
interaction with cooperation partners in the translation ecosystem. Following
Canfora and Ottmann (2020), we identify three key issues: (a) translation errors in
critical domains that place a risk for the end user of the translation, (b) liability for
clients and post-editors when working with MT, and (c) the potential exposure of
sensitive data in free online MT. This last one, in turn, introduces a series of
important issues related to intellectual property rights and ownership, confidentiality
and non-disclosure agreements, data sharing and data protection in collaborative
environments and MT databases (Mitchell-Schuitevoerder 2020, p. 113–127). Data
governance is also a major concern for authors such as Moorkens and Lewis (2019),
who consider “translation as a shared knowledge resource” and call for “a move to a
community-owned and managed digital commons” (Moorkens and Lewis 2019,
p. 17). The argumentation on all the different issues at stake when using MT
certainly needs careful consideration not only from a legal and normative point of
view but also from a technical perspective integrating secure translation workflows
in “close-circuit MT engines” (Mitchell-Schuitevoerder 2020, p. 121).
What calls our attention regarding the post-editor’s digital responsibility towards
MT is the acknowledgment of a certain lack of knowledge towards what it is that this
technology entails. In this respect, the study conducted by Sakamoto et al. (2017) is
significant. They explored the opinions and perceptions towards technology use in
the UK language service industry, and particularly project managers as the “key
108 C. Rico and M. del Mar Sánchez Ramos
people who have strong influences on all aspects of translation practice in the
industry”. Among their findings, we note the following: (a) project managers were
not clearly informed about actual MT use by translators in their teams, (b) they do
not discuss use of MT openly, (c) if the quality is low they suspect the translation
may be a post-edited MT output, or (d) there is no industry-wide consensus about
how much and in what way translators should use MT, and not many language
service providers implement an official policy about it. In this scenario where MT
use (and MTPE) goes almost unnoticed (deliberately or not) it is critical to delve into
the post-editor’s accountability towards translation as a digital product. From the
perspective of a situated model of translation, the post-editor’s responsibility is
found in the interaction with the different cooperation partners and artefacts, and
along the different phases of the task. That leads us to consider who initiates and
commissions the project (language service provider, project manager, final client),
which technology is to be used (MT system, terminology management tools,
translation memory), which other instruments are used (client instructions, MTPE
guides, previous translations), and which professional status is assumed by post-
editors or even assigned to them. When project managers acknowledge, for instance,
a “willingness and ignorance to know whether translators use MT” (Sakamoto et al.
2017), the MTPE process is deprived of transparency. Similarly, when translators
use free online MT without informing their clients, even when its use has been
excluded specifically in contractual clauses (Canfora and Ottmann 2020, p. 64),
transparency in the process is at risk.
A possible cause for this behaviour might be found in a lack of MT literacy in the
translation community. As Bowker (2019) indicates, “just because machine transla-
tion is easily accessible [. . .] this doesn’t mean that we instinctively know how to
optimise it or even to use it wisely in a given context”. It might be possible that the
absence of formal training in MT and MTPE hinders the capacity to become
informed and critical users of MT tools. When exploring Bowker and Buitrago
Ciro (2019) findings in the framework of scholarly communication we can easily
relate those to the context of translators and post-editors, advocating for their ability
to go beyond the mere technical (and procedural) competence and becoming critical
and informed users. Ideally, this involves comprehending the basics of how MT
systems process texts, understanding the implications associated with the use of MT,
and evaluating the possibilities of this technology. It is our contention that a
thorough knowledge of what MT is and what it entails in a situated model of
translation provides the adequate background for the post-editor with regards to
digital ethics.
This chapter has presented MTPE in the light of the translation ecosystem,
conceptualising the translation process in a situated model. By adapting Krüger
(2016a) original framework, we have been able to analyse how the different
6 The Ethics of Machine Translation Post-editing in the Translation Ecosystem 109
References
Guerberof A, Moorkens J (2019) Machine translation and post-editing training as part of a master’s
programme. JosTrans 31:217–238. https://www.jostrans.org/issue31/art_guerberof.pdf
Guinovart Cid C (2020) The professional profile of a post-editor according to LSCs and linguists: a
survey-based research. Hermes 60:171–190. https://doi.org/10.7146/hjlcb.v60i0.121318
Holz-Mänttäri J (1984) Translatorisches handeln. Theorie und methode. Suomalainen
Tiedeakatemia, Helsinki
ISO (International Organization for Standardization) (2017) Translation services – post-editing of
machine translation output – requirements. ISO 18587:2017. International Organization for
Standardization, Geneva. https://www.iso.org/standard/62970.html. Accessed 14 March 2020
Konttinen K, Salmi L, Koponen M (2021) Revision and post-editing competences in translator
education. In: Koponen M, Mossop B, Robert IS, Scocchera G (eds) Translation, revision and
post-editing. Industry practices and cognitive processes. Routledge, London, pp 185–201.
https://doi.org/10.4324/9781003096962-15
Koponen M (2016) Is machine translation post-editing worth the effort? A survey of research into
post-editing and effort. J Special Transl 25:131–148. https://www.jostrans.org/issue25/art_
koponen.pdf
Koponen M, Mossop B, Robert IS, Scocchera G (2021) Translation revision and post-editing.
Industry practices and cognitive processes. Routledge, London. https://doi.org/10.4324/
9781003096962
Krings HP (2001) Repairing texts: empirical investigations of machine translation post-editing
processes. Kent State University Press, Kent
Krüger R (2015) The Interface between scientific and technical translation studies and cognitive
linguistics. With particular emphasis on explicitation and implicitation as indicators of transla-
tional text-context interaction. Frank & Timme, Berlin
Krüger R (2016a) Situated LSP translation from a cognitive translational perspective. Lebende
Sprachen 61(2):297–332. https://doi.org/10.1515/les-2016-0014
Krüger R (2016b) Contextualising computer-assisted translation tools and modelling their usability.
Trans-kom 9(1):114–148. http://www.trans-kom.eu/bd09nr01/trans-kom_09_01_08_Krueger_
CAT.20160705.pdf. Accessed 15 Dec 2020
Lambert J (2018) How ethical are codes of ethics? Using illusions of neutrality to sell translations.
JosTrans 30:269–290. https://www.jostrans.org/issue30/art_lambert.pdf
Melby AK, Hague D (2019) A singular(ity) preoccupation. Helping translation students become
language-services advisors in the age of machine translation. In: Sawyer DB, Austermühl F,
Enríquez Raído V (eds) The evolving curriculum in interpreter and translator education:
stakeholder perspectives and voices. John Benjamins, Amsterdam, pp 205–228. https://doi.
org/10.1075/ata.xix.10me.l
Mitchell L, Roturier J, O’Brien S (2013) Community-based post-editing of machine-translated
content: monolingual vs. bilingual. In: O’Brien S, Simard M, Specia L (eds) Proceedings of the
MT summit XIV workshop on post-editing technology and practice. http://doras.dcu.ie/20030/.
Accessed 15 Dec 2020
Mitchell-Schuitevoerder R (2020) A project-based approach to translation technology. Routledge,
London. https://doi.org/10.4324/9780367138851
Moorkens J, Lewis D (2019) Research questions and a proposal for the future governance of
translation data. JosTrans 32:2–25. https://jostrans.org/issue32/art_moorkens.pdf
Moorkens J, Castilho S, Gaspari F, Doherty S (2018) Translation quality assessment: from
principles to practice. Springer, Cham. https://doi.org/10.1007/978-3-319-91241-7
Nitzke J (2016) Monolingual post-editing: an exploratory study on research behaviour and target
text quality. In: Hansen-Schirra S, Grucza S (eds) Eye-tracking and applied linguistics. Lan-
guage Science Press, Berlin, pp 83–109
O’Brien S (2002) Teaching post-editing: a proposal for course content. In: Proceedings of the 6th
EAMT workshop: teaching machine translation. https://www.aclweb.org/anthology/2002.
eamt-1.11.pdf. Accessed 15 May 2021
6 The Ethics of Machine Translation Post-editing in the Translation Ecosystem 111
Vieira LN (2019) Post-editing of machine translation. In: O’Hagan M (ed) The Routledge hand-
book of translation and technology. Routledge, London, pp 319–337. https://doi.org/10.4324/
9781315311258-19
Vieira LN (2020) Automation anxiety and translators. Transl Stud 13(1):1–21. https://doi.org/10.
1080/14781700.2018.1543613
Vieira LN, Alonso E (2020) Translating perceptions and managing expectations: an analysis of
management and production perspectives on machine translation. Perspectives 28(2):163–184.
https://doi.org/10.1080/0907676X.2019.1646776
Vieira LN, Alonso E, Bywood L (2019) Introduction: post-editing in practice – Process, product
and networks. JosTrans 31:2–13. https://jostrans.org/issue31/art_introduction.php. Accessed
15 Dec 2020
Way A (2013) Traditional and emerging use-cases for machine translation. In: Proceedings of
translating and the computer. ASLIB, London. https://www.computing.dcu.ie/~away/
PUBS/2013/Way_ASLIB_2013.pdf. Accessed 15 Dec 2020
Chapter 7
Ethics and Machine Translation: The End
User Perspective
7.1 Introduction
In 2016, ten years after it was launched, the world’s biggest machine translation
(MT) producer, Google Translate, announced that it generated over 143 billion
words per day (Pichai 2016). We can safely assume that this output has subsequently
increased, including many text types translated for a wide variety of users. Why is it
that MT use has become so widespread? There are two primary positions on this:
(a) the technological determinist view that the time has come for this technology,
i.e. from the natural evolution of the field at that time, and (b) the social determinist
A. Guerberof-Arenas (✉)
Guildford, UK
University of Groningen, Groningen, Netherlands
e-mail: [email protected]
J. Moorkens
Dublin City University, Dublin, Ireland
e-mail: [email protected]
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 113
H. Moniz, C. Parra Escartín (eds.), Towards Responsible Machine Translation,
Machine Translation: Technologies and Applications 4,
https://doi.org/10.1007/978-3-031-14689-3_7
114 A. Guerberof-Arenas and J. Moorkens
view that circumstances (societal, technological, economical) are such that huge
efforts were put into MT development. The former implies that MT is a small part of
inevitable technological progress and it follows that MT should be put into use
where possible without consideration of its sociocultural context. The sociotechnical
counterargument is that the pros, cons, and repercussions of each new technology
should be carefully considered by society before its implementation.
Kranzberg (1986) wrote that technology is “neither good nor bad; nor is it
neutral” (p. 545). The view among ethicists and researchers in science and technol-
ogy studies or STS (as summarised in Olohan 2017) is that science is not linear and
deterministic, but rather that development is rooted in a worldview from which the
decision as to what to develop, its intended audience, and its implementation are
indivisible. This set of factors, in turn, influence the effects of technologies in use, as
they reshape activities and their meaning, engendering new worlds of their own
(Winner 1983). For example, as Larsonneur (2021) noted, the major MT providers
are now big tech companies due to their access to resources and ubiquitous online
offering. The university research groups that at one stage topped the leaderboards in
competitive MT shared task events, particularly those for well-supported languages,
have gradually been replaced by big tech research groups. This means that the
perspective and motivation of big tech companies now drives much of MT devel-
opment.1 In other words, large corporations rather than all players in society are
determining the use and suitability of MT for assimilation, where MT is served
directly to the end user.
In Weaver’s (1949) memo proposing MT, for example, he wants to enable
communication and encourage peace between nations. He also sees translation as
a problem and foreign languages as encrypted versions of English or an as-yet-
undiscovered universal language (Raley 2003), an idea that Kenny et al. (2020, p. 1)
call one of the ‘most reductive. . . in translation history.’ This is on the basis that
translation as a communicative act is a much more complex process than coding and
decoding language at the superficial level of the written word as opposed to the
world of ideas and of communication between cultures. Nonetheless, this superficial
view has often prevailed: Kenny et al. (2020) note that the notion of ‘foreign as
English’ survived in MT literature well into the 2000s. We can see the evidence of
superficiality and neutralisation in MT output, as discussed in recent literature about
normalisation in MT (Čulo and Nitzke 2016; Toral 2019) and reduced lexical
diversity (Vanmassenhove et al. 2019). And yet, MT also carries the utopian
communicative intent of Weaver’s memo, enabling effective communication for
many people in many scenarios.
In this chapter, we attempt to systematically analyse the ethics of MT as an
end-product, and to examine the world engendered by widely-available MT, a world
that did not really exist before the advent of free, networked, and ubiquitous MT. If
we were to take a stakeholder approach (see Fig. 7.1) in analysing the effects of MT
on groups of people with different levels of involvement with MT (e.g., translators,
1
See also Paullada (2020) on MT and power dynamics.
7 Ethics and Machine Translation: The End User Perspective 115
Low-risk High-risk
A
short term short term D
second continuum of risk, whereby low-risk texts afford translation using MT but
high-risk texts (where mistranslation may cause injury or death) require careful
human revision. In Sect. 7.1 of this chapter, we consider ethics and MT as an
end-product for different types of texts (see Fig. 7.2), from those that have a short
shelf-life and are low risk to those that have a long shelf-life and are high risk. Of
course, modelling or presuming a user perspective is not ideal, so we also hear the
voices of real users testing MT in a novel long shelf-life use case in Sect. 7.2. In Sect.
7.3, we discuss the implications of our analyses and some further issues prior to
conclusion.
Translation use cases with a short shelf life that present low risk (A in Fig. 7.2) are
ostensibly ideal for reception of raw, ‘low stakes’ MT. For example, it makes little
sense to hire a professional translator to translate most user-generated content such
as online travel reviews or forum postings, as the reviews are likely to be superseded
by newer ones within hours or days and few readers tend to read beyond the most
recent postings. Similarly, online auctions are time-limited and cease to be useful as
soon as the auction closes. We can only assume that most instances of low-stakes
MT are positive and useful, with comprehension facilitated by MT. End users will
probably get the gist of the review or auction posting, and any mistranslation will
have few, if any, risks or repercussions.
The taxonomy proposed by Canfora and Ottmann (2018), categorising risks
along a continuum of increasing severity, is a useful tool for evaluating the level
of translation risk. These risks range from communication being impaired or impos-
sible, loss of reputation, and financial or legal consequences, to damage to property,
physical injury, and death. Following this taxonomy, there may be a possibility of
communication being impaired and a loss of reputation on the part of the provider of
a low-risk translation, i.e. the hosting site for reviews or sales. An example of this
was the launch of an Amazon portal for Sweden, featuring mistranslations and
vulgar MT errors (Hern 2020). Reputational risk is greater with mistranslation of
7 Ethics and Machine Translation: The End User Perspective 117
with teachers, and working in a garage, probably carry little risk (A and B in
Fig. 7.2), but there is a danger that uncritical use of MT will mean insufficient
discrimination between low- and high-risk uses. Ciribuco (2020) writes of the ‘need’
for translation for ‘survival’, and there may be instances when the availability of
online or mobile MT is crucial as a translator or informal mediator is just not
available, in which case MT fulfils a communication function. This raises a risk of
overreliance on MT, however. The non-English speaking immigrants to the US
interviewed by Liebling et al. (2020) experienced mistranslations that were inappro-
priate or dangerously inaccurate, and report having lost work and struggled to build
relationships due to their use of free online MT on smartphones for almost all
interactions.
Translation is often necessary for survival in high risk, low shelf-life situations
(D in Fig. 7.2) as addressed in work on crisis translation and on the use of free online
MT engines for health and legal settings. Cadwell et al. (2019) provide examples of
MT being used in response to crises, with common problems of underdeveloped
technology for the language pairs in use, including insufficient data, and available
data coming from an inappropriate domain. These problems exacerbate existing
quality issues with MT, and Federici and O’Brien (2020) suggest preparedness and
the intervention of professional translators and interpreters where possible to miti-
gate risk. In a crisis scenario, not all translation will be public-facing, and a digital
divide may affect access to human or machine translation, whereby socio-economic
factors (as highlighted by Cadwell et al. 2019) or gender inequality (as highlighted
by Vollmer 2020) might limit access to technology generally. Therefore, the use of
MT in these scenarios should be implemented with caution and ideally under
supervision of translators or others with high MT literacy (Parra Escartín and
Moniz 2019).
Example use cases for combined speech recognition and MT tools are often in
medical settings (see Sumita 2017, for example), which appears to be a high-risk
setting, despite the short MT shelf life. A mistranslation could have dire conse-
quences for an individual. In high risk, long-life use cases (C in Fig. 7.2), such as
translation of food ingredients, medicines and their accompanying information, or
instructions for machinery, mistranslation could expose individuals to risk at the
high end of Canfora and Ottmann’s (2018) continuum, such as injury or death. While
the argument for use of MT for assimilation in crisis scenarios might be a utilitarian
effort to minimise harm, there can be little argument that the use of MT without
expert human intervention in high-risk scenarios is neither wise nor ethical (Parra
Escartín and Moniz 2019; O’Mathúna et al. 2020).
Aside from the use cases discussed in this section, we may have low risk, long
shelf life (B in Fig. 7.2) use cases for MT, such as in the use of MT for literature or
for user interface translation. The risks will come at the lower end of Canfora and
Ottmann’s (2018) continuum, with mistranslation risking communication being
impaired or a loss of reputation. For literary translation, one could argue that there
is a long-term risk to language, and reduced readability presents a possible risk to
engagement with the other, to empathy. As noted by Bender et al. (2021), societal
views or biases as represented in MT systems are set in aspic from the moment the
7 Ethics and Machine Translation: The End User Perspective 119
training data is harvested, whereas in society these will change over time. If literature
models the way in which society thinks, feels and behaves, the consequences of poor
engagement with the text could be of a high-risk nature that are currently
unforeseeable. The argument inherent in the copyright waiver for developing coun-
tries in the 1971 update to the Berne Convention (see Moorkens and Lewis 2019)
suggests that availability is currently considered to outweigh these risks.
In the following section, we look in detail at user interaction with raw and post-
edited machine translation to bring in the direct voice of users and their perception of
the issues faced.
This section gives voice to users of raw MT in technical and creative environments
that are not language professionals, but the ultimate users or readers of translated
texts.
Since 2017, part of our research has explored how using MT (both highly
customised statistical, SMT, and NMT) engines impacts the user or reader experi-
ence when applying different translation modalities. We define translation modality
as a descriptor of the process in which the translation is generated. For example, if
the translation of a product or a story is generated by professional translators without
the aid of MT, this is considered one translation modality that we could call “human
translation”, but if the translation is generated by MT and then post-edited it would
be considered another modality, that we could call “MT post-editing (MTPE)”, and
finally, if raw MT output is used, this is labelled as MT.
The first experiment involved 84 participants that were native Japanese, English,
German and Spanish speakers using an eye-tracker (Guerberof-Arenas et al. 2021).
The participants were frequent users of word-processing applications but had a
different Microsoft Word (MS Word) literacy (varying experience when using
MS Word).
We set up an intra-subject experiment where the users did six tasks, three of them
using the published version of MS Word as localised for their native language
(HT/MTPE as part of the content is post-edited), and, after a brief pause, the
remaining three using a machine-translated version of MS Word (MT). Half of the
participants in each language group follow the reverse order in the translation
modality to counterbalance the order effect. The engine used for this “experimental
translation” was a customised SMT engine used in production by Microsoft in the
company’s localization process at that time (Quirk et al. 2005), and therefore deemed
of acceptable quality for post-editing for all the languages tested (Schmidtke and
120 A. Guerberof-Arenas and J. Moorkens
Groves 2019). Based on previous experiments (Doherty and O’Brien 2014; Castilho
2016), and in order to analyse the usability of the different translation modalities, we
looked at effectiveness, i.e. the number of completed tasks versus the total number of
tasks; efficiency, i.e. effectiveness in relation to time; satisfaction, i.e. the level of
satisfaction in completing tasks, the time, the instructions given and the language
used in MS Word; and, finally, cognitive effort, i.e. the mental effort employed in
completing (or not) the tasks. For a detailed methodology of this experiment, refer to
Guerberof-Arenas et al. (2019, 2021).
The results show that the type of task, and hence the participants’ experience and
ability (MS Word literacy) was a factor in their effectiveness, but the translation
modality was not a statistically significant factor. However, when it came to the
combination of completed tasks and time, efficiency, and the reported users’ satis-
faction, the translation modality was an influencing factor, and MT scored signifi-
cantly lower than the HT modality in efficiency and satisfaction. With regards to
cognitive load, the results show that the English participants exert a lower cognitive
effort when reading the instructions and completing the tasks in comparison to the
rest of the languages, but there is no difference between the other languages or the
modalities.
After the users had completed the experiment, each participant recorded a semi-
structured interview while viewing their own eye-tracking data in a Retrospective
Think Aloud (RTA) protocol. The interviews range from 10 to 20 min and they were
conducted in English. Let us examine some of the relevant questions guiding the
interviews and the responses by the participants to see the effect that MT had on their
user experience.
Strikingly, only three participants said that they had noticed that the application had
changed after the pause, and only one participant referred to MT “I just thought like
this is a, this is something that was processed by a machine and you cannot rely on
whatever you see, you have to search for it”.
However, the majority did not notice the change. There were several reasons
given for this: users with previous experience on that task did not look at the
language in detail, they just focused on the action, because they knew the location
of the option; others concentrated so deeply on reading the instructions and com-
pleting the tasks that they did not pay attention to the language in the application;
others assumed it would be the same application, they looked at other cues in the
application, or they were used to working with another version of MS Word. Having
said this, the users did fixate on the words on the application, perhaps only to look for
keywords or anchors without necessarily reflecting on the quality.
Most participants were not aware that they were using a different translation
modality after the pause. This came as a surprise to the research group, as we were
expecting that the change would be obvious because we were using a “fake” setup
7 Ethics and Machine Translation: The End User Perspective 121
with raw MT output—no post-editing at all was performed. This could have been
especially problematic for the German and Japanese participants because tradition-
ally these languages are difficult for MT, and because there were obvious errors in
the text displayed within the application in all languages (Guerberof-Arenas et al.
2021).
During the RTA, the participants were asked about the general quality of the
language they were working on in each modality. The responses were rather
mixed even from the same participants. However, from the 56 participants that
commented on the quality of the MT modality, 23 mentioned, surprisingly, that
the quality of the language was “Correct”, “Fine”, “Good” or “Very good”, but at the
same they were puzzled by some of the translations. “The Japanese language, I
didn’t recognise unnatural things in this task, but I did recognise some unnatural
translation, like too informal language in some other tasks”.
7.3.1.3 How Did You Find the Language in this Menu, Dialog Box,
Option?
When the participants were asked about the language in certain menus, dialog boxes
or options, 44 participants reported errors in the MT modality while only 5 reported
errors in the HT modality (which were not actual linguistic errors). These are some
examples of the errors found in MT.
It says, ‘links’ so you can actually adjust the left side, but it didn’t say anything for the right
side, it just said ‘Richting’2 and I was unsure about what it exactly means. (P04DE)
I don’t know if there is the original Spanish application or if there is a translation from the
English. Because in the next task I was looking for the right for the column space in the right.
But it was Correcto3 which is the direct translation from Spanish, so I don’t know if the
Spanish version, I don’t think it has written Correcto instead of right. And it was confusing
for me but the rest I think it is fine. (P07ES)
The users resorted to some strategies such as using the context to understand the
options or they back-translated the option to make sense of it.
Therefore, even if at the beginning we were puzzled that the change of modality
was not obvious to the participants after the pause, we did realise that some of the
MT options were confusing and that the language played an important role in finding
2
Richting was the MT alternative instead of Rechts (right in German).
3
Correcto was the MT alternative instead of Derecha because Right can have several translations,
one meaning right-hand side and another one meaning correct.
122 A. Guerberof-Arenas and J. Moorkens
an option, especially for users with little experience, and, hence, in completing the
tasks.
The participants expressed several feelings during the RTA such as confusion,
confidence, concentration, disappointment, frustration, nervousness, unhappiness,
and/or happiness. However, the two most frequent words (5 characters or longer)
when describing their feelings were “confused” and “nervousness”. On some occa-
sions the participants were confused because of the tasks, and on other occasions
they were confused because of the instructions and the difficulty to find the options,
the functionality of the application, or indeed the language “I didn’t see this. No
Izquierdo correcto4”.
In summary, in this first experiment, most users were not overtly aware when they
were working with the MT modality. This could lead us to believe that using raw MT
technology as part of the translation process, in the context of technical texts or even
software applications as in this case, does not compromise the user experience since
these texts would be considered (according to our models) low risk and long term
(B in Fig. 7.2). However, the participants did experience difficulties with certain
words which lead to inefficiency and less satisfaction when using MT despite being
unaware that they were using MT, and this was more problematic with less-
experienced users. Of course, there are other aspects to consider: the type of task,
the experience, the MT quality for a specific language, and the MS Word setup that
might also influence the user experience, but the results show that the translation
modality was indeed a factor.
We see here that the user experience when dealing with the MT translation
modality is not only difficult to gauge because of multiple confounding variables
that make isolating language difficult, such as the task itself, the user’s experience,
the language combination, but also because the user has a pre-set notion of the
quality of the application that is related to the status of that application and even the
historical relationship of the user with this application. Users might be confused
when they look for keywords or when they look at brief messages and this contrib-
utes to a poor experience without necessarily knowing why. There are insufficient
studies that focus on testing the user experience when MT is involved as a translation
modality, and we believe that asking the users if they found the information
translated with MT useful will not reveal the real experience because this is a
much more complex and nuanced phenomenon.
4
The participant is referring here to the indentation where the MT proposal meant correct in Spanish
instead of right. He understands left (Izquierdo) but then he sees correct (Correcto) instead of right
(Derecho/a).
7 Ethics and Machine Translation: The End User Perspective 123
The participants were debriefed about the translation of the text and then were asked
if they had realised they were reading a translation, 66% of the participants
responded “Yes” to this question. The follow up to this was how they had realised
it was a translation. In the follow up question, there was a striking difference between
the modalities. In the HT modality, 75% of the participants realised that the text was
a translation because it was set in the United States and so the names of the characters
and places referred to this country, the remaining 25% refer to literal or unnatural
expressions.
In the MTPE modality 60% of the participants refer to the USA setting, with the
remaining 40% referring to unnatural words, phrases, spelling or word order
(according to their own preferences).
In the MT modality, only 18% of the participants refer to the USA setting, with
the remaining 82% referring to nonsensical words, literal translations, strange
syntagmatic expressions, grammatical errors, wrong use of articles, lack of coher-
ence, and incorrect word order.
These comments help us understand that this estrangement factor—these odd
words and syntactic expressions—prevents the reader from engaging with or
enjoying the text. If the reader is disrupted from these stylistic elements because
they encounter errors, they become less focused on the text as a narration, and more
focused on the unusual words and hence enjoy the experience less, even if the reader
might be unaware of the translation modality.
In the HT modality, 39% of the participants found that there were paragraphs or
sentences that were difficult to understand because they contained too much infor-
mation (for example, proper names) or the narrative voice changed from third to first
person. Therefore, the difficulties were related to the structure of the ST and the
narrative decisions made by the author. In the MTPE modality, 34% of the partic-
ipants found paragraphs or sentences that were difficult to understand because of the
change of narrative voice and because the syntax was confusing. They gave exam-
ples of sentences that were long and difficult to follow. In the MT modality, 64% of
the participants responded “Yes” to this question because they found odd or non-
sensical words, confusion in the gender of articles, wrong syntactical constructions,
incorrect proper noun gender (la Joan instead of el Joan because in Catalan this is a
name for men), or they simply thought that certain sentences were not properly
translated.
7 Ethics and Machine Translation: The End User Perspective 125
In this second question we can see that in the HT and MTPE modalities the main
issues are related to the original ST while in the case of MT the issues are directly
related to the translation.
In the HT modality, most participants, 61%, found paragraphs or sentences that they
liked. They referred especially to the paragraphs of the story that described the way
one of the main characters dies and the description of his gaze.
In the MTPE modality, 43% of the participants found paragraphs or sentences
that they liked. They also referred to the paragraphs in the story where the author
describes how clinical death comes about and the description of the dying man’s
gaze. However, most of the participants (57%) did not say that they liked a paragraph
or sentences in this modality, in sharp contrast with the experience of those in the HT
modality.
In the MT modality, 36% of the participants found paragraphs or sentences that
they liked, mainly the first paragraph where the author explains what clinical death
is, even though there were translation errors in this paragraph.
In summary, when the users were asked specifically about parts of the texts they
liked, a majority liked parts of the text in the HT modality, and the three modalities
make references to parts of the text that had already powerful imaginary in the ST.
There were other comments from the participants at the end of the survey. One
participant that read the HT modality said:
I have done the exercise in half an hour, I have read the text very quickly. The writing has
had an impact on me, I could see the images. For this reason, I wouldn’t read something
similar. The text is well written, and it transmits emotions. But I never read this type of
horrible thing. It is not my genre. But yes, the translation is brilliant, it transmits everything.
(P37)5
For this modality, the main issues reported are the genre, the topic and the narrative
style of the ST. Participant P37 even refers to the translation as “brilliant”.
MTPE
For this modality, again the main issues reported are aspects of the ST. Participant
P68 even wants to read the whole book and P85 also praises the translation.
5
The translations from Catalan into English are provided by the authors.
126 A. Guerberof-Arenas and J. Moorkens
MT
In this modality, users did notice errors even though they did not disclose if they
realised that they were reading MT, and this seemed to influence their reading
experience.
There are words in the text that do not make any sense in the context, such as “jihad” or
“thone”. (P03)
It is difficult to know the quality of the translation without reading the original text. (P06)
As I advanced in my reading of the text the translation deficiencies have become less
problematic. (P23)
I've been in lock down for about fifty days and maybe this has influenced the fact that I had a
hard time concentrating. (P49)
We see here that once the readers are immersed in the narrative, they might
compensate for the lack of coherence, lexical accuracy and cohesion, so the context
and the narrative help to decipher a low-quality text. We wonder about the additional
cognitive effort of doing this, especially for a longer story. Recent research confirms
that the cognitive effort is higher when reading MT in literary texts (Colman et al.
2021). Finally, something that appears obvious, but given the current world situation
seems even more relevant, the personal circumstances of readers influence their
perception of language and their engagement. If participants already have difficulties
reading because of their personal circumstances (such as a pandemic), shouldn’t a
translation facilitate the engagement and enjoyment of the text instead of making it
more cognitively demanding?
In summary, we see that in a creative environment, the MT modality has a strong
effect on the reader experience. Readers show significantly less engagement, enjoy-
ment and diminished reception than those who read a version where a professional
translator intervened. We also see a pattern of higher values in the HT modality in
relation to the MTPE. We are aware that at present MT is not used in the publishing
sector, or at least its use is not publicised. However, MT is becoming an intrinsic part
of the translation workflow in the audiovisual sector through platforms that might
simplify structures to obtain better MT output (Mehta et al. 2020). Are viewers then
exposed to the best possible version of their language? There are some studies that
look into the productivity gains and quality of subtitles and conclude that this is a
viable solution (Bywood et al. 2017; Matusov et al. 2019; Koponen et al. 2020), but
we feel that analysing productivity and final quality in a more “traditional” way
leaves out an important aspect: the impact that MT has on the viewers. Based on this
research, it is important that streaming platforms make viewers and readers aware of
7 Ethics and Machine Translation: The End User Perspective 127
the use of MT, but also that translation reception studies are prioritised if technology
is to be used in any creative domain. The implications of these results for society are
various: on the one hand it shows that MT diminishes the reader's experience and
that using MT as a tool constrains the translators’ creativity. On the other hand, the
long-term effects of using technology might be worrying such as loss of lexical
richness, style simplification, loss of reputation for authors, and, thus, the minimi-
zation of the transformative effect that literature and fiction has on society.
In Sect. 7.1 we saw that risk for low-stakes MT is minimised (but not negligible),
rising along with the shelf life of the text. In Sect. 7.2 we looked in detail at two
experimental use cases for raw MT—so-called MT for assimilation—where we can
see that, although the risk is not high, there are nuanced implications for translation
modalities that affect the end user or reader experience. Participants in the studies
described in Sect. 7.2 were not aware of the translation modality chosen to produce
the text that they engaged with, as their preconceptions may have altered their
behaviour or responses.6
It is commonly assumed that MT should help users to communicate and that this
means of communication is improving as the technology improves, e.g. the percep-
tion that NMT is “harmless” in short-term and low-risk scenarios, as opposed to
high-risk long-term scenarios that involve mainly health, crisis and/or legal settings
(Vieira et al. 2020). This stems from the logical perspective that when users are only
trying to gist for content, skimming a website, a document, or a message, misleading
or even inaccurate translations are not as “important” as when users are trying to
understand or carry out an action that involves their health, their legal status, or
indeed their survival. We are aware, as users of public and private MT technology
and researchers, that indeed MT helps users to communicate in a language other than
their own, especially if they have not mastered that second language.
By looking at the data from our research, we see several common patterns when
examining the user experience in the context of raw MT output in several models
considered low-risk: (a) users do not necessarily recognise that they are exposed to
MT, as indeed the technology improves, if they are not explicitly informed;
(b) nevertheless, users might be confused, frustrated or (in the case of biased output)
misled by the information found and are likely to encounter errors, awkward style,
and unintelligible words that will result in lower efficiency, satisfaction, enjoyment
or engagement scores; and (c) end users are affected by the translation modality,
especially if their experience of the translated application or knowledge of the source
6
Our ethics committees agreed that these were low-risk settings for use of MT, but in a high-risk
setting this should change. At what level of risk does a study using MT without informing
participants become research involving deception?
128 A. Guerberof-Arenas and J. Moorkens
language is low, and they either fail to achieve what they set out to do or compensate
for errors by looking at the context, back-translating, or even compensating at later
stages of their interaction with the “product”. We conclude that the user experience is
not a binary issue resolved by asking if this “information was helpful or unhelpful in
your language” or by counting the “number of translation errors in the target
language” or by calculating a similarity score with a gold reference translation
(as happens in automatic MT quality evaluation). User experience research that
considers MT should look at a broader picture that considers experience not only
as a static and isolated event, but as part of a communicative process in the short and
long term. See Table 7.1 for a summary according to levels of risk and length of
time.
All of this indicates that there are implications and risks inherent in the use of MT
that has not been reviewed, suggesting that readers or users should be made aware
when text has been machine translated via a note or, keeping in mind the legal
implications of mistranslation, a disclaimer. Usability, privacy, and (cyber-)security
do not always integrate well, as evidenced by the ‘accept cookies’ popups that are
mandated by GDPR to appear when visiting many websites from within the
European Union, but a mandatory label or disclaimer for raw MT should not be
distracting or user-unfriendly. We should also clearly indicate what standard of
pre-publication review is necessary to avoid such a disclaimer.
The motivation for low-stakes MT, as described in Sect. 7.1, is for fast or
immediate translation of highly perishable low-risk text. As the shelf-life and risk
level of texts increase, producers or content providers may want to use MT in
translation workflows or even use raw MT to increase productivity and reduce
costs. There is a decision to make in this case, balancing potential risk against
speed and savings. Sometimes, as in Massidda (2015) and many multimedia trans-
lation production networks, consumer demand for simultaneous releases in all
locales—what used to be called simship—pushes turnaround speed. But is it really
7 Ethics and Machine Translation: The End User Perspective 129
the case that consumers cannot wait for the new game, software, or product, even if
this means a compromise on quality and an introduction of risk? Clearly labelling
MT published without review would make this clearer.
Ravid and Martens 2019)7 that enforces the advertising of MT usage as part of the
translation process so that users are aware and proceed with caution, and this
warning cannot be in small print hidden somewhere within the legal documentation
of the product or the text, but visible in the application, the text, or the leaflet that
spells out the existing issues with the technology and the possible risks the users
might encounter while using it.
Thirdly, translators and post-editors are also stakeholders in this ecosystem, as
they are obviously involved in the MT workflow as creators of training data when
post-editing, translating, and even when using MT as an additional tool. They have a
responsibility to acquire enough knowledge about the technology, to be aware of the
types of errors and biases found in the raw output and how best to fix them. Also,
more importantly, they need to be aware of copyright infringements and possible
interferences in producing a high-quality final product. We wonder how long a
translator can be exposed to a simplified version of a language without being
influenced by it. We understand that translators are not always well compensated
for their work, but there is a belief in some parts of the translation community that,
because they are dealing with MT post-editing, the effort required and the respon-
sibility towards the content, the user and reader, and the final quality is not as high as
without MT, and that the knowledge of a language is something immutable that will
not be influenced by processing poor translations and reading low-quality references.
The suggestion from Parasuraman et al. (2000) is that reasonably (but not entirely)
reliable automation can lead to skill degradation and to complacency in trusting the
automated output. In a similar way that good journalism does not occur by cutting
and pasting news from different sources, good translation practice does not involve
just cutting and pasting references from different technologies and that process could
result in a serious impoverishment of the language and the profession.
Fourthly, academics need to bring to light the aspects that we have mentioned in
this paper, mainly use and propagation of curated data, responsible use of that data,
MT literacy for translators and users, analysis of user and reader reception of
translated work, including different translation modalities in different languages,
and different genres. Academia has been better at analysing the use and effect of MT
than that of other tools (such as translation memories), possibly because the impact
of MT has also been greater. However, there is a need to engage more often and more
deeply with the final users of MT and within different branches of knowledge.
Interdisciplinary research cannot be a mere box to be ticked in grant applications,
it must be a real endeavour where academics value all aspects of a field. The more
mathematical or engineering sides of research cannot have precedence over the more
humanistic side of technology: the effect on the people using the technology. For
research on user reception and user experience, interdisciplinary research needs to
7
The authors offered a detailed description of copyright laws, translation and how AI might be
infringing these international laws that protect authors and creators, and they suggest including
work generated by AI systems.
7 Ethics and Machine Translation: The End User Perspective 131
happen at all levels of the research system, and not as an afterthought regarded as a
lesser science.
And finally, we should also consider the ethical responsibility of the users/
readers. If the ethical dimension is an ecosystem, users also have the responsibility
to buy products that protect language, translators, and those in the text supply chain
in the source and target culture. In the same way that some consumers might not buy
certain brands or in certain shops or from web portals or shopping malls because of
their operational practices or because in doing so they are destroying the local,
social, and commercial fabric of their city, region or country, users should have
enough information about how a text is produced to choose not to buy/see/hear a
given product or, at least, to know the effects this will have on their user experience,
their language, and their culture in the long term. This can only happen if readers and
users are informed by the whole ecosystem and if the ecosystem promotes transpar-
ent information sharing.
References
Bender EM, Gebru T, McMillan-Major A, Shmitchel S (2021) On the dangers of stochastic parrots:
can language models be too big? In: FAccT ’21, March 3–10, 2021, Virtual Event, Canada
Bowker L, Ciro JB (2019) Machine translation and global research: towards improved machine
translation literacy in the scholarly community, 1st edn. Emerald Publishing, Bingley
Busselle R, Bilandzic H (2009) Measuring narrative engagement. Media Psychol 12:321–347.
https://doi.org/10.1080/15213260903287259
Bywood L, Georgakopoulou P, Etchegoyhen T (2017) Embracing the threat: machine translation as
a solution for subtitling. Perspectives 25:492–508. https://doi.org/10.1080/0907676X.2017.
1291695
Cadwell P, O’Brien S, DeLuca E (2019) More than tweets: a critical reflection on developing and
testing crisis machine translation technology. Transl Spaces 8:300–333
Canfora C, Ottmann A (2018) Of ostriches, pyramids, and Swiss cheese: risks in safety-critical
translations. Transl Spaces 7:167–201
Canfora C, Ottmann A (2020) Risks in neural machine translation. Transl Spaces 9:58–77
Castilho S (2016) Acceptability of machine translated enterprise content. Ph.D. Thesis, Dublin City
University
Ciribuco A (2020) Translating the village: translation as part of the everyday lives of asylum seekers
in Italy. Transl Spaces 9:179–201
Colman T, Fonteyne M, Daems J, Macken L (2021) It is all in the eyes: an eye-tracking experiment
to assess the readability of machine translated literature. In: The 31st meeting of computational
linguistics in The Netherlands, Ghent
Čulo O, Nitzke J (2016) Patterns of terminological variation in post-editing and of cognate use in
machine translation in contrast to human translation. In: Proceedings of the 19th annual
conference of the European association for machine translation, pp 106–114
Desjardins R (2017) Translation and social media. In: Theory, in training and in professional
practice, 1st edn. Palgrave Macmillan, London
Dixon P, Bortolussi M, Twilley LC, Leung A (1993) Literary processing and interpretation:
towards empirical foundations. Poetics 22:5–33. https://doi.org/10.1016/0304-422X(93)
90018-C
132 A. Guerberof-Arenas and J. Moorkens
Doc Society v. Pompeo (2019) Doc Society v. Pompeo: a lawsuit challenging the State Depart-
ment’s social media registration requirement. In: 1st Night Amendment Institute at Columbia
University. https://knightcolumbia.org/cases/doc-society-v-pompeo
Doherty S, O’Brien S (2014) Assessing the usability of raw machine translated output: a user-
centered study using eye tracking. Int J Hum Comput Interact 30:40–51. https://doi.org/10.
1080/10447318.2013.802199
Federici FM, O’Brien S (2020) Cascading crises: translation as risk reduction. In: Federici FM,
O’Brien S (eds) Translation in cascading crises. Routledge, Abingdon, pp 1–22
Guerberof-Arenas A, Toral A (2020) The impact of post-editing and machine translation on
creativity and reading experience. Transl Spaces 9:255–282
Guerberof-Arenas A, Moorkens J, O’Brien S (2019) What is the impact of raw MT on Japanese
users of Word: preliminary results of a usability study using eye-tracking. In: Proceedings of
XVII machine translation summit. European Association for Machine Translation (EAMT),
Dublin, pp 67–77
Guerberof-Arenas A, Moorkens J, O’Brien S (2021) The impact of translation modality on user
experience: an eye-tracking study of the Microsoft Word user interface. Mach Transl. https://
doi.org/10.1007/s10590-021-09267-z
Hakemulder J (2004) Foregrounding and its effect on readers’ perception. Discourse Process 38:
193–218. https://doi.org/10.1207/s15326950dp3802_3
Hern A (2020) Amazon hits trouble with Sweden launch over lewd translation. The Guardian
Kenny D, Moorkens J, de Carmo F (2020) Fair MT: towards ethical, sustainable machine transla-
tion. Transl Spaces 9:1–11
Koponen M, Sulubacak U, Vitikainen K, Tiedemann J (2020) MT for subtitling: user evaluation of
post-editing productivity. In: Proceedings of the 22nd annual conference of the European
association for machine translation. European Association for Machine Translation, Lisboa,
pp 115–124
Kranzberg M (1986) Technology and history: “Kranzberg’s Laws”. Technol Cult 27:544–560
Larsonneur C (2021) Neural machine translation: from commodity to commons? In: Desjardins R,
Larsonneur C, Lacour P (eds) When translation goes digital: case studies and critical reflections.
Springer, Cham, pp 257–280
Liebling DJ, Lahav M, Evans A et al (2020) Unmet needs and opportunities for mobile translation
AI. In: Proceedings of the 2020 CHI conference on human factors in computing systems. ACM,
Honolulu, pp 1–13
Marking (2020) Thai mistranslation shows risk of auto-translating social media content. Slator
Massidda S (2015) Audiovisual translation in the digital age: The Italian fansubbing phenomenon,
1st edn. Palgrave Macmillan, London
Matusov E, Wilken P, Georgakopoulou Y (2019) Customizing neural machine translation for
subtitling. In: Proceedings of the fourth conference on machine translation, vol 1. Association
for Computational Linguistics, Florence, pp 82–93
Mehta S, Azarnoush B, Chen B, et al (2020) Simplify-then-translate: automatic preprocessing for
black-box machine translation. arXiv:200511197 [cs]
Moorkens J, Lewis D (2019) Research questions and a proposal for governance of translation data,
p 24
Nurminen M (2018) Machine translation in everyday life: What makes FAUT MT workable? In:
TAUS eLearning blogs. https://blog.taus.net/elearning/machine-translation-in-everyday-life-
what-makes-faut-mt-workable. Accessed 25 Aug 2020
Nurminen M, Koponen M (2020) Machine translation and fair access to information. Transl Spaces
9:150–169
O’Mathúna DP, Escartín CP, Roche P, Marlowe J (2020) Engaging citizen translators in disasters:
virtue ethics in response to ethical challenges. TIS 15:57–79. https://doi.org/10.1075/tis.
20003.oma
Olohan M (2017) Intercultural faultlines: research models in translation studies: v. 1: textual and
cognitive aspects. Routledge, London
7 Ethics and Machine Translation: The End User Perspective 133
Parasuraman R, Sheridan TB, Wickens CD (2000) A model for types and levels of human
interaction with automation. IEEE Trans Syst Man Cybern 30:286–297. https://doi.org/10.
1109/3468.844354
Parra Escartín C, Moniz H (2019) Ethical considerations on the use of machine translation and
crowdsourcing in cascading crises. In: Translation in cascading crises, 1st edn. Routledge,
London
Paullada A (2020) How does machine translation shift power? In: Resistance AI workshop at
NeurIPs 2020, Virtual Event, Canada
Pichai S (2016) Google I/O 2016 - keynote
Quirk C, Menezes A, Cherry C (2005) Dependency treelet translation: syntactically informed
phrasal SMT. In: Proceedings of the 43rd annual meeting of the association for computational
linguistics (ACL’05). Association for Computational Linguistics, Ann Arbor, pp 271–279
Raley R (2003) Machine translation and global English. Yale J Crit 16:291–313
Rey B (2014) Your tweet half-life is 1 billion times shorter than Carbon-14’s. In: Wiselytics. https://
www.wiselytics.com/blog/tweet-isbillion-time-shorter-than-carbon14/. Accessed 3 May 2021
Schmidtke D, Groves D (2019) Automatic translation for software with safe velocity. In: Pro-
ceedings of machine translation summit XVII volume 2: translator, project and user tracks.
European Association for Machine Translation, Dublin, pp 159–166
Smith R (2018) The google translate world cup. The New York Times
Sumita E (2017) Social innovation based on speech-to-speech translation technology targeting the
2020 Tokyo Olympic/Paralympic Games Presentation at MT Summit XVI, Nagoya, Japan
Thicke L (2013) Post-editor shortage and MT. Multilingual Magaz 2013:42–44
Toral A (2019) Post-editese: an Exacerbated Translationese. arXiv:190700900 [cs]
Toral A, Way A (2018) What level of quality can neural machine translation attain on literary text?
arXiv:180104962 [cs]
Vanmassenhove E, Shterionov DS, Way A (2019) Lost in translation: loss and decay of linguistic
richness in machine translation. In: Proceedings of machine translation summit XVII volume 1:
research track. European Association for Machine Translation, Dublin, pp 222–232
Vieira LN, O’Hagan M, O’Sullivan C (2020) Understanding the societal impacts of machine
translation: a critical review of the literature on medical and legal use cases. Inf Commun Soc
1:1–18. https://doi.org/10.1080/1369118X.2020.1776370
Vollmer SM (2020) The digital literacy practices of newly arrived Syrian refugees: a spatio-visual
linguistic ethnography. PhD Thesis, University of Leeds
Wang J, Xu C, Guzman F, et al (2021) Putting words into the system’s mouth: a targeted attack on
neural machine translation using monolingual data poisoning. arXiv:210705243 [cs]
Way A (2018) Quality expectations of machine translation. In: Moorkens J, Castilho S, Gaspari F,
Doherty S (eds) Translation quality assessment: from principles to practice. Springer, Berlin, pp
159–178
Weaver W (1949) Translation. UNESCO memo. Rockefeller Foundation
Winner L (1983) Technologies as forms of life. In: Cohen RS, Wartofsky MW (eds) Epistemology,
methodology and the social sciences. Reidel, Dordrecht, pp 249–263
Yanisky-Ravid S, Martens C (2019) From the myth of babel to google translate: confronting
malicious use of artificial intelligence – copyright and algorithmic biases in online translation
systems. SSRN J. https://doi.org/10.2139/ssrn.3345716
Chapter 8
Ethics, Automated Processes, Machine
Translation, and Crises
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 135
H. Moniz, C. Parra Escartín (eds.), Towards Responsible Machine Translation,
Machine Translation: Technologies and Applications 4,
https://doi.org/10.1007/978-3-031-14689-3_8
136 F. M. Federici et al.
8.1 Introduction
The United Nations Sustainable Development Goals (UN 2015) are underpinned by
the motto “leave no one behind”, which reflects the notion of inclusive development.
At the core of the UN’s 2030 Agenda for Sustainable Development, these goals
drove some attention to multidirectional communication as a way of increasing
social cohesion and equity. Timely, efficient, and trustworthy communication is
paramount to increase resilience to cascading crises for multilingual communities.
The world is richly multilingual, and delivering urgent messages across languages,
when it matters, quickly, and accurately, makes such paramount communication act
almost a superhuman task. Technologies are therefore superlative resources to
support translation and interpreting in intercultural crisis communication settings.
Automation resources that are crucial to speed up communication come in the form
of standalone applications, cloud-based platforms, and apps; they can deal with both
spoken and written texts.1 While the former textual types have benefited from
technologies such as automated speech recognition (ASR) and machine interpreting,
the latter encompass machine translation (MT), computer-aided translation (CAT)
tools, and hybrid MT-CAT solutions. Regardless of their nature, it is important to
avoid enthusiastically portraying them as self-sufficient solutions to pursue the UN
ambitious “leave no one behind” motto, even though they clearly play a substantial
role and will play an ever bigger one in the future. The questions we ask in this
chapter are not intended to challenge the technological innovations that are often
desperately needed but, rather, they are posed to raise some of the ethical issues that
must be considered, and some of the challenges that need to be tackled in order to
apply technology in an appropriate and efficient manner.
Deploying technologies in support of translation/interpreting during crises in
multilingual settings poses serious deontological and ethical challenges. Most arise
from ethical concerns around the adoption of technologies that can be only partially
controlled. Often, MT output is taken at face value and MT engines can be part of
automation processes where no quality assurance is available, but where no other
options to communicate exist. Unchecked uses of MT, even if justified by situational
constraints and urgency, raise concerns. Technology-driven automation rests on an
ethically complex spectrum in which the human-machine interaction ought to be
central to inform the final decision-making process for providing translation in
cascading crises. Heralding the benefits of technology to specialists and professional
linguists is different to showcasing them to international relief collaborations in
which some lingua francas have assumed a dominant position.
1
See Nimdzi’s language technology atlas (www.nimdzi.com/language-technology-atlas-2019) for a
detailed overview. Accessed 24 February 2022.
8 Ethics, Automated Processes, Machine Translation, and Crises 137
cyber- and terrorist attacks (Alexander and Pescaroli 2019; O’Brien and Federici
2019). Often used interchangeably, particularly the first two, these terms indicate
different triggers, from natural hazards via teleological choices (e.g. conflict,
cyberattacks, and terrorism), to technological failures (e.g. failing nuclear reactors,
collapsed hydroelectric power plants, etc.), to which correspond varying scales of
impact on populations, properties, infrastructures, and societies. When crises happen
in multilingual environments, they disrupt the way of life of entire communities
within one country or across a region; they often require resources beyond those
locally available to the most affected communities, and, increasingly, they have
cross-national cascading effects (Alexander and Pescaroli 2019). International crises
entail communication expectations to enable humanitarian operators, rescuers, and
disaster managers to coordinate each other’s efforts. They also need to provide
information to and gather information from affected communities, who should be
able to seek information using their own languages. In short, communication must be
a multidirectional exchange of information and must be accessible and inclusive if it
is to reduce risks for all members of the crisis-affected communities (Greenwood
et al. 2017; O’Brien et al. 2018).
In the 21st century, local triggers and hazards easily generate transboundary
cascading crises. Whereas, on the one hand, the 2017 Sierra Leone mudslide did
not have repercussions in Europe or Asia, the country was severely affected in terms
of trade, farming, and transportation around the capital, Freetown. On the other hand,
certain hazards set off international crises, as the COVID-19 pandemic has demon-
strated. Although it can be argued that the hazard posed by the mutation of SARs-
CoV-2 from animal to human transmission was not predictable, such
unpredictability does not justify lack of preparedness in terms of running multilin-
gual information campaigns from the onset. Before the pandemic, WHO (2014) had
published and revised information for medical personnel and crisis communicators
to explain how to prevent wide-spread contagion against severe acute respiratory
syndromes (SARs).2 At the time when WHO published these measures in English,
they could have been translated into multiple languages, especially focusing on those
used in deprived and multilingual areas with under-resourced healthcare systems
(see Crouse Quinn 2008). Pioneering projects such as the Canadian Global Public
Health Intelligence Network (GPHIN), launched in 1997, have been using MT
technology to translate important measures for public health in multiple languages
(Mawudeku et al. 2013). Risk reduction measures such as GPHIN have not seen
world-wide adoption, nor secured funding to achieve full coverage in terms of
languages of under-resourced and vulnerable regions, though the UN SDGs agenda
may increase momentum around these topics. However, it has influenced the
development of the Epidemic Intelligence from Open Sources (EIOS) system, see
2
The authors refuse to use the semantically misleading “social distancing” as it is evidence of a
kneejerk reaction, rather than the implementation of bespoke emergency plans: physical and spatial
distance, face masks, and PPE were necessary preventive measures to avoid contagion; endless
nation-wide lockdowns generating “social distancing” became necessary as failures to act suitably
in distancing people and reducing spread of the SARS-CoV-2.
8 Ethics, Automated Processes, Machine Translation, and Crises 139
discussion in the Sect. 8.4 of this chapter, which aims to scale up the GPHIN
monitoring approach increasing reliance on machine learning, web crawling, and
integrated forms of automation for data collection and sharing (Abdelmalik et al.
2018, p. 268).Together with the Global Disaster Alerting Coordination System
(GDACS), fruit of the collaboration between the UN Office for the Coordination
of Humanitarian Affairs (OCHA) and the European Commission, that provides live
updates and warnings on natural hazards, these systems should be the beacons of
global preparedness. GPHIN and EIOS embed MT engines, but GDACS does not do
so yet, even though it intends to support “disaster managers worldwide.”3
Risk preparedness is often a question of momentum and visibility, since, as
highlighted by Kelman (2020, p. 1), “[d]isasters are a socially and politically
structured phenomenon, arising from a combination of hazard and vulnerability,
with the associated risks reflected in public policies, infrastructure decisions and
considerations of inclusion”. To be inclusive, communication must “leave no one
behind”, so it needs to be in a language and format that are accessible. Given that
crisis lifecycles go beyond the response phase, MT resources can be used for
different purposes at various stages and operate in conjunction with other technol-
ogies such as CAT tools and ASR systems. Guidance on MT post-editing must be
part of a compulsory baseline training to maintain communication in myriad lan-
guages and avoid disastrous translations. The debate on whether the MT solutions
for information mining heighten rather than reduce risks is open (Cadwell et al.
2019). When it comes to fully or partially automated technological solutions, the
multiple functions performed through automation processes embedded in translation
workflows require assessment from an ethical perspective (Kenny 2011; Parra
Escartín and Moniz 2020), acknowledging recent work on ethics and citizenship in
crisis translation (O’Mathúna and Hunt 2019; O’Mathúna et al. 2020). Our first
ethical stance is that automation processes of translation have a central role to play,
alongside human interaction, when it comes to increasing preparedness.
In the case of language challenges, circulation of information from the urban to
the rural areas of Sierra Leone was an issue in 2017, as it had been during the 2014
Western Africa Ebola outbreak, due to the multiple languages in which such
information had to reach affected people and overcome significant cultural and
social barriers. Communication issues have had global consequences during the
recent pandemic. One common denominator remains: people are interconnected but
planning for efficient crisis communication across multilingual communities con-
tinues to be missing, as much as opportunities for embedding automation processes
in human-monitored translation workflows to enhance crisis communication. In both
examples of markedly diverse crises in terms of sizes of affected communities and
geographical impact, circulation of information had a vital role to play, and com-
munication strategies have become “a key plank of responses” to crises (Quinn
2018, p. 1). In this context, technologies that support translation automation, so that
cascading information in affected areas can reach out to affected communities faster,
3
See https://gdacs.org. Accessed 24 February 2022.
140 F. M. Federici et al.
are promising solutions. Not only should provision and access to information cut
across language boundaries, but also across social strata and sensory abilities. One
priority, underpinning most ethical issues, would be to identify who can be held
responsible for achieving multilingual communication since very rarely is a single
individual in charge of this objective, which is rather a distributed responsibility
within the international humanitarian sector (Federici et al. 2019a).
All members of affected populations and all language communities ought to be
able to find information relevant to them in a language and format they understand
(WHO 2017), including in this definition members of society affected by sensory
impairments such as deafness and blindness. To date, these communicative
exchanges fail to be symmetrical, often relying on a global or regional lingua franca,
which means that the risk of a top-down distribution of information to affected
populations remains high in international contexts (IFRC 2018). The dissemination
of information among affected communities features prominently when interlingual
translation is considered as part of the equation during crises, in which case an
apparent lack of planning tends to be observed when it comes to employing trans-
lators and interpreters in the response phase of crises. This issue leads to two further
problems: (1) translation capacity relies on both planning, which cannot just happen
when the crisis emerges, and availability of resources, which varies hugely for each
affected locality; (2) disregard of translation and interpreting (T&I) issues means that
the focus of the relief effort continues to be on the ad-hoc response phase of any
crisis rather than on the planning one.
These considerations dictate our second ethical stance: T&I must be acknowl-
edged more broadly in relation to the lifecycle of crises. For disaster risk reduction
experts and government officials, the best solutions are those that have been planned,
for which personnel have been trained, and of which the population has been
informed. To enhance social resilience, crises are better considered in relation to
preparedness (emergency plans, budgeting, assessing hazards, creating and evaluat-
ing resources in relation to them), which can encompass language needs. In this
respect, selecting, funding, and developing automation processes to support T&I to
meet such needs will contribute to designing emergency plans that mitigate the risks
in a multilingual crisis.
Ethical engagement with affected communities implies rectifying this asymmet-
rical communication that pivots around the main language of one country, or the
lingua franca in international settings, by enabling multi-directional communication
through policies and emergency plans (Federici et al. 2019b). Translation can be an
ethically sustainable risk reduction tool (Federici and O’Brien 2020) once, and if,
multilingual communication is planned and part of the equation. Awareness of the
cultural and linguistic needs of the affected communities is a sine qua non when
engaging multilingual communities and, to be successful in its objectives, it needs to
be supported by appropriate translation technologies and workflows. A translated
text, including audiovisual ones, is more likely to lead recipients to consider risk
reduction actions if it is trusted. Full automation is a risk as MT specialists and T&I
professionals know the strengths, weaknesses, and differences of usage of current
systems, but non-discerning users may not, and could thus deploy MT engines in
8 Ethics, Automated Processes, Machine Translation, and Crises 141
ways that end up eroding trust (see discussion of Wuhan’s crisis response in the next
section). To be ready and able to deploy all available solutions, one needs to know
the vulnerabilities of the affected (members of) society. These ethical principles of
engagement with communities for the promotion of symmetrical channels of com-
munication is pivotal when considering the role of T&I and automation in crisis
settings.
From an operational perspective, the use of MT solutions ultimately seeks to
provide language support that is critical to the response phase of disaster, crisis, and
emergency management (Hunt et al. 2019). Yet, it can be argued that the technology-
centred approach is not enough to establish trust on the message, as any MT output
will be used, read, adopted, and/or manipulated by humans at some point in the
communicative workflow. Although an important solution, it could lead to mis-
conceptions and faulty reception down the line, as translated materials have an
important role to play throughout a crisis lifecycle.
A lot of emphasis has been placed on creating MT engines that use, as one of their
main linguistic components, the international response lingua franca, i.e., English, in
an attempt to support efficient communication to aid affected populations and to
support responders. Successful operations such as Mission 4636 (discussed in Sect.
8.3) justify this initial approach. This perspective marries many of the issues
surrounding automation (e.g. MT, automated workflows, pivoting techniques, ren-
dering of multimodal documentation) with notions such as urgency and response. As
MT is perceived to “dramatically increase the speed by which relief can be provided”
(Lewis et al. 2011, p. 501), it is obvious why it should be considered first for a crisis’
response phase. Automation processes could achieve more by supporting under-
resourced languages and enabling communities to translate preparedness materials
and create their own bilingual resources (translation memories and domain-specific
engines) to enhance their resilience over time to known hazards. From working with
community translators (O’Mathúna et al. 2020), who may use MT output in uncrit-
ical manners, to time-pressed end-users who need translated information, ethical
concerns abound. Early and regular collaboration among technology experts and
low-resource language users could help alleviate some of these concerns. In the
response phase emergency, personnel can do with having “a sufficient picture of a
situation that responders will be able to plan and execute human and material
resource deployment activities” (Christianson et al. 2018, p. 8). It is, however, a
one-way communication solution; the question of how affected communities would
be able to direct their questions to those providing aid remains a critical issue.
In particular, by focusing on high-resource languages, natural language
processing (NLP) researchers have prioritised methods that work well only when
large amounts of labelled and unlabelled data are available (Ruder 2020). This very
reality clashes with the unpredictability of crisis settings, in which the languages
spoken by the affected populations may, or may not, have any amount of labelled/
unlabelled data. MT technologies are one of the various resources available to
responders and, to attain any ethical value, they need to be based on ML approaches
in which the automated processes do not reflect world-bias in favour of English but
provide the support tools necessary to the World South. Also, low-resourced
142 F. M. Federici et al.
language communities must be made aware that MT engines and the ML models on
which they are based have been able to extract only limited information, creating a
resource that has scarce potential for direct use.
Our third ethical stance focuses on the need for MT and automation processes to
be presented clearly for their potential as well as for their limitations, which must be
critically explained to users.
Since 2011, the use of MT in crisis settings has been explored (Lewis et al. 2011;
Munro 2010, 2013; Christianson et al. 2018), together with time-saving automation
processes and crowdsourcing (Sutherlin 2013; Mulder et al. 2016), as well as
technological-dependencies of the Third Sector (Rico Pérez 2013, 2020). Cadwell
et al. (2019) highlight the problems of data mining social-media resources using MT
engines. As far as the use of social media in crisis scenarios is concerned, the
technological framework in which relief responders operated in the 2010 Haiti
earthquake stands out as a paradigm that continues to influence the deployment of
MT in twenty-first century crises. The approach took into consideration logistical
connectivity, including the crowdsourcing of the translation of text messages and
messages from social media, between the local language, Haitian Creole, and that of
the responders. Mission 4636 (mission4636.org) launched soon after the scale of
destruction, mortality, and morbidity brought about by the earthquake became clear.
A partnership among 50 countries, Mission 4636 provided an online translation and
information processing service, connecting the Haitian people with the international
aid efforts. Devised to support local emergency response, this was a technology silo
in which local crowdsourcing efforts merged with translation—human translation
first, supported by MT—for the purpose of information assimilation (i.e., gaining a
superficial understanding of the meaning of what is being conveyed through the
automated translation output) and interchange.4 Haitians could mainly be reached by
phone and text messaging. Setting up the free emergency phone number ‘4636’, for
people to send text messages, relief organisations had a direct channel to collect and
share information. With cell towers restored pretty much immediately after the
disaster and with 83% of men and 67% of women owning a mobile phone, Haiti
remained connected. Survivors looking for relatives and friends could find out
through text messages and social media about their location and physical situation,
and they did so mainly in Haitian Creole, a language that international rescuers
arrived in Haiti did not speak or understand (Munro 2010; Lewis et al. 2011).
Cellular connectivity remained present and wide coverage was quickly restored
during the Haiti crisis; in its resilience, the mobile communication infrastructure
4
See Hutchins (1995, 2005) on the use of MT for the purpose of assimilation and interchange.
8 Ethics, Automated Processes, Machine Translation, and Crises 143
allowed responders to translate messages to the 4636 number. The messages located
individuals’ or groups’ requests for help on maps so that emergency logistics could
be deployed over there; and they were translated by about 2000 volunteers who
spoke Creole and/or French. Munro (2013) reports that over 80,000 messages were
processed, producing 45,000 unique reports that were transmitted to emergency
rescue teams on the ground. Increasingly over the first few weeks, this task was
then undertaken by people who were paid to provide translations. In the same period,
MT engines were developed (by Microsoft first, Google next) using the processed
messages, thus creating an additional support of the relief response, which
complemented the communication chain organised around the text messages
(Munro 2010; Lewis et al. 2011). The crowdsourcing and translation approach in
place allowed for responders to gain an idea of what the crowdsourced-translated
texts conveyed and to act quickly upon the information received. Through its ability
to empower disaster-affected communities and help them define the way in which
they could obtain help, the crowdsourced mapping approach of Mission 4636 has
quite simply revolutionised the way in which crisis response is nowadays perceived
and articulated (Harvard Humanitarian Initiative 2011).
With the combined efforts of crowdsourcing, mapping, and translating (with the
support of MT), Mission 4636 shaped emergency responses to other high-profile
crises such as the 2011 Great East Japan Earthquake, the 2013 Typhoon Haiyan/
Yolanda (Southeast Asia, but it mainly affected the Philippines), the 2014 Western
Ebola outbreak, the 2015 Cyclone Pam (South Pacific, but it mainly affected
Vanuatu), and the 2015 Nepal earthquake (IFRC 2013; Meier 2015; Ramalingam
and Sanderson 2015). In the two days following Typhoon Haiyan in the Philippines
in 2013, for instance, nearly 230,000 tweets were collected and processed. Only
800 (0.35%) were relevant but provided critical information about the most affected
regions which emergency response had to reach, thus saving lives (Moore and Verity
2014). In the evacuation phase, as reminisced by Field (2017, p. 341):
a common recollection in interviews of one of the causes of low evacuation rates in the days
preceding the landfall of typhoon Yolanda was the fact that the projected tidal impact on the
exposed coastal regions was referred to as a ‘storm surge’ rather than a tsunami or a
destructive wave. While the two are scientifically different phenomena, it was acknowledged
that had the threat of the storm surge been likened to that of a tsunami (for a coastal
population hit by a wave, the impact would be similar), the coastal regions would have
seen higher evacuation rates.
In the aftermath of the Haiti earthquake relief response and in light of the role
played by machine translation, Lewis et al. (2011, p. 510) proposed a crisis cook-
book, on the grounds that MT, which had proven to “dramatically increase the speed
by which relief can be provided”, should become “a standard tool in the arsenal of
tools needed in crisis events”. The argument is clear: if MT facilitates communica-
tion for assimilation and interchange purposes, provided the translated information is
accurate enough to be used, then it cannot but be the centrepiece of making content
available to a local population affected by a crisis in a language that they can
understand.
consisting of “over 250 college teachers and students, frontline responders, medical
staff, procurement agents, overseas donors, and foundation officers in Wuhan and
across the world” (ibid., p. 520). Volunteers translated the messages posted in the
WeChat group from Chinese into English, French, Japanese, Korean, Portuguese,
Russian, Spanish, Thai, and Vietnamese (ibid.).
Automation, central in these early efforts to disseminate information on the
pandemic, was given priority over CAT tools because, as Wang (2019, p. 98) states,
“it usually takes time to train a translator to use translation memory tools, [so] all
volunteer translators were instead encouraged to use two neural machine translation
tools (Google Translate and Youdao Translator) in this time-constrained context”.
Yet, awareness of potential MT shortcomings meant that the crowdsourcing efforts
continued in terms of volunteers’ post-editing role: “No matter whether they were
students, professionals or other bilinguals, all volunteers were asked to carry out
post-editing of machine-translated output since this situation was too risky to rely on
raw machine translation” (ibid., p. 99). We need to delve more into this type of
automation.
Developing ever more accurate automated translation outputs has intensified
existing ethical dilemmas. A first issue relates to the perceived readiness of the
translated information. Increasingly being used for assimilation purposes, the per-
ception of its quality has posed, and continues to pose, substantial issues, especially
when the output is no longer treated for gisting the content but, instead, it is
considered to be a complete, translated text. The ethical implications that derive
from the text output being seen as operative are significant, especially in crises. A
second issue, more widely discussed, relates to the improbable balance of accessi-
bility versus privacy. Although accessibility might very well be the motor for social
inclusion and for promoting intercultural dialogue (Matamala and Ortiz-Boix 2016),
the underlying risk is that such inclusion and dialogue rest on big tech companies
gaining access to and processing of personal data. If the advantage of increased
access to translated information is a given when it comes to furthering cross-
organisation collaboration and multilingual communication, then the downside
raises strong reservations in terms of infrastructural access to tools (connectivity
and storage), data privacy (third-party access and usage), financial interests (eco-
nomic exploitation of crowdsourced/free data), and environmental impact (Knight
2020; Joscelyne 2021). Even when initiatives are launched as part of a charity effort,
political and financial interests seem to always be at stake.
With the statistical model of automated translation introduced in the late 1980s,
the management of data took a central role. Neural machine translation (NMT)
equally relies on deep learning processes that use datasets of previously translated
sentences to train a model capable of translating between any two languages
(Lanners 2019). Tests on using a lingua franca to improve MT engines for
low-resource languages have shown promise but also limitations as pivot MT
engines obtain what can ethically only be considered as a barely acceptable output
quality (Silva et al. 2018; Liu et al. 2019). In order for a low-resource language to
become more resourced, so that a larger dataset is available, automated web crawling
is used, typically for in-domain knowledge. Although the internet was conceived as a
8 Ethics, Automated Processes, Machine Translation, and Crises 147
virtual space with great democratic potential, the reality is that the presence of certain
languages on the web does not allow for large-data processing in low-resource
languages (cf. data on languages used on websites, as collated by Statista 2020).5
When it comes to out-of-domain knowledge platforms, machine learning is still in its
infancy for the over 2000 African languages. Projects such as Lanfrica (Emezue and
Dossou 2020) and Masakhane (masakhane.io) aim to strengthen and spur NLP
research in African languages and to raise their visibility and standing among MT
researchers. Against this backdrop, the African experience could be capitalised on as
a model for the operationalisation and refinement of multilingual MT solutions in
crises. Experts on NLP and machine learning for translation purposes, from all over
the world, could also benefit from quality improvements to systems, approaches, and
workflows implemented when working with low-resource languages.
Using 2010 Haiti and 2020 Wuhan as comparator points, they have one astonishing
element in common: how little audiovisual translation options have been studied
(and in part exploited) for their crucial role in crisis communication. Rogl (2017,
p. 244) points out how “[a]nother important mission for translators” during the early
phase of recovery in Haiti included “a series of subtitling assignments for webcasts
and documentaries reporting on the earthquake”. In a technology-driven multimedia
society like the present one, the value of moving images, accompanied with sound
and text, is crucial when it comes to engaging in communication. It would be natural
to expect that they would be ubiquitously used with translations in crises, as
audiovisual formats have been hailed as the quintessential means of communication
in the twenty-first century. Indeed, audiovisual productions are omnipresent in our
time and age and their seductive power and influence are here to stay and live on via
the screens. Compiled by Stancheva (2021, online), some of the statistics regarding
our consumption of video are mind-boggling, such as the fact that “[v]ideo is the
number 1 source of information for 66% of people” and “92% of people who watch
videos on their mobile phone go on to share the content with other users”. Not only
are audiovisual productions given priority by citizens when it comes to finding
information, but they also seem to be pivotal in the reiteration and dissemination
of such knowledge. And yet, despite their communicative virtues and the increasing
numbers of individuals, companies, institutions, and international organisations that
resort to audiovisual and multimodal material as their preferred mode of communi-
cation, systematic uses of accessible videos, and particularly translated ones, in
5
According to aggregated data published in Statista.com, the most commonly used languages in
websites are English 25.9%, Chinese 19.4%, Spanish 7.9%, Arabic 5.2%, Indonesian/Malaysian
4.3%, Portuguese 3.7%, French 3.3%, Japanese 2.6%, Russian 2.5%, German 2%, and the rest of
languages 23.1%. www.statista.com/statistics/262946/share-of-the-most-common-languages-on-
the-internet. Accessed 24 February 2022.
148 F. M. Federici et al.
6
See http://en.hubei.gov.cn/special/coronavirus_2019/index.shtml. Accessed 24 February 2022.
7
See https://endangeredlanguagesproject.github.io/COVID-19. Accessed 24 February 2022.
8 Ethics, Automated Processes, Machine Translation, and Crises 149
sense, audiovisual messages can help build that trust, as the speakers can be seen on
screen, thus reaching the intended audience in a more impactful manner. However,
these advantages can be somewhat curtailed because the translation of multimedia
products can be perceived as being more complicated than that of written texts,
particularly from a technological perspective. Indeed, this further complex logistics
of having to translate audiovisual productions into a plurality of languages—very
frequently minority and/or minoritised ones—adds to the apparent lack of appetite
for exploiting the production and multilingual dissemination of informative and
instructive videos.
From an academic perspective, the situation is not much better and little research
has so far focused on the role, challenges, and potential of audiovisual communica-
tion and translation in crises. As an emerging field within the wider discipline of
translation studies, the remit of crisis translation should in future include ways of
contextualising multimodal translation (subtitling, dubbing, audio description,
respeaking, etc.) for the significant role it already plays in the crisis lifecycle to
educate on preparedness measures, to inform of risk mitigation measures, and to
reach low-literacy and sensory impaired audiences. Audiovisual translation studies
lag and have been somewhat restricted. Yet, as previously mentioned, the fact
remains that TV news broadcasts become crucially important in periods of crisis
as key tools to disseminate and receive information, as happened after the 2011
Great East Japan Earthquake (Sato et al. 2009; Tsuruta 2011; Kaigo 2012), and as
highlighted by Wang (2019) in Wuhan’s response. In these cases, translation might
not be immediately apparent on screen as different stations air their programmes in
their own languages. However, depending on the broadcasters’ awareness and
priorities, as well as on the potential existence of legislation on the topic, the figure
of the sign language interpreter can become a regular actor in these settings so as to
convey the information to the D/deaf community. On occasion, and to enhance
audiovisual accessibility and social inclusiveness, the screen is also shared with
subtitles that are created for D/deaf and hard of hearing people. Less prominent
seems to be the presence of audio description in these environments as a means to
communicate with other societal groups like the blind and the partially sighted and
further research ought to be conducted on the topic to gain a more comprehensive
picture of the current state of affairs.
Having said all this, it is true that audiovisual materials are being increasingly
exploited in crises, particularly during the response phase, when instructional and
informative videos are created to help those affected, as in the example about Wuhan
previously discussed. One of the defining characteristics of audiovisual translation is
its fundamental enmeshment with technology and technical advancements (Díaz-
Cintas 2014, 2018; Baños 2018; Doherty and Kruger 2018; Díaz-Cintas and
Massidda 2019), which opens up the gates to exploration into the potential that
linguistic and technical automation can have when dealing with the translation of
audiovisual materials in crises. In such contexts, not only practices like MT, whose
implementation in the subtitling industry has increased substantially over the past
few years, but also technologies such as (automatic) speech recognition (ASR) and
speech synthesis (Ciobanu and Secară 2019) could be efficiently implemented to
150 F. M. Federici et al.
help translating audiovisual content. In this respect, speech recognition has been
used to transcribe the dialogue uttered in videos into (intralingual) subtitles, both
automatically, with the help of ASR, and with the participation of a respeaker, in
which case, “the process is not fully automated as a professional is needed to dictate
or respeak the original speech” (Baños 2018, p. 31). YouTube automatic captioning
offers another instance of the use of ASR in subtitling, whereby Google’s speech
recognition technology is used to automatically convert the speech from a clip into
text, which is then synchronically presented as subtitles on screen. Likewise, text in
the form of existing subtitles or lists of dialogue can be converted to speech through
the use of speech synthesis. The resulting voice track can then be embedded in any
given audiovisual material to make it more accessible, for instance to people who are
blind.
Although the potential of MT in settings of crisis translation should certainly be
investigated as a tool to increase productivity and speed up the transfer process of
audiovisual productions, it is our contention that the most important issues should be
prioritised in this emerging field. Indeed, the designing, testing and implementation
of Audiovisual Translation (AVT) workflows that could be operational during a
crisis eventuality should be given utmost precedence, as well as the exploration of
the users’ likes and dislikes when it comes to accessing this type of material in a
revoiced (e.g., dubbed, voiced-over) or subtitled version. Of course, this would be of
significant consequence in the case of translating languages with little written
tradition. Another area that merits urgent attention is the potential unleashed by
the technical migration to the cloud and the popularity of online collaborative
platforms designed exclusively for dubbing and subtitling purposes. In these new
virtual spaces, activities such as cybersubtitling and cyberdubbing (Díaz-Cintas
2018) present both opportunities but also challenges as far as productivity and the
ethics of volunteering are concerned. Our fourth and final ethical stance is the
reiteration that we urgently need more systematic research on the usage of audiovi-
sual translation in crises, as audiovisual materials support most citizens and, thus,
contribute to the inclusivity agenda behind the UN Sustainable Goals.
8.6 Conclusions
The realisation of the UN’s “leave no one behind” SDGs is very uncertain and
definitely projected into a distant future. Inequities still remain in access to informa-
tion and to the infrastructures that allow data management and storage. The COVID-
19 pandemic has reminded us that circulation of information—even when commu-
nicating risks in the interest of the whole society—is an extremely complex task.
Lending trustworthiness to the various voices and speakers depends on many factors
but language is certainly quintessential to creating successful communication chan-
nels. Urgency, gaps, and “language indifference” (Polezzi forthcoming) from mono-
lingual, power-holding groups risk justifying rushed and incomplete forms of full
automation through technologies. Such an approach, underestimating the nuanced
8 Ethics, Automated Processes, Machine Translation, and Crises 151
landscapes that computer scientists who design them envisage for such technologies,
diminishes the very impact of MT engines in supporting humans to deal with crises.
The translation of information, regardless of the mode used, its role in addressing
social inequality, and the adoption of MT-driven or automated translation workflows
as if they were standalone, self-sufficient solutions risk obscuring the real potential
behind translation automation. Automation should be fully implemented but, yet
again, people should be also trained to work and enhance it with the ultimate goal of
strengthening social resilience against adversity. Global early warning systems such
as GDACS could scale up the approach that was being used by GPHIN, combining
automated information mining, MT, and domain-specific language experts (Tanguay
2019). The Epidemic Intelligence from Open Sources seems to be a successful move
in this direction as it aims “to create a unified, all-hazards, One Health approach by
using open-source information for early detection, verification and assessment of
public health risks and threats” (Abdelmalik et al. 2018, p. 268). Its design allows the
system to aggregate data streams from other systems, with a capacity to scale up the
data-harnessing features of GPHIN by relying on the WHO global network of
country-level offices, as well as all the open sources on the internet, and local
scientists’ reports and warnings. As mentioned earlier, EIOS picked up the first
mention of the epidemic in Wuhan. It is unclear how expert linguists are expected to
interact with the system. Of course, the design and development of new technology
and automated platforms bring about a wide range of ethical considerations that we
have not touched upon in this contribution, as our main interest lies on the use that is
made of such technology in the interactions with translation actors. The implications
from a technical point of view have been broached by authors like Parra Escartín and
Moniz (2020).
There are numerous multimodal contexts that currently embed MT engines and
raise ethical issues. In many areas, volunteerism and technology lull users into a false
security of the translation quality achieved through automated processes. Projects
such as Masakhane are an example of how 1500 languages in use in African
countries have seen scattered and unsystematic work to create MT engines and
minable resources. They show promise but also concerns with access in terms of
what is available, which has only recently started being optimised, and what is
infrastructurally possible. For rare and low-resource languages, the risk is that any
solution for automating translation processes might be applied, and often can only be
adopted, by untrained users and deployed as a final translation product with limited
diagnostic Quality Assurance (QA), which takes into consideration the responses of
members of the user group.
Our first ethical consideration has been that applications and uses of automation
processes of translation have a central role to play, together with human interaction,
in increasing preparedness, but more focus is needed on the human-computer
interaction and the role of humans in quality assurance processes. Our second ethical
consideration is that T&I automation processes must be part of the lifecycle of crises
and not just of the response phase. Our third consideration is that to reduce ethical
concerns about the application of MT and automation processes to crisis communi-
cation, these must be embedded in global platforms that aim to reduce risks, but their
152 F. M. Federici et al.
References
Joscelyne A (2021) How does AI ethics impact translation? TAUS Blog. https://blog.taus.net/
multilingual-morals-how-does-ai-ethics-impact-translation. Accessed 21 July 2021
Kaigo M (2012) Social media usage during disasters and social capital: Twitter and the Great East
Japan Earthquake. Keio Commun Rev 24:19–35
Kelman I (2020) COVID-19: what is the disaster? Soc Anthropol 28(2):296–297. https://doi.org/10.
1111/1469-8676.12890
Kenny D (2011) The ethics of machine translation. In: Ferner S (ed) Reflections on language and
technology: the driving forces in the modern world of translation and interpreting. NZSTI,
Auckland
Knight W (2020) AI can do great things—if it doesn’t burn the planet. Wired. https://tinyurl.com/
W20HBTC. Accessed 21 July 2021
Lanners Q (2019) Neural machine translation. Towards data science. https://towardsdatascience.
com/neural-machine-translation-15ecf6b0b. Accessed 21 July 2021
Lewis WD, Munro R, Vogel S (2011) Crisis MT: developing a cookbook for MT in crisis situations.
Proceedings of the sixth workshop on statistical machine translation, Edinburgh, 30-31 July
Liu C-H, Way A, Silva C, Martins AF (2019) Pivot machine translation in INTERACT project.
Proceedings of machine translation summit XVII volume 2: translator, project and user tracks,
Dublin
Matamala A, Ortiz-Boix C (2016) Accesibilidad y multilingüismo: un estudio exploratorio sobre la
traducción automática de descripciones de audio. Trans 20:11–24. https://doi.org/10.24310/
TRANS.2016.v0i20.2059
Mawudeku A, Blench M, Boily L, John RS, Andraghetti R, Ruben M (2013) The global public
health intelligence network. In: M’ikanatha NM, Lynfield R, Van Beneden CA, de Valk H (eds)
Infectious disease surveillance. Wiley, Hoboken, pp 457–469
McCulloch G (2020) Covid-19 is history’s biggest translation challenge. Wired. https://www.wired.
com/story/covid-language-translation-problem. Accessed 21 July 2021
Meier P (2015) How digital Jedis are springing to action in response to Cyclone Pam. iRevolutions.
https://irevolutions.org/2015/04/07/digital-jedis-cyclone-pam. Accessed 21 July 2021
Moore R, Verity A (2014) Hashtag standards for emergencies: OCHA policy and studies brief.
United Nations Office for the Coordination of Humanitarian Affairs (OCHA). https://www.
unocha.org/publication/policy-briefs-studies/hashtag-standards-emergencies. Accessed
21 July 2021
Mulder F, Ferguson J, Groenewegen P, Boersma K, Wolbers J (2016) Questioning big data:
crowdsourcing crisis data towards an inclusive humanitarian response. Big Data Soc 3(2):
1–13. https://doi.org/10.1177/2053951716662054
Munro R (2010) Crowdsourced translation for emergency response in Haiti: the global collabora-
tion of local knowledge. AMTA workshop on collaborative crowdsourcing for translation,
31 October, Denver, CO
Munro R (2013) Crowdsourcing and the crisis-affected community: lessons learned and looking
forward from Mission 4636. J Inf Retr 16(2):210–266
Nurminen M, Koponen M (2020) Machine translation and fair access to information. Transl Spaces
9(1):150–169
O’Brien S (2019) Translation technology and disaster management. In: O’Hagan M (ed) The
Routledge handbook of translation technologies. Routledge, London, pp 304–318
O’Brien S, Federici FM (2019) Crisis translation: considering language needs in multilingual
disaster settings. Disaster Prev Manag 29(2):129–143
O’Brien S, Federici FM, Cadwell P, Marlowe J, Gerber B (2018) Language translation during
disaster: a comparative analysis of five national approaches. Int J Disaster Risk Reduct 31:627–
636. https://doi.org/10.1016/j.ijdrr.2018.07.006
O’Mathúna DP, Hunt MR (2019) Ethics and crisis translation: insights from the work of Paul
Ricoeur. Disaster Prev Manag 29(2):175–186
8 Ethics, Automated Processes, Machine Translation, and Crises 155
O’Mathúna DP, Parra Escartín C, Roche P, Marlowe J (2020) Engaging citizen translators in
disasters: virtue ethics in response to ethical challenges. Transl Interpret Stud 15(1):57–79.
https://doi.org/10.1075/tis.20003.oma
Parra Escartín C, Moniz H (2020) Ethical considerations on the use of machine translation and
crowdsourcing in cascading crises. In: Federici FM, O’Brien S (eds) Translation in cascading
crises. Routledge, London, pp 132–151
Piller I (2016) Linguistic diversity and social justice: an introduction to applied sociolinguistics.
Oxford University Press, Oxford
Piller I (2020) COVID-19 forces us to take linguistic diversity seriously. Perspectives on the
pandemic: international social science thought leaders reflect on COVID-19, 12, 13–17.
https://www.degruyter.com/fileasset/craft/media/doc/DG_12perspectives_socialsciences.pdf
Polezzi L (forthcoming) Language indifference. In: Forsdick C, Kamali L (eds) Translating
cultures: a glossary. Liverpool University Press, Liverpool
Quinn P (2018) Crisis communication in public health emergencies: the limits of ‘legal control’ and
the risks for harmful outcomes in a digital age. Life Sci Soc Policy 14(1):1–40. https://doi.org/
10.1186/s40504-018-0067-0
Ramalingam B, Sanderson D (2015) Nepal earthquake response: lessons for operational agencies.
ALNAP/ODI. https://reliefweb.int/sites/reliefweb.int/files/resources/nepal-earthquake-
response-lessonspaper.pdf. Accessed 21 July 2021
Rico Pérez C (2013) From hacker spirit to collaborative terminology: resourcefulness in humani-
tarian work. Transl Spaces 2(1):19–36
Rico Pérez C (2020) Mapping translation technology and the multilingual needs of NGOs along the
aid chain. In: Federici FM, O’Brien S (eds) Translation in cascading crises. Routledge, London,
pp 112–131
Rogl R (2017) Language-related disaster relief in Haiti: volunteer translator networks and language
technologies in disaster aid. In: Antonini R, Cirillo L, Rossato L, Torresi I (eds)
Non-professional interpreting and translation. State of the art and future of an emerging field
of research. John Benjamins, Amsterdam, pp 231–255. https://doi.org/10.1075/btl.129.12rog
Ruder S (2020) Why you should do NLP beyond English. http://ruder.io/nlp-beyond-english.
Accessed 21 July 2021
Sato K, Okamoto K, Miyao M (2009) Japan, moving towards becoming a multi-cultural society,
and the way of disseminating multilingual disaster information to non-Japanese speakers.
Proceedings of the 2009 International Workshop on Intercultural Collaboration
Silva CC, Liu C-H, Poncelas A, Way A (2018) Extracting in-domain training corpora for neural
machine translation using data selection methods. Proceedings of the Third Conference on
Machine Translation
Stancheva T (2021) 24 noteworthy video consumption statistics. TechJury. https://techjury.net/
blog/video-consumption-statistics. Accessed 21 July 2021
Sutherlin G (2013) A voice in the crowd: Broader implications for crowdsourcing translation during
crisis. J Inf Sci 39(3):397–409
Systran (2021) 12 translation models specialized with corona crisis data. https://www.systransoft.
com/systran/news-and-events/specialized-corona-crisis-corpus-models/#try. Accessed
21 July 2021
Tanguay F (2019) GPHIN. Presentation. World Health Organization. https://www.who.int/docs/
default-source/eios-gtm-2019-presentations/tanguay-phac–-eios-gtm-2019.pdf?sfvrsn=
8c758734_2. Accessed 21 July 2021
TAUS (2021) TAUS corona crisis corpora. https://md.taus.net/corona. Accessed 21 July 2021
Tsuruta C (2011) Broadcast interpreters in Japan: Bringing news to and from the world. Inter-
preters’ Newsl 16:157–173
UN (2015) Sustainable development: the 17 goals. United Nations. https://sdgs.un.org/goals.
Accessed 21 July 2021
Wang P (2019) Translation in the COVID-19 health emergency in Wuhan: a crisis manager’s
perspective. J Int Local 6(2):86–107
156 F. M. Federici et al.
WHO (2014) Infection prevention and control of epidemic and pandemic prone acute respiratory
infections in health care. World Health Organization. https://www.who.int/publications/i/item/
infection-prevention-and-control-of-epidemic-and-pandemic-prone-acute-respiratory-infec
tions-in-health-care. Accessed 21 July 2021
WHO (2017) Communicating risk in public health emergencies. A WHO guideline for emergency
risk communication (ERC) policy and practice. World Health Organization
Yourterm (2021) COVID-19 terminology resource centre. https://yourterm.org/covid-19. Accessed
21 July 2021
Zhang J, Wu Y (2020) Providing multilingual logistics communication in COVID-19 disaster relief.
Multilingua 39(5):517–528. https://doi.org/10.1515/multi-2020-0110
Part III
Responsible Machine Translation: Societal
Impact
Chapter 9
Gender and Age Bias in Commercial
Machine Translation
Abstract The main goal of Machine Translation (MT) has been to correctly convey
the content in the source text to the target language. Stylistic considerations have
been at best secondary. However, style carries information about the author’s
identity. Mostly overlooking this aspect, the output of three commercial MT systems
(Bing, DeepL, Google) make demographically diverse samples from five languages
“sound” older and more male than the original texts. Our findings suggest that
translation models reflect demographic bias in the training data. This bias can
cause misunderstandings about unspoken assumptions and communication goals,
which normally differ for different demographic categories. These results open up
interesting new research avenues in MT to take stylistic considerations into account.
We explore whether this bias can be used as a feature, by correcting skewed initial
samples, and compute fairness scores for the different demographics.
F. Bianchi
Computer Science Department, Stanford University, Stanford, USA
e-mail: [email protected]
T. Fornaciari
Italian National Police, Rome, Italy
e-mail: [email protected]
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 159
H. Moniz, C. Parra Escartín (eds.), Towards Responsible Machine Translation,
Machine Translation: Technologies and Applications 4,
https://doi.org/10.1007/978-3-031-14689-3_9
160 F. Bianchi et al.
9.1 Introduction
1
https://www.theguardian.com/technology/2017/oct/24/facebook-palestine-israel-translates-good-
morning-attack-them-arrest.
9 Gender and Age Bias in Commercial Machine Translation 161
salient and well-studied aspects, but our findings hold more broadly for a wide range
of socio-demographic speaker-attributes.
Simply put, we would like our gender identity and age to be respected when we
communicate, via translation, with someone. Gender identity and age are thus called
Language Invariant Properties (Bianchi et al. 2021b): properties that should not
change when we translate text. However, currently, this does not happen, as we
demonstrate empirically in Sect. 9.4. We test what happens to perceived author
demographics when we translate texts with three different services. We use classi-
fiers based on standard TF-IDF (c.f. Box 9.1) with logistic regression (c.f. Box 9.2)
and based on the recent BERT model (Devlin et al. 2019) (c.f. Box 9.3) to measure
the degree of misrepresentation.
2
n-grams are sequences of n consecutive words in a text. In NLP, they are frequently used linguistic
features, and the most common n-grams are bi-grams and tri-grams.
9 Gender and Age Bias in Commercial Machine Translation 163
The most recent and prominent neural architecture to date are the Transformer
models, introduced for MT (Vaswani et al. 2017) and that have subsequently shaped
algorithms in NLP (and progressively replaced RNNs).
MT is the NLP problem with the most comprehensive adoption in industry, as
companies need to translate their documents. This global need for translation
underscores the importance of MT. In the following sections, we will consider
three popular translation systems:
• Google Translate: is a service provided by Google.3
• Bing Translator: is a service provided by Microsoft.4
• DeepL: provided by the homonymous company,5 to date, DeepL is an MT
service often considered by translators due to the quality of the translations.6
Given the essential business aspect of machine translations/machine-translated
texts, all three translation services come with both free and premium versions (where
the main difference is the maximum numbers of translations a user can run, and the
availability of some other features for professional use, like the possibility of being
integrated in spreadsheets).
There is now a growing literature on various types of bias present in the NLP field.
There is indeed a lot of interest in exploring the harm that modern systems can
perpetuate when used in real settings (Stafanovičs et al. 2020; Gehman et al. 2020;
Nozza et al. 2021; Nozza 2021). Blodgett et al. (2020) review 146 papers (about
written text only, excluding spoken language) focusing on “bias” in NLP systems.
However, they argue that such growth is not directed, as the studies’ motivations are
“often vague, inconsistent, and lacking in normative reasoning” (page 5455). In
particular, they point out the frequent lack of a clear definition of “bias” and
engagement with the relevant literature outside of NLP. They provide some recom-
mendations on improving the problem of bias definition and, therefore, the proposed
methods’ effectiveness.
Shah et al. (2020) offers another review of the literature with that goal in mind.
They identify four origins of biases in creating NLP systems, formalize them
mathematically, and show countermeasures for each type. With respect to MT, the
most frequent sources of bias are selection biases, semantic biases and
overamplification (the fourth bias they found, label bias, is not applicable in MT).
These biases play a role in selecting the training corpus, in the use of word
3
https://translate.google.com/.
4
https://www.bing.com/translator.
5
https://www.deepl.com/en/translator.
6
https://www.deepl.com/quality.html.
164 F. Bianchi et al.
representations, and the models’ optimization, respectively. The authors show how
training corpora and word representations, i.e., word embeddings, can incorporate a
distorted society image, possibly harmful for some socio-demographic categories,
especially those characterized by the frequent presence of biased and stereotyped
attributes in training corpora. The authors also note that NLP models themselves can
reproduce probability distributions that do not reflect the training data.
Work by Mirkin et al. (2015); Mirkin and Meunier (2015) has set the stage for
considering the impact of demographic variation (Hovy et al. 2015) and its integra-
tion in MT. Rescigno et al. (2020), report statistics about machine translation tools
performance in translating gender-related words. More recent research has suggested
that MT systems reflect cultural and societal biases (Stanovsky et al. 2019; Escudé
Font and Costa-jussà 2019), though primarily focusing on data selection and embed-
dings as sources. Vanmassenhove et al. (2021) found effects of bias amplification in
MT, with “loss of lexical and morphological richness” (page 2203). Bentivogli et al.
(2020) address the gender de-biasing problem in MT, using multi-modal data that
includes the speakers’ voice.
Zhao et al. (2018) show that downstream tasks inherit gender biases from the
contextualized word embeddings used. Manzini et al. (2019) propose methods for
multi-class debiasing of word embeddings. Escudé Font and Costa-jussà (2019)
propose one of the first approaches to debiasing MT. The authors focus on the use of
debiased word embeddings (Bolukbasi et al. 2016) and gender-neutral word embed-
dings (Zhao et al. 2017) to provide better support for neural MT pipelines. However,
subsequent research by Gonen and Goldberg (2019) has called into question whether
these techniques actually address the underlying bias, or rather “put lipstick on a
pig.” I.e., they mask some of the symptoms, but they resurface when the embeddings
are used in downstream applications.
The work by Zmigrod et al. (2019) and Zhao et al. (2018) addresses the problem
of gender biases by training models with balanced data sets. In contrast, given the
difficulty in building such data sets, Saunders and Byrne (2020) reduce the biases in
Neural Machine Translation (NMT) systems by fine-tuning rather than re-training,
treating the debiasing procedure as a domain adaptation task. Similarly, Michel and
Neubig (2018) propose to take into consideration speakers’ attributes in NMT.
Vanmassenhove et al. (2018) propose to reduce gender biases in NMT systems by
integrating gender information as an additional feature. Saunders et al. (2020),
however, point out how such an approach needs improvements. For example, it
does not account for multiple referents in the same sentence, overgeneralizing the
predicted gender to all of them.
Niu et al. (2018) address the broad problem of translating texts from one language
to another, preserving not only the content, but also their stylistic features.
Stafanovičs et al. (2020) underline that one of the problems is that there is not
enough contextual information to provide a translation. The translation system can
rely on the most frequent case, which is often the most stereotypical one. The authors
explain this with the example The secretary asked for details, where there is not
enough information to translate the term secretary with the correct gender. The
9 Gender and Age Bias in Commercial Machine Translation 165
In this section, we describe our experiments to demonstrate gender and age bias in
commercial MT. Based on the availability, we only cover a binary gender distinction
here. This should not be read as a normative comment, but a limitation of the data.
9.4.1 Method
Suppose we have access to a prediction model for the authors’ demographic aspects
(i.e., gender or age) for texts in all the languages and their respective translations. To
assess the demographic profile of a text, we train two separate aspect classifiers for
each language, namely age and gender. These classifiers allow us to compare the
predicted profiles in the original language with the predicted profiles of the transla-
tion, and compare both to the actual demographics of the test data.
Indeed, we can compare the predicted distribution, P, of an aspect with the actual
distributions, Q (following Shah et al. 2020). To compare the distribution, we can
use the Kullback-Leibler (KL) divergence, a divergence measure between probabil-
ity distributions.7
7
Note that the KL is a divergence and not a distance measure, because it is not symmetric,
KL(P|Q) ≠ KL(Q|P). This difference is not important for our objective, but it is important to
remember. Moreover, the KL divergence does not have an upper bound, while the lower bound
is 0 (i.e., equal distributions).
166 F. Bianchi et al.
P Pi
KLðPjQÞ = Pi log2
i Qi
Therefore, we define the notion of bias for this paper as the divergence between
the real and predicted demographic variables, comparing both original and translated
texts. In particular, given the predictions over demographic aspects on the text in the
original language and the text in the translated language, we describe two events.
1. If there is a classifier bias, both the predictions based on the original language and
the predictions based on the translations should be skewed in the same
direction. E.g., both are predicting a higher rate of male authors. Note that we
are not interested in the actual predictive performance of the classifiers, but
just in how similar the prediction rates are between the original an the translated
text. As explained above, we can measure the classifier bias by computing the KL
divergence of the predicted distribution from the true sample distribution.
2. If instead there is a translation bias, then the translated predictions should exhibit
a stronger skew than the predictions based on the original language. E.g., the
gender distribution in the original language (which we control) should be closer
to uniform than the gender ratio in the translation. Translation bias is the main
target of investigation of this paper.
To give an high-level view of this, Fig. 9.1 shows an example of how a
translation bias would behave: the KL is high, because the two distributions are
very different. Instead in the case of classifier bias, the predicted distributions of the
two classifiers should look like the ones in Fig. 9.2, where the KL is low, because the
two predicted distributions are similar. Note that in practice, both types of bias are
likely to be present in our results.
By using both translations from and into English, we can further tease apart the
direction of this effect. Figure 9.3 summarizes the idea behind our experiment, here
considering gender as the demographic aspect.
9 Gender and Age Bias in Commercial Machine Translation 167
The high
difference
between these
two suggest
translation bias
Fig. 9.3 The experimental methodology we adopt in this chapter. For example, we have a French
and an English gender classifier, each trained on monolingual corpora balanced for gender in the
respective language, i.e., French and English. We also have two possible translation pairs: French
original texts paired with their English translations, and English original texts paired with their
French translations. Our two classifiers are not perfect, but, if there is no translation bias, we expect
both predicted gender distributions to be similar to each other. On the other hand, if there is a
translation bias, the predicted distributions of the two classifiers should be different (e.g., the
English classifier predicting more females than males). The divergence between the two distribu-
tions gives a measure of the translation bias, i.e., the difficulty of the MT systems to preserve the
authors’ gender identity from one language to the other
Moreover, to see whether the predictions differ statistically significantly from the
original, we use a χ 2 contingency test and report significance at p ≤ 0.05 and
p ≤ 0.01.
168 F. Bianchi et al.
9.4.2 Data
Our starting point is the Trustpilot data set by Hovy et al. (2015). Trustpilot is an
online website, founded in Denmark in 2007, to provide users a platform to review
companies and services. For example, a satisfied customer can express a positive
opinion on their favorite online shoe seller. Users can also voluntarily report their
age and their (binary) gender identity, something that makes this data particularly
interesting for our use case. We use this self-reported socio-demographic informa-
tion as ground truth.
The data set contains reviews in different languages, but here we focus on
English, German, Italian, French, and Dutch. This selection was made taking into
consideration two criteria we established as requirements to run our experiment.
First of all, to provide a fair comparison, we need languages that can be covered by
all translation systems. Secondly, we need to be able to collect reviews that are
demographically representative samples of the language and country (the reviews
are from Germany, Italy, France, and the Netherlands, respectively). For the English
data, we use US reviews, rather than UK reviews, based on the general prevalence of
this variety in translation engines.
For each language, we restrict ourselves to reviews written in the respective lan-
guage (according to langid8 Lui and Baldwin 2012) that have both age and gender
information. We use the CIA factbook9 data on age pyramids to sample 200 instances
for both male and female users each. We use four age groups given in the factbook,
i.e., 15–24, 25–54, 55–64, and 65+, and sample instances proportionally to the age
pyramid from each group. Based on data sparsity in the Trustpilot data, we do not
include the under-15 age group. This sampling procedure results in five test sets (one
for each language) of about 400 instances each (the exact numbers vary slightly due
to rounding and the exact proportions in the CIA factbook data), balanced for binary
gender. The exception is Italian, where the original data is so heavily skewed
towards male-written reviews that we only achieve a 48:52 gender ratio even with
downsampling.
We then translate all non-English test sets into English. We will refer to this set of
data as the Into English data. We also translate the English test set into all other
languages, which we will refer to as From English. We use three commercially
available MT tools: Bing, DeepL, and Google Translate.
8
https://github.com/saffsd/langid.py.
9
https://www.cia.gov/library/publications/the-world-factbook/.
9 Gender and Age Bias in Commercial Machine Translation 169
We use all review instances that are not part of any test set to create training data for
the respective age and gender classifiers (see Sect. 9.4.3). Since we want to compare
fairly across languages, the training data sets need to be of comparable size. We are
therefore bounded by the size of the smallest available subset (Italian). We sample
about 2500 instances per gender in each language, again according to the respective
age distributions. This sampling results in about 5000 instances per language (again,
the exact number varies slightly based on the availability of samples for each group
and rounding). We again subsample to approximate the actual age and gender
distribution, since, according to Hovy et al. (2015), the data skews strongly male
while otherwise closely matching the official age distributions.
9.4.3 Classifiers
We use simple Logistic Regression (Pedregosa et al. 2011) models with L2 regular-
ization over TF-IDF based 2–6 character-grams, and regularization optimized via
tenfold cross-validation.
Furthermore, we also use the recent BERT-based neural language model archi-
tecture (Devlin et al. 2019; Liu et al. 2019). BERT has obtained convincing results
on many different NLP tasks and across different languages (Nozza et al. 2020).
Essentially, BERT is a transformer-based neural network (Vaswani et al. 2017)
trained on a large amount of text. Its network can be used as a starting point to
create new classifiers through fine-tuning (i.e., adapting the weights of BERT to a
new task). The same authors also released a multilingual version of BERT
(mBERT). mBERT10 supports over 100 languages, including Arabic, Dutch, or
Portuguese. BERT and mBERT have obtained increasingly good results in many
applications that involved monolingual and multilingual tasks (Yang et al. 2019;
Zhu et al. 2020; Bianchi et al. 2021c; Lamprinidis et al. 2021).
Given those initial results on English and multi-lingual data, researchers have
retrained BERT for specific languages, with similar (or better) performance (Nguyen
and Tuan Nguyen 2020; Antoun et al. 2020; Bianchi et al. 2021a; Martin et al. 2020;
de Vries et al. 2019, inter alia). In our experimental configuration we use the
following language-specific BERT (LS-BERT11) models:
• French: CamemBERT (Martin et al. 2020);
• German: GermanBERT;12
10
https://github.com/google-research/bert/blob/master/multilingual.md.
11
We will use LS-BERT to refer to the corresponding language-specific BERT applicable to the
language being discussed.
12
https://github.com/dbmdz/berts.
170 F. Bianchi et al.
• Italian: GilBERTo;13
• Dutch: BERTje (de Vries et al. 2019);
• English: RoBERTa (Liu et al. 2019).
Our classifiers cover a different range of training domains with different level of
specificity: the logistic regression is trained and tested on our corpus. The language-
specific BERT models were pretrained on language-specific common corpora and
then finetuned on our corpus. Finally, we finetuned a multilingual BERT model
(pretrained on a large Wikipedia corpus with many languages) on our corpus.
The numbers in Table 9.1 indicate that both age and gender can be inferred
reasonably well across all of the languages with all the classifiers. We use these
classifiers in the following analyses. LR, LS-BERT, and mBERT all achieve com-
parable results, indicating that there is an upper ceiling here. Since our corpus and
the task is original, we have no previous literature to compare our results, however
Logistic Regression is a fairly good baseline in most classification tasks.
For each non-English sample, we predict the age and gender of the author in both
the original language and in each of the three English translations (Google, Bing, and
DeepL). I.e., we use the respective language’s classifier described above (e.g., a
classifier trained on German to predict German test data). We use the English
classifiers described above for the translations from other languages into
English. E.g., we use the age and gender classifier trained on English data to predict
the English translations from the German test set.
For the English data, we first translate the texts into each of the other languages,
using each of the three translation systems. Then we again predict the author
demographics in the original English test set (using the classifier trained on English),
as well as in each of the translated versions (using the classifier trained on the
respective language). E.g., we create a German, French, Italian, and Dutch transla-
tion with each Google, Bing, and DeepL, and classify both the original English and
the translation.
13
https://github.com/idb-ita/GilBERTo.
9 Gender and Age Bias in Commercial Machine Translation 171
Table 9.2 Gender split (%) and KL divergence from gold for each language when translated into
English and classified with LogReg
From Gold Org. lang Google Bing DeepL
F:M F:M F:M F:M F:M
split split KL split KL split KL split KL
de 50:50 48:52 0.001 37:63 0.034 35:65 0.045 35:65 0.045
fr 50:50 47:53 0.002 49:51 0.000 48:52 0.001 49:51 0.000
it 48:52 47:53 0.000 37:63 0.026 43:57 0.006 36:64 0.033
nl 50:50 49:51 0.000 47:53 0.001 47:53 0.002 44:56 0.007
avg 0.000 0.015 0.013 0.021
**
Split differs significantly from gold split at p ≤ 0.01
Table 9.3 Gender split (%) and KL divergence from gold for each language when translated into
English and classified with LS-BERT
From Gold Org. lang Google Bing DeepL
F:M F:M F:M F:M F:M
split split KL split KL split KL split KL
de 50:50 51:49 0.000 36:64 0.042 39:61 0.022 39:61 0.023
fr 50:50 51:49 0.000 40:60 0.020 44:56 0.007 47:53 0.002
it 48:52 31:69 0.065 35:65 0.035 37:63 0.027 38:62 0.020
nl 50:50 48:52 0.000 46:54 0.003 41:59 0.015 43:57 0.008
avg 0.016 0.025 0.018 0.013
Split differs significantly from gold split at p ≤ 0.05
*
Table 9.4 Gender split (%) and KL divergence from gold for each language when translated into
English and classified with mBERT
From Gold Org. lang Google Bing DeepL
F:M F:M F:M F:M F:M
split split KL split KL split KL split KL
de 50:50 51:49 0.000 41:59 0.015 40:60 0.021 41:59 0.016
fr 50:50 51:49 0.000 45:55 0.005 47:53 0.002 47:53 0.002
it 48:52 31:69 0.065 40:60 0.014 44:56 0.005 44:56 0.004
nl 50:50 48:52 0.000 38:62 0.028 39:61 0.023 42:58 0.014
avg 0.016 0.016 0.013 0.009
Split differs significantly from gold split at p ≤ 0.05
*
Tables 9.2, 9.3, and 9.4 show the results when translating into English. Tables show
for each language the test gender ratio, the predicted ratio from classifiers trained in
the same language, as well as their KL divergence from the ratio in the test set, and
172 F. Bianchi et al.
Using Logistic Regression as a classifier (Table 9.2), we can observe that, for most
languages, there already exists a classifier bias in the gender predictions of that
language, skewed towards male. However, the translated English versions create an
even stronger skew. The notable exception is French, which most translation engines
render in a demographically accurate manner. Dutch is slightly worse, followed by
Italian (note, though, that the Italian data was so heavily imbalanced that we could
not sample an even distribution for the test data in the first place). The gender skew is
most substantial for German, swinging by as much as 15 percentage points.
The results obtained with language-specific BERT classifiers (Table 9.3) demon-
strate very similar results across languages and translation engines. The only notable
difference is the gender prediction distribution for Italian, which is more skewed
towards the male class compared to Table 9.2. In particular, this imbalanced gender
prediction is present both in the translated English version and the source Italian one.
Table 9.5 Gender split (%) and KL divergence from gold for each language when translated from
English and classified with LogReg
Gold English To Google Bing DeepL
F:M split F:M split KL F:M split KL F:M split KL F:M split KL
50:50 49:51 0.000 de 59:41 0.015 58:42 0.013 58:42 0.011
fr 49:51 0.000 52:48 0.001 54:46 0.003
it 45:55 0.004 44:56 0.007 41:59 0.016
nl 40:60 0.020 43:57 0.010 40:60 0.019
avg 0.010 0.008 0.012
Split differs significantly from gold split at p ≤ 0.05
*
Table 9.6 Gender split (%) and KL divergence from gold for each language when translated from
English and classified with LS-BERT
Gold English To Google Bing DeepL
F:M split F:M split KL F:M split KL F:M split KL F:M split KL
50:50 49:51 0.000 de 64:36 0.040 56:44 0.007 62:38 0.031
fr 45:55 0.005 40:60 0.021 48:52 0.001
it 35:65 0.050 34:66 0.057 39:61 0.026
nl 49:51 0.000 40:60 0.022 46:54 0.003
avg 0.024 0.027 0.015
Split differs significantly from gold split at p ≤ 0.05
*
Table 9.7 Gender split (%) and KL divergence from gold for each language when translated from
English and classified with mBERT
Gold English To Google Bing DeepL
F:M split F:M split KL F:M split KL F:M split KL F:M split KL
50:50 49:51 0.000 de 47:53 0.001 48:52 0.001 49:51 0.000
fr 42:58 0.014 39:61 0.023 46:54 0.003
it 21:79 0.201 20:80 0.218 21:79 0.214
nl 35:65 0.045 33:67 0.059 37:63 0.033
avg 0.065 0.075 0.062
Split differs significantly from gold split at p ≤ 0.05
*
Tables 9.5, 9.6, and 9.7 show the results when translating from English into the
various languages. The format is the same as for Tables 9.2, 9.3, and 9.4.
Again we see large swings, normally exacerbating the balance towards men.
However, translating into German with all systems produces estimates that are
perceived as a lot more female than the original data. This result could be the inverse
effect of what we observed above.
We observe little change for French in both tables, though the gender prediction
with logistic regression demonstrates some female bias in two MT systems. Similar
174 F. Bianchi et al.
Fig. 9.4 Density distribution and KL for age prediction in various languages and different systems
in original and when translated into English classified with LogReg. Solid yellow line = true
distribution. = predicted distribution differs significantly from gold distribution at p ≤ 0.05. =
significant difference at p ≤ 0.01
to previous results when using the BERT model pretrained on Italian (GilBERTo
and mBERT), the gender prediction is significantly skewed towards the male class.
9 Gender and Age Bias in Commercial Machine Translation 175
Fig. 9.5 Density distribution and KL for age prediction in various languages and different systems
in original and when translated into English classified with LS-BERT. Solid yellow line = true
distribution. = predicted distribution differs significantly from gold distribution at p ≤ 0.05. =
significant difference at p ≤ 0.01
Figures 9.4, 9.5, and 9.6 show the kernel density plots for the four age groups in each
language (rows) in the same language prediction and the English translation. The
distributions are reasonably close in all cases, but the predictions overestimate the
most prevalent class in all cases. This effect is clearer when predicting age
176 F. Bianchi et al.
Fig. 9.6 Density distribution and KL for age prediction in various languages and different systems
in original and when translated into English classified with mBERT. Solid yellow line = true
distribution. = predicted distribution differs significantly from gold distribution at p ≤ 0.05. =
significant difference at p ≤ 0.01
Fig. 9.7 Density distribution and KL for decade prediction in various languages and different
systems in original and when translated into English classified with LogReg. Solid yellow line =
true distribution. = predicted distribution differs significantly from gold distribution at p ≤ 0.05.
= significant difference at p ≤ 0.01
that the distribution still follows the true demographic distribution since we are
subsampling within the larger classes given by the CIA factbook.
However, the results still strongly suggest that the observed mismatch is driven
predominantly by the overprediction of the 50s decade for logistic regression and
70s decade for language-specific and multilingual BERT models. Because these
178 F. Bianchi et al.
Fig. 9.8 Density distribution and KL for decade prediction in various languages and different
systems in original and when translated into English classified with LS-BERT. Solid yellow line =
true distribution. = predicted distribution differs significantly from gold distribution at p ≤ 0.05.
= significant difference at p ≤ 0.01
decades often contributed strongly to the most frequent age categories (25–54 and
65+), predictions did not differ as much from gold in the previous test.
In essence, English translations of all these languages, irrespective of the MT
system, seem to be produced by authors much older than what they actually are
(Fig. 9.9).
9 Gender and Age Bias in Commercial Machine Translation 179
Fig. 9.9 Density distribution and KL for decade prediction in various languages and different
systems in original and when translated into English classified with mBERT. Solid yellow line =
true distribution. = predicted distribution differs significantly from gold distribution at p ≤ 0.05.
= significant difference at p ≤ 0.01
All three tested commercial MT systems are close together in terms of results in our
paper. However, they also seem to show the same systematic translation biases. The
most likely reason is the use of biased training data. The fact that translations into
English are perceived as older and more male than translations into other languages
180 F. Bianchi et al.
could indicate a more extensive collection of unevenly selected data in English than
for other languages.
9.5 Discussion
In this chapter, we have demonstrated the existence of gender and age bias in
MT. We have shown that translations with commercial systems make the translated
text seem produced by subjects more male and older than what they are. While
similar findings in literature corroborate these results, we are the first to provide a
quantitative analysis on three different commercial MT tools. We expect more
consideration for this kind of problem in the future, since they might affect our
communication quality. Ultimately, our findings contribute to a growing body of
research indicating that language is about more than just information content, but
includes important social aspects as well. By giving those aspects more consider-
ation, we can push the frontier of MT and move into stylistic aspects. On the other
hand, by ignoring these aspects, we proliferate the status quo, resulting in uneven
user experiences and translations that capture only half of what they should.
References
Antoun W, Baly F, Hajj H (2020) AraBERT: Transformer-based model for Arabic language
understanding. In: Proceedings of the 4th workshop on open-source arabic corpora and
processing tools, with a shared task on offensive language detection, Marseille, France.
European Language Resource Association, pp 9–15
Bentivogli L, Savoldi B, Negri M, Di Gangi MA, Cattoni R, Turchi M (2020) Gender in danger?
evaluating speech translation technology on the MuST-SHE corpus. In: Proceedings of the 58th
annual meeting of the association for computational linguistics, Online. Association for Com-
putational Linguistics, pp 6923–6933
Bianchi F, Hovy D (2021) On the gap between adoption and understanding in NLP. In: Findings of
the Association for Computational Linguistics: ACL-IJCNLP 2021, Online. Association for
Computational Linguistics, pp 3895–3901
Bianchi F, Nozza D, Hovy D (2021a) FEEL-IT: Emotion and sentiment classification for the Italian
language. In: Proceedings of the eleventh workshop on computational approaches to subjectiv-
ity, sentiment and social media analysis, Online. Association for Computational Linguistics, pp
76–83
Bianchi F, Nozza D, Hovy D (2021b) Language invariant properties in natural language processing.
Preprint. arXiv:2109.13037
Bianchi F, Terragni S, Hovy D, Nozza D, Fersini E (2021c) Cross-lingual contextualized topic
models with zero-shot learning. In: Proceedings of the 16th conference of the European chapter
of the Association for Computational Linguistics: main volume, Online. Association for Com-
putational Linguistics, pp 1676–1683
Blodgett SL, Barocas S, Daumé III H, Wallach H (2020) Language (technology) is power: A critical
survey of “bias” in NLP. In: Proceedings of the 58th annual meeting of the Association for
Computational Linguistics, Online. Association for Computational Linguistics, pp 5454–5476
9 Gender and Age Bias in Commercial Machine Translation 181
Bolukbasi T, Chang K, Zou JY, Saligrama V, Kalai AT (2016) Man is to computer programmer as
woman is to homemaker? debiasing word embeddings. In Lee DD, Sugiyama M, von
Luxburg U, Guyon I, Garnett R (eds) Advances in neural information processing systems 29:
annual conference on neural information processing systems 2016, December 5–10, 2016,
Barcelona, Spain, pp 4349–4357
Dabre R, Chu C, Kunchukuttan A (2020) A survey of multilingual neural machine translation.
ACM Comput Surv 53(5):1
de Vries W, van Cranenburgh A, Bisazza A, Caselli T, van Noord G, Nissim M (2019) BERTje: A
Dutch BERT model. Preprint. arXiv:1912.09582
Devlin J, Chang MW, Lee K, Toutanova K (2019) BERT: Pre-training of deep bidirectional
transformers for language understanding. In: Proceedings of the 2019 conference of the North
American chapter of the Association for Computational Linguistics: human language technol-
ogies, volume 1 (Long and Short Papers), Minneapolis, Minnesota. Association for Computa-
tional Linguistics, pp 4171–4186
Escudé Font J, Costa-jussà MR (2019) Equalizing gender bias in neural machine translation with
word embeddings techniques. In: Proceedings of the first workshop on gender bias in natural
language processing, Florence, Italy. Association for Computational Linguistics, pp 147–154
Flek L (2020) Returning the N to NLP: Towards contextually personalized classification models. In:
Proceedings of the 58th annual meeting of the Association for Computational Linguistics,
Online. Association for Computational Linguistics, pp 7828–7838
Gehman S, Gururangan S, Sap M, Choi Y, Smith NA (2020) RealToxicityPrompts: Evaluating
neural toxic degeneration in language models. In: Findings of the Association for Computa-
tional Linguistics: EMNLP 2020, Online. Association for Computational Linguistics, pp
3356–3369
Gonen H, Goldberg Y (2019) Lipstick on a pig: Debiasing methods cover up systematic gender
biases in word embeddings but do not remove them. In: Proceedings of the 2019 conference of
the North American chapter of the Association for Computational Linguistics: human language
technologies, volume 1 (Long and Short Papers), Minneapolis, Minnesota. Association for
Computational Linguistics, pp 609–614
Hovy D, Yang D (2021) The importance of modeling social factors of language: Theory and
practice. In: Proceedings of the 2021 conference of the North American chapter of the Associ-
ation for Computational Linguistics: human language technologies, Online. Association for
Computational Linguistics, pp 588–602
Hovy D, Johannsen A, Søgaard A (2015) User review sites as a resource for large-scale sociolin-
guistic studies. In: Gangemi A, Leonardi S, Panconesi A (eds) Proceedings of the 24th
international conference on world wide web, WWW 2015, Florence, Italy, May 18–22, 2015.
ACM, pp 452–461
Hovy D, Bianchi F, Fornaciari T (2020) “you sound just like your father” commercial machine
translation systems include stylistic biases. In: Proceedings of the 58th annual meeting of the
Association for Computational Linguistics, Online. Association for Computational Linguistics,
pp 1686–1690
Johannsen A, Hovy D, Søgaard A (2015) Cross-lingual syntactic variation over age and gender. In:
Proceedings of the nineteenth conference on computational natural language learning, Beijing,
China. Association for Computational Linguistics, pp 103–112
Lamprinidis S, Bianchi F, Hardt D, Hovy D (2021) Universal joy a data set and results for
classifying emotions across languages. In: Proceedings of the eleventh workshop on computa-
tional approaches to subjectivity, sentiment and social media analysis, Online. Association for
Computational Linguistics, pp 62–75
Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V
(2019) Roberta: A robustly optimized bert pretraining approach. Preprint. arXiv:1907.11692
Liu X, Duh K, Liu L, Gao J (2020) Very deep transformers for neural machine translation. Preprint.
arXiv:2008.07772
182 F. Bianchi et al.
Lui M, Baldwin T (2012) langid.py: An off-the-shelf language identification tool. In: Proceedings
of the ACL 2012 system demonstrations, Jeju Island, Korea. Association for Computational
Linguistics, pp 25–30
Manzini T, Yao Chong L, Black AW, Tsvetkov Y (2019) Black is to criminal as caucasian is to
police: Detecting and removing multiclass bias in word embeddings. In: Proceedings of the
2019 conference of the North American chapter of the Association for Computational Linguis-
tics: human language technologies, volume 1 (Long and Short Papers), Minneapolis, Minnesota.
Association for Computational Linguistics, pp 615–621
Martin L, Muller B, Ortiz Suárez P J., Dupont Y, Romary L, de la Clergerie É, Seddah D, Sagot B
(2020) CamemBERT: a tasty French language model. In: Proceedings of the 58th annual
meeting of the Association for Computational Linguistics, Online. Association for Computa-
tional Linguistics, pp 7203–7219
Michel P, Neubig G (2018) Extreme adaptation for personalized neural machine translation. In:
Proceedings of the 56th annual meeting of the Association for Computational Linguistics
(Volume 2: Short Papers), Melbourne, Australia. Association for Computational Linguistics,
pp 312–318
Mirkin S, Meunier JL (2015) Personalized machine translation: Predicting translational
preferences. In: Proceedings of the 2015 conference on empirical methods in natural language
processing, Lisbon, Portugal. Association for Computational Linguistics, pp 2019–2025
Mirkin S, Nowson S, Brun C, Perez J (2015) Motivating personality-aware machine translation. In:
Proceedings of the 2015 conference on empirical methods in natural language processing,
Lisbon, Portugal. Association for Computational Linguistics, pp 1102–1108
Nguyen DQ, Tuan Nguyen A (2020) PhoBERT: Pre-trained language models for Vietnamese. In:
Findings of the Association for Computational Linguistics: EMNLP 2020, Online. Association
for Computational Linguistics, pp 1037–1042
Niu X, Rao S, Carpuat M (2018) Multi-task neural models for translating between styles within and
across languages. In: Proceedings of the 27th international conference on computational lin-
guistics, Santa Fe, New Mexico, USA. Association for Computational Linguistics, pp
1008–1021
Nozza D (2021) Exposing the limits of zero-shot cross-lingual hate speech detection. In: Pro-
ceedings of the 59th annual meeting of the Association for Computational Linguistics and the
11th international joint conference on natural language processing (Volume 2: Short Papers),
Online. Association for Computational Linguistics, pp 907–914
Nozza D, Bianchi F, Hovy D (2020) What the [mask]? making sense of language-specific bert
models. Preprint. arXiv:2003.02912
Nozza D, Bianchi F, Hovy D (2021) HONEST: Measuring hurtful sentence completion in language
models. In: Proceedings of the 2021 conference of the North American chapter of the Associ-
ation for Computational Linguistics: human language technologies, Online. Association for
Computational Linguistics, pp 2398–2406
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel, O, Blondel M, Prettenhofer P,
Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay
E (2011) Scikit-learn: Machine learning in Python. J Mach Learn Res 12:2825–2830
Prabhumoye S, Boldt B, Salakhutdinov R, Black AW (2021) Case study: Deontological ethics in
NLP. In: Proceedings of the 2021 conference of the North American chapter of the Association
for Computational Linguistics: human language technologies, Online. Association for Compu-
tational Linguistics, pp 3784–3798
Prates MO, Avelar PH, Lamb LC (2019) Assessing gender bias in machine translation: a case study
with Google translate. Neural Comput Appl:1–19
Rabinovich E, Patel RN, Mirkin S, Specia L, Wintner S (2017) Personalized machine translation:
Preserving original author traits. In: Proceedings of the 15th conference of the European chapter
of the Association for Computational Linguistics: Volume 1, Long Papers, Valencia, Spain.
Association for Computational Linguistics, pp 1074–1084
9 Gender and Age Bias in Commercial Machine Translation 183
Rescigno AA, Monti J, Way A, Vanmassenhove E (2020) A case study of natural gender
phenomena in translation: A comparison of Google Translate, bing Microsoft translator and
DeepL for English to Italian, French and Spanish. In: Workshop on the impact of machine
translation (iMpacT 2020), Virtual. Association for Machine Translation in the Americas, pp
62–90
Saunders D, Byrne B (2020) Reducing gender bias in neural machine translation as a domain
adaptation problem. In: Proceedings of the 58th annual meeting of the Association for Compu-
tational Linguistics, Online. Association for Computational Linguistics, pp 7724–7736
Saunders D, Sallis R, Byrne B (2020) Neural machine translation doesn’t translate gender
coreference right unless you make it. In: Proceedings of the second workshop on gender bias
in natural language processing, Barcelona, Spain (Online). Association for Computational
Linguistics, pp 35–43
Shah DS, Schwartz HA, Hovy D (2020) Predictive biases in natural language processing models: A
conceptual framework and overview. In: Proceedings of the 58th annual meeting of the
Association for Computational Linguistics, Online. Association for Computational Linguistics,
pp 5248–5264
Stafanovičs A, Bergmanis T, Pinnis M (2020) Mitigating gender bias in machine translation with
target gender annotations. In: Proceedings of the fifth conference on machine translation,
Online. Association for Computational Linguistics, pp 629–638
Stanovsky G, Smith NA, Zettlemoyer L (2019) Evaluating gender bias in machine translation. In:
Proceedings of the 57th annual meeting of the Association for Computational Linguistics,
Florence, Italy. Association for Computational Linguistics, pp 1679–1684
Vanmassenhove E, Hardmeier C, Way A (2018) Getting gender right in neural machine
translation. In: Proceedings of the 2018 conference on empirical methods in natural language
processing, Brussels, Belgium. Association for Computational Linguistics, pp 3003–3008
Vanmassenhove E, Shterionov D, Way A (2019) Lost in translation: Loss and decay of linguistic
richness in machine translation. In: Proceedings of machine translation summit XVII volume 1:
research track, Dublin, Ireland. European Association for Machine Translation, pp 222–232
Vanmassenhove E, Shterionov D, Gwilliam M (2021) Machine translationese: Effects of algorith-
mic bias on linguistic complexity in machine translation. In: Proceedings of the 16th conference
of the European chapter of the Association for Computational Linguistics: main volume, Online.
Association for Computational Linguistics, pp 2203–2213
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017)
Attention is all you need. In Guyon I, von Luxburg U, Bengio S, Wallach HM, Fergus R,
Vishwanathan SVN, Garnett R (eds) Advances in neural information processing systems 30:
annual conference on neural information processing systems 2017, December 4–9, 2017, Long
Beach, CA, USA, pp 5998–6008
Wu F, Fan A, Baevski A, Dauphin YN, Auli M (2019) Pay less attention with lightweight and
dynamic convolutions. In: 7th International conference on learning representations, ICLR 2019,
New Orleans, LA, USA, May 6–9, 2019. OpenReview.net
Yang W, Xie Y, Lin A, Li X, Tan L, Xiong K, Li M, Lin J (2019) End-to-end open-domain question
answering with BERTserini. In: Proceedings of the 2019 conference of the North American
chapter of the Association for Computational Linguistics (Demonstrations), Minneapolis, Min-
nesota. Association for Computational Linguistics, pp 72–77
Zhao J, Wang T, Yatskar M, Ordonez V, Chang KW (2017) Men also like shopping: Reducing
gender bias amplification using corpus-level constraints. In: Proceedings of the 2017 conference
on empirical methods in natural language processing, Copenhagen, Denmark. Association for
Computational Linguistics, pp 2979–2989
184 F. Bianchi et al.
Zhao J, Wang T, Yatskar M, Ordonez V, Chang KW (2018) Gender bias in coreference resolution:
Evaluation and debiasing methods. In: Proceedings of the 2018 conference of the North
American chapter of the Association for Computational Linguistics: human language technol-
ogies, volume 2 (Short Papers), New Orleans, Louisiana. Association for Computational
Linguistics, pp 15–20
Zhao G, Sun X, Xu J, Zhang Z, Luo L (2019) Muse: Parallel multi-scale attention for sequence to
sequence learning. Preprint. arXiv:1911.09483
Zhu J, Xia Y, Wu L, He D, Qin T, Zhou W, Li H, Liu T (2020) Incorporating BERT into neural
machine translation. In: 8th International conference on learning representations, ICLR 2020,
Addis Ababa, Ethiopia, April 26–30, 2020. OpenReview.net
Zmigrod R, Mielke SJ, Wallach H, Cotterell R (2019) Counterfactual data augmentation for
mitigating gender stereotypes in languages with rich morphology. In: Proceedings of the 57th
annual meeting of the Association for Computational Linguistics, Florence, Italy. Association
for Computational Linguistics, pp 1651–1661
Chapter 10
The Ecological Footprint of Neural Machine
Translation Systems
Abstract Over the past decade, deep learning (DL) has led to significant advance-
ments in various fields of artificial intelligence, including machine translation (MT).
These advancements would not be possible without the ever-growing volumes of
data and the hardware that allows large DL models to be trained efficiently. Due to
the large amount of computing cores as well as dedicated memory, graphics
processing units (GPUs) are a more effective hardware solution for training and
inference with DL models than central processing units (CPUs). However, the
former is very power demanding. The electrical power consumption has economic
as well as ecological implications.
This chapter focuses on the ecological footprint of neural MT systems. It starts
from the power drain during the training of and the inference with neural MT models
and moves towards the environment impact, in terms of carbon dioxide emissions.
Different architectures (RNN and Transformer) and different GPUs (consumer-
grade NVidia 1080Ti and workstation-grade NVidia P100) are compared. Then,
the overall CO2 offload is calculated for Ireland and the Netherlands. The NMT
models and their ecological impact are compared to common household appliances
to draw a more clear picture.
The last part of this chapter analyses quantization, a technique for reducing the
size and complexity of models, as a way to reduce power consumption. As quantized
models can run on CPUs, they present a power-efficient inference solution without
depending on a GPU.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 185
H. Moniz, C. Parra Escartín (eds.), Towards Responsible Machine Translation,
Machine Translation: Technologies and Applications 4,
https://doi.org/10.1007/978-3-031-14689-3_10
186 D. Shterionov and E. Vanmassenhove
10.1 Introduction
Over the past decade, Deep Learning (DL) techniques took the world by storm and
their application led to state-of-the-art results in various fields. The same holds for
the field of Machine Translation (MT), where the latest AI boom led the way for the
emergence of a new paradigm Neural Machine Translation (NMT). Since most of
the foundational techniques used in current applications of AI were developed before
the turn of the century, the main triggers of the boom were innovations in general
purpose GPU computing as well as hardware advancements that facilitate much
more efficient training and inference. In spite of the improvements in terms of
efficiency, graphics processing units (GPUs) are more power demanding than their
central processing unit (CPU) predecessors and as such, they have a considerably
higher environmental impact.
In this chapter, a brief overview of the most recent paradigms is presented along
with a more in depth introduction to the core processing technology involved in
NMT (GPUs as opposed to CPUs). We elaborate on the exponential growth of
models, the current trends in ‘big data’ and their relation to model performance. The
related work discusses pioneering and recent papers on Green AI for Natural
Language Processing (NLP) as well as tools to quantify the environmental impact
of AI.
As a case study and to outline the realistic dimensions of power consumption and
environmental footprint of NMT, we train 16 NMT models and use them for
translation, while collecting power readings from the corresponding GPUs. Using
the collected measurements we compare different NMT architectures (Long Short-
Term Memory (LSTM) (Sutskever et al. 2014; Cho et al. 2014; Bahdanau et al.
2015) and Transformer (Vaswani et al. 2017)), different GPUs (NVidia GTX 1080Ti
and NVidia Tesla P100) as well as the environmental footprint of training and
translation with these models to other commonly-used devices.
Together with the research and analytical contributions, this chapter also aims to
motivate researchers to devote time, effort and investment in developing more
ecological solutions for MT.
Machine translation (MT), the task of automatically translating text in one language,
into text in another language using a computer system, has become an indispensable
tool for professional translators (to assist them in the translation workflow), for
commercial users, e.g. e-commerce companies (to make their content quickly
available in multiple languages) and to everyday users (to access information
unrestricted by the language in which it is produced). Since its inception in the
late 1950s, MT has undergone many shifts, the latest of which, NMT imposes a
requirement for hardware that can facilitate efficient training and inference with
10 The Ecological Footprint of Neural Machine Translation Systems 187
NMT models. Most commonly used hardware are GPUs that support embarrassingly
parallel computations due to the large amounts of processing cores and dedicated
memory.1
In the early days of machine translation (MT), rule-based MT (RBMT) systems were
built around dictionaries and human-crafted rules to convert a source sentence into
its equivalent in the target language. Such systems were heavily dependent on the
efforts and skills of linguists. Developing a rule-based system for a new domain or a
new language pair was (and still is) a cumbersome, time-consuming task that
requires extensive linguistic expertise (Arnold et al. 1994). However, from a com-
putational point of view, it is an inexpensive task—using an RBMT system does not
require substantial computational resources.
In the 1980s, researchers attempted to overcome some of the shortcomings of
RBMT when dealing with languages that differed substantially structurally
(e.g. English vs Japanese). Focusing specifically on collocations, examples could
be used for transfer when rules and trees failed (Nagao 1984). These hybrid
approaches evolved by the end of the decade into a more example-centred approach
(example-based MT (EBMT), where patterns would be retrieved from existing
corpora and adapted using hand-written rules).2 The idea of using patterns extracted
from corpora culminated when, in the late 1980s and early 1990s, a group of
researchers at IBM created an MT system relying solely on statistical information
extracted from corpora (Brown et al. 1988, 1990).3
The generation of corpus-based MT systems rely on data and statistical models to
derive word-, phrase- or segment-level translations eliminating the need for complex
semantic and/or syntactic rules. As such, the core mechanism of MT systems shifted
from human expertise in linguistics to machine learning techniques. This shift
entailed other important changes for MT related to the development time and the
computational resources required for training.4 Furthermore, this group of MT
paradigms that learn automatic translation models from large amounts of parallel
and monolingual data reached—for in-domain translations and given enough
1
There are other hardware and software that are specifically developed for AI-accelerated comput-
ing, e.g. tensor processing units (https://cloud.google.com/tpu/docs/tpus). However, as the most
commonly used and easily accessible such devices are GPUs, our work focuses on the power
considerations and environmental footprint of GPUs.
2
For an overview of EBMT we refer the interested reader to Carl and Way (2003).
3
An overview of the pre-neural evolution of MT can be found in Hutchins (2005a).
4
As a matter of fact, the technical advances in terms of computational resources facilitated
developments in the field of corpus-based MT. The paradigm shift furthermore coincided with
the late 1990s and early 2000s growth in terms of direct applications for MT and localisation.
188 D. Shterionov and E. Vanmassenhove
available training data—a better overall translation quality than that achieved by
earlier RBMT systems.
Up until about 2016, Phrase-Based Statistical MT (PB-SMT) was the dominant
corpus-based paradigm (especially after Google Translate made the switch from
RBMT to SMT) (Bentivogli et al. 2016). Currently, most state-of-the-art results for
MT are achieved using neural approaches, i.e. models based on artificial neural
networks and most prominently recurrent neural networks (RNNs) (Sutskever et al.
2014; Cho et al. 2014; Bahdanau et al. 2015) and Transformer architectures
(Vaswani et al. 2017). RNNs feed their output as input, along with new input.
This enables RNNs to compress sequences (of tokens, e.g. words) of arbitrary length
into a fixed-size representation. This representation can then be used to initiate a
decoding process where one token is generated at a time (again using a recurrent
network) conditioned on the previously generated tokens and the encoded represen-
tation of the input until a certain condition is met. In the context of MT, this
generation process typically continues until the end-of-sentence token is generated.
To mitigate issues related to long-distance dependencies, LSTM units (Zaremba
et al. 2014) or gated recurrent units (Cho et al. 2014) are typically used instead of
simple RNNs. In addition, to further improve the relation between encoder and
decoder, an attention mechanism is added (Luong et al. 2015). The attention
mechanism learns to associate different weights based on the importance of individ-
ual input tokens. It has been shown that attention-based models significantly
outperform those that do not employ attention. In contrast to NMT using RNNs,
Transformer does not employ recurrence; it uses self attention, i.e. an attention
mechanism that indicates the importance of a token with respect to the other input
tokens. Within a self-attention mechanism the positional information is lost. As
such, Transformer employs a positional encoding that learns the positional informa-
tion. These operations can be performed in parallel for different tokens, which allows
the training process of a Transformer model to be parallelised, unlike RNNs, where
operations are performed sequentially, making Transformer a more efficient
architecture.
For a more complete overview and description of the history of Machine Trans-
lation, we refer to the work of Hutchins (2005b), Kenny (2005), Poibeau (2017) and
Koehn (2020).
NMT evolved very quickly, replacing PB-SMT in both academia and industry.
Aside from the paradigm shift, NMT imposed another big change within the field
of MT, that of the core processing technology. In particular, employing GPUs
instead of CPUs. Training neural networks (NNs) revolves around the manipulation
of large matrices. A general-purpose, high-performance CPU typically contains 16–
32 high-frequency cores that can operate in parallel. CPUs are designed for general
purpose, sequential operations. However, the sizes of matrices involved in training
10 The Ecological Footprint of Neural Machine Translation Systems 189
ALU ALU
CONTROL
ALU ALU
CACHE
CPU GPU
Fig. 10.1 Visual comparison between a CPU and a GPU in terms of cache, control and processing
units. Source: https://fabiensanglard.net/cuda/
NNs are far beyond the processing and memory capacities of CPUs; processes that
can (theoretically) be parallelised need to be serialized and conducted in a sequence,
leading to rather high processing times. To perform all matrix-operations efficiently
a parallel-processing framework is more suitable. GPUs—as the name clearly
indicates—are designed to render and update graphics. They encapsulate thousands
of cores with tens of thousands of threads. With their large dedicated memory GPUs
can host both an NN model and training examples reducing the required memory
transfer. The large amount of processing cores that can operate in parallel and the
dedicated memory makes GPUs a much more effective option for training NN
models (Raina et al. 2009), including NMT models. See Fig. 10.1 for a visual
comparison between a CPU and a GPU in terms of cache, control and processing
units.
In GPUs, however, many more transistors are dedicated to data processing, rather
than to caching and control flow as in CPUs (Raina et al. 2009). GPUs are in fact
much more power demanding than CPUs leading to two types of considerations:
(1) first, the physical power requirements towards a data centre or a workstation
dedicated to training NN models, and (2) second, the ecological concern related to
the production and consumption of electricity to power up and sustain the training of
NN (including NMT) models. In this chapter we focus on the second question. We
will present actual power and thermal indicators measured during the training of
different NMT models and align them with ecological as well as economical markers
in Sect. 10.5. Then, in Sect. 10.6, we will discuss two approaches to reduce the GPU
power consumption: distribution and parallelisation, and quantization.
Since statistical models (both traditional and deep learning) took over in the field of
NLP, datasets and models grew bigger. Especially since 2018 when BERT (Devlin
et al. 2019) and its successors (e.g. GPT 2 (Radford et al. 2019), GPT-3 (Brown et al.
2020) and Turing NLG (Microsoft 2020)) appeared, the size of language models and
190 D. Shterionov and E. Vanmassenhove
the amount of parameters grew exponentially. Since the relation between a model’s
performance and its complexity is at best logarithmic (Schwartz et al. 2020),
exponentially larger models are being trained for often small gains in performance.
This exponential growth is illustrated well by the Switch-C, the current largest
language model introduced in 2021 with a capacity of 1.6 trillion parameters. For
comparison, one of its recent predecessors GPT-3, currently the third5 largest model,
introduced in June 2020 had a capacity of ‘only’ 175 billion parameters.6 Similarly, a
blogpost by Open AI (Amodei et al. 2018) demonstrated how the compute grew by
more than 300, 000×. This corresponds to a doubling every 3.4 months.7
Many studies have investigated the performance of NMT in terms of adequacy,
fluency, errors, data requirements, impact on the translation workflow and language
service providers, bias. The efficiency and energy impact of developing new NMT,
however, has not yet received the necessary attention. In Sect. 10.3 we present
related work in order to properly position our research in current literature.
The related work is divided into three subsections. In Sect. 10.3.1, we cover the
related research papers. Since there are a few more practical tools that have been
suggested and constructed in order to measure the environmental and financial cost,
we cover the main tools in Sect. 10.3.2. Finally, in Sect. 10.3.3, we mention some
recent initiatives related to sustainable NLP.
10.3.1 Research
Recent work (Strubell et al. 2019; Schwartz et al. 2020) brought to the attention of
the NLP community the environmental (carbon footprint) and financial (hardware
and electricity or cloud compute time) cost of training ‘deep’ NLP models. In
Strubell et al. (2019) the energy consumption (in kilowatts) of different state-of-
the-art NLP models is estimated. With this information, the carbon emissions and
electricity costs of the models can be approximated. From their experiments regard-
ing the cost of training, it results that training BERT (Devlin et al. 2019) is
comparable to a trans-American flight in terms of carbon emissions. They also
quantified the cost of development by studying logs of a multi-task NLP model
5
The second largest model is the GShard (Lepikhin et al. 2020), introduced in September 2020
which had a capacity of 600 billion parameters.
6
https://analyticsindiamag.com/open-ai-gpt-3-language-model/.
7
For comparison, Moore’s Law forecasted a doubling every 2 years for the number of transistors in
a dense integrated circuit (Amodei et al. 2018).
10 The Ecological Footprint of Neural Machine Translation Systems 191
(Strubell et al. 2018) that received the Best Long Paper award at EMNLP 2018.
From the estimated development costs, it resulted that the most problematic aspect in
terms of cost is the tuning process and the full development cycle (due to
hyperparameter grid searches)8,9 and not the training process of a single model.
They conclude their work with three recommendations for NLP research which
stress the importance of: (1) reporting the time required for (re)training and the
hyperparameters’ sensitivity, (2) the need for equitable access to computational
resources in academia, and (3) the development of efficient hardware and models.
Similar to Strubell et al. (2019), the work by Schwartz et al. (2020) advocates for
‘Green AI’, which is defined as “AI research that is more environmentally friendly
and inclusive” (Schwartz et al. 2020) and is directly opposed to environmentally
unfriendly, expensive and thus exclusive ‘Red AI’. Although the ‘Red AI’ trend has
led to significant improvements for a variety of AI tasks, Schwartz et al. (2020) stress
that there should also be room for other types of contributions that are greener, less
expensive and that allow young researchers and undergraduates to experiment,
research and have the ability to publish high-quality work at top conferences. The
trend of so-called ‘Red AI’, where massive models are trained using huge amount of
resources, can almost be seen as a type of ‘buying’ stronger results, especially given
that the relation between the complexity of a model and its performance is at best
logarithmic implying that exponentially larger model are required for linear gains.10
Nevertheless, their analysis of the trends in AI based on papers from top conferences
such as ACL11 and NeurIPS12 reveals that there is a strong tendency within the field
to focus merely on the accuracy (or performance) of the proposed models with very
few papers even mentioning other measures such as speed, model size or efficiency.
They propose making the efficiency of models a key criterion aside (or integrated
with) commonly used metrics. There are multiple ways to measure efficiency,13
floating point operations (FPO) being the one advocated for by Schwartz et al.
(2020). FPO is a metric that has occasionally been used to determine the energy
footprint of models (Molchanov et al. 2016; Vaswani et al. 2017; Gordon et al. 2018;
Veniat and Denoyer 2018) as it estimates “the work performed by a computation
process” (Schwartz et al. 2020) based on two operations ‘ADD’ and ‘MUL’. They
furthermore advocate for reporting a baseline that promotes data-efficient
approaches by plotting accuracy as a function of “computational cost and of training
set size” (Schwartz et al. 2020). Aside from the environmental impact, both recent
papers also stress the importance of making research more inclusive and accessible
8
During tuning the model is trained from an already existing checkpoint, typically using new data.
9
In the development cycle, different versions of the model are trained or tuned and evaluated. Each
of those differ in terms of hyperparameter values.
10
Such models are most commonly developed by large multinational companies that poses the
necessary resources.
11
https://acl2018.org.
12
https://nips.cc/Conferences/2018.
13
For a more detailed overview of measures we refer to the paper itself.
192 D. Shterionov and E. Vanmassenhove
14
Among others due to the fact that it has been considered the centrepiece that led to Google firing
leading AI ethics researcher Dr. Timnit Gebru, and subsequently Dr. Margaret Mitchell, two of the
authors of the paper.
10 The Ecological Footprint of Neural Machine Translation Systems 193
10.3.2 Tools
10.3.3 Workshop
SustaiNLP:17 The first SustaiNLP workshop was held in November, 2020 and
(virtually) co-located with EMNLP2020.18 It specifically focused on efficiency, by
encouraging researchers to design solutions that are more simple yet competitive
15
https://mlco2.github.io/impact.
16
https://github.com/Breakend/experiment-impact-tracker.
17
https://sites.google.com/view/sustainlp2020.
18
https://2020.emnlp.org/.
194 D. Shterionov and E. Vanmassenhove
In the following three sections we try to give a realistic image of the power
consumption and environmental footprint, in terms of carbon emissions, related to
the usage of GPUs for training and translating with NMT models. To do that, we
train multiple NMT models, using both LSTM and Transformer architectures on
different GPUs. We record the power consumption for each GPU during training as
well as during translation. In Sect. 10.5 we analyse the results and in Sect. 10.6 we
present one possible strategy to reduce power consumption at inference time,
i.e. quantization—the process of approximating a neural network’s parameters by
reducing their precision, thus reducing the size of a model.
To assess the energy consumption during the training and translation processes of
NMT we trained different NMT models from scratch, i.e. without any pretraining
nor relying on additional models (e.g. BERT) on two different workstations,
equipped with different GPUs. The GPUs we had at our disposal are four NVidia
GeForce 1080Ti with 11GB of vRAM and 3 NVidia Tesla P100 with 16GB or
vRAM. These units differ not only in terms of technical specifications, but also in
their purpose of use—the 1080Ti is a user-class GPU, developed for desktop
machine with active cooling; the P100 is designed for workstations that operate
continuously and does not support active cooling.
In Table 10.1, we compare the two GPU types in terms of their specifications.
The rest of the configurations are as follows: (1) for the 1080Ti desktop
workstation—an Intel(R) Core(TM) i7-7820X CPU @ 3.60GHz, RAM: 64GB
(512MB block) (2) for the P100 workstation—CPU: Intel(R) Xeon(R) Gold 6128
CPU @ 3.40GHz, RAM: 196 GB (1GB block).
In this work we focus on assessing the power consumption related to the
utilization of GPUs. As such, in this chapter we will not provide metrics related to
the CPU utilization and the corresponding power consumption. This decision is
motivated by the fact that GPUs are the main processing unit for NMT models.
10 The Ecological Footprint of Neural Machine Translation Systems 195
19
https://opennmt.net/OpenNMT-py/.
196 D. Shterionov and E. Vanmassenhove
Table 10.3 Vocabulary sizes. For completeness we also present the vocabulary size without BPE,
i.e. the number of unique words in the corpora
Lang. pair No BPE With BPE
EN FR/ES EN FR/ES
EN-FR/FR-EN 113,132 131,104 47,628 48,459
EN-ES/ES-EN 113,692 168,195 47,639 49,283
All NMT systems have the learning rate decay enabled and their training is
distributed over 4 nVidia 1080Ti GPUs. The selected settings for the RNN systems
are optimal according to Britz et al. (2017); for the Transformer we use the settings
suggested by the OpenNMT community20 as the optimal ones that lead to quality on
par with the original Transformer work (Vaswani et al. 2017).
For training, testing and validation of the systems we used the same data. To build
the vocabularies for the NMT systems we used sub-word units, allowing NMT to be
more creative. Using sub-word units also mitigates to a certain extent the out of
vocabulary problem. To compute the sub-word units we used BPE with 50 000
merging operations for all our data sets. Separate subword vocabularies were used
for every language. In Table 10.3 we present the vocabulary sizes of the data used to
train our PB-SMT and NMT systems.
The quality of our MT systems is evaluated on the test set using standard
evaluation metrics: BLEU (Papineni et al. 2002) (as implemented in SacreBLEU
(Post 2018) and TER (Snover et al. 2006) (as implemented in MultEval (Clark et al.
2011)). Our evaluation scores are presented in Table 10.4.
We computed pairwise statistical significance using bootstrap resampling (Koehn
2004) and a 95% confidence interval. The results shown in Table 10.4 are all
statistically significant based on 1000 iterations and samples of 100 sentences. All
metrics show the same performance trends for all language pairs: Transformer
(TRANS) outperforms all other systems, followed by PB-SMT, and LSTM.
To measure the consumption of power, memory and core utilisation, as well as the
heat generated during the training and use of an NMT model, several tools can be
exploited, e.g. experiment-impact-tracker21 (Henderson et al. 2020),
“Weights and Biases”,22 as well as NVidia’s device monitoring: nvidia-
smi dmon. We decided to stick to the mainstream NVIDIA System Management
Interface program (nvidia-smi)23 which does not require additional installation
20
http://opennmt.net/OpenNMT-py/FAQ.html.
21
https://github.com/Breakend/experiment-impact-tracker.
22
https://wandb.ai.
23
https://developer.download.nvidia.com/compute/DCGM/docs/nvidia-smi-367.38.pdf.
10
Table 10.4 Quality evaluation scores for our MT systems. TRANS denotes Transformer systems
System English as source English as target
EN→FR EN→ES FR→EN ES→EN
BLEU" TER# BLEU" TER# BLEU" TER# BLEU" TER#
1080Ti LSTM 34.2 50.9 38.2 45.3 34.6 48.2 38.1 44.7
TRANS 37.2 48.7 40.9 43.4 37.0 46.4 41.3 41.4
P100 LSTM 34.1 50.7 37.3 47 34.9 48 38.5 44.4
TRANS 37.4 48.4 40.9 43.3 37.3 47 41.6 42.5
The Ecological Footprint of Neural Machine Translation Systems
197
198 D. Shterionov and E. Vanmassenhove
or setup. It is a tool by NVidia for monitoring and management of major lines of their
GPUs. The nvidia-smi is cross-platform and supports all standard NVidia
driver-supported Linux distributions as well as 64bit versions of the Windows
operating system. In this work we used the nvidia-smi dmon command to
monitor all GPUs during the training and inference processes. This command
displays one line of monitoring data per monitoring cycle. The default range of
metrics includes power usage (or power draw—the last measured power draw for the
entire board reported in watts), temperature, SM clocks, memory clocks and utili-
zation values for SM, memory, encoder and decoder. By default we monitor all
GPUs during the training process as training is distributed over all four 1080Ti or
three P100 GPUs. However, at inference time we use only one GPU. As such, we
only monitor and report values for that specific GPU during inference.
We ought to note that the experiment-impact-tracker, which internally
invokes nvidia-smi, is a much more elaborate tool, designed to ease the collec-
tion and analysis of data. However, it utilizes Intel’s RAPL interface.24 As part of the
Linux kernel, RAPL is read/write protected to normal users and the information it
generates is readable by super users only. As such, the experiment-impact-
tracker, which utilizes RAPL would not collect any metrics from the CPU. We
did not find any work-around but to disable this functionality. That, in turn, resulted
in only collecting metrics from nvidia-smi and therefore we did not employ the
tool. However, along with the resource monitoring, this tool can compute the CO2
emissions based on the compute time and the energy consumed.
The CO2 emissions generated by each experiment are computed based on
Eq. 10.1 (Strubell et al. 2019; Henderson et al. 2020).
where PUE, which stands for Power Usage Effectiveness, defines how efficiently
data centres use energy, i.e. it accounts for the additional energy required to support
the compute infrastructure; kWh are the total kilowatt hours consumed; I CO2 is the
CO2 intensity. The PUE and I CO2 values vary greatly and depend on a large set of
factors. In our computations, similar to Henderson et al. (2020) we use averages
reported on a global or national level. In particular, we use the global average PUE
value reported by Ascierto and Lawrence (2020) of 1.59.25
The kWh are computed as the sum of all power (in watts) drawn per second per
GPU which is then divided by 3,600,000. We ought to note that during training on
24
Intel’s Running Average Power Limit or RAPL (Intel 2009) interface exposes power meters and
power limits. It also allows power limits on the CPU and the DRAM to be set.
25
It is worth noting that the average PUE value follows a descending trend until 2018 when the
average PUE is 1.58 (as reported in the Uptime Institute global data centre survey for 2018) and
used in the experiment-impact-tracker tool. However, in 2019 the global average PUE is
1.67, surpassing 2013; in 2020, the value is still high—1.59 (see p. 10 of the 2020 survey (Ascierto
and Lawrence 2020)).
10 The Ecological Footprint of Neural Machine Translation Systems 199
Table 10.5 Train time in hours, number of steps and average train time for one step
System 1080Ti P100
Elapsed # steps time/ Elapsed # steps time/
time (h) ×1000 step time (h) ×1000 step
LSTM EN-FR 25.08 160 0.16 18.83 145 0.13
EN-ES 28.41 180 0.16 16.66 130 0.13
FR-EN 23.51 145 0.16 13.95 105 0.13
ES-EN 24.38 145 0.17 19.21 145 0.13
TRANS EN-FR 5.22 14.5 0.36 5.06 11 0.46
EN-ES 6.6 19.5 0.34 6.06 13 0.47
FR-EN 6.15 17.5 0.35 4.85 11 0.44
ES-EN 6.36 19 0.33 6.2 13 0.48
the P100 workstation and translation (both quantized and not quantized) on both
workstations, we collected the readings of nvidia-smi every second; for the
training on the 1080Ti workstation, the readings were collected every 5 s. In order to
make the comparison more realistic, for the latter case we interpolated the missing
values instead of simply averaging for every 5-s interval.26
The carbon intensity is a measure of how much CO2 emissions are produced per
kilowatt hour of electricity consumed.27 It is a constantly-changing value and as such
we consider an average collected over the first half of 2020. The electricityMap
project28 (Tranberg et al. 2019) collects and distributes elaborate information related
to electricity production and consumption, including carbon intensity per country
and per specific timestamp. Since our research is conducted for academic purposes
‘electricityMap’ gave us access to historical data from which we calculated the mean
CO2 intensity and standard deviation for both Ireland (IE) and the Netherlands (NL),
where our workstations are located (the 1080Ti workstation is located in Ireland, the
P100 workstation—in the Netherlands. The values are as follows: IE: 229.8718 ±
77.4026; NL: 399.3685 ± 31.9251.
We first present the run times for each experiment and the training step at which the
training was terminated (using early stopping). The run times are shown in
Table 10.5. Although we are using the exact same data, software, hyperparameter
26
We used SciPy’s interpolate https://docs.scipy.org/doc/scipy/reference/interpolate.html.
27
https://carbonintensity.org.uk/.
28
https://www.electricitymap.org/.
200 D. Shterionov and E. Vanmassenhove
values29 and random seeds, the training processes on the different hardware deviate
due to differences in the hardware, number of GPUs and the NVidia driver.30
The values in Table 10.5 indicate a larger train time for the LSTM models
compared to the TRANS models, for both the 1080Ti and the more performant
P100. We furthermore observe an overall larger train time for the 1080Ti. However,
we ought to note that the average time per step for the TRANS models is larger for
the P100. We associate these differences with the number of GPUs on which we
trained the models: 4 for the 1080Ti and 3 for the P100.
In Table 10.6 we present the elapsed time during translation of our test set. Recall
that the test set is intentionally large. The translation is conducted on a single GPU
and no file is translated by two models at the same time in order to avoid any I/O
delays.
Contrary to what is ought to be expected, from the experiments it can be observed
that the translation times on the 1080Ti machine are constantly lower than those on
the P100.
29
These are not dependent on the number of GPUs.
30
https://pytorch.org/docs/stable/notes/randomness.html.
10 The Ecological Footprint of Neural Machine Translation Systems 201
Table 10.7 Run-time, power draw and CO2 emissions (kg) at train time
System 1080Ti P100
Elapsed Avg. CO2 Elapsed Avg. CO2
time power kWh (kg) time power kWh (kg)
(h) (W) (h) (W)
LSTM EN-FR 25.08 142.05 14.07 5.14 ± 1.73 18.83 115.09 6.33 4.02 ± 0.32
EN-ES 28.41 140.88 15.79 5.77 ± 1.94 16.66 113.99 5.54 3.52 ± 0.28
FR-EN 23.51 141.85 13.15 4.81 ± 1.62 13.95 113.48 4.63 2.94 ± 0.24
ES-EN 24.38 139.90 13.44 4.91 ± 1.65 19.21 113.91 6.37 4.04 ± 0.32
TRANS EN-FR 5.22 176.70 3.64 1.33 ± 0.45 5.06 153.47 2.27 1.44 ± 0.12
EN-ES 6.60 176.54 4.60 1.68 ± 0.56 6.06 152.08 2.69 1.71 ± 0.14
FR-EN 6.15 176.64 4.29 1.56 ± 0.53 4.85 151.43 2.15 1.37 ± 0.11
ES-EN 6.36 179.48 4.50 1.64 ± 0.55 6.20 151.59 2.74 1.74 ± 0.14
Table 10.8 Run-time, average power draw and CO2 emissions (kg) at translation time
System 1080Ti P100
Elapsed Average kWh CO2 Elapsed Average kWh CO2
power power
time (h) (w) (kg) time (h) (w) (kg)
LSTM EN-FR 1.52 157.80 0.22 0.08 ± 0.03 1.84 90.50 0.16 0.10 ± 0.01
EN-ES 1.38 158.51 0.20 0.07 ± 0.02 1.69 89.06 0.15 0.10 ± 0.01
FR-EN 1.34 153.43 0.19 0.07 ± 0.02 1.79 93.14 0.16 0.10 ± 0.01
ES-EN 1.48 154.98 0.21 0.08 ± 0.03 1.62 89.35 0.14 0.09 ± 0.01
TRANS EN-FR 2.63 188.75 0.45 0.16 ± 0.06 3.01 104.52 0.31 0.20 ± 0.02
EN-ES 2.48 170.02 0.38 0.14 ± 0.05 2.80 102.71 0.28 0.18 ± 0.01
FR-EN 2.47 193.34 0.47 0.17 ± 0.06 3.18 100.93 0.31 0.20 ± 0.02
ES-EN 2.45 175.60 0.42 0.15 ± 0.05 2.69 104.35 0.28 0.18 ± 0.01
excessively larger train time of LSTM models, their overall power consumption
(reported as kWh) is much larger than that of TRANS models. As such, in our
experiments they lead to a larger CO2 emission, as computed by Eq. 10.1 for all
4 1080Ti or 3 P100 GPUs: in the case of the 1080Ti, between 2.99 times (4.91–1.64
kg) for the ES-EN language pair and 3.86 times (5.14–1.64 kg) for EN-FR; in the
case of the P100, between 2.06 times (4.63–2.15 kg) for FR-EN and 2.79 times
(6.33–2.27 kg) for EN-FR. These results indicate that for the same data and optimal
hyperparameters,31 LSTMs have a larger ecological footprint at train time. How-
ever, at translation time the observations are reversed—as suggested by the lower
average power consumption for LSTM models and the larger translation time for
TRANS models, the CO2 emission regarded to LSTM models is two times lower
than that of TRANS models (consistent over all language pairs and GPU types). This
31
As recommended in the literature.
202 D. Shterionov and E. Vanmassenhove
would imply that after a certain time of usage, our LSTM models are “greener” than
the TRANS models. In particular, for the 1080Ti GPUs, our LSTMs would become
“greener” after 10–40 days and for the P100—after 9 and 12 days.32
When we compare the two types of GPUs, we notice that the P100 is much less
power-demanding than the 1080Ti with an average power draw between 113.91
(ES-EN LSTM) and 115.09 (EN-FR LSTM) versus between 139.9 (ES-EN LSTM)
and 142.05 (EN-FR LSTM), and between 151.43 (FR-EN TRANS) and 153.47
(EN-FR TRANS) versus between 176.54 (EN-ES TRANS) and 179.48 (FR-EN
TRANS).33 This is also reflected in the overall power consumption to which a large
contributing factor is train time which is much smaller with the P100 workstation.
That is, at train time, the power consumption for the P100 GPU workstation is almost
three times smaller than that of the 1080Ti machine for LSTM models and around
two times smaller for TRANS models. At translation time, while still leading to a
lower power consumption, mainly because of the lower average power draw, the
differences are much smaller.
Special attention needs to be paid to the impact of the national carbon intensity.
For Ireland it is much lower than that of the Netherlands. These differences have a
substantial impact on the carbon emissions: the TRANS models trained on the P100
workstation in the Netherlands have a larger footprint than in the 1080Ti case in
three out of the four cases (EN-FR, EN-ES and ES-EN); at translation time for all
models running on the P100 machine, the footprint is larger than in the case of the
1080Ti.
Analysing the data from Tranberg et al. (2019), we see that the power for Ireland
originates primarily from renewable sources while that is not the case for the
Netherlands. Figure 10.2 illustrates this for the period between 01/01/2020 and
01/06/2020 on a monthly basis.
32
This is the time that would be required for the two types of architectures to consume the same
power (in watt) including training (on the same data) and translation. Since LSTM consumes less
power at translation time per time unit (e.g. second), any further operation of both LSTM and
TRANS models would lead to an overall lower energy consumption by LSTM models.
33
At train time.
10 The Ecological Footprint of Neural Machine Translation Systems 203
Ireland Netherlands
100% 100% Origin
Fossil origin
80% 80% Renewable origin
60% 60%
40% 40%
20% 20%
0% 0%
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
-0
-0
-0
-0
-0
-0
-0
-0
-0
-0
-0
-0
1
6
Fig. 10.2 Power origin distribution.
34
https://www.caruna.fi.
35
https://www.carbonfootprint.com.
36
These are the same values indicated in Tables 10.7 and 10.8.
37
A modern workstation with 3 x Nvidia Tesla V100 costs approximately € 30,000; a workstation
with 4 x Nvidia RTX3060 is approximately € 7000.
204 D. Shterionov and E. Vanmassenhove
Toaster
Microwave
Electric mower
Hairdryer
Electric drill
Vacuum cleaner
1080Ti (TRANS – ES-EN)
1080Ti (TRANS – EN-FR)
1080Ti (TRANS – FR-EN)
1080Ti (TRANS – EN-ES)
1080Ti (LSTM – EN-FR)
1080Ti (LSTM – FR-EN)
1080Ti (LSTM – EN-ES)
1080Ti (LSTM – ES-EN)
Dehumidifier
P100 (TRANS – EN-FR)
P100 (TRANS – EN-ES)
P100 (TRANS – ES-EN)
P100 (TRANS – FR-EN)
Plasma TV
P100 (LSTM – EN-FR)
P100 (LSTM – EN-ES)
P100 (LSTM – FR-EN)
P100 (LSTM – ES-EN)
Fridge/freezer
Towel rail
Appliance
38
Values are based on data from Carbon Footprint: https://www.carbonfootprint.com.
39
One can also multiply the CO2 emissions at translation time by 4 for the 1080Ti or by 3 for the
P100 GPUs and get an indication about how much CO2 emissions would be generated if a
workstation (with 4 or 3 GPUs, as in our case) is utilized 100% at translation time. That is, either
all GPUs translate using the same model in parallel, or different models are used at the same time for
different translation jobs.
40
According to data from https://www.carbonindependent.org.
206 D. Shterionov and E. Vanmassenhove
Fridge-Freezer A spec
Electric Tumble Dryer
Electric Hob
Fridge-Freezer A+ spec
Electric Oven
Fridge-Freezer A ++ spec
Dishwasher at 65°C
Kettle
Washing Machine
Dishwasher at 55°C
Microwave Oven
Primary TV – LCD 34-37 inch
Country
Ireland Netherlands
10.6.1 Quantization
DNNs contain a huge amount of parameters (biases, weights) that are adapted during
training to reduce the loss. These are typically stored as 32-bit floating point
numbers. In every forward pass all these parameters are involved in computing the
output of the network. The large precision requires more memory and processing
power than, e.g. integer numbers. Quantization is the process of approximating a
neural network’s parameters by reducing their precision. A quantized model exe-
cutes some or all of the operations on tensors with integers rather than floating point
values.41 Quantization is a term that encapsulates a broad range of approaches to the
aforementioned process: binary quantization (Courbariaux and Bengio 2016), ter-
nary (Lin et al. 2016; Li and Liu 2016), uniform (Jacob et al. 2018) and learned
(Zhang et al. 2018), to mention a few. The benefits of quantization are a reduced
model size and the option to use more high performance vectorized operations, as
well as, the efficient use of other hardware platforms.42 However, more efficient
quantized models may suffer from worse inference quality. In the field of NMT
Bhandare et al. (2019) quantize trained Transformer models to a lower precision
(8-bit integers) for inference on Intel® CPUs. They investigate three approaches to
quantize the weights of a Transformer model and achieve a drop in performance of
only 0.35–0.421 BLEU points (from 27.68 originally). Prato et al. (2020) investigate
a uniform quantization for Transformer, quantizing matrix multiplications and
divisions (if both the numerator and denominator are second or higher rank tensors),
i.e. all operations that could improve the inference speed. Their 6-bit quantized
EN-DE Transformer base model is more than 5 times smaller than the baseline and
achieves higher BLEU scores. For EN-FR, the 8-bit quantized model which achieves
the highest performance is almost 4 times smaller than the baseline. With the
exception of the 4-bit fully quantized models and the naive approach, all the rest
show a significant reduction in model size with almost no loss in translation quality
(in terms of BLEU).
The promising results from the aforementioned works motivated us to investigate
the power consumption of quantized versions of our models running on a GPU. We
used the CTranslate2 tool of OpenNMT.43 As noted in the repository,
“CTranslate2 is a fast inference engine for OpenNMT-py and OpenNMT-tf models
supporting both CPU and GPU execution. The goal is to provide comprehensive
inference features and be the most efficient and cost-effective solution to deploy
standard neural machine translation systems such as Transformer models.”
41
https://pytorch.org/docs/stable/quantization.html.
42
For example, the second generation Intel®Xeon®Scale processors incorporate an INT8 data type
acceleration (Intel®DL Boost Vector Neural Network Instructions (VNNI) (Evarist 2018)), specif-
ically designed to accelerate neural network-related computations (Rodriguez et al. 2018).
43
https://github.com/OpenNMT/CTranslate2.
208 D. Shterionov and E. Vanmassenhove
Table 10.9 Evaluation metrics for quantized and baseline Transformer models. FP32 stands for
32-bit floating point; INT16 or INT8 stand for 16-bit or 8-bit integer. FP32 is the default,
non-quantised model (See Table 10.4)
GPU Prec. EN-FR EN-ES FR-EN ES-EN
BLEU" TER# BLEU" TER# BLEU" TER# BLEU" TER#
1080Ti FP32 37.2 48.7 40.9 43.4 37.0 46.4 41.3 41.4
INT16 36.4 49.5 40.6 44.1 36.6 47.3 40.9 44.0
INT8 36.3 49.5 40.5 44.1 36.6 47.3 40.8 44.1
P100 FP32 37.4 48.4 40.9 43.3 37.3 47 41.6 42.5
INT16 36.4 49.1 40.7 44.0 36.5 50.6 40.5 43.8
INT8 36.4 49.2 40.7 44.0 36.5 50.4 40.5 43.8
We quantized our Transformer models to INT8 and INT16 and translated our test
sets on the 1080Ti and P100 workstations. On each workstation, we quantized the
models that were originally trained on that machine. After translation, we scored
BLEU and TER the same way as with our normal models. The quality results are
summarised in Table 10.9.
We measured the power draw during the translation process with our quantized
models and computed the consumed kWh as well as CO2 emissions per model and
per region. Our results are summarised in Table 10.10.
Comparing these results to those in Table 10.8 (TRANS), we first notice the
increased translation time for the quantized models running on the 1080Ti machine:
from 2.45 to 4.03 (INT8) and 4.45 (INT16) hours for the fastest ES-EN and from
2.63 to 5.17 (INT16) and 4.66 (INT8) for the slowest EN-FR. However, due to the
lower power draw the overall energy consumption as well as the CO2 emissions at
translation time with these models (on the 1080Ti workstation) is still lower than for
the non-quantized models (on the same workstation).
When comparing the performance of these models on the P100 workstation we
notice a much lower translation time over all models, even if the difference between
the EN-FR/INT16 model and the non-quantized EN-FR Transformer model is not so
drastic. At the same time the power draw for all models is lower than for their
non-quantized version, leading to a very low electricity consumption and low carbon
emissions.
We ought to note that the time for quantization, i.e. the process of converting a
non-quantized Transformer model into a quantized one, is very low (between 6 and
12 s on the P100 workstation). Furthermore, quantization is very much suitable for
10 The Ecological Footprint of Neural Machine Translation Systems 209
Table 10.10 Run-time, average power draw and CO2 emissions (kg) at translation time for
quantized models
System 1080Ti P100
Elapsed Avg. kWh CO2 Elapsed Avg. kWh CO2
time power (kg) time power (kg)
(h) (W) (h) (W)
INT16 EN-FR 5.17 130.61 0.13 0.05 ± 0.02 0.79 81.54 0.01 0.01 ±0.00
INT8 EN-FR 4.66 115.99 0.11 0.04 ± 0.01 2.16 49.06 0.02 0.01 ±0.00
INT16 EN-ES 4.45 158.96 0.14 0.05 ± 0.02 0.99 65.4 0.01 0.01 ±0.00
INT8 EN-ES 4.15 124.40 0.10 0.04 ± 0.01 1.00 68.01 0.01 0.01 ±0.00
INT16 FR-EN 4.57 139.38 0.13 0.05 ± 0.02 1.28 67.57 0.02 0.01 ±0.00
INT8 FR-EN 4.39 107.87 0.09 0.03 ± 0.01 1.29 68.39 0.02 0.01 ±0.00
INT16 ES-EN 4.45 131.66 0.12 0.04 ± 0.01 1.02 68.33 0.01 0.01 ±0.00
INT8 ES-EN 4.03 117.48 0.09 0.03 ± 0.01 1.04 67.25 0.01 0.01 ±0.00
inference on CPU. Based on the results in Tables 10.9 and 10.10, the low quantiza-
tion time and the fact that quantized models can easily be run on CPU, we would
recommend quantized models at translation time for large-scale translation projects
in the pursuit of greener MT.
In the era of deep learning, neural models are continuously pushing the boundaries in
NLP, MT included. The ever growing volumes of data and the advanced, larger
models keep delivering new state-of-the-art results. A facilitator for these results are
the innovations in general purpose GPU computing, as well as in the hardware itself,
i.e. GPUs. The embarrassingly parallel processing required for deep learning models
is easily distributed on the thousands of processing cores on a GPUs, making
training and inference with such models much more efficient than on CPUs. How-
ever, GPUs are much more power demanding and as such have a higher environ-
mental impact. In this chapter we discussed considerations related to the power
consumption and ecological footprint, in terms of carbon emissions associated with
the training and the inference with MT models.
After briefly presenting the evolution of MT and the shift to GPUs as the core
processing technology of (N)MT, we discussed the related work addressing the
issues of power consumption and environmental footprint of computational models.
We acknowledge that the work of colleague researchers and practitioners, such as
Strubell et al. (2019) and Schwartz et al. (2020), among others, raises awareness
about the environmental footprint that deep learning models have, and we would like
to join them in their appeal towards “greener” AI. This could be achieved through
210 D. Shterionov and E. Vanmassenhove
optimizing models through quantization, as discussed in Sect. 10.6, but also, through
reusability, smarter data selection, knowledge distillation and other techniques.
To outline the realistic dimensions of power consumption and environmental
footprint of NMT, we analysed a number of NMT models, running (training or
inference) on two types of GPUs: a consumer GPU card (NVidia GTX 1080Ti),
designed to work on a desktop machine and a workstation GPU (NVidia Tesla P100)
developed for heavy loads of graphics or neural computing. We reported results for
training both LSTM and Transformer models for English-French and English-
Spanish (and vice-versa) language pairs on data from the Europarl corpus. These
models were trained on approximately 1.5M parallel sentences and used to translate
large test sets that include approximately 500, 000 sentences each. Our results and
analysis show that, while a Transformer model is much faster and as such much more
power efficient than an LSTM at train time, at translation time Transformer models
are lagging behind LSTM models, in terms of power consumption, speed as well as
carbon emissions. We also note that using the more expensive P100 is preferable in
almost every case. An exception is the slightly higher translation time, which
however, comes with the benefit of a largely reduced power consumption.
Additionally, we also note the impact of electricity sources on the carbon
emissions by investigating two different countries, each of which has a different
distribution between fossil and renewable energy sources—Ireland with a larger
portion of renewable energy and the Netherlands with a larger portion of fossil
sources.
Together with the aforementioned contributions, we also aim to motivate
researchers to devote time, effort and investment in developing more ecological
solutions. We are already looking into model reusability, data selection and filtering,
multi-objective optimization of hyperparameters and other approaches that reduce
the environmental footprint of NMT.
Carbon Impact Statement
This work contributed 50.77 ± 11.37 kg of CO2eq to the atmosphere and used
111.55 kWh of electricity.
References
Bender EM, Gebru T, McMillan-Major A, Shmitchell S (2021) On the dangers of stochastic parrots:
Can language models be too big. In: Proceedings of the 2020 conference on fairness, account-
ability, and transparency. Association for Computing Machinery, New York
Bentivogli L, Bisazza A, Cettolo M, Federico M (2016) Neural versus phrase-based machine
translation quality: a case study. In: Proceedings of the 2016 conference on empirical methods
in natural language processing, Austin, Texas, pp 257–267
Bhandare A, Sripathi V, Karkada D, Menon V, Choi S, Datta K, Saletore V (2019) Eflcient 8-bit
quantization of transformer neural machine language translation model. CoRR, abs/1906.00532
Britz D, Goldie A, Luong MT, Le Q (2017) Massive exploration of neural machine translation
architectures. In Proceedings of the association for computational linguistics (ACL), Vancouver,
Canada, pp 1442–1451 translation. Comput Linguist 16(2):79–85
Brown PF, Cocke J, Della Pietra S, Della Pietra VJ, Jelinek F, Mercer RL, Roossin PS (1988) A
statistical approach to language translation. In: Proceedings of the 12th international conference
on computational linguistics, COLING ’88, Budapest, Hungary, August 22–27, 1988. John von
Neumann Society for Computing Sciences, Budapest, pp 71–76
Brown PF, Cocke J, Della Pietra SA, Della Pietra VJ, Jelinek F, Lafferty J, Mercer RL, Roossin PS
(1990) A statistical approach to machine
Brown TB, Mann B, Ryder N, Subbiah M, Kaplan JD, Dhariwal P, Neelakantan A, Shyam P,
Sastry G, Askell A, et al (2020) Language models are few-shot learners. Preprint.
arXiv:2005.14165
Carl M, Way A (2003) Recent advances in example-based machine translation. Springer,
Cambridge, MA
Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014)
Learning phrase representations using RNN encoder–decoder for statistical machine
translation. In: Proceedings of the 2014 conference on empirical methods in natural language
processing, Doha, Qatar, pp 1724–1734
Clark JH, Dyer C, Lavie A, Smith NA (2011) Better hypothesis testing for statistical machine
translation: Controlling for optimizer instability. In: Proceedings of the 49th annual meeting of
the Association for Computational Linguistics: human language technologies, Portland, Ore-
gon, USA. Association for Computational Linguistics, pp 176–181
Courbariaux M, Bengio Y (2016) Binarynet: Training deep neural networks with weights and
activations constrained to +1 or -1. CoRR, abs/1602.02830
Devlin J, Chang MW, Lee K, Toutanova K (2019) Bert: Pre-training of deep bidirectional trans-
formers for language understanding. In: Proceedings of the 2019 conference of the North
American chapter of the Association for Computational Linguistics: human language technol-
ogies, (NAACL-HLT 2019), volume 1 (Long and Short Papers), Minneapolis, Minnesota, USA,
pp 4171–4186
Doyle J, Bashroush R (2020) Case studies for achieving a return on investment with a hardware
refresh in organizations with small data centers. IEEE Trans Sustain Comput:1
Ethayarajh K, Jurafsky D (2020) Utility is in the eye of the user: A critique of NLP leaderboards. In:
Proceedings of the 2020 conference on empirical methods in natural language processing,
EMNLP 2020, Online, November 16–20, 2020. Association for Computational Linguistics,
pp 4846–4853
Evarist F (2018) Eflcient 8-bit quantization of transformer neural machine language translation
model
Gordon A, Eban E, Nachum O, Chen B, Wu H, Yang TJ, Choi E (2018) Morphnet: Fast & simple
resource-constrained structure learning of deep networks. In: Proceedings of the IEEE confer-
ence on computer vision and pattern recognition, pp 1586–1595
Henderson P, Hu J, Romoff J, Brunskill E, Jurafsky D, Pineau J (2020) Towards the systematic
reporting of the energy and carbon footprints of machine learning. J Mach Learn Res 21(248):
1–43
Hutchins J (2005a) The history of machine translation in a nutshell. Retrieved December 20(2009):1
212 D. Shterionov and E. Vanmassenhove
Pulido L (2016) Flint, environmental racism, and racial capitalism. Capitalism Nat Socialism 27(3):
1–16
Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I (2019) Language models are
unsupervised multitask learners. OpenAI Blog 1(8):9
Raina R, Madhavan A, Ng AY (2009) Large-scale deep unsupervised learning using graphics
processors. In: Proceedings of the 26th annual international conference on machine learning,
ICML ’09, pp 873–880
Rodriguez A, Segal E, Meiri E, Fomenko E, Kim YJ, Shen H, Ziv B (2018) Lower numerical
precision deep learning inference and training. Intel White Paper 3:1–19
Schwartz R, Dodge J, Smith NA, Etzioni O (2020) Green AI. Commun ACM 63:54–63
Shterionov D, Do Carmo F, Moorkens J, Paquin E, Schmidtke D, Groves D, Way A (2019) When
less is more in neural quality estimation of machine translation. an industry case study. In
Proceedings of machine translation summit XVII volume 2: translator, project and user tracks,
Dublin, Ireland. European Association for Machine Translation, pp 228–235
Snover M, Dorr B, Schwartz R, Micciulla L, Makhoul J (2006) A study of translation edit rate with
targeted human annotation. In Proceedings of the 7th conference of the Association for Machine
Translation of the Americas (AMTA 2006). Visions for the Future of Machine Translation,
Cambridge, Massachusetts, USA, pp 223–231
Strubell E, Verga P, Andor D, Weiss D, McCallum A (2018) Linguistically-informed self-attention
for semantic role labeling. In: Proceedings of the 2018 conference on empirical methods in
natural language processing, Brussels, Belgium. Association for Computational Linguistics, pp
5027–5038
Strubell E, Ganesh A, McCallum A (2019) Energy and policy considerations for deep learning in
NLP. In: Proceedings of the 57th annual meeting of the Association for Computational
Linguistics, Florence, Italy. Association for Computational Linguistics, pp 3645–3650
Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. In:
Proceedings of advances in neural information processing systems 27: annual conference on
neural information processing systems, Montreal, Quebec, Canada, pp 3104–3112
Tranberg B, Corradi O, Lajoie B, Gibon T, Staffell I, Andresen GB (2019) Eflcient 8-bit
quantization of transformer neural machine language translation model. Energy Strategy Rev
26:100367
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017)
Attention is all you need. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R,
Vishwanathan S, Garnett R (eds) Advances in neural information processing systems, vol 30.
Curran Associates, Red Hook, pp 5998–6008
Veniat T, Denoyer L (2018) Learning time/memory-efficient deep architectures with budgeted
super networks. In: Proceedings of the IEEE conference on computer vision and pattern
recognition, pp 3492–3500
Zaremba W, Sutskever I, Vinyals O (2014) Recurrent neural network regularization. CoRR,
abs/1409.2329
Zhang D, Yang J, Ye D, Hua G (2018) Lq-nets: Learned quantization for highly accurate and
compact deep neural networks. In: Computer vision - ECCV 2018 - 15th European conference,
Munich, Germany, September 8–14, 2018, Proceedings, Part VIII, volume 11212 of Lecture
Notes in Computer Science. Springer, pp 373–390
Chapter 11
Treating Speech as Personally Identifiable
Information and Its Impact in Machine
Translation
Isabel Trancoso, Francisco Teixeira, Catarina Botelho, and Alberto Abad
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 215
H. Moniz, C. Parra Escartín (eds.), Towards Responsible Machine Translation,
Machine Translation: Technologies and Applications 4,
https://doi.org/10.1007/978-3-031-14689-3_11
216 I. Trancoso et al.
is being replaced by end-to-end systems that allow the sentences in the target
language to sound as the voice of the speaker in the source language, opening a
world of possibilities. All these privacy and security issues are becoming more and
more pressing in an era where speech must be legally regarded as PII (Personally
Identifiable Information).
11.1 Introduction
The discussion about ethics in AI is an extremely relevant topic that deserves the
attention of all the AI communities, but it is also extremely broad. Within AI, there
are communities in which this discussion is particularly pertinent. Natural Language
Processing (or NLP) is one of them. The term NLP has been mostly used in the past
to cover techniques dealing with text. Nowadays, however, it is increasingly used to
cover the human language in all its forms: written, spoken, gestural. This chapter
focuses on a single modality: speech. Speech is the most natural and immediate form
of communication. It is ubiquitous. The recent progress in speech technologies, due
mostly to advances in deep learning, huge amounts of training data and growing
computing power, has led to the use of speech as an input/output modality in a
panoply of applications which have been mostly reserved for text until recently.
Despite the fact that progress is very much dependent on the amount of training data,
making it very dependent on factors such as age and language/accent, and motivat-
ing huge efforts on training for less resourced languages, spoken language technol-
ogies are widely accepted nowadays. In fact, spoken language technologies, in
particular through their use in voice assistants, are transforming our society in
terms of digital inclusion, allowing an increasingly seamless use of the digital
tools that surround us: our telephone, our television set, or our household appliances.
Because of their complexity, many language applications run on cloud-based
platforms that provide remote access to powerful models in what is commonly
known as Machine Learning as a Service (MLaaS), enabling the automation of
time-consuming tasks such as document translation, or transcribing speech, and
helping users to perform everyday tasks (e.g. voice-based virtual assistants). When
a biometric signal such as speech is sent to a remote server for processing, however,
this input signal reveals, in addition to the meaning of words, much information
about the user, including his/her preferences, personality traits, mood, health, and
political opinions, among other data such as gender, age range, height, accent, etc.
Moreover, the input signal can be also used to extract relevant information about the
user environment, namely background sounds.
In fact, most users are unaware of the potential for misuse allowed by this new
generation of speech technology systems. For instance, most users do not know how
many sentences in their own voice are necessary for cloning it, nor have they heard
about spoofing speaker recognition systems. Moreover, most users do not realize
11 Treating Speech as Personally Identifiable Information and Its Impact. . . 217
that adversarial techniques now enable the effective injection of hidden commands
in spoken messages without being audible.
The almost impossible task of reviewing the state of the art in core spoken
language technologies is the topic of the second section of this chapter. In the
third one, we try to raise awareness of their potential misuse.
The fourth section briefly describes efforts towards voice privacy, covering
anonymisation and encryption techniques.
The fifth section focuses on the ethical impact of using speech as an input/output
modality for MT, one of the NLP technologies which traditionally has dealt with text
input and output. However, speech-to-speech machine translation (S2SMT) is no
longer a research-only topic, and one can only anticipate its growing use in our
multilingual world.
The sixth section wraps up, arguing that there is a need for a growing awareness
of the potential for misuse of speech technologies, and simultaneously a need for a
common taxonomy that allows experts in speech technology, cryptography, and law
to clearly define the boundaries of ethical speech processing.
Among several other survey papers that may complement the many topics raised
in this chapter, we strongly recommend the excellent overview of ethics and good
practice in computational paralinguistics in Batliner et al. (2020). For a survey of
approaches for privacy preserving speech processing (Nautsch et al. 2019b) is
another excellent start. For an in-depth study of human profiling from their voice,
we also recommend (Singh 2019).
A colleague once complained about reading too many papers including the words
“with the advent of deep learning”, but it is in fact the most adequate start for this
section, which tries to summarize in a couple of pages the recent progress in speech
technologies.
Yet, most users of remote speech technology servers are unaware of the amount
of information that can be mined from their sentences, in particular, about their
health status. In fact, the potential of speech as a biomarker for health has been
realized for diseases affecting respiratory organs, such as the common Cold,
Obstructive Sleep Apnea (OSA), or COVID-19, for mood disorders such as Depres-
sion, Anxiety, and Bipolar Disease, and neurodegenerative diseases such as
Parkinson’s (PD), Alzheimer’s (AD), and Huntington’s disease.
For instance, the most common speech disturbances in Parkinson’s Disease are
excess of tremor, reduced loudness, monotonicity, hoarseness, and imprecise artic-
ulation. A second example is OSA. Here, most patients show articulatory anomalies,
phonation anomalies, and abnormal coupling of the vocal tract with the nasal cavity,
which is present even in non-nasal sounds. A third example is depression, where
speech is characterized as dull, monotone, monoloud, lifeless and metallic.
Some of these symptoms are visible in features that may be automatically
extracted from the acoustic signal: prosodic features, voice quality features, and
spectral features (e.g., pitch, energy, resonance frequencies, jitter, harmonicity,
speech rate, pause duration, etc.). Other symptoms may also be visible in the analysis
of the text that is automatically produced by a speech recognition system. For
instance, the analysis of speech of patients with AD may show a decline in content
and fluency, and a higher prevalence of pauses and filler words.
Building classifiers that detect such diseases entails the collection of datasets of
speech from patients and healthy controls, which until very recently was mostly
done in clinical facilities. The experience with COVID-19 revealed how important
remote diagnosis can be, motivating research with data collected in-the-wild. Other
examples of potentially very useful remote diagnosis and monitoring are machine-
assisted depression and suicide risk assessment (Cummins et al. 2015).
In paralinguistic tasks involving speech affecting diseases, data collection may
involve different types of acoustic signals: read and spontaneous speech, sustained
vowels, consonant-vowel syllables, coughs, simulated snoring, etc. The typical short
dimension and limited demographic coverage of datasets for paralinguistic research
has been a problem for many years, independently of the physical or psychological
trait that is being detected. In fact, it has limited the use of neural machine learning
approaches in many cases. Hence, results vary significantly for the same task among
different datasets. For instance, Vasquez et al. (2017) reports results from 70.3% to
88.5% (unweighted average recall) using the same method for the task of detecting
Parkinson’s disease in datasets of read sentences in three different languages. These
limitations have been the main motivation for establishing joint Computational
Paralinguistics challenges such as ComParE,1 that takes place yearly at Interspeech
conferences since 2009, covering many different tasks, and motivating many teams
worldwide that work on common datasets.
These current limitations also contribute to show that the enormous potential of
profiling humans from their voices is still very far from being explored. This
1
http://www.compare.openaudio.eu/.
11 Treating Speech as Personally Identifiable Information and Its Impact. . . 219
potential is particularly high for clinical applications, going much beyond the typical
speech and language disorders such as stuttering or sigmatism (defective pronunci-
ation of sibilant sounds). Our vision is that collecting speech samples will one day
become as common as a blood test, so that doctors and therapists will be able to
automatically retrieve the results of a detailed analysis of speech features, as well as
global indicators of the presence of speech affecting diseases, that may be used as a
“second opinion”.
Health is an application domain where MT may play a very relevant role. One can
envisage a panoply of applications of MT and in particular speech-to-speech MT in
clinical facilities, to be used by physicians, caretakers and patients. Transposing
speech analysis models trained with speech in one language to another language
could hamper diagnosis based on speech features, although several paralinguistic
features, namely the ones related to voice quality (e.g. jitter), could be considered
language independent.
On the other hand, some prosodic features (such as pitch contours, for instance)
may be strongly language dependent. Their relevance for the extraction of paralin-
guistic traits such as emotion cannot be overemphasized. Hence, transposing such
traits to the synthetic output of speech-to-speech MT is a very challenging research
topic.
2
https://www.robots.ox.ac.uk/~vgg/data/voxceleb/.
3
https://kaldi-asr.org/.
220 I. Trancoso et al.
error rates (EER) close to 3%. This metric derives its name from corresponding to a
threshold for which the false positive and false negative error rates are equal.
These weights can be used to initialize models for new datasets and new tasks. In
fact these pre-trained speaker representations have many applications beyond voice
biometrics. They have been successfully applied to numerous speech processing
tasks, including linguistic (e.g. speech recognition and speech synthesis) and para-
linguistic tasks.
The enormous progress of speaker verification systems has allowed its worldwide
deployment in biometric systems with great impact in fraud prevention in areas such
as banking.4
The state of the art in automatic speech recognition (ASR) since the 1980s was
predominantly based on the GMM-HMM paradigm (Gaussian Mixture Models—
Hidden Markov Models). By combining these acoustic models, which were fed with
perceptually meaningful features, with additional knowledge sources provided by
n-gram language models and lexical (or pronunciation) models, one could achieve
word error rates (WER) that made ASR systems usable for certain tasks, namely
those involving read speech in clean recording conditions. However, progress was
too slow for more than three decades, and robustness was a major issue. A giant leap
in error rate reduction was achieved during the last decade with the so-called ‘hybrid
paradigm’, that pairs a deep neural network with HMMs. The Kaldi toolkit includes
a recipe for training a DNN-HMM based system on a corpus of read audiobooks
(Librispeech (Panayotov et al. 2015)) with close to 960 h of speech. This recipe
achieves a WER of 3.8%, an unthinkable result a decade ago. Conversational speech
is more challenging, with error rates that are almost triple the above result, and so are
tasks involving for instance non-native accents, distant microphones in a meeting
room, etc. Nowadays, fully end-to-end architectures are proposed to perform the
entire ASR pipeline (Karita et al. 2019), with the exception of feature extraction, but
the performance is significantly worse when training data is short. Many machine
learning approaches have been proposed recently for improving the ASR perfor-
mance in challenging tasks, such as audio augmentation (Park et al. 2019; Ko et al.
2015), transfer learning (Abad et al. 2020), multi-task learning (Pironkov et al.
2016), etc.
Of particular interest are the recent unsupervised approaches that leverage speech
representations to segment unlabeled audio and learn a mapping from these repre-
sentations to phonemes via adversarial training. These approaches reach competitive
4
https://www.finextra.com/newsarticle/37989/hsbcs-voice-id-prevents-249-million-of-attempted-
fraud.
11 Treating Speech as Personally Identifiable Information and Its Impact. . . 221
word error rates on the Librispeech benchmark (5.9%), rivaling systems trained on
labeled data (Baevski et al. 2021).
Due to their high complexity, ASR systems typically run in the cloud. In personal
voice assistants, the task of spotting the wake up keyword when they are in the
“always listening mode” is done on device using much less complex approaches.
Despite recent breakthroughs in ASR, spontaneous speech recognition is still a
very challenging task, representing one of the major sources of errors that together
with MT errors hinder the widespread use of speech-to-speech MT systems.
Although for specific domains enough training data can be collected to mitigate
such errors, users should be made aware of the limitations of current systems.
5
A vocoder (short for voice encoder) is a synthesis system which was initially developed to
reproduce human speech.
222 I. Trancoso et al.
Blizzard challenges6 have been organized every year since 2005 to jointly
evaluate speech synthesizers built on the same datasets. The synthetic speech quality
is now very close to human speech, reaching values above 4 on a scale of 1–5, the
so-called ‘Mean Opinion Score’ (MOS) scale.
Due to their high complexity, neural TTS systems typically run in the cloud, but
on-device implementation is a current target for some big industrial players, leverag-
ing the internal deep learning framework of the latest generation of mobile devices.
Voice conversion (VC) belongs to the general technical field of speech synthesis, but
instead of converting text to speech it encompasses changing the properties of
speech, for example, voice identity, emotion, and accents. An excellent survey on
VC can be found in Sisman et al. (2021). The impact of deep learning on VC has
been so relevant, that a number of applications of VC that were considered only
potential until very recently are now much closer to deployment: personalized
speech synthesis, namely for the speech-impaired community, speaker
de-identification, voice mimicry and disguise, computer-assisted pronunciation
training for second language students, and voice dubbing for movies. Unlike the
first above-mentioned applications which are essentially monolingual, voice dub-
bing involves the much harder task of crosslingual VC.
Traditional VC approaches encompassed three stages: analysis and feature
extraction, which decomposes the speech signals of a source speaker into features
that represent supra-segmental and segmental information, mapping, which changes
them towards the target speaker, and reconstruction, which re-synthesizes time-
domain speech signals. For many years, one of the hardest problems with these
traditional VC approaches was the overall muffling effect, most probably linked to
the statistical averaging that characterizes the adopted optimization criteria, and to
the low-resolution features that were used for the mapping.
Recent deep learning approaches avoid this pipeline. The reconstruction is done
with neural vocoders, which, being trainable, can be trained jointly with the mapping
module and even with the analysis module to become an end-to-end solution
(Sisman et al. 2021). The possibility of disentangling speaker information and
linguistic contents is crucial. This has been done notably by using variational auto-
encoder schemes, in which the content encoder learns a latent code from the source
speaker speech, and the speaker encoder learns the speaker embedding from the
target speaker speech. At run-time, the latent code and the speaker embedding are
combined to generate speech. This type of approach was used as one of the baselines
for the most recent Voice Conversion Challenge (Yi et al. 2020). Another baseline
was a cascade of ASR and TTS systems.
6
https://www.synsig.org/index.php/Blizzard_Challenge.
11 Treating Speech as Personally Identifiable Information and Its Impact. . . 223
The success of this disentanglement of speaker and linguistic contents has led
researchers to new approaches that try to factor in prosody embeddings or style
embeddings as well. This disentanglement can be particularly relevant in the context
of speech-to-speech translation.
VC systems may be evaluated using several subjective and objective metrics.
Among the subjective tests the MOS (mean opinion score) is one of the most
commonly used to measure speech naturalness. Among the objective tests, one
can use speaker recognition and speech recognition tests.
into outputting a target command. In the past, this type of perturbation was in most
cases perceptible, but the current state of the art in adversarial attacks, namely by
using multi-objective loss functions, showed that one can generate highly impercep-
tible perturbations that are extremely effective in misleading either speaker or speech
recognition systems.
The possibilities for misuse are in fact endless and this very brief review barely
touched the surface, ignoring several other types of attack that may target speech-
based apps.
Despite having claimed that most users of speech technologies running on remote
servers are not fully aware of the privacy concerns and potential security attacks
these technologies entail, privacy awareness in remote speech processing is gradu-
ally coming up in the media. The growing awareness that audio sensing can be
anywhere, anytime is illustrated by titles such as:
• Microsoft workers listen to some translated Skype calls7
• Apple halts practice of contractors listening in to users on Siri8
• Google ordered to halt human review of voice AI recordings over privacy risks9
• Amazon’s Alexa recorded private conversation and sent it to random contact10
• Amazon Echo Dot ad cleared over cat food order11
• LaLiga fined for soccer app’s privacy-violating spy mode12
• An Amazon Echo may be the key to solving a murder case13
• Siri and Alexa could become witnesses against you in court some day14
Privacy does not only concern what a subject says, but the way the subject says it,
the physical and psychological traits of the subject, and the environment in which the
utterance was produced. On the other hand, the possibility of using synthetic voices
for fraudulent purposes, or the recreation of the voices of celebrities also generates
great interest, as in:
7
https://www.bbc.com/news/technology-49263260.
8
https://www.theguardian.com/technology/2019/aug/02/apple-halts-practice-of-contractors-listen
ing-in-to-users-on-siri.
9
https://techcrunch.com/2019/08/02/google-ordered-to-halt-human-review-of-voice-ai-recordings-
over-privacy-risks/.
10
https://www.theguardian.com/technology/2018/may/24/amazon-alexa-recorded-conversation.
11
https://www.bbc.com/news/business-43044693.
12
https://techcrunch.com/2019/06/12/laliga-fined-280k-for-soccer-apps-privacy-violating-spy-
mode.
13
https://techcrunch.com/2016/12/27/an-amazon-echo-may-be-the-key-to-solving-a-murder-case/.
14
https://sdlgbtn.com/news/2016/12/29/siri-and-alexa-could-become-witnesses-against-you-court-
some-day.
11 Treating Speech as Personally Identifiable Information and Its Impact. . . 225
• Beware: Phone scammers are using this new sci-fi tool to fleece victims15
• An artificial-intelligence first: Voice-mimicking software reportedly used in a
major theft16
• New AI Tech Can Mimic Any Voice17
• The haunting afterlife of Anthony Bourdain18
In fact, social networks went wild with the recent use of TTS techniques for
generating the voice of a dead celebrity reading his own email sentences in a biopic
documentary. Terms such as AI lapse show that there is a growing awareness that
such speech deepfakes may raise ethical issues.
Much progress on voice privacy has been spurred by joint evaluation challenges
such as the Voice Privacy challenge (in 2020) (Tomashenko et al. 2020). There are
many different approaches targeting voice privacy. This section briefly covers two
classes: via anonymisation and via encryption.
For the sake of space, we leave out deletion methods which target ambient sound
analysis, deleting or obfuscating any overlapping speech, so that no information
about it can be recovered. For recent work on this topic, see for instance Cohen-
Hadria et al. (2019) and Gontier et al. (2020).
We also leave out federated (also called ‘decentralized’ or ‘distributed’) learning
methods, which aim to learn models from distributed data without accessing it
directly (Leroy et al. 2019). This type of method is often combined with differential
privacy.
11.4.1 Anonymisation
15
https://fortune.com/2021/05/04/voice-cloning-fraud-ai-deepfakes-phone-scams/.
16
https://www.washingtonpost.com/technology/2019/09/04/an-artificial-intelligence-first-voice-
mimicking-software-reportedly-used-major-theft/.
17
https://www.scientificamerican.com/article/new-ai-tech-can-mimic-any-voice/.
18
https://www.newyorker.com/culture/annals-of-gastronomy/the-haunting-afterlife-of-anthony-
bourdain.
226 I. Trancoso et al.
set of vectors among the farthest ones in the x-vector space. This means that the
artificial voice may not correspond to any real speaker.
The challenge used a number of objective metrics such as equal error rate (EER)
of speaker verification and word error rate of speech recognition, as well as subjec-
tive measures. The primary baseline system achieved good anonymisation results in
terms of EER (comparable to chance level) at the cost of a relative degradation in
terms of WER, which in some datasets surpassed 60%. However, EER results
degraded significant when the verification system is trained on anonymised data.
Several teams participating in the challenge proposed more elaborate x-vector
modification schemes, showing again the pervasiveness of speaker embedding
approaches.
Most anonymisation solutions aim at suppressing information related to the
identity of the speaker, whilst preserving the information related to the contents of
the spoken message. But anonymisation could also aim at selectively suppressing
certain attributes (e.g. gender/age), a disentanglement problem that may enable a
greater control of privacy levels.
The topic of anonymisation in the context of speech-to-speech MT is still
relatively unexplored to the best of our knowledge, but it should certainly deserve
much attention.
11.4.2 Encryption
Up to very recently, S2SMT systems were based on a cascade of ASR, MT and TTS
modules. Depending crucially on the state of the art for each of these modules, their
usability was mostly demonstrated in narrow domains, which was an obstacle to
their commercial deployment. End-to-end approaches have attracted much attention
recently, due to the potential for avoiding the error propagation that is inherent to
cascade approaches. They may also be advantageous in terms of computational
complexity and latency. Such advantages were also realised for speech-to-text
translation systems where end-to-end approaches have had a profound impact
(Bahar et al. 2019; Gangi et al. 2019). In addition, direct models for S2SMT are
228 I. Trancoso et al.
19
https://blog.sdl.com/blog/The-Issue-of-Data-Security-and-Machine%20Translation.html.
11 Treating Speech as Personally Identifiable Information and Its Impact. . . 229
information. The possibility of using the same human-in-the loop paradigm for
speech input/output remains unexplored, although proof-of-concept systems have
already been demonstrated (Bernardo et al. 2019). This possibility raises numerous
challenges not only from a technical point of view, but also from the point of view of
privacy. The human-in-the-loop paradigm for speech-to-speech MT would solve the
problems of users not being able to check if there were any misrecognition/
mistranslation errors in the synthetic files which could be spoken in the user’s voice.
11.6 Conclusions
This chapter tried to summarize several privacy and security issues potentially raised
by the use of speech technologies, when they are accessed at remote servers. Users of
speech-to-speech MT systems must be made aware of these issues at different levels.
One one hand, their input speech data in L1 may reveal a great amount of
paralinguistic/extra-linguistic information, besides the linguistic content of the spo-
ken message which may contain references to entities that one might prefer to
anonymise. That would allow a malicious server to profile the user for different
purposes, such as recommending products and services, for instance. On the other
hand, the input utterances themselves can be used for building text-to-speech
synthesizers in the speaker’s voice, which may be misused for impersonation/
spoofing attacks. One may argue that ideally the spoken utterance in L2 should
preserve all this information, but this may have to be done at the cost of trusting the
remote server. Achieving a balance between privacy and utility in speech technol-
ogies deployed in remote servers is a difficult goal, but it becomes much harder when
such technologies are combined in complex speech-to-speech MT systems.
Privacy engineering is an interdisciplinary field which is slowly emerging within
speech technologies, making developers conscious that performance indicators
alone are no longer sufficient, and privacy by design is crucial.
The discussion of all these issues requires joining forces of different communi-
ties: the speech research community, the cryptography research community, and the
legal community. This is one of the objectives of the recently formed Special Interest
Group (SIG) “Security and Privacy in Speech Communication” (SPSC), within the
International Speech Communication Association (ISCA). Intended as an interdis-
ciplinary platform, the SIG fosters exchange between leading industrial and aca-
demic players with the goal of reaching standards and procedures that protect the
privacy of the individual in speech communication while providing sufficient means
and incentives for industry to exploit towards future innovative services. The SIG
was very active in the recent discussion promoted by the European Data Protection
Board on Virtual Voice Assistants.20 In particular, the SIG criticized the use of the
20
https://edpb.europa.eu/our-work-tools/documents/public-consultations/2021/guidelines-022021-
virtual-voice-assistants_en.
230 I. Trancoso et al.
term unique identifiability with regard to speech signals. The term does not reflect
the probabilistic nature of speaker identification methods. This probabilistic nature
also raises the need to address uncertainty in decision outcomes and to limit its
impact in decision making.
The different taxonomy among the different communities is probably the first
obstacle to conquer, in order to clearly define the boundaries of ethical speech
processing (Nautsch et al. 2019a). The GDPR contains few norms that have direct
applicability to inferred data, requiring an effort of extensive interpretation of many
of its norms, with adaptations, to guarantee the effective protection of people’s rights
in an era where speech must be legally regarded as PII (Personally Identifiable
Information).
Acknowledgements This work was supported by national funds through Fundação para a Ciência
e a Tecnologia (FCT) with references UIBD/50021/2020 and CMU/TIC/0069/2019, and by the
P2020 project MAIA (contract 045909). We would like to thank several colleagues for many
interesting discussions on this topic, namely Bhiksha Raj, Helena Moniz, Filipa Calvão, and
Andreas Nautsch.
References
Abad A, Bell P, Carmantini A, Renais S (2020) Cross lingual transfer learning for zero-resource
domain adaptation. In: IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP), pp 6909–6913
Baevski A, Hsu WN, Conneau A, Auli M (2021) Unsupervised speech recognition. ArXiv preprint,
2105.11084
Bahar P, Bieschke T, Ney H (2019) A comparative study on end-to-end speech to text translation.
In: IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp 792–
799
Batliner A, Hantke S, Schuller BW (2020) Ethics and good practice in computational paralinguis-
tics. IEEE Trans Affect Comput. Manuscript. Preliminary Version
Ben-Or M, Goldwasser S, Wigderson A (1988) Completeness theorems for non-cryptographic
fault-tolerant distributed computation. In: 20th Annual ACM Symposium on Theory of Com-
puting, pp 1–10
Bernardo L, Giquel M, Quintas S, Dimas P, Moniz H, Trancoso I (2019) Unbabel Talk - human
verified translations for voice instant messaging. In: Interspeech, pp 3691–3692
Black AW, Zen H, Tokuda K (2007) Statistical parametric speech synthesis. In: IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP), vol 4, pp IV-1229–IV-1232
Boufounos P, Rane S (2011) Secure binary embeddings for privacy preserving nearest neighbors.
In: IEEE Workshop on Information Forensics and Security (WIFS), pp 1–6
Brasser F, Frassetto T, Riedhammer K, Sadeghi A-R., Schneider T, Weinert C (2018) VoiceGuard:
secure and private speech processing. In: Interspeech, pp 1303–1307
Casanova E, Shulby C, Gölge E, Müller NM, de Oliveira FS, Candido Jr A, da Silva Soares A,
Aluisio SM, Ponti MA (2021) SC-GlowTTS: an efficient zero-shot multi-speaker text-to-speech
model. In: Interspeech, pp 3645–3649
Cohen-Hadria A, Cartwright M, McFee B, Bello JP (2019) Voice anonymization in urban sound
recordings. In: IEEE International Workshop on Machine Learning for Signal Processing
(MLSP), pp 1–6
11 Treating Speech as Personally Identifiable Information and Its Impact. . . 231
synthesized, converted and replayed speech. IEEE Trans Biometr Behav Identity Sci 3(2):
252–265
Paillier P (1999) Public-key cryptosystems based on composite degree residuosity classes. In:
Advances in cryptology, volume 1592 of Lecture Notes in Computer Science, pp 223–238
Panayotov V, Chen G, Povey D, Khudanpur S (2015) Librispeech: an ASR corpus based on public
domain audio books. In: IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP), pp 5206–5210
Park DS, Chan W, Zhang Y, Chiu CC, Zoph B, Cubuk ED, Le QV (2019) SpecAugment: a simple
data augmentation method for automatic speech recognition. In: Interspeech, pp 2613–2617
Pathak M, Portelo J, Raj B, Trancoso I (2012) Privacy-preserving speaker authentication. In:
International Conference on Information Security. Springer, pp 1–22
Pennington J, Socher R, Manning CD (2014) GloVe: Global vectors for word representation. In:
Empirical Methods in Natural Language Processing (EMNLP), pp 1532–1543
Peters ME, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, Zettlemoyer L (2018) Deep
contextualized word representations. ArXiv preprint, 1802.05365
Pironkov G, Dupont S, Dutoit T (2016) Multi-task learning for speech recognition: an overview. In:
ESANN – European Symposium on Artificial Neural Networks, Computational Intelligence and
Machine Learning (ESANN), pp 189–194
Portêlo J, Abad A, Raj B, Trancoso I (2013) Secure binary embeddings of front-end factor analysis
for privacy preserving speaker cerification. In: Interspeech, pp 2494–2498
Portêlo J, Raj B, Abad A, Trancoso I (2014) Privacy-preserving speaker verification using garbled
GMMs. In: EUSIPCO, pp 2070–2074
Portêlo J, Abad A, Raj B, Trancoso I (2015) Privacy-preserving query-by-example speech search.
In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp
1797–1801
Qian Y, Soong FK, Yan ZJ (2013) A unified trajectory tiling approach to high quality speech
rendering. IEEE Trans Audio Speech Lang Process 21(2):280–290
Shen J, Pang R, Weiss RJ, Schuster M, Jaitly N, Yang Z, Chen Z, Zhang Y, Wang Y, Skerrv-Ryan
R, Saurous RA, Agiomvrgiannakis Y, Wu Y (2018) Natural TTS synthesis by conditioning
WaveNet on mel spectrogram predictions. In: IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), pp 4779–4783
Singh R (2019) Profiling humans from their voice. Springer
Sisman B, Yamagishi J, King S, Li H (2021) An overview of voice conversion and its challenges:
from statistical modeling to deep learning. IEEE/ACM Trans Audio Speech Lang Process 29:
132–157
Snyder D, Ghahremani P, Povey D, Garcia-Romero D, Carmiel Y, Khudanpur S (2016) Deep
neural network-based speaker embeddings for end-to-end speaker verification. In: IEEE Spoken
Language Technology Workshop (SLT), pp 165–170
Teixeira F, Abad A, Trancoso I (2018) Patient privacy in paralinguistic tasks. In: Interspeech, pp
3428–3432
Teixeira F, Abad A, Trancoso I (2019) Privacy-preserving paralinguistic tasks. In: International
Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 6575–6579
Tomashenko N, Srivastava BML, Wang X, Vincent E, Nautsch A, Yamagishi J, Evans N, Patino J,
Bonastre JF, Noé PG, Todisco M (2020) Introducing the VoicePrivacy initiative. In:
Interspeech, pp 1693–1697
van den Oord A, Dieleman S, Zen H, Simonyan K, Vinyals O, Graves A, Kalchbrenner N, Senior
AW, Kavukcuoglu K (2016) WaveNet: A generative model for raw audio. CoRR,
abs/1609.03499
11 Treating Speech as Personally Identifiable Information and Its Impact. . . 233
Vasquez J, Orozco JR, Noeth E (2017) Convolutional neural network to model articulation
impairments in patients with Parkinson’s disease. In: Interspeech, pp 314–318
Yao AC (1986) How to generate and exchange secrets. In: 27th Annual Symposium on Foundations
of Computer Science (SFCS), pp 162–167
Yi Z, Huang WC, Tian X, Yamagishi J, Das RK, Kinnunen T, Ling Z, Toda T (2020) Voice
conversion challenge 2020—intralingual semi-parallel and cross-lingual voice conversion. In:
Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge, pp 909–910
Zhang SX, Gong Y, Yu D (2019) Encrypted speech recognition using deep polynomial networks.
In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp
5691–5695