What About The Data A Mapping Study On Data Engineering For AI Systems
What About The Data A Mapping Study On Data Engineering For AI Systems
1 INTRODUCTION
AI systems cannot exist without data [16]. To develop an AI system,
data needs to be collected and prepared to train the AI model, see the
DATA cycle in Figure 1. But also when the model is in production,
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for components of this work owned by others than the Figure 2: The data engineering life cycle [36]
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from [email protected].
In the projects we do with industry we see that now AI models
CAIN 2024, April 14–15, 2024, Lisbon, Portugal have matured and are readily available to apply in practice, most
© 2024 Copyright held by the owner/author(s). Publication rights licensed to ACM. organizations struggle with the data infrastructure to do so. This can
ACM ISBN 979-8-4007-0591-5/24/04
https://doi.org/10.1145/3644815.3644954 1 https://datacentricai.org/
43
Authorized licensed use limited to: Universidade Tecnologica Federal do Parana. Downloaded on November 27,2024 at 14:16:41 UTC from IEEE Xplore. Restrictions apply.
CAIN 2024, April 14–15, 2024, Lisbon, Portugal Heck
be on a project level: How to retrain this pre-trained model with our analyzes each of the four subquestions (RQ1 till RQ4). Section 6
own small dataset? How to create synthetic data? How to merge discusses the overall research question and implications for practi-
different data sources? How to store and version data? How to tioners and researchers. The paper concludes with main findings
automate data pre-processing? But also on organization level: How and future work.
to integrate all available data from different systems into a central
“store”? How to protect sensitive data? How to deal with privacy 2 RELATED WORK
issues? How to clean data? How to design an organization-wide data
architecture that is fit for future AI developments? More and more This section describes related work on data engineering within AI
organizations directly ask for data engineers as the main driver engineering research.
for their AI initiatives [27]. They struggle to find these employees, Amershi et al. [3] is one of the first AI engineering case studies to
because software engineers have not been trained on "big data" or appear. The paper includes a machine learning workflow including
data architectures specifically and data scientists do not have the data-oriented steps. It describes data engineering challenges at
required software engineering skills. Both disciplines rather focus Microsoft but not really explains solutions. The paper shows that
on the more attractive model development part of AI engineering. already in 2019, data management and data discovery on a project-
Jarrahi et al. [21] also stress the importance of data-centric AI level was a challenge for AI engineers. The case study does not
(DCAI). They state that “the nature of ‘data work’ itself is not describe data engineering challenges on enterprise level.
necessarily new. However, over the years, the actual data work in Serban et al. [41] identified a set of best practices for AI engi-
AI projects comes mostly from individual initiatives, and/or from neering from existing literature, including five best practices on
piecemeal and ad hoc efforts. A lack of attention to data excellence data:
and quality of data has resulted in underwhelming outcomes for AI (1) Use sanity checks for all external data sources
systems, particularly those deployed in high-stake domains such (2) Check that input data is complete, balanced and well dis-
as medical diagnosis. DCAI magnifies the role of data throughout tributed
the AI life cycle and stretches its lifespan beyond the so-called (3) Write reusable scripts for data cleaning and merging
‘preprocessing step’ in model-centric AI.” (4) Ensure data labelling is performed in a strictly controlled
One could argue that to account for true DCAI the AI engineer- process
ing life cycle should also be extended with undercurrents such as (5) Make data sets available on shared infrastructure (private or
“data architecture” and “DataOps”, analogue to Figure 2. These are public)
data-related activities that are not just relevant for one single AI en-
gineering project, but for the entire organization, or for all projects More information on these best practices (and how to implement
being executed, or for all AI systems being maintained. This is in them) might be taken from the papers that list them. Since Serban
fact also what we see in the projects we do with industry. They et al. is a meta-research it is not included in our mapping study.
know how to quickly get data for one machine learning experiment, Several authors [7, 13, 19, 28, 30, 32, 39, 42] describe data chal-
but not how to set up a data architecture for enterprise-scale AI lenges for AI systems, without offering explicit solutions. Each of
engineering. those papers does however stress the importance of data engineer-
This paper sets out to answer the question “How to do data ing for AI systems. Bosch et al. [7] present DataOps as part of the
engineering for AI systems?” by means of a mapping study. For this AI engineering research agenda as a “a significant opportunity to
mapping study we formulated the following research questions: reduce ... overhead by generating, distributing and storing data
smarter in the development process”. Sambasivan et al. [39] state
• RQ1: Which data and AI engineering lifecycle phases are that “Data quality carries an elevated significance in high-stakes AI
covered? due to its heightened downstream impact, impacting predictions
• RQ2: Which technical solutions (tools/frameworks/platforms) like cancer detection, wildlife poaching, and loan allocations. Para-
for AI data engineering are proposed? doxically, data is the most under-valued and de-glamorised aspect
• RQ3: Which architectures for AI data engineering are pro- of AI.”. This paper complements previous work with an overview
posed? of actionable solutions.
• RQ4: What are lesson learned on AI data engineering?
The mapping study identified 25 papers that explain data engi- 3 PAPER SELECTION
neering activities, tools, frameworks or architectures. By categoriz-
To select papers that answer our research questions on data engi-
ing them and summarizing their solutions and lesson learned, the
neering for AI we used the process described by Kitchenham and
paper creates an overview of the body of knowledge on data engi-
Charters [22]:
neering for AI. This overview is useful for practitioners to identify
solutions as well as for researchers to identify gaps. (1) Define inclusion and exclusion criteria
The remainder of the paper is organized as follows. Section 2 (2) Design query string
provides related work on data engineering for AI systems. Section (3) Identify databases and other sources to search
3 explains the method used for the mapping study and how the 25 (4) Select relevant papers based on title and abstract
resulting papers were selected. Section 4 classifies the 25 papers (5) Select relevant papers based on full text
according to their meta-data, the type of solution they discuss and (6) Extend result set based on citations
the scope of the data engineering activities explained. Section 5 (7) Classify resulting papers
44
Authorized licensed use limited to: Universidade Tecnologica Federal do Parana. Downloaded on November 27,2024 at 14:16:41 UTC from IEEE Xplore. Restrictions apply.
What About the Data? A Mapping Study on Data Engineering for AI Systems CAIN 2024, April 14–15, 2024, Lisbon, Portugal
As a final step, we coded [38] the resulting papers to answer the Table 1: Excluded items
research questions. The following paragraphs describe each of the
steps in detail. The classification step is described in Section 4. # Items
45
Authorized licensed use limited to: Universidade Tecnologica Federal do Parana. Downloaded on November 27,2024 at 14:16:41 UTC from IEEE Xplore. Restrictions apply.
CAIN 2024, April 14–15, 2024, Lisbon, Portugal Heck
46
Authorized licensed use limited to: Universidade Tecnologica Federal do Parana. Downloaded on November 27,2024 at 14:16:41 UTC from IEEE Xplore. Restrictions apply.
What About the Data? A Mapping Study on Data Engineering for AI Systems CAIN 2024, April 14–15, 2024, Lisbon, Portugal
are described; 2) the technical solutions that are described; 3) the Table 4: RQ1: Which life cycle phases are covered?
architecture pictures it contains; 4) the lessons learned it contains.
AI Eng. Data Eng. Papers
5.1 RQ1: Which Data and AI Engineering Life DATA Generation DE18, DE20, DE23, DE25
Cycle Phases Are Covered? DATA Transformation DE10, DE12
To map the 25 selected papers to life cycle phases, we coded them DATA All DE6, DE21
with the AI engineering life cycle phases from Figure 1 (DATA, DATA+ML Transformation DE3
ML, DEV and OPS) and the data engineering life cycle phases from DATA+ML+DEV All DE17
DEV All DE4
Figure 2 (Generation, Ingestion, Transformation, Serving). When
DEV+OPS All DE9, DE19
the paper does not focus on one or more specific life cycle phase(s), OPS Serving DE1, DE2, DE8, DE14, DE22
we coded it with “All”. Table 4 shows the division of the papers All All DE5, DE7, DE11, DE13, DE15,
over the life cycle phases. DE16, DE24
Conclusion. Not surprisingly the majority of papers cover at
least the DATA phase of the AI engineering life cycle. But eight
of the papers focus more on the DEV and/or OPS part of the AI learning pipelines. This system is deployed in production as an
engineering life cycle. Out of them, five papers specifically focus integral part of TFX – an end-to-end machine learning platform at
on data validation in production (the Serving phase of the Data Google.” They discuss the challenges they faced in developing the
Engineering life cycle). system and the techniques they used to address them, including
design choices that were made. They also present three case studies
5.2 RQ2: Which Technical Solutions for AI at Google to illustrate the benefits of the data validation system in
Data Engineering Are Proposed? production.
As can be seen in Table 3, seven of the papers discuss technical [DE2] Derakshan et al. [11] propose a “platform for continuously
solutions for AI data engineering. In this section we discuss each training deployed machine learning models and pipelines that adapts
of the seven proposed solutions in more detail. to the changes in the incoming data.” Their platform uses tech-
[DE1] Breck et al. [8] present a “data validation system that is niques such as proactive training, online statistics computation
designed to detect anomalies specifically in data fed into machine and dynamic materialization to reduce (re)training and deployment
47
Authorized licensed use limited to: Universidade Tecnologica Federal do Parana. Downloaded on November 27,2024 at 14:16:41 UTC from IEEE Xplore. Restrictions apply.
CAIN 2024, April 14–15, 2024, Lisbon, Portugal Heck
costs. They include evidence from experiments, with two different engineers and data engineering as part of the data democratization
machine learning pipelines. challenge: “making all kinds of data available for AI for all kinds of
[DE3] Frost et al. present “AI Pro, an open-source framework end users across the entire enterprise”. Gröger suggests to address
for data processing with Artificial Intelligence (AI) models.” With this challenge with an enterprise data catalog that provides com-
AI Pro users can generate a data pipeline from a configuration prehensive metadata management across all data lakes and other
file through a user friendly web interface. For advanced users and data sources. This would enable self-service use of data.
core developers, there is a command line interface for in-depth [DE10] Thirumuruganathan et al. [45] present a reference archi-
operations with finer-grained control. They demonstrate AI Pro tecture for automated annotations of data. They describe the key
with two demo scenarios. components of this system architecture. Implementing a proof-of-
[DE5] Abeykoon et al. [1] developed a “high performance Python concept remains future work.
API with a C++ core to represent data as a table and provide [DE17] Paleyes et al. [33] propose “Flow-Based Programming as
distributed data operations.” Their PyCylon solution bridges ETL a paradigm for creating Data Oriented Architecture (DOA) applica-
pipelines in Python (as mostly used by data scientists) with high tions.” They compared the flow-based programming (FBP) paradigm
performance compute kernels in C++. They conducted experiments to the Service-Oriented Paradigm (SOA) by implementing four data-
to proof the performance of PyCylon. driven applications in both paradigms and measuring evolution of
[DE14] Foidl et al. [14] collected a catalogue of 36 “data smells” the codebase through pre-defined metrics.
in a multi-vocal literature review and implemented tool support to [DE19] Sen et al. [40] devised a “de-centralized edge-to-cloud
detect these data smells. They applied the tools to 246 Kaggle datasets architecture” with machine learning pipelines for erroneous data
to evaluate them. As opposed to the data anomalies detected by repair and detection of deviations in sensor data. They analyze their
the system of [DE1] presented above, these data smells are broader, proposed architecture in two different industrial case studies.
since smells also include “potential data quality issues”. [DE20] Wang et al. [46] implement a hierarchical dataset with
[DE23] Jariwala et al. [20] demonstrate the use of the IBM Data unified annotation rules. They use one example scenario to compare
Quality for AI Toolkit to check training data in a machine learning a hierarchical dataset created from three single datasets with the
setting. They include a workflow how to call the IBM API and show original single-source dataset.
the results of several included data quality metrics on open source [DE21] Warnett and Zdun [47] list architectural design decisions
datasets. (ADDs) for the machine learning workflow from a gray literature
[DE25] Sabet et al. [37] introduce “a scalable Aerial Synthetic study. Their replication package contains in an ADD model with
Data Augmentation (ASDA) framework tailored to aerial autonomy UML diagrams of all ADDs and their relations.
applications.” They demonstrate the ASDA framework by generat- [DE22] Azimi and Pahl [6] present a layered Data-as-a-Service
ing data for landing pad detection in the Seattle simulation scene. (DaaS) quality management architecture. Their framework focuses
Although this is a very specific technical solution, the usage of on input data quality and links it to machine learned data service
synthetic datasets is of course not limited to aerial autonomy appli- quality. They demonstrate their framework with a traffic manage-
cations. ment use case.
Conclusion. The selected papers contain a diverse set of tech- [DE24] Kreuzberger et al. [23] depict an “end-to-end MLOps ar-
nical solutions, ranging from synthetic data generation (DE25), chitecture and workflow with functional components and roles”. The
through data validation tools (DE1, DE14, DE23), through data pro- workflow contains a separate data engineering zone and data(Ops)
cessing frameworks (DE3, DE5), to deployment platforms (DE2). engineer is a separate role (apart from e.g., software engineer, data
In that way the solutions presented together cover the complete scientist or even ML engineer). According to them a data engineer
AI engineering and data engineering life cycles, although most “builds up and manages data and feature engineering pipelines” and
solutions focus on one single life cycle phase (see Table 4). The “ensures proper data ingestion to the databases of the feature store
exception is [DE5] that does not cover one specific life cycle phase system”. This indicates that their architecture/workflow has the
and is the only paper that covers the DEV phase of AI engineering. scope of one single ML project.
Conclusion. Most architectures presented focus on (parts of
a) system architecture (DE4, DE10, DE17, DE20, DE22) or the ML
5.3 RQ3: Which Architectures for AI Data pipeline (DE21, DE24). Only papers DE7 and DE19, contain a dia-
Engineering Are Proposed? gram for enterprise-wide data architectures, of which DE19 focuses
As can be seen in Table 3, nine of the papers discuss architectures on IoT data only. The data ecosystem from DE7 contains a compre-
for AI data engineering. In this section we discuss each of the nine hensive overview of this enterprise data landscape, including IoT
architectures in more detail. data sources, see Figure 4.
[DE4] Yokoyama [49] propose a multi-layer architectural pattern
for machine learning systems that separates the business logic from
the inference engine and data processing. Furthermore, it separates
the user interface from the data collection and the data lake from the
5.4 RQ4: What are Lessons Learned on AI Data
database. They demonstrate their architectural pattern by designing Engineering?
a chatbot system. As can be seen in Table 3, nine of the papers discuss case studies or
[DE7] Gröger [16] calls for a data ecosystem for industrial enter- best practices for AI data engineering. In this section we discuss
prises, see Figure 4. That ecosystem contains a specific role for data each of the nine papers in more detail.
48
Authorized licensed use limited to: Universidade Tecnologica Federal do Parana. Downloaded on November 27,2024 at 14:16:41 UTC from IEEE Xplore. Restrictions apply.
What About the Data? A Mapping Study on Data Engineering for AI Systems CAIN 2024, April 14–15, 2024, Lisbon, Portugal
49
Authorized licensed use limited to: Universidade Tecnologica Federal do Parana. Downloaded on November 27,2024 at 14:16:41 UTC from IEEE Xplore. Restrictions apply.
CAIN 2024, April 14–15, 2024, Lisbon, Portugal Heck
Out of all papers, DE12 takes the most narrow view on data Figure 2, 4) other books such as “Data Fabric and Data Mesh Ap-
engineering. Chattopadhyaya et al. [9] write about data engineering proaches with AI” [18] and “Practical DataOps: Delivering agile
decisions for AI-based applications. They use this rather umbrella data science at scale” [5]. This means that practitioners should
term to indicate the decision how they should convert a timestamp definitely consider sources from grey literature on DataOps and
dataset into an interval-based dataset: take the mean, the mode, the modern data architectures (such as data meshes and data fabrics).
maximum or the minimum for that interval.
Open source tooling. The open source tooling landscape for data
Only 4 out of 25 papers actually define what they mean by data
is becoming bigger and bigger, see also the online blog post “The
engineering. Paper DE5 by Abeykoon et al. [1] defines data en-
State of Data Engineering”3 . These kind of tools are necessary to
gineering as “The complex process of transforming raw data to a
achieve higher levels of DataOps, see DE6 [29].
form suitable for analytics”. Paper DE6 by Raj et al. [29] has a rather
fuzzy definition of data engineering as a step that “performs two Data spaces. Some AI data engineering projects require data
different operations at the high level, which include data collection sharing between different organizations or entities. To support
and data ingestion”. Paper DE7 by Gröger [16] defines data engi- federated learning and data sovereignty several technical solutions
neering as “modelling, integrating and cleansing of data.” Paper such as Gaia-X, FIWARE and the International Data Space have been
DE13 by Cheng and Long [10] says “the raw data in each entity is built up. Paper DE11 by Altendeiterung et al. [2] investigates how
extracted, transformed, and prepared for model training”. These to integrate such solutions with AI pipelines, but we recommend to
definitions are all also much more narrow than the definitions from also keep an eye on the evolution of the data space solutions, since
Andrew Ng and Reis and Housley as given in the introduction. they are fairly new.
To get a complete picture of data engineering for AI systems,
one should combine the life cycles in Figure 1 and Figure 2. In that Domain-specific data engineering. Two domains that stood out
way, one has both a picture of how AI engineering connects to data in our mapping study are data engineering for Internet-of-Things
as well as how data engineering connects to AI. We did not find any (IoT) and data engineering for automotive. There might be more
comprehensive work that already does this and only three papers literature or guidance on AI data engineering if one dives into a
that take a similar enterprise-level view on data engineering (DE6, specific domain.
DE7 and DE19). Synthetic data. Paper DE25 by Sabet et al. [37] describes syn-
thetic data generation. That is a topic that might not be relevant for
6.3 Implications for Practitioners all AI engineering projects, but if it is, we would like to point out
The mapping of the selected papers to life cycle phases (RQ1, see that there is a whole body of knowledge (and tools) about synthetic
section 5.1) and type of solutions they provide (RQ2 till RQ4, see sec- data generation specifically that can be looked into by practitioners.
tion 5.2 till 5.4) provides guidance to practitioners which solutions
to select for which project or activity. Running our mapping study, 6.4 Implications for Researchers
we also had the following observations on AI data engineering that The mapping of the selected papers to life cycle phases (RQ1, see
could be useful for practitioners. section 5.1) and type of solutions they provide (RQ2 till RQ4, see
section 5.2 till 5.4) provides guidance to researchers to see what is
Big Data. During snowballing, we excluded a number of papers
already there and what is still missing. Running our mapping study,
on data engineering for Big Data, that did not have an explicit refer-
we also had the following observations on AI data engineering that
ence to AI systems. However, those papers might contain valuable
could be useful for researchers.
solutions for both researchers and practitioners that also holds for
AI systems, as these systems are mostly Big Data systems as well. Data engineering. Tebernum et al. [44] developed a “data en-
That kind of analysis was out of scope for this paper and might be gineering reference model (DERM) which outlines the important
a topic for a future mapping study: “How to do data engineering building blocks for handling data along the data lifecycle.” They
for Big Data?” aim to bridge between data engineers and software engineers by
providing a common ground for engineering data-intensive applica-
Data quality. A number of papers relate to data quality or data
tions. They view data engineering as a sub-discipline of data science
validation. There might be an interesting body of knowledge (and
(“preparing data for data scientists”). The AI engineering research
tools) on those topics that was out of scope for our mapping study.
community could benefit from integrating with the data science
Zhang et al. [50] list several methods for data testing in their survey
research community on the data engineering topic. The DERM
on machine learning testing. The concept of data smells [14, 43] is
presented by Tebernum et al. could serve as a common ground also
also important to consider in this context as these indicate data qual-
for this purpose. Tebernum et al. also show ample opportunities
ity issues that might lead to machine learning problems, different
for future data engineering research.
from data errors (see DE14 [14]).
DataOps. Paper DE6 by Raj et al. [29] points to DataOps as an
Grey literature. In our mapping study we included only peer-
overall process to automate and orchestrate data life cycle stages.
reviewed papers. However, we found a number of other interesting
The large amount of references to grey literature they used indicates
resources: 1) The original post of Figure 1 [12] that already defined
that DataOps is not receiving enough attention in research yet. In
the DATA cycle as in fact being a DataOps process, 2) the blogs
line with the evolution from DevOps to MLOps, there is also a
and whitepapers on DataOps referenced in DE6 [29], 3) the book
“Fundamentals of Data Engineering” [36] from which we borrowed 3 https://lakefs.io/blog/the-state-of-data-engineering-2023/
50
Authorized licensed use limited to: Universidade Tecnologica Federal do Parana. Downloaded on November 27,2024 at 14:16:41 UTC from IEEE Xplore. Restrictions apply.
What About the Data? A Mapping Study on Data Engineering for AI Systems CAIN 2024, April 14–15, 2024, Lisbon, Portugal
51
Authorized licensed use limited to: Universidade Tecnologica Federal do Parana. Downloaded on November 27,2024 at 14:16:41 UTC from IEEE Xplore. Restrictions apply.
CAIN 2024, April 14–15, 2024, Lisbon, Portugal Heck
[13] Lukas Fischer, Lisa Ehrlinger, Verena Geist, Rudolf Ramler, Florian Sobiezky, [37] Mehrnaz Sabet, Praveen Palanisamy, and Sakshi Mishra. 2023. Scalable mod-
Werner Zellinger, David Brunner, Mohit Kumar, and Bernhard Moser. 2020. AI ular synthetic data generation for advancing aerial autonomy. Robotics and
system engineering—key challenges and lessons learned. Machine Learning and Autonomous Systems 166 (2023), 104464.
Knowledge Extraction 3, 1 (2020), 56–83. [38] Johnny Saldaña. 2011. The Coding Manual for Qualitative Researchers (2nd editio
[14] Harald Foidl, Michael Felderer, and Rudolf Ramler. 2022. Data smells: categories, ed.). SAGE Publications Inc. 329 pages.
causes and consequences, and detection of suspicious data in AI-based systems. [39] Nithya Sambasivan, Shivani Kapania, Hannah Highfill, Diana Akrong, Praveen
In Proceedings of the 1st International Conference on AI Engineering: Software Paritosh, and Lora M Aroyo. 2021. “Everyone wants to do the model work, not
Engineering for AI. 229–239. the data work”: Data Cascades in High-Stakes AI. In proceedings of the 2021 CHI
[15] Richie Frost, Debjyoti Paul, and Feifei Li. 2019. AI pro: Data processing framework Conference on Human Factors in Computing Systems. 1–15.
for AI models. In 2019 IEEE 35th International Conference on Data Engineering [40] Sagar Sen, Erik Johannes Husom, Arda Goknil, Simeon Tverdal, Phu Nguyen,
(ICDE). IEEE, 1980–1983. and Iker Mancisidor. 2022. Taming data quality in AI-enabled industrial internet
[16] Christoph Gröger. 2021. There is no AI without data. Commun. ACM 64, 11 of things. IEEE Software 39, 6 (2022), 35–42.
(2021), 98–108. [41] Alex Serban, Koen van der Blom, Holger Hoos, and Joost Visser. 2020. Adoption
[17] Constanze Hasterok and Janina Stompe. 2022. PAISE®–process model for AI and effects of software engineering best practices in machine learning. In Pro-
systems engineering. at-Automatisierungstechnik 70, 9 (2022), 777–786. ceedings of the 14th ACM/IEEE International Symposium on Empirical Software
[18] Eberhard Hechler, Maryela Weihrauch, and Yan Wu. 2023. Data Fabric and Data Engineering and Measurement (ESEM). 1–12.
Mesh for the AI Lifecycle. In Data Fabric and Data Mesh Approaches with AI: A [42] Karthik Shivashankar and Antonio Martini. 2022. Maintainability Challenges
Guide to AI-based Data Cataloging, Governance, Integration, Orchestration, and in ML: A Systematic Literature Review. In 2022 48th Euromicro Conference on
Consumption. Springer, 195–228. Software Engineering and Advanced Applications (SEAA). IEEE, 60–67.
[19] Hans-Martin Heyn, Eric Knauss, and Patrizio Pelliccione. 2023. A compositional [43] Arumoy Shome, Luis Cruz, and Arie Van Deursen. 2022. Data smells in public
approach to creating architecture frameworks with an application to distributed datasets. In Proceedings of the 1st International Conference on AI Engineering:
AI systems. Journal of Systems and Software 198 (2023), 111604. Software Engineering for AI. 205–216.
[20] Ankur Jariwala, Aayushi Chaudhari, Chintan Bhatt, and Dac-Nhuong Le. 2022. [44] Daniel Tebernum, Marcel Altendeitering, and Falk Howar. 2021. DERM: A
Data Quality for AI Tool: Exploratory Data Analysis on IBM API. International Reference Model for Data Engineering.
Journal of Intelligent Systems and Applications 14, 1 (2022), 42. [45] Saravanan Thirumuruganathan, Mayuresh Kunjir, Mourad Ouzzani, and Sanjay
[21] Mohammad Hossein Jarrahi, Ali Memariani, and Shion Guha. 2023. The Principles Chawla. 2021. Automated Annotations for AI Data and Model Transparency.
of Data-Centric AI. Commun. ACM 66, 8 (jul 2023), 84–92. https://doi.org/10. J. Data and Information Quality 14, 1, Article 2 (dec 2021), 9 pages. https:
1145/3571724 //doi.org/10.1145/3460000
[22] B. Kitchenham and S Charters. 2007. Guidelines for performing systematic literature [46] Yue Wang, Long Lin, He Yang, Cuncun Shi, and Weijiang Lu. 2022. The Construc-
reviews in software engineering. Technical Report. EBSE-2007-01. tion Techniques of Artificial Intelligence Hierarchical Dataset in Power Industry.
[23] Dominik Kreuzberger, Niklas Kühl, and Sebastian Hirschl. 2023. Machine learning In 2022 IEEE 6th Information Technology and Mechatronics Engineering Conference
operations (mlops): Overview, definition, and architecture. IEEE Access (2023). (ITOEC), Vol. 6. IEEE, 320–325.
[24] Lucy Ellen Lwakatare, Ivica Crnkovic, and Jan Bosch. 2020. DevOps for AI– [47] Stephen John Warnett and Uwe Zdun. 2022. Architectural design decisions for
Challenges in Development of AI-enabled Applications. In 2020 International the machine learning workflow. Computer 55, 3 (2022), 40–51.
Conference on Software, Telecommunications and Computer Networks (SoftCOM). [48] Claes Wohlin. 2014. Guidelines for Snowballing in Systematic Literature Studies
IEEE, 1–6. and a Replication in Software Engineering. In Proceedings of the International
[25] Lucy Ellen Lwakatare, Ellinor Rånge, Ivica Crnkovic, and Jan Bosch. 2021. On the Conference on Evaluation and Assessment in Software Engineering (EASE). London
Experiences of Adopting Automated Data Validation in an Industrial Machine (UK), Article 38, 10 pages.
Learning Project. In 2021 IEEE/ACM 43rd International Conference on Software [49] Haruki Yokoyama. 2019. Machine learning system architectural pattern for
Engineering: Software Engineering in Practice (ICSE-SEIP). 248–257. https://doi. improving operational stability. In 2019 IEEE International Conference on Software
org/10.1109/ICSE-SEIP52600.2021.00034 Architecture Companion (ICSA-C). IEEE, 267–274.
[26] Juliette Mattioli, Gabriel Pedroza, Souhaiel Khalfaoui, and Bertrand Leroy. 2022. [50] Jie M Zhang, Mark Harman, Lei Ma, and Yang Liu. 2020. Machine learning testing:
Combining Data-Driven and Knowledge-Based AI Paradigms for Engineering Survey, landscapes and horizons. IEEE Transactions on Software Engineering
AI-Based Safety-Critical Systems. In Workshop on Artificial Intelligence Safety (2020).
(SafeAI).
[27] Marcel Meesters, Petra Heck, and Alexander Serebrenik. 2022. What is an AI
engineer? An empirical analysis of job ads in The Netherlands. In Proceedings of
the 1st International Conference on AI Engineering: Software Engineering for AI.
136–144.
[28] Aiswarya Raj Munappy, Jan Bosch, Helena Holmström Olsson, Anders Arpteg,
and Björn Brinne. 2022. Data management for production quality deep learning
models: Challenges and solutions. Journal of Systems and Software 191, 111359.
[29] Aiswarya Raj Munappy, David Issa Mattos, Jan Bosch, Helena Holmström Olsson,
and Anas Dakkak. 2020. From ad-hoc data analytics to dataops. In Proceedings of
the International Conference on Software and System Processes. 165–174.
[30] N. Nahar, H. Zhang, G. Lewis, S. Zhou, and C. Kastner. 2023. A Meta-Summary of
Challenges in Building Products with ML Components – Collecting Experiences
from 4758+ Practitioners. In 2023 IEEE/ACM 2nd International Conference on AI
Engineering – Software Engineering for AI (CAIN). IEEE Computer Society, Los
Alamitos, CA, USA, 171–183. https://doi.org/10.1109/CAIN58948.2023.00034
[31] Pia Niemelä, Bilhanan Silverajan, Mikko Nurminen, Jenni Hukkanen, and Hannu-
Matti Järvinen. 2022. LAOps: Learning Analytics with Privacy-aware MLOps. In
International Conference on Computer Supported Education, CSEDU. Science and
Technology Publications (SciTePress), 213–220.
[32] Ipek Ozkaya. 2020. What is really different in engineering AI-enabled systems?
IEEE software 37, 4 (2020), 3–6.
[33] Andrei Paleyes, Christian Cabrera, and Neil D Lawrence. 2022. An empirical
evaluation of flow based programming in the machine learning deployment
context. In Proceedings of the 1st International Conference on AI Engineering:
Software Engineering for AI. 54–64.
[34] Patrick Petersen, Hanno Stage, Jacob Langner, Lennart Ries, Philipp Rigoll,
Carl Philipp Hohl, and Eric Sax. 2022. Towards a Data Engineering Process
in Data-Driven Systems Engineering. In 2022 IEEE International Symposium on
Systems Engineering (ISSE). IEEE, 1–8.
[35] M Aiswarya Raj, Jan Bosch, Helena Holmström Olsson, and Anders Jansson.
2021. On the Impact of ML use cases on Industrial Data Pipelines. In 2021 28th
Asia-Pacific Software Engineering Conference (APSEC). IEEE, 463–472.
[36] Joe Reis and Matt Housley. 2022. Fundamentals of Data Engineering. O’Reilly.
52
Authorized licensed use limited to: Universidade Tecnologica Federal do Parana. Downloaded on November 27,2024 at 14:16:41 UTC from IEEE Xplore. Restrictions apply.