ChatGPT's Role in Software Modeling
ChatGPT's Role in Software Modeling
[Link]
EXPERT VOICE
Received: 15 March 2023 / Revised: 9 April 2023 / Accepted: 14 April 2023 / Published online: 22 May 2023
© The Author(s) 2023
Abstract
Most experts agree that large language models (LLMs), such as those used by Copilot and ChatGPT, are expected to revo-
lutionize the way in which software is developed. Many papers are currently devoted to analyzing the potential advantages
and limitations of these generative AI models for writing code. However, the analysis of the current state of LLMs with
respect to software modeling has received little attention. In this paper, we investigate the current capabilities of ChatGPT to
perform modeling tasks and to assist modelers, while also trying to identify its main shortcomings. Our findings show that,
in contrast to code generation, the performance of the current version of ChatGPT for software modeling is limited, with
various syntactic and semantic deficiencies, lack of consistency in responses and scalability issues. We also outline our views
on how we perceive the role that LLMs can play in the software modeling discipline in the short term, and how the modeling
community can help to improve the current capabilities of ChatGPT and the coming LLMs for software modeling.
Keywords Large language models · ChatGPT · Software models · Modeling languages · UML
1 Introduction rent topic in many universities and are being covered by most
specialized forums and blogs. A plethora of papers are now
The emergence of generative AI and large language mod- analyzing the potential advantages, limitations and failures
els (LLMs), such as those used by GitHub’s Copilot [9] of these models for writing code [3], as well as how pro-
and OpenAI’s ChatGPT [14], is causing quite a stir in the grammers interact with them [2, 19]. Most studies seem to
Computer Science community. Most experts foresee a major agree that LLMs do an excellent job in writing code: despite
disruption in the way software is developed and software some minor syntactical errors, what they produce is essen-
engineering education is also expected to drastically change tially correct.
with the advent of these LLMs [12]. These issues are a recur- However, what about software modeling? What is the sit-
uation of LLMs when it comes to performing modeling tasks
Communicated by Bernhard Rumpe. or assisting modelers to accomplish them? A few months ago
B Javier Cámara
we started looking at these issues, trying to investigate the
jcamara@[Link] current status of LLMs with respect to conceptual modeling,
[Link] a topic that does not seem to have attracted much attention
Javier Troya so far. Our premise is that LLMs are here to stay. So, instead
jtroya@[Link] of ignoring them or rejecting their use, we posit that it would
[Link] be better to embrace and use them in an effective manner to
Lola Burgueño help us perform modeling tasks.
lolaburgueno@[Link] We are aware that the current LLM situation is very
[Link]
volatile, with new models, versions and tools being released
Antonio Vallecillo frequently, each one improving over the previous ones. How-
av@[Link]
[Link] av/
ever, our goal is to assess the current situation and to provide
a set of experiments that can enable us to identify possi-
1 ITIS Software, Universidad de Málaga, ETSI Informática, ble shortcomings of current tools for performing modeling
Campus de Teatinos. Bulevar Louis Pasteur, 35. 29071
Málaga, Spain
123
782 JJ. Cámara et al.
123
On the assessment of generative... 783
contrast, the new Bing search engine2 allows the non-expert these are difficult to parse and understand. Figure 1 shows
user to set only a few hyperparameters, but not all. For this, one example of these textual diagrams.
Bing has modified how the hyperparameterization of the We discovered that ChatGPT can also handle Ecore mod-
LLM is done and allows the user to choose the conversation els. You can ask it to generate models in Ecore and also use
style with three options: “more creative,” “more balanced” them as inputs for prompts. Its treatment of the Ecore lan-
and “more precise,” instead of asking them to select a guage is comparable to that of other modeling languages,
value (i.e., the so-called temperature value) within a with similar mistakes and correct answers.
given interval, usually a Real number between 0 and 1. We also asked ChatGPT about other textual languages
that it knows, which are used in UML for representing
different aspects of software systems. It mentioned the
Object Constraint Language (OCL), the Action Language
2.2 ChatGPT
for Foundational UML (ALF), the UML Profile Definition
Language (UML PDL) and the UML Testing Profile (UTP).
ChatGPT is a tool developed by OpenAI, a for-profit research
We checked in depth its skills with OCL, which are excellent,
organization co-founded by Elon Musk and Sam Altman,
but in contrast, the initial tests with the other notations did
strongly funded by Microsoft. The users interact with Chat-
not yield satisfactory results.
GPT in a conversational way via text prompts.
When asked about its modeling knowledge, ChatGPT
reports that it knows most UML diagrams, including Class 2.3 Research questions
diagrams, Use cases, State machines, Sequence diagrams and
Activity diagrams. As mentioned in the introduction, our primary goal was to
Regarding the UML notations ChatGPT can handle, being analyze the use of ChatGPT as an assistant tool for concep-
a language model, it cannot generate models in graphical tual modeling. In line with this, we address the following
form. ChatGPT produces models in textual UML nota- Research Questions:
tions, including PlantUML, USE (the UML-based Specifi-
cation Environment), Yuml, Markdown UML, Mermaid and
RQ1. Does ChatGPT generate syntactically correct UML
UMLet. It also produces some rudimentary class diagrams
models?
using plain characters to draw boxes and lines, but sometimes
RQ2. Does ChatGPT generate semantically correct models,
i.e., semantically aligned with the user intents?
2[Link] RQ3. How sensitive is ChatGPT to the context and to the
FORM=hpcodx. problem domain?
123
784 JJ. Cámara et al.
RQ4. How large are the models that ChatGPT is able to F1. Problem domain and semantics The problem domain
generate or handle? is important for ChatGPT. In general, it works poorly when
RQ5. Which modeling concepts and mechanisms is Chat- the names of the entities to be modeled have no meaning,
GPT able to effectively use? such as X, Y, Z, or A, B, C. The more meaningful and rep-
RQ6. Does prompt variability impact the correctness/quality resentative entity names are, the better the class model it
of the generated models? produces. Similarly, the more ChatGPT “knows” about the
RQ7. Do different use strategies (e.g., prompt partitioning) domain, the more accurate and complete the UML model
result in different outcomes? it generates. Purchase Orders, Banks or Employees are con-
RQ8. How sensitive is ChatGPT to the UML notation used cepts for which it is able to produce semantically rich models
to represent the output models? (too rich sometimes, as it completes them with information
that was not requested).
To answer these research questions, we devised a set of F2. Problem domain and syntax The problem domain also
experiments, which are detailed in the next section. seems to influence the structure and contents of the resulting
models, as well as their level of abstraction. In some domains,
the models generated had a very low level of abstraction,
quite close to a software program represented in UML. In
3 Experiments others, the level of abstraction was higher, although it heav-
ily depended on the particular conversation. As we know,
This section describes the experiments we conducted to LLMs have semantic and syntactic capabilities. When mix-
understand the current capabilities of ChatGPT to perform ing these two abilities to produce class models, depending
modeling tasks. We defined two phases. In the first one, we on the concrete domain (and thus the amount of data about
carried out some exploratory experiments to gain a basic that domain in the training dataset), ChatGPT seems to rely
understanding of how ChatGPT works with software models, on its translation capabilities. Sometimes, given our prompt,
as well as its main features and limitations. The experiments ChatGPT’s outputs seem to be the UML representation of
in the second phase were more systematic and aimed to a possible solution that it found/produced in a different lan-
further characterize ChatGPT’s modeling capabilities. The guage, i.e., with a different syntax. If this other language is a
results of these experiments are presented and discussed later low-level language such as Java or C++, the abstraction level
in Sect. 4. is lower than if it finds a solution represented as a software
model such as a relational schema. In other words, the prob-
3.1 First phase: exploration lem domain influences the result, as the latter depends on the
data with which ChatGPT has been trained for that domain.
Objective In this exploratory phase, the four authors of F3. Publicly available models Related to the previous
this paper interacted individually with ChatGPT to become point, if you ask ChatGPT to build a UML model that is on
acquainted with its modeling capabilities. We also explored the Internet (such as the example given in the OCL 2.4 stan-
some of its general characteristics. Since we are not able to set dard), ChatGPT will generate a correct model. OpenAI has
hyperparameters such as the number of tokens, we explored not disclosed what data was used to train ChatGPT or how
the size of the models it was able to handle. We also explored the training process was conducted, but it looks like these
its skills with various modeling notations, which depend on publicly available models have served as training models for
the training data. ChatGPT.
Method For this phase, we did not use any systematic F4. Size of the models to build The current version of
approach but tried to explore all the ideas that came to mind ChatGPT does not work well when asked to generate a class
based on the findings we were making and the results we model of more than 8–10 classes from scratch. However, it
were obtaining. works much better if you ask it to build a small initial model
Materials We wrote prompts asking ChatGPT to create mod- and progressively add information to it. In fact, ChatGPT
els of different sizes, as well as to create the target models was unable to cope with any of the exams of our modeling
of some of the assignments that we use in our modeling lec- course, because these UML models were too large (more
tures. The size of these models ranged from 10 to 40 classes than 20 classes and associations) for its current capabilities
and associations. We wrote all our interactions and findings or hyperparameterization, and it either did not finish the task
in a shared document used as a logbook [1]. (which had to be aborted) or built rather small and incomplete
First findings We became aware of several basic capabilities models.
and limitations of ChatGPT. Some of them were not surpris- F5. Notations We also experimented with various nota-
ing, given how language models work, but they are still worth tions to represent the generated UML model. By default,
reporting here. ChatGPT seems to use a diagrammatic notation that employs
123
On the assessment of generative... 785
Table 1 Coverage by the selected examples of the main modeling concepts and mechanisms
Concept/Mechanism Students Airlines File system Robots Video club Theaters Amphibious Cars
Enumerations X X X
Classes X X X X X X X X
Attributes X X X X X X X X
Operations X
Generalization X X X X
Association X X X X X X X
Aggregation X X X X
Composition X
Assoc. class X X X
Multiple inheritance X
Abstract classes X X X
OCL constraints X X
Roles (as assoc. ends)
Roles (as inherited classes) X
Roles (as entity types) [5]
Materialization [15] X X
characters to draw boxes and lines on the screen. This nota- F7. Cross-language translation facilities When testing the
tion is too difficult to read and understand when there are translation facilities across modeling languages, the results
more than four or five classes in the model, so we started to are conversation-dependent. For example, we gave ChatGPT
explicitly ask ChatGPT to produce models in specific nota- a model in USE with association classes and asked it to rep-
tions, such as PlantUML or USE. Apart from small syntactic resent the model in PlantUML. The result was not correct,
errors, the results are generally good; we cannot say the same because ChatGPT does not seem to know how to handle
for the semantics of the generated models, which were full association classes. Now, given that same PlantUML model,
of errors, as we shall later see. if asked to convert it to USE, depending on whether it is
F6. Conversation history Although there is a limit to the within the same conversation or in a different one, sometimes
amount of information ChatGPT can retain, it is able to ChatGPT converts it to the original USE model (even with
“remember” what was said earlier in a conversation. This association classes) or to a different model (this time with
is, ChatGPT is conversation-aware and results are heavily syntactic errors in USE). Interestingly, this does not seem to
conversation-dependent.3 Depending on the session, and on be specific only to modeling, but also to translation between
our previous interactions, the results may present remarkable other languages, even natural ones.
variations. In fact, when asked to build a model, ChatGPT F8. Integrity constraints When the description of the
takes information from previously developed models within model to be represented includes integrity constraints (which
the same conversation, even if they have nothing to do with we would expect to be specified by means of OCL expres-
the model in question. This is why it is important to start a sions), what ChatGPT usually does for each constraint is
new chat every time we want to develop a new model. One either to create a note or to define an operation that checks
exercise we did was to ask ChatGPT to generate a UML the constraint on the class that would correspond to the con-
model in three different chats using the same prompt. In two text of the OCL expression. We soon learned that if what
of them, we had been previously creating models from other we want to represent are the integrity constraints of a UML
domains, and the third chat restarted afresh. The results gen- class model using OCL, it is better to develop the model with-
erated in the first two conversations were very similar to the out constraints and then explicitly ask ChatGPT to generate
previously generated models, despite the fact that the new the constraints in OCL, one by one. ChatGPT works signifi-
model was from a different domain. The results of the same cantly better with OCL than with UML. We suspect that this
prompt in the new chat were closer to the desired target. is possibly due to the fact that the data sources used for the
construction of OCL expressions are usually SQL, Rust and
other declarative languages for which there is a much larger
3
corpus than for UML.
OpenAI states that, when replying to a prompt, ChatGPT does not
access previous conversations.
123
786 JJ. Cámara et al.
Fig. 2 Prompt used to ask ChatGPT to generate a UML class diagram of a video club system, and the resulting model
Fig. 3 Another model generated by ChatGPT in response to exactly the same prompt, but in a different session
3.2 Second phase: focused experiments small in size (three to six classes) so that ChatGPT could han-
dle them without problems. They represented different user
Objective In the first phase, we managed to obtain a basic intents, and for each one of them, the exercise consisted in
understanding of how ChatGPT works, as well as of its main asking ChatGPT to produce the corresponding UML model
features and limitations. We also obtained initial responses using one or more prompts.
to some of the research questions, namely those about its Figure 2 shows one of these exercises (Video club). The
sensitivity to context and problem domain (RQ3, addressed prompt used to generate the UML class diagram is shown on
by findings F1, F2, F3 and F6), its scalability (RQ4, addressed the left, and the ChatGPT response (in PlantUML) is shown
by finding F4) and partly about its sensitivity to the modeling on the right. For readability purposes, we have included the
notation of choice (RQ8, addressed by findings F5, F7 and graphical representation of the PlantUML model in the cen-
F8). The goal of this second phase was to address the rest of tral box. On this occasion, ChatGPT managed to generate
the research questions, which demanded a more systematic the intended model after a few interactions, so the exer-
approach. cise was considered successful. However, to illustrate the
Method For this phase, we developed a set of models that variability of ChatGPT’s responses, Fig. 3 shows another
were intended to cover the most important modeling concepts model generated by ChatGPT in response to exactly the
and mechanisms (see left column of Table 1). Each author same prompt, but from a different conversation. (Both were
independently proposed ten UML models. All of them were fresh conversations.) Although there are deterministic lan-
123
On the assessment of generative... 787
Table 2 Results of the experiment where the four authors tried to make is a multiplicity 1..* in the composite end). As we will
ChatGPT generate the intent models of the selected exercises mention later, ChatGPT does not always fix or add what we
Exercise Successful Avg. sessions Prompts/Sess. ask for such as repairing the multiplicity of an association.
Students 4/4 2.5 2.5 When it does it, ChatGPT sometimes introduces additional
Airlines 0/4 3 2.75
errors in other parts of the model.
From the complete set of 40 exercises, we selected two
File system 4/4 2 2.25
from each author. The resulting eight models covered the con-
Robots 0/4 3 3.5
cepts and mechanisms listed in Table 1. Their intent models
Video club 4/4 2 2.3
are shown in Fig. 4.
Theaters 0/4 3 3
Each author tried to make ChatGPT generate these UML
Amphibious 4/4 2.2 1.75
intent models as faithfully as possible, using different strate-
Car parts 4/4 2 2.3
gies to create the prompts. A summary of the results of this
experiment is shown in Table 2. The columns list the exercise,
the number of authors that could make ChatGPT successfully
generate the intended model, the average number of sessions
guage models, most modern LLMs (such as ChatGPT) are
that were used and the average number of prompts that were
designed to be probabilistic, and not deterministic. This lack
required per session until the solution was generated or the
of repeatability of the results represents a major obstacle to
author gave up. Reasons for restarting a new chat or giv-
the reproducibility of the experiments and is one of the main
ing up included that: (1) ChatGPT entered an endless loop,
current challenges of these assistants from our point of view.
e.g., saying “Sure, I will fix it” but repeating the previous
Even if prompts were carefully designed, very often Chat-
response, and (2) class diagrams that had an increasing num-
GPT did not generate the expected result. To improve the
ber of errors despite our indications to fix them, or diagrams
result, we always tried to follow a conversation with the bot
that were not worth fixing.
by providing multiple successive prompts in which we asked
Materials The complete set of UML models of the 40 exer-
to modify some aspect of the generated result. For example,
cises is available from our GitHub repository [1], as well as
if ChatGPT generates a class Movie that does not contain an
the reports that each author produced during their interactions
attribute name, we can tell ChatGPT that movies must have
with ChatGPT.
a name. The same can be done to add the multiplicities and
Findings The exercises of this phase revealed some very
role names of the associations, remove unwanted methods or
interesting findings, which are summarized below.
fix incorrect details (such as using compositions when there
123
788 JJ. Cámara et al.
123
On the assessment of generative... 789
could be correctly produced, the total number of interactions As mentioned in Finding F4, ChatGPT currently has strict
with ChatGPT (counting the prompts of all sessions until the limitations on the size of the models it can handle. It has seri-
model was correct) exceeded the number of model elements. ous problems with models larger than 10–12 classes. Even
the time and effort required to produce smaller models (Find-
ing F19) are not insignificant.
4 Analysis
RQ5. Which modeling concepts and mechanisms is Chat-
After carrying out the experiments and analyzing our expe- GPT able to effectively use?
rience with ChatGPT, this section is dedicated to answering
the research questions identified in Sect. 2.3. The modeling concepts that we analyzed are shown in
Table 1. There is a high degree of variability in how Chat-
RQ1. Does ChatGPT generate syntactically correct UML GPT handles them. We observed that it is able to manage
models? reasonably well (with some exceptions) associations, aggre-
gations and compositions, simple inheritance and role names
The UML models produced by ChatGPT are generally of association ends (F9). However, it requires explicit indi-
correct, although they may contain small syntactic errors (see cations for using enumerations (F14), multiple inheritance
finding F5). This also depends on the notation used. Although (F15) and integrity constraints (F16). Finally, we found out
we did not test it thoroughly, the level of syntactic correctness that its results are not acceptable when using abstraction
of the models produced in PlantUML was much higher than (F17), and it cannot handle association classes (F13).
those generated in USE, for example.
RQ6. Does prompt variability impact the correctness/qu
RQ2. Does ChatGPT generate semantically correct models, ality of the generated models?
i.e., semantically aligned with the user’s intent?
We observed that there is plenty of variability when Chat-
This is the weakest point that we observed during our GPT generates responses to same prompt (F10). We learned
interaction with ChatGPT. Some studies suggest that LLMs that it is useful to start a new conversation from scratch when
are better at syntax than producing semantically correct the results were not good, in order to find better solutions for
results [11]. Our findings (e.g., F13) corroborate this fact. the same intent model (F12).
This includes errors in both the semantics of the language
and the semantics of the domain being modeled. On many RQ7. Do different use strategies (e.g., prompt partitioning)
occasions, we observed that ChatGPT proposed seemingly result in different outcomes?
random models that made no sense from either a modeling
or domain standpoint. First, as noted in finding F4, the size of the models that
ChatGPT is capable of handling in a single query forces
RQ3. How sensitive is ChatGPT to the context and to the the modeling task to become an iterative process in which
problem domain? the user starts with a small model and progressively adds
details to it (F12). The variability and randomness of Chat-
Our findings F1, F2, F3 and F6 clearly show that not only GPT responses (F10) or when results within a conversation
the problem domain influences the resulting models, but also start to diverge often force the modeler to repeat conversa-
the information exchanged during the dialogues with Chat- tions to try to obtain better models.
GPT. In addition, the more ChatGPT “knows” about a domain
(i.e., the more data about a domain was used during training), RQ8. How sensitive is ChatGPT to the UML notation used
the closer-to-correct class models it produces. ChatGPT pro- to represent the output models?
duces its worst results when it has little or no information
about the domain or the entities to be modeled, as it hap- ChatGPT is capable of representing models with several
pened when asked to produce software models of entities notations (F5), although in general it makes fewer syntactic
such as Snarks or Zumbats, for which it did not seem to have mistakes with PlantUML. It is also much better with OCL
any reference or semantic anchor. than with UML (F8). Finally, we also looked at how accu-
rate ChatGPT was with cross-modeling language translation
RQ4. How large are the models that ChatGPT is able to (F7), realizing that this task works better within the same
generate or handle? conversation, but not across conversations.
123
790 JJ. Cámara et al.
123
On the assessment of generative... 791
LLMs can provide tailored instruction and feedback that Finally, researchers and academics will be able to use
meets the individual needs of students. LLMs to analyze large amounts of models, identify patterns
– Automated grading and assessment: LLMs can provide and insights and generate new ideas from them.
instant feedback to students on their performance. This
can save teachers time and help them provide more effec- 5.3 How do we make this happen?
tive feedback to students.
The prospects are certainly encouraging. The question is
whether they are really attainable and, if so, how they can
be achieved. It is clear that ChatGPT’s abilities to perform
5.2 How will the game change? modeling tasks are not yet up to the job. In this section, we
would like to propose some suggestions that the modeling
Overall, the use of large language models has the potential community could implement to improve the reliability and
to revolutionize software modeling engineering and educa- accuracy of ChatGPT and other generative AI models.
tion, making it more accessible, personalized and efficient. First, we should make more (correct) software models
To get to that point, we will first need to improve the current available in public repositories, thus increasing the acces-
consistency and reliability of the models produced by LLMs sibility of datasets that can be used for training LLMs and
such as ChatGPT. Second, we will need to change the way other generative AI models. The more UML and software
in which we currently develop software models and teach models that are publicly available from different domains,
modeling. These two issues are described in the following. the more accurate and reliable the responses from these AI
First of all, modeling assistants will become key compo- models will be.
nents in model development processes. Software modelers Second, we should start using LLMs/generative AI mod-
will be able to interact with them in natural language in order els in our software modeling tasks to familiarize ourselves
to build and test their models. For example, modelers may with them, explore their possibilities and discover their
rely on LLMs to explore modeling choices, add new features limitations. We should strive to use them not only for devel-
to a model or change a model to accommodate to new or oping software models, but also for testing them, generating
evolving requirements. instances and test cases, etc. Exploring their use for other
Secondly, new software engineering roles will also appear. MBSE tasks and activities could also be valuable. We are
For example, companies have started incorporating the new sure that AI models can open new ways to make use of mod-
role of prompt engineer [13], whose job is to test AI chatbots els in software and systems engineering tasks.
using natural language instead of code. Their goal is to iden- Providing feedback to the results of AI models, whenever
tify both errors and hidden capabilities so that developers can available, will benefit the whole community. Training them
either fix or exploit them. They are also experts on how best should become a community effort, i.e., a responsibility of
to ask an LLM to perform a particular task so that it is carried each and every one of us.
out in the most accurate and efficient manner by the chatbot. Developing a body of knowledge that incorporates a set
New opportunities also emerge for experts in configuring of guidelines about the best strategies to interact with AI-
the hyperparameters that allow users to customize the LLM based assistants for various types of modeling tasks, as well
predictions in order to improve the quality of the results. as a catalog of capabilities and common limitations, can also
As mentioned earlier, an appropriate hyperparameterization contribute to streamline the assimilation of AI models for
for a specific task could be as important as the dataset used modeling tasks.
for training the LLM [8] or the actual choice of the (deep Finally, let us incorporate LLMs and generative AI models
learning) algorithm. Similarly, LLM trainers can help provide into our teaching practices. Making students acquainted with
the appropriate datasets to improve the prediction accuracy them and aware of their possibilities and limitations will help
of an LLM in particular domains, and for specific tasks. them not only to improve their modeling skills, but also their
MBSE educators will have to change the way they perform critical thinking. They should learn to discriminate when to
most of their tasks today. Since LLMs will be ubiquitous, use these AI models and when not to, as well as when to trust
professors will not be able to prevent students from using their answers.
LLMs for their assignments. On the contrary, one of their
goals will be to help students use modeling assistants in the
best possible way to learn new concepts, develop software 6 Conclusions
models and test them. In addition, they will need to help
students to develop critical thinking skills that enable them Generative AI and large language models are becoming
to distinguish when the information provided by an assistant ubiquitous, and their upcoming impact on our disciplines
is useful and correct and when it is not. and professions cannot be overlooked. In this paper, we
123
792 JJ. Cámara et al.
have investigated their current capabilities and limitations 3. Borji, A.: A categorical archive of chatgpt failures. (2023). CoRR
for generating UML class diagrams and for assisting soft- arXiv:2302.03494
4. Burgueño, L., Clarisó, R., Gérard, S., Li, S., Cabot, J.: An NLP-
ware engineers to perform modeling tasks. Our findings show based architecture for the autocompletion of partial domain models.
that, in contrast to code generation and completion, the per- In: Proc. of CAiSE’21, LNCS, vol. 12751, pp. 91–106. Springer
formance of the current version of ChatGPT for software (2021). [Link]
modeling is still quite limited. 5. Cabot, J., Raventós, R.: Roles as entity types: a conceptual mod-
elling pattern. In: Proc. of ER’04, LNCS, vol. 3288, pp. 69–82.
Our intention was not to conduct an exhaustive set of Springer (2004). [Link]
experiments regarding the capabilities of LLMs for assist- 6. Capuano, T., Sahraoui, H.A., Frénay, B., Vanderose, B.: Learn-
ing in modeling tasks, as they are currently changing very ing from code repositories to recommend model classes. J. Object
fast. However, we wanted to address the growing need to Technol. 21(3), 1–11 (2022). [Link]
3.a4
have a picture of their current state, as accurate as possi-
7. Chaaben, M.B., Burgueño, L., Sahraoui, H.: Towards using few-
ble. We also did not want to address other issues related to shot prompt learning for automating model completion. In: Proc.
these types of tools, such as their ethical concerns. Although of ICSE (NIER)’23. IEEE/ACM (2023)
equally important, in this article we have focused mainly on 8. Döderlein, J., Acher, M., Khelladi, D.E., Combemale, B.: Piloting
copilot and codex: Hot temperature, cold prompts, or black magic?
their technical aspects.
(2022). CoRR arXiv:2210.14699
In general, we believe that, far from detracting from the use 9. GitHub: Copilot: Your AI pair programmer (2023). [Link]
of this type of generative AI-based tools, we should try to help com/features/copilot/
improving them as much as possible. In addition, we should 10. Kim, H., So, B.H., Han, W.S., Lee, H.: Natural language to SQL:
Where are we today? Proc. VLDB Endow. 13(10), 1737–1750
start adapting our model-based engineering practices to these (2020). [Link]
new assistants and the possibilities they offer. Likewise, we 11. Marcusarchive, G., Davisarchive, E.: GPT-3, Bloviator: OpenAI’s
should start changing our modeling education methods to language generator has no idea what it’s talking about (2020).
incorporate them. [Link]
openai-language-generator-artificial-intelligence-ai-opinion/
Successfully addressing the challenge of seamlessly inte- 12. Meyer, B.: What Do ChatGPT and AI-based Automatic Program
grating these new LLMs and generative AI models into our Generation Mean for the Future of Software. Commun. ACM
MBSE methods and practices is crucial. It could significantly 65(12), 5 (2022). [Link]
increase the impact of MBSE on society and lead to a major what-do-chatgpt-and-ai-based-automatic-program-generation-
mean-for-the-future-of-software/fulltext
step forward for our profession. 13. Mok, A.: ‘Prompt engineering’ is one of the hottest jobs
in generative AI. Here’s how it works. Business Insider
Acknowledgements We would like to thank Jörg Kienzle for his
(2023). [Link]
comments and very valuable feedback on an earlier draft of this
chatgpt-jobs-explained-2023-3
paper. This work was partially funded by the Spanish Govern-
14. Open AI: ChatGPT (2023). [Link]
ment (FEDER/Ministerio de Ciencia e Innovación–Agencia Estatal de
15. Pirotte, A., Zimányi, E., Massart, D., Yakusheva, T.: Materializa-
Investigación) under projects PID2021-125527NB-I00 and TED2021-
tion: A powerful and ubiquitous abstraction pattern. In: Proc. of
130523B-I00.
VLDB’94, pp. 630–641. Morgan Kaufmann (1994). [Link]
[Link]/conf/1994/[Link]
Funding Funding for open access publishing: Universidad de Málaga/
16. Rocco, J.D., Sipio, C.D., Ruscio, D.D., Nguyen, P.T.: A GNN-
CBUA.
based recommender system to assist the specification of metamod-
els and models. In: Proc. of MODELS’22, pp. 70–81. IEEE (2021).
Open Access This article is licensed under a Creative Commons
[Link]
Attribution 4.0 International License, which permits use, sharing, adap-
17. Saini, R., Mussbacher, G., Guo, J.L.C., Kienzle, J.: Automated,
tation, distribution and reproduction in any medium or format, as
interactive, and traceable domain modeling empowered by artificial
long as you give appropriate credit to the original author(s) and the
intelligence. Softw. Syst. Model. 21(3), 1015–1045 (2022). https://
source, provide a link to the Creative Commons licence, and indi-
[Link]/10.1007/s10270-021-00942-6
cate if changes were made. The images or other third party material
18. Savary-Leblanc, M., Burgueño, L., Cabot, J., Pallec, X.L., Gérard,
in this article are included in the article’s Creative Commons licence,
S.: Software assistants in software engineering: a systematic map-
unless indicated otherwise in a credit line to the material. If material
ping study. Softw. Pract .Exp. 53(3), 856–892 (2023). [Link]
is not included in the article’s Creative Commons licence and your
org/10.1002/spe.3170
intended use is not permitted by statutory regulation or exceeds the
19. Vaithilingam, P., Zhang, T., Glassman, E.L.: Expectation vs. Expe-
permitted use, you will need to obtain permission directly from the copy-
rience: evaluating the usability of code generation tools powered
right holder. To view a copy of this licence, visit [Link]
by large language models. In: Proc. of CHI’22, pp. 332:1–332:7.
[Link]/licenses/by/4.0/.
ACM (2022). [Link]
20. Weyssow, M., Sahraoui, H.A., Syriani, E.: Recommending meta-
References model concepts during modeling activities with pre-trained lan-
guage models. Softw. Syst. Model. 21(3), 1071–1089 (2022).
1. Atenea Research Group: Git repository: chatgpt-uml (2023). [Link]
[Link]
2. Barke, S., James, M.B., Polikarpova, N.: Grounded copilot: How
programmers interact with code-generating models. (2022). CoRR
arXiv:2206.15000 Publisher’s Note Springer Nature remains neutral with regard to juris-
dictional claims in published maps and institutional affiliations.
123
On the assessment of generative... 793
123