0% found this document useful (0 votes)
60 views13 pages

ChatGPT's Role in Software Modeling

Uploaded by

Omer Iqbal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views13 pages

ChatGPT's Role in Software Modeling

Uploaded by

Omer Iqbal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Software and Systems Modeling (2023) 22:781–793

[Link]

EXPERT VOICE

On the assessment of generative AI in modeling tasks: an experience


report with ChatGPT and UML
Javier Cámara1 · Javier Troya1 · Lola Burgueño1 · Antonio Vallecillo1

Received: 15 March 2023 / Revised: 9 April 2023 / Accepted: 14 April 2023 / Published online: 22 May 2023
© The Author(s) 2023

Abstract
Most experts agree that large language models (LLMs), such as those used by Copilot and ChatGPT, are expected to revo-
lutionize the way in which software is developed. Many papers are currently devoted to analyzing the potential advantages
and limitations of these generative AI models for writing code. However, the analysis of the current state of LLMs with
respect to software modeling has received little attention. In this paper, we investigate the current capabilities of ChatGPT to
perform modeling tasks and to assist modelers, while also trying to identify its main shortcomings. Our findings show that,
in contrast to code generation, the performance of the current version of ChatGPT for software modeling is limited, with
various syntactic and semantic deficiencies, lack of consistency in responses and scalability issues. We also outline our views
on how we perceive the role that LLMs can play in the software modeling discipline in the short term, and how the modeling
community can help to improve the current capabilities of ChatGPT and the coming LLMs for software modeling.

Keywords Large language models · ChatGPT · Software models · Modeling languages · UML

1 Introduction rent topic in many universities and are being covered by most
specialized forums and blogs. A plethora of papers are now
The emergence of generative AI and large language mod- analyzing the potential advantages, limitations and failures
els (LLMs), such as those used by GitHub’s Copilot [9] of these models for writing code [3], as well as how pro-
and OpenAI’s ChatGPT [14], is causing quite a stir in the grammers interact with them [2, 19]. Most studies seem to
Computer Science community. Most experts foresee a major agree that LLMs do an excellent job in writing code: despite
disruption in the way software is developed and software some minor syntactical errors, what they produce is essen-
engineering education is also expected to drastically change tially correct.
with the advent of these LLMs [12]. These issues are a recur- However, what about software modeling? What is the sit-
uation of LLMs when it comes to performing modeling tasks
Communicated by Bernhard Rumpe. or assisting modelers to accomplish them? A few months ago
B Javier Cámara
we started looking at these issues, trying to investigate the
jcamara@[Link] current status of LLMs with respect to conceptual modeling,
[Link] a topic that does not seem to have attracted much attention
Javier Troya so far. Our premise is that LLMs are here to stay. So, instead
jtroya@[Link] of ignoring them or rejecting their use, we posit that it would
[Link] be better to embrace and use them in an effective manner to
Lola Burgueño help us perform modeling tasks.
lolaburgueno@[Link] We are aware that the current LLM situation is very
[Link]
volatile, with new models, versions and tools being released
Antonio Vallecillo frequently, each one improving over the previous ones. How-
av@[Link]
[Link] av/
ever, our goal is to assess the current situation and to provide
a set of experiments that can enable us to identify possi-
1 ITIS Software, Universidad de Málaga, ETSI Informática, ble shortcomings of current tools for performing modeling
Campus de Teatinos. Bulevar Louis Pasteur, 35. 29071
Málaga, Spain

123
782 JJ. Cámara et al.

tasks and assisting modelers, as well as a way to measure the 2 Context


improvement of future versions.
In this paper, we focus on the development of software This section introduces the context of our work and our main
models and, more specifically, on how to build UML class objectives, formulated through a set of research questions.
diagrams enriched with OCL constraints. Of the existing
LLMs, we will focus on ChatGPT, analyzing its possible 2.1 AI-based assistant tools
use as a modeling assistant. To do so, we investigate several
issues, such as: (1) the correctness of the UML and OCL mod- Software assistants and conversational bots have been around
els produced by ChatGPT; (2) the best way to ask ChatGPT to for a long time [18]—think, for example, of Microsoft’s
build correct and complete software models—in particular, infamous Clippy. However, they have not received much
UML class diagrams; (3) its coverage of different model- attention until recently, when their performance has been
ing concepts and mechanisms; (4) its expressiveness and found to be outstanding and their responses have seriously
cross-modeling language translation capabilities; and (5) its challenged the Turing test in some instances. From the Arts
sensitivity to context and problem domains. to the Sciences, LLMs are demonstrating their great potential
Our findings show that the performance of the cur- and value in helping with numerous tasks.
rent version1 of ChatGPT’s capabilities for software model The way to use LLMs and interact with them depends on
development is not as good as for code generation. Our numerous factors. For example:
experiments concluded that ChatGPT is only able to deal
with small models, and unable to properly handle some – Interaction mode: Interactions with assistants in software
basic modeling concepts, such as association classes or development are bimodal [2]: in acceleration mode, the
multiple inheritance. The variability and inconsistency of programmer knows what to do next and uses a LLM such
the models produced in response to the same prompts was as Copilot or ChatGPT to get there faster; in exploration
too high to ensure the repeatability and reproducibility of mode, the programmer is unsure about how to proceed
the results. Some obvious errors (such as associations that and uses the assistant to explore options.
had composition symbols at both ends) were more frequent – Type of assistance: We can distinguish between two types
than expected. We also realized that the problem domain of AI-based tools for software modeling depending on
had a remarkable impact on the results. For example, in their use. First, there are auto-completion wizards that
domains for which there is a large code base (e.g., bank- propose new classes, attributes and relationships while
ing), the models produced by ChatGPT had a very low the model is being developed, e.g., [4, 6, 7, 16, 17, 20].
level of abstraction, were very close to the programming Second, there are tools that can be asked to perform the
level and mostly correct. However, the models generated for complete task, and then, the user can sometimes refine
more abstract domains, such as university courses or the- or extend the tool’s results based on their correctness,
ater plays, were fundamentally flawed. In contrast, we found completeness or suitability, if needed. Examples of such
that ChatGPT’s performance with OCL expressions and con- tools are Copilot and ChatGPT.
straints was remarkable. We attribute this to the fact that
OCL is very similar to SQL, for which there is an extensive LLMs are deep learning models trained with massive
base of programs on which ChatGPT seems to have been datasets to perform specific tasks. They all incorporate from
trained. millions to billions of parameters that, in some occasions, can
The structure of this paper is as follows. First, Sect. 2 be fine-tuned to be adapted to problems similar to those for
introduces the context of our work and our main objectives. which they have been initially trained. Usually, these mod-
Section 3 describes the experiments we have conducted to els contain a series of hyperparameters that allow users to
understand the current capabilities of ChatGPT for perform- customize the predictions. The choice of good hyperparam-
ing modeling tasks. The results of these experiments are eter values has an important impact on the quality of the
presented and analyzed in Sect. 4. Section 5 sets out our views results. An appropriate hyperparameterization for a specific
of the present and foreseeable future of generative LLMs for task could be as important as the dataset used for training—
performing software modeling tasks, how modelers can make see, e.g., [8] on how the hyperparameterization of LLMs such
the best use of them, and outlines some ideas on how the soft- as Copilot or Codex affects their results. However, it is not
ware modeling community can help to improve these tools. clear whether the advantages of choosing the most appro-
Finally, Sect. 6 concludes with some ending remarks. priate hyperparameters for the task at hand outweigh their
limitations, in terms of the needed knowledge and skills, com-
plexity, required effort and payoff in the results. For instance,
tools such as ChatGPT do not allow users to configure their
1 Stable release February, 2023. hyperparameters and these are inferred from the prompt. In

123
On the assessment of generative... 783

Fig. 1 A textual diagram generated by ChatGPT

contrast, the new Bing search engine2 allows the non-expert these are difficult to parse and understand. Figure 1 shows
user to set only a few hyperparameters, but not all. For this, one example of these textual diagrams.
Bing has modified how the hyperparameterization of the We discovered that ChatGPT can also handle Ecore mod-
LLM is done and allows the user to choose the conversation els. You can ask it to generate models in Ecore and also use
style with three options: “more creative,” “more balanced” them as inputs for prompts. Its treatment of the Ecore lan-
and “more precise,” instead of asking them to select a guage is comparable to that of other modeling languages,
value (i.e., the so-called temperature value) within a with similar mistakes and correct answers.
given interval, usually a Real number between 0 and 1. We also asked ChatGPT about other textual languages
that it knows, which are used in UML for representing
different aspects of software systems. It mentioned the
Object Constraint Language (OCL), the Action Language
2.2 ChatGPT
for Foundational UML (ALF), the UML Profile Definition
Language (UML PDL) and the UML Testing Profile (UTP).
ChatGPT is a tool developed by OpenAI, a for-profit research
We checked in depth its skills with OCL, which are excellent,
organization co-founded by Elon Musk and Sam Altman,
but in contrast, the initial tests with the other notations did
strongly funded by Microsoft. The users interact with Chat-
not yield satisfactory results.
GPT in a conversational way via text prompts.
When asked about its modeling knowledge, ChatGPT
reports that it knows most UML diagrams, including Class 2.3 Research questions
diagrams, Use cases, State machines, Sequence diagrams and
Activity diagrams. As mentioned in the introduction, our primary goal was to
Regarding the UML notations ChatGPT can handle, being analyze the use of ChatGPT as an assistant tool for concep-
a language model, it cannot generate models in graphical tual modeling. In line with this, we address the following
form. ChatGPT produces models in textual UML nota- Research Questions:
tions, including PlantUML, USE (the UML-based Specifi-
cation Environment), Yuml, Markdown UML, Mermaid and
RQ1. Does ChatGPT generate syntactically correct UML
UMLet. It also produces some rudimentary class diagrams
models?
using plain characters to draw boxes and lines, but sometimes
RQ2. Does ChatGPT generate semantically correct models,
i.e., semantically aligned with the user intents?
2[Link] RQ3. How sensitive is ChatGPT to the context and to the
FORM=hpcodx. problem domain?

123
784 JJ. Cámara et al.

RQ4. How large are the models that ChatGPT is able to F1. Problem domain and semantics The problem domain
generate or handle? is important for ChatGPT. In general, it works poorly when
RQ5. Which modeling concepts and mechanisms is Chat- the names of the entities to be modeled have no meaning,
GPT able to effectively use? such as X, Y, Z, or A, B, C. The more meaningful and rep-
RQ6. Does prompt variability impact the correctness/quality resentative entity names are, the better the class model it
of the generated models? produces. Similarly, the more ChatGPT “knows” about the
RQ7. Do different use strategies (e.g., prompt partitioning) domain, the more accurate and complete the UML model
result in different outcomes? it generates. Purchase Orders, Banks or Employees are con-
RQ8. How sensitive is ChatGPT to the UML notation used cepts for which it is able to produce semantically rich models
to represent the output models? (too rich sometimes, as it completes them with information
that was not requested).
To answer these research questions, we devised a set of F2. Problem domain and syntax The problem domain also
experiments, which are detailed in the next section. seems to influence the structure and contents of the resulting
models, as well as their level of abstraction. In some domains,
the models generated had a very low level of abstraction,
quite close to a software program represented in UML. In
3 Experiments others, the level of abstraction was higher, although it heav-
ily depended on the particular conversation. As we know,
This section describes the experiments we conducted to LLMs have semantic and syntactic capabilities. When mix-
understand the current capabilities of ChatGPT to perform ing these two abilities to produce class models, depending
modeling tasks. We defined two phases. In the first one, we on the concrete domain (and thus the amount of data about
carried out some exploratory experiments to gain a basic that domain in the training dataset), ChatGPT seems to rely
understanding of how ChatGPT works with software models, on its translation capabilities. Sometimes, given our prompt,
as well as its main features and limitations. The experiments ChatGPT’s outputs seem to be the UML representation of
in the second phase were more systematic and aimed to a possible solution that it found/produced in a different lan-
further characterize ChatGPT’s modeling capabilities. The guage, i.e., with a different syntax. If this other language is a
results of these experiments are presented and discussed later low-level language such as Java or C++, the abstraction level
in Sect. 4. is lower than if it finds a solution represented as a software
model such as a relational schema. In other words, the prob-
3.1 First phase: exploration lem domain influences the result, as the latter depends on the
data with which ChatGPT has been trained for that domain.
Objective In this exploratory phase, the four authors of F3. Publicly available models Related to the previous
this paper interacted individually with ChatGPT to become point, if you ask ChatGPT to build a UML model that is on
acquainted with its modeling capabilities. We also explored the Internet (such as the example given in the OCL 2.4 stan-
some of its general characteristics. Since we are not able to set dard), ChatGPT will generate a correct model. OpenAI has
hyperparameters such as the number of tokens, we explored not disclosed what data was used to train ChatGPT or how
the size of the models it was able to handle. We also explored the training process was conducted, but it looks like these
its skills with various modeling notations, which depend on publicly available models have served as training models for
the training data. ChatGPT.
Method For this phase, we did not use any systematic F4. Size of the models to build The current version of
approach but tried to explore all the ideas that came to mind ChatGPT does not work well when asked to generate a class
based on the findings we were making and the results we model of more than 8–10 classes from scratch. However, it
were obtaining. works much better if you ask it to build a small initial model
Materials We wrote prompts asking ChatGPT to create mod- and progressively add information to it. In fact, ChatGPT
els of different sizes, as well as to create the target models was unable to cope with any of the exams of our modeling
of some of the assignments that we use in our modeling lec- course, because these UML models were too large (more
tures. The size of these models ranged from 10 to 40 classes than 20 classes and associations) for its current capabilities
and associations. We wrote all our interactions and findings or hyperparameterization, and it either did not finish the task
in a shared document used as a logbook [1]. (which had to be aborted) or built rather small and incomplete
First findings We became aware of several basic capabilities models.
and limitations of ChatGPT. Some of them were not surpris- F5. Notations We also experimented with various nota-
ing, given how language models work, but they are still worth tions to represent the generated UML model. By default,
reporting here. ChatGPT seems to use a diagrammatic notation that employs

123
On the assessment of generative... 785

Table 1 Coverage by the selected examples of the main modeling concepts and mechanisms
Concept/Mechanism Students Airlines File system Robots Video club Theaters Amphibious Cars

Enumerations X X X
Classes X X X X X X X X
Attributes X X X X X X X X
Operations X
Generalization X X X X
Association X X X X X X X
Aggregation X X X X
Composition X
Assoc. class X X X
Multiple inheritance X
Abstract classes X X X
OCL constraints X X
Roles (as assoc. ends)
Roles (as inherited classes) X
Roles (as entity types) [5]
Materialization [15] X X

characters to draw boxes and lines on the screen. This nota- F7. Cross-language translation facilities When testing the
tion is too difficult to read and understand when there are translation facilities across modeling languages, the results
more than four or five classes in the model, so we started to are conversation-dependent. For example, we gave ChatGPT
explicitly ask ChatGPT to produce models in specific nota- a model in USE with association classes and asked it to rep-
tions, such as PlantUML or USE. Apart from small syntactic resent the model in PlantUML. The result was not correct,
errors, the results are generally good; we cannot say the same because ChatGPT does not seem to know how to handle
for the semantics of the generated models, which were full association classes. Now, given that same PlantUML model,
of errors, as we shall later see. if asked to convert it to USE, depending on whether it is
F6. Conversation history Although there is a limit to the within the same conversation or in a different one, sometimes
amount of information ChatGPT can retain, it is able to ChatGPT converts it to the original USE model (even with
“remember” what was said earlier in a conversation. This association classes) or to a different model (this time with
is, ChatGPT is conversation-aware and results are heavily syntactic errors in USE). Interestingly, this does not seem to
conversation-dependent.3 Depending on the session, and on be specific only to modeling, but also to translation between
our previous interactions, the results may present remarkable other languages, even natural ones.
variations. In fact, when asked to build a model, ChatGPT F8. Integrity constraints When the description of the
takes information from previously developed models within model to be represented includes integrity constraints (which
the same conversation, even if they have nothing to do with we would expect to be specified by means of OCL expres-
the model in question. This is why it is important to start a sions), what ChatGPT usually does for each constraint is
new chat every time we want to develop a new model. One either to create a note or to define an operation that checks
exercise we did was to ask ChatGPT to generate a UML the constraint on the class that would correspond to the con-
model in three different chats using the same prompt. In two text of the OCL expression. We soon learned that if what
of them, we had been previously creating models from other we want to represent are the integrity constraints of a UML
domains, and the third chat restarted afresh. The results gen- class model using OCL, it is better to develop the model with-
erated in the first two conversations were very similar to the out constraints and then explicitly ask ChatGPT to generate
previously generated models, despite the fact that the new the constraints in OCL, one by one. ChatGPT works signifi-
model was from a different domain. The results of the same cantly better with OCL than with UML. We suspect that this
prompt in the new chat were closer to the desired target. is possibly due to the fact that the data sources used for the
construction of OCL expressions are usually SQL, Rust and
other declarative languages for which there is a much larger
3
corpus than for UML.
OpenAI states that, when replying to a prompt, ChatGPT does not
access previous conversations.

123
786 JJ. Cámara et al.

Fig. 2 Prompt used to ask ChatGPT to generate a UML class diagram of a video club system, and the resulting model

Fig. 3 Another model generated by ChatGPT in response to exactly the same prompt, but in a different session

3.2 Second phase: focused experiments small in size (three to six classes) so that ChatGPT could han-
dle them without problems. They represented different user
Objective In the first phase, we managed to obtain a basic intents, and for each one of them, the exercise consisted in
understanding of how ChatGPT works, as well as of its main asking ChatGPT to produce the corresponding UML model
features and limitations. We also obtained initial responses using one or more prompts.
to some of the research questions, namely those about its Figure 2 shows one of these exercises (Video club). The
sensitivity to context and problem domain (RQ3, addressed prompt used to generate the UML class diagram is shown on
by findings F1, F2, F3 and F6), its scalability (RQ4, addressed the left, and the ChatGPT response (in PlantUML) is shown
by finding F4) and partly about its sensitivity to the modeling on the right. For readability purposes, we have included the
notation of choice (RQ8, addressed by findings F5, F7 and graphical representation of the PlantUML model in the cen-
F8). The goal of this second phase was to address the rest of tral box. On this occasion, ChatGPT managed to generate
the research questions, which demanded a more systematic the intended model after a few interactions, so the exer-
approach. cise was considered successful. However, to illustrate the
Method For this phase, we developed a set of models that variability of ChatGPT’s responses, Fig. 3 shows another
were intended to cover the most important modeling concepts model generated by ChatGPT in response to exactly the
and mechanisms (see left column of Table 1). Each author same prompt, but from a different conversation. (Both were
independently proposed ten UML models. All of them were fresh conversations.) Although there are deterministic lan-

123
On the assessment of generative... 787

Table 2 Results of the experiment where the four authors tried to make is a multiplicity 1..* in the composite end). As we will
ChatGPT generate the intent models of the selected exercises mention later, ChatGPT does not always fix or add what we
Exercise Successful Avg. sessions Prompts/Sess. ask for such as repairing the multiplicity of an association.
Students 4/4 2.5 2.5 When it does it, ChatGPT sometimes introduces additional
Airlines 0/4 3 2.75
errors in other parts of the model.
From the complete set of 40 exercises, we selected two
File system 4/4 2 2.25
from each author. The resulting eight models covered the con-
Robots 0/4 3 3.5
cepts and mechanisms listed in Table 1. Their intent models
Video club 4/4 2 2.3
are shown in Fig. 4.
Theaters 0/4 3 3
Each author tried to make ChatGPT generate these UML
Amphibious 4/4 2.2 1.75
intent models as faithfully as possible, using different strate-
Car parts 4/4 2 2.3
gies to create the prompts. A summary of the results of this
experiment is shown in Table 2. The columns list the exercise,
the number of authors that could make ChatGPT successfully
generate the intended model, the average number of sessions
guage models, most modern LLMs (such as ChatGPT) are
that were used and the average number of prompts that were
designed to be probabilistic, and not deterministic. This lack
required per session until the solution was generated or the
of repeatability of the results represents a major obstacle to
author gave up. Reasons for restarting a new chat or giv-
the reproducibility of the experiments and is one of the main
ing up included that: (1) ChatGPT entered an endless loop,
current challenges of these assistants from our point of view.
e.g., saying “Sure, I will fix it” but repeating the previous
Even if prompts were carefully designed, very often Chat-
response, and (2) class diagrams that had an increasing num-
GPT did not generate the expected result. To improve the
ber of errors despite our indications to fix them, or diagrams
result, we always tried to follow a conversation with the bot
that were not worth fixing.
by providing multiple successive prompts in which we asked
Materials The complete set of UML models of the 40 exer-
to modify some aspect of the generated result. For example,
cises is available from our GitHub repository [1], as well as
if ChatGPT generates a class Movie that does not contain an
the reports that each author produced during their interactions
attribute name, we can tell ChatGPT that movies must have
with ChatGPT.
a name. The same can be done to add the multiplicities and
Findings The exercises of this phase revealed some very
role names of the associations, remove unwanted methods or
interesting findings, which are summarized below.
fix incorrect details (such as using compositions when there

Fig. 4 Intent models of the eight selected exercises

123
788 JJ. Cámara et al.

so poorly that the models became badly flawed and we had


to start from scratch (or from some regenerated model in an
intermediate step of the dialogue).
F13. Constructs of the UML language not handled prop-
erly There are UML constructs that ChatGPT can incorporate
but does not always handle adequately. For example, if we
give it a UML model with an association class, ChatGPT
is able to handle it and even define correct OCL constraints
involving that association class. However, none of the mod-
Fig. 5 Example showing some of the mistakes made by ChatGPT when els that ChatGPT generates include association classes, even
representing associations (real ChatGPT output) when they would be the most natural way to model the
problem. We also tried to give ChatGPT a USE model that
F9. Relationships ChatGPT is able to capture associations contained an association class and asked it to write the model
and inheritance adequately, although not always. The ability in PlantUML. ChatGPT converted it to a model without asso-
seems to depend on the domain being modeled. Modeling of ciation class. We explicitly asked ChatGPT to rewrite the
role names, on the other hand, seems to work well for most model to have association classes, and it said yes, but did
domains. not. In fact, none of the three intent models of the exper-
F10. Determinism Results are rather random, with major iment containing association classes could be created with
differences for the same prompt in different chats, as dis- ChatGPT: Airlines, Robots and Theaters (cf., Table 2).
cussed above and illustrated in Figs. 2 and 3. F14. Enumerations In most cases, enumerations are not
F11. Semantics Syntactically, results are mostly correct. used by ChatGPT unless explicitly requested. It rather uses
However, semantically they are not always correct. Examples either inheritance or strings. Unlike with association classes,
of common mistakes include: when explicitly asked to use enumerations, it does so cor-
rectly.
1. Duplicating aggregations (and sometimes even associ- F15. Multiple inheritance Multiple inheritance is not han-
ations) by defining, in addition to the association, an dled correctly. We needed to explicitly describe the type of
attribute in the containing class with the list of related relationship and what the source and target classes were to
elements, which is equivalent to the association and there- obtain the desired result. Although ChatGPT most times ends
fore redundant. up producing the right result, there is high variability in its
2. Mistakenly modeling relations as directed associations. responses, producing correct and incorrect models seemingly
When asked to convert them into bidirectional associ- at random.
ations, two opposing directed associations are created. F16. OCL constraints Initially, ChatGPT does not include
These cannot be merged later, even if we explicitly ask constraints in the model even when they were stated in the
ChatGPT to do so. prompt. When explicitly asked to include them, ChatGPT
3. Creating compositions or aggregations with two com- first proposes using notes, and then operations. When we
posite ends, as illustrated in the example shown in Fig. 5. asked ChatGPT about whether OCL could be used instead,
Note that the multiplicities of these relations are seman- mostly correct OCL constraints were generated (apart from
tically incorrect, too. minor syntactic mistakes on a few occasions).
F17. Capacity for abstraction ChatGPT (unlike human
F12. Iterative process is required Several iterations with modelers) has no capacity for abstraction. If it is asked to rep-
explicit requests for modification are usually needed to resent the UML model of a car with four wheels, it sometimes
approximate the user intent model (cf. Table 2). Thus, the creates four such attributes, as opposed to a more general
task of developing a model usually consists of a dialogue form of modeling that is capable of using a collection of
with ChatGPT, rather than a single request–response inter- wheels that now has four but at another time might have
action. Normally, we start with an initial prompt and refine more or less. For a small number of elements, this strategy is
the result until we achieve the desired intent model. Given acceptable, but it is suboptimal when the number increases
the large variability of ChatGPT’s responses to exactly the above a certain threshold. Similarly, ChatGPT does not fac-
same prompt, it is even a good strategy to start several con- tor out the common attributes of subclasses and place them
versations and continue with the one whose initial model is in the superclass on its own.
most promising, both regarding its level of abstraction and F18. Effort required by the modeler Finally, the amount of
its contents (classes, attributes and associations). This is also time and effort required to produce the correct intent mod-
important because the iterative process does not always con- els is not negligible, especially considering the small size
verge. Sometimes the requested changes were implemented of these models. For example, in all the intent models that

123
On the assessment of generative... 789

could be correctly produced, the total number of interactions As mentioned in Finding F4, ChatGPT currently has strict
with ChatGPT (counting the prompts of all sessions until the limitations on the size of the models it can handle. It has seri-
model was correct) exceeded the number of model elements. ous problems with models larger than 10–12 classes. Even
the time and effort required to produce smaller models (Find-
ing F19) are not insignificant.
4 Analysis
RQ5. Which modeling concepts and mechanisms is Chat-
After carrying out the experiments and analyzing our expe- GPT able to effectively use?
rience with ChatGPT, this section is dedicated to answering
the research questions identified in Sect. 2.3. The modeling concepts that we analyzed are shown in
Table 1. There is a high degree of variability in how Chat-
RQ1. Does ChatGPT generate syntactically correct UML GPT handles them. We observed that it is able to manage
models? reasonably well (with some exceptions) associations, aggre-
gations and compositions, simple inheritance and role names
The UML models produced by ChatGPT are generally of association ends (F9). However, it requires explicit indi-
correct, although they may contain small syntactic errors (see cations for using enumerations (F14), multiple inheritance
finding F5). This also depends on the notation used. Although (F15) and integrity constraints (F16). Finally, we found out
we did not test it thoroughly, the level of syntactic correctness that its results are not acceptable when using abstraction
of the models produced in PlantUML was much higher than (F17), and it cannot handle association classes (F13).
those generated in USE, for example.
RQ6. Does prompt variability impact the correctness/qu
RQ2. Does ChatGPT generate semantically correct models, ality of the generated models?
i.e., semantically aligned with the user’s intent?
We observed that there is plenty of variability when Chat-
This is the weakest point that we observed during our GPT generates responses to same prompt (F10). We learned
interaction with ChatGPT. Some studies suggest that LLMs that it is useful to start a new conversation from scratch when
are better at syntax than producing semantically correct the results were not good, in order to find better solutions for
results [11]. Our findings (e.g., F13) corroborate this fact. the same intent model (F12).
This includes errors in both the semantics of the language
and the semantics of the domain being modeled. On many RQ7. Do different use strategies (e.g., prompt partitioning)
occasions, we observed that ChatGPT proposed seemingly result in different outcomes?
random models that made no sense from either a modeling
or domain standpoint. First, as noted in finding F4, the size of the models that
ChatGPT is capable of handling in a single query forces
RQ3. How sensitive is ChatGPT to the context and to the the modeling task to become an iterative process in which
problem domain? the user starts with a small model and progressively adds
details to it (F12). The variability and randomness of Chat-
Our findings F1, F2, F3 and F6 clearly show that not only GPT responses (F10) or when results within a conversation
the problem domain influences the resulting models, but also start to diverge often force the modeler to repeat conversa-
the information exchanged during the dialogues with Chat- tions to try to obtain better models.
GPT. In addition, the more ChatGPT “knows” about a domain
(i.e., the more data about a domain was used during training), RQ8. How sensitive is ChatGPT to the UML notation used
the closer-to-correct class models it produces. ChatGPT pro- to represent the output models?
duces its worst results when it has little or no information
about the domain or the entities to be modeled, as it hap- ChatGPT is capable of representing models with several
pened when asked to produce software models of entities notations (F5), although in general it makes fewer syntactic
such as Snarks or Zumbats, for which it did not seem to have mistakes with PlantUML. It is also much better with OCL
any reference or semantic anchor. than with UML (F8). Finally, we also looked at how accu-
rate ChatGPT was with cross-modeling language translation
RQ4. How large are the models that ChatGPT is able to (F7), realizing that this task works better within the same
generate or handle? conversation, but not across conversations.

123
790 JJ. Cámara et al.

5 Discussion Another task in which LLMs could be very useful is the


generation of object models that conform to a given class
From our study, we conclude that ChatGPT is not yet a reli- diagram. We have tested this functionality in ChatGPT, and
able tool to perform modeling tasks. Does that mean we the results have been very good, although we found similar
should discard it, or at least wait to see how it evolves before problems to those we had during model generation. Namely,
taking any action? Our position is that, on the contrary, we ChatGPT is able to produce very diverse instance models
should start working now to improve the modeling skills of quickly and efficiently, although the quality of those models
ChatGPT and other LLMs to come, and to build a future is not optimal. For example, most of them do not respect
where these assistants are destined to play a prominent role the integrity constraints of the class diagram. As soon as
in modeling. the quality of ChatGPT improves, e.g., by including some
This section sets out our views about the future of LLMs grammatical checks such as those available for SQL [10], it
that we foresee when it comes to performing software mod- could outperform the current instance model generators, thus
eling tasks, and about how modelers can make the best use successfully taking on this tedious and costly task.
of them. It is divided into three parts. First, we describe the Model-based testing In addition to generating sets of instance
Model-Based Software Engineering (MBSE) tasks in which models from a UML class model, that could serve as test
LLMs can be helpful, and how we can use LLMs to accom- inputs, LLMs can also be used to generate test cases for
plish them. Second, we discuss the consequences that the the system. For several simple systems (such as a bank
new status quo may have on the way we develop models account, a microwave, an online shopping system and a flight
and teach modeling, including the new possibilities it opens reservation system), we gave ChatGPT class diagrams with
and the new roles that software engineers could play in this the specification of their structure and operations, and state
new context. Finally, we discuss what we think is needed to machines with the specification of their behaviors, and asked
realize this vision. it to generate test cases for them. The results were very accu-
rate and complete, covering all relevant cases. Investigating
5.1 The role of assistants in MBSE in depth the quality of the test cases that ChatGPT is able to
generate is part of our future work.
In our opinion, ChatGPT or any other LLM can be of invalu- MBSE Education The methods for teaching modeling are
able help in many areas of MBSE, complementing the current likely to be one of the things that will change the most. A
work of software modelers and letting them focus on the tasks few ways in which LLMs can be used to improve modeling
for which they really provide value. education include:
Model development LLMs can help develop models both in
acceleration and exploration modes [2]. Modelers typically – Enhanced Learning: LLMs can help students to learn
generate models by composing (usually in their heads) model modeling languages by providing real-time feedback on
fragments, each of which addresses a concern or implements syntax, highlighting common errors and offering sugges-
a feature. These model fragments are reused from existing tions for improvement. Additionally, they can provide
conceptual patterns or solutions known to the modeler, adapt- contextual help, e.g., providing definitions and examples
ing them to the problem at hand. Assistants could be of of modeling concepts.
great help in this case, identifying these existing patterns or – Model Completion: LLMs can provide auto-complete
solutions and automatically performing the adaptation. For functionality when students are developing models,
example, in acceleration mode, the tool can provide solu- which can save time and improve accuracy.
tions to add security aspects to a model, extend an existing – Model Generation: LLMs can also generate models
model to implement more entities or functionalities or pro- based on natural language descriptions. This can be use-
vide model elements with new features, among other tasks. ful for students who are just starting and may not yet be
In exploration mode, an LLM can provide a set of options familiar with modeling, or with the syntax of a particular
to a modeler on how to model certain system aspects. For modeling language.
example, whether it would be better to use association ends,
inheritance or entity types to model certain roles in the appli- In addition, other tasks where LLMs could be of great
cation. We could also ask the LLM about how to model help—although they would require more elaborate tool
certain requirements and ask it to add to our model the one support—are the following.
that best suits our needs. In this context, the modeler would
identify the features or functionalities to be incorporated into – Personalized Learning: As with other subjects, LLMs can
the model, using natural language, and the wizard would be be used to provide personalized learning in computer sci-
in charge of automatically adding them, until the model is ence education. If complemented by a tool that analyzes
complete. the student’s strengths, weaknesses and learning style,

123
On the assessment of generative... 791

LLMs can provide tailored instruction and feedback that Finally, researchers and academics will be able to use
meets the individual needs of students. LLMs to analyze large amounts of models, identify patterns
– Automated grading and assessment: LLMs can provide and insights and generate new ideas from them.
instant feedback to students on their performance. This
can save teachers time and help them provide more effec- 5.3 How do we make this happen?
tive feedback to students.
The prospects are certainly encouraging. The question is
whether they are really attainable and, if so, how they can
be achieved. It is clear that ChatGPT’s abilities to perform
5.2 How will the game change? modeling tasks are not yet up to the job. In this section, we
would like to propose some suggestions that the modeling
Overall, the use of large language models has the potential community could implement to improve the reliability and
to revolutionize software modeling engineering and educa- accuracy of ChatGPT and other generative AI models.
tion, making it more accessible, personalized and efficient. First, we should make more (correct) software models
To get to that point, we will first need to improve the current available in public repositories, thus increasing the acces-
consistency and reliability of the models produced by LLMs sibility of datasets that can be used for training LLMs and
such as ChatGPT. Second, we will need to change the way other generative AI models. The more UML and software
in which we currently develop software models and teach models that are publicly available from different domains,
modeling. These two issues are described in the following. the more accurate and reliable the responses from these AI
First of all, modeling assistants will become key compo- models will be.
nents in model development processes. Software modelers Second, we should start using LLMs/generative AI mod-
will be able to interact with them in natural language in order els in our software modeling tasks to familiarize ourselves
to build and test their models. For example, modelers may with them, explore their possibilities and discover their
rely on LLMs to explore modeling choices, add new features limitations. We should strive to use them not only for devel-
to a model or change a model to accommodate to new or oping software models, but also for testing them, generating
evolving requirements. instances and test cases, etc. Exploring their use for other
Secondly, new software engineering roles will also appear. MBSE tasks and activities could also be valuable. We are
For example, companies have started incorporating the new sure that AI models can open new ways to make use of mod-
role of prompt engineer [13], whose job is to test AI chatbots els in software and systems engineering tasks.
using natural language instead of code. Their goal is to iden- Providing feedback to the results of AI models, whenever
tify both errors and hidden capabilities so that developers can available, will benefit the whole community. Training them
either fix or exploit them. They are also experts on how best should become a community effort, i.e., a responsibility of
to ask an LLM to perform a particular task so that it is carried each and every one of us.
out in the most accurate and efficient manner by the chatbot. Developing a body of knowledge that incorporates a set
New opportunities also emerge for experts in configuring of guidelines about the best strategies to interact with AI-
the hyperparameters that allow users to customize the LLM based assistants for various types of modeling tasks, as well
predictions in order to improve the quality of the results. as a catalog of capabilities and common limitations, can also
As mentioned earlier, an appropriate hyperparameterization contribute to streamline the assimilation of AI models for
for a specific task could be as important as the dataset used modeling tasks.
for training the LLM [8] or the actual choice of the (deep Finally, let us incorporate LLMs and generative AI models
learning) algorithm. Similarly, LLM trainers can help provide into our teaching practices. Making students acquainted with
the appropriate datasets to improve the prediction accuracy them and aware of their possibilities and limitations will help
of an LLM in particular domains, and for specific tasks. them not only to improve their modeling skills, but also their
MBSE educators will have to change the way they perform critical thinking. They should learn to discriminate when to
most of their tasks today. Since LLMs will be ubiquitous, use these AI models and when not to, as well as when to trust
professors will not be able to prevent students from using their answers.
LLMs for their assignments. On the contrary, one of their
goals will be to help students use modeling assistants in the
best possible way to learn new concepts, develop software 6 Conclusions
models and test them. In addition, they will need to help
students to develop critical thinking skills that enable them Generative AI and large language models are becoming
to distinguish when the information provided by an assistant ubiquitous, and their upcoming impact on our disciplines
is useful and correct and when it is not. and professions cannot be overlooked. In this paper, we

123
792 JJ. Cámara et al.

have investigated their current capabilities and limitations 3. Borji, A.: A categorical archive of chatgpt failures. (2023). CoRR
for generating UML class diagrams and for assisting soft- arXiv:2302.03494
4. Burgueño, L., Clarisó, R., Gérard, S., Li, S., Cabot, J.: An NLP-
ware engineers to perform modeling tasks. Our findings show based architecture for the autocompletion of partial domain models.
that, in contrast to code generation and completion, the per- In: Proc. of CAiSE’21, LNCS, vol. 12751, pp. 91–106. Springer
formance of the current version of ChatGPT for software (2021). [Link]
modeling is still quite limited. 5. Cabot, J., Raventós, R.: Roles as entity types: a conceptual mod-
elling pattern. In: Proc. of ER’04, LNCS, vol. 3288, pp. 69–82.
Our intention was not to conduct an exhaustive set of Springer (2004). [Link]
experiments regarding the capabilities of LLMs for assist- 6. Capuano, T., Sahraoui, H.A., Frénay, B., Vanderose, B.: Learn-
ing in modeling tasks, as they are currently changing very ing from code repositories to recommend model classes. J. Object
fast. However, we wanted to address the growing need to Technol. 21(3), 1–11 (2022). [Link]
3.a4
have a picture of their current state, as accurate as possi-
7. Chaaben, M.B., Burgueño, L., Sahraoui, H.: Towards using few-
ble. We also did not want to address other issues related to shot prompt learning for automating model completion. In: Proc.
these types of tools, such as their ethical concerns. Although of ICSE (NIER)’23. IEEE/ACM (2023)
equally important, in this article we have focused mainly on 8. Döderlein, J., Acher, M., Khelladi, D.E., Combemale, B.: Piloting
copilot and codex: Hot temperature, cold prompts, or black magic?
their technical aspects.
(2022). CoRR arXiv:2210.14699
In general, we believe that, far from detracting from the use 9. GitHub: Copilot: Your AI pair programmer (2023). [Link]
of this type of generative AI-based tools, we should try to help com/features/copilot/
improving them as much as possible. In addition, we should 10. Kim, H., So, B.H., Han, W.S., Lee, H.: Natural language to SQL:
Where are we today? Proc. VLDB Endow. 13(10), 1737–1750
start adapting our model-based engineering practices to these (2020). [Link]
new assistants and the possibilities they offer. Likewise, we 11. Marcusarchive, G., Davisarchive, E.: GPT-3, Bloviator: OpenAI’s
should start changing our modeling education methods to language generator has no idea what it’s talking about (2020).
incorporate them. [Link]
openai-language-generator-artificial-intelligence-ai-opinion/
Successfully addressing the challenge of seamlessly inte- 12. Meyer, B.: What Do ChatGPT and AI-based Automatic Program
grating these new LLMs and generative AI models into our Generation Mean for the Future of Software. Commun. ACM
MBSE methods and practices is crucial. It could significantly 65(12), 5 (2022). [Link]
increase the impact of MBSE on society and lead to a major what-do-chatgpt-and-ai-based-automatic-program-generation-
mean-for-the-future-of-software/fulltext
step forward for our profession. 13. Mok, A.: ‘Prompt engineering’ is one of the hottest jobs
in generative AI. Here’s how it works. Business Insider
Acknowledgements We would like to thank Jörg Kienzle for his
(2023). [Link]
comments and very valuable feedback on an earlier draft of this
chatgpt-jobs-explained-2023-3
paper. This work was partially funded by the Spanish Govern-
14. Open AI: ChatGPT (2023). [Link]
ment (FEDER/Ministerio de Ciencia e Innovación–Agencia Estatal de
15. Pirotte, A., Zimányi, E., Massart, D., Yakusheva, T.: Materializa-
Investigación) under projects PID2021-125527NB-I00 and TED2021-
tion: A powerful and ubiquitous abstraction pattern. In: Proc. of
130523B-I00.
VLDB’94, pp. 630–641. Morgan Kaufmann (1994). [Link]
[Link]/conf/1994/[Link]
Funding Funding for open access publishing: Universidad de Málaga/
16. Rocco, J.D., Sipio, C.D., Ruscio, D.D., Nguyen, P.T.: A GNN-
CBUA.
based recommender system to assist the specification of metamod-
els and models. In: Proc. of MODELS’22, pp. 70–81. IEEE (2021).
Open Access This article is licensed under a Creative Commons
[Link]
Attribution 4.0 International License, which permits use, sharing, adap-
17. Saini, R., Mussbacher, G., Guo, J.L.C., Kienzle, J.: Automated,
tation, distribution and reproduction in any medium or format, as
interactive, and traceable domain modeling empowered by artificial
long as you give appropriate credit to the original author(s) and the
intelligence. Softw. Syst. Model. 21(3), 1015–1045 (2022). https://
source, provide a link to the Creative Commons licence, and indi-
[Link]/10.1007/s10270-021-00942-6
cate if changes were made. The images or other third party material
18. Savary-Leblanc, M., Burgueño, L., Cabot, J., Pallec, X.L., Gérard,
in this article are included in the article’s Creative Commons licence,
S.: Software assistants in software engineering: a systematic map-
unless indicated otherwise in a credit line to the material. If material
ping study. Softw. Pract .Exp. 53(3), 856–892 (2023). [Link]
is not included in the article’s Creative Commons licence and your
org/10.1002/spe.3170
intended use is not permitted by statutory regulation or exceeds the
19. Vaithilingam, P., Zhang, T., Glassman, E.L.: Expectation vs. Expe-
permitted use, you will need to obtain permission directly from the copy-
rience: evaluating the usability of code generation tools powered
right holder. To view a copy of this licence, visit [Link]
by large language models. In: Proc. of CHI’22, pp. 332:1–332:7.
[Link]/licenses/by/4.0/.
ACM (2022). [Link]
20. Weyssow, M., Sahraoui, H.A., Syriani, E.: Recommending meta-
References model concepts during modeling activities with pre-trained lan-
guage models. Softw. Syst. Model. 21(3), 1071–1089 (2022).
1. Atenea Research Group: Git repository: chatgpt-uml (2023). [Link]
[Link]
2. Barke, S., James, M.B., Polikarpova, N.: Grounded copilot: How
programmers interact with code-generating models. (2022). CoRR
arXiv:2206.15000 Publisher’s Note Springer Nature remains neutral with regard to juris-
dictional claims in published maps and institutional affiliations.

123
On the assessment of generative... 793

Javier Troya is an Associate Pro- Lola Burgueño is an Associate


fessor at the University of Málaga, Professor at the University of
Spain. Before, he was Assistant Málaga, Spain. Her research inter-
Professor at the University of ests lie in the fields of software
Seville, Spain (2016-2020), and engineering and model-based soft-
a post-doctoral researcher in the ware engineering. She has made
TU Wien, Austria (2013-2015). contributions to the application of
He obtained his International PhD artificial intelligence techniques to
with honors at the University of improve software development
Málaga, Spain (2013). His current processes and tools, uncertainty
research interests include MDE, management during the software
Software Testing and Digital design phase, and model-based
Twins. For more information, software testing, among others.
please visit [Link] For more information, please visit
[Link]/. [Link]

Javier Cámara is an Associate Pro- Antonio Vallecillo is full profes-


fessor of Computer Science at the sor of software engineering at the
University of Málaga and Hon- University of Málaga, Spain. He
orary Visiting Fellow at the leads the Atenea research group,
Department of Computer Science, which focuses on systems mod-
University of York. His current eling and analysis. His research
research interests include interests include open distributed
self-adaptive and autonomous sys- processing, model-based software
tems, software architecture, for- engineering, and software quality.
mal methods, as well as cyber- For more information, please visit
physical and AI systems. During [Link]
2018-2021, he was a Lecturer in
Computer Science at the Univer-
sity of York. Prior to that, he was
part of the core faculty of the
Institute for Software Research at Carnegie Mellon University (2015-
2018). He received his European PhD with honors from the Univer-
sity of Málaga in 2009. He has also been a postdoctoral research
associate at INRIA Rhône-Alpes, the Centre for Informatics and Sys-
tems of the University of Coimbra, and Carnegie Mellon University.
For more information, contact him at jcamara@[Link] or visit http://
[Link]/.

123

You might also like