AI Unreliable Answers - A Case Study
AI Unreliable Answers - A Case Study
on ChatGPT
1 Introduction
Artificial Intelligence (AI) is increasingly pervading our daily lives through the
use of intelligent software, such as chatbots [1]. There exists many definitions
of chatbots, according to [10], a chatbot is a computer program which responds
like a smart entity when conversed with through text or voice and understands
one or more human languages by Natural Language Processing (NLP).
At the present (January 2023), the ChatGPT1 chatbot is fascinating all the
world and generates a lot of discussions on the capability of Artificial Intelli-
gence of substituting the Human Being. Many positive reactions to the tool have
been provided: the New York Times stated that it is “the best artificial intel-
ligence chatbot ever released to the general public” [17]. An important Italian
newspaper (Corriere della Sera) on 01.31.2023 stated that “ChatGPT answers
1
https://chat.openai.com/chat.
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023
H. Degen and S. Ntoa (Eds.): HCII 2023, LNAI 14051, pp. 23–40, 2023.
https://doi.org/10.1007/978-3-031-35894-4_2
24 I. Amaro et al.
questions, solves equations and resumes text” even if “it does not provide either
morale judgment nor intellectual contents”. The big tech companies have also
been largely impacted by the ChatGPT technology: Microsoft is investing in
ChatGPT 10 billion dollars with the plan to include it into Bing for attacking
the Google hegemony, and to add ChatGPT text summarization and transla-
tion features to Microsoft Teams. On the other side, Google is worried of loosing
its predominating position in the search engine field and is going to develop a
similar product [7].
Many newspapers are contributing to create a halo of magic intelligence
around ChatGPT, but simultaneously the first negative comments are arriving.
Many people are worried about the future of knowledge workers [12], who may
be substituted by Artificial Intelligence algorithms such as ChatGPT. Many
teachers are worried on the ChatGPT capabilities of performing exams. As an
example, ChatGPT scored C+ at Low school exams in four courses, low but
passing grade [5]. Some researchers tried to generate literature review on a spe-
cific topic (Digital Twin) and noted that the results were promising and with a
low detection of plagiarism percentage [3]. Others consider the GPT technology
as an opportunity for Health, with due caution [11]. The use of ChatGPT for
supporting the researcher has been investigated in [13]. Many are the advan-
tages, but also in this case caution is recommended, due to several drawbacks,
such as lack of generalizability, dependence on data quality and diversity, lack of
domain expertise, limited ability to understand context, ethical considerations,
and limited ability to generate original insights. Someone else, such as the Stack
Overflow community, banned for one month the users providing answers pro-
duced by ChatGPT in their discussions [19] because the solutions produced by
the chatbot seem realistic but often wrong and may confuse a inexpert user.
Others are sceptic on its real capabilities and tested its reliability, as in our case.
The goal of this case study is to try to investigate the capability of ChatGPT
and how reliable are its answers. In particular, we aim at answering to the
following research questions:
– RQ1: which are the task better supported by ChatGPT?
– RQ2: which is the behaviour of ChatGPT when it does not know the answer?
– RQ3: how may we use the support offered by ChatGPT?
– RQ4: which is the opinion of expert Computer Science students onto the
tool?
To answer these questions we conducted:
– a preliminary investigation on the reliability of the answers provided by Chat-
GPT, also considering related work;
– a user study involving fifteen expert users (Computer Science students) to
collect their satisfaction and their opinion on the tool.
The paper is organized as follows: in Sect. 2, we describe the ChatGPT tool
along with some details about generative transformers; in Sect. 3, we define an
exploratory method to analyze the capabilities of ChatGPT for various types of
tasks; in Sect. 4 we describe the user study. Section 5 discusses the results of our
investigation, and Sect. 6 concludes with the final remarks and future work.
AI Unreliable Answers: A Case Study on ChatGPT 25
2 ChatGPT
In this section we summarize the main steps that OpenAI followed to create
ChatGPT.
ChatGPT is a language model developed by OpenAI that belongs to the
family of Generative Pre-trained Transformer (GPT) models. The GPT models
leverage deep learning approaches to generate text and have attained state-of-
the-art performances on a variety of natural language processing tasks. ChatGPT
is the most recent model in this line, following the success of GPT-1, GPT-2,
and GPT-3.
GPT-1. GPT-1, released in 2018, was the first model in the GPT series devel-
oped by OpenAI [15]. It was a cutting-edge language model that utilized deep
learning techniques to generate text. Relying on a model size of 117 million
parameters, which was relatively large for its time, GPT-1 was trained on a
diverse range of internet text, allowing it to generate coherent and diverse
answers. The architecture of the model was based on the transformer architec-
ture, which has since become a staple of the field of natural language processing.
The self-attention mechanism in GPT-1 allowed it to effectively weigh the
importance of each word in a sentence when generating text, while the fully
connected feed-forward network enabled it to produce a representation of the
input that was more suitable for the task of text generation.
Through its training on a large corpus of wording, GPT-1 learned pat-
terns in the language and became capable of generating semantically meaningful
responses that were similar to the input it was trained on. This made it a valuable
tool for a wide range of natural language processing applications and established
the GPT series as a benchmark for language models.
GPT-2. OpenAI published GPT-2 in 2019 as the second model of its GPT series
[16]. GPT-2 considerably enlarged the architecture of its predecessor, GPT-1, by
adding 1.5 billion parameters to its model. This enabled it to generate writing
that was even more human-like, as well as execute a variety of linguistic tasks,
including translation and summarization. It was therefore a major improve-
ment over its predecessor, GPT-1, and proved the potential of deep learning
approaches is generate human-like text capable of performing a variety of lin-
guistic tasks. Its success set the path for the creation of the following GPT-series
devices.
GPT-3. OpenAI’s GPT-3, introduced in 2020, was the third model in the GPT
series [4]. At the time of its publication, it had 175 billion parameters, making
it the largest language model available at the time. This enabled it to display
unparalleled performance in a variety of natural language processing tasks, such
as question-answering, summarization, and even coding. The fully connected
feed-forward network enabled the model to generate a representation of the
input that was more suited for the current task.
26 I. Amaro et al.
To answer RQ1, RQ2, and RQ3 we proposed ChatGPT several kinds of ques-
tions, such as:
3.1 Creativity
We decided to propose tasks related to the invention of a story for a child, the
creation of an exam track of programming with the solution, lyrics, and music
generation.
The first question we proposed is reported in Fig. 1, which shows the nice
story on a fox and a bear ChatGPT generated for a child. Result is very impres-
sive.
28 I. Amaro et al.
3.4 Logic
4.2 Participants
To assess the user satisfaction on ChatGPT we conducted a study involving
fifteen participants.
The administration of the assessment was preceded by obtaining the partici-
pants’ informed consent. Participants had the option to withdraw from the study
at any point if they felt uneasy or lost interest in continuing. This guaranteed
that the participants had complete control over their experience and were free
to make decisions based on their degree of comfort.
The participants were eleven men and four women. 9 participants falling
between the ages of 24 and 30 years old and 6 participant in the range 18 and
23 years old.
All of them were Italian. Eleven participants had a bachelor’s degree in Com-
puter Science and four held a master’s degree in the same field.
In addition to their academic credentials, each participant had prior experi-
ence interacting with ChatGPT. This provided a level of familiarity and comfort
that was beneficial during the delivery of the test.
AI Unreliable Answers: A Case Study on ChatGPT 33
4.3 Procedure
4.4 Results
The analysis of the difference between the user perception of ChatGPT’s reliabil-
ity before and after the experience are shown in the boxplots in Fig. 10. Results
revealed that the participant’s satisfaction decreases after the experience but
at a level lower than expected, after knowing the indeterminate behaviour. The
median passed from 6 (before) to 5 (after). As an example, P7 scored seven in
Q1 before the experience and six after. By examining the answer to the open
question, P7 stated: “Simple and intuitive to use. If you use it with a certain
constancy you can perceive the difference between what has been formulated by
it and what has not, and the structure of his answers. Overall, it is an excellent
trump card to be used in critical situations and beyond”.
P3 stated that: “Very useful for study support, especially in carrying out and
explaining exercises; brainstorming, or, in general, when it is necessary to discuss
or elicit ideas and there is no adequate people; precise questions, for which you
want a short and immediate answer.” According to P4, “Even if sometimes the
answers may be unreliable the tool generates text which is very realistic and polite.
For this reason it is still appealing for me”. P14’s opinion is the following:“It is
a software that helps everyone and facilitates the understanding and research of
the information you need to know. I trust it for code generation.”
Task Question
Creativity 1.a) Write a children’s story on the
fox and the bear for a child
1.b) Generate a complex exam track
of programming in C language which
requires the development of a
program on arrays by using the
while loop, two arrays, the
%operator and provide the solution
Generate lyrics in John Lennon’s
style.
Problem solving 3) For his aquarium Michele bought
50 fish including neons, guppies,
black angels and clown loaches. 46
are not guppies, 33 are not clown
loaches and neons are one more than
black angels. How many neons are
there?
Search activity 2.b.1) Who is Father Christopher?
(he is a character of “Promessi
Sposi” by Alessandro Manzoni)
2.b.2) Give me an example of when
Father Christopher does the right
thing
3.b.1) Do you know that Prosdocimo
is a character in a Rossini opera? [20]
3.b.2) Did you know that
Prosdocimo is a character in “Turco
in Italia”? [20]
3.c) Please, explain the concept of
pointers by using examples in C
language
In this section we try to answer the research questions on the base of our analysis
and on the user perceptions. This case study’s primary objective was to identify
its strengths and limitations and to evaluate the effect of ChatGPT’s reliability
on user satisfaction. The findings of the study revealed that despite the fact
that the chatbot occasionally produced incorrect responses, most users still are
satisfied of it and enjoyed interacting with it. Figure 10 depicts how the partic-
ipant’s satisfaction changed after learning that ChatGPT’s responses may not
be reliable while Table 2 reports descriptive statistics of perceived satisfaction.
36 I. Amaro et al.
other words, people may be more tolerant of erroneous replies while searching
for information in fields in which they are experts, as they are able to determine
the reliability of the material themselves. The ability of ChatGPT to serve users
in a range of scenarios is an additional significant finding of the study. Users
may rely critical decisions on the chatbot’s responses; therefore, it is essential to
thoroughly evaluate the accuracy of the information supplied.
Fig. 10. Boxplot related to perceived satisfaction with ChatGPT collected Before and
After the study.
RQ2: which is the behaviour of ChatGPT when it does not know the
answer?
Many newspapers, talk shows and also users consider ChatGPT as an appro-
priate general purpose supporting tool.
Despite participants had a clear knowledge that the answers from Chat-
GPT may sometimes be unreliable, they would continue to utilize it for
general research and trust in it for specific tasks. Even if some of the
responses were incorrect, the average score for user satisfaction resulted
to be 4.7 out of 7, median 5.
6 Conclusion
In this paper, we proposed a study investigating the reliability of the answers
generated by ChatGPT and on how end users, e.g., Computer Science students
are satisfied with this chatbot, also collecting their opinions. To this aim we pro-
posed to ChatGPT problems of different types. It generates very credible text,
such as story o lyrics, but failed in case of some mathematics and logic prob-
lems, always trying to provide an answer. Fifteen compute Science students used
the tool under our indication that show them both appropriate and unreliable
answers. Surprisingly, they keep on appreciating the tool. This study concludes
by emphasizing the trade-off between reliability and usability in the design of
chatbots. Users are ready to accept inconsistent responses in exchange for the
convenience and accessibility given by chatbots; nonetheless, it is essential to
ensure that chatbots are used responsibly and that users are aware of their lim-
its.
This study has been conducted on a reduced number of participants since
they were expert users. For more accurate results we plan to consider a larger
sample, involving also participants without previous knowledge on the chatbot.
References
1. Adamopoulou, E., Moussiades, L.: An overview of chatbot technology. In: Maglo-
giannis, I., Iliadis, L., Pimenidis, E. (eds.) AIAI 2020. IAICT, vol. 584, pp. 373–383.
Springer, Cham (2020). https://doi.org/10.1007/978-3-030-49186-4 31
2. Aydın, N., Erdem, O.A.: A research on the new generation artificial intelligence
technology generative pretraining transformer 3. In: 2022 3rd International Infor-
matics and Software Engineering Conference (IISEC), pp. 1–6. IEEE (2022)
3. Aydın, Ö., Karaarslan, E.: OpenAI ChatGPT generated literature review: digital
twin in healthcare. Available at SSRN 4308687 (2022)
4. Brown, T., et al.: Language models are few-shot learners. In: Advances in Neural
Information Processing Systems 33, pp. 1877–1901 (2020)
5. Choi, J.H., Hickman, K.E., Monahan, A., Schwarcz, D.: ChatGPT goes to law
school. Available at SSRN (2023)
6. Fan, X., et al.: The influence of agent reliability on trust in human-agent collabo-
ration. In: Proceedings of the 15th European Conference on Cognitive Ergonomics:
The Ergonomics of Cool Interaction, pp. 1–8 (2008)
40 I. Amaro et al.
7. Forbes: Microsoft confirms its $10 billion investment into ChatGPT, changing
how Microsoft competes with Google, Apple and other tech giants. https://www.
forbes.com/sites/qai/2023/01/27/microsoft-confirms-its-10-billion-investment-
into-chatgpt-changing-how-microsoft-competes-with-google-apple-and-other-
tech-giants/?sh=24dd324b3624
8. Glass, A., McGuinness, D.L., Wolverton, M.: Toward establishing trust in adaptive
agents. In: Proceedings of the 13th International Conference on Intelligent User
Interfaces, pp. 227–236 (2008)
9. Katrak, M.: The role of language prediction models in contractual interpretation:
the challenges and future prospects of GPT-3. In: Legal Analytics, pp. 47–62 (2023)
10. Khanna, A., Pandey, B., Vashishta, K., Kalia, K., Pradeepkumar, B., Das, T.: A
study of today’s AI through chatbots and rediscovery of machine intelligence. Int.
J. u- and e-Serv. Sci. Technol. 8(7), 277–284 (2015)
11. Korngiebel, D.M., Mooney, S.D.: Considering the possibilities and pitfalls of gen-
erative pre-trained transformer 3 (GPT-3) in healthcare delivery. NPJ Digit. Med.
4(1), 93 (2021)
12. Krugman, P.: Does ChatGPT mean robots are coming for the skilled jobs?
The New York Times. https://www.nytimes.com/2022/12/06/opinion/chatgpt-ai-
skilled-jobs-automation.html
13. Alshater, M.M.: Exploring the role of artificial intelligence in enhancing academic
performance: a case study of ChatGPT. Available at SSRN (2022)
14. Moran, S., et al.: Team reactions to voiced agent instructions in a pervasive game.
In: Proceedings of the 2013 International Conference on Intelligent User Interfaces,
pp. 371–382 (2013)
15. Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., et al.: Improving lan-
guage understanding by generative pre-training (2018)
16. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language
models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019)
17. Roose, K.: The brilliance and weirdness of ChatGPT. The New York Times.
https://www.nytimes.com/2022/12/05/technology/chatgpt-ai-twitter.html
18. Sandzer-Bell, E.: ChatGPT music prompts for generating chords and lyrics.
https://www.audiocipher.com/post/chatgpt-music
19. Stack Overflow. https://meta.stackoverflow.com/questions/421831/temporary-
policy-chatgpt-is-banned
20. Vetere, G.: Posso chiamarti prosdocimo? perché è bene non fidarsi troppo delle
risposte di ChatGPT. https://centroriformastato.it/posso-chiamarti-prosdocimo/