Can ChatGPT Pass High School Exams On English Language
Can ChatGPT Pass High School Exams On English Language
https://doi.org/10.1007/s40593-023-00372-z
ARTICLE
Joost C. F. de Winter1
Abstract
Launched in late November 2022, ChatGPT, a large language model chatbot, has
garnered considerable attention. However, ongoing questions remain regarding its
capabilities. In this study, ChatGPT was used to complete national high school
exams in the Netherlands on the topic of English reading comprehension. In late
December 2022, we submitted the exam questions through the ChatGPT web inter-
face (GPT-3.5). According to official norms, ChatGPT achieved a mean grade of 7.3
on the Dutch scale of 1 to 10—comparable to the mean grade of all students who
took the exam in the Netherlands, 6.99. However, ChatGPT occasionally required
re-prompting to arrive at an explicit answer; without these nudges, the overall grade
was 6.5. In March 2023, API access was made available, and a new version of
ChatGPT, GPT-4, was released. We submitted the same exams to the API, and GPT-
4 achieved a score of 8.3 without a need for re-prompting. Additionally, employing
a bootstrapping method that incorporated randomness through ChatGPT’s ‘tempera-
ture’ parameter proved effective in self-identifying potentially incorrect answers.
Finally, a re-assessment conducted with the GPT-4 model updated as of June 2023
showed no substantial change in the overall score. The present findings highlight
significant opportunities but also raise concerns about the impact of ChatGPT and
similar large language models on educational assessment.
Joost C. F. de Winter
[email protected]
1
Cognitive Robotics Department, Delft University of Technology, Delft, Netherlands
13
International Journal of Artificial Intelligence in Education
Introduction
13
International Journal of Artificial Intelligence in Education
Methods
We applied ChatGPT (version: December 15, 2022) to national exams of the VWO
program (Preparatory Scientific Education) in the Netherlands that tested English
reading comprehension. These high-stakes exams are administered by the Dutch
organization CITO (CITO, 2023), and are mandatory for all VWO students. The
VWO program is considered the most academically rigorous high-school education
in the Netherlands, designed for students who intend to pursue university-level stud-
ies. In the VWO English exam, students are tasked with reading a variety of passages,
13
International Journal of Artificial Intelligence in Education
such as newspaper items, and answering associated questions. Although the specific
texts and questions change with each exam, the nature of the questions, such as iden-
tifying the main idea, making inferences, and interpreting vocabulary within context,
remains largely consistent. In this study, only exams from 2022 were used because
the database of ChatGPT has a knowledge cut-off in 2021.
The three available exams incorporated a total of 43, 40, and 41 questions, respec-
tively. Each exam comprised an accompanying textbook with 11 textual passages. Of
the three exams, there were 31, 32, and 29 multiple-choice questions, respectively.
These questions typically had four response options [A, B, C, and D]. However, some
questions offered three (A–C: 13 questions), five (A–E: 7 questions), or six (A–F: 1
question) response options. Every multiple-choice question carried the value of one
point. Further, the exams incorporated a number of open questions also worth one
point each. These questions either required succinct answers, called for a sequence of
statements to be arranged in a specific order, or asked for a response to a maximum
of two sub-questions in the form of a ‘yes’ or ‘no’. Examples of such questions were
as follows (translated from Dutch to English):
● “8. ‘why should he lose his perk?’ (paragraph 4). What was this ‘perk’? Please
respond in Dutch.” (Exam 1).
● “20. The text divides into critical and non-critical segments. In which paragraph
does the critical part commence? Please indicate the paragraph number.” (Exam
1).
● “2. ‘European ruling’ (title). Please determine whether Mike Short identifies each
of the following points as an issue with the introduction. Write “yes” or “no”
alongside each number on the answer sheet.
The remaining questions (3 to 6 per exam), worth either 2 or 4 points, were multi-part
items. More specifically, these involved scenarios where 4 to 8 statements needed a
‘yes’ or ‘no’ response, or items where multiple themes or locations in the text had to
be identified.
The text (e.g., a news item or another text fragment that was part of the exam) and
the corresponding questions were manually submitted one by one to the ChatGPT
web interface (GPT-3.5), as part of the same chat session for each exam. After the
text and before each question, a prompt was included, e.g., “Based on ‘Text 5’ above,
please choose the correct response option between A, B, C, and D for the question
below (note that the number in front of each paragraph indicates the number of the
paragraph)”.
In 15% of cases, ChatGPT appeared to misconstrue the question, leading to invalid
responses. These included not selecting any options in a multiple-choice question,
generating an entirely new question, or asserting that the correct answer could not
be determined due to insufficient information. When such a scenario occurred, the
researcher would either reiterate the question or provide further clarification. This
13
International Journal of Artificial Intelligence in Education
13
International Journal of Artificial Intelligence in Education
‘QUESTIONS:’, which preceded all the questions for that text. This prompt design
was used for all questions. It was adopted to ensure that the API only outputs the
answers without additional clarifying text, which speeds up the response by Chat-
GPT. In all instances, GPT-4 provided concise answers, typically a letter, number,
or keyword as requested. This enabled a straightforward manual evaluation of the
examinations.
An example of a complete prompt—in this case, Text 10 from Exam 2 and three
corresponding questions (Questions 36–38)—is presented in Fig. 1. It can be seen
that the text has not been preprocessed; the prompt represents how the PDF files of
the text booklet and the exam were automatically read in, which is accompanied by
unnecessary spaces, headers and footers, and information for the candidate about
how many points the question is worth. Attempts were also made to submit cleaned-
up text to ChatGPT instead of texts in their raw form. However, this did not appear
to lead to improvement in the accuracy of the answers provided by GPT-4. We sub-
mitted raw text data, as depicted in Fig. 1, to provide an evaluation of ChatGPT free
from human intervention.
More recently, in June 2023, an update of GPT-4 was released. An evaluation
by Chen et al. (2023) reported that this version performed highly differently on
certain tasks, such as answering sensitive questions or generating code, compared
to the March version, potentially having far-reaching consequences for users and
applications. However, apart from the work of Chen et al. (2023), there is currently
limited information in the literature regarding any disparities in output between the
March and June versions. Hence, we conducted a re-analysis using the June version
(GPT-4-0613).
While the web interface of ChatGPT has an element of stochasticity in its responses,
the API offers the ability to modulate this randomness via a ‘temperature’ parameter
that is adjustable within a continuum from 0 to 2. With a temperature setting at 0,
the output is highly reproducible. In contrast, a temperature setting at 2 introduces a
significant degree of randomness or creativity in the output. For our study, the tem-
13
International Journal of Artificial Intelligence in Education
perature parameter was set to 0. The entire analysis, from automatic reading of the
texts and questions to letting GPT-4 produce the responses, took about 40 s per exam.
Results
Table 1 presents the results of the exams completed by ChatGPT. In the Netherlands,
a mean grade of 5.50 or higher across all courses would imply a pass of the high
school diploma. It can be seen that GPT-3.5 would pass each of the English exams,
with an overall mean grade across the exams of 7.3, whereas GPT-4-0314 and GPT-
4-0613 had overall mean grades of 8.3 and 8.1, respectively.
Upon categorizing the test items into (1) multiple-choice questions, (2) open one-
point questions, and (3) open questions valued at more than one point, it became
evident that GPT-3.5 faced some challenges with the second category (79%, 68%,
and 77% of the points earned). Compared to GPT-3.5, GPT-4 showed improvement
in the multiple-choice questions (GPT-4-0314: 92%, 84%, 80%; GPT-4-0613: 92%,
63%, 83%).
It should be acknowledged that GPT-3.5 had an inherent advantage, as we occa-
sionally provided a repeated prompt through the web interface to procure an explicit
response (see Methods section for details). Upon limiting our evaluation to only the
first instance of its (non-)response, GPT-3.5’s mean score for the three exams was 6.5,
as shown in Table 1. In contrast, the interaction with GPT-4 was fully automated via
the API, without any re-prompting. We also made an attempt to use GPT-3.5 via the
API (model GPT-3.5-turbo-0301), but the performance was deemed unsatisfactory.
The grades for Exams 1, 2, and 3 were only 5.7, 6.7, and 7.0 respectively. The score
for the first exam was particularly low, as GPT-3.5 did not answer 7 of the 43 ques-
tions. Considering these outcomes, we chose not to further investigate this approach.
As shown above, GPT-3.5 performed comparably to the average Dutch high school
student in their final year of Preparatory Scientific Education, while GPT-4 outper-
formed the average student. However, GPT-4 was not flawless and made several mis-
takes on each exam. To further investigate, we explored if GPT-4 could self-identify
the questions it failed. Initial attempts using specific prompts, such as ‘Please rate the
difficulty of the question’ did not yield meaningful insights.
Previous research showed that implementing a self-consistency strategy may prove
beneficial in directing large language models towards accurate outputs (Wang et al.,
2023; Zheng et al., 2023). This entails generating a number of candidate outputs,
with the most frequent or internally consistent output being selected. In the present
study, we attempted to further this idea by incorporating an element of stochasticity
into the output, allowing us to gauge the level of confidence exhibited by GPT-4 with
respect to its own outputs. Specifically, we discovered that employing the ‘tempera-
ture’ parameter in conjunction with multiple repetitions yielded valuable insights.
In our self-consistency analysis, we used a temperature parameter of 1. This deci-
sion was informed by presenting a single prompt a large number of times under dif-
13
13
Table 1 Performance of ChatGPT on Dutch national exams on the topic of ‘English’
Number of points Corresponding grade (1 to 10)
GPT-3.5 (without GPT-4-0314 GPT-4-0613 Maximum GPT-3.5 (without GPT-4-0314 GPT-
re-prompting) number re-prompting) 4-
of points 0613
attainable
Exam 1 (May 2022, Period 1) 41 (36) 46 44 49 7.9 (7.0) 8.9 8.5
Exam 2 (June 2022, Period 2) 35 (30) 40 40 46 7.0 (6.1) 8.0 8.0
Exam 3 (July 2022, Period 3) 33 (30) 39 38 46 7.0 (6.4) 8.1 7.9
Total 109 (96) 125 122 141
Average 7.3 (6.5) 8.3 8.1
Note. The prompting of GPT-3.5 was done through the web interface, while the prompting of GPT-4 was done using the API with a temperature setting equal to 0
International Journal of Artificial Intelligence in Education
International Journal of Artificial Intelligence in Education
ferent temperature settings. The particular prompt used (see Fig. 1) required ChatGPT
to answer three questions, with the correct answers being C, B, and B, respectively.
Figure 2 shows how the correctness of answers varied with different temperature
settings, from the minimum value of 0 to the maximum possible value of 2.0, with
increments of 0.1. It can be seen that with the temperature set at 0, Questions 36 and
38 were answered correctly, while Question 37 was answered incorrectly. As the
temperature increased, Questions 36 and 38 consistently received correct answers,
suggesting that ChatGPT had confidence in its answers. Nonetheless, at very high
temperature settings, ChatGPT produced incorrect responses. Upon further investi-
gation, we noted that these faulty answers were not inaccurate responses to the mul-
tiple-choice questions, but rather ‘hallucinations’ from ChatGPT, where it generated
nonsensical text instead of an answer to the questions. In contrast, Question 37 elic-
ited a level of uncertainty from ChatGPT, where, as the temperature rose, the correct
answer surfaced approximately 20% of the time. Based on these observations, we
made the decision to conduct a bootstrapping analysis for assessing self-consistency
using a temperature setting of 1.0. This value ensures a degree of output variation, yet
it restricts excessive variation that might cause ChatGPT to generate arbitrary texts.
After deciding upon the temperature setting of 1.0, we submitted each of the three
exams to GPT-4, 50 times each. The number 50 is a trade-off, in which too few repeti-
tions carry the risk that GPT-4 coincidently produces the same output multiple times in a
row, appearing consistent when it actually is not. On the other hand, too many repetitions
involve unnecessary use of computational resources and can give the false suggestion that
GPT-4 is not consistent if it only very rarely produces an alternative output. The 50 repeti-
tions were accomplished by setting the ‘n’ parameter in the API to 50.
The performance of GPT-4 for each exam question was then manually scored as
above, and classified into three categories:
● Questions for which GPT-4 answered correctly in all 50 attempts, indicating high
consistency.
● Questions for which GPT-4 provided the same incorrect response in all 50
attempts, indicating high consistency but a wrong answer. In the case of a ques-
tion worth more than one point, not achieving the full points was also considered
an incorrect response for the respective question.
● Questions for which GPT-4 provided at least two different responses over the 50
attempts, indicating inconsistency.
Our analysis showed that out of 124 questions across the three exams combined,
GPT-4-0314 displayed inconsistency in 23 cases (19%), while GPT-4-0613 displayed
inconsistency in 25 cases (20%) (see Table 2). Among these inconsistent responses,
a significant portion (10 or 43% for GPT-4-0314; 14 or 56% for GPT-4-0613) were
indeed incorrect. In contrast, of the 101 and 98 consistent responses for GPT-4-0314
and GPT-4-0613, only 4 and 4 were incorrect (“consistently incorrect”).
The final grades on a scale from 1 to 10 were calculated by averaging the results
over 50 repetitions, and then further averaging across three exams. GPT-4-0314
scored an average of 8.11 (Exams 1–3: 8.63, 7.80, 7.89), while GPT-4-0613 obtained
an average score of 8.21 (Exams 1–3: 8.64, 7.97, 8.01). These results are comparable
13
International Journal of Artificial Intelligence in Education
Fig. 2 Percentage of correct responses by submitting the prompt shown in Fig. 1 for a total of 320 times
(model: GPT-4-0314). The procedure was repeated for 21 different temperature settings, from 0 to its
maximum value of 2.0
Table 2 Number of exam questions per category based on the consistency of 50 GPT-4 attempts (GPT-4-
0314 / GPT-4-0613)
Exam 1 Exam 2 Exam 3 Total
Consistently correct (50 times correct) 36 / 33 31 / 31 30 / 30 97 / 94
Consistently incorrect 1/2 3/1 0/2 4/5
(50 times the same incorrect response)
Inconsistent 6/8 6/8 11 / 9 23 / 25
(1–49 times the correct response)
of which GPT-4 answered incorrectly at 2/4 2/5 6/5 10 / 14
temperature = 0
of which GPT-4 answered correctly at 4/4 4/3 5/4 13 / 11
temperature = 0
Total 43 40 41 124
to those presented in Table 1, where the mean score across the three exams was 8.3
and 8.1 for GPT-4-0314 and GPT-4-0613, respectively. In summary, the exam grades
based on bootstrapping align with the original analysis from Table 1, where a tem-
perature setting of 0 was used.
Figure 3 illustrates the performance of the two ChatGPT versions on Exam 1 rela-
tive to all students who completed this exam. The students’ mean number of points
was 35.88 (SD = 6.48) out of a maximum of 49, and the mean of their grades was
6.99. This analysis was conducted only for one of the three exams, as the students’
results for the other two exams were not publicly available.
13
International Journal of Artificial Intelligence in Education
Fig. 3 Distribution of student scores on Exam 1 (n = 35,698). The performance of GPT-3.5 (corre-
sponding to the 46th and 76th percentiles), GPT-4 (97th and 91st percentiles), and bootstrapped GPT-4
(mean score of repetitions: 44.58 and 44.54) is depicted
Discussion
The present study’s results indicate that GPT-3.5 performs comparably to, while GPT-4
significantly outperforms, the average Dutch student in the domain of English language
comprehension. Although students are prohibited from using computers during conven-
tional in-person examinations, our findings suggest that ChatGPT could compromise
the integrity of computer-based exams, which have gained popularity in the wake of the
COVID-19 pandemic (Kerrigan et al., 2022; Pettit et al., 2021). Educators may presume
that online exams with minimal supervision are secure in subjects such as comprehension,
where answers are unlikely to be readily accessible online. However, this assumption
may no longer hold, given that our study demonstrates the generation of valid answers
within minutes. Concurrently, there are concerns that ChatGPT could be exploited for
cheating on assessments (Cotton et al., 2023; Mitchell, 2022), necessitating a reevalu-
ation of current methods for assessing student knowledge. Potential solutions include
increased proctoring, reduced reliance on essay-based work, and the utilization of alterna-
tive assignment formats, such as videos or presentations (Geerling et al., 2023; Graham,
2022; Rudolph et al., 2023; Susnjak, 2022).
On a positive note, ChatGPT holds the potential to foster innovation in the realm of
education. Possible applications encompass aiding the development of writing skills,
facilitating comprehension through step-by-step explanations, speeding up informa-
tion delivery via summarization of texts, and enhancing engagement through person-
alized feedback (Kasneci et al., 2023; Rudolph et al., 2023; Šlapeta, 2023). It is worth
considering whether the focus of student assessment should transition towards the
effective utilization of ChatGPT and similar language models. For instance, it may
be advisable to instruct students on identifying inaccuracies in content generated by
13
International Journal of Artificial Intelligence in Education
13
International Journal of Artificial Intelligence in Education
Acknowledgements Dr. Dimitra Dodou’s role in scoring the output of ChatGPT according to the correc-
tion instruction is acknowledged.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License,
which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long
as you give appropriate credit to the original author(s) and the source, provide a link to the Creative
Commons licence, and indicate if changes were made. The images or other third party material in this
article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line
to the material. If material is not included in the article’s Creative Commons licence and your intended use
is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission
directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/
licenses/by/4.0/.
13
International Journal of Artificial Intelligence in Education
References
Arora, D., & Singh, H. G. (2023). Have LLMs advanced enough? A challenging problem solving bench-
mark for large Language Models. arXiv. https://doi.org/10.48550/arXiv.2305.15074.
Bommarito, M. J., II, & Katz, D. M. (2022). GPT takes the Bar Exam arXiv. https://arxiv.org/
abs/2212.14402.
Bordt, S., & Von Luxburg, U. (2023). ChatGPT participates in a computer science exam arXiv. https://
arxiv.org/abs/2303.09461.
Bubeck, S., Chandrasekaran, V., Eldan, R., Gehrke, J., Horvitz, E., Kamar, E., Lee, P., Lee, Y. T., Li, Y.,
Lundberg, S., Nori, H., Palangi, H., Ribeiro, M. T., & Zhang, Y. (2023). Sparks of artificial general
intelligence: Early experiments with GPT-4 arXiv. https://arxiv.org/abs/2303.12712.
Chen, L., Zaharia, M., & Zou, J. (2023). How is ChatGPT’s behavior changing over time? arXiv. https://
doi.org/10.48550/arXiv.2307.09009.
CITO (2023). CITO: toetsen, examens, volgsystemen, certificeringen en trainingen [CITO: tests, exams,
tracking systems, certifications, and trainings]. https://cito.nl.
CITO (2022). Toets en item analyse VWO Engels 2022 tijdvak 1 [Test and item analysis VWO English
2022 period 1]. https://www2.cito.nl/vo/ex2022/VW-1002-a-22-1-TIA.docx.
College voor Toetsen en Examens (2020). Syllabus centraal examen 2022 Arabisch, Duits, Engels, Frans,
Russisch, Spaans, Turks [Syllabus central exams 2022 Arabic, German, English, French, Russian,
Spanish, Turkish]. https://havovwo.nl/pics/vmvtsyl22.pdf.
College voor Toetsen en Examens. (2022). Engels VWO 2022. https://www.examenblad.nl/examen/
engels-vwo-2/2022.
Cotton, D. R. E., Cotton, P. A., & Shipway, J. R. (2023). Chatting and cheating: Ensuring academic integ-
rity in the era of ChatGPT. Innovations in Education and Teaching International. https://doi.org/10.
1080/14703297.2023.2190148.
Davis, J. C., Lu, Y. H., & Thiruvathukal, G. K. (2023). Conversations with ChatGPT about
C programming: An ongoing study. Figshare. https://figshare.com/articles/preprint/
Conversations_with_ChatGPT_about_C_Programming_An_Ongoing_Study/22257274.
Frieder, S., Pinchetti, L., Griffiths, R. R., Salvatori, T., Lukasiewicz, T., Petersen, P. C., Chevalier, A.,
& Berner, J. (2023). Mathematical capabilities of ChatGPT. arXiv. https://doi.org/10.48550/
arXiv.2301.13867.
Geerling, W., Mateer, G. D., Wooten, J., & Damodaran, N. (2023). ChatGPT has mastered the principles
of economics: Now what? SSRN. https://doi.org/10.2139/ssrn.4356034.
Gilson, A., Safranek, C., Huang, T., Socrates, V., Chi, L., Taylor, R. A., & Chartash, D. (2022). How well
does ChatGPT do when taking the medical licensing exams? The implications of large language
models for medical education and knowledge assessment. medRxiv. https://doi.org/10.1101/2022.1
2.23.22283901.
Graham, F. (2022). Daily briefing: Will ChatGPT kill the essay assignment? Nature. https://doi.
org/10.1038/d41586-022-04437-2.
Han, Z., Battaglia, F., Udaiyar, A., Fooks, A., & Terlecky, S. R. (2023). An explorative assessment of
ChatGPT as an aid in medical education: Use it with caution. medRxiv. https://doi.org/10.1101/20
23.02.13.23285879.
Huang, F., Kwak, H., & An, J. (2023). Is ChatGPT better than human annotators? Potential and limitations
of ChatGPT in explaining implicit hate speech. Companion Proceedings of the ACM Web Confer-
ence, Austin, TX, 294–297. https://doi.org/10.1145/3543873.3587368.
Kasneci, E., Seßler, K., Küchemann, S., Bannert, M., Dementieva, D., Fischer, F., Gasser, U., Groh, G.,
Günnemann, S., Hüllermeier, E., Krusche, S., Kutyniok, G., Michaeli, T., Nerdel, C., Pfeffer, J.,
Poquet, O., Sailer, M., Schmidt, A., Seidel, T., & Kasneci, G. (2023). ChatGPT for good? On oppor-
tunities and challenges of large language models for education. Learning and Individual Differences,
103, 102274. https://doi.org/10.1016/j.lindif.2023.102274.
Katz, D. M., Bommarito, M. J., Gao, S., & Arredondo, P. (2023). GPT-4 passes the bar exam. SSRN.
https://doi.org/10.2139/ssrn.4389233.
Kerrigan, J., Cochran, G., Tabanli, S., Charnley, M., & Mulvey, S. (2022). Post-COVID changes to assess-
ment practices: A case study of undergraduate STEM recitations. Journal of Educational Technology
Systems, 51, 192–201. https://doi.org/10.1177/00472395221118392.
Kim, N., Htut, P. M., Bowman, S. R., & Petty, J. (2022). (QA)2: Question answering with questionable
assumptions. ArXiv. https://arxiv.org/abs/2212.10003.
13
International Journal of Artificial Intelligence in Education
King, M. R. (2023). The future of AI in medicine: A perspective from a chatbot. Annals of Biomedical
Engineering, 51, 291–295. https://doi.org/10.1007/s10439-022-03121-w.
Kirmani, A. R. (2023). Artificial Intelligence-enabled science poetry. ACS Energy Letters, 8, 574–576.
https://doi.org/10.1021/acsenergylett.2c02758.
Kortemeyer, G. (2023). Could an artificial-intelligence agent pass an introductory physics
course? Physical Review Physics Education Research, 19, 010132. https://doi.org/10.1103/
PhysRevPhysEducRes.19.010132.
Kosinski, M. (2023). Theory of mind may have spontaneously emerged in large language models. arXiv.
https://doi.org/10.48550/arXiv.2302.02083.
Kung, T. H., Cheatham, M., Medenilla, A., Sillos, C., De Leon, L., Elepaño, C., Madriaga, M., Agga-
bao, R., Diaz-Candido, G., Maningo, J., & Tseng, V. (2023). Performance of ChatGPT on USMLE:
Potential for AI-assisted medical education using large language models. PLOS Digital Health, 2,
e0000198. https://doi.org/10.1371/journal.pdig.0000198.
Kuzman, T., Ljubešić, N., & Mozetič, I. (2023). ChatGPT: Beginning of an end of manual annotation?
Use case of automatic genre identification. arXiv. https://arxiv.org/abs/2303.03953.
LeCun, Y. (2023). Do large language models need sensory grounding for meaning and understanding? Spoiler:
YES! [Presentation]. https://drive.google.com/file/d/1BU5bV3X5w65DwSMapKcsr0ZvrMRU_Nbi/
view.
Lovin, B. (2022, December 3). ChatGPT produces made-up nonexistent references. https://brianlovin.
com/hn/33841672.
Mitchell, A. (2022, December 26). Professor catches student cheating with ChatGPT: ‘I feel abject terror’.
https://nypost.com/2022/12/26/students-using-chatgpt-to-cheat-professor-warns.
Newton, P. M., & Xiromeriti, M. (2023). ChatGPT performance on MCQ-based exams. EdArXiv. https://
doi.org/10.35542/osf.io/sytu3.
Office Microsoft Blog (2023). Introducing Microsoft 365 Copilot – your copilot for work. https://blogs.
microsoft.com/blog/2023/03/16/introducing-microsoft-365-copilot-your-copilot-for-work.
OpenAI (2023). GPT-4 technical report. https://cdn.openai.com/papers/gpt-4.pdf.
Pettit, M., Shukla, S., Zhang, J., Sunil Kumar, K. H., & Khanduja, V. (2021). Virtual exams: Has COVID-
19 provided the impetus to change assessment methods in medicine? Bone & Joint Open, 2, 111–118.
https://doi.org/10.1302/2633-1462.22.BJO-2020-0142.R1.
Reiss, M. V. (2023). Testing the reliability of ChatGPT for text annotation and classification: A cautionary
remark. arXiv. https://doi.org/10.48550/arXiv.2304.11085.
Rospocher, M., & Eksir, S. (2023). Assessing fine-grained explicitness of song lyrics. Information, 14,
159. https://doi.org/10.3390/info14030159.
Rudolph, J., Tan, S., & Tan, S. (2023). ChatGPT: Bullshit spewer or the end of traditional assessments
in higher education? Journal of Applied Learning and Teaching, 6. https://doi.org/10.37074/
jalt.2023.6.1.9.
Savelka, J., Agarwal, A., An, M., Bogart, C., & Sakr, M. (2023). Thrilled by your progress! Large Lan-
guage Models (GPT-4) no longer struggle to pass assessments in higher education programming
courses. arXiv. https://doi.org/10.48550/arXiv.2306.10073.
Šlapeta, J. (2023). Are ChatGPT and other pretrained language models good parasitologists? Trends in
Parasitology. https://doi.org/10.1016/j.pt.2023.02.006.
Sobania, D., Briesch, M., Hanna, C., & Petke, J. (2023). An analysis of the automatic bug fixing perfor-
mance of ChatGPT. arXiv. https://doi.org/10.48550/arXiv.2301.08653.
Susnjak, T. (2022). ChatGPT: The end of online exam integrity? arXiv. https://arxiv.org/abs/2212.09292.
Tabone, W., & De Winter, J. (2023). Using ChatGPT for human–computer interaction research: A primer.
Royal Society Open Science, 10, 231053. https://doi.org/10.1098/rsos.231053
Vincent, J. (2022, December 5). AI-generated answers temporarily banned on cod-
ing Q&A site Stack Overflow. https://www.theverge.com/2022/12/5/23493932/
chatgpt-ai-generated-answers-temporarily-banned-stack-overflow-llms-dangers.
Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., & Zhou, D. (2023).
Self-consistency improves chain of thought reasoning in language models. Proceedings of the
International Conference on Learning Representations, Kigali, Rwanda. https://doi.org/10.48550/
arXiv.2203.11171.
Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B., Borgeaud, S., Yogatama, D., Bosma, M., Zhou, D.,
Metzler, D., Chi, E. H., Hashimoto, T., Vinyals, O., Liang, P., Dean, J., & Fedus, W. (2022). Emergent
abilities of large language models. arXiv. https://doi.org/10.48550/arXiv.2206.07682.
13
International Journal of Artificial Intelligence in Education
Whitford, E. (2022, December 9). A computer can now write your college essay — Maybe better than
you can. https://www.forbes.com/sites/emmawhitford/2022/12/09/a-computer-can-now-write-your-
college-essay---maybe-better-than-you-can/?sh=35deca9ddd39.
Zhai, X. (2022). ChatGPT user experience: Implications for education. ResearchGate. https://www.
researchgate.net/publication/366463233_ChatGPT_User_Experience_Implications_for_Education.
Zheng, C., Liu, Z., Xie, E., Li, Z., & Li, Y. (2023). Progressive-hint prompting improves reasoning in large
language models. arXiv. https://doi.org/10.48550/arXiv.2304.09797.
Zhong, Q., Ding, L., Liu, J., Du, B., & Tao, D. (2023a). Can ChatGPT understand too? A comparative
study on ChatGPT and fine-tuned BERT. arXiv. https://doi.org/10.48550/arXiv.2302.10198.
Zhong, W., Cui, R., Guo, Y., Liang, Y., Lu, S., Wang, Y., Saied, A., Chen, W., & Duan, N. (2023b). AGIEval:
A human-centric benchmark for evaluating foundation models. arXiv. https://doi.org/10.48550/
arXiv.2304.06364.
Supplementary Information All inputs (prompts) and outputs of ChatGPT, as well as the MAT-
LAB scripts used to access the API and create the figures can be found here: https://doi.
org/10.4121/545f8ead-235a-4eb6-8f32-aebb030dbbad.
Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps
and institutional affiliations.
13