Kalyan 1 s2.0 S2949719123000456 Main
Kalyan 1 s2.0 S2949719123000456 Main
1. Introduction et al., 2020). LLMs treat any NLP task as a conditional text generation
problem and generate the desired text output just by conditioning
Large Language Models (LLMs), the recent buzz in Artificial Intelli- on the input prompt, which includes task description, test input and
gence, have garnered a lot of attention in both academic and industry optionally, a few examples. Fig. 1 shows the evolution of artificial
circles with their remarkable performances in most of the natural intelligence from machine learning to LLMs.
language processing (NLP) tasks. These models are essentially deep In the beginning, NLP systems are predominantly rule-based. These
learning models, specifically transformer-based, pretrained on large rule-based models are built on top of domain expert-framed rules. As
volumes of text data and then aligned to human preferences using manual rule framing is a laborious, expensive process and also requires
meta-training. Pretraining provides universal language knowledge to
frequent changes, rules-based models are gradually replaced by ma-
the model (Kalyan et al., 2021), while meta-training aligns the model
chine models, which learn the rules automatically from the training
to act based on the user’s intentions. Here user’s intention includes
data and completely avoid manual rule framing (Kalyan et al., 2021).
both explicit intentions, like following instructions, and implicit in-
However, machine learning models require human intervention in the
tentions, like maintaining truthfulness and avoiding bias, toxicity, or
any harmful behaviour (Ouyang et al., 2022). Large language models form of domain experts for feature engineering. The evolution of dense
(LLMs) are a special class of pretrained language models obtained by text vector representation models like Word2Vec (Mikolov et al., 2013),
scaling model size, pretraining corpus and computation. For down- Glove (Pennington et al., 2014), FastText (Bojanowski et al., 2017) and
stream task usage, PLMs leverage supervised learning paradigm, which the advancement of computer hardware like GPUs, NLP systems are
involves task-specific fine-tuning and hundreds or thousands of la- built using traditional deep learning models like CNN (Kalchbrenner
belled instances (Kalyan et al., 2021, 2022). LLMs leverage in-context et al., 2014), RNN (Salehinejad et al., 2017), LSTM (Hochreiter and
learning (ICL), a new learning paradigm which does not require task- Schmidhuber, 1997), GRU (Chung et al., 2014), Seq2Seq (Sutskever
specific fine-tuning and a large number of labelled instances (Brown et al., 2014) and Attention-based Seq2Seq models (Bahdanau et al.,
https://doi.org/10.1016/j.nlp.2023.100048
Received 16 October 2023; Received in revised form 5 December 2023; Accepted 10 December 2023
2949-7191/© 2023 The Author(s). Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license
(http://creativecommons.org/licenses/by-nc-nd/4.0/).
K.S. Kalyan Natural Language Processing Journal 6 (2024) 100048
2015; Luong et al., 2015). However, the drawbacks of these models • First survey paper to present a comprehensive review of GPT-
like the inability to (i) capture long-term dependencies and (ii) leverage 3 family large language models (GLLMs) in multiple dimensions
GPUs fully because of sequential processing (except in the case of covering more than 350 recent research papers.
CNN), resulted in the evolution of advanced deep learning models like • We discuss various foundation concepts like transformers, transfer
Transformers (Vaswani et al., 2017), which are fully attention based learning, self-supervised learning, pretrained language models
without any recurrent and convolution layers. and large language models.
Inspired by the success of image-pretrained models (Krizhevsky • We discuss GPT-3 family large language models in detail, starting
et al., 2012; Simonyan and Zisserman, 2015; Szegedy et al., 2015) built from GPT-3 to the latest ChatGPT and GPT-4.
on top of transfer learning and large convolution models, the research • We discuss the performances of GLLMs in various downstream
community focused on building pretrained language models (PLMs) tasks and present a thorough discussion on the data labelling, and
like BERT (Devlin et al., 2018) and GPT-1 (Radford et al., 2018) with data augmentation abilities of GLLMs.
transformers as the backbone and pretrained based on a new learning • We discuss the robustness and the evaluation abilities of GLLMs.
paradigm called self-supervised learning (Kalyan et al., 2021; Liu et al., • We present multiple insightful future research directions which
2021c; Gui et al., 2023). Unlike traditional deep learning models and will guide the research community to improve the performances
vanilla transformers, which require training from scratch for down- of GLLMs further.
stream usage, PLMs can be easily adapted to downstream tasks with
fine-tuning. The huge success of BERT and GPT-1 models triggered the Comparison with existing surveys. The existing survey papers
development of other PLMs like RoBERTa, XLNet (Yang et al., 2019), provide a review of LLMs (Zhao et al., 2023d) and the relevant concepts
ELECTRA (Clark et al., 2019), ALBERT (Lan et al., 2019), DeBERTa (He like in-context learning (Dong et al., 2022), evaluation (Chang et al.,
et al., 2022a, 2020), GPT-2 (Radford et al., 2019), T5 (Raffel et al., 2023; Zhuang et al., 2023), alignment with human values (Wang et al.,
2020), BART (Lewis et al., 2020) etc. 2023o; Liu et al., 2023k), safety and trustworthiness (Huang et al.,
Although PLMs have many advantages compared to traditional 2023c), reasoning (Huang and Chang, 2022), challenges and applica-
deep learning and vanilla transformer models, they still suffer from tions (Kaddour et al., 2023), LLM compression (Zhu et al., 2023a),
drawbacks like the inability to generalize to unseen tasks without task- prompting frameworks (Liu et al., 2023h), security risks (Derner et al.,
specific training. So, the research community focused on developing 2023), chain-of-thought prompting (Zhang et al., 2023i), open-source
more advanced models like LLMs which can generalize to unseen LLMs (Chen et al., 2023c) and multi-modal LLMs (Yin et al., 2023). For
tasks without any task-specific training. The era of LLMs started with example, Zhao et al. (2023d) are the first to provide a comprehensive of
GPT-3 (Brown et al., 2020), and the success of GPT-3 inspired the LLMs. Unlike Zhao et al. (2023d), the other existing survey papers focus
development of other LLMs like PaLM (Chowdhery et al., 2022), Chin- on specific concepts of LLMs. For example, the survey papers written by
chilla (Hoffmann et al., 2022), GLaM (Du et al., 2022), LaMDA (Thop- Dong et al. (2022), Chang et al. (2023), Wang et al. (2023o) and Huang
pilan et al., 2022), Gopher (Rae et al., 2021), Megatron–Turing NLG and Chang (2022) focus on in-context learning, evaluation of LLMs,
(Smith et al., 2022; Du and Cardie, 2020), BLOOM (Scao et al., 2022), alignment of LLMs with human values and reasoning ability of LLMs
Galactica (Taylor et al., 2022), OPT (Zhang et al., 2022), LLaMA (Tou- respectively. Similarly, the survey papers written by Yin et al. (2023)
vron et al., 2023a,b) etc. The popularity of LLMs is increasing expo- and Huang et al. (2023c) provide a review of multi-modal LLMs and the
nentially after the recent launch of Open AI’s models like ChatGPT and safety and trustworthiness of LLMs, respectively. However, there is no
GPT-4 (OpenAI, 2023). For example, ChatGPT has garnered millions existing survey paper which provides a comprehensive survey of GPT-3
of users within a few weeks of its launch. Because of the ability to family LLMs. With the ever-rising popularity of GPT-3 family LLMs like
generalize to unseen tasks based on the task description and a few GPT-3, InstructGPT, ChatGPT, GPT-4 etc. and a lot of research works
examples without requiring any task-specific training, just like humans, using these models, there is a strong need for a survey paper which
LLMs can be considered as a baby step towards Artificial General focuses exclusively on GPT-3 family LLMs.
Intelligence (Bubeck et al., 2023). In this survey paper, we mainly focus Papers collection. For this survey paper, we gathered over 350
on Open AI LLMs like GPT-3 models, GPT-3.5 models (InstructGPT, research papers that appeared online in the period of June 2020 to
ChatGPT etc.) and GPT-4, which we refer to as GPT-3 family large September 2023. Initially, we selected GLLMs like GPT-3, InstructGPT,
language models (GLLMs). This survey paper provides a comprehensive Codex and GPT-4 papers as seed papers and collected all the citing
review of research works related to GLLMs in multiple dimensions. papers. We also collected papers from popular venues like ACL, EMNLP,
Contributions. The key contributions of this survey paper are COLING, AAAI, ICML, ICLR, NeurIPS etc. and popular databases like
2
K.S. Kalyan Natural Language Processing Journal 6 (2024) 100048
Google Scholar and ScienceDirect using the keywords GPT-3, ChatGPT, sequence is long (Bahdanau et al., 2015). The attention mechanism is
GPT-3.5, InstructGPT, Codex and GPT-4. After removing the duplicate introduced to address this issue, allowing the decoder to focus on the
papers, we did a manual review to arrive at a final set of over 350 relevant input tokens at each decoding step (Bahdanau et al., 2015;
relevant research papers. Luong et al., 2015). However, as the encoder and decoder of the
Survey paper organization. The survey paper is organized as fol- Seq2Seq model are based on RNN and its variants, the Seq2Seq model
lows: Section 2 presents a brief overview of various foundation concepts suffers from vanishing gradients and struggles to capture long-term
like transformers, transfer learning, self-supervised learning, pretrained dependencies.
language models and large language models. Section 3 presents GPT-
3 family LLMs in detail, starting from GPT-3 to the latest ChatGPT 2.1.2. Drawbacks of traditional deep learning models
and GPT-4. Sections 4, 5, and 6 discuss the performances of GLLMs in Here are the drawbacks of traditional deep learning models
various downstream tasks, specific domains and multilingual scenarios,
• Lack of sequence and semantic understanding - MLPs ignore se-
respectively. Section 7 presents the data labelling and data augmen-
quence information, treating all input tokens as independent.
tation abilities of GLLMs. Section 8 discusses various research works
Moreover, MLPs can learn statistical patterns but struggle to
presenting approaches to detect text generated by GLLMs. Sections 9,
capture semantic information in the input sequence.
10 and 11 discuss the evaluation, robustness and evaluation abilities
• Computationally expensive - CNNs require a large number of pa-
of GLLMs, respectively. Section 12 presents multiple insightful future
rameters to achieve good results. Although LSTM and GRU ad-
research directions.
dress the limitations of vanilla RNNs to some extent, these models
include a gating mechanism which significantly increases the
2. Foundation concepts
number of model parameters. The large number of parameters
makes these models computationally expensive to train and use.
2.1. Transformer
• Vanishing gradients - RNN suffer from vanishing gradients prob-
lem. Although LSTM and GRU address this problem to some
2.1.1. Traditional deep learning models
extent, these models also suffer from vanishing gradient problem
Before the evolution of the transformer model, most of the research
and have difficulties in capturing long-term dependencies (Qiu
in natural language processing involved deep learning models like
et al., 2020; Kalyan et al., 2021).
multi-layer perceptron (MLP), convolutional neural network (CNN),
• Sequential Computation - RNN and its variants process the in-
recurrent neural network (RNN), long short-term memory (LSTM) net-
put sequence token by token, i.e. sequentially . This sequential
work, gated recurrent unit (GRU), sequence-to-sequence and attention-
computation is a bottleneck for these models to leverage paral-
based sequence-to-sequence (Young et al., 2018). MLP is a feed-forward
lel computing capability in advanced computing hardware like
neural network with three or more layers (input layer, one or more
GPUs and TPUs (Vaswani et al., 2017; Kalyan et al., 2021). This
hidden layers, and output layer), and the neurons in these layers are
sequential computation also slows down training and inference
fully connected. MLPs are easy to understand and simple to implement.
processes, especially for long sequences.
However, as MLPs ignore the sequence information and struggle to cap-
ture the semantic relationships, these models are subsequently replaced
by advanced models like CNN and RNN. CNN, originally developed 2.1.3. Transformer description
to process images, is also explored for natural language processing The transformer model evolved as an effective alternative to tra-
tasks by treating text as a one-dimensional image (Kalchbrenner et al., ditional deep learning models and addressed most associated issues
2014; Kim, 2014). CNNs can learn local features (n-grams) effectively (Vaswani et al., 2017). In no time, the transformer model, with its
using convolution layers but struggle to capture long-term dependen- novel and efficient architecture, gained a lot of popularity and became
cies. RNNs evolved as a deep learning model exclusively to process a de facto choice for building PLMs and LLMs using self-supervised
sequential data like text, time series, etc. (Salehinejad et al., 2017). learning paradigm (Kalyan et al., 2021; Zhao et al., 2023d). The
RNNs can handle input with varying lengths and process sequential key ingredient behind the massive success of the transformer model
data by maintaining a hidden state to capture the context from previous is its self-attention mechanism. The self-attention mechanism allows
inputs. However, RNNs suffer from vanishing gradients problems and the transformer model to process the input sequence without using
struggle to capture long-term dependencies. LSTM (Hochreiter and recurrent or convolution layers. When compared to convolution and
Schmidhuber, 1997) and GRU (Chung et al., 2014; Cho et al., 2014) recurrent layers, the self-attention mechanism can better capture long-
evolved as advanced RNN variants to address the issues with the range dependencies in the input sequence which makes the transformer
vanilla RNN model. The gating mechanism in these models helps to model highly effective for natural language understanding and gener-
regulate the flow of information along the sequence and retain the most ation tasks. Although the self-attention mechanism is comparatively
important information. Compared to LSTM, which includes three gates better, still it has difficulties in capturing long-range dependencies
(input, forget and output gates), GRU is more parameter efficient as it because of the quadratic complexity (both time and memory) (Tay
includes only two gates, namely the input and the reset gates. et al., 2022; Li et al., 2023e). This drawback is later addressed in
RNN and its variants like LSTM and GRU expect the input and efficient transformer variants like Linformer (Wang et al., 2020a), Per-
output sequences to be the same length. However, in the case of former (Choromanski et al., 2020), Longformer (Beltagy et al., 2020)
natural language generation tasks like machine translation, text sum- etc. For a detailed discussion on efficient transformer variants, refer to
marization, etc., the input and output sequences can be of differ- the survey paper by Tay et al. (2022).
ent lengths. So, the researchers introduced the sequence-to-sequence In this paper, we present the description of the vanilla transformer
(Seq2Seq) model to handle tasks with different input and output se- model (Vaswani et al., 2017). The transformer consists of encoder and
quence lengths (Sutskever et al., 2014). The Seq2Seq model is originally decoder components. The encoder processes the input text using a
developed for machine translation and later explored for other NLP stack of encoder layers and then produces rich contextualized vector
tasks. The Seq2Seq model consists of an encoder and decoder based representations for each token in the input sequence, which are later
on RNN, LSTM or GRU to process the input sequence and generate the used by the decoder. Each encoder layer consists of a self-attention
output sequence. The encoder processes the input sequence to generate mechanism and a feedforward neural network. The self-attention mech-
a fixed-size context vector based on which the decoder generates the anism adds contextual information to the token vectors by allowing
output sequence. However, the fixed-size context vector fails to encode each token to attend to all other input tokens, and this helps the
the entire information in the input sequence, especially when the input model capture long-term dependencies better. After the self-attention
3
K.S. Kalyan Natural Language Processing Journal 6 (2024) 100048
mechanism, the token vectors are passed through a feedforward neural various deep learning models, the research community also focused on
network, which introduces non-linearity and further transforms the developing high-quality datasets for various tasks (Han et al., 2021).
representations. In this way, each encoder layer applies self-mechanism However, manual data annotation is a time-consuming, expensive and
and feed-forward network to add more contextual information to the laborious process. Additionally, when there is a change in the data
token vector representations. distribution, it is essential to re-train deep learning models with new
The decoder receives the output from the last encoder layer and pro- labelled data to maintain good performances (Pan and Yang, 2009). To
cesses it sequentially by applying a stack of layers, with each decoder reduce the costs, the research community focused on how to effectively
layer having masked self-attention, encoder–decoder self-attention and train deep learning models with limited labelled data. Transfer learning
feed-forward neural network. The masked self-attention allows each evolved as one of the effective solutions to train deep learning models
token to attend to the previously generated tokens only and prevents with limited labelled data (Zhuang et al., 2020; Pan and Yang, 2009).
the model from attending to future tokens. The encoder–decoder self-
attention allows the decoder to attend to the encoded input sequence 2.2.2. What is transfer learning?
and helps the decoder focus on relevant input sequence tokens to Transfer Learning in the context of artificial intelligence involves
generate the output tokens. existing knowledge transfer from one task (or domain) to another
The self-attention mechanism in the Transformer uses multiple at- different but related task (or domain) (Zhuang et al., 2020; Pan and
tention heads, which allow the model to learn different aspects of Yang, 2009). Transfer learning avoids training a model from scratch
relationships between tokens and encode more contextual information and helps improve the model’s performance on the target task (or
in the token representations. The encoder and decoder layers also domain) by leveraging already existing knowledge. Transfer learning is
include the embedding layer, residual connections (He et al., 2016) and largely based on the idea that when two tasks (or domains) are similar,
layer normalization (Ba et al., 2016). The embedding layer transforms the knowledge from the source task (or domain) with sufficient data
input tokens into vector representations where each vector representa- can be used to enhance the performance of the target task (or domain)
tion encodes both the meaning and position information. The residual with limited data. For example, consider the task of sentiment analysis
connections and layer normalization are applied after the self-attention of reviews of different products. It is highly expensive to annotate large
mechanism and feed-forward network. Residual connection (He et al., data separately for each product. In such cases, transfer learning helps
2016) avoids vanishing gradients and ensures a smooth flow of gradi- to adapt the model trained on one product reviews to perform well
ents, while layer normalization (Ba et al., 2016) is applied to normalize on other product reviews without requiring large labelled data (Blitzer
the token representations and stabilize training. Apart from the em- et al., 2007).
bedding layer and stack of decoder layers, the decoder also includes Transfer learning draws inspiration from human beings, i.e., human
an output layer. The output layer is nothing but a softmax layer that beings can do new tasks without or with few examples just by reusing
assigns probabilities to each token in the vocabulary, indicating the previously gained knowledge (Han et al., 2021). Fig. 2 illustrates real-
likelihood of each token being the next word in the generated sequence. life examples of knowledge transfer (transfer learning). For example,
a person who can cycle can learn to ride a bike quickly with less
effort. This is because riding a cycle and a bike involves a lot of
2.2. Transfer learning
common things like handling the balance, etc. Similarly, a person
familiar with C programming language can learn Python programming
2.2.1. Why transfer learning?
language easily. This is because both C and Python are programming
Although machine learning models tasted some success, these mod-
languages and share many common concepts. So, due to the ability to
els require feature engineering, which is a laborious and expensive
reuse the existing knowledge and train the target models with limited
process involving human intervention in the form of domain experts
data, transfer learning evolved as a promising learning paradigm and
(Kalyan et al., 2021). Deep learning models, essentially a subset of
eventually played a crucial role in the evolution of advanced deep
machine learning, do not require feature engineering as deep learning
learning models like PLMs (Kalyan et al., 2021, 2022) and the recent
models learn features during training. Over the years, deep learning
LLMs. Overall, the advantages of transfer learning are
witnessed the evolution of various models like multi-layer perceptron
(MLP), convolution neural networks (CNN), recurrent neural networks • Transfer learning helps to reduce the requirement of labelled data.
(RNN), long short-term memory networks (LSTM), gated recurrent unit (Data efficiency)
networks (GRU), encoder–decoder networks, encoder–decoder with at- • Transfer learning avoids training models from scratch by pro-
tention networks and recently transformers (Young et al., 2018; Otter viding a good initialization from existing related models. (Faster
et al., 2020). Even though deep learning models eliminated the require- training and development)
ment of manual feature engineering and achieved significant progress, • Transfer learning helps to enhance the performance on the target
the main drawback with these models is the requirement of a large task (or domain) by reusing existing knowledge. (Enhance target
amount of labelled data to achieve good results. Along with developing task performance)
4
K.S. Kalyan Natural Language Processing Journal 6 (2024) 100048
• Transfer learning is explored across AI areas like computer vision, Inspired by the huge success of pretrained image models, the NLP
natural language processing, and speech processing. (Versatile) research community focused on developing PLMs (Han et al., 2021;
Kalyan et al., 2021, 2022). However, the main challenge here is the
In conclusion, transfer learning is a powerful learning paradigm in
use of supervised learning at scale to pretrain language models. This is
artificial intelligence that has benefits regarding data efficiency, speed,
because supervised learning at scale requires huge volumes of labelled
performance, adaptability, and real-world practicality.
data, which is almost impossible to obtain in many cases because
of highly expensive annotation costs. Besides high annotation costs,
2.2.3. Transfer learning vs. other learning paradigms supervised learning also suffers from generalization errors and spurious
Along with transfer learning, the other learning paradigms that correlations (Kalyan et al., 2021; Gui et al., 2023). Self-supervised
evolved to address large labelled data requirements are semi-supervised learning with the ability to automatically generate the labels and make
learning (Van Engelen and Hoos, 2020) and multi-task learning (Zhang use of unlabelled data evolved as an effective alternative to supervised
and Yang, 2021). Semi-supervised learning is a learning paradigm in learning to pretrain language models at scale (Liu et al., 2021c; Gui
artificial intelligence that uses labelled and unlabelled data to train et al., 2023; Kalyan et al., 2021).
models (Van Engelen and Hoos, 2020). As semi-supervised learning
uses labelled and unlabelled data, it lies between unsupervised and 2.3.2. What is self-supervised learning?
supervised learning paradigms. As semi-supervised learning uses only Self-supervised learning, a promising learning paradigm in artifi-
a small amount of labelled data, it reduces the amount of labelled cial intelligence, helps models from different modalities like language,
data required, like transfer learning. However, unlike transfer learning, speech or image to learn background knowledge from large volumes of
where the distribution of source and target tasks can be different, in unlabelled data (Liu et al., 2021c; Gui et al., 2023). Unlike supervised
semi-supervised, the distribution of labelled and unlabelled data should learning, which relies on large volumes of labelled data, SSL pretrains
be the same (Zhuang et al., 2020). Multi-task learning is a learning the models at scale based on the pseudo supervision offered by one or
paradigm which focuses on enhancing the performance of a group of more pretraining tasks. Here, the pseudo supervision stems from the
tasks by leveraging the interconnections between the tasks and learning labels, which are automatically generated without human intervention
them simultaneously (Van Engelen and Hoos, 2020). Unlike multi-task based on the description of the pretraining task. In general, SSL involves
learning, which simultaneously learns all the tasks, transfer learning one or more pretraining tasks (Kalyan et al., 2021, 2022). Moreover,
first learns the source task and then transfers the knowledge to the the efficiency of SSL is heavily influenced by the choice of pretraining
target task. In multi-task learning, the focus is generally on all the tasks, task (Kalyan et al., 2021; Clark et al., 2019; He et al., 2022a).
while transfer learning focuses more on the target task (Pan and Yang, Fig. 3 presents the SSL paradigm. In the pretraining phase, the labels
2009). are automatically generated based on the description of pretraining
tasks, and the models learn universal knowledge using the pseudo
2.3. Self-supervised learning (SSL) supervision offered by one or more pretraining tasks. Pretraining helps
the models to gain strong background knowledge, which allows the
2.3.1. Why self-supervised learning? models to provide a good initialization to downstream models. The
The main drawback with traditional deep learning models like CNN initialization from pretrained models enhances the downstream models
is the requirement of training from scratch. Training from scratch in terms of generalization, performance, and robustness and makes
requires a large amount of labelled data. Data labelling is not only them data efficient. After pretraining, PLMs can be easily adapted
expensive but also a time-consuming and laborious process, which to downstream tasks with limited labelled data, and LLMs can be
eventually makes the model development expensive. To reduce the used to solve downstream tasks using in-context learning without any
requirement of labelled data and make the model development pro- task-specific fine-tuning.
cess less expensive, the computer vision research community focused
on developing models like VGGNet (Simonyan and Zisserman, 2015), 2.3.3. Evolution of self-supervised learning
AlexNet (Krizhevsky et al., 2012) and GoogleNet (Szegedy et al., 2015) Fig. 4 shows the evolution of SSL in natural language processing
on top of large CNNs, transfer learning and supervised learning. These from embedding models to the recent LLMs. The evolution of SSL in
models are pretrained on a large number of labelled images from natural language processing happened in three stages, namely embed-
ImageNet dataset (Deng et al., 2009) using supervised learning, and ding models, PLMs and LLMs. Initially, SSL is explored to develop
then adapted to downstream tasks. These pretrained models avoid non-contextual embedding models (e.g. Word2Vec Mikolov et al., 2013,
training downstream models from scratch by providing a good ini- FastText Bojanowski et al., 2017), followed by sentence embedding
tialization. Moreover, downstream models initialized from pretrained (e.g. Sent2Vec Pagliardini et al., 2018) and contextual embedding
models converge faster and achieve good results even with limited models (e.g. ELMo Peters et al., 2018). The quest to develop pre-
labelled data (Han et al., 2021). trained models motivated NLP researchers to explore SSL to develop
5
K.S. Kalyan Natural Language Processing Journal 6 (2024) 100048
PLMs (Kalyan et al., 2021, 2022; Han et al., 2021). As PLMs cannot et al., 2017; Lin et al., 2022b). However, transformer and tradi-
generalize to NLP tasks without fine-tuning, the NLP research commu- tional deep learning models suffer from one major drawback: train-
nity focused on developing LLMs using SSL at a large scale (Brown ing from scratch, which requires large volumes of labelled data and
et al., 2020; Touvron et al., 2023a,b; Anil et al., 2023; OpenAI, 2023). makes model development expensive. Inspired by the success of pre-
To summarize, self-supervised is undergoing a rapid evolution and is trained image models like VGGNet (Simonyan and Zisserman, 2015),
also treated as a significant element in achieving near human-level AlexNet (Krizhevsky et al., 2012) and GoogleNet (Szegedy et al., 2015)
intelligence (Gui et al., 2023). in computer vision, NLP researchers focused on developing pretrained
models for natural language processing based on transformers and self-
2.3.4. Self-supervised learning vs. other learning paradigms supervised learning (Kalyan et al., 2021, 2022; Han et al., 2021; Qiu
et al., 2020). Pretrained language models are advanced deep learning
Self-supervised learning, with its exceptional ability to make use of
models essentially transformer-based, pretrained on large volumes of
unlabelled data at scale, evolved as an alternative to supervised learn-
text data and can be adapted to downstream tasks with limited la-
ing to pretrain models. However, SSL has similarities and dissimilarities
belled data. Along with transformer model, self-supervised learning and
with supervised learning (Kalyan et al., 2021). Both self-supervised and
transfer learning are key concepts which make PLMs possible (Kalyan
supervised provide supervision. However, unlike supervised learning,
et al., 2021) (refer Fig. 5). The era of PLMs started with GPT-1 (Radford
which offers supervision based on human-labelled data, SSL offers
et al., 2018) and BERT (Devlin et al., 2018) models. The massive
supervision based on automatically generated data. Supervised learning
success of BERT and GPT-1 models triggered the development of other
is mostly used to train downstream models with task-specific data,
PLMs like RoBERTa (Liu et al., 2019), XLNet (Yang et al., 2019),
while SSL is used to train pretrained models to offer good initialization
ELECTRA (Clark et al., 2019), ALBERT (Lan et al., 2019), DeBERTa (He
to downstream models. Similarly, SSL has similarities and dissimilar-
et al., 2022a, 2020), GPT-2 (Radford et al., 2019), T5 (Raffel et al.,
ities with unsupervised learning (Kalyan et al., 2021). Both SSL and
2020), BART (Lewis et al., 2020), PEGASUS (Zhang et al., 2020) etc.
unsupervised learning make use of unlabelled data without requiring
any labelled data. However, unlike SSL, which focuses on learning
2.4.2. Evolution of pretrained language models
rich data representations using pseudo supervision, the main focus of
The evolution of PLMs happened along three dimensions: encoder-
unsupervised learning is to identify the hidden patterns in the data
based models, decoder-based models and encoder–decoder based mod-
without any supervision.
els (Kalyan et al., 2021). Encoder-based models consist of an embedding
layer and stack of encoder layers, with each encoder layer having
2.4. Pretrained language models (PLMs) self-attention and feed-forward networks. Encoder-based models are
primarily used for natural language understanding tasks like text classi-
2.4.1. Overview fication, entity extraction, relation extraction, etc. Some of the popular
Deep learning witnessed the evolution of several models, from con- encoder-based PLMs are BERT, RoBERTa, XLNet, ALBERT, ELECTRA,
volution neural networks to the latest transformers (Young et al., 2018; DeBERTa, etc. Decoder-based models consist of an embedding layer
Otter et al., 2020). Transformer addressed drawbacks of traditional and a stack of decoder layers, with each decoder layer having self-
deep learning models like convolutional neural network, recurrent neu- attention, masked self-attention and feed-forward networks. Decoder-
ral network and its variants and achieved significant progress (Vaswani based models are used for both natural language understanding and
6
K.S. Kalyan Natural Language Processing Journal 6 (2024) 100048
generation tasks. Some of the popular decoder-based PLMs are GPT- fine-tuning requires labelled data and creates a separate copy of the
1, GPT-2 etc. Encoder–decoder based models consist of both encoder pretrained language model for each downstream NLP task, increasing
and decoder modules. In general, encoder–decoder based models are the model development and deployment costs (Kalyan et al., 2021).
used for natural language generation tasks like machine translation, Pretrained language models are treated as narrow AI systems as they
text summarization, etc., while some are explored for both natural are adapted through fine-tuning and then used for specific downstream
language understanding and generation tasks. Some of the popular tasks. However, the main focus of the research community is to develop
encoder–decoder based models are T5, BART, PEGASUS, M2M100, artificial general intelligence systems (Goertzel, 2014; Bubeck et al.,
NLLB, etc. 2023) which are not narrowly focused on specific tasks but have the
After the massive success of PLMs in the English language, the ability for general problem-solving and can handle even the unseen
research community started to develop multilingual PLMs (Doddapa- tasks by utilizing the existing knowledge like human beings. The NLP
neni et al., 2021) and PLMs for non-English languages (Kalyan et al., researchers observed that the performance of PLMs can be enhanced
2021). Some of the popular multilingual PLMs are mBERT (Devlin further through scaling along three dimensions: pretraining computa-
et al., 2018), mT5 (Xue et al., 2021), mBART (Liu et al., 2020a), tion, pretraining data and model size (Liu et al., 2019; Radford et al.,
IndicBERT (Kakwani et al., 2020), XLM (Conneau and Lample, 2019), 2019; Raffel et al., 2020). Large size allows the models to capture
XLM-R (Conneau et al., 2020), mDeBERTa (He et al., 2022a) etc. As more nuanced language patterns, which in turn enhances their ability
the performance of general domain PLMs is limited in domain-specific to understand and generate text, while large pretraining data helps the
tasks (Kalyan et al., 2021, 2022), the research community focused model to learn from a wider range of text. The promising results from
on developing PLMs for specific domains like social media (Nguyen scaling and the quest to build artificial general intelligence systems
et al., 2020; Barbieri et al., 2020), finance (Yang et al., 2020a; Araci, motivated NLP researchers to build much bigger and bigger models,
2019; Liu et al., 2021a), legal (Chalkidis et al., 2020; Leivaditi et al., which eventually resulted in the evolution of GPT-3 and its successor
2020), coding (Feng et al., 2020; Wang et al., 2021d, 2023d), health- models (Brown et al., 2020; Chowdhery et al., 2022; Hoffmann et al.,
care (Lee et al., 2020; Gu et al., 2020; raj Kanakarajan et al., 2021) 2022; Du et al., 2022). Learning paradigms like transfer learning and
etc., As PLMs have millions of parameters which make model fine- self-supervised learning make LLMs possible, but scaling makes these
tuning and deployment expensive, compact PLMs like DistilBERT (Sanh models powerful.
et al., 2019), TinyBERT (Jiao et al., 2020), MobileBERT (Sun et al., The research community coined a new phrase, ‘‘large language mod-
2020), MiniLM (Wang et al., 2020b)etc., are developed. As PLMs els’’, to refer to GPT-3 and its successor large models to differentiate
have a limited context length which limits the performance on long these models from small PLMs (Zhao et al., 2023d). Large language
sequences, long-sequence PLMs like LongFormer (Beltagy et al., 2020), models (LLMs) are a special class of pretrained language models ob-
BigBird (Zaheer et al., 2020) etc., are developed. PLMs encode only tained by scaling model size, pretraining corpus and computation as
the universal language knowledge available in the pretraining corpus shown in Fig. 6. LLMs are essentially deep learning models, specifi-
and lack valuable knowledge available in ontologies. So, the research cally transformer-based, pretrained on large volumes of text data and
community developed ontology-enriched models like SapBERT (Liu aligned to human preferences using meta-training. Pretraining provides
et al., 2021), UmlsBERT (Michalopoulos et al., 2020), etc. universal language knowledge to the model (Kalyan et al., 2021), while
meta-training aligns the model to act based on the user’s intentions.
2.5. Large language models (LLMs) Here, the user’s intention includes explicit intentions, like following
instructions, and implicit intentions, like maintaining truthfulness and
2.5.1. Overview avoiding bias, toxicity, or harmful behaviour (Ouyang et al., 2022).
The pretrained language models, starting from GPT-1 (Radford Because of their large size and pretraining on large volumes of
et al., 2018), BERT (Devlin et al., 2018) models to the latest De- text data, LLMs exhibit special abilities referred to as emerging abili-
BERTa (He et al., 2022a, 2020), achieved significant progress and also ties (Wei et al., 2022a; Schaeffer et al., 2023), allowing them to achieve
reduced the amount of labelled data required to train the task-specific remarkable performances without any task-specific training in many
models (Kalyan et al., 2021, 2022). Pretrained language models follow natural language processing tasks. For downstream task usage, PLMs
the paradigm ‘‘pretrain then fine-tune’’, i.e., the model is pretrained leverage supervised learning paradigm, which involves task-specific
first and then adapted to downstream tasks by fine-tuning. As task- fine-tuning and hundreds or thousands of labelled instances (Kalyan
specific fine-tuning is mandatory to adapt the pretrained language et al., 2021, 2022). LLMs leverage in-context learning (ICL), a new
model to downstream tasks, PLMs cannot generalize to unseen down- learning paradigm that does not require task-specific fine-tuning and
stream tasks without task-specific fine-tuning. Moreover, task-specific many labelled instances (Brown et al., 2020; Dong et al., 2022).
7
K.S. Kalyan Natural Language Processing Journal 6 (2024) 100048
LLMs treat any NLP task as a conditional text generation problem 3. GPT-3 family large language models
and generate the desired text output by conditioning on the input
prompt, including task description, test input and optionally, a few 3.1. Overview
examples.
Open AI, an AI company established in 2015, focused on building
generative models. The Open AI researchers initially explored RNNs
2.5.2. Evolution of large language models
for developing generative language models (Radford et al., 2017).
The evolution of LLMs happened along two dimensions: closed-
Inspired by the huge success of the transformer model and its ability
source LLMs and open-source LLMs. The era of LLMs roughly started
to capture long-term dependencies, Open AI researchers leveraged the
with GPT-3. Following the success of GPT-3, Open AI developed suc- transformer decoder to build GPT-1 (117M parameters), the first-ever
cessor models like InstructGPT (Ouyang et al., 2022), Codex (Chen transformer-based pretrained language model (Radford et al., 2018).
et al., 2021b), ChatGPT and GPT-4 (OpenAI, 2023). Google intro- GPT-1 introduced a new paradigm, ‘‘pretrain and fine-tune’’, to develop
duced models like GLaM (Du et al., 2022), PaLM (Chowdhery et al., downstream task models effectively. Originally, the ‘‘pretrain and fine-
2022), PaLM2 (Anil et al., 2023), LaMDA (Thoppilan et al., 2022) tune’’ paradigm was introduced by Dai and Le (2015) and then explored
and Bard. DeepMind developed models like Gopher (Rae et al., 2021), by Howard and Ruder (Howard and Ruder, 2018) to build language
Chinchilla (Hoffmann et al., 2022), AlphaCode (Li et al., 2022a) and models for text classification. However, unlike Radford et al. (2018)
Sparrow (Glaese et al., 2022). Companies like Baidu, AI21 labs and work, these research works build language models based on LSTM,
Amazon developed the models Ernie 3.0 Titan (Wang et al., 2021c), which lacks parallelization ability and has difficulties in capturing
Jurassic-1 (Lieber et al., 2021) and AlexaTM (Soltan et al., 2022), long-term dependencies. Radford et al. (2018) used casual language
respectively. Although the performances of closed-source LLMs are modelling as a pretraining task to pretrain the GPT-1 model. The casual
impressive, the main drawback with these models is that they are language modelling pretraining task involves generating the next token
behind the paywalls, i.e., their weights are not publicly available, only based on the previous tokens. GPT-1 achieved SOTA results in 9 out of
some of them are accessible only through the APIs offered by the 12 NLP tasks (Radford et al., 2018).
respective companies, and the model usage is charged based on the Inspired by the success of GPT-1, Open AI researchers introduced
tokens processed and generated. the GPT-2 model to push the results further (Radford et al., 2019).
To address this issue, the research community focused on devel- The GPT-2 model is pretrained on the WebText corpus (40B text),
oping open-source LLMs with publicly available weights. Some of the which is much larger than the Books corpus used to pretrain the GPT-
popular open-source LLMs are OPT (Zhang et al., 2022), OPT-IML (Iyer 1 model. The authors developed four versions of the GPT-2 model
et al., 2022), Galactica (Taylor et al., 2022), LLaMA (Touvron et al., with varying parameters: 117M, 345M, 762M and 1.5B. The authors
2023a), LLaMA2 (Touvron et al., 2023b) and Falcon. The performances observed that the perplexity decreases with an increase in the model’s
of these open-source LLMs are on par with closed-source LLMs. More- size, and even for the largest version of 1.5B, the decrease in perplexity
over, in some cases, open-source LLMs outperform closed-source LLMs. did not exhibit saturation. This revealed that GPT-2 underfitted the
For example, Galactica beats closed-source LLMs like GPT-3, Chinchilla pretraining dataset, and extending the training duration could have
further reduced perplexity. This observation triggered the insight that
and PaLM. Inspired by the success of open-source LLMs in the English
‘‘developing even larger language models will decrease the perplexity
language, the research community focused on developing multilingual
further and enhance natural language understanding and generation
and bilingual LLMs. BLOOM (Scao et al., 2022) and BLOOMZ (Muen-
capabilities’’. The insights gained from the GPT-1 and GPT-2 models
nighoff et al., 2022) are examples of multilingual LLMs, JAIS (Sengupta
laid a strong foundation for the evolution of the GPT-3 family LLMs,
et al., 2023) (English and Arabic), GLM (Zeng et al., 2022) (English
including the latest models like ChatGPT and GPT-4. Fig. 7 shows the
and Chinese) and FLM-101B (Li et al., 2023j) (English and Chinese)
journey of Open AI starting from GPT-1 to the latest GPT-4 and Fig. 8
are examples of bilingual LLMs.
shows the GPT-3 family LLMs starting from GPT-3 series to the latest
The success of closed and open-source LLMs in the general domain GPT-4.
triggered the development of domain-specific LLMs like FinGPT (Yang
et al., 2023g) and BloombergGPT (Wu et al., 2023a) in the finance 3.2. GPT-3 models
domain, MedPaLM (Singhal et al., 2023a) and MedPaLM2 (Singhal
et al., 2023b) in the healthcare domain and StarCoder (Li et al., The experiment results of GPT-2 showed that increasing the model
2023a), CodeLlaMa (Rozière et al., 2023), CodeGen (Nijkamp et al., size further reduces the perplexity, and the model with more param-
2022) and CodeGen2 (Nijkamp et al., 2023) in the coding domains. eters achieves better results than the models with fewer parameters.
For example, Bloomberg developed BloombergGPT, an exclusive LLM This observation motivated Open AI researchers to train much bigger
for the finance domain. Similarly, Google developed MedPaLM and GPT models, which eventually resulted in the introduction of the GPT-
MedPaLM2 LLMs exclusively for the healthcare domain based on PaLM 3 model (Brown et al., 2020). GPT-3 model contains 175B parameters
and PaLM2 models respectively. Similarly, HuggingFace developed and is 100 times bigger than its predecessor model, GPT-2. Moreover,
StarCoder, MetaAI developed Code LlaMA, and SalesForce developed the GPT-3 model is trained over a corpus with the text from multiple
CodeGen and CodeGen2 LLMs exclusively for coding tasks. sources like webpages, Wikipedia and books, unlike GPT-1 and GPT-2
8
K.S. Kalyan Natural Language Processing Journal 6 (2024) 100048
Fig. 8. GPT-3 family large language models (GLLMs) starting from GPT-3 series to the latest GPT-4. Here, SFT stands for supervised fine-tuning, and RLHF stands for reinforcement
learning from human feedback. Here, raw represents that the model is just pretrained and is not aligned using SFT or RLHF. Here, RLHF-Chat represents that the model is aligned
using RLHF and optimized for chat.
models, which are pretrained over corpora with the text from books and enhance complex reasoning ability, the instruction following ability
webpages, respectively. Scaling in three dimensions: pretraining data, and reduce the harmful text generation, GPT-3.5 models are developed
model size, and pretraining computation allows the GPT-3 model to by fine-tuning GPT-3 models over code data and then aligned using
learn more from large volumes of texts from different sources, which supervised fine-tuning (SFT) or reinforcement learning from human
eventually empowers the model to handle unseen tasks without any feedback (RLHF) (Ouyang et al., 2022). For example, the text-davinci-
task-specific training. Unlike GPT-1 and GPT-2 models, which leverage 002 model is developed by fine-tuning the GPT-3 model (text-davinci)
supervised learning to do downstream tasks, GPT-3 leverages training- over code data to get code-davinci-002, which is further aligned using
free in-context learning. In-context learning is a new learning paradigm SFT.
that is training-free and solves the downstream tasks by using knowl-
edge encoded in the model parameters (Dong et al., 2022). In-context 3.4. ChatGPT and GPT-4
learning accepts prompts as input where the input prompt consists of
task descriptions, optimally few examples and other instructions. GPT-3 models are capable of understanding and generating natural
language, while GPT-3.5 models are capable of understanding and
3.3. GPT-3.5 models generating both natural language and code. However, both GPT-3 and
GPT-3.5 models are not chat optimized. This drawback is addressed
Two main drawbacks of the GPT-3 model are (i) GPT-3 is not trained by ChatGPT (GPT-3.5-turbo) and GPT-4 (OpenAI, 2023) models. Open
over code data, and hence, it lacks complex reasoning abilities like AI introduced ChatGPT in November 2022. With extraordinary con-
solving math problems (Zhao et al., 2023d), and (ii) GPT-3 model versational abilities, ChatGPT, ChatGPT has garnered millions of users
struggles to follow user instructions and sometimes generate harmful within a few weeks of its launch. Following ChatGPT, Open AI released
text (Ouyang et al., 2022). These two drawbacks are addressed by GPT- the GPT-4 model in March 2023, which can handle both text and
3.5 models. Brown et al. (2020) observed that GPT-3 can generate image inputs. Apart from generating text with human-like fluency,
simple programs, although it is not specifically trained for generating these models further pushed the results in many natural language
code. The Open AI researchers triggered by this observation introduced processing tasks. The performance of these models in downstream tasks
Codex (Chen et al., 2021b), an exclusive GLLM for coding tasks. Codex and specific domains is discussed in detail in Sections 5 and 6.
is developed by fine-tuning a GPT model with 12B parameters over pub-
licly available Github code. Moreover, it is observed that GPT models 4. Performance of GLLMs in downstream tasks
explicitly trained over code data exhibit better reasoning capabilities.
During pretraining, the GPT-3 model is optimized based on the 4.1. Text classification
casual language modelling objective, which involves predicting the
next word based on the previous words. In-context learning during
inference can be viewed as conditional text generation, where the Overview. Text Classification is one of the fundamental tasks in natural
model generates the output by conditioning on the given prompt. The language processing (Li et al., 2022c). It involves assigning label(s)
model performs text generation during pretraining and inference, but from a predefined set of labels to a given piece of text. Here, the piece of
it does vanilla text generation during pretraining and conditional text text can be a phrase, sentence, paragraph or even a document. Many
generation during inference. During pretraining, the model conditions of the natural language processing problems, like offensive language
on the previous words and generates the next word, i.e., vanilla text identification, stance detection, sentiment analysis, hate speech detec-
generation. However, during in-context learning, the model conditions tion, etc., are approached as text classification. Text Classification can
on the prompt and generates the answer rather than generating the be binary, multi-class or multi-label.
next words, i.e., conditional text generation. So, there is a gap between In the case of text classification, the large language model is
pretraining and in-context learning at inference. Due to this, in many prompted with a task description, a predefined set of labels, examples
cases during inference, the GPT-3 model fails to understand the given (optional) and the test input. Here, task description, a predefined set
prompt and tends to generate the next words. of labels and examples constitute the context. The model understands
The pretraining corpus of the GPT-3 model includes some amount of what actually the task is from the context and then assigns the most
text with undesired qualities like misinformation, abuse, hate, sexism, appropriate label(s) to the given test input. The additional inputs, like
etc., due to which the model sometimes generates harmful text. To examples in the context, enrich the prompt with more information
9
K.S. Kalyan Natural Language Processing Journal 6 (2024) 100048
Table 1
Summary of research works exploring GLLMs for various text classification problems. Here ZS represents zero-shot, and FS represents few-shot.
Paper Task(s) GLLMs explored Prompt settings Domain(s) Language(s) SOTA results
Zhang et al. Stance detection ChatGPT ZS, FS Social media English No
(2023c)
Lamichhane Stress detection, depression detection, ChatGPT ZS Social media English No
(2023) suicidal detection
Yang et al. Mental health analysis tasks ChatGPT ZS Social media English No
(2023c)
Wang et al. Sentiment analysis ChatGPT ZS, FS Social media English, Chinese No
(2023l)
Lopez-Lira Stock prediction based on sentiment ChatGPT ZS Finance English No
and Tang analysis
(2023)
Ziems et al. Computational social science tasks GPT-3, ChatGPT ZS Social media English No
(2023a)
Kuzman Genre identification ChatGPT ZS General English, Slovenian No
et al. (2023)
Bang et al. Sentiment analysis, misinformation ChatGPT ZS Social media English, Indonesian, No
(2023) detection Javanese, Buginese
Kocoń et al. Nine NLU tasks including sentiment ChatGPT ZS General, social media English No
(2023) analysis and natural language inference
Zhong et al. Paraphrase detection, Sentiment analysis, ChatGPT ZS, FS General English No
(2023) Natural language inference
Ye et al. Sentiment analysis, natural language GPT-3, GPT-3.5, ChatGPT ZS, FS General, social media English No
(2023) inference
Li et al. Financial news classification, sentiment ChatGPT, GPT-4 ZS Finance English No
(2023k) analysis
Wu et al. Natural language inference ChatGPT, GPT4 ZS, FS Healthcare English No
(2023c)
Wang et al. Natural language inference, document GPT3.5, GPT4, Bard ZS, FS Healthcare English No
(2023n) classification
Chiu et al. Hate Speech Detection GPT-3 ZS, FS Social media English No
(2021)
Huang et al. Implicit hate speech detection ChatGPT ZS Social media English No
(2023a)
Chen et al. Clinical text classification GPT-3, ChatGPT, GPT-4 ZS, FS Healthcare English No
(2023e)
Amin et al. Sentiment analysis, suicide tendency ChatGPT ZS Social media English No
(2023) detection, personality prediction
Parikh et al. Intent classification GPT-3 ZS Social media English No
(2023)
Sun et al. News classification, sentiment analysis InstructGPT ZS, FS General, social media English Yes
(2023b)
which allows the model to understand the task better and then perform Indonesian (Bang et al., 2023), Javanese (Bang et al., 2023), and Bugi-
better. nese (Bang et al., 2023). A brief summary of research works exploring
Research works exploring GLLMs for text classification. The GLLMs for various text classification problems is presented in Table 1.
recent works explored GLLMs like GPT-3, GPT-3.5 ChatGPT and GPT-4 Most of the research works showed that compared to direct prompt-
for various text classification problems like sentiment analysis (Wang ing, advanced prompting strategies help the model to achieve bet-
et al., 2023l; Lopez-Lira and Tang, 2023; Bang et al., 2023; Zhong et al., ter results. This is because advanced prompting involves generating
2023; Li et al., 2023k; Amin et al., 2023; Sun et al., 2023b), stance intermediate outputs, which in turn guide the model in generating
the correct final output. Zhang et al. (2023c) explored the ChatGPT
detection (Zhang et al., 2023c), intent classification (Parikh et al.,
model with direct and chain-of-thought prompting for stance detection
2023), mental health analysis (Lamichhane, 2023; Yang et al., 2023c),
in tweets in zero and few-shot settings. Experiment results on three
hate speech detection (Chiu et al., 2021; Huang et al., 2023a), misin-
datasets showed that one-shot chain of thought prompting outperforms
formation detection (Bang et al., 2023), paraphrase detection (Zhong
zero-shot direct prompting and also achieves near state-of-the-art re-
et al., 2023), news classification (Li et al., 2023k), natural language
sults. Yang et al. (2023c) designed emotion-enhanced CoT prompting
inference (Zhong et al., 2023; Wu et al., 2023c; Wang et al., 2023n)etc. to combine emotion information with the power of CoT prompting
The evaluation is done in zero and few-shot settings using different for mental health analysis tasks. Experiments on five different mental
prompting strategies like chain-of-thought (CoT) (Zhang et al., 2023c; health analysis tasks showed that ChatGPT with emotion-enhanced CoT
Yang et al., 2023c; Zhong et al., 2023; Wu et al., 2023c; Wang et al., outperforms other prompting strategies. Overall, ChatGPT outperforms
2023n; Chen et al., 2023e; Sun et al., 2023b), self-question prompting traditional deep learning models like CNN and RNN but still lags behind
(SQP) (Wang et al., 2023n), clue and reasoning prompting (CARP) (Sun task-specific fine-tuned models. Wu et al. (2023c) explored models like
et al., 2023b) etc. Most of the research works focused on English GPT-4 and ChatGPT for radiology natural language inference task. The
datasets, except a few research works focused on other languages authors reported that GPT-4 with IRSA prompting strategy outperforms
like Chinese (Wang et al., 2023l), Slovenian (Kuzman et al., 2023), ChatGPT in both zero and few-shot settings. IRSA stands for Instruction
10
K.S. Kalyan Natural Language Processing Journal 6 (2024) 100048
Response Semantic Alignment. IRSA prompting strategy is almost the (EAE) involves identifying event arguments, i.e., entities involved in the
same as direct prompting except that in the case of IRSA prompting, event and then classifying their roles (Ma et al., 2022). Event Extraction
the model is instructed to give the labels ‘‘contain’’ and ‘‘not con- (EE) aims to extract both the events and the involved entities, i.e., it
tain’’ instead of ‘‘entailment’’ and ‘‘not entailment’’, just to reduce the involves event detection followed by event argument extraction (Du
complexity. Wang et al. (2023n) evaluated the performances of the and Cardie, 2020).
latest LLMs like GPT-3.5, GPT-4, and Bard models on text classification Research works exploring GLLMs for information extraction
tasks like natural language inference and document classification in tasks The recent works explored GPT-3 family LLMs for various in-
the healthcare domain. The GPT-4 model with the newly designed formation extraction tasks like entity typing (Li et al., 2023d), entity
self-question prompting (SQP) outperforms other models in both zero extraction (González-Gallardo et al., 2023; Hu et al., 2023a; Wei et al.,
and few-shot settings. The SQP strategy involves identifying the key 2023; Gutiérrez et al., 2022; Li et al., 2023d; Ma et al., 2023a; Wang
elements of input, generating questions and answers related to the key et al., 2023i,n; Stammbach et al., 2022; Li et al., 2023k,g), relation
elements, and then using them to generate the final output. Parikh et al. classification (Gutiérrez et al., 2022; Li et al., 2023d; Chan et al., 2023;
(2023) showed that the performance of the GPT-3 model for intent Xu et al., 2023e; Wan et al., 2023; Wang et al., 2023n; Zhang et al.,
classification in zero-shot settings can be enhanced by including intent 2023e), relation extraction (Wei et al., 2023; Rehana et al., 2023; Yuan
class descriptions in the prompt. et al., 2023b; Li et al., 2023d; Ma et al., 2023a; Wadhwa et al., 2023;
Some of the research works demonstrated that GPT-3 family LLMs Li et al., 2023g), event classification (Li et al., 2023d), event argument
can outperform task-specific fine-tuned models (Kuzman et al., 2023; extraction (Li et al., 2023d) and event extraction (Wei et al., 2023;
Zhong et al., 2023) and domain-specific LLMs (Li et al., 2023k). Kuz- Gao et al., 2023e; Li et al., 2023d; Ma et al., 2023a). The evaluation is
man et al. (2023) showed that ChatGPT outperforms fine-tuned XLM-R done in zero and few-shot settings using different prompting strategies
model in the task of automatic genre identification in the English like chain-of-thought (CoT) (Yuan et al., 2023b; Wan et al., 2023;
language. Zhong et al. (2023) compared the performances of ChatGPT Wang et al., 2023n; Wadhwa et al., 2023), self-verification (Wang
and fine-tuned models based on base and large versions of BERT et al., 2023i), self-question prompting (SQP) (Wang et al., 2023n),
and RoBERTa models on tasks like natural language inference, senti- event ranking (ER) (Yuan et al., 2023b) etc. Most of the research works
ment analysis and paraphrase identification. The results showed that focused on English datasets, except a few research works focused on
ChatGPT outperforms both base and large fine-tuned models by a other languages like Chinese (Wei et al., 2023). A brief summary of
large margin in the case of natural language inference task. Li et al. research works exploring GLLMs for various information extraction
(2023k) evaluated the performances of general LLMs like ChatGPT and tasks is presented in Table 2.
GPT-4 and domain-specific LLMs like BloombergGPT on tasks like fi- Hu et al. (2023a) demonstrated the performance of ChatGPT in
nance news classification and sentiment analysis. In the case of finance extracting clinical entities like problem, treatment, and test can be
news classification, GPT-4 outperforms all other LLMs, including the enhanced by including additional information about entity types like
domain-specific BloombergGPT model. synonyms and subtypes in the prompt. Wei et al. (2023) proposed
In all the above discussed research works, the performance of ChatIE, a two-stage framework for information extraction, with each
GLLMs is impressive but still lags behind SOTA results. Sun et al. stage implemented as a multi-turn question answering. This two-stage
(2023b) showed that it is possible to achieve SOTA results in text framework helps the model break complex IE tasks into sub-tasks which
classification tasks with the newly designed clue And reasoning prompt- allows the model to perform better. Results showed that ChatGPT used
ing (CARP) prompting strategy. CARP involves a progressive reasoning with the ChatIE framework outperforms vanilla ChatGPT by a large
approach for handling complex linguistic phenomena, and it involves margin of more than 18 points. Gutiérrez et al. (2022) enhanced the
three steps: finding clues based on input, generating reasoning steps performance of the GPT-3 model for entity extraction and relation clas-
based on the input and the generated clues, and then arriving at sification by using techniques like contextual calibration (Zhao et al.,
the final output based on the input, generated clues and reasoning 2021) to reduce bias and kNN-based demonstration selection. Gao et al.
steps. Experiment results showed that the results are impressive as (2023e) examined the performance of ChatGPT for event extraction
InstructGPT with CARP prompting strategy using just 16 examples in few-shot settings. The model is prompted with task descriptions,
achieves SOTA results on four text classification datasets. definitions of event types, positive and negative examples, and test
input. The authors reported that including negative examples decreases
4.2. Information extraction the performance of the model, which is in line with other existing
works (Wang et al., 2022). The possible reason for this is that the model
Overview. Information Extraction (IE) in natural language process- misunderstands negative examples as positive examples. Rehana et al.
ing involves extracting structured data like entities, relationships and (2023) explored GPT-3 family models like GPT-3, ChatGPT and GPT-4
events from unstructured text data (Lu et al., 2022b). Transform- for protein–protein interaction extraction. It is reported that including
ing unstructured text data into structured data enables efficient data normalized protein names in the prompt enhances the performance
processing, knowledge discovery, decision making and enhances infor- of the model. However, fine-tuned PubMedBERT model outperforms
mation retrieval and search. Information extraction involves a number GPT-4 model with an F1-score of 86.47.
of tasks like entity typing, entity extraction, relation classification, re- Yuan et al. (2023b) demonstrated that advanced prompting strate-
lation extraction, event detection, event argument extraction and event gies like event ranking and chain-of-thought improve the performance
extraction (Li et al., 2023d). Entity typing (ET) involves classifying of ChatGPT compared to vanilla prompting in temporal relation ex-
identified named entity mentions into one of the predefined entity traction. However, ChatGPT lags behind traditional neural networks
types (Chen et al., 2022). Named Entity Recognition (NER) or Entity like LSTM and fine-tuned pre-trained language models, which indicates
Extraction (EE) involves identifying entity mentions and then assigning the toughness of the temporal relation extraction task. Wang et al.
them to appropriate entity types (Das et al., 2022). Relation classifica- (2023n) evaluated the performances of the latest LLMs like GPT-3.5,
tion (RC) involves identifying the semantic relationship between the GPT-4, and Bard models on entity extraction and relation classification
given two target entities in a sentence (Wu and He, 2019). Relation in the clinical domain. Experiment results showed that GPT-4 with self-
Extraction (RE) involves extracting the entities and then classifying question prompting outperforms other LLMs on most of the datasets. Li
the semantic relationship between the two target entities, i.e., involves et al. (2023g) compared the performances of both natural language and
entity extraction followed by relation classification (Ye et al., 2022). code LLMs like GPT-3 and Codex using natural language and code style
Event Detection (ED) aims to identify and categorize words or phrases prompts. Experiment results showed that (i) Codex outperforms GPT-3
that trigger events (Zhao et al., 2022a). Event Argument Extraction model and moderately sized fine-tuned models and (ii) Codex model
11
K.S. Kalyan Natural Language Processing Journal 6 (2024) 100048
Table 2
Summary of research works exploring GLLMs for information extraction tasks. Here ZS represents zero-shot, and FS represents few-shot.
Paper Task(s) GLLMs explored Prompt settings Domain(s) Language(s) SOTA results
González- Entity extraction ChatGPT ZS General English No
Gallardo
et al. (2023)
Hu et al. Entity extraction GPT-3, ChatGPT ZS Healthcare English No
(2023a)
Wei et al. Entity extraction, event extraction, ChatGPT ZS General English, Chinese No
(2023) relation extraction
Gutiérrez Entity extraction, relation classification GPT-3 FS Healthcare English No
et al. (2022)
Gao et al. Event extraction ChatGPT FS General English No
(2023e)
Rehana et al. Protein–protein interaction extraction GPT-3, ChatGPT ZS Healthcare English No
(2023) and GPT-4
Yuan et al. Temporal relation extraction ChatGPT ZS General English No
(2023b)
Li et al. Entity typing, entity extraction, relation ChatGPT ZS General English No
(2023d) classification, relation extraction, event
detection, event argument extraction,
event extraction
Chan et al. Temporal relation classification, causal ChatGPT ZS, FS General English No
(2023) relation classification, discourse relation
classification
Xu et al. Relation classification GPT-3.5 FS General, scientific English Yes
(2023e) literature
Wan et al. Relation classification GPT-3.5 FS General, scientific English Yes
(2023) literature
Qin et al. Entity extraction GPT-3.5, ChatGPT ZS General English No
(2023)
Ye et al. Entity extraction, relation extraction GPT-3, GPT-3.5, ZS, FS General, social eedia English No
(2023) ChatGPT
Ma et al. Entity extraction, relation extraction and InstructGPT FS General English Yes
(2023a) event detection
Wang et al. Entity extraction GPT-3 FS General English No
(2023i)
Wang et al. Entity extraction, relation classification GPT-3.5, GPT-4 ZS, FS Healthcare English No
(2023n)
Stammbach Entity extraction GPT-3 ZS General English No
et al. (2022)
Wadhwa Relation extraction GPT-3 FS General, healthcare English No
et al. (2023)
Li et al. Entity extraction ChatGPT, GPT-4 FS Finance English No
(2023k)
Li et al. Entity extraction, relation extraction GPT-3, Codex FS General, scientific English No
(2023g) literature
Zhang et al. Relation classification GPT-3.5, ChatGPT ZS General English No
(2023e)
with natural language or code style prompt outperforms GPT-3 model and gold label-induced reasoning. The use of representations from fine-
(iii) Code style prompts achieves better results in case of both Codex tuned relation model for demonstration selection is more effective
and GPT-3 models. The possible explanation for this is Codex which is as they naturally include entity and relation information. Ma et al.
pretrained over large volumes of code, encode structured code informa- (2023a) proposed a ‘‘filter then rerank’’ approach to use both fine-tuned
tion which is useful for IE tasks as IE tasks involve structured outputs. models and LLMs to take advantage of the strengths of both models for
Zhang et al. (2023e) proposed the QA4RE framework, which frames few-shot information extraction. Here fine-tuned model acts as a filter
relation extraction as a question-answering problem. In the QA4RE while LLM acts as a re-ranker. The proposed approach achieves SOTA
framework, the sentence serves as context, and the relation types serve results with an average improvement of over 2 points in the F1 score.
as options from which the LLMs choose. Experiment results showed
that the proposed approach improves the performance of ChatGPT and 4.3. Question answering
GPT-3.5 models by a good margin in relation extraction.
Some of the research works (Xu et al., 2023e; Wan et al., 2023; Overview. Question Answering (QA) is an important natural language
Ma et al., 2023a) demonstrated that GPT-3 family models can achieve processing task which deals with the development of algorithms to un-
SOTA results in information extraction tasks. Wan et al. (2023) derstand and interpret user queries in natural language and then deliver
achieved SOTA results in relation extraction with the GPT-RE frame- accurate responses (Zaib et al., 2022; Chali et al., 2011). The main
work. GPT-RE framework overcomes the drawbacks in existing works aim of question answering systems is to enhance human–computer
using entity-aware demonstration retrieval based on fine-tuned model interaction, i.e., QA systems avoid the use of complex commands
12
K.S. Kalyan Natural Language Processing Journal 6 (2024) 100048
and allow the user to interact with machines in a more natural way Hamidi and Roberts (2023) evaluated ChatGPT and Claude in
through natural language queries. For example, popular AI assistants answering patient-specific medical questions from MIMIC-III clinical
like Amazon Alexa,2 Google Assistant3 and Apple Siri4 rely on QA to notes. Experiment results demonstrated that the performances of both
provide accurate answers to user queries. The option of interaction models are promising as these models display significant levels of
through natural language queries enhances the reach of technology coherence, accuracy, coverage and relevance in their answers. Li et al.
to a broader audience. QA can be treated as a fine-grained version (2023k) demonstrated that GPT4 achieves the best results for question
of information retrieval (Torfi et al., 2020), and the demand for QA answering in the finance domain and outperforms ChatGPT, domain-
systems is increasing day by day because of the ability to generate specific models like BloombergGPT, FinQANet and general LLMs like
answers which are accurate, relevant and short. OPT (66B), and BLOOM (176B). Although the performance of GLLMs
Research works exploring GLLMs for question answering tasks. is impressive in zero and few-shot settings in multiple choice question
The NLP research community explored GLLMs for question answering answering, these models still lag behind SOTA results. The main reason
in various domains like education (Nunes et al., 2023; Joshi et al., for this is the use of cloze prompts. In cloze prompts, the model is
2023), news (Srivastava et al., 2022), healthcare (Samaan et al., 2023; prompted with only question without answer options, so the model
Holmes et al., 2023a; Nori et al., 2023a; Hamidi and Roberts, 2023; generates the answers just by conditioning on the question. Robinson
Gupta et al., 2023; Tanaka et al., 2023b; Wang et al., 2023n; Weng and Wingate (2022) proposed a new prompting strategy called multiple
et al., 2023; Kasai et al., 2023), social media (Ye et al., 2023), cod- choice prompt which prompts the model with question and answer
ing (Savelka et al., 2023), legal (Bommarito and Katz, 2022; Lin et al., options so that the model generates the answer by conditioning on both
2022a), finance (Li et al., 2023k) and scientific literature (Pereira question and answer options. Evaluation on 20 datasets showed that
et al., 2023). Most of the research works focused on the English multiple-choice prompt helps GLLMs to achieve near SOTA results.
language, except a few research works focusing on languages like Some of the research works explored the effectiveness of GLLMs in
Portuguese (Nunes et al., 2023), Japanese (Tanaka et al., 2023b; Kasai answering exam questions from various domains. Nunes et al. (2023)
et al., 2023) and Chinese (Weng et al., 2023). As advanced prompting investigated the performances of GLLMs like GPT-3.5, ChatGPT and
methods allow GLLMs to perform well, some of the research works GPT-4 in answering questions from the Brazilian university admission
investigated the effectiveness of advanced prompting strategies like exam. Here all the questions are in Brazilian Portuguese language. The
chain-of-thought (Nunes et al., 2023; Tan et al., 2023; Holmes et al., authors explored different prompting strategies like vanilla (zero-shot
2023a; Pereira et al., 2023; Wang et al., 2023n; Kasai et al., 2023), and few-shot) and CoT (few-shot). The authors observed that GPT-4
self-question prompting (Wang et al., 2023n; Weng et al., 2023) and outperforms all other models by a large margin of over 11 points and
holistically thought (Weng et al., 2023) for question answering. Table 3 achieves the best results with CoT prompting in few-shot settings. Joshi
presents a summary of research works exploring GLLMs for question et al. (2023) evaluated ChatGPT in answering undergraduate-level com-
answering across various domains and languages. puter science exam questions. For the evaluation, the authors gathered
(i) questions from various computer science subjects like data struc-
Zheng et al. (2023b) studied the shortcomings of ChatGPT in an-
tures, operating systems, machine learning and database management
swering complex open-domain questions and found errors related to un-
systems, (ii) questions from the GATE exam and (iii) programming
derstanding, factual accuracy, specificity, and logical reasoning. They
questions from the Leetcode website. The results showed that ChatGPT
also analysed the importance of knowledge memorization, recall, and
is inconsistent in answering the questions, so students are not advised to
reasoning abilities in addressing these failures. The authors demon-
rely on ChatGPT completely for their assignments and exams. Bommar-
strated that providing the model with external knowledge, cues for
ito and Katz (2022) examined the ability of OpenAI’s text-davinci-003
knowledge recall, and guidance for logical reasoning can enhance
(GPT-3.5) model in answering multiple choice questions from the Bar
its ability to provide more accurate answers. Samaan et al. (2023)
Exam. Interestingly, human participants with extensive education and
examined the accuracy of ChatGPT in answering questions related to
specialized training achieved a 68% accuracy rate, while the GPT-
Bariatric surgery. The authors reported that ChatGPT correctly an-
3.5 model achieved a lower accuracy rate of 50.3%. Gupta et al.
swered 131 questions from 151 questions, i.e., ChatGPT achieves an
(2023) evaluated how effective ChatGPT is in answering questions from
accuracy of 86.8%. The impressive performance of ChatGPT shows
plastic surgery inservice training examination. The authors reported
that it can serve as an additional information resource in addition to
that ChatGPT achieves an accuracy of 54.96% by correctly answer-
healthcare professionals and reduce their burden in answering patient
ing 242 questions. Tanaka et al. (2023b) evaluated the performances
questions. Holmes et al. (2023a) compared the performances of GLLMs
of GLLMs like GPT-3.5 and GPT-4 in answering questions from the
like ChatGPT, GPT-4 with other LLMs like Bard, BLOOMZ and med-
Japanese National Medical Licensing Examination (NMLE). Here the
ical physicists in answering related questions to Radiation Oncology
input includes sample examples, instructions to translate the question
Physics. The performance of GPT-4 is very impressive as the model
into English, and then summarizing the question before answering. The
outperforms medical physicists and other LLMs like ChatGPT, Bard and authors reported that GPT-4 achieves a score better than the minimum
BLOOMZ. The performance of GPT-4 is further enhanced using CoT passing score, and further analysis showed that the incorrect answers
prompting, i.e., the model is prompted to arrive at the answer after are due to insufficient medical knowledge and insufficient information
step-by-step reasoning. Nori et al. (2023a) performed a comprehen- about the Japanese-specific medical system. Kasai et al. (2023) reported
sive evaluation of the GPT-4 model on medical question answering that GPT-4 outperforms other models and passes the Japanese national
in zero and few-shot settings. For evaluation, the authors used six medical licensing exam in the last six years. Moreover, ChatGPT with
datasets: two related to the United States Medical License Examination English-translated prompts achieves better results than ChatGPT with
(USMLE) exam and four from the MultiMedQA benchmark (Singhal Japanese prompts. This is because ChatGPT is predominantly trained
et al., 2023a). The performance of GPT-4 is very impressive as it out- over the English text corpus.
performs not only general LLM like GPT-3.5 but also medical domain- Some of the research works explored GLLMs for more challenging
specific LLM like Med-PaLM (Singhal et al., 2023a). Moreover, on tasks in question answering like tabular question answering (Srivastava
USMLE exam datasets, GPT-4 model score is 20 points more than the et al., 2022), knowledge-based complex question answering (Tan et al.,
passing score. 2023), multiple choice code question answering (Savelka et al., 2023),
multi-document question answering (Pereira et al., 2023) and conversa-
2
https://alexa.amazon.com. tional question answering (Weng et al., 2023). Srivastava et al. (2022)
3
https://assistant.google.com. evaluated the effectiveness of GPT-3 for question answering on tabular
4
https://www.apple.com/in/siri/. data in zero and few-shot settings. Here the model is prompted with
13
K.S. Kalyan Natural Language Processing Journal 6 (2024) 100048
Table 3
Summary of research works exploring GLLMs for question answering tasks. Here ZS represents zero-shot, and FS represents few-shot.
Paper Task(s) GLLMs explored Prompt settings Domain(s) Language(s) SOTA results
Nunes et al. Admission exam question answering GPT-3.5, ChatGPT, ZS, FS Education Brazilian Portuguese No
(2023) GPT-4
Tan et al. Knowledge-based complex question GPT-3, GPT-3.5, ZS General Multiple languages No
(2023) answering ChatGPT
Yang et al. Knowledge-based visual question GPT-3 ZS General English Yes
(2022) answering
Srivastava Tabular question answering GPT-3 ZS, FS News English No
et al. (2022)
Zheng et al. Open domain question answering ChatGPT ZS General English No
(2023b)
Samaan et al. Bariatric surgery question answering ChatGPT ZS Healthcare English No
(2023)
Holmes et al. Radiation oncology physics question ChatGPT, GPT-4 ZS Healthcare English No
(2023a) answering
Joshi et al. Computer science question answering ChatGPT ZS Education English No
(2023)
Nori et al. Medical question answering GPT-3.5, GPT-4 ZS, FS Healthcare English No
(2023a)
Hamidi and Patient-specific question answering ChatGPT ZS Healthcare English No
Roberts
(2023)
Bang et al. Question answering ChatGPT ZS General English Yes
(2023)
Qin et al. Boolean question answering ChatGPT ZS General English No
(2023)
Kocoń et al. Multiple choice question answering ChatGPT ZS General, social English No
(2023) media
Ye et al. Question answering GPT-3, GPT-3.5, ZS, FS General English No
(2023) ChatGPT
Savelka et al. Multiple choice code question answering GPT-3.5 ZS Coding English No
(2023)
Bommarito Bar exam question answering GPT-3.5 ZS Legal English No
and Katz
(2022)
Pereira et al. Multi-document question answering GPT-3.5 FS General, scientific English No
(2023) literature
Gupta et al. Plastic survey exam question answering ChatGPT ZS Healthcare English No
(2023)
Tanaka et al. Japanese medical exam question GPT-3.5, GPT-4 FS Healthcare Japanese No
(2023b) answering
Li et al. Financial question answering ChatGPT, GPT-4 ZS Finance English No
(2023k)
Wang et al. Medical question answering GPT-3.5, GPT4 ZS, FS Healthcare English No
(2023n)
Robinson and Multiple choice question answering GPT-3, Codex, ZS General English No
Wingate InstructGPT
(2022)
Weng et al. Medical conversational question GPT-3, ZS Healthcare English, Chinese No
(2023) answering InstructGPT
Lin et al. Question answering GPT-3 ZS Multiple domains English No
(2022a) including Legal
and Health
Kasai et al. Japanese medical exam question GPT-3, ChatGPT, FS Healthcare Japanese No
(2023) answering GPT-4
unstructured passage text, tabular data in JSON format, examples (in with code snippets. Pereira et al. (2023) presented Visconde, a novel
the case of few-shot) and the question. The authors reported that GPT- framework based on the GPT-3.5 model to tackle multi-document
3 displayed its ability to successfully locate the table, comprehend its question answering. Visconde follows a three-step process involving
structure, and accurately access the relevant cells or passages of text in decomposition, retrieval, and aggregation. The decomposition phase
order to provide answers to the given questions. Savelka et al. (2023) uses the GPT-3.5 model in few-shot settings for question simplification,
evaluated the effectiveness of GPT-3.5 models in answering multiple- the retrieval stage uses the SOTA model to select the relevant text
choice questions (MCQs), particularly those involving code snippets chunks, and the final aggregation phase uses the GPT-3.5 with few-
from programming courses. Experiment results showed that MCQs with shot CoT prompting to get the answer. The authors observed that CoT
code snippets have lower success rates compared to those without prompting, i.e., generating reasoning steps before generating the final
code, indicating a challenge in answering multiple-choice questions answer, enhances the performance. Weng et al. (2023) enhanced the
14
K.S. Kalyan Natural Language Processing Journal 6 (2024) 100048
Table 4
Summary of research works exploring GLLMs for machine translation. Here ZS represents zero-shot, and FS represents few-shot.
Paper GLLMs Prompt settings Domain(s) Language(s) Granularity Outperforms
explored Commercial
Systems
Gu (2023) ChatGPT ZS General Japanese, Chinese Sentence No
Peng et al. ChatGPT ZS General, news, healthcare English, Chinese, German, Sentence No
(2023a) Romanian
Jiao et al. ChatGPT, ZS General, healthcare,social English, Chinese, German, Sentence Yes
(2023) GPT-4 media Romanian
Hendy InstructGPT, ZS, FS News, social media, English, German, Chinese Sentence, Document Yes
et al. ChatGPT, E-Commerce, dialogue
(2023) GPT-4
Gao et al. ChatGPT ZS, FS General, news, social media, English, French, Spanish Sentence Yes
(2023d) dialogue, E-Commerce
Wang et al. ChatGPT, ZS General, social media, news, English, German, Russian Document Yes
(2023h) GPT-4 dialogue
Zhu et al. ChatGPT ZS, FS General 102 languages in 202 directions Sentence No
(2023b)
Lyu et al. ChatGPT ZS General English, Chinese, French Paragraph No
(2023b)
Bang et al. ChatGPT ZS General Twelve languages, including four Sentence No
(2023) low-resource languages
Karpinska GPT-3.5 ZS General 18 language Pairs, including Sentence, Paragraph Yes
and Iyyer Japanese, English and Polish
(2023)
Moslem GPT-3.5 ZS, FS General English, Arabic, Chinese, German, Sentence Yes
et al. Spanish
(2023)
He et al. GPT-3.5 ZS, FS General English, Chinese, Japanese, Sentence No
(2023a) German, French
Raunak GPT-3.5, GPT-4 ZS General English, German, Chinese Sentence Yes
et al.
(2023b)
Raunak GPT-3.5 ZS General English, German, Russian Sentence Yes
et al.
(2023a)
performance of GLLMs in answering medical conversational questions in the target language. So, a good machine translation model should
in English and Chinese using a novel prompt strategy called Holistically possess strong natural language understanding and generation skills
Thought (HoT). The HoT prompting strategy involves diffused thinking to generate quality translations. The main objective of MT systems is
and focused thinking strategies to generate high-quality responses. Dif- to enhance cross-lingual communication by reducing the gap between
fused thinking helps to generate various responses through diversified individuals from different linguistic communities. The evolution of
decoding, focused thinking generates a concise medical summary based MT systems started with rule-based models followed by statistical and
on the dialogues and the final response is generated based on the neural models (Tan et al., 2020). Rule-based MT systems are built on
dialogues, outputs of diffused thinking and focused thinking. top of manually crafted syntactic and grammatical rules. As manually
Unlike all the above discussed research works where the perfor- framing rules is heavily laborious and expensive, these systems are later
mances of GLLMs are just satisfactory but not SOTA, some of the replaced by statistical MT systems. Statistical MT systems use statistical
research works (Yang et al., 2022; Bang et al., 2023) demonstrated models trained on bilingual data. With the evolution of deep learning
that it is possible to achieve SOTA results for question answering task models, the research community started to build neural machine trans-
using GLLMs. For example, Yang et al. (2022) explored GPT-3 model lation (NMT) systems with the help of neural models (Sutskever et al.,
for knowledge-based visual question answering. Knowledge-based vi- 2014; Bahdanau et al., 2014; Luong et al., 2015). These neural models
sual question answering involves answering questions which require are essentially based on the encoder–decoder architecture, where the
information which is not available in the input images. The authors encoder understands the input sequence and encodes it into a vector,
propose a novel approach which uses GPT-3 as a knowledge source and the decoder, based on the encoder output, generates the output
which is implicit and unstructured. Experiment results showed that sequence auto-regressively. Some of the recent neural models used for
the proposed approach achieves new SOTA results by outperforming translation are mBART-50 (Tang et al., 2020), M2M100 (Fan et al.,
existing approaches with a large margin of over 8 points. 2020), NLLB200 (Costa-jussà et al., 2022) etc.
Research works exploring GLLMs for machine translation. In
4.4. Machine translation recent times, GLLMs like ChatGPT and GPT-4 demonstrated remarkable
performances in both natural language understanding and generation
tasks. A good machine translation system requires strong natural lan-
Overview. Machine Translation (MT), an important task of natural guage understanding and generation skills. As ChatGPT and GPT-4
language processing, deals with the development of models which possess strong natural language understanding and generation skills,
can translate input text from the source language to the target lan- the research community investigated the effectiveness of these models
guage (Stahlberg, 2020; Yang et al., 2020b; Tan et al., 2020). MT for machine translation across various domains like news (Peng et al.,
models receive the input text in the source language, understand the 2023a; Hendy et al., 2023; Gao et al., 2023d; Wang et al., 2023h),
syntax and semantics of the input text and then generate the translation healthcare (Peng et al., 2023a; Jiao et al., 2023), social media (Jiao
15
K.S. Kalyan Natural Language Processing Journal 6 (2024) 100048
Table 5
Summary of research works exploring GLLMs for keyphrase generation task. Here ZS represents zero-shot, and FS represents few-shot.
Paper GLLMs explored Prompt settings Domain(s) Language(s) SOTA results
Martínez-Cruz et al. (2023) ChatGPT ZS News, scientific literature English Yes
Song et al. (2023) ChatGPT ZS Scientific literature English No
et al., 2023; Hendy et al., 2023; Gao et al., 2023d; Wang et al., 2023h), high resources languages only, and (ii) the translation quality of low-
dialogue (Hendy et al., 2023; Wang et al., 2023h; Gao et al., 2023d) resource languages can be enhanced using a novel pivot prompting
and e-commerce (Hendy et al., 2023; Gao et al., 2023d). Most of strategy, which involves translating into high resource language before
the research works focused on sentence-level machine translation (Gu, translating into the target low resource language. The naive prompts
2023; Peng et al., 2023a; Jiao et al., 2023; Hendy et al., 2023; Gao are unable to elicit the translation ability of ChatGPT fully. So, Gao
et al., 2023d; Zhu et al., 2023b; Bang et al., 2023; Karpinska and et al. (2023d) focused on developing advanced prompting strategies by
Iyyer, 2023; Moslem et al., 2023; He et al., 2023a; Raunak et al., including additional information like task information, domain infor-
2023b,a), except a few research works focused on paragraph-level mation and syntactic information like PoS (parts of speech) tags. The
machine translation (Lyu et al., 2023b; Karpinska and Iyyer, 2023) and authors showed that ChatGPT, with the proposed advanced prompting
document-level machine translation (Hendy et al., 2023; Wang et al., strategy, achieves promising results and even outperforms commercial
2023h). As advanced prompting methods allow GLLMs to perform systems like Google Translate and DeepL Translate. Wang et al. (2023h)
well, some of the research works investigated the effectiveness of examined the performances of ChatGPT and GPT-4 for document-level
advanced prompting strategies like pivot (Jiao et al., 2023), chain- machine translation and also compared the results with commercial
of-thought (Raunak et al., 2023b) and multi-aspect prompting and systems from Google, DeepL and Tencent. The authors reported that
selection (He et al., 2023a). Table 4 presents a summary of research GLLMs do well when the sentences in the document are combined and
works exploring GLLMs for machine translation across various domains given at once to the model. Moreover, with this prompting strategy,
and languages. both the GLLMs exhibit better performances than commercial machine
Gu (2023) proposed a novel approach based on ChatGPT to enhance translation systems according to human evaluation and also outperform
the quality of translation from Japanese to Chinese by effectively han- most document-level neural machine translation methods in terms of d-
dling attribute clauses using a pre-edit scheme. The proposed approach, BLEU scores. Karpinska and Iyyer (2023) explored the GPT-3.5 model
which integrates the pre-edit scheme with a novel two-step prompting for paragraph-level machine translation. The authors experimented
strategy, enhances the translation quality by more than 35%. Peng with three different prompting strategies, namely translating sentence
et al. (2023a) explored the impact of temperature, task and domain by sentence in isolation, translating sentence by sentence in the pres-
information on the translation performance of ChatGPT. The authors ence of the rest of the paragraph and translating the entire paragraph
showed that (i) ChatGPT performance degrades with an increase in at once. After extensive evaluation of 18 language pairs, including
temperature, and hence it is recommended to use a lower temperature English and Japanese, the authors report that translating the entire
(recommended is 0). and (ii) including task and domain information in paragraph at once outperforms other strategies and commercial systems
the prompt enhances the performance of ChatGPT consistently for both like Google Translate. Raunak et al. (2023a) examined the differences
high and low language translations. Zhu et al. (2023b) evaluated the between the translations generated by GLLMs like GPT-3.5 and NMT
performance of ChatGPT and other LLMs like OPT, BLOOM and XGLM systems like Microsoft Translator. The authors reported that GLLM
on 102 languages in 202 translation directions. The authors reported generated translations are less literal, with better scores.
that ChatGPT comprehensively outperforms other LLMs but still lags
behind neural machine translation models like NLLB in the majority 4.5. Keyphrase generation
of the translation directions. Further analysis showed three errors,
namely hallucination, monotonic translation and off-target translation.
Overview. Keyphrase generation (KPG) involves generating a set of
Lyu et al. (2023b) presented some interesting research directions with
phrases that capture the main ideas of a document (Meng et al.,
respect to using LLMs for machine translation. The presented interest-
2021). The primary advantage of KPG over keyphrase extraction is
ing research directions include stylized machine translation, interactive
the ability to generate both extractive and abstractive keyphrases.
machine translation and translation memory-based machine transla-
Keyphrase generation is approached as a sequence-to-sequence gener-
tion. Neural machine translation systems just focus on source–target
ation task (Sutskever et al., 2014; Yuan et al., 2020; Kulkarni et al.,
text mapping, which results in a lot of errors. Unlike neural machine
2022) in the existing works. The current state-of-the-art model for
translation systems, the human translation process involves interme-
keyphrase generation is, KeyBART (Kulkarni et al., 2022), which is
diate steps to ensure high translation quality. Inspired by the human
based on BART and trained using the text-to-text generation paradigm.
translation process, He et al. (2023a) proposed MAPS, which involves
Table 5 presents a summary of research works exploring GLLMs for
three steps: knowledge mining, knowledge integration and knowledge
keyphrase generation.
selection to generate quality translations. Extension evaluation of the
Research works exploring GLLMs for keyphrase generation.
WMT22 test set shows that MAPS improves the performance of models
Martínez-Cruz et al. (2023) performed a comprehensive evaluation of
like GPT-3.5 and Alpaca and also addresses the hallucination issue by
ChatGPT as a keyphrase generator by evaluating its performance on
resolving 59% of hallucination errors.
six datasets using six candidate prompts. The authors reported that the
In all the above discussed research works, the performances of
results are promising, but ChatGPT struggles in the case of generating
GLLMs are just satisfactory but not on par or beyond the perfor-
absent keyphrases. Song et al. (2023) evaluated ChatGPT on multiple
mances of commercial machine translation systems. Some of the re-
datasets from news and scientific literature domains having both short
search works (Jiao et al., 2023; Hendy et al., 2023; Gao et al., 2023d;
and long documents. Experiment results showed that ChatGPT outper-
Wang et al., 2023h; Karpinska and Iyyer, 2023; Moslem et al., 2023;
forms KeyBART (Kulkarni et al., 2022), the SOTA model, on all the
Raunak et al., 2023b,a) showed that it is possible to outperform com-
datasets.
mercial machine translation systems using GLLMs. For example, Jiao
et al. (2023) investigated the translation capabilities of GLLMs like
4.6. Dialogue tasks
ChatGPT and GPT-4 and compared the performance with commercial
systems like Google Translate, DeepL Translate and Tencent TranSmart.
Extensive evaluation of multiple datasets showed that (i) the perfor- Overview. Dialogue tasks in natural language processing (NLP) deal
mance of GLLMs is on par with commercial systems in the case of with understanding and generating human-like conversations between
16
K.S. Kalyan Natural Language Processing Journal 6 (2024) 100048
Table 6
Summary of research works exploring GLLMs for various dialogue tasks. Here ZS represents zero-shot, and FS represents few-shot.
Paper Task(s) GLLMs explored Prompt settings Domain(s) Language(s)
Pan et al. (2023) Spoken language understanding GPT-3.5, ChatGPT ZS General English
and dialogue state tracking
Zhao et al. (2023b) Emotion dialogue understanding ChatGPT ZS, FS General English
and generation tasks
Chintagunta et al. (2021) Dialogue summarization GPT-3 ZS Healthcare English
Bang et al. (2023) Dialogue generation ChatGPT ZS General English
Qin et al. (2023) Dialogue summarization ChatGPT ZS General English
Prodan and Pelican (2022) Dialogue summarization GPT-3 FS General English
Huynh et al. (2023) Dialogue evaluation GPT-3 FS General English
Fan and Jiang (2023) Dialogue discourse analysis ChatGPT ZS, FS General English, Chinese
Wang et al. (2023k) Dialogue question answering ChatGPT ZS, FS General English, Chinese
Table 7
Summary of research works exploring GLLMs for information retrieval tasks. Here ZS represents zero-shot, and FS represents few-shot.
Paper Task(s) GLLMs explored Prompt settings Domain(s) Language(s) SOTA results
Sun et al. (2023c) Passage re-ranking GPT-3, GPT-3.5, ZS, FS General, news, healthcare, English, ten low Yes
ChatGPT, GPT-4 scientific literature resource languages
Ziems et al. (2023b) Document retrieval GPT-3.5 ZS, FS General English Yes
machines and users (Serban et al., 2018). The main objective of these the datasets used for instruction tuning. Fan and Jiang (2023) inves-
tasks is to enable machines to have conversations with humans in a tigated the effectiveness of ChatGPT for dialogue discourse analysis
natural way. These dialogue tasks are essential components of building by evaluating its performance on three tasks, namely topic segmenta-
effective conversational agents, which have a wide range of applica- tion, discourse parsing and discourse relation recognition. ChatGPT’s
tions, including customer support (Serban et al., 2018; Larson and performance is promising in the case of topic segmentation, and CoT
Leach, 2022). prompting enhances the performance. Wang et al. (2023k) proposed
Research works exploring GLLMs for dialogue tasks. The re- a novel approach based on explicit CoT prompting and demonstration
search community explored GLLMs like GPT-3, GPT-3.5 and ChatGPT selection to answer dialogue questions in few-shot settings.
for various dialogue tasks like dialogue summarization (Chintagunta
et al., 2021; Qin et al., 2023; Prodan and Pelican, 2022) , dialogue ques- 4.7. Information retrieval
tion answering (Wang et al., 2023k), emotion dialogue understanding
and generation (Zhao et al., 2023b), dialogue state tracking (Pan et al., Information retrieval (IR) involves accessing and retrieving relevant
2023), dialogue generation (Bang et al., 2023), and dialogue discourse information from large volumes of data. Here, the main objective
analysis (Fan and Jiang, 2023). Some of the research works explored is to provide users with the most relevant information by matching
LLMs for the evaluation of dialogue tasks (Huynh et al., 2023). Most their queries to the content of documents and ranking them based
of the research works focused on general domain and English language on relevance (Anand et al., 2022). The process includes indexing,
datasets, except a few research works which focused on the medical query formulation, search and retrieval, ranking, and presentation.
domain (Chintagunta et al., 2021) and languages like Chinese (Fan Information retrieval is utilized in a wide range of fields, such as web
and Jiang, 2023; Wang et al., 2023k). Table 6 presents a summary of search engines, digital libraries, e-commerce, healthcare, and scientific
research works exploring GLLMs for various dialogue tasks. research (Anand et al., 2022). It plays a vital role in facilitating efficient
Pan et al. (2023) reported that ChatGPT exhibits better performance and effective access to information in the modern digital era. Table 7
in dialogue state tracking compared to spoken language understanding. presents a summary of research works exploring GLLMs for information
Further, the authors showed that the performance of ChatGPT can be retrieval.
enhanced by (i) using a multi-turn interactive prompt for dialogue state Sun et al. (2023c) explored the effectiveness of GPT-3 family models
tracking and (ii) providing additional details like slot names, examples like GPT-3, GPT-3.5, ChatGPT and GPT-4 for passage re-ranking in
and descriptions for slot filling in spoken language understanding. Zhao information retrieval. The results are promising as GPT-4 outper-
et al. (2023b) explored the emotion dialogue capabilities of ChatGPT forms SOTA models like monoT5-3B (Nogueira et al., 2020) on mul-
by evaluating the model on five different tasks, namely emotion recog- tiple benchmarks. Moreover, the compact model trained on ChatGPT-
nition, emotion cause recognition, dialogue act classification (emotion generated data demonstrates superior performance compared to the
dialogue understanding), empathetic response generation and emotion monoT5-3B model when evaluated on the MS MARCO dataset in
support generation. It is reported that ChatGPT exhibits better perfor- BEIR (Thakur et al., 2021) benchmark. The existing approaches for
mances in emotion dialogue generation compared to emotion dialogue document retrieval employ dual dense encoders, which encode query
understanding. Chintagunta et al. (2021) showed that the in-house and document independently, resulting in shallow interaction between
model trained on GPT-3 generated summaries achieves performances query and document (Zhao et al., 2022b). To overcome this draw-
comparable to when trained on human-generated summaries. Further, back, Ziems et al. (2023b) proposed a novel approach which in-
the in-house model trained on mixed summaries (human-generated and
volves generating URLs using LLMs for document retrieval. The authors
GPT-3 generated) achieves better performances than those trained on
reported that document retrieval by generating URLs outperforms
either one of the summaries.
existing approaches.
Prodan and Pelican (2022) proposed a scoring system to choose
the best examples for dialogue summarizing using few-shot GPT-3. The
4.8. Recommendation systems
proposed scoring system enhances the quality of generated summaries
with an 11% reduction in failures. Huynh et al. (2023) studied the
impact of various aspects influencing the performance of LLMs as Overview. Recommendation systems aim to reduce information over-
Dialogue evaluators. The authors reported that the performance as a load and enhance the user experience by making relevant recom-
dialogue evaluator largely depends on the diversity and relevance of mendations related to products or content based on user preferences
17
K.S. Kalyan Natural Language Processing Journal 6 (2024) 100048
Table 8
Summary of research works exploring GLLMs for recommendation systems. Here ZS represents zero-shot, and FS represents few-shot.
Paper GLLMs explored Prompt settings Domain(s) Language(s) SOTA results
Wang and Lim (2023) GPT-3.5 ZS Movies English No
Dai et al. (2023c) GPT-3.5, ChatGPT ZS, FS News, books, movies, music English No
Gao et al. (2023c) GPT-3.5, ChatGPT ZS Movies English No
Mysore et al. (2023) InstructGPT FS Social media English No
Kang et al. (2023b) GPT-3.5, ChatGPT ZS, FS Movies, books English No
Zhang et al. (2023a) ChatGPT ZS Music, Movies English No
Liu et al. (2023e) ChatGPT ZS, FS Beauty English Yes
Hou et al. (2023a) ChatGPT ZS Movies, games English No
Zhiyuli et al. (2023) ChatGPT ZS, FS Books English No
and behaviour (Adomavicius and Tuzhilin, 2005). In recent times, GLLMs. Zhang et al. (2023a) introduced FaiRLLM, a new benchmark
recommendation systems have gained immense popularity and are having eight sensitive attributes from domains like movies and music,
extensively utilized across a range of fields, such as entertainment, to investigate the fairness of GLLM recommendations. The authors
e-commerce, social media etc. For example, popular platforms like reported that GLLM-based recommendation systems are not fair to
YouTube and Netflix use recommendation systems to suggest relevant certain sensitive attributes.
videos and platforms like Amazon use recommendation systems to Liu et al. (2023e) evaluated the performance of ChatGPT in five
suggest relevant products to the user (Peng, 2022). The commonly used recommendation tasks, which include predicting ratings, direct recom-
approaches for recommendation systems are based on collaborative mendation, sequence recommendation, generating explanations, and
filtering (Rezaimehr and Dadkhah, 2021), content-based (Xie et al., summarizing reviews. Based on the evaluation of Amazon beauty
2023a) and knowledge-based (Dong et al., 2020). The performance of datasets, the authors reported that (i) ChatGPT is much better in
traditional recommendation systems is limited by a number of issues rating prediction compared to other tasks like direct and sequence
like cold-start problem, poor generalization across domains and lack of recommendation. and (ii) ChatGPT achieves new SOTA results in
explainability (Gao et al., 2023c; Zhu et al., 2021). generating explanations based on human evaluation. Hou et al. (2023a)
To overcome these drawbacks in traditional recommendation sys- demonstrated that GLLMs possess strong potential for zero-shot ranking
tems, recent works explored GPT-3 family LLMs for various tasks in tasks, showcasing performance that is comparable to or even superior
recommendation systems like next item prediction (Wang and Lim, to traditional recommendation models. Here, the authors designed the
2023), rating prediction (Gao et al., 2023c; Zhiyuli et al., 2023), top- prompts in a way that important information like candidate items, se-
k predictions (Gao et al., 2023c), direct recommendation (Liu et al., quential interaction history and ranking instruction is included. Zhiyuli
2023e), sequence recommendation (Liu et al., 2023e) and generating et al. (2023) proposed BookGPT, a novel framework which leverages
explanations (Liu et al., 2023e). The evaluation is done in a variety GLLMs like ChatGPT for book recommendation. Specifically, the per-
of domains like movies (Wang and Lim, 2023; Dai et al., 2023c; Gao formance of BookGPT is evaluated on three sub-tasks, namely the
et al., 2023c; Kang et al., 2023b; Zhang et al., 2023a; Hou et al., 2023a), book rating task, book summary recommendation task and user rating
news (Dai et al., 2023c), books (Dai et al., 2023c; Kang et al., 2023b; recommendation task. The performance of BookGPT is promising in
Zhiyuli et al., 2023), music (Dai et al., 2023c; Zhang et al., 2023a), all three sub-tasks, and the performance increases with an increase in
social media (Mysore et al., 2023), beauty (Liu et al., 2023e), and prompt examples.
games (Hou et al., 2023a). Table 8 presents a summary of research
works exploring GLLMs for recommendation systems. 4.9. Coding tasks
Research works exploring GLLMs for recommendation sys-
tems. Wang and Lim (2023) proposed a novel prompting strategy Overview. Software engineering is a discipline which deals with de-
called ‘‘Next-Item Recommendation (NIR)’’ to recommend movies using signing, developing, testing, and maintaining software systems (Hou
GLLMs. The proposed prompting strategy involves a three-step process et al., 2023b). To create software systems, software engineers use a
to capture the user’s preferences, choose representative movies they variety of programming languages, development tools, and technolo-
have watched in the past, and provide a ranked list of ten recommended gies. To aid software engineers and enhance their productivity, the
movies. Dai et al. (2023c) reported that ChatGPT outperforms other research community focused on automating a number of coding tasks
GLLMs and is more effective with pair-wise and list-wise ranking like code generation from natural language descriptions, code repair,
compared to point-wise ranking. When it comes to balancing cost code explanation generation, code hints generation, code completion,
and performance, ChatGPT with list-wise ranking outperforms both code document generation, test cases generation, code vulnerability
point-wise and pair-wise ranking approaches. ChatGPT demonstrates detection, code refactoring, etc. The evolution of pre-trained source
the potential for providing explanations for recommendations and code models has paved the way for achieving cutting-edge results
addressing the challenges of the cold start problem. Gao et al. (2023c) across coding tasks (Shi et al., 2023). Some of the popular pretrained
proposed Chat-REC, which leverages GLLMs to build conversational source code models are CodeBERT (Feng et al., 2020), CodeGPT (Lu
recommendation systems. The authors reported that Chat-REC per- et al., 2021), CoTexT (Phan et al., 2021), GraphCodeBERT (Guo et al.,
forms well in tasks like top-k recommendations and zero-shot rating 2020), CodeT5 (Wang et al., 2021d), CodeT5+ (Wang et al., 2023d),
prediction. Moreover, Chat-REC enhances the conversational recom- PLBART (Ahmad et al., 2021), PyCodeGPT (Zan et al., 2022) etc.
mendation systems by making them more interactive and providing Inspired by the success of GLLMs in NLP tasks, the research community
clear explanations. focused on assessing the performances of these models in coding tasks
Mysore et al. (2023) explored GLLMs like InstructGPT to generate also.
synthetic data, and the experiment results showed that narrative-driven Research works exploring GLLMs for various coding tasks. The
recommendation models trained on augmented datasets outperform research community explored GLLMs for coding tasks across various
LLM baselines and other approaches. Kang et al. (2023b) evaluated languages like Java (Xia and Zhang, 2023; Cheshkov et al., 2023; Liu
GLLMs like GPT-3.5 and ChatGPT on user rating prediction in zero et al., 2023a; Khan and Uddin, 2022; Prenner and Robbes, 2021; Siddiq
and few-shot settings. Based on the experimental findings on datasets et al., 2023; Geng et al., 2023; Kang et al., 2023a; Destefanis et al.,
from movies and book domains, the authors reported that traditional 2023; Yuan et al., 2023a), Python (Yetiştiren et al., 2023; Li et al.,
models that have access to user interaction data perform better than 2023l; Poldrack et al., 2023; Liu et al., 2023j; Chen et al., 2023b; Khan
18
K.S. Kalyan Natural Language Processing Journal 6 (2024) 100048
Table 9
Summary of research works exploring GLLMs for various coding tasks. Here ZS represents zero-shot, and FS represents few-shot.
Paper GLLMs explored Task(s) Prompt settings Language(s) SOTA results
Xia and Zhang (2023) ChatGPT Code repair ZS, FS Java Yes
Cheshkov et al. (2023) GPT-3, ChatGPT Code vulnerability detection ZS Java No
Yetiştiren et al. (2023) ChatGPT Code generation ZS Python No
Li et al. (2023l) ChatGPT Finding failure-inducing test cases ZS Python Yes
Liu et al. (2023a) ChatGPT Code generation ZS Java, C# No
Poldrack et al. (2023) GPT-4 Code generation, code refactoring, test ZS Python No
case generation
Liu et al. (2023j) ChatGPT, GPT-4 Code generation ZS Python No
Chen et al. (2023b) ChatGPT Code explanation generation ZS Python No
Nascimento et al. (2023) ChatGPT Code generation ZS C++ No
Khan and Uddin (2022) Codex Code documentation generation ZS, FS Java, Python, PHP, Yes
GO, Ruby, JS
Leinonen et al. (2023) GPT-3 Code explanation generation ZS C No
Li et al. (2023i) ChatGPT Code generation ZS Python Yes
Prenner and Robbes (2021) Codex Automatic code repair ZS, FS Python, Java No
Siddiq et al. (2023) Codex, ChatGPT Unit test generation ZS Java No
Tian et al. (2023) ChatGPT Code generation, APR, Code explanation ZS Python No
generation
Geng et al. (2023) Codex Code documentation generation ZS, FS Java Yes
Kang et al. (2023a) Codex, ChatGPT Automate program repair ZS Python, Java No
Kashefi and Mukerji (2023) ChatGPT Code generation ZS C, C++, Python, No
Julia, MATLAB
Destefanis et al. (2023) GPT-3.5 Code generation ZS Java No
Yuan et al. (2023a) ChatGPT Unit test generation ZS Java No
Phung et al. (2023) ChatGPT, GPT-4 Code repair, code completion, code ZS Python No
explanation generation, coding hints
generation
and Uddin, 2022; Li et al., 2023i; Prenner and Robbes, 2021; Tian et al., numerical methods in five different programming languages: C, C++,
2023; Kang et al., 2023a; Kashefi and Mukerji, 2023; Phung et al., Python, MATLAB and Julia. The authors observed that the results are
2023), PHP (Khan and Uddin, 2022), GO (Khan and Uddin, 2022), promising but have some limitations which require further investi-
Ruby (Khan and Uddin, 2022), JavaScript (Khan and Uddin, 2022), gation. Destefanis et al. (2023) assessed the code generation ability
C (Leinonen et al., 2023; Kashefi and Mukerji, 2023), C++ (Nascimento of LLMs like Bard and GPT-3.5 by evaluating their performances in
et al., 2023; Kashefi and Mukerji, 2023), Julia (Kashefi and Mukerji, generating Java language code given the natural language descriptions.
2023), and MATLAB (Kashefi and Mukerji, 2023). Most of the research The authors observed that GPT-3.5 outperforms the Bard model by a
works focused on Python and Java languages, while a few research large margin of more than 37%.
works focused on other languages like GO, PHP, GO, Ruby, JavaScript, Some of the research works (Prenner and Robbes, 2021; Tian et al.,
C, C++, Julia and MATLAB. The assessment is done in zero and few- 2023; Kang et al., 2023a; Phung et al., 2023) explored GLLMs for code
shot settings using mostly direct prompts. Table 9 presents a summary repair task. Prenner and Robbes (2021) explored the Codex model for
of research works exploring GLLMs for various coding tasks. automatic program repair in Python and Java programming languages.
Some of the research works (Yetiştiren et al., 2023; Liu et al., 2023j; The authors observed that the performance of Codex is comparable to
Nascimento et al., 2023; Kashefi and Mukerji, 2023; Destefanis et al., state-of-the-art methods. Moreover, the Codex model is slightly better
2023) explored GLLMs for code generation task. Yetiştiren et al. (2023) at fixing errors in Python language compared to Java language. Kang
compared various AI-assisted code generation tools like ChatGPT, Ama- et al. (2023a) developed AutoSD, a novel framework for automatic
zon’s Code Whisperer and Github’s Copilot on the Human Eval (Chen program repair using GLLMs. The authors reported that the evaluation
et al., 2021b) dataset. ChatGPT outperforms other tools by generating on three standard datasets showed that the proposed framework is on
correct code 65.2% of the time, while the other tools generate correct par with the baselines.
code for a maximum of 46.3% of the time only. The test cases in Unit tests generated using traditional approaches suffer from low
existing datasets for code generation evaluation are limited in terms readability (Yuan et al., 2023a). To address this drawback, some of the
of quality and quantity. So, Liu et al. (2023j) proposed EvaPlus, a research works (Siddiq et al., 2023; Yuan et al., 2023a) explored GLLMs
new framework for automatic test case generation using ChatGPT and for test case generation. Siddiq et al. (2023) evaluated models like
the traditional mutation approach. The authors use EvaPlus to develop Codex and ChatGPT for unit test generation for Java code. Experiment
HumanEvalPlus on the top of the HumanEval (Chen et al., 2021b) results showed that Codex performs better with 80% coverage for the
dataset. The authors reported that HumanEvalPlus can detect a lot of HumanEval dataset. However, both models perform poorly in the case
incorrectly generated code that was previously undetected. Nascimento of the SF110 benchmark, with less than 2% coverage. Yuan et al.
et al. (2023) compared the quality of code generated by ChatGPT and (2023a) designed a ChatGPT-based unit test generation framework
software developers for competitive coding problems on the LeetCode called ‘‘Chat-Tester’’. The iterative test refiner helps Chat-Tester to
platform using various evaluation metrics. The authors reported that generate better unit tests compared to vanilla ChatGPT.
ChatGPT exhibits better performance compared to novice programmers In all the above discussed research works, the performance of
but is outperformed by experienced programmers. Kashefi and Mukerji GLLMs in various coding tasks is promising but still lags behind SOTA
(2023) explored how effective ChatGPT is for generating code for results. Some of the research works (Xia and Zhang, 2023; Li et al.,
19
K.S. Kalyan Natural Language Processing Journal 6 (2024) 100048
2023l; Khan and Uddin, 2022; Li et al., 2023i; Geng et al., 2023) research works focused on general domain datasets, which some of
demonstrated that GLLMs can achieve SOTA results in coding tasks. Xia the research works focused on specific domains like healthcare (Ranjit
and Zhang (2023) proposed ChatRepair, an automatic program repair et al., 2023; Li et al., 2023h). Table 10 presents a brief summary of
tool based on ChatGPT. ChatRepair achieves remarkable performance, research works exploring GLLMs for various multimodal AI tasks.
surpassing all the existing methods. It successfully resolves 114 and Some of the research works developed multi-model AI systems
48 bugs on Defects4j 1.2 and 2.0 (Just et al., 2014), respectively, for a specific task like action generation (Kalakonda et al., 2022),
outperforming the previous best by 15 and 17 bugs, respectively. Khan knowledge-based visual question answering (Shao et al., 2023; Lin
and Uddin (2022) explored Codex, GPT-3 family model pretrained et al., 2022c; Yang et al., 2022; Gui et al., 2022), X-ray report gener-
on natural and programming languages to automate code documenta- ation (Ranjit et al., 2023), named entity recognition (Li et al., 2023f),
tion generation. The evaluation results on six programming languages text-to-video generation (Hong et al., 2023), layout generation (Feng
showed that Codex, with just one example, outperforms existing ap- et al., 2023), text-to-image generation (Zhang et al., 2023k). Kalakonda
proaches by a large margin of 11.2%. Geng et al. (2023) explored et al. (2022) proposed GPT-3 based plug-and-play framework called
Codex for code document generation and demonstrated that few-shot Action-GPT for text-based action generation. Here, the authors gen-
in-context learning with systematic demonstration selection helps the erated multiple detailed body movement descriptions from the action
GPT-3 model to achieve new SOTA results on two standard datasets phrases and then used them to generate actions. Shao et al. (2023)
related to Java language. proposed Prophet, which avoids using an external knowledge base by
Some of the research works (Li et al., 2023l; Liu et al., 2023a; using GPT-3 as an implicit knowledge base and includes vanilla visual
Li et al., 2023i) explored advanced prompting like CoT, brainstorm- question answering to provide answer heuristics to GPT-3. The answer
ing, differential prompting, etc., for coding tasks. Liu et al. (2023a) heuristics, along with caption and question information, provide rich
evaluated the code generation capabilities of ChatGPT by evaluating task-specific information to the GPT-3 model, which results in much
its performances on text-to-code and code-to-code generation tasks better performances. Ranjit et al. (2023) proposed automatic X-ray
on CodeXGLUE (Lu et al., 2021) datasets. The authors observed that report generation based on contrastively pretrained vision-language
advanced prompting strategies like CoT enhance the code generation encoder and GPT-3 family models like GPT-3.5, ChatGPT and GPT-
capabilities of models like ChatGPT. Li et al. (2023i) proposed Brain- 4. The contrastively pretrained encoder is used to encode input X-ray
storm, a new framework for code generation. Brainstorm involves three image into image vector embedding based on which the most similar
steps: brainstorming to generate diverse thoughts, thoughts selection sentences from the radiology report corpus are retrieved. The retrieved
to select the best thought using a ranking model and writing code to similar sentences form the context and allow LLM to generate a quality
generate the code based on the problem statement and the best thought. X-ray report. Li et al. (2023f) proposed PGIM, a two-stage approach
The authors reported that the proposed framework helps ChatGPT to which utilizes ChatGPT as an implicit knowledge base for multi-modal
increase its performance by more than 50% and achieve new SOTA re- NER task. In the first stage, ChatGPT, when prompted with text descrip-
sults on the CodeContests (Li et al., 2022a) benchmark. Li et al. (2023l) tions of the image, generates the auxiliary knowledge. In the second
showed that directly using ChatGPT to find failure-inducing test cases stage, the downstream model receives the raw text and ChatGPT-
results in poor performances. So, the authors proposed a new prompting generated auxiliary knowledge as input. The authors reported that the
strategy called ‘‘Differential Prompting’’, which enables ChatGPT to proposed approach outperforms existing SOTA approaches based on
achieve new SOTA results on the Quixbugs dataset (Lin et al., 2017). text-text and text-image paradigms.
Differential Prompting involves program intention inference followed Hong et al. (2023) proposed DirecT2V for text-to-video generation,
by two more steps: program generation and differential testing. which leverages GPT-4 model as a frame-level director. Here, the GPT-4
model generates descriptions for each frame based on a single prompt,
4.10. Multimodal AI tasks and then the Text-to-Image model is used to generate frames based
on these descriptions. Feng et al. (2023) developed LayoutGPT, which
leverages LLM and Layout-to-Image models to generate 2D and 3D
Overview. Traditional AI systems are designed to handle data from a planning layouts from text descriptions. Zhang et al. (2023k) proposed
single modality such as text, image, audio or video. As real-world data ‘‘Control-GPT’’ based on LLMs and diffusion models for controllable
is often multi-modal, researchers focused on developing multi-modal text-to-image generation. Here, GPT-4 generates sketches based on Tikz
AI systems which can leverage input data from multiple modalities code based on the text instructions, and then diffusion model generates
to generate more accurate results. Multi-modal AI systems leverage realistic images with generated sketches and the text instructions as
techniques from different areas of AI, like natural language processing, input. Here, the generated sketches help diffusion models to get a better
computer vision, speech processing etc., to process multi-modal input idea about spatial relationships.
data effectively (Sundar and Heck, 2022; Xu et al., 2023d). Multi-Modal Some of the research works focused on developing multi-model
AI systems can perform a variety of understanding and generation tasks AI systems which can handle multiple tasks (Wu et al., 2023b; Yang
like visual question answering (Shao et al., 2023; Lin et al., 2022c; et al., 2023e; Bhattacharya et al., 2023; Hakimov and Schlangen, 2023;
Yang et al., 2022; Gui et al., 2022), text-to-image generation (Lu et al., Zhao et al., 2023a; Huang et al., 2023b). As ChatGPT is trained on
2023b; Zhu et al., 2023c; Zhang et al., 2023k), text-to-video genera- one data modality i.e., text data, ChatGPT can only handle text inputs
tion (Hong et al., 2023), text-to-speech synthesis (Huang et al., 2023b), and training models from scratch for vision-language tasks, is not a
speech-to-text synthesis (Huang et al., 2023b), image captioning (Ranjit feasible option as it involves huge computation. So, Wu et al. (2023b)
et al., 2023) etc. developed Visual ChatGPT based on ChatGPT and various visual foun-
Research works exploring GLLMs for Multimodal AI tasks. After dation models to handle 22 vision language tasks. Bhattacharya et al.
the huge success of LLMs in natural language generation and under- (2023) proposed a novel three-stage approach to handle five video un-
standing tasks, the research community recently explored GPT-3 family derstanding tasks. The proposed approach involves transforming video
models in multi-modal understanding and generation tasks in various into text stories and then using this text content for video understanding
combinations like image+language (Kalakonda et al., 2022; Wu et al., tasks. Hakimov and Schlangen (2023) explored GPT-3 model for five
2023b; Shao et al., 2023; Yang et al., 2023e; Ranjit et al., 2023; Lin vision language tasks, including four classifications and one question
et al., 2022c; Lu et al., 2023b; Zhu et al., 2023c; Li et al., 2023f; answering. Here the model is prompted with text description of the
Hakimov and Schlangen, 2023; Yang et al., 2022; Feng et al., 2023; input image along with other elements like task instruction and similar
Zhang et al., 2023k; Fan et al., 2023; Li et al., 2023h; Gui et al., examples. Huang et al. (2023b) proposed AudioGPT, which allows
2022), video+language (Bhattacharya et al., 2023; Hong et al., 2023), ChatGPT to handle multiple audio understanding and generation tasks
audio+language (Mei et al., 2023; Zhang et al., 2023h). Most of the with the help of audio foundation models.
20
K.S. Kalyan Natural Language Processing Journal 6 (2024) 100048
Table 10
Summary of research works exploring GLLMs for various multimodal AI tasks. Here ZS represents zero-shot, and FS represents few-shot.
Paper GLLMs explored Task(s) Prompt Multimodality Domain
settings
Kalakonda et al. (2022) GPT-3 Text-based action generation ZS Image + Language General
Wu et al. (2023b) ChatGPT Twenty two vision language tasks ZS Image + Language General
Shao et al. (2023) GPT-3 Knowledge-based visual question answering FS Image + Language General
Mei et al. (2023) ChatGPT Audio labelling ZS Audio + Language General
Yang et al. (2023e) ChatGPT Multi-image reasoning, multi-hop document ZS Image + Language General
understanding, open-world concept
understanding, video summarization
Ranjit et al. (2023) GPT-3.5, Chest X-ray report generation ZS Image + Language Healthcare
ChatGPT, GPT-4
Lin et al. (2022c) GPT-3 Knowledge-based visual question answering FS Image + Language General
Bhattacharya et al. (2023) GPT-3.5 Five video understanding tasks ZS Video + Language General
Zhang et al. (2023h) GPT-4 Generate instructions ZS Audio + Language General
Lu et al. (2023b) GPT-3.5, GPT-4 Evaluator for text-to-image generation ZS Image + Language General
Zhu et al. (2023c) GPT-3, GPT-3.5 Editing in text-to-image generation FS Image + Language General
Li et al. (2023f) ChatGPT Multimodal named entity recognition FS Image + Language General
Hakimov and Schlangen (2023) GPT-3 Five vision language tasks (four FS Image + Language General
classification tasks and one question
answering task)
Hong et al. (2023) GPT-4 Text-to-video generation ZS Video + Language General
Yang et al. (2022) GPT-3 Knowledge-based visual question answering FS Image + Language General
Feng et al. (2023) GPT-3.5, Layout generation FS Image + Language General
ChatGPT, GPT-4
Zhao et al. (2023a) ChatGPT, GPT-4 Multimodal tasks covering text, video, audio ZS Multimodal covering text, General
and images video, audio and images
Zhang et al. (2023k) GPT-3.5, Controlled text-to-image generation ZS Image + Language General
ChatGPT, GPT-4
Fan et al. (2023) ChatGPT Paraphrasing ZS Image + Language General
Huang et al. (2023b) ChatGPT Audio understanding and generation tasks ZS Multimodal covering text, General
audio and images
Li et al. (2023h) GPT-4 Generate instruction tuning dataset FS Image + Language Healthcare
Gui et al. (2022) GPT-3 Knowledge-based visual question answering FS Image + Language General
Some of the research works explored GPT-3 family models for from data and make decisions (Zhang et al., 2023j). Even though
other tasks like data labelling (Mei et al., 2023), generating instruc- machine learning algorithms are successfully used in various real-world
tions (Zhang et al., 2023h), data generation (Fan et al., 2023), prompt applications, creating an effective ML solution for a new task can be
editing (Zhu et al., 2023c) and evaluation (Lu et al., 2023b) while difficult due to the numerous design choices involved. In recent times,
developing multimodal AI systems. Mei et al. (2023) used ChatGPT to AutoML has evolved as a solution to reduce the human effort involved
rewrite those noisy audio captions and developed WavCaps, an audio in designing ML solutions (Hutter et al., 2019). However, AutoML
captions dataset of 400k instances. The authors reported that the mod- algorithms suffer from various drawbacks (Zhang et al., 2023j), like
els trained on WavCaps datasets achieve new SOTA results. Zhang et al. (i) the requirement of multiple rounds of trial-and-error, resulting in
(2023h) developed SpeechGPT and then do cross-modal instruction significant time consumption, (ii) starting the search for a new task
tuning to enhance its multi-model instruction following ability. Here, from scratch, ignoring past experience gained from the previous tasks
the authors use GPT-4 to generate the instructions for diverse tasks. and (iii) many AutoML methods lack interpretability because of their
Fan et al. (2023) proposed LaCLIP (Language augmented Contrastive black-box nature.
Language-Image Pretraining), an extended version of CLIP which ap- Research works exploring GLLMs to automate machine learn-
plies data augmentation to both text and image data to ensure that ing tasks. Inspired by the success of GLLMs in other tasks, the research
the model gets exposed to diversified texts during training. Here the community explored GLLMs as an alternative to AutoML to automate
data augmentation is performed using the open-source LLaMA model machine learning tasks (Zheng et al., 2023c; Shen et al., 2023b; Zhang
in few-shot settings, and the examples for LLaMA ICL are generated et al., 2023j,d). Table 11 presents a summary of research works explor-
using ChatGPT. Zhu et al. (2023c) explored GPT-3 and GPT-3.5 models ing GLLMs to automate machine learning tasks. Zheng et al. (2023c)
for prompt editing in text-to-image generation. The authors observed explored how effective is GPT-4 for neural architecture search, i.e., de-
a potential reduction of 20%–30% in the remaining edits required by signing optimal neural network configurations. The proposed approach
implementing the prompt edits suggested by GPT-3 family models. Lu involves two steps, namely (i) GPT-4 generates the optimal neural
et al. (2023b) proposed LLMScore, a new metric which can effectively architecture based on the given problem statement, (ii) the generated
capture both image and object-level compositionality for text-to-image configuration is evaluated, and for further refinement, the evaluation
generation evaluation. results along with the problem statement are passed to the model.
This two-step process is repeated for a certain number of iterations
4.11. Machine learning tasks
to achieve the optimal configuration. Shen et al. (2023b) proposed
HuggingGPT to solve AI tasks with the help of GLLMs like ChatGPT
Overview. Machine learning (ML) is an area of artificial intelligence and models in AI communities like Hugging Face. HuggingGPT involves
(AI) that deals with the development of algorithms that can learn four steps, namely task planning, model selection, task execution and
21
K.S. Kalyan Natural Language Processing Journal 6 (2024) 100048
Table 11
Summary of research works exploring GLLMs to automate machine learning tasks. Here ZS represents zero-shot, and FS represents few-shot.
Paper Task(s) GLLMs explored Prompt settings Language(s)
Zheng et al. (2023c) Neural architecture search GPT-4 ZS English
Shen et al. (2023b) Multiple AI tasks in language, GPT-3.5, ChatGPT, GPT-4 FS English
speech and vision areas
Zhang et al. (2023j) Machine learning tasks GPT-3.5 FS English
Zhang et al. (2023d) Machine learning tasks GPT-4 FS English
Table 12
Summary of research works exploring GLLMs for planning. Here ZS represents zero-shot, and FS represents few-shot.
Paper Task(s) GLLMs explored Prompt settings Language(s) SOTA results
Olmo et al. (2021) Plan extraction GPT-3 FS English Yes
Zhang and Soh (2023) Planning in human–robot interaction GPT-3.5 ZS English No
Xie et al. (2023b) Plan extraction GPT-3.5 FS English No
Hu et al. (2023b) Planning InstructGPT, ChatGPT FS English No
response generation. The authors reported that HuggingGPT achieves natural language descriptions. The authors reported that the models
promising results in solving AI tasks in language, vision and speech. are poor planners on their own, which is in line with the existing
Zhang et al. (2023j) proposed MLCopilot, which leverages the power works (Valmeekam et al., 2022; Collins et al., 2022; Mahowald et al.,
of GLLMs to solve machine learning tasks. MLCopilot works in two 2023) and are better at extracting plans from natural language. How-
stages, namely offline and online. The offline stage involves creating ever, these models are sensitive to prompts and also struggle in the case
an experience pool from which GLLM is used to retrieve relevant of tasks involving spatial or numerical reasoning.
knowledge. The online stage involves retrieving relevant examples from
the experience pool, and then GLLM generates results based on the task 5. Performance of GLLMs in specific domains
description, relevant examples and knowledge. Zhang et al. (2023d)
proposed AutoML-GPT, which leverages the advanced GPT-4 GLLM Apart from the general domain, natural language processing is
to automatic machine learning tasks and reduces human efforts in also explored in specific domains like healthcare, finance, legal, social
building machine learning models. AutoML-GPT involves two stages. media, etc. Analysing domain-specific texts is more challenging because
The first stage involves composing a prompt paragraph based on the of domain-specific terminology and abbreviations, complex language
model and data cards. The second stage involves performing the four structures, etc. In domains like healthcare, finance and legal, domain
crucial steps from data processing to training log prediction. experts use many words and abbreviations that are specific to the
domain and not commonly found in general domain texts. In domains
4.12. Planning like social media, the texts are mostly authored by the general public
using informal language and slang words. Moreover, social media texts
are noisy, with many misspelt words, emojis, irregular grammar and
Overview. Many important industries, like finance and banking, of- abbreviations (Kalyan and Sangeetha, 2020a,b).
ten involve repetitive sequential tasks. These workflows, despite their Inspired by the success of PLMs like BERT, RoBERTa, ELECTRA,
significance, are typically not fully automated or formally defined. DeBERTa and T5 in the general domain, these models are also explored
Recently, due to strong reasoning capabilities, the research community for domain-specific NLP tasks (Kalyan et al., 2021). However, the
explored GLLMs for planning. Some of the research works (Zhang and performance of general domain models is limited as these models are
Soh, 2023; Hu et al., 2023b) directly used LLMs for planning, while pretrained on general domain texts (Yang et al., 2020a; Lee et al.,
some of them (Olmo et al., 2021; Xie et al., 2023b) explored LLMs for 2020), and fine-tuning alone cannot provide enough domain knowl-
planning extraction, which can then be used by automated systems. edge (Kalyan et al., 2021). So, the research community focused on
Research works exploring GLLMs for planning. Table 12 presents developing domain-specific PLMs either by continual pretraining or
a summary of research works exploring GLLMs for planning. Human pretraining from scratch (Kalyan et al., 2021, 2022). Currently, domain-
models are crucial in facilitating human–robot interaction (HRI), as specific PLMs achieve state-of-the-art results in most tasks in specific
they empower robots to plan their behaviour based on the impact domains like healthcare, finance, legal, social media, etc.
of their actions on individuals. As it is difficult to craft good human GPT-3 family large language models achieve impressive perfor-
labels, Zhang and Soh (2023) used the GPT-3.5 model (i) as zero- mances in most NLP tasks in zero and few-shot settings in the general
shot human models and also (ii) for planning in trust-related scenarios. domain. Surprisingly, these models outperform fine-tuned PLMs in
Hu et al. (2023b) proposed a novel prompting strategy called ‘‘Chain some tasks and achieve state-of-the-art results (Sun et al., 2023b;
of Symbol’’ prompting to elicit better the planning abilities of LLMs Xu et al., 2023e; Wan et al., 2023; Ma et al., 2023a). Inspired by
like InstructGPT and ChatGPT. Unlike CoT prompting, which uses the massive success of GLLMs in the general domain, the research
natural language descriptions to represent complex environments, CoS community explored GLLMs in specific domains to assess how good
prompting uses condensed symbols to represent them in intermediate these models are in domain-specific NLP tasks. Moreover, an extensive
reasoning steps. The authors reported that CoS prompting outperforms evaluation of these models in domain-specific tasks helps to arrive at
CoT prompting in both performance and efficiency. valuable insights that will guide the research community to improve
There are usually natural language documents that describe the the performance further and increase the usage of these models in
procedures for the company’s employees. Plan extraction methods offer domain-specific NLP tasks.
the opportunity to extract structured plans from these natural lan-
guage descriptions of workflows (Araci, 2019; Chalkidis et al., 2020). 5.1. Healthcare domain
These extracted plans can then be used by automated systems. Olmo
et al. (2021) explored the GPT-3 model for plan extraction in few- The recent works explored GLLMs for a variety of clinical NLP
shot settings from the natural language descriptions of workflows and tasks like question answering (Holmes et al., 2023b; Nori et al., 2023b;
showed that GPT-3 model outperforms existing SOTA models in some Tanaka et al., 2023a; Liu et al., 2023n; Kasai et al., 2023; Moradi et al.,
cases. Xie et al. (2023b) explored GPT-3.5 models to extract plans from 2021; Singhal et al., 2023b; Wang et al., 2023a; Hernandez et al., 2023;
22
K.S. Kalyan Natural Language Processing Journal 6 (2024) 100048
Table 13
Summary of research works exploring GLLMs for various NLP tasks in the healthcare domain. Here ZS represents zero-shot, and FS represents few-shot. Here ‘–’ represents there
is no comparison between GLLMs and domain-specific PLMs in the paper.
Paper GLLMs explored Task(s) Prompt Language(s) Outperforms
settings domain-specific
models
Holmes et al. ChatGPT, GPT-4 Question answering ZS English –
(2023b)
Liu et al. (2023l) ChatGPT, GPT-4 Text de-identification ZS English Yes
Giorgi et al. GPT-4 Dialogue summarization FS English Yes
(2023)
Nori et al. GPT-3.5, ChatGPT, GPT-4 Question answering ZS, FS English Yes
(2023b)
Chen et al. GPT-3.5, GPT-4 Named entity recognition, relation extraction, ZS, FS English Yes
(2023a) document classification and semantic similarity
Tanaka et al. GPT-3.5, ChatGPT Question answering ZS Japanese –
(2023a)
Liu et al. GPT-3.5, GPT-4 Question answering, reasoning ZS Chinese Yes
(2023n)
Yang et al. GPT-3 Text simplification FS English –
(2023b)
Gutiérrez et al. GPT-3 Entity extraction, relation classification FS English No
(2022)
Wu et al. ChatGPT, GPT-4 Natural language inference ZS, FS English –
(2023c)
Ma et al. ChatGPT Text summarization FS English Yes
(2023b)
Wang et al. GPT3.5, GPT4 Natural language inference, document ZS, FS English –
(2023n) classification
Kasai et al. GPT-3, ChatGPT, GPT-4 Question answering FS Japanese –
(2023)
Moradi et al. GPT-3 Natural language inference, relation FS English No
(2021) classification, semantic similarity, question
answering, text classification
Jeblick et al. ChatGPT Text simplification ZS English –
(2022)
Tang et al. GPT-3, GPT-4 Dialogue summarization FS English –
(2023c)
Agrawal et al. GPT-3 Clinical sense disambiguation, biomedical ZS, FS English –
(2022) evidence extraction, coreference resolution,
medication status extraction, medication
attribute extraction
Nair et al. GPT-3 Dialogue summarization ZS, FS English –
(2023)
Shaib et al. GPT-3 Text summarization ZS, FS English –
(2023)
Xu et al. (2023a) ChatGPT Multi-turn medical dialogue ZS Chinese No
Singhal et al. GPT-4 Question answering FS English No
(2023b)
Wang et al. ChatGPT Question answering ZS Chinese –
(2023a)
Carpenter and GPT-3 Synonym generation ZS English –
Altman (2023)
Hernandez et al. GPT-3 Natural language inference, question answering, ZS English No
(2023) text classification
Rao et al. (2023) ChatGPT Clinical decision support ZS English –
Kung et al. ChatGPT Question answering ZS English –
(2023)
Hulman et al. ChatGPT Question answering ZS English –
(2023)
Hirosawa et al. ChatGPT Diagnosis lists generation ZS English –
(2023)
Liu et al. (2023i) ChatGPT Clinical decision support ZS English –
Gilson et al. GPT-3, GPT-3.5, ChatGPT Question answering ZS English –
(2023)
Antaki et al. ChatGPT Question answering ZS English –
(2023)
Lyu et al. (2023a) ChatGPT, GPT-4 Text simplification ZS English –
23
K.S. Kalyan Natural Language Processing Journal 6 (2024) 100048
Kung et al., 2023; Hulman et al., 2023; Gilson et al., 2023; Antaki diagnosis lists for clinical vignettes with common chief complaints.
et al., 2023), text de-identification (Liu et al., 2023l), dialogue sum- Experimental results showed that ChatGPT can generate diagnosis lists
marization (Giorgi et al., 2023; Tang et al., 2023c; Nair et al., 2023), with good accuracy. However, the accuracy rate of ChatGPT is still less
named entity recognition (Chen et al., 2023a; Gutiérrez et al., 2022), than the accuracy rate of physicians. Wang et al. (2023a) evaluated
relation extraction (Chen et al., 2023a), text classification (Chen et al., the performance of the ChatGPT model in answering medical questions
2023a; Wang et al., 2023n; Moradi et al., 2021; Hernandez et al., 2023), in the Chinese language. Here, ChatGPT is prompted with questions
semantic similarity (Chen et al., 2023a; Moradi et al., 2021), text sim- in both English and Chinese to avoid language barriers. Experimental
plification (Yang et al., 2023b; Jeblick et al., 2022; Lyu et al., 2023a), results show that the performance of ChatGPT is much lower than the
relation classification (Gutiérrez et al., 2022; Moradi et al., 2021), average performance of the medical students. For example, ChatGPT
text summarization (Ma et al., 2023b; Shaib et al., 2023), natural correctly answers 45.8% of questions, while the average answering rate
language inference (Wu et al., 2023c; Wang et al., 2023n; Moradi et al., of medical students is 67.9% in 2021.
2021; Hernandez et al., 2023), word sense disambiguation (Agrawal Some of the research works demonstrated that domain-specific
et al., 2022), biomedical evidence extraction (Agrawal et al., 2022), PLMs outperform GLLMs. Hernandez et al. (2023) compared the per-
coreference resolution (Agrawal et al., 2022), medical status extrac- formance of the GPT-3 model with the performances of general and
domain-specific PLMs on three healthcare NLP tasks: natural language
tion (Agrawal et al., 2022), medical attribute extraction (Agrawal et al.,
inference, question answering and text classification. Experiment re-
2022), synonym generation (Carpenter and Altman, 2023), clinical
sults showed that domain-specific PLMs achieve better results even
decision support (Rao et al., 2023; Liu et al., 2023i) and diagnostic
though they are much smaller than GPT-3. Xu et al. (2023a) introduced
lists generation (Hirosawa et al., 2023). Most of the research focused
MedGPTEval, a benchmark to assess LLMs in the healthcare domain. An
on English datasets, except a few focused on other languages like
extensive evaluation showed that domain-specific Chinese LLM outper-
Japanese (Tanaka et al., 2023a; Kasai et al., 2023) and Chinese (Liu
forms general-purpose models like ChatGPT and ERNINE Bot. Singhal
et al., 2023n; Xu et al., 2023a; Wang et al., 2023a). Table 13 presents et al. (2023b) introduced MedPaLM2, a healthcare domain-specific
a summary of research works exploring GLLMs for various NLP tasks LLM obtained by domain-specific finetuning of the PaLM2 (Anil et al.,
in the healthcare domain. 2023) model. Experiment results showed that MedPaLM2 outperforms
Lyu et al. (2023a) investigated the performance of ChatGPT and few-shot GPT-4 and achieves new state-of-the-art results on the Multi-
GPT-4 models in the healthcare domain, specifically the radiology area, MedQA benchmark. Moradi et al. (2021) investigated the performances
by evaluating their ability to simplify the content in radiology reports. of BioBERT and GPT-3 in few-shot settings on five biomedical NLP
Experiment results showed that (i) GPT-4 performs better than Chat- tasks: text classification, natural language inference, question answer-
GPT. and (ii) optimized prompt with detailed instructions improves ing, relation extraction and semantic similarity. The authors observed
the performance for both models by a good margin. Antaki et al. that BioBERT and GPT-3 models underperform the model fine-tuned
(2023) evaluated the effectiveness of ChatGPT in answering Ophthal- using full training data. Moreover, the BioBERT model outperforms
mology questions. The test set consists of both easy and moderate-level GPT-3 in few-shot settings even though the BioBERT model is 514 times
questions. Experiment results showed that ChatGPT achieves an av- smaller than GPT-3.
erage accuracy of 49.25%. Specifically, ChatGPT is able to answer Some research works showed that GLLMs can outperform domain-
the questions with good accuracy in general medicine. However, its specific PLMs. Ma et al. (2023b) proposed ImpressionGPT, a novel ap-
performance in specific sub-areas of Ophthalmology is worst. Gilson proach for summarizing radiology reports using ChatGPT. The proposed
et al. (2023) evaluated GLLMs like GPT-3, GPT-3.5, and ChatGPT method involves dynamic prompt construction and iterative optimiza-
model in answering the medical questions in Step 1 and Step 2 exams tion to enhance the performance of ChatGPT further. Evaluation on two
of USMLE. Experiment results showed that ChatGPT outperforms the standard datasets showed that the proposed framework achieves new
other two models by a good margin. Rao et al. (2023) demonstrated SOTA results outperforming fine-tuned models like ChestXrayBERT (Cai
that ChatGPT performs better in the final diagnosis than in the initial et al., 2021). Liu et al. (2023n) introduced CMExam, a dataset with
diagnosis. This is because ChatGPT has access to more clinical data 60k+ multiple-choice medical questions in the Chinese language and
during the final diagnosis than the initial one. evaluated GLLMs like GPT-3.5 and GPT-4 on answer prediction and
Carpenter and Altman (2023) demonstrated that GPT-3 can be used answer reasoning tasks. The authors observed that GPT-4 achieves
for the synonym generation for drugs of abuse. The authors query the best results for both tasks, outperforming GPT-3.5 and medical
GPT-3 repeatedly for each drug to generate multiple synonyms, which domain-specific Chinese LLMs like Huatuo (Antaki et al., 2023) and
are later filtered. The generated synonyms are then used to build a DoctorGLM (Xiong et al., 2023). Chen et al. (2023a) explored GLLMs
lexicon that is helpful for pharmacovigilance on social media platforms. like GPT-3.5 and GPT-4 on eight datasets spanning four tasks in zero
and few-shot settings. The authors observed that fine-tuned PubMed-
Inspired by the success of the GPT-3 model for text summarization
BERT outperforms both the GLLMs in all the biomedical tasks except
in the general domain, Shaib et al. (2023) explored the GPT-3 model
question answering. In the case of biomedical question answering, GPT-
for summarizing biomedical documents. Experiment results revealed
4 outperforms the fine-tuned PubMedBERT model by a large margin of
that (i) GPT-3 performance is promising in the case of single docu-
17
ment summarization and (ii) GPT-3 struggles to summarize the content
Giorgi et al. (2023) explored models like Longformer Encoder–
from multiple biomedical documents. Nair et al. (2023) proposed a
Decoder (LED) (Beltagy et al., 2020) based on supervised fine-tuning
novel approach called ‘‘MEDSUM-ENT’’, a multi-stage framework for and GLLMs like GPT-4 based on few-shot ICL for clinical dialogue
clinical dialogue summarization. The proposed method leverages the summarization as a part of MEDIQA-Chat 2023 (Abacha et al., 2023)
GPT-3 model through multiple intermediate calls to extract medical shared task. Here, the authors used Instructor (Su et al., 2022) to select
entities from the conversations. In the final step of summarization, the most similar examples for few-shot ICL. Experiment results based
the extracted entities, task instructions and in-context examples help on automatic metrics like BERTScore and ROUGE demonstrated that
the GPT-3 model to generate high-quality summaries. Based on the GPT-4 not only outperforms the LED model but also achieves first rank
evaluation of radiology reports simplified by ChatGPT, Jeblick et al. in the shared task. For medical text de-identification, Liu et al. (2023l)
(2022) reported that ChatGPT-generated simplified radiology reports proposed a novel approach called ‘‘DeID-GPT’’, a two-step approach
are not potentially harmful, complete and factually correct. However, based on GLLMs. In the first step, HIPAA identifiers are included in the
further analysis reveals that some simplified reports contain factu- prompt. In the second step, GLLM receives the prompt and the medical
ally incorrect sentences, potentially harmful paragraphs and a lack of record based on which the model generates the de-identified medical
essential medical findings. record having the personal information masked. The authors observed
Hirosawa et al. (2023) investigated the effectiveness of ChatGPT that GPT-4 outperforms not only ChatGPT but also fine-tuned models
for clinical diagnosis by evaluating its ability to generate accurate based on BERT, RoBERTa and ClinicalBERT.
24
K.S. Kalyan Natural Language Processing Journal 6 (2024) 100048
Table 14
Summary of research works exploring GLLMs for various NLP tasks in the legal domain. Here ZS represents zero-shot, and FS represents few-shot.
Here ‘–’ represents there is no comparison between GLLMs and domain-specific PLMs in the paper.
Paper GLLMs explored Task(s) Prompt settings Language(s) Outperforms
domain-specific models
Yu et al. (2022) GPT-3 Natural language inference ZS, FS English –
Bommarito and Katz (2022) GPT-3.5 Question answering ZS English –
Nguyen (2023) GPT-3 Question answering, text generation ZS English –
Chalkidis (2023) ChatGPT Text classification ZS, FS English No
Choi et al. (2023) ChatGPT Question answering, text generation ZS English –
Table 15
Summary of research works exploring GLLMs for various NLP tasks in the finance domain. Here ZS represents zero-shot, and FS represents few-shot. Here ‘–’ represents there is
no comparison between GLLMs and domain-specific PLMs in the paper.
Paper GLLMs explored Task(s) Prompt Language(s) Outperforms
settings domain-specific models
Li et al. (2023k) ChatGPT, GPT-4 News headlines classification, financial ZS English Yes
sentiment analysis, named entity recognition,
question answering
Fatouros et al. (2023) ChatGPT Sentiment analysis ZS English Yes
Leippold (2023) GPT-3 Sentiment analysis ZS English No
Wiriyathammabhum (2022) GPT-3.5 Pairwise ranking FS Chinese –
Shah and Chava (2023) ChatGPT Sentiment analysis, claim detection, named ZS English No
entity recognition
Zhang et al. (2023b) ChatGPT, GPT-4 Question answering ZS, FS Chinese –
Rajpoot and Parikh (2023) ChatGPT, GPT-4 Relation extraction FS English –
Lan et al. (2023) ChatGPT Sentiment analysis ZS Chinese –
Loukas et al. (2023) GPT-3.5, GPT-4 Text classification ZS, FS English –
5.2. Legal domain sentiment analysis (Li et al., 2023k; Leippold, 2023; Shah and Chava,
2023; Lan et al., 2023), named entity recognition (Li et al., 2023k;
The recent works explored GLLMs for a variety of legal NLP tasks Shah and Chava, 2023), question answering (Li et al., 2023k; Zhang
like natural language inference (Yu et al., 2022), question answer- et al., 2023b), pairwise ranking (Wiriyathammabhum, 2022), claim
ing (Bommarito and Katz, 2022; Lan et al., 2023; Choi et al., 2023), detection (Shah and Chava, 2023) and relation extraction (Rajpoot
text generation (Nguyen, 2023; Choi et al., 2023) and text classifi- and Parikh, 2023). Table 15 presents a summary of research works
cation (Chalkidis, 2023). Table 14 presents a summary of research exploring GLLMs for various NLP tasks in the finance domain.
works exploring GLLMs for various NLP tasks in the legal domain. Li et al. (2023k) compared the performances of general LLMs like
Bommarito and Katz (2022) evaluated the performance of the GPT3.5 ChatGPT and GPT-4 in the finance domain with domain-specific models
model in the legal domain by evaluating its ability to answer bar exam like BloombergGPT (Wu et al., 2023a) and small fine-tuned models
questions. The model answers the questions correctly at a rate of 50%, like FinBERT (Araci, 2019) and FinQANet (Chen et al., 2021a). The
which is 25% more than the random guess baseline. However, the evaluation is done on five different datasets related to four financial
model performance is almost 18% less than the human performance, NLP tasks: news headlines classification, sentiment analysis, entity
and overall model performance is below the passing threshold. Nguyen extraction, and question answering. The ChatGPT and GPT4 models
(2023) presented LawGPT 1.0, the first-ever chatbot model based on do well in question-answering task but lag behind in tasks requiring
GPT-3 for the legal domain. The GPT-3 model is pretrained on mostly domain-specific knowledge like entity extraction and sentiment anal-
generic corpus, so it lacks domain-specific knowledge. To add domain- ysis. Fatouros et al. (2023) evaluated the effectiveness of ChatGPT
specific knowledge, LawGPT is developed by fine-tuning the GPT-3 for financial sentiment analysis by assessing its performance on the
model on the law corpus. Experimental results showed that LawGPT forex-related news headlines dataset. Experiment results showed that
1.0 performs on par with existing legal assistants. ChatGPT outperforms the domain-specific FinBERT (Liu et al., 2021a)
Chalkidis (2023) investigated how effective ChatGPT is for le- model by a large margin of 35% and also exhibits a high correlation
gal text classification by evaluating the model performance on the with market returns.
LexGLUE (Chalkidis et al., 2022) benchmark, which consists of seven Leippold (2023) explored GPT-3 for financial sentiment analysis and
legal text classification datasets. The evaluation is performed in both to generate adversarial attacks. Experiment results showed that Fin-
zero and few-shot settings. Experiment results showed that ChatGPT BERT outperforms keyword-based approaches and the few-shot GPT-
performs poorly on legal text classification datasets. Choi et al. (2023) 3 model in financial sentiment analysis. To study the robustness of
demonstrated that the performance of ChatGPT is just above the FinBERT-based and keyword-based approaches, the authors explored
passing threshold, i.e., equivalent to a C+ grade student. The authors GPT-3 to generate adversarial attacks. The main advantage of GPT-3
found that advanced prompts like CoT (Wei et al., 2022b) and Ranking over existing adversarial attack-generating methods is that the model
prompts performed worse or the same as simple prompts for multiple- makes more subtle changes to the instances such that they are not
choice questions. For essay writing, the authors used carefully crafted noticeable to humans but still can fool the models. Wiriyathammab-
simple prompts by including specific instructions at the end of the hum (2022) explored instruction fine-tuned T5 and GPT-3.5 models
prompt. to evaluate investments-related social media posts in Chinese. The
task involves two subtasks, namely pairwise ranking and unsupervised
5.3. Finance domain ranking. Experiment results showed that the few-shot prompted GPT-
3.5 model outperforms the instruction fine-tuned T5 model and the
The recent works explored GLLMs for a variety of finance NLP few-shot prompted GPT-3.5 model with English-translated social media
tasks like text classification (Li et al., 2023k; Loukas et al., 2023), posts.
25
K.S. Kalyan Natural Language Processing Journal 6 (2024) 100048
Table 16
Summary of research works exploring GLLMs for NLP tasks in multilingual settings. Here, ZS represents zero-shot, and FS represents few-shot.
Paper GLLMs Task(s) Prompt Language(s) Domain(s)
explored settings
Lai et al. (2023a) ChatGPT PoS tagging, entity extraction, relation ZS 37 Languages General
extraction, natural language inference, question
answering, text summarization, common sense
reasoning
Fang et al. (2023b) ChatGPT Grammar error correction ZS, FS English, German, Chinese General
Armengol-Estapé et al. (2022) GPT-3 Question answering, natural language ZS German, Spanish, Russian, General
generation, text summarization Turkish, Catalan
Ahuja et al. (2023) GPT-3.5, Natural language inference, paraphrase ZS 70 languages General
ChatGPT, identification, commonsense reasoning,
GPT-4 question answering, parts of speech tagging,
sentiment analysis, text summarization
Bang et al. (2023) ChatGPT Sentiment analysis, language identification, ZS Multiple language General
machine translation including low resource
languages like Sudanese,
Javanese etc.
Kuzman et al. (2023) ChatGPT Genre identification ZS English, Slovenian General
Zhang et al. (2023g) ChatGPT Question answering, Reasoning ZS Six languages including General
Chinese, German and
French
Das et al. (2023) ChatGPT Hate speech detection ZS Eleven languages including Social media
Hindi, Arabic and Italian
Hada et al. (2023) GPT-4 Three text generation tasks ZS Ten languages including General
Chinese and Japanese.
Leong et al. (2023) ChatGPT, Question answering, sentiment analysis, text ZS, FS Indonesian, Vietnamese, General,
GPT-4 summarization, named entity recognition, Thai, Tamil Social Media,
toxicity detection, machine translation, natural News
language inference, casual reasoning
Shah and Chava (2023) compared the performance of ChatGPT with like parts of speech tagging (Lai et al., 2023a; Ahuja et al., 2023),
the performance of fine-tuned PLMs for three different financial NLP named entity recognition (Lai et al., 2023a; Leong et al., 2023), relation
tasks: claim detection, sentiment analysis and named entity recognition. extraction (Lai et al., 2023a), natural language inference (Lai et al.,
The authors observed that fine-tuned models outperform ChatGPT, but 2023a; Ahuja et al., 2023; Leong et al., 2023), question answering (Lai
ChatGPT performs much better than some open-source LLMs. Zhang et al., 2023a; Armengol-Estapé et al., 2022; Ahuja et al., 2023; Zhang
et al. (2023b) introduced FinEval, a new benchmark to evaluate the fi- et al., 2023g; Leong et al., 2023), text summarization (Lai et al., 2023a;
nancial domain of knowledge of LLMs in the Chinese language. FinEval Armengol-Estapé et al., 2022; Ahuja et al., 2023; Leong et al., 2023),
includes 4661 multiple-choice questions in Chinese language from four commonsense reasoning (Lai et al., 2023a; Ahuja et al., 2023), grammar
different categories spanning 34 academic subjects. Experiment results error correction (Fang et al., 2023b), text generation (Armengol-Estapé
showed that GPT-4 achieves around 70% accuracy and outperforms all et al., 2022; Hada et al., 2023), paraphrase identification (Ahuja et al.,
other LLMs, including ChatGPT and Chinese LLMs. 2023), sentiment analysis (Ahuja et al., 2023; Bang et al., 2023; Leong
Rajpoot and Parikh (2023) assessed the effectiveness of ChatGPT et al., 2023), language identification (Bang et al., 2023), machine
and GPT-4 for financial relation extraction in few-shot settings. As the translation (Bang et al., 2023; Leong et al., 2023), genre identifica-
choice of examples is crucial in few-shot ICL, the authors explored tion (Kuzman et al., 2023), hate speech detection (Das et al., 2023) and
learning free and learning-based retriever for example selection. The toxicity detection (Leong et al., 2023). Most of the research focused on
authors observed that GPT-4 outperforms ChatGPT by a decent margin, general domain datasets, except a few focused on other domains like
and the learning-based retriever performs better than the learning-free social media (Das et al., 2023; Leong et al., 2023) and news (Leong
retriever. et al., 2023). Table 16 presents a summary of research works exploring
GLLMs for NLP tasks in multilingual settings.
6. Multilingual performance of GLLMs Bang et al. (2023) presented an extensive multilingual evaluation of
ChatGPT across three tasks: sentiment analysis, language identification
and machine translation. When compared to English, the performance
Overview. GLLMs are pretrained over large volumes of text data from of ChatGPT degrades in the case of low-resource languages, particularly
multiple languages. For example, the corpus used to pretrain the GPT- in the case of languages with non-Latin scripts. Das et al. (2023)
3 model includes text from around 90 languages, and the percentage assessed the effectiveness of ChatGPT for emoji-based hate speech
of English text is more than 90% (Brown et al., 2020; Ahuja et al., detection in multilingual settings. The authors reported that ChatGPT
2023). In the beginning, most of the research focused on assessing the exhibits good performance but tends to misclassify abusive content as
performance of GLLMs on English datasets only. However, it is essen- hate speech for non-English languages in the case of non-protected
tial to evaluate these models on datasets from non-English languages, groups. Moreover, Armengol-Estapé et al. (2022) reported that the
especially low-resource languages, to know how effective GLLMs are for performance of GPT-3 can be improved in the case of low-resource
non-English languages, and the insights gained from the comprehensive languages with optimized tokenization.
evaluation help to further improve these models towards non-English The focus of existing benchmarks like HELM (Bommasani et al.,
languages. 2023) and BIG-Bench (Srivastava et al., 2023) is on the English lan-
Research works exploring GLLMs in multilingual settings. Re- guage. So, some of the research works focused on introducing new
cently, some of the research works focused on evaluating GLLMs across benchmarks to facilitate a systematic and comprehensive evaluation
various non-English languages. The evaluation is done on various tasks of the multilingual performance of GLLMs (Ahuja et al., 2023; Leong
26
K.S. Kalyan Natural Language Processing Journal 6 (2024) 100048
Table 17
Summary of research works exploring GLLMs for data labelling. Here, ‘–’ represents that the paper does not include a comparison between GLLMs and human annotators.
Paper GLLMs explored Task(s) Prompt Domain(s) Language(s) Outperforms
settings human
annotators
Gilardi et al. (2023) ChatGPT Stance, relevance, frame and ZS Social media, English Yes
topics detection news
He et al. (2023b) GPT-3.5 Three binary text classification ZS, FS General English Yes
tasks
Törnberg (2023) GPT-4 Political tweets classification ZS Social media English Yes
Zhu et al. (2023e) ChatGPT Stance detection, sentiment ZS Social media English No
analysis, hate speech detection,
bot detection
Li et al. (2023c) ChatGPT Detection of hateful, toxic and ZS Social media English No
offensive comments
Gu et al. (2023) GPT-3.5, GPT-4 Adverse drug reaction extraction ZS, FS Healthcare English –
Wang et al. (2021b) GPT-3 Text entailment, topic ZS General English –
classification, sentiment analysis,
answer type classification,
question generation, text
generation
Ding et al. (2022) GPT-3 Sentiment analysis, relation FS General English –
extraction, named entity
recognition
Meoni et al. (2023) GPT-3.5 Named entity recognition ZS Healthcare English, French, –
Spanish, Italian,
Basque
Xu et al. (2023c) GPT-3.5 Text summarization ZS, FS General English –
Alizadeh et al. (2023) ChatGPT Detection of stance, topics, ZS, FS Social media, English Yes
relevance, general frame and news
policy frame
Yang et al. (2023b) GPT-3 Radiology text simplification FS Healthcare English –
et al., 2023). For example, Ahuja et al. (2023) presented MEGA, a performance of GLLMs is good only in the case of tasks which can
comprehensive evaluation benchmarking having 16 datasets covering translated. Hada et al. (2023) assessed the effectiveness of GPT-4 as an
70 languages. Based on the evaluation of GLLMs like GPT-3.5, ChatGPT evaluator for natural language generation tasks in multilingual settings.
and GPT-4, the authors reported that GLLMs perform well in the case The authors reported that GPT-4 tends to favour high scores and should
of languages with Latin scripts, and the performance is worst in the be used carefully.
case of low-resource languages with non-Latin scripts across tasks.
One of the possible reasons for this is the quality of tokenization. 7. Data labelling and data augmentation abilities of GLLMs
Similarly, Leong et al. (2023) introduced BHASA, a benchmark to
evaluate the performance of LLMs in four Southeast Asian languages. 7.1. Data labelling
The benchmark consists of 20 datasets covering eight NLP tasks. The
authors reported that (i) GPT-4 achieves better results compared to Overview. Large language models, specifically GLLMs, have achieved
ChatGPT, and (ii) overall, the performance on some of the tasks is impressive performances in most of the NLP tasks, highlighting the
promising, with a lot of room for improvement in other tasks. huge potential of these models. However, large model size, high la-
Some of the existing works demonstrated that using prompts in tency, high inference costs, proprietary access (in the case of GLLMs)
English improves the performance of GLLMs in the case of non-English and confidentiality concerns (in the case of sensitive domains like
languages (Lai et al., 2023a; Kuzman et al., 2023). For example, Lai medical Meoni et al., 2023) have become bottlenecks for the practical
et al. (2023a) performed a comprehensive evaluation of the multi- use of these models. Because of these bottlenecks, in environments with
lingual abilities of ChatGPT on seven tasks covering more than 30 constrained resources or confidentiality constraints, PLMs are preferred
languages ranging from high-resource to extremely low-resource lan- over GLLMs as these models are much smaller in size and also more
guages. The experiment results confirmed the bias of ChatGPT towards efficient compared to GLLMs (Thapa et al., 2023). For example, BERT
the English language, i.e., the performance is better for English com- base and large models contain just 110M and 340M parameters, while
pared to other languages and prompts in the English language can the GPT-3 model contains 175B parameters. Moreover, it is reported
enhance the performance for non-English languages. The possible rea- that GLLMs are trailing the SOTA models, with 4% to 70% lower
son for the bias of GLLMs towards the English language is that GLLMs performance when evaluated across a set of 25 diverse natural language
are trained mostly on English text corpus; hence, these models can processing tasks (Kocoń et al., 2023).
better understand the prompt if it is in English (Kuzman et al., 2023). The performance of fine-tuned PLMs is largely determined by the
Some of the research works investigated how GLLMs exhibit multi- quality as well as the quantity of labelled data. Human-annotated data
lingual capabilities (Zhang et al., 2023g) and how effective GLLM-based is considered the gold standard (Murthy et al., 2019; Van Atteveldt
evaluators are in scaling up evaluation in multilingual settings (Hada et al., 2021), and we have two strategies for this (Gilardi et al., 2023;
et al., 2023). Zhang et al. (2023g) proposed a novel back transla- Törnberg, 2023). The first one is using trained expert coders like
tion prompting approach to systematically study how ChatGPT exhibit students and research assistants, and the second one is using crowd
multilingual capabilities, although these models are largely pretrained workers from online platforms like Amazon Mechanical Turk. Although
on the English text corpus. The authors demonstrated that ChatGPT human-labelled data is considered the gold standard, the human anno-
does translation in multilingual settings. Moreover, the multilingual tation process is expensive, laborious and time-consuming. The second
27
K.S. Kalyan Natural Language Processing Journal 6 (2024) 100048
strategy, i.e., using crowd workers, is comparatively less expensive, annotators can outperform human annotators. Gilardi et al. (2023)
but there is a growing concern regarding the degrading annotation investigated the effectiveness of ChatGPT as an annotator in zero-shot
quality of crowd workers (Chmielewski and Kucker, 2020). Moreover, settings for four text classification tasks involving tweets and news arti-
the annotation quality varies with annotators, and hence it is consis- cles. The authors reported that ChatGPT is more effective than MTurk
tent. To address the challenges associated with the human annotation crowd-workers as (i) ChatGPT achieves 25 points more than crowd-
process, there is a growing interest in the NLP research community to workers in terms of accuracy, (ii) ChatGPT is approximately 30 times
leverage the extraordinary generative abilities of GLLMs to make the cheaper, and (iii) intercoder agreement of ChatGPT is more than crowd-
data annotation process less expensive, faster and consistent. Similar workers. He et al. (2023b) proposed a novel approach called ‘‘explain
to the human annotation process, GLLMs are provided with detailed then annotate’’ to enhance the performance of GLLMs as text data anno-
instructions along with some labelled examples to label the data. tators. The proposed approach involves two steps: (i) GLLM generates
Research exploring GLLMs for data labelling. The research com- explanations for the demonstrations and then (ii) annotates the data
munity explored GLLMs for data labelling in a variety of NLP tasks like by leveraging annotation guidelines, demonstrations and explanations
stance detection (Gilardi et al., 2023; Zhu et al., 2023e), political tweets through CoT prompting. Evaluation on three binary text classification
classification (Törnberg, 2023), sentiment analysis (Zhu et al., 2023e; tasks revealed that GPT-3.5 outperforms crowd-workers on one task
Wang et al., 2021b; Ding et al., 2022), hate speech detection (Zhu and matches the performance of crowd-workers on the other two
et al., 2023e; Li et al., 2023c), bot detection (Zhu et al., 2023e), tasks. Törnberg (2023) demonstrated that zero-shot GPT-4 outper-
toxic comments detection (Li et al., 2023c), offensive comments de- forms human annotators in labelling political English tweets. Further
tection (Li et al., 2023c), adverse drug reaction extraction (Gu et al., analysis demonstrated that GPT-4 possesses the ability to accurately
2023), text entailment (Wang et al., 2021b), topic classification (Wang label tweets that involve logical reasoning from contextual informa-
et al., 2021b), text generation (Wang et al., 2021b), answer type tion. Alizadeh et al. (2023) compared the performances of GLLMs
classification (Wang et al., 2021b), question generation (Wang et al., like ChatGPT, open-source LLMs like FLAN (Chung et al., 2022) and
2021b), relation extraction (Ding et al., 2022), named entity recogni- MTurk annotators in labelling data (tweets and news articles) for five
tion (Ding et al., 2022; Meoni et al., 2023), text summarization (Xu text classification tasks. The authors reported that ChatGPT achieves
et al., 2023c), radiology text simplification (Yang et al., 2023b) etc. the best results, outperforming both open-source LLMs and MTurk
Most of the research works focused on English datasets, except a few annotators. One promising observation here is that open-source LLMs
research works focused on other languages like French (Meoni et al., outperform MTurk annotators, and the performance is comparable to
2023), Spanish (Meoni et al., 2023), Italian (Meoni et al., 2023) and ChatGPT.
Basque (Meoni et al., 2023). Table 17 presents a summary of research
works exploring GLLMs for data labelling. 7.2. Data augmentation
Gu et al. (2023) labelled sentences from PubMed abstracts using
the GPT-3.5 model and then fine-tuned the PubMedBERT model for Overview. The performance of downstream task-specific models is
adverse drug reaction extraction. Experiment results showed that (i) determined by the quality as well as the quantity of labelled data.
PubMedBERT achieves results comparable to the SOTA model and (ii) Fine-tuning the PLMs on a small amount of labelled data will result in
PubMedBERT outperforms the GPT-3.5 and GPT-4 models by large overfitting (Kalyan et al., 2021) and, subsequently, poor performances.
margins of 6 and 5 points in F1 score, respectively. Based on the However, it is not feasible all the time to label a large number of
evaluation of multiple NLU and NLG tasks, Wang et al. (2021b) demon- instances as the annotation process is expensive. So, the research
strated that GPT-3 labelled data can result in a 50 to 96% reduction community focused on alternative approaches like data augmenta-
in labelling expenses. Moreover, PLMs fine-tuned on GPT-3 labelled tion to increase the size of training sets in a relatively inexpensive
data outperform the few-shot GPT-3 model in both NLU and NLG way (Shorten and Khoshgoftaar, 2019; Li et al., 2022b; Liu et al.,
tasks. Further, the authors proposed an approach based on active 2020b; Feng et al., 2021; Bayer et al., 2022). The data augmentation
learning to make use of both human and GPT-3 labels, which further approaches focus on generating additional training instances either
enhances the performance of the fine-tuned models. Meoni et al. (2023) by making small changes to the existing instances or creating new
investigated the effectiveness of GPT-3.5 labelled data and dictionary- instances with a distribution similar to the existing instances.
based labelled data in fine-tuning PLMs to extract clinical entities in Data augmentation is initially explored in the area of computer
multiple languages like English, Spanish, Basque, Italian and French. vision (Shorten and Khoshgoftaar, 2019) and then explored in natural
The authors reported that (i) the performance of GPT-3.5 labelled language processing (Li et al., 2022b; Liu et al., 2020b; Feng et al.,
data is on par with dictionary-based labelled data, and (ii) combining 2021; Bayer et al., 2022). When compared to computer vision, text data
annotations from both approaches further enhances the results. Xu et al. augmentation is more challenging because of the discrete nature of text.
(2023c) proposed InhertiSumm, a novel approach for training small Data augmentation can be done at character, word and sentence levels.
text summarization models like ZCode++ (He et al., 2022c) using GPT- Character-level data augmentation approaches involve random dele-
3.5 generated summaries. The authors showed that the ZCode++ model tion, addition, exchange or insertion of characters (Belinkov and Bisk,
with just 390M parameters trained using GPT-3.5 generated summaries 2018; Coulombe, 2018). For example, in the case of keyboard augmen-
performs on par with GPT-3.5 in zero and few-shot settings. tation, a random character is replaced with its neighbour based on the
Zhu et al. (2023e) investigated how effective ChatGPT is for la- QWERTY layout (Belinkov and Bisk, 2018). Similar to character-level
belling data for social computing tasks. Based on the evaluation of data augmentation, word-level data augmentation approaches involve
five datasets spanning over tasks like stance detection, hate speech deletion, replacement, exchange or insertion of words at random po-
detection, bot detection and sentiment analysis, the authors reported sitions (Wei and Zou, 2019; Wang and Yang, 2015). Sentence-level
that ChatGPT achieves an average accuracy of 60.9. Li et al. (2023c) approaches like back translation and paraphrasing generate augmented
investigated the ability of ChatGPT to label hateful, offensive and toxic instances by rewriting the sentence (Sennrich et al., 2016; Mallikarjuna
comments and compared the performances with MTurk annotations. and Sivanesan, 2022). Overall, the main drawbacks of existing data
The authors observed that ChatGPT performance is promising as it is augmentation approaches are (i) lack of sufficient diversity in the
able to label 80% of comments correctly. Moreover, the performance augmented instances and (ii) often struggle to guarantee the accurate
of ChatGPT is more consistent for non-harmful comments than harmful labelling of the augmented data (Dai et al., 2023b). To address these
comments. drawbacks, the research community focused on leveraging the excep-
Some of the research works (Gilardi et al., 2023; He et al., 2023b; tional generating abilities of GLLMs for data augmentation to ensure
Törnberg, 2023; Alizadeh et al., 2023) showed that GLLMs as data sufficient diversity and correct labelling in the augmented data.
28
K.S. Kalyan Natural Language Processing Journal 6 (2024) 100048
Table 18
Summary of research works exploring GLLMs for paraphrasing-based data augmentation.
Paper GLLMs Task(s) Prompt Domain(s) Language(s)
explored settings
Cegin et al. (2023) ChatGPT Intent classification ZS General English
Oh et al. (2023) ChatGPT Machine translation ZS General Korean, German
Sharma et al. (2022) GPT-3 Named entity recognition ZS News, social media, English
general, healthcare
Guo et al. (2023b) ChatGPT, Question answering ZS Healthcare English
GPT-4
Abaskohi et al. (2023) GPT-3 Text classification FS General English
Sarker et al. (2023) ChatGPT Medical event classification, ZS Healthcare English
medication identification
Parikh et al. (2023) GPT-3 Intent classification ZS Social media English
Dai et al. (2023b) ChatGPT Text classification ZS General, Healthcare English
Fang et al. (2023a) ChatGPT Open intent detection ZS General English
29
K.S. Kalyan Natural Language Processing Journal 6 (2024) 100048
Table 19
Summary of research works exploring GLLMs for data generation-based data augmentation. Here ZS represents zero-shot and FS represents few-shot.
Paper GLLMs Task(s) Prompt Domain(s) Language(s)
explored settings
Zhan et al. (2023b) ChatGPT Text classification ZS Social media Chinese
Wang et al. (2023m) ChatGPT Note2Dialogue generation ZS Healthcare English
Gunasekar et al. (2023) GPT-3.5 Training Phi-1 LLM ZS Programming English
Whitehouse et al. (2023) ChatGPT, Cross-lingual common sense FS General Multiple
GPT-4 reasoning languages
Hartvigsen et al. (2022) GPT-3 Hate speech detection FS Social media English
Markov et al. (2023) GPT-3 Undesired context detection ZS, FS Social media English
Guo et al. (2023c) ChatGPT, Question answering ZS Healthcare English
GPT-4
Parikh et al. (2023) GPT-3 Intent classification ZS General English
Eldan and Li (2023) GPT-3.5, Training smaller LLMs ZS General English
GPT-4
Xu et al. (2023e) GPT-3.5 Relation extraction FS General, scientific English
literature
Liu et al. (2023g) GPT-4 CoT instruction tuning FS General English
Peng et al. (2023c) GPT-4 Instruction tuning ZS General English, Chinese
Malkiel et al. (2023) GPT-3 Call segmentation, topic ZS Dialogue English
extraction
Wahle et al. (2022) GPT-3 Paraphrase detection ZS General, scientific English
literature
Michail et al. (2023) ChatGPT Tweet intimacy prediction FS Social media Multiple
languages
Tang et al. (2023a) ChatGPT Named entity recognition, ZS Healthcare English
Relation classification
Yu et al. (2023d) ChatGPT Topic classification ZS News, social media English
Yang and Nicolai (2023) ChatGPT Neural machine translation ZS General Multiple
languages
Zhao et al. (2023c) GPT-3, Codex Table question answering ZS General English
Xu et al. (2023b) GPT-4 Text generation evaluation ZS General Multiple
languages
Yu et al. (2023d) proposed a novel approach that leverages attributed generate coherent and consistent stories with near-perfect grammar.
prompts for data generation to increase the diversity in the generated Instruction tuning requires large human-annotated datasets, which are
data. Based on the evaluation on four topic classification datasets, the often difficult to obtain. Stanford Alpaca5 and Vicuna6 showed the
authors observed that (i) the proposed approach enhances the model effectiveness of synthetic instruction tuning datasets generated using
performance and (ii) reduces the querying cost of ChatGPT by a large GPT-3.5 and ChatGPT, respectively. Inspired by the success of these
margin. models, Peng et al. (2023c) explored advanced models like GPT-4 to
Some of the research works explored GLLMs for data generation- generate instruction-tuning datasets in English and Chinese languages.
based data augmentation in various information extraction tasks like The experiment results showed that GPT-4 generated instruction tuning
relation extraction (Xu et al., 2023e), relation classification (Tang et al., datasets further enhance the zero-shot performance of LLaMA models.
2023a) and named entity recognition (Tang et al., 2023a). Xu et al. Liu et al. (2023g) used GPT-4 to generate LogiCoT, a synthetic dataset
(2023e) evaluated how effective is the GPT-3.5 model for relation of CoT rationales. This dataset can be used for instruction tuning the
classification. To address the data scarcity problem in few-shot settings, LLMs to enhance their logical reasoning abilities.
the authors used the GPT-3.5 model to generate additional data. The
prompt used for data generation consists of instance descriptions along 8. Detecting GLLM generated text
with some example instances. Tang et al. (2023a) used ChatGPT in
zero-shot settings to generate synthetic data for tasks like named en-
tity recognition and relation classification in the healthcare domain. Overview. GLLMs demonstrated extraordinary human-like capabilities
The authors showed that the model fine-tuned on this synthetic data to understand user queries, follow the instructions and then answer
outperforms zero-shot ChatGPT by a large margin in both tasks. the user queries with high-quality content. Apart from responding to
Some of the research works explored GLLMs for data generation in user queries, these models can also generate news articles, research
LLM development stages, like LLM pretraining (Gunasekar et al., 2023; papers, code and essays with human-like fluency. With the ability to
Eldan and Li, 2023) and instruction tuning (Liu et al., 2023g; Peng generate text with human-like fluency, these models are widely adopted
et al., 2023c). Gunasekar et al. (2023) trained Phi-1, a code LLM using in a variety of real-world applications like writing assistants, coding
GPT-3.5 generated synthetic textbook and code data. Here, the training assistants, chatbots, etc. (Mireshghallah et al., 2023). Although there
corpus includes 1B tokens of GPT-3.5 generated Python textbook and is a lot of excitement about GLLMs and their applications in recent
code data along with 6B tokens of code data from the web. Eldan and times, there are also growing concerns regarding the potential misuse
Li (2023) explored GLLMs like GPT-3.5 and GPT-4 models to generate of these models for illegal activities (Guo et al., 2023d), such as fake
TinyStories, a synthetic dataset of stories with only the words under-
stood by typical 3 to 4-year-old kids. The authors demonstrated that the 5
https://crfm.stanford.edu/2023/03/13/alpaca.html.
6
GLLM generated dataset can be used to train smaller LLMs, which can https://lmsys.org/blog/2023-03-30-vicuna/.
30
K.S. Kalyan Natural Language Processing Journal 6 (2024) 100048
news on social media platforms (Hacker et al., 2023; De Angelis et al., Theocharopoulos et al. (2023) evaluated the effectiveness of classifiers
2023), fake reviews on e-commerce websites (Mitrovi’c et al., 2023), based on models like logistic regression, support vector machine, LSTM,
fake research papers (Gao et al., 2023a), academic fraud (Cotton et al., and BERT to identify GPT-3 generated scientific abstracts. The LSTM-
2023), etc. For example, these models can be easily used by malicious based classifier with word2vec embeddings achieves an accuracy of
users to create fake news (Hacker et al., 2023; De Angelis et al., 2023) more than 98% and outperforms other classifiers. Zaitsu and Jin (2023)
and propagate on social platforms at a large scale to exaggerate or observed that LLM-generated texts differ significantly from human-
manipulate the facts to get an undue advantage, especially during written texts in terms of stylometric features. The authors demonstrated
political campaigns. Similarly, students can use these models to write that random forest trained with different stylometric features can iden-
their assignments or generate code for their projects (Cotton et al., tify the LLM-generated Japanese text with 100% accuracy. Liu et al.
2023), and GLLM generated fake research papers (Gao et al., 2023a) (2023m) reported that fine-tuned RoBERTa model achieves an accuracy
can have a serious impact on the scientific community as these papers of more than 90% on the AruGPT dataset of human-written and GLLM
are written without conducting any experiments. generated argumentative essays. Moreover, linguistic analysis revealed
There is a strong need for the development of approaches to de- that GLLM generated texts tend to be more complex syntactically, while
tect GLLM generated text, as there are growing concerns regarding human-generated texts are lexically more complex. To facilitate the
the misuse of GLLMs. Such approaches help to distinguish the GLLM development of a ChatGPT-written abstract detector, Yu et al. (2023a)
generated text from human-generated text and verify the source as introduced CHEAT, a large dataset of ChatGPT and human-written
well as the authenticity of the information. However, detecting GLLM abstracts. Based on the evaluation of multiple existing approaches
generated text is more challenging as models like ChatGPT and GPT-4 like ZeroGPT, OpenAI detector, ChatGPT-detector-roberta (Guo et al.,
can generate content with human-like fluency. 2023d) and ChatGPT-qa-detector-roberta (Guo et al., 2023d), the au-
Research exploring the detection of GLLM generated text. To thors reported that performance is away from satisfactory and the
avoid misuse and ensure the safe use of these models, the research human involvement further increases the detection difficulty. Zhan
community focused on developing approaches to identify the GLLM et al. (2023a) treated the detection of LLM generated as a binary clas-
generated text accurately. The recent research works explored the sification problem and proposed a novel approach based on fine-tuned
detection of GLLM generated text in multiple domains like scientific RoBERTa model. The authors reported that the proposed approach
literature (Theocharopoulos et al., 2023; Zaitsu and Jin, 2023; Yu et al., exhibits good performance and also has the ability to detect the text
2023a; Yang et al., 2023a), academic (Liu et al., 2023m; Orenstrakh generated using a detection evasion technique. Mitrovi’c et al. (2023)
et al., 2023), healthcare (Liao et al., 2023; Zhan et al., 2023a; Yang proposed a novel approach based on DistilBERT (Sanh et al., 2019)
et al., 2023a), news (Clark et al., 2021), legal (Zhan et al., 2023a; and SHAP (Lundberg and Lee, 2017) to detect the machine-generated
Guo et al., 2023d), social media (Yang et al., 2023a; Mitrovi’c et al., text and explain the reasoning. The proposed approach achieves an
2023), Finance (Guo et al., 2023d) etc. Most of the research works accuracy of 79%, and based on the explanations, the authors observed
focused on the English language, while a few research works focused that ChatGPT-generated text maintains a polite tone, lacks specific
on other languages like Japanese (Zaitsu and Jin, 2023), German (Yang details and generally refrains from expressing emotions.
et al., 2023a) and Spanish (Orenstrakh et al., 2023). Table 20 presents a Chen et al. (2023d) introduced OpenGPTText, which includes
summary of research works exploring the detection of GLLM generated ChatGPT-generated paraphrased text. The authors reported that fine-
text. tuned classifiers based on models like RoBERTa and T5 can achieve
Some of the research works focused on assessing the effectiveness impressive results in detecting ChatGPT-generated text with an ac-
of the existing machine-generated text detection tools to detect GLLM curacy of more than 97%. Yu et al. (2023b) introduced GPT-Pat, a
generated text. A number of online tools are available, ranging from novel approach based on ChatGPT, a Siamese network and binary
simple classifiers based on logistic regression to advanced classifiers classifier, to detect machine-generated text effectively. The proposed
based on PLMs to detect ChatGPT-generated text. To assess the ef- approach enhances the SOTA accuracy by more than 12% and also
fectiveness of these tools, Pegoraro et al. (2023) introduced a dataset exhibits better robustness to attacks like re-translation and text pol-
having ChatGPT-generated responses for questions from various do- ishing. Yang et al. (2023d) focused on detecting GLLM-polished text,
mains like finance, medicine, etc., and user-generated responses from which is more challenging and useful in real-world applications. The
social media platforms. The comprehensive evaluation showed that the proposed approach involves training a classification model to identify
maximum success rate of these tools is less than 50% only, which the machine-generated text and a polish ratio (regression) model to
leaves a lot of room for improvement. Orenstrakh et al. (2023) eval- explain the ChatGPT involvement. A Polish ratio of 0.2 indicates
uated the effectiveness of eight popular detectors using three metrics, ChatGPT involvement and a value of more than 0.6 represents the text
namely resilience, false positives and accuracy. The authors observed is entirely ChatGPT generated.
that CopyLeaks, GPTKit and GLTR achieve the best results for the Training-based approaches to detect LLM-generated text have lim-
metrics accuracy, false positives and resilience. However, all these ited flexibility, especially when used for new domains (Yang et al.,
detectors struggle with non-English languages and paraphrased LLM- 2023a). To overcome this drawback, some of the research works fo-
generated text. There is a lack of comprehensive evaluation bench- cused on developing training-free approaches to detect GLLM generated
mark to detect machine-generated text as the existing approaches use text. Yang et al. (2023a) proposed DNA-GPT, a training-free approach
different models, datasets and settings. To address this, He et al. based on divergent n-gram analysis. With the proposed approach, the
(2023c) proposed MGTBench, the first machine-generated text detec- authors achieved SOTA results on both English and German datasets.
tion benchmark. Evaluation on this benchmark showed that, except for Wang et al. (2023g) proposed a novel framework called FLAIR to detect
the ChatGPT detector (Guo et al., 2023d) and LM detector (Ippolito LLM-based bots with a single question in an effective way. The results
et al., 2020), the performance of other detectors is not satisfactory. showed that the proposed approach is effective and a good alternative
Guo et al. (2023d) introduced the HC3 dataset, having human-authored to existing CAPTCHA-based approaches. Mireshghallah et al. (2023)
and ChatGPT-generated responses to questions from multiple domains investigated whether models other than the generator can be used to
like legal, healthcare, finance, psychology, etc. The performance of identify machine-generated text. In general, smaller models serve as
existing detection approaches on the HC3 dataset is just satisfactory, more effective universal text detectors. These models exhibit better
and linguistic analysis showed that human-authored answers are short accuracy in identifying text produced by both small and larger models.
in length but use a large vocabulary compared to ChatGPT-generated For example, OPT-125M achieves better results compared to the GPT-J
answers. 6B model in detecting ChatGPT-generated text.
Some of the research works focused on developing approaches Some of the research works focused on assessing the robustness of
based on trained classifier models to detect GLLM generated text. machine-generated text detectors towards different attacks. Shi et al.
31
K.S. Kalyan Natural Language Processing Journal 6 (2024) 100048
Table 20
Summary of research works exploring the detection of GLLM generated text.
Paper Detect Approach Satisfactory Training free Domain(s) Language(s)
performance
Pegoraro et al. (2023) ChatGPT generated text Evaluate multiple online tools No – Multiple domains English
Theocharopoulos et al. (2023) GPT-3 generated text Classifiers based on machine Yes No Scientific literature English
learning models like LR, SVM and
deep learning models like LSTM
and BERT
Zaitsu and Jin (2023) ChatGPT and GPT-4 Classifier based on random forest Yes No Scientific literature Japanese
generated text and stylometric features
Liu et al. (2023m) GPT-3 and ChatGPT Classifier based on models like Yes No Academic English
generated text SVM and RoBERTa
Yu et al. (2023a) ChatGPT generated text Classifier based on models like No No Scientific literature English
RoBERTa
Liao et al. (2023) ChatGPT generated text Classifier based on models like Yes No Healthcare English
BERT
Orenstrakh et al. (2023) ChatGPT generated text Evaluate multiple online tools Yes – Academic English,
Spanish
Clark et al. (2021) GPT-3 generated text Evaluate human evaluators No – Stories, news, English
recipes
Zhan et al. (2023a) ChatGPT and GPT-4 Classifier based on models like Yes No Law, medical, English
generated text BERT and RoBERTa dialogue, general
Yang et al. (2023a) GPT-3.5, ChatGPT and Training free divergent N-gram Yes Yes Healthcare, social English,
GPT-4 generated text Analysis media, scientific German
literature
Shi et al. (2023) ChatGPT generated text Evaluate the robustness of No – General English
existing detectors
Khalil and Er (2023) ChatGPT generated text Evaluate existing plagiarism tools No – General English
He et al. (2023c) ChatGPT generated text Propose benchmark and evaluate Yes – General English
existing detectors
Mitrovi’c et al. (2023) ChatGPT generated text Propose novel approach based on Yes No Social media English
DistilBERT and SHAP to detect
and explain
Guo et al. (2023d) ChatGPT generated text Introduce new dataset and Yes – General, finance, English
evaluate multiple existing healthcare, legal,
detection models psychology
Wang et al. (2023g) GPT-3 and ChatGPT-based Propose FLAIR to detect online Yes Yes General English
bots GPT-3 and ChatGPT-based bots
Chen et al. (2023d) ChatGPT generated text Classifiers based on models like Yes No General English
RoBERTa and T5
Mireshghallah et al. (2023) ChatGPT generated text Propose a zero-shot approach Yes Yes General English
based on local optimality
Yu et al. (2023b) ChatGPT generated text Propose an approach based on Yes No General English
Siamese Network and binary
classifier
Yang et al. (2023d) ChatGPT polished text Trains classifier and polish ratio Yes No General English
models to detect and explain
Krishna et al. (2023) GPT-3.5 generated text Evaluate robustness using No – General English
paraphrase attacks
(2023) evaluated the robustness of existing detectors using attacks like 9. Evaluation of GLLMs
synonym word replacement and writing style modification. The authors
implemented both attacks using LLMs. The results showed that the GLLMs with their remarkable performances across a variety of
existing detectors are not robust to the attacks, which emphasizes the tasks, gained a lot of attention in both industry and academia which
need for more robust and reliable detectors to detect and avoid the eventually led to their use in a lot of real-world applications. GLLMs
misuse of LLMs. Krishna et al. (2023) showed that existing detectors are double-edged swords i.e., apart from remarkable performances,
like OpenAI detector, GPTZero and DetectGPT (Mitchell et al., 2023) GLLMs are also associated with a lot of potential risks (Guo et al.,
are not robust to paraphrase attacks. For example, paraphrase attacks 2023a). For example, GLLMs sometimes generate factually incorrect
result in a drop of more than 65% accuracy in the case of DetectGPT. text, biased and harmful text and also tend to leak private data. So,
Some of the research works focused on assessing the effectiveness of it is highly recommended to have a thorough evaluation of GLLMs to
humans in identifying GLLM generated text. For example, Clark et al. understand their limitations which not only helps the researchers to
(2021) observed that non-expert evaluators are unable to differentiate further improve them but also ensures their safe and reliable use (Guo
GPT-3 generated text from human-authored text in three different et al., 2023a; Chang et al., 2023).
domains, namely news, recipes and stories. The reason for this is the In recent times, several benchmarks have been proposed to assess
evaluators arrived at their decisions based on surface-level features the performance as well as understand the limitations of GLLMs across
without considering the advanced text generation capabilities of the tasks and domains (Zhuang et al., 2023). A benchmark serves as a
GPT-3 model. standardized method for evaluating a model’s ability to generalize
32
K.S. Kalyan Natural Language Processing Journal 6 (2024) 100048
Table 21
Summary of research works exploring GLLMs robustness to out-of-distribution instances, adversarial prompts and adversarial inputs. Here ZS represents zero-shot, and FS represents
few-shot.
Paper GLLMs explored Task(s) Prompt Robustness Domain(s) Language(s)
settings
Chen et al. (2023g) GPT-3, GPT-3.5 Nine NLU tasks ZS, FS Adversarial input General English
Wang et al. (2023b) GPT-3.5, Four NLU tasks, machine ZS Out of distribution General, English
ChatGPT translation medical
Zhuo et al. (2023) Codex Semantic parsing ZS, FS Adversarial input Programming English
Zhu et al. (2023d) ChatGPT Eight tasks including four ZS, FS Adversarial prompt General English
NLU tasks
Shirafuji et al. (2023) Codex, Code generation ZS Adversarial prompt Programming English
InstructGPT,
ChatGPT
Zhao et al. (2023c) GPT-3, Codex Table question answering FS Adversarial input General English
Han et al. (2023) ChatGPT Fourteen IE tasks ZS, FS Adversarial prompt General English
Liu et al. (2023f) ChatGPT, GPT-4 Question answering ZS, FS Out-of-distribution General English
Liu et al. (2023c) ChatGPT Text-to-SQL generation ZS Adversarial input General English
across different tasks (Kalyan et al., 2021). Typically, it includes a et al., 2023c). Zhuo et al. (2023) reported that Codex-based semantic
collection of diverse and challenging datasets, an online leaderboard for parsers are not robust to adversarial examples, and the robustness
model comparison and ranking, and a designated metric for assessing can be enhanced using few-shot in-context learning. Shirafuji et al.
overall performance across tasks (Wang et al., 2018). The use of a (2023) studied the robustness of GPT-3 family models like Codex,
benchmark is essential to have a consistent evaluation framework, InstructGPT, and ChatGPT to adversarial prompts in code generation
enabling the tracking of progress in the development of LLMs. Without task. The authors observed that InstructGPT and ChatGPT exhibit better
a benchmark, evaluating models lacks a standardized approach and is robustness compared to Codex. However, there is much room for
challenging. Tables 22 and 23 present a summary of various bench- improvement, indicating that quality code generation requires well-
marks which assess the abilities of GLLMs across different tasks and designed prompts. Zhao et al. (2023c) proposed RobuT, a benchmark
domains. to systematically study the robustness of LLMs to adversarial inputs in
table question answering. The authors reported that GLLMs like GPT-3
10. Robustness of GLLMs and Codex exhibit better robustness than fine-tuned models. Moreover,
the authors demonstrated that GLLM generated adversarial inputs can
Overview. GPT-3 family LLMs achieve impressive performances in enhance the adversarial robustness of fine-tuned models. Liu et al.
zero and few-shot settings in many NLP tasks. In some tasks like text (2023f) reported that ChatGPT and GPT-4 perform well in multiple
classification (Sun et al., 2023b), relation extraction (Wan et al., 2023), choice question answering but struggle to answer out-of-distribution
etc. GLLMs without any explicit fine-tuning outperform state-of-the-art questions. Liu et al. (2023c) showed that ChatGPT exhibits impressive
fine-tuned models. For example, Sun et al. (2023b) demonstrated that zero-shot performance in Text-to-SQL generation. Moreover, ChatGPT
InstructGPT, with the advanced prompting strategy, achieves SOTA demonstrates better robustness to adversarial inputs than SOTA models
results using just 16 examples on four text classification datasets. Simi- in text-to-SQL generation.
larly, Wan et al. (2023) achieved SOTA results in relation extraction Some of the research works evaluated the GLLM robustness in
with the GPT-RE framework. However, to increase the reliability of multiple natural language understanding and generation tasks (Chen
these models in real-world applications, especially in critical domains et al., 2023g; Wang et al., 2023b; Zhu et al., 2023d; Han et al., 2023).
like medicine, it is essential to systematically study the robustness of Chen et al. (2023g) assessed the robustness of GPT-3 and GPT-3.5
these models in various scenarios. Adversarial robustness refers to the models on 21 datasets covering nine natural language understanding
model’s ability to maintain good performance even in the case of de- tasks. Here the authors used adversarial text transformations from
liberately crafted instances (Goyal et al., 2022; Qiu et al., 2022). These TextFlint (Wang et al., 2021a). The authors observed that the models
instances are called adversarial instances and are carefully designed
are robust in tasks like machine reading comprehension and exhibit
by making subtle changes in the original inputs to deceive the model.
performance degradation of more than 35% in tasks like sentiment
Out-of-distribution (OOD) instances refer to examples that differ signif-
analysis and natural language inference. Wang et al. (2023b) evaluated
icantly from the data distribution used to train the model (Shen et al.,
the robustness of GPT-3.5 and ChatGPT models on adversarial and out-
2021). These instances fall outside the range of the model’s training
of-distribution (OOD) samples on nine datasets covering four NLU tasks
data and present challenges to the model’s performance and generaliza-
and machine translation. The authors observed that ChatGPT exhibits
tion ability. Some of the recent research works focused on evaluating
good performances on adversarial and OOD samples, but still, there is
the robustness of GLLMs to out-of-distribution instances (Wang et al.,
much room for improvement.
2023b; Liu et al., 2023f), adversarial prompts (Zhu et al., 2023d; Shi-
rafuji et al., 2023; Han et al., 2023) and adversarial inputs (Chen et al., Zhu et al. (2023d) developed PromptBench, a benchmark with more
2023g; Zhuo et al., 2023; Zhao et al., 2023c; Liu et al., 2023c) in one or than 4k adversarial prompts to evaluate the robustness of LLMs to
more natural language processing tasks. Table 21 presents a summary adversarial prompts. The benchmark covers 13 datasets spanning eight
of research works assessing GLLMs robustness to out-of-distribution tasks, including four NLU tasks. The authors observed that GLLMs are
instances, adversarial prompts and adversarial inputs. not robust to adversarial prompts. Moreover, word-level attacks are the
Research works exploring GLLMs robustness. Some of the re- most effective, which results in a performance drop of more than 30%.
search works evaluated the robustness of GLLMs in specific tasks like Based on the evaluation of ChatGPT on fourteen information extraction
semantic parsing (Zhuo et al., 2023), code generation (Shirafuji et al., sub-tasks, Han et al. (2023) showed that ChatGPT is vulnerable to ad-
2023), table question answering (Zhao et al., 2023c), multi-choice versarial prompts, i.e., the performance is greatly affected by including
question answering (Liu et al., 2023f) and text-to-SQL generation (Liu irrelevant context in the prompt.
33
K.S. Kalyan Natural Language Processing Journal 6 (2024) 100048
Table 22
Summary of benchmarks assessing the abilities of GLLMs across various tasks and domains.
Benchmark Evaluates Domain Language(s) Description
KoLA (Yu et al., 2023c) World knowledge General English KoLA stands for Knowledge-oriented LLM Assessment
benchmark covering 19 tasks and assesses the world
knowledge of GLLMs.
SciBench (Wang et al., 2023c) College-level scientific Education English SciBench stands for Scientific problem-solving
problem solving Benchmark and includes two datasets of scientific
problems at the college level.
FinEval (Zhang et al., 2023b) Chinese finance domain Finance Chinese FinEval includes over a thousand multiple-choice
knowledge questions covering more than 30 academic subjects
from the Finance domain.
LegalBench (Guha et al., 2023) Legal reasoning Legal English LegalBench is a legal reasoning benchmark created
through collaborative efforts, featuring 162 tasks that
encompass six distinct categories of legal reasoning.
SciEval (Sun et al., 2023a) Scientific research ability Education English SciEval benchmark includes both objective and
subjective questions from science subjects like biology,
physics and chemistry.
LongBench (Bai et al., 2023a) Long context Multiple domains English, LongBench consists of 21 datasets spanning 6 task
understanding Chinese categories, available in both English and Chinese.
LawBench (Fei et al., 2023) Legal knowledge Legal Chinese LawBench evaluates LLMs in three dimensions namely
legal knowledge memorization, understanding and
applying. This benchmark covers 20 tasks spanning
over 5 task types.
BHASA (Leong et al., 2023) Language understanding, Multiple domains Indonesian, BHASA evaluates LLMs in South East Asian languages
generation and reasoning Thai, Tamil, like Tamil, Thai, Vietnamese and Indonesian. This
Vietnamese benchmark includes eight tasks spanning over natural
language reasoning, generation and understanding.
L2CEval (Ni et al., 2023) Language to code Programming English L2CEval benchmark systematically evaluates the
generation language-to-code generation capabilities of LLMs
across seven different tasks.
XSafety (Wang et al., 2023j) LLM safety Multiple domains Ten XSafety is an LLM safety benchmark which includes
languages fourteen types of frequently encountered safety
concerns, spanning ten languages that belong to
diverse language families.
TRAM (Wang and Zhao, 2023) Temporal reasoning General English TRAM, a benchmark for temporal reasoning includes
ten datasets that cover a range of events related to
temporal aspects like duration, frequency, arithmetic
and order.
FELM (Chen et al., 2023i) Factuality Multiple domains English FELM is a factuality evaluating LLM benchmark
focusing diverse domains including math, reasoning
and world knowledge.
LAiW (Dai et al., 2023a) Legal knowledge Legal Chinese LAiW is the first benchmark for Legal LLMs in the
Chinese language and it evaluates three levels of legal
abilities.
LLMBar (Zeng et al., 2023) Instruction following General English LLMBar is a meta-evaluation benchmark assessing the
ability instruction following ability of LLMs and consists of
419 instances.
BLESS (Kew et al., 2023) Text simplification ability Multiple domains English BLESS benchmark evaluates the text simplification
ability of LLMs and includes instances from three
different domains.
11. GLLMs as evaluators Wang et al., 2023e). To address the issues with human evaluation, au-
tomatic evaluation metrics are developed, which fall broadly into two
categories: n-gram-based and embedding-based. N-gram-based metrics
Overview. Natural language processing tasks can be broadly classi- assess the quality based on the lexical overlap between the gener-
fied into natural language understanding (NLU) and natural language ated and reference texts. Some of the commonly used n-gram-based
generation (NLG). NLU involves the interpretation of text, while NLG metrics are BLEU (Papineni et al., 2002), ROUGE (Lin, 2004) and
involves generating human-like text. The evaluation of NLU outputs is METEOR (Banerjee and Lavie, 2005). However, these metrics have a
pretty straightforward, while the evaluation of NLG outputs is challeng- poor correlation with human scores because of their inability to capture
ing because of the diversity and inherent complexity of the text (Chen semantic meaning (Kocmi et al., 2021). Later, with the evolution of
et al., 2023f). Moreover, the NLG evaluation involves assessing the transformers and PLMs, the researchers developed embedding-based
generated text outputs in multiple dimensions, such as coherence, metrics like BERTScore (Zhang et al., 2019), MoverScore (Zhao et al.,
fluency, naturalness and semantic consistency. Human evaluation and 2019), BARTScore (Yuan et al., 2021), CodeBERTScore (Zhou et al.,
automatic evaluation are two existing approaches for NLG evalua- 2023) etc. These metrics leverage the PLMs and assess the quality based
tion. The human evaluation depends on competent annotators for an on the semantic similarity between the generated and reference text.
accurate and reliable assessment (Sai et al., 2022). The main drawback of the existing automatic evaluation metrics is the
Human Evaluation vs. Automatic Evaluation. Human evaluation requirement for references, which are difficult to obtain, especially in
is treated as the gold standard, but it is time-consuming, expensive, low-resource domains. Moreover, with just a few references, it is not
difficult to scale, inconsistent, and not reproducible (Chen et al., 2023f; possible to get an accurate and reliable assessment as few references
34
K.S. Kalyan Natural Language Processing Journal 6 (2024) 100048
Table 23
Summary of benchmarks assessing the abilities of GLLMs across various tasks and domains.
Benchmark Evaluates Domain Language(s) Description
MedEval (He et al., 2023d) Medical knowledge Medical English MedEval benchmark is designed to evaluate LLMs in
medical domain and covers diverse tasks.
XLingEval (Choudhury et al., 2023) Multilingual medical Medical English, Spanish, XLingEval is a multilingual benchmark introduced the
capabilities Chinese, Hindi assess the effectiveness of LLMs in medical domain.
M4LE (Kwan et al., 2023) Long context Multiple English, Chinese M4LE benchmark is designed to evaluate long context
understanding domains understanding of LLMs and includes 36 datasets from
twelve domains covering 11 task types.
BizBench (Koncel-Kedziorski et al., 2023) Quantitative reasoning Finance English BizBench benchmark is introduced to assess the
quantitative reasoning abilities of LLMs and includes
eight reasoning tasks.
CodeScope (Yan et al., 2023) Code understanding Programming 43 programming CodeScope benchmark evaluates code generation and
languages understanding abilities of LLM and covers eight
coding tasks from forty-three programming languages.
FollowEval (Jing et al., 2023) Instruction following General English, Chinese FollowEval benchmark is introduced to evaluate LLMs
based on their performance in five essential aspects of
instruction following.
FinanceBench (Islam et al., 2023) Question answering Finance English FinanceBench is designed to assess the capabilities of
LLMs in the context of open-book financial question
answering (QA) and it includes 10,231 questions
related to publicly traded companies, each
accompanied by evidence strings and relevant
answers.
ARB (Sawada et al., 2023) Advanced reasoning Multiple English ARB is an LLM benchmark that consists of complex
domains reasoning problems spanning various disciplines like
mathematics, physics, biology, chemistry, and law.
TimeBench (Chu et al., 2023) Temporal reasoning General English The TimeBench benchmark is designed to evaluate the
temporal reasoning abilities of LLMs and consists of
10 tasks that address a wide range of temporal
reasoning phenomena.
TaskEval (Shen et al., 2023c) Task automation ability Programming Python TaskEval aims to evaluate the proficiency of LLMs in
automating tasks through a comprehensive and
quantitative evaluation of three different aspects.
PromptBench (Zhu et al., 2023d) Robustness to adversarial General English PromptBench is introduced to assess the robustness of
prompts LLMs to adversarial prompts and this benchmark
includes more than 4500 adversarial prompts.
cannot account for all the semantic variations (Chen et al., 2023f). answering (Bai et al., 2023b; Zheng et al., 2023a). Most of the research
So, there is a strong need for automatic evaluation metrics which are works proposed evaluation frameworks using direct prompting, while
reference-free. some of the research works introduced evaluation frameworks based on
GLLM-based Evaluation. Recently, with the huge success of GLLMs advanced prompting strategies like chain-of-thoughts (Zhuo, 2023; Liu
in most of the NLP tasks, the research community focused on develop- et al., 2023d) and error analysis prompting (Lu et al., 2023a). Some of
ing automatic evaluation metrics based on these models. These models the proposed evaluation frameworks work with and without Refs. Zhuo
possess the ability of in-context learning, while instruction tuning en- (2023), Kocmi and Federmann (2023) and Wang et al. (2023f), while
ables these models to align themselves with human evaluation (Ouyang some of them require Refs. Lai et al. (2023b), Lu et al. (2023a), Xu
et al., 2022). These two abilities enable these models to imitate the et al. (2023b), Tang et al. (2023b) and Yang et al. (2023f), and some
behaviour of human evaluators, who typically evaluate natural lan- do not require any Refs. Liu et al. (2023d), Chen et al. (2023f), Luo
guage generation task outputs by understanding instructions and the et al. (2023), Shen et al. (2023a), Fu et al. (2023), Liu et al. (2023b),
given examples. The GLLM-based evaluation metrics demonstrate a Gao et al. (2023b), Wang et al. (2023e), Jain et al. (2023), Bai et al.
strong correlation with human scores even in the absence of reference
(2023b) and Zheng et al. (2023a).
outputs (Liu et al., 2023d; Fu et al., 2023). Table 24 presents a summary
Lai et al. (2023b) investigated how effective ChatGPT is to evaluate
of research works exploring GLLM-based evaluation for various natural
text style transfer task along three dimensions: fluency, content and
language generation tasks.
style. The model achieves good correlations with human judgements,
Research works exploring GLLM-based evaluation. The NLP
and the best results are obtained by using separate prompts for each
researchers proposed various GLLM-based evaluation frameworks to
evaluate the outputs of various NLG tasks like code generation (Zhuo, dimension evaluation. Kocmi and Federmann (2023) proposed GEMBA,
2023), text style transfer (Lai et al., 2023b), text summarization (Liu a GPT-based metric to assess translation output quality, with references
et al., 2023d; Chen et al., 2023f; Luo et al., 2023; Shen et al., 2023a; being optional. The authors reported that GPT-3.5 and higher models
Fu et al., 2023; Liu et al., 2023b; Gao et al., 2023b; Tang et al., 2023b; are only useful for the assessment, and GPT-4 achieves the best results.
Jain et al., 2023; Wang et al., 2023f), dialogue generation (Liu et al., Based on the evaluation of four natural language generation tasks, para-
2023d; Chen et al., 2023f; Fu et al., 2023), machine translation (Kocmi phrase generation, text summarization, story generation and dialogue
and Federmann, 2023; Lu et al., 2023a; Xu et al., 2023b; Fu et al., response generation, Chen et al. (2023f) showed that explicit score with
2023; Tang et al., 2023b; Yang et al., 2023f), story generation (Chen greedy decoding strategy is the best way to assess NLG outputs using
et al., 2023f; Wang et al., 2023f), paraphrase generation (Chen et al., GLLMs like ChatGPT. Luo et al. (2023) evaluated ChatGPT’s ability as a
2023f), text-to-image synthesis (Lu et al., 2023b), data-to-text genera- factual inconsistency evaluator for text summarization task. Experiment
tion (Fu et al., 2023; Wang et al., 2023f), image captioning (Tang et al., results showed that ChatGPT outperforms existing metrics on most of
2023b), text generation (Wang et al., 2023e), open-ended question the datasets.
35
K.S. Kalyan Natural Language Processing Journal 6 (2024) 100048
Table 24
Summary of research works exploring GLLM-based evaluation for natural language generation tasks. Here ZS represents zero-shot, and FS represents few-shot.
Paper GLLMs explored Task(s) Prompt References Domain(s) Language(s)
settings required
Zhuo (2023) ChatGPT Code generation ZS Optional Programming Five
programming
languages
Lai et al. (2023b) ChatGPT Text style transfer ZS Yes General English
Liu et al. (2023d) ChatGPT, GPT-4 Text summarization, dialogue ZS No General English
generation
Kocmi and Federmann (2023) GPT, GPT-3.5, Machine translation ZS Optional General English, German,
ChatGPT, GPT-4 Chinese, Russian
Chen et al. (2023f) GPT-3.5, ChatGPT Text summarization, dialogue ZS No General English
generation, story generation,
paraphrase generation
Lu et al. (2023a) GPT-3.5, ChatGPT Machine translation ZS, FS Yes General English, Chinese,
German
Luo et al. (2023) ChatGPT Text summarization ZS No General English
Shen et al. (2023a) ChatGPT Text summarization ZS No General English
Lu et al. (2023b) GPT-4 Text-to-image synthesis ZS N/A General English
Xu et al. (2023b) GPT-4 Machine translation ZS Yes General English, German,
Russian
Fu et al. (2023) GPT-3, GPT-3.5 Dialogue generation, machine ZS, FS No General English, Chinese
translation, text summarization,
data-to-text generation
Liu et al. (2023b) GPT-3, ChatGPT Text summarization ZS No General English
Gao et al. (2023b) ChatGPT Text summarization ZS No General English
Tang et al. (2023b) GPT-3.5 Machine translation, text ZS Yes General English
summarization, image caption
Wang et al. (2023e) ChatGPT, GPT-4 Text generation ZS No General English
Jain et al. (2023) GPT-3.5 Text summarization ZS No General English
Wang et al. (2023f) ChatGPT Text summarization, story ZS Optional General English
generation, data-to-text generation
Bai et al. (2023b) GPT-4 Open-ended question answering ZS No General English
Yang et al. (2023f) GPT-4 Machine translation ZS Yes General Multiple
languages
Zheng et al. (2023a) GPT-4 Open-ended question answering ZS No General English
Shen et al. (2023a) explored how effective ChatGPT can be as a introduced a novel evaluation framework called Language-Model-as-an-
zero-shot evaluator for abstractive summarization systems using differ- Examiner to evaluate open-ended questions. In this framework, GLLM
ent evaluation methods like likert scaling (He et al., 2022b) and head- acts as a knowledgeable examiner, generates questions using its own
to-head comparisons (Shen et al., 2022). Extensive analysis showed knowledge and then does the reference-free evaluation. Yang et al.
that likert scaling implemented as a multiple-choice question gives the (2023f) developed the BigTrans model (based on LLaMA -13B model)
best and most stable results. Liu et al. (2023b) designed a novel ap- with a multilingual translation capacity of more than 100 languages.
proach which uses BRIO (Liu et al., 2022), a contrastive learning-based GPT-4 based assessment showed that BigTrans performance is on par
method, to train smaller models like BART for text summarization and with ChatGPT and Google translate. Zheng et al. (2023a) explored
metrics like GPTScore (Fu et al., 2023) or GPTRank for evaluation. GPT-4 as a judge to evaluate open-ended question answering using
The contrastive learning training method helps the model to effectively two newly introduced benchmarks MT-Bench and Chatbot Arena. The
utilize the supervision signal offered by the reference LLMs. The eval- experiment results showed that GPT-4 achieves more than 80
uation showed that the proposed approach helps the smaller model to Unlike the above-discussed research works, which used direct
outperform LLMs like GPT-3 and ChatGPT. prompting, some of the works explored advanced prompting to offer
Gao et al. (2023b) evaluated ChatGPT for text summarization using better guidance and context for the GLLM evaluator. Zhuo (2023)
various human evaluation methods and reported that (i) ChatGPT- developed a code generation evaluation framework based on ChatGPT
based evaluation is both cost-effective and reproducible, unlike human and demonstrated that the proposed framework outperforms Code-
evaluation, (ii) the performance of ChatGPT-based evaluation is highly BERTScore (Zhou et al., 2023) consistently across multiple program-
dependent on the prompt design, and (iii) ChatGPT generated ex- ming languages. Moreover, the performance of the evaluation frame-
planations correlates with its scores. Jain et al. (2023) explored the work can be enhanced using references and zero-shot CoT prompting.
effectiveness of the GPT-3.5 model as a multi-dimensional evaluator of Liu et al. (2023d) proposed G-EVAL, a novel framework based on GPT-4
text summarization. The authors reported that using in-context learn- for the assessment of natural language generation tasks. The proposed
ing, GPT-3.5-based evaluation achieves SOTA performances on factual framework uses CoT prompting and a form-filling paradigm. Here,
consistency and relevance dimensions. Based on the evaluation of five CoT prompting enhances the performance of G-EVAL by offering more
datasets covering text summarization, story generation and data-to-text guidance and context. The performance of ChatGPT-based evaluation
generation, Wang et al. (2023f) reported that ChatGPT as an evaluator in segment-level machine translation is poor. To overcome this, Lu
(i) exhibits good correlations with human scores, especially in the case et al. (2023a) proposed a novel prompting called Error Analysis (EA)
of story generation task and (ii) is prompt sensitive. Bai et al. (2023b) prompting, which combines error analysis (Lu et al., 2022a) and CoT
36
K.S. Kalyan Natural Language Processing Journal 6 (2024) 100048
prompting. The authors showed that with EA prompting, ChatGPT can fluency. However, the other side is that these models sometimes gen-
assess translations at the segment level much better. erate harmful text. For example, Bhardwaj and Poria (2023) observed
Some of the research works explored GLLMs for the evaluation that GLLMs like ChatGPT and GPT-4 generate answers to more than
of multi-modal AI tasks (Lu et al., 2023b), fine-tuning open-source 60% of harmful queries. One of the possible reasons for this undesirable
LLM evaluators (Xu et al., 2023b), and paraphrasing references to behaviour of GLLMs is that data used for pretraining these models
enhance existing metrics based on PLMs (Tang et al., 2023b). For includes toxic, biased and noisy text to some extent (Bhardwaj and
example, Lu et al. (2023b) introduced LLMScore (based on GPT-4), a Poria, 2023). This unwanted behaviour of generating harmful text
new metric which can effectively capture both image and object-level raises concerns and limits the scalable deployment of these models
compositionality for text-to-image synthesis evaluation. Some of the for public use. We can expect more research in future to expose such
research works explored these models to fine-tune open-source LLMs so undesirable behaviour in various scenarios and eventually enhance the
that they can be used as evaluators, which makes the evaluation less safety alignment as well as the safe use of GLLMs.
expensive. For example, Xu et al. (2023b) introduced InstructScore, a
novel and explainable metric based on fine-tuned LLaMA model for text 12.3. State-of-the-art results across NLP tasks
generation evaluation. Here the authors use GPT-4 generated synthetic
data to fine-tune the LLaMA model. InstructScore can generate an In the beginning, GLLMs like GPT-3 achieved impressive perfor-
error diagnostic report having error details along with an explanation. mances in zero and few-shot settings across NLP tasks. Advanced
Natural language generation evaluation using few references results in GLLMs like ChatGPT and GPT-4 further pushed the results but still lag
poor correlation with human judgements. To overcome this drawback, behind SOTA results achieved by PLMs fine-tuned based on supervised
Tang et al. (2023b) introduced Para-Ref, which leverages LLMs to learning. Later, with the evolution of advanced prompting strategies
increase the number of references by paraphrasing. The evaluation on and novel approaches, GLLMs are able to achieve SOTA results in
three NLG tasks, text summarization, machine translation and image some of the NLP tasks. For example, InstructGPT with CARP prompting
caption, showed that the proposed approach enhances the correlation strategy using just 16 examples achieves SOTA results on four text
of sixteen automatic evaluation metrics with human judgements by a classification datasets (Sun et al., 2023b). Similarly, Wan et al. (2023)
good margin. achieved SOTA results in relation extraction with the novel GPT-RE
Some of the research works focused on addressing the limitations of framework. Yang et al. (2022) proposed a novel approach which uses
using GLLMs as evaluators. For example, Wang et al. (2023e) demon- GPT-3 as an implicit knowledge source and achieves SOTA results in
strated positional bias in GLLM-based evaluation, i.e., the order of knowledge-based visual question answering. In future, we can expect
candidate responses can significantly influence the results. The authors more focus from the research community to achieve SOTA results using
demonstrated that the two proposed strategies, namely multiple evi- GLLMs in as many NLP tasks as possible, which will be treated as
dence calibration and balanced position calibration, can reduce the bias a further push towards artificial general intelligence. Moreover, this
and enhance the correlation with human judgements. eliminates the painful process of labelling large amounts of data and
then fine-tuning PLMs separately for each downstream task.
12. Future research directions
12.4. Robust approaches to detect GLLM generated text
12.1. Enhance robustness of GLLMs
The ability to generate text with human-like fluency resulted in
GLLMs achieved promising results across various NLP tasks in zero
the wide adoption of GLLMs in various real-world applications like
and few-shot settings across various NLP tasks. In some of the tasks
writing assistants, coding assistants, and chatbots (Mireshghallah et al.,
like data labelling (Gilardi et al., 2023; He et al., 2023b; Törnberg,
2023). There is a growing concern regarding the misuse of these models
2023; Alizadeh et al., 2023), text classification (Sun et al., 2023b),
for various illegal activities (Guo et al., 2023d), like fake news on
relation extraction (Wan et al., 2023), question answering (Yang et al.,
social media platforms (Hacker et al., 2023; De Angelis et al., 2023),
2022; Bang et al., 2023), keyphrase generation (Song et al., 2023), etc.,
fake reviews on e-commerce websites (Mitrovi’c et al., 2023), fake
these models achieved even SOTA results. However, some of the recent
research papers (Gao et al., 2023a), academic fraud (Cotton et al.,
research works exposed the brittleness of these models towards out-of-
2023), etc. The performance of existing approaches like DetectGPT,
distribution inputs (Wang et al., 2023b; Liu et al., 2023f), adversarial
ZeroGPT, OpenAI detector, ChatGPT-detector-roberta and ChatGPT-qa-
prompts (Zhu et al., 2023d; Shirafuji et al., 2023; Han et al., 2023) and
detector-roberta is not satisfactory (Pegoraro et al., 2023; Yu et al.,
inputs (Chen et al., 2023g; Zhuo et al., 2023; Zhao et al., 2023c; Liu
2023a). Moreover, the existing approaches are not robust to various
et al., 2023c) . For example, Liu et al. (2023f) reported that ChatGPT
attacks like paraphrasing, synonym word replacement and writing style
and GPT-4 perform well in multiple choice question answering but
modification (Shi et al., 2023; Krishna et al., 2023). So, there is a great
struggle to answer out-of-distribution questions. Similarly, Chen et al.
need for better approaches which can reliably detect GLLM generated
(2023g) observed more than 35% performance degradation for GPT-
text and also robust to various attacks, including paraphrasing. With
3 and GPT-3.5 models in tasks like sentiment analysis and natural
reliable and robust detection approaches, the misuse of GLLMs for
language inference for adversarial inputs. The brittleness towards out-
various illegal activities can be reduced to a great extent.
of-distribution and adversarial inputs makes these models unreliable
and limits their practical utility, especially in sensitive domains. So, it
is necessary for the research community to focus more on this research 12.5. Reduce inference costs
direction to make GLLMs more robust and enhance their reliability and
usage. GLLMs achieve impressive performances across NLP tasks, with
SOTA results in some tasks. However, the downside of using GLLMs
12.2. Red teaming is the high inference costs (Chen et al., 2023h; Cheng et al., 2023).
For example, a small business is required to spend more than $21,000
Red teaming involves an assessment to expose undesirable model monthly to use GPT-4 for better customer support.7 Such high inference
behaviours like generating harmful text (Bhardwaj and Poria, 2023; costs have become a burden to small and medium-sized companies.
Ganguli et al., 2022; Mehrabi et al., 2023; Perez et al., 2022). GLLMs
trained over large volumes of text data with a simple next-word predic- 7
https://neoteric.eu/blog/how-much-does-it-cost-to-use-gpt-models-gpt-
tion objective are surprisingly good at generating text with human-like 3-pricing-explained.
37
K.S. Kalyan Natural Language Processing Journal 6 (2024) 100048
Recently, Chen et al. (2023h) proposed FrugalGPT, a novel frame- Aiyappa et al., 2023). The problem of data contamination is more
work involving multiple strategies like prompt adaptation and LLM relevant in the case of GLLMs because of their proprietary nature
approximation to reduce the inference costs of GLLMs. The inference and non-disclosure of training corpus details. Recent research works
costs of GLLMs increase with the prompt size as the inference cost is have reported the problem of data contamination in GLLMs like Chat-
computed based on the number of tokens processed. Prompt adaptation GPT (Aiyappa et al., 2023) and GPT-4 (Golchin and Surdeanu, 2023).
focuses on reducing the size of the prompt by using fewer but effective For example, Golchin and Surdeanu (2023) demonstrated that GPT-4 is
examples or querying the GLLMs as a batch. LLM approximation uses contaminated with instances from text classification, natural language
cache to avoid querying GLLM for similar queries, which eventually inference and text summarization datasets like WNLI (Wang et al.,
reduces overall inference costs. Similarly, Cheng et al. (2023) proposed 2018), AG News (Zhang et al., 2015) and XSUM (Narayan et al., 2018).
batch prompting, which involves GLLM inference in batches rather than Recently, Golchin and Surdeanu (2023) proposed a novel approach to
processing one sample individually. The authors demonstrated that the detect data contamination for LLMs. Future research must focus on
proposed prompting strategy reduces Codex model inference cost across developing simple and effective approaches to identify data contamina-
tion and ensure fair evaluation, enhancing the reliability of impressive
ten datasets with little or no degradation in the performance. Future
performances of GLLMs.
research in this direction will result in much better approaches which
will further reduce the GLLM inference costs and make GLLM usage
12.9. Reduce hallucinations
more affordable for companies.
Despite the remarkable performances of GLLMs, there is a grow-
12.6. Enhance performance in domain-specific NLP tasks ing concern regarding their tendency to generate factually incorrect
information (Zhang et al., 2023f; Rawte et al., 2023). This tendency
Inspired by the success of GLLMs in general domain NLP tasks, to generate text that does not align with existing world knowledge,
the research community explored GLLMs for NLP tasks in specific deviates from the user’s input or contradicts the context generated
domains like healthcare, legal, finance, etc. However, the performances earlier is referred to as hallucination (Zhang et al., 2023f). Hallu-
of GLLMs in domain-specific NLP tasks are not as impressive as those cination is a serious problem yet to be addressed fully (Dhuliawala
achieved in general domain NLP tasks (Moradi et al., 2021; Hernandez et al., 2023), and it reduces the reliability of GLLMs, which becomes
et al., 2023; Chalkidis, 2023; Choi et al., 2023; Li et al., 2023k; a bottleneck for the adoption of GLLMs, especially in sensitive domains
Shah and Chava, 2023). For example, Moradi et al. (2021) reported like healthcare (Umapathi et al., 2023). Recently, some of the re-
that the BioBERT model outperforms GPT-3 in few-shot settings even search works focused on evaluating hallucination in GLLMs (Umapathi
though the BioBERT model is 514 times smaller than GPT-3. Chalkidis et al., 2023), assessing the ability of GLLMs to identify hallucina-
(2023) evaluated ChatGPT on the LexGLUE benchmark and reported tions (Li et al., 2023b) and developing approaches to reduce halluci-
that ChatGPT performs poorly on legal text classification datasets. nations (Peng et al., 2023b). For example, Li et al. (2023b) proposed
Analysing domain-specific texts is more challenging because of domain- HaluEval, a novel benchmark to assess the ability of GLLMs to identify
specific terminology and abbreviations, complex language structures, hallucinations. Peng et al. (2023b) introduced LLM-AUGMENTER, a
etc. In domains like healthcare, finance and legal, domain experts use novel approach that reduces hallucinations in ChatGPT without im-
many words and abbreviations that are specific to the domain and not pacting the quality of generated responses. Considering the seriousness
commonly found in general domain texts. There is a lot of scope to of the hallucination problem, we can expect more future research
improve the performance of GLLMs in domain-specific NLP tasks, which to identify and reduce hallucinations in GLLMs, which enhance their
reduces the bottleneck for the widespread adoption of these models in reliability and adoption across domains, including sensitive domains
like healthcare.
specific domains.
12.10. Enhance the performance of GLLMs for non-english languages
12.7. Handle limited context length
The performance of GLLMs is not impressive in the case of non-
One of the major drawbacks of GLLMs is their limited context English languages, especially in the case of languages with non-Latin
length (Li, 2023; Kaddour et al., 2023; Arefeen et al., 2023). The maxi- scripts (Ahuja et al., 2023; Bang et al., 2023; Lai et al., 2023a; Kuz-
mum context length of GLLMs lies in the range of 2049 tokens to 32,768 man et al., 2023). This is because GLLMs are mostly pretrained on
tokens.8 This limited context length poses a challenge and becomes English text. For example, more than 90% of text in the pretraining
a bottleneck for GLLMs to handle long documents or maintain long corpus of the GPT-3 model is from the English language (Brown et al.,
conservations in which the number of tokens falls beyond the maximum 2020; Ahuja et al., 2023). Some of the possible options to enhance
context length. Recently, Li (2023) proposed selective context, a novel the performance of GLLMs for non-English languages are the use of
approach to effectively utilize the limited context length by filtering English prompts (Lai et al., 2023a; Kuzman et al., 2023) and optimized
out the less useful content in the input text. The authors demonstrated tokenization (Armengol-Estapé et al., 2022). There is a great need for
the effectiveness of the proposed approach using the ChatGPT model better approaches to greatly enhance the performance of GLLMs for
for question-answering and text summarization tasks across datasets non-English languages, which increase their adoption across the globe
having lengthy input instances. Future research in this direction will and benefit users from non-English communities.
help in the evolution of more efficient approaches which will effectively
utilize the limited context length and eliminate the bottlenecks for the 13. Conclusion
application of GLLMs in tasks that require processing long inputs.
In this survey paper, we provide a comprehensive review of GPT-
3 family LLMs in multiple dimensions covering more than 350 recent
12.8. Ensure fair evaluation of GLLMs
research papers. Here, we present foundation concepts, GPT-3 fam-
ily LLMs and discuss the performances of these models in various
GLLMs achieved impressive performances across NLP tasks and have
downstream tasks, specific domains and multiple languages. We also
received much attention recently. However, one concern regarding
discuss data labelling, data augmentation and data generation abilities
the evaluation of GLLMs is data contamination, which refers to the
of GLLMs, the robustness of GLLMs, the effectiveness of GLLMs as eval-
presence of test data instances of downstream tasks in the training
uators, and finally, conclude with multiple insightful future research
corpus of GLLMs (Chang et al., 2023; Golchin and Surdeanu, 2023; directions. Overall, this comprehensive survey paper on GPT-3 family
LLMs will serve as a good resource for both academic and industry
8
https://platform.openai.com/docs/models/overview. people to stay updated with the latest research.
38
K.S. Kalyan Natural Language Processing Journal 6 (2024) 100048
CRediT authorship contribution statement Banerjee, S., Lavie, A., 2005. METEOR: An automatic metric for MT evaluation with
improved correlation with human judgments. In: Proceedings of the Acl Workshop
on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/Or
Katikapalli Subramanyam Kalyan: Conceptualization, Data cura-
Summarization. pp. 65–72.
tion, Formal analysis, Funding acquisition, Investigation, Methodology, Bang, Y., Cahyawijaya, S., Lee, N., Dai, W., Su, D., Wilie, B., Lovenia, H., Ji, Z., Yu, T.,
Project administration, Resources, Software, Supervision, Validation, Chung, W., et al., 2023. A multitask, multilingual, multimodal evaluation of chatgpt
Visualization, Writing – original draft, Writing – review & editing. on reasoning, hallucination, and interactivity. arXiv preprint arXiv:2302.04023.
Barbieri, F., Camacho-Collados, J., Anke, L.E., Neves, L., 2020. TweetEval: Unified
benchmark and comparative evaluation for tweet classification. In: Findings of the
Declaration of competing interest Association for Computational Linguistics: EMNLP 2020. pp. 1644–1650.
Bayer, M., Kaufhold, M.-A., Reuter, C., 2022. A survey on data augmentation for text
The authors declare that they have no known competing finan- classification. ACM Comput. Surv. 55 (7), 1–39.
cial interests or personal relationships that could have appeared to Belinkov, Y., Bisk, Y., 2018. Synthetic and natural noise both break neural machine
translation. In: International Conference on Learning Representations.
influence the work reported in this paper.
Beltagy, I., Peters, M.E., Cohan, A., 2020. Longformer: The long-document transformer.
arXiv preprint arXiv:2004.05150.
Acknowledgement Bhardwaj, R., Poria, S., 2023. Red-teaming large language models using chain of
utterances for safety-alignment. arXiv preprint arXiv:2308.09662.
The author would like to thank Ajit Rajasekharan for his encourage- Bhattacharya, A., Singla, Y.K., Krishnamurthy, B., Shah, R.R., Chen, C., 2023. A video
is worth 4096 tokens: Verbalize story videos to understand them in zero shot. arXiv
ment and support.
preprint arXiv:2305.09758.
Blitzer, J., Dredze, M., Pereira, F., 2007. Biographies, bollywood, boom-boxes and
References blenders: Domain adaptation for sentiment classification. In: Proceedings of the
45th Annual Meeting of the Association of Computational Linguistics. pp. 440–447.
Abacha, A.B., Yim, W.-w., Adams, G., Snider, N., Yetisgen-Yildiz, M., 2023. Overview Bojanowski, P., Grave, E., Joulin, A., Mikolov, T., 2017. Enriching word vectors with
of the MEDIQA-chat 2023 shared tasks on the summarization & generation of subword information. Trans. Assoc. Comput. Linguist. 5, 135–146.
doctor-patient conversations. In: Proceedings of the 5th Clinical Natural Language Bommarito, II, M., Katz, D.M., 2022. GPT takes the bar exam. arXiv preprint arXiv:
Processing Workshop. pp. 503–513. 2212.14402.
Abaskohi, A., Rothe, S., Yaghoobzadeh, Y., 2023. LM-CPPF: Paraphrasing-guided data Bommasani, R., Liang, P., Lee, T., 2023. Holistic evaluation of language models. Ann.
augmentation for contrastive prompt-based few-shot fine-tuning. arXiv preprint New York Acad. Sci..
arXiv:2305.18169. Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakan-
Adomavicius, G., Tuzhilin, A., 2005. Toward the next generation of recommender tan, A., Shyam, P., Sastry, G., Askell, A., et al., 2020. Language models are few-shot
systems: A survey of the state-of-the-art and possible extensions. IEEE Trans. Knowl. learners. Adv. Neural Inf. Process. Syst. 33, 1877–1901.
Data Eng. 17 (6), 734–749. Bubeck, S., Chandrasekaran, V., Eldan, R., Gehrke, J., Horvitz, E., Kamar, E., Lee, P.,
Agrawal, M., Hegselmann, S., Lang, H., Kim, Y., Sontag, D., 2022. Large language Lee, Y.T., Li, Y., Lundberg, S., et al., 2023. Sparks of artificial general intelligence:
models are few-shot clinical information extractors. In: Proceedings of the 2022 Early experiments with gpt-4. arXiv preprint arXiv:2303.12712.
Conference on Empirical Methods in Natural Language Processing. pp. 1998–2022. Cai, X., Liu, S., Han, J., Yang, L., Liu, Z., Liu, T., 2021. Chestxraybert: A pretrained
Ahmad, W., Chakraborty, S., Ray, B., Chang, K.-W., 2021. Unified pre-training for language model for chest radiology report summarization. IEEE Trans. Multimed..
program understanding and generation. In: Proceedings of the 2021 Conference Carpenter, K.A., Altman, R.B., 2023. Using GPT-3 to build a lexicon of drugs of abuse
of the North American Chapter of the Association for Computational Linguistics: synonyms for social media pharmacovigilance. Biomolecules 13 (2), 387.
Human Language Technologies. pp. 2655–2668. Cegin, J., Simko, J., Brusilovsky, P., 2023. ChatGPT to replace crowdsourcing of para-
Ahuja, K., Hada, R., Ochieng, M., Jain, P., Diddee, H., Maina, S., Ganu, T., Segal, S., phrases for intent classification: Higher diversity and comparable model robustness.
Axmed, M., Bali, K., et al., 2023. Mega: Multilingual evaluation of generative ai. arXiv preprint arXiv:2305.12947.
arXiv preprint arXiv:2303.12528. Chali, Y., Hasan, S.A., Joty, S.R., 2011. Improving graph-based random walks for
Aiyappa, R., An, J., Kwak, H., Ahn, Y.-Y., 2023. Can we trust the evaluation on complex question answering using syntactic, shallow semantic and extended string
ChatGPT? arXiv preprint arXiv:2303.12767. subsequence kernels. Inf. Process. Manage. 47 (6), 843–855.
Alizadeh, M., Kubli, M., Samei, Z., Dehghani, S., Bermeo, J.D., Korobeynikova, M., Chalkidis, I., 2023. ChatGPT may pass the bar exam soon, but has a long way to go
Gilardi, F., 2023. Open-source large language models outperform crowd workers for the LexGLUE benchmark. arXiv preprint arXiv:2304.12202.
and approach ChatGPT in text-annotation tasks. arXiv preprint arXiv:2307.02179. Chalkidis, I., Fergadiotis, M., Malakasiotis, P., Aletras, N., Androutsopoulos, I., 2020.
Amin, M.M., Cambria, E., Schuller, B.W., 2023. Will affective computing emerge from LEGAL-BERT: The muppets straight out of law school. In: Findings of the
foundation models and general AI? A first evaluation on ChatGPT. IEEE Intell. Syst. Association for Computational Linguistics: EMNLP 2020. pp. 2898–2904.
38 (2). Chalkidis, I., Jana, A., Hartung, D., Bommarito, M., Androutsopoulos, I., Katz, D.,
Anand, A., Lyu, L., Idahl, M., Wang, Y., Wallat, J., Zhang, Z., 2022. Explainable Aletras, N., 2022. LexGLUE: A benchmark dataset for legal language understanding
information retrieval: A survey. arXiv preprint arXiv:2211.02405. in English. In: Proceedings of the 60th Annual Meeting of the Association for
Anil, R., Dai, A.M., Firat, O., Johnson, M., Lepikhin, D., Passos, A., Shakeri, S., Computational Linguistics (Volume 1: Long Papers). pp. 4310–4330.
Taropa, E., Bailey, P., Chen, Z., et al., 2023. Palm 2 technical report. arXiv preprint Chan, C., Cheng, J., Wang, W., Jiang, Y., Fang, T., Liu, X., Song, Y., 2023. Chatgpt
arXiv:2305.10403. evaluation on sentence level relations: A focus on temporal, causal, and discourse
Antaki, F., Touma, S., Milad, D., El-Khoury, J., Duval, R., 2023. Evaluating the perfor- relations. arXiv preprint arXiv:2304.14827.
mance of chatgpt in ophthalmology: An analysis of its successes and shortcomings. Chang, Y., Wang, X., Wang, J., Wu, Y., Zhu, K., Chen, H., Yang, L., Yi, X., Wang, C.,
Ophthalmol. Sci. 100324. Wang, Y., et al., 2023. A survey on evaluation of large language models. arXiv
Araci, D., 2019. Finbert: Financial sentiment analysis with pre-trained language models. preprint arXiv:2307.03109.
arXiv preprint arXiv:1908.10063. Chen, Z., Chen, W., Smiley, C., Shah, S., Borova, I., Langdon, D., Moussa, R., Beane, M.,
Arefeen, M.A., Debnath, B., Chakradhar, S., 2023. LeanContext: Cost-efficient Huang, T.-H., Routledge, B.R., et al., 2021a. FinQA: A dataset of numerical
domain-specific question answering using LLMs. arXiv preprint arXiv:2309.00841. reasoning over financial data. In: Proceedings of the 2021 Conference on Empirical
Armengol-Estapé, J., de Gibert Bonet, O., Melero, M., 2022. On the multilingual Methods in Natural Language Processing. pp. 3697–3711.
capabilities of very large-scale english language models. In: Proceedings of the Chen, Y., Cheng, J., Jiang, H., Liu, L., Zhang, H., Shi, S., Xu, R., 2022. Learning
Thirteenth Language Resources and Evaluation Conference. pp. 3056–3068. from sibling mentions with scalable graph inference in fine-grained entity typing.
Ba, J.L., Kiros, J.R., Hinton, G.E., 2016. Layer normalization. arXiv preprint arXiv: In: Proceedings of the 60th Annual Meeting of the Association for Computational
1607.06450. Linguistics (Volume 1: Long Papers). pp. 2076–2087.
Bahdanau, D., Cho, K., Bengio, Y., 2014. Neural machine translation by jointly learning Chen, Q., Du, J., Hu, Y., Keloth, V.K., Peng, X., Raja, K., Zhang, R., Lu, Z.,
to align and translate. CoRR abs/1409.0473. Xu, H., 2023a. Large language models in biomedical natural language processing:
Bahdanau, D., Cho, K.H., Bengio, Y., 2015. Neural machine translation by jointly benchmarks, baselines, and recommendations. arXiv preprint arXiv:2305.16326.
learning to align and translate. In: 3rd International Conference on Learning Chen, E., Huang, R., Chen, H.-S., Tseng, Y.-H., Li, L.-Y., 2023b. GPTutor: a ChatGPT-
Representations, ICLR 2015. powered programming tool for code explanation. arXiv preprint arXiv:2305.
Bai, Y., Lv, X., Zhang, J., Lyu, H., Tang, J., Huang, Z., Du, Z., Liu, X., Zeng, A., 01863.
Hou, L., et al., 2023a. LongBench: A bilingual, multitask benchmark for long Chen, H., Jiao, F., Li, X., Qin, C., Ravaut, M., Zhao, R., Xiong, C., Joty, S., 2023c.
context understanding. arXiv preprint arXiv:2308.14508. ChatGPT’s one-year anniversary: Are open-source large language models catching
Bai, Y., Ying, J., Cao, Y., Lv, X., He, Y., Wang, X., Yu, J., Zeng, K., Xiao, Y., Lyu, H., et up? arXiv preprint arXiv:2311.16989.
al., 2023b. Benchmarking foundation models with language-model-as-an-examiner. Chen, Y., Kang, H., Zhai, V., Li, L., Singh, R., Ramakrishnan, B., 2023d. GPT-sentinel:
arXiv preprint arXiv:2306.04181. Distinguishing human and ChatGPT generated content. ArXiv, abs/2305.07969.
39
K.S. Kalyan Natural Language Processing Journal 6 (2024) 100048
Chen, S., Li, Y., Lu, S., Van, H., Aerts, H.J., Savova, G.K., Bitterman, D.S., 2023e. Dai, A.M., Le, Q.V., 2015. Semi-supervised sequence learning. Adv. Neural Inf. Process.
Evaluation of ChatGPT family of models for biomedical reasoning and classification. Syst. 28.
arXiv preprint arXiv:2304.02496. Dai, H., Liu, Z., Liao, W., Huang, X., Cao, Y., Wu, Z., Zhao, L., Xu, S., Liu, W., Liu, N., et
Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H.P.d.O., Kaplan, J., Edwards, H., al., 2023b. AugGPT: Leveraging ChatGPT for text data augmentation. arXiv preprint
Burda, Y., Joseph, N., Brockman, G., et al., 2021b. Evaluating large language arXiv:2302.13007.
models trained on code. arXiv preprint arXiv:2107.03374. Dai, S., Shao, N., Zhao, H., Yu, W., Si, Z., Xu, C., Sun, Z., Zhang, X., Xu, J.,
Chen, Y., Wang, R., Jiang, H., Shi, S., Xu, R., 2023f. Exploring the use of large language 2023c. Uncovering ChatGPT’s capabilities in recommender systems. arXiv preprint
models for reference-free text quality evaluation: A preliminary empirical study. arXiv:2305.02182.
arXiv preprint arXiv:2304.00723. Das, S.S.S., Katiyar, A., Passonneau, R.J., Zhang, R., 2022. Container: Few-shot named
Chen, X., Ye, J., Zu, C., Xu, N., Zheng, R., Peng, M., Zhou, J., Gui, T., Zhang, Q., entity recognition via contrastive learning. In: Proceedings of the 60th Annual
Huang, X., 2023g. How robust is GPT-3.5 to predecessors? A comprehensive study Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).
on language understanding tasks. arXiv preprint arXiv:2303.00293. pp. 6338–6353.
Chen, L., Zaharia, M., Zou, J., 2023h. FrugalGPT: How to use large language models Das, M., Pandey, S.K., Mukherjee, A., 2023. Evaluating ChatGPT’s performance for
while reducing cost and improving performance. arXiv preprint arXiv:2305.05176. multilingual and emoji-based hate speech detection. arXiv preprint arXiv:2305.
Chen, S., Zhao, Y., Zhang, J., Chern, I., Gao, S., Liu, P., He, J., et al., 2023i. 13276.
Felm: Benchmarking factuality evaluation of large language models. arXiv preprint De Angelis, L., Baglivo, F., Arzilli, G., Privitera, G.P., Ferragina, P., Tozzi, A.E., Rizzo, C.,
arXiv:2310.00741. 2023. ChatGPT and the rise of large language models: the new AI-driven infodemic
Cheng, Z., Kasai, J., Yu, T., 2023. Batch prompting: Efficient inference with large threat in public health. Front. Public Health 11, 1166120.
language model apis. arXiv preprint arXiv:2301.08721. Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L., 2009. Imagenet: A large-
Cheshkov, A., Zadorozhny, P., Levichev, R., 2023. Evaluation of ChatGPT model for scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision
vulnerability detection. arXiv preprint arXiv:2304.07232. and Pattern Recognition. IEEE, pp. 248–255.
Chintagunta, B., Katariya, N., Amatriain, X., Kannan, A., 2021. Medically aware GPT-3 Derner, E., Batistič, K., Zahálka, J., Babuška, R., 2023. A security risk taxonomy for
as a data generator for medical dialogue summarization. In: Machine Learning for large language models. arXiv preprint arXiv:2311.11415.
Healthcare Conference. PMLR, pp. 354–372. Destefanis, G., Bartolucci, S., Ortu, M., 2023. A preliminary analysis on the code
Chiu, K.-L., Collins, A., Alexander, R., 2021. Detecting hate speech with gpt-3. arXiv generation capabilities of GPT-3.5 and bard AI models for java functions. arXiv
preprint arXiv:2103.12407. preprint arXiv:2305.09402.
Chmielewski, M., Kucker, S.C., 2020. An MTurk crisis? Shifts in data quality and the Devlin, J., Chang, M.-W., Lee, K., Toutanova, K., 2018. Bert: Pre-training of deep
impact on study results. Soc. Psychol. Pers. Sci. 11 (4), 464–473. bidirectional transformers for language understanding. arXiv preprint arXiv:1810.
Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., 04805.
Bengio, Y., 2014. Learning phrase representations using RNN encoder–decoder for
Dhuliawala, S., Komeili, M., Xu, J., Raileanu, R., Li, X., Celikyilmaz, A., Weston, J.,
statistical machine translation. In: Proceedings of the 2014 Conference on Empirical
2023. Chain-of-verification reduces hallucination in large language models.
Methods in Natural Language Processing (EMNLP). Association for Computational
Ding, B., Qin, C., Liu, L., Bing, L., Joty, S., Li, B., 2022. Is gpt-3 a good data
Linguistics, p. 1724.
annotator? arXiv preprint arXiv:2212.10450.
Choi, J.H., Hickman, K.E., Monahan, A., Schwarcz, D., 2023. Chatgpt goes to law
Doddapaneni, S., Ramesh, G., Khapra, M.M., Kunchukuttan, A., Kumar, P., 2021. A
school. Available at SSRN.
primer on pretrained multilingual language models. arXiv preprint arXiv:2107.
Choromanski, K.M., Likhosherstov, V., Dohan, D., Song, X., Gane, A., Sar-
00676.
los, T., Hawkins, P., Davis, J.Q., Mohiuddin, A., Kaiser, L., et al., 2020.
Dong, Q., Li, L., Dai, D., Zheng, C., Wu, Z., Chang, B., Sun, X., Xu, J., Sui, Z., 2022.
Rethinking attention with performers. In: International Conference on Learning
A survey for in-context learning. arXiv preprint arXiv:2301.00234.
Representations.
Dong, M., Zeng, X., Koehl, L., Zhang, J., 2020. An interactive knowledge-based
Choudhury, D., et al., 2023. Ask me in english instead: Cross-lingual evaluation of large
recommender system for fashion product design in the big data environment.
language models for healthcare queries. arXiv preprint arXiv:2310.13132.
Inform. Sci. 540, 469–488.
Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P.,
Du, X., Cardie, C., 2020. Event extraction by answering (almost) natural questions. In:
Chung, H.W., Sutton, C., Gehrmann, S., et al., 2022. Palm: Scaling language
Proceedings of the 2020 Conference on Empirical Methods in Natural Language
modeling with pathways. arXiv preprint arXiv:2204.02311.
Processing (EMNLP). pp. 671–683.
Chu, Z., Chen, J., Chen, Q., Yu, W., Wang, H., Liu, M., Qin, B., 2023. TimeBench: A
Du, N., Huang, Y., Dai, A.M., Tong, S., Lepikhin, D., Xu, Y., Krikun, M., Zhou, Y.,
comprehensive evaluation of temporal reasoning abilities in large language models.
Yu, A.W., Firat, O., et al., 2022. Glam: Efficient scaling of language models with
arXiv preprint arXiv:2311.17667.
mixture-of-experts. In: International Conference on Machine Learning. PMLR, pp.
Chung, J., Gulcehre, C., Cho, K., Bengio, Y., 2014. Empirical evaluation of gated
5547–5569.
recurrent neural networks on sequence modeling. In: NIPS 2014 Workshop on Deep
Learning, December 2014. Eldan, R., Li, Y., 2023. TinyStories: How small can language models be and still speak
Chung, H.W., Hou, L., Longpre, S., Zoph, B., Tay, Y., Fedus, W., Li, E., Wang, X., coherent english? arXiv preprint arXiv:2305.07759.
Dehghani, M., Brahma, S., et al., 2022. Scaling instruction-finetuned language Fan, A., Bhosale, S., Schwenk, H., Ma, Z., El-Kishky, A., Goyal, S., Baines, M., Çelebi, O.,
models. arXiv preprint arXiv:2210.11416. Wenzek, G., Chaudhary, V., Goyal, N., Birch, T., Liptchinsky, V., Edunov, S.,
Clark, E., August, T., Serrano, S., Haduong, N., Gururangan, S., Smith, N.A., 2021. Grave, E., Auli, M., Joulin, A., 2020. Beyond english-centric multilingual machine
All that’s ‘human’is not gold: Evaluating human evaluation of generated text. In: translation. arXiv abs/2010.11125.
Proceedings of the 59th Annual Meeting of the Association for Computational Fan, Y., Jiang, F., 2023. Uncovering the potential of ChatGPT for discourse analysis in
Linguistics and the 11th International Joint Conference on Natural Language dialogue: An empirical study. arXiv preprint arXiv:2305.08391.
Processing (Volume 1: Long Papers). pp. 7282–7296. Fan, L., Krishnan, D., Isola, P., Katabi, D., Tian, Y., 2023. Improving CLIP training with
Clark, K., Luong, M.-T., Le, Q.V., Manning, C.D., 2019. ELECTRA: Pre-training text language rewrites. arXiv preprint arXiv:2305.20088.
encoders as discriminators rather than generators. In: International Conference on Fang, Y., Li, X., Thomas, S.W., Zhu, X., 2023a. ChatGPT as data augmentation for
Learning Representations. compositional generalization: A case study in open intent detection. arXiv preprint
Collins, K.M., Wong, C., Feng, J., Wei, M., Tenenbaum, J.B., 2022. Structured, flexible, arXiv:2308.13517.
and robust: benchmarking and improving large language models towards more Fang, T., Yang, S., Lan, K., Wong, D.F., Hu, J., Chao, L.S., Zhang, Y., 2023b. Is chatgpt
human-like behavior in out-of-distribution reasoning tasks. arXiv preprint arXiv: a highly fluent grammatical error correction system? a comprehensive evaluation.
2205.05718. arXiv preprint arXiv:2304.01746.
Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., Fatouros, G., Soldatos, J., Kouroumali, K., Makridis, G., Kyriazis, D., 2023. Transforming
Grave, É., Ott, M., Zettlemoyer, L., Stoyanov, V., 2020. Unsupervised cross-lingual sentiment analysis in the financial domain with ChatGPT. arXiv preprint arXiv:
representation learning at scale. In: Proceedings of the 58th Annual Meeting of the 2308.07935.
Association for Computational Linguistics. pp. 8440–8451. Fei, Z., Shen, X., Zhu, D., Zhou, F., Han, Z., Zhang, S., Chen, K., Shen, Z., Ge, J.,
Conneau, A., Lample, G., 2019. Cross-lingual language model pretraining. Adv. Neural 2023. LawBench: Benchmarking legal knowledge of large language models. arXiv
Inf. Process. Syst. 32. preprint arXiv:2309.16289.
Costa-jussà, M.R., Cross, J., Çelebi, O., Elbayad, M., Heafield, K., Heffernan, K., Feng, S.Y., Gangal, V., Wei, J., Chandar, S., Vosoughi, S., Mitamura, T., Hovy, E., 2021.
Kalbassi, E., Lam, J., Licht, D., Maillard, J., et al., 2022. No language left behind: A survey of data augmentation approaches for NLP. In: Findings of the Association
Scaling human-centered machine translation. arXiv preprint arXiv:2207.04672. for Computational Linguistics: ACL-IJCNLP 2021. pp. 968–988.
Cotton, D.R., Cotton, P.A., Shipway, J.R., 2023. Chatting and cheating: Ensuring Feng, Z., Guo, D., Tang, D., Duan, N., Feng, X., Gong, M., Shou, L., Qin, B., Liu, T.,
academic integrity in the era of ChatGPT. Innov. Educ. Teach. Int. 1–12. Jiang, D., et al., 2020. CodeBERT: A pre-trained model for programming and
Coulombe, C., 2018. Text data augmentation made simple by leveraging nlp cloud apis. natural languages. In: Findings of the Association for Computational Linguistics:
arXiv preprint arXiv:1812.04718. EMNLP 2020. pp. 1536–1547.
Dai, Y., Feng, D., Huang, J., Jia, H., Xie, Q., Zhang, Y., Han, W., Tian, W., Wang, H., Feng, W., Zhu, W., Fu, T.-j., Jampani, V., Akula, A., He, X., Basu, S., Wang, X.E.,
2023a. LAiW: A Chinese legal large language models benchmark (a technical Wang, W.Y., 2023. LayoutGPT: Compositional visual planning and generation with
report). arXiv preprint arXiv:2310.05620. large language models. arXiv preprint arXiv:2305.15393.
40
K.S. Kalyan Natural Language Processing Journal 6 (2024) 100048
Fu, J., Ng, S.-K., Jiang, Z., Liu, P., 2023. Gptscore: Evaluate as you desire. arXiv preprint Guo, B., Zhang, X., Wang, Z., Jiang, M., Nie, J., Ding, Y., Yue, J., Wu, Y., 2023d. How
arXiv:2302.04166. close is ChatGPT to human experts? Comparison corpus, evaluation, and detection.
Ganguli, D., Lovitt, L., Kernion, J., Askell, A., Bai, Y., Kadavath, S., Mann, B., ArXiv, abs/2301.07597.
Perez, E., Schiefer, N., Ndousse, K., et al., 2022. Red teaming language models Gupta, R., Herzog, I., Park, J.B., Weisberger, J., Firouzbakht, P., Ocon, V., Chao, J.,
to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv preprint Lee, E.S., Mailey, B.A., 2023. Performance of ChatGPT on the plastic surgery
arXiv:2209.07858. inservice training examination. Aesthetic Surg. J. sjad128.
Gao, C.A., Howard, F.M., Markov, N.S., Dyer, E.C., Ramesh, S., Luo, Y., Pearson, A.T., Gutiérrez, B.J., McNeal, N., Washington, C., Chen, Y., Li, L., Sun, H., Su, Y., 2022.
2023a. Comparing scientific abstracts generated by ChatGPT to real abstracts with Thinking about GPT-3 in-context learning for biomedical ie? Think again. In:
detectors and blinded human reviewers. NPJ Digit. Med. 6 (1), 75. Findings of the Association for Computational Linguistics: EMNLP 2022. pp.
Gao, M., Ruan, J., Sun, R., Yin, X., Yang, S., Wan, X., 2023b. Human-like summarization 4497–4512.
evaluation with chatgpt. arXiv preprint arXiv:2304.02554. Hacker, P., Engel, A., Mauer, M., 2023. Regulating ChatGPT and other large gen-
Gao, Y., Sheng, T., Xiang, Y., Xiong, Y., Wang, H., Zhang, J., 2023c. Chat-rec: Towards erative AI models. In: Proceedings of the 2023 ACM Conference on Fairness,
interactive and explainable llms-augmented recommender system. arXiv preprint Accountability, and Transparency. pp. 1112–1123.
arXiv:2303.14524. Hada, R., Gumma, V., de Wynter, A., Diddee, H., Ahmed, M., Choudhury, M., Bali, K.,
Gao, Y., Wang, R., Hou, F., 2023d. How to design translation prompts for ChatGPT: Sitaram, S., 2023. Are large language model-based evaluators the solution to scaling
An empirical study. arXiv e-prints, arXiv–2304. up multilingual evaluation? arXiv preprint arXiv:2309.07462.
Gao, J., Zhao, H., Yu, C., Xu, R., 2023e. Exploring the feasibility of chatgpt for event Hakimov, S., Schlangen, D., 2023. Images in language space: Exploring the suitability
extraction. arXiv preprint arXiv:2303.03836. of large language models for vision & language tasks. arXiv preprint arXiv:2305.
Geng, M., Wang, S., Dong, D., Wang, H., Li, G., Jin, Z., Mao, X., Liao, X., 2023. 13782.
An empirical study on using large language models for multi-intent comment Hamidi, A., Roberts, K., 2023. Evaluation of AI chatbots for patient-specific EHR
generation. arXiv abs/2304.11384. questions. arXiv preprint arXiv:2306.02549.
Gilardi, F., Alizadeh, M., Kubli, M., 2023. Chatgpt outperforms crowd-workers for Han, R., Peng, T., Yang, C., Wang, B., Liu, L., Wan, X., 2023. Is information extraction
text-annotation tasks. arXiv preprint arXiv:2303.15056. solved by ChatGPT? An analysis of performance, evaluation criteria, robustness and
Gilson, A., Safranek, C.W., Huang, T., Socrates, V., Chi, L., Taylor, R.A., Chartash, D., errors. arXiv preprint arXiv:2305.14450.
et al., 2023. How does ChatGPT perform on the United States medical licensing Han, X., Zhang, Z., Ding, N., Gu, Y., Liu, X., Huo, Y., Qiu, J., Yao, Y., Zhang, A.,
examination? The implications of large language models for medical education and Zhang, L., et al., 2021. Pre-trained models: Past, present and future. AI Open 2,
knowledge assessment. JMIR Med. Educ. 9 (1), e45312. 225–250.
Giorgi, J., Toma, A., Xie, R., Chen, S., An, K., Zheng, G., Wang, B., 2023. WangLab Hartvigsen, T., Gabriel, S., Palangi, H., Sap, M., Ray, D., Kamar, E., 2022. ToxiGen:
at MEDIQA-chat 2023: Clinical note generation from doctor-patient conversations A large-scale machine-generated dataset for adversarial and implicit hate speech
using large language models. In: Proceedings of the 5th Clinical Natural Language detection. In: Proceedings of the 60th Annual Meeting of the Association for
Processing Workshop. pp. 323–334. Computational Linguistics (Volume 1: Long Papers). pp. 3309–3326.
Glaese, A., McAleese, N., Trebacz, M., Aslanides, J., Firoiu, V., Ewalds, T., Rauh, M., He, P., Gao, J., Chen, W., 2022a. DeBERTaV3: Improving DeBERTa using ELECTRA-
Weidinger, L., Chadwick, M., Thacker, P., et al., 2022. Improving alignment of style pre-training with gradient-disentangled embedding sharing. In: The Eleventh
dialogue agents via targeted human judgements. arXiv preprint arXiv:2209.14375. International Conference on Learning Representations.
Goertzel, B., 2014. Artificial general intelligence: concept, state of the art, and future He, J., Kryściński, W., McCann, B., Rajani, N., Xiong, C., 2022b. CTRLsum: Towards
prospects. J. Artif. Gener. Intell. 5 (1), 1. generic controllable text summarization. In: Proceedings of the 2022 Conference
Golchin, S., Surdeanu, M., 2023. Time travel in LLMs: Tracing data contamination in on Empirical Methods in Natural Language Processing. pp. 5879–5915.
large language models. arXiv preprint arXiv:2308.08493. He, Z., Liang, T., Jiao, W., Zhang, Z., Yang, Y., Wang, R., Tu, Z., Shi, S., Wang, X.,
González-Gallardo, C.-E., Boros, E., Girdhar, N., Hamdi, A., Moreno, J.G., Doucet, A., 2023a. Exploring human-like translation strategy with large language models. arXiv
2023. Yes but.. Can ChatGPT identify entities in historical documents? arXiv preprint arXiv:2305.04118.
preprint arXiv:2303.17322. He, X., Lin, Z., Gong, Y., Jin, A., Zhang, H., Lin, C., Jiao, J., Yiu, S.M., Duan, N.,
Goyal, S., Doddapaneni, S., Khapra, M.M., Ravindran, B., 2022. A survey of adversarial Chen, W., et al., 2023b. Annollm: Making large language models to be better
defences and robustness in nlp. ACM Comput. Surv.. crowdsourced annotators. arXiv preprint arXiv:2303.16854.
Gu, W., 2023. Linguistically informed ChatGPT prompts to enhance Japanese–Chinese He, P., Liu, X., Gao, J., Chen, W., 2020. DEBERTA: Decoding-enhanced bert with
machine translation: A case study on attributive clauses. arXiv preprint arXiv: disentangled attention. In: International Conference on Learning Representations.
2303.15587. He, P., Peng, B., Lu, L., Wang, S., Mei, J., Liu, Y., Xu, R., Awadalla, H.H., Shi, Y.,
Gu, Y., Tinn, R., Cheng, H., Lucas, M., Usuyama, N., Liu, X., Naumann, T., Gao, J., Zhu, C., et al., 2022c. Z-code++: A pre-trained language model optimized for
Poon, H., 2020. Domain-specific language model pretraining for biomedical natural abstractive summarization. arXiv preprint arXiv:2208.09770.
language processing. arXiv preprint arXiv:2007.15779. He, X., Shen, X., Chen, Z., Backes, M., Zhang, Y., 2023c. Mgtbench: Benchmarking
Gu, Y., Zhang, S., Usuyama, N., Woldesenbet, Y., Wong, C., Sanapathi, P., Wei, M., machine-generated text detection. arXiv preprint arXiv:2303.14822.
Valluri, N., Strandberg, E., Naumann, T., et al., 2023. Distilling large language He, Z., Wang, Y., Yan, A., Liu, Y., Chang, E.Y., Gentili, A., McAuley, J., Hsu, C.-N.,
models for biomedical knowledge extraction: A case study on adverse drug events. 2023d. MedEval: A multi-level, multi-task, and multi-domain medical benchmark
arXiv preprint arXiv:2307.06439. for language model evaluation. arXiv preprint arXiv:2310.14088.
Guha, N., Nyarko, J., Ho, D.E., Re, C., Chilton, A., Narayana, A., Chohlas-Wood, A., He, K., Zhang, X., Ren, S., Sun, J., 2016. Deep residual learning for image recog-
Peters, A., Waldon, B., Rockmore, D., et al., 2023. LegalBench: A collaboratively nition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern
built benchmark for measuring legal reasoning in large language models. In: Recognition. pp. 770–778.
Thirty-Seventh Conference on Neural Information Processing Systems Datasets and Hendy, A., Abdelrehim, M., Sharaf, A., Raunak, V., Gabr, M., Matsushita, H., Kim, Y.J.,
Benchmarks Track. Afify, M., Awadalla, H.H., 2023. How good are gpt models at machine translation?
Gui, J., Chen, T., Cao, Q., Sun, Z., Luo, H., Tao, D., 2023. A survey of self-supervised a comprehensive evaluation. arXiv preprint arXiv:2302.09210.
learning from multiple perspectives: Algorithms, theory, applications and future Hernandez, E., Mahajan, D., Wulff, J., Smith, M.J., Ziegler, Z., Nadler, D., Szolovits, P.,
trends. arXiv preprint arXiv:2301.05712. Johnson, A., Alsentzer, E., et al., 2023. Do we still need clinical language
Gui, L., Wang, B., Huang, Q., Hauptmann, A.G., Bisk, Y., Gao, J., 2022. KAT: A models? In: Conference on Health, Inference, and Learning. PMLR, pp. 578–597.
knowledge augmented transformer for vision-and-language. In: Proceedings of Hirosawa, T., Harada, Y., Yokose, M., Sakamoto, T., Kawamura, R., Shimizu, T., 2023.
the 2022 Conference of the North American Chapter of the Association for Diagnostic accuracy of differential-diagnosis lists generated by generative pre-
Computational Linguistics: Human Language Technologies. pp. 956–968. trained transformer 3 chatbot for clinical vignettes with common chief complaints:
Gunasekar, S., Zhang, Y., Aneja, J., Mendes, C.C.T., Giorno, A.D., Gopi, S., Java- A pilot study. Int. J. Environ. Res. Public Health 20 (4), 3378.
heripi, M., Kauffmann, P.C., de Rosa, G., Saarikivi, O., Salim, A., Shah, S., Hochreiter, S., Schmidhuber, J., 1997. Long short-term memory. Neural Comput. 9 (8),
Behl, H.S., Wang, X., Bubeck, S., Eldan, R., Kalai, A.T., Lee, Y.T., Li, Y.-F., 2023. 1735–1780.
Textbooks are all you need. ArXiv, abs/2306.11644. Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E.,
Guo, Z., Jin, R., Liu, C., Huang, Y., Shi, D., Yu, L., Liu, Y., Li, J., Xiong, B., Xiong, D., Casas, D.d.L., Hendricks, L.A., Welbl, J., Clark, A., et al., 2022. Training
et al., 2023a. Evaluating large language models: A comprehensive survey. arXiv compute-optimal large language models. arXiv preprint arXiv:2203.15556.
preprint arXiv:2310.19736. Holmes, J., Liu, Z., Zhang, L., Ding, Y., Sio, T.T., McGee, L.A., Ashman, J.B.,
Guo, D., Ren, S., Lu, S., Feng, Z., Tang, D., Shujie, L., Zhou, L., Duan, N., Li, X., Liu, T., Shen, J., et al., 2023a. Evaluating large language models on a
Svyatkovskiy, A., Fu, S., et al., 2020. GraphCodeBERT: Pre-training code highly-specialized topic, radiation oncology physics. Front. Oncol. 13, 1219326.
representations with data flow. In: International Conference on Learning Holmes, J., Liu, Z., Zhang, L., Ding, Y., Sio, T.T., McGee, L.A., Ashman, J.B.,
Representations. Li, X., Liu, T., Shen, J., et al., 2023b. Evaluating large language models on a
Guo, Z., Wang, P., Wang, Y., Yu, S., 2023b. Dr. LLaMA: Improving small language highly-specialized topic, radiation oncology physics. arXiv preprint arXiv:2304.
models in domain-specific QA via generative data augmentation. arXiv preprint 01938.
arXiv:2305.07804. Hong, S., Seo, J., Hong, S., Shin, H., Kim, S., 2023. Large language models are
Guo, Z., Wang, P., Wang, Y., Yu, S., 2023c. Dr. LLaMA: Improving small language frame-level directors for zero-shot text-to-video generation. arXiv preprint arXiv:
models on PubMedQA via generative data augmentation. ArXiv, abs/2305.07804. 2305.14330.
41
K.S. Kalyan Natural Language Processing Journal 6 (2024) 100048
Hou, Y., Zhang, J., Lin, Z., Lu, H., Xie, R., McAuley, J., Zhao, W.X., 2023a. Large Kalyan, K.S., Rajasekharan, A., Sangeetha, S., 2022. AMMU: a survey of
language models are zero-shot rankers for recommender systems. arXiv preprint transformer-based biomedical pretrained language models. J. Biomed. Inform. 126,
arXiv:2305.08845. 103982.
Hou, X., Zhao, Y., Liu, Y., Yang, Z., Wang, K., Li, L., Luo, X., Lo, D., Grundy, J., Kalyan, K.S., Sangeetha, S., 2020a. Medical concept normalization in user-generated
Wang, H., 2023b. Large language models for software engineering: A systematic texts by learning target concept embeddings. In: Proceedings of the 11th In-
literature review. arXiv preprint arXiv:2308.10620. ternational Workshop on Health Text Mining and Information Analysis. pp.
Howard, J., Ruder, S., 2018. Universal language model fine-tuning for text clas- 18–23.
sification. In: Proceedings of the 56th Annual Meeting of the Association for Kalyan, K.S., Sangeetha, S., 2020b. Target concept guided medical concept normaliza-
Computational Linguistics (Volume 1: Long Papers). pp. 328–339. tion in noisy user-generated texts. In: Proceedings of Deep Learning Inside Out
Hu, Y., Ameer, I., Zuo, X., Peng, X., Zhou, Y., Li, Z., Li, Y., Li, J., Jiang, X., (DeeLIO): The First Workshop on Knowledge Extraction and Integration for Deep
Xu, H., 2023a. Zero-shot clinical entity recognition using chatgpt. arXiv preprint Learning Architectures. pp. 64–73.
arXiv:2303.16416. raj Kanakarajan, K., Kundumani, B., Sankarasubbu, M., 2021. Bioelectra: pretrained
Hu, H., Lu, H., Zhang, H., Lam, W., Zhang, Y., 2023b. Chain-of-symbol prompting elicits biomedical text encoder using discriminators. In: Proceedings of the 20th Workshop
planning in large langauge models. arXiv preprint arXiv:2305.10276. on Biomedical Language Processing. pp. 143–154.
Huang, J., Chang, K.C.-C., 2022. Towards reasoning in large language models: A survey. Kang, S., Chen, B., Yoo, S., Lou, J.-G., 2023a. Explainable automated debugging via
arXiv preprint arXiv:2212.10403. large language model-driven scientific debugging. arXiv preprint arXiv:2304.02195.
Huang, F., Kwak, H., An, J., 2023a. Is chatgpt better than human annotators? potential Kang, W.-C., Ni, J., Mehta, N., Sathiamoorthy, M., Hong, L., Chi, E., Cheng, D.Z., 2023b.
and limitations of chatgpt in explaining implicit hate speech. arXiv preprint arXiv: Do LLMs understand user preferences? Evaluating LLMs on user rating prediction.
2302.07736. arXiv preprint arXiv:2305.06474.
Huang, R., Li, M., Yang, D., Shi, J., Chang, X., Ye, Z., Wu, Y., Hong, Z., Huang, J., Karpinska, M., Iyyer, M., 2023. Large language models effectively leverage document-
Liu, J., et al., 2023b. Audiogpt: Understanding and generating speech, music, sound, level context for literary translation, but critical errors persist. arXiv preprint
and talking head. arXiv preprint arXiv:2304.12995. arXiv:2304.03245.
Huang, X., Ruan, W., Huang, W., Jin, G., Dong, Y., Wu, C., Bensalem, S., Mu, R., Kasai, J., Kasai, Y., Sakaguchi, K., Yamada, Y., Radev, D., 2023. Evaluating gpt-4
Qi, Y., Zhao, X., et al., 2023c. A survey of safety and trustworthiness of large and chatgpt on japanese medical licensing examinations. arXiv preprint arXiv:
language models through the lens of verification and validation. arXiv preprint 2303.18027.
arXiv:2305.11391. Kashefi, A., Mukerji, T., 2023. ChatGPT for programming numerical methods. arXiv
Hulman, A., Dollerup, O.L., Mortensen, J.F., Fenech, M., Norman, K., Stoevring, H., abs/2303.12093.
Hansen, T.K., 2023. ChatGPT-versus human-generated answers to frequently asked Kew, T., Chi, A., Vásquez-Rodríguez, L., Agrawal, S., Aumiller, D., Alva-Manchego, F.,
questions about diabetes: a turing test-inspired survey among employees of a Danish Shardlow, M., 2023. BLESS: Benchmarking large language models on sentence
diabetes center. medRxiv, pp. 2023-2002. simplification. arXiv preprint arXiv:2310.15773.
Hutter, F., Kotthoff, L., Vanschoren, J., 2019. Automated Machine Learning: Methods,
Khalil, M., Er, E., 2023. Will ChatGPT get you caught? Rethinking of plagiarism
Systems, Challenges. Springer Nature.
detection. arXiv preprint arXiv:2302.04335.
Huynh, J., Jiao, C., Gupta, P., Mehri, S., Bajaj, P., Chaudhary, V., Eskenazi, M., 2023.
Khan, J.Y., Uddin, G., 2022. Automatic code documentation generation using gpt-3.
Understanding the effectiveness of very large language models on dialog evaluation.
In: Proceedings of the 37th IEEE/ACM International Conference on Automated
arXiv preprint arXiv:2301.12004.
Software Engineering. pp. 1–6.
Ippolito, D., Duckworth, D., Callison-Burch, C., Eck, D., 2020. Automatic detection of
Kim, Y., 2014. Convolutional neural networks for sentence classification. In: Proceed-
generated text is easiest when humans are fooled. In: Proceedings of the 58th
ings of the 2014 Conference on Empirical Methods in Natural Language Processing
Annual Meeting of the Association for Computational Linguistics. pp. 1808–1822.
(EMNLP). Association for Computational Linguistics.
Islam, P., Kannappan, A., Kiela, D., Qian, R., Scherrer, N., Vidgen, B., 2023. Fi-
Kocmi, T., Federmann, C., 2023. Large language models are state-of-the-art evaluators
nanceBench: A new benchmark for financial question answering. arXiv preprint
of translation quality. arXiv preprint arXiv:2302.14520.
arXiv:2311.11944.
Kocmi, T., Federmann, C., Grundkiewicz, R., Junczys-Dowmunt, M., Matsushita, H.,
Iyer, S., Lin, X.V., Pasunuru, R., Mihaylov, T., Simig, D., Yu, P., Shuster, K., Wang, T.,
Menezes, A., 2021. To ship or not to ship: An extensive evaluation of automatic
Liu, Q., Koura, P.S., et al., 2022. Opt-iml: Scaling language model instruction meta
metrics for machine translation. In: Proceedings of the Sixth Conference on Machine
learning through the lens of generalization. arXiv preprint arXiv:2212.12017.
Translation. pp. 478–494.
Jain, S., Keshava, V., Sathyendra, S.M., Fernandes, P., Liu, P., Neubig, G., Zhou, C.,
Kocoń, J., Cichecki, I., Kaszyca, O., Kochanek, M., Szydło, D., Baran, J., Bielaniewicz, J.,
2023. Multi-dimensional evaluation of text summarization with in-context learning.
Gruza, M., Janz, A., Kanclerz, K., et al., 2023. Chatgpt: Jack of all trades, master
arXiv preprint arXiv:2306.01200.
of none. arXiv preprint arXiv:2302.10724.
Jeblick, K., Schachtner, B., Dexl, J., Mittermeier, A., Stüber, A.T., Topalis, J., Weber, T.,
Koncel-Kedziorski, R., Krumdick, M., Lai, V., Reddy, V., Lovering, C., Tanner, C., 2023.
Wesp, P., Sabel, B., Ricke, J., et al., 2022. Chatgpt makes medicine easy to
BizBench: A quantitative reasoning benchmark for business and finance. arXiv
swallow: An exploratory case study on simplified radiology reports. arXiv preprint
preprint arXiv:2311.06602.
arXiv:2212.14882.
Jiao, W., Wang, W., Huang, J., Wang, X., Tu, Z., 2023. Is ChatGPT a good translator? Krishna, K., Song, Y., Karpinska, M., Wieting, J., Iyyer, M., 2023. Paraphrasing evades
Yes with GPT-4 as the engine. arXiv preprint arXiv:2301.08745. detectors of ai-generated text, but retrieval is an effective defense. arXiv preprint
Jiao, X., Yin, Y., Shang, L., Jiang, X., Chen, X., Li, L., Wang, F., Liu, Q., 2020. TinyBERT: arXiv:2303.13408.
Distilling BERT for natural language understanding. In: Findings of the Association Krizhevsky, A., Sutskever, I., Hinton, G.E., 2012. Imagenet classification with deep
for Computational Linguistics: EMNLP 2020. pp. 4163–4174. convolutional neural networks. Adv. Neural Inf. Process. Syst. 25.
Jing, Y., Jin, R., Hu, J., Qiu, H., Wang, X., Wang, P., Xiong, D., 2023. FollowEval: A Kulkarni, M., Mahata, D., Arora, R., Bhowmik, R., 2022. Learning rich representation of
multi-dimensional benchmark for assessing the instruction-following capability of keyphrases from text. In: Findings of the Association for Computational Linguistics:
large language models. arXiv preprint arXiv:2311.09829. NAACL 2022. Association for Computational Linguistics, Seattle, United States,
Joshi, I., Budhiraja, R., Dev, H., Kadia, J., Ataullah, M.O., Mitra, S., Kumar, D., pp. 891–906. http://dx.doi.org/10.18653/v1/2022.findings-naacl.67, URL https:
Akolekar, H.D., 2023. ChatGPT–a blessing or a curse for undergraduate computer //aclanthology.org/2022.findings-naacl.67.
science students and instructors? arXiv preprint arXiv:2304.14993. Kung, T.H., Cheatham, M., Medenilla, A., Sillos, C., De Leon, L., Elepaño, C.,
Just, R., Jalali, D., Ernst, M.D., 2014. Defects4J: A database of existing faults to Madriaga, M., Aggabao, R., Diaz-Candido, G., Maningo, J., et al., 2023. Performance
enable controlled testing studies for java programs. In: Proceedings of the 2014 of ChatGPT on USMLE: Potential for AI-assisted medical education using large
International Symposium on Software Testing and Analysis. pp. 437–440. language models. PLoS Digit. Health 2 (2), e0000198.
Kaddour, J., Harris, J., Mozes, M., Bradley, H., Raileanu, R., McHardy, R., 2023. Kuzman, T., Ljubešić, N., Mozetič, I., 2023. Chatgpt: Beginning of an end of manual
Challenges and applications of large language models. arXiv preprint arXiv:2307. annotation? Use case of automatic genre identification. arXiv preprint arXiv:2303.
10169. 03953.
Kakwani, D., Kunchukuttan, A., Golla, S., Gokul, N., Bhattacharyya, A., Khapra, M.M., Kwan, W.-C., Zeng, X., Wang, Y., Sun, Y., Li, L., Shang, L., Liu, Q., Wong, K.-F., 2023.
Kumar, P., 2020. IndicNLPSuite: Monolingual corpora, evaluation benchmarks and M4LE: A multi-ability multi-range multi-task multi-domain long-context evaluation
pre-trained multilingual language models for Indian languages. In: Findings of the benchmark for large language models. arXiv preprint arXiv:2310.19240.
Association for Computational Linguistics: EMNLP 2020. pp. 4948–4961. Lai, V.D., Ngo, N.T., Veyseh, A.P.B., Man, H., Dernoncourt, F., Bui, T., Nguyen, T.H.,
Kalakonda, S.S., Maheshwari, S., Sarvadevabhatla, R.K., 2022. Action-GPT: Leverag- 2023a. Chatgpt beyond english: Towards a comprehensive evaluation of large
ing large-scale language models for improved and generalized zero shot action language models in multilingual learning. arXiv preprint arXiv:2304.05613.
generation. arXiv preprint arXiv:2211.15603. Lai, H., Toral, A., Nissim, M., 2023b. Multidimensional evaluation for text style transfer
Kalchbrenner, N., Grefenstette, E., Blunsom, P., 2014. A convolutional neural network using ChatGPT. arXiv preprint arXiv:2304.13462.
for modelling sentences. In: Proceedings of the 52nd Annual Meeting of the Lamichhane, B., 2023. Evaluation of chatgpt for nlp-based mental health applications.
Association for Computational Linguistics (Volume 1: Long Papers). pp. 655–665. arXiv preprint arXiv:2303.15727.
Kalyan, K.S., Rajasekharan, A., Sangeetha, S., 2021. Ammus: A survey of transformer- Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., Soricut, R., 2019. ALBERT: A
based pretrained models in natural language processing. arXiv preprint arXiv: lite BERT for self-supervised learning of language representations. In: International
2108.05542. Conference on Learning Representations.
42
K.S. Kalyan Natural Language Processing Journal 6 (2024) 100048
Lan, Y., Wu, Y., Xu, W., Feng, W., Zhang, Y., 2023. Chinese fine-grained financial Lin, Y., Xie, Y., Chen, D., Xu, Y., Zhu, C., Yuan, L., 2022c. Revive: Regional
sentiment analysis with large language models. arXiv preprint arXiv:2306.14096. visual representation matters in knowledge-based visual question answering. arXiv
Larson, S., Leach, K., 2022. A survey of intent classification and slot-filling datasets for preprint arXiv:2206.01201.
task-oriented dialog. arXiv preprint arXiv:2207.13211. Liu, C., Bao, X., Zhang, H., Zhang, N., Hu, H., Zhang, X., Yan, M., 2023a. Improving
Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C.H., Kang, J., 2020. BioBERT: a ChatGPT prompt for code generation. arXiv preprint arXiv:2305.08360.
pre-trained biomedical language representation model for biomedical text mining. Liu, Y., Fabbri, A.R., Liu, P., Radev, D., Cohan, A., 2023b. On learning to summarize
Bioinformatics 36 (4), 1234–1240. with large language models as references. arXiv preprint arXiv:2305.14239.
Leinonen, J., Denny, P., MacNeil, S., Sarsa, S., Bernstein, S., Kim, J., Tran, A., Hellas, A., Liu, Y., Gu, J., Goyal, N., Li, X., Edunov, S., Ghazvininejad, M., Lewis, M.,
2023. Comparing code explanations created by students and large language models. Zettlemoyer, L., 2020a. Multilingual denoising pre-training for neural machine
arXiv preprint arXiv:2304.03938. translation. Trans. Assoc. Comput. Linguist. 8, 726–742.
Leippold, M., 2023. Sentiment spin: Attacking financial sentiment with GPT-3. Finance Liu, A., Hu, X., Wen, L., Yu, P.S., 2023c. A comprehensive evaluation of ChatGPT’s
Res. Lett. 103957. zero-shot Text-to-SQL capability. arXiv preprint arXiv:2303.13547.
Leivaditi, S., Rossi, J., Kanoulas, E., 2020. A benchmark for lease contract review. arXiv Liu, Z., Huang, D., Huang, K., Li, Z., Zhao, J., 2021a. Finbert: A pre-trained financial
preprint arXiv:2010.10386. language representation model for financial text mining. In: Proceedings of the
Leong, W.Q., Ngui, J.G., Susanto, Y., Rengarajan, H., Sarveswaran, K., Tjhi, W.C., 2023. Twenty-Ninth International Conference on International Joint Conferences on
BHASA: A holistic southeast Asian linguistic and cultural evaluation suite for large Artificial Intelligence. pp. 4513–4519.
language models. arXiv preprint arXiv:2309.06085. Liu, Y., Iter, D., Xu, Y., Wang, S., Xu, R., Zhu, C., 2023d. Gpteval: Nlg evaluation using
Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., Stoyanov, V., gpt-4 with better human alignment. arXiv preprint arXiv:2303.16634.
Zettlemoyer, L., 2020. BART: Denoising sequence-to-sequence pre-training for Liu, J., Liu, C., Lv, R., Zhou, K., Zhang, Y., 2023e. Is chatgpt a good recommender? a
natural language generation, translation, and comprehension. In: Proceedings of preliminary study. arXiv preprint arXiv:2304.10149.
the 58th Annual Meeting of the Association for Computational Linguistics. pp. Liu, Y., Liu, P., Radev, D., Neubig, G., 2022. BRIO: Bringing order to abstractive
7871–7880. summarization. In: Proceedings of the 60th Annual Meeting of the Association for
Li, Y., 2023. Unlocking context constraints of LLMs: Enhancing context efficiency Computational Linguistics (Volume 1: Long Papers). pp. 2890–2903.
of LLMs with self-information-based content filtering. arXiv preprint arXiv:2304. Liu, H., Ning, R., Teng, Z., Liu, J., Zhou, Q., Zhang, Y., 2023f. Evaluating the logical
12102. reasoning ability of chatgpt and gpt-4. arXiv preprint arXiv:2304.03439.
Li, R., Allal, L.B., Zi, Y., Muennighoff, N., Kocetkov, D., Mou, C., Marone, M., Akiki, C., Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M.,
Li, J., Chim, J., et al., 2023a. StarCoder: may the source be with you! arXiv preprint Zettlemoyer, L., Stoyanov, V., 2019. Roberta: A robustly optimized bert pretraining
arXiv:2305.06161. approach. arXiv preprint arXiv:1907.11692.
Li, J., Cheng, X., Zhao, W.X., Nie, J.-Y., Wen, J.-R., 2023b. HaluEval: A large-scale
Liu, F., Shareghi, E., Meng, Z., Basaldella, M., Collier, N., 2021. Self-alignment
hallucination evaluation benchmark for large language models. arXiv e-prints,
pretraining for biomedical entity representations. In: Proceedings of the 2021
arXiv–2305.
Conference of the North American Chapter of the Association for Computational
Li, Y., Choi, D., Chung, J., Kushman, N., Schrittwieser, J., Leblond, R., Eccles, T.,
Linguistics: Human Language Technologies. pp. 4228–4238.
Keeling, J., Gimeno, F., Dal Lago, A., et al., 2022a. Competition-level code
Liu, H., Teng, Z., Cui, L., Zhang, C., Zhou, Q., Zhang, Y., 2023g. LogiCoT: Logical chain-
generation with alphacode. Science 378 (6624), 1092–1097.
of-thought instruction-tuning data collection with GPT-4. arXiv preprint arXiv:
Li, L., Fan, L., Atreja, S., Hemphill, L., 2023c. ‘‘HOT’’ ChatGPT: The promise of ChatGPT
2305.12147.
in detecting and discriminating hateful, offensive, and toxic comments on social
Liu, X., Wang, J., Sun, J., Yuan, X., Dong, G., Di, P., Wang, W., Wang, D., 2023h.
media. arXiv preprint arXiv:2304.10619.
Prompting frameworks for large language models: A survey. arXiv preprint arXiv:
Li, B., Fang, G., Yang, Y., Wang, Q., Ye, W., Zhao, W., Zhang, S., 2023d. Evaluating
2311.12785.
ChatGPT’s information extraction capabilities: An assessment of performance,
Liu, P., Wang, X., Xiang, C., Meng, W., 2020b. A survey of text data augmentation. In:
explainability, calibration, and faithfulness. arXiv preprint arXiv:2304.11633.
2020 International Conference on Computer Communication and Network Security
Li, B., Hou, Y., Che, W., 2022b. Data augmentation approaches in natural language
(CCNS). IEEE, pp. 191–195.
processing: A survey. Ai Open 3, 71–90.
Liu, S., Wright, A.P., Patterson, B.L., Wanderer, J.P., Turer, R.W., Nelson, S.D.,
Li, X., Li, Z., Luo, X., Xie, H., Lee, X., Zhao, Y., Wang, F.L., Li, Q., 2023e. Recurrent
McCoy, A.B., Sittig, D.F., Wright, A., 2023i. Assessing the value of ChatGPT for
attention networks for long-text modeling. arXiv preprint arXiv:2306.06843.
clinical decision support optimization. medRxiv, pp. 2023-2002.
Li, J., Li, H., Pan, Z., Pan, G., 2023f. Prompt ChatGPT in MNER: Improved multimodal
Liu, J., Xia, C.S., Wang, Y., Zhang, L., 2023j. Is your code generated by chatgpt really
named entity recognition method based on auxiliary refining knowledge from
correct? rigorous evaluation of large language models for code generation. arXiv
ChatGPT. arXiv preprint arXiv:2305.12212.
preprint arXiv:2305.01210.
Li, Q., Peng, H., Li, J., Xia, C., Yang, R., Sun, L., Yu, P.S., He, L., 2022c. A survey
Liu, Y., Yao, Y., Ton, J.-F., Zhang, X., Cheng, R.G.H., Klochkov, Y., Taufiq, M.F., Li, H.,
on text classification: From traditional to deep learning. ACM Trans. Intell. Syst.
2023k. Trustworthy LLMs: a survey and guideline for evaluating large language
Technol. 13 (2), 1–41.
models’ alignment. arXiv preprint arXiv:2308.05374.
Li, P., Sun, T., Tang, Q., Yan, H., Wu, Y., Huang, X., Qiu, X., 2023g. CodeIE: Large
code generation models are better few-shot information extractors. arXiv preprint Liu, Z., Yu, X., Zhang, L., Wu, Z., Cao, C., Dai, H., Zhao, L., Liu, W., Shen, D., Li, Q., et
arXiv:2305.05711. al., 2023l. Deid-gpt: Zero-shot medical text de-identification by gpt-4. arXiv preprint
Li, C., Wong, C., Zhang, S., Usuyama, N., Liu, H., Yang, J., Naumann, T., Poon, H., arXiv:2303.11032.
Gao, J., 2023h. LLaVA-Med: Training a large language-and-vision assistant for Liu, X., Zhang, F., Hou, Z., Mian, L., Wang, Z., Zhang, J., Tang, J., 2021c. Self-
biomedicine in one day. arXiv preprint arXiv:2306.00890. supervised learning: Generative or contrastive. IEEE Trans. Knowl. Data Eng. 35
Li, X.-Y., Xue, J.-T., Xie, Z., Li, M., 2023i. Think outside the code: Brainstorming boosts (1), 857–876.
large language models in code generation. arXiv preprint arXiv:2305.10679. Liu, Y., Zhang, Z., Zhang, W., Yue, S., Zhao, X., Cheng, X., Zhang, Y., Hu, H.,
Li, X., Yao, Y., Jiang, X., Fang, X., Meng, X., Fan, S., Han, P., Li, J., Du, L., Qin, B., et 2023m. ArguGPT: evaluating, understanding and identifying argumentative essays
al., 2023j. FLM-101b: An open LLM and how to train it with 100 K budget. arXiv generated by GPT models. arXiv preprint arXiv:2304.07666.
preprint arXiv:2309.03852. Liu, J., Zhou, P., Hua, Y., Chong, D., Tian, Z., Liu, A., Wang, H., You, C., Guo, Z.,
Li, X., Zhu, X., Ma, Z., Liu, X., Shah, S., 2023k. Are ChatGPT and GPT-4 general-purpose Zhu, L., et al., 2023n. Benchmarking large language models on cmexam–a
solvers for financial text analytics? An examination on several typical tasks. arXiv comprehensive Chinese medical exam dataset. arXiv preprint arXiv:2306.03030.
preprint arXiv:2305.05862. Lopez-Lira, A., Tang, Y., 2023. Can chatgpt forecast stock price movements? return
Li, T.-O., Zong, W., Wang, Y., Tian, H., Wang, Y., Cheung, S.-C., 2023l. Finding predictability and large language models. arXiv preprint arXiv:2304.07619.
failure-inducing test cases with ChatGPT. arXiv preprint arXiv:2304.11686. Loukas, L., Stogiannidis, I., Malakasiotis, P., Vassos, S., 2023. Breaking the bank with
Liao, W., Liu, Z., Dai, H., Xu, S., Wu, Z., Zhang, Y., Huang, X., Zhu, D., Cai, H., Liu, T., ChatGPT: Few-shot text classification for finance. arXiv preprint arXiv:2308.14634.
et al., 2023. Differentiate chatgpt-generated and human-written medical texts. arXiv Lu, Q., Ding, L., Xie, L., Zhang, K., Wong, D.F., Tao, D., 2022a. Toward human-
preprint arXiv:2304.11567. like evaluation for natural language generation with error analysis. arXiv preprint
Lieber, O., Sharir, O., Lenz, B., Shoham, Y., 2021. Jurassic-1: Technical Details and arXiv:2212.10179.
Evaluation. White Paper. AI21 Labs. Lu, S., Guo, D., Ren, S., Huang, J., Svyatkovskiy, A., Blanco, A., Clement, C., Drain, D.,
Lin, C.-Y., 2004. Rouge: A package for automatic evaluation of summaries. In: Text Jiang, D., Tang, D., et al., 2021. CodeXGLUE: A machine learning benchmark
Summarization Branches Out. pp. 74–81. dataset for code understanding and generation. In: Thirty-Fifth Conference on
Lin, S., Hilton, J., Evans, O., 2022a. TruthfulQA: Measuring how models mimic human Neural Information Processing Systems Datasets and Benchmarks Track (Round 1).
falsehoods. In: Proceedings of the 60th Annual Meeting of the Association for Lu, Y., Liu, Q., Dai, D., Xiao, X., Lin, H., Han, X., Sun, L., Wu, H., 2022b. Unified
Computational Linguistics (Volume 1: Long Papers). pp. 3214–3252. structure generation for universal information extraction. In: Proceedings of the
Lin, D., Koppel, J., Chen, A., Solar-Lezama, A., 2017. QuixBugs: A multi-lingual program 60th Annual Meeting of the Association for Computational Linguistics (Volume 1:
repair benchmark set based on the quixey challenge. In: Proceedings Companion Long Papers). pp. 5755–5772.
of the 2017 ACM SIGPLAN International Conference on Systems, Programming, Lu, Q., Qiu, B., Ding, L., Xie, L., Tao, D., 2023a. Error analysis prompting enables
Languages, and Applications: Software for Humanity. pp. 55–56. human-like translation evaluation in large language models: A case study on
Lin, T., Wang, Y., Liu, X., Qiu, X., 2022b. A survey of transformers. AI Open. chatgpt. arXiv preprint arXiv:2303.13809.
43
K.S. Kalyan Natural Language Processing Journal 6 (2024) 100048
Lu, Y., Yang, X., Li, X., Wang, X.E., Wang, W.Y., 2023b. LLMScore: Unveiling the power Mysore, S., McCallum, A., Zamani, H., 2023. Large language model augmented narrative
of large language models in text-to-image synthesis evaluation. arXiv preprint driven recommendations. arXiv preprint arXiv:2306.02250.
arXiv:2305.11116. Nair, V., Schumacher, E., Kannan, A., 2023. Generating medically-accurate summaries
Lundberg, S.M., Lee, S.-I., 2017. A unified approach to interpreting model predictions. of patient-provider dialogue: A multi-stage approach using large language models.
Adv. Neural Inf. Process. Syst. 30. arXiv preprint arXiv:2305.05982.
Luo, Z., Xie, Q., Ananiadou, S., 2023. ChatGPT as a factual inconsistency evaluator for Narayan, S., Cohen, S.B., Lapata, M., 2018. Don’t give me the details, just the
text summarization. summary! topic-aware convolutional neural networks for extreme summarization.
Luong, M.-T., Pham, H., Manning, C.D., 2015. Effective approaches to attention-based In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language
neural machine translation. In: Proceedings of the 2015 Conference on Empirical Processing. pp. 1797–1807.
Methods in Natural Language Processing. pp. 1412–1421. Nascimento, N., Alencar, P., Cowan, D., 2023. Comparing software developers with
Lyu, Q., Tan, J., Zapadka, M.E., Ponnatapura, J., Niu, C., Myers, K.J., Wang, G., ChatGPT: An empirical investigation. arXiv preprint arXiv:2305.11837.
Whitlow, C.T., 2023a. Translating radiology reports into plain language using Nguyen, H.-T., 2023. A brief report on LawGPT 1.0: A virtual legal assistant based on
ChatGPT and GPT-4 with prompt learning: results, limitations, and potential. Vis. GPT-3. arXiv preprint arXiv:2302.05729.
Comput. Ind. Biomed. Art 6 (1), 9. Nguyen, D.Q., Vu, T., Nguyen, A.T., 2020. BERTweet: A pre-trained language model
Lyu, C., Xu, J., Wang, L., 2023b. New trends in machine translation using large for english tweets. In: Proceedings of the 2020 Conference on Empirical Methods
language models: Case examples with chatgpt. arXiv preprint arXiv:2305.01181. in Natural Language Processing: System Demonstrations. pp. 9–14.
Ma, Y., Cao, Y., Hong, Y., Sun, A., 2023a. Large language model is not a good few- Ni, A., Yin, P., Zhao, Y., Riddell, M., Feng, T., Shen, R., Yin, S., Liu, Y., Yavuz, S.,
shot information extractor, but a good reranker for hard samples! arXiv preprint Xiong, C., et al., 2023. L2CEval: Evaluating language-to-code generation capabilities
arXiv:2303.08559. of large language models. arXiv preprint arXiv:2309.17446.
Ma, Y., Wang, Z., Cao, Y., Li, M., Chen, M., Wang, K., Shao, J., 2022. Prompt for Nijkamp, E., Hayashi, H., Xiong, C., Savarese, S., Zhou, Y., 2023. Codegen2: Lessons
extraction? PAIE: Prompting argument interaction for event argument extraction. for training llms on programming and natural languages. arXiv preprint arXiv:
In: Proceedings of the 60th Annual Meeting of the Association for Computational 2305.02309.
Linguistics (Volume 1: Long Papers). pp. 6759–6774. Nijkamp, E., Pang, B., Hayashi, H., Tu, L., Wang, H., Zhou, Y., Savarese, S., Xiong, C.,
Ma, C., Wu, Z., Wang, J., Xu, S., Wei, Y., Liu, Z., Guo, L., Cai, X., Zhang, S., Zhang, T., et 2022. CodeGen: An open large language model for code with multi-turn program
al., 2023b. ImpressionGPT: an iterative optimizing framework for radiology report synthesis. In: The Eleventh International Conference on Learning Representations.
summarization with chatGPT. arXiv preprint arXiv:2304.08448. Nogueira, R., Jiang, Z., Pradeep, R., Lin, J., 2020. Document ranking with a pretrained
Mahowald, K., Ivanova, A.A., Blank, I.A., Kanwisher, N., Tenenbaum, J.B., Fe- sequence-to-sequence model. In: Findings of the Association for Computational
dorenko, E., 2023. Dissociating language and thought in large language models: Linguistics: EMNLP 2020. pp. 708–718.
a cognitive perspective. arXiv preprint arXiv:2301.06627. Nori, H., King, N., McKinney, S.M., Carignan, D., Horvitz, E., 2023a. Capabilities of
Malkiel, I., Alon, U., Yehuda, Y., Keren, S., Barkan, O., Ronen, R., Koenigstein, N., gpt-4 on medical challenge problems. arXiv preprint arXiv:2303.13375.
2023. GPT-calls: Enhancing call segmentation and tagging by generating synthetic Nori, H., King, N., McKinney, S.M., Carignan, D., Horvitz, E., 2023b. Capabilities of
conversations via large language models. arXiv preprint arXiv:2306.07941. GPT-4 on medical challenge problems. arXiv abs/2303.13375.
Nunes, D., Primi, R., Pires, R., Lotufo, R., Nogueira, R., 2023. Evaluating GPT-3.5
Mallikarjuna, C., Sivanesan, S., 2022. Question classification using limited labelled data.
and GPT-4 models on Brazilian university admission exams. arXiv preprint arXiv:
Inf. Process. Manage. 59 (6), 103094.
2303.17003.
Markov, T., Zhang, C., Agarwal, S., Nekoul, F.E., Lee, T., Adler, S., Jiang, A., Weng, L.,
Oh, S., Jung, W., et al., 2023. Data augmentation for neural machine translation using
2023. A holistic approach to undesired content detection in the real world.
generative language model. arXiv preprint arXiv:2307.16833.
In: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 37. pp.
Olmo, A., Sreedharan, S., Kambhampati, S., 2021. GPT3-to-plan: Extracting plans from
15009–15018, no. 12.
text using GPT-3. arXiv preprint arXiv:2106.07131.
Martínez-Cruz, R., López-López, A.J., Portela, J., 2023. ChatGPT vs state-of-the-art
OpenAI, 2023. GPT-4 technical report. arXiv:2303.08774.
models: A benchmarking study in keyphrase generation task. arXiv preprint arXiv:
Orenstrakh, M.S., Karnalim, O., Suarez, C.A., Liut, M., 2023. Detecting LLM-generated
2304.14177.
text in computing education: A comparative study for ChatGPT cases. arXiv preprint
Mehrabi, N., Goyal, P., Dupuy, C., Hu, Q., Ghosh, S., Zemel, R., Chang, K.-W.,
arXiv:2307.07411.
Galstyan, A., Gupta, R., 2023. FLIRT: Feedback loop in-context red teaming. arXiv
Otter, D.W., Medina, J.R., Kalita, J.K., 2020. A survey of the usages of deep learning
preprint arXiv:2308.04265.
for natural language processing. IEEE Trans. Neural Netw. Learn. Syst. 32 (2),
Mei, X., Meng, C., Liu, H., Kong, Q., Ko, T., Zhao, C., Plumbley, M.D., Zou, Y.,
604–624.
Wang, W., 2023. Wavcaps: A chatgpt-assisted weakly-labelled audio captioning
Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C.,
dataset for audio-language multimodal research. arXiv preprint arXiv:2303.17395.
Agarwal, S., Slama, K., Ray, A., et al., 2022. Training language models to follow
Meng, R., Yuan, X., Wang, T., Zhao, S., Trischler, A., He, D., 2021. An empirical study
instructions with human feedback. Adv. Neural Inf. Process. Syst. 35, 27730–27744.
on neural keyphrase generation. In: Proceedings of the 2021 Conference of the
Pagliardini, M., Gupta, P., Jaggi, M., 2018. Unsupervised learning of sentence embed-
North American Chapter of the Association for Computational Linguistics: Human
dings using compositional n-gram features. In: Proceedings of the 2018 Conference
Language Technologies. pp. 4985–5007.
of the North American Chapter of the Association for Computational Linguistics:
Meoni, S., De la Clergerie, E., Ryffel, T., 2023. Large language models as instructors: Human Language Technologies, Volume 1 (Long Papers). pp. 528–540.
A study on multilingual clinical entity extraction. In: The 22nd Workshop on Pan, W., Chen, Q., Xu, X., Che, W., Qin, L., 2023. A preliminary evaluation of chatgpt
Biomedical Natural Language Processing and BioNLP Shared Tasks. pp. 178–190. for zero-shot dialogue understanding. arXiv preprint arXiv:2304.04256.
Michail, A., Konstantinou, S., Clematide, S., 2023. UZH_CLyp at SemEval-2023 task 9: Pan, S.J., Yang, Q., 2009. A survey on transfer learning. IEEE Trans. Knowl. Data Eng.
Head-first fine-tuning and ChatGPT data generation for cross-lingual learning in 22 (10), 1345–1359.
tweet intimacy prediction. arXiv preprint arXiv:2303.01194. Papineni, K., Roukos, S., Ward, T., Zhu, W.-J., 2002. Bleu: a method for automatic
Michalopoulos, G., Wang, Y., Kaka, H., Chen, H., Wong, A., 2020. UmlsBERT: Clinical evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of
domain knowledge augmentation of contextual embeddings using the unified the Association for Computational Linguistics. pp. 311–318.
medical language system metathesaurus. arXiv preprint arXiv:2010.10391. Parikh, S., Vohra, Q., Tumbade, P., Tiwari, M., 2023. Exploring zero and few-shot
Mikolov, T., Chen, K., Corrado, G., Dean, J., 2013. Efficient estimation of word techniques for intent classification. arXiv preprint arXiv:2305.07157.
representations in vector space. arXiv preprint arXiv:1301.3781. Pegoraro, A., Kumari, K., Fereidooni, H., Sadeghi, A.-R., 2023. To ChatGPT, or not to
Mireshghallah, F., Mattern, J., Gao, S., Shokri, R., Berg-Kirkpatrick, T., 2023. Smaller ChatGPT: That is the question!. arXiv preprint arXiv:2304.01487.
language models are better black-box machine-generated text detectors. ArXiv, Peng, Y., 2022. A survey on modern recommendation system based on big data. arXiv
abs/2305.09859. preprint arXiv:2206.02631.
Mitchell, E., Lee, Y., Khazatsky, A., Manning, C.D., Finn, C., 2023. Detectgpt: Zero- Peng, K., Ding, L., Zhong, Q., Shen, L., Liu, X., Zhang, M., Ouyang, Y., Tao, D.,
shot machine-generated text detection using probability curvature. arXiv preprint 2023a. Towards making the most of chatgpt for machine translation. arXiv preprint
arXiv:2301.11305. arXiv:2303.13780.
Mitrovi’c, S., Andreoletti, D., Ayoub, O., 2023. ChatGPT or human? Detect and Peng, B., Galley, M., He, P., Cheng, H., Xie, Y., Hu, Y., Huang, Q., Liden, L., Yu, Z.,
explain. Explaining decisions of machine learning model for detecting short Chen, W., et al., 2023b. Check your facts and try again: Improving large language
ChatGPT-generated text. ArXiv, abs/2301.13852. models with external knowledge and automated feedback. arXiv preprint arXiv:
Moradi, M., Blagec, K., Haberl, F., Samwald, M., 2021. Gpt-3 models are poor few-shot 2302.12813.
learners in the biomedical domain. arXiv preprint arXiv:2109.02555. Peng, B., Li, C., He, P., Galley, M., Gao, J., 2023c. Instruction tuning with gpt-4. arXiv
Moslem, Y., Haque, R., Way, A., 2023. Adaptive machine translation with large preprint arXiv:2304.03277.
language models. arXiv preprint arXiv:2301.13294. Pennington, J., Socher, R., Manning, C.D., 2014. Glove: Global vectors for word
Muennighoff, N., Wang, T., Sutawika, L., Roberts, A., Biderman, S., Scao, T.L., representation. In: Proceedings of the 2014 Conference on Empirical Methods in
Bari, M.S., Shen, S., Yong, Z.-X., Schoelkopf, H., et al., 2022. Crosslingual Natural Language Processing (EMNLP). pp. 1532–1543.
generalization through multitask finetuning. arXiv preprint arXiv:2211.01786. Pereira, J., Fidalgo, R., Lotufo, R., Nogueira, R., 2023. Visconde: Multi-document
Murthy, J.S., Siddesh, G., Srinivasa, K., 2019. TwitSenti: a real-time Twitter sentiment QA with GPT-3 and neural reranking. In: European Conference on Information
analysis and visualization framework. J. Inf. Knowl. Manag. 18 (02), 1950013. Retrieval. Springer, pp. 534–543.
44
K.S. Kalyan Natural Language Processing Journal 6 (2024) 100048
Perez, E., Huang, S., Song, F., Cai, T., Ring, R., Aslanides, J., Glaese, A., McAleese, N., Sawada, T., Paleka, D., Havrilla, A., Tadepalli, P., Vidas, P., Kranias, A., Nay, J.,
Irving, G., 2022. Red teaming language models with language models. In: Proceed- Gupta, K., Komatsuzaki, A., 2023. ARB: Advanced reasoning benchmark for large
ings of the 2022 Conference on Empirical Methods in Natural Language Processing. language models. In: The 3rd Workshop on Mathematical Reasoning and AI at
pp. 3419–3448. NeurIPS’23.
Peters, M.E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., Zettlemoyer, L., Scao, T.L., Fan, A., Akiki, C., Pavlick, E., Ilić, S., Hesslow, D., Castagné, R., Luc-
2018. Deep contextualized word representations. In: Proceedings of the 2018 cioni, A.S., Yvon, F., Gallé, M., et al., 2022. Bloom: A 176b-parameter open-access
Conference of the North American Chapter of the Association for Computational multilingual language model. arXiv preprint arXiv:2211.05100.
Linguistics: Human Language Technologies, Volume 1 (Long Papers). Association Schaeffer, R., Miranda, B., Koyejo, S., 2023. Are emergent abilities of large language
for Computational Linguistics, New Orleans, Louisiana, pp. 2227–2237. http://dx. models a mirage? arXiv preprint arXiv:2304.15004.
doi.org/10.18653/v1/N18-1202, URL https://aclanthology.org/N18-1202. Sengupta, N., Sahu, S.K., Jia, B., Katipomu, S., Li, H., Koto, F., Afzal, O.M., Kamboj, S.,
Phan, L., Tran, H., Le, D., Nguyen, H., Annibal, J., Peltekian, A., Ye, Y., 2021. CoTexT: Pandit, O., Pal, R., et al., 2023. Jais and jais-chat: Arabic-centric foundation and
Multi-task learning with code-text transformer. In: Proceedings of the 1st Workshop instruction-tuned open generative large language models. arXiv preprint arXiv:
on Natural Language Processing for Programming (NLP4Prog 2021). pp. 40–47. 2308.16149.
Phung, T., Padurean, V.-A., Cambronero, J.P., Gulwani, S., Kohn, T., Majum- Sennrich, R., Haddow, B., Birch, A., 2016. Improving neural machine translation
dar, R., Singla, A.K., Soares, G., 2023. Generative AI for programming education: models with monolingual data. In: Proceedings of the 54th Annual Meeting of
Benchmarking ChatGPT, GPT-4, and human tutors. arXiv abs/2306.17156. the Association for Computational Linguistics (Volume 1: Long Papers). pp. 86–96.
Poldrack, R.A., Lu, T., Beguš, G., 2023. AI-assisted coding: Experiments with GPT-4. Serban, I.V., Lowe, R., Henderson, P., Charlin, L., Pineau, J., 2018. A survey of
arXiv preprint arXiv:2304.13187. available corpora for building data-driven dialogue systems: The journal version.
Prenner, J.A., Robbes, R., 2021. Automatic program repair with openai’s codex: Dial. Discourse 9 (1), 1–49.
Evaluating QuixBugs. arXiv preprint arXiv:2111.03922. Shah, A., Chava, S., 2023. Zero is not hero yet: Benchmarking zero-shot performance
Prodan, G.P., Pelican, E., 2022. Prompt scoring system for dialogue summarization of LLMs for financial tasks. arXiv preprint arXiv:2305.16633.
using GPT-3. ACM Trans. Audio Speech Lang. Process. 1–9. Shaib, C., Li, M.L., Joseph, S., Marshall, I.J., Li, J.J., Wallace, B.C., 2023. Summarizing,
Qin, C., Zhang, A., Zhang, Z., Chen, J., Yasunaga, M., Yang, D., 2023. Is ChatGPT simplifying, and synthesizing medical evidence using gpt-3 (with varying success).
a general-purpose natural language processing task solver? arXiv preprint arXiv: arXiv preprint arXiv:2305.06299.
2302.06476. Shao, Z., Yu, Z., Wang, M., Yu, J., 2023. Prompting large language models with
Qiu, S., Liu, Q., Zhou, S., Huang, W., 2022. Adversarial attack and defense technologies answer heuristics for knowledge-based visual question answering. In: Proceedings
in natural language processing: A survey. Neurocomputing 492, 278–307. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp.
Qiu, X., Sun, T., Xu, Y., Shao, Y., Dai, N., Huang, X., 2020. Pre-trained models for 14974–14983.
natural language processing: A survey. Sci. China Technol. Sci. 63 (10), 1872–1897. Sharma, S., Joshi, A., Mukhija, N., Zhao, Y., Bhathena, H., Singh, P., Santhanam, S.,
Radford, A., Jozefowicz, R., Sutskever, I., 2017. Learning to generate reviews and Biswas, P., 2022. Systematic review of effect of data augmentation using paraphras-
discovering sentiment. arXiv preprint arXiv:1704.01444. ing on Named entity recognition. In: NeurIPS 2022 Workshop on Synthetic Data
Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., et al., 2018. Improving for Empowering ML Research.
language understanding by generative pre-training. Shen, C., Cheng, L., Bing, L., You, Y., Si, L., 2022. SentBS: Sentence-level beam
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al., 2019. Language
search for controllable summarization. In: Proceedings of the 2022 Conference on
models are unsupervised multitask learners. OpenAI Blog 1 (8), 9.
Empirical Methods in Natural Language Processing. pp. 10256–10265.
Rae, J.W., Borgeaud, S., Cai, T., Millican, K., Hoffmann, J., Song, F., Aslanides, J.,
Shen, C., Cheng, L., You, Y., Bing, L., 2023a. Are large language models good evaluators
Henderson, S., Ring, R., Young, S., et al., 2021. Scaling language models: Methods,
for abstractive summarization? arXiv preprint arXiv:2305.13091.
analysis & insights from training gopher. arXiv preprint arXiv:2112.11446.
Shen, Z., Liu, J., He, Y., Zhang, X., Xu, R., Yu, H., Cui, P., 2021. Towards
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W.,
out-of-distribution generalization: A survey. arXiv preprint arXiv:2108.13624.
Liu, P.J., 2020. Exploring the limits of transfer learning with a unified text-to-text
Shen, Y., Song, K., Tan, X., Li, D., Lu, W., Zhuang, Y., 2023b. Hugginggpt: Solving ai
transformer. J. Mach. Learn. Res. 21 (1), 5485–5551.
tasks with chatgpt and its friends in huggingface. arXiv preprint arXiv:2303.17580.
Rajpoot, P.K., Parikh, A., 2023. GPT-FinRE: In-context learning for financial relation
Shen, Y., Song, K., Tan, X., Zhang, W., Ren, K., Yuan, S., Lu, W., Li, D., Zhuang, Y.,
extraction using large language models. arXiv preprint arXiv:2306.17519.
2023c. TaskBench: Benchmarking large language models for task automation.
Ranjit, M., Ganapathy, G., Manuel, R., Ganu, T., 2023. Retrieval augmented chest X-Ray
Shi, Z., Wang, Y., Yin, F., Chen, X., Chang, K.-W., Hsieh, C.-J., 2023. Red teaming
report generation using openai gpt models. arXiv preprint arXiv:2305.03660.
language model detectors with language models. arXiv preprint arXiv:2305.19713.
Rao, A.S., Pang, M., Kim, J., Kamineni, M., Lie, W., Prasad, A.K., Landman, A.,
Shirafuji, A., Watanobe, Y., Ito, T., Morishita, M., Nakamura, Y., Oda, Y., Suzuki, J.,
Dryer, K., Succi, M.D., 2023. Assessing the utility of ChatGPT throughout the entire
2023. Exploring the robustness of large language models for solving programming
clinical workflow. medRxiv, pp. 2023-2002.
problems. arXiv preprint arXiv:2306.14583.
Raunak, V., Menezes, A., Post, M., Awadallah, H.H., 2023a. Do GPTs produce less literal
Shorten, C., Khoshgoftaar, T.M., 2019. A survey on image data augmentation for deep
translations? arXiv preprint arXiv:2305.16806.
Raunak, V., Sharaf, A., Awadallah, H.H., Menezes, A., 2023b. Leveraging GPT-4 for learning. J. Big Data 6 (1), 1–48.
automatic translation post-editing. arXiv preprint arXiv:2305.14878. Siddiq, M.L., Santos, J.C.S., Tanvir, R.H., Ulfat, N., Rifat, F.A., Lopes, V.C., 2023.
Rawte, V., Sheth, A., Das, A., 2023. A survey of hallucination in large foundation Exploring the effectiveness of large language models in generating unit tests. arXiv
models. arXiv preprint arXiv:2309.05922. abs/2305.00418.
Rehana, H., Çam, N.B., Basmaci, M., He, Y., Özgür, A., Hur, J., 2023. Evaluation of GPT Simonyan, K., Zisserman, A., 2015. Very deep convolutional networks for large-scale
and BERT-based models on identifying protein-protein interactions in biomedical image recognition. In: 3rd International Conference on Learning Representations
text. arXiv preprint arXiv:2303.17728. (ICLR 2015). Computational and Biological Learning Society.
Rezaimehr, F., Dadkhah, C., 2021. A survey of attack detection approaches in Singhal, K., Azizi, S., Tu, T., Mahdavi, S.S., Wei, J., Chung, H.W., Scales, N.,
collaborative filtering recommender systems. Artif. Intell. Rev. 54, 2011–2066. Tanwani, A., Cole-Lewis, H., Pfohl, S., et al., 2023a. Large language models encode
Robinson, J., Wingate, D., 2022. Leveraging large language models for multiple clinical knowledge. Nature 1–9.
choice question answering. In: The Eleventh International Conference on Learning Singhal, K., Tu, T., Gottweis, J., Sayres, R., Wulczyn, E., Hou, L., Clark, K., Pfohl, S.,
Representations. Cole-Lewis, H., Neal, D., et al., 2023b. Towards expert-level medical question
Rozière, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., Adi, Y., Liu, J., answering with large language models. arXiv preprint arXiv:2305.09617.
Remez, T., Rapin, J., et al., 2023. Code llama: Open foundation models for code. Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z.,
arXiv preprint arXiv:2308.12950. Prabhumoye, S., Zerveas, G., Korthikanti, V., et al., 2022. Using deepspeed and
Sai, A.B., Mohankumar, A.K., Khapra, M.M., 2022. A survey of evaluation metrics used megatron to train megatron-turing nlg 530b, a large-scale generative language
for NLG systems. ACM Comput. Surv. 55 (2), 1–39. model. arXiv preprint arXiv:2201.11990.
Salehinejad, H., Sankar, S., Barfett, J., Colak, E., Valaee, S., 2017. Recent advances in Soltan, S., Ananthakrishnan, S., FitzGerald, J., Gupta, R., Hamza, W., Khan, H.,
recurrent neural networks. arXiv preprint arXiv:1801.01078. Peris, C., Rawls, S., Rosenbaum, A., Rumshisky, A., et al., 2022. Alexatm 20b:
Samaan, J.S., Yeo, Y.H., Rajeev, N., Hawley, L., Abel, S., Ng, W.H., Srinivasan, N., Few-shot learning using a large-scale multilingual seq2seq model. arXiv preprint
Park, J., Burch, M., Watson, R., et al., 2023. Assessing the accuracy of responses arXiv:2208.01448.
by the language model ChatGPT to questions regarding bariatric surgery. Obes. Song, M., Jiang, H., Shi, S., Yao, S., Lu, S., Feng, Y., Liu, H., Jing, L., 2023. Is ChatGPT
Surg. 1–7. a good keyphrase generator? A preliminary study. arXiv preprint arXiv:2303.13001.
Sanh, V., Debut, L., Chaumond, J., Wolf, T., 2019. DistilBERT, a distilled version of Srivastava, P., Ganu, T., Guha, S., 2022. Towards zero-shot and few-shot table question
BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108. answering using GPT-3. arXiv preprint arXiv:2210.17284.
Sarker, S., Qian, L., Dong, X., 2023. Medical data augmentation via ChatGPT: A Srivastava, A., Rastogi, A., Rao, A., Shoeb, A.A.M., Abid, A., Fisch, A., Brown, A.R.,
case study on medication identification and medication event classification. arXiv Santoro, A., Gupta, A., Garriga-Alonso, A., et al., 2023. Beyond the imitation game:
preprint arXiv:2306.07297. Quantifying and extrapolating the capabilities of language models. Trans. Mach.
Savelka, J., Agarwal, A., Bogart, C., Sakr, M., 2023. Large language models (gpt) Learn. Res..
struggle to answer multiple-choice questions about code. arXiv preprint arXiv: Stahlberg, F., 2020. Neural machine translation: A review. J. Artificial Intelligence Res.
2303.08033. 69, 343–418.
45
K.S. Kalyan Natural Language Processing Journal 6 (2024) 100048
Stammbach, D., Antoniak, M., Ash, E., 2022. Heroes, villains, and victims, and GPT-3: Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N.,
Automated extraction of character roles without training data. In: Proceedings of Batra, S., Bhargava, P., Bhosale, S., et al., 2023b. Llama 2: Open foundation and
the 4th Workshop of Narrative Understanding (WNU2022). pp. 47–56. fine-tuned chat models. arXiv preprint arXiv:2307.09288.
Su, H., Kasai, J., Wang, Y., Hu, Y., Ostendorf, M., Yih, W.-t., Smith, N.A., Zettle- Umapathi, L.K., Pal, A., Sankarasubbu, M., 2023. Med-halt: Medical domain
moyer, L., Yu, T., et al., 2022. One embedder, any task: Instruction-finetuned text hallucination test for large language models. arXiv preprint arXiv:2307.15343.
embeddings. arXiv preprint arXiv:2212.09741. Valmeekam, K., Olmo, A., Sreedharan, S., Kambhampati, S., 2022. Large language
Sugiyama, A., Yoshinaga, N., 2019. Data augmentation using back-translation for models still can’t plan (a benchmark for LLMs on planning and reasoning about
context-aware neural machine translation. In: Proceedings of the Fourth Workshop change). In: NeurIPS 2022 Foundation Models for Decision Making Workshop.
on Discourse in Machine Translation (DiscoMT 2019). pp. 35–44. Van Atteveldt, W., Van der Velden, M.A., Boukes, M., 2021. The validity of sentiment
Sun, L., Han, Y., Zhao, Z., Ma, D., Shen, Z., Chen, B., Chen, L., Yu, K., 2023a. SciEval: analysis: Comparing manual annotation, crowd-coding, dictionary approaches, and
A multi-level large language model evaluation benchmark for scientific research. machine learning algorithms. Commun. Methods Meas. 15 (2), 121–140.
arXiv preprint arXiv:2308.13149. Van Engelen, J.E., Hoos, H.H., 2020. A survey on semi-supervised learning. Mach.
Sun, X., Li, X., Li, J., Wu, F., Guo, S., Zhang, T., Wang, G., 2023b. Text classification Learn. 109 (2), 373–440.
via large language models. arXiv preprint arXiv:2305.08377. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł.,
Sun, W., Yan, L., Ma, X., Ren, P., Yin, D., Ren, Z., 2023c. Is ChatGPT good at Polosukhin, I., 2017. Attention is all you need. Adv. Neural Inf. Process. Syst. 30.
search? Investigating large language models as re-ranking agent. arXiv preprint Wadhwa, S., Amir, S., Wallace, B.C., 2023. Revisiting relation extraction in the era of
arXiv:2304.09542. large language models. arXiv preprint arXiv:2305.05003.
Sun, Z., Yu, H., Song, X., Liu, R., Yang, Y., Zhou, D., 2020. MobileBERT: a compact Wahle, J.P., Ruas, T., Kirstein, F., Gipp, B., 2022. How large language models
task-agnostic BERT for resource-limited devices. In: Proceedings of the 58th Annual are transforming machine-paraphrase plagiarism. In: Proceedings of the 2022
Meeting of the Association for Computational Linguistics. pp. 2158–2170. Conference on Empirical Methods in Natural Language Processing. pp. 952–963.
Sundar, A., Heck, L., 2022. Multimodal conversational AI: A survey of datasets and Wan, Z., Cheng, F., Mao, Z., Liu, Q., Song, H., Li, J., Kurohashi, S., 2023. Gpt-re: In-
approaches. In: Proceedings of the 4th Workshop on NLP for Conversational AI. context learning for relation extraction using large language models. arXiv preprint
pp. 131–147. arXiv:2305.02105.
Sutskever, I., Vinyals, O., Le, Q.V., 2014. Sequence to sequence learning with neural Wang, X., Gong, Z., Wang, G., Jia, J., Xu, Y., Zhao, J., Fan, Q., Wu, S., Hu, W., Li, X.,
networks. Adv. Neural Inf. Process. Syst. 27. 2023a. Chatgpt performs on the chinese national medical licensing examination.
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Van- Wang, J., Hu, X., Hou, W., Chen, H., Zheng, R., Wang, Y., Yang, L., Huang, H.,
houcke, V., Rabinovich, A., 2015. Going deeper with convolutions. In: Proceedings Ye, W., Geng, X., et al., 2023b. On the robustness of chatgpt: An adversarial and
of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 1–9. out-of-distribution perspective. arXiv preprint arXiv:2302.12095.
Tan, Y., Min, D., Li, Y., Li, W., Hu, N., Chen, Y., Qi, G., 2023. Evaluation of ChatGPT
Wang, X., Hu, Z., Lu, P., Zhu, Y., Zhang, J., Subramaniam, S., Loomba, A., Zhang, S.,
as a question answering system for answering complex questions. arXiv preprint
Sun, Y., Wang, W., 2023c. SCIBENCH: Evaluating college-level scientific problem-
arXiv:2303.07992.
solving abilities of large language models. In: The 3rd Workshop on Mathematical
Tan, Z., Wang, S., Yang, Z., Chen, G., Huang, X., Sun, M., Liu, Y., 2020. Neural machine
Reasoning and AI at NeurIPS’23.
translation: A review of methods, resources, and tools. AI Open 1, 5–21.
Wang, Y., Le, H., Gotmare, A.D., Bui, N.D., Li, J., Hoi, S.C., 2023d. Codet5+: Open
Tanaka, Y., Nakata, T., Aiga, K., Etani, T., Muramatsu, R., Katagiri, S., Kawai, H.,
code large language models for code understanding and generation. arXiv preprint
Higashino, F., Enomoto, M., Noda, M., Kometani, M., Takamura, M., Yoneda, T.,
arXiv:2305.07922.
Kakizaki, H., Nomura, A., 2023a. Performance of generative pretrained transformer
Wang, P., Li, L., Chen, L., Zhu, D., Lin, B., Cao, Y., Liu, Q., Liu, T., Sui, Z., 2023e.
on the national medical licensing examination in Japan. medRxiv.
Large language models are not fair evaluators. arXiv preprint arXiv:2305.17926.
Tanaka, Y., Nakata, T., Aiga, K., Etani, T., Muramatsu, R., Katagiri, S., Kawai, H.,
Wang, S., Li, B.Z., Khabsa, M., Fang, H., Ma, H., 2020a. Linformer: Self-attention with
Higashino, F., Enomoto, M., Noda, M., et al., 2023b. Performance of generative
linear complexity. arXiv preprint arXiv:2006.04768.
pretrained transformer on the national medical licensing examination in Japan.
Wang, J., Liang, Y., Meng, F., Shi, H., Li, Z., Xu, J., Qu, J., Zhou, J., 2023f. Is chatgpt
medRxiv, pp. 2023-2004.
a good nlg evaluator? a preliminary study. arXiv preprint arXiv:2303.04048.
Tang, R., Han, X., Jiang, X., Hu, X., 2023a. Does synthetic data generation of llms help
Wang, L., Lim, E.-P., 2023. Zero-shot next-item recommendation using large pretrained
clinical text mining? arXiv preprint arXiv:2303.04360.
language models. arXiv preprint arXiv:2304.03153.
Tang, T., Lu, H., Jiang, Y.E., Huang, H., Zhang, D., Zhao, W.X., Wei, F., 2023b. Not
Wang, X., Liu, Q., Gui, T., Zhang, Q., Zou, Y., Zhou, X., Ye, J., Zhang, Y., Zheng, R.,
all metrics are guilty: Improving NLG evaluation with LLM paraphrasing. arXiv
Pang, Z., et al., 2021a. Textflint: Unified multilingual robustness evaluation toolkit
preprint arXiv:2305.15067.
for natural language processing. In: Proceedings of the 59th Annual Meeting of
Tang, Y., Tran, C., Li, X., Chen, P.-J., Goyal, N., Chaudhary, V., Gu, J., Fan, A., 2020.
the Association for Computational Linguistics and the 11th International Joint
Multilingual translation with extensible multilingual pretraining and finetuning.
Conference on Natural Language Processing: System Demonstrations. pp. 347–355.
arXiv preprint arXiv:2008.00401.
Tang, X., Tran, A., Tan, J., Gerstein, M., 2023c. GersteinLab at MEDIQA-chat 2023: Wang, S., Liu, Y., Xu, Y., Zhu, C., Zeng, M., 2021b. Want to reduce labeling cost? GPT-
Clinical note summarization from doctor-patient conversations through fine-tuning 3 can help. In: Findings of the Association for Computational Linguistics: EMNLP
and in-context learning. arXiv preprint arXiv:2305.05001. 2021. pp. 4195–4205.
Tay, Y., Dehghani, M., Bahri, D., Metzler, D., 2022. Efficient transformers: A survey. Wang, H., Luo, X., Wang, W., Yan, X., 2023g. Bot or human? Detecting ChatGPT
arXiv:2009.06732. imposters with a single question. ArXiv, abs/2305.06424.
Taylor, R., Kardas, M., Cucurull, G., Scialom, T., Hartshorn, A., Saravia, E., Poulton, A., Wang, L., Lyu, C., Ji, T., Zhang, Z., Yu, D., Shi, S., Tu, Z., 2023h. Document-level
Kerkez, V., Stojnic, R., 2022. Galactica: A large language model for science. arXiv machine translation with large language models. arXiv preprint arXiv:2304.02210.
preprint arXiv:2211.09085. Wang, Y., Mishra, S., Alipoormolabashi, P., Kordi, Y., Mirzaei, A., Naik, A., Ashok, A.,
Thakur, N., Reimers, N., Rücklé, A., Srivastava, A., Gurevych, I., 2021. BEIR: A Dhanasekaran, A.S., Arunkumar, A., Stap, D., et al., 2022. Super-naturalinstructions:
heterogeneous benchmark for zero-shot evaluation of information retrieval models. Generalization via declarative instructions on 1600+ nlp tasks. In: Proceedings of
In: Thirty-Fifth Conference on Neural Information Processing Systems Datasets and the 2022 Conference on Empirical Methods in Natural Language Processing. pp.
Benchmarks Track (Round 2). 5085–5109.
Thapa, S., Naseem, U., Nasim, M., 2023. From humans to machines: can ChatGPT-like Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., Bowman, S.R., 2018. GLUE: A
LLMs effectively replace human annotators in NLP tasks. In: Workshop Proceedings multi-task benchmark and analysis platform for natural language understanding.
of the 17th International AAAI Conference on Web and Social Media. In: International Conference on Learning Representations.
Theocharopoulos, P.C., Anagnostou, P., Tsoukala, A., Georgakopoulos, S.V., Wang, S., Sun, X., Li, X., Ouyang, R., Wu, F., Zhang, T., Li, J., Wang, G., 2023i.
Tasoulis, S.K., Plagianakos, V.P., 2023. Detection of fake generated scientific Gpt-ner: Named entity recognition via large language models. arXiv preprint arXiv:
abstracts. arXiv preprint arXiv:2304.06148. 2304.10428.
Thoppilan, R., De Freitas, D., Hall, J., Shazeer, N., Kulshreshtha, A., Cheng, H.-T., Wang, S., Sun, Y., Xiang, Y., Wu, Z., Ding, S., Gong, W., Feng, S., Shang, J.,
Jin, A., Bos, T., Baker, L., Du, Y., et al., 2022. Lamda: Language models for dialog Zhao, Y., Pang, C., et al., 2021c. Ernie 3.0 titan: Exploring larger-scale knowledge
applications. arXiv preprint arXiv:2201.08239. enhanced pre-training for language understanding and generation. arXiv preprint
Tian, H., Lu, W., Li, T.O., Tang, X., Cheung, S.-C., Klein, J., Bissyandé, T.F., 2023. arXiv:2112.12731.
Is ChatGPT the ultimate programming assistant–how far is it? arXiv preprint Wang, W., Tu, Z., Chen, C., Yuan, Y., Huang, J.-t., Jiao, W., Lyu, M.R., 2023j. All
arXiv:2304.11938. languages matter: On the multilingual safety of large language models. arXiv
Torfi, A., Shirvani, R.A., Keneshloo, Y., Tavaf, N., Fox, E.A., 2020. Natural language preprint arXiv:2310.00905.
processing advancements by deep learning: A survey. arXiv preprint arXiv:2003. Wang, Y., Wang, W., Joty, S., Hoi, S.C., 2021d. CodeT5: Identifier-aware unified
01200. pre-trained encoder-decoder models for code understanding and generation. In:
Törnberg, P., 2023. Chatgpt-4 outperforms experts and crowd workers in annotating Proceedings of the 2021 Conference on Empirical Methods in Natural Language
political twitter messages with zero-shot learning. arXiv preprint arXiv:2304.06588. Processing. pp. 8696–8708.
Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Wang, H., Wang, R., Mi, F., Wang, Z., Xu, R., Wong, K.-F., 2023k. Chain-of-thought
Goyal, N., Hambro, E., Azhar, F., et al., 2023a. Llama: Open and efficient prompting for responding to in-depth dialogue questions with LLM. arXiv preprint
foundation language models. arXiv preprint arXiv:2302.13971. arXiv:2305.11792.
46
K.S. Kalyan Natural Language Processing Journal 6 (2024) 100048
Wang, W., Wei, F., Dong, L., Bao, H., Yang, N., Zhou, M., 2020b. Minilm: Deep Yan, W., Liu, H., Wang, Y., Li, Y., Chen, Q., Wang, W., Lin, T., Zhao, W., Zhu, L.,
self-attention distillation for task-agnostic compression of pre-trained transformers. Deng, S., et al., 2023. CodeScope: An execution-based multilingual multitask
arXiv preprint arXiv:2002.10957. multidimensional benchmark for evaluating LLMs on code understanding and
Wang, Z., Xie, Q., Ding, Z., Feng, Y., Xia, R., 2023l. Is ChatGPT a good sentiment generation. arXiv preprint arXiv:2311.08588.
analyzer? A preliminary study. arXiv preprint arXiv:2304.04339. Yang, X., Cheng, W., Petzold, L., Wang, W.Y., Chen, H., 2023a. DNA-GPT: Divergent
Wang, W.Y., Yang, D., 2015. That’s so annoying!!!: A lexical and frame-semantic N-gram analysis for training-free detection of GPT-generated text. arXiv preprint
embedding based data augmentation approach to automatic categorization of arXiv:2305.17359.
annoying behaviors using# petpeeve tweets. In: Proceedings of the 2015 Conference Yang, Z., Cherian, S., Vucetic, S., 2023b. Data augmentation for radiology report
on Empirical Methods in Natural Language Processing. pp. 2557–2563. simplification. In: Findings of the Association for Computational Linguistics: EACL
Wang, J., Yao, Z., Mitra, A., Osebe, S., Yang, Z., Yu, H., 2023m. UMASS_BioNLP 2023. pp. 1877–1887.
at MEDIQA-Chat 2023: Can LLMs generate high-quality synthetic note-oriented Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R.R., Le, Q.V., 2019. Xlnet:
doctor-patient conversations? arXiv preprint arXiv:2306.16931. Generalized autoregressive pretraining for language understanding. Adv. Neural Inf.
Wang, Y., Zhao, Y., 2023. TRAM: Benchmarking temporal reasoning for large language Process. Syst. 32.
models. arXiv preprint arXiv:2310.00835. Yang, Z., Gan, Z., Wang, J., Hu, X., Lu, Y., Liu, Z., Wang, L., 2022. An empirical study
Wang, Y., Zhao, Y., Petzold, L., 2023n. Are large language models ready for healthcare? of gpt-3 for few-shot knowledge-based vqa. In: Proceedings of the AAAI Conference
A comparative study on clinical language understanding. arXiv preprint arXiv: on Artificial Intelligence, Vol. 36. pp. 3081–3089.
2304.05368. Yang, K., Ji, S., Zhang, T., Xie, Q., Ananiadou, S., 2023c. On the evaluations of
Wang, Y., Zhong, W., Li, L., Mi, F., Zeng, X., Huang, W., Shang, L., Jiang, X., Liu, Q., chatgpt and emotion-enhanced prompting for mental health analysis. arXiv preprint
2023o. Aligning large language models with human: A survey. arXiv preprint arXiv:2304.03347.
arXiv:2307.12966. Yang, L., Jiang, F., Li, H., 2023d. Is chatgpt involved in texts? Measure the polish ratio
Wei, X., Cui, X., Cheng, N., Wang, X., Zhang, X., Huang, S., Xie, P., Xu, J., Chen, Y., to detect ChatGPT-generated text. ArXiv, abs/2307.11380.
Zhang, M., et al., 2023. Zero-shot information extraction via chatting with chatgpt. Yang, Z., Li, L., Wang, J., Lin, K., Azarnasab, E., Ahmed, F., Liu, Z., Liu, C., Zeng, M.,
arXiv preprint arXiv:2302.10205. Wang, L., 2023e. Mm-react: Prompting chatgpt for multimodal reasoning and
Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B., Borgeaud, S., Yogatama, D., action. arXiv preprint arXiv:2303.11381.
Bosma, M., Zhou, D., Metzler, D., et al., 2022a. Emergent abilities of large language Yang, W., Li, C., Zhang, J., Zong, C., 2023f. BigTrans: Augmenting large language
models. Trans. Mach. Learn. Res.. models with multilingual translation capability over 100 languages. arXiv preprint
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q.V., Zhou, D., et arXiv:2305.18098.
al., 2022b. Chain-of-thought prompting elicits reasoning in large language models. Yang, H., Liu, X.-Y., Wang, C.D., 2023g. FinGPT: Open-source financial large language
Adv. Neural Inf. Process. Syst. 35, 24824–24837. models. arXiv preprint arXiv:2306.06031.
Wei, J., Zou, K., 2019. EDA: Easy data augmentation techniques for boosting per- Yang, W., Nicolai, G., 2023. Neural machine translation data generation and
formance on text classification tasks. In: Proceedings of the 2019 Conference on augmentation using ChatGPT. arXiv preprint arXiv:2307.05779.
Empirical Methods in Natural Language Processing and the 9th International Joint
Yang, Y., Uy, M.C.S., Huang, A., 2020a. Finbert: A pretrained language model for
Conference on Natural Language Processing (EMNLP-IJCNLP). pp. 6382–6388.
financial communications. arXiv preprint arXiv:2006.08097.
Weng, Y., Li, B., Xia, F., Zhu, M., Sun, B., He, S., Liu, K., Zhao, J., 2023. Large
Yang, S., Wang, Y., Chu, X., 2020b. A survey of deep learning techniques for neural
language models need holistically thought in medical conversational QA. arXiv
machine translation. arXiv abs/2002.07526.
preprint arXiv:2305.05410.
Ye, J., Chen, X., Xu, N., Zu, C., Shao, Z., Liu, S., Cui, Y., Zhou, Z., Gong, C., Shen, Y., et
Whitehouse, C., Choudhury, M., Aji, A.F., 2023. LLM-powered data augmentation for
al., 2023. A comprehensive capability analysis of gpt-3 and gpt-3.5 series models.
enhanced crosslingual performance. ArXiv, abs/2305.14288.
arXiv preprint arXiv:2303.10420.
Wiriyathammabhum, P., 2022. PromptShots at the FinNLP-2022 ERAI task: Pairwise
Ye, D., Lin, Y., Li, P., Sun, M., 2022. Packed levitated marker for entity and relation
comparison and unsupervised ranking. In: Proceedings of the Fourth Workshop on
extraction. In: Proceedings of the 60th Annual Meeting of the Association for
Financial Technology and Natural Language Processing (FinNLP). pp. 104–110.
Computational Linguistics (Volume 1: Long Papers). pp. 4904–4917.
Wu, S., He, Y., 2019. Enriching pre-trained language model with entity information for
Yetiştiren, B., Özsoy, I., Ayerdem, M., Tüzün, E., 2023. Evaluating the code quality of
relation classification. In: Proceedings of the 28th ACM International Conference
AI-assisted code generation tools: An empirical study on GitHub copilot, amazon
on Information and Knowledge Management. pp. 2361–2364.
CodeWhisperer, and ChatGPT. arXiv preprint arXiv:2304.10778.
Wu, S., Irsoy, O., Lu, S., Dabravolski, V., Dredze, M., Gehrmann, S., Kambadur, P.,
Yin, S., Fu, C., Zhao, S., Li, K., Sun, X., Xu, T., Chen, E., 2023. A survey on multimodal
Rosenberg, D., Mann, G., 2023a. Bloomberggpt: A large language model for finance.
large language models. arXiv preprint arXiv:2306.13549.
arXiv preprint arXiv:2303.17564.
Young, T., Hazarika, D., Poria, S., Cambria, E., 2018. Recent trends in deep learning
Wu, C., Yin, S., Qi, W., Wang, X., Tang, Z., Duan, N., 2023b. Visual chatgpt: Talking,
based natural language processing. IEEE Comput. Intell. Mag. 13 (3), 55–75.
drawing and editing with visual foundation models. arXiv preprint arXiv:2303.
Yu, P., Chen, J., Feng, X., Xia, Z., 2023a. CHEAT: A large-scale dataset for detecting
04671.
ChatGPT-writtEn AbsTracts. arXiv preprint arXiv:2304.12008.
Wu, Z., Zhang, L., Cao, C., Yu, X., Dai, H., Ma, C., Liu, Z., Zhao, L., Li, G., Liu, W., et al.,
2023c. Exploring the trade-offs: Unified large language models vs local fine-tuned Yu, X., Qi, Y., Chen, K., Chen, G., Yang, X., Zhu, P., Zhang, W., Yu, N.H., 2023b. GPT
models for highly-specific radiology NLI task. arXiv preprint arXiv:2304.09138. paternity test: GPT generated text detection with GPT genetic inheritance. ArXiv,
Xia, C.S., Zhang, L., 2023. Keep the conversation going: Fixing 162 out of 337 bugs abs/2305.12519.
for 0.42 each using ChatGPT. arXiv preprint arXiv:2304.00385. Yu, F., Quartey, L., Schilder, F., 2022. Legal prompting: Teaching a language model to
Xie, Y., Gao, J., Zhou, P., Ye, Q., Hua, Y., Kim, J., Wu, F., Kim, S., 2023a. Rethinking think like a lawyer. arXiv preprint arXiv:2212.01326.
multi-interest learning for candidate matching in recommender systems. arXiv Yu, J., Wang, X., Tu, S., Cao, S., Zhang-Li, D., Lv, X., Peng, H., Yao, Z., Zhang, X., Li, H.,
preprint arXiv:2302.14532. et al., 2023c. KoLA: Carefully benchmarking world knowledge of large language
Xie, Y., Yu, C., Zhu, T., Bai, J., Gong, Z., Soh, H., 2023b. Translating natural language models. arXiv preprint arXiv:2306.09296.
to planning goals with large-language models. arXiv preprint arXiv:2302.05128. Yu, Y., Zhuang, Y., Zhang, J., Meng, Y., Ratner, A., Krishna, R., Shen, J., Zhang, C.,
Xiong, H., Wang, S., Zhu, Y., Zhao, Z., Liu, Y., Wang, Q., Shen, D., 2023. Doctorglm: 2023d. Large language model as attributed training data generator: A tale of
Fine-tuning your chinese doctor is not a herculean task. arXiv preprint arXiv: diversity and bias. arXiv preprint arXiv:2306.15895.
2304.01097. Yuan, Z., Lou, Y., Liu, M., Ding, S., Wang, K., Chen, Y., Peng, X., 2023a. No more
Xu, J., Lu, L., Yang, S., Liang, B., Peng, X., Pang, J., Ding, J., Shi, X., Yang, L., Song, H., manual tests? Evaluating and improving ChatGPT for unit test generation. arXiv
et al., 2023a. MedGPTEval: A dataset and benchmark to evaluate responses of large abs/2305.04207.
language models in medicine. arXiv preprint arXiv:2305.07340. Yuan, W., Neubig, G., Liu, P., 2021. Bartscore: Evaluating generated text as text
Xu, W., Wang, D., Pan, L., Song, Z., Freitag, M., Wang, W.Y., Li, L., 2023b. Instructscore: generation. Adv. Neural Inf. Process. Syst. 34, 27263–27277.
Towards explainable text generation evaluation with automatic feedback. arXiv Yuan, X., Wang, T., Meng, R., Thaker, K., Brusilovsky, P., He, D., Trischler, A., 2020.
preprint arXiv:2305.14282. One size does not fit all: Generating and evaluating variable number of keyphrases.
Xu, Y., Xu, R., Iter, D., Liu, Y., Wang, S., Zhu, C., Zeng, M., 2023c. InheritSumm: A In: Proceedings of the 58th Annual Meeting of the Association for Computational
general, versatile and compact summarizer by distilling from GPT. arXiv preprint Linguistics. pp. 7961–7975.
arXiv:2305.13083. Yuan, C., Xie, Q., Ananiadou, S., 2023b. Zero-shot temporal relation extraction with
Xu, P., Zhu, X., Clifton, D.A., 2023d. Multimodal learning with transformers: A survey. chatgpt. arXiv preprint arXiv:2304.05454.
IEEE Trans. Pattern Anal. Mach. Intell.. Zaheer, M., Guruganesh, G., Dubey, K.A., Ainslie, J., Alberti, C., Ontanon, S., Pham, P.,
Xu, X., Zhu, Y., Wang, X., Zhang, N., 2023e. How to unleash the power of large Ravula, A., Wang, Q., Yang, L., et al., 2020. Big bird: Transformers for longer
language models for few-shot relation extraction? arXiv preprint arXiv:2305.01555. sequences. In: NeurIPS.
Xue, L., Constant, N., Roberts, A., Kale, M., Al-Rfou, R., Siddhant, A., Barua, A., Zaib, M., Zhang, W.E., Sheng, Q.Z., Mahmood, A., Zhang, Y., 2022. Conversational
Raffel, C., 2021. mT5: A massively multilingual pre-trained text-to-text transformer. question answering: A survey. Knowl. Inf. Syst. 64 (12), 3151–3195.
In: Proceedings of the 2021 Conference of the North American Chapter of the Zaitsu, W., Jin, M., 2023. Distinguishing ChatGPT (-3.5,-4)-generated and human-
Association for Computational Linguistics: Human Language Technologies. pp. written papers through Japanese stylometric analysis. arXiv preprint arXiv:2304.
483–498. 05534.
47
K.S. Kalyan Natural Language Processing Journal 6 (2024) 100048
Zan, D., Chen, B., Yang, D., Lin, Z., Kim, M., Guan, B., Wang, Y., Chen, W., Lou, J.- Zhao, W.X., Liu, J., Ren, R., Wen, J.-R., 2022b. Dense text retrieval based on pretrained
G., 2022. CERT: Continual pre-training on sketches for library-oriented code language models: A survey. arXiv preprint arXiv:2211.14876.
generation. arXiv preprint arXiv:2206.06888. Zhao, W., Peyrard, M., Liu, F., Gao, Y., Meyer, C.M., Eger, S., 2019. MoverScore: Text
Zeng, A., Liu, X., Du, Z., Wang, Z., Lai, H., Ding, M., Yang, Z., Xu, Y., Zheng, W., generation evaluating with contextualized embeddings and earth mover distance.
Xia, X., et al., 2022. GLM-130b: An open bilingual pre-trained model. In: The In: Proceedings of the 2019 Conference on Empirical Methods in Natural Lan-
Eleventh International Conference on Learning Representations. guage Processing and the 9th International Joint Conference on Natural Language
Zeng, Z., Yu, J., Gao, T., Meng, Y., Goyal, T., Chen, D., 2023. Evaluating large language Processing (EMNLP-IJCNLP). pp. 563–578.
models at evaluating instruction following. arXiv preprint arXiv:2310.07641. Zhao, Z., Wallace, E., Feng, S., Klein, D., Singh, S., 2021. Calibrate before use:
Zhan, H., He, X., Xu, Q., Wu, Y., Stenetorp, P., 2023a. G3Detector: General Improving few-shot performance of language models. In: International Conference
GPT-generated text detector. arXiv preprint arXiv:2305.12680. on Machine Learning. PMLR, pp. 12697–12706.
Zhan, H., Li, Z., Wang, Y., Luo, L., Feng, T., Kang, X., Hua, Y., Qu, L., Soon, L.- Zhao, W., Zhao, Y., Lu, X., Wang, S., Tong, Y., Qin, B., 2023b. Is ChatGPT equipped
K., Sharma, S., et al., 2023b. SocialDial: A benchmark for socially-aware dialogue with emotional dialogue capabilities? arXiv preprint arXiv:2304.09582.
systems. arXiv preprint arXiv:2304.12026. Zhao, Y., Zhao, C., Nan, L., Qi, Z., Zhang, W., Tang, X., Mi, B., Radev, D., 2023c. RobuT:
Zhang, J., Bao, K., Zhang, Y., Wang, W., Feng, F., He, X., 2023a. Is chatgpt fair for A systematic study of table QA robustness against human-annotated adversarial
recommendation? Evaluating fairness in large language model recommendation. perturbations. arXiv preprint arXiv:2306.14321.
arXiv preprint arXiv:2305.07609. Zhao, W.X., Zhou, K., Li, J., Tang, T., Wang, X., Hou, Y., Min, Y., Zhang, B., Zhang, J.,
Zhang, L., Cai, W., Liu, Z., Yang, Z., Dai, W., Liao, Y., Qin, Q., Li, Y., Liu, X., Liu, Z., et Dong, Z., et al., 2023d. A survey of large language models. arXiv preprint arXiv:
al., 2023b. Fineval: A chinese financial domain knowledge evaluation benchmark 2303.18223.
for large language models. arXiv preprint arXiv:2308.09975. Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z.,
Zhang, B., Fu, X., Ding, D., Huang, H., Li, Y., Jing, L., 2023c. Investigating chain- Li, D., Xing, E., et al., 2023a. Judging LLM-as-a-judge with MT-bench and chatbot
of-thought with ChatGPT for stance detection on social media. arXiv preprint arena. arXiv preprint arXiv:2306.05685.
arXiv:2304.03087. Zheng, S., Huang, J., Chang, K.C.-C., 2023b. Why does ChatGPT fall short in answering
Zhang, S., Gong, C., Wu, L., Liu, X., Zhou, M., 2023d. AutoML-GPT: Automatic machine questions faithfully? arXiv preprint arXiv:2304.10513.
learning with GPT. arXiv preprint arXiv:2305.02499. Zheng, M., Su, X., You, S., Wang, F., Qian, C., Xu, C., Albanie, S., 2023c. Can GPT-4
Zhang, K., Gutiérrez, B.J., Su, Y., 2023e. Aligning instruction tasks unlocks large perform neural architecture search? arXiv preprint arXiv:2304.10970.
language models as zero-shot relation extractors. arXiv preprint arXiv:2305.11159. Zhiyuli, A., Chen, Y., Zhang, X., Liang, X., 2023. BookGPT: A general framework
Zhang, T., Kishore, V., Wu, F., Weinberger, K.Q., Artzi, Y., 2019. BERTScore: for book recommendation empowered by large language model. arXiv preprint
Evaluating text generation with BERT. In: International Conference on Learning arXiv:2305.15673.
Representations. Zhong, Q., Ding, L., Liu, J., Du, B., Tao, D., 2023. Can chatgpt understand too? a
Zhang, Y., Li, Y., Cui, L., Cai, D., Liu, L., Fu, T., Huang, X., Zhao, E., Zhang, Y., comparative study on chatgpt and fine-tuned bert. arXiv preprint arXiv:2302.10198.
Chen, Y., et al., 2023f. Siren’s song in the AI ocean: A survey on hallucination in Zhou, S., Alon, U., Agarwal, S., Neubig, G., 2023. Codebertscore: Evaluating code
large language models. arXiv preprint arXiv:2309.01219. generation with pretrained models of code. arXiv preprint arXiv:2302.05527.
Zhang, X., Li, S., Hauer, B., Shi, N., Kondrak, G., 2023g. Don’t trust GPT when your Zhu, X., Li, J., Liu, Y., Ma, C., Wang, W., 2023a. A survey on model compression for
question is not in english. arXiv preprint arXiv:2305.16339. large language models. arXiv preprint arXiv:2308.07633.
Zhang, D., Li, S., Zhang, X., Zhan, J., Wang, P., Zhou, Y., Qiu, X., 2023h. Speechgpt: Zhu, W., Liu, H., Dong, Q., Xu, J., Kong, L., Chen, J., Li, L., Huang, S., 2023b.
Empowering large language models with intrinsic cross-modal conversational Multilingual machine translation with large language models: Empirical results and
abilities. arXiv preprint arXiv:2305.11000. analysis. arXiv preprint arXiv:2304.04675.
Zhang, S., Roller, S., Goyal, N., Artetxe, M., Chen, M., Chen, S., Dewan, C., Diab, M., Zhu, F., Wang, Y., Chen, C., Zhou, J., Li, L., Liu, G., 2021. Cross-domain rec-
Li, X., Lin, X.V., et al., 2022. Opt: Open pre-trained transformer language models. ommendation: challenges, progress, and prospects. arXiv preprint arXiv:2103.
arXiv preprint arXiv:2205.01068. 01696.
Zhang, B., Soh, H., 2023. Large language models as zero-shot human models for Zhu, W., Wang, X., Lu, Y., Fu, T.-J., Wang, X.E., Eckstein, M., Wang, W.Y., 2023c.
human-robot interaction. arXiv preprint arXiv:2303.03548. Collaborative generative AI: Integrating GPT-k for efficient editing in text-to-image
Zhang, Y., Yang, Q., 2021. A survey on multi-task learning. IEEE Trans. Knowl. Data generation. arXiv preprint arXiv:2305.11317.
Eng. 34 (12), 5586–5609. Zhu, K., Wang, J., Zhou, J., Wang, Z., Chen, H., Wang, Y., Yang, L., Ye, W., Gong, N.Z.,
Zhang, Z., Yao, Y., Zhang, A., Tang, X., Ma, X., He, Z., Wang, Y., Gerstein, M., Wang, R., Zhang, Y., et al., 2023d. PromptBench: Towards evaluating the robustness of large
Liu, G., et al., 2023i. Igniting language intelligence: The hitchhiker’s guide from language models on adversarial prompts. arXiv preprint arXiv:2306.04528.
chain-of-thought reasoning to language agents. arXiv preprint arXiv:2311.11797. Zhu, Y., Zhang, P., Haq, E.-U., Hui, P., Tyson, G., 2023e. Can chatgpt reproduce
Zhang, L., Zhang, Y., Ren, K., Li, D., Yang, Y., 2023j. MLCopilot: Unleashing the human-generated labels? a study of social computing tasks. arXiv preprint arXiv:
power of large language models in solving machine learning tasks. arXiv preprint 2304.10145.
arXiv:2304.14979. Zhuang, Z., Chen, Q., Ma, L., Li, M., Han, Y., Qian, Y., Bai, H., Feng, Z., Zhang, W.,
Zhang, T., Zhang, Y., Vineet, V., Joshi, N., Wang, X., 2023k. Controllable text-to-image Liu, T., 2023. Through the lens of core competency: Survey on evaluation of large
generation with GPT-4. arXiv preprint arXiv:2305.18583. language models. arXiv preprint arXiv:2308.07902.
Zhang, X., Zhao, J., LeCun, Y., 2015. Character-level convolutional networks for text Zhuang, F., Qi, Z., Duan, K., Xi, D., Zhu, Y., Zhu, H., Xiong, H., He, Q., 2020. A
classification. Adv. Neural Inf. Process. Syst. 28. comprehensive survey on transfer learning. Proc. IEEE 109 (1), 43–76.
Zhang, J., Zhao, Y., Saleh, M., Liu, P., 2020. Pegasus: Pre-training with extracted gap- Zhuo, T.Y., 2023. Large language models are state-of-the-art evaluators of code
sentences for abstractive summarization. In: International Conference on Machine generation. arXiv preprint arXiv:2304.14317.
Learning. PMLR, pp. 11328–11339. Zhuo, T.Y., Li, Z., Huang, Y., Li, Y.-F., Wang, W., Haffari, G., Shiri, F., 2023.
Zhao, Z., Guo, L., Yue, T., Chen, S., Shao, S., Zhu, X., Yuan, Z., Liu, J., 2023a. On robustness of prompt-based semantic parsing with large pre-trained language
ChatBridge: Bridging modalities with large language model as a language catalyst. model: An empirical study on codex. arXiv preprint arXiv:2301.12868.
arXiv preprint arXiv:2305.16103. Ziems, C., Held, W., Shaikh, O., Chen, J., Zhang, Z., Yang, D., 2023a. Can large language
Zhao, K., Jin, X., Bai, L., Guo, J., Cheng, X., 2022a. Knowledge-enhanced self-supervised models transform computational social science? arXiv preprint arXiv:2305.03514.
prototypical network for few-shot event detection. In: Findings of the Association Ziems, N., Yu, W., Zhang, Z., Jiang, M., 2023b. Large language models are built-in
for Computational Linguistics: EMNLP 2022. pp. 6266–6275. autoregressive search engines. arXiv preprint arXiv:2305.09612.
48