Large Language Model (LLM)
and Performance Evaluation
Metrics in NLP
Lecture # 12
Today Agenda
Large Language Models
Performance Evaluation Metrics in NLP
Quiz # 3
What is Large Language Models?
A large language model is an advanced type of language model that is trained
using deep learning techniques on massive amounts of text data. These models
are capable of generating human-like text and performing various natural
language processing tasks.
The definition of a language model refers to the concept of assigning
probabilities to sequences of words, based on the analysis of text corpora. A
language model can be of varying complexity, from simple n-gram models to
more sophisticated neural network models. However, the term “large language
model” usually refers to models that use deep learning techniques and have a
large number of parameters, which can range from millions to billions. These
models can capture complex patterns in language and produce text that is often
indistinguishable from that written by humans.
Large Language Models
Modeling human language at scale is a highly complex and resource-intensive
endeavor. The path to reaching the current capabilities of language models and
large language models has spanned several decades.
As models are built bigger and bigger, their complexity and efficacy increases.
Early language models could predict the probability of a single word; modern
large language models can predict the probability of sentences, paragraphs, or
even entire documents.
The size and capability of language models has exploded over the last few years
as computer memory, dataset size, and processing power increases, and more
effective techniques for modeling longer text sequences are developed.
How a Large Language Model Is
Built?
A large-scale transformer model known as a “large language model” is typically too
massive to run on a single computer and is, therefore, provided as a service over an
API or web interface. These models are trained on vast amounts of text data from
sources such as books, articles, websites, and numerous other forms of written
content. By analyzing the statistical relationships between words, phrases, and
sentences through this training process, the models can generate coherent and
contextually relevant responses to prompts or queries.
ChatGPT’s GPT-3 model, for instance, was trained on massive amounts of internet
text data, giving it the ability to understand various languages and possess
knowledge of diverse topics. As a result, it can produce text in multiple styles. While
its capabilities may seem impressive, including translation, text summarization, and
question-answering, they are not surprising, given that these functions operate using
special “grammars” that match up with prompts.
General Architecture
The architecture of Large Language Models primarily consists of multiple layers
of neural networks, like recurrent layers, feedforward layers, embedding layers,
and attention layers. These layers work together to process the input text and
generate output predictions.
The embedding layer converts each word in the input text into a high-
dimensional vector representation. These embeddings capture semantic and
syntactic information about the words and help the model to understand the
context.
The feedforward layers of Large Language Models have multiple fully
connected layers that apply nonlinear transformations to the input embeddings.
These layers help the model learn higher-level abstractions from the input text.
General Architecture
The recurrent layers of LLMs are designed to interpret information from the
input text in sequence. These layers maintain a hidden state that is updated at
each time step, allowing the model to capture the dependencies between words in
a sentence.
The attention mechanism is another important part of LLMs, which allows the
model to focus selectively on different parts of the input text. This mechanism
helps the model attend to the input text’s most relevant parts and generate more
accurate predictions.
Examples of LLMs
Let’s take a look at some popular large language models:
GPT-3 (Generative Pre-trained Transformer 3) – This is one of the largest Large
Language Models developed by OpenAI. It has 175 billion parameters and can
perform many tasks, including text generation, translation, and summarization.
T5 (Text-to-Text Transfer Transformer) – T5, developed by Google, is trained on
a variety of language tasks and can perform text-to-text transformations, like
translating text to another language, creating a summary, and question answering.
RoBERTa (Robustly Optimized BERT Pretraining Approach) – Developed by
Facebook AI Research, RoBERTa is an improved BERT version that performs
better on several language tasks.
How do large language models work?
A large language model is based on a transformer model and works by receiving an
input, encoding it, and then decoding it to produce an output prediction. But before a
large language model can receive text input and generate an output prediction, it requires
training, so that it can fulfill general functions, and fine-tuning, which enables it to
perform specific tasks.
Training: Large language models are pre-trained using large textual datasets from sites
like Wikipedia, GitHub, or others. These datasets consist of trillions of words, and their
quality will affect the language model's performance. At this stage, the large language
model engages in unsupervised learning, meaning it processes the datasets fed to it
without specific instructions. During this process, the LLM's AI algorithm can learn the
meaning of words, and of the relationships between words. It also learns to distinguish
words based on context. For example, it would learn to understand whether "right"
means "correct," or the opposite of "left.“
How do large language models work?
Fine-tuning: In order for a large language model to perform a specific task, such as translation,
it must be fine-tuned to that particular activity. Fine-tuning optimizes the performance of specific
tasks.
Prompt-tuning fulfills a similar function to fine-tuning, whereby it trains a model to perform a
specific task through few-shot prompting, or zero-shot prompting. A prompt is an instruction
given to an LLM. Few-shot prompting teaches the model to predict outputs through the use of
examples. For instance, in this sentiment analysis exercise, a few-shot prompt would look like
this:
Customer review: This plant is so beautiful!
Customer sentiment: positive
Customer review: This plant is so hideous!
Customer sentiment: negative
Large language models use cases
Information retrieval: Think of Bing or Google. Whenever you use their search feature, you are
relying on a large language model to produce information in response to a query. It's able to retrieve
information, then summarize and communicate the answer in a conversational style.
Sentiment analysis: As applications of natural language processing, large language models enable
companies to analyze the sentiment of textual data.
Text generation: Large language models are behind generative AI, like ChatGPT, and can generate
text based on inputs. They can produce an example of text when prompted. For example: “Write me a
poem about palm trees in the style of Emily Dickinson.”
Code generation: Like text generation, code generation is an application of generative AI. LLMs
understand patterns, which enables them to generate code.
Chatbots and conversational AI: Large language models enable customer service chatbots or
conversational AI to engage with customers, interpret the meaning of their queries or responses, and
offer responses in turn.
Applications
Tech: Large language models are used anywhere from enabling search engines to respond to queries, to
assisting developers with writing code.
Healthcare and Science: Large language models have the ability to understand proteins, molecules, DNA,
and RNA. This position allows LLMs to assist in the development of vaccines, finding cures for illnesses,
and improving preventative care medicines. LLMs are also used as medical chatbots to perform patient
intakes or basic diagnoses.
Customer Service: LLMs are used across industries for customer service purposes such as chatbots or
conversational AI.
Marketing: Marketing teams can use LLMs to perform sentiment analysis to quickly generate campaign
ideas or text as pitching examples, and much more.
Legal: From searching through massive textual datasets to generating legalese, large language models can
assist lawyers, paralegals, and legal staff.
Banking: LLMs can support credit card companies in detecting fraud.
Popular large language models
PaLM: Google's Pathways Language Model (PaLM) is a transformer language
model capable of common-sense and arithmetic reasoning, joke explanation,
code generation, and translation.
XLNet: A permutation language model, XLNet generated output predictions in a
random order, which distinguishes it from BERT. It assesses the pattern of tokens
encoded and then predicts tokens in random order, instead of a sequential order.
Performance Evaluation Metrics for
Classification
Evaluating a model is a major part of building an effective machine learning and
NLP model. The most frequent classification evaluation metric that we use
should be ‘Accuracy’. You might believe that the model is good when the
accuracy rate is 99%! However, it is not always true and can be misleading in
some situations. I’m going to explain the 4 aspects as shown below:
The Confusion Matrix for a 2-class classification problem
The key classification metrics: Accuracy, Recall, Precision, and F1- Score
Receiver Operating Characteristic (ROC) curve
Confusion Matrix
Evaluation of the performance of a classification model is based on
the counts of test records correctly and incorrectly predicted by the
model.
The confusion matrix provides a more insightful picture which is not
only the performance of a predictive model, but also which classes
are being predicted correctly and incorrectly, and what type of errors
are being made. To illustrate, we can see how the 4 classification
metrics are calculated (TP, FP, FN, TN), and our predicted value
compared to the actual value in a confusion matrix is clearly
presented in the below confusion matrix table.
Example 1
Predicted Predicted
YES No
Actual 95 5
Yes
Actual 5 45
No
Predicted Predicted
YES No
Actual 95 5
Yes (TP) (FP)
Actual 5(FN) 45 (TN)
No
Calculation
Calculation
True positive rate also known as
“Recall” or “Sensitivity”
Calculation
F1 score = 95%
Quiz # 3
Compare the architecture and
application of Logistic Regression
and N-Gram in NLP.