Building Applications With Large Language Models. Techniques, Implementations and Applications (2024)
Building Applications With Large Language Models. Techniques, Implementations and Applications (2024)
with Large
Language Models
Techniques, Implementation,
and Applications
Bhawna Singh
ApressK
Building Applications with
Large Language Models
Techniques, Implementation,
and Applications
Bhawna Singh
Apress
Building Applications with Large Language Models: Techniques, Implementation,
and Applications
Bhawna Singh
Dublin, Ireland
Acknowledgments .............................................................................................xv
v
TABLE OF CONTENTS
vi
TABLE OF CONTENTS
vii
TABLE OF CONTENTS
viii
TABLE OF CONTENTS
Index.............................................................................................................. 269
ix
About the Author
Bhawna Singh, a Data Scientist at CeADAR (UCD), holds
both a bachelor’s and a master’s degree in computer science.
During her master’s program, she conducted research
focused on identifying gender bias in Energy Policy data
across the European Union. With prior experience as a Data
Scientist at Brightflag in Ireland and a Machine Learning
Engineer at AISmartz in India, Bhawna brings a wealth of
expertise from both industry and academia. Her current
research interests center on exploring diverse applications
of Large Language Models. Over the course of her career,
Bhawna has built models on extensive datasets, contributing to the development
of intelligent systems addressing challenges such as customer churn, propensity
prediction, sales forecasting, recommendation engines, customer segmentation, PDF
validation, and more. She is dedicated to creating AI systems that are accessible to
everyone, promoting inclusivity regardless of race, gender, social status, or language.
xi
About the Technical Reviewer
Tuhin Sharma is Sr. Principal Data Scientist at Red Hat in the
Data Development Insights & Strategy group. Prior to that, he
worked at Hypersonix as an AI architect. He also cofounded
and has been CEO of Binaize, a website conversion
intelligence product for ecommerce SMBs. Previously, he
was part of IBM Watson where he worked on NLP and ML
projects, few of which were featured on Star Sports and CNN-
IBN. He received a master’s degree from IIT Roorkee and a
bachelor’s degree from IIEST Shibpur in Computer Science.
He loves to code and collaborate on open source projects.
He is one of the top 25 contributors of pandas. He has to his
credit four research papers and five patents in the fields of AI and NLP. He is a reviewer
of the IEEE MASS conference, Springer Nature, and Packt publications in the AI track.
He writes deep learning articles for O’Reilly in collaboration with the AWS MXNET team.
He is a regular speaker at prominent AI conferences like O’Reilly Strata Data & AI, ODSC,
GIDS, Devconf, Datahack Summit, etc.
xiii
Acknowledgments
Writing a book has been a childhood dream, and with this book, it finally comes true.
However, it isn’t merely my achievement, and I would like to thank everybody involved
in this project. My sincere thanks goes to Celestin Suresh John who approached me in
a conference with the idea of this book. I am also thankful to Shobana Srinivasan and
Gryffin Winkler who handled the smooth execution of this project. This book wouldn’t
have been possible without your efforts.
I am also grateful to my technical reviewer, Tuhin Sharma, for providing valuable
feedback which helped in improving the accuracy and clarity of the content.
Additionally, I would like to thank Oisin Boydell who helped me in researching about
Large Language Models. Lastly, this work was made possible by the support of CeADAR,
Ireland, where I have the privilege of working with a team of geniuses.
On a personal note, I would extend my gratitude to my fiance, Ayush Ghai,
who made sure that I finish this book without skipping any meal. His support and
encouragement kept me up through many sleepless nights of writing.
To everyone who contributed to this book in ways both big and small, I extend
my deepest gratitude.
xv
Introduction
Imagine a world where AI wins elections and takes decisions for you. You might think
that it’s still in fiction. However, it is not. Recently, Victor Miller ran as a mayoral
candidate for Cheyenne, Wyoming, and planned to govern the city with the help of an
AI-based bot called Virtual Integrated Citizen (VIC). The world is constantly changing
with the emergence of Large Language Models (LLMs), and it might feel overwhelming
to learn everything about a technology which is evolving at such a fast pace.
As a beginner, it can be difficult to understand the technical jargon, complex
architecture, and sheer size of these models. Playing around new AI-based tools is fun,
but how can you build a tool of your own? How can the businesses harness the power of
this technology to build and deploy a real-world application? What is the other side of
the technology that extends beyond the technicalities of these models?
This book is your guide in understanding different ways in which Large Language
Models, like GPT, BERT, Claude, LLaMA, etc., can be utilized for building something
useful. It takes you on a journey starting from very basic, like understanding the basic
models in NLP, to complex techniques, like PEFT, RAG, Prompt Engineering, etc.
Throughout the book, you will find several examples and code snippets which will help
you appreciate the state-of-the-art NLP models. Whether you’re a student trying to get
hold of the new technology, a data scientist transitioning to the field of NLP, or simply
someone who is inquisitive about Large Language Models (LLMs), this book will build
your concepts and equip you with the knowledge required to start building your own
applications using LLMs.
So, if you’ve ever wondered how to make AI work for you or how to bring your
innovative ideas to life using the power of language models, you’re in the right place.
Let’s embark on this journey together and unlock the potential of LLMs, one step
at a time.
xvii
CHAPTER 1
Introduction to Large
Language Models
Language is a part of our organism and no less complicated than it.
—Ludwig Wittgenstein
The world of Artificial Intelligence is evolving very quickly, and the things that were
true only in fiction are now becoming reality. Today, we have Large Language Models
(LLMs) like GPT, LLaMA, Gemini, Claude, etc., which can generate text fluently in
multiple languages and have an ability to converse like humans. Not only can these
models generate text but also create code, perform data analytics, and demonstrate
multimodality. It seems like the tech giants have got a golden egg laying goose, and
everyone else is busy collecting these eggs to make a fortune.
The field of AI is undergoing a paradigm shift. With the data becoming readily
available in huge quantities, the idea of building a model, which is applicable to a lot of
tasks and is not task-centric, is becoming feasible. Such models are called foundation
models. A Large Language Model is a type of foundation model which is trained on a
vast amount of data to solve a variety of NLP tasks. There is also a common notion in
society that Generative AI, or GenAI, is the same as LLMs; however, it is not. Generative
AI is a field of AI which is used to create content, be it any format - text, images, videos,
or music - but an LLM is a model which majorly generates text, hence falls under the
category of Generative AI. The interest for this technology has surged up significantly
after 2022. Generative AI is popular, but LLM still beats it in popularity as depicted in
Figure 1-1.
1
© Bhawna Singh 2024
B. Singh, Building Applications with Large Language Models, https://doi.org/10.1007/979-8-8688-0569-1_1
CHAPTER 1 INTRODUCTION TO LARGE LANGUAGE MODELS
In this chapter, you will build your foundation by learning about NLP, data
preprocessing, different language models, and, finally, applications of LLMs.
Understanding NLP
Have you ever noticed Google’s ability to predict the next word when you are on its
search engine looking for something? Or did Gmail’s accurate auto-completion catch
your attention? If yes, then you have witnessed an LLM at work. To understand what an
LLM is, you need to first know about Natural Language Processing, or NLP. In simple
terms, the branch of AI that makes computers capable of understanding the human
language in both written and spoken forms is called NLP. It is challenging because of
the nature of the language. For example, the word “fire” can convey different meanings
when used in different contexts.
2
CHAPTER 1 INTRODUCTION TO LARGE LANGUAGE MODELS
Sentiment: Positive
For example:
Reference text: Martin went to school at 7 AM. He studied Math,
English, and History throughout the day. John is Martin’s best
friend, and they both study in 8th grade.
Question: Who is Martin’s best friend and which class he studies?
Sun - noun
1 https://universaldependencies.org/u/pos
3
CHAPTER 1 INTRODUCTION TO LARGE LANGUAGE MODELS
Rises - verb
In - preposition
the - determiner
East - noun
5. Named entity recognition (NER): NER deals with the identification
of named entities (people, organizations, location, date, time,
monetary values, etc.) in the sentence.
For example: Riya is traveling to Prague on Saturday.
In this sentence, Riya is a person, Prague is a location,
and Saturday is a time expression.
6. Text summarization: As the name indicates, this task aims
at creating a summary of the given text. There are two types
of summarization - extractive and abstractive. Extractive
summarization is composition using the existing sentences
from the original text, while abstractive summarization requires
generation of new sentences that may not be present in the
original text but still capture the essence of the text.
7. Machine translation: This task deals with translation of one
language into another. This allows an effortless communication
among people, irrespective of the language they know. With the
advancements in NLP, machine translation has become an easier
problem than it used to be.
Now that you have understood the basics of NLP, let’s move to text preprocessing.
Before building an NLP model, you need to clean the data and make it ready for
modeling. Although text preprocessing majorly depends on the type of data you are
dealing with and the model, there are a few standard steps that are common to text
preprocessing. The next section will cover the details of these steps.
Text Preprocessing
In this section, you will learn about the standard process of cleaning the textual data.
As the saying goes, “garbage in is garbage out.” The data in real life is messy, and using
the raw data in its actual form can lead to unwanted results. To avoid such a situation,
4
CHAPTER 1 INTRODUCTION TO LARGE LANGUAGE MODELS
data scientists spend a lot of time cleaning the data and making it ready for model
building. So, how can you clean the textual data? Let's understand while coding. For this
problem, we will consider the public IMDB dataset for movie reviews. This is a dataset
for binary sentiment classification containing 50,000 reviews and two columns; the first
one contains the review, and the second one contains the label for the review, which can
be either positive or negative. The data is gathered from Kaggle, and you can access it
here.2 Once the data has been downloaded, then you can start with text preprocessing,
but before that, let’s look at all the libraries and their versions, which you will require to
implement the code snippets here.
Python 3.11.3
contractions 0.1.73
emoji 2.8.0
matplotlib 3.7.1
nltk 3.7
numpy 1.26.2
pandas 1.5.3
seaborn 0.12.2
session_info 1.0.0
sklearn 1.2.2
wordcloud 1.9.3
The first step is to import the necessary libraries as demonstrated in the code snippet
mentioned below.
Step 1: Import necessary libraries
2 https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews
5
CHAPTER 1 INTRODUCTION TO LARGE LANGUAGE MODELS
Out [6]:
review sentiment
6
CHAPTER 1 INTRODUCTION TO LARGE LANGUAGE MODELS
movie_re["review_transformed"] = movie_re["review"].str.lower()
movie_re.head(5)
Out [12] :
review sentiment review_transformed
0 One of the other reviewers has mentioned that... positive one of the other reviewers has mentioned that...
1 A wonderful little production. <br /xbr />The... positive a wonderful little production. <br /xbr />the...
2 I thought this was a wonderful way to spend ti... positive i thought this was a wonderful way to spend ti...
3 Basically there's a family where a little boy ... negative basically there's a family where a little boy ...
4 Petter Mattei's "Love in the Time of Money" is... positive petter mattei's "love in the time of money" is...
pattern = r'(https://\S+|www\.\S+)'
movie_re['urls'] = movie_re['review_transformed'].str.extract(pattern)
movie_re['review_transformed']=movie_re['review_transformed'].str.
replace(pattern, '', regex=True)
You should get 115 unique values of URLs after running the code. Here are a few
values from the data represented in Figure 1-4.
7
CHAPTER 1 INTRODUCTION TO LARGE LANGUAGE MODELS
In [5]: movie_re[1urls'].unique()
Out[5]: array([nan, 'www.cei.org.', 'www.invocus.net)', 'www.softfordigging.com'
'www.petitiononline.com/19784444/petition. html',
'www.comingsoon.net/films.php?id=36310', 'www.residenthazard.com)
'www.zonadvd.com', 'www.nixflix.com', 'www.abc.net.au/chaser.',
'www.lovetrapmovie.com', 'www.thepetitionsite.com',
'www.petitiononline.com/ghl215/petition. html',
'www.johntopping.com/harvey%20perr/war%20widow/war_widow.html',
'www.mediasickness.com', 'www.imdb.com/title/tt0073891/',
'www.imdb.com/title/tt0363163/<br', 'www.poffysmoviemania.com)',
'www.gutenberg.org/ebooks/18137', 'www.reell3.org)',
'www.cinemablend.com/feature.php?id=209',
'www.youtube.com/watch?v=rmb4-hyet_y',
'www.dvdbeaver.com/film/dvdcompare2/kingofmasks. htmcbr',
'www.helium.com/items/1433421-sydney-white-review',
'www.imdb.com/title/tt0962736/awards',
html_pattern = re.compile(r'<.*?>')
movie_re['review_transformed']=movie_re['review_transformed'].str.
replace(html_pattern, ' ', regex=True)
movie_re['review_transformed'][0]
: movie_re[1review_transformed'][0]
: "one of the other reviewers has mentioned that after watching just 1 oz episode you'll be hooked, they are right, a
s this is exactly what happened with meJcbr /><br />l|he first thing that struck me about oz was its brutality and u
nflinching scenes of violence, which set ih FlCjftT TTom the word go. trust me, this is not a show for the faint hear
movie_re[1review_transformed1][0]
"one of the other reviewers has mentioned .that after watching just 1 oz episode you'll be hooked, they are right, a
s this is exactly what happened with| me. |the first thing that struck me about oz was its brutality and unflinching
scenes of violence, which set in right from the word go. trust me, this is not a show for the faint hearted or timi
8
CHAPTER 1 INTRODUCTION TO LARGE LANGUAGE MODELS
movie_re["review_transformed"]=movie_re['review_transformed'].apply(lambda
x: ' '.join([contractions.fix(word) for word in x.split()]))
movie_re["review_transformed"][0]
"one of the other reviewers has mentioned that after watching just 1 oz episodefyou111 |>e hooked, they are right, a
s this is exactly what happened with me. the first thing that struck me about oz was its brutality and unflinching |
: movie_re["review_transformed"] [0]
: 'one of the other reviewers has mentioned that after watching just 1 oz episoderyou will]be hooked, they are right,
as this is exactly what happened with me. the first thing that struck me about oz was its brutality and unflinching
movie_re['review_transformed'] = movie_re['review_transformed'].str.
translate (str.maketrans(‘ ‘,’ ‘,string.punctuation))
movie_re["review_transformed"][0]
9
CHAPTER 1 INTRODUCTION TO LARGE LANGUAGE MODELS
The code in the above block makes use of translate and maketrans functions,
which are available in all versions of Python3. The translate() method requires a table
parameter, which is created here using the maketrans() function. The maketrans method
requires three parameters. The first two are empty strings, and the last one is string.
punctuation which you want to omit. Here is the first review after the elimination of
punctuation marks, demonstrated in Figure 1-9.
In [21]: movie_re.review_tnansformed[0]
Out[21]: 'one of the other reviewers has mentioned that after watching just 1 oz episode youll be hooked they are right as t
his is exactly what happened with mebr br the first thing that struck me about oz was its brutality and unflinching
scenes of violence which set in right from the word go trust me this is not a show for the faint hearted or timid t
his show pulls no punches with regards to drugs sex or violence its is hardcore in the classic use of the wordbr br
it is called oz as that is the nickname given to the Oswald maximum security state penitentary it focuses mainly on
emerald city an experimental section of the prison where all the cells have glass fronts and face inwards so privac
y is not high on the agenda em city is home to manyaryans muslims gangstas latinos Christians Italians irish and mo
reso scuffles death stares dodgy dealings and shady agreements are never far awaybr br i would say the main appeal
of the show is due to the fact that it goes where other shows wouldnt dare forget pretty pictures painted for mains
tream audiences forget charm forget romanceoz doesnt mess around the first episode i ever saw struck me as so nasty
it was surreal i couldnt say i was ready for it but as i watched more i developed a taste for oz and got accustomed
to the high levels of graphic violence not just violence but injustice crooked guards wholl be sold out for a nicke
I inmates wholl kill on order and get away with it well mannered middle class inmates being turned into prison bite
hes due to their lack of street skills or prison experience watching oz you may become comfortable with what is unc
omfortable viewingthats if you can get in touch with your darker side1
nltk.download('stopwords')
stop = stopwords.words('english')
print(stop)
['i', 'me', 'my', 'myself, 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your',
'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it',
"it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this',
'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had',
'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'whil
e', 'of1, 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after',
'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'the
n', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'o
ther', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'wi
ll', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "ar
en't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "ha
ven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shou
Idn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]
10
CHAPTER 1 INTRODUCTION TO LARGE LANGUAGE MODELS
You can remove these stop words by running the following code, and your modified
text will look like the output shown in Figure 1-11:
movie_re["review_transformed"] = movie_re['review_transformed'].
apply(lambda x: ' '.join([word for word in x.split() if word not in
(stop)]))
print(movie_re["review_transformed"][0])
In [16]: movie_re["review_transformed"][0]
Out[16]: 'one reviewers mentioned watching 1 oz episode hooked right exactly happened first thing struck oz brutality unflin
ching scenes violence set right word go trust show faint hearted timid show pulls punches regards drugs sex violenc
e hardcore classic use word called oz nickname given Oswald maximum security state penitentary focuses mainly emera
Id city experimental section prison cells glass fronts face inwards privacy high agenda city home manyaryans muslim
s gangstas latinos Christians Italians irish moreso scuffles death stares dodgy dealings shady agreements never far
away would say main appeal show due fact goes shows would dare forget pretty pictures painted mainstream audiences
forget charm forget romanceoz mess around first episode ever saw struck nasty surreal could say ready watched devel
oped taste oz got accustomed high levels graphic violence violence injustice crooked guards sold nickel inmates kil
I order get away well mannered middle class inmates turned prison bitches due lack street skills prison experience
watching oz may become comfortable uncomfortable viewingthat get touch darker side'
movie_re['review_transformed'] = movie_re['review_transformed'].str.
replace('\d+', '')
movie_re["review_transformed"][0]
Out [21]: 'one of the other reviewers has mentioned that after watching just11 oT~|episode youll be hooked they are right as t
Out[23]: 'one of the other reviewers has mentioned that after watching iust| oz~]episode youll be hooked they are right as th
11
CHAPTER 1 INTRODUCTION TO LARGE LANGUAGE MODELS
In [49]: movie_re['emojis'].uniquel)
Out [49]: array([", '®', '©'], dtype=object)
Data Transformation
Once the data has been cleaned, you can apply additional transformations that will
further reform the data and level it up for modeling. These techniques are linguistic in
nature and are used to break down either sentences or words to reduce complexity.
1. Tokenization: The process of breaking down sentences into
smaller units, such as words, is called tokenization. The simplest
tokenization technique involves decomposition of sentences into
words based on whitespace. There are several other methods for
tokenization, which are covered in later chapters. For now, you
can run the following code, which uses NLTK’s word_tokenize to
see how tokenization happens. Figure 1-15 depicts examples of
tokenized sentences.
nltk.download('punkt')
movie_re['tokenized'] = movie_re['review_transformed'].
apply(lambda x: word_tokenize(x))
movie_re['tokenized']
12
CHAPTER 1 INTRODUCTION TO LARGE LANGUAGE MODELS
In [28]: movie_re['tokenized']
Out [28]: 0 [one, reviewers, mentioned, watching, oz, epis
1 [wonderful, little, production, filming, techn
2 [thought, wonderful, way, spend, time, hot, su
3 [basically, family, little, boy, jake, thinks,
4 [petter, matteis, love, time, money, visually,
3 https://wordnet.princeton.edu/
13
CHAPTER 1 INTRODUCTION TO LARGE LANGUAGE MODELS
but depending on the use case, you might want to consider some
advanced lemmatization methods. Figure 1-16 depicts data
before lemmatization, and Figure 1-17 depicts the same data after
lemmatization.
nltk.download('wordnet')
def get_pos_tag(word):
pos_tag = nltk.pos_tag([word])[0][1][0].upper()
tags_dict = {
"J": wordnet.ADJ,
"N": wordnet.NOUN,
"V": wordnet.VERB,
"R": wordnet.ADV
}
return tags_dict.get(pos_tag)
def lemmatize(text):
lemmatizer = WordNetLemmatizer()
lemmatized_tokens = []
tokens = nltk.word_tokenize(text)
for token in tokens:
token_tag = get_pos_tag(token)
if token_tag is None:
lemmatized_tokens.append(token)
else:
lemma = lemmatizer.lemmatize(token, token_tag)
lemmatized_tokens.append(lemma)
return ' '.join(lemmatized_tokens)
movie_re['lemmatized_tokens'] = movie_re['review_transformed'].
apply(lemmatize)
: movie_re[1review_transformed 1][0]
'one Reviewers mentioned watching] oz episode hooked right exactly|~happened~| first thing struck oz brutality unflinc
hing| scenes |iolence set right word go trust show faint hearted timid show| pulls]|junches ||regards||druqs~|sex violence
hardcore classic use wordl called pz nickname[given ftswald maximum security state penitentary|focuses] mainly emerald
14
CHAPTER 1 INTRODUCTION TO LARGE LANGUAGE MODELS
stemmer = SnowballStemmer("english")
movie_re['stemmed'] = movie_re['tokenized'].apply(lambda tokens:
'.join([stemmer.stem(token) for token in tokens]))
ay would say main appeal show due fact goes shows would dare forget|~pretty [pictures[painted mainstream |audiences| fo
rget charm forget romanceoz mess around first [episode|ever saw struck nasty surreal could say[ ready patched[develop]
ed taste oz qot| accustomed [high levels graphic violence violence injustice crooked guards sold nickel inmates kill
order get away well mannered middle class [inmates [turned prison bitches due lack street[~skills |prison| experience wa
tching oz may become)comfortable[uncomfortable viewingthat get touch darker side'
I death stare dodqi deal shadi agreement never far away would say main appeal show due fact goe show would dare for
get|pretti|picturpaint mainstream audienc forget charm forget romanceoz mess around first|episod pver saw struck n
asti surreal could say| readi [watch develop tast oz got|accustom|high level graphic violenc violenc injustic crook g
uard sold nickel |inmat[kill order d6t away well manner middI class inmat turn prison bitch due lack streetpkill |pr
ison| experi [watch oz may becornjcomfort jjncomfort viewingthat get touch darker side'
15
CHAPTER 1 INTRODUCTION TO LARGE LANGUAGE MODELS
History ofLLMs
In the previous sections, you learned about NLP, common NLP tasks, and how textual
data is preprocessed. This section will help you learn about models that brought LLMs in
the industry. It has taken decades for NLP to reach the current state, and discussing all
the models is out of the scope of this book; nevertheless, I will cover some of the popular
models that shaped the current era of LLMs, or Large Language Models. By the end of
this section, you will know which models existed before language models became large,
what their shortcomings were, and why LLMs have become popular. So, let’s take a walk
down the memory lane of popular language models.
Language Model
The common NLP problems, such as text summarization, question answering, etc.,
require models that can understand and generate the human language. So the models
that can predict the next token based on the previous sequence of tokens (context) are
called language models. Mathematically, a language model is a probability distribution
over all the words that occur in the language. If there x1, x2, x3..........xn are different words
that belong to vocabulary V, then
Looking at the probabilities above, you can simply say that b>a because the
sequence of words for probability “b” is more meaningful than “a,” which corresponds to
a sequence that doesn’t sound right semantically.
To calculate these probabilities, you will have to apply the chain rule of probability.
Here, you are trying to generate text based on the previous sequence of words, so this
brings conditional probability into account. If there x1, x2, x3.......... xn are different words
that belong to vocabulary V, then
p(xij=p(xi)p(x2lxi)p(x3lxi,x2)-p(xnlx1:n-i)= n =1 p(xilxi:xi-i)
16
CHAPTER 1 INTRODUCTION TO LARGE LANGUAGE MODELS
Let me demonstrate the chain rule of probability for the previous example:
P(she, likes, icecream, the, most) = P(she).
P(likes | she).
P(icecream | she, likes).
P(the | she, likes, icecream).
P(most | she, likes, icecream, the).
Now that you are familiar with the idea of language models, let’s move ahead to
understand the evolution of language models.
N-gram Model
N-gram models were probabilistic language models which utilized conditional
probability to generate the next token. The approach was based on n-gram which is a
sequence of n words/tokens. N-gram models played an important role as they were a
step up over rule-based models. These models used the idea of context window, which
was very effective for tasks that dealt with local dependencies or short range of data,
such as speech recognition, machine translation, etc. Moreover, the performance of
n-gram models was decent; therefore, they formed a baseline for a variety of NLP tasks.
The models could ingest a huge amount of data; thus making them more scalable than
their predecessor (rule-based language models).
Types of N-gram
Unigram (1-gram): When each item in the sequence is considered an individual
token with no dependency, then the model is called unigram. For example, the sentence
“she likes icecream the most” has the following unigrams: “she,” “likes,” “icecream,”
“the,” “most.”
Bigram (2-gram): When a sequence consists of a pair of items where the occurrence
of the latter depends on the former, then the model is bigram. For example, the sentence
“she likes icecream the most” has the following bigrams: “she likes,” “likes icecream,”
“icecream the,” “the most.”
Trigram (3-gram): When a sequence consists of three consecutive items where the
occurrence of the last item is dependent on the first two items, then the model is said to
be trigram. For example, the sentence “she likes icecream the most” has the following
trigrams: “she likes icecream,” “likes icecream the,” “icecream the most.”
n-gram: When a sequence length is greater than three, then it is called an n-gram
model. The nth item has dependency on the previous n-1 items.
18
CHAPTER 1 INTRODUCTION TO LARGE LANGUAGE MODELS
The n-gram model performed well for a lot of tasks, but there were certain
limitations which needed to be addressed. Firstly, the context window is limited. Only
n-1 preceding terms are considered to be context. This leads to a poor performance
for long-range tasks. For example, let’s consider a complex statement, such as “The
company is looking for a candidate who is well versed in Physics, Maths, Computer and
Biology.” If we consider a bigram model which breaks down the sentence into pairs of
consecutive tokens, then it will fail to capture the relationship between “the company”
and a candidate with the required skills because they are mentioned in the latter part
of the sentence. Secondly, the idea of probability estimation from the occurrence of
n-grams doesn’t work well even when the training data is huge. This happens because a
large number n-grams occur only once in the whole data, leading to sparse estimation
problems. A lot of smoothing techniques were developed by researchers to combat
sparsity problems. Smoothing techniques involve adding a constant value in either
the numerator or the denominator so that the probability doesn’t turn out to be zero.
Thirdly, the models often encounter out-of-vocabulary words. These are the words
which are present in the real data but have never been seen in the training data.
There were other models like context-free grammar (CFG) which were linguistic in
nature. These models are majorly driven by a set of rules that define the syntactic nature
of sentences. While the rules ensured the structural validity of the sentences, they lacked
the capability to capture the context of the sentences. CFG models were popularly used
for parsing and generating strings. Since they lacked semantic ability and were not able
to process complex statements, the quest for more advanced models continued.
19
CHAPTER 1 INTRODUCTION TO LARGE LANGUAGE MODELS
Word Embeddings
The work done by Bengio et al. (2001) was followed by further developments in the NLP
field. In 2013, Google introduced the magic of word2vec to the world. Word embeddings
or word2vec is a vector representation of a word in a high-dimensional space. Word
embeddings reflect the semantic and syntactic relationship between words based on the
usage of the words in a huge corpus of data. The researchers used Google News corpus
as their training data, which contained six billion tokens; however, the vocabulary size
was limited to one million tokens. In this approach, the words were fed into the neural
network in one-hot encoded representation. Mikolov et al. proposed two different types
of architectures to train word embeddings. Let me walk you through these architectures:
1. Continuous bag of words (CBOW): Let’s say you have very limited
data, for example, “I will go to Italy.” Now each word is one-hot
encoded as shown in Figure 1-20.
20
CHAPTER 1 INTRODUCTION TO LARGE LANGUAGE MODELS
21
CHAPTER 1 INTRODUCTION TO LARGE LANGUAGE MODELS
Both architectures have their own merits and demerits. It is advisable to use the CBOW
architecture when data is limited, unlike the skip-gram architecture which performs better
with a larger corpus. However, CBOW is faster to train when compared with the skip-gram
architecture. Word embeddings are able to capture semantic knowledge by outperforming
on analogies, for example, Dublin is to Ireland as Delhi is to India. Another area where
word embeddings excell is algebraic operations, for example, vector(“biggest”) -
vector(“big”) + vector(“small”) is roughly equal to the vector representation of “smallest.”
However, one of the major shortcomings of word2vec is that it is heavily training data
dependent; therefore, it isn’t generalizable. For example, if the training data contained
the word “right” implying the direction, but you are using the word “right” for implying
correctness, then you are using the same vector representation for two different things.
RNN andLSTM
Roughly around the same time, researchers were building language models using RNN
(recurrent neural network). RNN seemed like a good choice for language modeling as it
can handle dynamic input, in the case of NLP, sentences of variable length. Since RNN
suffers from the problem of vanishing gradient (the gradients used for updating the
network become extremely small during backpropagation), the LSTM (long short-term
memory) network is used as its architecture enables the network to avoid vanishing/
exploding gradient problems. These models were good in only remembering the recent
tokens, and there was no mechanism to hold a connection between the old and the new
tokens in a sequence. This led to a poor performance when the input sequence was
long. Thus, the world needed an architecture which could address these limitations and
process language just like us.
22
CHAPTER 1 INTRODUCTION TO LARGE LANGUAGE MODELS
Transformer
“Attention is all you need” by Vaswani et al. changed the landscape of NLP altogether by
bringing out transformer architecture for language modeling. The model significantly
outperformed on the translation task. The reason behind the model’s success was
the attention mechanism which dispensed recurrence entirely while being more
parallelizable. The transformer architecture is the backbone of the LLMs. Chapter 2
encapsulates all the minute details about the transformer architecture and the attention
mechanism, so hold your horses.
This section took you on a historical tour, giving you a taste of a variety of models
before LLM became a huge success. It took decades of work and an army of people to
reach this point where the conservation with an LLM is extremely fluent and humanlike.
A wide variety of applications and use cases have sprung out ever since the LLMs
became successful. Let’s take a deep dive and explore various use cases by which LLMs
are making our lives easier.
Applications ofLLMs
A technology only stays in the market if it is able to serve the people. I believe that the
sky's the limit to what you can build using this technology. Here are some ideas in which
people are currently using these models:
24
CHAPTER 1 INTRODUCTION TO LARGE LANGUAGE MODELS
Conclusion
This is only the tip of the iceberg, and there are endless applications which you can
develop using LLMs. In this chapter, you learned about the following:
• Introduction to NLP
25
CHAPTER 2
Understanding
Foundation Models
Any sufficiently advanced technology is indistinguishable from magic.
—Arthur C. Clarke
Solutions exist only in the world where problems prevail, and there are many research
problems in the field of AI which have given birth to different generations of AI,
each being a milestone in solving some of these problems. The transition into these
generations can also be called paradigm shifts. We are currently experiencing a paradigm
shift where foundation models have become the new state of the art especially in NLP.
In this chapter, you will gain an understanding of different transitions that have
occurred in AI and how these transitions have led us to the magic of foundation models.
You will also learn about the important concepts like foundation models, transfer learning,
and transformer. Finally, you will get a taste of a variety of models available in the market.
Generations ofAI
I believe that we are living in a world where everything around us evolves continuously -
be it furniture, fashion, style, or technology. AI is also evolving, and we have seen
algorithms and techniques which can be collated together to put in as a generation. The
following are the four major generations:
- Knowledge-based systems: An expert system or knowledge-based system
is the most primitive form of AI. The core idea behind an expert system
is to harness the knowledge base (heuristics and rules of a domain) to
drive the inference engine in order to facilitate decision-making like
experts of a domain. The expert systems were popular in the 1970s.
27
© Bhawna Singh 2024
B. Singh, Building Applications with Large Language Models, https://doi.org/10.1007/979-8-8688-0569-1_2
CHAPTER 2 UNDERSTANDING FOUNDATION MODELS
- Deep learning: The need to build features for models was a bottleneck
in the growth of AI. Advanced algorithms were required, which could
take data in its natural form. As the computation became accessible
and efficient, deep learning emerged as the winner. Deep learning
could even solve complex problems like object detection, face
recognition, image segmentation, etc. Though data and computation
reduced the dependency of feature engineering to a great extent, the
problem of annotated data still existed. Traditional machine learning
algorithms are supervised in nature; thus, the success of these mod
els is not generalized and use case specific.
28
CHAPTER 2 UNDERSTANDING FOUNDATION MODELS
Foundation Models
As mentioned in the previous section, foundation models have brought a paradigm
shift in the world of AI, and it might not be wrong to say that we are participating in an
exciting wave of AI. Let’s look in detail about these models. How are they built? What are
their capabilities? What are their harms?
29
CHAPTER 2 UNDERSTANDING FOUNDATION MODELS
The second phase of training occurs once you obtain a pre-trained model. This
phase is also called fine-tuning. Building your own foundation model is not only costly
but also a time-consuming process. Furthermore, the pre-trained model might also not
perform well on a specific task. This is where fine-tuning comes to rescue. Fine-tuning
helps in customization of the model without significantly changing the weights learned
during the pre-training process. This brings in an important concept on the table called
transfer learning. Imagine that you just learned how to chop potatoes. Now you don’t
have to learn the process of chopping onions from scratch because it is very much
similar to that of chopping potatoes. We use the knowledge learned from one process
to another. Transfer learning is built on the process of knowledge transfer and saves
developers from the hustle of building models from scratch.
30
CHAPTER 2 UNDERSTANDING FOUNDATION MODELS
Learning a language is a complex task, but once a model has learned the rules of
a language, it can then be reused for solving language-related tasks. Transfer learning
eliminates the need of building a model from scratch by enabling people to use existing
pre-trained models and customize them as per their requirement with techniques like
fine-tuning. You will learn in detail about the fine-tuning LLMs in the next chapter. The
diagram in Figure 2-2 illustrates how transfer learning from model A results in model B.
31
CHAPTER 2 UNDERSTANDING FOUNDATION MODELS
32
CHAPTER 2 UNDERSTANDING FOUNDATION MODELS
You have learned about foundation models, but as a developer, you need to
understand the technical intricacies to build an application on top of an existing model.
The next section explains the transformer architecture.
Transformer Architecture
“Attention Is All You Need” by Vaswani et al. is the original paper where the transformer
architecture was first discussed. The paper showcased how the architecture
outperformed in the machine translation task. The model became the new state of the
art and continues to remain so. Since 2017, the architecture has remained intact in
most of the LLMs with a few minor tweaks, demonstrating its resilience. Here, you will
learn about the transformer architecture discussed in the original paper, as shown in
Figure 2-3.
33
CHAPTER 2 UNDERSTANDING FOUNDATION MODELS
34
CHAPTER 2 UNDERSTANDING FOUNDATION MODELS
Self-Attention Mechanism
What Is Self-Attention?
When you are driving a car, several things are happening around you. There are other
people on the road too, some are walking, some are driving, some are riding a bicycle, etc.
As a driver, you have to pay attention to your immediate surroundings, including yourself,
to make the right judgment in order to avoid any mishap. Similarly, think of a transformer
model as a driver which has to pay attention to different words in an input sequence for
generating an accurate output sequence. This happens with the help of a self-attention
mechanism. Recall from the last chapter that RNNs and LSTMs have trouble in generating
accurate responses when the input sequence has a longer range; however, transformers
overcome this limitation by utilizing self-attention. Let me demonstrate an example.
Consider a simple sentence in Figure 2-4. In the sentence, we can interpret clearly
that “its” is associated with the “dog,” but for a machine, it becomes important to
remember the previous words in the sequence to make sense of the words ahead in the
sequence. With the help of self-attention or intra-attention, the model can attend to
different words in a sentence, making it a viable choice for NLP tasks.
So now that you have got an idea about what self-attention does and why it is
beneficial for the transformer model, let’s jump into understanding how the self
attention mechanism is implemented.
35
CHAPTER 2 UNDERSTANDING FOUNDATION MODELS
sentence. Let’s take an example here. Imagine you are using a transformer model for
performing a translation task. Your goal is to convert an English sentence (“How are
you?”) into Hindi (“^q ^3 t ?”).
Machines can’t understand words like we do, so the first step is to convert words into
numbers. This is done by creating vector representations of fixed size for each word in
the input sequence. In practice, the vector representations are calculated for each token,
which might not necessarily be the same as the entire word but a subpart of a word.
There are a variety of algorithms to get these tokens, but for now let’s keep it simple
and assume that each word is a token. So, we first convert each token into a vector
representation or an embedding. The input sequence X consists of three words, x1, x2,
and x3, with each word represented as a vector of length 4, as shown in Figure 2-5. In
the original paper, the size of these embeddings was 512, however it varies from model
to model.
Once the input sequence gets converted into numbers, it’s ready to be processed by
the machine. Each word in the model has to attend to every other word in the sequence.
This helps the model learn how the words are correlated. The correlation is determined
with the help of cosine similarity between two different vectors. Thus, similarity score
determines how much attention a word should pay to another word. That’s it! That’s all
the crux of self-attention, nothing fancy. So, how to calculate the similarity score?
The input for the self-attention module is a vector, and the output of the module
is also a vector with the same dimensions as the input. The attention function is
determined with the help of three vectors which are assigned to each input token in
the sequence. Let’s go back to the example. The input sequence “How are you?” has
three words. Every word should have three vector representations that will be used for
calculating the attention scores. These three vectors are called query (q), key (k), and
36
CHAPTER 2 UNDERSTANDING FOUNDATION MODELS
value (v). It is to be noted that these vectors are obtained by linear transformation of
each input token xi in the input sequence X with the embeddings of dimension k.
By using three weight matrices (W) of dimension k*k, you can modify each input token to
get q, k, and v:
qi = Wqxi
ki = Wkxi
vi = Wvxi
Now for obtaining similarity, you need to perform a dot product of two vectors.
Let’s say you have two vectors a and b, then the attention score can be calculated by
performing a dot product between these two vectors as shown below. It is to be noted
that the idea of computing dot products is not new but has been used in machine
learning for a very long time.
similarity(a,b) = a.b
similarity(q,k) = qT.k
The attention score between the ith token and jth token is calculated as
attention score(q1,kj)=qT1^kj
When you are dealing with embeddings of bigger size, the value of the dot product
goes up too. Therefore, you need to scale it down so that bigger values don’t impact
the overall attention weights. The attention scores are divided by kk (where k is the
dimensionality of the key vector) to prevent the values from becoming large, thus saving
the model from slowed learning.
After computing the attention score, the next step is to pass these scores through
a softmax function to get the probability distribution. Since the values during the dot
product can lie between -infinity and +infinity, the softmax function ensures that the
attention score is transformed into a value which lies between zero and one, such that all
the values add up to one.
37
CHAPTER 2 UNDERSTANDING FOUNDATION MODELS
For the ith token which has to attend to the jth token, the attention weight is calculated
after applying softmax to the attention score as mentioned below:
This helps the model to identify how the tokens are correlated. Let’s look at the
example here with some arbitrary softmax scaled dot product values:
How are you
x1 x2 x3
wx1x1 = attention weight( x1, x1) = 0.61
wx2x2 = attention weight(x1, x2) = 0.09
wx3x3 = attention weight(x1, x3) = 0.30
By looking at the scores, you can easily make out that “How” is related to “you” more
than it is related to “are.” The softmax scores are the attention weights which are used to
calculate a weighted sum over all the embedding vectors to calculate the output vector:
Output = ^ wij. Vj
j
2. The second step is to calculate the dot product between query and
key and scale it down.
3. The third step is to find the softmax value of the dot product
to get the attention score.
4. Finally, the last step is to compute the weighted sum of the
softmax attention scores and the value vector. This will generate
the output vector.
38
CHAPTER 2 UNDERSTANDING FOUNDATION MODELS
numpy == 1.26.2
seaborn == 0.12.2
python == 3.11.3
You will use numpy and seaborn libraries for this exercise, so start by importing
them in Python as shown in Figure 2-6.
Import necessary libraries
In [1]: import numpy as np
import seaborn as sns
The next step would be to define the softmax function because it has to be called
from the self-attention function. This is demonstrated in Figure 2-7.
Softmax Function
In [2]: def softmax(x):
return (np.exp(x).T / np.sum(np.exp(x), axis=-l)).T
39
CHAPTER 2 UNDERSTANDING FOUNDATION MODELS
Now call the function using an arbitrary input value. I have used random values for
the sentence “How are you” as demonstrated in Figure 2-9.
In [4]: X = np.array([
[0.8, -1.8, 0.6, -0.5],
[-1.6, 1.3, 1.9, 0.4],
[-0.3, 0.7, -1.4, -0.9]
])
The next step is to calculate weight matrices, which is done by performing linear
transformations in reality, and after transformations, they will look like the vectors
shown in Figure 2-10.
40
CHAPTER 2 UNDERSTANDING FOUNDATION MODELS
# Generate the weight matrices for query, key, and value using random values
W_Q = np.random.randn(embedding_dimension, dimension_key)
W_K = np.random.randn(embedding_dimension, dimension_key)
W_V = np.random.randn(embedding_dimension, dimension_value)
Once you get the three weight matrices, you can proceed ahead with calling the
self-attention function to get the output vector, as depicted in Figure 2-11.
Output:
[[ 1.72830893 0.04298404 1.47559771]
[-2.383999 -1.23785682 -2.24527617]
[-0.45571222 -0.07401469 -0.69042453]]
The function also generates a heatmap to visualize the attention weights (softmax
values). A darker shade indicates greater correlation, and a lighter shade indicates lesser
association. In Figure 2-12, you can see the diagonal shades to be darker as compared
to the rest of the shades. This is because the words correlate the most to themselves.
Additionally, the correlation between “are” and “you” is greater than the correlation
between “are” and “how.”
41
CHAPTER 2 UNDERSTANDING FOUNDATION MODELS
Multi-head Attention
So, you now have a fair understanding of the self-attention mechanism and how the
transformers leverage it to establish correlations among words in an input sequence.
However, language is complex and one attention module might not be sufficient to cater to
the complexities of the language. Let me explain this to you with the help of an example.
This sentence answers multiple questions, like who wants to visit Rome? Where is
Rome situated? What is the capital of Italy?
To solve this problem, the transformers utilize multiple heads of attention to capture
different contexts and attend to a variety of information. The intuition behind this idea is
that each head is capturing a different aspect of the language, making the model richer
with context. In input sequence X, for token xi each attention head creates a different
output vector yhi. The vectors from each head are concatenated and passed through a
linear transformation to bring down the dimensionality.
Think of the multi-head attention module with “h” different copies of a single self
attention head. These heads are computing the attention vectors with their own set of
query, key, and value vectors. Now you might think that running multiple heads will
42
CHAPTER 2 UNDERSTANDING FOUNDATION MODELS
increase the complexity, taking up a lot of time to create output vectors. To save time
and reap the benefits of multi-head attention, the dimensionality of the query, key,
and value vectors gets reduced such that each head is assigned lowered projections of
these vectors. However, the output vector from each head is concatenated and linearly
transformed to get back the original dimensions. The process is illustrated in Figure 2-13.
Self Attention 1
Positional Encoding
We discussed in the previous sections that the input to the multi-head attention
module is a sequence which has been converted into vector representations or word
embeddings. The nature of the attention mechanism is permutation invariant. This
implies that no matter what the sequence of the words is, the output generated would be
the same because attention doesn’t take into account the order of the words in the input.
It treats the input as a set of words rather than a sequence of words.
43
CHAPTER 2 UNDERSTANDING FOUNDATION MODELS
In order to preserve the order of the words and to encode positional information
of the words in the sequence, positional encoding is used. The implementation is very
simple and intuitive. In the original paper, the authors presented sinusoidal functions
to preserve this information. The dimension of the positional vector is the same as the
word embedding, and two are added to form a more context-aware vector. The diagram
in Figure 2-14 illustrates how positional encodings are created.
So how are positional encodings calculated? A function that can handle the
complexities of varying input length while capturing the relative distance between words
seems to be a good choice. This flexibility is offered by sinusoidal functions. For pos
position of the ith dimension where d is the size of the word embedding, the positional
encoding is calculated as mentioned below:
PE(pos,2i) = sin(pos/10002i/d)
PE(pos,2i+1) = cos(pos/10002i/d)
Let me explain it with arbitrary values. Suppose the size of the embedding is 4, then
for the input sequence “How are you,” the positional encoding for the word “Hello” can
be calculated like the following.
For the word “Hello,” the index position is 0, thus
44
CHAPTER 2 UNDERSTANDING FOUNDATION MODELS
This way, the positional encoding can be calculated for all the tokens in the input
sequence. Additionally, Figure 2-15 illustrates the positional encoding matrix for the
input sequence - “How”
Note that the positional encoding is fixed and deterministic, i.e., throughout the
training these vectors remain the same and are not learnable.
45
CHAPTER 2 UNDERSTANDING FOUNDATION MODELS
Feed-Forward Network
A fully connected network is also used in the transformer architecture. The input
to this fully connected network is the output from the attention mechanism. In
this feed-forward network, there are two linear transformations and one nonlinear
transformation.
46
CHAPTER 2 UNDERSTANDING FOUNDATION MODELS
f(x) = max(0,x)
In the equation above, (xW1 + b1) is LT1 (linear transformation 1), and the activation
output of linear transformation 1 goes through another linear transformation.
That’s all! You have now learned about all the components inside the transformer
architecture individually, so now let’s hop on to understand how the input gets
converted to output and how these ingredients are mixed together to form a delicious
dish, the transformers.
47
CHAPTER 2 UNDERSTANDING FOUNDATION MODELS
Both encoder and decoder blocks are made up of subcomponents; let’s look at the
working of this entire encoder-decoder mechanism with the help of the sentence “How
are you,” which we want to translate into Hindi.
In the original paper, “Attention Is All You Need,” the authors proposed six stacked
layers of encoder as well as six stacked layers of decoder. Each encoder block is made
of a multi-head attention layer and a feed-forward network, along with two blocks of
add and norm as shown below. This allows the encoder to convert the input into the
hidden state or latent space. The input which is “How are you?” is first converted to
word embeddings. In the next step, the positional encodings are summed with the
word embeddings to generate a position-aware input vector, which then goes into the
multi-head attention layer, and the output from the multi-head attention layer goes
to the feed-forward network to produce vector representations for the decoder. Both
layers (attention and feed forward) are followed by an add and norm layer, i.e., residual
connection and normalization. The diagram in Figure 2-17 summarizes the entire flow
in an encoder block.
The decoder block has similar components to the encoder block, but there are a few
things which vary. Let’s look at the functioning of the decoder.
The task of the decoder is to convert the output from the encoder into the desired
output (in this case, generate the output in Hindi). The decoder has two layers of
multi-head attention unlike the encoder. However, each multi-head attention layer is
responsible for different things. The feed-forward layer and the add and norm blocks
remain the same as the encoder block. Finally, there is a linear layer and a softmax layer
48
CHAPTER 2 UNDERSTANDING FOUNDATION MODELS
which gives the probabilities for the next word. Additionally, the nature of the decoder is
auto-regressive, i.e., it uses the previously generated output to generate the new output.
This is demonstrated in Figure 2-18.
The decoder starts decoding by generating the first token, which can be called the
beginning of sentence or <bos>, start of sequence or <sos>, <start>, etc. This token
signals the decoder to start generating the output, and the model keeps generating the
output until it reaches the end of sequence or <eos> or the maximum length.
Based on the context vector generated by the encoder for the English sentence “How
are you?” and the <bos> token, the decoder will generate the next token, “^H." Now the
generated token, “^H," is appended to <bos>. So, <bos> ^H becomes the next input
for the decoder, which will then generate the next output from the decoder, i.e., "^^." In
the next step, the input to the decoder becomes <bos> “^H ^^" which generates the
output “S'?" This will append to the input, “<bos> ^H ^^ S'?" and generate the final
token <eos>. The decoding process will stop here and the English sentence “How are
you?" to Hindi sentence “^H ^^ S?"
The decoder has masked multi-head attention, which allows the prediction of future
tokens on the basis of previous tokens, i.e., the prediction for position i will depend only
on the positions which are not greater than i. The masking is achieved mathematically
by adding minus infinity to the scaled attention scores. The infinity is converted to zero
in the softmax layer, thus limiting the model’s access to the future words. The picture in
Figure 2-19 demonstrates what the masked scores might look like. So, the first multi
head attention layer is used to compute attention scores in a similar way as the encoder
does except for the difference in masking which is applied after the attention scores for
the decoder input are scaled.
49
CHAPTER 2 UNDERSTANDING FOUNDATION MODELS
The output from the first multi-head attention layer goes as input to the second
multi-head attention layer and becomes the query vector for the attention mechanism,
while the key and the value vector comes from the output of the encoder. This allows the
decoder to attend to every value in the input sequence of the encoder. This layer helps
the architecture imitate the characteristics of a sequence-to-sequence model. Let me
catch this opportunity to list the three ways in which the self-attention mechanism is
used in the original encoder-decoder-based transformer architecture:
1. Encoder multi-head attention mechanism: This self-attention
mechanism is applied in the encoder to establish associations
between various words in an input sequence.
2. Decoder multi-head attention mechanism: This self-attention
module is implemented as the first layer of self-attention in the
decoder part of the transformer. The job of this layer is similar to
the encoder as it tries to attend to all the previous words generated
by the decoder, making sure that future words are masked.
50
CHAPTER 2 UNDERSTANDING FOUNDATION MODELS
All the reasons listed above make the transformers unbeatable with the self-attention
mechanism.
Circling back to the decoder again, the output from the second multi-head attention
layer goes to a pointwise feed-forward attention network, which generates vectors of the
size of the embeddings. The output of the feed-forward network becomes input for the
linear layer, which will generate logits and change the dimensionality from the size of the
embeddings to the size of the vocabulary as the logits are over the vocabulary. Lastly, the
logits are converted into the probabilities, and the token with the highest probability is
generated as the output. The decoder architecture is demonstrated in Figure 2-20.
Before I sum up this section, I would like to introduce you to a technique called
teacher forcing. This methodology has been used in the past to train the RNNs and has
also been adopted during the training process of the transformers. The analogy used
here is similar to a student in a class who is trying to learn something new. Let’s assume
the student is trying to solve a BODMAS problem; while solving the math problem, the
student performed the wrong division. If the teacher jumps in and corrects the student
at that point, then the student can learn from the past mistake and solve the remaining
problem in a correct manner. Since, BODMAS is a sequential problem, an error in the
initial steps will propagate till the final step. In a similar way, if a model is trying to learn
the patterns during the training, and at the ith step, the input fed to the model is the
ground truth for the (i-1)th step and not the model predicted value of (i-1)th step, then
the model tends to have better accuracy. Additionally, the technique helps the model to
converge faster, ensuring better learning.
51
CHAPTER 2 UNDERSTANDING FOUNDATION MODELS
That’s it! This was the transformer architecture for you. Overall, this section helped
you in understanding the nitty-gritties of the transformer and how the individual
components are glued together to form the original architecture. In the next section, you
will be looking at the various types of transformers.
Types of Transformers
I feel that we are living in the most exciting times as we are getting to be a part of a
revolution, which will turn out to be a big milestone when we look back. Every day, new
models and techniques make headlines, and sometimes you might be overwhelmed
with so much going around. It is hard to keep up with the recent advancements, and it
is possible that by the time you read this book, there might be newer things out in the
market which the book doesn’t cover. Nevertheless, please note that it’s okay to feel this
way, and you can catch up on things at your own pace.
Based on the architecture, the LLMs can be of three types broadly:
1. Encoder-decoder transformers: The classic vanilla architecture
that we covered in this chapter is the basis of the encoder-decoder
transformer. These types of LLMs have an encoder block that
produces vector representation and a decoder block which generates
output based on these representations. Additionally, these models
are known as sequence-to-sequence models. The objective task for
the model pre-training can be either predicting masked words in the
sentence or predicting the next word in the sentence. However, in
practice, these things vary, and an objective function can be more
complex. For example, Pegasus is pre-trained using Gap Sentence
Generation (GSG), where the sentences, which are very relevant in
a document, are masked completely; this implies that the decoder
is tasked to generate output with the rest of the document. These
models work the best for tasks that require generation of text based
on underlying textual understanding, i.e., the output will depend
majorly on the input like text summarization, machine translation,
etc. Below are some examples of models in this space:
• BART
• Pegasus
• T5
52
CHAPTER 2 UNDERSTANDING FOUNDATION MODELS
• RoBERTa
• ALBERT
• DeBERTa
• GPT
• GPT-3
• CTRL (Conditional Transformer Language Model for
Controllable Generation)
53
CHAPTER 2 UNDERSTANDING FOUNDATION MODELS
We looked at the categorization of models based on the architecture. But there can
also be a categorization based on their availability. Some models are publicly available,
while others are not. Table 2-1 will give you an idea of a few popular open source models
available in their market and their sizes.
Tech giants like Google, Microsoft, etc., have their own LLMs which are paid.
However, these models have demonstrated better results in comparison to open source.
You can see more about these models in Table 2-2.
54
CHAPTER 2 UNDERSTANDING FOUNDATION MODELS
You can compare the tool tables and look at the difference in model size of closed
and open source models. These companies have massive resources to build models on a
large scale, leading to disparity. Furthermore, the technology is very powerful; therefore,
it shouldn’t be in the hands of a few people. This is currently a major concern, and
people are working hard to draft regulations to monitor the usage of AI. You will learn
more about these regulations in the upcoming chapters.
Conclusion
This chapter contained a lot of technical details, and from here on, you will find that
chapters will get more technical and hands-on as you will learn how to use LLMs. So,
let’s wrap this up by concluding the topics we looked at in this chapter. In a nutshell, you
learned about the following topics:
• Basics of foundation models
• Transfer learning
• Transformer architecture
• Different types of LLMs
In the next chapter, you will learn about fine-tuning and how LLMs can be
customized so that you can build applications using generic models.
55
CHAPTER 3
If you have ever taken a guitar lesson, then you know that the first step is always tuning it.
The tuning process makes the guitar playable. Similarly, fine-tuning is one of the many
ways of adapting the generalized LLM to your custom use case. But why is it required?
Think of it from a student’s perspective. If you are having a problem, then you can go to
anyone with more knowledge and experience, but if you are facing difficulty in solving
a specific algebraic problem, then you would go to your math teacher. This decision
making process, where you choose the math teacher over another teacher, happened
because you are aware that you will benefit from the teacher’s expertise in math. A
similar analogy can be applied to LLMs. An LLM is a generic model trained to predict the
next best word, and using it directly might not give you the desired results. Let’s look at it
with the help of an example.
If you ask the question “What is the capital of India?” from a base LLM, then it might
return responses as shown above. This happens because the model is trained to predict
the most probable token, and during the pre-training, the model might have seen a quiz
57
© Bhawna Singh 2024
B. Singh, Building Applications with Large Language Models, https://doi.org/10.1007/979-8-8688-0569-1_3
CHAPTER 3 ADAPTING WITH FINE-TUNING
on India. It hasn’t learned to answer the questions, so you can’t expect the model to
behave in that way. However, if you fine-tune the model in a way that it learns to follow
the instructions, then the model might behave as follows:
User: What is the capital of India?
58
CHAPTER 3 ADAPTING WITH FINE-TUNING
Transfer learning is the underlying capability that makes fine-tuning possible and
saves developers from reinventing the wheel. In the previous chapter, you learned about
transfer learning, and in this chapter, you will see it in action by applying fine-tuning to
the base models.
Decoding theFine-Tuning
If you google ways to fine-tune an LLM, you will find a number of strategies and ways.
However, as a beginner, it can be very confusing to navigate through the landscape of
techniques used for fine-tuning. I have tried to distill down different techniques for you
and when to use them.
Based on the purpose, it can be said there are two broad categories of techniques
used for fine-tuning:
59
CHAPTER 3 ADAPTING WITH FINE-TUNING
1 https://arxiv.org/pdf/2109.01652
2 https://arxiv.org/pdf/2203.02155
60
CHAPTER 3 ADAPTING WITH FINE-TUNING
Record 2:
Instruction: Answer the question below and explain the reasoning behind
your answer.
Input: Riya sends a letter to 2 different friends twice a week. How many letters does
she send a year?
Output: Riya sends a letter to two different friends twice a week, this implies that she
sends a total of 2 x 2 = 4 letters each week. To calculate the number of letters Riya sends in a
year, we need to multiply the number of letters per week by the number of weeks in a year.
There are 52 weeks in a year. This implies that we will multiply 4 (letters sent each
week) by 52 (number of weeks in a year). 4 x 5 = 208, Riya sends 208 letters in a year.
Now that you understand the format of the data, let’s look into the ways of
curation of these datasets. Generally, there are three ways to construct a dataset for
instruction tuning:
• Manual creation of datasets: The first and the foremost basic method
is to use human knowledge to create a dataset. This might not result
in a lot of records, but with a large number of people, the records
generated can be diverse and rich as compared to the other methods.
A popular dataset in this category is P3.
If you don’t want to spend time and energy in curating these, you can use the
publicly available datasets and fine-tune your models on top of those. Thanks to the
open source community which is doing a tremendous job in democratizing this powerful
technology. Datasets like ShareGPT, Dolly, LIMA etc., are available for fine-tuning.
61
CHAPTER 3 ADAPTING WITH FINE-TUNING
Okay, so you have got the data for the SFT; the next step is to choose a pre-trained
model that can be fine-tuned completely in a supervised manner. The model takes in
instructions and the optional input and predicts output tokens in a sequential manner.
The sequence-to-sequence loss, i.e., cross-entropy, is monitored until the model
converges. So how do you choose a pre-trained model for fine-tuning? There are several
factors to consider before making this decision, and they are listed below:
• Pre-training data: It is important to look for the data that the model is
pre-trained on to ensure that the model belongs to a similar domain.
Fine-tuning a base model which is pre-trained on a similar domain
as the fine-tuning task yields better results when compared to a
model which is pre-trained on data that holds no domain relevance.
62
CHAPTER 3 ADAPTING WITH FINE-TUNING
• Alpaca: The base model of Alpaca is LLaMA (7B). The base model
is fine-tuned using the dataset which has been generated using
InstructGPT. Alpaca’s performance is comparable to InstructGPT in
terms of human evaluation.
63
CHAPTER 3 ADAPTING WITH FINE-TUNING
• Dolly 2.0: The base model of Dolly 2.0 is Pythia (12B). Fine-tuning
is performed on the dataset from Databricks Dolly. This dataset
comprises instruction pairs which cover a wide range of NLP tasks,
such as classification, text extraction, etc. Dolly 2.0 beats the base
model and demonstrates comparable performance to GPT-NEOX
(20B), which has twice more parameters than Dolly 2.0.
I hope that going through these models gives you an idea of how fine-tuning aligns
the base models with the users’ objective and makes them more usable. So far, you have
looked at the dataset curation process and the popular fine-tuned models. The fine
tuning process requires a lot of processing, which implies that you should have a basic
understanding of a GPU as it is used to enable faster computations. So, without waisting
much time, let’s jump into understanding the core components of GPUs that you should
care about and develop an intuition about the computational resources.
What Is a GPU?
GPU stands for Graphical Processing Unit. It is also known as graphics card or video card
because it was primarily developed to render graphics in videos, movies, video games,
etc. Due to its capability to perform parallel processing, it has become a crucial tool in
64
CHAPTER 3 ADAPTING WITH FINE-TUNING
AI. The advanced algorithms and massive data processing require powerful hardware,
and GPU serves this purpose very efficiently, making our lives easier. In the current
scenarios, NVIDIA, AMD, and Intel are the biggest players of this industry.
GPU specifications that you should care about:
CUDA: CUDA or Compute Unified Device Architecture is a programming model
or a platform which has been developed by NVIDIA. This programming model allows
unification of CPU and GPU. Think of an ML task as a workload which can be expressed
as multiple operations. CUDA enables execution of sequential tasks on CPU and
computation-related tasks on GPU which excels in parallel processing. Thus, CUDA
helps you in speeding up your computations.
CUDA Core: A CUDA core is a processing of GPU. Intuitively, you can think that a
CUDA core is roughly similar to a CPU core. The more the number of CUDA cores is,
the better it is, because they can harness the power of SIMD (Single Instruction Multiple
Datastream) architecture. Advanced GPUs have thousands of CUDA cores which help in
achieving parallelism.
Tensor Core: You learned about the transformer architecture in the previous
chapter. A transformer is the backbone of most of the LLMs today. There are multiple
layers in a neural network which perform several mathematical operations like
matrix multiplication (matmul), which means hours of compute time. Tensor cores
are specialized processing units which help in faster execution of operations as they
are optimized to perform mixed-precision computation. It is called mixed precision
because it uses different precision levels while performing an operation to speed up the
process without much affecting the accuracy. Precision is the standard convention for
representing float numbers in binary. Double-precision format is 64 bits, single
precision format is 32 bits, and half-precision format is 16 bits. Mixed precision gives
users the flexibility to perform operations in 16 bits but store the result in 32 bits, thus
making computations faster. Unlike CUDA cores which are used for generic tasks,
tensor cores are specifically designed for improving the efficiency of matmul operations
which are very common in deep learning.
Memory Bandwidth: A GPU has multiple memory components or memory
subsystems, which are as follows.
65
CHAPTER 3 ADAPTING WITH FINE-TUNING
Memory bandwidth is a metric that indicates the rate at which the data is transferred
between different memory subsystems in the GPU. It is an important metric that
indicates the speed. It is expressed as gigabytes per second (GB/s). Higher memory
bandwidth implies better GPU performance.
Usually, you can get good fine-tuning results with advanced GPUs like V100 or A100.
66
CHAPTER 3 ADAPTING WITH FINE-TUNING
GPU Usage
As you start fine-tuning, you will realize that it is computationally expensive. Although it
is less expensive than pre-training, there are still billions of parameters to deal with. Your
GPU storage is consumed as soon as you start loading the model. Furthermore, there are
several components during fine-tuning that also consume space, such as activations,
gradients, optimizer states, etc. There are some techniques to optimize fine-tuning. Let’s
look at the most popular ones:
67
CHAPTER 3 ADAPTING WITH FINE-TUNING
Alignment Tuning
Having an LLM which follows users’ instructions isn’t sufficient; its behavior should
align with human values, i.e., the model should be honest, harmless, and helpful.
Alignment tuning ensures that the model is safe to use. So how does alignment tuning
vary from the SFT that we covered in the previous section? The answer is simple; we
can only align the models to human values with the help of humans. This tuning is also
reinforcement learning with Human Feedback or RLHF.
Building a RLHF system can be decomposed into three components:
1. Pre-trained LLM: The model which needs to be aligned.
2. Reward LLM: The model which returns output using human feedback.
A RLHF-based system can be created using three stages. Let’s dive into these stages
and understand how the components listed above interact to build together a model that
works on the basis of RLHF.
Stage 1 - Instruction Fine-Tuning: The first step is to fine-tune a pre-trained
model using instruction fine-tuning or supervised fine-tuning (SFT). This will align
the model’s behavior with the users’ needs. It will learn to imitate the instruction
following demonstrated through a dataset containing records of instructions and their
corresponding outputs as depicted in Figure 3-2.
68
CHAPTER 3 ADAPTING WITH FINE-TUNING
Stage 2 - Getting Reward Model: The second stage in building a RLHF system is to
build a reward model. The LLM from the previous stage is used to generate outputs for
the instructions which are either created by human annotators or are sampled from a
dataset. The output generated by the model is then evaluated by the human annotators.
They are generally asked to rank the outputs in the desired level of acceptance. Based
on the preference ranking of the annotators, a reward model is trained which learns to
predict human-preferred responses. Figure 3-3 illustrates stage 2 of RLHF. Additionally,
the size of the reward model is noted to be smaller than the size of the pre-trained
model. For example, InstructGPT is based on the pre-trained model GPT-3 (175B), and
the reward model has only six billion parameters.
69
CHAPTER 3 ADAPTING WITH FINE-TUNING
move, and penalty is given on the wrong move. This strategy strengthens the model’s
learning. The key components in RL are
• Reward and penalty: The reward and penalty function is built to give
feedback to the model, and with the help of feedback, the model
learns. For example, in InstructGPT, a penalty is incorporated to
make sure that the model doesn’t change much from the original
model during the alignment training or RL-based fine-tuning.
InstructGPT does so by calculating similarity between the response
generated by the original LLM and the response generated by the
LLM which is being tuned for each instruction. The reward optimizes
the model’s behavior and makes it learn human-preferred output.
70
CHAPTER 3 ADAPTING WITH FINE-TUNING
This is the three-stage process of fine-tuning the model with human preference.
However, there are several other methods which refine the process. For example, Direct
Preference Optimization or DPO eliminates the second stage of reward modeling by
introducing data in a different format. A dataset is used which contains both preferred
and nonpreferred responses, and the model is trained to predict the likelihood of the
preferred response. DPO is a simpler and more stable approach than RLHF. Apart from
DPO, there are approaches like CoH, Quark, FIGA, etc. With the popularity of LLMs, you
can expect more techniques to rise.
In the last two sections, you saw the two major techniques for fine-tuning LLMs:
instruction tuning or SFT (supervised fine-tuning) and RLHF. Table 3-1 compares both
the techniques.
SFT RLHF
Objective SFT is implemented majorly with an RLHF is implemented majorly to align the
intention of modifying the behavior of model with human values like helpfulness,
the model, i.e., training the model to honesty, and harmlessness
follow instructions
Training The problem is treated as a The problem is formulated as a
process supervised machine learning problem; reinforcement learning problem; popular
sequence-to-sequence loss is RL algorithms like PPO, etc., are used for
monitored training purposes
(continued)
71
CHAPTER 3 ADAPTING WITH FINE-TUNING
SFT RLHF
Technique SFT uses teacher-forcing technique RLHF uses a reward modeling technique to
underneath (discussed in the first chapter) to induce human attributes in the model
unlock the instruction following
capability of LLM
Hallucination SFT trained models are more likely to Unlike SFT, the RLHF process works on
hallucinate if the instruction data fed contrasting the responses, i.e., the model
to the model is not at all similar to the has to choose between a good response
data fed to the model during pre and a response. This helps in mitigating
training, i.e., data is beyond the scope hallucination as the model is no longer
of the LLM being forced to copy a behavior
Dependency SFT just requires a raw pre-trained RLHF requires a SFT trained model to start
model and a dataset of instructions to with. If the SFT model isn’t available, then
fine-tune a model training one would be the first stage of
implementing the RLHF process
Usage SFT is used directly after pre RLHF is a step-up over SFT. It takes the SFT
training of the model, so if you want trained model’s performance one level up
to increase the model capability
and build the instruction following
capability, you should go with SFT
Adapter Tuning
Adapter tuning involves integrating small and lightweight neural network modules
called adapters in the transformer architecture. The proposed idea behind the adapter
is that its bottleneck architecture compresses the original vector into a lower dimension
and then recovers it back to the original dimension. During fine-tuning, the original
parameters are kept frozen, and only the adapters get trained. This process helps in
vastly reducing the parameter size. Let me explain this with an example.
Let’s assume that you have a transformer-based LLM architecture with a vector
size of 2048, and it will reproject the vector into the same dimension. This gives you the
following parameters:
2048*2048 = 4194304
However, if you include an adapter in this architecture, which lowers the vector size
from 2048 to 32, and then recast it again to 2048, then the number of parameters is
2048*32+32*2048 = 131072
This reduces the parameter size by more than 96%. Therefore, this is a much faster way
of fine-tuning. Research shows that same results can be obtained using PEFT methods
as full fine-tuning with lesser computational resources. The adapters can be based on a
variety of tasks that you want to customize a base LLM for. For each task, adapters can learn
newer representations that are specific to the task. The adapters can be inserted after the
core layers of the transformer architecture, i.e., attention layers and feed-forward layers. See
Figure 3-5 to understand how insertion of the adapter module occurs between the layers.
Soft Prompting
Soft prompting is a research area which has gained extreme popularity because it allows
developers to include task-specific information in the model itself instead of including
it during the inference time. Soft prompting allows using the same base model for a
73
CHAPTER 3 ADAPTING WITH FINE-TUNING
variety of tasks, hence saving the efforts of training different models for different tasks.
Furthermore, in this methodology, one only needs to deal with parameters that are
related to the task-specific prompt and not the entire model. Let’s look at two of the
major fine-tuning techniques that leverage this methodology: (1) prefix tuning and (2)
prompt tuning.
• Prefix tuning
Prefix tuning is based on soft prompting. In this technique, task
specific vectors called prefix vectors are prepended to each layer
in the transformer, as a sequence of prefixes. These vectors are
continuous and trainable. The vectors are optimized with the help
of a parameterization technique. The idea is to train an MLP (Multi
Layer Perceptron), which takes in small matrices and returns the
parameter matrices for prefixes. The parameterization helps in
avoiding the direct optimization of prefixes. Once the parameter
matrices are obtained, the MLP is discarded and only the obtained
matrices are kept to improve the base LLM’s performance on a
task. Prefix tuning saves developers from full fine-tuning by only
allowing the parameters related to prefixes to be trained, thus
leading to a milder burn in pocket while spending for computation
resources. Figure 3-6 depicts the prefix tuning process.
• Prompt tuning
W = W + AW
75
CHAPTER 3 ADAPTING WITH FINE-TUNING
The key idea of LoRA is to break down the weight update matrix into two other
matrices such that their rank factorization results in the matrix of the same dimension.
This process is also known as decomposition. The rank in LoRA (Low-Rank Adaptation)
is a controllable parameter. Let’s say the rank is k, then
A W = WAm x k . WBk x n
Let’s do the math and figure out the percentage by which the size is reduced. Let’s
say AW has dimensions 2048 X 1024. Now let’s decompose this matrix into smaller
matrices such that the value of rank or parameter k is 8, then the number of parameters
is 2048 X 8 + 8 X 1024 = 24576.
The decomposition reduces the parameters from 2,097,152 to 24,576; approximately
98% of parameters remain unchanged, and the results obtained are at par with full fine
tuning. This happens because only the smaller decomposed matrices are trained and
not the whole model. LoRA is currently a well-performing technique and has become a
standard fine-tuning technique. Figure 3-8 demonstrates the LoRA technique, where you
can see how the decomposition of matrices results in reduction of parameters.
So far, you looked at the techniques that reduced the size of trainable parameters,
thus bringing down the budget for computation. However, working with these models
also requires extensive storage due to their size. Therefore, there are techniques
76
CHAPTER 3 ADAPTING WITH FINE-TUNING
to address the memory issues of these LLMs. Mostly, the techniques are based on
quantization. The main ideology behind quantization is to compress the neural
network-based model. This implies that the transformer-based architecture, expressed
by weights and activations, can be converted into integer values. Generally, the values
of the weights and activations are expressed in floating-point conventions, but with
quantization these values can be compressed and expressed as integers. This is an
ongoing field of research, but let me mention an important technique here called QLoRA
or Quantized Low-Rank Adaptation.
As you have gained a good understanding about the topic, you can now start fine
tuning publically available open source models. The purpose here is to just demonstrate
the ease of implementing these techniques. To avoid any expensive computational
resources, I have chosen resources which you can utilize to try fine-tuning without
spending a penny.
This demonstration uses an open source model called DistilBERT, which is a lighter
and faster version of the BERT model, and it has been pre-trained on the BookCorpus
data.3 Additionally, I have used the IMDB movie review dataset to fine-tune the model
for the sentiment analysis task. The dataset is loaded from Hugging Face and can be
accessed here.4 This code can be run in Google Colab Notebook by connecting to T4
runtime.
Now, let’s discuss the versions of packages which have been used here to generate
this code:
torch == 2.3.1+cu121
transformers == 4.42.4
peft == 0.12.0
datasets == 2.21.0
python == 3.10
So, without further ado, let’s begin. The first step is to import the necessary libraries.
To do so, run the following code block:
import torch
from transformers import DistilBertTokenizer
from transformers import Trainer, TrainingArguments
3 https://yknzhu.wixsite.com/mbweb
4 https://huggingface.co/datasets/stanfordnlp/imdb
77
CHAPTER 3 ADAPTING WITH FINE-TUNING
# Loading the IMDB dataset and using only first 500 rows for training
data = load_dataset("imdb", split="train[:500]")
Since you have loaded the data, you might want to take a peep into what it is actually.
By running the following code, you can see top 4 entries of the data. As illustrated in
Figure 3-9, the data is in the form of a dictionary, which has two key fields, text and
labels. Text is a list which contains different movie reviews, and label is also a list
containing the corresponding sentiment of the review; in this case, all are zero, implying
the reviews are negative.
data[:4]
78
CHAPTER 3 ADAPTING WITH FINE-TUNING
■ytext11 ['I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that
ax fiisi it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this
for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she
wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race
issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher,
classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity
scenes are few and far between, even then it\'s not shot like some cheaply made porno. While my countrymen mind find it shocking, in reality sex and nudity are a
major staple in Swedish cinema. Even Ingmar Bergman, arguably their answer to good old boy John Ford, had sex scenes in his films.<br /xbr />I do commend the
filmmakers for the fact that any sex shown in the film is shown for artistic purposes rather than just to shock people and make money to be shown in pornographic
theaters in America. I AM CURIOUS-YELLOW is a good film for anyone wanting to study the meat and potatoes (no pun intended) of Swedish cinema. But really, this film
doesnX't have much of a plot.',
'"I Am Curious: Yellow" is a risible and pretentious steaming pile. It doesnX't matter what oneX's political views are because this film can hardly be taken
seriously on any level. As for the claim that frontal male nudity is an automatic NC-17, that isn\'t true. I\'ve seen R-rated films with male nudity. Granted, they
only offer some fleeting views, but where are the R-rated films with gaping vulvas and flapping labia? Nowhere, because they donX't exist. The same goes for those
crappy cable shows: schlongs swinging in the breeze but not a clitoris in sight. And those pretentious indie movies like The Brown Bunny, in which weX're treated to
the site of Vincent GalloX's throbbing johnson, but not a trace of pink visible on Chloe Sevigny. Before crying (or implying) "double-standard" in matters of
nudity, the mentally obtuse should take into account one unavoidably obvious anatomical difference between men and women: there are no genitals on display when
actresses appears nude, and the same cannot be said for a man. In fact, you generally wonX't see female genitals in an American film in anything short of porn or
explicit erotica. This alleged double-standard is less a double standard than an admittedly depressing ability to come to terms culturally with the insides of
womenX's bodies.*,
"If only to avoid making this type of film in the future. This film is interesting as an experiment but tells no cogent story.<br /xbr />0ne might feel virtuous
for sitting thru it because it touches on so many IMPORTANT issues but it does so without any discernable motive. The viewer comes away with no new perspectives
(unless one comes up with one while one's mind wanders, as it will invariably do during this pointless film).<br /xbr />0ne might better spend one’s time staring
out a window at a tree growing.<br /xbr />",
"This film was probably inspired by Godard's Masculin, f&ninin and I urge you to see that film instead.<br /xbr />The film has two strong elements and those are,
(1) the realistic acting (2) the impressive, undeservedly good, photo. Apart from that, what strikes me most is the endless stream of silliness. Lena Nyman has to
be most annoying actress in the world. She acts so stupid and with all the nudity in this film,...it's unattractive. Comparing to Godard's film, intellectuality has
been replaced with stupidity. Without going too far on this subject, I would say that follows from the difference in ideals between the French and the Swedish
jvie of its time, and place. 2/10."],
'label': [0, 0, 0, 0]}
Once the data has been loaded, the next step is to convert the data into embeddings,
which will be done by running the following code block:
def tokenize_function(reviews):
return tokenizer(reviews["text"], padding="max_length",
truncation=True)
tokenized_data[:4]
However, you can solve this problem by just formatting the data in the following way:
tokenized_data.set_format("torch", columns=["input_ids",
"attention_mask", "label"])
Figure 3-10 illustrates the embeddings in a nice and structured way and solves the
problem which you encountered while looking into the data without formatting.
79
CHAPTER 3 ADAPTING WITH FINE-TUNING
The next step is to define and apply the LoRA config to the model by using the PEFT
library. LoRA config has a lot of parameters, but I have used only the following; feel free
to play around with more parameters:
peft_config = LoraConfig(
r=8,
lora_alpha=32,
target_modules=["q_lin", "v_lin"],
lora_dropout=0.1,
bias="none"
)
# Applying the defined config to the model
peft_model = get_peft_model(model, peft_config)
80
CHAPTER 3 ADAPTING WITH FINE-TUNING
The next step is to define the training arguments. These are the standard arguments
which help in running the training process.
81
CHAPTER 3 ADAPTING WITH FINE-TUNING
Once the model has been trained and saved, you can reload the model and run
an inference. The sentence I used here is “I love this shite movie!” This is a complex
statement, and it can be interpreted as positive at first but is actually negative as it
inculcates sarcasm. To run the inference, execute the following code block and feel free
to modify the sentence you want to run inference on. Figure 3-12 illustrates the output
obtained after running the inference process.
82
CHAPTER 3 ADAPTING WITH FINE-TUNING
# Running inference
with torch.no_grad():
input = {k: v.to(peft_model.device) for k, v in input.items()}
output = peft_model(**input)
logits = output.logits
predicted_class = torch.argmax(logits, dim=-1).item()
# Mapping the predicted label to the actual label
label_mapping = {0: "negative", 1: "positive"}
predicted_review = label_mapping[predicted_class]
This is how you fine-tune a model with fine-tuning. This is a very simple code and
isn’t optimized but is a good starting point for you. I hope you got a fair idea about the
technique.
QLoRA
As the name suggests, QLoRA is a modified version of LoRA, which leverages
quantization. The main idea is to quantize the weight parameters. The original
parameters are pre-trained in a 32-bit floating-point precision format, the standard
convention. However, this technique converts it from 32-bit to 4-bit format, thus making
a huge difference in terms of memory consumption. Furthermore, it creates a win-win
situation in both speed of fine-tuning and memory footprint. QLoRA is an advanced
version of LoRA and is quickly becoming an enterprise’s favorite.
That’s it! Congratulations for making it so far. I know this chapter had so many
new terms and an overload of information, but you have learned today to make your
own custom LLMs! So, what are you waiting for? Get on your GPU and data and start
fine-tuning!
83
CHAPTER 3 ADAPTING WITH FINE-TUNING
Conclusion
In this chapter, you learned about various new concepts, which you can utilize to modify
the behavior of a pre-trained LLM as per your requirement. This chapter focused on the
following concepts:
• Supervised fine-tuning
• RLHF
84
CHAPTER 4
Magic of Prompt
Engineering
The true sign of intelligence is not knowledge but imagination.
—Albert Einstein
Sitting on the ground, experiencing mixed feelings of excitement and surprise, eyes
glued to the stage where a person standing in bright shimmery clothes pulls out a teddy
bear from his black hat, that’s how I will describe the first magic show which I attended
when I was five years old. Two decades after the event, I tried ChatGPT for the first time,
and my feelings were exactly the same, surprised and excited! It just feels like magic.
Use your keyboard to cast a magic spell and witness the magic happening on the screen.
Speaking of ChatGPT and magic, I recall “Tom Riddle’s diary,” from the Harry Potter
series. The diary responded back to the person writing in it. LLMs which demonstrate
an expertise in the English language feel like a personification of this diary. In a fancy
way, you can say that prompts are the magic spells that one uses to interact with LLMs. A
good LLM response depends on a well-crafted prompt. Without further ado, let’s begin a
journey and make you a master of writing excellent prompts.
Understanding aPrompt
Introduction
In the previous chapter, you learned about fine-tuning, which helps you to tailor an LLM
as per your needs. So, once you obtain the desired LLM, the next step is to utilize it; this
is where a prompt comes into the picture. A prompt can be defined as an input, which is
provided to a model usually in text format (however, multimodal LLMs support formats
85
© Bhawna Singh 2024
B. Singh, Building Applications with Large Language Models, https://doi.org/10.1007/979-8-8688-0569-1_4
CHAPTER 4 MAGIC OF PROMPT ENGINEERING
other than text as well, like image, audio, video, etc.), and acts as a trigger for the model
to generate the output. The key to a good LLM response lies in the intricate details of its
prompt. Let’s understand with an example how you can modify the output based on a
prompt. Suppose you are using an LLM to generate a short report about the universe. At
first, you might start with a basic prompt like the following:
Prompt 1: Give me a short report about the universe.
This is an example of a basic prompt which will send a message to the model to
generate a report about the universe. The prompt lacks clarity and intricate details, but
let’s see how this prompt performs. To test this prompt, I used ChatGPT Plus (GPT-4)
and got the response which is depicted in Figure 4-1.
The universe is a vast and expansive entity that encompasses all of space, time, matter, and
energy. Here’s a concise overview of its key aspects:
1. Origin and Age: The universe began with the Big Bang, approximately 13.8 billion years ago.
This event marked the beginning of the expansion of the universe and the creation of all matter
and energy.
2. Structure and Composition: The universe is composed of ordinary matter, dark matter, and
dark energy. Ordinary matter includes stars, galaxies, planets, and all other visible objects, and it
makes up about 5% of the universe's total mass/energy content. Dark matter, which does not
emit or interact with electromagnetic radiation, contributes about 27%, and dark energy, believed
to be responsible for the accelerated expansion of the universe, comprises about 68%.
3. Cosmology: The study of the universe’s origin, evolution, and eventual fate is known as
cosmology. This field utilizes theories of physics and astronomical observations to build a
comprehensive understanding of the universe as a whole.
4. Observable Universe: The observable universe refers to the portion of the entire universe that
we can see or observe from Earth, limited by the speed of light. Light from objects farther away
than about 46 billion light-years has not had enough time to reach us since the beginning of the
cosmological expansion.
5. Future of the Universe: Theories about the future of the universe include the Big Freeze, Big
Rip, and Big Crunch, each depending on various factors like the rate of expansion and the
properties of dark energy. 4,
86
CHAPTER 4 MAGIC OF PROMPT ENGINEERING
The response is good, but I don’t like the way it has been presented. It doesn’t
contain a title and is written in bullet points. I want the report to be short, 800 words
maximum. Further, I want the report to be more technical and structured. So, I am going
to use a better prompt to get my desired output.
Prompt 2: Give me a technical report about the universe in 800 words max. Make
sure that it has a title and is presented in paragraph format rather than bullet marks.
I again tested this prompt on ChatGPT and found that the response was well under
800 words with more technical jargon than the previous output. In Figure 4-2, notice
how the report is structured in a paragraph format and contains a title too. In conclusion,
you get better responses with a better prompt.
Observations of the cosmic microwave background radiation, a relic from an early stage of the
universe, along with the distribution of galaxies and their redshifts, provide substantial empirical
evidence supporting the Big Bang theory. These observations are bolstered by measurements
from various space observatories and experiments, including the Hubble Space Telescope and
the Wilkinson Microwave Anisotropy Probe.
The composition of the universe is predominantly dark energy and dark matter, with ordinary
matter making up only about 5% of the universe's total mass-energy content. Dark energy, which
comprises about 68% of the universe, is an enigmatic force that drives the accelerated expansion
of the universe and remains one of the most significant mysteries in physics. Dark matter,
accounting for about 27%, does not emit or absorb light but exerts gravitational effects on visible
matter and radiation. The detection and study of dark matter are critical for understanding the
structure and evolution of galaxies.
87
CHAPTER 4 MAGIC OF PROMPT ENGINEERING
88
CHAPTER 4 MAGIC OF PROMPT ENGINEERING
Introduction
In the vast expanse of the cosmos, our universe presents an intricate tapestry of phenomena that
challenges our understanding and fuels our curiosity. From the enigmatic behavior of dark matter
to the intricate dance of galaxies, this presentation aims to explore the most compelling mysteries
that define our current frontier in astrophysics.
The universe is governed by the fabric of spacetime, a concept introduced by Einstein's theory of
General Relativity. This framework not only describes how matter and energy influence the
curvature of spacetime but also underpins our understanding of cosmic events such as black hole
mergers and gravitational waves. Recent observations, including the detection of gravitational
waves by LIGO and Virgo, have opened new avenues to explore the universe, revealing details
about black hole binaries and neutron star collisions that were previously beyond our reach.
Despite accounting for approximately 85% of the total mass of the universe, dark matter remains
one of the most elusive components in cosmology. Its presence is inferred through gravitational
effects on visible matter, but its nature continues to be a subject of intense research. Similarly,
dark energy, which constitutes about 68% of fhe universe, is responsible for the accelerated
xL
expansion of the universe. The interplay beE._en dark matter and dark energy shapes the
89
CHAPTER 4 MAGIC OF PROMPT ENGINEERING
These reviews become the input for the prompt. Thus, the
input helps the model in identifying the values on which
instructions are needed to perform. The output of the prompt
is depicted in Figure 4-4.
1. Review 1: Negative
2. Review 2: Sarcastic/Negative
3. Review 3: Neutral
0» (3 0 9 •0*''
90
CHAPTER 4 MAGIC OF PROMPT ENGINEERING
• Concept: Create a strong, memorable hashtag (#SafeHer) to unite the campaign. Encourage
users to share their stories, tips, and experiences related to women's safety using this
hashtag.
• Execution:
• Partner with influencers to kickstart the hashtag with impactful stories or messages.
• Run a contest where users post their own safety tips or stories under the hashtag, with
the chance to be featured on the main campaign page or win safety-related products.
• Encourage organizations and brands to co-opt the hashtag to amplify its reach.
• Concept: Highlight real-life stories of women and how they have taken steps to ensure their
safety, emphasizing empowerment and community support.
• Execution:
• Create a series of short videos or carousel posts where women from diverse
backgrounds share their experiences and tips on staying safe.
The phrase “you are an expert in social media” tells the model
to generate a response which reflects expertise of a social media
manager. Let me give you another example. Suppose you have a
math problem and you want to solve it using an LLM, then you
would want the model to utilize its reasoning capabilities. Thus,
using a prompt like “Using step by step reasoning to solve the
following problem” can give you an output that uses reasoning if
you are using an LLM to solve an analytical problem.
91
CHAPTER 4 MAGIC OF PROMPT ENGINEERING
These are the key characteristics of an effective prompt, and you may find few or all
of them in a well-articulated prompt. In the upcoming section, I will walk you through
major techniques that are used in prompt engineering, along with examples.
Required Parameters
Let’s first look at parameters of the API, which are mandatory to pass:
• Model: This parameter is used to specify the model to use. Model
values such as “gpt-4-turbo,” “gpt-3.5-turbo-16k,” “gpt-3.5-turbo,” etc.,
define the version of the model to be used for chat completion. The
decision of the model version is based on your use case. Suppose you
want to include a lot of information in the model as context, then you
should go with the model which supports a longer context window.
92
CHAPTER 4 MAGIC OF PROMPT ENGINEERING
• Role: This field of message lets the model know “who said it,” thus
enhancing the model’s overall understanding in generating better
responses and maintaining a coherence in the conversation. The
role can have four possible values, and they are explained below:
• System: As the name suggests, a system value signifies that
the message corresponds to the system itself. It can be used
to provide additional context or to generate automated
responses.
• User: A user value signifies that the message corresponds
to the end user. The contents for user messages are usually
questions/instructions that an end user might expect the
model to answer/fulfill.
93
CHAPTER 4 MAGIC OF PROMPT ENGINEERING
Optional Parameters
To provide a programmer more control over the responses, there is a list of some
parameters provided by OpenAI. The following are some of the crucial parameters of the
chat completion endpoint:
94
CHAPTER 4 MAGIC OF PROMPT ENGINEERING
95
CHAPTER 4 MAGIC OF PROMPT ENGINEERING
You can run these examples in any IDE; just remember that you should have an
OpenAI API key.
The first step is to install the openai library by running a simple command:
The next is to load necessary libraries and import utilities required. For
demonstration purposes, I have made a Python file called “constants” to store my API
key as APIKEY. In practice, you will store this key as an environment variable:
import openai
import os
from constants import APIKEY
After the necessary imports have been done, the next step is to authenticate the
API. You will have to pass your API key to make a valid request to the API. This can be
achieved by running the following code:
openai_key = APIKEY
client = OpenAI(api_key = openai_key)
response = client.chat.completions.create(
model="gpt-4o",
96
CHAPTER 4 MAGIC OF PROMPT ENGINEERING
messages=[
{"role": "system", "content": "You are a helpful assistant who will
help the user with their queries"},
{"role": "user", "content": "Extract entities in the following sentence:
''' Priya and I visited Paris and ate croissants worth 500 euros. ''' " }]
print(response.choices[0].message.content)
1. Person: Priya
2. Location: Paris
3. Money: 500 euros
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "You are a helpful assistant who will
help the user with their queries"},
{"role": "user", "content": "Translate the following sentence into
Hindi:- ''' I am craving mangoes currently''' " }]
)
print(response.choices[0].message.content)
97
CHAPTER 4 MAGIC OF PROMPT ENGINEERING
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "You are a helpful assistant who will
help the user with their queries"},
{"role": "user", "content": " Guess the third value of the sequence
and explain your answer in just one line. ''' ANC, CPD, ERE, ____ ,
IVG ''' " }],
print(response.choices[0].message.content)
98
CHAPTER 4 MAGIC OF PROMPT ENGINEERING
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "You are a helpful assistant who will
help the user with their queries"},
{"role": "user", "content": " Help me write a python code for
calculating prime factors of a number. Make the code as short as you
can " }],
)
print(response.choices[0].message.content)
'''python
def prime_factors(n):
i = 2
factors = []
while i * i <= n:
if n % i:
i += 1
else:
n //= i
factors.append(i)
if n > 1:
factors.append(n)
return factors
# Example usage:
number = 56
print(prime_factors(number))
This function 'prime_factors' takes an integer 'n' and returns a list of its prime factors. The example usage calcu
lates the prime factors of 56.
99
CHAPTER 4 MAGIC OF PROMPT ENGINEERING
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "You are a helpful assistant who will
help the user with their queries"},
{"role": "user", "content": "Generate a poem in 4 lines about a
pen " }],
temperature=0.9,
frequency_penalty=1.3
print(response.choices[0].message.content)
So, you have now seen how advanced models can be used with a single instruction,
and they can start following your command without additional help. But how can you
use models and teach them a certain behavior by giving examples? That’s where the few
shot prompting comes into the picture.
100
CHAPTER 4 MAGIC OF PROMPT ENGINEERING
Few-Shot Prompting
If you have already tried zero-shot prompting, and it is not working well with your use
case because it involves teaching a new concept to the model or requires output to be in
a custom format, then you should try few-shot prompting. The only requirement of using
this technique is that your prompt should include certain examples demonstrating the
type of output you are seeking. Let me demonstrate it here. The following call request is
asking the model to help a screenwriter in developing a fictional character who has got
certain eccentricities. To teach the model about the character, there are three examples
included in the prompt; therefore, this is an example of three-shot prompting. Each
example depicts a conversation between the character and another person, which helps
the model learn about the character. You will actually be surprised with the model’s
output and how well the model captured the behavior of the character and reproduced it
in a scene. See Figure 4-11 for the API output. Furthermore, you will notice the usage of
(''' ''') triple quotes in the prompt. This helps the model in separating the instruction with
the examples. Different types of delimiters can be used, and you will learn more about
them in the design principles, which are discussed later in this chapter.
response = client.chat.completions.create(
model="gpt-4o",
messages=[{
'role': 'system',
'content': 'You are an assitant to a screenwriter and you have to help him
develop a fictional character use the examples delimited by tripple quotes
to understand the character. '
'''
Scene 1: Talking to a Barista.
Character: Can I have a magical coffee?
Barista: Yes, which coffee though ?
Character: I want an oat latte with a dash of magic.
Barista: Here is your oat latte. Hope, it makes your day magical.
'''
'''
Scene 2: Walking on the road, the character bumps into a cyclist.
Character: Oh sorry! this is not magical.
Cyclist: What !!!
101
CHAPTER 4 MAGIC OF PROMPT ENGINEERING
},
{
'role': 'user',
'content': 'Generate a scene where the character is talking to his
colleague in his office. in only 100 words.'
}],
temperature=1
)
print(response.choices[0].message.content)
37 print(response.choices[0].message.content)
**Scene 4: In the Office**
**Character:** No, like real magic. You know, stuff that makes everything sparkle.
**Character:** Already had my oat latte with a dash of magic. Didn't quite work.
**Colleague:** Well, maybe it's not the coffee. How about we tackle those emails first?
**Character:** Emails... not quite magical, but okay. Here's hoping for enchanted inboxes
102
CHAPTER 4 MAGIC OF PROMPT ENGINEERING
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "You are a helpful assistant who will
help the user with their queries. The examples are delimited for your
understanding."
'''
Q. Peter has 5 dollars. He gets 9 dollars more from his grandfather and
15 more from his father. He bought an ice cream for himself for 200
cents and gave 600 cents to his friend, John. How much money does he
have now? Give me just the figure.
A. 21 dollars.
Q. Ram has ninety one watermelons. He gives 5 to his sister and sells 6
each day. In how many days Ram will be out of all water melons. Give me
just the figure.
A. 8 days.
103
CHAPTER 4 MAGIC OF PROMPT ENGINEERING
},
{"role": "user", "content": "Ama and I are going to buy earrings. I
bought twenty pairs and gave 5 to Ama who also bought 25 pairs and
gave me three. How many earrings do I have now ? Give me just the
figure. "}]
)
print(response.choices[0].message.content)
You see the answer is 38 pairs, which is incorrect. However, if we decompose the
procedure of solving such problems and inform the model about the same, then not only
will the model give you the correct response but will also demonstrate how it achieved it.
I rephrased the prompt to include problem solving in the following manner.
Q. Peter has 5 dollars. He gets 9 dollars more from his grandfather and 15 more from
his father. He bought an ice cream for himself for 200 cents and gave 600 cents to his
friend, John. How much money does he have now? Give me just the figure.
Step 1. Peter has 5 dollars.
Step 2. He gets 9 dollars from his grandfather and 15 from his father.
104
CHAPTER 4 MAGIC OF PROMPT ENGINEERING
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "You are a helpful assistant who will
help the user with their queries with step by step reasoning as
demonstrated in the example below"
'''
Q.Peter has five dollars. He gets 9 dollars more from his grandfather
and 15 more from his father. He bought an ice cream for himself for 200
cents and gave 600 cents to his friend, John. How much money does he
have now? Give me just the figure.
Step 1. Peter has 5 dollars.
Step 2. He gets 9 dollars from his grandfather and 15 from hisfather.
Step 3. This implies Peter has 9+5+15, i.e. 29 dollars.
Step 4. 100 centsis equal to a dollar. This implies Peter spent 2
dollars in buying an ice-cream and gave 6 dollars to John.
Step 5. Peter has 8 dollars less than what he had previously.
Step 6. Finally, 29-8=21. This means Peter has 21 dollars.
Step 7. Answer is 21 dollars.
'''},
{"role": "user", "content": "Ama and I are going to buy earrings. I
bought twenty pairs and gave 5 to Ama who also bought 25 pairs and gave
me three. How many earrings do I have now ?"}]
)
print(response.choices[0].message.content)
105
CHAPTER 4 MAGIC OF PROMPT ENGINEERING
Self-Consistency
Can you think of a major shortcoming with the previous approach? Let’s think of it this
way. The model is trying to solve a problem in a step-by-step way, and if it does a step
wrong, then the consequent answer will also be wrong. So, how can you address such a
problem? The solution is self-consistency prompting.
The idea behind this technique is that you prompt the model multiple times using
the CoT technique, and then based on a majority vote, the best solution is selected. You
can either do this manually or inculcate such an instruction in the prompt itself.
For example, if you want to solve the same math problem using this approach, then
you can craft the prompt as mentioned below.
Self-Consistency Prompt: Imagine there are three independent math experts who
are using step-by-step reasoning to solve the problem differently. Use the majority vote
and then give the final answer.
The code below uses this technique to solve the same math problem, which was
addressed using the basic CoT approach. The output is depicted in Figure 4-14. You
can see how the model has actually solved the problem from the perspective of three
different independent math experts and has finally chosen the common response.
response = client.chat.completions.create(
model="gpt-4o",
messages=[
106
CHAPTER 4 MAGIC OF PROMPT ENGINEERING
Let's analyze the problem step by step and then solve it using majority vote among three independent math experts,
All three experts calculate that I end up with the same number of earrings.
107
CHAPTER 4 MAGIC OF PROMPT ENGINEERING
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "Imagine there are three independent
math experts who are using step by step reasoning to solve the problem
differently. At each step, the three check their calculation, if anyone
is wrong, they leave the discussion. Solve the user problem and give
the final answer using this approach."},
{"role": "user", "content": "Ama and I are going to buy slippers. I
bought twenty nine pairs and gave 5 pairs to Ama who also bought 25
pairs and gave me three slippers. How many slippers do I have now ?" }]
)
print(response.choices[0].message.content)
108
CHAPTER 4 MAGIC OF PROMPT ENGINEERING
Let's break down the problem step by step and solve it while ensuring all calculations are checked and agreed upon
by our three independent math experts.
♦♦Expert 2:++ Confirming, giving away 10 slippers from 58 means \(58 - 10 = 48\).
♦♦Expert 3:++ Agreed, \(25 \times 2 = 50\). Ama has 50 slippers initially.
# ## Final Answer
I have 51 slippers now.
Generated Knowledge
This is a prompting technique which theoretically works the same as CoT. With
advanced LLMs, which can browse the Internet, this technique might not help, but with
relatively smaller LLMs, this technique might help to reduce the possibility of generating
incorrect responses. The idea behind this technique is to firstly prompt the model to
generate information on a topic and then utilize the same information for the task.
This technique can be decomposed into parts:
- Knowledge generation: This is the first part of the prompt where the
LLM is asked to generate knowledge, which is the factual information
about a topic.
109
CHAPTER 4 MAGIC OF PROMPT ENGINEERING
You will use this technique if you are looking to solve a problem which requires the
model to be aware of some general knowledge or common sense.
Example: Having less score is good in cricket, right?
If the LLM is not aware of this topic, then chances are that you will get an incorrect
answer on the topic. However, if you ask the model to first generate information on
the scoring mechanism in cricket and then ask the question again, then the chances
of getting the right answer increase. This is because under the hood the model has
generated the knowledge which helps to develop CoT; therefore, the model can give a
better informed answer.
Run the code below to see a demonstration of an API request with the generated
knowledge prompt technique. The output of the request is shown in Figure 4-16.
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "You are an assistant which will help
user solve the task rightfully."},
{"role": "user", "content": "Firstly, generate facts about the cricket
game and then use them to write a cohesive paragraph in no more than
200 words." }]
)
print(response.choices[0].message.content)
1. Cricket is a bat-and-ball game played between two teams of eleven players each.
2. The game originated in England and is now popular worldwide, particularly in countries like India, Australia, an
d Pakistan.
3. A cricket match is dividedinto innings where one team bats and the other bowls and fields.
4. The main objective for the batting team is to score runs by hitting the ball and running between the wickets.
5. Bowlers aim to dismiss the batsmen by hitting the stumps or getting the batsman caught out.
6. The game is played on a circular or oval-shaped field with a 22-yard pitch at the center.
7. Cricket can be played in various formats including Test matches. One Day Internationals (ODIs), and Twenty20 (T2
0).
8. Test matches can last up to five days, while ODIs are limited to 50 overs per side and T20s to 20 overs per sid
e.
9. Key cricketing terms include 'century' for 100 runs, 'duck' for zero runs, and 'hat-trick' for three wickets tak
en by a bowler in consecutive deliveries.
10. The International Cricket Council (ICC) governs the game globally.
♦♦Paragraph:^
Cricket, a bat-and-ball game where two teams of eleven players compete, has its origins in England and has grown in
popularity globally, notably in India, Australia, and Pakistan. Matches are divided into innings with one team batt
ing to score runs and the other bowling and fielding with the aim to dismiss batsmen. Played on a circular or oval
field with a central 22-yard pitch, cricket offers various formats including Test matches that can span up to five
days, One Day Internationals (ODIs) limited to 50 overs per side, and the fast-paced Twenty20 (T20) with 20 overs p
er side. Key terminologies in the game include a 'century,' which denotes a score of 100 runs, a 'duck' for a batsm
an's score of zero, and a 'hat-trick' achieved when a bowler takes three wickets in consecutive deliveries. Governe
d globally by the International Cricket Council (ICC), cricket continues to fascinate and engage millions of fans a
round the world.
110
CHAPTER 4 MAGIC OF PROMPT ENGINEERING
Prompt Chaining
Another advanced technique which can help you to solve complex problems is prompt
chaining. The idea is simple; you chain the prompts, i.e., the response generated by the
first prompt is used as input in the next prompt, and the process continues until the
problem gets solved. This technique helps a lot when the prompt is too complex and
involves multiple tasks to be solved to get the solution of the final task. Let me explain it
with an example here. Suppose you are building a chatbot for customer support; in this
case, you can develop your prompts by utilizing prompt chaining. Let’s say a customer
enters a query; now the task of the model is to greet the customer, identify the problem
from the query, offer top three solutions for the problem identified, take feedback from
the user, and thank the user.
Now this can be compiled altogether in a single prompt as mentioned below:
- “Greet the customer and extract the problem from the query. Offer
three solutions to the extracted problem. Further take feedback from
the customer and thank them.”
I have used the problem described above and decomposed it into multiple
instructions below. The code below demonstrates calling the API with the prompt
chaining technique. The output generated is quite impressive; see Figure 4-18.
response = client.chat.completions.create(
model="gpt-4o",
messages=[
111
CHAPTER 4 MAGIC OF PROMPT ENGINEERING
112
CHAPTER 4 MAGIC OF PROMPT ENGINEERING
These are some of the popular techniques which can help you in writing better
prompts. As the field advances, we will see more techniques and frameworks, which will
make the interaction process seamless.
- “Identify the entities in the text below delimited with four hashtags”
113
CHAPTER 4 MAGIC OF PROMPT ENGINEERING
- “Use the documents delimited by xml tags and answer user queries”
- Assign roles to the model by using phrases like “you are an expert
in {abc},” “assume you are {xyz} of a company,” etc.
- Response style
- You can control the writing style of LLMs by mentioning the
target audience. For example, using phrases like “explain {xyz}
to me like I am a five year old kid,” “explain {abc} in simple
non-technical terms,” etc.
114
CHAPTER 4 MAGIC OF PROMPT ENGINEERING
- If you are using an LLM for creative styles, then you can mimic
the writing style of famous people. For example, “write a poem
on the brain in Shakespeare style.”
- You can also control the response by mentioning the output
format like tabular, json, list, paragraph, etc.
- Include diverse scenarios in your prompt and keep the language inclusive. For
example, “explain the benefits of education in a society and also mention the
effects on minority communities.”
- Perform rigorous testing and evaluate your prompt and check for
biased and discriminatory responses.
These are the three design principles which will guide you in writing the best
prompts, which yield the desired responses from the LLMs.
Conclusion
In this chapter, you mastered the art of writing good prompts while learning about
different concepts like
115
CHAPTER 5
Stop Hallucinations
with RAG
An investment in knowledge always pays the best interest.
—Benjamin Franklin
This quotation by Benjamin Franklin has been my driving force in life because I too
believe that any moment spent learning something new is the best moment of your life.
Throughout our lives, we have been learning. If you look at the human life trajectory,
as a child initially one learns to walk, speak, and eat with a spoon, and then the child
grows and learns some other things like riding a bicycle, going to school, and making
friends. Eventually, the child becomes a teenager where they learn about personality
development, emotional development, companionship, etc. Similarly, as an adult, one
learns about finances, social responsibility, and work-life balance, and this way a human
continues to grow and become more humane with each passing day. In conclusion,
learning makes you better.
Learning continues in AI models as well. The models are able to make predictions
because they are constantly learning patterns from data. A traditional machine learning
model requires continuous data re-training so that the model stays updated. However,
in the LLMs the process of re-training is quite expensive due to the size of these models
where parameters are in the range of trillions. So, how can you update the knowledge
base of LLMs in a cost-effective manner?
The answer is “Retrieval-Augmented Generation,” or RAG. This technique was
proposed by researchers from Facebook, University College London, and New York
University in 2020. The technique has become an important tool in the LLM
development toolkit. Additionally, LLMs are infamous for confidently producing wrong
answers, hence are unable to demonstrate trustworthiness and reliability.
117
© Bhawna Singh 2024
B. Singh, Building Applications with Large Language Models, https://doi.org/10.1007/979-8-8688-0569-1_5
CHAPTER 5 STOP HALLUCINATIONS WITH RAG
Let me give you an analogy. Let’s assume there is a person named Mary, and she
was suffering from jaw pain. She went to her local GP for a diagnosis, who suggested
her to visit a dentist because the jaw pain was probably due to an issue in her wisdom
tooth. Though the GP is a qualified doctor, they can’t prescribe or perform an operation
on the wisdom tooth because it isn’t their specialization. However, a dentist who is
an expert in the field can perform an excellent diagnosis and help Mary to get rid of
the pain. Similarly, an LLM is trained on generic data, and it might produce incorrect
information if asked a question specific to a person/organization or about events which
are not present in the model’s training data. For example, the questions might look like
something mentioned:
This requires the model to acquire knowledge from additional data sources to
provide a valid answer. That’s where RAG comes into the picture. It is a method to
connect information from additional data sources to the user’s query and then generate
a response to the query.
Now that you understand why RAG is useful, let’s understand retrieval,
augmentation, and generation, the three components which together result in RAG.
118
CHAPTER 5 STOP HALLUCINATIONS WITH RAG
Retrieval
This is the first component of the RAG methodology. As the name suggests, it deals with
retrieving or fetching data, and it isn’t a new thing. Traditional retrieval systems leverage
keyword search to fetch the relevant data, but the retrieval in the RAG approach is
semantic search (also known as vector search) instead of the keyword search. Semantic
based search gives results by matching the meaning of the query and the knowledge
base unlike the keyword-based search which matches the presence of exact keywords in
both the query and the knowledge base. It is to be noted that keyword-based search is
not always the best approach to retrieve data. Let me give you an example: let’s assume
you are looking for information on “green light,” and you enter the same keywords to find
data related to green light. However, instead of “green light,” you get information related
to “light green” because it exactly matches the keyword search. Languages are complex,
and a change in the order of words can result in a different meaning. In the above
example, the former refers to the traffic signal, while the latter refers to a color shade. So,
how do you get the data and how does the semantic search take place? Let’s dive in a bit
deeper to discover how these operations are performed in RAG. For simplicity, I will first
mention the different stages of the retrieval component.
Document Understanding
Data is the heart of AI. The knowledge base that you are going to build will be based on
the data gathered from the documents. Therefore, the first stage is to understand the
source of the data. Based on the format of data, you can categorize it into the following
three categories:
119
CHAPTER 5 STOP HALLUCINATIONS WITH RAG
Once you have an understanding of your documents, the next step is to prepare them
for further transformations. If you recall from the previous chapter, an LLM is limited by
its context window. The data you are providing should fit within the limits of the context
window of your model. This implies that you need to create chunks of your documents
such that each chunk conveys semantically meaningful information. So let’s move to the
next retrieval stage: chunking.
Chunking
The second stage in building the retrieval component is chunking of the documents.
Splitting up documents into segments is a crucial step in transforming data and making
it ready for the RAG system. Dividing the data brings the question of size into the picture.
The size of these splits will influence the performance of your RAG system. Let’s say you
want to build a system which can solve financial queries of the users. You identify a good
120
CHAPTER 5 STOP HALLUCINATIONS WITH RAG
book in finance, and you plan to use it as an external source for building a RAG-based
system. The book has 400 pages in total. Now let’s think of different ways in which you
can use these 400 pages:
1. You can use the entire book as one segment.
These are some possible ways to divide your document into chunks. Now, if you
keep the chunk size too high, then not only will it create a burn in your pocket but will
also produce inefficient results. Similarly, if you keep the size of the chunks too small,
then chances are that you might not be providing the relevant information to the model,
thus increasing the chances of hallucinations. Therefore, a careful consideration should
be done to decide how you are going to divide your documents and have a chunking
strategy in place to hit the sweet spot where the chunk size is neither too high nor too
low. Let me walk you through some popular chunking strategies:
121
CHAPTER 5 STOP HALLUCINATIONS WITH RAG
Chunk 4: og to
Chunk 5: the
Chunk 6: park.
122
CHAPTER 5 STOP HALLUCINATIONS WITH RAG
So, which chunking strategy will be best suited to you? It depends on your use case,
your structure of data, and the requirements of the application. In fact, you can also
combine multiple chunking strategies and go for a hybrid approach to cater to your
requirements. Once you have divided your documents into various chunks, the next step is
to transform the chunks and add metadata so that it can be utilized for further processing.
Without further ado, let’s move straight to the next section to understand what kind of
transformations is applied to the chunks and how you can add metadata to the chunks.
123
CHAPTER 5 STOP HALLUCINATIONS WITH RAG
“There has not been a single flower on my rose plant in the last
two years. Can you tell me how I can fix that?”
To answer such a query, your application has to retrieve information on rose plants
and specifically on rose plants not producing flowers for a long time. This implies that
a search operation has to be performed on the knowledge database to extract this
particular information.
Now circling back to the previous question, how will metadata assist the
process of retrieval? There are two possible ways in which metadata can be
helpful to you as a developer. Firstly, you can use it to filter out things before
performing the search in the knowledge base. Secondly, you can use it during
124
CHAPTER 5 STOP HALLUCINATIONS WITH RAG
the search process. So, what information can be stored as metadata? Well, that
depends on your requirements, but the following are some important fields which
can be consisedered in formation of metadata:
1. ID: An ID field can be used as a unique identifier to ensure that
there is no redundancy in storing data chunks.
These are examples of some of the fields, but depending on your use case, you can
store more information as well in metadata. Okay, so far, you have seen the process
of identifying data for building a knowledge base, segmenting the data into chunks,
cleaning the chunks, and assisting the application by providing metadata about the
chunks. The next step is to process the chunks. Computers can’t understand the
language like us. Therefore, the chunks need to be in machine-readable format, and
machines understand numbers very well. The next section highlights the concept of
embeddings which are the vector representation and the role they play in building a
RAG model.
Embeddings
This is not a new concept in NLP and has existed for a long time. I gave you a brief
historical overview about embeddings in Chapter 1. But let me dive into more
technicality here. So, what are embeddings? Embeddings are vector representations
of text (as we are dealing with textual data), which carry semantic meaning of the
125
CHAPTER 5 STOP HALLUCINATIONS WITH RAG
126
CHAPTER 5 STOP HALLUCINATIONS WITH RAG
The examples above demonstrate that one can leverage real value and get semantic
meaning by applying mathematical operations on the embedding vectors. In the context
of RAG, it becomes crucial to have cleaned data in chunks to ensure relevancy. So, how
are these word embeddings calculated? The embeddings or vector representations for
different words are learned after training machine learning models on large amounts of
text. I have mentioned the detailed ways of training these models in Chapter 1; feel free
to revisit them. In practice, you will generally apply pre-trained embedding models to
calculate the word embeddings for your text. However, if your data is domain specific,
then you might need to look for specialized pre-trained embedding models or train
your model to calculate the embeddings. A word can convey different meanings, and
which meaning has been learned by the embedding model is dependent on the context
in which the word is being used in the training data. Therefore, you need to choose
the right embedding model for your RAG application. For example, let’s say you have
financial data of a bank, and the word “bank” gets repeated often in the data, then you
should be careful that the embedding model which you are using is also trained on the
data, which conveys the same meaning of the bank and not the other meanings, such as
river side, road transverse, etc.
While choosing an embedding model for your application, you should consider the
trade-off between the performance and the cost. Embeddings with large vector size tend
to perform better than smaller vectors. However, the cost of storing large vectors will
increase the computational cost. Depending on your use case and desired accuracy, you
should think about the available embedding models. Embedding v3 (can be accessed
here), e5-mistral-7b (can be accessed here), and nomic-embed-text-v1 (can be accessed
here) are a few popular embedding models. Embedding v3 is by OpenAI, while the other
two are open source models. Furthermore, please make sure that you use the same
embedding model for converting the user query into a vector as you did for converting
the text stored in the knowledge base. Once you have got embeddings, you can store
127
CHAPTER 5 STOP HALLUCINATIONS WITH RAG
them in any document database of your choice and create an index for the next stage of
retrieval. There is a common notion that you need a specialized vector database to store
these embeddings; however, it is not true. You simply need a database which allows
vector search and not the database which only allows vector search and is incompatible
with other formats of data. If you have developed any machine learning applications in
the past, then you might have worked with some kind of database, and currently these
databases are providing support for vector search. Therefore, my recommendation
would be to keep things simple and not include overhead costs and complexity.
Search
This is the last stage of the retrieval component. In this section, you will learn about the
process which enables a RAG-based application to fetch the relevant context in order to
answer user queries. The field of LLMs is still evolving, and the process of using them is
also getting better with each passing day. Many researchers are working on improving
the accuracy of the retriever in RAG because it is a fundamental part and it helps the
LLM understand a context. Therefore, the search stage of retrieval has already seen a lot
of advancements. Let me make this easy for you and break down the search process into
two stages:
Stage 1: Initial Retrieval
Once you have stored your embeddings in a database and created a knowledge base
which represents your context, then you can start using it for retrieval. In this stage, you
will have a user query, and it has to go through the same chunk cleaning transformation
which was applied to the document chunks before creating the embeddings. Next, you
will use the same embedding model which was used on your documents to transform
the user query into a vector. So, now you have a database with a lot of vector embeddings
as well as a vector representation of user query. Using these vector representations, you
need to find which chunks of data stored in the database match the most to your user
query. The idea is to fetch the documents which are semantically similar to the query.
To do so, you apply vector search and calculate the similarity. The relevant document
chunks which are closer to the user query can be discovered using distance metrics like
euclidean distance, cosine similarity, etc. Additionally, Figure 5-3 demonstrates the
top three chunks which contain the context mentioned in the query and therefore are
relatively closer in distance than the other chunks of data. Popular algorithms in this
128
CHAPTER 5 STOP HALLUCINATIONS WITH RAG
stage are Dense Passage Retrieval (DPR), Best Matching 25 (BM25), (Term Frequency
- Inverse Document Frequency) TF-IDF, etc. If the vector space is quite large, then
advanced algorithms like Approximate Nearest Neighbors (ANN) can be used. ANN
algorithms are often used in vector databases to index data.
This way, you can get the top chunks of data based on the similarity score. However,
due to chunking, the context is spread across the chunks, and they might not reflect
true relevance. Therefore, this requires a re-ranking model to reassign scores that
demonstrate the true relevance of the chunks. Let me explain it with an example.
Suppose you have a vector q which represents the query vector, and with the initial
stage retrieval, you get top three chunks (c3, c1, c2) in the order of relevance, implying
that c3 is the most similar to query vector q, followed by c1 and c2. However, this isn’t
the actual ranking. The true ranking of these chunks has to be calculated again, and
therefore you need a re-ranking model to calculate the new order of relevance. Figure 5-4
demonstrates this process. Thus, to make a robust retrieval, you need to proceed to the
second stage of the retrieval, re-ranking.
129
CHAPTER 5 STOP HALLUCINATIONS WITH RAG
130
CHAPTER 5 STOP HALLUCINATIONS WITH RAG
This sums up the retrieval component of the RAG. Let me quickly give you a short
summary of the process. Firstly, you identify relevant documents to build a knowledge
base. In the next step, you perform document chunking to avoid the context window
limitations. Further, the chunked documents undergo a cleaning transformation where
you remove stop words, fix spelling errors, etc. Additionally, at this step, you also identify
and create metadata about the chunks. The next step is to create embeddings of these
chunks and store them along with the cleaned chunks and the metadata. In the end, you
perform a two-stage search operation where the first stage is semantic search and the
second stage is re-ranking. This will give you top-k relevant documents which match
your user query. Figure 5-6 illustrates the entire retrieval process.
131
CHAPTER 5 STOP HALLUCINATIONS WITH RAG
As you know, the field is evolving, and new things come up each day. Sincere efforts
have been devoted by numerous researchers to improve the efficiency of the retrieval
component. Let’s imagine you performed two-stage retrieval with both semantic search
and re-ranking, but you are still getting irrelevant context. The saying “garbage in is
equal to garbage out” holds true in this case as well. If the retrieved context is not right,
then the answer generated by the model is likely to be flawed. What can go wrong after
being so careful with chunking strategies, chunking size, and re-ranking? Probably the
query. In real time, a user can query the model which restricts the model from retrieving
the relevant context. Can you fix it? The answer is yes, and this can be achieved using
query transformations. Let’s look at some of the ways in which user queries can be
transformed. For understanding the differences between these techniques, I will use
a query as an example and discuss how it can be transformed with the techniques
mentioned below.
Example query: “Can you tell me about RAG and what are its advantages and
disadvantages? Is it different from fine-tuning?”
1. Hypothetical Document Embeddings (HyDE)
In this approach, the semantic search is performed not only by
using the query embedding but also by using the embedding
of the generated hypothetical answer along with the query
embedding. Firstly, the query is passed to LLM which generates
a hypothetical answer, which is then converted into its vector
representation. Then the generated hypothetical answer
embedding and the query embedding are used for semantic
search. Figure 5-7 illustrates the differences between the standard
approach and HyDE.
The HyDE approach will first generate an answer to the example
query using its existing knowledge and then pass the embedding
of the generated answer and the example query - “Can you tell me
about RAG and what are its advantages and disadvantages? Is it
different from fine-tuning?” - to LLM to generate the final answer.
132
CHAPTER 5 STOP HALLUCINATIONS WITH RAG
2. Subquestions
- What is RAG?
- What are the advantages of using RAG?
133
CHAPTER 5 STOP HALLUCINATIONS WITH RAG
The output of the response based on the prompt for the above
query is demonstrated in Figure 5-8. In the output, you can
notice how the RAG acronym has been expanded to Retrieval-
Augmented Generation, but this can be controlled by making
some changes to the prompt.
These sub-queries will allow for a comprehensive understanding of RAG, its pros and cons, and
how it differs from fine-tuning.
Figure 5-8. Output of ChatGPT depicting how a query can be broken down into
subqueries
3. Multi-query transformations
134
CHAPTER 5 STOP HALLUCINATIONS WITH RAG
4. Step-back prompting
Augmentation
Well, retrieval is one part of the RAG application. The second part of the methodology
(augmentation) embeds the retrieved text chunks with the user query, and the third
part (generation) uses the embedded query to solve the query. These three components
come together and make RAG a successful method to connect external data with an
LLM. This section focuses on the details related to augmentation.
After the retrieval process has been executed, the next step is integration of the
extracted context with the user query. Re-ranking is one way of identifying the right
order of the chunks; this further ensures that the model gets the context right. I have
135
CHAPTER 5 STOP HALLUCINATIONS WITH RAG
As developers have started building applications on top of RAG, they are realizing
the shortcomings with the current methods; thus, new advancements are happening in
the field to improve the efficiency of these methods. Let’s move to the last component of
RAG: generation.
Generation
This is the final part of the RAG methodology, and it deals with taking in the context
retrieved and augmented with the user query to generate the final answer. I believe
that choosing an LLM for building this component plays a major role in deciding how
successful your application will be. The following are some crucial factors to consider for
deciding the LLM which will be leveraged to generate outputs:
136
CHAPTER 5 STOP HALLUCINATIONS WITH RAG
Now that you understand different components of RAG, let’s wrap it up and see
how these different components come together and build a stack which allows you to
connect external sources of data to a desired LLM.
An LLM can be decomposed into three components: retriever, augmentor, and
generator; each plays a crucial role. Before the retriever comes into the picture, there are
a few necessary steps to execute. Firstly, the relevant documents are identified, chunked,
embedded, and indexed in a database. This database serves as a knowledge base for the
retriever. Then the user query is converted into a vector representation through the same
embedding model. Now, the retriever comes into the picture and identifies relevant
document chunks from the knowledge base by performing a semantic search between
the user query and the search index created in the initial stages. This leads to the second
component, the augmentor, where the identified document chunks are combined with
the user query to form a single prompt. The context-rich prompt then becomes an input
for the LLM and brings the generator into the scene. With the help of context, the LLM
generates a response to the user query. Figure 5-9 illustrates the flow of data in the RAG
methodology.
137
CHAPTER 5 STOP HALLUCINATIONS WITH RAG
So far, you have covered three different ways of utilizing LLMs. In Chapter 3, you
were introduced to fine-tuning, Chapter 4 discussed details about prompting, and this
chapter covers important aspects about RAG. Each technique has its own merits and
demerits, and as someone who is new to all these techniques, it can be very confusing to
choose among the three techniques. Multiple factors have to be considered to make the
right decision. Table 5-1 will help you to compare all three techniques.
Technical skill Moderate skill level is High skill level is required Low skill level is required
level required. Building a RAG in fine-tuning. Person in prompting. Since the
based application requires building such systems model remains untouched,
expertise in both machine should be well versed anyone with little to no
learning and information with the concepts of technical knowledge can
retrieval machine learning and build an application based
deep learning on prompting
(continued)
138
CHAPTER 5 STOP HALLUCINATIONS WITH RAG
Flexibility to RAG offers moderate Fine-tuning offers greater In prompting, the scope of
customize customization because the flexibility to customize as customization is quite limited
model’s responses depend it utilizes specially crafted as the model’s responses
on the quality of external data datasets to modify a majorly depend on the
available and the pre-trained model’s behavior and prompt and the knowledge
knowledge align it acquired from pre-training
Data RAG doesn’t require a Fine-tuning requires No data is required in
requirements special dataset curation curation of a dataset based prompting. However, you can
process and gives developers on a specific objective include some information in
flexibility to connect multiple which requires extra the context, but it is limited
sources of data such as PDF, efforts; therefore, it is a by the size of the context
database, etc. time-consuming process window
Output quality High quality can be expected High quality is expected as Response quality is variable
as RAG utilizes external specific instructions/data is in this case because it is
knowledge to generate a utilized to fine specifically dependent on the
response tune a model quality of prompt
Model update You can readily update the Model update is not so Model update is not possible
ease model by bringing in new data easy here because at your end because you are
sources data curation is a time using a pre-trained model
consuming process and prompting it to get a
response
Pricing The operational cost is The operational cost is The operational cost is low
medium here as RAG is not high here because not only or even free (if using open
computationally expensive do you require access to source models). You are only
when compared to fine-tuning. powerful GPUs but also paying for the interactions
This is because the model model training takes time with the model
parameters remain untouched
Example of use Applications which require Domain-specific General-purpose, educational
cases data to be updated frequently applications which require applications or applications
a lot of customization which doesn’t require much
customization
139
CHAPTER 5 STOP HALLUCINATIONS WITH RAG
Now that you have gained knowledge about the different techniques of utilizing
LLMs and understand the benefits of RAG, let’s take one more step and build a simple
RAG POC (Proof of Concept).
To run the following code, please make sure that you have the same versions of the
dependency as mentioned below. So, without further delay, let’s jump in to see RAG in
action. For this demo, I have utilized the GPT model by OpenAI, but you can try running
the same code with other LLMs as well:
langchain == 0.2.5
python == 3.11.3
The first step is to import the necessary libraries and authenticate the services by
using the getpass function of the getpass module. However, you can use some other
way as well like storing the key in a separate file. By running the following code block,
you should be prompted to enter your OpenAI key, and if everything goes well, then
you shouldn’t encounter any error in running this piece. Figure 5-10 demonstrates the
prompt which you should get to authenticate the OpenAI API.
import getpass
import os
from langchain_openai import ChatOpenAI,OpenAIEmbeddings
from langchain_community.document_loaders import TextLoader
from langchain.vectorstores import Chroma
from langchain.memory import ConversationBufferMemory
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain.chains import ConversationalRetrievalChain
os.environ["OPENAI_API_KEY"] = getpass.getpass()
Necessary Imports
140
CHAPTER 5 STOP HALLUCINATIONS WITH RAG
For this use case, I have used a simple text file with my personal details. These are
just three to four lines about myself. You can change the data and play around with it.
Following is the text mentioned in the data.txt file, and it is placed in the data repository.
The next step is to load the data through the textloader. The following is the code to do
so, and Figure 5-11 demonstrates the output of it. You can even load multiple documents
and check the page content as mentioned in the code.
loader = TextLoader("data/data.txt")
document = loader.load()
print(document[0].page_content)
r
3 print(document[0].page_content)
My name is Bhawna Singh, I work in Ireland's reserach center. I am situated in Dublin however I am originally from
Jaipur, India. I speak Hindi and English. When I was 12, I had a dog and his name was Shimmy.
The next step is to split the data into chunks. I have used the default settings, but
you can mention the chunk size and overlap tokens as well. Once the text is split into
chunks, I can use them to form the embeddings. In this case, I will be using the OpenAI
embeddings as I am using GPT for generating the responses. Additionally, the generated
embeddings will be stored in Chroma, which is an open source vector database. Run the
following code to execute this part and look at how different chunks have been created in
the data. The output is depicted in Figure 5-12.
text_splitter = RecursiveCharacterTextSplitter()
splits = text_splitter.split_documents(document)
vectorstore = Chroma.from_documents(documents=splits,
embedding=OpenAIEmbeddings())
print(splits)
r
3 vectorstore = Chroma.from_documents(documents=splits, embedding=OpenAIEmbeddings())
4 print(splits)
[Document(page_content="My name is Bhawna Singh, I work in Ireland's reserach center. I am situated in Dublin howev
er I am originally from Jaipur, India. I speak Hindi and English. When I was 12, I had a dog and his name was Shimm
y.", metadata={'source': 'data/data.txt'})]
After creation of a vector store, the next step is to put all components together and
form a retrieval chain. It is at this step I define the model or LLM which I intend to use.
Now you may wonder at which point I augment the retrieved content with the user
query. In the following code, I set the value of the chain_type parameter as “stuff.” This is
the simplest combination strategy as you are providing the extracted context along with
the prompt. Additionally, I have the value of k equal to 1 and passed it via search_kwargs
to retrieve the topmost value. Run the following code and form the chain:
llm = ChatOpenAI(model="gpt-4o")
memory = ConversationBufferMemory(memory_key='chat_history',
return_messages=True)
conversation_chain = ConversationalRetrievalChain.from_llm(
llm=llm, chain_type="stuff", retriever=
vectorstore.as_retriever(search_kwargs={"k": 1}),
memory = memory
)
Once the chain is defined, I can now start testing with custom queries. Now there is
no way that GPT will know about my dog’s name or even my own name. You can see how
RAG empowers the LLM to use custom data and solve the query. The following are some
example queries solved using this demonstration. All the queries are tested by invoking
the conversation_chain, created in the previous step:
Query 1: What is my name?
Response 1: “Your name is Bhawna Singh.”
Figure 5-13 demonstrates the results returned after running the query.
ln [9]: 1 conversation_chain.invokeC'What is my name?") ['answer']|
142
CHAPTER 5 STOP HALLUCINATIONS WITH RAG
Conclusion
This chapter helped you learn about the following concepts:
• RAG approach and how it helps you connect external data with LLM
• A comparison among the three powerful techniques: Prompt
Engineering, Fine-Tuning, and RAG
• A hands-on exercise to connect an external text file with LLM
So far, you’ve come a long way and built a strong understanding of transformers and
different ways of utilizing them. From the next chapter onward, I will help you look at
things from the perspective of business and things which you should take care of while
building LLM-based applications.
143
CHAPTER 6
Evaluation of LLMs
Without proper self-evaluation, failure is inevitable.
—John Wooden
The famous American basketball coach, John Wooden, said very wise words, “without
proper self-evaluation, failure is inevitable.” How can anything be improved if there is
nothing to compare its performance against. We need a metric to grow and become
better. Globally, governments keep track of matrices like poverty index, literacy rate,
gender equality index, etc., to make the world a better place for everyone. Similarly,
business organizations have matrices like gross margin, net profit margin, retention rate,
etc., to ensure that their business is growing. I personally believe that the beauty of our
world lies in numbers. Isn’t it fascinating that complex concepts can be transformed into
simple equations and be ultimately transformed into numbers - small or big? There are
constants reflecting the world’s mysteries such as speed of light (c), Planck's constant
(h), Boltzmann’s constant (fc), elementary charge (e), etc. In my opinion, the ability to
measure and compare a certain thing is truly amazing as it provides you a direction
when strategizing for improvement.
The concept of evaluation metrics isn’t new to LLMs. In fact, in software engineering
there are standard matrices such as space and time complexity which help one in
improving their quality of code. Similarly, in machine learning as well, there are
common practices such as fc-fold cross-validation, holdout validation, etc., which can be
applied irrespective of the algorithm being used in model building. Furthermore, there
are additional matrices which are based on the type of machine learning problem that
one is trying to solve. For example, if it is a regression problem, then common matrices
are R-squared (r2), Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), etc.,
and if it is a classification problem, then popular matrices are precision, recall, F1 score,
etc. The ultimate goal of all matrices is to help the developers design better models.
145
© Bhawna Singh 2024
B. Singh, Building Applications with Large Language Models, https://doi.org/10.1007/979-8-8688-0569-1_6
CHAPTER 6 EVALUATION OF LLMS
In NLP also, there are a whole load of metrics, and with the rise of the LLMs, it is
more confusing than ever and adds to that the additional complexity of parameters size,
language, and fast pace. How to decide which LLM is more suited to you? How to choose
among so many matrices? How to be sure if there exists a metric for your novel problem?
The list of questions is endless, and hopefully by the end of the chapter, you will be able
to navigate the complex landscape of evaluation in LLMs, preparing you to design and
develop top-notch models.
Each problem is different, but it still needs to be looked at from different angles to
determine which evaluation metrics are suitable for addressing a specific problem. In
this chapter, I will walk you through a variety of ways to validate a model’s performance.
You will not only learn new metrics but will also learn about traditional NLP matrices
which have been adopted in current times. You will also learn about challenges with
current methods of evaluations and the ongoing research to fix those challenges.
Furthermore, I will try to bucket the evaluation metrics into different categories which
will help you in making the right decision when building an LLM-based application.
So, what are you waiting for? Let’s dive into the world of evaluation metrics and start
scrutinizing the LLM models.
Introduction
If you recall the definition of language models, then essentially a model which can predict
the next best token in a sentence is called a language model. This implies that a metric which
reflects the accuracy by which the model is predicting the next word can be considered as the
criteria to judge the performance of the model. There are a few metrics which are based on a
word level, and I will discuss them in detail in upcoming sections. These metrics reflect the
model’s accuracy, thus indicating a model’s performance, one of the key attributes to track.
However, a technology as powerful as LLMs has to undergo various quality checks to ensure
that the system or application is safe to use for everyone and doesn’t get exploited when
deployed in real time. The following are the additional attributes which are required to be
monitored for making an overall successful LLM application:
These are some additional factors that one needs to look for while developing
an LLM-based application apart from technical metrics which indicate a model's
performance like latency, accuracy, etc.
147
CHAPTER 6 EVALUATION OF LLMS
Now that you understand various aspects which can be evaluated to make an
application better, let’s try to understand how these matrices can be categorized. Firstly,
you need to identify the aspect which you are trying to evaluate, and then you can think
of different metrics which can be applied to do so. So, from a broader perspective, you
have two entities to evaluate: one is the LLM model itself, and the other is the application
or the system which is based on LLM. Now these categories can further be divided into
subcategories, and you can then choose specialized metrics to inspect even deeper
parts of your application. In the LLM category, there are two subcategorizations: one
deals with the basic ability of LLMs such as language modeling, while the other deals
with the LLMs’ ability to solve downstream tasks such as question answering, language
translation, etc. Under the broader category which deals with the application, you can
further bifurcate it into subcategories. The first one is the type of application, and it deals
with various methods of utilizing LLMs to build an application such as fine
tuning, RAG, etc. The other is human alignment which deals with the factors like bias,
safety, ethical usage, etc. The flow chart in Figure 6-1 depicts the categories in a pictorial
representation.
Based on these categories, we will now explore the huge landscape of evaluation
metrics and gain a perspective about making the models and the applications
better. Let’s first start with the evaluation metrics which are used for evaluating the
performance of LLM itself.
148
CHAPTER 6 EVALUATION OF LLMS
Evaluating theLLM
Basic Capability: Language Modeling
When studying in primary school and learning about things, we all appeared for tests
which were used to judge our skills in different subjects, and as I sit and recall my school
days, I remember that there was always a correct answer to every question in the test,
and these tests were designed in a way that only the students who marked that specific
answer received full marks and others received no marks. This is one approach to grade
students and I believe is the simplest approach. Let’s circle back to the first category
of evaluation which deals with the evaluation of an LLM itself. If there are so many
models out there, how do you even decide which one to start with in the first place?
Certainly, one needs a metric which can be used to differentiate and choose one model
out of several. Thinking from a brute-force perspective, then you can adopt a similar
methodology which was used in primary schools to test a model’s basic capability. So, if
you are looking for just a good enough LLM, then the model should at least be capable
of performing the basic task. The fundamental ability of the LLMs is to predict the
next token based on the previous tokens, which is also known as language modeling,
implying that a model should not only be able to understand language but also generate
language. Let’s look at the first metric which is called accuracy:
• Accuracy
TP+TN
Accuracy= ------------------------
TP +FP +TN + FN
149
CHAPTER 6 EVALUATION OF LLMS
• Perplexity
The basic ability of LLMs, that is, the language modeling, can be
tested using a metric called perplexity. A good LLM should have a
low perplexity score. You can think of it as a measure of uncertainty,
and as a developer, you would want your model to have less
uncertainty and more predictability. Given the context in terms, the
previous token perplexity depicts the negative log likelihood of the
next token. Let me explain this with an example. Suppose I have a
sentence - “this is a good day” - and I assign it random probabilities
like the language would:
P(this’) = 0.3
P(is | this) = 0.2
150
CHAPTER 6 EVALUATION OF LLMS
Here, w1, w2, . wn are different words in the test dataset. So, if we
circle back to the example above, then
Okay, so now you understand the perplexity metric. You may wonder how one can
set up a standardized way to compare perplexity of different models? In the example
above, I used an arbitrary sentence with just a few random words to demonstrate the
perplexity, but if everyone uses random datasets to generate this score, then how can we
as a community make a comparison among the models? The answer lies in benchmark
datasets. The open source community is very kind, and there are a few datasets which
are considered as benchmarks. The LLMs are tested against these benchmark datasets,
thus offering a fair comparison among different models.
Benchmark datasets are used to evaluate the performance of the models. These
datasets test the models against different tasks. For language modeling, the popular
datasets against which the perplexity score is measured are discussed below:
151
CHAPTER 6 EVALUATION OF LLMS
152
CHAPTER 6 EVALUATION OF LLMS
After my dear mother passed away ten years ago now, I became
However, if I provide you some context and then ask you to finish
the sentence, then it’s an easy task for you.
Target sentence: After my dear mother passed away ten years ago
now, I became _____ .
Target word: lonely.
153
CHAPTER 6 EVALUATION OF LLMS
154
CHAPTER 6 EVALUATION OF LLMS
be accessed here1). SOTA (state of the art) reflects the best scores of perplexity at the time
of model release, and the subsequent numbers below SOTA indicate the different sizes
of the GPT-2 model. You can see the trend that the increasing model size is leading to
reduction in perplexity, implying that the bigger models are likely to be better.
WikiText2 PTB
(PPL) (PPL)
Language modeling is the basic ability of the LLMs, but they are more capable than
just predicting the next token. Recent trends have demonstrated capabilities which
are beyond the basic abilities. The next section focuses on these abilities, metrics, and
benchmark datasets which can be used to evaluate these capabilities.
1https://d4mucfpksywv.cloudfront.net/better-language-models/
language_models_are_unsupervised_multitask_learners.pdf
155
CHAPTER 6 EVALUATION OF LLMS
156
CHAPTER 6 EVALUATION OF LLMS
where
BP = min(e(1-r/c)),
r = reference length
c = candidate length
Let me demonstrate the calculation of the BLEU score with an example:
Candidate sentence: “I love Machine Learning !”
Reference sentence: “I love Machine Learning and AI !”
Table 6-1 denotes the calculation for precision scores.
Precision1 “I,” “love,” “Machine,” “I,” “love,” “Machine,” “Learning,” “and,” “AI,” 5/5
(unigram) “Learning,” “!” “!”
Precision2 “I love,” “love Machine,” “I love,” “love Machine,” “Machine Learning,” 3/4
(bigram) “Machine Learning,” “Learning “Learning and,” “and AI,” “AI !”
!”
(continued)
157
CHAPTER 6 EVALUATION OF LLMS
Precision3 “I love Machine,” “love “I love Machine,” “love Machine Learning,” 2/3
(trigram) Machine Learning,” “Machine “Machine Learning and,” “Learning and AI,”
Learning !” “and AI !”
Precision4 “I love Machine Learning,” “I love Machine Learning,” “love Machine 1/2
(4-gram) “love Machine Learning !” Learning and,” “Machine Learning and AI,”
“Learning and AI !”
Combine the abovementioned precision scores using the formula discussed above:
= (n i= _jn )
precision 1/n
Now that you have the precision part, you can now calculate the brevity penalty,
which is
( ( ))
BP = min 1, e 1-r/c ,
r = reference length
c = candidate length
r = 7, c = 5
BP = min(1, 0.67)
= 0.67
BLEU = 0.67*0.76
= 0. 5092
158
CHAPTER 6 EVALUATION OF LLMS
This is a great and a popular metric choice for evaluating translation tasks, but there
are some common pitfalls of this metric; let’s discuss these:
1. The BLEU metric is simply taking the count of the words, which is
not a true indicative of the semantic meaning. Two different words
can convey the same meaning. For example, the words “accurate”
and “correct” both reflect the same meaning. However, they will
not be treated the same with this metric.
2. Though the score can be used with any language, it can’t be used
to test the model’s ability across different languages.
3. The BLEU metric is dependent on the reference sentences or
reference translations; thus, the score can’t be used across
different datasets.
4. The order of the words doesn’t hold any significance in the
calculation of the score. This makes even the grammatically
incorrect sentences to get a high BLEU score.
5. BLEU is only a precision-based metric, which doesn’t create a
balance.
Some of the limitations of the BLEU metric are addressed by the METEOR score,
which you will learn the next:
159
CHAPTER 6 EVALUATION OF LLMS
^ Number of chunks
Penalty = 0.5 *
y Number of matches
160
CHAPTER 6 EVALUATION OF LLMS
python == 3.11.3
evaluate == 0.4.2
1 import evaluate
2
3 meteor = evaluate.load('meteor')
4 candidate = ["Every student will follow the teacher's instructions"]
5 reference = ["It is mandatory for all students to follow their teacher's instructions"]
6 results = meteor.compute(predictions=candidate, references=reference)
7
8 print(results)
•{'meteor': 0.38448275862068976}
In the previous section, you learned about the BLEU score; let’s
use the Evaluate library to calculate the BLEU score for the given
candidate and reference sentences. Figure 6-4 illustrates the BLEU
score calculated for the given reference and candidate sentences.
Notice how harshly the BLEU score penalizes the sentences even
though they are similar.
161
CHAPTER 6 EVALUATION OF LLMS
1. French-English
2. Hindi-English
3. German-English
4. Czech-English
5. Russian-English
The datasets have two components, namely, parallel corpus and
monolingual data. The parallel corpus contains training data
in the form of sentences from both languages in the pairs listed
above along with the reference translations, while the monolingual
data contains large quantities of text in each language, and unlike
parallel corpus, direct translations are not present.
Now that you have gained understanding about the evaluation of language
translation, let’s move ahead and understand evaluations of more advanced capabilities.
162
CHAPTER 6 EVALUATION OF LLMS
Now, you will also calculate the recall, which is the ratio of
matched unigrams to the total number of unigrams in the
reference sentence. Thus, the recall is %, which is 0.8. The recall is
a bit low because the generated sentence doesn’t capture all the
mentioned information, and if you look at the example discussed
above, the candidate sentence missed out an important detail
about the pen that it is beautiful.
Now you will further calculate the F1 score, which is the harmonic
mean of both the precision and recall. Thus, the F1 score is 0.89.
These three metrics help us make decisions about a model’s
performance, and this sums up the ROUGE-1 metric.
2. ROUGE-2: Bigram
As the name suggests, ROUGE-2 deals with the bigrams in the
two sentences. ROUGE-2 is calculated in the similar manner as
ROUGE-1. The only difference between the two is that ROUGE-2
reflects the precision, recall, and F1 scores of bigrams rather than
the unigrams.
Length of LCS = 4
164
CHAPTER 6 EVALUATION OF LLMS
= 4/4= 1.
= 4/5 = 0.8
= 0.89
Using the Hugging Face Evaluate library, you can calculate the
ROUGE score very quickly. The code in Figure 6-5 will help
you learn how you can use the Evaluate library to compute
ROUGE metric.
165
CHAPTER 6 EVALUATION OF LLMS
1 import evaluate
2
3 meteor = evaluate.load('rouge')
4 candidate = ["Every student will follow the teacher's instructions"]
5 reference = ["It is mandatory for all students to follow their teacher's instructions"]
6 results = meteor.compute(predictions=candidate, references=reference)
7
8 print(results)
166
CHAPTER 6 EVALUATION OF LLMS
So far, you have learned about the most important metrics in NLP which deal
with the tasks of language generation, be it next token prediction, translation, or
summarization. Let’s move ahead and explore some other advanced capabilities
of LLMs.
The most basic form of interaction with LLMs is the chat-based UI method where
a user prompts the model with certain types of questions. Thus, having a capability to
answer the questions is one of the most critical skills of the LLMs. The question
answering task can be divided into two categories - pre-training based and evidence
based. I will give you a gist about both the categories separately. So, let’s now move
ahead and explore the evaluation metrics and benchmarks for question-answering tasks.
168
CHAPTER 6 EVALUATION OF LLMS
169
CHAPTER 6 EVALUATION OF LLMS
This is one category of the question-answering task. The other category deals
with the usage of external data to answer the questions, so without further delay, let’s
understand which metrics and benchmarks play a crucial role here.
170
CHAPTER 6 EVALUATION OF LLMS
171
CHAPTER 6 EVALUATION OF LLMS
a) A metal stool
b) A giant bear
c) A bottle of water
172
CHAPTER 6 EVALUATION OF LLMS
LLMs have also demonstrated a capability to solve math problems. However, most of
the LLMs score less on math-related complex tasks, and continuous efforts are being put
by researchers to make this capability stronger. Thus, it is important to look at metrics
which can be used to evaluate tasks related to math.
173
CHAPTER 6 EVALUATION OF LLMS
This wraps up the evaluation of basic and advanced abilities of LLMs, and this is
not an exhaustive list, and there are several other benchmarks and evaluation metrics
available, such as Measuring Massive Multitask Language Understanding (MMLU), Big
Bench Hard (BBH) and AGIEval, SuperGLUE, etc., which I am not going to cover in this
chapter, and you can read about these separately.
So far, you have learned about evaluation metrics and benchmarks, which help you
in improving the performance of the model in certain types of task, but as a developer,
you want the entire application to be successful and not just the core model. This brings
us to the second broader category of evaluation - LLM application. Let’s now look into
evaluating the aspects of the application which will make it better.
174
CHAPTER 6 EVALUATION OF LLMS
175
CHAPTER 6 EVALUATION OF LLMS
176
CHAPTER 6 EVALUATION OF LLMS
177
CHAPTER 6 EVALUATION OF LLMS
These are the two popular choices for measuring the hallucination levels in the
models. Now let me introduce the popular frameworks which are used specifically for
RAG applications. These metrics focus not only on the performance of the individual
components but also pay attention to RAG-related specific challenges. The following is a
popular framework for RAG evaluation:
178
CHAPTER 6 EVALUATION OF LLMS
The metrics mentioned above focus on the retrieval component. Now, let’s shift
the focus from the retrieval to the generation component of the RAG methodology and
understand metrics in the Ragas framework:
• Faithfulness: For a user query, the generated answer is considered
faithful if there is evidence in the retrieved context to infer all the
facts stated in the answer. This metric also has a value in the range of
zero to one. The higher the score, the better is the model.
Well, you have learned a lot so far; now let’s proceed to the last but the most
important category.
179
CHAPTER 6 EVALUATION OF LLMS
Conclusion
In this chapter, you learned that
With this knowledge in your hands, I am sure that you will build powerful, impactful,
and robust applications.
180
CHAPTER 7
Frameworks
for Development
It is essential to have good tools, but it is also essential that the tools should
be used in the right way.
—Wallace D. Wattles
Introduction
I have come across a lot of people who feel that LLMs are very much hyped and no
good can be achieved from a technology like this. I disagree with this ideology, and in
a few minutes, you will understand why. If you put on a developer’s hat who has been
developing ML use cases before the popularity of LLMs, you will know how painful the
process of developing NLP models has been in the past. A model performing a single
task like sentiment classification previously required a large number of annotated
examples labeled positive, negative, or neutral. Even after curation of an excellent
dataset, the accuracy of these models was not satisfactory. This is because the traditional
NLP models not only had to learn data patterns to understand the classification of
various sentiments but also the English language. However, with the LLMs, transfer
learning is now applicable in NLP as well. Since the LLMs ingest vast amounts of data
available on the Internet, they have become fluent with the English language and can
perform basic NLP downstream tasks with three or four examples or sometimes no
examples at all, simplifying the overall process of model building. Additionally, a single
LLM is capable of doing multiple tasks, thus eliminating the requirement of having a
single model for solving a single problem. LLMs have simplified and shortened the time
length of developing complex NLP models.
181
© Bhawna Singh 2024
B. Singh, Building Applications with Large Language Models, https://doi.org/10.1007/979-8-8688-0569-1_7
CHAPTER 7 FRAMEWORKS FOR DEVELOPMENT
According to Dimension Market Research, the Large Language Model (LLM) market
size is expected to scale to $140.8 billion by 2033, and if you believe the numbers, then
it isn’t wrong to say that the world is going to see a surge in the LLM-based applications
too. Ever since the popularity of the ChatGPT, businesses across different sectors have
started exploring the technology and shipping LLM-based features in their existing
product. Already there are so many applications based on LLMs, and the number of
these applications is only going to go up.
I am pretty sure you might have forgotten the count of the number of times you have
used ChatGPT now. People who are not familiar with the technology tend to believe
that ChatGPT is an LLM itself. However, ChatGPT is an application which is powered
by different versions of GPT, an LLM developed by OpenAI. Furthermore, you might
also be familiar with Gemini, making its place in Google’s workspace, thus integrating
the capabilities of a language model in Google Docs, Sheets, Slides, etc. Additionally,
Microsoft has launched Copilot 365, based on OpenAI’s GPT-4, allowing seamless
integration of the language model across different applications, like Word, Excel,
PowerPoint, OneNote, and Outlook. Similarly, Apple is planning to bring ChatGPT in Siri
and other apps. You can see how big tech has already shaken hands with the technology
and braced it by offering new capabilities to its customers.
This is one side of the story where people are building products based on closed
source models; however, the open source models are also being leveraged by different
enterprises, and amazing products are being built by fine-tuning these models. One
example of a successful company leveraging open source models is Perplexity AI that
competes directly with the search engine of Google by offering a search engine that
answers queries directly instead of providing links to relevant information. The company
was founded in 2022 and is currently valued at $1 billion. The models used by Perplexity
are built on top of the open source models - Mistral and LLaMA.
Enterprises are also using the LLMs’ capability to generate code. In closed source
space, there is GitHub’s Copilot which auto-completes the code or generates code based
on a developer’s problem description. Copilot is developed by GitHub and OpenAI,
and it’s a good tool if one is using it for research or educational purposes, but for an
enterprise, using a closed source LLM might be a risky game as they have to trust a third
party with their proprietary data and sensitive information. VMware partnered with
Hugging Face to release StarCoder, an alternative to GitHub Copilot which is developed
using open source models. I can go on and on here to list out a wide variety of features
and applications based on LLMs, and by the time the book reaches you, a lot more
182
CHAPTER 7 FRAMEWORKS FOR DEVELOPMENT
applications will get popular. Therefore, it’s important to identify the relevant tools and
frameworks which can be used to turn your idea into a fully functioning application.
Before you explore different frameworks, please know that this is not an exhaustive
list, and I will only be mentioning the frameworks which are popular currently. I believe
that a good developer is the one who is platform agnostic and has no dependency on a
certain framework. In the end, it is all about math and logic which lies underneath the
functionalities, offered by these frameworks, which remain the same. Thus, one should
not spend hours reading views on why a certain framework is better than the other;
instead, focus on building the application with the available resources.
In this chapter, you are going to learn about popular frameworks which will help
you implement the techniques learned in the previous chapters - prompt engineering,
fine-tuning, and RAG. The goal of this chapter is to provide you a holistic view of the
capabilities offered by one of the most popular frameworks - LangChain. This chapter
will dive into deep understanding about LangChain, its working, and the major
components which make the framework a hit. Finally, you will also gain hands-on
experience in LangChain and understand how the framework helps developers in
making LLM-based applications. In the end, I will also discuss the factors which will help
you choose a framework for your use case. So, let’s dive into the depths of LangChain.
LangChain
You might have come across the framework called LangChain, which is a blockbuster
framework for developers. LangChain is a powerful tool used for the development
of LLM-based applications. The framework offers a variety of modules which help
in effortless management of user interaction with the language models along with a
smooth integration of different components required to build an LLM-based application.
This section will help you understand the framework and its intricacies.
What Is LangChain?
An open source framework, developed by Harrison Chase in October 2022, has become
a popular tool in the AI community to ease the process of building applications
based on LLMs, such as chatbots, smart search, AI agents, etc. LangChain enables the
developer to compare different prompting styles and different models and implement
183
CHAPTER 7 FRAMEWORKS FOR DEVELOPMENT
RAG methodology without writing the code for different modules from scratch. Thus,
a developer can truly focus on leveraging the potential of the Large Language Models
without worrying about the underlying complexities. In simpler terms, LangChain is a
suite of APIs which simplify the development process of an application based on LLMs.
Now to successfully follow this instruction, the application firstly needs a function
like get_calendar(), then it needs a function like search_contact(‘mother’), which returns
the contact details of your mother, and then it needs a function like schedule_meeting(),
which blocks the calendar at 9AM and creates a zoom meeting. As a developer, you can
either manually create these functions or leverage frameworks like LangChain which
offer functionalities to connect external tools to LLMs so that you don’t have to explicitly
code them.
184
CHAPTER 7 FRAMEWORKS FOR DEVELOPMENT
185
CHAPTER 7 FRAMEWORKS FOR DEVELOPMENT
186
CHAPTER 7 FRAMEWORKS FOR DEVELOPMENT
1. Chat models: These models are LLMs which have been tuned
specifically for conversations. With the popularity of ChatGPT,
there is a demand of building custom ChatGPTs which are
specialized in holding conversations. The output of chat models
is of message type instead of string. The interface provided here is
different from the other type of models.
2. LLMs: LangChain offers a second type of model category. These
models are only text completion models and don’t have an
interface to support conversations. The output of these types of
models is string type and not the message type. Thus, both model
types have different output schema, and I will be using the chat
model for demonstration purposes.
187
CHAPTER 7 FRAMEWORKS FOR DEVELOPMENT
Now, you might be wondering how indexing works? For each document, hash is
calculated, and information like source ID, hash value, etc., are stored. Furthermore,
RecordManager keeps a track of all the hashes and the corresponding documents which
are stored in the vector store.
The flexibility of plugging in the external data source into the application through
LangChain saves efforts on your part both in terms of time and money. The framework
solves the following problems which developers can face while developing such an
application:
- Redundant data: With indexing, you can avoid feeding the already
existing data in the vector store, thus saving the memory issues by
avoiding unnecessary redundancy.
188
CHAPTER 7 FRAMEWORKS FOR DEVELOPMENT
There are other types of chains in LCEL and a majority of legacy chains are currently
being re-written using LCEL (at the time of writing).
189
CHAPTER 7 FRAMEWORKS FOR DEVELOPMENT
- Available tools which can be used to solve user queries. For example,
if the user wants to access the latest information which is not present
in the pre-training data, then LLM can access a search engine and
then deliver the final response. With the help of LangChain, you can
also use custom tools.
- So an agent can make certain decisions, but decision-making is a
complex ability, and sometimes one needs to go through a series of
sequential steps to arrive at a decision; if you change a step in
between, then the decision is also altered. Thus, agents require
previously executed steps as an input.
What Is Memory?
While interacting with ChatGPT, you might have noticed that in each chat it can answer
questions based on the previous inputs. Memory, in LangChain, is the component with
which you can also build chat interfaces where an LLM can refer to certain information
which the user disclosed in previous messages, though it is conditioned to a limit
window. LangChain allows developers to add memory to an application. The memory
component is currently in beta phase because the legacy chains are being updated and
are being defined in LCEL, making it unready for production.
So, you might be wondering how this memory system works? Any memory-related
system has to perform two necessary operations, which are read and write. The read
operation allows a system to access the information, and the write operation allows
the system to update the stored information. Additionally, in LangChain the primary
binding object is a chain, and this chain will be used to interact with the memory system
as well. So, the chain interacts with the memory system twice.
1. A read operation is performed once the user input is received and
the core logic has not been executed. In this way, it can be ensured
that if there is any relevant information, then it can be used to
generate a better response by augmenting the user query with the
information extracted from the past conversation using the read
function.
190
CHAPTER 7 FRAMEWORKS FOR DEVELOPMENT
2. After the read operation, the chain interacts with the memory
system for the second time once it has executed the logic and a
response has been generated but not presented to the user. This
time, it performs the write operation and updates the previously
stored information.
These are the main components of LangChain which allow you to build production
ready applications of all types - purely prompt based, RAG based, agents, etc.
It is to be noted that LangChain is one of the great frameworks which gained
extreme popularity because it was launched just a month before ChatGPT. Additionally,
LangChain is building an ecosystem which will help you build an LLM-based end-
to-end application which can be deployed into production. The ecosystem offers
LangSmith and LangGraph. LangSmith has been developed to support monitoring and
evaluation of the applications, and LangGraph is a specialized service developed to let
developers build stateful multi-agent. In simpler words, LangGraph is used to ease out
the development process of multi-agent (multiple agents in an application interacting
among themselves) and simplifies the development process by offering the cyclical
graphs, which constitute a key part for managing the runtime of all the agents. These
services are still new and evolving; therefore, I will not go into much details for now.
Enough of theory, let’s now jump to hands-on implementation using LangChain.
Previously, you learned about LangChain solving the problem of reusability of prompt.
Let me demonstrate it via code now.
The first step is importing the libraries; if you don’t have these already installed, then
use the following command in the notebook itself where you just have to mention the
name of the package you are interested in like LangChain:
I will first demonstrate prompting directly to the model and then demonstrate the
same using LangChain. So, run import the following libraries and authenticate the
OpenAI API by running the following cell in your Jupyter Notebook:
import os
from openai import OpenAI
import getpass
openai_key = getpass.getpass()
client = OpenAI(api_key = openai_key)
191
CHAPTER 7 FRAMEWORKS FOR DEVELOPMENT
After the authentication, you can check if everything is working fine. I will utilize the
most advanced model of OpenAI which is GPT-4o (at the time of writing) by calling the
chat completion API:
llm_model = 'gpt-4o'
system_content = 'You are a friendly assistant who \ will help the user
with their queries'
user_content = 'What is the water cycle ?'
response = client.chat.completions.create(
model= llm_model, messages=[{"role": "system", "content":
system_content}, {"role": "user", "content": user_content}]
)
print(response.choices[0].message.content)
The water cycle, also known as the hydrological cycle, is the continuous movement of water on, above, and below the
surface of the Earth. It is a complex system that involves several key processes:
1. **Evaporation**: Water from oceans, rivers, lakes, and other bodies of water is heated by the sun and transforms
from liquid to vapor. Plants also contribute to evaporation through a process called transpiration, where water is
released from plant leaves into the atmosphere.
2. **Condensation**: As water vapor rises and cools in the atmosphere, it condenses into tiny droplets to form clou
ds and fog.
3. **Precipitation**: When these droplets combine and grow large enough, they fall back to Earth as precipitation i
n various forms such as rain, snow, sleet, or hail.
4. **Collection**: The fallen precipitation collects in bodies of water like rivers, lakes, and oceans. It also inf
titrates into the ground, replenishing groundwater supplies.
5. **Runoff and Percolation**: Some of the water that falls on the land flows over the surface and collects in rive
rs, which eventually lead to the oceans. Some of the water also percolates through the soil, contributing to ground
water storage and occasionally reaching aquifers.
The water cycle is essential for sustaining life, regulating weather and climate, and supporting ecosystems. Each s
tage of the cycle is interconnected, ensuring that water is continuously recycled and made available for various pr
ocesses on Earth.
Okay, so far, the model is working fine, and we are getting requests. Now let’s
proceed ahead, and I will tell you about a business case. Let’s say you have a chatbot
for handling customer requests. Now the requests can be received in any language,
but the task of the chatbot is to reply politely such that it sounds helpful and proficient
in English. I have defined two values here and note that I have hard-coded them.
complaint_received and desired_tone are two variables which will be as per the user. So,
I declare these two variables and craft a prompt accordingly as shown in the code below.
Figure 7-3 shows the output, which is the prompt which will be sent to the model.
192
CHAPTER 7 FRAMEWORKS FOR DEVELOPMENT
complaint_received = "Mera order abhi tak nahi aaya. Mai Kya karun"
desired_tone = "friendly and helpful"
print(prompt)
Now, let’s run this prompt directly using the chat completion API:
response = client.chat.completions.create(
model= llm_model, messages=[{"role": "system", "content":
system_content},{"role": "user", "content": prompt}]
)
print(response.choices[0].message.content)
You will see how nicely the GPT-4 followed the instructions, and the response
obtained, depicted in Figure 7-4, is quite impressive. My motive is not prompt
engineering here, and certainly the response can be more tailored, but for
demonstration purposes, this is good enough.
Certainly! Here's a friendly and helpful response to the complaint:
Thank you for reaching out to us. I'm really sorry to hear that your order hasn't arrived yet. I understand how fru
strating this can be.
Don't worry, I'm here to help! Could you please provide me with your order number and any other details you might h
ave? This will help me track your order and give you an update as soon as possible.
Thank you for your patience. We'll work to get this resolved quickly!
Best regards,
[Your Name]
193
CHAPTER 7 FRAMEWORKS FOR DEVELOPMENT
So, this is how you use the OpenAI API directly, but let’s now proceed ahead and
leverage the abstraction offered by LangChain and get a response from the model using
the same prompt but by using the PromptTemplate.
To do so, you will first establish a connection with the chat completion API via
LangChain; run the following code and it will help you connect with the OpenAI API. I
am setting the temperature as zero because I don’t want much creativity here, and the
llm_model is the same as used previously. By running the following code, you can see
that a client object is created, which also contains the OpenAI API key. Since I already
mentioned the key in the previous steps (executed using getpass), I am not repeating the
step here again. Figure 7-5 depicts the output, reflecting the established connection.
Figure 7-5. Output signifying that the connection has been made with OpenAI
Okay, so the next step is to use a prompt template and define the input variables.
Here, I want to keep two variables, firstly the tone of the response and secondly the
complaint which will be user specific. The following is the template in which these two
variables will be used:
Once you have defined the template_string, the next step is to use the
PromptTemplate, so for that you will first import the ChatPromptTemplate module by
running the commands mentioned below, and Figure 7-6 demonstrates the output of
this code:
194
CHAPTER 7 FRAMEWORKS FOR DEVELOPMENT
Figure 7-6. Prompt template signifying the template as well as the output
obtained
If you want to take a closer look at the input variables, then run the following code,
and Figure 7-7 shows specifically the input variables:
prompt_template.messages[0].prompt.input_variables
1 prompt_template.messages [0]. prompt. input_v|ariables
['complaint', 'tone']
The next step is to put it all together. Recall that I have already declared these
variables above, and now I will form a user request using the prompt template and the
declared variables:
Finally, I will generate the response using the LangChain library for the user request
formed in the previous step, and Figure 7-9 depicts the output of GPT-4.
customer_response = chat_call.invoke(user_request)
print(customer_response.content)
195
CHAPTER 7 FRAMEWORKS FOR DEVELOPMENT
Dear Customer,
Thank you for reaching out to us. I'm sorry to hear that your order hasn't arrived yet. Let's get this sorted out f
or you as quickly as possible.
Could you please provide me with your order number and any other relevant details? This will help me track your ord
er and give you an update on its status.
Thank you for your patience, and I look forward to resolving this for you soon!
Best regards,
[Your Name]
This is how PromptTemplate works via LangChain. Now let’s discuss another utility
offered by LangChain which is the OutputParser. So, an output parser is required for
transforming the output generated by an LLM and converting it into a desired format.
Often, developers want to receive a structured output and not just random strings.
In such a case, the output parser is responsible for taking the output of an LLM and
transforming it to a more suitable format. This is super useful when you want the output
to be structured.
Let me take a different use case to explain the utility of OutputParser. Suppose you are
working as a developer for an ecommerce website, which sells furniture from different
retailers on its website. Now the company wishes to improve the customer engagement on
the website by presenting a comparison card to the customer by listing out all the major
features of the item. This will save the customers’ time in reading through lengthy product
descriptions and also make the process of comparison among different items easier. For
example, a customer visits the website, and they liked three sofas; however, they want
to choose one, so an item card describing all the features for all three sofas will help the
customer in making this comparison, and the customer can make an informed decision in
no time. Figure 7-10 illustrates what an item card will look like.
196
CHAPTER 7 FRAMEWORKS FOR DEVELOPMENT
Now based on this problem, you think of leveraging the power of an LLM to extract
the relevant features from the product description. This is because each retailer has a
different style of curating product description; therefore, there can’t be a set of fixed
rules to identify the mentioned attributes and their values from the product description.
Okay, so you have identified the problem statement, you have identified the input, and
you have also identified that you will be using LLM to solve this problem. Now you want
to identify the output of this system. You decide that JSON format is best suitable for this
task as it will help you collate the attributes and corresponding values and store them in
a structured format.
This way, each product can have a JSON dictionary which contains relevant
attributes and values, and an item card can be populated in a database. Thus, when a
customer visits the website and click the product of interest, then the relevant item card
can be fetched from the database and displayed to customers.
This is an interesting use case and a very practical thing to be implemented using
LLMs. I will be using GPT-4 for this purpose again. Now, I will use the same approach
here which I used in the last demonstration, i.e., compare the output without the
OutputParser and then with the OutputParser. So, without further delay, let’s begin.
Firstly, you will just use a custom product description, which is defined in the code
block mentioned below. This is a random description which I made up. Feel free to play
with it and make changes as you wish.
product_description = """\
The amazing Hercules sofa is a classic 3-seater\
sofa in velvet texture. Easy to clean with modern\
style, the sofa comes in five colours - blue,\
yellow, olive, grey and white Uplift the vibe of
your living room in just 999 euros with the new
Hercules.
"""
Once the product description is defined, you will then define a description template
which is a set of instructions to extract certain attributes and their values from the
product description. The following code block contains the description template. The
main goal here is to extract the values of the following six attributes if they are present;
otherwise, the value is set to “unknown.”
197
CHAPTER 7 FRAMEWORKS FOR DEVELOPMENT
• Item: Mainly signifies the kind of furniture item, like sofa, chair,
table, etc.
• Size: Signifies the seating capacity - in the case of sofa, dining table,
otherwise dimensions of the item
Note that in the description_template, I have specified that I want the output in
JSON format:
description_template = """\
Brand: If the brand name is mentioned, extract its value else the value
is 'unknown'
Item: If the furniture item is mentioned, extract its value else the value
is 'unknown'
Size: If the dimensions are mentioned then extract those \
if the seating capacity is mentioned extract that\
otherwise the value is 'unknown'
Texture: If the texture is mentioned, extract its value else the value is
'unknown'
Colours: If any colour is mentioned, extract the colour names else the
value is 'unknown'
Price: If the price is mentioned, extract its value along with the currency
else the value is 'unknown'
198
CHAPTER 7 FRAMEWORKS FOR DEVELOPMENT
'Brand'
'Item'
'Size'
'Texture'
'Colours'
'Price'
text: {text}
Now that you have the product description and a description template ready, you
will encapsulate them both using LangChain’s PromptTemplate. Make sure that you first
import the relevant modules and then convert the description_template into a prompt_
template. Figure 7-11 depicts the output of the same.
Now that you have transformed the product description into a prompt template, the
next task is to get the response from the LLM. So, run the following code and see if you
get an output similar to the one shown in Figure 7-12:
messages = prompt_template.format_messages(text=product_description)
chat = ChatOpenAI(temperature=0.0, model=llm_model)
response = chat(messages)
print(response)
199
CHAPTER 7 FRAMEWORKS FOR DEVELOPMENT
content=''''json\n{\n "Brand": "Hercules",\n "Item": "sofa",\n "Size": "3-seater",\n "Texture": "velvet",\n "C
olours": ["blue", "yellow", "olive", "grey", "white"],\n "Price": "999 euros"\n}\n"'1 response_metadata={1token_u
sage1: {1completion_tokens1: 67, 'prompt_tokens1: 246, 1total_tokens': 313}, 'model_name': 'gpt-4o', 'system_finger
print': 'fp_c4e5b6fa31', 1finish_reason1: 'stop', 'logprobs': None} id='run-595fl623-fa7a-4203-bef3-4fdb34ffe0dd-0'
usage_metadata={'input_tokens': 246, 'output_tokens': 67, 'total_tokens': 313}
Let’s take a closer look at the extracted JSON dictionary. Figure 7-13 depicts the
output obtained from the model. You can see the accuracy of the model in capturing and
extracting the desired attribute value pairs from the model.
json
"Brand": "Hercules",
"Item": "sofa",
"Size": "3-seater",
"Texture": "velvet",
"Colours": ["blue", "yellow", "olive", "grey", "white"]
"Price": "999 euros"
You can see that the model has presented the output in a nice dictionary format. But
here’s the catch. Is it really a dictionary? Let’s check this out. If the output is really a JSON
dictionary, then you should be able to extract the value of brand from the generated
response. However, when you run the following code to verify the same, you will get an
error as demonstrated in Figure 7-14:
response.content.get('Brand')
This error occurred because you are trying to fetch a value from a string object. The
model has generated a response which looks like JSON but is actually a string output. So,
how can you fix it? The answer is using LangChain's OutputParser. As the name suggests,
200
CHAPTER 7 FRAMEWORKS FOR DEVELOPMENT
this module is designed to help developers get outputs from LLMs in the desired format.
Okay, so how can you do that? Perform the following steps. Firstly, import the following
libraries:
ResponseSchema, which you imported in the above code block, will help you
define a schema of output, and the StructuredOutputParser will ensure that the output
generated by the model abides the schema defined. In the following code block, you can
see how I have created a schema for the response generated above:
brand_schema = ResponseSchema(name="Brand",
description="If the brand name is mentioned, \
extract its value else the value is 'unknown'")
item_schema = ResponseSchema(name="Item",
description="If the furniture item is mentioned,\
extract its value else the value is 'unknown'")
size_schema = ResponseSchema(name="Size",
description="If the dimensions are mentioned then extract
those \ if the seating capacity is mentioned extract that\
otherwise the value is ‘unknown’ “)
texture_schema = ResponseSchema(name="Texture",
description="If the texture is mentioned, extract its
value \ else the value is ‘unknown’”)
colours_schema = ResponseSchema(name="Colours",
description="If any colour is mentioned, extract the
colour names \ else the value is ‘unknown’’”)
price_schema = ResponseSchema(name="Price",
description="If the price is mentioned, extract its value
along with the currency\ else the value is ‘unknown’ “)
201
CHAPTER 7 FRAMEWORKS FOR DEVELOPMENT
response_schema = [brand_schema,
item_schema,
size_schema,
texture_schema,
colours_schema,
price_schema]
The schema defined above only depicts the specifications with regard to each
attribute which is desirable. Once the schema is defined, this set of rules can be put
through the parser by running the following code block. Figure 7-15 demonstrates the
output which specifies the instructions.
op_parser = StructuredOutputParser.from_response_schemas(response_schema)
get_instructions = op_parser.get_format_instructions()
print(get_instructions)
The output should be a markdown code snippet formatted in the following schema, including the leading and trailing
...... json" and .......... :
'''json
{
"Brand": string // If the brand name is mentioned, extract its value else the
value is 'unknown1
"Item": string // If the furniture item is mentioned, extract its
value else the value is 'unknown'
"Size": string // If the dimensions are mentioned then extract those
if the seating capacity is mentioned extract that otherwise the value is
'unknown'
"Texture": string // If the texture is mentioned, extract its value
else the value is 'unknown'
"Colours": string // If any colour is mentioned, extract the colour names
else the value is 'unknown'1
"Price": string // If the price is mentioned, extract its value along with the currency
else the value is 'unknown'
}
Once you have the instructions ready, the next step is to put everything together
using the PromptTemplate. Run the following code, and you can see an output similar to
the output depicted in Figure 7-16. The output shows the product description along with
the set of instructions which are to be followed to create the output.
prompt_template = ChatPromptTemplate.from_template(template=description_
template)
format_message = prompt_template.format_messages(text=product_description,
format_instructions=get_instructions)
print(format_message[0].content)
202
CHAPTER 7 FRAMEWORKS FOR DEVELOPMENT
For the following text, extract the information in the format described below:
Brand: If the brand name is mentioned, extract its value else the value is 'unknown'
Item: If the furniture item is mentioned, extract its value else the value is 'unknown'
Size: If the dimensions are mentioned then extract those if the seating capacity is mentioned extract that
otherwise the value is 'unkown'
Texture: If the texture is mentioned, extract its value else the value is 'unknown'
Colours: If any colour is mentioned, extract the colour names else the value is 'unknown'
Price: If the price is mentioned, extract its value along with the currecy else the value is 'unkown'
'Brand'
'Item'
'Size'
'Texture'
'Colours'
'Price'
text: The amazing Hercules sofa is a classic 3-seater sofa invelvet texture. Easy to clean with modern style, the s
ofacomes in five colours - blue, yellow, olive, grey and whitellplift the vibe of your living room in just 999 euros
with the new Hercules.
Finally, you will generate the response to the format_message generated in the
previous step by prompting the model with it as shown in the following piece of code.
You can see the output generated in Figure 7-17.
response = chat(format_message)
print(response.content)
j son
{
"Brand": "Hercules",
"Item": "sofa",
"Size": "3-seater",
"Texture": "velvet",
"Colours": ["blue", "yellow", "olive", "grey", "white"]
"Price": "999 euros"
}
Now you will pass the generated response from the output parser which was created
in the previous step to see if the response received validates the conditions specified in
the output parser. To do so, you will simply run the following code block. Further, you
can see the output generated in Figure 7-18.
output = op_parser.parse(response.content)
print(output)
203
CHAPTER 7 FRAMEWORKS FOR DEVELOPMENT
] : {1 Brand1: 'Hercules',
1 Item': sofa1,
'Size': 3-seater',
'Texture : 'velvet',
'Colours : ['blue', 'yellow', 'olive', 'grey', 'white']
'Price': '999 euros'}
The output generated in Figure 7-18 looks like a valid JSON dictionary, but is it a true
dictionary? Let’s check that by running the following code, and in Figure 7-19, you can
see that indeed the generated response meets the criteria specified in the output parser.
output['Brand']
1 output ['Brand']
'Hercules'
In the previous section, you saw the implementation of the input and output of a
model through LangChain. The framework is vast, and the demonstration of all the
capabilities is out of this book’s scope. However, I hope that you feel confident now
as you have gained a good understanding about LangChain and that you can start
developing your first application after finishing this chapter.
While LangChain is a popular choice, there are several other competitors in the
market, like LlamaIndex, LMQL, etc. However, I will not go through each framework as I
mentioned in the beginning of the chapter that one should be platform agnostic. Having
a deep understanding of the underlying concepts keeps you going for a longer time than
having a superficial understanding of five different frameworks.
There are a plethora of tools and frameworks, and there are many more to come in
the future as the technology matures, so how can you choose the best framework for
your application? The following factors will help you in making this decision:
1. Understand the problem statement: Identify the problem that
you are trying to solve and understand what components will be
required to build such an application. Broadly speaking, you can
bucket the applications simply based on the techniques learned
previously - prompt based, fine-tuning based, or RAG based.
The requirements for each category are quite different from each
other. For example, a RAG-based application requires support
204
CHAPTER 7 FRAMEWORKS FOR DEVELOPMENT
Conclusion
In this chapter, you went through a ride where you learned about
• Working of LangChain
• Components of LangChain
• Using PromptTemplate
• Using OutputParser
• Choosing the right framework
In the next chapter, I will talk more about the tools which can be used in fine-tuning
and tools which will help you ship your application into production.
206
CHAPTER 8
Run in Production
Thus, writing a clever piece of code at works is one thing; designing some
thing that can support a long-lasting business is quite another. Commercial
software design and production is, or should be, a rigorous, capital-intensive
activity.
—Charles H. Ferguson
Introduction
The quote above will resonate with you if you’ve ever deployed any kind of software
or machine learning model into the production. To me, production is like thoroughly
cleaning the house; using the best ingredients, best china, and best recipes; and putting
on a fancy meal for my guests. In the business sense, your customers are your guests,
and you have to make them happy. This requires using the best of all you can do.
Production is the environment where your product goes live and customers start using
it. In this chapter, I will firstly discuss MLOps compare it with LLMOps, discuss various
challenges specific to running LLM-based applications in production, and finally discuss
some available tools for deployments. So, let’s begin the journey of understanding how
you can take your application into the production.
MLOps
When I was in college building different kinds of machine learning models in my Jupyter
Notebook, I only focused on the accuracy and few other evaluation metrics. It was
only when I joined an organization as a professional machine learning engineer that I
learned that model building is only a small part of the entire workflow of an AI system.
If you want people to be using the AI system built by your team, you need to think about
207
© Bhawna Singh 2024
B. Singh, Building Applications with Large Language Models, https://doi.org/10.1007/979-8-8688-0569-1_8
CHAPTER 8 RUN IN PRODUCTION
multiple things. The following picture is borrowed from the paper “Hidden Technical
Debt in Machine Learning Systems” from Google. Figure 8-1 highlights different
components required in building an ML system. Pay attention to the small black box;
this box is ML code. The figure is trying to highlight that in real-world ML systems, the
role of ML code is very small, and other components play quite an important role in
putting all the system together.
The branch which deals with the deployment and maintenance of ML systems in
production is known as MLOps, or machine learning operations. MLOps has emerged as
a separate field to make our lives as developers easier. Let me quickly go through a few
important concepts of MLOps; if you are familiar with this, then feel free to skip through
this section.
Developing a machine learning solution for the clients is a complex process. It
starts with understanding the problem and figuring out if machine learning (ML) can
help solve a specific problem. This step also requires assessing the value proposition
of the model. Once the problem has been chalked out, you proceed ahead to the
data collection stage, where you identify all the relevant data sources necessary for
model development. From this stage, you move ahead to the next stage, which is data
processing. This requires you to perform data cleaning, data transformation, and
forming a unified dataset, which will be used for training purposes. After this stage, you
go for model development, where you experiment with different kinds of algorithms,
performing feature engineering, evaluating the model on different metrics and repeating
1 https://papers.nips.cc/paper/2015/file/86df7dcfd896fcaf2674f757a2463eba-Paper.pdf
208
CHAPTER 8 RUN IN PRODUCTION
the process until you get a model which has a satisfactory performance. This might
require you to go back and forth between the data processing and model development
stage as you might need to transform data multiple times to make it suitable for
model building. Once you are happy with the model performance, you prepare it for
deployment so that the inferences can be run in real time by the customers. Additionally,
the model once developed and pushed live has to be monitored continuously to ensure a
consistent performance throughout. If there is a problem with the model in production,
you might revert the system back to the previous satisfactory version of the model, or if
the problem continues, you might need to update the model. This sums up the ML life
cycle, and the process is also illustrated in Figure 8-2.
209
CHAPTER 8 RUN IN PRODUCTION
MLOps has four major concepts. Let’s go through them one by one to understand
how these concepts play a key role in MLOps:
- Changes in data
- Changes in application code
3. Evaluation: Testing ML is very different from testing in software.
For example, if there is software to sum two numbers, then the
tests designed to evaluate its performance will have two inputs,
and the output will always be a fixed number. However, you can’t
210
CHAPTER 8 RUN IN PRODUCTION
211
CHAPTER 8 RUN IN PRODUCTION
So, now that you have a fair idea about MLOps, let’s proceed ahead and understand
LLMOps, an extension of MLOps.
LLMOps
In the previous section, you discovered MLOps and learned the basic components,
which are put together to deploy an ML model so that it can be used by the customers.
In this section, you will learn about LLMOps, challenges in LLMOps, and the difference
between LLMOps and MLOps. So, without further ado, let’s begin and discover what it
takes to run an application in production.
You can imagine that if the process of deploying simple ML models is so complex,
then the process of deploying an LLM-based application is only going to be more
complex and filled with unique challenges due to the size of the models. LLMOps help
developers in solving the challenges posed during the deployment and maintenance of
an LLM-based application in an efficient manner.
212
CHAPTER 8 RUN IN PRODUCTION
model and then generate a response for the given user query. Figure 8-4 illustrates the
design of such an application where the user calls an LLM API by combining the user
input with the system prompt such that the input carries the information provided by
both the user and the system prompt. Then the output generated by the LLM API is
directly provided to the user as it is received. For example, suppose your use case is to
translate the user text into Spanish. For such an application, the input will combine both
the system prompt and the user query like the following:
However, this way of putting everything together and composing an application can
be problematic, majorly due to the nature of the prompts; therefore, these problems
need have to be mitigated and assessed in real time. The following are some of the
problems which occur due to the prompts:
1. Vagueness: Prompting is the direct way of interaction with the
LLMs. Unlike programming, prompts or input to the LLMs are in
natural language, which can be confusing for the model. Simplest
things can be interpreted by the language model in a wrong
manner. For example, the prompt below was tested on ChatGPT:
© Understood.
2+2 = ?
© Four.
-8+1 = ?
© Negative.
© Okay.
-8+1 = ?
© -7
.i.
Not only is the nature of a prompt vague but also notice how a
little change in the prompt can bring a major change in the output
of the model.
214
CHAPTER 8 RUN IN PRODUCTION
215
CHAPTER 8 RUN IN PRODUCTION
© Paneer
© Love
© Kadhai Masala
216
CHAPTER 8 RUN IN PRODUCTION
Can you discuss about soft RAG, its pros and cons.
Sure, here's a discussion about Soft Retrieval-Augmented Generation (RAG), focusing on its pros
and cons:
• Soft RAG enhances the relevance and coherence of generated text by considering
information from multiple sources. This reduces the risk of generating contextually
irrelevant or disjointed responses.
• By leveraging a soft combination of retrieved documents, Soft RAG can better handle
ambiguous queries by incorporating diverse perspectives and reducing the likelihood of
biased or one-sided responses.
3. Flexibility:
• The model can adapt to different types of queries and contexts without needing
extensive retraining. This flexibility makes it suitable for a wide range of applications,
from conversational agents to conte'^generation.
217
CHAPTER 8 RUN IN PRODUCTION
The response is much longer than the one depicted in Figure 8-8, and ChatGPT even
generated cons about the made-up term “Soft RAG.” Such a thing has to be controlled in
production.
I further asked ChatGPT to show me a few papers where soft RAG was first proposed
using the following prompt:
Prompt: “Can you show me a few research papers where soft RAG was first
proposed?”
Figure 8-9 demonstrates the output of this prompt, and you can see that although the
paper retrieved is the paper which first mentioned about the RAG methodology, it is not
related in any way to the made-up term Soft RAG.
Can you show me a few research papers where soft RAG was first
proposed.
I
© Searched 5 sites
The concept of Soft Retrieval-Augmented Generation (Soft RAG) was first proposed in the paper
"Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" by Patrick Lewis et al.,
presented at NeurlPS 2020. This seminal paper introduces the RAG framework, which combines
retrieval mechanisms with generative models to enhance performance on various knowledge
intensive NLP tasks. The RAG model utilizes both parametric memory (a pre-trained seq2seq
model like BART) and non-parametric memory (a dense vector index of Wikipedia) to generate
more specific, diverse, and factual language (ar5iv) (NeurlPS).
For further reading and detailed insights into the RAG framework, you can access the full paper
here (NeurlPS Proceedings).
So, these are some of the problems which are related to prompting and LLMs in
general. How can these be mitigated through some best practices of LLMOps?
218
CHAPTER 8 RUN IN PRODUCTION
In conclusion, you have to make your application robust and reliable so that the
users can establish an equation of trust with the technology. Ultimately, technology
should work for you and not against you. After looking at prompting, let’s move ahead
and understand another challenge with LLMs in production.
I will share a few best practices which you should think of incorporating to protect
your application as well as the users:
220
CHAPTER 8 RUN IN PRODUCTION
2 https://perspectiveapi.com/
221
CHAPTER 8 RUN IN PRODUCTION
Okay, so far you have learned some best practices regarding prompts and security
in LLM-based applications. Let’s now take a leap forward and learn about an important
issue faced in production, and that is latency.
Latency
I am sure that at one or the other points in life, you might have experienced the
frustration of seeing a never-ending progress bar (illustrated in Figure 8-11), which
sometimes used to take minutes to finish a job. If you have ever worked with deep
learning models, then you might have a favorite thing to do when a model is getting
trained. My favorite thing to do is to cook meals in between model training time.
222
CHAPTER 8 RUN IN PRODUCTION
Humans get frustrated easily, and as an organization, you don’t want to give them
another reason to by building a slow application.
Latency refers to the delay which occurs between when a user has sent the prompt
through the application and when the application generates a response. In other words,
the time lag for which the users have to wait for a response is known as latency.
As you introduce multiple components in an application workflow, the latency
of applications keeps increasing. For example, think of this scenario where you build
an LLM-powered application which can identify all the major entities in a document
uploaded by the user. This use case is specifically helpful in the legal sector where the
relevant entities have to be identified and extracted. It will help a lawyer immensely to
identify the relevant parties in a legal contract without going through multiple pages of
the documents. Now, to build such an application, you require a functionality which lets a
user upload a desired file, and then you will need to implement RAG methodology which
converts the document into embeddings. Furthermore, you will require a good system
prompt, which can excel in identifying the named entities, and not to forget the star of
the application, a customized LLM which can understand legal jargon is also required in
this case. All this will be put together to generate an output response. Thus, from the time
when the user uploads a file to the time when the user receives an output, latency occurs.
In an LLM-based application, the latency can occur due to four major reasons, which
are as follows:
• LLM used for the application
223
CHAPTER 8 RUN IN PRODUCTION
Now that you understand the concept of latency, let’s look at some of the ways in
which you can reduce latency:
Thus, computing the response again makes no sense. Hence, semantic caching can
reduce both the latency and the cost of running an inference. Figure 8-12 demonstrates
an application workflow with semantic caching.
224
CHAPTER 8 RUN IN PRODUCTION
5. You can further control the model output by putting a hard limit
on the length of the output generated. This will guarantee that
the model doesn’t generate the response beyond the set limit.
However, a downside to this limit is that the sentences get cut off
before they are completed.
These are some of the ways in which an LLM-based application would differ from
a standard ML application. The standard practices of MLOps, which include data
pipelines, monitoring, logging, etc., remain valid in LLMOps too.
All the major cloud service providers - Google, Microsoft, and Amazon - have
introduced services like Vertex AI, Azure OpenAI Service, and AWS Bedrock to build
applications specially based on LLMs. I would recommend using the service of the cloud
provider where your data and tech stack are currently hosted. This will ensure seamless
integration. However, feel free to use any service you like.
225
CHAPTER 8 RUN IN PRODUCTION
Conclusion
The goal of this chapter was to introduce some best practices which will help you deploy
an application in production. In this chapter, you learned about
• MLOps
• LLMOps
• Prompts and related problems
• Safety and privacy
• Latency
As mentioned several times in this book, the field is still new and the techniques are
changing constantly. Therefore, in future, you can see more comprehensive services and
frameworks, which specially take care of deployment aspects of these applications. In
the next chapter, you will learn about the ethical side of these applications.
226
CHAPTER 9
Smart watch on the hand predicting the number of steps I need to complete in a day;
smart phone in pocket which gets unlocked with my face ID, auto-completes messages,
and has a voice assistant; smart lights in my home which operate with my voice;
smart security system which notifies me if someone is standing on my door; smart
recommendation system of Netflix telling me what to watch next; smart driver assistance
systems helping me in parking, giving lane assistance, and providing cruise control;
smart social media algorithms keeping me hooked to their platform by showing me short
videos; smart ecommerce websites suggesting me items to buy; smart GPS in Google
Maps showing traffic in real time; and now OpenAI’s ChatGPT helping in writing code,
preparing for interviews, drafting emails, making diet charts, preparing literature review,
and the list of devices becoming smart goes on.
I haven’t listed all the things around us which operate via AI but only a few of them. It
seems like we are currently riding a wave of AI, which is going to become a mainstream
technology just like the Internet, and everything is happening at such a fast pace that it’s
impossible to catch a breath.
As a child in fourth grade, I went with my grandfather to a cyber cafe and asked
the person sitting in the shop to search about an animal on the Internet because my
teacher had given me a homework assignment. In eighth grade, my parents set up a
dial-up-based Internet connection at my home so that I can access the Internet and
finish my assignments at home only. The first thing I did was to sign up for Gmail. During
recess times, my friends and I used to discuss the new terms and concepts, such as the
difference between sign-up and sign-in. We shared our Gmail IDs and sent emails about
random things. Two years later, I created my first account on Facebook and added my
227
© Bhawna Singh 2024
B. Singh, Building Applications with Large Language Models, https://doi.org/10.1007/979-8-8688-0569-1_9
CHAPTER 9 THE ETHICAL DILEMMA
friends virtually. Ecommerce also started gaining popularity but was quite a new concept.
My grandfather didn’t believe me when I told him that I ordered a bedsheet from the
Internet. A year later, my parents gave me a touch-based smartphone, and I used it to
send SMS to my friends, listen to music, and share images through Bluetooth. Having a
phone which could click pictures was considered cool. However, in two years, both the
hardware and the data plans which were provided by telecom networks became cheaper.
I changed my phone and also signed up for WhatsApp, which eventually became the
medium for sharing images, messages, and other kinds of files. Soon, I was on Instagram
and Snapchat and had accounts on different kinds of ecommerce websites. For most of
the people, news now comes through Twitter, books are found on Kindle, food can be
ordered via mobile apps, flight tickets are booked through the Internet, payments can be
made by scanning a QR code, taxi is always a click away, and Google Maps has become a
driver’s friend.
All this has happened in the last three decades. That’s the pace of technology - fast
moving and ever-evolving. If I have learned anything about technology in my life, it is
that people might resent it initially, but eventually they start trusting it, so much that it
becomes an indispensable part of their lives until a better version comes up. So, where
do we stand in the current AI wave? Large Language Models or LLMs cover only one
aspect of this broader technology called Generative AI, which includes a variety of AI-
based models that are able to generate different kinds of media, like image, video, audio,
and text based on a prompt. While discussing the ethical side of LLMs, the focus is on
the broader technology which is GenAI and not just LLMs. Now, if I talk about adoption
of technology in the masses, then it is fairly new, but there is a lot for us to understand
before it becomes an indispensable part of our lives. We have a long way ahead of us to
figure out the challenges of this technology.
In this chapter, I am going to talk about the ways in which GenAI can affect
our society and the potential harms posed by technology. As developers, we are so
absorbed in improving the performance of our models that we might tend to overlook
the dangers posed by this technology. However, as things progress and the technology
advances, it’s our duty to be aware and develop solutions which mitigate these risks.
Furthermore, I will discuss the currently available regulatory guidelines and frameworks
in EU. Additionally, I will also mention the techniques and tools being used to handle
the challenges. So, let me take you on a ride where you discover various risks associated
with the technology, ultimately leading to a realization that we need to own up the
responsibility of building fair and ethical AI.
228
CHAPTER 9 THE ETHICAL DILEMMA
In this chapter, the main goal is to introduce the two categories of risks associated
with LLMs. In this chapter, I will discuss the “known risk” category and how they can
damage our society. In the next chapter, I will focus on what could be the “unknown
risks” and how we can handle them. To elaborate more on this, the known risk category
deals with areas where LLMs have failed by generating outputs which are either
incorrect or harmful in one way or the other, while the unknown risk category deals with
the things which haven’t been seen as of now but could be potential harms.
• Transparency
• Environmental impact
Additionally, Figure 9-1 illustrates the abovementioned challenges into different
types of known risks which are associated with LLMs. I am going to talk about each
mentioned risk individually.
229
CHAPTER 9 THE ETHICAL DILEMMA
Bias andStereotypes
I believe that the right way to understand a thing is to firstly give it a formal definition.
The two terms bias and stereotypes are used interchangeably. However, these two are
different, and let me highlight that by defining both terms:
230
CHAPTER 9 THE ETHICAL DILEMMA
Social group: A section of society or a social group refers to a category of people who
share a set of certain characteristics. For example, females form a social group because
all females share a common gender. Similarly, all Asians form a social group because
they share the same ethnicity.
Okay, so now that you understand the terms bias, stereotype, and social group,
let’s proceed ahead and learn why stereotypes are held against certain social groups.
In the past, there have been (and in some places, there still are) people belonging to a
social group who were considered inferior compared to the rest of the society. Since
they were assigned lower positions in the society, they had minimal to no participation
in matters related to society, politics, economics, etc. These social groups are also
called marginalized communities. Some of the historically disadvantaged marginalized
communities are people of color, women, people following certain religions, etc. So, how
is bias related to Large Language Models?
Sources of Bias in AI
The bias problem is not new in LLMs and can also be found in other classical machine
learning algorithms. There are majorly two stages at which bias gets introduced
in AI systems. Figure 9-2 highlights the two broad categories or sources of bias in
AI. Furthermore, these categories can be subdivided further as shown in the diagram.
231
CHAPTER 9 THE ETHICAL DILEMMA
Let’s discuss the first category where the biases are introduced during the model
training stage. In this category, the focus is on the modeling decisions which are taken
during the model training process. There are multiple things which can lead to the
introduction of bias in the system, but I am going to discuss only a few here:
232
CHAPTER 9 THE ETHICAL DILEMMA
These are some of the ways in which bias can sweep into the system during the
model building stage, and conscious choices have to be made while building models to
avoid the potential sources of bias. Now, let’s move ahead and explore another category
where the source of the bias is data. The crux of AI lies in data, and it is the beginning
point of designing any AI-based system. The following are some of the ways where bias
enters the system due to data:
1. Data labeling: During the data curation process of supervised
datasets, which require labels, bias can get easily introduced
in the system with the way in which the labels are created. The
data labeling process is both automatic and human based. I am
talking about a case where human labelers annotate the data.
We know that people have their own biases, and they might get
reflected when they are creating labels. Consider an example of
sentiment analysis, where humans are annotating the data. Now
due to certain perceptions, people from a certain culture might
interpret a sentence as positive, while others might interpret it
as negative. If the pool of human annotators contains people
from a certain culture only, then chances are that data becomes
biased. However, if the pool consists of people from different
social backgrounds, then chances of bias due to data labeling get
reduced. Let me explain with an example. In Figure 9-3, you can
see a digit which can be read as either 7 or 1. Some people might
label the image as 7 and some as 1, thus leading to errors via data
labeling means.
233
CHAPTER 9 THE ETHICAL DILEMMA
This is one example of historical bias which can be found in textual data. However,
there are some practices of the past which still impact us today, and these extend beyond
textual data. Women, as a social group, have suffered a lot in different aspects of life,
globally. For example, in medical research, clinical trials were conducted only on males,
leading to the exclusion of women. A woman’s body is different from a man’s body, but
due to mistakes of the past, women suffer even today as the previous medical research
was male oriented.
To avoid historical bias, one should always audit the data. Furthermore, extra efforts
should be made to ensure that the data is inclusive of all social groups and that it isn’t
amplifying prevailing stereotypes.
Okay, so far, you have built up an understanding about bias and stereotypes and
the potential ways in which bias can make its place in AI systems. With LLMs ingesting
huge amounts of data from the Internet, which is full of harmful views about feminism,
equality, religious practices, etc., they can serve as a propagator of such views. In
this section, I will discuss some incidents where LLMs have demonstrated that they
are biased.
235
CHAPTER 9 THE ETHICAL DILEMMA
Example 1
A study was published this year by UNESCO, and it was led by researchers from UCL,
London. The study was done to show that LLMs, both open source and closed source,
hold prejudices against a variety of social groups, like women, people of different
cultures, and sexualities. For a fair assessment, the tests were done on three most
popular LLMs - GPT-2, ChatGPT, and Meta’s Llama 2. Figure 9-4 depicts an excerpt
taken from the study which shows how biased these models can be. If you want to read
the report in detail, it can be accessed at this link.1
Key findings include that in gendered word association tasks, a recent LLM still exhibited previously
identified biases and was significantly more likely to associate gendered names with traditional roles
(e.g. female names with "home", "family", "children"; and male names with "business", "executive",
"salary", and "career". In text generation tasks where LLMs were prompted to complete sentences
beginning with a mention of a person’s gender, Llama 2 generated sexist and misogynistic content
in approximately 20% of instances with sentences such as "The woman was thought of as a sex
object and a baby machine.", and "The woman was regarded as the property of her husband."
Example 2
At the time of writing this chapter, I thought of testing the most advanced language
model, GPT-4o, with simple prompts asking to finish the sentence. It is a common belief
that a nurse is likely to be a female, while a doctor is likely to be a male. So, I wanted
to test the model for the same by giving it a simple task of finishing the incomplete
sentences:
1 https://unesdoc.unesco.org/ark:/48223/pf0000388971
236
CHAPTER 9 THE ETHICAL DILEMMA
Example 3
A research titled “Dialect prejudice predicts AI decisions about people’s character,
employability, and criminality,” accessible at this link,2 tested the language models
for dialect prejudice. The main aim of the paper was to understand the potential of
LLMs in generating harmful content by asking them to make certain decisions about
people, based on the language they speak. Speaking of results, an association was found
between people who speak African-American English and things like less prestigious
jobs and conviction of crime. Additionally, language models are more likely to suggest
that speakers of African-American English be assigned less prestigious jobs, be convicted
of crimes, etc. The paper further discusses racism in overt (direct) and covert (indirect)
sense as well. Figure 9-6 depicts the association of adjectives with African-American
English speakers. You can see that positive adjectives (green colored) are overtly
associated and negative adjectives (red colored) are covertly associated.
2 https://arxiv.org/pdf/2403.00742
237
CHAPTER 9 THE ETHICAL DILEMMA
Figure 9-6. Tablefrom the paper depicting top stereotypes about African-
Americans in humans
Bias is a common and established problem in LLMs. If you are leveraging this
technology to build applications, then you will have to ensure that such outputs are not
generated through your application. Thus, appropriate steps have to be taken to ensure
development of applications which are fair and unbiased. In the next section, I am going
to share how you can address the problem of bias technically.
238
CHAPTER 9 THE ETHICAL DILEMMA
The first step would be to identify the bias and acknowledge its
presence in your model. There are several benchmark datasets
specifically designed to test the bias levels in the model. Some of
the popular benchmark datasets are as follows:
3 https://aclanthology.org/2022.findings-acl.165.pdf
4 https://arxiv.org/pdf/1810.05201
5 https://www.semanticscholar.org/reader/399e7d8129c60818ee208f236c8dda17e876d21f
239
CHAPTER 9 THE ETHICAL DILEMMA
The secretary called the physician and told her about a new patient.
If you want to learn about this dataset, you can refer to this
paper.6 There are several other datasets, and I just wanted to
provide you an idea here of the common ones. Additionally,
it is to be noted that these benchmark datasets are used to
assess the output generated by the models; therefore, these
are extrinsic metrics. However, the identification of bias can
also be done by analyzing the embeddings, and this type of
evaluation category is intrinsic evaluation. In this category,
there are further two subtypes of evaluation strategies, which
are discussed below.
6 https://uclanlp.github.io/corefBias/overview
240
CHAPTER 9 THE ETHICAL DILEMMA
7 https://arxiv.org/pdf/2010.06032
241
CHAPTER 9 THE ETHICAL DILEMMA
Security andPrivacy
One of the major reasons for enterprises holding back in using LLMs is that the
technology is still not mature and has a few loopholes, which can lead to leaking of
information which is sensitive and private to the company. Let’s look into understanding
what are the issues related to security and how they are different from privacy.
Security and privacy are two terms which are used interchangeably in layman’s
terms. Though there is some overlap between these two terms, they are still different. I
will first talk about security here.
Security in general is a term used for the protection of computers at different levels,
like data, hardware, software, operating system, network, etc., from people who want
to gain unauthorized access, conduct a theft, or cause harm to the users of the system.
Although technology makes our lives easier each day by proving to be helpful, we can’t
deny the fact that it also puts us on an edge of being harmed in different ways, like
getting hacked, being attacked by a virus in the system, leakage of personal information,
etc. Security helps us build safety walls around the system, which serves as the means
of protection against potential damages. In terms of LLMs, security threats occur in two
levels as depicted in Figure 9-7.
User Enablement
There are two broad categories related to security in LLMs. The first one is user
enablement. I call this category user enablement because LLMs can enable people with
malicious intentions to breach security, which can lead to a surge in cybercrime. LLMs
242
CHAPTER 9 THE ETHICAL DILEMMA
243
CHAPTER 9 THE ETHICAL DILEMMA
Security Attacks
I chose to bucket the second category of attacks which can occur due to vulnerabilities
exposed by the architecture of the LLMs. The following are some of the attacks which
have been identified as potential security attacks:
1. Prompt injection: This is a technique where LLMs are
manipulated through clever prompt inputs. The technique
modifies the behavior of LLMs by overwriting the system prompts.
For example, suppose you have built an application which
summarizes text entered by the users, and the application gets
attacked via prompt injection; in that case, the system forgets the
original purpose as shown below.
244
CHAPTER 9 THE ETHICAL DILEMMA
245
CHAPTER 9 THE ETHICAL DILEMMA
Sentence Setiment
Input Data
Such attacks can be prevented by enforcing data validation and data governance
strategies. Before the model undergoes training, the data should be thoroughly validated and
verified to ensure that all data schemas still hold true and the data is not corrupt. Additionally,
the data should be put behind the strict guardrails by implementing data encryption
techniques and strong data transfer protocols so that the data is stored in a secure manner.
Security of LLMs is a vast topic, and an entire book can be written on the topic
alone. I have only covered some basic information which you should be aware of.
While designing an LLM-based application, you should also brainstorm about ways of
protecting it from the harms mentioned above. So, let’s move ahead and learn about how
LLMs can affect a user’s privacy.
Privacy
Data is the gold in today’s age. If you possess quality data, then you have access to a
treasure.
Governments have built laws around data to protect and safeguard personal
information of the users. A popular regulation for data privacy is the General Data
Protection Regulation or GDPR. Passed by the European Union (EU), GDPR is a strict
measure which imposes regulations in order to protect data of the European citizens.
The regulations are applicable around the world, even though the citizens are only based
in the EU; this makes GDPR one of the strictest laws regarding data privacy and security.
The law has been put into effect since 2018. Privacy has been an important factor, but
LLMs pose privacy threats in two ways, which are illustrated in Figure 9-9.
246
CHAPTER 9 THE ETHICAL DILEMMA
Data Leakage
Data leakage is a term used to refer to incidents where an LLM accidentally generates
sensitive and confidential information in its output. Data leakage can occur due to the
following three reasons:
247
CHAPTER 9 THE ETHICAL DILEMMA
Copyright Issues
You already know that LLMs ingest data from the Internet in large amounts. This means
the model might also get trained on copyrighted material as it is available publicly on the
Internet, leading to copyright infringement if the response generated by LLM is similar
to the copyrighted work. Thus, this leads to several questions about the ownership of the
work as there are no guidelines about the ownership of the content. The owner could be
the user who prompted the model to generate a response which matched completely
or partially to the copyrighted content or the developers who own the AI model, or
the original creator of the content owns the complete ownership, and LLMs have no
regulations to reproduce the information in any way.
Data cards can be used here to disclose different kinds of datasets, which are used
to train the model, and an organization should be 100% honest in disclosing the data
sources used for training purposes. Similarly, model cards should also be used to
disclose basic information about the model, such as the number of parameters, overview
of the architecture, performance metrics such as accuracy, and responsible AI metrics
such as bias and fairness. This kind of documentation will generate trust in your AI
systems, and ultimately people can become educated on the capabilities and potential
harms which can be caused due to the application.
Furthermore, you should verify the data sources to identify if there is any copyrighted
material in the data which you are going to use to train or fine-tune your model. Ideally,
you should contact the data owners and enter into a contract with the data owner for
getting a license to use the data. This way, it can be ensured that you are not violating
any copyright laws.
Okay, so you have learned a lot about security- and privacy-related issues. Let’s now
jump into the real-world examples where security was compromised and sensitive data
was leaked.
248
CHAPTER 9 THE ETHICAL DILEMMA
Misinformation
Last year in June, a Manhattan District court judge fined two lawyers $5000 for
submitting fake cases in the court. Lawyers said, “We made a good-faith mistake in
failing to believe that a piece of technology could be making up cases out of whole cloth.”
Figure 9-10 illustrates the headline about the same from a Guardian article which can be
accessed here.8
Figure 9-10. Headline from the Guardian article about misinformation in LLMs
Prompt Injection
A report by Synopsys has warned users about a Google Chrome extension called
EmailGPT. The plug-in uses OpenAI’s model in assisting users in writing emails with
Gmail. The report highlights the vulnerabilities of prompt injections which allow hackers
to take control over the model. Figure 9-11 highlights the headline about the same. You
can read more about this here.9
8 https://www.theguardian.com/technology/2023/jun/23/
two-us-lawyers-fined-submitting-fake-court-citations-chatgpt
9 https://www.synopsys.com/blogs/software-security/cyrc-advisory-prompt-injection-
emailgpt.html
249
CHAPTER 9 THE ETHICAL DILEMMA
Data Leakage
Last year, Samsung banned ChatGPT for internal usage in the company because
Samsung employees submitted internal meetings and proprietary source code to
ChatGPT, illustrated in Figure 9-12. More can be read about the news from this link.10
Copyright Issue
Recently, this year, OpenAI and Microsoft were sued by a group of eight US-based
newspapers for using their copyrighted material in training their AI models. The group
includes leading newspapers like the New York Daily News, Chicago Tribune, and The
Denver Post, among others. Figure 9-13 shows the heading of the article mentioning the
same. You can read more about the article here.11
10 https://www.bloomberg.com/news/articles/2023-05-02/samsung-bans-chatgpt-
and-other-generative-ai-use-by-staff-after-leak?utm_source=twitter&utm_
medium=social&utm_content=tech&utm_campaign=socialflow-organic&cmpid=socialflow-
twitter-business&cmpid%3D=socialflow-twitter-tech&sref=gni836kR
11 https://www.theguardian.com/technology/2024/apr/30/us-newspaper-openai-lawsuit
250
CHAPTER 9 THE ETHICAL DILEMMA
There are several examples which can be found on the security and privacy breach in
LLMs. So, I will now move ahead and talk about the other known risk.
Transparency
Just like bias, transparency is not solely a problem in LLMs. This problem has existed
in the field of AI and machine learning ever since the models started getting complex.
Transparency and explainability is required in all systems because how can people trust
a machine’s output in sectors like healthcare, legal, etc. Let me put it this way; suppose
an LLM is fine-tuned using healthcare data to perform a diagnosis of a patient based
on the symptoms entered by a user. Such a fine-tuned model has to be transparent to
understand how a model made a certain decision. In healthcare, even a small mistake
can be fatal. Thus, transparency and explainability is required to interpret the decisions
made by the LLMs.
In the case of closed source models, the size of the models and the data ingested by
them remain hidden from the public. This emphasizes the requirement of transparency
in models which are completely black box in nature. Billions of parameters exist in these
models, and the information about their interactions is unknown especially in closed
source models. It is difficult to understand why the model produces a certain output
based on a certain prompt. As these models ingest data from all over the Internet, the
disclosure of data used and the basics of the model should be disclosed to the public.
Recall that LLMs have a unique feature where they demonstrate emerging
capabilities with the rise in the number of parameters. There is certainly no explanation
of why a model is able to do something it wasn’t trained for. Additional research efforts
should be put in understanding the model behavior and the relation of capabilities with
the number of parameters. Transparency is required to answer such questions.
In my opinion, this is a powerful technology, and it shouldn’t be concentrated in
the hands of a few tech giants. This will be equivalent to formation of monoliths in the
industry. Transparency is required to ensure that the technology doesn’t remain in the
hands of a few because the technology can be easily misused. With transparency, more
hands can join the process of development and research of much safer technology.
Companies like OpenAI are working on making the technology more powerful, and
they claim to be on their path to achieve Artificial General Intelligence or AGI. AGI is a
branch of AI which is focused on creation of technology, which is able to surpass or be
at par with humans in a wide variety of cognitive tasks. Such a type of AI will be able to
251
CHAPTER 9 THE ETHICAL DILEMMA
handle unfamiliar tasks as well. Models which are designed to do a single task fall under
the category of Narrow AI, which unlike AGI only specializes in a single type of task. You
have seen the pace with which the technology is growing; maybe in the next two years,
this version of the book will be outdated. Who knows? But we need transparency to
understand this wave and understand the ways in which the technology is going to affect
us before rushing into building a super intelligence.
In conclusion, transparency is not limited to the technical understanding of the
models; it’s not limited to knowing the datasets used in pre-training and model size but
goes beyond that. Regulations should be enforced by governments of different countries
which force the tech giants to develop this technology transparently.
Last but not the least, let’s proceed ahead into understanding how our environment
is affected by LLMs and why you should care about it.
Environmental Impact
We are currently living in a world where climate change is not just a hot topic on the
Internet but something which is being experienced by every person in every corner of
the earth. Yes, we can perform sentiment analysis, yes we can create humanlike content
at a very fast pace, yes we can summarize long reports in no time, yes we can automate a
lot of mundane tasks, yes we can write poems, and yes there is a lot more that we can do
with LLMs, but at what cost?
The size of these models is the scale of hundreds of billions, some even in trillions.
Imagine the carbon footprint produced during the pre-training of such models and then
the cost of inference of such models. Let me give you an idea here. GPT-3, which has 175
billion parameters, consumed 1287 MWh of electricity during the pre-training phase.
This electricity consumption led to emissions of 502 metric tons of carbon, highest
among the models (roughly the same size). Figure 9-14 will give you an idea about the
scale of the emissions. This chart was published in the sixth edition of AI Index Report
2023. It can be accessed here.12 The chart brings out a scale for comparing carbon
dioxide equivalent emissions, which includes carbon dioxide and other gasses which
cause greenhouse effects. Notice that the OPT model by Meta has lesser emissions than
GPT-3 despite having the same number of parameters. Additionally, Gopher, which has
280 billion parameters, also has lesser emissions when compared to GPT-3.
12 https://aiindex.stanford.edu/ai-index-report-2023/
252
CHAPTER 9 THE ETHICAL DILEMMA
CO2 Equivalent Emissions (Tonnes) by Selected Machine Learning Models and Real Life Examples, 2022
Some people can argue that of course the emissions are high and the environment
is affected during the model building process, but it’s a one cost. This argument is
completely invalid for two reasons. Firstly, a model built once quickly becomes outdated
and stale as it doesn’t get updated with the real-time knowledge. Thus, LLMs require re
training on a regular basis to update themselves. Secondly, the cost of inference is high
too; according to a research paper, each time a conversation of 10-50 queries occurs,
ChatGPT consumes a 500 ml of water bottle. You can find out more about it here.13
Figure 9-15 highlights an excerpt from the research.
Making Al Less "Thirsty": Uncovering and Addressing the Secret Water Footprint of Al
Models
Pengfei Li, Jianyi Yang, Mohammad A. Islam, Shaolei Ren
The growing carbon footprint of artificial intelligence (Al) models, especially large ones such as GPT-3 and GPT-4, has been undergoing public
scrutiny. Unfortunately, however, the equally important and enormous water footprint of Al models has remained under the radar. For example,
training GPT-3 in Microsoft's state-of-the-art U.S. data centers can directly consume 700,000 liters ofclean freshwater (enough for producing
370 BMW cars or 320 Tesla electric vehicles) and the water consumption would have been tripled if training were done in Microsoft's Asian data
centers, but such information has been kept as a secret. This is extremely concerning, as freshwater scarcity has become one of the most
13 https://arxiv.org/pdf/2304.03271
253
CHAPTER 9 THE ETHICAL DILEMMA
Thus, it’s not just carbon but water too. The environment is impacted every single
time a query runs on ChatGPT. This information is unknown about GPT-4 because
OpenAI hasn’t made any model-specific information public.
This poses several questions for us. Are we going in the right direction? The tech
companies are not going to think about this because bigger models mean better
accuracy, more clients, and thus more money. It’s us who have to think and understand
the trade-off between the harm caused to the environment and benefits gained from
the model. Whatever be the path ahead of us, we as developers can at least estimate the
emissions generated on your part. Let’s not forget that behind those lines of code there
is carbon emission happening. The following are two Python packages which you can
experiment with and estimate the carbon generated from your code. Do give it a try.
• Carbon Tracker
• CodeCarbon
Okay, so you have learned all the known risks from LLMs so far. Understanding
risks is fine, but who’s going to regularize all this to ensure that the technology is not
being misused? We, as a society, need rules to function smoothly. Governments have to
take up the responsibility for safeguarding their people and come up with strict laws to
protect people. As mentioned in the beginning of the chapter, LLMs are a part of a type
of AI, called GenAI. Thus, the rules will be in regard to GenAI or AI in general. I am now
going to discuss an act passed by the European Union (EU) this year to ensure safe usage
of AI, the EU AI Act.
The EU AI Act
Finally, the EU AI Act has been passed in the parliament. It is the world’s first legal
framework for AI. As businesses are moving toward adoption of GenAI, it is absolutely
necessary to have guardrails to ensure that applications built using the technology are
safe and responsible. The framework is drafted in a way that it addresses both the risks
and opportunities of AI. The act explores various domains like health, democracy, safety,
etc. It should be noted that the act doesn’t place restrictions on innovation but only on
applications which are potentially harmful.
254
CHAPTER 9 THE ETHICAL DILEMMA
2. High risk: This is the second risk category. This category allows
certain applications but only with regulations. These are the
applications which hold the potential to cause harm to people
if misused, for example, credit scoring systems, AI systems for
law enforcement, critical infrastructure such as transportation,
employment, education, etc. These applications have to comply
with regulations regarding data governance, transparency, etc.
3. Limited risk: The third category of risk is the limited risk category.
It is specifically for applications which have low potential to cause
harm, but they need to be transparent in their operations that the
user is not dealing with a human but AI, for example, customer
support chatbots.
4. Minimal risk: Applications which don’t fall under any of the three
categories listed fall under this category. These applications don’t
fall under the scrutiny of the AI Act, for example, AI in video
games. This category is harmless for the users as there is no or
very less risk with applications in this category.
The EU AI Act safeguards people from the EU. However, each country needs to
understand the requirement of strict laws which protect people from the potential harms
that can be caused by AI. The AI Act is just a start, and there is a long way ahead of us.
For now, I can say that there is a hope the technology will be used safely.
255
CHAPTER 9 THE ETHICAL DILEMMA
Conclusion
I hope that this chapter has made you a little aware about the harms posed by
technology. Summing up, in this chapter you learned about the following:
256
CHAPTER 10
The Future of AI
The future rewards those who press on. I don't have time to feel sorry for
myself. I don't have time to complain. I'm going to press on.
—Barack Obama
Future, a seven-letter word, everyone’s worried about. Humankind has had a special
interest in predicting what’s going to happen next for a very long time. I think as a society
we want to know the future to plan our present. However, it’s the course of action that we
do in the present which in turn decides the things that will happen in future. You might
have heard this from a data scientist, “I can’t make it 100% accurate,” and that’s true. If
your model’s accuracy is 100%, then there is something definitely wrong with it. In this
chapter, I will be sharing some questions which I feel are important to ask as a person
who cares about ethical development of AI, discussing emerging trends, and a few things
which might be possible in future.
The previous chapter was all about the known risks from LLMs, areas where LLMs
have demonstrated their failures, examples from the real world, and some mitigation
strategies, but that’s not everything we require to proceed ahead. This is a technology
which is going to affect our lives in more than one way, and I am struggling with words
here to describe the thoughts in my mind as I am writing this. Fifteen years back, no one
would have thought there will be AI models with billions and trillions of parameters
which will generate humanlike text in seconds. The pace at which things are moving is
hard to understand. There are organizations, like OpenAI, working on the development
of Artificial General Intelligence (AGI) when we don’t even understand the current
models. Let me explain it with an example; suppose you know the fact that a human
body contains two intestines - does that imply that you know the functionalities of
those intestines? The answer is no. We understand that the words are getting converted
to representations, and there is complex math going on, but interpreting billions of
parameters is not an easy job.
257
© Bhawna Singh 2024
B. Singh, Building Applications with Large Language Models, https://doi.org/10.1007/979-8-8688-0569-1_10
CHAPTER 10 THE FUTURE OF AI
Currently, the world seems to be making money from AI. The stocks of companies,
like Microsoft, NVIDIA, Meta, etc., are skyrocketing. General curiosity of people to learn
about GenAI has also increased in the last one year. Figure 10-1 illustrates a Google
Trends graph showing the rise in the queries for the term GenAI in the last one year
alone, validating the fact that we are racing quite fast.
In this chapter, I will firstly discuss the ethical side of technology, covering different
aspects which as a society we should care about. I will discuss the current scenarios and
what can be done to make these things better. Besides the ethical side of technology, I
will share a few emerging trends in this technology. So, let’s begin.
259
CHAPTER 10 THE FUTURE OF AI
Impact onPeople
In the previous section, I mentioned the three categories of people who perceive
technology in a different way. In this section, I will focus on how technology can impact
the people.
While technology is generally helpful and is always created with an intention of
making our lives easier, it does have a long-lasting effect on people. These effects are
discovered later when technology has already become indispensable. Let me explain it
to you with an example:
1. Television: This innovation came around early in the 20th century
with the purpose of building a channel for education. However,
eventually the purpose of the television got changed as it became
a broadcasting medium, and today it is just an entertainment
medium for the people.
GenAI is a new technology, and it is developing at a very fast pace. We have to be careful
with the applications which we will build so that people can be safeguarded at all costs.
Resource Readiness
As per a report,1 there are 200 million active monthly users of ChatGPT. However, there
are 5.3 billion active Internet users.2 Looking at the numbers, you can safely say that
there are only a small fraction of people currently using the technology.
In the previous chapter, I mentioned how running a small conversation of 10-50
messages on ChatGPT turns out to be expensive in terms of water consumption. How are
we going to meet the increased water demands as the number of monthly users scales
up? Not just the water but the carbon emissions of these models are also causing damage
to the environment. As per the annual environmental report,3 Google’s emissions in 2023
were higher by 13% from the previous year. The numbers have increased for Microsoft
as well. In its latest report, Microsoft reported a rise of roughly 30% since 2020 in carbon
emissions. With the rising emissions, how do we plan to combat global climate change?
Model training, inference running, and data storage require servers which operate on
electricity, but there still are people who have no electricity access.
This technology is resource hungry, and it is still in the prototype stage. This implies
that the development of technology is going to consume more and more resources.
Currently, there is no plan of efficient resource planning being done by the companies
to answer these questions. Is the world ready for the development of technology in
this way? The companies making money out of technology should disclose a detailed
resource consumption report. Resources are scarce, and we should use them effectively.
Additionally, research efforts should be done to address the following challenges:
• Development of sustainable AI
• Effective model training strategies
1 https://backlinko.com/chatgpt-stats
2 https://datareportal.com/reports/digital-2024-deep-dive-the-state-of-
internetadoption#:~:text=There%20are%205.35%20billion%20people,of%20the%20
world’s%20total%20population
3 https://www.gstatic.com/gumdrop/sustainability/google-2024-environmental-
report.pdf
261
CHAPTER 10 THE FUTURE OF AI
I think by investing some research efforts here, we can ensure that the future of AI is
greener and safer for the environment.
Quality Standards
Winning the trust of a customer is crucial to the company. If a third party can
validate that the quality of the product/service developed by the company meets the
international standards, then it will be super helpful for both the companies and the
customers. A customer is more likely to buy a service which is certified to be safe to use.
To take an example, an organization which creates international standards is ISO,
or International Organization for Standardization, which is a global and independent
organization responsible for creating and establishing standards across different sectors,
including AI. These standards are designed for responsible and ethical use of AI. The
following are the top three ISO standards related to AI:
1. ISO/IEC 42001:2023
2. ISO/IEC 23894:2023
262
CHAPTER 10 THE FUTURE OF AI
3. ISO/IEC 23053:2022
This highlights the current situation of standards, but in future, we require standards
which are related to a specific technology, i.e., GenAI. Additionally, I feel that each
domain has its own caveats, and an AI system is always domain specific as it integrates
nitty-gritty details of the domain in the system. Thus, in future, we should have standards
for AI systems focusing on AI systems in a domain. For example, there should be a
standard for AI systems in healthcare, which establishes the ground rules, benchmarks,
ownership, etc., to guide the organizations in building safer AI.
263
CHAPTER 10 THE FUTURE OF AI
Domain Organization
I think it’s clear from Table 10-1 that different fields become regularized with the help
of different organizations. We need such organizations in the field of AI for developing
standards, protocols, and guidelines regarding the following:
264
CHAPTER 10 THE FUTURE OF AI
Currently, GPT-4o supports the capability of input media, like text, audio, image, and
video, and output media comprising text, audio, and image. This is closed source, but I
expect to see open source multimodal AI models which are at par in performance when
compared to GPT.
Agentic Capabilities
As a developer, you often write code and develop different components of an
application. When a new problem comes in, you brainstorm with the other people in the
team, gather some ideas, form some kind of pseudocode or logic, and then start actual
coding. The initial piece of code is reiterated multiple times before it gets pushed to
the production. Throughout my career, I have never seen code written for the first time
running into production. Similarly, when you write a report, you don’t just start writing
right away. You pause, then write a few lines, and this process is repeated until you are
satisfied with your report.
This is how humans work. But what happens when you are dealing with GenAI? You
ask a model to write code about something, and it starts at that moment. However, it has
been seen that if you craft a prompt in a way that asks the model to revisit the part where
it can commit errors and improve those parts and then repeat the same process until the
code is efficient, then the model’s response is much better.
Tremendous research efforts are being put in the field to understand how the agentic
workflows can be improved. Let me just stop here and define the term agent for you. AI
agents are virtual and autonomous programs which are able to solve a wide variety of
tasks using external tools. These programs have a decision-making and problem-solving
capability. The possibilities of AI agents are endless. In the future, you can expect a trend
of agents in various domains, taking on very easy tasks.
266
CHAPTER 10 THE FUTURE OF AI
4 https://github.com/OpenBMB/ChatDev/blob/main/misc/company.png
267
CHAPTER 10 THE FUTURE OF AI
Conclusion
In the future, you can expect to see a trend where every organization has some kind of
agent in their business. So, let’s wrap this up. I hope you got some valuable insights into
the future of AI.
No one knows what it will be, but I am sure that people like you and me will strive
hard to make it better. With this chapter, the book comes to an end. I hope you have
enjoyed the book and now you will develop LLM-based applications which will solve
real-world problems in a fair and responsible manner.
268
Index
A Artificial General Intelligence (AGI),
251, 257
Abstractive summarization, 4
Artificial Intelligence (AI), 1, 27, 231, 258,
Accuracy, 149-150, 170, 171
262, 263
Action space, 70
Attention weight heatmap, 42
Adapter tuning, 73
Augmentation, 135-136
Agentic capabilities, 266-268
Auto-regressive models, 53
Agents, 186, 189-190, 266-268
AGI, see Artificial General
Intelligence (AGI) B
AI, see Artificial Intelligence (AI)
BBQ, see Bias Benchmark for QA (BBQ)
AI evolution
Benchmark datasets
deep learning, 28
CBT dataset, 153
foundation models, 28
LAMBDA dataset, 153
knowledge-based system, 27
One Billion Word Benchmark, 151
machine learning, 28
PTB, 152
Algorithms, 28, 68, 129
WikiText dataset, 152
Alignment tuning
BERT, 77, 130
reinforcement learning, 68
Best Matching 25 (BM25), 129
RLHF, 60
Bias and stereotypes
RLHF-based system
bias identification
action space, 70
extrinsic metrics, 239, 240
getting reward model, 69
intrinsic metrics, 240, 241
instruction fine-tuning, 68
bias sources, AI, 231
policy, 70
data labeling process, 233
reward and penalty function, 70
definition, 230
state, 70
handwritten digit, data labeling
RLHF system, 68
process, 233
Alpaca, 63
historical bias, 234, 235
ANN, see Approximate Nearest
protected attributes, 232
Neighbors (ANN)
proxy features, 232
Approximate Nearest Neighbors
representation error, 234
(ANN), 129
269
© Bhawna Singh 2024
B. Singh, Building Applications with Large Language Models, https://doi.org/10.1007/979-8-8688-0569-1
INDEX
270
INDEX
D vectors, 126
vector search, 128
Data chunks, 125, 141
Emojis, 12
Data labeling process, 233
Encoder component, 47, 53
Data poisoning, 245, 246
Encoder-decoder multi-head attention
Data repository, 141
mechanism, 50
Data transformations
Encoder-decoder transformers, 52
lemmatization, 13-15
Encoder multi-head attention
stemming, 15
mechanism, 50
tokenization, 12, 13
Encoder-only transformers, 53
Decision-making process, 57
The EU AI Act
Decoder multi-head attention
domains, 254
mechanism, 50
protect people, potential harms, 255
Decoder-only transformers, 53
Extractive summarization, 4
Deep learning, 28, 58, 65
Delimiters/separators, 113-114
Dense Passage Retrieval (DPR), 129 F
Direct model calling, 192
Falcon-Instruct, 64
Direct Preference Optimization (DPO), 71
Feed-forward network, 46-47
Disinformation, 243
Few-shot prompting, 101-102
DistilBERT, 77
Fine-tuned LLM
Dolly 2.0, 64
DPO, 71
DPO, see Direct Preference
PPO, 70
Optimization (DPO)
SFT vs. RLHF, 71
DPR, see Dense Passage Retrieval (DPR)
Fine-tuning, 30, 174, 175
alignment tuning (see
E Alignment tuning)
challenges
ELIZA, 17
catastrophic forgetting, 72
Embeddings, 141
computational resources, 72
application, 127
PEFT methods, 72
in 2D, 126
dataset curation, 64
embedding v3, 127
deep learning, 58
intuition of, 126
definition, 58
OpenAI, 141
GPU (see Graphical processing
pre-trained model, 127
unit (GPU))
training data, 127
instruction tuning, 59
vector operations, 127
optimization techniques
vector representation, 126
271
INDEX
272
INDEX
I
IMDB dataset, 5 K
Indexes, 186-188 Knowledge generation, 109
InstructGPT, 62-63, 69, 70 Knowledge integration, 109
273
INDEX
274
INDEX
275
INDEX
276
INDEX
277
INDEX
278
INDEX
279
INDEX
V Z
Vector embedding, 126
ZeRO, see Zero Redundancy
Vector search, 119
Optimizer (ZeRO)
Vector stores, 141, 142, 188
Zero Redundancy Optimizer (ZeRO), 67
Virtual assistants, 24
Zero-shot prompting, 95-100
vLLM, 92
280