0% found this document useful (0 votes)
109 views153 pages

Introduction to NLP Course Overview

The document outlines an introductory lecture on Natural Language Processing (NLP) and its relevance to deep learning. It covers course information, the definition of NLP, its applications, challenges like language ambiguity, and various NLP paradigms. Additionally, it discusses text processing basics, including tokenization methods and the significance of subword segmentation in modern NLP tasks.

Uploaded by

AfzalSyed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
109 views153 pages

Introduction to NLP Course Overview

The document outlines an introductory lecture on Natural Language Processing (NLP) and its relevance to deep learning. It covers course information, the definition of NLP, its applications, challenges like language ambiguity, and various NLP paradigms. Additionally, it discusses text processing basics, including tokenization methods and the significance of subword segmentation in modern NLP tasks.

Uploaded by

AfzalSyed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

EL

PT
Lecture 01 : Introduction to the Course
N
EL
• Course Information
• What is NLP?
Why Deep Learning for NLP?

PT

• Course Content
N
Course Information

EL
• My Contact
• Email: pawang@[Link]
• Webpage: [Link]
• Course Page: [Link]

PT
• Teaching Assistants (Inaugural Course)
• Subhendu Khatuya
• Pretam Ray
N
Natural Language Processing

EL
PT
N
Natural Languages: Languages that evolved naturally through human use

Source: [Link]
Natural Language Processing
What is NLP?

EL
● Making computers understand what we write (or speak)
● Making computers write (and speak)

PT
The field of NLP attempts to design, implement and test systems
that process natural languages for practical applications
N
NLP Applications: NLP is everywhere!

EL
PT
N
NLP is everywhere!

EL
PT
N
NLP is everywhere!

EL
PT
N
NLP is everywhere!

EL
PT
N
NLP is everywhere!

EL
PT
N
Source: [Link]
Domain Specific Applications

EL
PT
N
Why is NLP Hard? Language Ambiguity

EL
PT
N
Source: [Link]
Why is NLP Hard? Language Ambiguity

EL
PT
N
Source: [Link]
Why is NLP Hard? Language Ambiguity
Let’s try to decipher this weird conversation!

EL
PT
N
Example: Courtesy Dr. Monojit choudhury
NLP: Levels of Linguistic Structure

EL
PT
N
Source: [Link]
NLP Paradigms

EL
We generally try to map problems to various (ML) paradigms
● Sentiment Analysis, news article groupings, etc. → text classification

● Named entity recognition, code-mixing, etc. → Sequence Labeling

● Machine Translation, summarization, chatbots, etc. → Text generation

PT
N
EL
Timeline illustrating the progression
of NLP from the 1950s

PT
N
Source: Kamath, Uday, et al. "Large Language Models: A Deep Dive." (2024).
Why Deep Learning?
Sparse vs. dense feature

EL
representations. Two
encodings of the information:
current word is “dog;”
previous word is “the”

PT
previous pos-tag is “DET.”
N
Source: Yoav Goldberg, Graeme Hirst. Neural Network Methods in Natural Language Processing, Morgan & Claypool Publishers (2017).
EL
These dense feature
representations are
used with various
deep-learning
architectures

PT
N
Source: [Link] .
A timeline of the recent developments

EL
PT
N
Source: Alammar, J., & Grootendorst, M. (2024). Hands-On Large Language Models. O'Reilly.
Change of NLP paradigms: Just use generation!

EL
PT
N
Sanh, Victor, et al. "Multitask Prompted Training Enables Zero-Shot Task Generalization." ICLR 2022
Course Content (Weeks 1-6)

EL
Background
• Introduction to NLP
• Introduction to Deep Learning and Representation Learning
• Word Representation: Word2Vec, Glove, FastText, Multilingual
Models and Architectures

PT
• Recurrent Neural Networks: RNNs, LSTMs, Sequence to Sequence
• Attention Mechanism and Transformers: Attention in RNNs, Self-Attention in
Transformers
Methods
• Pretraining: Self-supervised Learning objectives for Pretraining,
N
ELMo, BERT, GPT, T5, BART, Fine-tuning
Course Content (Weeks 7-12)

EL
Tasks
• Question Answering, Text Summarization, Dialogs
• Domain and language-specific applications and challenges
Methods (LLMs)

PT
• Towards building LLMs as chat assistants: Instruction Fine-tuning,
Reinforcement learning from human feedback, Alignment techniques
• In-content learning, chain-of-thought prompting, Various LLMs
• Parameter Efficient Fine-tuning (PEFT), LoRA, QLoRA
• Handling Long Context, Retrieval Augmented Generation (RAG)
Conclusion
N
• Analysis and Interpretability, ethical considerations
Daniel Jurafsky and James H. Martin. 2024. Speech and Language Processing: An

EL
Introduction to Natural Language Proces s ing, Computational Linguis tics , and
Speech Recognition with Language Models , 3rd edition. Online manus cript releas ed
Augus t 20, 2024. https ://web.s [Link]/~jurafs ky/s lp3.

Alammar, J ., & Grootendors t, M. (2024). Hands -On Large Language Models .


O'Reilly.

PT
Yoav Goldberg, Graeme Hirs t. Neural Network Methods in Natural Language
Proces s ing, Morgan & Claypool Publis hers (2017).

Kamath, Uday, et al. "Large Language Models : A Deep Dive." (2024).


N
N
PT
EL
EL
PT
Lecture 02 : Text Processing Basics, Tokenization
N
EL
• Processing Text Input
• Whitespace Tokenizer
Byte-Pair Encoding

PT

N
Processing Text Input

EL
For any NLP application, the input
text needs to be processed first.

The first step in processing text is


tokenization.

PT
N
Source: Alammar, J., & Grootendorst, M. (2024). Hands-On Large Language Models. O'Reilly.
Tokenization: How many words in a sentence?

EL
they lay back on the San Francisco grass and looked at the
stars and their
Type: an element of the vocabulary.
Token: an instance of that type in running text.

PT
How many?
◦15 tokens
◦13 types
N
Source: Speech and Language Processing, 3rd Ed.
How many words in a corpus?

EL
PT
N
Source: Speech and Language Processing, 3rd Ed.
Corpora: Where do the words come from?
Words don't appear out of nowhere!

EL
A text is produced by
• a specific writer(s),
• at a specific time,

PT
• in a specific variety,
• of a specific language,
• for a specific function.
N
Source: Speech and Language Processing, 3rd Ed.
Corpora vary along dimensions like
• Language: 7097 languages in the world

EL
• Variety, like African American Language varieties.
Twitter posts might include forms like "iont" (I don't)
• Code switching, e.g., Spanish/English, Hindi/English:
S/E: Por primera vez veo a @username actually being hateful! It was beautiful:)
[For the first time I get to see @username actually being hateful! it was beautiful:) ]

PT
H/E: dost tha or rahega ... dont worry
[“he was and will remain a friend ... don’t worry ”]
• Genre: newswire, fiction, scientific articles, Wikipedia
• Author Demographics: writer's age, gender, ethnicity
N
Source: Speech and Language Processing, 3rd Ed.
Whitespace tokenization

EL
Tokens are implied to be words
Example:

Whitespace tokenizer issues

PT
● conjunctions: isn’t ⇒ is, n’t
● hyphenated phrases: prize-winning ⇒ prize, -, winning
● punctuation: great movie! ⇒ great, movie, !
(Word tokenizers require lots of specialized rules about how to handle specific inputs)
N
Source: [Link] -[Link]/Natural -Language-Processing-bd1a2ca290fc44f69556908ad8d25c70
What if a new (or infrequent) word appears?

EL
Out-of-vocabulary (OOV): Words that were seen very rarely during training or not even at all
Closed-vocabulary models: Unable to produce word forms unseen in training data
<UNK> tokens:
● Historically rare word types were replaced with a new word type UNK (unknown) at training time

PT
● At test time, any token that was not part of the model’s vocabulary could then be replaced by UNK
● But you should not generate UNK when generating text
● UNKs don’t give features for novel words that maybe useful anchors of meaning
● In languages other than English, in particular those with more productive
morphology, removing rare words is infeasible
N
Source: [Link] -[Link]/Natural -Language-Processing-bd1a2ca290fc44f69556908ad8d25c70
Limitations of <UNK>

EL
We lose lots of information about texts with a lot of rare words / entities

The chapel is sometimes referred to as "Hen Gapel Lligwy" ("hen" being the Welsh word for
"old" and "capel" meaning "chapel").

PT
The chapel is sometimes referred to as " Hen <unk> <unk> " (" hen " being the Welsh word
for " old " and "<unk> " meaning " chapel ").
N
Source: [Link]
Maximal Decomposition into Characters

EL
PT Challenges due to sandhi phenomena for
Sanskrit Word Segmentation
N
Source: [Link] -[Link]/Natural -Language-Processing-bd1a2ca290fc44f69556908ad8d25c70
Preprocessing / Text normalization
● Lemmatization: determining that two words have the same root, despite their surface

EL
differences
○ sang, sung, and sings are forms of sing
● Stemming: strip suffixes from the end of the word
● Sentence segmentation: Breaking up a text into individual sentences
● Stopword removal: Remove commonly used words in a language
○ a, the, is, are

PT
● Casing: Lowercase all words or not
With pretrained language models, besides casing, we do none of the other steps
After text normalization, most tokenizers are irreversible
we cannot recover the raw text definitively from the tokenized output
N
Source: [Link] -[Link]/Natural -Language-Processing-bd1a2ca290fc44f69556908ad8d25c70
A redefinition of the notion of tokenization
Due to:

EL
● Scientific results: The impact of sub-word segmentation on machine translation performance in 2016
● Technical requirements: A fixed-size vocabulary for neural language models

…in current NLP, the notion of token and tokenization changed


“Tokenization” is now the task of segmenting a sentence into non-typographically (and non-

PT
linguistically) motivated units, which are often smaller than classical tokens, and therefore often called
sub-words
Typographic units (the “old” tokens) are now often called “pre-tokens”, and
what used to be called “tokenization” is therefore called “pre-tokenization”
N
● [Link]

Source: [Link] -[Link]/Natural -Language-Processing-bd1a2ca290fc44f69556908ad8d25c70


Subwords are expected to be meaningful units
Subwords can be arbitrary substrings…

EL
…but subwords can be meaning-bearing units like the morphemes -est or -er

● A morpheme is the smallest meaning-bearing unit of a language


○ “unlikeliest” has the morphemes {un-, likely, -est}
● Morphology is the study of the way words are built up from morphemes

PT
● Word forms are the variations of a word that express different grammatical categories (tense, case,
number, gender, etc) and thus help convey the specific meaning and function of the word in a sentence

Unseen word like lower can thus be represented by

some sequence of known subword units, such as {low, er}


N
Source: [Link] -[Link]/Natural -Language-Processing-bd1a2ca290fc44f69556908ad8d25c70
Byte-Pair-Encoding (BPE)
Main idea: Use data to automatically tell what the tokens should be

EL
Token learner
Raw train corpus ⇒ Vocabulary (a set of tokens)

PT
Token segmenter
Raw sentences ⇒ Tokens in the vocabulary
N
Source: [Link] -[Link]/Natural -Language-Processing-bd1a2ca290fc44f69556908ad8d25c70
[coined by Gage et al., 1994; adapted to the task of word segmentation by Sennrich et al., 2016; see Gallé
(2019) for more]
Byte-Pair-Encoding (BPE) – Token learner
Raw train corpus ⇒ Vocabulary (a set of tokens)

EL
● Pre-tokenize the corpus in words & append a special end-of-word symbol _ to each word
● Initialize vocabulary with the set of all individual characters
● Choose 2 tokens that are most frequently adjacent (“A”, “B”)
○ Respect word boundaries

PT
● Add a new merged symbol (“AB”) to the vocabulary
● Change the occurrence of the 2 selected tokens with the new merged token in the corpus
● Continues doing this until k merges are done
All k new symbols and initial characters are the final vocabulary
N
What’s k? Open research question

Source: [Link] -[Link]/Natural -Language-Processing-bd1a2ca290fc44f69556908ad8d25c70


Byte-Pair-Encoding (BPE) – Example

EL
PT
N
Source: [Link] -[Link]/Natural -Language-Processing-bd1a2ca290fc44f69556908ad8d25c70
Byte-Pair-Encoding (BPE) – Example

EL
PT
N
Source: [Link] -[Link]/Natural -Language-Processing-bd1a2ca290fc44f69556908ad8d25c70
Byte-Pair-Encoding (BPE) – Example

EL
PT
N
Source: [Link] -[Link]/Natural -Language-Processing-bd1a2ca290fc44f69556908ad8d25c70
Byte-Pair-Encoding (BPE) – Example

EL
PT
N
Source: [Link] -[Link]/Natural -Language-Processing-bd1a2ca290fc44f69556908ad8d25c70
Byte-Pair-Encoding (BPE) – Example

EL
PT
N
Source: [Link] -[Link]/Natural -Language-Processing-bd1a2ca290fc44f69556908ad8d25c70
Byte-Pair-Encoding (BPE) – Token segmenter
Just runs on the test data the merges we have learned from the training data, greedily, in

EL
the order we learned them

First we segment each test sentence word into characters

Then we apply the first merge rule


● E.g., replace every instance of “e”, ”r” in the test corpus with “er”

Then the second merge rule

PT
● E.g., replace every instance of “er”, “_” in the test corpus with “er_”

And so on
N
Source: [Link] -[Link]/Natural -Language-Processing-bd1a2ca290fc44f69556908ad8d25c70
Byte-Pair-Encoding (BPE) Vocabulary
Model Tokenizer Vocabulary Size

EL
BERT base (uncased) [2018] WordPiece 30,522

BERT base (cased) [2018] WordPiece 28,996

GPT-2 [2019] BPE 50,257

Flan-T5 [2022] SentencePiece 32,100

PT
GPT-4 [2023] BPE > 100,000

StarCoder2 [2024] BPE 49,152

Llama2 [2023] BPE 32,000


N
You can play with different tokenizers here: [Link]
Subwords - Example

EL
PT
N
Source: Alammar, J., & Grootendorst, M. (2024). Hands-On Large Language Models. O'Reilly.
Byte-Pair-Encoding (BPE) Implications [Hofmann e t al., 20 21

EL
BERT thinks the sentiment of
"superbizarre" is positive because
its tokenization contains the token
"superb"

PT
N
Source: [Link] -[Link]/Natural -Language-Processing-bd1a2ca290fc44f69556908ad8d25c70
Byte-Pair-Encoding (BPE) Implications – Do All
languages cost the same?
[Ahia e t al., 20 23]

EL
Proprietary models, as GPT-4, are
accessible only through paid APIs
API cost is me asure d by the numbe r of
toke ns proce sse d or ge ne rate d

PT
Subword toke nize rs le ad to
disproportionate fragme ntation rate s
for diffe re nt language s and writing
scripts
N
Source: [Link] -[Link]/Natural -Language-Processing-bd1a2ca290fc44f69556908ad8d25c70
Other subword encoding schemes

EL
WordPiece (Schuster et al., ICASSP 2012): merge by likelihood as measured
by language model, not by frequency

SentencePiece (Kudo et al., 2018): can do subword tokenization without


pretokenization (good for languages that don’t always separate words w/

PT
spaces), although pretokenization usually improves performance
N
Source: [Link] -[Link]/Natural -Language-Processing-bd1a2ca290fc44f69556908ad8d25c70
Daniel Jurafsky and James H. Martin. 2024. Speech and Language Processing: An

EL
Introduction to Natural Language Proces s ing, Computational Linguis tics , and
Speech Recognition with Language Models , 3rd edition. Online manus cript releas ed
Augus t 20, 2024. https ://web.s [Link]/~jurafs ky/s lp3. [Chapter 2]

PT
N
N
PT
EL
EL
PT
Lecture 03 : N-gram Language Models: Part 1
N
EL
• What is Language Modeling? (LM in LLMs!!)
• N-gram Language Models
Some Practical Issues

PT

N
Predicting words

• The water of Walden Pond is beautifully ...

EL
blue
*refrigerator

PT
green
*that
clear
N
Source: [Link] .
Language Models
Systems that can predict upcoming words

EL
• Can assign a probability to each potential next word
• Can assign a probability to a whole sentence

PT
N
Source: [Link] .
Why word prediction?
It's a helpful part of language tasks

EL
• Grammar or spell checking
Their are two midterms Their There are two midterms
Everything has improve Everything has improve improved

• Speech recognition

PT
I will be back soonish
N I will be bassoon dish

Source: [Link] .
Why word prediction?
It's how large language models (LLMs) work!

EL
LLMs are trained to predict words
• Left-to-right (autoregressive) LMs learn to predict next word
LLMs generate text by predicting words
• By predicting the next word over and over again

PT
N
Source: [Link] .
Pretrain-then-finetune paradigm

EL
PT
N
Source: [Link]
Pretrain-then-Prompt paradigm

EL
PT
N
Source: [Link]
Language modeling forms the core of most self-
supervised NLP approaches

EL
PT
N
Source: [Link]
Language modeling forms the core of most self-
supervised NLP approaches

EL
PT
N
Source: [Link]
Language Modeling: More Formally

EL
PT
N
How to compute P(W) or P(wn|w1, …wn-1)
• How to compute the joint probability P(W):

EL
P(The, water, of, Walden, Pond, is, so, beautifully, blue)

• Intuition: let’s rely on the Chain Rule of Probability

PT
N
Source: [Link] .
Reminder: The Chain Rule
• Recall the definition of conditional probabilities

EL
P(B|A) = P(A,B)/P(A) Rewriting: P(A,B) = P(A) P(B|A)

• More variables:
P(A,B,C,D) = P(A) P(B|A) P(C|A,B) P(D|A,B,C)

PT
• The Chain Rule in General
P(x1,x2,x3,…,xn) = P(x1)P(x2|x1)P(x3|x1,x2)…P(xn|x1,…,xn-1)
N
Source: [Link] .
The Chain Rule applied to compute joint
probability of words in sentence

EL
PT
P(“The water of Walden Pond is so beautifully blue”) =
P(The) × P(water|The) × P(of|The water)
× P(Walden|The water of) ×
N
P(Pond|The water of Walden) × ….
Source: [Link] .
How to estimate these probabilities
• Could we just count and divide?

EL
=

PT
• We’ll never see enough data for estimating these!!
N
Source: [Link] .
Markov Assumption

EL
• Simplifying assumption:

Andrei Markov


PT
N
Source: [Link] .

Wikimedia commons
Bigram Markov Assumption

EL
Instead of:

PT
More generally, we approximate each component in the
product
N
Source: [Link] .
Simplest case: Unigram model

EL
Some automatically generated sentences from two different unigram models
To him swallowed confess hear both . Which . Of save on trail
for are ay device and rote life have

PT
Hill he late speaks ; or ! a more to leg less first you enter

Months the my and issue of year foreign new exchange’s


September

were recession exchange new endorsed a acquire to six


N
executives
Source: [Link] .
Bigram model

EL
Some automatically generated sentences from two different unigram models
Why dost stand forth thy canopy, forsooth; he is this
palpable hit the King Henry. Live king. Follow.

PT
What means, sir. I confess she? then all sorts, he is trim,
captain.

Last December through the way to preserve the Hudson


corporation N. B. E. C. Taylor would seem to complete the
major central planners one gram point five percent of U. S.
N
E. has already old M. X. corporation of living

Source: [Link] .
Approximating Shakespeare

EL
PT
N
Source: [Link] .
Problems with N-gram models
• N-grams can't handle long-distance dependencies:

EL
“The soups that I made from that new cookbook I bought yesterday
were amazingly delicious."
• N-grams don't do well at modeling new sequences with similar meanings
The solution: Large language models

PT
• can handle much longer contexts
(because of using embedding spaces)
• can model synonymy better
N
Source: [Link] .
Why N-gram models?
A nice clear paradigm that lets us introduce many of the important issues for large

EL
language models
• training and test sets
• the perplexity metric
• sampling to generate sentences

PT
• ideas like interpolation and backoff
N
Source: [Link] .
Estimating n-gram probabilities

EL
PT
N
Estimating n-gram probabilities: an Example

EL
Given a corpus C, the bigram probability of “paper | question” is 0.3
and the count of occurrences of the word “question” is 600. What
will be the frequency of the pair (question, paper) in the corpus C?

PT
P (paper | question) = freq(question, paper)/freq(question)

freq (question, paper) = 180


N
An Example

EL
PT
N
Computing Sentence Probabilities

EL
Practical Issues

PT
N
Daniel Jurafsky and James H. Martin. 2024. Speech and Language Processing: An

EL
Introduction to Natural Language Proces s ing, Computational Linguis tics , and
Speech Recognition with Language Models , 3rd edition. Online manus cript releas ed
Augus t 20, 2024. [Chapter 3]

PT
N
N
PT
EL
EL
PT
Lecture 04 : N-gram Language Models: Part 2
N
EL
• Smoothing in n-gram LMs
• Evaluation: Perplexity
Sampling from the distribution for generation

PT

• Larger n-grams
N
Some Important Points with N-gram LMs
▪ For unigram counts, P(w) is always non-zero

EL
▪ if our dictionary is derived from the document collection
▪ This won’t be true of P(wk|wk−1). Let’s take an example below.

PT
N
P(offer | denied the) = 0, the test sentence will be assigned a probability of 0!
Smoothing

EL
PT
N
Add-1 Smoothing

EL
PT
N
Add-1 Smoothing: Example
Given a corpus C, the bigram probability of “paper | question” is 0.3

EL
and the count of occurrences of the word “question” is 600. What will
be the frequency of the pair (question, paper) in the corpus C? → We
got this as 180
Now, suppose the vocabulary size is 1210. What will be the

PT
probability of “paper | question” after add-1 smoothing?

Padd_1(“paper | question”) =
(freq(question, paper)+1)/(freq(question)+V)
N
= (180+1)/(600+1210)
= 0.1
More General Formulations

EL
PT
Interpolate with unigram probabilities instead
N
Backoff and Interpolation
• Sometimes it helps to use less context

EL
Condition on less context for contexts you know less about
• Backoff:
• use trigram if you have good evidence,
• otherwise bigram, otherwise unigram
• Interpolation:

PT
• mix unigram, bigram, trigram

• Interpolation works better


N
Source: [Link] .
How to evaluate N-gram models
• "Extrinsic (in-vivo) Evaluation"

EL
To compare models A and B
1. Put each model in a real task
• Machine Translation, speech recognition, etc.
2. Run the task, get a score for A and for B

PT
• How many words translated correctly
• How many words transcribed correctly
3. Compare accuracy for A and B
N
Source: [Link] .
Intrinsic (in-vitro) evaluation
• Extrinsic evaluation not always possible
• Expensive, time-consuming

EL
• Doesn't always generalize to other applications
• Intrinsic evaluation: perplexity
• Directly measures language model performance at predicting words.
• Doesn't necessarily correspond with real application performance

PT
• But gives us a single general metric for language models
• Useful for large language models (LLMs) as well as n-grams
N
Source: [Link] .
Training sets and test sets
We train parameters of our model on a training set.

EL
We test the model’s performance on data we haven’t seen.
• A test set is an unseen dataset; different from training set.
• Intuition: we want to measure generalization to unseen data
• An evaluation metric (like perplexity) tells us how well our model does on
the test set.

PT
N
Source: [Link] .
Choosing training and test sets
• If we're building an LM for a specific task

EL
• The test set should reflect the task language we want to use the
model for
• If we're building a general-purpose model
• We'll need lots of different kinds of training data

PT
• We don't want the training set or the test set to be just from one
domain or author or language.
N
Source: [Link] .
Training on the test set
We can’t allow test sentences into the training set
• Or else the LM will assign that sentence an artificially high probability when we see

EL
it in the test set
• And hence assign the whole test set a falsely high probability.
• Making the LM look better than it really is
This is called “Training on the test set”

PT
Bad science! N
Source: [Link] .
Dev sets
• If we test on the test set many times we might implicitly tune to its

EL
characteristics
• Noticing which changes make the model better.
• So we run on the test set only once, or a few times
• That means we need a third dataset:

PT
• A development test set or, devset.
• We test our LM on the devset until the very end
• And then test our LM on the test set once
N
Source: [Link] .
Intuition of perplexity as evaluation metric:
How good is our language model?

EL
Intuition: A good LM prefers "real" sentences
• Assign higher probability to “real” or “frequently observed”
sentences
• Assigns lower probability to “word salad” or “rarely observed”
sentences?

PT
N
Source: [Link] .
Intuition of perplexity 2:
Predicting upcoming words

EL
time 0.9
The Shannon Game: How well can we predict
the next word? dream 0.03
• Once upon a ____ midnight 0.02
• That is a picture of a ____ …

PT
• For breakfast I ate my usual ____ and 1e-100

Claude Shannon
Unigrams are terrible at this game (Why?)

A good LM is one that assigns a higher probability


N
to the next word that actually occurs
Picture credit: Historiska bildsamlingen
Source: [Link] . [Link]
Intuition of perplexity 3: The best language model is
one that best predicts the entire unseen test set
• We said: a good LM is one that assigns a higher probability to the

EL
next word that actually occurs.
• Let's generalize to all the words!
• The best LM assigns high probability to the entire test set.
• When comparing two LMs, A and B

PT
• We compute PA(test set) and PB(test set)
• The better LM will give a higher probability to
(=be less surprised by) the test set than the other LM.
N
Source: [Link] .
Intuition of perplexity 4: Use perplexity instead
of raw probability
• Probability depends on size of test set

EL
• Probability gets smaller the longer the text
• Better: a metric that is per-word, normalized by length
• Perplexity is the inverse probability of the test set, normalized by
the number of words

PT
N
Source: [Link] .
Intuition of perplexity 5: the inverse
Perplexity is the inverse probability of the test set, normalized by the number of
words

EL
PT
(The inverse comes from the original definition of perplexity from cross-entropy
rate in information theory)
Probability range is [0,1], perplexity range is [1,∞]
Minimizing perplexity is the same as
N
maximizing probability
Source: [Link] .
Intuition of perplexity 6: N-grams

EL
Chain rule:
PT
N
Bigrams:
Source: [Link] .
Intuition of perplexity 7:
Weighted average branching factor

EL
Perplexity is also the weighted average branching factor of a language.
Branching factor: number of possible next words that can follow any word
Example: Deterministic language L = {red,blue, green}
Branching factor = 3 (any word can be followed by red, blue, green)
Now assume LM A where each word follows any other word with equal probability ⅓

PT
Given a test set T = "red red red red blue"
PerplexityA(T) = PA(red red red red blue)-1/5 = 3
• But now suppose red was very likely in training set, such that for LM B:
• P(red) = .8 p(green) = .1 p(blue) = .1
• We would expect the probability to be higher, and hence the
N
perplexity to be smaller:
PerplexityB(T) = PB(red red red red blue)-1/5

= (.8 * .8 * .8 * .8 * .1) -1/5 =.04096 -1/5 = .527-1 = 1.89


Holding test set constant:
Lower perplexity = better language model

EL
• Training 38 million words, test 1.5 million words, WSJ

N-gram Unigram Bigram Trigram


Order
Perplexity 962
PT 170 109
N
Heavily abbreviated history of LMs

EL
PT
N
Source: [Link]
The Shannon (1948) Visualization Method
Sample words from an LM

EL
• Unigram: Claude Shannon
REPRESENTING AND SPEEDILY IS AN GOOD APT OR

COME CAN DIFFERENT NATURAL HERE HE THE A IN


CAME THE TO OF TO EXPERT GRAY COME TO FURNISHES
THE LINE MESSAGE HAD BE THESE.

• Bigram:
PT
THE HEAD AND IN FRONTAL ATTACK ON AN ENGLISH
WRITER THAT THE CHARACTER OF THIS POINT IS
THEREFORE ANOTHER METHOD FOR THE LETTERS THAT
THE TIME OF WHO EVER TOLD THE PROBLEM FOR AN
N
UNEXPECTED.
Source: [Link] .
How Shannon sampled those words in 1948

EL
PT
"Open a book at random and select a letter at random on the page. This letter is
recorded. The book is then opened to another page and one reads until this letter is
encountered. The succeeding letter is then recorded. Turning to another page this
second letter is searched for and the succeeding letter recorded, etc."
N
Source: [Link] .
Sampling a word from a distribution

EL
PT
N
The Shannon Visualization Method

EL
PT
N
Note: there are other sampling methods
Used for neural language models

EL
Many of them avoid generating words from the very unlikely tail of the
distribution
We'll discuss when we get to neural LM decoding:
• Temperature sampling

PT
Top-k sampling
• Top-p sampling
N
Larger ngrams
• 4-grams, 5-grams

EL
• Large datasets of large n-grams have been released
• N-grams from Corpus of Contemporary American English (COCA) 1 billion words
(Davies 2020)
• Google Web 5-grams (Franz and Brants 2006) 1 trillion words)
• Efficiency: quantize probabilities to 4-8 bits instead of 8-byte float

PT
Newest model: infini-grams (∞-grams) (Liu et al 2024)
• No precomputing! Instead, store 5 trillion words of web text in suffix arrays.
• Can compute n-gram probabilities with any n!
N
∞-grams

EL
PT
N
Source: [Link]
∞-grams as a variant of back-off

EL
PT
N
Source: [Link]
N-gram LM Toolkits
• SRILM

EL
• [Link]
• KenLM
• [Link]

PT
N
Daniel Jurafsky and James H. Martin. 2024. Speech and Language Processing: An

EL
Introduction to Natural Language Proces s ing, Computational Linguis tics , and
Speech Recognition with Language Models , 3rd edition. Online manus cript releas ed
Augus t 20, 2024. [Chapter 3]

PT
N
N
PT
EL
EL
PT
Lecture 05 : NLP Tasks and Paradigms
N
EL
• Paradigms in NLP
• Text Classification, Sequence Labeling, Text Generation, Structured
Prediction

PT
Some NLP Tasks
N
NLP Paradigms
We generally try to map NLP problems to various (ML) paradigms

EL
● Sentiment Analysis, news article groupings, etc. → Text Classification
● Named entity recognition, code-mixing, etc. → Sequence Labeling
● Machine Translation, summarization, chatbots, etc. → Text Generation

Other popular paradigm: Structured Prediction

PT
Example of NLP tasks

Word / Span Level: Word sense disambiguation, Entity Linking


Sentence Level: Sentence Similarity, Natural Language Inference
N
Paragraph / Document Level: Question Answering
EL
PT
NLP Paradigms: Classification
N
Classification : Positive or negative review?

EL
+ ...zany characters and richly applied satire, and some great plot twists

− It was pathetic. The worst part about it was the boxing scenes...

PT
+ ...awesome caramel sauce and sweet toasty almonds. I love this place!
N
− ...awful pizza and ridiculously overpriced...
[Link]
Why sentiment analysis?
Movie: is this review positive or negative?

EL
Products: what do people think about the new iPhone?
Politics: what do people think about this candidate or issue?

PT
Prediction: predict election outcomes or market trends from
sentiment
N
[Link]
Text Classification: formal definition

EL
Input:
– a document d
– a fixed set of classes C = {c1, c2,…, cJ}

PT
Output: a predicted class c ∈ C
N
[Link]
Classification Methods: Machine Learning
Input:
– a document d

EL
– a fixed set of classes C = {c1, c2,…, cJ}
– A training set of m hand-labeled documents (d1,c1),....,(dm,cm)
Output:
– a learned classifier γ:d → c

Any kind of classifier







Naïve Bayes

Neural networks
PT
Support Vector Machines

k-Nearest Neighbors

N
[Link]
Evaluation for Text Classification

EL
Let's consider just binary text classification tasks
Imagine you're the CEO of Delicious Pie Company
You want to know what people are saying about your pies
So you build a "Delicious Pie" tweet detector

PT
– Positive class: tweets about Delicious Pie Co
– Negative class: all other tweets
N
[Link]
The 2-by-2 confusion matrix

EL
PT
N
[Link]
Evaluation: Accuracy
Why don't we use accuracy as our metric?

EL
Imagine we saw 1 million tweets
– 100 of them talked about Delicious Pie Co.
– 999,900 talked about something else
We could build a dumb classifier that just labels every tweet "not about pie"
– It would get 99.99% accuracy!!! Wow!!!!

PT
– But useless! Doesn't return the comments we are looking for!
– That's why we use precision and recall instead
N
[Link]
Why Precision and recall
Our dumb pie-classifier
– Just label nothing as "about pie"

EL
Accuracy=99.99%
but
Recall = 0
– (it doesn't get any of the 100 Pie tweets)


PT
Precision and recall, unlike accuracy, emphasize true positives:
finding the things that we are supposed to be looking for.
N
[Link]
A combined measure: F

EL
F measure: a single number that combines P and R:

PT
We almost always use balanced F1 (i.e., β = 1)
N
[Link]
Confusion Matrix for 3-class classification

EL
PT
N
[Link]
How to combine P/R from 3 classes to
get one metric

EL
Macro-averaging:
– compute the performance for each class, and then average over classes
Micro-averaging:
– collect decisions for all classes into one confusion matrix
– compute precision and recall from that table.

PT
N
[Link]
Macro-averaging and Micro-averaging

EL
PT
N
[Link]
EL
NLP Paradigms: Sequence
Labeling
PT
Parts-of-Speech Tagging
N
Sequence Labeling: Parts of Speech
From the earliest linguistic traditions (Yaska and Panini 5th C. BCE, Aristotle

EL
4th C. BCE), the idea that words can be classified into grammatical
categories
• part of speech, word classes, POS, POS tags
8 parts of speech attributed to Dionysius Thrax of Alexandria (c. 1st C. BCE):

PT
noun, verb, pronoun, preposition, adverb, conjunction, participle,
article
N
[Link]
Open vs. Closed Class

EL
PT
N
[Link]
Part-of-Speech Tagging
Assigning a part-of-speech to each word in a text.

EL
Words often have more than one POS.
book:
• VERB: (Book that flight)

PT
NOUN: (Hand me that book).
N
[Link]
Popular tag-set: Penn Treebank

EL
PT
N
[Link]
Methods and Evaluation
Methods:

EL
Hidden Markov Models
Maximum Entropy Markov Models
Conditional Random Fields
RNNs, Transformers

Evaluation:
Accuracy
PT
Macro-F1 (giving equal importance to each tag)
N
[Link]
EL
PT
NLP Paradigms: Text Generation
Dialogs
N
Example: Dialogs

EL
PT
N
[Link]
Two kind of conversational agents
1. Chatbots

EL
- mimic informal human chatting
- for fun, or even for therapy
2. (Task-based) Dialogue Agents

PT
- interfaces to personal assistants
- cars, robots, appliances
- booking flights or restaurants
N
[Link]
Chatbot Architectures
Rule-based

EL
Pattern-action rules (ELIZA)
+ A mental model (PARRY):
The first system to pass the Turing Test!

Corpus-based

PT
Information Retrieval (XiaoIce)
Neural encoder-decoder (BlenderBot)
N
[Link]
Response by generation
Think of response production as an encoder-decoder task

EL
Generate each token rt of the response by conditioning on the
encoding of the entire query q and the response so far r1...rt−1

) Conditional LM

Evaluation is tricky

PT
N
[Link]
EL
NLP Paradigms: Structured
Prediction
PT
Dependency Parsing
N
Dependency Parsing

EL
PT
N
EL
PT
Other NLP Tasks: Some Examples
N
Word Sense Disambiguation (WSD)

EL
PT
N
Entity Linking

EL
PT
N
Sentence Similarity

EL
PT
N
[Link]
Question Answering

EL
PT
N
More Example Tasks, Benchmarks

EL
PT
N
[Link]
Daniel Jurafsky and James H. Martin. 2024. Speech and Language Processing: An

EL
Introduction to Natural Language Proces s ing, Computational Linguis tics , and
Speech Recognition with Language Models , 3rd edition. Online manus cript releas ed
Augus t 20, 2024.

PT
N
N
PT
EL

You might also like