0% found this document useful (0 votes)
29 views35 pages

Language Modeling Techniques Overview

The document discusses language modeling, which involves predicting the probability distribution of words in text sequences, and outlines various approaches such as N-gram, neural network, and transformer models. It covers applications of language models in spell checking, grammar checking, machine translation, and more, while explaining concepts like perplexity and the chain rule of probability for estimating word probabilities. Additionally, it emphasizes the importance of training and test sets in evaluating language models and the need for proper evaluation metrics.

Uploaded by

SX
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views35 pages

Language Modeling Techniques Overview

The document discusses language modeling, which involves predicting the probability distribution of words in text sequences, and outlines various approaches such as N-gram, neural network, and transformer models. It covers applications of language models in spell checking, grammar checking, machine translation, and more, while explaining concepts like perplexity and the chain rule of probability for estimating word probabilities. Additionally, it emphasizes the importance of training and test sets in evaluating language models and the need for proper evaluation metrics.

Uploaded by

SX
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

Multimedia

Application
By

Minhaz Uddin Ahmed, PhD


Department of Computer Engineering
Inha University Tashkent.
Email: [Link]@[Link]
Content
 Language Models
 N-Grams
 3.2 Evaluating Language Models: Training and Test Sets
 3.3 Evaluating Language Models: Perplexity
 3.4 Sampling sentences from a language model
 3.5 Generalization and Zeros
 3.6 Smoothing
 3.8 Advanced: Kneser-Ney Smoothing
Language Modeling

 Language modeling involves predicting the probability distribution of


words or tokens in a sequence of text. The goal of language modeling
is to capture the underlying structure and patterns of natural
language, allowing computers to generate coherent and
grammatically correct text.

 There are several approaches to language modeling, including:

i) N-gram Models
ii) Neural Network Models
iii) Transformer Models
Language Modeling

 Tashkent is the capital of ---------------?

i) India
ii) China
iii) Uzbekistan
Language model applications

 Spell checking
 Grammer Checking
 Machine translation
 Summarization
 Question answering
 Speech recognition
Probabilistic Language Models

 Assign a probability to a sentence

Application:
 Machine Translation:
P(high winds tonite) > P(large winds tonite)
 Spell Correction
 The office is about fifteen minuets from my house
 P(about fifteen minutes from) > P(about fifteen minuets from)

 Speech Recognition
 P(I saw a van) >> P(eyes awe of an)


+ Summarization, question-answering ,
Probability of sentence

 Grammer correction
 I go to school
 I going to school

 Probability score: I go to school > I going to school

 Correct: go to school, Wrong: going to school


Probability of sentence or words

 Compute the probability of a sentence or sequence of words:


=> P(W) = P(w1, w2,w3, w4,w5…wn)

 Probability of an upcoming word:


=> P(w5| w1,w2,w3,w4)
 P(Uzbekistan | Tashkent , is, the, capital, of)

 A model that computes either of these :


P(W) or P(wn|w1, w2…wn-1) is called a language model.
How to compute P(W)
 How to compute this joint probability:
 P(its, water, is, so, transparent, that)
 Intuition: let’s rely on the Chain Rule of Probability

P(A,B) = p(A|B) p(B)


We can extend this for three variables:
P(A,B,C) = P(A| B,C) P(B,C) = P(A|B,C) P(B|C) P(C)
and in general to n variables:
P(A1, A2, ..., An) = P(A1| A2, ..., An) P(A2| A3, ..., An)
P(An-1|An) P(An)
In general we refer to this as the chain rule

the joint probability of all the random variables can be calculated by


multiplying the probability of each variable conditioned on all the previous
variables
Chain Rule of Probability

 Conditional probabilities
=> P(B|A) = P(A,B) / P(A)
Rewriting : P(A,B) = P(A)P(B|A)

More variables: P(A,B,C,D) = P(A) P(B|A) P (C|A, B) P(D|A,B,C)

The chain rule in general


=> P(x1, x2, x3, …, xn) = P(x1) P(x2|x1) P(x3|x1,x2) … P(xn|x1, …, xn-
1)
Chain Rule of Probability

 Chain rule : P(A,B,C,D) = P(A) P(B|A) P(C|A,B) P(D|A,B,C)

 Example
= P(Tashkent is the capital of Uzbekistan)
= P(Tashkent) x P(is | Tashkent) x P(the | Tashkent, is) x P(capital | Tashkent, is
the) x P(of | Tashkent, is, the, capital) x P( Uzbekistan | Tashkent, is, the, capital,
of)
Chain Rule of Probability

 Example
= P (Tashkent is the capital of Uzbekistan)
= P(Tashkent) X P(is |Tashkent)X P(the| Tashkent, is) x P(capital | Tashkent, is, the)
X P(of | Tashkent, is, the, capital) x P( Uzbekistan| Tashkent, is ,the, capital, of)

Calculation
= P(Uzbekistan| Tashkent, is ,the, capital, of )
= count (Tashkent is the capital of Uzbekistan) / count (Tashkent is the capital of)
The Chain Rule applied to compute
joint probability of words in
sentence

P(“its water is so transparent”) =


P(its) × P(water|its) × P(is|its water)
× P(so|its water is) × P(transparent|its water is so)
How to estimate these probabilities

 Could we just count and divide?

 No! Too many possible sentences!


 We’ll never see enough data for estimating these
Markov Assumption

 Simplifying assumption
= P(Uzbekistan| Tashkent, is, the, capital, of)
= P(Uzbekistan | of)
Andrei Markov
= P (Uzbekistan | capital of)

The assumption that the probability of a word depends only on the


previous word is called Markov assumption
Simplest case: Unigram model

Some automatically generated sentences from a unigram model

fifth, an, of, futures, the, an, incorporated, a,


a, the, inflation, most, dollars, quarter, in, is,
mass

thrift, did, eighty, said, hard, 'm, july, bullish

that, or, limited, the


Bigram model

 Condition on previous word

 Please bring me a glass of water.

History Word prediction


Estimating bigram probabilities

 The Maximum Likelihood Estimate


Bigram model

<s> I am Sam </s>


<s> Sam I am </s>
<s> I do not like green eggs and ham </s>
Estimated bigram probabilities

 P(<s> I want English food </s>) = P(I|<s>)x P (want|I)x P(English|


want) x p(food|english) x P(</s>|food) = 0.000031

 Given that
P(I|<s>) = 0.25
P (want|I)= 0.33
P(English|want)= 0.0011
p(food|english)=0.5
P(</s>|food) = 0.68
N-gram models

 We can extend to trigrams, 4-grams, 5-


grams
 In general this is an insufficient model of
language
 because language has long-distance
dependencies:

“The computer which I had just put into the


machine room on the fifth floor crashed.”

 But we can often get away with N-gram


N-gram models

 An n-gram is a collection of n successive items in a text document


that may include words, numbers, symbols, and punctuation. N-gram
models are useful in many text analytics applications where
sequences of words are relevant, such as in sentiment analysis, text
classification, and text generation.

 In deep learning , Language models used higher gram model to train


the dataset.
N-gram models

Google Ngram
Viewer displays
user-selected words
or phrases (ngrams)
in a graph that
shows how those
phrases have
occurred in a
corpus. Google
Ngram Viewer's
corpus is made up
of the scanned
books available in
Google Book
Once the language model is built, it can then be used with machine
learning algorithms to build predictive models for text analytics
applications
Google N-Gram Release, August
2006


Evaluating Language Models:
Training and Test Sets
 "Extrinsic (in-vivo) Evaluation"
To compare models A and B
1. Put each model in a real task
• Machine Translation, speech recognition, etc.
2. Run the task, get a score for A and for B
• How many words translated correctly
• How many words transcribed correctly
3. Compare accuracy for A and B
Intrinsic (in-vitro) evaluation

 Extrinsic evaluation not always possible


• Expensive, time-consuming
• Doesn't always generalize to other applications
 Intrinsic evaluation: perplexity
• Directly measures language model performance at predicting words.
• Doesn't necessarily correspond with real application performance

• But gives us a single general metric for language models


• Useful for large language models (LLMs) as well as n-grams
Training sets and test sets

We train parameters of our model on a training set.


We test the model’s performance on data we haven’t
seen.
 A test set is an unseen dataset; different from training set.
 Intuition: we want to measure generalization to unseen data
 An evaluation metric (like perplexity) tells us how well
our model does on the test set.
Perplexity

 Perplexity is the standard metric for measuring quality of a language


model.
 The inverse probability of test set, normalized by the number of
words.

Chain rule:

Bigrams:

Minimizing perplexity is the maximizing probability


Perplexity

 Calculate perplexity of a sentence

Task of recognizing the digit in English


=> A sentence consist of random digits
=> Each digit probability is p = 1/10

Minimizing perplexity is the maximizing probability


Choosing training and test sets

• If we're building an LM for a specific task


• The test set should reflect the task language we want
to use the model for
• If we're building a general-purpose model
• We'll need lots of different kinds of training data
• We don't want the training set or the test set to be
just from one domain or author or language.
Training on the test set

We can’t allow test sentences into the training set


• Or else the LM will assign that sentence an artificially high probability
when we see it in the test set
• And hence assign the whole test set a falsely high probability.
• Making the LM look better than it really is
This is called “Training on the test set”
Dev sets

• If we test on the test set many times we might implicitly tune to its characteristics

• Noticing which changes make the model better.


• So we run on the test set only once, or a few times
• That means we need a third dataset:
• A development test set or, devset.
• We test our LM on the devset until the very end
• And then test our LM on the test set once
Reference

Chapter 3
Question
Thank you

You might also like