Language
Models
Session-19
Dr. Subhra Rani Patra
SCOPE, VIT Chennai
Context Sensitive Spelling Correction
Probabilistic Language Modeling
Applications
• In OCR-Optical Character Recognition
Can Predict the word that is not easily readable in the given image
• Correcting a sentence
If we write “Deer Sir” instead of “Dear Sir”
• Speech recognition
Language Model
• Language Model (LM)
• A language model is a probability distribution over entire sentences or texts
• N-gram: unigrams, bigrams, trigrams,…
• In a simple n-gram language model, the probability of a word, conditioned
on some number of previous words
• Item:
• Phonemes
• Syllables
• Letters
• Words
• Anything else depending on the application.
How do we train these models?
Very large corpora
• Corpora are online collections of text and speech
• Brown Corpus
• Wall Street Journal
• AP newswire
• Hansards
• Timit
• DARPA/NIST text/speech corpora (Call Home, Call Friend, ATIS, Switchboard,
Broadcast News, Broadcast Conversation, TDT, Communicator)
• TRAINS, Boston Radio News Corpus
Computing P(W)
The Chain Rule
Probability of words in sentences
Estimating these probability values
Markov Assumption
Markov Assumption
N-gram Models
N-gram Models
Estimating N-gram Probabilities
An Example
An Example
Bigram Counts for 9222 Restaurant
Sentences
Computing Bigram Probabilities
Computing Sentence probabilities
What Knowledge does N-gram represent
Practical Issues
Google N-grams
Example from the 4-gram data