0% found this document useful (0 votes)
82 views50 pages

NLP Comprehensive Study Guide Pokhara University Fall 2025

Master Natural Language Processing with this study guide for CMP459 at Pokhara University, Fall 2025. Equips advanced undergraduate and graduate students with a comprehensive question bank covering NLP fundamentals, text preprocessing, TF-IDF, Naive Bayes, Word2Vec, Transformers, and BERT. Featuring 99+ questions with detailed solutions, Python (NLTK) code, and mathematical derivations, it’s ideal for exam prep, assignments, and self-study. Elevate your NLP skills and succeed in your coursework.

Uploaded by

Binayak Bartaula
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
82 views50 pages

NLP Comprehensive Study Guide Pokhara University Fall 2025

Master Natural Language Processing with this study guide for CMP459 at Pokhara University, Fall 2025. Equips advanced undergraduate and graduate students with a comprehensive question bank covering NLP fundamentals, text preprocessing, TF-IDF, Naive Bayes, Word2Vec, Transformers, and BERT. Featuring 99+ questions with detailed solutions, Python (NLTK) code, and mathematical derivations, it’s ideal for exam prep, assignments, and self-study. Elevate your NLP skills and succeed in your coursework.

Uploaded by

Binayak Bartaula
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

Natural Language Processing

Question Bank & Study Guide

Comprehensive Collection of Theoretical, Practical, and Coding Exercises


For Advanced Undergraduate and Graduate Students

CMP 459: Natural Language Processing


Fall 2025

Department of Computer Science


Pokhara University, NCIT

Compiled by:

Binayak Bartaula

Keywords:

Natural Language Processing (NLP) • Text Preprocessing • Text Representation • Language


Models • Transformers & BERT • Machine Learning in NLP • TF-IDF • Naive Bayes • NLTK
Python Library • Computational Linguistics • Text Classification • Tokenization & Parsing •
Statistical NLP • Deep Learning for NLP • Academic Study Material • Exam Preparation &
Assignments • Mathematical Derivations in NLP • Python Code Snippets for NLP • AI &
Linguistics
Natural Language Processing – Question Bank Binayak B.

July 2025

NLP

Chapter 1: Introduction to NLP


1.1 What are the two main subfields of NLP, and how do they differ?
1.2 List at least four real-world applications of NLP mentioned in the course.
1.3 Explain the relationship between the Turing Test, NLU, and NLG.
1.4 Provide two examples each of lexical, syntactic, semantic, and narrative ambiguity.
1.5 Why is context dependence considered a major challenge in NLP?
1.6 Define data sparsity and explain its impact on low-resource languages.

Chapter 2: Data and Pre-processing


2.1 Identify the three broad categories of data used in NLP.
2.2 What is the primary goal of text pre-processing?
2.3 Provide Python (NLTK) commands to:
(a) Convert a string to lowercase,
(b) Remove punctuation,
(c) Remove numbers,
(d) Strip extra whitespace.
2.4 Differentiate between stemming and lemmatization, providing one example of each.
2.5 What are stop words, and why are they often removed during preprocessing?
2.6 Describe the steps that NLTK follows to lemmatize a passage of text.

Chapter 3: Text Representation and Modeling


3.1 Encoding Schemes
3.1.1 Compare label encoding and one-hot encoding in terms of dimensionality and interpretability.
3.1.2 Write the mathematical form of a one-hot vector for the word “apple” in the vocabulary
{apple, banana, cherry}.
3.1.3 List three limitations of one-hot encoding.

2
Natural Language Processing – Question Bank Binayak B.

3.2 Bag-of-Words (BoW)


3.2.1 Explain why the sentences “dog bites man” and “man bites dog” have identical BoW vectors.

3.2.2 Given two documents, outline the two-step procedure to construct a BoW matrix.

3.2.3 What type of information is lost when using the BoW representation?

3.3 TF-IDF
3.3.1 Write the formula for term frequency (TF) using logarithmic scaling.

3.3.2 Derive the inverse document frequency (IDF) formula and explain why it assigns higher weight
to rare terms.

3.3.3 Compute the TF-IDF value for the word “cat” across three toy documents (each with 6 tokens).

3.3.4 Discuss two practical issues associated with the raw TF component in TF-IDF.

3.4 BM25
3.4.1 Write the BM25 scoring formula and define each parameter.

3.4.2 Describe the role of the k and b parameters in the BM25 algorithm.

3.4.3 Under what conditions does BM25 reduce to standard TF-IDF?

3.5 Vector Space Model


3.5.1 Why is Euclidean distance a poor similarity metric for TF-IDF vectors?

3.5.2 Provide the cosine similarity formula between a query vector q and a document vector d.

3.5.3 Explain the three-step ranked retrieval process used in the vector space model.

3.6 Word2Vec
3.6.1 State the distributional hypothesis underlying the Word2Vec algorithm.

3.6.2 Describe the CBOW architecture, including input, hidden layer operation, and output.

3.6.3 Describe the Skip-Gram architecture, including input, hidden layer operation, and output.

3.6.4 Present the famous vector arithmetic example: “king – man + woman”.

Chapter 4: Text Classification


4.1 Write the Naive Bayes decision rule for classifying a document d into a class c.

4.2 What problem arises when a word appears in the test set but not in the training set for a specific
class?

4.3 Apply Laplace smoothing (ϵ = 1) to re-estimate P (w|c) when nwc = 0.

4.4 Perform the full Naive Bayes calculation for a toy example using vocabulary {a, b, c}.

4.5 Why is a class prior P (c) = 0 considered meaningful and not subject to smoothing?

3
Natural Language Processing – Question Bank Binayak B.

Chapter 5: Language Models


5.1 Write the chain-rule expansion for the probability of a sentence P (w1 , w2 , . . . , wn ).

5.2 State the Markov assumption used in bigram models.

5.3 Compute the bigram probability P (wn |wn−1 ) using maximum likelihood estimation.

5.4 Show how Laplace smoothing adjusts bigram counts to avoid zero probabilities.

5.5 Compare unigram, bigram, and trigram models with respect to context length and data sparsity.

Chapter 6: Transformer and BERT


6.1 Transformer Architecture
6.1.1 Identify the two major components that Transformers do not rely on, unlike RNNs and CNNs.

6.1.2 Describe the purpose and function of multi-head self-attention.

6.1.3 Write the equation for scaled dot-product attention.

6.1.4 Explain the role of positional encodings in the Transformer architecture.

6.1.5 Describe the Add & Norm sub-layer used in both encoder and decoder stacks.

6.2 BERT
6.2.1 Which components of the original Transformer architecture are reused in BERT?

6.2.2 Describe the Masked Language Modeling (MLM) task used during BERT’s pre-training.

6.2.3 Explain the Next Sentence Prediction (NSP) task used in BERT.

6.2.4 How many encoder layers and hidden units are present in BERT-Base?

6.2.5 Using the word “bank” as an example, explain how BERT generates contextualized word embed-
dings.

4
Natural Language Processing – Question Bank Binayak B.

Additional Practice Questions


P.1 (a) Explain Natural Language Processing (NLP) and its key components. (8)
(b) Discuss the major challenges faced in NLP. (7)

P.2 (a) Describe the steps involved in data preprocessing, and illustrate each step with a relevant
example. (8)
(b) Explain TF-IDF and the major challenges faced in TF-IDF. (7)

P.3 (a) Determine the TF-IDF scores for each term in each document using raw TF: (8)
• D1: car insurance auto insurance
• D2: car auto insurance auto
• D3: car car auto insurance car
(b) Why is cosine similarity preferred over Euclidean distance in the vector space model for text
analysis? (7)

P.4 (a) What is text classification? Explain CBOW of Word2Vec in detail. (8)
(b) Using the Naive Bayes algorithm with Laplace smoothing, predict the class of a given test
document based on the training documents provided. (7)
Training Set:
• Chinese Beijing Chinese → Yes
• Chinese Chinese Shanghai → Yes
• Chinese Macao → Yes
• Tokyo Japan Chinese → No
Testing:
• Chinese Chinese Chinese Tokyo Japan → ?

P.5 (a) What is language model? What are the bigram probabilities with Laplace smoothing of the
sentence "(s) Sam like eggs (/s)" based on: (3+10)
• (s) I am Sam (/s)
• (s) Sam I am (/s)
• (s) I do not like green eggs and ham (/s)
(b) Explain language model and n-gram model. (7)

P.6 (a) Explain attention mechanism used in Transformer. (8)


(b) Explain BERT embeddings. (7)

P.7 (a) What is data? Describe the steps involved in data preprocessing, and illustrate each step with
examples: (2+10)
• "Great product!! Highly recommend."
• "Terrible experience... never buying again."
• "Okay, but delivery was late."
• "Loved it!!! (Positive sentiment)"
(b) Explain one-hot encoding with example and the major challenges faced in one-hot encoding.
(8)

P.8 (a) What is Zipf’s law? Determine the TF-IDF scores for each term in each document after
removing stopwords (use raw TF): (5+10)
• D1: "The cat sat on the mat."

5
Natural Language Processing – Question Bank Binayak B.

• D2: "The dog played in the park."


• D3: "Cats and dogs are great pets."
(b) Explain Transformer and its architecture. (8)

P.9 (a) Explain Word2Vec and describe the architectures used in its implementation. (8)
(b) Using the Naive Bayes algorithm, predict the class of a given test document based on the
training documents provided. (7)
Training Set:
• abac → A
• baabaaa → A
• bbaabbab → B
• abbb → B
• abbaa → A
• bbbaab → B
Testing:
• aabc → ?

P.10 Write short notes on any two of the following: (2×5=10)

• Next Sentence Prediction


• Masked Language Modeling
• Human language and intelligence

"He who performs his duty without attachment, surrendering the results to the Supreme, is not touched by
sin, just as a lotus leaf is untouched by water."
— Isha Upanishad, Verse 3

"Let noble thoughts come to us from every side." — Rigveda 1.89.1


Expertly curated by Binayak B.
Chapter 1: Introduction to NLP - Solutions
1.1 What are the two main subfields of NLP, and how do they differ?
The two main subfields of NLP are:
Natural Language Understanding (NLU): This subfield focuses on enabling
machines to comprehend and interpret human language. NLU involves extracting
meaning from text or speech, understanding context, disambiguating words, and
converting natural language into structured representations that computers can pro-
cess. Examples include sentiment analysis, named entity recognition, and question
answering.
Natural Language Generation (NLG): This subfield deals with producing co-
herent and contextually appropriate human language from structured data or in-
ternal representations. NLG systems transform computational data into natural
language text or speech. Examples include machine translation output generation,
chatbot responses, and automated report writing.
The key difference lies in their direction: NLU processes natural language input
to extract computational meaning, while NLG generates natural language output
from computational representations.

1.2 List at least four real-world applications of NLP mentioned in the course.
Four major real-world applications of NLP include:

• Machine Translation: Systems like Google Translate that convert text from
one language to another
• Virtual Assistants: Voice-activated systems like Siri, Alexa, and Google
Assistant that understand and respond to spoken queries
• Search Engines: Information retrieval systems that process natural language
queries and return relevant results
• Sentiment Analysis: Applications that analyze customer reviews, social
media posts, or feedback to determine emotional tone and opinions
• Chatbots and Customer Service: Automated systems that handle cus-
tomer inquiries and provide support through natural language interaction
• Text Summarization: Tools that automatically generate concise summaries
of longer documents or articles

1.3 Explain the relationship between the Turing Test, NLU, and NLG.
The Turing Test serves as a benchmark for machine intelligence, specifically testing
whether a machine can engage in conversations indistinguishable from those of
humans. The relationship with NLP subfields is as follows:
NLU Component: For a machine to pass the Turing Test, it must demonstrate
sophisticated Natural Language Understanding. The system needs to comprehend
questions, interpret context, understand implicit meanings, and recognize nuances
in human communication.

1
NLG Component: Equally important, the machine must exhibit advanced Nat-
ural Language Generation capabilities to produce human-like responses that are
contextually appropriate, grammatically correct, and conversationally natural.
Integration: The Turing Test essentially requires seamless integration of both
NLU and NLG. The machine must understand the human interlocutor’s input
(NLU) and generate convincing human-like responses (NLG). Success in the Turing
Test implies that both subfields have reached human-level performance in conver-
sational contexts.

1.4 Provide two examples each of lexical, syntactic, semantic, and narrative
ambiguity.
Lexical Ambiguity (word-level):

• “I went to the bank ” (financial institution vs. river bank)


• “The bark was loud” (dog’s sound vs. tree covering)

Syntactic Ambiguity (sentence structure):

• “I saw the man with the telescope” (who has the telescope?)
• “Flying planes can be dangerous” (planes that fly vs. the act of flying planes)

Semantic Ambiguity (meaning-level):

• “Every student read a book” (same book for all vs. different books)
• “The chicken is ready to eat” (chicken will eat vs. chicken is cooked)

Narrative Ambiguity (discourse-level):

• “John told Mike he was wrong” (who was wrong - John or Mike?)
• “The trophy doesn’t fit in the suitcase because it is too big” (what is too big
- trophy or suitcase?)

1.5 Why is context dependence considered a major challenge in NLP?


Context dependence is a major challenge in NLP because the meaning and inter-
pretation of natural language heavily rely on surrounding context, which can span
multiple levels:
Word-level Context: The same word can have different meanings depending on
surrounding words. For example, “apple” in “apple pie” versus “Apple iPhone”
requires contextual understanding.
Sentence-level Context: Pronouns, ellipsis, and implicit references require un-
derstanding of previous sentences. “John went to the store. He bought milk”
requires linking “He” to “John.”
Document-level Context: Topics, themes, and discourse structure influence in-
terpretation throughout entire documents. Technical terms may have specific mean-
ings within particular domains.
Situational Context: Real-world knowledge, cultural background, and situational
awareness affect meaning. “It’s cold in here” might be a request to close a window
rather than just an observation.

2
Temporal Context: The timing of communication affects interpretation, includ-
ing references to “now,” “yesterday,” or current events.
The challenge lies in computationally modeling and maintaining these multiple
layers of context simultaneously.

1.6 Define data sparsity and explain its impact on low-resource languages.
Definition: Data sparsity refers to the insufficient availability of training data
for machine learning models, particularly in NLP where large corpora of text are
essential for training effective language models and NLP systems.
Impact on Low-Resource Languages:
Limited Training Data: Low-resource languages have significantly fewer digital
texts, corpora, and annotated datasets compared to high-resource languages like
English. This scarcity makes it difficult to train robust NLP models.
Poor Model Performance: With insufficient training data, models for low-
resource languages typically exhibit lower accuracy, higher error rates, and poor
generalization capabilities across different domains and contexts.
Vocabulary Coverage Issues: Limited data means many words, phrases, and
linguistic constructions remain unseen during training, leading to out-of-vocabulary
problems and inability to handle diverse linguistic expressions.
Reduced Commercial Viability: The combination of technical challenges and
smaller user bases makes it economically less attractive for companies to develop
NLP tools for low-resource languages, creating a cycle of continued underrepresen-
tation.
Cultural and Knowledge Gaps: Sparse data often fails to capture cultural
nuances, idiomatic expressions, and domain-specific knowledge crucial for effective
language processing in these communities.

3
Chapter 2: Data and Pre-processing
Questions
2.1 Identify the three broad categories of data used in NLP.
2.2 What is the primary goal of text pre-processing?
2.3 Provide Python (NLTK) commands to:
(a) Convert a string to lowercase,
(b) Remove punctuation,
(c) Remove numbers,
(d) Strip extra whitespace.
2.4 Differentiate between stemming and lemmatization, providing one example of each.
2.5 What are stop words, and why are they often removed during preprocessing?
2.6 Describe the steps that NLTK follows to lemmatize a passage of text.

Solutions
2.1 Three Broad Categories of Data Used in NLP
The three broad categories of data used in Natural Language Processing are:
1. Structured Data: Data that is organized in a predefined format with clear schema,
such as databases, XML files, or JSON documents. Examples include customer
records, product catalogs, and metadata.
2. Semi-structured Data: Data that contains some organizational properties but
lacks a rigid structure. Examples include HTML documents, emails with headers,
and social media posts with tags and timestamps.
3. Unstructured Data: Raw text data without any predefined format or organiza-
tion. This includes free-form text such as news articles, books, social media content,
reviews, and conversational text.

2.2 Primary Goal of Text Pre-processing


The primary goal of text pre-processing is to clean and standardize raw text data
to make it suitable for computational analysis and machine learning algorithms. Pre-
processing aims to:
• Remove noise and irrelevant information
• Normalize text format and representation
• Reduce dimensionality while preserving meaningful content
• Convert text into a consistent format that algorithms can process effectively
• Improve the quality and performance of downstream NLP tasks

1
2.3 Python (NLTK) Commands for Text Pre-processing
(a) Convert a string to lowercase:
1 import nltk
2 text = " Hello ␣ World ! "
3 lowercase_text = text . lower ()
4 # Result : " hello world !"

(b) Remove punctuation:


1 import string
2 import nltk
3 text = " Hello , ␣ world ! ␣ How ␣ are ␣ you ? "
4 no_punct = text . translate ( str . maketrans ( ’ ’ , ’ ’ , string .
punctuation ) )
5 # Result : " Hello world How are you "
6

7 # Alternative using NLTK :


8 from nltk . tokenize import word_tokenize
9 tokens = word_tokenize ( text )
10 no_punct_tokens = [ word for word in tokens if word . isalnum () ]

(c) Remove numbers:


1 import re
2 text = " I ␣ have ␣ 5 ␣ apples ␣ and ␣ 10 ␣ oranges ␣ in ␣ 2023 "
3 no_numbers = re . sub ( r ’\ d + ’ , ’ ’ , text )
4 # Result : " I have apples and oranges in "
5

6 # Alternative method :
7 no_numbers = ’ ’. join ([ char for char in text if not char .
isdigit () ])

(d) Strip extra whitespace:


1 import re
2 text = " ␣ ␣ Hello ␣ ␣ ␣ ␣ world ␣ ␣ ␣ with ␣ ␣ ␣ extra ␣ ␣ ␣ spaces ␣ ␣ "
3 clean_text = re . sub ( r ’\ s + ’ , ’␣ ’ , text ) . strip ()
4 # Result : " Hello world with extra spaces "
5

6 # Alternative using split and join :


7 clean_text = ’␣ ’. join ( text . split () )

2.4 Stemming vs Lemmatization


Stemming is the process of reducing words to their root or base form by removing
suffixes, often using crude heuristic rules. It may not always produce valid words.
Lemmatization is the process of reducing words to their canonical or dictionary form
(lemma) using vocabulary and morphological analysis, always producing valid words.

2
Example:

• Stemming:

– ”running” → ”run”
– ”studies” → ”studi” (not a valid word)
– ”better” → ”better”

• Lemmatization:

– ”running” → ”run”
– ”studies” → ”study”
– ”better” → ”good” (when used as comparative adjective)

Key Differences:

• Stemming is faster but less accurate

• Lemmatization is slower but more linguistically accurate

• Lemmatization considers part-of-speech tags and context

2.5 Stop Words and Their Removal


Stop words are common words in a language that carry little semantic meaning and
appear frequently across documents. Examples include ”the,” ”is,” ”at,” ”which,” ”on,”
”a,” ”an,” etc.
Reasons for removing stop words during preprocessing:

1. Noise Reduction: They add noise to text analysis without contributing meaning-
ful information

2. Dimensionality Reduction: Removing them reduces the feature space, making


processing more efficient

3. Focus on Content: Helps algorithms focus on content-bearing words that are


more discriminative

4. Storage Efficiency: Reduces memory requirements and computational complexity

5. Improved Performance: Can improve the performance of text mining and clas-
sification tasks

Note: In some NLP tasks like sentiment analysis or question answering, stop words
might be important and should be retained.

3
2.6 NLTK Lemmatization Process
NLTK follows these steps to lemmatize a passage of text:

1. Tokenization: Break the text into individual words or tokens using word tokenizers

2. Part-of-Speech (POS) Tagging: Assign grammatical tags to each token to de-


termine their syntactic role (noun, verb, adjective, etc.)

3. POS Tag Conversion: Convert NLTK POS tags to WordNet POS tags, as Word-
Net lemmatizer requires specific tag formats

4. Lemmatization: Apply the WordNet lemmatizer using the appropriate POS tag
for each word

5. Reconstruction: Combine the lemmatized tokens back into the processed text

Example implementation:
1 import nltk
2 from nltk . tokenize import word_tokenize
3 from nltk . tag import pos_tag
4 from nltk . stem import Wo rdNetL emmati zer
5 from nltk . corpus import wordnet
6

7 def get_wordnet_pos ( word ) :


8 """ Convert NLTK POS tag to WordNet POS tag """
9 tag = nltk . pos_tag ([ word ]) [0][1][0]. upper ()
10 tag_dict = { " J " : wordnet . ADJ , " N " : wordnet . NOUN ,
11 " V " : wordnet . VERB , " R " : wordnet . ADV }
12 return tag_dict . get ( tag , wordnet . NOUN )
13

14 def lemmatize_text ( text ) :


15 lemmatizer = WordN etLemm atizer ()
16 tokens = word_tokenize ( text )
17 pos_tags = pos_tag ( tokens )
18 lemmatized = [ lemmatizer . lemmatize ( word , get_wordnet_pos ( word
))
19 for word , pos in pos_tags ]
20 return lemmatized

4
Chapter 3: Text Representation and Modeling - So-
lutions
3.1 Encoding Schemes
3.1.1 Compare label encoding and one-hot encoding in terms of dimensionality
and interpretability.
Solution:
Label Encoding:

• Dimensionality: Single dimension - each category is mapped to a single


integer value (0, 1, 2, ..., n-1)
• Interpretability: Poor interpretability because it introduces artificial ordinal
relationships between categories that may not exist in reality

One-Hot Encoding:

• Dimensionality: High dimensionality - creates n dimensions for n categories,


resulting in sparse vectors
• Interpretability: Excellent interpretability as each dimension explicitly rep-
resents one category with no artificial ordering

The key trade-off is that label encoding is memory-efficient but can mislead machine
learning algorithms into assuming non-existent ordinal relationships, while one-hot
encoding preserves categorical independence at the cost of increased dimensionality.

3.1.2 Write the mathematical form of a one-hot vector for the word “apple”
in the vocabulary {apple, banana, cherry}.
Solution:
Given vocabulary V = {apple, banana, cherry} with indices:

apple → index 0 (1)


banana → index 1 (2)
cherry → index 2 (3)

The one-hot vector for “apple” is:


 
1
vapple = 0
 (4)
0

where the first position corresponds to “apple”, second to “banana”, and third to
“cherry”.

3.1.3 List three limitations of one-hot encoding.


Solution:

1
(a) High Dimensionality: Creates extremely sparse vectors with dimension
equal to vocabulary size, leading to the curse of dimensionality
(b) No Semantic Relationships: All words are equidistant from each other
(orthogonal vectors), failing to capture semantic similarities between related
words
(c) Memory Inefficiency: Requires significant storage space, especially for large
vocabularies, as most vector elements are zeros

3.2 Bag-of-Words (BoW)


3.2.1 Explain why the sentences “dog bites man” and “man bites dog” have
identical BoW vectors.
Solution:
The Bag-of-Words model treats documents as unordered collections of words, com-
pletely ignoring word order and syntactic structure.
For vocabulary V = {dog, bites, man}:
Sentence 1: “dog bites man”
 
1
v1 = 1
 (dog: 1, bites: 1, man: 1) (5)
1

Sentence 2: “man bites dog”


 
1
v2 = 1
 (dog: 1, bites: 1, man: 1) (6)
1

Both sentences contain exactly the same words with the same frequencies, resulting
in identical BoW representations despite having completely different meanings.

3.2.2 Given two documents, outline the two-step procedure to construct a


BoW matrix.
Solution:
Step 1: Vocabulary Construction

• Extract all unique words from both documents


• Create a unified vocabulary V = {w1 , w2 , ..., wn } containing all distinct terms
• Assign each word a unique index position

Step 2: Vector Construction

• For each document, create a vector of length |V |


• Count the frequency of each vocabulary word in the document
• Fill the vector with these frequency counts at corresponding positions
• Stack the document vectors to form the final BoW matrix

2
The resulting matrix has dimensions m × n where m is the number of documents
and n is the vocabulary size.

3.2.3 What type of information is lost when using the BoW representation?
Solution:
The BoW representation loses several crucial types of information:

(a) Word Order: Sequential arrangement of words is completely ignored


(b) Syntactic Structure: Grammatical relationships and sentence structure are
lost
(c) Semantic Context: Local context that determines word meaning is dis-
carded
(d) Discourse Structure: Paragraph and document-level organization is not
preserved

This information loss significantly impacts the model’s ability to understand nu-
anced meaning, sarcasm, negation, and complex linguistic phenomena.

3.3 TF-IDF
3.3.1 Write the formula for term frequency (TF) using logarithmic scaling.
Solution:
The logarithmic term frequency formula is:
(
1 + log10 (ft,d ) if ft,d > 0
T F (t, d) = (7)
0 if ft,d = 0

where ft,d is the raw frequency of term t in document d.


The logarithmic scaling prevents very frequent terms from dominating the repre-
sentation while still giving higher weight to terms that appear multiple times.

3.3.2 Derive the inverse document frequency (IDF) formula and explain why
it assigns higher weight to rare terms.
Solution:
The IDF formula is derived as follows:
 
N
IDF (t) = log10 (8)
dft

where:

• N = total number of documents in the collection


• dft = number of documents containing term t

Explanation: IDF assigns higher weights to rare terms because:

• When dft is small (rare term), N


dft
is large, making IDF (t) large

3
• When dft approaches N (common term), N
dft
approaches 1, making IDF (t)
approach 0
• Rare terms are more discriminative and informative for distinguishing between
documents
3.3.3 Compute the TF-IDF value for the word “cat” across three toy docu-
ments (each with 6 tokens).
Solution:
Given documents:
• Doc 1: “cat sits on mat with dog” (“cat” appears 1 time)
• Doc 2: “dog runs fast in park today” (“cat” appears 0 times)
• Doc 3: “cat cat plays with yarn ball” (“cat” appears 2 times)
TF Calculations:
T F (cat, Doc1) = 1 + log10 (1) = 1 + 0 = 1 (9)
T F (cat, Doc2) = 0 (cat doesn’t appear) (10)
T F (cat, Doc3) = 1 + log10 (2) = 1 + 0.301 = 1.301 (11)

IDF Calculation:
 
3
IDF (cat) = log10 = log10 (1.5) = 0.176 (12)
2

TF-IDF Values:
T F -IDF (cat, Doc1) = 1 × 0.176 = 0.176 (13)
T F -IDF (cat, Doc2) = 0 × 0.176 = 0 (14)
T F -IDF (cat, Doc3) = 1.301 × 0.176 = 0.229 (15)

3.3.4 Discuss two practical issues associated with the raw TF component in
TF-IDF.
Solution:
Issue 1: Linear Growth Problem
• Raw TF grows linearly with term frequency, causing documents with high
term repetition to be unfairly favored
• A document with 100 occurrences of a term gets 10 times more weight than
one with 10 occurrences, which may not reflect 10 times more relevance
Issue 2: Document Length Bias
• Longer documents naturally have higher raw term frequencies, creating bias
toward lengthy documents
• Short, concise documents may be unfairly penalized despite being highly rel-
evant
• This necessitates normalization techniques like cosine normalization or length
normalization

4
3.4 BM25
3.4.1 Write the BM25 scoring formula and define each parameter.
Solution:
The BM25 scoring formula is:
X ft,d · (k1 + 1)
BM 25(q, d) = IDF (t) ·   (16)
|d|
t∈q ft,d + k1 · 1 − b + b · avgdl

Parameter Definitions:

• q = query containing terms


• d = document being scored
• ft,d = frequency of term t in document d
• |d| = length of document d (number of terms)
• avgdl = average document length in the collection
• k1 = term frequency saturation parameter (typically 1.2-2.0)
• b = length normalization parameter (typically 0.75)
• IDF (t) = inverse document frequency of term t

3.4.2 Describe the role of the k and b parameters in the BM25 algorithm.
Solution:
Parameter k1 (Term Frequency Saturation):

• Controls how quickly the term frequency component saturates


• Higher k1 values allow term frequency to contribute more before reaching sat-
uration
• Lower k1 values cause faster saturation, reducing the impact of very high term
frequencies
• Typical range: 1.2 to 2.0

Parameter b (Length Normalization):

• Controls the degree of document length normalization


• b = 0: No length normalization (favors longer documents)
• b = 1: Full length normalization (heavily penalizes longer documents)
• b = 0.75: Balanced approach (standard setting)
• Helps ensure that document length doesn’t unfairly bias the scoring

3.4.3 Under what conditions does BM25 reduce to standard TF-IDF?


Solution:
BM25 reduces to standard TF-IDF under the following limiting conditions:
Condition 1: k1 → ∞ (no term frequency saturation) Condition 2: b = 0 (no
length normalization)

5
Under these conditions:
X
lim BM 25(q, d) = IDF (t) · ft,d (17)
k1 →∞,b=0
t∈q

This becomes equivalent to the standard TF-IDF formulation where:

• Term frequency grows linearly without saturation


• Document length doesn’t affect the scoring
• The formula reduces to the sum of IDF × T F for each query term

3.5 Vector Space Model


3.5.1 Why is Euclidean distance a poor similarity metric for TF-IDF vectors?
Solution:
Euclidean distance is inappropriate for TF-IDF vectors due to several fundamental
issues:
Document Length Bias:

• Longer documents have higher TF-IDF values, resulting in larger vector mag-
nitudes
• Euclidean distance is sensitive to vector magnitude, unfairly penalizing longer
documents
• Two semantically similar documents of different lengths will appear dissimilar

High Dimensionality Problems:

• TF-IDF vectors are high-dimensional and sparse


• In high dimensions, Euclidean distances become less discriminative (curse of
dimensionality)
• Most distances tend to become similar, reducing the metric’s effectiveness

Scale Sensitivity:

• TF-IDF values can vary significantly across different terms


• Euclidean distance treats all dimensions equally, ignoring the relative impor-
tance of terms

3.5.2 Provide the cosine similarity formula between a query vector q and a
document vector d.
Solution:
The cosine similarity formula is:
Pn
q·d qi × di
cos(q, d) = = pPn i=1 pPn (18)
|q| × |d| 2
i=1 qi ×
2
i=1 di

6
where:

• q · d = dot product of query and document vectors


• |q| = magnitude (L2 norm) of query vector
• |d| = magnitude (L2 norm) of document vector
• n = dimensionality of the vectors (vocabulary size)

The result ranges from -1 to 1, where 1 indicates perfect similarity and 0 indicates
orthogonality.

3.5.3 Explain the three-step ranked retrieval process used in the vector space
model.
Solution:
Step 1: Document Representation

• Convert all documents in the collection into TF-IDF vectors


• Each document becomes a point in the high-dimensional term space
• Precompute and store document vector magnitudes for efficiency

Step 2: Query Processing

• Convert the user query into a TF-IDF vector using the same vocabulary
• Apply the same preprocessing steps used for documents (tokenization, stem-
ming, etc.)
• Compute the query vector magnitude

Step 3: Similarity Computation and Ranking

• Calculate cosine similarity between the query vector and each document vector
• Rank all documents in descending order of similarity scores
• Return the top-k most similar documents as the retrieval result

3.6 Word2Vec
3.6.1 State the distributional hypothesis underlying the Word2Vec algorithm.
Solution:
The distributional hypothesis, originally formulated by Zellig Harris and refined by
Firth, states:

“Words that occur in similar contexts tend to have similar meanings.”

In the context of Word2Vec, this translates to:

• Words appearing in similar contexts should have similar vector representations


• The meaning of a word can be inferred from the company it keeps
• Semantic similarity is reflected through distributional similarity in large text
corpora

7
This hypothesis forms the foundation for learning dense word embeddings by pre-
dicting context words or target words based on their surrounding textual environ-
ment.

3.6.2 Describe the CBOW architecture, including input, hidden layer opera-
tion, and output.
Solution:
CBOW (Continuous Bag of Words) Architecture:
Input Layer:

• Takes context words within a fixed window around the target word
• Each context word is represented as a one-hot vector
• Window size typically ranges from 2 to 10 words on each side

Hidden Layer Operation:

• Maps each context word through an embedding matrix WV ×N


• Averages all context word embeddings to create a single representation
• Mathematical operation: h = C1 C
P
c=1 W xc
• Where C is the number of context words and xc are one-hot vectors

Output Layer:

• Uses the averaged hidden representation to predict the target word


• Applies softmax over the entire vocabulary to get probability distribution
• Objective: maximize the probability of the correct target word

3.6.3 Describe the Skip-Gram architecture, including input, hidden layer op-
eration, and output.
Solution:
Skip-Gram Architecture:
Input Layer:

• Takes a single target word as input


• Represented as a one-hot vector of size |V | (vocabulary size)
• Goal is to predict surrounding context words

Hidden Layer Operation:

• Maps the target word through embedding matrix WV ×N


• Simply extracts the corresponding word embedding (no averaging)
• Mathematical operation: h = W T x
• Where x is the one-hot vector of the target word

8
Output Layer:

• Uses the target word embedding to predict each context word independently
• Multiple output nodes, one for each context position
• Each output applies softmax over vocabulary to predict context words
• Objective: maximize probability of all context words given the target word

3.6.4 Present the famous vector arithmetic example: “king – man + woman”.
Solution:
The famous Word2Vec analogy example demonstrates semantic relationships through
vector arithmetic:

vking − vman + vwoman ≈ vqueen (19)

Interpretation:

• vking − vman captures the concept of “royalty” while removing “maleness”


• Adding vwoman introduces “femaleness” to the royalty concept
• The result should be closest to the vector for “queen”

General Pattern: This represents the analogy relationship: “king is to man as


queen is to woman”
king queen
= ⇒ king − man + woman = queen (20)
man woman

This example showcases Word2Vec’s ability to capture complex semantic and syn-
tactic relationships in the learned vector space, enabling analogical reasoning through
simple vector operations.

9
Chapter 4: Text Classification - Solutions
4.1 Write the Naive Bayes decision rule for classifying a document d into a
class c.
Solution:
The Naive Bayes decision rule classifies a document d into the class c that maxi-
mizes the posterior probability. Using Bayes’ theorem and the naive independence
assumption:
Y
ĉ = arg max P (c|d) = arg max P (c) P (w|c) (1)
c∈C c∈C
w∈d

Where:

• ĉ is the predicted class


• C is the set of all possible classes
• P (c) is the class prior probability
• P (w|c) is the likelihood of word w given class c
• The product is taken over all words w in document d

In practice, we often use the log form to avoid numerical underflow:


" #
X
ĉ = arg max log P (c) + log P (w|c) (2)
c∈C
w∈d

4.2 What problem arises when a word appears in the test set but not in the
training set for a specific class?
Solution:
The problem that arises is the zero probability problem (also called the sparse
data problem). When a word w appears in the test document but was never ob-
served in the training data for class c, we have:

nwc 0
P (w|c) = = =0 (3)
Nc Nc
Where nwc is the count of word w in class c, and Nc is the total number of words
in class c.
Q
This causes the entire product w∈d P (w|c) to become zero, making P (c|d) = 0.
Consequently:

• The classifier will assign zero probability to that class regardless of other evi-
dence
• This can lead to poor classification decisions
• The model becomes overly sensitive to unseen words

1
4.3 Apply Laplace smoothing (ϵ = 1) to re-estimate P (w|c) when nwc = 0.
Solution:
Laplace smoothing (add-one smoothing) addresses the zero probability problem by
adding a small constant ϵ to all word counts. The smoothed probability estimate
becomes:

nwc + ϵ
P (w|c) = (4)
Nc + ϵ|V |
Where:
• nwc is the count of word w in class c
• Nc is the total number of words in class c
• |V | is the vocabulary size (number of unique words)
• ϵ = 1 for Laplace smoothing
When nwc = 0 and ϵ = 1:
0+1 1
P (w|c) = = (5)
Nc + |V | Nc + |V |

This ensures that:


• No word has zero probability
• All unseen words get the same small probability
• The probabilities still sum to 1 across the vocabulary
4.4 Perform the full Naive Bayes calculation for a toy example using vocab-
ulary {a, b, c}.
Solution:
Let’s consider a binary classification problem with classes C1 and C2 .
Training Data:
• Class C1 : documents “a b”, “a c”, “b b”
• Class C2 : documents “b c”, “c c”
Step 1: Calculate Class Priors
3
P (C1 ) = = 0.6 (6)
5
2
P (C2 ) = = 0.4 (7)
5
Step 2: Count Words in Each Class

Word Count in C1 Count in C2 Total


a 2 0 2
b 3 1 4
c 1 3 4
Total 6 4 10

2
Step 3: Calculate Likelihoods with Laplace Smoothing With vocabulary
size |V | = 3 and ϵ = 1:
For class C1 :
2+1 3 1
P (a|C1 ) = = = (8)
6+3 9 3
3+1 4
P (b|C1 ) = = (9)
6+3 9
1+1 2
P (c|C1 ) = = (10)
6+3 9

For class C2 :
0+1 1
P (a|C2 ) = = (11)
4+3 7
1+1 2
P (b|C2 ) = = (12)
4+3 7
3+1 4
P (c|C2 ) = = (13)
4+3 7

Step 4: Classify Test Document “a c”

P (C1 |“a c′′ ) ∝ P (C1 ) · P (a|C1 ) · P (c|C1 ) (14)


1 2 2 1.2 4
= 0.6 × × = 0.6 × = = (15)
3 9 27 27 90

P (C2 |“a c′′ ) ∝ P (C2 ) · P (a|C2 ) · P (c|C2 ) (16)


1 4 4 1.6 16
= 0.4 × × = 0.4 × = = (17)
7 7 49 49 490
4 196 16 144
Since 90
= 4410
> 490
= 4410
, the document “a c” is classified as C1 .
4.5 Why is a class prior P (c) = 0 considered meaningful and not subject to
smoothing?
Solution:
A class prior P (c) = 0 is considered meaningful and should not be smoothed because
it represents a true absence of that class in the training data, which has important
implications:

(a) Logical Consistency: If class c never appears in the training data, then
P (c) = 0 accurately reflects our training experience. The classifier should not
predict a class it has never seen.
(b) Model Interpretation: Zero class priors indicate that certain classes are not
represented in our training set. This is valuable information that should be
preserved rather than artificially inflated.
(c) Practical Implications: If P (c) = 0, then P (c|d) = 0 for any document d,
meaning the classifier will never predict class c. This is the correct behavior
when we have no training examples for that class.

3
(d) Different from Word Smoothing: Unlike word probabilities P (w|c) where
zero counts might be due to limited sampling, zero class priors represent a
definitive absence of training data for that class.
(e) Training Set Completeness: Smoothing class priors would imply we can
classify into classes we’ve never trained on, which violates the supervised learn-
ing paradigm.

In contrast, word probabilities are smoothed because a word’s absence from a


class in training data doesn’t mean it’s impossible for that word to appear in that
class—it might just be due to limited training data.

4
Chapter 5: Language Models - Solutions
5.1 Write the chain-rule expansion for the probability of a sentence P (w1 , w2 , . . . , wn ).
Solution:
Using the chain rule of probability, we can decompose the joint probability of a
sequence of words as:

P (w1 , w2 , . . . , wn ) = P (w1 )·P (w2 |w1 )·P (w3 |w1 , w2 ) · · · P (wn |w1 , w2 , . . . , wn−1 ) (1)

This can be written more compactly as:

n
Y
P (w1 , w2 , . . . , wn ) = P (wi |w1 , w2 , . . . , wi−1 ) (2)
i=1

where P (w1 |w0 ) = P (w1 ) by convention. This decomposition expresses the prob-
ability of a sentence as the product of conditional probabilities, where each word
depends on all preceding words in the sequence.

5.2 State the Markov assumption used in bigram models.


Solution:
The Markov assumption in bigram models states that the probability of a word
depends only on the immediately preceding word, not on the entire history of pre-
ceding words. Formally:

P (wi |w1 , w2 , . . . , wi−1 ) ≈ P (wi |wi−1 ) (3)

This is a first-order Markov assumption, which significantly simplifies the com-


putation by reducing the conditioning context from the entire prefix to just one
word. This assumption makes the model tractable but may lose some important
long-range dependencies in natural language.

5.3 Compute the bigram probability P (wn |wn−1 ) using maximum likelihood
estimation.
Solution:
Using maximum likelihood estimation (MLE), the bigram probability is computed
by counting occurrences in the training corpus:

C(wn−1 , wn )
P (wn |wn−1 ) = (4)
C(wn−1 )
where:

• C(wn−1 , wn ) is the count of how many times the bigram (wn−1 , wn ) appears in
the training corpus
• C(wn−1 ) is the count of how many times the word wn−1 appears in the training
corpus

1
This formula represents the relative frequency of seeing word wn after word wn−1
in the training data. The MLE approach chooses parameters that maximize the
likelihood of the observed training data.

5.4 Show how Laplace smoothing adjusts bigram counts to avoid zero prob-
abilities.
Solution:
Laplace smoothing (add-one smoothing) addresses the zero probability problem by
adding 1 to all bigram counts. The smoothed bigram probability becomes:

C(wn−1 , wn ) + 1
PLaplace (wn |wn−1 ) = (5)
C(wn−1 ) + V

where V is the vocabulary size (number of unique words).


Justification:

• The numerator C(wn−1 , wn )+1 ensures that even unseen bigrams have a count
of 1
• The denominator C(wn−1 ) + V is adjusted by adding V because we’ve effec-
tively added 1 to each of the V possible words that could follow wn−1
• This maintains the property that probabilities sum to 1: w PLaplace (w|wn−1 ) =
P
1

While Laplace smoothing eliminates zero probabilities, it can over-smooth by as-


signing too much probability mass to unseen events.

5.5 Compare unigram, bigram, and trigram models with respect to context
length and data sparsity.
Solution:

Model Context Length Data Sparsity Trade-offs


Unigram 0 (no context) Low sparsity Simple but ignores word order
Bigram 1 word Moderate sparsity Captures local dependencies
Trigram 2 words High sparsity Better context, more parameters

Context Length:

• Unigram: P (wi ) - No conditioning context


• Bigram: P (wi |wi−1 ) - Conditions on 1 previous word
• Trigram: P (wi |wi−2 , wi−1 ) - Conditions on 2 previous words

Data Sparsity: As context length increases, the number of possible n-grams grows
exponentially (V n for vocabulary size V ), leading to:

• More parameters to estimate


• Higher likelihood of zero counts for unseen n-grams

2
• Greater need for smoothing techniques

Performance Trade-off: Longer context generally improves language modeling


quality but requires more training data and sophisticated smoothing to handle
sparsity effectively.

3
Chapter 6: Transformer and BERT - Solutions
6.1 Transformer Architecture
6.1.1 Question: Identify the two major components that Transformers do not rely on,
unlike RNNs and CNNs.
Answer: The two major components that Transformers do not rely on are:

• Recurrence: Unlike RNNs, Transformers do not process sequences sequen-


tially or maintain hidden states that depend on previous time steps.
• Convolution: Unlike CNNs, Transformers do not use convolutional opera-
tions with local receptive fields to extract features.

Instead, Transformers rely entirely on attention mechanisms to capture dependen-


cies between all positions in the sequence simultaneously, enabling parallel process-
ing and better handling of long-range dependencies.

6.1.2 Question: Describe the purpose and function of multi-head self-attention.


Answer: Multi-head self-attention serves to capture different types of relationships
and dependencies within a sequence by attending to information from different
representation subspaces at different positions simultaneously.
Purpose:

• Allow the model to jointly attend to information from different representation


subspaces
• Capture various types of relationships (syntactic, semantic, positional) in par-
allel
• Provide richer representations than single-head attention

Function: The mechanism splits the input into h different ”heads,” where each
head learns different attention patterns. Each head computes its own query (Q),
key (K), and value (V ) matrices, performs scaled dot-product attention, and the
results are concatenated and linearly transformed to produce the final output.

6.1.3 Question: Write the equation for scaled dot-product attention.


Answer: The scaled dot-product attention is computed as:

QK T
 
Attention(Q, K, V ) = softmax √ V (1)
dk
Where:

• Q is the query matrix of dimension n × dk


• K is the key matrix of dimension m × dk
• V is the value matrix of dimension m × dv
• dk is the dimension of the key vectors (used for scaling)

• dk prevents the dot products from becoming too large, which would push
the softmax into regions with small gradients

1
6.1.4 Question: Explain the role of positional encodings in the Transformer architecture.
Answer: Positional encodings provide information about the position of tokens in
the sequence, which is crucial because the Transformer architecture lacks inherent
sequential processing capabilities.
Role and Necessity:

• Since Transformers process all positions in parallel through attention, they


have no built-in notion of token order
• Without positional information, the model would treat sequences as unordered
sets
• Positional encodings are added to input embeddings to inject positional infor-
mation

Implementation: The original Transformer uses sinusoidal positional encodings:


 pos 
P E(pos,2i) = sin 2i/dmodel
(2)
10000pos 
P E(pos,2i+1) = cos (3)
100002i/dmodel
where pos is the position and i is the dimension index.

6.1.5 Question: Describe the Add & Norm sub-layer used in both encoder and decoder
stacks.
Answer: The Add & Norm sub-layer implements residual connections followed
by layer normalization, appearing after each main component in both encoder and
decoder blocks.
Structure:
LayerNorm(x + Sublayer(x)) (4)

Components:

• Residual Connection (Add): Adds the input x to the output of the sub-
layer, helping with gradient flow and enabling training of deeper networks
• Layer Normalization (Norm): Normalizes the summed output across the
feature dimension, stabilizing training and improving convergence

Benefits: This combination facilitates gradient flow, reduces internal covariate


shift, and enables stable training of deep Transformer networks with many layers.

6.2 BERT
6.2.1 Question: Which components of the original Transformer architecture are reused
in BERT?
Answer: BERT reuses the encoder components of the original Transformer ar-
chitecture exclusively:
Reused Components:

• Multi-head self-attention layers: For capturing bidirectional dependencies

2
• Position-wise feed-forward networks: Applied to each position separately
• Add & Norm sub-layers: Residual connections with layer normalization
• Positional encodings: To provide position information (though BERT uses
learned positional embeddings)

Not Used: BERT does not use the decoder stack, masked self-attention, or
encoder-decoder attention mechanisms, as it is designed for bidirectional encod-
ing rather than autoregressive generation.

6.2.2 Question: Describe the Masked Language Modeling (MLM) task used during
BERT’s pre-training.
Answer: Masked Language Modeling is a pre-training objective where BERT
learns to predict masked tokens in a sentence using bidirectional context.
Process:

• Randomly mask 15% of input tokens


• Of these masked positions: 80% replaced with [MASK], 10% replaced with
random tokens, 10% kept unchanged
• Model predicts the original token at masked positions using context from both
directions

Example:

Input: “The cat [MASK] on the mat”


Target: “sat”
BERT uses both left context (“The cat”) and right context (“on the mat”) to
predict “sat”

Purpose: This enables BERT to learn deep bidirectional representations, unlike


traditional left-to-right language models.

6.2.3 Question: Explain the Next Sentence Prediction (NSP) task used in BERT.
Answer: Next Sentence Prediction is a binary classification pre-training task where
BERT learns to determine whether two sentences appear consecutively in the orig-
inal text.
Task Structure:

• Input: Two sentences separated by [SEP] token, preceded by [CLS] token


• 50% of training examples: sentences are actual consecutive pairs (label: Is-
Next)
• 50% of training examples: sentences are randomly paired (label: NotNext)
• Model predicts binary classification using [CLS] token representation

Example:

Input: “[CLS] The weather is nice. [SEP] Let’s go for a walk. [SEP]”
Label: IsNext (if consecutive) or NotNext (if random)

3
Purpose: Helps BERT understand sentence-level relationships, beneficial for down-
stream tasks like question answering and natural language inference.

6.2.4 Question: How many encoder layers and hidden units are present in BERT-Base?
Answer: BERT-Base architecture specifications:

• Encoder Layers: 12 transformer encoder layers


• Hidden Units: 768 hidden units (dimension of hidden states)
• Attention Heads: 12 attention heads per layer
• Parameters: Approximately 110 million parameters

Note: BERT-Large has 24 layers, 1024 hidden units, 16 attention heads, and 340
million parameters, making it significantly larger than BERT-Base.

6.2.5 Question: Using the word “bank” as an example, explain how BERT generates
contextualized word embeddings.
Answer: BERT generates contextualized embeddings by considering the surround-
ing context to disambiguate word meanings, unlike static word embeddings that
assign fixed vectors to words.
Example with “bank”:
Sentence 1: “I deposited money at the bank today.” Sentence 2: “We sat by
the river bank watching boats.”
Process:

• BERT processes the entire sentence bidirectionally using self-attention


• For sentence 1: attention mechanism focuses on financial context (“deposited,”
“money”)
• For sentence 2: attention mechanism focuses on geographical context (“river,”
“boats”)
• Different attention patterns produce different contextualized representations
for “bank”

Result: The same word “bank” receives different vector representations based on
context, enabling BERT to handle polysemy and provide more accurate semantic
understanding for downstream tasks.

4
Solutions to Additional Practice Questions
P.1 (a) Explain Natural Language Processing (NLP) and its key components.
(8)
Natural Language Processing (NLP) is a branch of artificial intelligence that fo-
cuses on the interaction between computers and human language. It enables
machines to understand, interpret, and generate human language in a meaningful
way, bridging the gap between human communication and computer understand-
ing. NLP combines computational linguistics with machine learning and deep
learning techniques to process and analyze large amounts of natural language
data.
Key Components of NLP:
• Tokenization: Breaks text into individual words, phrases, or meaningful
units called tokens. This fundamental step handles challenges like punctua-
tion, whitespace, and word boundaries.
• Syntax Analysis: Analyzes grammatical structure and relationships be-
tween words, including parsing dependency relationships and part-of-speech
tagging.
• Semantic Analysis: Determines the meaning of words and sentences in
context, resolving ambiguities and understanding implicit meanings.
• Pragmatic Analysis: Interprets meaning using context and real-world knowl-
edge, considering speaker intent and situational factors.
• Morphological Analysis: Studies word formation, roots, affixes, and struc-
ture to understand how words are constructed and related.
• Discourse Analysis: Examines the structure and coherence of longer texts
beyond individual sentences, tracking themes and maintaining context across
paragraphs.
• Phonological Analysis: (In speech-based NLP) Deals with sound patterns
and pronunciation in spoken language, essential for speech recognition sys-
tems.
Core Components of NLP: Natural Language Generation (NLG) focuses
on producing human-like text from structured data or internal representa-
tions, while Natural Language Understanding (NLU) involves comprehending
and interpreting text to extract meaning and intent.
(b) Discuss the major challenges faced in NLP. (7)
Major Challenges in NLP:
• Ambiguity: Words and sentences can have multiple meanings depending on
context (lexical, syntactic, and semantic ambiguity)
• Context Dependency: Meaning changes based on surrounding text and
situation, requiring sophisticated context modeling
• Sarcasm and Irony: Difficult to detect when intended meaning differs from
literal meaning, often requiring cultural and social understanding

1
• Language Variations: Dialects, slang, informal language, and code-switching
create processing difficulties across different communities
• Data Sparsity: Limited training data for many languages and domains,
particularly for low-resource languages
• Cultural and Domain Knowledge: Understanding requires background
knowledge not present in text, including implicit cultural references
• Computational Complexity: Processing large volumes of text requires
significant computational resources and efficient algorithms

P.2 (a) Describe the steps involved in data preprocessing, and illustrate each
step with a relevant example. (8)
Data Preprocessing Steps:
Data preprocessing is crucial for preparing raw text data for NLP tasks, as it
standardizes and cleans the input to improve model performance.
1. Text Cleaning: Remove unwanted characters, HTML tags, special symbols,
and noise
Input: "Hello!!! @World #NLP <html>"
Output: "Hello World NLP"
2. Case Normalization: Convert text to lowercase for consistency, reducing
vocabulary size
Input: "Natural Language Processing"
Output: "natural language processing"
3. Tokenization: Split text into individual words or tokens, handling punctua-
tion appropriately
Input: "I love NLP"
Output: ["I", "love", "NLP"]
4. Stop Word Removal: Remove common words that don’t carry significant
meaning for most tasks
Input: ["I", "love", "natural", "language", "processing"]
Output: ["love", "natural", "language", "processing"]
5. Stemming/Lemmatization: Reduce words to their root form to group
related words together
Stemming: "running" → "run", "better" → "better"
Lemmatization: "running" → "run", "better" → "good"
Note: Lemmatization is generally preferred over stemming as it produces actual
dictionary words and considers morphological analysis.

2
(b) Explain TF-IDF and the major challenges faced in TF-IDF. (7)
TF-IDF (Term Frequency-Inverse Document Frequency):
TF-IDF is a numerical statistic that reflects how important a word is to a doc-
ument in a collection of documents. It balances the frequency of a term in a
document with its rarity across the entire corpus, helping identify distinctive and
meaningful terms.

TF-IDF(t, d, D) = TF(t, d) × IDF(t, D) (1)


Where:
frequency of term t in document d
TF(t, d) = (2)
total terms in document d
 
total number of documents
IDF(t, D) = log (3)
number of documents containing term t
The logarithm in IDF dampens the effect of very rare terms and provides a
smoother weighting scheme.
Challenges in TF-IDF:
• Semantic Meaning: Cannot capture semantic relationships between words
or understand context-dependent meanings
• Word Order: Ignores sequence and context of words, treating documents
as bags of words
• Synonyms: Treats synonyms as different terms, missing semantic equiva-
lences
• Sparse Representation: Creates high-dimensional sparse vectors that can
be computationally inefficient
• Out-of-Vocabulary: Cannot handle new words not seen during training,
limiting adaptability

P.3 (a) Determine the TF-IDF scores for each term in each document using
raw TF: (8)
Given documents:
• D1: car insurance auto insurance
• D2: car auto insurance auto
• D3: car car auto insurance car
Step 1: Calculate Term Frequencies (TF)
Term D1 D2 D3
car 1 1 3
insurance 2 1 1
auto 1 2 1

3
Step 2: Calculate Inverse Document Frequency (IDF) Since all terms
appear in all three documents, the document frequency for each term is 3:
 
3
IDF(car) = log = log(1) = 0 (4)
3
 
3
IDF(insurance) = log = log(1) = 0 (5)
3
 
3
IDF(auto) = log = log(1) = 0 (6)
3

Step 3: Calculate TF-IDF scores


Term D1 TF-IDF D2 TF-IDF D3 TF-IDF
car 1×0=0 1×0=0 3×0=0
insurance 2×0=0 1×0=0 1×0=0
auto 1×0=0 2×0=0 1×0=0
Since all terms appear in all documents, IDF = 0 for all terms,
making TF-IDF = 0 for all terms in all documents. This demonstrates a limitation
of TF-IDF when the vocabulary is shared across all documents in the corpus.
(b) Why is cosine similarity preferred over Euclidean distance in the vector
space model for text analysis? (7)
Cosine Similarity vs Euclidean Distance:
• Length Independence: Cosine similarity measures the angle between vec-
tors, not their magnitude. Documents of different lengths can still be similar
in content, which is crucial for text analysis where document length varies
significantly.
• Normalization: Cosine similarity automatically normalizes for document
length, while Euclidean distance is affected by vector magnitude, potentially
biasing toward longer documents.
• High-Dimensional Sparse Data: Text vectors are typically high-dimensional
and sparse. Cosine similarity works better in such spaces by focusing on the
direction rather than magnitude.
• Interpretability: Cosine similarity ranges from -1 to 1, providing intuitive
interpretation of similarity (1 = identical direction, 0 = orthogonal, -1 =
opposite direction).
• TF-IDF Compatibility: Works well with TF-IDF representations where
term frequencies vary significantly across documents of different lengths.
• Curse of Dimensionality: In high-dimensional spaces, Euclidean distance
becomes less discriminative as all points appear equidistant, while cosine
similarity remains meaningful.

A·B
Cosine Similarity = (7)
||A|| × ||B||

4
P.4 (a) What is text classification? Explain CBOW of Word2Vec in detail. (8)
Text Classification: Text classification is the task of assigning predefined cat-
egories or labels to text documents based on their content. It’s a supervised
learning problem where the model learns from labeled training data. Examples
include spam detection, sentiment analysis, topic categorization, and language
identification. Modern approaches use deep learning models that can capture
complex patterns and relationships in text data.
CBOW (Continuous Bag of Words) in Word2Vec:
CBOW predicts a target word based on its surrounding context words. It uses
a neural network architecture that learns distributed representations of words by
maximizing the probability of predicting the center word given its context.
Architecture:
• Input Layer: Context words represented as one-hot vectors of vocabulary
size
• Projection Layer: Average of context word embeddings, creating a dense
representation
• Output Layer: Softmax layer predicting target word probability over entire
vocabulary
Training Process:
i. Select a window of context words around the target word (typically 2-5 words
on each side).
ii. Feed context words to the input layer as one-hot vectors.
iii. Compute the average of context embeddings in the projection layer.
iv. Use softmax to predict the target word probability distribution.
v. Update weights using backpropagation and gradient descent to minimize pre-
diction error.
Objective Function:
T
1X
J =− log p(wt |wt−c , ..., wt−1 , wt+1 , ..., wt+c ) (8)
T t=1

CBOW is faster to train than Skip-gram and works well for frequent words, mak-
ing it suitable for large corpora.
(b) Using the Naive Bayes algorithm with Laplace smoothing, predict the
class of the test document. (7)
Training Set:
• Chinese Beijing Chinese → Yes
• Chinese Chinese Shanghai → Yes
• Chinese Macao → Yes
• Tokyo Japan Chinese → No

5
Test Document: Chinese Chinese Chinese Tokyo Japan
Step 1: Calculate Prior Probabilities
3
P (Yes) = = 0.75 (9)
4
1
P (No) = = 0.25 (10)
4
Step 2: Calculate Likelihood with Laplace Smoothing Laplace smoothing
adds 1 to each count to handle zero probabilities:

count(w in class c) + 1
P (w | c) =
total words in class c + |V|

Vocabulary: {Chinese, Beijing, Shanghai, Macao, Tokyo, Japan}, |V | = 6

For class “Yes”:


• Total words: 8 Unique words: 4
5+1 6
• P (Chinese | Yes) = =
8+6 14
0+1 1
• P (Tokyo | Yes) = =
8+6 14
0+1 1
• P (Japan | Yes) = =
8+6 14

For class “No”:


• Total words: 3 Unique words: 3
1+1 2
• P (Chinese | No) = =
3+6 9
1+1 2
• P (Tokyo | No) = =
3+6 9
1+1 2
• P (Japan | No) = =
3+6 9
Step 3: Calculate Posterior Probabilities Using the naive independence
assumption:
 3
6 1 1
P (Yes|test) ∝ 0.75 × × × (11)
14 14 14
 3
2 2 2
P (No|test) ∝ 0.25 × × × (12)
9 9 9

After calculating the numerical values, P (Yes|test) > P (No|test), so the predicted
class is Yes.

6
P.5 (a) What is language model? Calculate bigram probabilities with Laplace
smoothing for ”(s) Sam like eggs (/s)”. (8)
Language Model: A language model is a probability distribution over sequences
of words. It assigns probabilities to word sequences, helping determine how likely
a given sequence is in a language. Language models are fundamental to many NLP
applications including machine translation, speech recognition, text generation,
and auto-completion systems.
Given Corpus:
• (s) I am Sam (/s)
• (s) Sam I am (/s)
• (s) I do not like green eggs and ham (/s)
Bigram Counts: A bigram model predicts each word based on the immediately
preceding word using the Markov assumption.
Bigram Count
(s)→I 2
I→am 2
am→Sam 1
Sam→(/s) 1
(s)→Sam 1
Sam→I 1
am→(/s) 1
I→do 1
do→not 1
not→like 1
like→green 1
green→eggs 1
eggs→and 1
and→ham 1
ham→(/s) 1
Target Sentence Bigrams:
(s) → Sam, Sam → like, like → eggs, eggs → (/s)
Counts for Relevant Bigrams:
• (s) → Sam: 1
• Sam → like: 0 (does not appear in corpus)
• like → eggs: 0 (only like → green appears)
• eggs → (/s): 0 (only eggs → and appears)
Laplace Smoothing Formula:
C(wi−1 , wi ) + 1
P (wi | wi−1 ) = (13)
C(wi−1 ) + |V |

Where |V | = 12 (vocabulary size including start/end tokens)

7
Probability Calculations:

C((s), Sam) + 1 1+1 2


P (Sam | (s)) = = = (14)
C((s)) + |V | 3 + 12 15
C(Sam, like) + 1 0+1 1
P (like | Sam) = = = (15)
C(Sam) + |V | 2 + 12 14
C(like, eggs) + 1 0+1 1
P (eggs | like) = = = (16)
C(like) + |V | 1 + 12 13
C(eggs, (/s)) + 1 0+1 1
P ((/s) | eggs) = = = (17)
C(eggs) + |V | 1 + 12 13

(b) Explain language model and n-gram model. (7)


Language Model: Statistical models that capture patterns in language by as-
signing probabilities to word sequences. They help in tasks like speech recognition,
machine translation, and text generation by providing a measure of how natural
or likely a sequence of words is. Language models form the foundation of many
modern NLP systems and can be trained on large corpora to learn linguistic
patterns.
N-gram Model: A type of language model that predicts the next word based
on the previous (n − 1) words. N-gram models make the Markov assumption that
the probability of a word depends only on a limited history of previous words.
Types of N-gram Models:
• Unigram: P (wi ) - Independent word probabilities, ignoring context entirely
• Bigram: P (wi |wi−1 ) - Depends on previous word, capturing local dependen-
cies
• Trigram: P (wi |wi−2 , wi−1 ) - Depends on two previous words, providing more
context
Markov Assumption: The probability of a word depends only on a fixed num-
ber of previous words, not the entire history. Higher-order n-grams capture more
context but require exponentially more parameters.
Challenges: Data sparsity increases with n, out-of-vocabulary words, and the
need for smoothing techniques like Laplace or Kneser-Ney smoothing to handle
unseen n-grams.

P.6 (a) Explain attention mechanism used in Transformer. (8)


Attention Mechanism in Transformer:
The attention mechanism allows the model to focus on different parts of the input
sequence when processing each element, enabling better understanding of context
and relationships. Unlike recurrent models, attention provides direct connections
between all positions in the sequence, allowing for parallel computation and better
handling of long-range dependencies.

8
Self-Attention Formula:
QK T
 
Attention(Q, K, V ) = softmax √ V (18)
dk

Where:
• Q = Query matrix (what we’re looking for)
• K = Key matrix (what we’re looking in)
• V = Value matrix (what we actually use)
• dk = Dimension of key vectors (for scaling)
Multi-Head Attention:

MultiHead(Q, K, V ) = Concat(head1 , ..., headh )W O (19)

Multiple attention heads allow the model to attend to different representation


subspaces simultaneously, capturing various types of relationships.
Process:
i. Linear transformations are applied to generate the query (Q), key (K), and
value (V) matrices from input embeddings.
ii. Compute attention scores between all pairs of positions using dot products
of Q and K.
iii. Apply the softmax function to the attention scores to obtain attention weights
that sum to 1.
iv. Compute the weighted sum of the value vectors (V) using the attention
weights to produce the output.
v. Use multiple attention heads to capture different types of relationships and
patterns in the data.
vi. Apply layer normalization and residual connections for training stability.
Benefits: Parallelizable computation, captures long-range dependencies efficiently,
provides interpretable attention weights, and enables better gradient flow com-
pared to recurrent architectures.
(b) Explain BERT embeddings. (7)
BERT (Bidirectional Encoder Representations from Transformers):
BERT creates contextualized word embeddings by using bidirectional context,
unlike traditional models that process text left-to-right or right-to-left. This bidi-
rectional approach allows BERT to understand words in their full context, leading
to more accurate representations.
Key Features:
• Bidirectional Context: Considers both left and right context simultane-
ously using masked language modeling
• Transformer Architecture: Uses self-attention mechanism for parallel pro-
cessing and long-range dependencies

9
• Pre-training Tasks: Masked Language Modeling (MLM) and Next Sen-
tence Prediction (NSP) for comprehensive understanding
• Contextual Embeddings: Same word gets different embeddings in different
contexts, capturing polysemy and context-dependent meanings
Architecture:
• Input: Token embeddings + Segment embeddings + Position embeddings
for complete sequence representation
• Encoder: Multiple Transformer encoder layers with self-attention and feed-
forward networks
• Output: Contextualized representations for each token that can be fine-
tuned for downstream tasks
Applications: Fine-tuning for various NLP tasks like classification, question an-
swering, named entity recognition, and natural language inference, often achieving
state-of-the-art performance.

P.7 (a) What is data? Describe data preprocessing steps with examples. (2+10)
Data: Information collected and stored for analysis, processing, and decision-
making. In NLP context, data typically consists of text documents, sentences,
or words that can be structured (databases, XML) or unstructured (social media
posts, documents). The quality and preparation of data significantly impacts the
performance of NLP models.
Data Preprocessing Steps:
Data preprocessing transforms raw text into a clean, standardized format suitable
for machine learning algorithms. Proper preprocessing can significantly improve
model performance by reducing noise and inconsistencies.
Given Examples:
• ”Great product!! Highly recommend.”
• ”Terrible experience... never buying again.”
• ”Okay, but delivery was late.”
• ”Loved it!!! (Positive sentiment)”
Step 1: Text Cleaning Remove noise, special characters, HTML tags, and
irrelevant information:
Before: "Great product!! Highly recommend."
After: "Great product Highly recommend"
Step 2: Case Normalization Convert to consistent case to reduce vocabulary
size and improve matching:
Before: "Great product Highly recommend"
After: "great product highly recommend"

10
Step 3: Tokenization Split text into individual tokens, handling punctuation
and word boundaries:
Before: "great product highly recommend"
After: ["great", "product", "highly", "recommend"]
Step 4: Stop Word Removal Remove common words that typically don’t
contribute to meaning:
Before: ["okay", "but", "delivery", "was", "late"]
After: ["okay", "delivery", "late"]
Step 5: Stemming/Lemmatization Reduce words to their base or root form
to group related words:
Before: ["buying", "loved", "recommend"]
After: ["buy", "love", "recommend"]
(b) Explain one-hot encoding with example and challenges. (8)
One-Hot Encoding: A representation method where each word in the vocab-
ulary is represented as a binary vector with only one element set to 1 and all
others set to 0. The position of the 1 corresponds to the word’s index in the
vocabulary. This creates a sparse, high-dimensional representation where each
dimension represents a unique word.
Example: Vocabulary: [”good”, ”bad”, ”movie”, ”great”]

Word One-Hot Vector


good [1, 0, 0, 0]
bad [0, 1, 0, 0]
movie [0, 0, 1, 0]
great [0, 0, 0, 1]

The sentence ”good movie” would be represented as the sum: [1, 0, 1, 0].
Challenges in One-Hot Encoding:
• High Dimensionality: Vector size equals vocabulary size, creating sparse
representations that can be memory-intensive
• No Semantic Similarity: Cannot capture relationships between similar
words - ”good” and ”great” appear as orthogonal
• Memory Inefficiency: Requires significant storage for large vocabularies,
with most elements being zero
• Curse of Dimensionality: Performance degrades in high-dimensional spaces
due to sparsity and distance metrics becoming less meaningful
• Out-of-Vocabulary Problem: Cannot handle new words not in training
vocabulary, limiting model adaptability
• Loss of Word Order: When combining vectors (e.g., bag-of-words), se-
quential information is lost

11
P.8 (a) What is Zipf ’s law? Calculate TF-IDF after removing stopwords. (5+10)
Zipf ’s Law is an empirical linguistic observation that states:
In a given corpus of natural language, the frequency of any word is in-
versely proportional to its rank in the frequency table.
This fundamental principle describes the distribution of word frequencies in nat-
ural languages and has implications for NLP system design, vocabulary selection,
and understanding language structure.
Mathematically:
1
f (r) ∝
r
Where:
• f (r) = frequency of the word at rank r
• r = rank of the word (1 for the most frequent word, 2 for the second most
frequent, etc.)
Generalized Form (with exponent s):

C
f (r) =
rs
Where:
• C = a constant (often the frequency of the most common word)
• s ≈ 1 for natural language (empirically observed)
This law explains why a small number of words account for most of text content,
influencing stopword identification and vocabulary pruning strategies.
Given Documents:
• D1: ”The cat sat on the mat.”
• D2: ”The dog played in the park.”
• D3: ”Cats and dogs are great pets.”
After removing stopwords (the, on, in, and, are):
• D1: ”cat sat mat”
• D2: ”dog played park”
• D3: ”cats dogs great pets”

12
Term Frequencies:
Term D1 D2 D3
cat 1 0 0
sat 1 0 0
mat 1 0 0
dog 0 1 0
played 0 1 0
park 0 1 0
cats 0 0 1
dogs 0 0 1
great 0 0 1
pets 0 0 1
IDF Calculations: Each term appears in exactly one document, so:
 
3
IDF = log = log(3) ≈ 1.099 (20)
1

TF-IDF Scores: Each term has TF-IDF = 1 × 1.099 = 1.099 in its respective
document and 0 in others. This uniform distribution occurs because stopword
removal created documents with completely disjoint vocabularies.
(b) Explain Transformer and its architecture. (8)
Transformer: A neural network architecture that relies entirely on attention
mechanisms, eliminating recurrence and convolutions for sequence-to-sequence
tasks. Introduced in ”Attention Is All You Need” (2017), it revolutionized NLP
by enabling parallel processing and better handling of long-range dependencies.
Key Components:
1. Encoder Stack:
• 6 identical layers stacked vertically
• Each layer: Multi-head self-attention + Position-wise feed-forward network
• Residual connections and layer normalization around each sub-layer
• Enables parallel processing of input sequences
2. Decoder Stack:
• 6 identical layers with three sub-layers each
• Each layer: Masked self-attention + Encoder-decoder attention + Feed-forward
network
• Residual connections and layer normalization
• Generates output sequences autoregressively
3. Attention Mechanism:
QK T
 
Attention(Q, K, V ) = softmax √ V (21)
dk

13

The scaling factor dk prevents the softmax from saturating for large dimensions.
4. Positional Encoding: Adds position information since the model has no
recurrence, using sinusoidal functions to encode absolute and relative positions.
Advantages: Parallelizable training, captures long-range dependencies effec-
tively, achieves state-of-the-art results in many NLP tasks, and forms the foun-
dation for modern language models like BERT and GPT.

P.9 (a) Explain Word2Vec and its architectures. (8)


Word2Vec: A neural network-based method for learning word embeddings that
represent words as dense vectors in a continuous vector space, capturing seman-
tic relationships. Developed by Mikolov et al., it revolutionized word represen-
tation by moving from sparse, high-dimensional one-hot vectors to dense, low-
dimensional embeddings that capture semantic meaning.
Two Main Architectures:
1. CBOW (Continuous Bag of Words):
• Predicts target word from context words within a sliding window
• Fast training, good for frequent words due to averaging effect
• Architecture: Input (context) → Projection (average) → Output (target)
• Smooth representations for frequent words by averaging multiple contexts
2. Skip-gram:
• Predicts context words from target word, maximizing context prediction
• Better for infrequent words, captures more semantic relationships and analo-
gies
• Architecture: Input (target) → Projection → Output (context)
• More computationally expensive but produces higher quality embeddings
Training Techniques:
• Hierarchical Softmax: Uses binary tree to reduce computational complex-
ity from O(V) to O(log V)
• Negative Sampling: Samples negative examples instead of full softmax,
making training more efficient
Benefits: Captures semantic similarity through vector proximity, enables vector
arithmetic for analogical reasoning, and provides dense representations suitable
for downstream tasks. Famous example:

king − man + woman ≈ queen

(b) Naive Bayes classification prediction. (7)


Training Set:
• abac → A
• baabaaa → A
• bbaabbab → B

14
• abbb → B
• abbaa → A
• bbbaab → B
Test: aabc
Naive Bayes assumes conditional independence between features (characters) given
the class label, simplifying probability calculations.

Step 1: Prior Probabilities


• Class A: abac, baabaaa, abbaa (3 documents)
• Class B: bbaabbab, abbb, bbbaab (3 documents)

3
P (A) =
= 0.5 (22)
6
3
P (B) = = 0.5 (23)
6
Step 2: Character Count Analysis Character Counts:
Character Class A Count Class B Count
a 10 6
b 5 12
c 1 0
Total characters: Class A = 16, Class B = 18
Step 3: Likelihood with Laplace Smoothing Vocabulary: {a, b, c}, |V | = 3
For class A:
10 + 1 11
P (a|A) = = (24)
16 + 3 19
5+1 6
P (b|A) = = (25)
16 + 3 19
1+1 2
P (c|A) = = (26)
16 + 3 19
For class B:
6+1 7 1
P (a|B) = = = (27)
18 + 3 21 3
12 + 1 13
P (b|B) = = (28)
18 + 3 21
0+1 1
P (c|B) = = (29)
18 + 3 21
Step 4: Posterior for ”aabc” Using the naive independence assumption:
 2
11 6 2
P (A|aabc) ∝ 0.5 × × × (30)
19 19 19
 2
7 13 1
P (B|aabc) ∝ 0.5 × × × (31)
21 21 21

15
Computing the numerical values: P (A|aabc) > P (B|aabc), so the predicted class
is A.

P.10 Write short notes on any two of the following: (2×5=10)


a) Next Sentence Prediction (NSP):
NSP is one of the pre-training tasks used in BERT to help the model understand
relationships between sentences and develop discourse-level understanding. The model
is given pairs of sentences and must predict whether the second sentence follows the
first in the original document.
Training Process:

• 50% of pairs are consecutive sentences from the same document (labeled IsNext)
• 50% are random sentences from different documents (labeled NotNext)
• Model learns to distinguish coherent sentence pairs from random combinations
• Uses the [CLS] token representation for binary classification

Applications: Helps in downstream tasks requiring sentence-level understanding like


question answering, natural language inference, and document-level comprehension
tasks.
b) Masked Language Modeling (MLM):
MLM is BERT’s primary pre-training objective where random tokens in the input are
masked and the model must predict the original tokens based on bidirectional context.
This enables the model to learn deep bidirectional representations.
Masking Strategy:

• 15% of tokens are selected for prediction


• Of these: 80% replaced with [MASK], 10% replaced with random tokens, 10%
unchanged
• Model learns to predict masked tokens using both left and right context
• The varied masking strategy prevents the model from only learning to predict
[MASK] tokens

Benefits: Enables bidirectional learning unlike traditional left-to-right language mod-


els, captures deep contextual relationships, and creates robust word representations
that understand both left and right context simultaneously.
c) Human Language and Intelligence:
Human language is a complex cognitive system that enables communication through
structured symbols and rules, representing one of the most sophisticated achievements
of human intelligence. It demonstrates human cognitive capabilities through several
key aspects:

16
• Creativity and infinite expressivity using finite elements (words, grammar rules)
• Context-dependent meaning generation and pragmatic understanding
• Ability to convey abstract concepts, emotions, and complex ideas
• Recursive structure allowing infinite sentence generation
• Cultural transmission and evolution of linguistic knowledge

Language serves as both a tool for communication and a medium for thought, re-
flecting the sophisticated nature of human cognition and providing insights into the
organization of human knowledge and reasoning processes.

17

You might also like