0% found this document useful (0 votes)
82 views90 pages

ANI vs AGI: Key Differences Explained

The document discusses the differences between Artificial Narrow Intelligence (ANI) and Artificial General Intelligence (AGI), highlighting that ANI is task-specific while AGI aims for human-like versatility. It also explains how Generative AI contributes to AGI development by enhancing creativity, learning, and interaction. Additionally, it covers the role of representational learning and various applications of Generative AI in computer vision and data synthesis.

Uploaded by

Shreya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
82 views90 pages

ANI vs AGI: Key Differences Explained

The document discusses the differences between Artificial Narrow Intelligence (ANI) and Artificial General Intelligence (AGI), highlighting that ANI is task-specific while AGI aims for human-like versatility. It also explains how Generative AI contributes to AGI development by enhancing creativity, learning, and interaction. Additionally, it covers the role of representational learning and various applications of Generative AI in computer vision and data synthesis.

Uploaded by

Shreya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

Here are clean, clear, exam-ready notes for the content you pasted — rewritten in a

structured, easy-to-remember format without missing any point.

Artificial Narrow Intelligence (ANI) vs


Artificial General Intelligence (AGI)
Module 4 – Generative AI (Prof. Naveen Kumar Bhansali)
(Fully rewritten notes — all points included, nothing omitted)

1. Artificial Narrow Intelligence (ANI)


Also known as Weak AI.

Definition

AI systems designed and trained to perform one specific task or a narrow set of tasks.
They operate strictly within predefined constraints and cannot generalize beyond their
scope.

Examples

 Siri, Alexa → Voice assistants performing specific actions (set reminders, play
music, answer factual queries).
 Recommendation Systems → Netflix/Amazon algorithms suggesting movies or
products based on user history.
 Autonomous Vehicles (Self-driving cars) → Navigate roads, detect traffic signals,
avoid obstacles using sensors + specialized algorithms.

Key Characteristics

 Domain-specific
 Limited flexibility
 No ability to transfer learning across domains

2. Artificial General Intelligence (AGI)


Also called Strong AI or Human-Level AI.

Definition
AI systems capable of performing any intellectual task that a human can, with the ability
to:

 Understand
 Learn
 Reason
 Plan
 Adaptive problem-solving across any domain

Current Status

AGI is theoretical.
No existing system fully matches human-level flexible intelligence.

Levels of AGI

1. Basic AGI
o Matches human capability
o Performs any human task but not necessarily faster or better
2. Advanced AGI
o Exceeds human abilities in speed, accuracy, efficiency, insights
3. Superintelligence (Speculative)
o Surpasses human intelligence in all dimensions
o Could cause extremely rapid societal and technological change

How Generative AI Drives the Development


of AGI
Generative AI models (e.g., GPT-4) create text, images, audio, and other content from simple
prompts. They contribute to AGI development in the following ways:

1. Enhancing Creativity & Problem-Solving


 Can produce human-like text
 Useful for brainstorming, drafting, ideation
 Shows potential for cognitive flexibility, a key component of AGI

2. Improving Learning & Adaptation


 Train on massive datasets
 Continuously improve
 Develop generalization abilities across different domains
 Mirrors how AGI must learn and adapt

3. Facilitating Natural Interaction


 NLP and text generation allow smoother human–AI communication
 Reduces barrier between machine understanding & human language
 Enables AI to integrate into everyday activities more naturally
 Essential for AGI-level interaction

4. Democratizing Access to Advanced AI


 Generative AI tools are available to non-experts
 Encourages experimentation, broad usage, and feedback
 This widespread use accelerates refinement → pushing AI closer to AGI

5. Creating Multimodal Capabilities


 Not limited to text
 Also generates images, audio, video
 AGI requires understanding and reasoning across multiple modalities simultaneously
 Multimodal generative models are stepping stones to AGI

Summary
 ANI → Specialized, task-focused, limited intelligence
 AGI → Human-like versatile intelligence capable of reasoning and solving problems
in any domain
 Generative AI accelerates progress toward AGI by improving:
o Creativity
o Adaptability
o Natural communication
o Accessibility
o Multimodal understanding
If you want, I can also convert these notes into:
✅ A one-page revision sheet
✅ A slide deck
✅ A diagram-based memory map
Just tell me! Here are clear, structured, exam-ready notes for the entire passage you
provided — rewritten without missing any point, exactly aligned to your course content.

AI vs ML vs DL vs Generative AI
Module 4 – Generative AI (Prof. Naveen Kumar Bhansali)
(Fully rewritten notes, diagram-friendly, nothing omitted)

1. Artificial Intelligence (AI)


Definition:
AI is the broadest field concerned with creating systems that can perform tasks requiring
human-like intelligence, such as:

 Learning
 Reasoning
 Problem-solving
 Perception
 Language understanding

AI is the umbrella term under which ML, DL, and Generative AI fall.

2. Machine Learning (ML)


Subset of AI

Definition:

ML focuses on developing algorithms that enable machines to learn patterns from data and
make predictions or decisions without being explicitly programmed for each task.

Key Points:

 Systems improve automatically with experience


 ML models learn from data patterns
 Used for tasks like classification, regression, clustering, etc.
3. Deep Learning (DL)
Subset of Machine Learning

Definition:

DL uses artificial neural networks with many layers (deep networks) to automatically
learn complex representations from data.

Characteristics:

 Learns hierarchical features automatically


 Works extremely well for image recognition, speech recognition, text processing,
etc.
 Requires large datasets + high computation

DL forms the backbone of modern generative models.

4. Generative AI (GenAI)
Sits under AI → ML → DL
From an implementation standpoint, generative AI is a subset of deep learning.

Definition:

Generative AI uses ML/DL techniques to create new content such as:

 Text
 Images
 Music
 Code
 Audio
 Videos

How Generative AI fits into the hierarchy:

1. Under AI:
o It is part of the broader effort of creating intelligent systems capable of
creative reasoning and output generation.
2. Uses Machine Learning:
o GenAI models learn patterns, styles, and structures from training data and
generate new, similar outputs.
3. Based on Deep Learning:
o Most generative models (GPT-4, DALL·E, Stable Diffusion, etc.) use deep
neural networks to understand and generate content.
o Especially uses architectures like Transformers, GANs, VAEs, Diffusion
Models.

Summary of Relationships
AI

→ Broadest field: All intelligent systems

ML (Subset of AI)

→ Learning from data


→ No explicit rules required

DL (Subset of ML)

→ Neural networks with multiple layers


→ Learns complex representations

Generative AI (Subset of DL)

→ Uses AI + ML + DL to generate entirely new content

One-line Summary
AI ⟶ ML ⟶ Deep Learning ⟶ Generative AI
(Generative AI sits at the deepest end of the hierarchy and uses deep neural networks to
create new original content.)

If you want, I can also make:


📌 A diagram/flowchart for your notes
📌 A one-slide summary
📌 Short exam-style Q&A from this topic

Here are well-structured, exam-ready, complete notes for the entire passage you provided
— rewritten cleanly, without missing any point.
Core Principle of Generative AI –
Representational Learning
Module 4 – Prof. Naveen Kumar Bhansali

1. What is Representational Learning?


Representational learning (also called feature learning) refers to techniques where
algorithms:

 Automatically discover useful representations/features from raw data


 Eliminate the need for manual feature engineering
 Learn the most informative structure of the data directly

This makes learning more efficient, scalable, and adaptable across domains.

2. Relationship Between Representational Learning and


Generative AI
Generative AI depends fundamentally on representational learning because generative
models must understand the underlying structure of data before generating new content.

This relationship is expressed through the following aspects:

A. Feature Extraction

Generative models need to produce realistic outputs.


For this, they must first understand key features of the training data.

Representational learning helps generative AI to:

 Extract meaningful patterns


 Identify relevant characteristics
 Generate new instances preserving the original data’s structure

B. Learning Data Distributions

Generative models aim to learn the probability distribution of the dataset.


Representational learning supports this by:

 Providing compact latent representations


 Capturing essential characteristics in a lower-dimensional space
 Making it easier for the model to learn and sample from the underlying distribution

Example: Autoencoders compress data into a latent space that preserves important features.

C. Cross-Domain Generation (Multimodality)

Generative AI often works with multiple data modalities, such as:

 Text
 Images
 Audio

Representational learning creates a unified encoding framework, enabling:

 Translation of features between modalities


 Multimodal generation (e.g., generating images from text)

D. Improving Model Performance

The performance of generative models depends strongly on how well features are learned.

Better representations → Better generation.

Advances in representational learning (new architectures, improved training) directly


improve:

 Realism
 Diversity
 Accuracy
 Coherence

of generative outputs.

Summary (Conceptual)
Representational learning is the foundation of generative AI.
It allows models to:

 Extract and encode meaningful features


 Understand raw data structures
 Recreate or generate new data effectively

The synergy between them leads to more accurate, realistic, and versatile generative
capabilities.

Encoder–Decoder Architecture for


Representational Learning
Representational learning is often achieved through encoder–decoder architectures,
commonly used in generative tasks.

1. Encoder
Role: Compress raw data into a meaningful latent representation

 Takes input such as text, images, audio, or sequences


 Identifies the most relevant features
 Converts data into a low-dimensional latent vector
 Removes noise and redundancy
 Captures semantic structure

2. Decoder
Role: Reconstruct or generate outputs from the latent representation

 Takes encoded representations


 Expands them back into the original format or a new output
 Reconstructs data that resembles the input
 Or generates new samples following the same distribution

3. Architectures that use the Encoder–Decoder


Framework
 Autoencoders (AE) – learn compressed representations and reconstruct input
 Variational Autoencoders (VAE) – learn probabilistic latent spaces
 Transformers – use encoder and decoder blocks for tasks like translation
 Seq2Seq models – for text generation and machine translation
These architectures enable effective representational learning, which is critical for high-
quality generative tasks.

Final Summary
 The core principle of Generative AI is representational learning.
 Generative AI models rely on effective representation learning to extract essential
features and learn data distributions.
 Encoder–decoder architectures play a vital role in capturing and reconstructing these
learned representations.
 Strong representation learning → better generative performance across modalities
(text, images, audio).

If you want, I can convert this into:


📌 A one-page PDF-like summary
📌 A diagram or flowchart for your notes
📌 Short exam questions from this topic

Here are clear, structured, exam-ready notes for the entire passage you provided —
rewritten cleanly and without missing ANY point.

Module 4 – Generative AI
Applications / Case Studies in Computer Vision

(Prof. Naveen Kumar Bhansali)

Generative AI has transformed computer vision by enabling tasks that go far beyond classical
image processing. Using models like GANs, VAEs, and diffusion models, it can generate,
enhance, modify, and reconstruct visual content with high accuracy and realism.

Below are the key applications, rewritten as detailed notes.

1. Image Synthesis
Generative AI models such as:
 GANs (Generative Adversarial Networks)
 VAEs (Variational Autoencoders)
 Diffusion Models

can generate high-resolution, realistic images from scratch.

Capabilities:

 Learn to produce new images resembling the training dataset


 Create synthetic datasets for training other models
 Artwork generation
 Filling missing image parts (basic inpainting)

2. Image Translation & Style Transfer


Generative models can transfer the style of one image onto another while preserving
content.

Applications:

 Convert real photos into styles of famous painters (e.g., Monet, Van Gogh)
 Convert scenes from one domain to another, such as:
o Day → Night
o Summer → Winter
o Sketch → Photo-like image

Used in art, entertainment, domain adaptation, etc.

3. Super-Resolution Imaging
Generative models (especially GAN variants like SRGAN):

 Improve resolution of low-quality images


 Restore fine details
 Produce sharp, high-clarity images

Use Cases:

 Medical imaging
 Satellite/remote sensing images
 Enhancing old or low-quality digital photographs
4. Video Synthesis & Prediction
Generative AI can:

 Generate new video sequences from input frames


 Predict future frames in a given video

Applications:

 Video editing
 Film special effects
 Surveillance (predicting future activity/frames)

5. Image Inpainting
Generative models can intelligently fill missing or damaged regions of an image using
surrounding context.

Useful for:

 Restoring old or damaged photos


 Removing unwanted objects from images
 Completing missing data in medical imagery

6. 3D Object Generation & Reconstruction


Generative models + 3D rendering techniques enable:

 Generating 3D objects from 2D images


 Reconstructing 3D shapes/structures from multiple viewpoints

Applications:

 Virtual Reality (VR)


 Augmented Reality (AR)
 Gaming
 Digital content creation

7. Image-to-Image Translation
Generative AI can convert an image from one domain to another.

Examples:

 Sketch → Realistic Image


 Low-light image → Enhanced image
 Aerial view → Street-level map

Use Cases:

 Urban planning
 Remote sensing
 Navigation
 Entertainment and creative industries

8. Data Augmentation
Generative models can create synthetic, realistic data to improve the training of deep
learning systems.

Benefits:

 Solves data scarcity issues


 Increases dataset diversity
 Helps models generalize better
 Reduces overfitting

Particularly important in medical imaging, autonomous driving datasets, and rare event
detection.

Conclusion
Generative AI is reshaping computer vision by providing:

 High realism
 Enhanced creativity
 Smarter reconstruction
 Better data availability

As models continue to evolve, the impact of generative AI is expected to increase


significantly across fields like healthcare, entertainment, surveillance, design, urban planning,
and AR/VR.
If you want, I can also prepare:

📌 One-page short notes


📌 Flowchart diagrams for each application
📌 Exam Q&A based on this module

Here are clear, structured, exam-ready notes for the passage you provided — rewritten
cleanly and without missing a single point.

Generative AI – Data Synthesis


Module 4 – Prof. Naveen Kumar Bhansali

Data synthesis refers to the creation of artificial data that closely resembles real-world data.
Generative AI models (trained on large datasets) play a crucial role in synthesizing such data
for:

 Training machine learning models


 Protecting privacy
 Testing systems
 Handling data scarcity

1. Synthetic Data Generation


Models: GANs (Generative Adversarial Networks) and VAEs (Variational
Autoencoders)

These models can:

 Generate synthetic data that mirrors original datasets


 Produce realistic samples where real data is scarce or sensitive

Example:
GANs generate synthetic medical images to train diagnostic models without exposing
patient identities.

2. Data Augmentation
Machine learning often needs large labeled datasets, which are expensive and time-
consuming to collect.
Generative AI helps by:

 Adding diverse, realistic examples to existing datasets


 Expanding training data automatically

Used heavily in:

 Computer vision → synthesizing new images


 Speech recognition → generating variations of audio recordings

This improves algorithm robustness and generalization.

3. Anonymization
Privacy protection is essential when dealing with sensitive data.

Generative AI supports anonymization by:

 Generating synthetic datasets with realistic patterns


 Ensuring no direct link to any real individual
 Reducing privacy risks while preserving data utility

4. Handling Imbalanced Datasets


Many real datasets have class imbalance, where some categories have very few examples.

Generative AI solves this by:

 Synthesizing new data for minority classes


 Balancing the dataset
 Improving fairness and prediction accuracy of ML models

5. Simulation and Scenario Analysis


Generative AI creates synthetic data for environments that are:

 Hard
 Expensive
 Dangerous
 Or impossible to capture in real life
Examples:

Autonomous Vehicles

 Simulate diverse driving conditions


 Train and test self-driving car algorithms

Case Study: Wave (London)

 Developed GAIA-1, a generative AI model for autonomy


 Generates realistic driving videos using video, text, and action inputs
 Allows fine control of ego vehicle behavior and scene features
 Useful for research, simulation, and training

Finance

 Generates synthetic financial behavioral patterns


 Helps personalize financial advice
 Identifies hidden patterns and relationships in spending/investment data

6. Feature Creation for ML


Generative AI can create new, meaningful features by learning underlying data
distributions.

 Captures complex relationships that humans may miss


 Enhances the performance of machine learning models
 Used in finance, healthcare, behavioral modeling, etc.

7. NLP Data Synthesis


Language models like GPT can generate:

 Synthetic text for training datasets


 Evaluation data for language systems
 Chatbot training conversations

This supports scalable NLP dataset creation.

8. Cost-Effective Dataset Creation


Generative AI drastically reduces the cost of producing training datasets.

Example: Stanford Alpaca Project

 Generated 52,000 training instructions


 Total cost: ~$500 using OpenAI models

This democratizes access to large, high-quality datasets.

9. Conclusion and Future Outlook


Generative AI transforms data synthesis by:

 Overcoming data scarcity


 Preserving privacy
 Enabling large-scale, realistic training environments
 Reducing the cost of data creation

Prediction:
By 2024, 60% of data used in AI and analytics projects will be synthetically generated.

Ethical Considerations

Essential to ensure:

 Synthetic data does not reproduce biases from real datasets


 Compliance with ethical standards and regulations
 Special caution in sensitive domains (healthcare, finance, governance)

If you want, I can also prepare:

📌 A 1-page condensed revision sheet


📌 A flowchart for all Data Synthesis applications
📌 Exam questions & answers based on this topic

Here are clean, complete, properly structured notes for Gen AI – Personalization exactly
matching your module content, with no missing points, written in a crisp exam-ready style.
📘 Generative AI – Personalization
(Module 4 – Prof. Naveen Kumar Bhansali)

Generative AI has significantly advanced personalization, enabling systems to deliver


tailored content, recommendations, and experiences based on individual preferences,
behavior, and context. By generating synthetic or customized data, GenAI adapts much more
closely to user needs than traditional rule-based systems.

1. Personalized Content Creation


Generative AI creates customized content based on:

 User interests
 Past interactions
 Purchase history
 Browsing behavior

Applications

 Digital marketing:
GenAI creates personalized email campaigns with:
o Tailored subject lines
o Product recommendations
o Body text reflecting prior purchases

→ Leads to higher engagement, open rates, and conversion rates.

 E-commerce:
If a user prefers eco-friendly products, GenAI:
o Generates product descriptions highlighting sustainability
o Creates personalized ads aligned with their values

2. Tailored Recommendations
Recommendation systems powered by GenAI analyze:

 User behavior
 Past selections
 Engagement patterns

They then generate highly personalized suggestions.


Examples

 Netflix, Spotify:
GenAI analyzes viewing and listening habits to:
o Recommend new movies or songs
o Generate synthetic profiles that explore patterns to suggest content users may
not discover themselves

3. Adaptive User Interfaces


Generative AI creates interfaces that adapt in real-time based on how users interact.

Example

 A news app dynamically rearranges its homepage based on:


o What categories a user reads most
o Time spent on certain types of articles

If the user frequently reads technology, the app prioritizes tech news at the top.

→ Results in a more intuitive, relevant, and user-friendly interface.

4. Customized Educational Content


Generative AI enables personalized learning experiences by producing:

 Customized learning paths


 Personalized quizzes
 Targeted reading materials
 Remedial content

Example

 An online learning platform uses GenAI to:


o Analyze student performance
o Generate study plans tailored to strengths and weaknesses
o Provide specific exercises to improve low-performing areas

5. Individualized Health & Wellness Plans


Healthcare and wellness applications use GenAI to create:
 Personalized diet plans
 Customized workout routines
 Tailored medical or lifestyle suggestions

Example

 A fitness app uses GenAI to generate:


o Workouts based on user’s health metrics
o Diet plans supporting goals such as fat loss, strength building, or
cardiovascular improvement

Conclusion
Generative AI has transformed personalization by enabling:

 More relevant user experiences


 Tailored content generation
 Better recommendations
 Adaptive and intelligent interfaces

As the technology evolves, GenAI will deliver even richer personalization while ensuring:

 Ethical use
 Privacy protection
 Transparency in recommendations

If you want, I can now combine all Module 4 topics into one single document or convert
everything into a PDF for your exam notes.

Here are clean, complete, well-structured notes for Gen AI – Widening of Gap Between
Experts and Novices (Module 4).
All points from your slides are included, nothing missed, rewritten in clear exam-ready form.

📘 Gen AI – Widening of Gap Between


Experts and Novices
(Module 4 – Prof. Naveen Kumar Bhansali)

Generative AI provides powerful capabilities, but the effectiveness of these tools heavily
depends on domain expertise. Experts can use GenAI more efficiently and strategically
because they understand the context, constraints, and exactly how to frame precise prompts.
This leads to a widening gap between experts and novices.

1. Why Experts Benefit More from GenAI


 Experts possess deep domain knowledge.
 They frame accurate and detailed prompts.
 They can interpret AI outputs correctly.
 They refine and improve AI-generated results better than novices.

Thus, generative AI amplifies expert abilities, making them even more productive and
innovative.

2. Examples Showing How GenAI Favors Experts

A. Scientific Research

Experts in technical fields can use GenAI to model and analyze complex structures
accurately.

Example:
A molecular biologist uses GenAI to:

 Generate protein structure models


 Provide specific scientific parameters
 Interpret results correctly

→ Enables discoveries that novices cannot achieve due to lack of foundational knowledge.

B. Creative & Professional Design Fields

Experts in design use GenAI tools to automate routine tasks and push creativity further.

Example:
A professional graphic designer using Adobe Sensei can:

 Automate repetitive components


 Focus on advanced creative decisions
 Refine AI-generated drafts into high-quality designs

→ Novices may not know how to evaluate or correct AI suggestions.


C. Programming & Software Development

Experienced developers know how to frame programming problems and integrate AI-
generated code effectively.

Example:
A senior programmer using GitHub Copilot can:

 Generate sophisticated code snippets


 Integrate them into large systems
 Detect and fix logical flaws in AI-generated code

→ Novices may copy code blindly without understanding context, leading to errors.

D. Healthcare Diagnostics

Domain knowledge is crucial for interpreting medical outputs from GenAI systems.

Example:
A radiologist using AI diagnostic tools can:

 Identify anomalies in medical scans


 Connect findings with patient history
 Make accurate treatment decisions

→ Novices may misinterpret AI suggestions, risking patient safety.

E. Business & Financial Analysis

Experts use GenAI for deep market insights and strategic decision-making.

Example:
An experienced financial analyst can:

 Generate market forecasts


 Interpret AI-generated investment reports
 Make informed investment decisions

→ Novices may misread AI outputs due to lack of financial knowledge.

3. Why the Gap Widens


 GenAI increases productivity only when combined with expertise.
 Experts become even more capable and efficient.
 Novices often lack:
o Context to frame prompts correctly
o Skills to evaluate AI outputs
o Ability to refine AI-generated content

Therefore, Generative AI magnifies existing skill differences instead of reducing them.

4. Conclusion
Generative AI offers immense value, but domain expertise determines its true impact.
Experts leverage AI to:

 Work faster
 Work more creatively
 Produce more accurate and insightful results

Novices, lacking the knowledge to guide or interpret AI, may struggle, causing the expert–
novice gap to widen.

This highlights the need for:

 Strong foundational learning


 Skill development
 Understanding domain principles

to fully harness the power of generative AI.

If you'd like, I can convert all Module-4 notes into a single, clean PDF.

Here are clean, structured, exam-ready notes for Module 5 – Transformer Architecture
(Prof. Naveen Kumar Bhansali), rewritten from your provided text.
Nothing is missed.

📘 Module 5 – Transformer Architecture


Generative AI – Prof. Naveen Kumar Bhansali

Transformers form the foundational architecture behind modern Large Language Models
(LLMs) used in Generative AI (such as GPT, PaLM, LLaMA, etc.). This architecture
revolutionized Natural Language Processing (NLP) because it can efficiently model long-
range dependencies and capture context far better than previous architectures like RNNs and
LSTMs.

1. Introduction to Transformer Architecture


Large Language Models (LLMs) are built on Transformer architecture, which enables
high-level language understanding and generation.
The key components you must understand are:

1. Tokenization
2. Embeddings
3. Attention Mechanism (Self-Attention)

These components together make it possible for LLMs to process text, understand context,
and generate meaningful responses.

2. Tokenization
Tokenization = breaking text into smaller units called tokens.

✔ Types of Tokens:

 Words
 Subwords (most common in LLMs)
 Characters
 Special symbols (punctuation, whitespace markers, etc.)

✔ Example

Sentence:
"Hello, how are you?"
Possible tokens:

 “Hello”
 “,”
 “How”
 “are”
 “you”
 “?”

✔ Why Tokenization Is Important

 Enables the model to process text piece by piece


 Handles rare words through subword tokenization
 Reduces vocabulary size while retaining semantic meaning

Tokenization is the first step in transforming raw text into something measurable by the
model.

3. Embeddings
Once tokens are extracted, each token is converted into a vector representation known as an
embedding.

✔ Purpose of Embeddings

Embeddings encode:

 Semantic meaning
 Syntactic role
 Contextual relationships between tokens

This means the embedding of “doctor” is closer to “nurse” than to “banana.”

✔ Key Point

Embeddings are learned during training, so they automatically capture complex


relationships between words/subwords.

4. Attention Mechanism (Self-Attention)


The attention layer is the core innovation of Transformers.

✔ What Attention Does

It allows the model to focus on the most relevant tokens in a sentence while processing the
input.

The model learns:

 Which words matter more for meaning


 How words relate to each other
 How to maintain context across long sentences

✔ Example:
In the sentence:
“The cat, which was hungry, ate its food.”

To understand “its,” the model must look back to “cat.”


Self-attention enables this long-distance link.

✔ Why Attention Is Powerful

 Handles long-range dependencies


 Understands context deeply
 Learns relationships dynamically for each input
 Replaces sequential processing (unlike RNNs)

Hence, Transformers are parallelizable, making training dramatically faster.

5. How These Components Work Together


1. Tokens
→ Break text into manageable units.
2. Embeddings
→ Convert tokens to meaningful numerical vectors.
3. Self-Attention
→ Helps the model focus on relevant parts of the input and understand context.

This pipeline allows Transformers to perform language modeling with extremely high
accuracy.

6. Conclusion
Transformer Architecture is the backbone of all modern generative AI systems.
By combining:

 Tokenization
 Embeddings
 Self-Attention / Attention Layers

Transformers can understand, model, and generate human language with unmatched
capability.

As the architecture continues to evolve, it opens new possibilities across industries, from
healthcare and finance to creative design and automation.
If you want, I can also make this into a PDF, or I can continue with the next Module 5 topic.

Here are clean, complete, perfectly structured notes for


📘 Module 5 – Why Transformer Models Are Trending
(Prof. Naveen Kumar Bhansali)

All points from your text are included — rewritten clearly for exam use.
Nothing added, nothing missed.

📘 Why Transformer Models Are Trending


(Module 5 – Generative AI)

Transformer models have rapidly become the foundation of modern AI due to several
powerful advantages in learning, processing efficiency, scalability, and adaptability.

1. Enhanced Learning Through Self-Supervision


Transformers excel at self-supervised learning, where they learn patterns from unlabeled
data during pre-training.

How This Works:

 Models are trained on large text corpora.


 They predict:
o Missing words
o Masked words
o Mixed or corrupted tokens
 This helps them learn:
o Deep linguistic structures
o Semantic relationships
o Contextual meanings

Why This Matters:

 No need for fully labeled datasets for each downstream task.


 Reduces manual annotation effort.
 Improves generalization across:
o Multiple tasks
o Domains
o Languages
2. Efficient Parallel Processing
One of the biggest innovations of transformers is their ability to parallelize computation.

Comparison with RNNs:

 RNNs → process input sequentially, slowing down training.


 Transformers → process all tokens in parallel because of the attention
mechanism.

Benefits:

 Faster training and inference.


 Highly efficient on modern hardware:
o GPUs
o TPUs
 Can handle extremely large datasets.
 More scalable than earlier deep learning models.

3. Scalability for Complex Tasks


Transformers are designed to scale:

Scales in Two Ways:

1. Data scale – can learn from massive datasets.


2. Model scale – supports billions of parameters.

Outcome:

 Captures intricate linguistic patterns.


 Handles complex AI tasks.
 Enables state-of-the-art performance in real-world applications.

Scalability is key to pushing the boundaries of AI research and improving model capabilities
across languages and domains.

4. Adaptable Integration Across Applications


Transformer models are extremely flexible and easy to integrate into diverse applications.

Why This Is Possible:


 Pre-trained models (e.g., BERT, GPT) can be fine-tuned with small amounts of task-
specific data.
 Fine-tuning adapts the model to:
o New tasks
o New domains
o New datasets

Applications Include:

 Natural language understanding


 Machine translation
 Sentiment analysis
 Text classification
 Question answering
 Many others

This adaptability makes transformers suitable for a wide range of industries and real-world
use cases.

5. Conclusion
Transformers are trending because they offer:

 Powerful self-supervised learning → strong representations


 Efficient parallel processing → fast and scalable training
 High scalability → supports massive models and datasets
 Flexible integration → easy fine-tuning for new tasks

These strengths have made transformers a dominating force in advancing modern AI and
revolutionizing all areas of natural language processing and understanding.

If you want, I can combine all Module 5 topics into a single, polished PDF.

Here are clear, clean, complete exam-ready notes for


📘 Module 5 – List of Foundation Models
Everything from your text is included exactly, rewritten neatly, with no missing points.

📘 Module 5 – List of Foundation Models


(Generative AI – Prof. Naveen Kumar Bhansali)

What Are Foundation Models?


 Term coined/popularized by Stanford HAI – Center for Research on Foundation
Models (CRFM).
 Also called Large AI Models.
 Trained on massive datasets to enable wide-range applicability across tasks and
domains.
 Development requires huge resources (hundreds of millions of dollars for compute +
data).
 But fine-tuning or using pre-trained models is far more cost-effective.

📌 Timeline & Summary of Major


Foundation Models
Below is a complete structured list, in the same order as your text.

1. GPT-1 (OpenAI, June 2018)


 Architecture: Decoder-only Transformer
 Parameters: 117 million
 Training data: 1 billion tokens
 Training time: 30 days, using 8× NVIDIA P600 GPUs
 Significance: Marked the beginning of the GPT series; foundation for modern NLP
models.

2. BERT (Google, October 2018)


 Architecture: Encoder-only Transformer
 Parameters: 340 million
 Training data: 3.3 billion words
 Key feature: Bidirectional context → major improvement in language understanding.
 Applications: Set a benchmark for contextual NLP tasks.

3. GPT-2 (OpenAI, February 2019)


 Parameters: 1.5 billion
 Training data: 40 GB (~10 billion tokens)
 Significance: Demonstrated strong generative abilities across domains.
4. T5 – Text-to-Text Transfer Transformer (Google, 2019)
 Parameters: 11 billion
 Training data: 34 billion tokens
 Capabilities:
o Text generation
o Translation
o Image-related tasks
 Positioned as a general-purpose foundation model for many Google projects.

5. GPT-3 (OpenAI, May 2020)


 Parameters: 175 billion
 Training data: 300 billion tokens
 Significance:
o Huge performance leap in natural language generation.
o Set new benchmarks across NLP tasks.

GPT-3.5 (2022)

 A fine-tuned variant of GPT-3.


 Delivered to the public via ChatGPT.

6. Claude (Anthropic, December 2021)


 Parameters: 52 billion
 Training data: 400 billion tokens
 Features:
o Strong conversational abilities
o Emphasis on ethical AI and safety
 Became influential in responsible AI discussions.

7. BLOOM (Hugging Face, July 2022)


 Parameters: 175 billion
 Architecture: Similar to GPT-3
 Training data: 350 billion tokens, multilingual
o 30% English
o Excludes programming languages
 Focus: Open, responsible, multilingual AI.
8. LLaMA (Meta AI, February 2023)
 Parameters: 65 billion
 Training data: 1.4 trillion tokens
 Supports 20 languages
 Designed specifically for research and efficiency.

9. BloombergGPT (Bloomberg, March 2023)


 Parameters: 50 billion
 Training data: 363 billion tokens, financial-domain focused
 Specialization: Financial NLP (analysis, insights, domain tasks)
 Claim: Outperforms general LLMs of similar size on finance tasks.

10. GPT-4 (OpenAI, March 2023)


 Details on parameters/training: Undisclosed
 Available via ChatGPT Plus
 Represents the next stage in OpenAI’s innovation.

11. Claude 2 (Anthropic, July 2023)


 Successor to Claude.
 Enhanced conversational abilities, ethics, and human-aligned interaction.

12. LLaMA 2 (Meta AI, July 2023)


 Parameters: 70 billion
 Training data: 2 trillion tokens
 Emphasizes:
o Scalability
o Multilingual performance
o Research usability

13. Mistral 7B (Mistral AI, September 2023)


 Parameters: 7.3 billion
 Extremely efficient for its size.
 General-purpose NLP tasks at high performance.

14. Grok 1 (xAI by Elon Musk, November 2023)


 Parameters: 314 billion
 Context window: 8192 tokens
 Special access to real-time X/Twitter data
 Optimized for dynamic, real-time social media insights.

15. Gemini 1.5 (Google DeepMind, February 2024)


 Architecture: Mixture of Experts (MoE)
 Parameters: Undisclosed
 Context window: ~1 million tokens
 Focus: High scalability + multimodal capabilities.

16. Phi-3 Family (Microsoft, April 2024)


 “Small Language Models”
 Variants: Mini, Small, Medium
 Parameters: 3.8B – 14B
 Designed for efficient deployment with strong performance.

17. LLaMA 3 (Meta AI, April 2024)


 Parameters: 70 billion
 Training data: 15 trillion tokens
 Designed for large-scale research and improved multilingual tasks.

18. Claude 3 Family (Anthropic, 2024)


Includes:

 Claude 3 Opus – flagship, highest reasoning performance


 Claude 3 Sonnet – fast + versatile
 Claude 3 Haiku – cheapest + fastest for text tasks

Features:

 Multimodal vision capabilities


 Improved reasoning, math, coding
 Better multilingual fluency
 Strong emphasis on safety + responsible scaling

📌 Licensing Considerations
Foundation model licenses differ significantly:

Open-source models (e.g., Apache 2.0):

 Allow modification + redistribution


 Some conditions apply

Proprietary models:

 Strict usage rights


 May require permission for commercial use

LLaMA (Meta AI) License:

 Non-commercial research focus


 Requires compliance with specific usage guidelines

Key Advice:

 Always check the latest documentation.


 Licensing terms evolve over time.
 Seek legal clarity when deploying models commercially.

📌 Final Note
The field of foundation models is evolving rapidly.
New models continue to emerge with groundbreaking capabilities in:

 Multimodal learning
 Larger context windows
 More efficient architectures
 Safer, aligned AI
The landscape of NLP and AI is constantly being reshaped as innovation accelerates.

If you want, I can compile all Module 5 notes (Transformer Architecture + Why
Transformers Are Trending + Foundation Models) into a single consolidated PDF.

Here are clean, structured, complete notes for


Module-5: Top-k vs Top-p Sampling
(Fully faithful to your slides, with NO missing line, NO missing concept.)

Module – 5: Top-k Sampling vs Top-p


Sampling (Clean Notes)
Generative AI – Prof. Naveen Kumar Bhansali

Overview
Sampling methods determine how a language model selects the next token during text
generation.
This section explains:

 Greedy approach
 Random weighted sampling
 Top-k sampling
 Top-p (nucleus) sampling

Each method balances coherence, diversity, and creativity differently.

1. Greedy Approach
 The model always chooses the highest-probability token at each step.
 Produces coherent but repetitive and predictable text.
 Lacks creativity and variation.
 Example: a story generated this way may follow common, repetitive patterns.

2. Random Weighted Sampling


 The model samples tokens according to their probability distribution.
 High-probability tokens are more likely but not guaranteed to be chosen.
 Adds randomness → more creativity, more variation.
 Useful in:
o Creative writing
o Dialogue generation
o Less predictable interactions

3. Top-k Sampling
 A refinement of random weighted sampling.
 Model considers only the top k most probable tokens.
 The next token is randomly sampled from this subset.
 Balances quality + diversity.
 Example:
o If k = 10, the model picks the next token from the 10 most likely options.
 Helps maintain relevance while keeping text varied.

4. Top-p Sampling (Nucleus Sampling)


 Instead of selecting a fixed number of tokens, we select tokens whose cumulative
probability ≥ threshold p.
 The size of the pool changes dynamically based on context.
 More flexible than top-k.
 Produces fluent, coherent, contextually appropriate text.
 Particularly effective in conversational AI.

Top-p Example (From Slides)


Step 1: Initial context

Sentence: “The cat sat on the”

Step 2: Model provides probabilities

Model outputs probability distribution for possible next words.

Step 3: Set Top-p threshold

 Choose p = 0.85
 Want the smallest set of tokens whose cumulative probability ≥ 0.85
Step 4: Sort tokens

Sort all candidate tokens in descending probability order.

Step 5: Accumulate probabilities

Keep adding probabilities until sum ≥ 0.85.

Step 6: Form the candidate pool

Tokens included in the top-p pool:

 mat
 roof
 table
 grass

Step 7: Randomly select one token

Example chosen from pool: “roof”


Final output:
➡ “The cat sat on the roof.”

Summary
Each method offers a different trade-off between coherence, diversity, and creativity:

Method How It Works Strength Weakness


Always picks highest-
Greedy Coherent Predictable, repetitive
probability token
Random
Samples proportionally to May become
Weighted Creative, diverse
probability incoherent
Sampling
Fixed k may miss
Top-k Sampling Sample from top k tokens Balanced, relevant
context changes
Sample from smallest set with Adaptive, fluent, Slightly more
Top-p Sampling
cumulative prob ≥ p context-aware complex

If you want, I can also convert this into:


✅ PDF
✅ 1-page revision sheet
Just tell me!
Here are clean, structured, complete notes for Module-6: Retrieval Augmented
Generation (RAG) — rewritten clearly, without missing any line or concept from your
slides.

Module – 6: Retrieval Augmented


Generation (RAG)
Generative AI – Prof. Naveen Kumar Bhansali

1. Why RAG? Understanding the Need


The performance of a generative AI model depends on two major factors:

A. Training Phase – Quality of Training Data

 Model learns patterns from the data it is trained on.


 High-quality, diverse, accurate, representative datasets → better generalization.
 Poor or incomplete data → reduced performance.

B. Inference Phase – Quality of Context

 Even a well-trained model produces weak results if the prompt/context is vague.


 Clear, complete prompts are necessary for accurate responses.

Both good training data AND rich inference context are essential.

2. Challenges With Standard Large


Language Models
LLMs face two key limitations:

1. Lack of access to specific or updated data

 LLMs are trained on large public datasets.


 After training, they become static → cannot access new or external data.
 This leads to:
o Outdated answers
o Hallucinations
o Incorrect responses for information not in their training set

2. AI applications need custom / organization-specific data

 Real-world applications require company-specific knowledge.


 Examples:
o Customer support bots must answer using company data
o Internal HR bots must answer using HR policies
 Retraining LLMs is expensive, slow, and impractical.

Therefore, we need a way to give the model external, domain-specific, up-to-date data
without retraining.

3. What is Retrieval Augmented Generation


(RAG)?
RAG is an architectural technique that combines:

Retrieval + Generation

 Retrieval system → fetches relevant documents from an external knowledge base


 Generative model → uses the retrieved content to produce informed responses

RAG gives LLMs access to custom, updated, and precise data during inference.

Benefits

 Reduces hallucinations
 Produces contextually accurate answers
 Useful for chatbots, Q&A systems, knowledge assistants, domain-specific tools

4. How RAG Works – Step-by-Step


Procedure

Step 1: Data Preparation


 Gather documents + metadata
 Preprocess them (cleaning / removing PII / redacting sensitive fields)
 Split documents into chunks (manageable segments)
o Chunk size depends on embedding model and LLM needs
 Goal: Prepare clean, chunked data ready for embedding.

Step 2: Indexing the Data


 Create embeddings → numerical vectors representing semantic meaning
 Store embeddings in a vector database / vector search index
 Vector index enables semantic similarity search—not keyword matching
 This allows fast and accurate retrieval of relevant chunks.

Step 3: Retrieval During Querying


When the user submits a query:

1. Query is converted into an embedding


2. System searches the vector index
3. Retrieves the most relevant chunks
4. Retrieved chunks are added to the LLM prompt

This enriched prompt gives the model accurate context → better responses.

Analogy: Google Search

 Google crawls → processes → indexes → retrieves → ranks


 RAG retrieval works similarly, but with semantic vectors.

Step 4: Build the LLM Application


 Combine:
o Prompt augmentation (query + retrieved text)
o LLM response generation
 Wrap it in a REST API or endpoint
 Use it in applications like:
o Chatbots
o Q&A systems
o Internal knowledge assistants

With enriched context, applications give precise, relevant, updated answers.


5. Summary of the RAG Architecture
RAG follows this pipeline:

1. Data Preparation
o Collect documents → clean → preprocess → chunk
2. Embedding + Indexing
o Convert chunks to embeddings
o Store in vector database
3. Retrieval at Inference
o Matching chunks retrieved
o Added to prompt
4. Augmented Generation
o LLM uses enhanced context
o Produces accurate, domain-specific responses
5. Deployment
o Packaged into an endpoint for easy integration into apps

RAG ensures responses are:

 Accurate
 Updated
 Domain-specific
 Grounded in actual data, not hallucination

If you want, I can also prepare:


✅ a diagrammatic flowchart
✅ a 1-page exam revision sheet
Just tell me!

Here are clean, complete, structured notes on RAG – Vector Databases (Module 6, Prof.
Naveen Kumar Bhansali).
No points are skipped. No meaning is altered. Everything is reorganized clearly for revision.

Module 6 – RAG: Vector Databases


Generative AI – Prof. Naveen Kumar Bhansali

1. What Are Vector Databases?


Vector databases are specialized systems designed to:
 Store
 Manage
 Search

data represented as vectors (lists of numbers).


These vectors can represent:

 Text
 Images
 Audio
 Any high-dimensional data

Vector representations allow operations such as:

 Similarity search
 Clustering
 Nearest neighbor search

The main goal:


👉 Enable extremely fast and efficient similarity search in high-dimensional spaces.

2. Why Do We Need Vector Indexing?


Without specialized indexing:

 Searching through millions of vectors becomes computationally expensive.

Common indexing techniques used:

1. Approximate Nearest Neighbors (ANN)

Used for fast retrieval of approximate but highly relevant nearest neighbors.

2. HNSW (Hierarchical Navigable Small World Graphs)

 Graph-based indexing
 Supports fast and accurate similarity search

3. FAISS (Facebook AI Similarity Search)

 Developed by Meta
 Highly optimized library for vector search and clustering
 Supports GPU acceleration
3. Vector Search Index vs. Vector Database
Vector Search Index

 A component used to speed up similarity search


 Typically sits inside search engines or recommendation systems
 Focuses only on indexing and fast retrieval

Vector Database

A full-fledged database that includes:

 Persistent storage
 Indexing
 Querying
 Security
 Scalability
 Consistency
 Integration with other systems

Goal: Manage vectorized data end-to-end, not just search.

4. Key Components of a Vector Database

A. Data Storage
Includes:

1. Vector Storage

 Efficiently stores large volumes of high-dimensional vectors


 Uses optimized formats and compressed structures

2. Metadata Storage

Stores additional information like:

 IDs
 Timestamps
 Labels
 Categories

Metadata enables:
 Filtering
 Complex queries
 Hybrid searches (vector + metadata)

B. Indexing
1. Vector Indexing Techniques

 HNSW: Graph-based structure for fast nearest neighbor search


 IVF (Inverted File Index):
o Divides vector space into clusters
o Searches only relevant clusters
 Product Quantization:
o Compresses vectors
o Speeds up distance calculations

2. Dynamic Indexing

 Allows adding/removing vectors


 Does not require rebuilding the entire index
 Important for real-time applications

C. Query Processing
1. Similarity Search

Uses distance metrics:

 Cosine similarity
 Euclidean distance
 Dot product

Goal: find vectors closest to the query vector.

2. Filtering & Re-ranking

Supports:

 Metadata-based filtering
 Re-ranking results using similarity scores or other criteria

3. Batch Queries

 Multiple vectors queried at once → increases efficiency


D. Scaling and Distribution
1. Sharding

 Data split across multiple servers


 Enables handling very large datasets

2. Load Balancing

 Distributes queries across nodes


 Reduces response time

3. Replication

 Copies data across nodes


 Ensures:
o High availability
o Fault tolerance

E. Integration with ML Pipelines


1. Data Ingestion

 ML models generate embeddings


 Database ingests vectors + metadata automatically

2. Model Updates

 When ML models change, embeddings may change


 Vector database must update stored vectors and indices

3. Real-time Inference

Used in real-time applications such as:

 Recommendation systems
 Personalization engines
 Fraud detection
 Semantic search

New data can be ingested instantly, and results retrieved immediately.


5. Examples of Vector Databases
Common vector database systems include:

 Pinecone
 Milvus
 ChromaDB
 Weaviate

These provide:

 Storage
 Indexing
 APIs
 Scalability
 Integration with AI tools

6. Use Cases of Vector Databases


1. Recommendation Systems

 Suggest similar products


 Identify user preference patterns
 Used in e-commerce, music, movies, etc.

2. Image & Video Search

 Retrieve visually similar images/videos using embeddings

3. Natural Language Processing

 Semantic similarity between documents/sentences


 Used in chatbots and Q&A systems

4. Anomaly Detection

 Identify vectors that are significantly different from the norm


 Useful for fraud detection and cybersecurity

7. Summary
A vector search index improves search speed.
A vector database provides:

 Storage
 Indexing
 Query processing
 Distribution
 ML integration

Vector databases have become essential in AI and machine learning due to the rise of:

 Embeddings
 High-dimensional data
 Real-time applications
 Semantic search

They are a critical component in RAG architecture, enabling fast retrieval of relevant
knowledge chunks.

If you want, I can also prepare:

✅ A diagram/flowchart of vector database architecture


or
✅ A 1-page exam cheat sheet

Just tell me!

Here are clean, structured, exam-ready notes for your LangChain (Module 6) content —
rewritten clearly while preserving every point from the slides, with no missing concepts.

📘 Module 6 — LangChain (Detailed Notes)


Generative AI – Prof. Naveen Kumar Bhansali

1. What is LangChain?
LangChain is a framework that simplifies building applications using Large Language
Models (LLMs) such as GPT.
It provides tools, abstractions, and integrations to help developers build context-aware,
data-driven, and multi-step LLM applications.

LangChain makes it easier to build systems that require:


 advanced prompting
 memory
 sequential reasoning
 interactions with external tools
 integration with other systems (APIs, databases)

2. Key Features of LangChain


a) Prompt Templates

 Allows creation of reusable, structured prompts.


 Ensures consistency in how prompts are written for different tasks.
 Useful for standardizing complex prompt patterns.

b) Memory

 LangChain supports memory management, allowing LLMs to retain context across:


o multiple interactions
o sessions
o turns in a conversation
 Essential for chatbots and multi-step applications.

c) Agents

 Agents use LLMs to reason, decide, and act.


 Works in a loop:
1. Model interprets the situation
2. Decides a next action
3. Executes relevant tool
4. Produces an output
 Enables dynamic behavior instead of fixed sequences.

d) Tools

 LangChain can integrate with external:


o APIs
o Databases
o Calculators
o Search engines
 Allows LLMs to fetch data, perform operations, or interact with the environment.
3. How LangChain Works
Step 1: Building Blocks

LangChain provides fundamental components such as:

 Prompt Templates
 Memory classes
 Chains (multi-step sequences)
Developers use these blocks to build complex applications.

Step 2: Combining Components

 Components are combined to form chains.


 Chains represent a sequence of operations the LLM executes.
 Example:
1. Take user input
2. Process with a prompt template
3. Query an API
4. Summarize using the LLM

Step 3: Executing Chains

 When a user input arrives:


o The chain runs step-by-step
o Uses memory when needed
o Uses tools or APIs
o Produces structured output

Step 4: Interaction Loop

 Used in applications requiring multiple steps or continuous communication.


 Example: chatbots.
 LangChain handles:
o context retention
o conversation flow
o multi-step reasoning and execution
 Ensures smooth, coherent multi-turn interactions.
4. Why LangChain is Powerful
LangChain provides:

 abstraction over complex LLM logic


 structured pipelines for reasoning
 integration with external systems
 persistent and flexible memory
 easier development of advanced AI applications
 dynamic agent-based decision-making

5. Summary
LangChain is a framework that:
✔ simplifies building LLM-based applications
✔ provides prompt templates, memory, agents, tools
✔ supports multi-step workflows using chains
✔ manages interaction loops in conversational or task-based systems
✔ integrates external APIs, databases, and utilities
✔ enables sophisticated AI applications with minimal overhead

If you want, I can also prepare:


✅ diagram-based notes
✅ flashcards
✅ short exam-style answers
✅ real-world examples of LangChain

Here are clear, structured, exam-ready notes for LangChain – Chunking Strategy,
rewritten without missing any point from your slides.

📘 Module 6 — LangChain: Chunking


Strategy (Detailed Notes)
Generative AI – Prof. Naveen Kumar Bhansali

1. What is Chunking?
Chunking refers to the process of splitting large text into smaller, manageable pieces called
chunks.

Why it is needed:

 LLMs have token/character limits.


 Large documents cannot be processed as a whole.
 Well-designed chunks ensure the model retains context and understands the content.

Goal:
✔ break text into pieces that are small enough for processing
✔ but large and meaningful enough to preserve context

2. Key Parameters in Chunking


a) Chunk Size

 Maximum number of characters or tokens allowed per chunk.


 Example: If chunk size = 100 characters → every chunk ≤ 100 characters.

Determines the length of each chunk.

b) Chunk Overlap

 Number of characters/tokens repeated between consecutive chunks.


 Ensures continuity and avoids losing context at boundaries.

Example:
If overlap = 20 characters:

 Last 20 characters of chunk 1


→ repeated at the beginning of chunk 2.

This avoids cutting sentences in unnatural places.

3. Character Text Splitter


A method to split text based on a specific character or separator.

Example separators:
 \n (newline) → splits by lines or paragraphs
 (space) → splits by words
 custom characters (comma, period, symbols, etc.)

Example text:

 Line 1: The quick brown fox


 Line 2: Jumps over the lazy dog
 Line 3: And runs away swiftly

If we use \n as separator → the text splits into 3 chunks (one per line).

4. Recursive Character Text Splitting


(RCT)
A hierarchical chunking method that splits text step-by-step using multiple separators
arranged by importance.

Separators usually used in this order:

1. \n\n → paragraph breaks


2. \n → line breaks
3. (space) → words
4. characters → smallest units

Process

1. Split by double newline \n\n


o Each paragraph becomes a chunk.
2. If a paragraph-chunk is still too large → split using \n (single newline)
o Divides paragraphs into individual lines.
3. If still too large → split using spaces
o Breaks down lines into words or short phrases.
4. If still too large → split into characters
o Last resort to ensure all chunks fit within size limits.

5. Why Recursive Splitting Works


✔ Maintains maximum context by using largest meaningful separators first
✔ Avoids breaking important sentences or phrases prematurely
✔ Only uses smaller separators when absolutely necessary
✔ Produces chunks that are:
 coherent
 contextual
 within size constraints

6. Summary
Chunking Strategy in LangChain ensures that large text is divided into context-preserving,
size-compatible, and LLM-friendly chunks.
It uses techniques like:

 chunk size
 chunk overlap
 character text splitting
 recursive character text splitting

This guarantees efficient and meaningful text processing in RAG and LLM applications.

If you want, I can also prepare:


✅ diagrams for chunking
✅ examples with actual chunk outputs
✅ flashcards for revision
✅ a one-page high-yield summary

Below are clean, structured, exam-ready notes for Module 6 – LangChain: Memory +
RAG Retrieval Strategy, rewritten from your slides without missing a single point.

📘 Module 6 — LangChain: Memory &


Retrieval Strategy (Detailed Notes)
Generative AI – Prof. Naveen Kumar Bhansali

1. What is Memory in LangChain?


Memory in LangChain allows applications—especially conversational agents—to retain
context across multiple interactions.

Why it matters:
 Conversational agents need to remember past queries, decisions, and responses
 Produces coherent, context-aware output
 Enables multi-turn interactions

LangChain memory supports two core actions:

a) Reading (Retrieving Information)

 Retrieves relevant past interaction data.

b) Writing (Storing Information)

 Stores new information for future interactions.

Both actions happen within the chain execution pipeline, ensuring every new input is
influenced by past context.

2. Types of Memory in LangChain

2.1 Conversational Buffer Memory


 Stores entire conversation history.
 Includes all user inputs + system responses.
 History is stored in a variable accessible during processing.

Use cases:

✔ Customer support systems


✔ Long-running conversations requiring full historical context

2.2 Conversation Buffer Window Memory


 Stores only the last k interactions.
 Works like a sliding window:
o New messages added
o Oldest ones removed
 Still stores both user and system messages.

Use cases:
✔ Casual chatbots
✔ When only recent context matters
✔ Memory-efficient long sessions

2.3 Conversation Token Buffer Memory


 Controls memory based on token count, not number of messages.
 Stores messages until a token limit is exceeded.
 Once exceeded → oldest messages discarded.

Use cases:

✔ When LLM has strict token limits


✔ Long inputs with varying token lengths
✔ Fine-grained control over memory usage

2.4 Conversation Summary Memory


 Creates and maintains a summary instead of storing each message.
 After every interaction, the summary is updated to reflect new information.

How it works:

1. System generates summary based on past interactions


2. Updates summary after each turn
3. Stores only essential information

Use cases:

✔ Very long conversations


✔ Applications requiring high-level continuity
✔ Decision-based workflows or narrative systems

3. Introduction to RAG (Retrieval


Augmented Generation)
RAG combines:

 Retrieval-based methods → fetch relevant information


 Generation-based models → produce final output
Retrieval strategy determines what information is fed to the LLM.

4. Retrieval Strategy in RAG


Step 1: Query Formation

 Transform user input into a search query.

Step 2: Document Retrieval

 Fetch relevant documents from a pre-indexed corpus.


 Methods:
o Similarity Search
o Maximal Marginal Relevance (MMR)

Step 3: Document Selection

 Choose most relevant or diverse documents based on:


o relevance
o diversity
o ranking scores

Step 4: Augmentation

 Provide selected documents as context to the LLM.


 LLM generates the final enhanced response.

5. Similarity Search
Technique where retrieved documents are those most similar to the query.

 Uses similarity metrics such as cosine similarity


 Compares embeddings of query vs. documents

Example:

Query: “What are the benefits of renewable energy?”


Retrieved:

 Document on solar energy benefits


 Document on wind power
 Document on environmental impact
Result:

 Highly relevant documents (but may be redundant)

6. Maximal Marginal Relevance (MMR)


Balances relevance + diversity.

Goal:

 Avoid redundancy
 Retrieve documents that cover different aspects of the query

Example:

Query: “What are the benefits of renewable energy?”

MMR retrieves:

 One document on environmental benefits


 One document on economic advantages
 One on sustainability impact

Result:

 Broader, more informative context


 Less repetition

7. Summary Comparison
Feature Similarity Search MMR (Maximal Marginal Relevance)
Focus Relevance Relevance + Diversity
Output May be repetitive Covers multiple perspectives
Best for Focused answers Broad coverage, diverse insights

8. Next Step
Now that chunking, memory types, and retrieval methods are understood, you can integrate
them to build:
✔ A question-answering system
✔ A full chatbot using LangChain + RAG

If you want, I can combine all Module 6 notes into a full polished PDF, or create diagrams,
flashcards, or a one-page revision sheet too.

Here are clean, structured, exam-ready notes for the full Module-7 content you provided
— rewritten clearly, with no missing points, no copyrighted diagrams, and 100% safe to
use in your study material.

📘 Module 7 — Instruction Tuned Models


(Clean Notes)
(Based on your provided lecture transcript)

1. What Are Instruction-Tuned Models?


➡ Large Language Models (LLMs)

Examples: GPT-3, GPT-4, etc.


Trained on massive datasets containing diverse internet text.

LLM Training Objective

 Predict the next word in a sentence.


 This enables the model to learn:
o Grammar
o World knowledge
o Reasoning patterns
o Semantic relationships

Strengths of LLMs

 Generate human-like text


 Answer questions
 Complete prompts
 Handle open-ended tasks

Limitations of LLMs

LLMs may struggle when:


 Instructions are complex
 Tasks are specific
 Prompts require precision
 Instructions need step-by-step execution

Reason:
LLMs are primarily trained using unsupervised learning, without explicit guidance for
specific tasks.

2. What Are Instruction-Tuned Models?


Instruction-tuned models are LLMs fine-tuned on datasets containing instruction–response
pairs.

Why Instruction Tuning?

It helps the model:

 Follow user instructions more accurately


 Understand user intent better
 Give more task-oriented, relevant outputs
 Reduce hallucinations
 Respond more consistently

How Instruction Tuning Works

After the LLM is pre-trained:

1. A supervised fine-tuning dataset is created.


Each entry has:
o Instruction
o Correct response
2. The model is trained to produce the right output explicitly for the given instruction.

Where Are Instruction-Tuned Models Useful?

Tasks requiring precise instruction following, such as:

 Summarization
 Translation
 Code generation
 Question answering
 Data extraction
3. LLM vs Instruction-Tuned Model — Difference
LLMs Instruction-Tuned Models
General-purpose Task-oriented
Good for open-ended generation Good for precise instructions
Unsupervised training Supervised fine-tuning
May misinterpret vague instructions Understands instructions better
Needs careful prompting Works reliably even with simple instructions

4. Instruction Tuning Research: Key Paper


"Finetuned Language Models Are Zero-Shot Learners"

Important points from the paper:

 Instruction tuning greatly improves zero-shot performance on unseen tasks.


 Instead of building datasets from scratch, the authors:
o Took 62 public NLP datasets (from TensorFlow Datasets)
o Converted them into instruction format
 For each dataset:
o 10 different templates were manually created
o Templates use natural language to explain the task
o Up to 3 templates reversed the task
Example:
Instead of classify sentiment → generate review

Example: Sentiment Classification Template

 Instead of: "Label this review as positive/negative"


 They also added:
“Write a movie review with negative sentiment.”

This increases robustness and generalization during training.

5. Why Instruction Tuning Requires Less Compute


 Pre-training LLMs = extremely expensive
 Instruction tuning = done on small, supervised datasets
 This phase is called the “metaphorical dessert” compared to the heavy pre-training
“main course.”

6. Examples of Instruction-Tuned Models


✔ FLAN-T5

 Based on Google’s T5 model


 Further fine-tuned using FLAN instruction datasets
 Follows user instructions far more reliably

✔ PALM → FLAN-PALM

 PaLM is the base LLM


 FLAN-PaLM is the instruction-tuned version
 Better at structured tasks, prompting, chain-of-thought

✔ BLOOM → BLOOMZ

 BLOOM = multilingual foundational model


 BLOOMZ = instruction-tuned for multilingual tasks
 Strong at following instructions in many languages

7. Module Summary (Perfect for Exams)


 LLMs are powerful but not always instruction-following by default.
 Instruction tuning improves their ability to understand and execute explicit
instructions.
 This is done by fine-tuning models on instruction–response datasets.
 The tuning process is supervised and cheaper than pre-training.
 Instruction-tuned models significantly improve zero-shot performance.
 Popular examples: FLAN-T5, FLAN-PaLM, BLOOMZ.
 Instruction tuning increases consistency, task accuracy, and user-friendliness.

If you want, I can turn this into:

✅ A PDF
✅ A one-page cheat sheet
✅ A revision table
✅ Side-by-side comparison chart
Just tell me!

Here are clean, structured, exam-friendly notes for the content you pasted on Full Fine-
Tuning — no missing points, everything organized clearly.

📘 Module 7 — Instruction Tuned Models


Full Fine-Tuning (Detailed Notes)

1. What is Full Fine-Tuning?


 Full fine-tuning means updating all parameters of a pre-trained large language
model (LLM) using a new, smaller, task-specific dataset.
 Purpose: Adapt the entire model so it performs extremely well on a specific task.

2. Process of Full Fine-Tuning


a) Pre-training (Initial step)

 Model is trained on a massive, diverse dataset.


 Learns:
o Grammar
o World knowledge
o Reasoning
o General language patterns

b) Full Fine-tuning (Second step)

 The entire model (all weights + all layers) is trained further on a target dataset.
 This changes the full parameter set to adapt fully to the new domain/task.

3. Advantages of Full Fine-Tuning


✔ Highly Specialized Model

 Achieves state-of-the-art performance on the specific task.


 Captures task-specific nuances extremely well.

✔ Maximum Adaptability

 Since all weights update, model fully absorbs knowledge from the new dataset.

✔ Works best when task requires deep domain understanding

 e.g., legal document classification


 e.g., medical text summarization
4. Disadvantages of Full Fine-Tuning
❌ Computationally Expensive

 Requires:
o High-end GPUs/TPUs
o Large memory
o Significant time
 Costly in both money and energy consumption (environmental impact).

❌ Low Scalability

 For every new task:


o A separate full fine-tuned model must be created.
 Increases complexity in:
o Deployment
o Updating
o Storage
o Maintenance

❌ Lack of Flexibility

 A model fine-tuned for one task may lose performance on general tasks.

5. Catastrophic Forgetting
What is it?

 When fine-tuned fully on a narrow dataset, the model:


o Forgets general knowledge learned during pre-training.
o Performance on original tasks drops significantly.

Example

 Pre-trained general LLM → fully fine-tuned on medical corpus


→ becomes excellent at medical tasks
→ BUT poor at general language tasks (e.g., storytelling, casual chat).

Why catastrophic forgetting happens

 Full fine-tuning overwrites weights that were important for general tasks.

6. Why Full Fine-Tuning Is a Problem for Large LLMs


 Large LLMs are expected to be multi-task, but:
o Full fine-tuning makes them task-specific.
o They may lose broad capability.

7. Solution to These Problems


⭐ Parameter Efficient Fine Tuning (PEFT)

 Instead of updating all parameters, PEFT updates only a small number of additional
or selected parameters.
 Prevents catastrophic forgetting.
 Reduces compute cost.

(You will likely study: LoRA, LoRA+, Prefix Tuning, P-Tuning, Adapter Layers, etc.)

Summary Table
Topic Key Points
Full Fine-Tuning Update all parameters of LLM
Pros Highest performance for specific tasks
Cons Expensive, time-consuming, risk of catastrophic forgetting
Issue Not scalable for multiple tasks
Solution Use PEFT methods

If you want, I can also prepare slides-style notes, exam-style Q&A notes, or one-page
revision sheet for this topic.

Here are detailed, clean, exam-ready notes for the Parameter-Efficient Fine-Tuning
(PEFT) section you pasted — no points missed, everything structured exactly as per the
lecture.

📘 Module 7 — Parameter-Efficient Fine-


Tuning (PEFT)
(Complete Detailed Notes)
1. What is PEFT?
 Parameter-Efficient Fine-Tuning (PEFT) refers to a set of techniques that allow
fine-tuning a large language model by updating only a very small subset of its
parameters.
 Goal:
o Adapt the model to new tasks
o While keeping most parameters frozen
o Thus reducing compute, memory, and storage cost

2. Why PEFT Is Needed (Motivation)


Traditional full fine-tuning updates all the model parameters and faces problems like:

 High computational cost


 High memory usage
 Need to store separate model copies for each task
 Risk of catastrophic forgetting

PEFT solves these issues by updating < 1–5% of parameters.

3. Adapters — The Core PEFT Technique


Adapters are one of the most practical and widely-used PEFT techniques.

What are adapters?

 Small neural modules added inside a pre-trained model (usually inside each
Transformer layer).
 When fine-tuning:
o Only the adapter parameters are updated
o All original model parameters remain frozen

4. How Adapters Work (Functioning)


1. Start with a pre-trained model (e.g., BERT, GPT, T5).
2. Insert small adapter layers at certain points in each Transformer layer.
3. During fine-tuning:
o The input flows through the main (frozen) model as usual.
o Wherever an adapter is placed, the data additionally goes through the adapter
layer.
o Only the adapter’s weights are trained.
4. The pre-trained model remains unchanged → prevents catastrophic forgetting.

5. Why Adapters Are Efficient


✔ Only a small number of parameters are updated

 Training is:
o Faster
o Cheaper
o Requires far less memory

✔ Multiple tasks can be supported

 For each new task:


o Add a new adapter module
o Train only that adapter
 No need to duplicate the entire model → huge storage savings.

✔ Easy task switching

 Just load the appropriate adapter for the task.


 The base model stays the same for all tasks.

6. Advantages of Using Adapters


Advantage Explanation
Low compute cost Only small modules are trained
Low memory usage Base LLM remains frozen
Storage efficient You store only tiny adapter weights, not full models
Avoids catastrophic forgetting Pre-trained weights stay intact
Supports multi-task systems Swap adapters for different tasks
Highly scalable Ideal for organizations needing many task-specific models

7. Summary (One-Liner Revision)


PEFT = Fine-tuning only small, additional modules (like adapters) instead of the whole
model → cheaper, faster, and avoids catastrophic forgetting while enabling multi-task
support.
If you want, I can also make:

✅ Slide-style notes
✅ One-page revision sheet
✅ Diagram of how adapters fit inside a Transformer
✅ Comparison table: Full Fine-Tuning vs PEFT vs LoRA

Just tell me!

Here are clean, clear, exam-ready notes for the exact text you provided — every point is
included, nothing missed, written in a crisp and structured way.

📘 Parameter-Efficient Fine-Tuning (PEFT)


(Complete Notes – Module 7, Prof. Naveen Kumar Bhansali)

1. What is PEFT?
 Parameter-Efficient Fine-Tuning (PEFT) refers to techniques that allow fine-tuning
a large language model by making minimal changes to its parameters.
 Instead of updating the full model, PEFT updates only a small subset of parameters.
 Purpose:
o Adapt the model to new tasks
o While using very little compute and very little storage

2. Why PEFT?
 Large pre-trained LLMs have billions of parameters.
 Fully updating them for every task is:
o Expensive
o Slow
o Memory-heavy
o Hard to store and deploy
 PEFT aims to overcome this by modifying only a small number of parameters.

3. Adapters — A Key PEFT Technique


 Adapters are small additional layers added inside a pre-trained model.
 They are a practical and widely used PEFT method.
Purpose of adapters

 Enable task-specific adaptation


 Make fine-tuning:
o More computationally efficient
o More storage efficient
 Allow the model to generalize across tasks while keeping the main model intact

4. Where Are Adapters Added?


 They are inserted within each layer of a Transformer model, typically after the
feed-forward or attention blocks.
 The adapter modules are small compared to the full layer.

5. How Adapters Work During Fine-Tuning


1. The pre-trained model remains frozen.
None of its original parameters change.
2. Only adapter parameters are trained.
These are tiny compared to the full model.
3. Data flow:
o Input goes through the regular pre-trained layers.
o At locations where adapters are inserted:
 The data also passes through the adapter module.
 The adapter learns the task-specific transformation.
4. This allows the model to adapt to the new task without changing the main pre-
trained weights.

6. Efficiency Benefits of Adapters


 Only a very small number of parameters are updated →
Less computational power is needed.
 Lower memory usage during training.
 Much cheaper than full fine-tuning.
 Original model stays intact → helps avoid catastrophic forgetting.

7. Multi-Task Support Using Adapters


 Each task can have its own adapter module.
 You only store the adapters, not separate full models.
 Switching between tasks is easy:
o Just swap the adapter for the corresponding task.
 Enables efficient multi-task systems using one base model.

8. Summary (One Sentence)


PEFT allows efficient fine-tuning by adding small adapter modules and training only
those, while keeping the main model frozen—making it cheaper, faster, and easier to
handle multiple tasks.

If you want, I can also prepare:

✅ A simplified one-page revision


✅ A diagram showing adapters inside a Transformer
✅ A comparison table: Full Fine-Tuning vs PEFT vs LoRA

Just tell me!

Here are clean, complete, exam-ready notes for the exact content you provided — no
points missed, written in a clear and structured way while keeping all technical details.

📘 PEFT – LoRA (Low-Rank Adaptation)


Module 7 – Prof. Naveen Kumar Bhansali

1. What is LoRA?
 LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning technique for
large language models.
 It works by updating only low-rank matrices instead of updating the full large
weight matrices.
 Main goals:
o Reduce number of learnable parameters
o Make fine-tuning efficient in computation and memory
o Maintain or even improve task performance

2. Why LoRA?
 Transformer models contain very large weight matrices (especially in:
o Self-attention: Query, Key, Value (Q, K, V) matrices
o Feed-forward layers
)
 Full fine-tuning updates billions of parameters → expensive and slow.
 LoRA solves this by updating only a small low-rank decomposition, not the entire
matrix.

3. Weight Matrices in Transformers


 A typical transformer layer has weight matrices denoted as W.
 These matrices often have dimensions:
D × D, where D = hidden size.

4. The Key Idea of LoRA: Low-Rank Decomposition


Instead of updating the full matrix W, LoRA decomposes the update into two much smaller
matrices:

Matrix A

 Shape: D × r
 Learned during fine-tuning

Matrix B

 Shape: r × D
 Also learned during fine-tuning

Where r is the rank and r << D (very small).

5. Re-parameterization
The original weight matrix is not modified.

We compute an updated matrix:

W′ = W + ΔW

Where
ΔW = B × A
 W = original pretrained weight matrix → kept frozen
 A and B = newly learned low-rank parameters

ΔW has the same dimension as W (D × D), even though it is derived from small matrices.

This addition of B×A is called re-parameterization.

6. Where LoRA Is Applied


LoRA can be added to any transformer weight matrix, but primarily used in:

 Q (Query) projections
 K (Key) projections
 V (Value) projections
 Feed-forward layers

These matrices are large → ideal for low-rank adaptation.

7. After Fine-Tuning
 The final matrix used by the model is:
W′ = W + B×A
 This merged matrix replaces W during inference.
 Architecture remains unchanged.

8. Why It’s Still Efficient


Even though we introduce new A and B matrices:

 Only A and B are trained


 W remains frozen
 W′ has same dimension as W → model expressiveness is preserved.

9. Advantages of LoRA
✔ 1. Huge Reduction in Parameters

Only the low-rank A and B matrices are learned → drastically fewer parameters.
✔ 2. Much Lower Compute Requirements

Smaller gradients
Less GPU memory
Faster fine-tuning

✔ 3. No Changes to the Original Model

Base model remains intact and frozen.

✔ 4. Multi-Task Capability

 For each task, only store different A and B matrices.


 Same base model can support multiple tasks by swapping LoRA modules.

✔ 5. Maintains Model Expressiveness

Because we add ΔW rather than replace W, the full capacity of the pretrained model is
retained.

10. One-Line Summary


LoRA fine-tunes transformer models by learning small low-rank matrices (A and B)
and adding them to frozen pretrained weights, achieving highly efficient fine-tuning
with minimal computational overhead.

If you want, I can also create:

📌 A diagram showing LoRA inside a transformer layer


📌 A comparison table: Full Fine-Tuning vs PEFT vs LoRA
📌 Short revision notes for exams

Just tell me!

Here are clean, structured notes for your entire passage — rewritten clearly but preserving
every concept, intention, and explanation exactly as your slides teach them. This will help
you study and revise easily.

📘 Notes: Word Embeddings — Dense


Representations & Latent Factors
Why Dense Embeddings?

Earlier methods like one-hot encoding were:

 Sparse (mostly zeros)


 High-dimensional
 Did not capture context or semantic meaning
 Could not express relationships between words

To overcome this, embeddings represent each word as a dense vector of learned numbers.
These numbers encode:

 Context
 Semantic meaning
 Relationships between words

📌 Understanding the Example Embedding


Matrix
The matrix shown has:

 Rows → latent factors (hidden dimensions)


 Columns → word vectors across all factors

The matrix is dense (no zero-dominance), indicating meaningful learned weights.

Interpreting the rows (latent factors)

Although the factors are not explicitly defined in real models, the example helps build
intuition:

1. Row 1: Vehicle Factor


o Car, Bike, Mercedes-Benz, Harley-Davidson → high values
o Orange, Mango → low values
2. Row 2: Luxury Factor
o Only Mercedes-Benz and Harley-Davidson show high values
o Car/Bike aren’t luxury → low
3. Row 3: Fruit Factor
o Orange, Mango → high
o Vehicles → low
4. Row 4: Company Factor
o Mango and Orange are also company names, so they score here.

Key Idea
We never manually choose these factors — the model discovers them automatically by
training on large text corpora.

📌 Similarity Between Words


Because each word is represented as a vector (column), we can analyze relationships using
distance metrics.

If two words are similar:

 Their vector values across all rows/dimensions will be similar.


 They will appear close in the vector space.

Examples

 Car and Bike → similar values → close in embedding space


 Mercedes-Benz and Harley Davidson → luxury vehicle brands → similar
representation
 Car vs. Orange → very different values → far apart

Analogy Relationships

Using distances (Euclidean, Manhattan, etc.), embeddings can capture analogies like:

Car : Bike = Mercedes-Benz : Harley-Davidson

This is because:

 Car ↔ Bike share the “vehicle type” relation


 Mercedes ↔ Harley share the “luxury” relation

📌 Number of Dimensions (Hyperparameter)


The number of rows = number of embedding dimensions.

Common choices:

 50
 100
 300
 500

More dimensions =
 better at capturing complex relationships
 but increases computation

📌 Latent Factors Are Not Always


Interpretable
In real-world embeddings:

 Each row does not represent a clean concept.


 Many factors overlap.
 Difficult to assign human labels to each dimension.

Still, the model effectively captures:

 Semantic similarity (Car close to Bike, Orange close to Mango)


 Contextual similarity

Plotting in 2D (after dimensionality reduction like PCA or t-SNE):

 Car, Bike, Mercedes-Benz, Harley-Davidson cluster together


 Orange and Mango cluster separately

📌 The Big Question — How Are These


Values Learned?
This leads to the methods of learning embeddings such as:

 Word2Vec (Skip-gram / CBOW)


 GloVe
 FastText
 Transformer-based embeddings

The slide sets up motivation before explaining how embeddings are trained.

If you want, I can also make super-diagrammatic notes, mind-maps, exam-oriented


points, or a memory trick version for fast recall.

Here are clear, complete, exam-ready notes for your entire passage — rewritten to preserve
every concept, every explanation, and the full flow exactly as taught in your slides.
📘 Notes: Learning Word Embeddings Using
Neural Language Models (Slide
Explanation)
1. Why Context Matters for Word Embeddings
To learn useful embeddings, the training data must contain:

 Many occurrences of each word


 In many different contexts

This allows embeddings to capture:

 Contextual meaning
 Semantics
 Relationships between words

→ Large corpus is necessary.

But learning a word representation directly from documents in an unsupervised way is


difficult.

So the problem is reframed as a supervised learning task.

2. Converting Embedding Learning into a


Supervised Problem
Idea: Use Neural Networks for Language Modeling

Use previous words ( W_1, W_2, \dots, W_t ) to predict the next word ( W_{t+1} ).

Example:
She is a great tennis player

To predict “player”, use the previous 5 words.

3. How the Neural Network Is Structured


Input Layer

Each input word is:

 Represented using one-hot encoding


 Size = vocabulary size (e.g., 1000)

So “She” = 1000-dimensional one-hot vector


“Issue repeats for all 5 input words.”

Hidden Layer

 Number of neurons = embedding size


 Example: 500 neurons → embedding dimension = 500

Output Layer

 1 neuron per vocabulary word → 1000 neurons


 Softmax applied
 Produces probability distribution over vocabulary
 Highest probability → predicted next word

4. Handling Variable Sentence Length


Sentences can be long or short, so the model uses a fixed context window size.

Example:
If window size = 3, model predicts next word using the last 3 words.

For “She is a great tennis player”


→ Use “a great tennis” to predict “player”

The window size is a hyperparameter.

5. Using Bidirectional Contexts for


Embeddings
Language modeling predicts next word using previous context.

But for learning embeddings, we can use:

 Words before the target word (left context)


 Words after the target word (right context)

Example sentence:
She is a great tennis player and has won many awards.

If context window = 3:

 Left context of “player”: a great tennis


 Right context of “player”: and has won

This is used to learn better embeddings because the target word is in the middle.

This modeling approach is common in embedding training.

6. Simplest Model: Predict Next Word


Using Only Previous Word
Architecture:

 Input (one-hot) → hidden layer → output layer


 Weights connecting input → hidden or hidden → output become the word
embeddings

These weights represent the core learned vector for each word.

7. Word2Vec
Word2Vec popularized:

 Using neural networks without deep layers


 Using context windows
 Using target prediction tasks to learn embeddings

Word2Vec uses two main architectures:

 CBOW (Continuous Bag of Words) → predict word from context


 Skip-gram → predict context from word

The slide ends by introducing Word2Vec, which will be explained next.

If you want, I can also convert these into:


✅ very short exam revision notes
✅ diagrammatic flow version
✅ memory hacks for embedding concepts
✅ a clean PDF
Just tell me!

Here are clean, complete, and exam-ready notes for the entire passage, rewritten clearly
while keeping all details, concepts, parameters, examples, and logic exactly as in the
slides.

📘 Word2Vec Model – Detailed Notes


(CBOW, Skip-Gram & Negative Sampling)
Word2Vec has two variants:

1. Continuous Bag-of-Words (CBOW)


Architecture

 Input: Context words (surrounding the target word)


 Context includes words to the left and right of the target word.
 Output: The target (middle) word.

Concept

CBOW predicts:

context → target

Example:
Sentence: She is a great tennis player
Predict “player” using the context: a great tennis

2. Skip-Gram
Architecture

 Input: Target (middle) word


 Output: Context words (words to the left and right)
Skip-Gram predicts:

target → context

Why is it called Skip-Gram?

Because not all context words are used.


Some words in the window are skipped randomly.

3. Window Size + "Number of Skips"


Parameter
For Skip-Gram, two hyperparameters are used:

1. Window Size

Example: window size = 3


→ 3 words left + 3 words right
Context = 6 words

2. Number of Skips

Defines how many words to randomly pick from the window.

Example:

 Window size = 3 → context = 6 words


 Number of skips = 2 → pick only 2 random context words

Sentence example:
Context words = a, great, tennis, and, has, won

Possible Skip-Gram training samples:

 (player → has)
 (player → tennis)
or any other random pair.

4. Word Embeddings in Word2Vec


The weights between hidden layer and output layer = word embeddings.
Example size:

 Hidden layer = 500 neurons


 Vocabulary size = 1000 words
→ Weights = 500 × 1000 = 0.5 million parameters

But real vocabularies can be 10,000 or more:

 500 × 10,000 = 5 million weights

5. Computational Problem
To learn embeddings:

 Need huge training corpus


 Need to compute softmax over entire vocabulary for each training step
This becomes:
 Very expensive
 Sometimes infeasible

6. Solution: Negative Sampling (Mikolov et


al.)
Introduced in the paper:
“Distributed Representation of Words and Phrases and their Compositionality”

Key Idea

Update only a small number of weights instead of updating all vocabulary weights.

Example

Target word = player


Context word = tennis

Without negative sampling:

 Output is 1 for tennis


 Output is 0 for all other words
→ All weights would be updated (millions!)

With Negative Sampling


Choose:

 1 positive sample → (player, tennis) labeled as 1


 k negative samples → random words labeled as 0

Negative words:

 Randomly selected from outside the context


 Sampled with probability proportional to their frequency raised to ( \frac{3}{4} )

Example (k = 3)

Positive pair:

 (tennis, player → 1)

Negative pairs:

 (tennis, hello → 0)
 (tennis, piece → 0)
 (tennis, few → 0)

Parameter reduction

Instead of updating all 5 million weights:


→ Update only weights corresponding to
1 positive + k negative words
→ = (k + 1) × embedding_dimension
→ For k = 3, embedding dimension = 500
→ Updates needed = 4 × 500 = 2000 weights

Huge computational savings.

7. Recommended Values for k


 Small datasets: k = 5–20
 Large datasets: k = 2–5

8. Summary
Word2Vec:

 CBOW: context → target


 Skip-Gram: target → context (with skips)
 Uses single-layer neural network
 Negative sampling makes training feasible
 Only a few weights updated per step
 Learns dense vectors with rich semantic properties

If you want, I can also prepare:

✅ a diagram-based version
✅ a 1-page revision sheet
✅ a PDF version
Just tell me!

Here are clean, structured, and detailed notes for the entire passage exactly as in your
slides but written in a clear, exam-ready manner.

📘 Word Embedding Matrix – Detailed


Notes
1. Embedding Matrix Structure
 Word embeddings are stored in a matrix.
 Rows = vocabulary size
(one row per word)
 Columns = embedding dimension
(e.g., 50, 100, 300 values per word)

Example:
Vocabulary size = 10,000
Embedding dimension = 300
→ Embedding matrix size = 10,000 × 300

2. Using the Embedding Matrix


Input Representation (One-Hot Encoding)

 An input word is represented as a 10,000 × 1 one-hot vector.


 All values = 0
 Except one index = 1 → the position of that word in the vocabulary
Example:
Word “a” is vocabulary index 1 → one-hot vector has a 1 in the first position.

Word “great” is index 524 → one-hot vector has a 1 in position 524.

3. How the Embedding Vector is Obtained


Matrix multiplication

Embedding vector = Embedding Matrix × One-Hot Vector

Since all values in input are 0 except one:

 Only one row contributes to the output


 All other multiplications give 0

So:

 For “a” → output = 1st row of embedding matrix


 For “great” → output = 524th row

Output size = 300 × 1


(embedding dimension)

4. Why Direct Multiplication is Wasteful


 We multiply a large 10,000 × 300 matrix with a 10,000 × 1 vector.
 But 9,999 multiplications are with zero.
 Completely unnecessary computation.

5. Practical Implementation: Lookup Table


Instead of multiplication, frameworks simply look up the row:

 Input index → fetch corresponding row from embedding matrix


 No real matrix multiplication
 Much faster and highly efficient

Examples:

 Word “a” → fetch row 1


 Word “great” → fetch row 524

This works because:

The indices in the one-hot input and the rows in the embedding matrix are aligned.

6. Embedding Layer in Keras


 Keras's Embedding Layer directly maps indices to embedding vectors.
 It performs lookup, not multiplication.
 Commonly used as the first trainable layer in NLP tasks such as:
o Text Classification
o Sentiment Analysis
o Machine Translation
o Named Entity Recognition (NER)
o Text Summarization
o Many other NLP tasks

7. Transfer Learning with Pre-trained


Embeddings
Analogy to Computer Vision

 In CV, we use pretrained CNN weights (e.g., ResNet, VGG).

Similarly in NLP:

We reuse pretrained embedding weights from:

 Word2Vec
 GloVe
 FastText
 Other large-scale models

Benefits:

 Better performance
 Faster convergence
 Requires less data
 Captures richer semantic relationships

Used for:
 Text classification
 Sequence models
 Neural machine translation
 Any deep learning-based NLP system

✔️Summary
 Embedding matrix = vocab_size × embedding_dim
 One-hot input → lookup the corresponding row
 Multiplication is conceptually taught but not used in practice
 Keras Embedding layer performs direct index-to-vector mapping
 Pretrained embeddings can be plugged into your model for transfer learning

If you want, I can also make:

✅ a diagram to visually explain lookup vs multiplication


✅ a short 1-page revision sheet
✅ a PDF version for your exam notes

Just tell me!

Here are clear, structured, exam-ready notes for the GRU explanation you provided —
rewritten with all details preserved, no loss in meaning, and organized cleanly for studying:

Gated Recurrent Unit (GRU) — Detailed


Notes
1. Background
 GRU was proposed by Cho et al. (2014) in the same paper introducing the RNN
Encoder–Decoder architecture.
 GRU is a simplified variant of LSTM.
 Designed to reduce complexity while maintaining long-term dependency handling.

2. Key Difference Between LSTM and GRU


LSTM
 Has two states:
o Short-term state (hidden state): ( h_t )
o Long-term state (cell state): ( C_t )
 Has three gates + one main network:
o Input gate
o Forget gate
o Output gate
o Candidate network

GRU

 Merges long-term and short-term memory → only one state:


o ( C_t )
 Has only two gates + one main network:
o Update gate
o Reset gate
o Candidate activation network
 No output gate

GRU is therefore:

 Smaller
 Less complex
 Slightly faster to train

But performance-wise, neither LSTM nor GRU is consistently superior—both are widely
used.

3. GRU Architecture Overview


GRU uses three neural components:

1. Update Gate ( z_t )


2. Reset Gate ( r_t )
3. Main Network (candidate activation) ( g_t )

GRU output = GRU state


(While the output and state can differ, usually we treat them as same.)

4. Update Gate — Controls “Forget” +


“Input” Gates Together
Unlike LSTM which uses two separate gates, GRU uses one gate (update gate) to control
both:

 Input gate portion = ( z_t )


 Forget gate portion = ( 1 - z_t )

Interpretation:

 If ( z_t = 1 ): keep new information (input gate open), forget old state
 If ( z_t = 0 ): keep old information (forget gate open), ignore new state

Formula:

[
z_t = \sigma(W_{xz} x_t + W_{cz}C_{t-1} + b_z)
]

Where:

 ( W_{xz} ): weight matrix for input


 ( W_{cz} ): weight matrix for previous state
 ( b_z ): bias

5. Reset Gate — Controls How Much Past


State Is Used
This determines how much of the previous state ( C_{t-1} ) should influence the candidate
activation.

Formula:

[
r_t = \sigma(W_{xr}x_t + W_{cr}C_{t-1} + b_r)
]

If:

 ( r_t = 0 ): previous state is “reset” → GRU runs like it’s seeing a new sequence
 ( r_t = 1 ): full previous state is used

6. Main Neural Network (Candidate


Activation)
This computes the new candidate state using current input and reset-controlled previous
state.

Formula:

[
g_t = \tanh(W_{xg}x_t + W_{cg}(r_t \cdot C_{t-1}) + b_g)
]

Where:

 ( r_t \cdot C_{t-1} ) controls how much previous memory contributes


 Very similar to the candidate creation in LSTM but simpler

7. Final State Update (No Output Gate)


GRU directly combines:

 Old memory
 New candidate memory

Using update gate ( z_t ):

State update:

[
C_t = z_t \cdot g_t + (1 - z_t) \cdot C_{t-1}
]

Meaning:

 If ( z_t ) is large → new information dominates


 If ( 1 - z_t ) is large → old information dominates

8. Output of GRU
GRU has no separate output gate.

The output can be:

 ( y_t = C_t ), OR
 A softmax applied externally (for prediction tasks)
9. Summary
Component LSTM GRU
Memory State 2 states: (h_t, C_t) 1 state: (C_t)
Gates 3 gates (input, forget, output) 2 gates (update, reset)
Complexity Higher Lower
Speed Slower Faster
Output Gate Yes No
Performance Comparable Comparable

GRU is simpler, faster, and still capable of learning long-term dependencies, which is why it
became very popular.

If you want, I can also prepare:

📌 A diagram of the GRU cell


📌 A one-page summary for exams
📌 A comparison table between RNN, LSTM, GRU
📌 A PDF of all your notes

Just tell me!

You might also like