Here are clean, clear, exam-ready notes for the content you pasted — rewritten in a
structured, easy-to-remember format without missing any point.
Artificial Narrow Intelligence (ANI) vs
Artificial General Intelligence (AGI)
Module 4 – Generative AI (Prof. Naveen Kumar Bhansali)
(Fully rewritten notes — all points included, nothing omitted)
1. Artificial Narrow Intelligence (ANI)
Also known as Weak AI.
Definition
AI systems designed and trained to perform one specific task or a narrow set of tasks.
They operate strictly within predefined constraints and cannot generalize beyond their
scope.
Examples
Siri, Alexa → Voice assistants performing specific actions (set reminders, play
music, answer factual queries).
Recommendation Systems → Netflix/Amazon algorithms suggesting movies or
products based on user history.
Autonomous Vehicles (Self-driving cars) → Navigate roads, detect traffic signals,
avoid obstacles using sensors + specialized algorithms.
Key Characteristics
Domain-specific
Limited flexibility
No ability to transfer learning across domains
2. Artificial General Intelligence (AGI)
Also called Strong AI or Human-Level AI.
Definition
AI systems capable of performing any intellectual task that a human can, with the ability
to:
Understand
Learn
Reason
Plan
Adaptive problem-solving across any domain
Current Status
AGI is theoretical.
No existing system fully matches human-level flexible intelligence.
Levels of AGI
1. Basic AGI
o Matches human capability
o Performs any human task but not necessarily faster or better
2. Advanced AGI
o Exceeds human abilities in speed, accuracy, efficiency, insights
3. Superintelligence (Speculative)
o Surpasses human intelligence in all dimensions
o Could cause extremely rapid societal and technological change
How Generative AI Drives the Development
of AGI
Generative AI models (e.g., GPT-4) create text, images, audio, and other content from simple
prompts. They contribute to AGI development in the following ways:
1. Enhancing Creativity & Problem-Solving
Can produce human-like text
Useful for brainstorming, drafting, ideation
Shows potential for cognitive flexibility, a key component of AGI
2. Improving Learning & Adaptation
Train on massive datasets
Continuously improve
Develop generalization abilities across different domains
Mirrors how AGI must learn and adapt
3. Facilitating Natural Interaction
NLP and text generation allow smoother human–AI communication
Reduces barrier between machine understanding & human language
Enables AI to integrate into everyday activities more naturally
Essential for AGI-level interaction
4. Democratizing Access to Advanced AI
Generative AI tools are available to non-experts
Encourages experimentation, broad usage, and feedback
This widespread use accelerates refinement → pushing AI closer to AGI
5. Creating Multimodal Capabilities
Not limited to text
Also generates images, audio, video
AGI requires understanding and reasoning across multiple modalities simultaneously
Multimodal generative models are stepping stones to AGI
Summary
ANI → Specialized, task-focused, limited intelligence
AGI → Human-like versatile intelligence capable of reasoning and solving problems
in any domain
Generative AI accelerates progress toward AGI by improving:
o Creativity
o Adaptability
o Natural communication
o Accessibility
o Multimodal understanding
If you want, I can also convert these notes into:
✅ A one-page revision sheet
✅ A slide deck
✅ A diagram-based memory map
Just tell me! Here are clear, structured, exam-ready notes for the entire passage you
provided — rewritten without missing any point, exactly aligned to your course content.
AI vs ML vs DL vs Generative AI
Module 4 – Generative AI (Prof. Naveen Kumar Bhansali)
(Fully rewritten notes, diagram-friendly, nothing omitted)
1. Artificial Intelligence (AI)
Definition:
AI is the broadest field concerned with creating systems that can perform tasks requiring
human-like intelligence, such as:
Learning
Reasoning
Problem-solving
Perception
Language understanding
AI is the umbrella term under which ML, DL, and Generative AI fall.
2. Machine Learning (ML)
Subset of AI
Definition:
ML focuses on developing algorithms that enable machines to learn patterns from data and
make predictions or decisions without being explicitly programmed for each task.
Key Points:
Systems improve automatically with experience
ML models learn from data patterns
Used for tasks like classification, regression, clustering, etc.
3. Deep Learning (DL)
Subset of Machine Learning
Definition:
DL uses artificial neural networks with many layers (deep networks) to automatically
learn complex representations from data.
Characteristics:
Learns hierarchical features automatically
Works extremely well for image recognition, speech recognition, text processing,
etc.
Requires large datasets + high computation
DL forms the backbone of modern generative models.
4. Generative AI (GenAI)
Sits under AI → ML → DL
From an implementation standpoint, generative AI is a subset of deep learning.
Definition:
Generative AI uses ML/DL techniques to create new content such as:
Text
Images
Music
Code
Audio
Videos
How Generative AI fits into the hierarchy:
1. Under AI:
o It is part of the broader effort of creating intelligent systems capable of
creative reasoning and output generation.
2. Uses Machine Learning:
o GenAI models learn patterns, styles, and structures from training data and
generate new, similar outputs.
3. Based on Deep Learning:
o Most generative models (GPT-4, DALL·E, Stable Diffusion, etc.) use deep
neural networks to understand and generate content.
o Especially uses architectures like Transformers, GANs, VAEs, Diffusion
Models.
Summary of Relationships
AI
→ Broadest field: All intelligent systems
ML (Subset of AI)
→ Learning from data
→ No explicit rules required
DL (Subset of ML)
→ Neural networks with multiple layers
→ Learns complex representations
Generative AI (Subset of DL)
→ Uses AI + ML + DL to generate entirely new content
One-line Summary
AI ⟶ ML ⟶ Deep Learning ⟶ Generative AI
(Generative AI sits at the deepest end of the hierarchy and uses deep neural networks to
create new original content.)
If you want, I can also make:
📌 A diagram/flowchart for your notes
📌 A one-slide summary
📌 Short exam-style Q&A from this topic
Here are well-structured, exam-ready, complete notes for the entire passage you provided
— rewritten cleanly, without missing any point.
Core Principle of Generative AI –
Representational Learning
Module 4 – Prof. Naveen Kumar Bhansali
1. What is Representational Learning?
Representational learning (also called feature learning) refers to techniques where
algorithms:
Automatically discover useful representations/features from raw data
Eliminate the need for manual feature engineering
Learn the most informative structure of the data directly
This makes learning more efficient, scalable, and adaptable across domains.
2. Relationship Between Representational Learning and
Generative AI
Generative AI depends fundamentally on representational learning because generative
models must understand the underlying structure of data before generating new content.
This relationship is expressed through the following aspects:
A. Feature Extraction
Generative models need to produce realistic outputs.
For this, they must first understand key features of the training data.
Representational learning helps generative AI to:
Extract meaningful patterns
Identify relevant characteristics
Generate new instances preserving the original data’s structure
B. Learning Data Distributions
Generative models aim to learn the probability distribution of the dataset.
Representational learning supports this by:
Providing compact latent representations
Capturing essential characteristics in a lower-dimensional space
Making it easier for the model to learn and sample from the underlying distribution
Example: Autoencoders compress data into a latent space that preserves important features.
C. Cross-Domain Generation (Multimodality)
Generative AI often works with multiple data modalities, such as:
Text
Images
Audio
Representational learning creates a unified encoding framework, enabling:
Translation of features between modalities
Multimodal generation (e.g., generating images from text)
D. Improving Model Performance
The performance of generative models depends strongly on how well features are learned.
Better representations → Better generation.
Advances in representational learning (new architectures, improved training) directly
improve:
Realism
Diversity
Accuracy
Coherence
of generative outputs.
Summary (Conceptual)
Representational learning is the foundation of generative AI.
It allows models to:
Extract and encode meaningful features
Understand raw data structures
Recreate or generate new data effectively
The synergy between them leads to more accurate, realistic, and versatile generative
capabilities.
Encoder–Decoder Architecture for
Representational Learning
Representational learning is often achieved through encoder–decoder architectures,
commonly used in generative tasks.
1. Encoder
Role: Compress raw data into a meaningful latent representation
Takes input such as text, images, audio, or sequences
Identifies the most relevant features
Converts data into a low-dimensional latent vector
Removes noise and redundancy
Captures semantic structure
2. Decoder
Role: Reconstruct or generate outputs from the latent representation
Takes encoded representations
Expands them back into the original format or a new output
Reconstructs data that resembles the input
Or generates new samples following the same distribution
3. Architectures that use the Encoder–Decoder
Framework
Autoencoders (AE) – learn compressed representations and reconstruct input
Variational Autoencoders (VAE) – learn probabilistic latent spaces
Transformers – use encoder and decoder blocks for tasks like translation
Seq2Seq models – for text generation and machine translation
These architectures enable effective representational learning, which is critical for high-
quality generative tasks.
Final Summary
The core principle of Generative AI is representational learning.
Generative AI models rely on effective representation learning to extract essential
features and learn data distributions.
Encoder–decoder architectures play a vital role in capturing and reconstructing these
learned representations.
Strong representation learning → better generative performance across modalities
(text, images, audio).
If you want, I can convert this into:
📌 A one-page PDF-like summary
📌 A diagram or flowchart for your notes
📌 Short exam questions from this topic
Here are clear, structured, exam-ready notes for the entire passage you provided —
rewritten cleanly and without missing ANY point.
Module 4 – Generative AI
Applications / Case Studies in Computer Vision
(Prof. Naveen Kumar Bhansali)
Generative AI has transformed computer vision by enabling tasks that go far beyond classical
image processing. Using models like GANs, VAEs, and diffusion models, it can generate,
enhance, modify, and reconstruct visual content with high accuracy and realism.
Below are the key applications, rewritten as detailed notes.
1. Image Synthesis
Generative AI models such as:
GANs (Generative Adversarial Networks)
VAEs (Variational Autoencoders)
Diffusion Models
can generate high-resolution, realistic images from scratch.
Capabilities:
Learn to produce new images resembling the training dataset
Create synthetic datasets for training other models
Artwork generation
Filling missing image parts (basic inpainting)
2. Image Translation & Style Transfer
Generative models can transfer the style of one image onto another while preserving
content.
Applications:
Convert real photos into styles of famous painters (e.g., Monet, Van Gogh)
Convert scenes from one domain to another, such as:
o Day → Night
o Summer → Winter
o Sketch → Photo-like image
Used in art, entertainment, domain adaptation, etc.
3. Super-Resolution Imaging
Generative models (especially GAN variants like SRGAN):
Improve resolution of low-quality images
Restore fine details
Produce sharp, high-clarity images
Use Cases:
Medical imaging
Satellite/remote sensing images
Enhancing old or low-quality digital photographs
4. Video Synthesis & Prediction
Generative AI can:
Generate new video sequences from input frames
Predict future frames in a given video
Applications:
Video editing
Film special effects
Surveillance (predicting future activity/frames)
5. Image Inpainting
Generative models can intelligently fill missing or damaged regions of an image using
surrounding context.
Useful for:
Restoring old or damaged photos
Removing unwanted objects from images
Completing missing data in medical imagery
6. 3D Object Generation & Reconstruction
Generative models + 3D rendering techniques enable:
Generating 3D objects from 2D images
Reconstructing 3D shapes/structures from multiple viewpoints
Applications:
Virtual Reality (VR)
Augmented Reality (AR)
Gaming
Digital content creation
7. Image-to-Image Translation
Generative AI can convert an image from one domain to another.
Examples:
Sketch → Realistic Image
Low-light image → Enhanced image
Aerial view → Street-level map
Use Cases:
Urban planning
Remote sensing
Navigation
Entertainment and creative industries
8. Data Augmentation
Generative models can create synthetic, realistic data to improve the training of deep
learning systems.
Benefits:
Solves data scarcity issues
Increases dataset diversity
Helps models generalize better
Reduces overfitting
Particularly important in medical imaging, autonomous driving datasets, and rare event
detection.
Conclusion
Generative AI is reshaping computer vision by providing:
High realism
Enhanced creativity
Smarter reconstruction
Better data availability
As models continue to evolve, the impact of generative AI is expected to increase
significantly across fields like healthcare, entertainment, surveillance, design, urban planning,
and AR/VR.
If you want, I can also prepare:
📌 One-page short notes
📌 Flowchart diagrams for each application
📌 Exam Q&A based on this module
Here are clear, structured, exam-ready notes for the passage you provided — rewritten
cleanly and without missing a single point.
Generative AI – Data Synthesis
Module 4 – Prof. Naveen Kumar Bhansali
Data synthesis refers to the creation of artificial data that closely resembles real-world data.
Generative AI models (trained on large datasets) play a crucial role in synthesizing such data
for:
Training machine learning models
Protecting privacy
Testing systems
Handling data scarcity
1. Synthetic Data Generation
Models: GANs (Generative Adversarial Networks) and VAEs (Variational
Autoencoders)
These models can:
Generate synthetic data that mirrors original datasets
Produce realistic samples where real data is scarce or sensitive
Example:
GANs generate synthetic medical images to train diagnostic models without exposing
patient identities.
2. Data Augmentation
Machine learning often needs large labeled datasets, which are expensive and time-
consuming to collect.
Generative AI helps by:
Adding diverse, realistic examples to existing datasets
Expanding training data automatically
Used heavily in:
Computer vision → synthesizing new images
Speech recognition → generating variations of audio recordings
This improves algorithm robustness and generalization.
3. Anonymization
Privacy protection is essential when dealing with sensitive data.
Generative AI supports anonymization by:
Generating synthetic datasets with realistic patterns
Ensuring no direct link to any real individual
Reducing privacy risks while preserving data utility
4. Handling Imbalanced Datasets
Many real datasets have class imbalance, where some categories have very few examples.
Generative AI solves this by:
Synthesizing new data for minority classes
Balancing the dataset
Improving fairness and prediction accuracy of ML models
5. Simulation and Scenario Analysis
Generative AI creates synthetic data for environments that are:
Hard
Expensive
Dangerous
Or impossible to capture in real life
Examples:
Autonomous Vehicles
Simulate diverse driving conditions
Train and test self-driving car algorithms
Case Study: Wave (London)
Developed GAIA-1, a generative AI model for autonomy
Generates realistic driving videos using video, text, and action inputs
Allows fine control of ego vehicle behavior and scene features
Useful for research, simulation, and training
Finance
Generates synthetic financial behavioral patterns
Helps personalize financial advice
Identifies hidden patterns and relationships in spending/investment data
6. Feature Creation for ML
Generative AI can create new, meaningful features by learning underlying data
distributions.
Captures complex relationships that humans may miss
Enhances the performance of machine learning models
Used in finance, healthcare, behavioral modeling, etc.
7. NLP Data Synthesis
Language models like GPT can generate:
Synthetic text for training datasets
Evaluation data for language systems
Chatbot training conversations
This supports scalable NLP dataset creation.
8. Cost-Effective Dataset Creation
Generative AI drastically reduces the cost of producing training datasets.
Example: Stanford Alpaca Project
Generated 52,000 training instructions
Total cost: ~$500 using OpenAI models
This democratizes access to large, high-quality datasets.
9. Conclusion and Future Outlook
Generative AI transforms data synthesis by:
Overcoming data scarcity
Preserving privacy
Enabling large-scale, realistic training environments
Reducing the cost of data creation
Prediction:
By 2024, 60% of data used in AI and analytics projects will be synthetically generated.
Ethical Considerations
Essential to ensure:
Synthetic data does not reproduce biases from real datasets
Compliance with ethical standards and regulations
Special caution in sensitive domains (healthcare, finance, governance)
If you want, I can also prepare:
📌 A 1-page condensed revision sheet
📌 A flowchart for all Data Synthesis applications
📌 Exam questions & answers based on this topic
Here are clean, complete, properly structured notes for Gen AI – Personalization exactly
matching your module content, with no missing points, written in a crisp exam-ready style.
📘 Generative AI – Personalization
(Module 4 – Prof. Naveen Kumar Bhansali)
Generative AI has significantly advanced personalization, enabling systems to deliver
tailored content, recommendations, and experiences based on individual preferences,
behavior, and context. By generating synthetic or customized data, GenAI adapts much more
closely to user needs than traditional rule-based systems.
1. Personalized Content Creation
Generative AI creates customized content based on:
User interests
Past interactions
Purchase history
Browsing behavior
Applications
Digital marketing:
GenAI creates personalized email campaigns with:
o Tailored subject lines
o Product recommendations
o Body text reflecting prior purchases
→ Leads to higher engagement, open rates, and conversion rates.
E-commerce:
If a user prefers eco-friendly products, GenAI:
o Generates product descriptions highlighting sustainability
o Creates personalized ads aligned with their values
2. Tailored Recommendations
Recommendation systems powered by GenAI analyze:
User behavior
Past selections
Engagement patterns
They then generate highly personalized suggestions.
Examples
Netflix, Spotify:
GenAI analyzes viewing and listening habits to:
o Recommend new movies or songs
o Generate synthetic profiles that explore patterns to suggest content users may
not discover themselves
3. Adaptive User Interfaces
Generative AI creates interfaces that adapt in real-time based on how users interact.
Example
A news app dynamically rearranges its homepage based on:
o What categories a user reads most
o Time spent on certain types of articles
If the user frequently reads technology, the app prioritizes tech news at the top.
→ Results in a more intuitive, relevant, and user-friendly interface.
4. Customized Educational Content
Generative AI enables personalized learning experiences by producing:
Customized learning paths
Personalized quizzes
Targeted reading materials
Remedial content
Example
An online learning platform uses GenAI to:
o Analyze student performance
o Generate study plans tailored to strengths and weaknesses
o Provide specific exercises to improve low-performing areas
5. Individualized Health & Wellness Plans
Healthcare and wellness applications use GenAI to create:
Personalized diet plans
Customized workout routines
Tailored medical or lifestyle suggestions
Example
A fitness app uses GenAI to generate:
o Workouts based on user’s health metrics
o Diet plans supporting goals such as fat loss, strength building, or
cardiovascular improvement
Conclusion
Generative AI has transformed personalization by enabling:
More relevant user experiences
Tailored content generation
Better recommendations
Adaptive and intelligent interfaces
As the technology evolves, GenAI will deliver even richer personalization while ensuring:
Ethical use
Privacy protection
Transparency in recommendations
If you want, I can now combine all Module 4 topics into one single document or convert
everything into a PDF for your exam notes.
Here are clean, complete, well-structured notes for Gen AI – Widening of Gap Between
Experts and Novices (Module 4).
All points from your slides are included, nothing missed, rewritten in clear exam-ready form.
📘 Gen AI – Widening of Gap Between
Experts and Novices
(Module 4 – Prof. Naveen Kumar Bhansali)
Generative AI provides powerful capabilities, but the effectiveness of these tools heavily
depends on domain expertise. Experts can use GenAI more efficiently and strategically
because they understand the context, constraints, and exactly how to frame precise prompts.
This leads to a widening gap between experts and novices.
1. Why Experts Benefit More from GenAI
Experts possess deep domain knowledge.
They frame accurate and detailed prompts.
They can interpret AI outputs correctly.
They refine and improve AI-generated results better than novices.
Thus, generative AI amplifies expert abilities, making them even more productive and
innovative.
2. Examples Showing How GenAI Favors Experts
A. Scientific Research
Experts in technical fields can use GenAI to model and analyze complex structures
accurately.
Example:
A molecular biologist uses GenAI to:
Generate protein structure models
Provide specific scientific parameters
Interpret results correctly
→ Enables discoveries that novices cannot achieve due to lack of foundational knowledge.
B. Creative & Professional Design Fields
Experts in design use GenAI tools to automate routine tasks and push creativity further.
Example:
A professional graphic designer using Adobe Sensei can:
Automate repetitive components
Focus on advanced creative decisions
Refine AI-generated drafts into high-quality designs
→ Novices may not know how to evaluate or correct AI suggestions.
C. Programming & Software Development
Experienced developers know how to frame programming problems and integrate AI-
generated code effectively.
Example:
A senior programmer using GitHub Copilot can:
Generate sophisticated code snippets
Integrate them into large systems
Detect and fix logical flaws in AI-generated code
→ Novices may copy code blindly without understanding context, leading to errors.
D. Healthcare Diagnostics
Domain knowledge is crucial for interpreting medical outputs from GenAI systems.
Example:
A radiologist using AI diagnostic tools can:
Identify anomalies in medical scans
Connect findings with patient history
Make accurate treatment decisions
→ Novices may misinterpret AI suggestions, risking patient safety.
E. Business & Financial Analysis
Experts use GenAI for deep market insights and strategic decision-making.
Example:
An experienced financial analyst can:
Generate market forecasts
Interpret AI-generated investment reports
Make informed investment decisions
→ Novices may misread AI outputs due to lack of financial knowledge.
3. Why the Gap Widens
GenAI increases productivity only when combined with expertise.
Experts become even more capable and efficient.
Novices often lack:
o Context to frame prompts correctly
o Skills to evaluate AI outputs
o Ability to refine AI-generated content
Therefore, Generative AI magnifies existing skill differences instead of reducing them.
4. Conclusion
Generative AI offers immense value, but domain expertise determines its true impact.
Experts leverage AI to:
Work faster
Work more creatively
Produce more accurate and insightful results
Novices, lacking the knowledge to guide or interpret AI, may struggle, causing the expert–
novice gap to widen.
This highlights the need for:
Strong foundational learning
Skill development
Understanding domain principles
to fully harness the power of generative AI.
If you'd like, I can convert all Module-4 notes into a single, clean PDF.
Here are clean, structured, exam-ready notes for Module 5 – Transformer Architecture
(Prof. Naveen Kumar Bhansali), rewritten from your provided text.
Nothing is missed.
📘 Module 5 – Transformer Architecture
Generative AI – Prof. Naveen Kumar Bhansali
Transformers form the foundational architecture behind modern Large Language Models
(LLMs) used in Generative AI (such as GPT, PaLM, LLaMA, etc.). This architecture
revolutionized Natural Language Processing (NLP) because it can efficiently model long-
range dependencies and capture context far better than previous architectures like RNNs and
LSTMs.
1. Introduction to Transformer Architecture
Large Language Models (LLMs) are built on Transformer architecture, which enables
high-level language understanding and generation.
The key components you must understand are:
1. Tokenization
2. Embeddings
3. Attention Mechanism (Self-Attention)
These components together make it possible for LLMs to process text, understand context,
and generate meaningful responses.
2. Tokenization
Tokenization = breaking text into smaller units called tokens.
✔ Types of Tokens:
Words
Subwords (most common in LLMs)
Characters
Special symbols (punctuation, whitespace markers, etc.)
✔ Example
Sentence:
"Hello, how are you?"
Possible tokens:
“Hello”
“,”
“How”
“are”
“you”
“?”
✔ Why Tokenization Is Important
Enables the model to process text piece by piece
Handles rare words through subword tokenization
Reduces vocabulary size while retaining semantic meaning
Tokenization is the first step in transforming raw text into something measurable by the
model.
3. Embeddings
Once tokens are extracted, each token is converted into a vector representation known as an
embedding.
✔ Purpose of Embeddings
Embeddings encode:
Semantic meaning
Syntactic role
Contextual relationships between tokens
This means the embedding of “doctor” is closer to “nurse” than to “banana.”
✔ Key Point
Embeddings are learned during training, so they automatically capture complex
relationships between words/subwords.
4. Attention Mechanism (Self-Attention)
The attention layer is the core innovation of Transformers.
✔ What Attention Does
It allows the model to focus on the most relevant tokens in a sentence while processing the
input.
The model learns:
Which words matter more for meaning
How words relate to each other
How to maintain context across long sentences
✔ Example:
In the sentence:
“The cat, which was hungry, ate its food.”
To understand “its,” the model must look back to “cat.”
Self-attention enables this long-distance link.
✔ Why Attention Is Powerful
Handles long-range dependencies
Understands context deeply
Learns relationships dynamically for each input
Replaces sequential processing (unlike RNNs)
Hence, Transformers are parallelizable, making training dramatically faster.
5. How These Components Work Together
1. Tokens
→ Break text into manageable units.
2. Embeddings
→ Convert tokens to meaningful numerical vectors.
3. Self-Attention
→ Helps the model focus on relevant parts of the input and understand context.
This pipeline allows Transformers to perform language modeling with extremely high
accuracy.
6. Conclusion
Transformer Architecture is the backbone of all modern generative AI systems.
By combining:
Tokenization
Embeddings
Self-Attention / Attention Layers
Transformers can understand, model, and generate human language with unmatched
capability.
As the architecture continues to evolve, it opens new possibilities across industries, from
healthcare and finance to creative design and automation.
If you want, I can also make this into a PDF, or I can continue with the next Module 5 topic.
Here are clean, complete, perfectly structured notes for
📘 Module 5 – Why Transformer Models Are Trending
(Prof. Naveen Kumar Bhansali)
All points from your text are included — rewritten clearly for exam use.
Nothing added, nothing missed.
📘 Why Transformer Models Are Trending
(Module 5 – Generative AI)
Transformer models have rapidly become the foundation of modern AI due to several
powerful advantages in learning, processing efficiency, scalability, and adaptability.
1. Enhanced Learning Through Self-Supervision
Transformers excel at self-supervised learning, where they learn patterns from unlabeled
data during pre-training.
How This Works:
Models are trained on large text corpora.
They predict:
o Missing words
o Masked words
o Mixed or corrupted tokens
This helps them learn:
o Deep linguistic structures
o Semantic relationships
o Contextual meanings
Why This Matters:
No need for fully labeled datasets for each downstream task.
Reduces manual annotation effort.
Improves generalization across:
o Multiple tasks
o Domains
o Languages
2. Efficient Parallel Processing
One of the biggest innovations of transformers is their ability to parallelize computation.
Comparison with RNNs:
RNNs → process input sequentially, slowing down training.
Transformers → process all tokens in parallel because of the attention
mechanism.
Benefits:
Faster training and inference.
Highly efficient on modern hardware:
o GPUs
o TPUs
Can handle extremely large datasets.
More scalable than earlier deep learning models.
3. Scalability for Complex Tasks
Transformers are designed to scale:
Scales in Two Ways:
1. Data scale – can learn from massive datasets.
2. Model scale – supports billions of parameters.
Outcome:
Captures intricate linguistic patterns.
Handles complex AI tasks.
Enables state-of-the-art performance in real-world applications.
Scalability is key to pushing the boundaries of AI research and improving model capabilities
across languages and domains.
4. Adaptable Integration Across Applications
Transformer models are extremely flexible and easy to integrate into diverse applications.
Why This Is Possible:
Pre-trained models (e.g., BERT, GPT) can be fine-tuned with small amounts of task-
specific data.
Fine-tuning adapts the model to:
o New tasks
o New domains
o New datasets
Applications Include:
Natural language understanding
Machine translation
Sentiment analysis
Text classification
Question answering
Many others
This adaptability makes transformers suitable for a wide range of industries and real-world
use cases.
5. Conclusion
Transformers are trending because they offer:
Powerful self-supervised learning → strong representations
Efficient parallel processing → fast and scalable training
High scalability → supports massive models and datasets
Flexible integration → easy fine-tuning for new tasks
These strengths have made transformers a dominating force in advancing modern AI and
revolutionizing all areas of natural language processing and understanding.
If you want, I can combine all Module 5 topics into a single, polished PDF.
Here are clear, clean, complete exam-ready notes for
📘 Module 5 – List of Foundation Models
Everything from your text is included exactly, rewritten neatly, with no missing points.
📘 Module 5 – List of Foundation Models
(Generative AI – Prof. Naveen Kumar Bhansali)
What Are Foundation Models?
Term coined/popularized by Stanford HAI – Center for Research on Foundation
Models (CRFM).
Also called Large AI Models.
Trained on massive datasets to enable wide-range applicability across tasks and
domains.
Development requires huge resources (hundreds of millions of dollars for compute +
data).
But fine-tuning or using pre-trained models is far more cost-effective.
📌 Timeline & Summary of Major
Foundation Models
Below is a complete structured list, in the same order as your text.
1. GPT-1 (OpenAI, June 2018)
Architecture: Decoder-only Transformer
Parameters: 117 million
Training data: 1 billion tokens
Training time: 30 days, using 8× NVIDIA P600 GPUs
Significance: Marked the beginning of the GPT series; foundation for modern NLP
models.
2. BERT (Google, October 2018)
Architecture: Encoder-only Transformer
Parameters: 340 million
Training data: 3.3 billion words
Key feature: Bidirectional context → major improvement in language understanding.
Applications: Set a benchmark for contextual NLP tasks.
3. GPT-2 (OpenAI, February 2019)
Parameters: 1.5 billion
Training data: 40 GB (~10 billion tokens)
Significance: Demonstrated strong generative abilities across domains.
4. T5 – Text-to-Text Transfer Transformer (Google, 2019)
Parameters: 11 billion
Training data: 34 billion tokens
Capabilities:
o Text generation
o Translation
o Image-related tasks
Positioned as a general-purpose foundation model for many Google projects.
5. GPT-3 (OpenAI, May 2020)
Parameters: 175 billion
Training data: 300 billion tokens
Significance:
o Huge performance leap in natural language generation.
o Set new benchmarks across NLP tasks.
GPT-3.5 (2022)
A fine-tuned variant of GPT-3.
Delivered to the public via ChatGPT.
6. Claude (Anthropic, December 2021)
Parameters: 52 billion
Training data: 400 billion tokens
Features:
o Strong conversational abilities
o Emphasis on ethical AI and safety
Became influential in responsible AI discussions.
7. BLOOM (Hugging Face, July 2022)
Parameters: 175 billion
Architecture: Similar to GPT-3
Training data: 350 billion tokens, multilingual
o 30% English
o Excludes programming languages
Focus: Open, responsible, multilingual AI.
8. LLaMA (Meta AI, February 2023)
Parameters: 65 billion
Training data: 1.4 trillion tokens
Supports 20 languages
Designed specifically for research and efficiency.
9. BloombergGPT (Bloomberg, March 2023)
Parameters: 50 billion
Training data: 363 billion tokens, financial-domain focused
Specialization: Financial NLP (analysis, insights, domain tasks)
Claim: Outperforms general LLMs of similar size on finance tasks.
10. GPT-4 (OpenAI, March 2023)
Details on parameters/training: Undisclosed
Available via ChatGPT Plus
Represents the next stage in OpenAI’s innovation.
11. Claude 2 (Anthropic, July 2023)
Successor to Claude.
Enhanced conversational abilities, ethics, and human-aligned interaction.
12. LLaMA 2 (Meta AI, July 2023)
Parameters: 70 billion
Training data: 2 trillion tokens
Emphasizes:
o Scalability
o Multilingual performance
o Research usability
13. Mistral 7B (Mistral AI, September 2023)
Parameters: 7.3 billion
Extremely efficient for its size.
General-purpose NLP tasks at high performance.
14. Grok 1 (xAI by Elon Musk, November 2023)
Parameters: 314 billion
Context window: 8192 tokens
Special access to real-time X/Twitter data
Optimized for dynamic, real-time social media insights.
15. Gemini 1.5 (Google DeepMind, February 2024)
Architecture: Mixture of Experts (MoE)
Parameters: Undisclosed
Context window: ~1 million tokens
Focus: High scalability + multimodal capabilities.
16. Phi-3 Family (Microsoft, April 2024)
“Small Language Models”
Variants: Mini, Small, Medium
Parameters: 3.8B – 14B
Designed for efficient deployment with strong performance.
17. LLaMA 3 (Meta AI, April 2024)
Parameters: 70 billion
Training data: 15 trillion tokens
Designed for large-scale research and improved multilingual tasks.
18. Claude 3 Family (Anthropic, 2024)
Includes:
Claude 3 Opus – flagship, highest reasoning performance
Claude 3 Sonnet – fast + versatile
Claude 3 Haiku – cheapest + fastest for text tasks
Features:
Multimodal vision capabilities
Improved reasoning, math, coding
Better multilingual fluency
Strong emphasis on safety + responsible scaling
📌 Licensing Considerations
Foundation model licenses differ significantly:
Open-source models (e.g., Apache 2.0):
Allow modification + redistribution
Some conditions apply
Proprietary models:
Strict usage rights
May require permission for commercial use
LLaMA (Meta AI) License:
Non-commercial research focus
Requires compliance with specific usage guidelines
Key Advice:
Always check the latest documentation.
Licensing terms evolve over time.
Seek legal clarity when deploying models commercially.
📌 Final Note
The field of foundation models is evolving rapidly.
New models continue to emerge with groundbreaking capabilities in:
Multimodal learning
Larger context windows
More efficient architectures
Safer, aligned AI
The landscape of NLP and AI is constantly being reshaped as innovation accelerates.
If you want, I can compile all Module 5 notes (Transformer Architecture + Why
Transformers Are Trending + Foundation Models) into a single consolidated PDF.
Here are clean, structured, complete notes for
Module-5: Top-k vs Top-p Sampling
(Fully faithful to your slides, with NO missing line, NO missing concept.)
Module – 5: Top-k Sampling vs Top-p
Sampling (Clean Notes)
Generative AI – Prof. Naveen Kumar Bhansali
Overview
Sampling methods determine how a language model selects the next token during text
generation.
This section explains:
Greedy approach
Random weighted sampling
Top-k sampling
Top-p (nucleus) sampling
Each method balances coherence, diversity, and creativity differently.
1. Greedy Approach
The model always chooses the highest-probability token at each step.
Produces coherent but repetitive and predictable text.
Lacks creativity and variation.
Example: a story generated this way may follow common, repetitive patterns.
2. Random Weighted Sampling
The model samples tokens according to their probability distribution.
High-probability tokens are more likely but not guaranteed to be chosen.
Adds randomness → more creativity, more variation.
Useful in:
o Creative writing
o Dialogue generation
o Less predictable interactions
3. Top-k Sampling
A refinement of random weighted sampling.
Model considers only the top k most probable tokens.
The next token is randomly sampled from this subset.
Balances quality + diversity.
Example:
o If k = 10, the model picks the next token from the 10 most likely options.
Helps maintain relevance while keeping text varied.
4. Top-p Sampling (Nucleus Sampling)
Instead of selecting a fixed number of tokens, we select tokens whose cumulative
probability ≥ threshold p.
The size of the pool changes dynamically based on context.
More flexible than top-k.
Produces fluent, coherent, contextually appropriate text.
Particularly effective in conversational AI.
Top-p Example (From Slides)
Step 1: Initial context
Sentence: “The cat sat on the”
Step 2: Model provides probabilities
Model outputs probability distribution for possible next words.
Step 3: Set Top-p threshold
Choose p = 0.85
Want the smallest set of tokens whose cumulative probability ≥ 0.85
Step 4: Sort tokens
Sort all candidate tokens in descending probability order.
Step 5: Accumulate probabilities
Keep adding probabilities until sum ≥ 0.85.
Step 6: Form the candidate pool
Tokens included in the top-p pool:
mat
roof
table
grass
Step 7: Randomly select one token
Example chosen from pool: “roof”
Final output:
➡ “The cat sat on the roof.”
Summary
Each method offers a different trade-off between coherence, diversity, and creativity:
Method How It Works Strength Weakness
Always picks highest-
Greedy Coherent Predictable, repetitive
probability token
Random
Samples proportionally to May become
Weighted Creative, diverse
probability incoherent
Sampling
Fixed k may miss
Top-k Sampling Sample from top k tokens Balanced, relevant
context changes
Sample from smallest set with Adaptive, fluent, Slightly more
Top-p Sampling
cumulative prob ≥ p context-aware complex
If you want, I can also convert this into:
✅ PDF
✅ 1-page revision sheet
Just tell me!
Here are clean, structured, complete notes for Module-6: Retrieval Augmented
Generation (RAG) — rewritten clearly, without missing any line or concept from your
slides.
Module – 6: Retrieval Augmented
Generation (RAG)
Generative AI – Prof. Naveen Kumar Bhansali
1. Why RAG? Understanding the Need
The performance of a generative AI model depends on two major factors:
A. Training Phase – Quality of Training Data
Model learns patterns from the data it is trained on.
High-quality, diverse, accurate, representative datasets → better generalization.
Poor or incomplete data → reduced performance.
B. Inference Phase – Quality of Context
Even a well-trained model produces weak results if the prompt/context is vague.
Clear, complete prompts are necessary for accurate responses.
Both good training data AND rich inference context are essential.
2. Challenges With Standard Large
Language Models
LLMs face two key limitations:
1. Lack of access to specific or updated data
LLMs are trained on large public datasets.
After training, they become static → cannot access new or external data.
This leads to:
o Outdated answers
o Hallucinations
o Incorrect responses for information not in their training set
2. AI applications need custom / organization-specific data
Real-world applications require company-specific knowledge.
Examples:
o Customer support bots must answer using company data
o Internal HR bots must answer using HR policies
Retraining LLMs is expensive, slow, and impractical.
Therefore, we need a way to give the model external, domain-specific, up-to-date data
without retraining.
3. What is Retrieval Augmented Generation
(RAG)?
RAG is an architectural technique that combines:
Retrieval + Generation
Retrieval system → fetches relevant documents from an external knowledge base
Generative model → uses the retrieved content to produce informed responses
RAG gives LLMs access to custom, updated, and precise data during inference.
Benefits
Reduces hallucinations
Produces contextually accurate answers
Useful for chatbots, Q&A systems, knowledge assistants, domain-specific tools
4. How RAG Works – Step-by-Step
Procedure
Step 1: Data Preparation
Gather documents + metadata
Preprocess them (cleaning / removing PII / redacting sensitive fields)
Split documents into chunks (manageable segments)
o Chunk size depends on embedding model and LLM needs
Goal: Prepare clean, chunked data ready for embedding.
Step 2: Indexing the Data
Create embeddings → numerical vectors representing semantic meaning
Store embeddings in a vector database / vector search index
Vector index enables semantic similarity search—not keyword matching
This allows fast and accurate retrieval of relevant chunks.
Step 3: Retrieval During Querying
When the user submits a query:
1. Query is converted into an embedding
2. System searches the vector index
3. Retrieves the most relevant chunks
4. Retrieved chunks are added to the LLM prompt
This enriched prompt gives the model accurate context → better responses.
Analogy: Google Search
Google crawls → processes → indexes → retrieves → ranks
RAG retrieval works similarly, but with semantic vectors.
Step 4: Build the LLM Application
Combine:
o Prompt augmentation (query + retrieved text)
o LLM response generation
Wrap it in a REST API or endpoint
Use it in applications like:
o Chatbots
o Q&A systems
o Internal knowledge assistants
With enriched context, applications give precise, relevant, updated answers.
5. Summary of the RAG Architecture
RAG follows this pipeline:
1. Data Preparation
o Collect documents → clean → preprocess → chunk
2. Embedding + Indexing
o Convert chunks to embeddings
o Store in vector database
3. Retrieval at Inference
o Matching chunks retrieved
o Added to prompt
4. Augmented Generation
o LLM uses enhanced context
o Produces accurate, domain-specific responses
5. Deployment
o Packaged into an endpoint for easy integration into apps
RAG ensures responses are:
Accurate
Updated
Domain-specific
Grounded in actual data, not hallucination
If you want, I can also prepare:
✅ a diagrammatic flowchart
✅ a 1-page exam revision sheet
Just tell me!
Here are clean, complete, structured notes on RAG – Vector Databases (Module 6, Prof.
Naveen Kumar Bhansali).
No points are skipped. No meaning is altered. Everything is reorganized clearly for revision.
Module 6 – RAG: Vector Databases
Generative AI – Prof. Naveen Kumar Bhansali
1. What Are Vector Databases?
Vector databases are specialized systems designed to:
Store
Manage
Search
data represented as vectors (lists of numbers).
These vectors can represent:
Text
Images
Audio
Any high-dimensional data
Vector representations allow operations such as:
Similarity search
Clustering
Nearest neighbor search
The main goal:
👉 Enable extremely fast and efficient similarity search in high-dimensional spaces.
2. Why Do We Need Vector Indexing?
Without specialized indexing:
Searching through millions of vectors becomes computationally expensive.
Common indexing techniques used:
1. Approximate Nearest Neighbors (ANN)
Used for fast retrieval of approximate but highly relevant nearest neighbors.
2. HNSW (Hierarchical Navigable Small World Graphs)
Graph-based indexing
Supports fast and accurate similarity search
3. FAISS (Facebook AI Similarity Search)
Developed by Meta
Highly optimized library for vector search and clustering
Supports GPU acceleration
3. Vector Search Index vs. Vector Database
Vector Search Index
A component used to speed up similarity search
Typically sits inside search engines or recommendation systems
Focuses only on indexing and fast retrieval
Vector Database
A full-fledged database that includes:
Persistent storage
Indexing
Querying
Security
Scalability
Consistency
Integration with other systems
Goal: Manage vectorized data end-to-end, not just search.
4. Key Components of a Vector Database
A. Data Storage
Includes:
1. Vector Storage
Efficiently stores large volumes of high-dimensional vectors
Uses optimized formats and compressed structures
2. Metadata Storage
Stores additional information like:
IDs
Timestamps
Labels
Categories
Metadata enables:
Filtering
Complex queries
Hybrid searches (vector + metadata)
B. Indexing
1. Vector Indexing Techniques
HNSW: Graph-based structure for fast nearest neighbor search
IVF (Inverted File Index):
o Divides vector space into clusters
o Searches only relevant clusters
Product Quantization:
o Compresses vectors
o Speeds up distance calculations
2. Dynamic Indexing
Allows adding/removing vectors
Does not require rebuilding the entire index
Important for real-time applications
C. Query Processing
1. Similarity Search
Uses distance metrics:
Cosine similarity
Euclidean distance
Dot product
Goal: find vectors closest to the query vector.
2. Filtering & Re-ranking
Supports:
Metadata-based filtering
Re-ranking results using similarity scores or other criteria
3. Batch Queries
Multiple vectors queried at once → increases efficiency
D. Scaling and Distribution
1. Sharding
Data split across multiple servers
Enables handling very large datasets
2. Load Balancing
Distributes queries across nodes
Reduces response time
3. Replication
Copies data across nodes
Ensures:
o High availability
o Fault tolerance
E. Integration with ML Pipelines
1. Data Ingestion
ML models generate embeddings
Database ingests vectors + metadata automatically
2. Model Updates
When ML models change, embeddings may change
Vector database must update stored vectors and indices
3. Real-time Inference
Used in real-time applications such as:
Recommendation systems
Personalization engines
Fraud detection
Semantic search
New data can be ingested instantly, and results retrieved immediately.
5. Examples of Vector Databases
Common vector database systems include:
Pinecone
Milvus
ChromaDB
Weaviate
These provide:
Storage
Indexing
APIs
Scalability
Integration with AI tools
6. Use Cases of Vector Databases
1. Recommendation Systems
Suggest similar products
Identify user preference patterns
Used in e-commerce, music, movies, etc.
2. Image & Video Search
Retrieve visually similar images/videos using embeddings
3. Natural Language Processing
Semantic similarity between documents/sentences
Used in chatbots and Q&A systems
4. Anomaly Detection
Identify vectors that are significantly different from the norm
Useful for fraud detection and cybersecurity
7. Summary
A vector search index improves search speed.
A vector database provides:
Storage
Indexing
Query processing
Distribution
ML integration
Vector databases have become essential in AI and machine learning due to the rise of:
Embeddings
High-dimensional data
Real-time applications
Semantic search
They are a critical component in RAG architecture, enabling fast retrieval of relevant
knowledge chunks.
If you want, I can also prepare:
✅ A diagram/flowchart of vector database architecture
or
✅ A 1-page exam cheat sheet
Just tell me!
Here are clean, structured, exam-ready notes for your LangChain (Module 6) content —
rewritten clearly while preserving every point from the slides, with no missing concepts.
📘 Module 6 — LangChain (Detailed Notes)
Generative AI – Prof. Naveen Kumar Bhansali
1. What is LangChain?
LangChain is a framework that simplifies building applications using Large Language
Models (LLMs) such as GPT.
It provides tools, abstractions, and integrations to help developers build context-aware,
data-driven, and multi-step LLM applications.
LangChain makes it easier to build systems that require:
advanced prompting
memory
sequential reasoning
interactions with external tools
integration with other systems (APIs, databases)
2. Key Features of LangChain
a) Prompt Templates
Allows creation of reusable, structured prompts.
Ensures consistency in how prompts are written for different tasks.
Useful for standardizing complex prompt patterns.
b) Memory
LangChain supports memory management, allowing LLMs to retain context across:
o multiple interactions
o sessions
o turns in a conversation
Essential for chatbots and multi-step applications.
c) Agents
Agents use LLMs to reason, decide, and act.
Works in a loop:
1. Model interprets the situation
2. Decides a next action
3. Executes relevant tool
4. Produces an output
Enables dynamic behavior instead of fixed sequences.
d) Tools
LangChain can integrate with external:
o APIs
o Databases
o Calculators
o Search engines
Allows LLMs to fetch data, perform operations, or interact with the environment.
3. How LangChain Works
Step 1: Building Blocks
LangChain provides fundamental components such as:
Prompt Templates
Memory classes
Chains (multi-step sequences)
Developers use these blocks to build complex applications.
Step 2: Combining Components
Components are combined to form chains.
Chains represent a sequence of operations the LLM executes.
Example:
1. Take user input
2. Process with a prompt template
3. Query an API
4. Summarize using the LLM
Step 3: Executing Chains
When a user input arrives:
o The chain runs step-by-step
o Uses memory when needed
o Uses tools or APIs
o Produces structured output
Step 4: Interaction Loop
Used in applications requiring multiple steps or continuous communication.
Example: chatbots.
LangChain handles:
o context retention
o conversation flow
o multi-step reasoning and execution
Ensures smooth, coherent multi-turn interactions.
4. Why LangChain is Powerful
LangChain provides:
abstraction over complex LLM logic
structured pipelines for reasoning
integration with external systems
persistent and flexible memory
easier development of advanced AI applications
dynamic agent-based decision-making
5. Summary
LangChain is a framework that:
✔ simplifies building LLM-based applications
✔ provides prompt templates, memory, agents, tools
✔ supports multi-step workflows using chains
✔ manages interaction loops in conversational or task-based systems
✔ integrates external APIs, databases, and utilities
✔ enables sophisticated AI applications with minimal overhead
If you want, I can also prepare:
✅ diagram-based notes
✅ flashcards
✅ short exam-style answers
✅ real-world examples of LangChain
Here are clear, structured, exam-ready notes for LangChain – Chunking Strategy,
rewritten without missing any point from your slides.
📘 Module 6 — LangChain: Chunking
Strategy (Detailed Notes)
Generative AI – Prof. Naveen Kumar Bhansali
1. What is Chunking?
Chunking refers to the process of splitting large text into smaller, manageable pieces called
chunks.
Why it is needed:
LLMs have token/character limits.
Large documents cannot be processed as a whole.
Well-designed chunks ensure the model retains context and understands the content.
Goal:
✔ break text into pieces that are small enough for processing
✔ but large and meaningful enough to preserve context
2. Key Parameters in Chunking
a) Chunk Size
Maximum number of characters or tokens allowed per chunk.
Example: If chunk size = 100 characters → every chunk ≤ 100 characters.
Determines the length of each chunk.
b) Chunk Overlap
Number of characters/tokens repeated between consecutive chunks.
Ensures continuity and avoids losing context at boundaries.
Example:
If overlap = 20 characters:
Last 20 characters of chunk 1
→ repeated at the beginning of chunk 2.
This avoids cutting sentences in unnatural places.
3. Character Text Splitter
A method to split text based on a specific character or separator.
Example separators:
\n (newline) → splits by lines or paragraphs
(space) → splits by words
custom characters (comma, period, symbols, etc.)
Example text:
Line 1: The quick brown fox
Line 2: Jumps over the lazy dog
Line 3: And runs away swiftly
If we use \n as separator → the text splits into 3 chunks (one per line).
4. Recursive Character Text Splitting
(RCT)
A hierarchical chunking method that splits text step-by-step using multiple separators
arranged by importance.
Separators usually used in this order:
1. \n\n → paragraph breaks
2. \n → line breaks
3. (space) → words
4. characters → smallest units
Process
1. Split by double newline \n\n
o Each paragraph becomes a chunk.
2. If a paragraph-chunk is still too large → split using \n (single newline)
o Divides paragraphs into individual lines.
3. If still too large → split using spaces
o Breaks down lines into words or short phrases.
4. If still too large → split into characters
o Last resort to ensure all chunks fit within size limits.
5. Why Recursive Splitting Works
✔ Maintains maximum context by using largest meaningful separators first
✔ Avoids breaking important sentences or phrases prematurely
✔ Only uses smaller separators when absolutely necessary
✔ Produces chunks that are:
coherent
contextual
within size constraints
6. Summary
Chunking Strategy in LangChain ensures that large text is divided into context-preserving,
size-compatible, and LLM-friendly chunks.
It uses techniques like:
chunk size
chunk overlap
character text splitting
recursive character text splitting
This guarantees efficient and meaningful text processing in RAG and LLM applications.
If you want, I can also prepare:
✅ diagrams for chunking
✅ examples with actual chunk outputs
✅ flashcards for revision
✅ a one-page high-yield summary
Below are clean, structured, exam-ready notes for Module 6 – LangChain: Memory +
RAG Retrieval Strategy, rewritten from your slides without missing a single point.
📘 Module 6 — LangChain: Memory &
Retrieval Strategy (Detailed Notes)
Generative AI – Prof. Naveen Kumar Bhansali
1. What is Memory in LangChain?
Memory in LangChain allows applications—especially conversational agents—to retain
context across multiple interactions.
Why it matters:
Conversational agents need to remember past queries, decisions, and responses
Produces coherent, context-aware output
Enables multi-turn interactions
LangChain memory supports two core actions:
a) Reading (Retrieving Information)
Retrieves relevant past interaction data.
b) Writing (Storing Information)
Stores new information for future interactions.
Both actions happen within the chain execution pipeline, ensuring every new input is
influenced by past context.
2. Types of Memory in LangChain
2.1 Conversational Buffer Memory
Stores entire conversation history.
Includes all user inputs + system responses.
History is stored in a variable accessible during processing.
Use cases:
✔ Customer support systems
✔ Long-running conversations requiring full historical context
2.2 Conversation Buffer Window Memory
Stores only the last k interactions.
Works like a sliding window:
o New messages added
o Oldest ones removed
Still stores both user and system messages.
Use cases:
✔ Casual chatbots
✔ When only recent context matters
✔ Memory-efficient long sessions
2.3 Conversation Token Buffer Memory
Controls memory based on token count, not number of messages.
Stores messages until a token limit is exceeded.
Once exceeded → oldest messages discarded.
Use cases:
✔ When LLM has strict token limits
✔ Long inputs with varying token lengths
✔ Fine-grained control over memory usage
2.4 Conversation Summary Memory
Creates and maintains a summary instead of storing each message.
After every interaction, the summary is updated to reflect new information.
How it works:
1. System generates summary based on past interactions
2. Updates summary after each turn
3. Stores only essential information
Use cases:
✔ Very long conversations
✔ Applications requiring high-level continuity
✔ Decision-based workflows or narrative systems
3. Introduction to RAG (Retrieval
Augmented Generation)
RAG combines:
Retrieval-based methods → fetch relevant information
Generation-based models → produce final output
Retrieval strategy determines what information is fed to the LLM.
4. Retrieval Strategy in RAG
Step 1: Query Formation
Transform user input into a search query.
Step 2: Document Retrieval
Fetch relevant documents from a pre-indexed corpus.
Methods:
o Similarity Search
o Maximal Marginal Relevance (MMR)
Step 3: Document Selection
Choose most relevant or diverse documents based on:
o relevance
o diversity
o ranking scores
Step 4: Augmentation
Provide selected documents as context to the LLM.
LLM generates the final enhanced response.
5. Similarity Search
Technique where retrieved documents are those most similar to the query.
Uses similarity metrics such as cosine similarity
Compares embeddings of query vs. documents
Example:
Query: “What are the benefits of renewable energy?”
Retrieved:
Document on solar energy benefits
Document on wind power
Document on environmental impact
Result:
Highly relevant documents (but may be redundant)
6. Maximal Marginal Relevance (MMR)
Balances relevance + diversity.
Goal:
Avoid redundancy
Retrieve documents that cover different aspects of the query
Example:
Query: “What are the benefits of renewable energy?”
MMR retrieves:
One document on environmental benefits
One document on economic advantages
One on sustainability impact
Result:
Broader, more informative context
Less repetition
7. Summary Comparison
Feature Similarity Search MMR (Maximal Marginal Relevance)
Focus Relevance Relevance + Diversity
Output May be repetitive Covers multiple perspectives
Best for Focused answers Broad coverage, diverse insights
8. Next Step
Now that chunking, memory types, and retrieval methods are understood, you can integrate
them to build:
✔ A question-answering system
✔ A full chatbot using LangChain + RAG
If you want, I can combine all Module 6 notes into a full polished PDF, or create diagrams,
flashcards, or a one-page revision sheet too.
Here are clean, structured, exam-ready notes for the full Module-7 content you provided
— rewritten clearly, with no missing points, no copyrighted diagrams, and 100% safe to
use in your study material.
📘 Module 7 — Instruction Tuned Models
(Clean Notes)
(Based on your provided lecture transcript)
1. What Are Instruction-Tuned Models?
➡ Large Language Models (LLMs)
Examples: GPT-3, GPT-4, etc.
Trained on massive datasets containing diverse internet text.
LLM Training Objective
Predict the next word in a sentence.
This enables the model to learn:
o Grammar
o World knowledge
o Reasoning patterns
o Semantic relationships
Strengths of LLMs
Generate human-like text
Answer questions
Complete prompts
Handle open-ended tasks
Limitations of LLMs
LLMs may struggle when:
Instructions are complex
Tasks are specific
Prompts require precision
Instructions need step-by-step execution
Reason:
LLMs are primarily trained using unsupervised learning, without explicit guidance for
specific tasks.
2. What Are Instruction-Tuned Models?
Instruction-tuned models are LLMs fine-tuned on datasets containing instruction–response
pairs.
Why Instruction Tuning?
It helps the model:
Follow user instructions more accurately
Understand user intent better
Give more task-oriented, relevant outputs
Reduce hallucinations
Respond more consistently
How Instruction Tuning Works
After the LLM is pre-trained:
1. A supervised fine-tuning dataset is created.
Each entry has:
o Instruction
o Correct response
2. The model is trained to produce the right output explicitly for the given instruction.
Where Are Instruction-Tuned Models Useful?
Tasks requiring precise instruction following, such as:
Summarization
Translation
Code generation
Question answering
Data extraction
3. LLM vs Instruction-Tuned Model — Difference
LLMs Instruction-Tuned Models
General-purpose Task-oriented
Good for open-ended generation Good for precise instructions
Unsupervised training Supervised fine-tuning
May misinterpret vague instructions Understands instructions better
Needs careful prompting Works reliably even with simple instructions
4. Instruction Tuning Research: Key Paper
"Finetuned Language Models Are Zero-Shot Learners"
Important points from the paper:
Instruction tuning greatly improves zero-shot performance on unseen tasks.
Instead of building datasets from scratch, the authors:
o Took 62 public NLP datasets (from TensorFlow Datasets)
o Converted them into instruction format
For each dataset:
o 10 different templates were manually created
o Templates use natural language to explain the task
o Up to 3 templates reversed the task
Example:
Instead of classify sentiment → generate review
Example: Sentiment Classification Template
Instead of: "Label this review as positive/negative"
They also added:
“Write a movie review with negative sentiment.”
This increases robustness and generalization during training.
5. Why Instruction Tuning Requires Less Compute
Pre-training LLMs = extremely expensive
Instruction tuning = done on small, supervised datasets
This phase is called the “metaphorical dessert” compared to the heavy pre-training
“main course.”
6. Examples of Instruction-Tuned Models
✔ FLAN-T5
Based on Google’s T5 model
Further fine-tuned using FLAN instruction datasets
Follows user instructions far more reliably
✔ PALM → FLAN-PALM
PaLM is the base LLM
FLAN-PaLM is the instruction-tuned version
Better at structured tasks, prompting, chain-of-thought
✔ BLOOM → BLOOMZ
BLOOM = multilingual foundational model
BLOOMZ = instruction-tuned for multilingual tasks
Strong at following instructions in many languages
7. Module Summary (Perfect for Exams)
LLMs are powerful but not always instruction-following by default.
Instruction tuning improves their ability to understand and execute explicit
instructions.
This is done by fine-tuning models on instruction–response datasets.
The tuning process is supervised and cheaper than pre-training.
Instruction-tuned models significantly improve zero-shot performance.
Popular examples: FLAN-T5, FLAN-PaLM, BLOOMZ.
Instruction tuning increases consistency, task accuracy, and user-friendliness.
If you want, I can turn this into:
✅ A PDF
✅ A one-page cheat sheet
✅ A revision table
✅ Side-by-side comparison chart
Just tell me!
Here are clean, structured, exam-friendly notes for the content you pasted on Full Fine-
Tuning — no missing points, everything organized clearly.
📘 Module 7 — Instruction Tuned Models
Full Fine-Tuning (Detailed Notes)
1. What is Full Fine-Tuning?
Full fine-tuning means updating all parameters of a pre-trained large language
model (LLM) using a new, smaller, task-specific dataset.
Purpose: Adapt the entire model so it performs extremely well on a specific task.
2. Process of Full Fine-Tuning
a) Pre-training (Initial step)
Model is trained on a massive, diverse dataset.
Learns:
o Grammar
o World knowledge
o Reasoning
o General language patterns
b) Full Fine-tuning (Second step)
The entire model (all weights + all layers) is trained further on a target dataset.
This changes the full parameter set to adapt fully to the new domain/task.
3. Advantages of Full Fine-Tuning
✔ Highly Specialized Model
Achieves state-of-the-art performance on the specific task.
Captures task-specific nuances extremely well.
✔ Maximum Adaptability
Since all weights update, model fully absorbs knowledge from the new dataset.
✔ Works best when task requires deep domain understanding
e.g., legal document classification
e.g., medical text summarization
4. Disadvantages of Full Fine-Tuning
❌ Computationally Expensive
Requires:
o High-end GPUs/TPUs
o Large memory
o Significant time
Costly in both money and energy consumption (environmental impact).
❌ Low Scalability
For every new task:
o A separate full fine-tuned model must be created.
Increases complexity in:
o Deployment
o Updating
o Storage
o Maintenance
❌ Lack of Flexibility
A model fine-tuned for one task may lose performance on general tasks.
5. Catastrophic Forgetting
What is it?
When fine-tuned fully on a narrow dataset, the model:
o Forgets general knowledge learned during pre-training.
o Performance on original tasks drops significantly.
Example
Pre-trained general LLM → fully fine-tuned on medical corpus
→ becomes excellent at medical tasks
→ BUT poor at general language tasks (e.g., storytelling, casual chat).
Why catastrophic forgetting happens
Full fine-tuning overwrites weights that were important for general tasks.
6. Why Full Fine-Tuning Is a Problem for Large LLMs
Large LLMs are expected to be multi-task, but:
o Full fine-tuning makes them task-specific.
o They may lose broad capability.
7. Solution to These Problems
⭐ Parameter Efficient Fine Tuning (PEFT)
Instead of updating all parameters, PEFT updates only a small number of additional
or selected parameters.
Prevents catastrophic forgetting.
Reduces compute cost.
(You will likely study: LoRA, LoRA+, Prefix Tuning, P-Tuning, Adapter Layers, etc.)
Summary Table
Topic Key Points
Full Fine-Tuning Update all parameters of LLM
Pros Highest performance for specific tasks
Cons Expensive, time-consuming, risk of catastrophic forgetting
Issue Not scalable for multiple tasks
Solution Use PEFT methods
If you want, I can also prepare slides-style notes, exam-style Q&A notes, or one-page
revision sheet for this topic.
Here are detailed, clean, exam-ready notes for the Parameter-Efficient Fine-Tuning
(PEFT) section you pasted — no points missed, everything structured exactly as per the
lecture.
📘 Module 7 — Parameter-Efficient Fine-
Tuning (PEFT)
(Complete Detailed Notes)
1. What is PEFT?
Parameter-Efficient Fine-Tuning (PEFT) refers to a set of techniques that allow
fine-tuning a large language model by updating only a very small subset of its
parameters.
Goal:
o Adapt the model to new tasks
o While keeping most parameters frozen
o Thus reducing compute, memory, and storage cost
2. Why PEFT Is Needed (Motivation)
Traditional full fine-tuning updates all the model parameters and faces problems like:
High computational cost
High memory usage
Need to store separate model copies for each task
Risk of catastrophic forgetting
PEFT solves these issues by updating < 1–5% of parameters.
3. Adapters — The Core PEFT Technique
Adapters are one of the most practical and widely-used PEFT techniques.
What are adapters?
Small neural modules added inside a pre-trained model (usually inside each
Transformer layer).
When fine-tuning:
o Only the adapter parameters are updated
o All original model parameters remain frozen
4. How Adapters Work (Functioning)
1. Start with a pre-trained model (e.g., BERT, GPT, T5).
2. Insert small adapter layers at certain points in each Transformer layer.
3. During fine-tuning:
o The input flows through the main (frozen) model as usual.
o Wherever an adapter is placed, the data additionally goes through the adapter
layer.
o Only the adapter’s weights are trained.
4. The pre-trained model remains unchanged → prevents catastrophic forgetting.
5. Why Adapters Are Efficient
✔ Only a small number of parameters are updated
Training is:
o Faster
o Cheaper
o Requires far less memory
✔ Multiple tasks can be supported
For each new task:
o Add a new adapter module
o Train only that adapter
No need to duplicate the entire model → huge storage savings.
✔ Easy task switching
Just load the appropriate adapter for the task.
The base model stays the same for all tasks.
6. Advantages of Using Adapters
Advantage Explanation
Low compute cost Only small modules are trained
Low memory usage Base LLM remains frozen
Storage efficient You store only tiny adapter weights, not full models
Avoids catastrophic forgetting Pre-trained weights stay intact
Supports multi-task systems Swap adapters for different tasks
Highly scalable Ideal for organizations needing many task-specific models
7. Summary (One-Liner Revision)
PEFT = Fine-tuning only small, additional modules (like adapters) instead of the whole
model → cheaper, faster, and avoids catastrophic forgetting while enabling multi-task
support.
If you want, I can also make:
✅ Slide-style notes
✅ One-page revision sheet
✅ Diagram of how adapters fit inside a Transformer
✅ Comparison table: Full Fine-Tuning vs PEFT vs LoRA
Just tell me!
Here are clean, clear, exam-ready notes for the exact text you provided — every point is
included, nothing missed, written in a crisp and structured way.
📘 Parameter-Efficient Fine-Tuning (PEFT)
(Complete Notes – Module 7, Prof. Naveen Kumar Bhansali)
1. What is PEFT?
Parameter-Efficient Fine-Tuning (PEFT) refers to techniques that allow fine-tuning
a large language model by making minimal changes to its parameters.
Instead of updating the full model, PEFT updates only a small subset of parameters.
Purpose:
o Adapt the model to new tasks
o While using very little compute and very little storage
2. Why PEFT?
Large pre-trained LLMs have billions of parameters.
Fully updating them for every task is:
o Expensive
o Slow
o Memory-heavy
o Hard to store and deploy
PEFT aims to overcome this by modifying only a small number of parameters.
3. Adapters — A Key PEFT Technique
Adapters are small additional layers added inside a pre-trained model.
They are a practical and widely used PEFT method.
Purpose of adapters
Enable task-specific adaptation
Make fine-tuning:
o More computationally efficient
o More storage efficient
Allow the model to generalize across tasks while keeping the main model intact
4. Where Are Adapters Added?
They are inserted within each layer of a Transformer model, typically after the
feed-forward or attention blocks.
The adapter modules are small compared to the full layer.
5. How Adapters Work During Fine-Tuning
1. The pre-trained model remains frozen.
None of its original parameters change.
2. Only adapter parameters are trained.
These are tiny compared to the full model.
3. Data flow:
o Input goes through the regular pre-trained layers.
o At locations where adapters are inserted:
The data also passes through the adapter module.
The adapter learns the task-specific transformation.
4. This allows the model to adapt to the new task without changing the main pre-
trained weights.
6. Efficiency Benefits of Adapters
Only a very small number of parameters are updated →
Less computational power is needed.
Lower memory usage during training.
Much cheaper than full fine-tuning.
Original model stays intact → helps avoid catastrophic forgetting.
7. Multi-Task Support Using Adapters
Each task can have its own adapter module.
You only store the adapters, not separate full models.
Switching between tasks is easy:
o Just swap the adapter for the corresponding task.
Enables efficient multi-task systems using one base model.
8. Summary (One Sentence)
PEFT allows efficient fine-tuning by adding small adapter modules and training only
those, while keeping the main model frozen—making it cheaper, faster, and easier to
handle multiple tasks.
If you want, I can also prepare:
✅ A simplified one-page revision
✅ A diagram showing adapters inside a Transformer
✅ A comparison table: Full Fine-Tuning vs PEFT vs LoRA
Just tell me!
Here are clean, complete, exam-ready notes for the exact content you provided — no
points missed, written in a clear and structured way while keeping all technical details.
📘 PEFT – LoRA (Low-Rank Adaptation)
Module 7 – Prof. Naveen Kumar Bhansali
1. What is LoRA?
LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning technique for
large language models.
It works by updating only low-rank matrices instead of updating the full large
weight matrices.
Main goals:
o Reduce number of learnable parameters
o Make fine-tuning efficient in computation and memory
o Maintain or even improve task performance
2. Why LoRA?
Transformer models contain very large weight matrices (especially in:
o Self-attention: Query, Key, Value (Q, K, V) matrices
o Feed-forward layers
)
Full fine-tuning updates billions of parameters → expensive and slow.
LoRA solves this by updating only a small low-rank decomposition, not the entire
matrix.
3. Weight Matrices in Transformers
A typical transformer layer has weight matrices denoted as W.
These matrices often have dimensions:
D × D, where D = hidden size.
4. The Key Idea of LoRA: Low-Rank Decomposition
Instead of updating the full matrix W, LoRA decomposes the update into two much smaller
matrices:
Matrix A
Shape: D × r
Learned during fine-tuning
Matrix B
Shape: r × D
Also learned during fine-tuning
Where r is the rank and r << D (very small).
5. Re-parameterization
The original weight matrix is not modified.
We compute an updated matrix:
W′ = W + ΔW
Where
ΔW = B × A
W = original pretrained weight matrix → kept frozen
A and B = newly learned low-rank parameters
ΔW has the same dimension as W (D × D), even though it is derived from small matrices.
This addition of B×A is called re-parameterization.
6. Where LoRA Is Applied
LoRA can be added to any transformer weight matrix, but primarily used in:
Q (Query) projections
K (Key) projections
V (Value) projections
Feed-forward layers
These matrices are large → ideal for low-rank adaptation.
7. After Fine-Tuning
The final matrix used by the model is:
W′ = W + B×A
This merged matrix replaces W during inference.
Architecture remains unchanged.
8. Why It’s Still Efficient
Even though we introduce new A and B matrices:
Only A and B are trained
W remains frozen
W′ has same dimension as W → model expressiveness is preserved.
9. Advantages of LoRA
✔ 1. Huge Reduction in Parameters
Only the low-rank A and B matrices are learned → drastically fewer parameters.
✔ 2. Much Lower Compute Requirements
Smaller gradients
Less GPU memory
Faster fine-tuning
✔ 3. No Changes to the Original Model
Base model remains intact and frozen.
✔ 4. Multi-Task Capability
For each task, only store different A and B matrices.
Same base model can support multiple tasks by swapping LoRA modules.
✔ 5. Maintains Model Expressiveness
Because we add ΔW rather than replace W, the full capacity of the pretrained model is
retained.
10. One-Line Summary
LoRA fine-tunes transformer models by learning small low-rank matrices (A and B)
and adding them to frozen pretrained weights, achieving highly efficient fine-tuning
with minimal computational overhead.
If you want, I can also create:
📌 A diagram showing LoRA inside a transformer layer
📌 A comparison table: Full Fine-Tuning vs PEFT vs LoRA
📌 Short revision notes for exams
Just tell me!
Here are clean, structured notes for your entire passage — rewritten clearly but preserving
every concept, intention, and explanation exactly as your slides teach them. This will help
you study and revise easily.
📘 Notes: Word Embeddings — Dense
Representations & Latent Factors
Why Dense Embeddings?
Earlier methods like one-hot encoding were:
Sparse (mostly zeros)
High-dimensional
Did not capture context or semantic meaning
Could not express relationships between words
To overcome this, embeddings represent each word as a dense vector of learned numbers.
These numbers encode:
Context
Semantic meaning
Relationships between words
📌 Understanding the Example Embedding
Matrix
The matrix shown has:
Rows → latent factors (hidden dimensions)
Columns → word vectors across all factors
The matrix is dense (no zero-dominance), indicating meaningful learned weights.
Interpreting the rows (latent factors)
Although the factors are not explicitly defined in real models, the example helps build
intuition:
1. Row 1: Vehicle Factor
o Car, Bike, Mercedes-Benz, Harley-Davidson → high values
o Orange, Mango → low values
2. Row 2: Luxury Factor
o Only Mercedes-Benz and Harley-Davidson show high values
o Car/Bike aren’t luxury → low
3. Row 3: Fruit Factor
o Orange, Mango → high
o Vehicles → low
4. Row 4: Company Factor
o Mango and Orange are also company names, so they score here.
Key Idea
We never manually choose these factors — the model discovers them automatically by
training on large text corpora.
📌 Similarity Between Words
Because each word is represented as a vector (column), we can analyze relationships using
distance metrics.
If two words are similar:
Their vector values across all rows/dimensions will be similar.
They will appear close in the vector space.
Examples
Car and Bike → similar values → close in embedding space
Mercedes-Benz and Harley Davidson → luxury vehicle brands → similar
representation
Car vs. Orange → very different values → far apart
Analogy Relationships
Using distances (Euclidean, Manhattan, etc.), embeddings can capture analogies like:
Car : Bike = Mercedes-Benz : Harley-Davidson
This is because:
Car ↔ Bike share the “vehicle type” relation
Mercedes ↔ Harley share the “luxury” relation
📌 Number of Dimensions (Hyperparameter)
The number of rows = number of embedding dimensions.
Common choices:
50
100
300
500
More dimensions =
better at capturing complex relationships
but increases computation
📌 Latent Factors Are Not Always
Interpretable
In real-world embeddings:
Each row does not represent a clean concept.
Many factors overlap.
Difficult to assign human labels to each dimension.
Still, the model effectively captures:
Semantic similarity (Car close to Bike, Orange close to Mango)
Contextual similarity
Plotting in 2D (after dimensionality reduction like PCA or t-SNE):
Car, Bike, Mercedes-Benz, Harley-Davidson cluster together
Orange and Mango cluster separately
📌 The Big Question — How Are These
Values Learned?
This leads to the methods of learning embeddings such as:
Word2Vec (Skip-gram / CBOW)
GloVe
FastText
Transformer-based embeddings
The slide sets up motivation before explaining how embeddings are trained.
If you want, I can also make super-diagrammatic notes, mind-maps, exam-oriented
points, or a memory trick version for fast recall.
Here are clear, complete, exam-ready notes for your entire passage — rewritten to preserve
every concept, every explanation, and the full flow exactly as taught in your slides.
📘 Notes: Learning Word Embeddings Using
Neural Language Models (Slide
Explanation)
1. Why Context Matters for Word Embeddings
To learn useful embeddings, the training data must contain:
Many occurrences of each word
In many different contexts
This allows embeddings to capture:
Contextual meaning
Semantics
Relationships between words
→ Large corpus is necessary.
But learning a word representation directly from documents in an unsupervised way is
difficult.
So the problem is reframed as a supervised learning task.
2. Converting Embedding Learning into a
Supervised Problem
Idea: Use Neural Networks for Language Modeling
Use previous words ( W_1, W_2, \dots, W_t ) to predict the next word ( W_{t+1} ).
Example:
She is a great tennis player
To predict “player”, use the previous 5 words.
3. How the Neural Network Is Structured
Input Layer
Each input word is:
Represented using one-hot encoding
Size = vocabulary size (e.g., 1000)
So “She” = 1000-dimensional one-hot vector
“Issue repeats for all 5 input words.”
Hidden Layer
Number of neurons = embedding size
Example: 500 neurons → embedding dimension = 500
Output Layer
1 neuron per vocabulary word → 1000 neurons
Softmax applied
Produces probability distribution over vocabulary
Highest probability → predicted next word
4. Handling Variable Sentence Length
Sentences can be long or short, so the model uses a fixed context window size.
Example:
If window size = 3, model predicts next word using the last 3 words.
For “She is a great tennis player”
→ Use “a great tennis” to predict “player”
The window size is a hyperparameter.
5. Using Bidirectional Contexts for
Embeddings
Language modeling predicts next word using previous context.
But for learning embeddings, we can use:
Words before the target word (left context)
Words after the target word (right context)
Example sentence:
She is a great tennis player and has won many awards.
If context window = 3:
Left context of “player”: a great tennis
Right context of “player”: and has won
This is used to learn better embeddings because the target word is in the middle.
This modeling approach is common in embedding training.
6. Simplest Model: Predict Next Word
Using Only Previous Word
Architecture:
Input (one-hot) → hidden layer → output layer
Weights connecting input → hidden or hidden → output become the word
embeddings
These weights represent the core learned vector for each word.
7. Word2Vec
Word2Vec popularized:
Using neural networks without deep layers
Using context windows
Using target prediction tasks to learn embeddings
Word2Vec uses two main architectures:
CBOW (Continuous Bag of Words) → predict word from context
Skip-gram → predict context from word
The slide ends by introducing Word2Vec, which will be explained next.
If you want, I can also convert these into:
✅ very short exam revision notes
✅ diagrammatic flow version
✅ memory hacks for embedding concepts
✅ a clean PDF
Just tell me!
Here are clean, complete, and exam-ready notes for the entire passage, rewritten clearly
while keeping all details, concepts, parameters, examples, and logic exactly as in the
slides.
📘 Word2Vec Model – Detailed Notes
(CBOW, Skip-Gram & Negative Sampling)
Word2Vec has two variants:
1. Continuous Bag-of-Words (CBOW)
Architecture
Input: Context words (surrounding the target word)
Context includes words to the left and right of the target word.
Output: The target (middle) word.
Concept
CBOW predicts:
context → target
Example:
Sentence: She is a great tennis player
Predict “player” using the context: a great tennis
2. Skip-Gram
Architecture
Input: Target (middle) word
Output: Context words (words to the left and right)
Skip-Gram predicts:
target → context
Why is it called Skip-Gram?
Because not all context words are used.
Some words in the window are skipped randomly.
3. Window Size + "Number of Skips"
Parameter
For Skip-Gram, two hyperparameters are used:
1. Window Size
Example: window size = 3
→ 3 words left + 3 words right
Context = 6 words
2. Number of Skips
Defines how many words to randomly pick from the window.
Example:
Window size = 3 → context = 6 words
Number of skips = 2 → pick only 2 random context words
Sentence example:
Context words = a, great, tennis, and, has, won
Possible Skip-Gram training samples:
(player → has)
(player → tennis)
or any other random pair.
4. Word Embeddings in Word2Vec
The weights between hidden layer and output layer = word embeddings.
Example size:
Hidden layer = 500 neurons
Vocabulary size = 1000 words
→ Weights = 500 × 1000 = 0.5 million parameters
But real vocabularies can be 10,000 or more:
500 × 10,000 = 5 million weights
5. Computational Problem
To learn embeddings:
Need huge training corpus
Need to compute softmax over entire vocabulary for each training step
This becomes:
Very expensive
Sometimes infeasible
6. Solution: Negative Sampling (Mikolov et
al.)
Introduced in the paper:
“Distributed Representation of Words and Phrases and their Compositionality”
Key Idea
Update only a small number of weights instead of updating all vocabulary weights.
Example
Target word = player
Context word = tennis
Without negative sampling:
Output is 1 for tennis
Output is 0 for all other words
→ All weights would be updated (millions!)
With Negative Sampling
Choose:
1 positive sample → (player, tennis) labeled as 1
k negative samples → random words labeled as 0
Negative words:
Randomly selected from outside the context
Sampled with probability proportional to their frequency raised to ( \frac{3}{4} )
Example (k = 3)
Positive pair:
(tennis, player → 1)
Negative pairs:
(tennis, hello → 0)
(tennis, piece → 0)
(tennis, few → 0)
Parameter reduction
Instead of updating all 5 million weights:
→ Update only weights corresponding to
1 positive + k negative words
→ = (k + 1) × embedding_dimension
→ For k = 3, embedding dimension = 500
→ Updates needed = 4 × 500 = 2000 weights
Huge computational savings.
7. Recommended Values for k
Small datasets: k = 5–20
Large datasets: k = 2–5
8. Summary
Word2Vec:
CBOW: context → target
Skip-Gram: target → context (with skips)
Uses single-layer neural network
Negative sampling makes training feasible
Only a few weights updated per step
Learns dense vectors with rich semantic properties
If you want, I can also prepare:
✅ a diagram-based version
✅ a 1-page revision sheet
✅ a PDF version
Just tell me!
Here are clean, structured, and detailed notes for the entire passage exactly as in your
slides but written in a clear, exam-ready manner.
📘 Word Embedding Matrix – Detailed
Notes
1. Embedding Matrix Structure
Word embeddings are stored in a matrix.
Rows = vocabulary size
(one row per word)
Columns = embedding dimension
(e.g., 50, 100, 300 values per word)
Example:
Vocabulary size = 10,000
Embedding dimension = 300
→ Embedding matrix size = 10,000 × 300
2. Using the Embedding Matrix
Input Representation (One-Hot Encoding)
An input word is represented as a 10,000 × 1 one-hot vector.
All values = 0
Except one index = 1 → the position of that word in the vocabulary
Example:
Word “a” is vocabulary index 1 → one-hot vector has a 1 in the first position.
Word “great” is index 524 → one-hot vector has a 1 in position 524.
3. How the Embedding Vector is Obtained
Matrix multiplication
Embedding vector = Embedding Matrix × One-Hot Vector
Since all values in input are 0 except one:
Only one row contributes to the output
All other multiplications give 0
So:
For “a” → output = 1st row of embedding matrix
For “great” → output = 524th row
Output size = 300 × 1
(embedding dimension)
4. Why Direct Multiplication is Wasteful
We multiply a large 10,000 × 300 matrix with a 10,000 × 1 vector.
But 9,999 multiplications are with zero.
Completely unnecessary computation.
5. Practical Implementation: Lookup Table
Instead of multiplication, frameworks simply look up the row:
Input index → fetch corresponding row from embedding matrix
No real matrix multiplication
Much faster and highly efficient
Examples:
Word “a” → fetch row 1
Word “great” → fetch row 524
This works because:
The indices in the one-hot input and the rows in the embedding matrix are aligned.
6. Embedding Layer in Keras
Keras's Embedding Layer directly maps indices to embedding vectors.
It performs lookup, not multiplication.
Commonly used as the first trainable layer in NLP tasks such as:
o Text Classification
o Sentiment Analysis
o Machine Translation
o Named Entity Recognition (NER)
o Text Summarization
o Many other NLP tasks
7. Transfer Learning with Pre-trained
Embeddings
Analogy to Computer Vision
In CV, we use pretrained CNN weights (e.g., ResNet, VGG).
Similarly in NLP:
We reuse pretrained embedding weights from:
Word2Vec
GloVe
FastText
Other large-scale models
Benefits:
Better performance
Faster convergence
Requires less data
Captures richer semantic relationships
Used for:
Text classification
Sequence models
Neural machine translation
Any deep learning-based NLP system
✔️Summary
Embedding matrix = vocab_size × embedding_dim
One-hot input → lookup the corresponding row
Multiplication is conceptually taught but not used in practice
Keras Embedding layer performs direct index-to-vector mapping
Pretrained embeddings can be plugged into your model for transfer learning
If you want, I can also make:
✅ a diagram to visually explain lookup vs multiplication
✅ a short 1-page revision sheet
✅ a PDF version for your exam notes
Just tell me!
Here are clear, structured, exam-ready notes for the GRU explanation you provided —
rewritten with all details preserved, no loss in meaning, and organized cleanly for studying:
Gated Recurrent Unit (GRU) — Detailed
Notes
1. Background
GRU was proposed by Cho et al. (2014) in the same paper introducing the RNN
Encoder–Decoder architecture.
GRU is a simplified variant of LSTM.
Designed to reduce complexity while maintaining long-term dependency handling.
2. Key Difference Between LSTM and GRU
LSTM
Has two states:
o Short-term state (hidden state): ( h_t )
o Long-term state (cell state): ( C_t )
Has three gates + one main network:
o Input gate
o Forget gate
o Output gate
o Candidate network
GRU
Merges long-term and short-term memory → only one state:
o ( C_t )
Has only two gates + one main network:
o Update gate
o Reset gate
o Candidate activation network
No output gate
GRU is therefore:
Smaller
Less complex
Slightly faster to train
But performance-wise, neither LSTM nor GRU is consistently superior—both are widely
used.
3. GRU Architecture Overview
GRU uses three neural components:
1. Update Gate ( z_t )
2. Reset Gate ( r_t )
3. Main Network (candidate activation) ( g_t )
GRU output = GRU state
(While the output and state can differ, usually we treat them as same.)
4. Update Gate — Controls “Forget” +
“Input” Gates Together
Unlike LSTM which uses two separate gates, GRU uses one gate (update gate) to control
both:
Input gate portion = ( z_t )
Forget gate portion = ( 1 - z_t )
Interpretation:
If ( z_t = 1 ): keep new information (input gate open), forget old state
If ( z_t = 0 ): keep old information (forget gate open), ignore new state
Formula:
[
z_t = \sigma(W_{xz} x_t + W_{cz}C_{t-1} + b_z)
]
Where:
( W_{xz} ): weight matrix for input
( W_{cz} ): weight matrix for previous state
( b_z ): bias
5. Reset Gate — Controls How Much Past
State Is Used
This determines how much of the previous state ( C_{t-1} ) should influence the candidate
activation.
Formula:
[
r_t = \sigma(W_{xr}x_t + W_{cr}C_{t-1} + b_r)
]
If:
( r_t = 0 ): previous state is “reset” → GRU runs like it’s seeing a new sequence
( r_t = 1 ): full previous state is used
6. Main Neural Network (Candidate
Activation)
This computes the new candidate state using current input and reset-controlled previous
state.
Formula:
[
g_t = \tanh(W_{xg}x_t + W_{cg}(r_t \cdot C_{t-1}) + b_g)
]
Where:
( r_t \cdot C_{t-1} ) controls how much previous memory contributes
Very similar to the candidate creation in LSTM but simpler
7. Final State Update (No Output Gate)
GRU directly combines:
Old memory
New candidate memory
Using update gate ( z_t ):
State update:
[
C_t = z_t \cdot g_t + (1 - z_t) \cdot C_{t-1}
]
Meaning:
If ( z_t ) is large → new information dominates
If ( 1 - z_t ) is large → old information dominates
8. Output of GRU
GRU has no separate output gate.
The output can be:
( y_t = C_t ), OR
A softmax applied externally (for prediction tasks)
9. Summary
Component LSTM GRU
Memory State 2 states: (h_t, C_t) 1 state: (C_t)
Gates 3 gates (input, forget, output) 2 gates (update, reset)
Complexity Higher Lower
Speed Slower Faster
Output Gate Yes No
Performance Comparable Comparable
GRU is simpler, faster, and still capable of learning long-term dependencies, which is why it
became very popular.
If you want, I can also prepare:
📌 A diagram of the GRU cell
📌 A one-page summary for exams
📌 A comparison table between RNN, LSTM, GRU
📌 A PDF of all your notes
Just tell me!