0% found this document useful (0 votes)

4 views

Model Pretraining

Uploaded by

aashutoshkumar.mishra99

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views

Model Pretraining

Uploaded by

aashutoshkumar.mishra99

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Model Pretraining :

what is Generative Configuration:

How you decide which Model you have to

choose for you work ?

Differnece between Auto regressive

Models and Auto encoding Models

Model Pretraining : 1
Autoencoding Models: Encoder-Only
Models
Autoencoding models, also known as encoder-only models, are pre-trained
using masked language modeling. In this approach, tokens in the input
sequence are randomly masked, and the model’s objective is to predict the
masked tokens to reconstruct the original sentence.

Autoregressive Models: Decoder-Only

Models
Autoregressive models, or decoder-only models, are pre-trained using causal
language modeling. The objective is to predict the next token based on the
previous sequence of tokens. These models mask the input sequence and can
only see the input tokens leading up to the token in question.

Sequence-to-Sequence Models:
Encoder-Decoder Models

Model Pretraining : 2
Sequence-to-sequence models utilize both the encoder and decoder
components of the original transformer architecture. The pre-training objective
for these models varies depending on the specific model.

Sequence-to-sequence models are commonly used for translation,

summarization, and question-answering tasks. BART is another well-known
encoder-decoder model.

What Matter the Most :

Model Size Matters
Training Dataset Size

increase the compute power and the time you are going to train the model.

A. Give your model more horse power — increase the

compute power and the time you are going to train the model

B. Give your model more muscles — increase the model size

or the model parameters

C. Give your model more training material — increase the

size of the dataset

The paper Scaling Laws for Neural Language Models shows that increasing
model size, dataset size, and training compute all independently enhance
performance.

ChatGPT (175B parameters), Jurassic (178B parameters), or the massive

Megatron-Turing NLG (530B parameters).

Model Pretraining : 3
B : increase the model size or the model parameters

DeepMind paper Training Compute-Optimal Large Language Models (also

known as the Chinchilla paper), the authors suggest that current big models
might be over-parameterised and under-trained in terms of dataset size.

cost issues:

Inference cost — the cost of calling an LLM to generate a response

Tuning cost — the cost of tuning an LLM to drive tailored pre-trained model
responses

Pre-training cost — the cost of training a new LLM from scratch

Hosting cost — the cost of deploying and maintaining a model behind an

API, supporting inference or tuning

these choices are influenced by the available compute budget, encompassing

factors such as hardware limitations, training time constraints, and financial
considerations.

C. Increase the size of the dataset:

The authors show that the optimal training dataset size for a given model is
about
20 times larger

than the number of parameters of the model.

The Chinchilla model trained by Deepmind was trained optimally and the size of
the dataset was 1.4T and the number of parameters is 70B. Llama-65B also
follows a similar pattern, in contrast to GPT-3 or BLOOM models highlighted in
red in the picture below

Model Pretraining : 4
Optimal Models:

Model Pretraining : 5
Why choose a Small Language Model?

Benefits and shortcomings of Small Language

Models

Benefits
Efficient — SLMs are more nimble and require less computational power
which makes them more efficient to deploy in production

Cost-effective — Less parameters means that you need less resources to

train, maintain and run an SLM compared to an LLM

Specialised — SLMs can be trained on high quality datasets for specific

domain tasks. This often leads to better performance within that niche

Explainable — Because these are less complex and use more targeted data
can offer more transparency into their outputs. Explainability is valued in
most Enterprises especially in sensitive applications

Shortcomings
Task-limited — Due to their specialised nature SLMs might struggle to
perform as well on tasks outside their training domain. They lack the
breadth of knowledge that LLMs possess

Performance-limited — SLMs have a lower capacity for learning and

understanding complex language patterns compared to larger models.
This can lead to limitations in the types of tasks they can handle effectively

Dataset-dependent — Smaller but less curated datasets can lead to less

robust models as the performance of SLMs relies heavily on the quality and
relevance of the data they are trained on

Why Customizing Language Models for

Specialized Domains:

Model Pretraining : 6
Understanding Domain Adaptation
Certain domains, such as law, medicine, finance, and science, possess their
own vocabulary and unique language structures. Common terms and phrases
in these domains may be unfamiliar outside their respective fields.

BloombergGPT
A Finance-Focused LLM: BloombergGPT serves as a prime example of a
specialized LLM in the financial domain. Developed by Bloomberg researchers,
this model combines finance-specific data with general-purpose text during
pretraining. By maintaining a balance between finance and public data (51%
financial data and 49% public data), BloombergGPT achieves superior results
on financial benchmarks while still demonstrating competitive performance on
general-purpose LLM benchmarks.

Model Pretraining : 7
What is Quantization?

Quantization involves reducing the memory required to store model weights by

decreasing their precision. Instead of the default 32-bit floating-point numbers
(FP32) used to represent parameters, quantization employs 16-bit floating-
point numbers (FP16) or even 8-bit integers (INT8). This reduction in precision
helps optimize the memory footprint of the models.

Model Pretraining : 8
Quantization Process and Memory
Savings
Quantization statistically projects the original 32-bit floating-point numbers into
lower-precision spaces, utilizing scaling factors derived from the range of the
original numbers. For instance, if a model with one billion parameters requires
approximately 80 gigabytes of GPU RAM at full 32-bit precision, quantization
can yield significant memory savings.
By employing 16-bit half precision (FP16), the memory requirement can be
reduced by 50%, resulting in only 40 gigabytes of GPU RAM. Furthermore,
representing the model parameters as 8-bit integers (INT8) can reduce the
memory footprint even further to just one gigabyte, representing a total 98.75%
reduction compared to full 32-bit precision.

BFLOAT16 (BF16) has emerged as a widely adopted precision format in deep

learning. Developed by Google Brain, BF16 serves as a hybrid between FP16
and FP32, capturing the full dynamic range of FP32 with just 16 bits.

1.reduced require memory for to store and train models

2. lower precision spaces

3. QAT qunatization aware training

4. BFlOAT16

THINK:

500B param : 32 bit ?

How Much Gpu RAM Needed to train

500B parameter model ?

Model Pretraining : 9
what are the Scaling techniques for
Model Training with Multiple GPUs
Improving Efficiency and Performance

Two popular techniques

1. Distributed Data-Parallel (DDP)
2. Fully Sharded Data Parallel (FSDP)

DDP is a widely used model replication technique that distributes large datasets
across multiple GPUs, enabling parallel processing of batches of data. With
DDP, each GPU receives a copy of the model and processes data
independently. Afterward, a synchronization step combines the results,
updating the identical model on each GPU. DDP is suitable when the model
and its additional parameters fit onto a single GPU, resulting in faster
training.

large to fit in the memory of a single GPU.

Sharding involves splitting and distributing one logical data set across
multiple databases that share nothing and can be deployed across multiple
servers.

Fully Sharded Data Parallel (FSDP):

FSDP, inspired by the ZeRO technique, provides a solution when the model is
too large to fit in the memory of a single GPU. ZeRO (Zero Redundancy
Optimizer) aims to optimize memory usage by distributing or sharding model
parameters, gradients, and optimizer states across GPUs. FSDP applies
sharding strategies specified in ZeRO to distribute these components across
GPU nodes. This enables working with models that would otherwise exceed the
capacity of a single chip.

Model Pretraining : 10
Memory Optimization with ZeRO:
ZeRO offers three optimization stages:

Stage 1 shards only optimizer states, reducing memory usage by up to a

factor of four.

Stage 2 shards gradients, further reducing memory usage by up to eight

times when combined with Stage 1.

Stage 3 shards all components, including model parameters, with memory

reduction scaling linearly with the number of GPUs.

Model Pretraining : 11

Immediate download New Perspectives on HTML 5 and CSS 8th Edition Patrick M. Carey ebooks 2024
100% (4)
Immediate download New Perspectives on HTML 5 and CSS 8th Edition Patrick M. Carey ebooks 2024
66 pages
LLM Challenges
No ratings yet
LLM Challenges
1 page
Small Language Models (SLMS)
No ratings yet
Small Language Models (SLMS)
23 pages
Foundations of LLM
No ratings yet
Foundations of LLM
231 pages
Foundations of Large Language Models 1738142777
No ratings yet
Foundations of Large Language Models 1738142777
101 pages
Getting Started With Generative Ai and Foundation Models
No ratings yet
Getting Started With Generative Ai and Foundation Models
16 pages
Using Pre-Trained Models
No ratings yet
Using Pre-Trained Models
16 pages
lec20.LLM
No ratings yet
lec20.LLM
58 pages
Unit II
No ratings yet
Unit II
27 pages
[FREE PDF sample] Large Language Model Based Solutions HOW TO DELIVER VALUE WITH COST EFFECTIVE GENERATIVE AI APPLICATIONS 1st Edition Shreyas Subramanian ebooks
100% (1)
[FREE PDF sample] Large Language Model Based Solutions HOW TO DELIVER VALUE WITH COST EFFECTIVE GENERATIVE AI APPLICATIONS 1st Edition Shreyas Subramanian ebooks
79 pages
How LLM's Work, How GPT Was Trained, and How GPT Generates Outputs
No ratings yet
How LLM's Work, How GPT Was Trained, and How GPT Generates Outputs
12 pages
Fine Tuning Large Language Model (LLM) - GeeksforGeeks
No ratings yet
Fine Tuning Large Language Model (LLM) - GeeksforGeeks
19 pages
Efficient Deep Learning (First Early Release) (Gaurav Menghani Naresh Singh) (Z-Library)
No ratings yet
Efficient Deep Learning (First Early Release) (Gaurav Menghani Naresh Singh) (Z-Library)
69 pages
Jntuk r20 Unit-V Deep Learning Techniques (WWW - Jntumaterials.co - In)
No ratings yet
Jntuk r20 Unit-V Deep Learning Techniques (WWW - Jntumaterials.co - In)
61 pages
Large Language Model Based Solutions HOW TO DELIVER VALUE WITH COST EFFECTIVE GENERATIVE AI APPLICATIONS 1st Edition Shreyas Subramanian pdf download
100% (1)
Large Language Model Based Solutions HOW TO DELIVER VALUE WITH COST EFFECTIVE GENERATIVE AI APPLICATIONS 1st Edition Shreyas Subramanian pdf download
32 pages
WHITEPAPERs (6)
No ratings yet
WHITEPAPERs (6)
14 pages
DL Unit 1
No ratings yet
DL Unit 1
200 pages
Unit-V Deep Learning Techniques
100% (1)
Unit-V Deep Learning Techniques
31 pages
Building Finetuning Aimodels
No ratings yet
Building Finetuning Aimodels
41 pages
Introduction To ML
No ratings yet
Introduction To ML
34 pages
Bay Learn 2015 Deep Mind
No ratings yet
Bay Learn 2015 Deep Mind
69 pages
ML A Deep Dive in The World of AI and LLM Tun'Up Munich - 241021 - 130023
No ratings yet
ML A Deep Dive in The World of AI and LLM Tun'Up Munich - 241021 - 130023
34 pages
[English] Introduction to Large Language Models [DownSub.com]
No ratings yet
[English] Introduction to Large Language Models [DownSub.com]
9 pages
Unit - V
No ratings yet
Unit - V
44 pages
NLP Transformer Class Notes
No ratings yet
NLP Transformer Class Notes
3 pages
Fundamentals of Deep Learning
No ratings yet
Fundamentals of Deep Learning
195 pages
Fundamentals of Deep Learning
No ratings yet
Fundamentals of Deep Learning
26 pages
AI for Everyone: An Intermediate Guide to Artificial Intelligence
From Everand
AI for Everyone: An Intermediate Guide to Artificial Intelligence
Nova Clarke
No ratings yet
2.game AI 1
No ratings yet
2.game AI 1
268 pages
Paper Colossal-AI - A Unified Deep Learning System for Large-Scale Parallel Training
No ratings yet
Paper Colossal-AI - A Unified Deep Learning System for Large-Scale Parallel Training
10 pages
Toc 9780138199302
No ratings yet
Toc 9780138199302
8 pages
Transformers
No ratings yet
Transformers
2 pages
Complete Download Large Language Model Based Solutions HOW TO DELIVER VALUE WITH COST EFFECTIVE GENERATIVE AI APPLICATIONS 1st Edition Shreyas Subramanian PDF All Chapters
100% (2)
Complete Download Large Language Model Based Solutions HOW TO DELIVER VALUE WITH COST EFFECTIVE GENERATIVE AI APPLICATIONS 1st Edition Shreyas Subramanian PDF All Chapters
71 pages
SSRN Id4655822
No ratings yet
SSRN Id4655822
9 pages
Week 4 - LLM - FineTuning
No ratings yet
Week 4 - LLM - FineTuning
38 pages
Jntuk r20 Unit v Deep Learning Techniqueswwwjntumaterials
No ratings yet
Jntuk r20 Unit v Deep Learning Techniqueswwwjntumaterials
32 pages
23 DeepLearning PDF
No ratings yet
23 DeepLearning PDF
74 pages
unit-iv-v-deep-learning-material
No ratings yet
unit-iv-v-deep-learning-material
32 pages
Large Language Models in Medicine: The Potentials and Pitfalls
No ratings yet
Large Language Models in Medicine: The Potentials and Pitfalls
19 pages
Sony Ai Content[1]
No ratings yet
Sony Ai Content[1]
26 pages
Deep Learning
No ratings yet
Deep Learning
78 pages
Whitepaper_Foundational Large Language Models & Text Generation_v2
100% (1)
Whitepaper_Foundational Large Language Models & Text Generation_v2
86 pages
CM20315_01_Intro
No ratings yet
CM20315_01_Intro
62 pages
Mastering the Craft of C Programming: Unraveling the Secrets of Expert-Level Programming
From Everand
Mastering the Craft of C Programming: Unraveling the Secrets of Expert-Level Programming
Steve Jones
No ratings yet
Model Compression and Efficient Inference For Large Language Models: A Survey
No ratings yet
Model Compression and Efficient Inference For Large Language Models: A Survey
47 pages
Large Language Model Based Solutions HOW TO DELIVER VALUE WITH COST EFFECTIVE GENERATIVE AI APPLICATIONS 1st Edition Shreyas Subramanian - Download the ebook now for full and detailed access
100% (1)
Large Language Model Based Solutions HOW TO DELIVER VALUE WITH COST EFFECTIVE GENERATIVE AI APPLICATIONS 1st Edition Shreyas Subramanian - Download the ebook now for full and detailed access
53 pages
Machine Learning
No ratings yet
Machine Learning
3 pages
High-Performance C: Optimizing Code for Speed and Efficiency
From Everand
High-Performance C: Optimizing Code for Speed and Efficiency
Larry Jones
No ratings yet
Lecture 1,2,3 - Module 1 - ML Vs DL
No ratings yet
Lecture 1,2,3 - Module 1 - ML Vs DL
26 pages
Understanding LLMS: A Comprehensive Overview From Training To Inference
No ratings yet
Understanding LLMS: A Comprehensive Overview From Training To Inference
30 pages
Deep Learning
No ratings yet
Deep Learning
10 pages
Understanding LLMs: A Comprehensive Overview from Training to Inference
No ratings yet
Understanding LLMs: A Comprehensive Overview from Training to Inference
30 pages
03-Lecture Notes-Mid
No ratings yet
03-Lecture Notes-Mid
23 pages
AIMLDL Questions
No ratings yet
AIMLDL Questions
5 pages
Pytoch Modeling
No ratings yet
Pytoch Modeling
16 pages
Unit 5
No ratings yet
Unit 5
39 pages
CM20315 01 Intro 01
No ratings yet
CM20315 01 Intro 01
39 pages
Deep Learning Tutorial: Reference: Hung-Yi Lee
100% (1)
Deep Learning Tutorial: Reference: Hung-Yi Lee
179 pages
Parameters to Fine Tune Large Language Models
No ratings yet
Parameters to Fine Tune Large Language Models
4 pages
Deep Learning
100% (1)
Deep Learning
49 pages
ML Unit-5
No ratings yet
ML Unit-5
19 pages
Fine tuning
No ratings yet
Fine tuning
24 pages
Generative AI Introduction
No ratings yet
Generative AI Introduction
51 pages
Generative_AI_MCQ
No ratings yet
Generative_AI_MCQ
5 pages
BDE (EdTech) JD
No ratings yet
BDE (EdTech) JD
2 pages
30 Biometric Attendance System Over Iot
No ratings yet
30 Biometric Attendance System Over Iot
7 pages
Alumni System Final Year Project For IT
No ratings yet
Alumni System Final Year Project For IT
61 pages
Factory Acceptance Test Plan
No ratings yet
Factory Acceptance Test Plan
6 pages
Mini Project Sample Document-1
No ratings yet
Mini Project Sample Document-1
12 pages
TODO
No ratings yet
TODO
24 pages
Com - Cmfjg.aln - Wer Logcat
No ratings yet
Com - Cmfjg.aln - Wer Logcat
19 pages
Class Handout BES500080 Revit IFC Manual v2-11-13
No ratings yet
Class Handout BES500080 Revit IFC Manual v2-11-13
3 pages
Rob Pike at Stanford On Golang
100% (3)
Rob Pike at Stanford On Golang
56 pages
Activity 2 - CC02
No ratings yet
Activity 2 - CC02
2 pages
Quizizz: Comparative and Superlative
No ratings yet
Quizizz: Comparative and Superlative
13 pages
Rhodes Affair Yamaha Edition User Guide
100% (1)
Rhodes Affair Yamaha Edition User Guide
4 pages
Module-8
No ratings yet
Module-8
32 pages
Car Rental System Thesis by Yasir
No ratings yet
Car Rental System Thesis by Yasir
85 pages
Core - Drawing Memory Models With Primitive Data
No ratings yet
Core - Drawing Memory Models With Primitive Data
13 pages
SFTFS 35
No ratings yet
SFTFS 35
1 page
Unit 1 Introduction To Computers: Computer Concepts Computer Categories Operating Systems The Internet and The WWW
No ratings yet
Unit 1 Introduction To Computers: Computer Concepts Computer Categories Operating Systems The Internet and The WWW
41 pages
A PUF Based Cryptographic Security Solution For IoT Systems On Chip
No ratings yet
A PUF Based Cryptographic Security Solution For IoT Systems On Chip
23 pages
information-systems-sm-ipcn
No ratings yet
information-systems-sm-ipcn
16 pages
ABP LTD. - ABS Production Is Out of Network - Root Cause Analysis
No ratings yet
ABP LTD. - ABS Production Is Out of Network - Root Cause Analysis
6 pages
2 Vexcode v5 Tank Drives Manual Clawbot Programs Compressed
No ratings yet
2 Vexcode v5 Tank Drives Manual Clawbot Programs Compressed
113 pages
Technical Note #24: Order of Installation For PMCS 5.0/5.1 Components
No ratings yet
Technical Note #24: Order of Installation For PMCS 5.0/5.1 Components
2 pages
SPCC GATE Question For Practice
No ratings yet
SPCC GATE Question For Practice
22 pages
National Pokédex (Copy)
No ratings yet
National Pokédex (Copy)
400 pages
Aman_223EC5204 - Aman Gupta
No ratings yet
Aman_223EC5204 - Aman Gupta
1 page
Document 1
No ratings yet
Document 1
28 pages
Open Gapps Log
No ratings yet
Open Gapps Log
2 pages
Vulnerability Scan: Prepared by
No ratings yet
Vulnerability Scan: Prepared by
13 pages
Custom Tab Add in Me21n, Me22n, Me23n
No ratings yet
Custom Tab Add in Me21n, Me22n, Me23n
12 pages
Networked Medical Devices Ass2
No ratings yet
Networked Medical Devices Ass2
9 pages