0% found this document useful (0 votes)
4 views

Model Pretraining

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Model Pretraining

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Model Pretraining :

what is Generative Configuration:

How you decide which Model you have to


choose for you work ?

Differnece between Auto regressive


Models and Auto encoding Models

Model Pretraining : 1
Autoencoding Models: Encoder-Only
Models
Autoencoding models, also known as encoder-only models, are pre-trained
using masked language modeling. In this approach, tokens in the input
sequence are randomly masked, and the model’s objective is to predict the
masked tokens to reconstruct the original sentence.

Autoregressive Models: Decoder-Only


Models
Autoregressive models, or decoder-only models, are pre-trained using causal
language modeling. The objective is to predict the next token based on the
previous sequence of tokens. These models mask the input sequence and can
only see the input tokens leading up to the token in question.

Sequence-to-Sequence Models:
Encoder-Decoder Models

Model Pretraining : 2
Sequence-to-sequence models utilize both the encoder and decoder
components of the original transformer architecture. The pre-training objective
for these models varies depending on the specific model.

Sequence-to-sequence models are commonly used for translation,


summarization, and question-answering tasks. BART is another well-known
encoder-decoder model.

What Matter the Most :


Model Size Matters
Training Dataset Size

increase the compute power and the time you are going to train the model.

A. Give your model more horse power — increase the


compute power and the time you are going to train the model

B. Give your model more muscles — increase the model size


or the model parameters

C. Give your model more training material — increase the


size of the dataset

The paper Scaling Laws for Neural Language Models shows that increasing
model size, dataset size, and training compute all independently enhance
performance.

ChatGPT (175B parameters), Jurassic (178B parameters), or the massive


Megatron-Turing NLG (530B parameters).

Model Pretraining : 3
B : increase the model size or the model parameters

DeepMind paper Training Compute-Optimal Large Language Models (also


known as the Chinchilla paper), the authors suggest that current big models
might be over-parameterised and under-trained in terms of dataset size.

cost issues:

Inference cost — the cost of calling an LLM to generate a response

Tuning cost — the cost of tuning an LLM to drive tailored pre-trained model
responses

Pre-training cost — the cost of training a new LLM from scratch

Hosting cost — the cost of deploying and maintaining a model behind an


API, supporting inference or tuning

these choices are influenced by the available compute budget, encompassing


factors such as hardware limitations, training time constraints, and financial
considerations.

C. Increase the size of the dataset:

The authors show that the optimal training dataset size for a given model is
about
20 times larger

than the number of parameters of the model.

The Chinchilla model trained by Deepmind was trained optimally and the size of
the dataset was 1.4T and the number of parameters is 70B. Llama-65B also
follows a similar pattern, in contrast to GPT-3 or BLOOM models highlighted in
red in the picture below

Model Pretraining : 4
Optimal Models:

Model Pretraining : 5
Why choose a Small Language Model?

Benefits and shortcomings of Small Language


Models

Benefits
Efficient — SLMs are more nimble and require less computational power
which makes them more efficient to deploy in production

Cost-effective — Less parameters means that you need less resources to


train, maintain and run an SLM compared to an LLM

Specialised — SLMs can be trained on high quality datasets for specific


domain tasks. This often leads to better performance within that niche

Explainable — Because these are less complex and use more targeted data
can offer more transparency into their outputs. Explainability is valued in
most Enterprises especially in sensitive applications

Shortcomings
Task-limited — Due to their specialised nature SLMs might struggle to
perform as well on tasks outside their training domain. They lack the
breadth of knowledge that LLMs possess

Performance-limited — SLMs have a lower capacity for learning and


understanding complex language patterns compared to larger models.
This can lead to limitations in the types of tasks they can handle effectively

Dataset-dependent — Smaller but less curated datasets can lead to less


robust models as the performance of SLMs relies heavily on the quality and
relevance of the data they are trained on

Why Customizing Language Models for


Specialized Domains:

Model Pretraining : 6
Understanding Domain Adaptation
Certain domains, such as law, medicine, finance, and science, possess their
own vocabulary and unique language structures. Common terms and phrases
in these domains may be unfamiliar outside their respective fields.

BloombergGPT
A Finance-Focused LLM: BloombergGPT serves as a prime example of a
specialized LLM in the financial domain. Developed by Bloomberg researchers,
this model combines finance-specific data with general-purpose text during
pretraining. By maintaining a balance between finance and public data (51%
financial data and 49% public data), BloombergGPT achieves superior results
on financial benchmarks while still demonstrating competitive performance on
general-purpose LLM benchmarks.

Model Pretraining : 7
What is Quantization?

Quantization involves reducing the memory required to store model weights by


decreasing their precision. Instead of the default 32-bit floating-point numbers
(FP32) used to represent parameters, quantization employs 16-bit floating-
point numbers (FP16) or even 8-bit integers (INT8). This reduction in precision
helps optimize the memory footprint of the models.

Model Pretraining : 8
Quantization Process and Memory
Savings
Quantization statistically projects the original 32-bit floating-point numbers into
lower-precision spaces, utilizing scaling factors derived from the range of the
original numbers. For instance, if a model with one billion parameters requires
approximately 80 gigabytes of GPU RAM at full 32-bit precision, quantization
can yield significant memory savings.
By employing 16-bit half precision (FP16), the memory requirement can be
reduced by 50%, resulting in only 40 gigabytes of GPU RAM. Furthermore,
representing the model parameters as 8-bit integers (INT8) can reduce the
memory footprint even further to just one gigabyte, representing a total 98.75%
reduction compared to full 32-bit precision.

BFLOAT16 (BF16) has emerged as a widely adopted precision format in deep


learning. Developed by Google Brain, BF16 serves as a hybrid between FP16
and FP32, capturing the full dynamic range of FP32 with just 16 bits.

1.reduced require memory for to store and train models

2. lower precision spaces

3. QAT qunatization aware training

4. BFlOAT16

THINK:

500B param : 32 bit ?

How Much Gpu RAM Needed to train


500B parameter model ?

Model Pretraining : 9
what are the Scaling techniques for
Model Training with Multiple GPUs
Improving Efficiency and Performance

Two popular techniques


1. Distributed Data-Parallel (DDP)
2. Fully Sharded Data Parallel (FSDP)

DDP is a widely used model replication technique that distributes large datasets
across multiple GPUs, enabling parallel processing of batches of data. With
DDP, each GPU receives a copy of the model and processes data
independently. Afterward, a synchronization step combines the results,
updating the identical model on each GPU. DDP is suitable when the model
and its additional parameters fit onto a single GPU, resulting in faster
training.

large to fit in the memory of a single GPU.

Sharding involves splitting and distributing one logical data set across
multiple databases that share nothing and can be deployed across multiple
servers.

Fully Sharded Data Parallel (FSDP):

FSDP, inspired by the ZeRO technique, provides a solution when the model is
too large to fit in the memory of a single GPU. ZeRO (Zero Redundancy
Optimizer) aims to optimize memory usage by distributing or sharding model
parameters, gradients, and optimizer states across GPUs. FSDP applies
sharding strategies specified in ZeRO to distribute these components across
GPU nodes. This enables working with models that would otherwise exceed the
capacity of a single chip.

Model Pretraining : 10
Memory Optimization with ZeRO:
ZeRO offers three optimization stages:

Stage 1 shards only optimizer states, reducing memory usage by up to a


factor of four.

Stage 2 shards gradients, further reducing memory usage by up to eight


times when combined with Stage 1.

Stage 3 shards all components, including model parameters, with memory


reduction scaling linearly with the number of GPUs.

Model Pretraining : 11

You might also like