Model Pretraining
Model Pretraining
Model Pretraining : 1
Autoencoding Models: Encoder-Only
Models
Autoencoding models, also known as encoder-only models, are pre-trained
using masked language modeling. In this approach, tokens in the input
sequence are randomly masked, and the model’s objective is to predict the
masked tokens to reconstruct the original sentence.
Sequence-to-Sequence Models:
Encoder-Decoder Models
Model Pretraining : 2
Sequence-to-sequence models utilize both the encoder and decoder
components of the original transformer architecture. The pre-training objective
for these models varies depending on the specific model.
increase the compute power and the time you are going to train the model.
The paper Scaling Laws for Neural Language Models shows that increasing
model size, dataset size, and training compute all independently enhance
performance.
Model Pretraining : 3
B : increase the model size or the model parameters
cost issues:
Tuning cost — the cost of tuning an LLM to drive tailored pre-trained model
responses
The authors show that the optimal training dataset size for a given model is
about
20 times larger
The Chinchilla model trained by Deepmind was trained optimally and the size of
the dataset was 1.4T and the number of parameters is 70B. Llama-65B also
follows a similar pattern, in contrast to GPT-3 or BLOOM models highlighted in
red in the picture below
Model Pretraining : 4
Optimal Models:
Model Pretraining : 5
Why choose a Small Language Model?
Benefits
Efficient — SLMs are more nimble and require less computational power
which makes them more efficient to deploy in production
Explainable — Because these are less complex and use more targeted data
can offer more transparency into their outputs. Explainability is valued in
most Enterprises especially in sensitive applications
Shortcomings
Task-limited — Due to their specialised nature SLMs might struggle to
perform as well on tasks outside their training domain. They lack the
breadth of knowledge that LLMs possess
Model Pretraining : 6
Understanding Domain Adaptation
Certain domains, such as law, medicine, finance, and science, possess their
own vocabulary and unique language structures. Common terms and phrases
in these domains may be unfamiliar outside their respective fields.
BloombergGPT
A Finance-Focused LLM: BloombergGPT serves as a prime example of a
specialized LLM in the financial domain. Developed by Bloomberg researchers,
this model combines finance-specific data with general-purpose text during
pretraining. By maintaining a balance between finance and public data (51%
financial data and 49% public data), BloombergGPT achieves superior results
on financial benchmarks while still demonstrating competitive performance on
general-purpose LLM benchmarks.
Model Pretraining : 7
What is Quantization?
Model Pretraining : 8
Quantization Process and Memory
Savings
Quantization statistically projects the original 32-bit floating-point numbers into
lower-precision spaces, utilizing scaling factors derived from the range of the
original numbers. For instance, if a model with one billion parameters requires
approximately 80 gigabytes of GPU RAM at full 32-bit precision, quantization
can yield significant memory savings.
By employing 16-bit half precision (FP16), the memory requirement can be
reduced by 50%, resulting in only 40 gigabytes of GPU RAM. Furthermore,
representing the model parameters as 8-bit integers (INT8) can reduce the
memory footprint even further to just one gigabyte, representing a total 98.75%
reduction compared to full 32-bit precision.
4. BFlOAT16
THINK:
Model Pretraining : 9
what are the Scaling techniques for
Model Training with Multiple GPUs
Improving Efficiency and Performance
DDP is a widely used model replication technique that distributes large datasets
across multiple GPUs, enabling parallel processing of batches of data. With
DDP, each GPU receives a copy of the model and processes data
independently. Afterward, a synchronization step combines the results,
updating the identical model on each GPU. DDP is suitable when the model
and its additional parameters fit onto a single GPU, resulting in faster
training.
Sharding involves splitting and distributing one logical data set across
multiple databases that share nothing and can be deployed across multiple
servers.
FSDP, inspired by the ZeRO technique, provides a solution when the model is
too large to fit in the memory of a single GPU. ZeRO (Zero Redundancy
Optimizer) aims to optimize memory usage by distributing or sharding model
parameters, gradients, and optimizer states across GPUs. FSDP applies
sharding strategies specified in ZeRO to distribute these components across
GPU nodes. This enables working with models that would otherwise exceed the
capacity of a single chip.
Model Pretraining : 10
Memory Optimization with ZeRO:
ZeRO offers three optimization stages:
Model Pretraining : 11