LLM Challenges
LLM Challenges
weights of the model by converting the precision Researchers explored trade-offs between
from 32bit to 16bit or 8bit integers. the dataset size, the model size, and the
LARGE LANGUAGE MODEL CHOICE LLMs are massive and require plenty of memory compute budget:
for training and inference.
FP32 space
Increasing compute may seem ideal for better
Generative AI Project Lifecycle To load the model into GPU RAM: performance, but practical constraints like
3 x 10-38 0.0 3 x 1038 hardware, time, and budget limit its feasibility.
1 parameter (32-bit precision) = 4 bytes needed
Adapt
(prompt App integration 1B parameters = 4 x 109 bytes = 4GB of GPU
Use case engineering, (model
Model
definition Selection fine tuning), optimization, Constraint
& scoping augment, deployment) Pre-training requires storing additional components,
and evaluate Compute budget
model beyond the model’s parameters:
FP16 | BFLOAT16 | INT8 | INT4
• Optimizer states (e.g., 2 for Adam)
Scaling choice Scaling choice
Two options for model selection • Gradients
Dataset size Model size
• Forward activations Model
Quantization maps the FP32 numbers to a lower # of tokens
performance
# of parameters
• Use a pre-trained LLM. • Temporary variables precision space by employing scaling factors
• Train your own LLM from scratch. This could result in an additional 12-20 bytes of determined from the range of the FP32 numbers.
But, in general... memory needed per model parameter.
…develop your application using a pre-trained LLM, It has been empirically shown that, as the compute
except if you work with extremely specific data In most cases, quantization strongly reduces budget remains fixed:
(i.e., medical, legal) memory requirements with a limited loss
This would mean it requires 16 GB to 24 GB of
in prediction. Fixed model size: Increasing training dataset
Hubs: Where you can browse existing models GPU memory to train a 1-billion parameter
size improves model performance.
Model Cards: List of best use cases, training details, LLM, around 4-6x the GPU RAM needed just for
BFLOAT16 is a popular alternative to FP16:
limitations on models. storing the model weights.
Fixed dataset size: Larger models
The model choice will depend on the details • Developed by Google Brain demonstrate lower test loss, indicating
of the task to carry out. • Balances memory efficiency and accuracy enhanced performance.
• Wider dynamic range
Hence, the memory needed for LLM training is: • Optimized for storage and speed in ML tasks
Model pre-training:
e.g., FLAN T5 pre-trained using BFLOAT16
Model weights are adjusted in order to minimize the Excessive for consumer hardware What’s the optimal balance?
loss of the training objective.
It requires significant computational resources, Even demanding for data center hardware Benefits of quantization:
Once scaling laws have been estimated, we can use the
(i.e., GPUs, due to high computational load). (for single processor training). Chinchilla approach, i.e., we can choose the dataset
For instance, NVIDIA A100 supports up to Less memory
size and the model size to train a compute-optimal
80GB of RAM. model, which maximizes performance for a given
Potentially better model performance
PaLM compute budget. The compute-optimal training dataset
GPT-3 YaLM GPT-2 BERT size is ~20x the number of parameters.
Higher calculation speed
Number
of parameters
540B 175B 100B 1.5B 110M