Here are the modern implementations of LLM architecture, sharding strategies and kernel optimizations.
- Softmax in PyTorch with autograd and in NumPy
- Linear projection in PyTorch with autograd
- Multi Layer Perceptron in PyTorch with autograd
- Disaggregated Serving with KV Cache in PyTorch
- Multihead Attention in PyTorch with autograd
- Norm RMS in PyTorch with autograd and in NumPy
- Transformer in PyTorch with autograd
- MLP Data Parallelism(DP) in PyTorch, in JAX
- MLP Tensor Parallelism(TP) in PyTorch, in JAX
- MLP Fully Sharded Data Parallelism(FSDP) in PyTorch, in JAX
- MLP Pipelining in PyTorch
The following are roofline analysis for different architectures. Those are non-fused operations.
- Masking in NumPy
- Torch distributed API.
- don't use the old primitives, instead use in-place ones like
dist.all_gather_into_tensoranddist.all_reduce_tensorthat aggregate along the primary dimension. - custom classes for training requires
torch.autograd.Function,@staticmethodandctx.save_for_backward




