Stars
CUDA Templates and Python DSLs for High-Performance Linear Algebra
Domain-specific language designed to streamline the development of high-performance GPU/CPU/Accelerators kernels
A high-throughput and memory-efficient inference and serving engine for LLMs
PyTorch native quantization and sparsity for training and inference
A unified library of SOTA model optimization techniques like quantization, distillation, pruning, neural architecture search, speculative decoding, etc. It compresses deep learning models for downs…
Development repository for the Triton language and compiler