oLLM: A Lightweight Library for Large-Context LLM Inference

2mo

🚀 Meet oLLM: Enabling 100K-Token Contexts on Just an 8 GB GPU Amid all the buzz about quantization, pruning, and multi-GPU setups, here’s a refreshing design pivot: oLLM, a lightweight Python library, pushes large-context LLM inference into the “affordable GPU” realm using SSD offload — no quantization required. 🎯 Why oLLM is a Game Changer Ultra-long context w/o VRAM blowup It streams layer weights and attention key/value caches from SSD to GPU, keeping VRAM within 8–10 GB even for 100K token contexts. No need for quantization compromises oLLM retains FP16/BF16 precision, trading off speed for accessibility rather than accuracy loss. Supports big models on modest hardware Examples include Llama-3 (1B / 3B / 8B), GPT-OSS 20B, Qwen3-Next-80B (sparse MoE). Transparent trade-offs = control You’ll pay in SSD I/O and latency. Throughput is modest (e.g. ~0.5 tokens/sec on Qwen-80B with 50K context), but for offline, large-document tasks (summarization, logs analysis, compliance) it’s extremely compelling. 🛠️ How It Works (in a Nutshell) Memory offload to SSD Weights and KV cache bypass host RAM and are stored on fast NVMe SSD, streamed or paged as needed. FlashAttention-2 & chunked MLPs These speedups reduce peak memory demands so the full attention matrix never fully materializes. DiskCache with efficient I/O Uses GPUDirect and fast I/O frameworks (e.g. KvikIO / cuFile) to reduce I/O latency overhead. ✅ Use Cases Where oLLM Shines 📄 Large-document summarization & compliance review Process books, legal documents, logs — whole context, not split windows. 🔎 Knowledge retrieval / RAG over huge corpora Full passage context rather than batching windows. 📊 Offline analytics / batch inference Where throughput is less critical than context fidelity. 💡 ML research & experimentation For anyone wanting to push context limits without needing server-grade hardware. 🔍 Key Limitations & Trade-Offs (Be Realistic) Latency & throughput are not production-grade Heavy I/O overhead means it's not ideal for real-time services. Storage load is heavy Long contexts demand massive SSD space (e.g., 100K tokens → tens to hundreds of GB of KV). Hardware dependencies You’ll need fast NVMe SSDs and GPUs with good I/O paths. Not a drop-in replacement For ultra-high throughput or low-latency production use, conventional multi-GPU pipelines still dominate. 📌 TL;DR (for the scroll-haters) 🚀 oLLM lets you run massive context LLMs offline on a single 8 GB GPU by offloading memory to SSD — no quantization needed. Great for deep dives, long content tasks, and pushing what’s possible on modest hardware. If you’re tinkering with long-context applications and don’t have access to server farms, this is one of the most interesting libraries to explore right now.

To view or add a comment, sign in

More Relevant Posts

Alex Razvant

Senior AI Engineer | Author @ NeuralBits | Helping engineers understand, build, and master AI.
1mo
Report this post
For engineers who want to optimize LLMs for CPU or Edge, learn about GGUF. On HuggingFace, optimized LLMs stored as GGUF checkpoints boomed in popularity. Initially, we had .bin or .pt sharded checkpoints (4-5 years ago) for LLM models: `model-0001-of-0005.bin` `model-0002-of-0005.bin` Then, HF Safetensors fixed a critical aspect of PyTorch serialized models of "Executing arbitrary code during Unpickling". Safetensors are a standard currently, but models are in BF16 precision, which still requires a lot of compute. ✅ Recently, GGUF models, which are Quantized models, started to become popular. > GGUF is a binary format that is optimized for quick loading and saving of models, making it highly efficient for inference purposes, compatible with GGML. When you see a GGUF model like Q4, Q5_K, or IQ2_K_S, that suffix tells you how it was quantized. We can generally split the quants into 3 distinct groups: 𝟭/ 𝗟𝗲𝗴𝗮𝗰𝘆 𝗤𝘂𝗮𝗻𝘁𝘀 𝟮/ 𝗞-𝗤𝘂𝗮𝗻𝘁𝘀 𝟯/ 𝗜-𝗤𝘂𝗮𝗻𝘁𝘀 1. Legacy Quants (Q4_0, Q4_1, Q8_0) > Block-based quantization: weights split into fixed-size blocks > Uses 1 or 2 extra constants per block to scale weights back > Best: Fast and simple, but can't tweak accuracy that much. 2. K-Quants (Q3_K_S, Q5_K_M) > Block-wise quantization with per-block scaling > Mixed quantizations, for example, some weights in 4bits, others in FP32. > Popular with large models (8B+) > Best: Some layers could be compressed more to give more bits to critical layers. 3. I-Quants (IQ2_XXS, IQ3_S) > Builds on K-Quants > Introduces an Importance Matrix to identify critical weights. > Best: the most customizable quant type. ✅ Most GGUF models today use K-Quants or I-Quants, especially for larger LLMs. Understanding which quantization type to use improves inference speed while cutting down on memory usage. That's key for resource-constrained environments! --- ➕ Follow for more expert AI/ML insights! ➕ Join 6500+ engineers, learning production-ready AI. https://lnkd.in/ed6FRFCH Cheers!
18 Comments
Like Comment
To view or add a comment, sign in
Neural Bits

200 followers
1mo
Report this post
Want to learn the AI stack for LLMs on Edge? Check this Live Coding Session unpacking, > llama.cpp - architecture, workflow, components > GGML - the ML Tensor Library built in C++ > GGUF - binary model format for storing highly quantized LLMs. Find it here: https://lnkd.in/dzAg8RA3
Alex Razvant

Senior AI Engineer | Author @ NeuralBits | Helping engineers understand, build, and master AI.
1mo

For engineers who want to optimize LLMs for CPU or Edge, learn about GGUF. On HuggingFace, optimized LLMs stored as GGUF checkpoints boomed in popularity. Initially, we had .bin or .pt sharded checkpoints (4-5 years ago) for LLM models: `model-0001-of-0005.bin` `model-0002-of-0005.bin` Then, HF Safetensors fixed a critical aspect of PyTorch serialized models of "Executing arbitrary code during Unpickling". Safetensors are a standard currently, but models are in BF16 precision, which still requires a lot of compute. ✅ Recently, GGUF models, which are Quantized models, started to become popular. > GGUF is a binary format that is optimized for quick loading and saving of models, making it highly efficient for inference purposes, compatible with GGML. When you see a GGUF model like Q4, Q5_K, or IQ2_K_S, that suffix tells you how it was quantized. We can generally split the quants into 3 distinct groups: 𝟭/ 𝗟𝗲𝗴𝗮𝗰𝘆 𝗤𝘂𝗮𝗻𝘁𝘀 𝟮/ 𝗞-𝗤𝘂𝗮𝗻𝘁𝘀 𝟯/ 𝗜-𝗤𝘂𝗮𝗻𝘁𝘀 1. Legacy Quants (Q4_0, Q4_1, Q8_0) > Block-based quantization: weights split into fixed-size blocks > Uses 1 or 2 extra constants per block to scale weights back > Best: Fast and simple, but can't tweak accuracy that much. 2. K-Quants (Q3_K_S, Q5_K_M) > Block-wise quantization with per-block scaling > Mixed quantizations, for example, some weights in 4bits, others in FP32. > Popular with large models (8B+) > Best: Some layers could be compressed more to give more bits to critical layers. 3. I-Quants (IQ2_XXS, IQ3_S) > Builds on K-Quants > Introduces an Importance Matrix to identify critical weights. > Best: the most customizable quant type. ✅ Most GGUF models today use K-Quants or I-Quants, especially for larger LLMs. Understanding which quantization type to use improves inference speed while cutting down on memory usage. That's key for resource-constrained environments! --- ➕ Follow for more expert AI/ML insights! ➕ Join 6500+ engineers, learning production-ready AI. https://lnkd.in/ed6FRFCH Cheers!
Like Comment
To view or add a comment, sign in
Sathursan Sanmukanathan
1mo
Report this post
I've been running several GGUF models on my laptop. As someone who uses GGUF models, I can confirm it's a game-changing technology for local AI. This post provides a fantastic, clear insight into how it actually works. If you want to briefly understand the tech that lets you run LLMs on your own machine, this is it. #AI #EdgeAI #LLM #Developer #GGUF
Alex Razvant

Senior AI Engineer | Author @ NeuralBits | Helping engineers understand, build, and master AI.
1mo

For engineers who want to optimize LLMs for CPU or Edge, learn about GGUF. On HuggingFace, optimized LLMs stored as GGUF checkpoints boomed in popularity. Initially, we had .bin or .pt sharded checkpoints (4-5 years ago) for LLM models: `model-0001-of-0005.bin` `model-0002-of-0005.bin` Then, HF Safetensors fixed a critical aspect of PyTorch serialized models of "Executing arbitrary code during Unpickling". Safetensors are a standard currently, but models are in BF16 precision, which still requires a lot of compute. ✅ Recently, GGUF models, which are Quantized models, started to become popular. > GGUF is a binary format that is optimized for quick loading and saving of models, making it highly efficient for inference purposes, compatible with GGML. When you see a GGUF model like Q4, Q5_K, or IQ2_K_S, that suffix tells you how it was quantized. We can generally split the quants into 3 distinct groups: 𝟭/ 𝗟𝗲𝗴𝗮𝗰𝘆 𝗤𝘂𝗮𝗻𝘁𝘀 𝟮/ 𝗞-𝗤𝘂𝗮𝗻𝘁𝘀 𝟯/ 𝗜-𝗤𝘂𝗮𝗻𝘁𝘀 1. Legacy Quants (Q4_0, Q4_1, Q8_0) > Block-based quantization: weights split into fixed-size blocks > Uses 1 or 2 extra constants per block to scale weights back > Best: Fast and simple, but can't tweak accuracy that much. 2. K-Quants (Q3_K_S, Q5_K_M) > Block-wise quantization with per-block scaling > Mixed quantizations, for example, some weights in 4bits, others in FP32. > Popular with large models (8B+) > Best: Some layers could be compressed more to give more bits to critical layers. 3. I-Quants (IQ2_XXS, IQ3_S) > Builds on K-Quants > Introduces an Importance Matrix to identify critical weights. > Best: the most customizable quant type. ✅ Most GGUF models today use K-Quants or I-Quants, especially for larger LLMs. Understanding which quantization type to use improves inference speed while cutting down on memory usage. That's key for resource-constrained environments! --- ➕ Follow for more expert AI/ML insights! ➕ Join 6500+ engineers, learning production-ready AI. https://lnkd.in/ed6FRFCH Cheers!
2 Comments
Like Comment
To view or add a comment, sign in
Maliheh Mahdavi Sefat
1mo
Report this post
7️⃣ What are the different multi-GPU training paradigms, and what are their respective advantages and disadvantages❓ Multi-GPU training paradigms can be categorized into two groups: 🔱 Data parallelism: dividing data for parallel processing across multiple GPUs. 🔱 Model parallelism: dividing the model across GPUs to handle memory constraints when the model size exceeds a single GPU’s capacity. 🔰 Model Parallelism Also known as inter-op parallelism, this technique places different sections of a large model on different GPUs. Computation proceeds sequentially, with intermediate results passed between devices. 🌝 Advantage: Useful when the full model cannot fit into a single GPU’s memory. 🌚 Disadvantage: GPUs must wait for each other’s outputs, limiting parallel efficiency. 🔰 Data Parallelism A minibatch is split into smaller microbatches. Each GPU processes a microbatch independently, computing loss and gradients. These gradients are then aggregated to update the model weights. 🌝 Advantage: GPUs operate in parallel, improving throughput. 🌚 Disadvantage: Each GPU must hold a full copy of the model, which may be infeasible for large models. 🔰 Tensor Parallelism Also called intra-op parallelism, this is a more granular form of model parallelism. Instead of distributing entire layers, it splits weight and activation matrices across GPUs (e.g., row-wise or column-wise) so that individual matrix multiplications are parallelized. 🌝 Advantage: Combines memory efficiency with parallel execution. 🌚 Disadvantage: Can incur high communication overhead due to frequent inter-GPU data exchange. 🔰 Pipeline Parallelism The model is split across devices, with different layers assigned to different GPUs. During training: ⏩ Forward pass: Activations flow through the pipeline. ⏪ Backward pass: Gradients propagate in reverse. To reduce idle time, input batches are divided into microbatches that move through the pipeline in staggered fashion. 🌝 Advantage: Improves device utilization. 🌚 Disadvantage: Still suffers from stage-wise waiting; designing pipeline stages and managing communication is complex. 🔰 Sequence Parallelism Designed for transformer-based LLMs, where self-attention scales quadratically with sequence length. Sequence parallelism splits long input sequences into smaller chunks distributed across GPUs. 🌝 Advantage: Reduces memory and compute bottlenecks for long sequences. 🌚 Disadvantage: Limited to sequential data and may require careful synchronization. 💠 Summary: 🔹 Sequence parallelism: Optimizes long-sequence processing. 🔹 Tensor parallelism: Splits internal model operations. 🔹 Data parallelism: Divides training data across GPUs. Each paradigm addresses different bottlenecks (memory, compute, or communication) and can be combined in large-scale training setups. #MLQandAI #machinelearning #artificialintelligence #deeplearning Day 7 of #30dayChallenge https://lnkd.in/dPnXwx82
Like Comment
To view or add a comment, sign in
Maliheh Mahdavi Sefat
1mo
Report this post
Exercise 7️⃣ - 🥇. Suppose we are implementing our own version of tensor parallelism, which works great when we train our model with a standard stochastic gradient descent optimizer. However, when we try the Adam optimizer by Diederik P. Kingma and Jimmy Ba, we encounter an out-of-memory device. What problem might explain this issue ❓ 💡 Adam uses extra memory to store moving averages of gradients and squared gradients. In tensor parallelism, these extra states are duplicated across GPUs unless carefully sharded. That can cause out-of-memory errors, especially with large models. Exercise 7️⃣ - 🥈 Suppose we don’t have access to a GPU and are considering using data parallelism on the CPU. Is this a good idea ❓ 💡 No, it's usually not a good idea. CPUs have much slower parallel processing and memory bandwidth than GPUs, so data parallelism on CPU often leads to high overhead and poor speedup. #MLQandAI #machinelearning #artificialintelligence #deeplearning Day 7 of #30dayChallenge https://lnkd.in/dPnXwx82

Maliheh Mahdavi Sefat

Learner
1mo

7️⃣ What are the different multi-GPU training paradigms, and what are their respective advantages and disadvantages❓ Multi-GPU training paradigms can be categorized into two groups: 🔱 Data parallelism: dividing data for parallel processing across multiple GPUs. 🔱 Model parallelism: dividing the model across GPUs to handle memory constraints when the model size exceeds a single GPU’s capacity. 🔰 Model Parallelism Also known as inter-op parallelism, this technique places different sections of a large model on different GPUs. Computation proceeds sequentially, with intermediate results passed between devices. 🌝 Advantage: Useful when the full model cannot fit into a single GPU’s memory. 🌚 Disadvantage: GPUs must wait for each other’s outputs, limiting parallel efficiency. 🔰 Data Parallelism A minibatch is split into smaller microbatches. Each GPU processes a microbatch independently, computing loss and gradients. These gradients are then aggregated to update the model weights. 🌝 Advantage: GPUs operate in parallel, improving throughput. 🌚 Disadvantage: Each GPU must hold a full copy of the model, which may be infeasible for large models. 🔰 Tensor Parallelism Also called intra-op parallelism, this is a more granular form of model parallelism. Instead of distributing entire layers, it splits weight and activation matrices across GPUs (e.g., row-wise or column-wise) so that individual matrix multiplications are parallelized. 🌝 Advantage: Combines memory efficiency with parallel execution. 🌚 Disadvantage: Can incur high communication overhead due to frequent inter-GPU data exchange. 🔰 Pipeline Parallelism The model is split across devices, with different layers assigned to different GPUs. During training: ⏩ Forward pass: Activations flow through the pipeline. ⏪ Backward pass: Gradients propagate in reverse. To reduce idle time, input batches are divided into microbatches that move through the pipeline in staggered fashion. 🌝 Advantage: Improves device utilization. 🌚 Disadvantage: Still suffers from stage-wise waiting; designing pipeline stages and managing communication is complex. 🔰 Sequence Parallelism Designed for transformer-based LLMs, where self-attention scales quadratically with sequence length. Sequence parallelism splits long input sequences into smaller chunks distributed across GPUs. 🌝 Advantage: Reduces memory and compute bottlenecks for long sequences. 🌚 Disadvantage: Limited to sequential data and may require careful synchronization. 💠 Summary: 🔹 Sequence parallelism: Optimizes long-sequence processing. 🔹 Tensor parallelism: Splits internal model operations. 🔹 Data parallelism: Divides training data across GPUs. Each paradigm addresses different bottlenecks (memory, compute, or communication) and can be combined in large-scale training setups. #MLQandAI #machinelearning #artificialintelligence #deeplearning Day 7 of #30dayChallenge https://lnkd.in/dPnXwx82
Like Comment
To view or add a comment, sign in
Kundan Sai Chowdary Sannapaneni
1mo
Report this post
#Day12 of #50DaysOfMastery: Your Model is Slow. Stop Guessing Why. Your observability dashboard is flashing red: latency is spiking. The "Check Engine" light is on. The junior engineer's response is, "We need a bigger GPU." The senior engineer's response is, "Show me the profiler." Performance optimization is a science, not an art. And the first rule of this science is: Don't guess, measure. A model's performance problem is rarely one single thing. It's almost always a shifting bottleneck that changes depending on the model's lifecycle. A production model's life has different stages, each with a different bottleneck. 1. The Startup: The I/O Bottleneck You just deployed a new pod. For the first 30 seconds, it's not responding. The Symptom: High "Time to First Byte" (TTFB). The Culprit: The model is I/O bound. It's frantically trying to read a 20GB model file from a slow disk. A faster GPU is useless here. The Fix: Pre-loading the model into the container image (using a "warm" image) or using a persistent volume to keep the model data instantly accessible. 2. The Idle State: The Memory Bottleneck The model is loaded and waiting for a request. The Symptom: High, constant memory usage. The Culprit: The model is memory bound. Those massive weights are just sitting there, occupying all the available RAM, pressuring the system. The Fix: This is where you apply model-level optimizations: Quantization: Converting weights from 32-bit floats to 8-bit integers (INT8). It's a 4x size reduction with minimal accuracy loss. Pruning: Removing unnecessary, near-zero weights from the network. Distillation: Training a smaller "student" model to mimic the behavior of your giant "teacher" model. 3. The Inference: The Compute/Code Bottleneck A request comes in. The CPU spikes to 100%. The Symptom: High inference time after the request is received. The Culprit: The model is finally compute bound. The actual matrix operations are the bottleneck. Now you can consider a GPU. The Fix: Hardware: Add a GPU or TPU. Code: Use an efficient serving framework (like Triton) or batch small requests together (e..g., process 16 images at once, not 16 single-image requests). Architecture: Use Async I/O (like FastAPI) so your server isn't blocked, and it can handle thousands of other requests while waiting for the model to compute. Day's Lesson: Performance is not a single problem. It's a chain of bottlenecks. The key to optimization is to identify the current, biggest bottleneck, fix it, and then measure again to find the next one. #50DaysOfMastery #MLOps #Performance #Optimization #Kubernetes #DevOps #SRE

2 Comments
Like Comment
To view or add a comment, sign in
Shivank Goel
2mo
Report this post
LLM inference infra (https://lnkd.in/gNDaQrzg) is evolving fast. What used to be “just scale GPUs” has become a game of routing, caching, and memory transfers at cluster scale. 1/ Routing A Radix Tree (compact prefix tree) is a Trie variant that merges single-child paths, reducing memory usage significantly compared to standard Tries. (Good explainer here: https://lnkd.in/ggxeGhBA). Radix Trees are now used to compute prefix-based KV cache hit rates. This enables KV-aware routing: sending a request to the worker most likely to already hold the right cache blocks, while still balancing load to avoid hotspots. Not every query needs a dedicated prefill step—if the decoder has a high prefix hit probability, the prompt is short, or prefill workers are saturated, it can be more efficient to prefill locally inside the decode worker. 2/ Independent Scaling Scaling is no longer just about QPS. A surge in long input sequences stresses prefills more than decoders, so the system must scale each tier independently. That only works if both Prefill and Decode workers are stateless with respect to KV ownership. Each decoder, on startup, allocates its KV blocks (reserved chunk of GPU / CPU memory that the worker sets aside exclusively for storing those KV tensors) and publishes their descriptors into a distributed store (e.g., etcd). Prefill workers lazily fetch these descriptors when they first interact with a decoder, cache them locally, and thereafter exchange only compact block IDs during handoff. This design makes both tiers elastic: you can add or remove prefill/decoder workers at runtime without disrupting active sessions. 3/ Caching Inference systems now look like CPUs in miniature, juggling a memory hierarchy: GPU caches (fast but small), CPU memory (larger but slower), and cross-device links (PCIe, NVLink, RDMA). Think of this as L1/L2/L3 caching logic, but applied to distributed GPU memory pools. 4/ Pub/Sub and Direct Reads Optimizations focus on reducing redundant transfers and speeding up prefill→decode KV cache handoffs. Communication between prefills and decoders increasingly relies on in-memory pub/sub for request coordination. But the heavy data movement is handled via RDMA (Remote Direct Memory Access). With RDMA, a prefill worker can directly read KV blocks from a decode worker’s GPU memory (or vice versa) without involving the remote CPU. That turns what would have been an RPC + copy into a zero-copy memory operation over the network fabric, cutting latency and freeing CPUs to do… nothing at all. The next generation of infra is as much about clever systems design as it is about FLOPs!!!
Like Comment
To view or add a comment, sign in
Rajesh Iyer
1mo Edited
Report this post
Pay More. Buy Less. See Most. I’ll be speaking at the NVIDIA booth at AWS re:Invent 2025, representing Capgemini Stop by if you’d like to see what happens when enterprises stop scaling sideways and start processing at GPU speed. This isn’t another “AI use-case” session. It’s a look at the physics of modern data — the part nobody writes whitepapers about because it’s messy, fast, and slightly insulting to CPUs. ⸻ 1️⃣ Ingestion — the CPU is no longer on the guest list The classic route — NVMe → CPU DRAM → GPU — is a Rube Goldberg machine powered by latency. GPUDirect Storage (GDS) and RDMA over InfiniBand cut it out entirely. Data moves straight from NVMe or FSX on Lustre into HBM. The CPU still waves the green flag, but the race is already over. ⸻ 2️⃣ ETL at memory bandwidth Your Spark cluster still thinks “shuffle” is a feature. RAPIDS, Ray, and Polars GPU push transforms directly into HBM, processing at terabytes per second. No shuffle, no serialization, no orphaned executors — just compute at the speed of physics. ⸻ 3️⃣ HBM overflow — elasticity without melodrama HBM is fast but finite; that’s why modern GPUs let it overflow across NVLink/NVSwitch peers or out to NVMe through GDS. You trade nanoseconds for microseconds, not milliseconds. Elastic memory finally lives up to its name. ⸻ 4️⃣ BI that explains instead of entertains Most BI tools are animated slide decks. cuTensor and cuBLAS turn BI back into mathematics — tensor decomposition across customer × product × channel × time. It doesn’t just tell you what changed; it shows why it did. KPIs become equations, not colors. ⸻ 5️⃣ GH, GB, and H series — the unified fabric With NVLink-C2C, NVSwitch, and InfiniBand, compute and data share one coherent memory space. ETL, analytics, and BI aren’t tiers anymore — they’re just kernels sharing the same address. Less architecture, more throughput. ⸻ 6️⃣ The economics of clarity You pay more per GPU, but You also get done with them faster You delete orchestration scripts last touched during the Obama administration. And you finally see your business as a tensor evolving through time, not a table trying to stay relevant. ⸻ Come by the NVIDIA booth at AWS re:Invent if you want to talk about what really happens when you build for bandwidth, not patience. Pay More. Buy Less. See Most. Rebecca (Smith) Gentile Phil Lee Kevin Levitt Tony Santiago Sangeeta Ron MBA, MTech, CAMS Arindam Choudhury Deepak Juneja Nilesh Vaidya Chirag Thakral Jack-Ryan Ashby Vik Patel Amita Patel Amazon Web Services (AWS) #tPower #ml #ai #GenAI4FS #inc81starch

3 Comments
Like Comment
To view or add a comment, sign in
Ravichandran Paramasivam
1mo
Report this post
MIG, MPS, and the Real Story of GPU Isolation ✳️ MIG (Multi-Instance GPU) is hard partitioning of one physical NVIDIA GPU into multiple independent mini-GPUs. Each MIG instance gets its own SMs (whole units), L2 slices, HBM address range, and copy engines, exposed to software as a separate CUDA device. ✳️ MPS (Multi-Process Service): a scheduler that lets multiple processes share one GPU context concurrently. Great for utilization, does not physically partition cache or DRAM bandwidth. Resource partitioning: what’s really isolated? 🔹 SMs / Front-end 🔺 MIG: carves the GPU at GPC granularity (a GPC = a cluster of SMs + front-end). Each instance owns whole SMs (not fractions) with independent warp schedulers, register files, and instruction caches inside that slice. One tenant can't starve the SMs assigned to another. 🔺 MPS: kernels from different processes time-share SMs. You can cap residency (e.g., via active-thread percentage) but it's cooperative scheduling, not hard fencing. A heavy, latency-sensitive kernel can still get jitter from others. 🔹 L2 cache 🔺 MIG: L2 is physically sliced. an instance maps to a disjoint set of L2 slices and tags. Result: tenants don't evict each other's cache lines and don't contend on the same slice pipelines. 🔺 MPS: everyone shares the full L2. Classic cache thrash and back-pressure show up under mixed workloads (e.g., small-working-set inference next to bandwidth-streaming training). 🔹 HBM capacity & channels 🔺 MIG: each instance gets its own HBM address range and a bound subset of memory controller channels (a fixed fraction of peak). Capacity and a corresponding chunk of sustained bandwidth are reserved. 🔺 MPS: all processes see the whole memory and all controllers. the heaviest streamer wins more bandwidth unless you engineer app-level throttles. 🔹 Copy / DMA engines & accelerators 🔺 MIG: instances are assigned dedicated copy engines (and often per-slice JPEG/Video/NVDEC/NVENC where present). Data movement for tenant A won't queue behind tenant B on the same engine. 🔺 MPS: engines are shared. large async memcpy/IPC bursts from one client can elongate latency for others. Still shared under MIG (important!) 🔹 Power/thermal budget & global clocks: a hot neighbour can trigger frequency drops visible to all slices. 🔹 External I/O: PCIe/NVLink/NVSwitch links are shared pipes. MIG doesn't guarantee per-slice link bandwidth. Choosing the right tool ✅ Strict SLOs / multi-tenant clouds / mixed inference → MIG You want predictable p99s, cache isolation, fixed memory capacity, and blast-radius containment. ✅ Throughput maximization within one team / cooperative workloads → MPS Great for packing small jobs, fast context switches, and keeping SMs busy, if you can tolerate latency variance. ✅ Hybrid Use MIG to define 2–4 strong fences, then run MPS inside each slice to keep utilization high while bounding cross-team interference. #NVIDIA #GPU #MIG #AI #ML
5 Comments
Like Comment
To view or add a comment, sign in

518 followers

64 Posts

View Profile Follow

LinkedIn respects your privacy

oLLM: A Lightweight Library for Large-Context LLM Inference

Explore content categories