How LLM inference infra is evolving beyond GPU scaling

LLM inference infra (https://lnkd.in/gNDaQrzg) is evolving fast. What used to be “just scale GPUs” has become a game of routing, caching, and memory transfers at cluster scale. 1/ Routing A Radix Tree (compact prefix tree) is a Trie variant that merges single-child paths, reducing memory usage significantly compared to standard Tries. (Good explainer here: https://lnkd.in/ggxeGhBA). Radix Trees are now used to compute prefix-based KV cache hit rates. This enables KV-aware routing: sending a request to the worker most likely to already hold the right cache blocks, while still balancing load to avoid hotspots. Not every query needs a dedicated prefill step—if the decoder has a high prefix hit probability, the prompt is short, or prefill workers are saturated, it can be more efficient to prefill locally inside the decode worker. 2/ Independent Scaling Scaling is no longer just about QPS. A surge in long input sequences stresses prefills more than decoders, so the system must scale each tier independently. That only works if both Prefill and Decode workers are stateless with respect to KV ownership. Each decoder, on startup, allocates its KV blocks (reserved chunk of GPU / CPU memory that the worker sets aside exclusively for storing those KV tensors) and publishes their descriptors into a distributed store (e.g., etcd). Prefill workers lazily fetch these descriptors when they first interact with a decoder, cache them locally, and thereafter exchange only compact block IDs during handoff. This design makes both tiers elastic: you can add or remove prefill/decoder workers at runtime without disrupting active sessions. 3/ Caching Inference systems now look like CPUs in miniature, juggling a memory hierarchy: GPU caches (fast but small), CPU memory (larger but slower), and cross-device links (PCIe, NVLink, RDMA). Think of this as L1/L2/L3 caching logic, but applied to distributed GPU memory pools. 4/ Pub/Sub and Direct Reads Optimizations focus on reducing redundant transfers and speeding up prefill→decode KV cache handoffs. Communication between prefills and decoders increasingly relies on in-memory pub/sub for request coordination. But the heavy data movement is handled via RDMA (Remote Direct Memory Access). With RDMA, a prefill worker can directly read KV blocks from a decode worker’s GPU memory (or vice versa) without involving the remote CPU. That turns what would have been an RPC + copy into a zero-copy memory operation over the network fabric, cutting latency and freeing CPUs to do… nothing at all. The next generation of infra is as much about clever systems design as it is about FLOPs!!!

To view or add a comment, sign in

Explore content categories