0% found this document useful (0 votes)
42 views15 pages

Splitwise Efficient Generative LLM Inference Using Phase Splitting

The document presents Splitwise, a novel technique for optimizing generative large language model (LLM) inference by splitting the compute-intensive prompt computation and memory-intensive token generation phases onto separate machines. This approach enhances resource management, achieving up to 1.4× higher throughput at 20% lower cost, or 2.35× more throughput under the same power and cost budgets compared to existing designs. The paper also discusses performance metrics and the implications of different batching mechanisms on LLM inference efficiency.

Uploaded by

cenosen709
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views15 pages

Splitwise Efficient Generative LLM Inference Using Phase Splitting

The document presents Splitwise, a novel technique for optimizing generative large language model (LLM) inference by splitting the compute-intensive prompt computation and memory-intensive token generation phases onto separate machines. This approach enhances resource management, achieving up to 1.4× higher throughput at 20% lower cost, or 2.35× more throughput under the same power and cost budgets compared to existing designs. The paper also discusses performance metrics and the implications of different batching mechanisms on LLM inference efficiency.

Uploaded by

cenosen709
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)

Splitwise: Efficient Generative LLM Inference Using


Phase Splitting
2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA) | 979-8-3503-2658-1/24/$31.00 ©2024 IEEE | DOI: 10.1109/ISCA59077.2024.00019

Pratyush Patel1 , Esha Choukse2 , Chaojie Zhang2 ,


Aashaka Shah2 , Íñigo Goiri2 , Saeed Maleki2 , Ricardo Bianchini2
1 2
University of Washington Microsoft

Abstract—Generative large language model (LLM) applications A100 H100 Ratio


are growing rapidly, leading to large-scale deployments of TFLOPs 19.5 66.9 3.43×
expensive and power-hungry GPUs. Our characterization of LLM HBM capacity 80GB 80GB 1.00×
inference shows that each inference request undergoes two phases: HBM bandwidth 2039GBps 3352GBps 1.64×
a compute-intensive prompt computation phase and a memory- Power 400W 700W 1.75×
intensive token generation phase, each with distinct latency, NVLink 50Gbps 100Gbps 2.00×
throughput, memory, and power characteristics. Despite state- Infiniband 200GBps 400GBps 2.00×
of-the-art batching and scheduling, the token generation phase Cost per machine [5] $17.6/hr $38/hr 2.16×
underutilizes compute resources. Unlike prompt computation,
token generation does not need the compute capability of the TABLE I: NVIDIA A100 vs. H100 specifications.
latest GPUs and can be run with lower power and cost.
Based on these insights, we propose Splitwise, a model Generative LLM inference for a single request consists of
deployment and scheduling technique that splits the two phases of
LLM inference requests on to separate machines. Splitwise enables several forward passes through the model, since the output
phase-specific resource management using hardware that is well tokens are generated one by one. This inherently has two con-
suited for each phase. Request state is transferred efficiently trasting phases of computation. First, the prompt computation
between machines using optimized network libraries on the phase, in which all the input prompt tokens run through the
fast back-plane interconnects available in today’s GPU clusters. forward pass of the model in parallel to generate the first output
Using Splitwise, we design homogeneous and heterogeneous LLM
inference clusters optimized for throughput, cost, and power. token. This phase tends to be computationally intensive and
Compared to current designs, Splitwise clusters achieve up to requires the high FLOPs (floating point operations per second)
1.4× higher throughput at 20% lower cost. Alternatively, they of the latest GPUs today. Second, the token generation phase,
can deliver 2.35× more throughput under the same power and in which subsequent output tokens are generated sequentially
cost budgets. based on the forward pass of the last token and all the cached
I. I NTRODUCTION context from previous tokens in the sequence. Given the lack
of compute parallelism, this phase tends to be more memory
Recent advancements in generative large language models bandwidth and capacity bound, despite state-of-the-art batching.
(LLMs) have significantly improved their response quality and Running both phases on the same machine often leads to
accuracy [18], [71]. These trends have led to the widespread inconsistent end-to-end latencies due to the arbitrary batching
adoption of LLMs across various domains [6], [21]. Most of prompt and token phases. Due to these challenges, services
modern LLMs are built using the transformer architecture [77], need to over-provision expensive GPUs to meet tight inference
[78] and exhibit similar characteristics [63]. Transformer model service level objectives (SLOs) for interactive applications. At
sizes have grown steadily, from the early BERT models [36] the same time, cloud service providers (CSPs) are having to
having 340 million parameters, to GPT-3 [28] with a staggering build a lot of new datacenters to meet the GPU demand, and
175 billion parameters, and GPT-4 rumored to have even more. are running into a power wall [19].
LLMs typically run on expensive and power-hungry The industry continues to release new computationally
GPUs [16]. The sudden and large-scale deployment of LLMs powerful GPUs, each much more power hungry and expensive
has led to a worldwide GPU capacity crunch [14]. The than the last. However, as shown in Table I, the high-bandwidth
computational demand for LLM inference far exceeds that memory (HBM) capacity and bandwidth on these GPUs has not
of training due to the vast number of applications leveraging scaled at the same rate recently. The latest NVIDIA H100 GPUs
LLMs. Furthermore, since training LLMs requires expensive have 3.43× more compute and 1.75× more power compared
and dedicated supercomputers [56], [60], a large number of to their predecessor A100 GPUs. However, their memory
inferences are necessary to amortize the high training costs. bandwidth only grew by 1.6×, with no increase in memory
LLM inference jobs, although orders of magnitude smaller capacity.
than training, are still expensive given the compute involved.
Our work. Given the distinct properties of prompt computation
1 Work partly done as an intern at Microsoft. and token generation phases, we propose splitting the inference

979-8-3503-2658-1/24/$31.00 ©2024 IEEE 118


DOI 10.1109/ISCA59077.2024.00019
Authorized licensed use limited to: Argonne National Laboratory. Downloaded on May 30,2025 at 20:59:06 UTC from IEEE Xplore. Restrictions apply.
request and running them on separate machines. Doing so Prompt phase Token genera�on phase

allows us to separately manage hardware resources for each


phase, thereby increasing the GPU utilization and the overall LLM LLM LLM LLM
iteration 1 iteration 2 iteration 3 iteration 4
efficiency of the system. It also enables using different, better-
suited hardware for each phase. To realize such a setup, the KV KV KV
Is tomato a cache Yes it is EOS
cache cache
cached context from the prompt computation needs to be fruit?
communicated over from the prompt processing machine to the
token generation machine at low latency. We implement these Fig. 1: An LLM inference example.
transfers in an optimized manner over the back-end Infiniband
interconnects avaialble in datacenters today, allowing us to Metric Importance to user
increase efficiency without any perceived performance loss. End-to-end (E2E) latency Total query time that the user sees
With Splitwise, we design clusters optimized for cost, Time to first token (TTFT) How quickly user sees initial response
Time between tokens (TBT) Average token streaming latency
throughput, and power, using production traces of LLM Throughput Requests per second
inference requests [4]. Given the diverging memory and
compute scaling rates across GPU generations, we also evaluate TABLE II: Performance metrics for LLMs.
different GPUs and power caps for the different inference
phases. This allows us to target better performance per dollar
(Perf/$) for users, and better performance per watt (Perf/W) is needed for all the future token generation iterations. After
for CSPs. Additionally, users can target older GPUs, which the first token is generated, the following tokens only use
are likely more readily available to them. the last generated token and the KV-cache as inputs to the
We show that Splitwise-based LLM inference clusters can forward pass of the model. This makes the subsequent token
achieve 1.4× higher throughput at 20% lower cost than existing generation more memory bandwidth and capacity intensive
clusters. Alternatively, they can deliver 2.35× more throughput than the computationally heavy prompt phase.
with the same cost and power budgets.
C. Performance metrics for LLMs
Summary. We make the following contributions:
Prior work has proposed three main metrics for LLM
1) An extensive characterization of the differences in the inference: end-to-end (E2E) latency, time to first token (TTFT),
execution and utilization patterns of the prompt and token and throughput. We add another latency metric: time between
generation phases in LLM inference on the NVIDIA A100 tokens (TBT), to track the online streaming throughput of the
and H100 GPUs using production traces. tokens as they are generated serially. Table II summarizes the
2) Splitwise, our technique for optimized utilization of avail- key performance metrics that we consider in this work.
able hardware, which splits the prompt computation and Generative LLMs may be used for a variety of tasks with
token generation phases onto separate machines. different kinds of SLOs. For batch tasks (e.g., summariza-
3) A design exploration of homogeneous and heterogeneous tion), TTFT or TBT latency metrics are less important than
cluster deployments with Splitwise to optimize the overall throughput. On the other hand, for latency-sensitive tasks (e.g.,
cost, request throughput, and provisioned power. conversational APIs), TTFT and TBT are the more important
4) An evaluation of the systems designed with Splitwise using metrics with tighter SLOs.
production traces.
D. Batching of requests
II. BACKGROUND
Inference requests can be batched together for higher
A. Large Language Models throughput. Several prior works have explored batching [23],
Modern LLMs are based on transformers. Transformer [81]. Figure 2 shows the timelines for inference with three
models use attention [77] and multi-layer-perceptron layers to common batching mechanisms. The default mechanism only
understand the inputs and generate an output, respectively. batches at the request-level (Figure 2(a)). In this case, ready
Transformer-based LLMs include encoder-only [36], [54], requests are batched together, but all the forward passes for
decoder-only [67], [69], [71], and encoder-decoder [70] models. these requests are completed before any other requests are run.
Generative LLMs, the focus of this paper, are usually either Since requests can have long token generation phases, this
decoder-only, or encoder-decoder models. can lead to long wait times for requests arriving in between,
causing high TTFT and high E2E latencies. An optimization is
B. Generative LLM inference phases continuous batching [81] (Figure 2(b)). In this case, scheduling
Figure 1 shows an example of generative LLM inference. decisions are made before each forward pass of the model.
Once the prompt query is received, all the input tokens are However, any given batch comprises either only of requests
computed in parallel, within a single iteration, to generate the in their prompt phase or only requests in token phase. Prompt
first token. We call this the prompt processing phase. The phase is considered more important since it impacts TTFT.
context generated from the attention layers during the prompt Hence, a waiting prompt can preempt a token phase. Although
computation is saved in the key-value (KV) cache, since it this leads to shorter TTFT, it can substantially increase the tail

119

Authorized licensed use limited to: Argonne National Laboratory. Downloaded on May 30,2025 at 20:59:06 UTC from IEEE Xplore. Restrictions apply.
1 Queuing Model #Layers Hidden size #Heads
Prompt
Request id

2 Token Llama2-70B 80 8192 32


3 BLOOM-176B 70 14336 112
4
0 2 4 6 0 2 4 6 0 2 4 6
TABLE III: Models we evaluate and their parameters.
Time (100 ms)
(a) Request-level. (b) Continuous. (c) Mixed. Coding Conversation Coding Conversation
1.00 1.00
Fig. 2: Batching mechanisms and their latency impact on the
prompt and token phases. 0.75 0.75
0.50 0.50

CDF

CDF
0.25 0.25
for TBT, and therefore the E2E latency. Finally, there is mixed 0.00 0.00
batching (Figure 2(c)) [23]. With this batching, the scheduling 0 2000 4000 6000 0 500 1000 1500
decisions are made at each forward pass, and the prompt and (a) Prompt input tokens. (b) Generated output tokens.
token phases can run together. This reduces the impact on TBT,
Fig. 3: Distribution for prompt and generated tokens.
but does not eliminate it, since token phases scheduled with
prompt phases will experience a longer runtime. In the rest of
the paper, we use mixed batching unless stated otherwise. and output size (number of output tokens). Due to customer
E. Model parallelism privacy requirements (e.g., GDPR), we do not have visibility
into the content of the prompts. We instead use the production
Model parallelism can be used to divide a model onto multi-
traces to guide the input and output sizes, where we send the
ple GPUs, and even multiple machines, for higher efficiency and
input prompt with the required number of tokens, and force the
memory capacity. LLM inference typically uses pipeline and
model to generate the corresponding number of output tokens
tensor parallelism. Pipeline parallelism (PP) divides the layers
for each request. Note that the text of the inputs prompts
of the model among the GPUs, while keeping all the operators
does not impact the performance metrics that we benchmark,
and tensors within a layer on the same GPU. Tensor parallelism
since they depend only on the input and output sizes. For
(TP) divides the tensor across the GPUs, while replicating all
this characterization, we do not reuse the KV-cache between
the layers on each GPU. Pipeline parallelism requires lower
requests to emulate a cloud service with security guarantees.
communication across the participating GPUs, while tensor
parallelism requires high bandwidth communication for each Models. Table III shows the models that we evaluate. Both
layer. In general, tensor parallelism performs better for GPUs BLOOM [69] and Llama2 [71] are state-of-the-art open
within the same machine, connected with high bandwidth source LLMs. Both models are decoder-only, transformer-based
interconnects like e.g. NVLink [15]. In the rest of the paper, models. We use the version of each model with the most
we use tensor parallelism across 8 GPUs for the best latency. parameters, since these versions are the most representative
for production-class accuracy. Unless stated otherwise, we run
F. GPU clusters and interconnects BLOOM-176B and Llama-70B on vLLM [51] on a machine
With the recent rise of LLM use cases, several cloud service with 8 H100 [16] GPUs.
providers have expanded the GPU-based offerings, leading to
A. Number of prompt and generated tokens
large GPU cluster deployments [5], [56], [57]. Each machine in
these AI clusters is generally comprised of 8 flagship NVIDIA To better understand our traces, we examine the distribution
GPUs (A100 or H100). Each GPU is connected to all the of the number of prompt input and generated output tokens.
other GPUs in the cluster with a high bandwidth Mellanox Figure 3a shows the distribution of number of prompt tokens.
InfiniBand interconnect [10], [13], forming a high bandwidth Since the coding LLM inference service is generally used to
data plane network. The InfiniBand bandwidth offered in the generate completions as the user is writing code, its input
cloud today ranges from 25 to 50GBps per GPU pair [7], [10]. prompt can include large chunks of the code written so far.
Thus, it has a large median prompt size of 1500 tokens. On
III. C HARACTERIZATION the other hand, the conversation service has a wider range of
In this section, we explore the performance and utilization input prompt tokens since it depends on the user. The median
characteristics of LLM inference and draw key insights to number of prompt tokens for this trace is 1020 tokens.
guide the design of Splitwise. Figure 3b shows the distribution of the number of generated
Production traces. We use production traces taken from two tokens. Since the coding service typically only generates the
th
Azure LLM inference services on November 11 2023. Our next few words in the program as the user types, the median
traces represent the most common scenarios in LLM inference number of output token is 13 tokens. On the other hand, the
today: coding and conversation. We have released a subset of conversation service has an almost bimodal distribution, with
our traces at https://github.com/Azure/AzurePublicDataset [4]. a median of 129 tokens generated.
The traces we use for characterization are 20 minutes long and Insight I: Different inference services may have widely
include the arrival time, input size (number of prompt tokens), different prompt and token distributions.

120

Authorized licensed use limited to: Argonne National Laboratory. Downloaded on May 30,2025 at 20:59:06 UTC from IEEE Xplore. Restrictions apply.
BLOOM Llama BLOOM Llama
1.0

Throughput (tokens/s)

Throughput (tokens/s)
0.8 10000 1250
7500 1000
0.6
CDF 0.4 5000
750
500
0.2 Coding-Bloom Conversation-Bloom 2500 250
0.0 Coding-Llama Conversation-Llama 0
0 10000 20000 30000
0
0 20 40 60
0 1000 2000 3000 4000 5000 6000 7000 Batched Token Size Batched Token Size
# Active Tokens
(a) Prompt phase. (b) Token generation phase.
Fig. 4: Cumulative distribution of time spent with various active
batched tokens. Fig. 6: Impact of batching on the throughput for the 2 LLMs.

E2E-P50 TTFT-P50 Prompt phase Token phase


E2E-P90 TTFT-P90
BLOOM Llama BLOOM Llama E2E-P99 TTFT-P99

104
500
TTFT Latency (ms)

TBT Latency (ms)


500 1000 1500

60

Memory (GB)
Latency (ms)
40 103
20 400
102
0 Coding Coding Conv. Conv.
0

128 256 512 1024 2048 4096 8192 1 2 4 8 16 32 64


Bloom Llama Bloom Llama
Batched Token Size Batched Token Size
300 BLOOM-176B model size
(a) TTFT by prompt (b) TBT by batch size. (c) Latencies on prod
size. traces (no batching). 100 101 102 103 104
Number of Batch Tokens
Fig. 5: TTFT, TBT, and E2E for BLOOM-176B and Llama-
70B on DGX-H100. Fig. 7: Required memory with batching in prompt/token phases.

B. Batch utilization E2E. Figure 5c shows various percentiles of E2E latency for
both models, with no batching. The variability between the
To understand how much can these requests be batched, we request input and output sizes is apparent. Furthermore, we
measure how often machines run at a given batch size. We use see that most of the E2E time is spent running the token phase.
mixed continuous batching as shown in Figure 2. To fit into a This holds true even for the coding trace, where prompt sizes
single machine, we run a scaled-down version of the coding are large and generated tokens few. In fact, we find that for
and conversation traces with 2 requests per second. BLOOM-176B, a prompt phase with 1500 input tokens takes
Figure 4 shows the distribution of the time spent by the the same time as token phase with only 6 output tokens.
machine running various number of active tokens in a batch.
Note that if a prompt of 100 tokens is running in its prompt Insight III: For most requests, the majority of the E2E time
phase, we count the active tokens as 100. However, once the is spent in the token generation phase.
request is in the token phase, we count it as one active token, D. Throughput
since the tokens are generated one at a time (assuming a
Figure 6 shows the impact of batching on the throughput
beam search size of one [51]). We find that most of the time
(measured as tokens per second). For the prompt phase, we
(60–70%) for conversation is spent running only 20 tokens or
define the throughput as the number of prompt input tokens
fewer. Since the coding service has very few output tokens, it
that are processed per second. We see that the throughput
experiences even worse batching in the token phase and runs
decreases after 2048 prompt tokens, which corresponds to a
with a single token for more than 20% of the time. Both the
batch size of less than 2 for the median prompt sizes from the
LLMs show very similar trends.
traces. On the other hand, Figure 6b shows that the throughput
Insight II: Mixed continuous batching spends most of the time in the token phase keeps increasing with batching until 64
with very few active tokens batched. batch-size, at which point, the machine runs out of memory.
C. Latency Insight IV: The prompt phase batch size should be limited
to ensure good performance. In contrast, batching the token
TTFT. Figure 5a shows the impact of the number of prompt generation phase yields high throughput without any downside.
tokens on TTFT. The range of sizes was chosen based on the
coding and conversation traces. We find that TTFT for both E. Memory utilization
models grows almost linearly as the prompt size increases. During an LLM inference, the GPU memory is used to host
This behavior is due to the prompt phase having high GPU the model weights and activations, as well as the KV caches
utilization and being computationally bound. (Section II-B). As the number of tokens in a batch increase,
TBT. Figure 5b shows the impact of forcefully batching the the memory capacity required for the KV cache also increases.
output tokens of different requests together on the TBT. We Figure 7 shows the memory capacity utilization during each
observe very little impact on TBT as the batch size grows. phase as the number of tokens in the batch increases. During
With a batch size of 64, there is only 2× impact on TBT. the prompt phase, the input prompt tokens generate the KV

121

Authorized licensed use limited to: Argonne National Laboratory. Downloaded on May 30,2025 at 20:59:06 UTC from IEEE Xplore. Restrictions apply.
BLOOM Llama BLOOM Llama Coding Conversation
TDP TDP
A100 H100 Ratio A100 H100 Ratio
Normalized Power

Normalized Power
1.00 1.00 TTFT 185 ms 95 ms 0.51× 155 ms 84 ms 0.54×
0.75 0.75 TBT 52 ms 31 ms 0.70× 40 ms 28 ms 0.70×
0.50 0.50 E2E 856 ms 493 ms 0.58× 4957 ms 3387 ms 0.68×
0.25 0.25 Cost [5] $0.42 $0.52 1.24× $2.4 $3.6 1.5×
Energy 1.37 Whr 1.37 Whr 1× 7.9 Whr 9.4 Whr 1.2×
0.00 0.00
512 1024 2048 4096 8192 1 2 4 8 16
Batched Token Size Batched Token Size
TABLE IV: P50 request metrics on A100 vs. H100 without
(a) Prompt phase. (b) Token generation phase. batching on Llama-70B.
Fig. 8: Maximum and mean power utilization varying the
batching size.
the two from running on different hardware. Table I shows the
specifications for DGX-A100 [15] and DGX-H100 [16]. The
80
memory-to-compute ratio favors A100 over H100. Table IV
12000
TTFT Latency (ms)

TBT Latency (ms)

shows our findings. We see a lower performance impact on the


9000 60
token generation phase (TBT) as compared to the Prompt phase
6000 40
(TTFT). Since coding requests are dominated by prompt phase,
3000 20 by having very few generated tokens, the E2E latency impact
0 0 from A100 is worse on coding than conversation. Furthermore,
700650600550500450400350300250200 700 650 600 550 500 450 400 350 300 250 200
Power Cap (W) Power Cap (W)
we see that A100 has better or equal inference cost and energy
(a) Prompt phase. (b) Token generation phase. overall compared to H100.
Fig. 9: Impact of power cap on the prompt and token generation Insight VII: Token generation can be run on less compute-
latency with the maximum batch size possible. capable hardware for better Perf/W and Perf/$ efficiencies.
IV. S PLITWISE
cache. During the output token phase, each active generated Based on our characterization insights, we propose Splitwise,
token that is being processed accesses the KV cache of its a technique to split the prompt and generation phases in the
entire context so far. LLM inference on to separate machines.
Insight V: Batching during the prompt phase is compute-bound, Figure 10 shows the high-level overview of Splitwise. We
whereas the token phase is limited by memory capacity. maintain two separate pools of machines for prompt and token
processing. A third machine pool, the mixed pool, expands and
F. Power utilization contracts as needed by the workload. All machines are pre-
When hosting machines, cloud providers need to consider loaded with the model of choice. When a new inference request
the peak power draw, which has direct impact in the datacenter arrives, the scheduler allocates it to a pair of machines (i.e.,
cost [26]. This is especially important when building GPU prompt and token). The prompt machines are responsible for
clusters, since GPUs consume much higher power than regular generating the first token for an input query, by processing all
compute machines [63], [64]. Figure 8 shows the GPU power the input prompt tokens in the prompt phase and generating the
draw normalized to the thermal design power (TDP) when KV-cache. The prompt machine also sends over the KV-cache
running prompt and token generation phases. Since the the to the token machine, which continues the token generation
prompt phase is compute intensive, its power draw increases until the response is complete. We use continuous batching at
with batch size. On the other hand, the token phase is memory the token machines to maximize their utilization. Machines in
bound and its power draw does not vary when increasing the mixed pool use mixed continuous batching.
number of tokens to process. At a lower request rate, we target better latency in Splitwise,
Providers can cap the power usage of the machines to reduce while, at a higher request rate, we target avoiding any
the peak power. Figure 9 shows the impact to latency when performance or throughput reduction due to the fragmentation
increasing the power caps for both prompt and token phases. between prompt and token machine pools.
The prompt phase is highly sensitive to the power cap and the Splitwise uses a hierarchical two-level scheduling as shown
latency increases substantially. On the other hand, the token in Figure 10. The cluster-level scheduler (CLS) 1 is respon-
generation phase incurs almost no latency impact when power sible for machine pool management and for routing incoming
capping by over 50% (i.e., 700 to 350W). inference requests. The machine-level scheduler (MLS) 2
Insight VI: While the prompt phase utilizes the power budget maintains the pending queue and manages batching of requests
of the GPU efficiently, the token phase does not. at each machine.

G. GPU hardware variations A. Cluster-level scheduling


Given the different characteristics of prompt and token Machine pool management. The CLS maintains the prompt,
generation phases, we measure the performance impact on token, and mixed machine pools 3 . Splitwise initially assigns

122

Authorized licensed use limited to: Argonne National Laboratory. Downloaded on May 30,2025 at 20:59:06 UTC from IEEE Xplore. Restrictions apply.
Cluster-Level Scheduler
Prompt Prompt
(CLS) Itera�on 1 Itera�on 1
Prompt Prompt
Machine Machine KV-cache
KV-cache
transfer per transfer per
iteration layer

Token Token Token Token


Machine Itera�on 1 Machine Itera�on 1

(a) Serialized KV-cache transfer. (b) Optimized KV-cache transfer


Prompt Token per-layer during prompt phase.
Pool Mixed Pool Pool
Fig. 11: Optimizing KV-cache transfer in Splitwise.

GPUs InfiniBand GPUs


once the machine is done running tokens, we transition the
Machine-Level Machine-Level machine back into the prompt pool.
Scheduler (MLS) Scheduler (MLS)
... KV-cache ... B. Machine-level scheduling
transfer
The MLS runs on each machine and is responsible for
tracking the GPU memory utilization, maintaining the pending
Pending Queue Pending Queue
queue 4 , deciding the batch for each iteration, and reporting
Fig. 10: High-level system diagram of Splitwise. the relevant status to the CLS.
Prompt machines. The MLS simply uses first-come-first-serve
(FCFS) to schedule prompts. The results in Figure 6a show
machines to the prompt or token pool depending on the that after 2048 prompt tokens, the throughput degrades. For
expected request load and input/output token distributions. this reason, the MLS restricts the batching of multiple prompts
Machines from the prompt or token pools may be dynamically together to 2048 tokens in total. This is a configurable value,
moved into and out of the mixed pool to reduce fragmentation and can change for a different model or hardware.
and meet SLOs at higher loads. A machine in the mixed pool Token machines. The MLS uses FCFS to schedule tokens and
retains its identity as a prompt or token machine and goes batches as much as possible. Figure 6b shows that the token
back to its original pool once there are no tasks of the opposite generation throughput keeps scaling up with the batch size
kind in its pending queue. Switching pools does not incur any until the machine runs out of memory. For this reason, the
noticeable latency. If the load distribution deviates considerably MLS tracks the memory and starts queueing tokens once the
from initial assumptions, Splitwise employs a coarse grained machine is close to running out of memory.
re-purposing of machines and moves machines between the
Mixed machines. To meet the TTFT SLO, the MLS must
prompt and token pools. Re-purposing of machines is done
prioritize running prompts and schedule any new prompts in
infrequently, typically only if they stay in the mixed pool for
the pending queue immediately. If the machine is running token
a considerable amount of time.
phases and has no additional capacity to run the prompt phase,
Request routing. CLS uses Join the Shortest Queue (JSQ) the MLS will preempt tokens. To avoid starvation of the token
scheduling [39], [85] to assign a prompt and a token machine phase due to preemption, we increase the priority of the token
to each request. Queue lengths are defined by the number with age and limit the number of preemptions that each request
of pending tokens. Each machine regularly communicates to can have.
the CLS changes in its memory capacity or pending queue.
Note that this does not happen at every iteration boundary. C. KV-cache transfer
We simultaneously assign both the prompt and token machine As discussed in Section II, the KV-cache is generated during
when scheduling requests, since we can then overlap KV-cache the prompt phase of the request, and it continuously grows
transfers with prompt computation to reduce transfer overheads during the token generation phase. In Splitwise, we need
(Section IV-C). to transfer the KV-cache from the prompt machine to the
When routing requests, if the pending queue is bigger than token machine 5 (shown in Figure 10) to complete the
a certain threshold, the CLS looks for target machines in the inference. This transfer delay is the main overhead associated
mixed pool. If the mixed pool is also full, it proceeds to look with Splitwise. In this section, we discuss the impact of KV-
in the opposite pool (i.e., a token machine to run prompts cache transfer and how we optimize it.
and vice versa) and moves the machine into the mixed pool. Figure 11a shows the Gantt chart for the prompt phase, the
Machines in the mixed pool operate exactly as a non-Splitwise KV-cache transfer, and the token generation phase for a single
machine would, with mixed batching. Once the queue of mixed batch of requests when naively transferring the KV cache in
requests is drained, the CLS transitions the machine back to a serialized way. The KV-cache transfer starts only after the
its original pool. For example, when the queue is too long, we prompt phase has finished and the first token is generated.
can move a prompt machine to the mixed pool to run tokens; Further, it needs to complete before the next output token

123

Authorized licensed use limited to: Argonne National Laboratory. Downloaded on May 30,2025 at 20:59:06 UTC from IEEE Xplore. Restrictions apply.
can be generated in the token generation phase. This directly
impacts the maximum TBT and end-to-end latency of inference.
The time required for the transfer depends on the size of
the KV cache (which is directly proportional to the number
of prompt tokens) and on the bandwidth of the interconnect
between the prompt and the token machines. Even when using
fast InfiniBand links, the transfer overhead for large prompt
sizes could become a significant fraction of the TBT.
In Splitwise, we optimize the KV-cache transfer by overlap-
ping it with the computation in the prompt phase. As each layer
in the LLM gets calculated in the prompt machine, the KV
cache corresponding to that layer is also generated. At the end
of each layer, we trigger an asynchronous transfer of the KV-
cache for that layer while the prompt computation continues
to the next layer. Figure 11b shows this asynchronous transfer
which reduces the transfer overheads. Layer-wise transfer also
enables other optimizations, such as earlier start of the token
phase in the token machines, as well as earlier release of Fig. 12: Design space for provisioning a Splitwise-HH cluster.
KV-cache memory on the prompt machines. Cluster configurations targets a peak throughput of 70 RPS.
Layer-wise KV-cache transfer happens in parallel with the The cost-optimal Splitwise-HH configuration is marked with ⋆
prompt computation for the next layer. This requires fine- (27 prompt and 3 token machines).
grained synchronization per layer for correctness. Thus, it is
possible to incur performance interference and increase the
TTFT, especially for smaller prompts. However, for small 50% of the power. We propose this design based on Figure 9
prompts the total KV-cache size is small and does not need and Insight VII (i.e., the prompts phase is impacted by power
the layer-wise transfer to hide the latency. Since the number of caps while token has no performance impact with 50% lower
tokens in a batch is already known at the start of computation, power cap per GPU).
Splitwise picks the best technique for KV-cache transfer. It Number of machines. The LLM inference cluster deployment
uses serialized KV-cache transfer for smaller prompts and must be sized with the appropriate number of prompt and token
layer-wise transfer and for larger prompts. We show that the machines. Our methodology involves searching the design space
overall transfer and interference overheads are relatively small using our event-driven cluster simulator, which is described in
in Section VI-A. detail in Section V. We need to provide as input: (1) the target
D. Provisioning with Splitwise cluster design (e.g., Splitwise-HA or Splitwise-HHcap), (2) an
LLM-specific performance model that can estimate the TTFT
We leverage Splitwise to optimize LLM inference cluster
and TBT at various input, output, and batch sizes, (3) a short
deployments for power, cost, and throughput.
trace derived from the target prompt and token size distributions
Type of machines. We propose four main variants of Splitwise- for the service (e.g., Figure 3), (4) the SLOs (e.g., Table VI), (5)
based systems: Splitwise-AA, Splitwise-HH, Splitwise-HA, and the constraints (e.g., throughput), and (6) the optimization goal
Splitwise-HHcap. The nomenclature is simply drawn from the (e.g., minimize cost). Using this information, our provisioning
first letter representing the Prompt machine type, and the second framework searches the space for the desired optimal point.
letter representing the Token machine type. “A” represents a For example, searching with a throughput constraint and a
DGX-A100 machine, “H” represents a DGX-H100 machine, cost minimization goal gives us iso-throughput cost-optimized
and “Hcap” represents a power-capped DGX-H100 machine. clusters across different designs.
Table V shows a summary of the cost, power, and hardware
in each of our evaluated systems. Search space. Figure 12 shows an example of the two-
Splitwise-AA uses DGX-A100 for both prompt and token dimensional search space for the number of prompt and token
pools, while Splitwise-HH uses DGX-H100 for both. These two machines under Splitwise-HH for the coding workload (using
variants represent the commonly available setups in providers a 2-minute trace). The simulator outputs the various percentiles
where machines are homogeneous and interchangeable. for TTFT, TBT, and E2E latencies. Then, we select the clusters
Splitwise-HA uses DGX-H100 for the prompt pool and that meet the SLOs for each of these metrics and optimize
DGX-A100 for the token pool. We choose this configuration our target function. For example, Figure 12 shows a ⋆ for the
based on Table IV, and the Insight VII (i.e., A100s can be setup with 27 prompt and 3 token machines with the lowest
more cost- and power-efficient for the token phase). cost that achieves 70 RPS. We call this setup iso-throughput
Splitwise-HHcap uses DGX-H100 machines for both prompt cost-optimized.
and token pools. However, we power cap the token machines Optimization. We can use three optimization goals: throughput,
down to 70% of their rated power, with each GPU capped by cost, and power. Throughput optimization is important for both,

124

Authorized licensed use limited to: Argonne National Laboratory. Downloaded on May 30,2025 at 20:59:06 UTC from IEEE Xplore. Restrictions apply.
Prompt Machine Token Machine Prompt-Token
Type Cost Power Type Cost Power Interconnect Bandwidth
Splitwise-AA DGX-A100 1× 1× DGX-A100 1× 1× 1×
Splitwise-HH DGX-H100 2.35× 1.75× DGX-H100 2.5× 1.75× 2×
Splitwise-HHcap DGX-H100 2.35× 1.75× DGX-H100 2.5× 1.23× 2×
Splitwise-HA DGX-H100 2.35× 1.75× DGX-A100 1× 1× 1×

TABLE V: Evaluated Splitwise designs all normalized to DGX-A100

the cloud service provider (CSP) and the user. Cost optimization Orchestration Cluster Scheduler Model
has different importance levels to the CSP and the user. For inputs spec configs zoo

the CSP, a higher cost for the same throughput might be Application Simulator 2
Simulator
acceptable if there are gains in power and space requirements inputs outputs
Cluster
for the cluster. However, for the end-user, a higher cost at Request TTFT, TBT,
traces E2E
the same throughput is generally unacceptable. Finally, power Perf. 3
Schedulers Util.
optimization is attractive for a CSP, since it enables more GPUs SLOs 2 models
levels
4 2
to be deployed in the same datacenter [62], [63], but it may not
LLM KV-cache
be as important to the user. We only consider the provisioned 4 Validation performance transfer
power, and not the dynamic power utilization, in our study. 4 Validation
Hardware
1
profiling Implementation
E. Practical Considerations
Fig. 13: Overview of the design of the Splitwise simulator.
Accuracy impact. Splitwise does not impact accuracy since
it uses lossless KV-cache transfer and does not add any
randomization. It executes inference with the same parameters Our implementation of the Splitwise technique assigns
and state as on a single machine. machines either a prompt role, or a token role. As the
Scalability. Since LLM requests are much longer than typical prompt machine generates the first token, it transfers the KV-
ML requests [37], [38], they incur lower scheduling overhead cache to the token machine using the technique described in
for similar cluster sizes. However, the CLS may become a Section IV-C. We use MSCCL++ [11], an optimized GPU-
scalability bottleneck for large clusters. Insights from prior driven communication library, to implement the naive and
work on partitioned or replicated scheduling could help improve layer-wise KV cache transfers.
scalability [27], [61], [72] and are orthogonal to Splitwise. In our implementation, the prompt machine uses the zero-
Reliability and fault tolerance. If the prompt or the token copy one-sided put primitive of MSCCL++ to send KV-cache
machine fail, Splitwise simply restarts requests from scratch, data over InfiniBand as soon as it is ready, without requiring
similar to today’s LLM serving systems [44], [51]. Alternatively, the token machine to issue any receive instructions. Once
Splitwise could checkpoint the KV-cache generated after we have issued a put for all layers, the prompt machine
prompt computation into an in-memory database. To recover, signals a semaphore that the token machine waits on. The
Splitwise can use this cache to skip prompt recomputation, and synchronization done with the help of semaphores uses the
start right away with the token phase. The KV-cache could also same InfiniBand connection used to send KV-cache data.
be checkpointed periodically during the token phase. Designing When processing a batch of prompts, each request is assigned
safe and efficient failure recovery is out of scope for our paper. a different semaphore since it may be routed to different
token machines. We ship the KV-caches block-by-block in
V. M ETHODOLOGY vLLM. To minimize the number of transfers, we also consider
the contiguity of KV blocks as long as they use the same
A. Experimental setup
semaphore.
To evaluate our proposal on real hardware, we implement
Splitwise’s KV-cache transfer mechanism on top of vLLM [51]. B. Simulator setup
Our implementation is open source [1]. We run this modified We build a simulator to explore cluster designs and evaluate
vLLM on two DGX-A100 and two DGX-H10 virtual machines Splitwise at scale. The simulator code is open source [20].
(VMs) on Microsoft Azure with specifications from Table I. Figure 13 shows the design of our simulator. The simulator is
These are the VMs used to collect the characterization data in event-driven and faithfully models the Splitwise machine pools,
Section III. These machines are connected with InfiniBand and schedulers, machine-level memory and queues, and KV-cache
the DGX-H100s have double the bandwidth (i.e., 400 Gbps). transfer. We first profile the LLM on the target hardware with
Since vanilla vLLM only supports continuous batching with various input/output sizes 1 . Based on the characterization
token preemption which can lead to much higher TBT, we profiles, we build a performance model. The simulator takes
implement state-of-the-art mixed continuous batching [81] as as input the request traces, SLOs, the performance model,
discussed earlier in Figure 2(c). and the configurations for cluster and scheduler 2 . For our

125

Authorized licensed use limited to: Argonne National Laboratory. Downloaded on May 30,2025 at 20:59:06 UTC from IEEE Xplore. Restrictions apply.
P50 P90 P99 H100-Per-Layer A100-Per-Layer
TTFT 2× 3× 6×
H100-Serialized A100-Serialized
TBT 1.25× 1.5× 5× 40

Latency (ms)
E2E 1.25× 1.5× 5×
30
TABLE VI: SLO expressed as slowdown compared to a request 20
running on DGX-A100 under no contention. 10
0
0 512 1024 1536 2048
Batched Token Size
evaluation, we use the prompt and token size distributions from
the production traces in Section III. We tune the Poisson arrival Fig. 14: Overhead of the KV-cache transfer as the prompt size
rate to increase and decrease the load (requests per second) increases on A100s and H100s.
for cluster sizing. The simulator provides the achieved metrics
per request (TTFT, TBT, E2E), and the machine utilization E2E Per-layer TTFT Per-layer
E2E Serialized TTFT Serialized
levels 3 . We cross-validated the performance model with E2E 1-machine Baseline TTFT 1-machine Baseline
hardware experiments to ensure accuracy; we also validated
600
the simulator end-to-end using production load with over 50K

Latency (ms)
iterations to ensure fidelity 4 . 400
Performance model. We build a piece-wise linear performance 200
model using performance profiles at various batch sizes, input
0 128 256 384 512 640 768 896 1024 1536 2048
sizes, output sizes, in the required parallelism configuration Batched Token Size
on A100 and H100 machines from Section III. We validate
that our performance model has high accuracy; it incurs a Fig. 15: Overhead of KV cache transfer on TTFT, E2E latency
mean absolute percentage error (MAPE) of less than 3% when for coding trace for A100 and H100.
evaluated with a 80:20 train:test dataset split.
Communication model. In our evaluation, KV-cache transfers
setup has double the bandwidth of the A100 setup (i.e., 200
cause inter-machine communication, whereas tensor parallelism
vs 400 Gbps), and the impact of this can be clearly seen with
only causes intra-machine communication. We model inter-
transfers in the H100 setup happening about twice as fast as
machine communication overheads by benchmarking our KV-
those in the A100 setup.
cache transfer implementation over Infiniband in Section VI-A.
As discussed in Section IV-C, for small prompt sizes (< 512
SLOs. To determine the maximum throughput that can be in H100), Splitwise uses the serialized KV-cache transfer and
supported by a given cluster design, we use P50, P90, and for larger prompts, it uses per-layer transfers.
P99 SLOs for TTFT, TBT, and E2E latency metrics. Table VI
End-to-end impact. Next, we run the coding trace on the
shows our SLO definition using DGX-A100 as a reference. We
2-machine Splitwise setups without batching, and compare the
require all nine SLOs to be met. SLOs on TTFT are slightly
observed latency metrics to a 1-machine baseline setup with
looser, since it has a much smaller impact on the E2E latency.
no batching. Figure 15 shows our results. The latency impact
Baselines. We compare our Splitwise designs against Baseline- of serially transferring the KV-cache grows up to 3% of the
A100 and Baseline-H100. The clusters in these baselines E2E with large prompts. However, Splitwise only incurs 0.8%
consist of just DGX-A100s and DGX-H100s, respectively. Both of E2E. In a user-facing inference, the only visible impact
baselines use the same mixed continuous batching that Splitwise of KV-cache transfer overhead is the latency for the second
uses for mixed pool machines (described in Section IV-A). token. Splitwise adds a 16.5% latency to the second token,
as compared to the 64% overhead from a serialized transfer.
VI. E VALUATION
Overall, the transfer impact in Splitwise is hardly perceivable
A. Experimental results even in a user-facing inference.
KV-cache transfer latency. We first measure the latency to B. Iso-power throughput-optimized clusters
transfer the KV-cache as the prompt size grows. Figure 14
shows the visible transfer latency on both A100 and H100 Cluster provisioning. We provision clusters using the method-
setups with the naive and optimized transfer design as discussed ology described in Section IV-D. We target a specific workload
in Figure 11. Compared to the prompt computation time, the (e.g., conversation) at a peak load with the same power (i.e.,
overhead is minimal (< 7%). The time for serialized transfers iso-power) for each cluster design. For the baseline, we use the
linearly increases with the prompt size since the size of the power for 40 DGX-H100 machines as our target peak power.
KV-cache also increases. The optimized per-layer transfer, on For the A100 baseline, we can fit 70 DGX-A100 machines
the other hand, hides much of the latency. For these transfers, under the same power budget. We denote these two designs
we see a constant non-overlapped transfer time of around 8ms as 40P/T and 70P/T respectively, since they both use mixed
for the A100 and around 5ms for the H100 setup. The H100 batching in all machines.

126

Authorized licensed use limited to: Argonne National Laboratory. Downloaded on May 30,2025 at 20:59:06 UTC from IEEE Xplore. Restrictions apply.
For Splitwise cluster designs under the coding trace,
Splitwise-AA provisions 55 prompt machines and 15 for the
token pool, denoted as (55P, 15P). Note that like Baseline-
SLO
A100, Splitwise-AA also provisions 75% more machines than
Baseline-H100. The legends in Figure 16 show the different
provisioning choices under coding and conversation workloads.
Request size distributions reflect in the machine pool sizing. For
example, we provision more prompt machines under Splitwise-
HH (35P, 5T) for the coding trace, while we provision more
token machines (25P, 15T) for the conversation trace.
Latency and throughput. Figure 16 shows a deep dive
into all the latency metrics at different input load for each
cluster design with the same power (i.e., iso-power). For the
coding trace (Figure 16a), Splitwise-HH, Splitwise-HHcap, and
Splitwise-AA all perform better than Baseline-H100. As the (a) Coding trace.
load increases, Baseline-H100 suffers from high TBT due to
mixed batching with large prompt sizes. Although Splitwise-
AA can support higher throughput, its TTFT is consistently
higher than most designs. Splitwise-HA clearly bridges the SLO

gap by providing low TTFT and E2E at high throughput. The


mixed machine pool in Splitwise becomes useful at higher
loads to use all the available hardware without fragmentation.
This benefit can be seen clearly in the P50 TBT chart for
Splitwise-HA, where after 90 RPS, H100 machines jump into
the mixed machine pool and help reduce TBT.
For the conversation trace (Figure 16b), Splitwise-HHcap
clearly does better on all fronts, including latency. This is
because its token generation phases typically run for much
longer than in the coding trace, which is beneficial for the
token machines.
Impact on batched tokens. Figure 17 shows the cumulative (b) Conversation trace.
distribution of time spent processing a varying number of Fig. 16: Latency metrics across input loads for iso-power
batched active tokens in an iso-power throughput-optimized throughput optimized clusters. Dashed red lines indicate SLO.
cluster. The distributions are collected by running the conver-
sation trace at low (70 RPS) and high (130 RPS) loads.
At low load, all 40 Baseline-H100 machines spend 70% Iso-cost throughput-optimized. Figure 18b shows the sum-
of the time running ≤15 tokens, and the rest running mixed mary plot for iso-cost clusters, with their space, throughput,
batches with large prompts, which affects TBT and E2E. The and power requirements. We find that Splitwise-AA provides
35 Splitwise-HH prompt machines are mostly idle, and when the best throughput for the same cost, namely 1.4× more
active, run much larger batches of tokens. The 15 Splitwise- throughput than Baseline-H100, running at 25% more power,
HH token machines also do a better job at batching. Overall, and 2× the space. This is an interesting operational point for
Splitwise machines have better batching and latency at 70 RPS. most customers who may not care about power and space,
At high load, since the mixed pool is utilized more, the batch instead preferring the 40% higher throughput using older, more
sizes start looking similar across prompt and token machines. easily available GPUs. In contrast, the preferable choice for
Summary plot. Figure 18a summarizes the results across all the CSP is less clear.
cluster metrics for iso-power throughput-optimized designs for Iso-throughput power-optimized. Figure 19a shows cluster
the conversation trace. We use Baseline-A100 as the baseline. designs that yield same throughput at the least power. Splitwise-
Compared to Baseline-A100, Splitwise-AA delivers 2.15× more HHcap can achieve the same throughput as Baseline-H100 at
throughput at the same power and cost. Splitwise-HA delivers 25% lower power at the same cost and space. This can be a
1.18× more throughput at 10% lower cost and the same power. clear win for the CSPs.
Iso-throughput cost-optimized. Figure 19b shows the cost-
C. Other cluster optimizations optimized versions of the iso-throughput design. Note that there
We have described iso-power throughput-optimized clusters are no changes to any of the homogeneous designs between
in detail. For the rest of the cluster optimization evaluation, Figures 19a and 19b. This is because the prompt and token
we only discuss the summary plots. machines have the same cost and power. However, Splitwise-

127

Authorized licensed use limited to: Argonne National Laboratory. Downloaded on May 30,2025 at 20:59:06 UTC from IEEE Xplore. Restrictions apply.
Baseline-H100 Baseline-H100 #Servers Cost Power #Servers Cost Power
Splitwise-HH (Prompt) Splitwise-HH (Prompt) 1.0 1.0

Normalized Value

Normalized Value
Splitwise-HH (Token) Splitwise-HH (Token)
1.0 1.0 0.5 0.5
Cumulative Duration

Cumulative Duration
0.8 0.8 0.0 0.0
0.6 0.6

Baseline-A100

Splitwise-AA

Splitwise-AA
Baseline-H100

Splitwise-HH

Splitwise-HA

Splitwise-HHcap

Baseline-A100

Baseline-H100

Splitwise-HH

Splitwise-HA

Splitwise-HHcap
(25P, 16T)

(25P, 16T)

(11P, 19T)
(21P, 1T)
(5P, 17T)

(5P, 17T)
(88P/T)

(88P/T)
(8P, 16T)

(19P, 3T)
(24P/T)

(24P/T)
0.4 0.4
0.2 0.2
0.0 0.0 (a) Power-optimized. (b) Cost-optimized.
101 103 101 103
Number of Batch Tokens Number of Batch Tokens Fig. 19: Summary of iso-throughput cluster designs.
(a) Low load (70 RPS). (b) High load (130 RPS).
Fig. 17: Cumulative distribution of time spent at various batched Baseline-A100 (70P/T) Splitwise-AA (55P, 15T) Splitwise-HA (35P, 8T)
Baseline-H100 (40P/T) Splitwise-HH (35P, 5T) Splitwise-HHcap (35P, 7T)
token sizes for iso-power throughput-optimized design. 4 4 4

Normalized

Normalized

Normalized
p90 TTFT

p90 TBT

p90 E2E
2 2 2
#Servers Throughput Cost #Servers Throughput Power
0 0 0
30 50 70 90 110130150 30 50 70 90 110130150 30 50 70 90 110130150
Normalized Value

Normalized Value

2 2 Request Rate (req/s) Request Rate (req/s) Request Rate (req/s)


1 1
(a) Conversation trace running on a cluster designed for coding.
0 0 Baseline-A100 (70P/T) Splitwise-AA (45P, 25T) Splitwise-HA (25P, 26T)
Baseline-H100 (40P/T) Splitwise-HH (25P, 15T) Splitwise-HHcap (25P, 21T)
Baseline-A100

Splitwise-AA
Baseline-H100

Splitwise-HH

Splitwise-HA

Splitwise-HHcap

Baseline-A100

Splitwise-AA
Baseline-H100

Splitwise-HH

Splitwise-HA

Splitwise-HHcap
(45P, 25T)

(51P, 35T)
(25P, 26T)

(30P, 21T)
(25P, 15T)

(25P, 15T)
(25P, 21T)

(30P, 10T)
(70P/T)

(86P/T)
(40P/T)

(40P/T)

4 4 4

Normalized

Normalized

Normalized
p90 TTFT

p90 TBT

p90 E2E
2 2 2

(a) Iso-power. (b) Iso-cost. 0 0 0


100 130 160 190 220 250 100 130 160 190 220 250 100 130 160 190 220 250
Request Rate (req/s) Request Rate (req/s) Request Rate (req/s)
Fig. 18: Summary of throughput-optimized cluster designs.
(b) Llama-70B, on a cluster designed for BLOOM-176B.
HA and Splitwise-HHcap arrive at slightly different results
Fig. 20: Latency impact of running a workload on a cluster
with the cost and power optimizations. Figure 19b shows that
designed for another workload. Dashed red lines indicate SLO.
with Splitwise-AA, customers can achieve the same throughput
as Baseline-H100 at 25% lower cost.

D. Impact of workload changes Summary. Based on these two experiments, we conclude


that Splitwise can morph according to the requirements of the
So far, we have tested a trace and a model on clusters workload using its smart scheduling, and it is robust to changes
optimized for a specific workload pattern and model. To test in the LLMs, request load, and token distributions.
the Splitwise’ robustness, we now run conversation trace on a
cluster meant for coding service, and Llama-70B on a cluster E. Cluster design for batch job
meant for BLOOM-176B. Figure 20 shows these results for We design various clusters with Splitwise under strict latency
iso-power throughput-optimized clusters. SLOs, even when we are optimizing for throughput. This is
Changing workload trace. Compared to Figure 16b, we find unnecessary for batch jobs, which can be stressed to high
that in Figure 20a, the Baseline clusters are similarly sized load for a high token generation throughput. We find that upon
and incur no throughput or latency impact. Splitwise-AA and stressing our iso-power throughput-optimized clusters, Baseline-
Splitwise-HH with the mixed pool morph well to meet the A100 and Splitwise-AA have the best throughput per cost at
requirements of the new workload, and they see no throughput 0.89 RPS/$. At high load, Splitwise devolves into the iso-count
or latency impact. Since Splitwise-HA and Splitwise-HHcap Baseline, since it starts mixed batching with all the machines
have different types of machines in the prompt and token pools, in the mixed pool. The same holds true for Splitwise-HH and
they experience a throughput setback of 7% from the respective Baseline-H100, which achieve 0.75 RPS/$.
cluster optimized designs for conversation trace. Note that all
the Splitwise designs still perform much better than any of the VII. D ISCUSSION
Baseline designs. Extensibility to new models. Despite the plethora of model
Changing model. Figure 20b shows that Llama-70B can sizes from 2B parameters [47], [84] to 176B parameters [69]
support much higher throughput in the same cluster design or more [18], all modern transformer-based generative LLMs
than BLOOM-176B, given its fewer parameters (Table III). All have the distinct prompt processing and token generation
the Splitwise designs out-perform both the Baseline designs at phases. Similarly, even modifications and flavors like Mixture-
higher load. Furthermore, Splitwise-HH and Splitwise-HHcap of-Experts (MoEs) have these phases. Since Splitwise is built
consistently achieve the best latency, even as the load increases. solely by exploiting these phases, it is applicable to all of the

128

Authorized licensed use limited to: Argonne National Laboratory. Downloaded on May 30,2025 at 20:59:06 UTC from IEEE Xplore. Restrictions apply.
current and upcoming LLMs, as long as the auto-regressive online monitoring with metrics like request length or hardware
nature of the workload requires these two phases. Note that performance counters to identify workload phases and allocate
as shown in Section VI-D, clusters provisioned with Splitwise them appropriately on heterogeneous processors. However, they
for one model can also efficiently serve other models. do not consider the complexities with batching. Distributed
Alternative compute hardware. In this work, we use NVIDIA dataflow systems orchestrate large-scale computational graphs
H100 and A100 GPUs since they are commonly used for and aim to provide general-purpose programmability [34], [46],
LLM inference in datacenters today [17]. Smaller datacenter [75], [82]. LLM inference under Splitwise can be viewed as
GPUs like NVIDIA T4 lack enough memory to run modern a static computational graph with two stages, so it could be
LLMs efficiently. In general, our methodology is applicable implemented using distributed frameworks that provide efficient
to any hardware (including CPUs, FPGAs, ASICs [33]) that GPU abstractions [59]. Splitwise differs from these works since
aligns with the computational requirements of prompt and it uses a specialized two-phase design for generative LLM
token phases. Our characterization suggests that prompt phases inference and leverages phase-aware resource management
need high compute capability and memory bandwidth with low with efficient batching.
memory capacity, whereas token phases need moderate compute Model serving systems. LLM inference serving is a rapidly
capability with high memory capacity and bandwidth. Thus, developing field, with several recent works optimizing batch-
GPUs like AMD MI-250 [2] and CPUs like Intel Sapphire- ing [23], [25], [51], [53], [81], scheduling [22], [42], [51], [66],
Rapids (with HBM) [9] could be effective token machines. [73], [79], and memory usage [32], [35], [51], [74]. Prior work
Since we do not have access to such hardware and/or optimized has also proposed using CPUs and lower compute capability
LLM implementations, we leave this to future work. devices for LLM serving [8], [12]. These approaches use the
Interconnect between prompt and token machines. In this same machine for both prompt and token phase. With Splitwise,
work, we assume Infiniband connection between the prompt they could improve throughput and latency by splitting phases.
and token machines in all the designs (albeit, lower bandwidth Prior work on video and ML serving focuses on scheduling
when A100s were involved). Although this is common for all model chains with data dependencies under latency con-
homogenous machines, Splitwise-HA is not be readily available straints [24], [31], [43], [49], [68]. Such schedulers rely on
with an Infiniband connection between H100s and A100s, even model profiling to make efficient allocation decisions and
though technically feasible. The alternative could be HPC manage requests across machines. Recommendation system
clouds, with Infiniband connections through the CPU [3], or inference exhibits compute/memory heterogeneity both within
Ethernet, using RoCE [58]. Given our optimized KV-cache and across models. Prior work exploits such heterogeneity
transfer that helps reduce critical latency, an interconnect with to selectively schedule requests between CPUs and accelera-
10× lower bandwidth would likely still be beneficial. To further tors [38], [52], colocate models with complementary memory
reduce our bandwidth utilization, we could also compress the usage [30], and partition compute/memory on heterogeneous
KV-cache before transferring it across the network [55]. hardware resources [45], [48]. Similarly, Splitwise exploits the
Heterogeneous prompt/token machines. Although Splitwise heterogeneity within LLM inference requests. However, it uses
is robust to varied models and input traces, we recognize that different optimizations due to the differences in LLM workload
fragmenting a data center with different types of GPUs (e.g., characteristics and requirements.
Splitwise-HA) may bring its own challenges for the CSP.
IX. C ONCLUSION
Conversation back and forth. Chat APIs for LLMs today
require the user to send the complete context of the conversation We extensively characterized the prompt computation and
so far [18]. However, in the future, services may have enough token generation phases of LLM inference to draw out
GPU capacity to cache the context and avoid recomputation. differences in their system utilization patterns. Based on
This could sway the memory utilization pattern of the prompt our insights, we designed Splitwise to separate these phases
phase from our characterization. Furthermore, it may require onto different machines and enable phase-specific resource
transferring the KV-cache back to a prompt machine to be management. Using Splitwise, we explored cluster designs
ready for the next conversation request. optimized for throughput, cost, and power, and showed that
they perform well even as workloads change. Splitwise clusters
VIII. R ELATED W ORK under performance SLOs achieve 1.76× better throughput with
Heterogeneous scheduling and dataflow systems. Prior 15% lower power at the same cost, or 2.35× better throughput
work has studied heterogeneous scheduling for a variety of with same the cost and power than existing designs.
interactive services [65], [68], [83]. These works exploit
ACKNOWLEDGEMENTS
hardware heterogeneity to strike a balance between different
objectives such as cost, energy, and performance. However, We thank the reviewers for their helpful feedback. We
they run the entire workload on the same machine. Research thank Chetan Bansal, Srikant Bhardwaj, Suriya Kalivardhan,
on heterogeneous multiprocessor CPU scheduling attempts to Ankur Mallick, Deepak Narayanan, and Amar Phanishayee for
match workload heterogeneity to hardware heterogeneity [29], insightful discussions. Pratyush Patel was partially supported
[40], [41], [50], [76], [80]. These works use profiling or by NSF CNS-2104548 and a research grant from VMware.

129

Authorized licensed use limited to: Argonne National Laboratory. Downloaded on May 30,2025 at 20:59:06 UTC from IEEE Xplore. Restrictions apply.
R EFERENCES and D. Amodei, “Language Models are Few-Shot Learners,” arXiv
preprint arXiv:2005.14165, 2020.
[1] Add Splitwise Implementation to vLLM. GitHub. [Online]. Available: [29] J. Chen and L. K. John, “Efficient Program Scheduling for Heterogeneous
https://github.com/vllm-project/vllm/pull/2809 Multi-core Processors,” in DAC, 2009.
[2] AMD Instinct™ MI250 Accelerator. [Online]. Available: https: [30] Y. Choi, J. Kim, and M. Rhu, “Hera: A Heterogeneity-Aware Multi-Tenant
//www.amd.com/en/products/server-accelerators/instinct-mi250 Inference Server for Personalized Recommendations,” arXiv preprint
[3] Azure InfiniBand HPC VMs. [Online]. Available: https://learn.microsoft. arXiv:2302.11750, 2023.
com/en-us/azure/virtual-machines/overview-hb-hc [31] D. Crankshaw, G.-E. Sela, X. Mo, C. Zumar, I. Stoica, J. Gonzalez, and
[4] Azure Public Dataset: Azure LLM Inference Trace 2023. GitHub. A. Tumanov, “InferLine: Latency-aware Provisioning and Scaling for
[Online]. Available: https://github.com/Azure/AzurePublicDataset/blob/ Prediction Serving Pipelines,” in SoCC, 2020.
master/AzureLLMInferenceDataset2023.md [32] T. Dao, D. Fu, S. Ermon, A. Rudra, and C. Ré, “FlashAttention: Fast
[5] CoreWeave - Specialized Cloud Provider. [Online]. Available: and Memory-efficient Exact Attention with IO-Awareness,” in NeurIPS,
https://www.coreweave.com 2022.
[6] Google Assistant with Bard. [Online]. Available: https://blog.google/ [33] David Patterson. Domain Specific Architectures for Deep Neural
products/assistant/google-assistant-bard-generative-ai/ Networks: Three Generations of Tensor Processing Units (TPUs).
[7] HPC Interconnect on CoreWeave Cloud. [Online]. Available: https: Allen School Distinguished Lecture. [Online]. Available: https:
//docs.coreweave.com/networking/hpc-interconnect //www.youtube.com/watch?v=VCScWh966u4
[8] Intel BigDL-LLM. [Online]. Available: https://github.com/intel-analytics/ [34] J. Dean and S. Ghemawat, “MapReduce: Simplified Data Processing on
BigDL Large Clusters,” Communications of the ACM, 2008.
[9] Intel Sapphire Rapids with HBM. [Online]. [35] T. Dettmers, M. Lewis, Y. Belkada, and L. Zettlemoyer, “LLM.int8():
Available: https://www.anandtech.com/show/17422/ 8-bit Matrix Multiplication for Transformers at Scale,” arXiv preprint
intel-showcases-sapphire-rapids-plus-hbm-xeon-performance-isc-2022 arXiv:2208.07339, 2022.
[10] Microsoft Azure ND A100 v4-series . [Online]. Available: https: [36] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training
//learn.microsoft.com/en-us/azure/virtual-machines/nda100-v4-series of Deep Bidirectional Transformers for Language Understanding,” in
NAACL, 2019.
[11] MSCCL++: A GPU-driven communication stack for scalable AI
[37] A. Gujarati, R. Karimi, S. Alzayat, W. Hao, A. Kaufmann, Y. Vigfusson,
applications. [Online]. Available: https://github.com/microsoft/mscclpp
and J. Mace, “Serving DNNs like Clockwork: Performance Predictability
[12] Numenta Inference on CPUs. [On-
from the Bottom Up,” in OSDI, 2020.
line]. Available: https://www.servethehome.com/
[38] U. Gupta, S. Hsia, V. Saraph, X. Wang, B. Reagen, G.-Y. Wei, H.-H. S.
numenta-has-the-secret-to-ai-inference-on-cpus-like-the-intel-xeon-max/
Lee, D. Brooks, and C.-J. Wu, “DeepRecSys: A System for Optimizing
[13] NVIDIA Accelerated InfiniBand Solutions. [Online]. Available:
End-to-end At-scale Neural Recommendation Inference,” in ISCA, 2020.
https://www.nvidia.com/en-us/networking/products/infiniband/
[39] V. Gupta, M. Harchol Balter, K. Sigman, and W. Whitt, “Analysis of
[14] NVIDIA Chip Shortage. [On- Join-the-Shortest-Queue Routing for Web Server Farms,” Performance
line]. Available: https://www.wired.com/story/ Evaluation, 2007.
nvidia-chip-shortages-leave-ai-startups-scrambling-for-computing-power/
[40] M. E. Haque, Y. H. Eom, Y. He, S. Elnikety, R. Bianchini, and K. S.
[15] NVIDIA DGX A100: Universal System for AI Infrastructure. [Online]. McKinley, “Few-to-Many: Incremental Parallelism for Reducing Tail
Available: https://resources.nvidia.com/en-us-dgx-systems/dgx-ai Latency in Interactive Services,” ACM SIGPLAN Notices, 2015.
[16] NVIDIA DGX H100. [Online]. Available: https://www.nvidia.com/en-us/ [41] M. E. Haque, Y. He, S. Elnikety, T. D. Nguyen, R. Bianchini, and
data-center/dgx-h100/ K. S. McKinley, “Exploiting Heterogeneity for Tail Latency and Energy
[17] NVIDIA Hopper GPUs Expand Reach as Demand for AI Efficiency,” in MICRO, 2017.
Grows. [Online]. Available: https://nvidianews.nvidia.com/news/ [42] K. Hong, G. Dai, J. Xu, Q. Mao, X. Li, J. Liu, K. Chen, H. Dong, and
nvidia-hopper-gpus-expand-reach-as-demand-for-ai-grows Y. Wang, “FlashDecoding++: Faster Large Language Model Inference
[18] OpenAI ChatGPT APIs. [Online]. Available: https://openai.com/blog/ on GPUs,” arXiv preprint arXiv:2311.01282, 2023.
introducing-chatgpt-and-whisper-apis [43] Y. Hu, R. Ghosh, and R. Govindan, “Scrooge: A Cost-effective Deep
[19] Power Availability Stymies Datacenter Growth. [On- Learning Inference System,” in SoCC, 2021.
line]. Available: https://www.networkworld.com/article/972483/ [44] Huggingface. Text Generation Inference (TGI). [Online]. Available:
power-availability-stymies-data-center-growth. https://github.com/huggingface/text-generation-inference
[20] SplitwiseSim: LLM Serving Cluster Simulator. GitHub. [Online]. [45] R. Hwang, T. Kim, Y. Kwon, and M. Rhu, “Centaur: A Chiplet-based,
Available: https://github.com/Mutinifni/splitwise-sim Hybrid Sparse-Dense Accelerator for Personalized Recommendations,”
[21] The New Bing. [Online]. Available: https://www.microsoft.com/en-us/ in ISCA, 2020.
edge/features/the-new-bing?form=MT00D8 [46] M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly, “Dryad: Distributed
[22] TurboMind Inference Server. [Online]. Available: https://github.com/ Data-Parallel Programs from Sequential Building Blocks,” in EuroSys,
InternLM/lmdeploy 2007.
[23] A. Agrawal, A. Panwar, J. Mohan, N. Kwatra, B. S. Gulavani, and [47] M. Javaheripi and S. Bubeck, “Phi-2: The Surprising Power of Small
R. Ramjee, “SARATHI: Efficient LLM Inference by Piggybacking Language Models,” Microsoft Research Blog, 2023.
Decodes with Chunked Prefills,” arXiv preprint arXiv:2308.16369, 2023. [48] W. Jiang, Z. He, S. Zhang, K. Zeng, L. Feng, J. Zhang, T. Liu, Y. Li,
[24] H. Albahar, S. Dongare, Y. Du, N. Zhao, A. K. Paul, and A. R. Butt, J. Zhou, C. Zhang et al., “FleetRec: Large-scale Recommendation
“SchedTune: A heterogeneity-aware GPU Scheduler for Deep Learning,” Inference on Hybrid GPU-FPGA Clusters,” in KDD, 2021.
in CCGrid, 2022. [49] R. S. Kannan, L. Subramanian, A. Raju, J. Ahn, J. Mars, and L. Tang,
[25] R. Y. Aminabadi, S. Rajbhandari, A. A. Awan, C. Li, D. Li, E. Zheng, “GrandSLAm: Guaranteeing SLAs for Jobs in Microservices Execution
O. Ruwase, S. Smith, M. Zhang, J. Rasley, and Y. He, “DeepSpeed- Frameworks,” in EuroSys, 2019.
Inference: Enabling Efficient Inference of Transformer Models at [50] R. Kumar, K. I. Farkas, N. P. Jouppi, P. Ranganathan, and D. M. Tullsen,
Unprecedented Scale,” in SC, 2022. “Single-ISA Heterogeneous Multi-core Architectures: The Potential for
[26] L. A. Barroso, U. Hölzle, and P. Ranganathan, “The Datacenter as a Processor Power Reduction,” in MICRO, 2003.
Computer: Designing Warehouse-Scale Machines,” Synthesis Lectures [51] W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, C. H. Yu, J. E. Gonzalez,
on Computer Architecture, 2018. H. Zhang, and I. Stoica, “Efficient Memory Management for Large
[27] E. Boutin, J. Ekanayake, W. Lin, B. Shi, J. Zhou, Z. Qian, M. Wu, and Language Model Serving with PagedAttention,” in SOSP, 2023.
L. Zhou, “Apollo: Scalable and Coordinated Scheduling for Cloud-scale [52] Y. Kwon, Y. Lee, and M. Rhu, “TensorDIMM: A Practical Near-Memory
Computing,” in OSDI, 2014. Processing Architecture for Embeddings and Tensor Operations in Deep
[28] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, Learning,” in MICRO, 2019.
A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert- [53] Z. Li, L. Zheng, Y. Zhong, V. Liu, Y. Sheng, X. Jin, Y. Huang,
Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, Z. Chen, H. Zhang, J. E. Gonzalez, and I. Stoica, “AlpaServe: Statistical
J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, Multiplexing with Model Parallelism for Deep Learning Serving,” in
B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, OSDI, 2023.

130

Authorized licensed use limited to: Argonne National Laboratory. Downloaded on May 30,2025 at 20:59:06 UTC from IEEE Xplore. Restrictions apply.
[54] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, [78] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac,
L. Zettlemoyer, and V. Stoyanov, “RoBERTa: A Robustly Optimized T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. v. Platen,
BERT Pretraining Approach,” arXiv preprint arXiv:1907.11692, 2019. C. Ma, Y. Jernite, J. Plu, C. Xu, T. L. Scao, S. Gugger, M. Drame,
[55] Z. Liu, J. Wang, T. Dao, T. Zhou, B. Yuan, Z. Song, A. Shrivastava, Q. Lhoest, and A. M. Rush, “Transformers: State-of-the-art Natural
C. Zhang, Y. Tian, C. Re, and B. Chen, “Deja Vu: Contextual Sparsity Language Processing,” in EMNLP, 2020.
for Efficient LLMs at Inference Time,” in ICML, 2023. [79] B. Wu, Y. Zhong, Z. Zhang, G. Huang, X. Liu, and X. Jin, “Fast
[56] Meta. Introducing the AI Research SuperCluster — Meta’s Cutting- Distributed Inference Serving for Large Language Models,” arXiv preprint
Edge AI Supercomputer for AI Research. [Online]. Available: arXiv:2305.05920, 2023.
https://ai.facebook.com/blog/ai-rsc/ [80] H. Yang, Q. Chen, M. Riaz, Z. Luan, L. Tang, and J. Mars, “PowerChief:
[57] “Azure OpenAI Service,” Microsoft Azure, 2022. [Online]. Available: Intelligent Power Allocation for Multi-stage Applications to Improve
https://azure.microsoft.com/en-us/products/ai-services/openai-service Responsiveness on Power Constrained CMP,” in ISCA, 2017.
[58] R. Mittal, A. Shpiner, A. Panda, E. Zahavi, A. Krishnamurthy, S. Rat- [81] G.-I. Yu, J. S. Jeong, G.-W. Kim, S. Kim, and B.-G. Chun, “Orca: A
nasamy, and S. Shenker, “Revisiting Network Support for RDMA,” arXiv Distributed Serving System for Transformer-Based Generative Models,”
preprint arXiv:1806.08159, 2018. in OSDI, 2022.
[59] P. Moritz, R. Nishihara, S. Wang, A. Tumanov, R. Liaw, E. Liang, [82] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauly, M. J.
M. Elibol, Z. Yang, W. Paul, M. I. Jordan et al., “Ray: A Distributed Franklin, S. Shenker, and I. Stoica, “Resilient Distributed Datasets: A
Framework for Emerging AI Applications,” in OSDI, 2018. Fault-Tolerant Abstraction for In-Memory Cluster Computing,” in NSDI,
[60] OpenAI. Scaling Kubernetes to 7,500 Nodes. [Online]. Available: 2012.
https://openai.com/research/scaling-kubernetes-to-7500-nodes [83] C. Zhang, M. Yu, W. Wang, and F. Yan, “MArk: Exploiting Cloud
[61] K. Ousterhout, P. Wendell, M. Zaharia, and I. Stoica, “Sparrow: Services for Cost-Effective, SLO-Aware Machine Learning Inference
Distributed, Low Latency Scheduling,” in SOSP, 2013. Serving,” in USENIX ATC, 2019.
[62] P. Patel, E. Choukse, C. Zhang, Í. Goiri, B. Warrier, N. Mahalingam, [84] S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, C. Dewan,
and R. Bianchini, “POLCA: Power Oversubscription in LLM Cloud M. Diab, X. Li, X. V. Lin et al., “OPT: Open Pre-trained Transformer
Providers,” arXiv preprint arXiv:2308.12908, 2023. Language Models,” arXiv preprint arXiv:2205.01068, 2022.
[63] P. Patel, E. Choukse, C. Zhang, Í. Goiri, B. Warrier, N. Mahalingam, [85] W. Zhu, “Analysis of JSQ Policy on Soft Real-time Scheduling in Cluster,”
and R. Bianchini, “Characterizing Power Management Opportunities for in HPCAsia, 2000.
LLMs in the Cloud,” in ASPLOS, 2024.
[64] P. Patel, Z. Gong, S. Rizvi, E. Choukse, P. Misra, T. Anderson, and A PPENDIX
A. Sriraman, “Towards Improved Power Management in Cloud GPUs,”
in IEEE CAL, 2023. A. Abstract
[65] P. Patel, K. Lim, K. Jhunjhunwalla, A. Martinez, M. Demoulin, J. Nelson,
I. Zhang, and T. Anderson, “Hybrid Computing for Interactive Datacenter We open source critical components needed to evaluate
Applications,” arXiv preprint arXiv:2304.04488, 2023. Splitwise; these could be repurposed to also evaluate future
[66] R. Pope, S. Douglas, A. Chowdhery, J. Devlin, J. Bradbury, J. Heek,
K. Xiao, S. Agrawal, and J. Dean, “Efficiently Scaling Transformer LLM inference serving systems. Our artifact includes:
Inference,” in MLSys, 2023. ● Production traces from two LLM inference services at
[67] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever,
“Language Models are Unsupervised Multitask Learners,” OpenAI blog, Microsoft Azure.
2019. ● A prototype implementation of Splitwise’s KV-cache
[68] F. Romero, Q. Li, N. J. Yadwadkar, and C. Kozyrakis, “INFaaS: transfer mechanism in vLLM [51].
Automated Model-less Inference Serving,” in USENIX ATC, 2021.
[69] T. L. Scao, A. Fan, C. Akiki, E. Pavlick, S. Ilić, D. Hesslow, R. Castagné,
● SplitwiseSim, a discrete event simulator to evaluate model
A. S. Luccioni, F. Yvon, M. Gallé, J. Tow, A. M. Rush, S. Biderman, serving in LLM inference clusters.
A. Webson, P. S. Ammanamanchi, T. Wang, B. Sagot, N. Muennighoff,
A. V. del Moral, O. Ruwase, R. Bawden, S. Bekman, A. McMillan-
Artifact functionality was only tested for the traces and
Major, I. Beltagy, H. Nguyen, L. Saulnier, S. Tan, P. O. Suarez, V. Sanh, SplitwiseSim due to limited hardware availability.
H. Laurençon, Y. Jernite, J. Launay, M. Mitchell, and C. Raffel, “BLOOM:
A 176B-Parameter Open-access Multilingual Language Model,” arXiv B. Artifact check-list (meta-information)
preprint arXiv:2211.05100, 2022.
[70] P. Schmid. Fine-tune FLAN-T5 XL/XXL using DeepSpeed & Hugging ● Data set: Production traces available as a part of the artifact.
Face Transformers. [Online]. Available: https://www.philschmid.de/ ● Run-time environment: Linux / Ubuntu.
fine-tune-flan-t5-deepspeed ● Hardware: Two machines connected over GPU Infiniband
[71] P. Schmid, O. Sanseviero, P. Cuenca, and L. Tunstall. Llama for the vLLM prototype (e.g. NVIDIA DGX-A100, NVIDIA
2 is here - Get it on Hugging Face. [Online]. Available: DGX-H100). x86-64 CPU machine for SplitwiseSim.
https://huggingface.co/blog/llama2
● Publicly available?: Yes.
[72] M. Schwarzkopf, A. Konwinski, M. Abd-El-Malek, and J. Wilkes,
“Omega: Flexible, Scalable Schedulers for Large Compute Clusters,”
● Code licenses (if publicly available)?: MIT.
in EuroSys, 2013. ● Data licenses (if publicly available)?: CC-BY.
[73] Y. Sheng, S. Cao, D. Li, B. Zhu, Z. Li, D. Zhuo, J. E. Gonzalez, and ● Archived (provide DOI)?: 10.5281/zenodo.11003049.
I. Stoica, “Fairness in Serving Large Language Models,” arXiv preprint
arXiv:2401.00588, 2023. C. Description
[74] Y. Sheng, L. Zheng, B. Yuan, Z. Li, M. Ryabinin, B. Chen, P. Liang,
C. Ré, I. Stoica, and C. Zhang, “FlexGen: High-Throughput Generative
Inference of Large Language Models with a Single GPU,” in ICML,
How to access. The entire artifact is available as an archive on
2023. Zenodo: https://doi.org/10.5281/zenodo.11003049. Individual
[75] K. Shvachko, H. Kuang, S. Radia, and R. Chansler, “The Hadoop components are also available online as follows:
Distributed File System,” in MSST, 2010.
[76] K. Van Craeynest, A. Jaleel, L. Eeckhout, P. Narvaez, and J. Emer, ● The production traces can be downloaded from the Azure
“Scheduling Heterogeneous Multi-cores Through Performance Impact Public Dataset GitHub repository [4].
Estimation (PIE),” ACM SIGARCH Computer Architecture News, 2012. ● The KV-cache transfer prototype can be downloaded from
[77] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
Ł. Kaiser, and I. Polosukhin, “Attention is All You Need,” in NeurIPS, the vLLM GitHub repository, currently available as a pull
2017. request [1].

131

Authorized licensed use limited to: Argonne National Laboratory. Downloaded on May 30,2025 at 20:59:06 UTC from IEEE Xplore. Restrictions apply.
● SplitwiseSim, and the associated experiment and plotting Data sets. Coding and conversation traces from Microsoft
scripts, can be downloaded from a separate GitHub Azure are available online as a part of the artifact release [4].
repository [20].
D. Installation and Experiment Workflow
Hardware dependencies. The KV-cache transfer prototype Please refer to the README files within the artifact for
requires two GPU machines connected over Infiniband, such as installation and usage instructions.
NVIDIA DGX-A100s or NVIDIA DGX-H100s. SplitwiseSim
requires a standard x86-64 CPU machine; multiple machines E. Methodology
may be used to parallelize simulation runs. Submission, reviewing and badging methodology:
Software dependencies. The KV-cache transfer prototype is ● https://www.acm.org/publications/policies/
built on top of vLLM [51] and MSCCL++ [11]. SplitwiseSim artifact-review-and-badging-current
depends on a small set of publicly available Python packages, ● http://cTuning.org/ae/submission-20201122.html
which can be installed via the included requirements.txt. ● http://cTuning.org/ae/reviewing-20201122.html

132

Authorized licensed use limited to: Argonne National Laboratory. Downloaded on May 30,2025 at 20:59:06 UTC from IEEE Xplore. Restrictions apply.

You might also like