0% found this document useful (0 votes)
26 views14 pages

KNN-SSD:: Enabling Dynamic Self-Speculative Decoding Via Nearest Neighbor Layer Set Optimization

The document presents KNN-SSD, a novel algorithm that enhances Self-Speculative Decoding (SD) for large language models (LLMs) by dynamically optimizing layer skipping based on domain-specific inputs. This method achieves a speedup of 1.3× to 1.6× in LLM inference while maintaining high token acceptance rates across various tasks. KNN-SSD addresses the limitations of existing SD approaches by integrating K-Nearest Neighbor search to adaptively select optimal skip-layer configurations for different domains.

Uploaded by

ahmadreza.safara
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views14 pages

KNN-SSD:: Enabling Dynamic Self-Speculative Decoding Via Nearest Neighbor Layer Set Optimization

The document presents KNN-SSD, a novel algorithm that enhances Self-Speculative Decoding (SD) for large language models (LLMs) by dynamically optimizing layer skipping based on domain-specific inputs. This method achieves a speedup of 1.3× to 1.6× in LLM inference while maintaining high token acceptance rates across various tasks. KNN-SSD addresses the limitations of existing SD approaches by integrating K-Nearest Neighbor search to adaptively select optimal skip-layer configurations for different domains.

Uploaded by

ahmadreza.safara
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

KNN-SSD: Enabling Dynamic Self-Speculative Decoding via

Nearest Neighbor Layer Set Optimization


Mingbo Song1 , Heming Xia2 , Jun Zhang3 , Chak Tou Leong2 ,
Qiancheng Xu2 ,Wenjie Li2 , Sujian Li1
1
National Key Laboratory for Multimedia Information Processing, Peking University
2
Department of Computing, The Hong Kong Polytechnic University
3
College of Computer Science and Technology, Zhejiang University
songmingbo@[Link]; [Link]@[Link]

Abstract
1.40 SelfSD(Fix)
SelfSD(Mix)
Speculative Decoding (SD) has emerged as a
1.35 KNN-SSD
arXiv:2505.16162v1 [[Link]] 22 May 2025

widely used paradigm to accelerate the infer-

Speedup Ratio
ence of large language models (LLMs) with- 1.30
out compromising generation quality. It works
by efficiently drafting multiple tokens using 1.25
a compact model and then verifying them in
parallel using the target LLM. Notably, Self- 1.20
Speculative Decoding proposes skipping cer-
1.15
tain layers to construct the draft model, which
eliminates the need for additional parameters 1.10
or training. Despite its strengths, we observe
Summarization Reasoning Translation StoryTelling Text2SQL
in this work that drafting with layer skipping
exhibits significant sensitivity to domain shifts, Figure 1: Average speedup results under task-by-task
leading to a substantial drop in acceleration sample streams. The dashed line represents the average
performance. To enhance the domain generaliz- speedup ratio achieved by KNN-SSD. Results indicate
ability of this paradigm, we introduce KNN-SSD, that our KNN-SSD can achieve a stable speedup while
an algorithm that leverages K-Nearest Neigh- Self-SD methods’ speedups decline, as they are sensitive
bor (KNN) search to match different skipped to domain shifts.
layers with various domain inputs. We evalu-
ated our algorithm in various models and mul-
tiple tasks, observing that its application leads
trade-off between drafting latency and speculation
to 1.3×∼1.6× speedup in LLM inference. 1
accuracy (Xia et al., 2024; Hu et al., 2025). Dur-
1 Introduction ing inference, SD aims to both minimize latency
and maximize accuracy to improve efficiency while
Large language models (LLMs) have proven highly maintaining output quality.
capable in handling various downstream tasks (Tou- Recent advancements in SD have significantly
vron et al., 2023; OpenAI et al., 2024; Yang et al., expanded the boundaries of the latency-accuracy
2025). However, the token-by-token generation in trade-off by employing diverse techniques, such
autoregressive decoding results in quadratic com- as integrating lightweight draft models into
putational complexity, which presents significant LLMs (Ankner et al., 2024; Zhang et al., 2025)
efficiency challenges, particularly as model size or aligning a small model with a larger one (Kim
increases. To address this challenge, speculative et al., 2023; Bachmann et al., 2025) for speculative
decoding (SD) has been proposed as a promising generation. However, these approaches inevitably
solution for lossless acceleration of LLM infer- require additional models, which increase the to-
ence (Xia et al., 2023; Leviathan et al., 2023; Chen tal number of parameters and introduce additional
et al., 2023). At each decoding step, SD uses a training complexity. Addressing this concern, Self-
lightweight draft model to efficiently predict mul- SD (Zhang et al., 2024) has been proposed to se-
tiple tokens, which are then verified in parallel by lectively skip certain layers within the large model
the target LLM to preserve the original output dis- itself to construct a compact draft model.
tribution. The effectiveness of SD hinges on the
In this work, we find that the selection of
1
Code in [Link] skipped layers is not universal. Instead, one skip-

1
layer configuration could be sensitive to domain It reduces decoding latency by predicting multiple
shifts. For example, when applying a configuration future tokens using a draft model or internal mech-
derived from the summarization task to other tasks, anisms, followed by verification and correction by
as shown in Figure 1, we observe a significant re- the target LLM. Existing strategies include aligning
duction in speedup from 1.35× to less than 1.10×, small draft models with large models (Xia et al.,
highlighting the need for domain-specific adapta- 2023; Kim et al., 2023; Bachmann et al., 2025) or
tion. To tackle this issue, we propose KNN-SSD, a predicting k tokens in parallel (Cai et al., 2024;
method for dynamically adjusting skip-layer con- Wen et al., 2024). In another line of work, plug-
figurations based on domain representations. The and-play methods have been examined, with exam-
key goal of KNN-SSD is to optimize skipped layers ples including appending pseudo tokens (Fu et al.,
specific to each domain, simulate realistic input 2024) and skipping layers dynamically (Metel et al.,
scenarios, and accurately identify the domain of 2024; Xia et al., 2025) during inference. Despite
each sample. To achieve this goal, KNN-SSD in- efficiency improvement, these methods often rely
tegrates three main features: (1) a skipped layer on auxiliary models or sub-optimal choices, hinder-
set optimization process for the specific domain of ing scalability and effectiveness. The most related
samples, (2) an input sample stream designed to methods to our work include Self-SD (Zhang et al.,
simulate real-life user inputs better, and (3) a KNN 2024) and LayerSkip (Elhoushi et al., 2024), which
model using LLM’s last hidden representations to also construct draft models by skipping interme-
identify the domain of input samples. diate LLM layers. However, both approaches are
Experiments are conducted using LLaMA-2 trained on a single data type and struggle with di-
series (Touvron et al., 2023) and Qwen-2.5 se- verse data streams. Our work aims to tackle this
ries (Yang et al., 2025) across various tasks, in- problem by integrating samples from various do-
cluding summarization, reasoning, translation, sto- mains.
rytelling, and text-to-SQL. KNN-SSD achieves a
1.3×∼1.6× speedup compared to autoregressive Sparsity and Model Compression. Sparsity and
decoding. This approach maintains over 80% to- model compression are essential for enhancing the
ken acceptance rate across the LLaMA-2 series and efficiency of LLMs by reducing active parame-
over 99% token acceptance rate across the Qwen-2 ters or computations during inference (Hu et al.,
series, indicating high alignment potential between 2022). Common approaches include parameter
the draft model and the target LLM. Further anal- pruning (Frantar and Alistarh, 2023; Ashkboos
ysis validated the effectiveness of KNN-SSD across et al., 2024; Sun et al., 2024), knowledge distil-
out-of-domain sample inputs and one dataset that lation (Huang et al., 2022; Gu et al., 2024; Wu
contains various types of samples. et al., 2024), and quantization (Yao et al., 2022;
To summarize, our key contributions are: Liu et al., 2023; Park et al., 2024), which compress
models while preserving performance. Structured
1. We introduce KNN-SSD, a self-speculative de-
sparsity methods, such as layer skipping (Liu et al.,
coding algorithm with a fine-grained skipped
2024; Bhendawade et al., 2024; Xia et al., 2025)
layer set selection, which adopts k-nearest
and dynamic sparsification, further enhance effi-
neighbor search to retrieve a suitable skipped
ciency by adapting computation to input character-
layer set for each input sample;
istics. While these works aim to optimize computa-
2. To evaluate our method, we design a dynamic tional workloads, they may sacrifice performance
input data stream that contains samples from by using sub-optimal choices because of insuffi-
diverse domains, and KNN-SSD can achieve a cient search in the layer space. In contrast, our
1.3×∼1.6× speedup across different models KNN-SSD method can always find optimal choices
without changing the generated tokens’ distri- to accelerate LLM inference losslessly.
bution.
3 Background
2 Related Work
3.1 Self-Speculative Decoding
Speculative Decoding (SD). Speculative Decod-
ing (SD) aims to accelerate autoregressive text Unlike traditional SD methods that require an
generation in LLMs without compromising out- auxiliary draft model, Self-Speculative Decoding
put quality (Xia et al., 2023; Leviathan et al., 2023). (Self-SD) leverages the LLM’s internal structure

2
Sum SL St SL 1.45
which dynamically selects the most suitable skip-
1.4
1.41 Re SL SQL SL
Tran SL 1.35 layer configuration based on task characteristics,
1.31
1.3 1.27
1.24
1.26 ensuring robust and efficient speculative decoding
Speedup

1.22 1.21 1.22 1.22


1.2 1.2
1.18
1.13
1.16 1.17 across diverse tasks.
1.1 1.11
1.1 1.09
1.07
1.05 1.05
1.02
1.0
0.95
0.97 4 Methodology
0.9
Summarization Reasoning Translation StoryTelling Text2SQL
We introduce KNN-SSD, a generalizable Self-SD
Figure 2: Different tasks have different optimal skip method designed to improve inference efficiency
layer sets. "Sum SL" denotes the skip layer set opti- while maintaining adaptability across diverse tasks.
mized for the Summarization task. Figure 3 shows our method of accelerating infer-
ence. It first generates enough last hidden vec-
to draft tokens by selectively skipping certain lay- tors for each task during the pre-inference process.
ers (Zhang et al., 2024). Given data x1 , . . . , xn and Then, a fixed number of vectors are selected as rep-
the target LLM M with L layers including both resentative anchors to fit a KNN model. For each
attention and MLP layers, Self-SD aims to find an task, its optimal skip layer set is searched using
optimal z ∈ {0, 1}L , where z (i) = 1 indicates that a Bayesian optimization process. In the inference
the ith layer needs to be skipped and vice versa. A process, a new input data will find its cluster using
black-box function f (·) is used to assess the aver- the previous KNN model, and the corresponding
age inference time per verified token: skip layer set will be used for the target LLM. Fi-
nally, we perform the standard Self-SD process,
which contains two stages of drafting and verifica-
z ∗ = arg min f M (z)|x1 , . . . , xn .

(1) tion to accelerate inference. By integrating these
z
two processes, KNN-SSD provides a flexible and
Self-SD applies Bayesian optimization (Jones et al.,
effective solution to accelerate LLM inference in
1998) to identify an optimal skip layer set by itera-
real-world applications.
tively selecting new z based on a Gaussian process
and evaluating with Eq(1). After a specified num-
4.1 Pre Inference
ber of iterations, the best z is considered an approx-
imation of z ∗ and is fixed for inference. During de- Given a set of domains D1 , . . . , Dn , we first ran-
coding, the selected layers are skipped to efficiently domly sample multiple instances from each do-
generate draft tokens, which are then validated in main, denoted as di1 , . . . , dim for domain Di . Each
parallel by the full-parameter LLM to ensure the sampled instance dij is then passed through a pre-
output distribution remains unchanged. trained LLM M to obtain its last hidden vector
representation vij . These samples are then aggre-
3.2 Preliminary Study gated and clustered into n groups µ1 , . . . , µn using
While Self-SD improves inference efficiency, the the K-means algorithm, where the number of clus-
optimal layers to skip vary significantly across dif- ters is set to match the number of domains. For
ferent tasks. To demonstrate this, we analyze the each cluster µi , we identify k representative an-
performance of SD across multiple representative chors based on their distance to the cluster centroid.
tasks, including summarization, reasoning, story- The collection of selected anchors for cluster µi
telling, translation, and text-to-SQL. As shown in is denoted as Ai = {ai1 , . . . , aik }, which will be
Figure 2, an optimized skip-layer configuration for used to fit a KNN model. The construction of the
one task does not generalize well to others. For anchor set Ai is formally defined as follows:
example, a configuration that accelerates summa-
rization degrades performance in reasoning tasks.
X
Ai = arg min ∥vij − µi ∥. (2)
These results show that the static skip-layer con- S⊆Di ,|S|=k v ∈S
ij
figuration is suboptimal. This limits its effective-
ness, particularly in real-world scenarios where Subsequently, for each domain Di , we utilize the
query types are unpredictable. To achieve both anchor set {ai1 , . . . , aik } to determine a domain-
high inference efficiency and minimal performance specific skip layer set zi ∈ {0, 1}L , where L de-
degradation, task-specific configurations are essen- notes the total number of layers in the language
(j)
tial. This motivates the development of KNN-SSD, model M. Each element zi indicates whether

3
input tokens generated tokens inter cluster intra cluster
Skipped Layer Set
Optimization

]
MLP 0

]
MLP 1

]
MLP 0
Bayes Attention 1
Translate Attention 0
Optimization Attention 0
Last Hidden MLP 0
Vectors MLP 0
MLP 1
Reasoning Clustering Attention 0
Attention
Attention 0 0
MLP
MLP 1
MLP 0 0
Summarize Attention
Attention 0 1

[
Attention
z 1
......

[
z

[
z

SSD via Querying


Query Skipped Layer Set
Optimized Layer Set Last Hidden
Vectors
Query

To solve this math


problem, we ... update

LLM Inputs LLM Outputs

Figure 3: Layer skipping and KNN process in KNN-SSD. Before LLM-related generation, KNN-SSD first performs
(a) Layer set Searching Optimization. For each task, KNN-SSD generates a task-specific skip layer set and stores
it in a configuration file; (b) Generate Anchor representatives. KNN-SSD then produces last hidden vectors for
each task to fit a KNN model. When a new sample is input, KNN-SSD first uses its last hidden vector as the input
representative and queries the KNN model. Based on the retrieved result, it selects the corresponding skip layer set
to perform decoding, thereby achieving acceleration.

(j)
the j-th layer should be skipped (zi = 1) or re-
(j)
tained (zi = 0) during inference. To identify v · aij
i∗ , j ∗ = arg max , (4)
the optimal configuration zi , we employ Bayesian i,j ∥v∥ · ∥aij ∥
Optimization (Jones et al., 1998) over the space Domain(s) = i∗ , (5)
of binary layer masks, aiming to minimize an ob-
M← zi∗∗ . (6)
jective black-box function f (·) that measures the
average inference time per verified token: We then perform the standard Self-SD pro-
cess (Zhang et al., 2024), which involves two
stages: drafting and verification. During the draft-
ing stage, the LLM uses the previously selected
zi∗ = arg min f M (zi )|ai1 , . . . , aik .

(3) skip-layer configuration zi as a draft model M (zi )
zi
to generate a sequence of draft tokens:

All z1∗ , . . . , zn∗ will be stored for future use and not y ′ = arg max log P (y | x, y; M (zi )), (7)
y
be changed.
where x and y denote input and output generated
by LLM, respectively, and y ′ represents the token
4.2 Inference produced by the autoregressive process. In the veri-
fication stage, the full LLM verifies the draft tokens
For a newly arrived sample s, we first extract its last in a single forward pass. This step validates the cor-
hidden vector v from the model. We then perform rectness of the generated tokens and either accepts
a KNN search based on cosine similarity between them or triggers a re-drafting if discrepancies are
the hidden vector of s and all representative an- found.
chors. This process yields a corresponding domain To better simulate real-world task streams, we
label, effectively classifying the sample into one introduce the mix ratio r, which denotes the prob-
of the known domains i∗ . Based on the identified ability that the next input sample belongs to a dif-
domain, we apply its associated optimal skip-layer ferent task than the current one. A mix ratio of 0
configuration zi∗∗ to M to accelerate inference: corresponds to a task-by-task input stream, where

4
R=0.0 R=0.3 R=0.7 R=1.0 Speed Overall
Models Methods
E(Spd.) E(Spd.) E(Spd.) E(Spd.) (token/s) E(Spd.)
VANILLA 1.00× 1.00× 1.00× 1.00× 13.62 1.00×
S ELF -SD(F IX ) 1.24× 1.21× 1.19× 1.17× 16.34 1.20×
LLaMA-2-13B
S ELF -SD(M IX ) 1.23× 1.27× 1.24× 1.23× 16.88 1.24×
KNN-SSD 1.42× 1.45× 1.43× 1.45× 19.61 1.44×
VANILLA 1.00× 1.00× 1.00× 1.00× 13.22 1.00×
LLaMA-2-13B S ELF -SD(F IX ) 1.13× 1.14× 1.08× 1.10× 14.67 1.11×
-Chat S ELF -SD(M IX ) 1.13× 1.17× 1.17× 1.16× 15.33 1.16×
KNN-SSD 1.33× 1.36× 1.36× 1.37× 17.85 1.35×
VANILLA 1.00× 1.00× 1.00× 1.00× 11.16 1.00×
S ELF -SD(F IX ) 1.25× 1.23× 1.27× 1.28× 14.06 1.26×
Qwen-2.5-14B
S ELF -SD(M IX ) 1.40× 1.36× 1.39× 1.38× 15.40 1.38×
KNN-SSD 1.60× 1.64× 1.63× 1.61× 18.08 1.62×
VANILLA 1.00× 1.00× 1.00× 1.00× 10.79 1.00×
Qwen-2.5-14B S ELF -SD(F IX ) 1.18× 1.20× 1.20× 1.17× 12.84 1.19×
-Instruct S ELF -SD(M IX ) 1.26× 1.24× 1.27× 1.25× 13.49 1.25×
KNN-SSD 1.52× 1.49× 1.50× 1.52× 16.30 1.51×

Table 1: Comparison between KNN-SSD and two Self-SD methods. R indicates the mix ratio of sample streams.
We report the expected speedup ratio under different mix ratios, average decoding speed (token/s) under greedy
decoding, and average speedup ratio among different mix ratios. More details are provided in the Appendix C.3.

all consecutive samples come from the same task. resentative samples in search of the optimal skip-
In contrast, a mix ratio of 1 indicates maximum layer configuration. The representative samples are
task mixing, where every two consecutive samples selected via the K-means algorithm from all last
are from different tasks. As the mix ratio grows, hidden vectors generated by the LLM in the corre-
the frequency of domain shift increases. sponding dataset, ensuring optimal coverage of the
feature space. The maximum generation lengths on
(
r CNN/DM, GSM8K, Wmt16, Spider2, and TinyS-
N −1 if j ̸= k,
P (si+1 ∈ Dj | si ∈ Dk ) = (8) tories are set to 64, 64, 64, 64, and 128, respec-
1−r if j = k.
tively. We conduct 1-shot evaluation for CNN/DM
5 Experiments and TinyStories, 3-shot evaluation for Spider2, and
5-shot evaluation for GSM8K and Wmt16. For
5.1 Experimental Setup
each dataset, we extracted the most representative
Implementation Details. We mainly evaluate k = 10 hidden vectors from the last hidden layer
KNN-SSD on LLaMA-2 series (Touvron et al., across all data samples using cosine similarity to
2023) and Qwen-2.5 series (Yang et al., 2025) serve as anchor points for the KNN model, follow-
across various tasks, including summarization, ing the same approach as introduced earlier in the
mathematical reasoning, storytelling, translation, BO framework. For each new input sample, we
and text-to-SQL. The evaluation datasets include also compute the cosine similarity between its last
CNN/Daily Mail (CNN/DM) (Nallapati et al., hidden vector and the anchors, and assign it to the
2016), GSM8K (Cobbe et al., 2021), TinyS- task of its nearest neighbor.
tories (Eldan and Li, 2023), Wmt16 DE-EN
(Wmt16) (Bojar et al., 2016), and Spider2 (Lei Baselines. In our primary experiments, we com-
et al., 2025). pared KNN-SSD and Self-SD approach (Zhang et al.,
For each dataset, we used Bayesian optimiza- 2024) to assess their effectiveness. For the Self-SD
tion 2 (BO) to perform 1,000 iterations in 8 rep- method, we primarily simulated two scenarios. In
2
[Link] the first scenario, a fixed skip-layer configuration
BayesianOptimization was determined based on the first sample in the

5
Models Methods M α Speedup
0.85
Vanilla 1.00 - 1.00×

Mean Acceptance Rate


LLaMA-2 Self-SD(Fix) 2.17 0.62 1.10×
0.80
-13B Self-SD(Mix) 2.53 0.68 1.14×
SelfSD(Fix)
KNN-SSD 3.12 0.88 1.34× SelfSD(Mix)
0.75
Vanilla 1.00 - 1.00× KNN-SSD
LLaMA-2 Self-SD(Fix) 1.97 0.57 1.04×
0.70
-13B-Chat Self-SD(Mix) 2.14 0.59 1.09×
KNN-SSD 2.87 0.85 1.28×
0.65

Table 2: The results demonstrate the mean accepted Summarization Reasoning Translation StoryTelling Text2SQL
tokens, token acceptance rate, and actual speedup ratio
obtained from our tests on the LLaMA-2 series, showing 3.2
that KNN-SSD outperforms two Self-SD methods in every

Mean Accepted Tokens


metric. 3.0

2.8
task stream and remained unchanged throughout
the process, which is denoted as Self-SD(Fix). In 2.6
the second scenario, the skip-layer configuration
was adjusted by re-performing BO according to the 2.4 SelfSD(Fix)
task distribution within the stream, and the newly SelfSD(Mix)
searched configuration was subsequently applied 2.2 KNN-SSD
for inference and also remained unchanged, which Summarization Reasoning Translation StoryTelling Text2SQL
is denoted as Self-SD(Mix).
Figure 4: The mean accepted tokens and mean accep-
Evaluation Metrics. We evaluate KNN-SSD using tance rate under task-by-task sample streams. The
two standard metrics commonly adopted in evalu- dashed lines represent the average length and rate
ation: the mean generated length M (Stern et al., achieved by KNN-SSD across all five datasets.
2018) and the token acceptance rate α (Leviathan
et al., 2023). Beyond these, we also report the ratio of sample flows doesn’t affect the speedup
expected decoding throughput in tokens per sec- of KNN-SSD. The speedup remains stable, which
ond, along with the expected wall-time speedup indicates that KNN-SSD can handle various samples
ratio compared to standard autoregressive decod- in a more realistic scenario.
ing. Given M and α, the expected speedup can be
We present the mean accepted tokens, accep-
derived by the formula given by Leviathan et al.
tance rate, as well as the actual speedup of LLaMA-
(2023):
2-13B series in Table 2 and Figure 4, which further
Mα validates the superiority of KNN-SSD over Self-SD.
E(Spd.) = (9)
(M − 1)(1 − r) + α
5.3 Analysis
where r denotes the ratio of skipped layers.
Inter & Intra. We use the MATH (Hendrycks
5.2 Main Result et al., 2021) dataset to assess the capabilities of
Table 1 presents the comparison between KNN-SSD KNN-SSD in a single dataset with multiple domains.
and two Self-SD methods on generation tasks. In In the MATH dataset, math questions are catego-
our experiments, we evaluate KNN-SSD under four rized into seven types. Thus, using one specific skip
settings: mix ratio = 0, 0.3, 0.7, and 1 separately, layer set for this dataset is insufficient, and we intro-
with 40 samples from five datasets each, 200 sam- duce a fine-grained clustering to handle this mixed
ples in total. The experimental results demonstrate domain. Figure 6 shows that each type of math
the following findings: (1) KNN-SSD shows superior question can be clustered into a single group. Ta-
efficiency over prior methods, achieving consistent ble 3 indicates the speedup result for each method,
speedups of 1.35×∼1.62× over vanilla autoregres- where we can clearly see that KNN-SSD outperforms
sive decoding across various models. (2) The mix Self-SD methods and achieves a speedup of 1.23×

6
All Vectors All Vectors
Representative Vectors Representative Vectors

Figure 5: Visualization of last hidden vectors from five Figure 6: Visualization of 700 last hidden vectors from
domains of samples. Results show these vectors can be the MATH dataset using t-SNE method. It is clear to
clearly divided into five clusters. From each cluster, we see that all vectors can be categorized into 7 groups,
selected ten vectors as representative anchors for our which aligns with the fact that the MATH dataset has 7
KNN model. different kinds of problems.

Baseline Self-SD(Mix)
Methods M α Speedup 1.3
Self-SD(Fix) KNN-SSD
1.27 1.26
1.23 1.23 1.24 1.23
1.21 1.21 1.22
Vanilla 1.00 - 1.00× 1.2
Speedup

1.16
1.14
Self-SD(Fix) 1.54 0.51 0.97× 1.12 1.12
1.1 1.09 1.1
1.06 1.05
Self-SD(Mix) 1.82 0.59 1.02× 1.0 1 1 1 1 1
1.03
1
1.03
1
1.02
0.97
KNN-SSD 2.37 0.81 1.23×
0.9
y y s
Algebrand Probabilit Geometr diate Algebra mber Theory Prealgebra Precalculu
ng a Interm
e Nu
Table 3: Results of MATH dataset using LLaMA-2-13B. Counti

Figure 7: Speedup results for task-by-task sample


streams on MATH dataset under three methods. While
and a mean generated length M of 2.37.
KNN-SSD maintains a speedup of around 1.25×, two
Figure 7 visualizes speedups on task-by-task set- Self-SD methods decline as the number of domains
tings. Self-SD-Fix achieves high performance only grows.
on the first subtask and declines on the rest of the
subtasks; while KNN-SSD has a better speedup than
the other two Self-SD methods. and many other tasks. Results indicate that al-
though some of the domains are not covered by our
Out of Domain Generalization. We adopt five datasets in the main experiments, our method
the XSUM (Narayan et al., 2018) dataset, the can still assign an unknown sample to its most
MATH (Hendrycks et al., 2021) dataset, and the similar domain and thus achieve inference accel-
Alpaca (Taori et al., 2023) dataset as out-of-domain eration. As shown in Table 4, our experimental
tasks to assess KNN-SSD’s generalizability. XSUM results demonstrate that the model achieves an ap-
and CNN/DM datasets belong to summarization proximately 1.15× ∼ 1.25× speedup under the
tasks, whereas MATH and GSM8K datasets in- KNN-SSD method, even without prior search.
volve reasoning-based tasks. Therefore, although
we did not search for their respective optimal skip Number of Clusters. Table 5 shows the result
layer sets for XSUM and MATH in our exper- of the influence of cluster numbers. We conducted
iments, it is reasonable that KNN-SSD would as- experiments on the Alpaca dataset as it covers a
sign XSUM samples to CNN/DM and thus adopt variety of domains, using K-means clustering with
CNN/DM’s optimal skip layer set, and the same varying numbers of clusters. As shown in the
applies to MATH samples. results, the speedup effect improves as the num-
Compared to these two datasets, the Alpaca ber of clusters increases, eventually surpassing the
dataset contains more diverse instruction-answer speedup ratio observed in the out-of-domain exper-
pairs across summarization, reasoning, grammar, iments (Table 4). However, when the cluster count

7
Datasets Methods M α Speedup
Vanilla 1.00 - 1.00×
Article: Carlos Tevez talks about being as free as a bird,
XSUM Self-SD 1.42 0.56 0.99× back to the days when he was banging in the goals in
Argentina for Boca Juniors. A big hit on the famous
KNN-SSD 2.51 0.84 1.24× steeped terracing at La Bombonera as he followed in
the footsteps of Diego Maradona ...... Summary:
Vanilla 1.00 - 1.00×
Searching for nearest skip layer set
MATH Self-SD 1.34 0.48 0.93×
KNN-SSD 2.13 0.76 1.17×
Vanilla 1.00 - 1.00× Juventus forward Carlos Tevez discusses his enjoyment
of playing for the club and his form this season, with six
Alpaca Self-SD 1.26 0.43 0.92× Champions League goals so far. The team is looking to
beat Monaco in the quarter-finals and ......
KNN-SSD 1.95 0.67 1.15×
Question: Maddison has 5 boxes with 50 marbles in
each box. Then she gets 20 marbles from her friend.
Table 4: Results of out-of-domain datasets using How many marbles does she have now? Answer:
LLaMA-2-13B-Chat. No representative anchor of these
three domains is generated. Searching for nearest skip layer set

exceeds 5 (e.g., up to 7), the speedup plateaus, indi- Here's the solution step-by-step:\n\n1. Maddison has 5
boxes, each with 50 marbles. \n\nSo, in total, Maddison
cating that partitioning Alpaca into five clusters is has 5 x 50 = 250 ......
sufficient—further subdivision yields no additional
gains. Translate German to English: "Es ist für mich wirklich",
sagte Spielberg, "unstrittig der größte Zeitreise-Film,
der jemals gedreht wurde."

Num. M α Speedup Searching for nearest skip layer set

1 1.86 0.65 1.05×


3 2.20 0.75 1.17×
5 2.52 0.80 1.23× Here's the translation:\n\n"It is truly, in my opinion,"
said Spielberg, "the greatest time travel film that has
7 2.55 0.81 1.23× ever been made."

Table 5: Results of the Alpaca dataset among different


numbers of clusters using LLaMA-2-13B-Chat. Num.
denotes the number of clusters. Figure 8: Case study of how KNN-SSD works. Blue to-
kens indicate that they are generated during the drafting
step and verified by the model, while red tokens indicate
Case Study. To better illustrate how our method they are generated by prediction from the verification
works, we provide a case study that presents a typ- step. Squares in red and blue indicate skipped attention
ical sample stream. In Figure 8, a sample stream layers and MLP layers, respectively.
contains three common types of queries a user
might ask: summarization, reasoning, and trans-
suitable skipped layers for various domain inputs.
lation. For each input query, KNN-SSD will first
KNN-SSD is designed to find an optimal skipped
compute its last hidden vector and then use a KNN
layer set for each domain of data, which accelerates
model to find its optimal skipped layer set. The typ-
LLM’s inference losslessly. To assess its ability,
ical speculative decoding will be conducted with
we define a mix ratio of a sample stream, indicat-
draft and verification steps, where blue and red to-
ing how frequently the domain changes. We con-
kens indicate that they are generated separately in
ducted extensive experiments with various LLMs
draft and verification steps. By constantly chang-
and mix ratios and found that KNN-SSD can achieve
ing skipped layer sets, KNN-SSD achieves a stable
a speedup of around 1.3×∼1.6× without changing
speedup compared to other methods that use a static
the ordinary distribution of the generated tokens.
strategy, which is insufficient for diverse inputs.
Our in-depth analysis indicates that a single dataset
6 Conclusion may also contain mixed domains. Furthermore,
KNN-SSD can achieve a 1.2× speedup on out-of-
In this work, we introduce KNN-SSD, an algorithm domain datasets, showing its great potential in han-
that leverages K-Nearest Neighbor search to match dling various data streams in real-life scenarios.

8
Limitations on Machine Translation, pages 131–198, Berlin, Ger-
many. Association for Computational Linguistics.
A few limitations need to be considered while our
KNN-SSD achieves a notable speedup on various Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu
models. First, we did not incorporate draft tree ver- Peng, Jason D. Lee, Deming Chen, and Tri Dao.
2024. Medusa: Simple llm inference acceleration
ification, which has been shown to improve the to- framework with multiple decoding heads. Preprint,
ken acceptance rate (Xia et al., 2025). Second, our arXiv:2401.10774.
current evaluation is limited to models of moderate
scale. Due to practical considerations related to Charlie Chen, Sebastian Borgeaud, Geoffrey Irving,
Jean-Baptiste Lespiau, Laurent Sifre, and John
computational resources, we have not yet extended Jumper. 2023. Accelerating large language model
our method to larger-scale models. We leave these decoding with speculative sampling. Preprint,
directions for future work. arXiv:2302.01318.

Ethics Statement Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian,


Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias
The datasets used in our experiment are publicly Plappert, Jerry Tworek, Jacob Hilton, Reiichiro
released and labeled through interaction with hu- Nakano, Christopher Hesse, and John Schulman.
2021. Training verifiers to solve math word prob-
mans in English. In this process, user privacy is lems. CoRR, abs/2110.14168.
protected, and no personal information is contained
in the dataset. The scientific artifacts that we used Ronen Eldan and Yuanzhi Li. 2023. Tinystories: How
are available for research with permissive licenses. small can language models be and still speak coherent
english? Preprint, arXiv:2305.07759.
And the use of these artifacts in this paper is consis-
tent with their intended use. Therefore, we believe Mostafa Elhoushi, Akshat Shrivastava, Diana Liskovich,
that our research work meets the ethics of ACL. Basil Hosmer, Bram Wasti, Liangzhen Lai, Anas
Mahmoud, Bilge Acun, Saurabh Agarwal, Ahmed
Roman, Ahmed Aly, Beidi Chen, and Carole-Jean
References Wu. 2024. LayerSkip: Enabling early exit inference
and self-speculative decoding. In Proceedings of the
Zachary Ankner, Rishab Parthasarathy, Aniruddha 62nd Annual Meeting of the Association for Compu-
Nrusimha, Christopher Rinard, Jonathan Ragan- tational Linguistics (Volume 1: Long Papers), pages
Kelley, and William Brandon. 2024. Hydra: 12622–12642, Bangkok, Thailand. Association for
Sequentially-dependent draft heads for medusa de- Computational Linguistics.
coding. Preprint, arXiv:2402.05109.
Elias Frantar and Dan Alistarh. 2023. SparseGPT: Mas-
Saleh Ashkboos, Maximilian L. Croci, Marcelo Gennari sive language models can be accurately pruned in
do Nascimento, Torsten Hoefler, and James Hens- one-shot. In Proceedings of the 40th International
man. 2024. Slicegpt: Compress large language Conference on Machine Learning, volume 202 of
models by deleting rows and columns. Preprint, Proceedings of Machine Learning Research, pages
arXiv:2401.15024. 10323–10337. PMLR.
Gregor Bachmann, Sotiris Anagnostidis, Albert
Pumarola, Markos Georgopoulos, Artsiom Yichao Fu, Peter Bailis, Ion Stoica, and Hao Zhang.
Sanakoyeu, Yuming Du, Edgar Schönfeld, Ali 2024. Break the sequential dependency of LLM
Thabet, and Jonas Kohler. 2025. Judge decoding: inference using lookahead decoding. In Forty-first
Faster speculative sampling requires going beyond International Conference on Machine Learning.
model alignment. Preprint, arXiv:2501.19309.
Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. 2024.
Nikhil Bhendawade, Irina Belousova, Qichen Fu, Henry Minillm: Knowledge distillation of large language
Mason, Mohammad Rastegari, and Mahyar Najibi. models. Preprint, arXiv:2306.08543.
2024. Speculative streaming: Fast llm inference with-
out auxiliary models. Preprint, arXiv:2402.11131. Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul
Arora, Steven Basart, Eric Tang, Dawn Song, and
Ond rej Bojar, Rajen Chatterjee, Christian Federmann, Jacob Steinhardt. 2021. Measuring mathematical
Yvette Graham, Barry Haddow, Matthias Huck, An- problem solving with the math dataset. Preprint,
tonio Jimeno Yepes, Philipp Koehn, Varvara Lo- arXiv:2103.03874.
gacheva, Christof Monz, Matteo Negri, Aurelie
Neveol, Mariana Neves, Martin Popel, Matt Post, Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-
Raphael Rubino, Carolina Scarton, Lucia Specia, Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu
Marco Turchi, Karin Verspoor, and Marcos Zampieri. Chen. 2022. LoRA: Low-rank adaptation of large
2016. Findings of the 2016 conference on machine language models. In International Conference on
translation. In Proceedings of the First Conference Learning Representations.

9
Yunhai Hu, Zining Liu, Zhenyuan Dong, Tianfan Peng, Conference on Empirical Methods in Natural Lan-
Bradley McDanel, and Sai Qian Zhang. 2025. Spec- guage Processing, Brussels, Belgium.
ulative decoding and beyond: An in-depth survey of
techniques. Preprint, arXiv:2502.19732. OpenAI, Josh Achiam, Steven Adler, Sandhini Agar-
wal, Lama Ahmad, Ilge Akkaya, et al. 2024. Gpt-4
Yukun Huang, Yanda Chen, Zhou Yu, and Kathleen technical report. Preprint, arXiv:2303.08774.
McKeown. 2022. In-context learning distillation:
Transferring few-shot learning ability of pre-trained Gunho Park, Baeseong Park, Minsub Kim, Sungjae
language models. Preprint, arXiv:2212.10670. Lee, Jeonghoon Kim, Beomseok Kwon, Se Jung
Donald R Jones, Matthias Schonlau, and William J Kwon, Byeongwook Kim, Youngjoo Lee, and Dong-
Welch. 1998. Efficient global optimization of ex- soo Lee. 2024. Lut-gemm: Quantized matrix mul-
pensive black-box functions. Journal of Global opti- tiplication based on luts for efficient inference in
mization, 13:455–492. large-scale generative language models. Preprint,
arXiv:2206.09557.
Sehoon Kim, Karttikeya Mangalam, Suhong Moon, Ji-
tendra Malik, Michael W Mahoney, Amir Gholami, Mitchell Stern, Noam Shazeer, and Jakob Uszkoreit.
and Kurt Keutzer. 2023. Speculative decoding with 2018. Blockwise parallel decoding for deep autore-
big little decoder. In Advances in Neural Information gressive models. In Advances in Neural Information
Processing Systems, volume 36, pages 39236–39256. Processing Systems, volume 31. Curran Associates,
Curran Associates, Inc. Inc.

Fangyu Lei, Jixuan Chen, Yuxiao Ye, Ruisheng Mingjie Sun, Zhuang Liu, Anna Bair, and J. Zico Kolter.
Cao, Dongchan Shin, Hongjin Su, Zhaoqing Suo, 2024. A simple and effective pruning approach for
Hongcheng Gao, Wenjing Hu, Pengcheng Yin, Victor large language models. Preprint, arXiv:2306.11695.
Zhong, Caiming Xiong, Ruoxi Sun, Qian Liu, Sida
Wang, and Tao Yu. 2025. Spider 2.0: Evaluating Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann
language models on real-world enterprise text-to-sql Dubois, Xuechen Li, Carlos Guestrin, Percy Liang,
workflows. Preprint, arXiv:2411.07763. and Tatsunori B. Hashimoto. 2023. Stanford alpaca:
An instruction-following llama model. https://
Yaniv Leviathan, Matan Kalman, and Yossi Matias.
[Link]/tatsu-lab/stanford_alpaca.
2023. Fast inference from transformers via spec-
ulative decoding. In Proceedings of the 40th Inter-
national Conference on Machine Learning, volume Hugo Touvron, Louis Martin, Kevin Stone, Peter Al-
202 of Proceedings of Machine Learning Research, bert, Amjad Almahairi, et al. 2023. Llama 2: Open
pages 19274–19286. PMLR. foundation and fine-tuned chat models. Preprint,
arXiv:2307.09288.
Yijin Liu, Fandong Meng, and Jie Zhou. 2024.
Accelerating inference in large language models Zhuofan Wen, Shangtong Gui, and Yang Feng. 2024.
with a unified layer skipping strategy. Preprint, Speculative decoding with ctc-based draft model for
arXiv:2404.06954. llm inference acceleration. In Advances in Neural
Information Processing Systems, volume 37, pages
Zechun Liu, Barlas Oguz, Changsheng Zhao, Ernie 92082–92100. Curran Associates, Inc.
Chang, Pierre Stock, Yashar Mehdad, Yangyang
Shi, Raghuraman Krishnamoorthi, and Vikas Chan- Minghao Wu, Abdul Waheed, Chiyu Zhang, Muham-
dra. 2023. Llm-qat: Data-free quantization aware mad Abdul-Mageed, and Alham Fikri Aji. 2024.
training for large language models. Preprint, LaMini-LM: A diverse herd of distilled models from
arXiv:2305.17888. large-scale instructions. In Proceedings of the 18th
Conference of the European Chapter of the Associa-
Michael R. Metel, Peng Lu, Boxing Chen, Mehdi Reza- tion for Computational Linguistics (Volume 1: Long
gholizadeh, and Ivan Kobyzev. 2024. Draft on the Papers), pages 944–964, St. Julian’s, Malta. Associa-
fly: Adaptive self-speculative decoding using cosine tion for Computational Linguistics.
similarity. Preprint, arXiv:2410.01028.
Ramesh Nallapati, Bowen Zhou, Cicero dos Santos, Heming Xia, Tao Ge, Peiyi Wang, Si-Qing Chen, Furu
Çağlar Gu˙lçehre, and Bing Xiang. 2016. Abstrac- Wei, and Zhifang Sui. 2023. Speculative decod-
tive text summarization using sequence-to-sequence ing: Exploiting speculative execution for accelerat-
RNNs and beyond. In Proceedings of the 20th ing seq2seq generation. In Findings of the Associa-
SIGNLL Conference on Computational Natural Lan- tion for Computational Linguistics: EMNLP 2023,
guage Learning, pages 280–290, Berlin, Germany. pages 3909–3925, Singapore. Association for Com-
Association for Computational Linguistics. putational Linguistics.

Shashi Narayan, Shay B. Cohen, and Mirella Lapata. Heming Xia, Yongqi Li, Jun Zhang, Cunxiao Du, and
2018. Don’t give me the details, just the summary! Wenjie Li. 2025. Swift: On-the-fly self-speculative
Topic-aware convolutional neural networks for ex- decoding for llm inference acceleration. Preprint,
treme summarization. In Proceedings of the 2018 arXiv:2410.06916.

10
Heming Xia, Zhe Yang, Qingxiu Dong, Peiyi Wang,
Yongqi Li, Tao Ge, Tianyu Liu, Wenjie Li, and Zhi-
fang Sui. 2024. Unlocking efficiency in large lan-
guage model inference: A comprehensive survey
of speculative decoding. In Findings of the Asso-
ciation for Computational Linguistics: ACL 2024,
pages 7655–7671, Bangkok, Thailand. Association
for Computational Linguistics.
An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui,
Bo Zheng, et al. 2025. Qwen2.5 technical report.
Preprint, arXiv:2412.15115.
Zhewei Yao, Reza Yazdani Aminabadi, Minjia Zhang,
Xiaoxia Wu, Conglong Li, and Yuxiong He. 2022.
Zeroquant: Efficient and affordable post-training
quantization for large-scale transformers. In Ad-
vances in Neural Information Processing Systems,
volume 35, pages 27168–27183. Curran Associates,
Inc.

Jun Zhang, Jue Wang, Huan Li, Lidan Shou, Ke Chen,


Gang Chen, and Sharad Mehrotra. 2024. Draft &
verify: Lossless large language model acceleration
via self-speculative decoding. In Proceedings of the
62nd Annual Meeting of the Association for Compu-
tational Linguistics (Volume 1: Long Papers), pages
11263–11282, Bangkok, Thailand. Association for
Computational Linguistics.
Lefan Zhang, Xiaodan Wang, Yanhua Huang, and
Ruiwen Xu. 2025. Learning harmonized rep-
resentations for speculative sampling. Preprint,
arXiv:2408.15766.

11
A Preliminary Details from various sources. It is used to evaluate the
translation quality of models.
We visualize the optimal skipped layer sets we
searched across five tasks on two series of models Spider2 Spider 2.0 is a complex and cross-
in Figure 9 and Figure 10. domain text-to-SQL benchmark designed to eval-
uate the ability of models to generate executable
B Datasets
SQL queries from natural language questions. It
We mainly evaluate KNN-SSD on LLaMA-2 (Tou- includes diverse databases and query types, requir-
vron et al., 2023) series and Qwen-2.5 (Yang et al., ing models to generalize to unseen schemas and
2025) series across diverse tasks. We select five dif- handle intricate reasoning.
ferent datasets, covering summarization, mathemat-
ical reasoning, translation, storytelling, and text-to- XSUM XSUM is an abstractive summarization
SQL, which are CNN/Daily Mail (CNN/DM) (Nal- dataset consisting of BBC news articles paired
lapati et al., 2016), GSM8K (Cobbe et al., 2021), with single-sentence summaries, in contrast to
TinyStories (Eldan and Li, 2023), Wmt16 DE-EN CNN/DM, which provides longer, multi-sentence
(Wmt16) (Bojar et al., 2016), and Spider2 (Lei summaries for news articles. It emphasizes con-
et al., 2025) datasets, respectively. The maxi- cise and information-rich summaries, testing the
mum generation lengths on CNN/DM, GSM8K, models’ ability to extract key information.
Wmt16, Spider2, and TinyStories are set to 64, 64,
64, 64, and 128, respectively. We conduct 1-shot MATH The MATH dataset is a benchmark for
evaluation for CNN/DM and TinyStories, 3-shot mathematical problem solving, comprising high
evaluation for Spider2, and 5-shot evaluation for school-level competition problems with detailed
GSM8K and Wmt16. For further analysis, we also step-by-step solutions. It covers a wide range of
use the XSUM (Narayan et al., 2018) dataset, the topics, including algebra, counting and probabil-
MATH (Hendrycks et al., 2021) dataset, and the ity, geometry, intermediate algebra, number the-
Alpaca (Taori et al., 2023) dataset for the summa- ory, prealgebra, and precalculus, and is designed
rization, mathematical reasoning, and instruction to evaluate the advanced reasoning and symbolic
following tasks, respectively. manipulation abilities of language models.

CNN/DM The CNN/Daily Mail dataset is a large- Alpaca The Alpaca dataset is a collection of
scale benchmark for abstractive text summarization. instruction-following demonstrations generated us-
It consists of long news articles paired with short ing the self-instruct method, based on the outputs
summaries, derived from the CNN and Daily Mail of a strong language model. It covers a wide range
websites. The dataset is used to evaluate the perfor- of tasks, making it suitable for us to test the gener-
mance on long-form input and coherent summary alizability of KNN-SSD.
generation.
GSM8K GSM8K is a high-quality benchmark C Experimental Details
dataset for arithmetic reasoning, consisting of
C.1 Setups
grade school math word problems and their de-
tailed step-by-step solutions. It is used to evaluate During the pre-inference stage, we set the maxi-
the reasoning and problem-solving capabilities in mum iterations of Bayesian Optimization to 1,000
mathematical contexts. and the number of samples to 8. For each dataset,
we first randomly choose 1,000 last hidden vec-
TinyStories TinyStories is a dataset of short, syn-
tors, then we use the K-means algorithm to find 10
thetically generated children’s stories designed to
representatives as anchors for the KNN model.
support research on language modeling and narra-
In the inference process, experiments were con-
tive understanding. The stories are simple in struc-
ducted on 8×NVIDIA RTX 3090 GPU (24GB) and
ture and vocabulary, making the dataset suitable
4×NVIDIA RTX A6000 GPU (40GB) with CUDA
for studying controlled text generation.
12.0, and an Intel(R) Xeon(R) Gold 5117 CPU with
Wmt16 The WMT16 De-En dataset is a standard 14 cores. Pytorch and Huggingface transformers
benchmark for machine translation, consisting of package are used to perform both baselines and our
parallel German-English sentence pairs collected method.

12
MLP 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80

ATT 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79

(a) Summarization - CNN/DM

MLP 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80

ATT 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79

(b) Reasoning - GSM8K

MLP 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80

ATT 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79

(c) Translation - WMT16

MLP 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80

ATT 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79

(d) Storytelling - TinyStories

MLP 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80

ATT 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79

(e) Text-to-SQL - Spider2

Figure 9: Visualization of skipped layer set configuration of LLaMA-2-13B optimized by Self-SD (Zhang et al.,
2024) on different task domains. Gray squares indicate retained layers, red squares denote skipped attention layers,
and blue squares signify skipped MLP layers.

C.2 Evaluation Metrics can handle a more diverse input stream with stable
inference acceleration.
We further demonstrate the two main metrics we
used in the main experiments. The mean accepted
length M denotes the average number of output
tokens produced by the target LLM during each
forward pass. The token acceptance rate α refers
to the ratio of tokens that are accepted by the tar-
get LLM to the total number of draft steps, which
showcases the expectation of whether the target
LLM accepts a token generated by the draft models.
Given M and α, the expected wall-time speedup
can be derived as follows:


E(Speedup) = (10)
(M − 1)c + α

where c is defined as the cost efficient in Leviathan


et al. (2023). It represents the ratio of the draft
model’s required time to the target model’s during
a single forward pass. In the Self-SD method, we
define c = 1 − r, where r represents the proportion
of skipped layers to total layers, as the draft model
only needs to process the retained layers.

C.3 Details of Main Results


More details are provided in Table 6. Results show
that our KNN-SSD outperforms the two Self-SD
methods on both metrics, indicating our method

13
MLP 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82 84 86 88 90 92 94 96

ATT 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79 81 83 85 87 89 91 93 95

(a) Summarization - CNN/DM

MLP 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82 84 86 88 90 92 94 96

ATT 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79 81 83 85 87 89 91 93 95

(b) Reasoning - GSM8K

MLP 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82 84 86 88 90 92 94 96

ATT 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79 81 83 85 87 89 91 93 95

(c) Translation - WMT16

MLP 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82 84 86 88 90 92 94 96

ATT 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79 81 83 85 87 89 91 93 95

(d) Storytelling - TinyStories

MLP 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82 84 86 88 90 92 94 96

ATT 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79 81 83 85 87 89 91 93 95

(e) Text-to-SQL - Spider2

Figure 10: Visualization of skipped layer set configuration of Qwen-2.5-14B optimized by Self-SD (Zhang et al.,
2024) on different task domains.

R=0.0 R=0.3 R=0.7 R=1.0 Overall


Models Methods
M α M α M α M α M α
VANILLA 1.00 - 1.00 - 1.00 - 1.00 - 1.00 -
S ELF -SD(F IX ) 2.22 0.65 2.19 0.63 2.14 0.59 2.12 0.61 2.17 0.62
LLaMA-2-13B
S ELF -SD(M IX ) 2.50 0.64 2.58 0.70 2.53 0.69 2.52 0.68 2.53 0.68
KNN-SSD 3.10 0.86 3.14 0.88 3.11 0.89 3.12 0.88 3.12 0.88
VANILLA 1.00 - 1.00 - 1.00 - 1.00 - 1.00 -
LLaMA-2-13B S ELF -SD(F IX ) 2.03 0.60 1.99 0.56 1.92 0.55 1.97 0.57 1.97 0.57
-Chat S ELF -SD(M IX ) 2.10 0.56 2.14 0.61 2.18 0.59 2.15 0.58 2.14 0.59
KNN-SSD 2.84 0.84 2.85 0.86 2.90 0.85 2.90 0.86 2.87 0.85
VANILLA 1.00 - 1.00 - 1.00 - 1.00 - 1.00 -
S ELF -SD(F IX ) 2.41 0.82 2.40 0.82 2.44 0.84 2.48 0.85 2.43 0.83
Qwen-2.5-14B
S ELF -SD(M IX ) 3.02 0.89 2.94 0.89 2.99 0.90 2.97 0.90 2.98 0.90
KNN-SSD 4.35 0.99 4.42 1.00 4.40 1.00 4.38 0.99 4.37 1.00
VANILLA 1.00 - 1.00 - 1.00 - 1.00 - 1.00 -
Qwen-2.5-14B S ELF -SD(F IX ) 2.12 0.80 2.16 0.80 2.16 0.80 2.10 0.79 2.13 0.80
-Instruct S ELF -SD(M IX ) 2.32 0.83 2.25 0.84 2.35 0.87 2.34 0.87 2.32 0.85
KNN-SSD 3.78 1.00 3.69 0.99 3.71 0.99 3.75 1.00 3.73 1.00

Table 6: Comparison between KNN-SSD and two Self-SD methods. R indicates the mix ratio of sample streams. We
report the mean accepted length and token acceptance rate, which are denoted as M and α, respectively.

14

You might also like