KNN-SSD:: Enabling Dynamic Self-Speculative Decoding Via Nearest Neighbor Layer Set Optimization
KNN-SSD:: Enabling Dynamic Self-Speculative Decoding Via Nearest Neighbor Layer Set Optimization
Abstract
1.40 SelfSD(Fix)
SelfSD(Mix)
Speculative Decoding (SD) has emerged as a
1.35 KNN-SSD
arXiv:2505.16162v1 [[Link]] 22 May 2025
Speedup Ratio
ence of large language models (LLMs) with- 1.30
out compromising generation quality. It works
by efficiently drafting multiple tokens using 1.25
a compact model and then verifying them in
parallel using the target LLM. Notably, Self- 1.20
Speculative Decoding proposes skipping cer-
1.15
tain layers to construct the draft model, which
eliminates the need for additional parameters 1.10
or training. Despite its strengths, we observe
Summarization Reasoning Translation StoryTelling Text2SQL
in this work that drafting with layer skipping
exhibits significant sensitivity to domain shifts, Figure 1: Average speedup results under task-by-task
leading to a substantial drop in acceleration sample streams. The dashed line represents the average
performance. To enhance the domain generaliz- speedup ratio achieved by KNN-SSD. Results indicate
ability of this paradigm, we introduce KNN-SSD, that our KNN-SSD can achieve a stable speedup while
an algorithm that leverages K-Nearest Neigh- Self-SD methods’ speedups decline, as they are sensitive
bor (KNN) search to match different skipped to domain shifts.
layers with various domain inputs. We evalu-
ated our algorithm in various models and mul-
tiple tasks, observing that its application leads
trade-off between drafting latency and speculation
to 1.3×∼1.6× speedup in LLM inference. 1
accuracy (Xia et al., 2024; Hu et al., 2025). Dur-
1 Introduction ing inference, SD aims to both minimize latency
and maximize accuracy to improve efficiency while
Large language models (LLMs) have proven highly maintaining output quality.
capable in handling various downstream tasks (Tou- Recent advancements in SD have significantly
vron et al., 2023; OpenAI et al., 2024; Yang et al., expanded the boundaries of the latency-accuracy
2025). However, the token-by-token generation in trade-off by employing diverse techniques, such
autoregressive decoding results in quadratic com- as integrating lightweight draft models into
putational complexity, which presents significant LLMs (Ankner et al., 2024; Zhang et al., 2025)
efficiency challenges, particularly as model size or aligning a small model with a larger one (Kim
increases. To address this challenge, speculative et al., 2023; Bachmann et al., 2025) for speculative
decoding (SD) has been proposed as a promising generation. However, these approaches inevitably
solution for lossless acceleration of LLM infer- require additional models, which increase the to-
ence (Xia et al., 2023; Leviathan et al., 2023; Chen tal number of parameters and introduce additional
et al., 2023). At each decoding step, SD uses a training complexity. Addressing this concern, Self-
lightweight draft model to efficiently predict mul- SD (Zhang et al., 2024) has been proposed to se-
tiple tokens, which are then verified in parallel by lectively skip certain layers within the large model
the target LLM to preserve the original output dis- itself to construct a compact draft model.
tribution. The effectiveness of SD hinges on the
In this work, we find that the selection of
1
Code in [Link] skipped layers is not universal. Instead, one skip-
1
layer configuration could be sensitive to domain It reduces decoding latency by predicting multiple
shifts. For example, when applying a configuration future tokens using a draft model or internal mech-
derived from the summarization task to other tasks, anisms, followed by verification and correction by
as shown in Figure 1, we observe a significant re- the target LLM. Existing strategies include aligning
duction in speedup from 1.35× to less than 1.10×, small draft models with large models (Xia et al.,
highlighting the need for domain-specific adapta- 2023; Kim et al., 2023; Bachmann et al., 2025) or
tion. To tackle this issue, we propose KNN-SSD, a predicting k tokens in parallel (Cai et al., 2024;
method for dynamically adjusting skip-layer con- Wen et al., 2024). In another line of work, plug-
figurations based on domain representations. The and-play methods have been examined, with exam-
key goal of KNN-SSD is to optimize skipped layers ples including appending pseudo tokens (Fu et al.,
specific to each domain, simulate realistic input 2024) and skipping layers dynamically (Metel et al.,
scenarios, and accurately identify the domain of 2024; Xia et al., 2025) during inference. Despite
each sample. To achieve this goal, KNN-SSD in- efficiency improvement, these methods often rely
tegrates three main features: (1) a skipped layer on auxiliary models or sub-optimal choices, hinder-
set optimization process for the specific domain of ing scalability and effectiveness. The most related
samples, (2) an input sample stream designed to methods to our work include Self-SD (Zhang et al.,
simulate real-life user inputs better, and (3) a KNN 2024) and LayerSkip (Elhoushi et al., 2024), which
model using LLM’s last hidden representations to also construct draft models by skipping interme-
identify the domain of input samples. diate LLM layers. However, both approaches are
Experiments are conducted using LLaMA-2 trained on a single data type and struggle with di-
series (Touvron et al., 2023) and Qwen-2.5 se- verse data streams. Our work aims to tackle this
ries (Yang et al., 2025) across various tasks, in- problem by integrating samples from various do-
cluding summarization, reasoning, translation, sto- mains.
rytelling, and text-to-SQL. KNN-SSD achieves a
1.3×∼1.6× speedup compared to autoregressive Sparsity and Model Compression. Sparsity and
decoding. This approach maintains over 80% to- model compression are essential for enhancing the
ken acceptance rate across the LLaMA-2 series and efficiency of LLMs by reducing active parame-
over 99% token acceptance rate across the Qwen-2 ters or computations during inference (Hu et al.,
series, indicating high alignment potential between 2022). Common approaches include parameter
the draft model and the target LLM. Further anal- pruning (Frantar and Alistarh, 2023; Ashkboos
ysis validated the effectiveness of KNN-SSD across et al., 2024; Sun et al., 2024), knowledge distil-
out-of-domain sample inputs and one dataset that lation (Huang et al., 2022; Gu et al., 2024; Wu
contains various types of samples. et al., 2024), and quantization (Yao et al., 2022;
To summarize, our key contributions are: Liu et al., 2023; Park et al., 2024), which compress
models while preserving performance. Structured
1. We introduce KNN-SSD, a self-speculative de-
sparsity methods, such as layer skipping (Liu et al.,
coding algorithm with a fine-grained skipped
2024; Bhendawade et al., 2024; Xia et al., 2025)
layer set selection, which adopts k-nearest
and dynamic sparsification, further enhance effi-
neighbor search to retrieve a suitable skipped
ciency by adapting computation to input character-
layer set for each input sample;
istics. While these works aim to optimize computa-
2. To evaluate our method, we design a dynamic tional workloads, they may sacrifice performance
input data stream that contains samples from by using sub-optimal choices because of insuffi-
diverse domains, and KNN-SSD can achieve a cient search in the layer space. In contrast, our
1.3×∼1.6× speedup across different models KNN-SSD method can always find optimal choices
without changing the generated tokens’ distri- to accelerate LLM inference losslessly.
bution.
3 Background
2 Related Work
3.1 Self-Speculative Decoding
Speculative Decoding (SD). Speculative Decod-
ing (SD) aims to accelerate autoregressive text Unlike traditional SD methods that require an
generation in LLMs without compromising out- auxiliary draft model, Self-Speculative Decoding
put quality (Xia et al., 2023; Leviathan et al., 2023). (Self-SD) leverages the LLM’s internal structure
2
Sum SL St SL 1.45
which dynamically selects the most suitable skip-
1.4
1.41 Re SL SQL SL
Tran SL 1.35 layer configuration based on task characteristics,
1.31
1.3 1.27
1.24
1.26 ensuring robust and efficient speculative decoding
Speedup
3
input tokens generated tokens inter cluster intra cluster
Skipped Layer Set
Optimization
]
MLP 0
]
MLP 1
]
MLP 0
Bayes Attention 1
Translate Attention 0
Optimization Attention 0
Last Hidden MLP 0
Vectors MLP 0
MLP 1
Reasoning Clustering Attention 0
Attention
Attention 0 0
MLP
MLP 1
MLP 0 0
Summarize Attention
Attention 0 1
[
Attention
z 1
......
[
z
[
z
Figure 3: Layer skipping and KNN process in KNN-SSD. Before LLM-related generation, KNN-SSD first performs
(a) Layer set Searching Optimization. For each task, KNN-SSD generates a task-specific skip layer set and stores
it in a configuration file; (b) Generate Anchor representatives. KNN-SSD then produces last hidden vectors for
each task to fit a KNN model. When a new sample is input, KNN-SSD first uses its last hidden vector as the input
representative and queries the KNN model. Based on the retrieved result, it selects the corresponding skip layer set
to perform decoding, thereby achieving acceleration.
(j)
the j-th layer should be skipped (zi = 1) or re-
(j)
tained (zi = 0) during inference. To identify v · aij
i∗ , j ∗ = arg max , (4)
the optimal configuration zi , we employ Bayesian i,j ∥v∥ · ∥aij ∥
Optimization (Jones et al., 1998) over the space Domain(s) = i∗ , (5)
of binary layer masks, aiming to minimize an ob-
M← zi∗∗ . (6)
jective black-box function f (·) that measures the
average inference time per verified token: We then perform the standard Self-SD pro-
cess (Zhang et al., 2024), which involves two
stages: drafting and verification. During the draft-
ing stage, the LLM uses the previously selected
zi∗ = arg min f M (zi )|ai1 , . . . , aik .
(3) skip-layer configuration zi as a draft model M (zi )
zi
to generate a sequence of draft tokens:
All z1∗ , . . . , zn∗ will be stored for future use and not y ′ = arg max log P (y | x, y; M (zi )), (7)
y
be changed.
where x and y denote input and output generated
by LLM, respectively, and y ′ represents the token
4.2 Inference produced by the autoregressive process. In the veri-
fication stage, the full LLM verifies the draft tokens
For a newly arrived sample s, we first extract its last in a single forward pass. This step validates the cor-
hidden vector v from the model. We then perform rectness of the generated tokens and either accepts
a KNN search based on cosine similarity between them or triggers a re-drafting if discrepancies are
the hidden vector of s and all representative an- found.
chors. This process yields a corresponding domain To better simulate real-world task streams, we
label, effectively classifying the sample into one introduce the mix ratio r, which denotes the prob-
of the known domains i∗ . Based on the identified ability that the next input sample belongs to a dif-
domain, we apply its associated optimal skip-layer ferent task than the current one. A mix ratio of 0
configuration zi∗∗ to M to accelerate inference: corresponds to a task-by-task input stream, where
4
R=0.0 R=0.3 R=0.7 R=1.0 Speed Overall
Models Methods
E(Spd.) E(Spd.) E(Spd.) E(Spd.) (token/s) E(Spd.)
VANILLA 1.00× 1.00× 1.00× 1.00× 13.62 1.00×
S ELF -SD(F IX ) 1.24× 1.21× 1.19× 1.17× 16.34 1.20×
LLaMA-2-13B
S ELF -SD(M IX ) 1.23× 1.27× 1.24× 1.23× 16.88 1.24×
KNN-SSD 1.42× 1.45× 1.43× 1.45× 19.61 1.44×
VANILLA 1.00× 1.00× 1.00× 1.00× 13.22 1.00×
LLaMA-2-13B S ELF -SD(F IX ) 1.13× 1.14× 1.08× 1.10× 14.67 1.11×
-Chat S ELF -SD(M IX ) 1.13× 1.17× 1.17× 1.16× 15.33 1.16×
KNN-SSD 1.33× 1.36× 1.36× 1.37× 17.85 1.35×
VANILLA 1.00× 1.00× 1.00× 1.00× 11.16 1.00×
S ELF -SD(F IX ) 1.25× 1.23× 1.27× 1.28× 14.06 1.26×
Qwen-2.5-14B
S ELF -SD(M IX ) 1.40× 1.36× 1.39× 1.38× 15.40 1.38×
KNN-SSD 1.60× 1.64× 1.63× 1.61× 18.08 1.62×
VANILLA 1.00× 1.00× 1.00× 1.00× 10.79 1.00×
Qwen-2.5-14B S ELF -SD(F IX ) 1.18× 1.20× 1.20× 1.17× 12.84 1.19×
-Instruct S ELF -SD(M IX ) 1.26× 1.24× 1.27× 1.25× 13.49 1.25×
KNN-SSD 1.52× 1.49× 1.50× 1.52× 16.30 1.51×
Table 1: Comparison between KNN-SSD and two Self-SD methods. R indicates the mix ratio of sample streams.
We report the expected speedup ratio under different mix ratios, average decoding speed (token/s) under greedy
decoding, and average speedup ratio among different mix ratios. More details are provided in the Appendix C.3.
all consecutive samples come from the same task. resentative samples in search of the optimal skip-
In contrast, a mix ratio of 1 indicates maximum layer configuration. The representative samples are
task mixing, where every two consecutive samples selected via the K-means algorithm from all last
are from different tasks. As the mix ratio grows, hidden vectors generated by the LLM in the corre-
the frequency of domain shift increases. sponding dataset, ensuring optimal coverage of the
feature space. The maximum generation lengths on
(
r CNN/DM, GSM8K, Wmt16, Spider2, and TinyS-
N −1 if j ̸= k,
P (si+1 ∈ Dj | si ∈ Dk ) = (8) tories are set to 64, 64, 64, 64, and 128, respec-
1−r if j = k.
tively. We conduct 1-shot evaluation for CNN/DM
5 Experiments and TinyStories, 3-shot evaluation for Spider2, and
5-shot evaluation for GSM8K and Wmt16. For
5.1 Experimental Setup
each dataset, we extracted the most representative
Implementation Details. We mainly evaluate k = 10 hidden vectors from the last hidden layer
KNN-SSD on LLaMA-2 series (Touvron et al., across all data samples using cosine similarity to
2023) and Qwen-2.5 series (Yang et al., 2025) serve as anchor points for the KNN model, follow-
across various tasks, including summarization, ing the same approach as introduced earlier in the
mathematical reasoning, storytelling, translation, BO framework. For each new input sample, we
and text-to-SQL. The evaluation datasets include also compute the cosine similarity between its last
CNN/Daily Mail (CNN/DM) (Nallapati et al., hidden vector and the anchors, and assign it to the
2016), GSM8K (Cobbe et al., 2021), TinyS- task of its nearest neighbor.
tories (Eldan and Li, 2023), Wmt16 DE-EN
(Wmt16) (Bojar et al., 2016), and Spider2 (Lei Baselines. In our primary experiments, we com-
et al., 2025). pared KNN-SSD and Self-SD approach (Zhang et al.,
For each dataset, we used Bayesian optimiza- 2024) to assess their effectiveness. For the Self-SD
tion 2 (BO) to perform 1,000 iterations in 8 rep- method, we primarily simulated two scenarios. In
2
[Link] the first scenario, a fixed skip-layer configuration
BayesianOptimization was determined based on the first sample in the
5
Models Methods M α Speedup
0.85
Vanilla 1.00 - 1.00×
Table 2: The results demonstrate the mean accepted Summarization Reasoning Translation StoryTelling Text2SQL
tokens, token acceptance rate, and actual speedup ratio
obtained from our tests on the LLaMA-2 series, showing 3.2
that KNN-SSD outperforms two Self-SD methods in every
2.8
task stream and remained unchanged throughout
the process, which is denoted as Self-SD(Fix). In 2.6
the second scenario, the skip-layer configuration
was adjusted by re-performing BO according to the 2.4 SelfSD(Fix)
task distribution within the stream, and the newly SelfSD(Mix)
searched configuration was subsequently applied 2.2 KNN-SSD
for inference and also remained unchanged, which Summarization Reasoning Translation StoryTelling Text2SQL
is denoted as Self-SD(Mix).
Figure 4: The mean accepted tokens and mean accep-
Evaluation Metrics. We evaluate KNN-SSD using tance rate under task-by-task sample streams. The
two standard metrics commonly adopted in evalu- dashed lines represent the average length and rate
ation: the mean generated length M (Stern et al., achieved by KNN-SSD across all five datasets.
2018) and the token acceptance rate α (Leviathan
et al., 2023). Beyond these, we also report the ratio of sample flows doesn’t affect the speedup
expected decoding throughput in tokens per sec- of KNN-SSD. The speedup remains stable, which
ond, along with the expected wall-time speedup indicates that KNN-SSD can handle various samples
ratio compared to standard autoregressive decod- in a more realistic scenario.
ing. Given M and α, the expected speedup can be
We present the mean accepted tokens, accep-
derived by the formula given by Leviathan et al.
tance rate, as well as the actual speedup of LLaMA-
(2023):
2-13B series in Table 2 and Figure 4, which further
Mα validates the superiority of KNN-SSD over Self-SD.
E(Spd.) = (9)
(M − 1)(1 − r) + α
5.3 Analysis
where r denotes the ratio of skipped layers.
Inter & Intra. We use the MATH (Hendrycks
5.2 Main Result et al., 2021) dataset to assess the capabilities of
Table 1 presents the comparison between KNN-SSD KNN-SSD in a single dataset with multiple domains.
and two Self-SD methods on generation tasks. In In the MATH dataset, math questions are catego-
our experiments, we evaluate KNN-SSD under four rized into seven types. Thus, using one specific skip
settings: mix ratio = 0, 0.3, 0.7, and 1 separately, layer set for this dataset is insufficient, and we intro-
with 40 samples from five datasets each, 200 sam- duce a fine-grained clustering to handle this mixed
ples in total. The experimental results demonstrate domain. Figure 6 shows that each type of math
the following findings: (1) KNN-SSD shows superior question can be clustered into a single group. Ta-
efficiency over prior methods, achieving consistent ble 3 indicates the speedup result for each method,
speedups of 1.35×∼1.62× over vanilla autoregres- where we can clearly see that KNN-SSD outperforms
sive decoding across various models. (2) The mix Self-SD methods and achieves a speedup of 1.23×
6
All Vectors All Vectors
Representative Vectors Representative Vectors
Figure 5: Visualization of last hidden vectors from five Figure 6: Visualization of 700 last hidden vectors from
domains of samples. Results show these vectors can be the MATH dataset using t-SNE method. It is clear to
clearly divided into five clusters. From each cluster, we see that all vectors can be categorized into 7 groups,
selected ten vectors as representative anchors for our which aligns with the fact that the MATH dataset has 7
KNN model. different kinds of problems.
Baseline Self-SD(Mix)
Methods M α Speedup 1.3
Self-SD(Fix) KNN-SSD
1.27 1.26
1.23 1.23 1.24 1.23
1.21 1.21 1.22
Vanilla 1.00 - 1.00× 1.2
Speedup
1.16
1.14
Self-SD(Fix) 1.54 0.51 0.97× 1.12 1.12
1.1 1.09 1.1
1.06 1.05
Self-SD(Mix) 1.82 0.59 1.02× 1.0 1 1 1 1 1
1.03
1
1.03
1
1.02
0.97
KNN-SSD 2.37 0.81 1.23×
0.9
y y s
Algebrand Probabilit Geometr diate Algebra mber Theory Prealgebra Precalculu
ng a Interm
e Nu
Table 3: Results of MATH dataset using LLaMA-2-13B. Counti
7
Datasets Methods M α Speedup
Vanilla 1.00 - 1.00×
Article: Carlos Tevez talks about being as free as a bird,
XSUM Self-SD 1.42 0.56 0.99× back to the days when he was banging in the goals in
Argentina for Boca Juniors. A big hit on the famous
KNN-SSD 2.51 0.84 1.24× steeped terracing at La Bombonera as he followed in
the footsteps of Diego Maradona ...... Summary:
Vanilla 1.00 - 1.00×
Searching for nearest skip layer set
MATH Self-SD 1.34 0.48 0.93×
KNN-SSD 2.13 0.76 1.17×
Vanilla 1.00 - 1.00× Juventus forward Carlos Tevez discusses his enjoyment
of playing for the club and his form this season, with six
Alpaca Self-SD 1.26 0.43 0.92× Champions League goals so far. The team is looking to
beat Monaco in the quarter-finals and ......
KNN-SSD 1.95 0.67 1.15×
Question: Maddison has 5 boxes with 50 marbles in
each box. Then she gets 20 marbles from her friend.
Table 4: Results of out-of-domain datasets using How many marbles does she have now? Answer:
LLaMA-2-13B-Chat. No representative anchor of these
three domains is generated. Searching for nearest skip layer set
exceeds 5 (e.g., up to 7), the speedup plateaus, indi- Here's the solution step-by-step:\n\n1. Maddison has 5
boxes, each with 50 marbles. \n\nSo, in total, Maddison
cating that partitioning Alpaca into five clusters is has 5 x 50 = 250 ......
sufficient—further subdivision yields no additional
gains. Translate German to English: "Es ist für mich wirklich",
sagte Spielberg, "unstrittig der größte Zeitreise-Film,
der jemals gedreht wurde."
8
Limitations on Machine Translation, pages 131–198, Berlin, Ger-
many. Association for Computational Linguistics.
A few limitations need to be considered while our
KNN-SSD achieves a notable speedup on various Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu
models. First, we did not incorporate draft tree ver- Peng, Jason D. Lee, Deming Chen, and Tri Dao.
2024. Medusa: Simple llm inference acceleration
ification, which has been shown to improve the to- framework with multiple decoding heads. Preprint,
ken acceptance rate (Xia et al., 2025). Second, our arXiv:2401.10774.
current evaluation is limited to models of moderate
scale. Due to practical considerations related to Charlie Chen, Sebastian Borgeaud, Geoffrey Irving,
Jean-Baptiste Lespiau, Laurent Sifre, and John
computational resources, we have not yet extended Jumper. 2023. Accelerating large language model
our method to larger-scale models. We leave these decoding with speculative sampling. Preprint,
directions for future work. arXiv:2302.01318.
9
Yunhai Hu, Zining Liu, Zhenyuan Dong, Tianfan Peng, Conference on Empirical Methods in Natural Lan-
Bradley McDanel, and Sai Qian Zhang. 2025. Spec- guage Processing, Brussels, Belgium.
ulative decoding and beyond: An in-depth survey of
techniques. Preprint, arXiv:2502.19732. OpenAI, Josh Achiam, Steven Adler, Sandhini Agar-
wal, Lama Ahmad, Ilge Akkaya, et al. 2024. Gpt-4
Yukun Huang, Yanda Chen, Zhou Yu, and Kathleen technical report. Preprint, arXiv:2303.08774.
McKeown. 2022. In-context learning distillation:
Transferring few-shot learning ability of pre-trained Gunho Park, Baeseong Park, Minsub Kim, Sungjae
language models. Preprint, arXiv:2212.10670. Lee, Jeonghoon Kim, Beomseok Kwon, Se Jung
Donald R Jones, Matthias Schonlau, and William J Kwon, Byeongwook Kim, Youngjoo Lee, and Dong-
Welch. 1998. Efficient global optimization of ex- soo Lee. 2024. Lut-gemm: Quantized matrix mul-
pensive black-box functions. Journal of Global opti- tiplication based on luts for efficient inference in
mization, 13:455–492. large-scale generative language models. Preprint,
arXiv:2206.09557.
Sehoon Kim, Karttikeya Mangalam, Suhong Moon, Ji-
tendra Malik, Michael W Mahoney, Amir Gholami, Mitchell Stern, Noam Shazeer, and Jakob Uszkoreit.
and Kurt Keutzer. 2023. Speculative decoding with 2018. Blockwise parallel decoding for deep autore-
big little decoder. In Advances in Neural Information gressive models. In Advances in Neural Information
Processing Systems, volume 36, pages 39236–39256. Processing Systems, volume 31. Curran Associates,
Curran Associates, Inc. Inc.
Fangyu Lei, Jixuan Chen, Yuxiao Ye, Ruisheng Mingjie Sun, Zhuang Liu, Anna Bair, and J. Zico Kolter.
Cao, Dongchan Shin, Hongjin Su, Zhaoqing Suo, 2024. A simple and effective pruning approach for
Hongcheng Gao, Wenjing Hu, Pengcheng Yin, Victor large language models. Preprint, arXiv:2306.11695.
Zhong, Caiming Xiong, Ruoxi Sun, Qian Liu, Sida
Wang, and Tao Yu. 2025. Spider 2.0: Evaluating Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann
language models on real-world enterprise text-to-sql Dubois, Xuechen Li, Carlos Guestrin, Percy Liang,
workflows. Preprint, arXiv:2411.07763. and Tatsunori B. Hashimoto. 2023. Stanford alpaca:
An instruction-following llama model. https://
Yaniv Leviathan, Matan Kalman, and Yossi Matias.
[Link]/tatsu-lab/stanford_alpaca.
2023. Fast inference from transformers via spec-
ulative decoding. In Proceedings of the 40th Inter-
national Conference on Machine Learning, volume Hugo Touvron, Louis Martin, Kevin Stone, Peter Al-
202 of Proceedings of Machine Learning Research, bert, Amjad Almahairi, et al. 2023. Llama 2: Open
pages 19274–19286. PMLR. foundation and fine-tuned chat models. Preprint,
arXiv:2307.09288.
Yijin Liu, Fandong Meng, and Jie Zhou. 2024.
Accelerating inference in large language models Zhuofan Wen, Shangtong Gui, and Yang Feng. 2024.
with a unified layer skipping strategy. Preprint, Speculative decoding with ctc-based draft model for
arXiv:2404.06954. llm inference acceleration. In Advances in Neural
Information Processing Systems, volume 37, pages
Zechun Liu, Barlas Oguz, Changsheng Zhao, Ernie 92082–92100. Curran Associates, Inc.
Chang, Pierre Stock, Yashar Mehdad, Yangyang
Shi, Raghuraman Krishnamoorthi, and Vikas Chan- Minghao Wu, Abdul Waheed, Chiyu Zhang, Muham-
dra. 2023. Llm-qat: Data-free quantization aware mad Abdul-Mageed, and Alham Fikri Aji. 2024.
training for large language models. Preprint, LaMini-LM: A diverse herd of distilled models from
arXiv:2305.17888. large-scale instructions. In Proceedings of the 18th
Conference of the European Chapter of the Associa-
Michael R. Metel, Peng Lu, Boxing Chen, Mehdi Reza- tion for Computational Linguistics (Volume 1: Long
gholizadeh, and Ivan Kobyzev. 2024. Draft on the Papers), pages 944–964, St. Julian’s, Malta. Associa-
fly: Adaptive self-speculative decoding using cosine tion for Computational Linguistics.
similarity. Preprint, arXiv:2410.01028.
Ramesh Nallapati, Bowen Zhou, Cicero dos Santos, Heming Xia, Tao Ge, Peiyi Wang, Si-Qing Chen, Furu
Çağlar Gu˙lçehre, and Bing Xiang. 2016. Abstrac- Wei, and Zhifang Sui. 2023. Speculative decod-
tive text summarization using sequence-to-sequence ing: Exploiting speculative execution for accelerat-
RNNs and beyond. In Proceedings of the 20th ing seq2seq generation. In Findings of the Associa-
SIGNLL Conference on Computational Natural Lan- tion for Computational Linguistics: EMNLP 2023,
guage Learning, pages 280–290, Berlin, Germany. pages 3909–3925, Singapore. Association for Com-
Association for Computational Linguistics. putational Linguistics.
Shashi Narayan, Shay B. Cohen, and Mirella Lapata. Heming Xia, Yongqi Li, Jun Zhang, Cunxiao Du, and
2018. Don’t give me the details, just the summary! Wenjie Li. 2025. Swift: On-the-fly self-speculative
Topic-aware convolutional neural networks for ex- decoding for llm inference acceleration. Preprint,
treme summarization. In Proceedings of the 2018 arXiv:2410.06916.
10
Heming Xia, Zhe Yang, Qingxiu Dong, Peiyi Wang,
Yongqi Li, Tao Ge, Tianyu Liu, Wenjie Li, and Zhi-
fang Sui. 2024. Unlocking efficiency in large lan-
guage model inference: A comprehensive survey
of speculative decoding. In Findings of the Asso-
ciation for Computational Linguistics: ACL 2024,
pages 7655–7671, Bangkok, Thailand. Association
for Computational Linguistics.
An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui,
Bo Zheng, et al. 2025. Qwen2.5 technical report.
Preprint, arXiv:2412.15115.
Zhewei Yao, Reza Yazdani Aminabadi, Minjia Zhang,
Xiaoxia Wu, Conglong Li, and Yuxiong He. 2022.
Zeroquant: Efficient and affordable post-training
quantization for large-scale transformers. In Ad-
vances in Neural Information Processing Systems,
volume 35, pages 27168–27183. Curran Associates,
Inc.
11
A Preliminary Details from various sources. It is used to evaluate the
translation quality of models.
We visualize the optimal skipped layer sets we
searched across five tasks on two series of models Spider2 Spider 2.0 is a complex and cross-
in Figure 9 and Figure 10. domain text-to-SQL benchmark designed to eval-
uate the ability of models to generate executable
B Datasets
SQL queries from natural language questions. It
We mainly evaluate KNN-SSD on LLaMA-2 (Tou- includes diverse databases and query types, requir-
vron et al., 2023) series and Qwen-2.5 (Yang et al., ing models to generalize to unseen schemas and
2025) series across diverse tasks. We select five dif- handle intricate reasoning.
ferent datasets, covering summarization, mathemat-
ical reasoning, translation, storytelling, and text-to- XSUM XSUM is an abstractive summarization
SQL, which are CNN/Daily Mail (CNN/DM) (Nal- dataset consisting of BBC news articles paired
lapati et al., 2016), GSM8K (Cobbe et al., 2021), with single-sentence summaries, in contrast to
TinyStories (Eldan and Li, 2023), Wmt16 DE-EN CNN/DM, which provides longer, multi-sentence
(Wmt16) (Bojar et al., 2016), and Spider2 (Lei summaries for news articles. It emphasizes con-
et al., 2025) datasets, respectively. The maxi- cise and information-rich summaries, testing the
mum generation lengths on CNN/DM, GSM8K, models’ ability to extract key information.
Wmt16, Spider2, and TinyStories are set to 64, 64,
64, 64, and 128, respectively. We conduct 1-shot MATH The MATH dataset is a benchmark for
evaluation for CNN/DM and TinyStories, 3-shot mathematical problem solving, comprising high
evaluation for Spider2, and 5-shot evaluation for school-level competition problems with detailed
GSM8K and Wmt16. For further analysis, we also step-by-step solutions. It covers a wide range of
use the XSUM (Narayan et al., 2018) dataset, the topics, including algebra, counting and probabil-
MATH (Hendrycks et al., 2021) dataset, and the ity, geometry, intermediate algebra, number the-
Alpaca (Taori et al., 2023) dataset for the summa- ory, prealgebra, and precalculus, and is designed
rization, mathematical reasoning, and instruction to evaluate the advanced reasoning and symbolic
following tasks, respectively. manipulation abilities of language models.
CNN/DM The CNN/Daily Mail dataset is a large- Alpaca The Alpaca dataset is a collection of
scale benchmark for abstractive text summarization. instruction-following demonstrations generated us-
It consists of long news articles paired with short ing the self-instruct method, based on the outputs
summaries, derived from the CNN and Daily Mail of a strong language model. It covers a wide range
websites. The dataset is used to evaluate the perfor- of tasks, making it suitable for us to test the gener-
mance on long-form input and coherent summary alizability of KNN-SSD.
generation.
GSM8K GSM8K is a high-quality benchmark C Experimental Details
dataset for arithmetic reasoning, consisting of
C.1 Setups
grade school math word problems and their de-
tailed step-by-step solutions. It is used to evaluate During the pre-inference stage, we set the maxi-
the reasoning and problem-solving capabilities in mum iterations of Bayesian Optimization to 1,000
mathematical contexts. and the number of samples to 8. For each dataset,
we first randomly choose 1,000 last hidden vec-
TinyStories TinyStories is a dataset of short, syn-
tors, then we use the K-means algorithm to find 10
thetically generated children’s stories designed to
representatives as anchors for the KNN model.
support research on language modeling and narra-
In the inference process, experiments were con-
tive understanding. The stories are simple in struc-
ducted on 8×NVIDIA RTX 3090 GPU (24GB) and
ture and vocabulary, making the dataset suitable
4×NVIDIA RTX A6000 GPU (40GB) with CUDA
for studying controlled text generation.
12.0, and an Intel(R) Xeon(R) Gold 5117 CPU with
Wmt16 The WMT16 De-En dataset is a standard 14 cores. Pytorch and Huggingface transformers
benchmark for machine translation, consisting of package are used to perform both baselines and our
parallel German-English sentence pairs collected method.
12
MLP 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80
ATT 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79
MLP 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80
ATT 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79
MLP 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80
ATT 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79
MLP 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80
ATT 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79
MLP 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80
ATT 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79
Figure 9: Visualization of skipped layer set configuration of LLaMA-2-13B optimized by Self-SD (Zhang et al.,
2024) on different task domains. Gray squares indicate retained layers, red squares denote skipped attention layers,
and blue squares signify skipped MLP layers.
C.2 Evaluation Metrics can handle a more diverse input stream with stable
inference acceleration.
We further demonstrate the two main metrics we
used in the main experiments. The mean accepted
length M denotes the average number of output
tokens produced by the target LLM during each
forward pass. The token acceptance rate α refers
to the ratio of tokens that are accepted by the tar-
get LLM to the total number of draft steps, which
showcases the expectation of whether the target
LLM accepts a token generated by the draft models.
Given M and α, the expected wall-time speedup
can be derived as follows:
Mα
E(Speedup) = (10)
(M − 1)c + α
13
MLP 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82 84 86 88 90 92 94 96
ATT 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79 81 83 85 87 89 91 93 95
MLP 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82 84 86 88 90 92 94 96
ATT 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79 81 83 85 87 89 91 93 95
MLP 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82 84 86 88 90 92 94 96
ATT 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79 81 83 85 87 89 91 93 95
MLP 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82 84 86 88 90 92 94 96
ATT 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79 81 83 85 87 89 91 93 95
MLP 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82 84 86 88 90 92 94 96
ATT 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79 81 83 85 87 89 91 93 95
Figure 10: Visualization of skipped layer set configuration of Qwen-2.5-14B optimized by Self-SD (Zhang et al.,
2024) on different task domains.
Table 6: Comparison between KNN-SSD and two Self-SD methods. R indicates the mix ratio of sample streams. We
report the mean accepted length and token acceptance rate, which are denoted as M and α, respectively.
14