
长上下文建模对于下一代语言模型至关重要,但传统的标准注意力机制由于其高计算量,带来了巨大的计算挑战。为此,稀疏注意力机制提供了一个有前景的方向,能够在保持模型能力的同时提高效率。
我们介绍了一种名为**NSA(Neighborhood Sparse Attention)**的本地可训练稀疏注意力机制。NSA通过硬件对齐优化和算法创新,实现了高效长上下文建模。其核心策略包括:
动态分层稀疏策略:结合粗粒度和细粒度令牌选择,通过令牌压缩保持全局上下文感知和局部精度。
两项关键创新:
(1)通过算术强度平衡算法设计,实现了显著的加速,并针对现代硬件进行了优化。
(2)支持端到端训练,在不牺牲模型性能的前提下,减少了预训练计算。
实验结果表明,使用NSA预训练的模型在以下方面表现出色:
在通用基准测试、长上下文任务和基于指令的推理中,保持或超过了全注意力模型的性能。
在处理64k长度序列的解码过程中,NSA在正向传播和反向传播阶段均显著优于全注意力机制,验证了其在整个模型生命周期中的高效性。
during autoregressive decoding while requiring computationally intensive pre-processing
(e.g. attention map calculation, index building) during prefilling. In contrast, approaches
like MInference (Jiang et al., 2024) focus solely on prefilling sparsity. These methods fail to
achieve acceleration across all inference stages, as at least one phase remains computational
costs comparable to Full Attention. The phase specialization reduces the speedup ability of these
methods in prefilling-dominated workloads like book summarization and code completion, or
decoding-dominated workloads like long chain-of-thought (Wei et al., 2022) reasoning.
Incompatibility with Advanced Attention Architecture. Some sparse attention methods
fail to adapt to modern decoding efficient architectures like Mulitiple-Query Attention (MQA)
(Shazeer, 2019) and Grouped-Query Attention (GQA) (Ainslie et al., 2023), which significantly
reduced the memory access bottleneck during decoding by sharing KV across multiple query
heads. For instance, in approaches like Quest (Tang et al., 2024), each attention head indepen-
dently selects its KV-cache subset. Although it demonstrates consistent computation sparsity
and memory access sparsity in Multi-Head Attention (MHA) models, it presents a different sce-
nario in models based on architectures like GQA, where the memory access volume of KV-cache
corresponds to the union of selections from all query heads within the same GQA group. This
architectural characteristic means that while these methods can reduce computation operations,
the required KV-cache memory access remains relatively high. This limitation forces a critical
choice: while some sparse attention methods reduce computation, their scattered memory access
pattern conflicts with efficient memory access design from advanced architectures.
These limitations arise because many existing sparse attention methods focus on KV-cache
reduction or theoretical computation reduction, but struggle to achieve significant latency
reduction in advanced frameworks or backends. This motivates us to develop algorithms that
combine both advanced architectural and hardware-efficient implementation to fully leverage
sparsity for improving model efficiency.
2.2. The Myth of Trainable Sparsity
Our pursuit of native trainable sparse attention is motivated by two key insights from analyzing
inference-only approaches: (1) Performance Degradation: Applying sparsity post-hoc forces
models to deviate from their pretrained optimization trajectory. As demonstrated by Chen et al.
(2024), top 20% attention can only cover 70% of the total attention scores, rendering structures
like retrieval heads in pretrained models vulnerable to pruning during inference. (2) Training
Efficiency Demands: Efficient handling of long-sequence training is crucial for modern LLM
development. This includes both pretraining on longer documents to enhance model capacity,
and subsequent adaptation phases such as long-context fine-tuning and reinforcement learning.
However, existing sparse attention methods primarily target inference, leaving the computa-
tional challenges in training largely unaddressed. This limitation hinders the development
of more capable long-context models through efficient training. Additionally, efforts to adapt
existing sparse attention for training also expose challenges:
Non-Trainable Components. Discrete operations in methods like ClusterKV (Liu et al.,
2024) (includes k-means clustering) and MagicPIG (Chen et al., 2024) (includes SimHash-based
selecting) create discontinuities in the computational graph. These non-trainable components
prevent gradient flow through the token selection process, limiting the model’s ability to learn
optimal sparse patterns.
Inefficient Back-propagation. Some theoretically trainable sparse attention methods suffer
from practical training inefficiencies. Token-granular selection strategy used in approaches
4