DeepSeek最新成果：原生稀疏注意力（NativeSparseAttention-AlignedandNativelyTrainableSparseAttention）.pdf

版权申诉

177 浏览量 2025-02-22 14:47:13 上传评论收藏 1.06MB PDF 举报

在当前深度学习领域，长距离上下文建模对于下一代语言模型至关重要，尤其是在处理如语言理解和生成等复杂任务时。然而，标准注意力机制的高计算成本带来了显著的计算挑战。稀疏注意力是一种提高效率同时保持模型能力的有前景的方向。本文介绍了一种新型的可训练稀疏注意力机制，即原生稀疏注意力（Native Sparse Attention，简称NSA），它通过结合算法创新与硬件对齐优化来实现高效的长距离上下文建模。 NSA机制采用了一种层级的稀疏策略，将粗粒度的token压缩与细粒度的token选择相结合，既保留了全局上下文感知能力，又保证了局部精度。NSA的两个关键创新点是：通过算术强度平衡的算法设计，并针对现代硬件进行了实现优化，实现了显著的速度提升；通过端到端训练，减少了预训练计算量，同时不牺牲模型性能。实验表明，使用NSA预训练的模型，在一般基准、长距离任务和基于指令的推理等各个方面，其性能要么维持要么超过了全注意力模型。此外，NSA在长达64k长度序列的解码、前向传播和反向传播中，其速度比全注意力模型有了显著提高，验证了其在整个模型生命周期中的效率。该研究的背景在于，当前研究社区越来越多地认识到长距离建模对于下一代大型语言模型的重要性，这种认识是由多样化的现实世界应用需求所驱动的。长距离建模能够处理长篇幅文档，对深层次的信息进行建模，这对于理解和生成与上下文紧密相关的语言非常重要。 NSA的提出，不仅对学术界产生了重要影响，也为工业界提供了新的实践方向。通过NSA，可以在降低硬件计算成本的同时，保证语言模型在理解长篇幅内容时的准确性和效率。这对于发展更为强大和高效的自然语言处理系统具有深远的意义。此外，NSA的设计充分考虑了硬件架构的特性，确保了其算法优化与现代处理器的匹配度，这对于提升模型在实际应用中的性能表现至关重要。原生稀疏注意力机制NSA不仅为稀疏注意力设计提供了新思路，而且对于实现语言模型的高效长距离上下文建模具有重要的研究和应用价值。

资源推荐

资源详情

资源评论

长上下文建模对于下一代语言模型至关重要，但传统的标准注意力机制由于其高计算量，带来了巨大的计算挑战。为此，稀疏注意力机制提供了一个有前景的方向，能够在保持模型能力的同时提高效率。

我们介绍了一种名为**NSA（Neighborhood Sparse Attention）**的本地可训练稀疏注意力机制。NSA通过硬件对齐优化和算法创新，实现了高效长上下文建模。其核心策略包括：

动态分层稀疏策略：结合粗粒度和细粒度令牌选择，通过令牌压缩保持全局上下文感知和局部精度。

两项关键创新：

（1）通过算术强度平衡算法设计，实现了显著的加速，并针对现代硬件进行了优化。

（2）支持端到端训练，在不牺牲模型性能的前提下，减少了预训练计算。

实验结果表明，使用NSA预训练的模型在以下方面表现出色：

在通用基准测试、长上下文任务和基于指令的推理中，保持或超过了全注意力模型的性能。

在处理64k长度序列的解码过程中，NSA在正向传播和反向传播阶段均显著优于全注意力机制，验证了其在整个模型生命周期中的高效性。

Native Sparse Attention: Hardware-Aligned and Natively

Trainable Sparse Attention

Jingyang Yuan

∗1,2

, Huazuo Gao

, Damai Dai

, Junyu Luo

, Liang Zhao

, Zhengyan Zhang

, Zhenda Xie

Y. X. Wei

, Lean Wang

, Zhiping Xiao

, Yuqing Wang

, Chong Ruan

, Ming Zhang

, Wenfeng Liang

Wangding Zeng

DeepSeek-AI

Key Laboratory for Multimedia Information Processing, Peking University, PKU-Anker LLM Lab

University of Washington

{yuanjy, mzhang_cs}@pku.edu.cn, {zengwangding, wenfeng.liang}@deepseek.com

Abstract

Long-context modeling is crucial for next-generation language models, yet the high compu-

tational cost of standard attention mechanisms poses signiﬁcant computational challenges.

Sparse attention offers a promising direction for improving efﬁciency while maintaining model

capabilities. We present NSA, a

atively trainable

parse

ttention mechanism that integrates

algorithmic innovations with hardware-aligned optimizations to achieve efﬁcient long-context

modeling. NSA employs a dynamic hierarchical sparse strategy, combining coarse-grained

token compression with ﬁne-grained token selection to preserve both global context awareness

and local precision. Our approach advances sparse attention design with two key innovations:

(1) We achieve substantial speedups through arithmetic intensity-balanced algorithm design,

with implementation optimizations for modern hardware. (2) We enable end-to-end training,

reducing pretraining computation without sacriﬁcing model performance. As shown in Figure 1,

experiments show the model pretrained with NSA maintains or exceeds Full Attention models

across general benchmarks, long-context tasks, and instruction-based reasoning. Meanwhile,

NSA achieves substantial speedups over Full Attention on 64k-length sequences across decod-

ing, forward propagation, and backward propagation, validating its efﬁciency throughout the

model lifecycle.

1. Introduction

The research community increasingly recognizes long-context modeling as a crucial capability

for next-generation large language models, driven by diverse real-world applications ranging

from in-depth reasoning (DeepSeek-AI, 2025; Zelikman et al., 2022), repository-level code gener-

ation (Zhang et al., 2023a; Zhang et al.) and multi-turn autonomous agent systems (Park et al.,

2023). Recent breakthroughs, including OpenAI’s o-series models, DeepSeek-R1 (DeepSeek-AI,

2025), and Gemini 1.5 Pro (Google et al., 2024), enabling models to process entire codebases,

lengthy documents, maintain coherent multi-turn conversations over thousands of tokens, and

perform complex reasoning across long-range dependencies. However, the high complexity (Za-

heer et al., 2020) of vanilla Attention (Vaswani et al., 2017) mechanisms emerges as a critical

latency bottleneck as sequence length increases. Theoretical estimates indicate that attention

*Contribution during internship at DeepSeek-AI.

arXiv:2502.11089v1 [cs.CL] 16 Feb 2025

动态分层稀疏策略：结合粗粒度和细粒度令牌选择，通过令牌压缩保持全局上下文感知和局部精度。

两项关键创新：

（1）通过算术强度平衡算法设计，实现了显著的加速，并针对现代硬件进行了优化。

（2）支持端到端训练，在不牺牲模型性能的前提下，减少了预训练计算。

实验结果表明，使用NSA预训练的模型在以下方面表现出色：

在通用基准测试、长上下文任务和基于指令的推理中，保持或超过了全注意力模型的性能。

在处理64k长度序列的解码过程中，NSA在正向传播和反向传播阶段均显著优于全注意力机制，验证了其在整个模型生命周期中的高效性。

General

LongBench Reasoning

0.0

0.1

0.2

0.3

0.4

0.5

Score

Performance onBenchmarks

Full Attention

NSA

Decode Forward Backward

1.0

3.0

5.0

7.0

9.0

11.0

13.0

Speedup Ratio

11.6×

9.0×

6.0×

Speed on Stages

Figure 1

Comparison of performance and efﬁciency between Full Attention model and our NSA.

Left: Despite being sparse, NSA surpasses Full Attention baseline on average across general

benchmarks, long-context tasks, and reasoning evaluation. Right: For 64k-length sequence

processing, NSA achieves substantial computational speedup compared to Full Attention in all

stages: decoding, forward propagation, and backward propagation.

computation with softmax architectures accounts for 70–80% of total latency when decoding

64k-length contexts, underscoring the urgent need for more efﬁcient attention mechanisms.

A natural approach to efﬁcient long-context modeling is to take advantage of the inherent

sparsity of softmax attention (Ge et al., 2023; Jiang et al., 2023), where selectively computing

critical query-key pairs can signiﬁcantly reduce computational overhead while preserving

performance. Recent advances demonstrate this potential through diverse strategies: KV-cache

eviction methods (Li et al., 2024; Zhang et al., 2023b; Zhou et al., 2024), blockwise KV-cache

selection methods (Tang et al., 2024; Xiao et al., 2024), and sampling, clustering or hashing-based

selection methods (Chen et al., 2024; Desai et al., 2024; Liu et al., 2024). Despite these promising

strategies, existing sparse attention methods often fall short in practical deployments. Many

approaches fail to achieve speedups comparable to their theoretical gains; also, most methods

mainly focus on inference stage, lacking effective training-time support to fully exploit the

sparsity patterns of attention.

To address these limitations, the deployment of effective sparse attention must tackle two

key challenges: (1) Hardware-aligned inference speedup: Converting theoretical computation

reductions into actual speed improvements requires hardware-friendly algorithm design during

both preﬁlling and decoding stages to mitigate memory access and hardware scheduling bottle-

necks; (2) Training-aware algorithm design: Enabling end-to-end computation with trainable

operators to reduce training costs while maintaining model performance. These requirements

are crucial for real-world applications to achieve fast long-context inference or training. When

considering both aspects, existing methods still exhibit a noticeable gap.

To achieve more effective and efﬁcient sparse attention, we present NSA, a Natively trainable

Sparse Attention architecture that integrates hierarchical token modeling. As shown in Figure 2,

NSA reduces per-query computation by organizing keys and values into temporal blocks and

processing them through three attention paths: compressed coarse-grained tokens, selectively

retained ﬁne-grained tokens, and sliding windows for local contextual information. Then

we implement specialized kernels to maximize its practical efﬁciency. NSA introduces two

动态分层稀疏策略：结合粗粒度和细粒度令牌选择，通过令牌压缩保持全局上下文感知和局部精度。

两项关键创新：

（1）通过算术强度平衡算法设计，实现了显著的加速，并针对现代硬件进行了优化。

（2）支持端到端训练，在不牺牲模型性能的前提下，减少了预训练计算。

实验结果表明，使用NSA预训练的模型在以下方面表现出色：

在通用基准测试、长上下文任务和基于指令的推理中，保持或超过了全注意力模型的性能。

在处理64k长度序列的解码过程中，NSA在正向传播和反向传播阶段均显著优于全注意力机制，验证了其在整个模型生命周期中的高效性。

Concat

Top-n Selection

Compression

Selection

Compress

Compressed Attention Sliding Attention

Sliding

Compressed Attention Mask

Sliding Attention Mask

Selected Attention

Split to Continuous Blocks

Attention Score Activated Token

Query Token Ignored TokenEvicted Token

Native Sparse Attention Mechanism

Selected Attention Mask

Gated Output

Output Output Output

...

Figure 2

Overview of NSA’s architecture. Left: The framework processes input sequences

through three parallel attention branches: For a given query, preceding keys and values are

processed into compressed attention for coarse-grained patterns, selected attention for important

token blocks, and sliding attention for local context. Right: Visualization of different attention

patterns produced by each branch. Green areas indicate regions where attention scores need to

be computed, while white areas represent regions that can be skipped.

core innovations corresponding to the key requirements above: (1) Hardware-aligned system:

Optimize blockwise sparse attention for Tensor Core utilization and memory access, ensuring

balanced arithmetic intensity. (2) Training-aware design: Enable stable end-to-end training

through efﬁcient algorithms and backward operators. This optimization enables NSA to support

both efﬁcient deployment and end-to-end training.

We evaluate NSA through comprehensive experiments on real-world language corpora.

Pretraining on a 27B-parameter transformer backbone with 260B tokens, we assess NSA’s per-

formance across general language evaluations, long-context evaluations, and chain-of-thought

reasoning evaluation. We further compare the kernel speed on A100 GPUs with optimized

Triton (Tillet et al., 2019) implementations. Experimental results demonstrate that NSA achieves

comparable or superior performance to full attention baseline, while outperforming existing

sparse attention approaches. Additionally, NSA delivers substantial speedups across decoding,

forward, and backward stages compared to Full Attention, with the speedup ratio increasing for

longer sequences. These results validate that our hierarchical sparse attention design effectively

balances model capability and computational efﬁciency.

2. Rethinking Sparse Attention Methods

Modern sparse attention methods have made signiﬁcant strides in reducing the theoretical

computational complexity of transformer models. However, most approaches predominantly

apply sparsity during inference while retaining a pretrained Full Attention backbone, poten-

tially introducing architectural bias that limits their ability to fully exploit sparse attention’s

advantages. Before introducing our native sparse architecture, we systematically analyze these

limitations through two critical lenses.

2.1. The Illusion of Efﬁcient Inference

Despite achieving sparsity in attention computation, many methods fail to achieve correspond-

ing reductions in inference latency, primarily due to two challenges:

Phase-Restricted Sparsity. Methods such as H2O (Zhang et al., 2023b) apply sparsity

动态分层稀疏策略：结合粗粒度和细粒度令牌选择，通过令牌压缩保持全局上下文感知和局部精度。

两项关键创新：

（1）通过算术强度平衡算法设计，实现了显著的加速，并针对现代硬件进行了优化。

（2）支持端到端训练，在不牺牲模型性能的前提下，减少了预训练计算。

实验结果表明，使用NSA预训练的模型在以下方面表现出色：

在通用基准测试、长上下文任务和基于指令的推理中，保持或超过了全注意力模型的性能。

在处理64k长度序列的解码过程中，NSA在正向传播和反向传播阶段均显著优于全注意力机制，验证了其在整个模型生命周期中的高效性。

during autoregressive decoding while requiring computationally intensive pre-processing

(e.g. attention map calculation, index building) during preﬁlling. In contrast, approaches

like MInference (Jiang et al., 2024) focus solely on preﬁlling sparsity. These methods fail to

achieve acceleration across all inference stages, as at least one phase remains computational

costs comparable to Full Attention. The phase specialization reduces the speedup ability of these

methods in preﬁlling-dominated workloads like book summarization and code completion, or

decoding-dominated workloads like long chain-of-thought (Wei et al., 2022) reasoning.

Incompatibility with Advanced Attention Architecture. Some sparse attention methods

fail to adapt to modern decoding efﬁcient architectures like Mulitiple-Query Attention (MQA)

(Shazeer, 2019) and Grouped-Query Attention (GQA) (Ainslie et al., 2023), which signiﬁcantly

reduced the memory access bottleneck during decoding by sharing KV across multiple query

heads. For instance, in approaches like Quest (Tang et al., 2024), each attention head indepen-

dently selects its KV-cache subset. Although it demonstrates consistent computation sparsity

and memory access sparsity in Multi-Head Attention (MHA) models, it presents a different sce-

nario in models based on architectures like GQA, where the memory access volume of KV-cache

corresponds to the union of selections from all query heads within the same GQA group. This

architectural characteristic means that while these methods can reduce computation operations,

the required KV-cache memory access remains relatively high. This limitation forces a critical

choice: while some sparse attention methods reduce computation, their scattered memory access

pattern conﬂicts with efﬁcient memory access design from advanced architectures.

These limitations arise because many existing sparse attention methods focus on KV-cache

reduction or theoretical computation reduction, but struggle to achieve signiﬁcant latency

reduction in advanced frameworks or backends. This motivates us to develop algorithms that

combine both advanced architectural and hardware-efﬁcient implementation to fully leverage

sparsity for improving model efﬁciency.

2.2. The Myth of Trainable Sparsity

Our pursuit of native trainable sparse attention is motivated by two key insights from analyzing

inference-only approaches: (1) Performance Degradation: Applying sparsity post-hoc forces

models to deviate from their pretrained optimization trajectory. As demonstrated by Chen et al.

(2024), top 20% attention can only cover 70% of the total attention scores, rendering structures

like retrieval heads in pretrained models vulnerable to pruning during inference. (2) Training

Efﬁciency Demands: Efﬁcient handling of long-sequence training is crucial for modern LLM

development. This includes both pretraining on longer documents to enhance model capacity,

and subsequent adaptation phases such as long-context ﬁne-tuning and reinforcement learning.

However, existing sparse attention methods primarily target inference, leaving the computa-

tional challenges in training largely unaddressed. This limitation hinders the development

of more capable long-context models through efﬁcient training. Additionally, efforts to adapt

existing sparse attention for training also expose challenges:

Non-Trainable Components. Discrete operations in methods like ClusterKV (Liu et al.,

2024) (includes k-means clustering) and MagicPIG (Chen et al., 2024) (includes SimHash-based

selecting) create discontinuities in the computational graph. These non-trainable components

prevent gradient ﬂow through the token selection process, limiting the model’s ability to learn

optimal sparse patterns.

Inefﬁcient Back-propagation. Some theoretically trainable sparse attention methods suffer

from practical training inefﬁciencies. Token-granular selection strategy used in approaches

动态分层稀疏策略：结合粗粒度和细粒度令牌选择，通过令牌压缩保持全局上下文感知和局部精度。

两项关键创新：

（1）通过算术强度平衡算法设计，实现了显著的加速，并针对现代硬件进行了优化。

（2）支持端到端训练，在不牺牲模型性能的前提下，减少了预训练计算。

实验结果表明，使用NSA预训练的模型在以下方面表现出色：

在通用基准测试、长上下文任务和基于指令的推理中，保持或超过了全注意力模型的性能。

在处理64k长度序列的解码过程中，NSA在正向传播和反向传播阶段均显著优于全注意力机制，验证了其在整个模型生命周期中的高效性。

like HashAttention (Desai et al., 2024) leads to the need to load a large number of individual

tokens from the KV cache during attention computation. This non-contiguous memory access

prevents efﬁcient adaptation of fast attention techniques like FlashAttention, which rely on

contiguous memory access and blockwise computation to achieve high throughput. As a result,

implementations are forced to fall back to low hardware utilization, signiﬁcantly degrading

training efﬁciency.

2.3. Native Sparsity as an Imperative

These limitations in inference efﬁciency and training viability motivate our fundamental redesign

of sparse attention mechanisms. We propose NSA, a natively sparse attention framework that

addresses both computational efﬁciency and training requirements. In the following sections,

we detail the algorithmic design and operator implementation of NSA.

3. Methodology

Our technical approach spans algorithm design and kernel optimization. In the following

subsections, we ﬁrst introduce the background of our methodology. Then we present the

overall framework of NSA, followed by its key algorithmic components. Finally, we detail our

hardware-optimized kernel design that maximizes practical efﬁciency.

3.1. Background

Attention Mechanism is widely used in language modeling where each query token q

𝑡

computes

relevance scores against all preceding keys k

:𝑡

to generate a weighted sum of values v

:𝑡

. Formally,

for an input sequence of length 𝑡, the attention operation is deﬁned as:

𝑡

= Attn



𝑡

, k

:𝑡

, v

:𝑡



(1)

where Attn denotes the attention function:

Attn



𝑡

, k

:𝑡

, v

:𝑡



𝑡



𝑖=1

𝛼

𝑡,𝑖

𝑖

𝑡

𝑗=1

𝛼

𝑡,𝑗

, 𝛼

𝑡,𝑖

= 𝑒

⊤

𝑡

𝑖

√

𝑑

𝑘

. (2)

Here,

𝛼

𝑡,𝑖

represents the attention weight between q

𝑡

and k

𝑖

, and

𝑑

𝑘

is the feature dimension of

keys. As sequence length increases, attention computation becomes increasingly dominant in

the overall computational cost, presenting signiﬁcant challenges for long-context processing.

Arithmetic Intensity is the ratio of compute operations to memory accesses. It intrinsically

shapes algorithm optimization on hardware. Each GPU has a critical arithmetic intensity

determined by its peak compute capability and memory bandwidth, calculated as the ratio

of these two hardware limits. For computation tasks, arithmetic intensity above this critical

threshold becomes compute-bound (limited by GPU FLOPS), while below it becomes memory-

bound (limited by memory bandwidth).

Speciﬁcally for causal self-attention mechanism, during training and preﬁlling phases,

batched matrix multiplications and attention computations exhibit high arithmetic intensity,

making these stages compute-bound on modern accelerators. In contrast, auto-regressive de-

coding becomes memory-bandwidth constrained because it generates one token per forward

pass while requiring loading the entire key-value cache, resulting in low arithmetic intensity.

This leads to different optimization goals — reducing computation cost during training and

preﬁlling, while reducing memory access during decoding.

剩余23页未读，继续阅读

评论收藏

内容反馈

版权申诉

AI方案2025

粉丝: 1668

DeepSeek最新成果：原生稀疏注意力（Native Sparse Attention-Aligned and Nativel...

最新资源

DeepSeek最新成果：原生稀疏注意力（Native Sparse Attention-Aligned and Nativel...

PyPI 官网下载 | torch_sparse-0.6.12.tar.gz

torch_sparse-0.6.12-cp39-cp39-win_amd64whl.zip

torch_sparse-0.6.17+pt113cpu-cp37-cp37m-win_amd64.whl.zip

torch_sparse-0.6.18+pt20cpu-cp38-cp38-win_amd64whl.zip

torch_sparse-0.6.12-cp38-cp38-macosx_10_14_x86_64whl.zip

tsl-sparse-map-devel-0.6.2-2.el8.x64-86.rpm.tar.gz

torch_sparse-0.6.18-cp39-cp39-macosx_11_0_x86_64whl.zip

torch_sparse-0.6.9-cp38-cp38-win_amd64whl.zip

SBA: A Software Package for Generic Sparse Bundle Adjustment

torch_sparse-0.6.17-cp39-cp39-macosx_10_15_x86_64.whl.zip

torch_sparse-0.6.1-cp38-cp38-linux_x86_64whl.zip

torch_sparse-0.6.9-cp37-cp37m-linux_x86_64whl.zip

torch_sparse-0.6.18+pt20cpu-cp310-cp310-win_amd64whl.zip

tf.SparseTensor - 知乎1

torch_sparse-0.6.17+pt113cpu-cp310-cp310-linux_x86_64.whl.zip

torch_sparse-0.6.12-cp38-cp38-linux_x86_64whl.zip

torch_sparse-0.6.16+pt113cpu-cp39-cp39-win_amd64.whl.zip

torch_sparse-0.6.15-cp38-cp38-macosx_10_15_x86_64.whl.zip

torch_sparse-0.6.10-cp39-cp39-linux_x86_64whl.zip

Python库 | tf_sparse-0.0.1.tar.gz

Python库 | sparse-0.8.0-py2.py3-none-any.whl

torch_sparse-0.6.10-cp38-cp38-linux_x86_64whl.zip

torch_sparse-0.6.2-cp38-cp38-linux_x86_64whl.zip

torch_sparse-0.6.17+pt20cpu-cp39-cp39-win_amd64whl.zip

torch_sparse-0.6.16+pt113cpu-cp310-cp310-win_amd64.whl.zip

torch_sparse-0.6.15-cp39-cp39-macosx_10_15_x86_64.whl.zip

torch_sparse-0.6.9-cp38-cp38-linux_x86_64whl.zip

DeepSeek从入门到精通-清华大学-202502.pdf

YOLOv8-deepsort 实现智能车辆目标检测+车辆跟踪+车辆计数

cursor-auto-free-Cursor无限PRO免费用

最新资源