-
Notifications
You must be signed in to change notification settings - Fork 0
UPSTREAM PR #17579: Add PagedAttention support (experimental, CUDA only) #352
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
UPSTREAM PR #17579: Add PagedAttention support (experimental, CUDA only) #352
Conversation
|
Explore the complete analysis inside the Version Insights Performance Analysis Summary: PR #352Analysis Context: Comparing version OverviewPR #352 introduces experimental PagedAttention support for memory-efficient KV cache management. The feature is disabled by default and requires the Key FindingsPerformance-Critical Areas ImpactInference Functions:
Tokens Per Second Impact: None. Since inference-critical functions show no response time or throughput changes, token generation throughput remains unaffected. The reference metric (7% TPS reduction per 2 ms increase in llama_decode) does not apply as llama_decode shows 0 ns change. Affected Functions:
These functions execute during CLI initialization only, not during inference. The absolute overhead (2-10 microseconds per argument) is negligible for application startup time. Root Cause: The addition of the Power Consumption AnalysisBinary-Level Impact:
The power consumption changes are minimal across all binaries. The +1.80% increase in libllama.so represents 3468 nJ total, which is negligible in absolute terms. This increase is not attributable to PagedAttention code paths since the feature is disabled by default. Interpretation: The power consumption variations fall within normal measurement variance for binaries of this size. No actionable power efficiency concerns identified. |
06254d1 to
1745418
Compare
|
Explore the complete analysis inside the Version Insights Performance Analysis Summary: PR #352 - PagedAttention ImplementationProject: llama.cpp | PR: #352 | Scope: 19 files, 1939 additions, 3 deletions OverviewThis PR introduces PagedAttention, an experimental CUDA-only KV cache implementation using block-based memory management. The changes add new memory allocation patterns, CUDA kernels, and integration points across the codebase. Analysis reveals performance impacts primarily in graph construction and memory management, with no direct impact on core inference functions. Key FindingsImpact on Inference Performance (Tokens per Second)Core Inference Functions Analysis: The analysis examined llama_decode, llama_encode, and llama_tokenize for response time and throughput changes. Based on available performance data:
Tokens per Second Impact: Using the reference model (ollama://smollm:135m on 12th Gen Intel Core i7-1255U, Ubuntu 24.04.3 LTS, x86_64) where 2 ms slower llama_decode results in 7% tokens per second reduction:
Conclusion: This PR does not directly impact tokens per second for CPU-based inference. CUDA performance impact cannot be assessed from the available static analysis data. Most-Impacted Functions in Performance-Critical AreasMemory Management Module:
Model Processing Module:
STL Container Operations (Indirect Impact):
Power Consumption AnalysisBinary-Level Impact:
Summary: The core inference library shows a 3468 nJ increase in power consumption, representing 1.80% higher energy usage per execution cycle. This is driven by increased CPU time in graph construction (map iterations, tensor creation) and memory management operations (block allocation, reference counting). The impact is concentrated in libllama.so, with minimal effects on other binaries. |
5efc8b7 to
f077805
Compare
Implement PagedAttention algorithm from for memory-efficient KV cache management. This feature reduces memory fragmentation by storing KV cache in fixed-size blocks (similar to virtual memory paging) and enables efficient memory sharing between sequences through copy-on-write semantics. The implementation is experimental and disabled by default. Enable with the --pagedattention flag Signed-off-by: Eric Curtin <[email protected]>
1745418 to
f0b133d
Compare
|
Explore the complete analysis inside the Version Insights Performance Analysis Summary: PR #352 - PagedAttention ImplementationOverviewPR #352 introduces an experimental PagedAttention feature for CUDA-based KV cache management, adding 1979 lines across 19 files. The implementation remains disabled by default and requires explicit activation via the Key FindingsPerformance Metrics StatusNo performance data is available for version 29c716bb-4b1b-4a55-998e-62d24f7fdb79. All binaries show -100% power consumption change, reflecting zero throughput measurements in the target version. This prevents quantitative assessment of response time or throughput changes for critical functions. Code Implementation AnalysisCore Attention Path Modifications: Memory Management: CUDA Kernel Implementation: Inference Impact: Power Consumption: Implementation CompletenessThe CUDA kernels contain placeholder logic with TODO comments for vectorized operations. Only FP16 data type with head_size=128 and block_size=16 is implemented, representing 1 of 9 documented supported configurations. The |
e4a4e1d to
d0b408b
Compare
Mirrored from ggml-org/llama.cpp#17579
Implement PagedAttention algorithm from for memory-efficient KV cache management. This feature reduces memory fragmentation by storing KV cache in fixed-size blocks (similar to virtual memory paging) and enables efficient memory sharing between sequences through copy-on-write semantics.
The implementation is experimental and disabled by default. Enable with the --pagedattention flag