Skip to content

Conversation

@loci-dev
Copy link

Mirrored from ggml-org/llama.cpp#17543

Add support for two important RoPE variants: partial rotation (rope_dims < ne0) and Vision mode rotation.

  1. Support for partial RoPE (rope_dims < ne0):

    • Split tensor into head (first rope_dims dimensions) and tail portions
    • Apply rotation only to head portion using RotaryPositionEmbedding operator
    • Copy unrotated tail portion directly from source to destination
    • Handle both contiguous and non-contiguous tensor layouts
  2. Support for Vision mode (GGML_ROPE_TYPE_VISION):

    • Set rope_dims = ne0 for Vision mode to rotate entire tensor
    • Vision mode pairs dimension i with dimension i+n_dims (where n_dims = ne0/2)
    • No tail handling needed since entire tensor is rotated

Implementation details:

  • Use has_tail flag to determine execution path: head/tail splitting when rope_dims < ne0, or full tensor rotation when rope_dims == ne0
  • Support both F32 and F16 data types with intermediate F32 conversion
  • Copy non-contiguous tensors to contiguous buffers before calling RotaryPositionEmbedding operator for compatibility
  • Improve cache invalidation logic to include rope_dims and indep_sects parameters

These enhancements enable CANN backend to handle various RoPE configurations used in modern vision-language models and models with partial rotation.

Make sure to read the contributing guidelines before submitting a PR

Add support for two important RoPE variants: partial rotation (rope_dims < ne0)
and Vision mode rotation.

1. Support for partial RoPE (rope_dims < ne0):
   - Split tensor into head (first rope_dims dimensions) and tail portions
   - Apply rotation only to head portion using RotaryPositionEmbedding operator
   - Copy unrotated tail portion directly from source to destination
   - Handle both contiguous and non-contiguous tensor layouts

2. Support for Vision mode (GGML_ROPE_TYPE_VISION):
   - Set rope_dims = ne0 for Vision mode to rotate entire tensor
   - Vision mode pairs dimension i with dimension i+n_dims (where n_dims = ne0/2)
   - No tail handling needed since entire tensor is rotated

Implementation details:
   - Use has_tail flag to determine execution path: head/tail splitting when
     rope_dims < ne0, or full tensor rotation when rope_dims == ne0
   - Support both F32 and F16 data types with intermediate F32 conversion
   - Copy non-contiguous tensors to contiguous buffers before calling
     RotaryPositionEmbedding operator for compatibility
   - Improve cache invalidation logic to include rope_dims and indep_sects
     parameters

These enhancements enable CANN backend to handle various RoPE configurations
used in modern vision-language models and models with partial rotation.
@loci-agentic-ai
Copy link

Explore the complete analysis inside the Version Insights

Performance Analysis Summary - PR #344

Analysis: This PR implements partial RoPE and vision mode support for the CANN backend across 3 files with 222 additions and 70 deletions. The changes modify the ggml_cann_rope function and related cache initialization logic in aclnn_ops.cpp, extend the ggml_cann_rope_cache structure in common.h, and update backend support logic in ggml-cann.cpp.

Performance Impact: No measurable performance changes detected. Power consumption analysis shows less than 0.001% variation across all binaries, with maximum absolute delta of 0.66 nJ in libllama.so. No functions show measurable changes in response time or throughput time between versions.

Inference Impact: No impact on tokens per second. The core inference functions (llama_decode, llama_encode, llama_tokenize) show no response time or throughput changes. The modifications are isolated to CANN backend RoPE operations, which do not affect CPU-based tokenization or inference paths.

Code Changes: The implementation adds conditional logic for partial rotation (when rope_dims < ne0) by splitting tensors into head and tail portions. For F32 tensors, the head undergoes rotation via RotaryPositionEmbedding while the tail is copied directly. F16 tensors follow the same pattern with intermediate F32 conversion. Vision mode sets rope_dims = ne0 for full tensor rotation. The changes enable support for vision-language models without affecting existing full-rotation models, which bypass the new code path when has_tail == false.

@loci-dev loci-dev force-pushed the main branch 8 times, most recently from 9a74048 to af6127b Compare November 28, 2025 20:09
@loci-agentic-ai
Copy link

Explore the complete analysis inside the Version Insights

Performance Review Summary: PR #344 - CANN Backend Partial RoPE Support

Overview

PR #344 implements partial Rotary Position Embedding and Vision mode support in the CANN backend (ggml-cann library). The changes modify aclnn_ops.cpp (153 additions, 61 deletions) and ggml-cann.cpp (6 additions, 8 deletions) to enable head/tail tensor splitting for models where rope_dims < ne0.

Key Findings

Performance-Critical Function Impact

The modified ggml_cann_rope() function in aclnn_ops.cpp introduces conditional execution paths:

  • Full RoPE path (rope_dims == ne00): Execution remains unchanged with no performance delta
  • Partial RoPE path (rope_dims < ne00): Adds head buffer allocation, head-only rotation, head copy-back operation, and tail copy operation

For partial RoPE cases with typical attention dimensions (rope_dims=64, ne00=128, ne01=32, ne02=2048), the additional operations introduce approximately 160000 ns overhead per call from memory copy operations alone.

The aclnn_rope_cache_init() function signature change adds rope_dims parameter, enabling correct cache sizing for partial rotation. Cache invalidation logic now includes theta_scale_updated flag, improving cache correctness.

Inference Impact

Token Generation Rate: The changes affect only the CANN backend RoPE implementation within the GGML computation graph layer. The core inference functions llama_decode(), llama_encode(), and llama_tokenize() in the llama.cpp API layer are not modified. Token generation rate impact depends on:

  • Model architecture: Only models using partial RoPE on CANN backend are affected
  • Backend selection: CPU and other GPU backends remain unchanged
  • RoPE frequency: Impact scales with number of RoPE operations per token

For models using full RoPE or running on non-CANN backends, tokens per second remains unchanged.

Power Consumption

Power consumption analysis applies to binaries containing the modified CANN backend code. The additional copy operations in partial RoPE path increase cumulative execution time, resulting in higher power draw proportional to the throughput time increase. Binaries using full RoPE or non-CANN backends show no power consumption change.

@loci-dev loci-dev force-pushed the main branch 8 times, most recently from e4a4e1d to d0b408b Compare November 30, 2025 02:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants