Skip to content

Conversation

@loci-dev
Copy link

Mirrored from ggml-org/llama.cpp#17548

alt #17248

  • Force token embeddings to be at the start of the graph
  • Fix LFM2 output norm tensor
  • Fix LLM_TENSOR_TOKEN_EMBD_NORM tensor info

@loci-agentic-ai
Copy link

Explore the complete analysis inside the Version Insights

Performance Analysis Summary: PR #347

Overview

This PR corrects LFM2 model tensor classifications and graph construction order. Changes span 3 files with 21 line modifications, primarily addressing tensor type misclassification and ensuring token embeddings appear at graph start.

Key Findings

Performance-Critical Areas Impact

Inference Pipeline Functions:
No direct modifications to core inference functions (llama_decode, llama_encode, llama_tokenize). The changes affect LFM2-specific model loading and graph construction but do not alter the execution path of primary inference functions.

Tokens Per Second Impact:
No measurable impact on tokens per second expected. The PR modifies LFM2 architecture-specific tensor mapping and graph ordering without changing the computational logic of tokenization or decoding functions. Response time and throughput of llama_decode remain unchanged.

Modified Components:

  • src/llama-arch.cpp: Tensor type mapping correction (LLM_TENSOR_OUTPUT_NORM vs LLM_TENSOR_TOKEN_EMBD_NORM)
  • src/llama-model.cpp: Model structure update (tok_norm → output_norm)
  • src/models/lfm2.cpp: Graph construction ordering and tensor reference updates

Absolute Performance Changes:
The changes introduce graph ordering optimization by forcing token embeddings to graph start via ggml_build_forward_expand. This affects memory allocation timing but not computational cost. Operation type correction from GGML_OP_GET_ROWS to GGML_OP_MUL for token embedding normalization aligns operation semantics with actual usage pattern.

Power Consumption:
Analysis operates at binary level. Changes affect LFM2-specific code paths within the main binary. No additional computational operations introduced; structural corrections maintain equivalent operation count.

Scope:
Changes isolated to LFM2 and LFM2MOE architectures. Other model types (GPT, LLAMA, etc.) unaffected. Core inference loop unchanged.

@loci-dev loci-dev force-pushed the main branch 16 times, most recently from 1854a53 to 1b177fe Compare November 30, 2025 15:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants