Skip to content

Conversation

@cacaview
Copy link

Make sure to read the contributing guidelines before submitting a PR
This is the current work progress:
#16930 (comment)

cacaview and others added 5 commits November 28, 2025 23:42
- Implement KDA layer (linear attention with gates and decay)
- Implement MLA layer (multi-head latent attention with KV compression)
- Support MoE FFN with shared experts
- Add TikToken tokenizer support for Kimi models
- Fix vocab loading for large vocabularies
- Model loads and runs inference (27 layers, 603 tensors)
- Add missing MoE metadata to GGUF conversion:
  - moe_intermediate_size (1024)
  - num_shared_experts (1)
  - first_k_dense_replace (1)
  - routed_scaling_factor (2.446)
  - expert_gating_func (sigmoid)

- Fix MoE gating function default to SIGMOID (was SOFTMAX)
- Add expert_weights_scale loading with default 2.446
- Enable moe_renormalize (norm_w=true) in build_moe_ffn
- Add fallback for exp_probs_b tensor suffix compatibility
- Add KDA (Kimi Delta Attention) CUDA kernel (kda-scan.cu)
- Fix recurrence order: decay first, then retrieval
- Verify CPU/CUDA implementation consistency
- Support head_dim=128, L2 normalization for Q/K
@github-actions github-actions bot added model Model specific Nvidia GPU Issues specific to Nvidia GPUs python python script changes ggml changes relating to the ggml tensor library for machine learning labels Nov 29, 2025
Comment on lines +2729 to 2733
# KimiLinearModel is defined later in this file (line ~5140) as a TextModel subclass
# This old definition has been removed to avoid conflicts


@ModelBase.register(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# KimiLinearModel is defined later in this file (line ~5140) as a TextModel subclass
# This old definition has been removed to avoid conflicts
@ModelBase.register(
@ModelBase.register(

(self.format_tensor_name(gguf.MODEL_TENSOR.ATTN_Q, bid), q),
(self.format_tensor_name(gguf.MODEL_TENSOR.ATTN_K, bid), k),
(self.format_tensor_name(gguf.MODEL_TENSOR.ATTN_V, bid), v),
]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
]
]
else:
return [(self.map_tensor_name(name), data_torch)]

@ModelBase.register("KimiLinearModel", "KimiLinearForCausalLM")
class KimiLinearModel(TextModel):
"""Kimi-Linear model with hybrid MLA+KDA architecture"""
model_arch = gguf.MODEL_ARCH.KIMI
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
model_arch = gguf.MODEL_ARCH.KIMI
model_arch = gguf.MODEL_ARCH.KIMI_LINEAR

_experts: list[dict[str, Tensor]] | None = None

def set_gguf_parameters(self):
self.gguf_writer.add_vocab_size(self.hparams["vocab_size"])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self.gguf_writer.add_vocab_size(self.hparams["vocab_size"])
super().set_gguf_parameters()
self.gguf_writer.add_vocab_size(self.hparams["vocab_size"])

Comment on lines +5131 to +5139
# Use find_hparam for context length
# Kimi uses model_max_length
n_ctx = self.find_hparam(["max_position_embeddings", "model_max_length", "n_ctx", "n_positions"], optional=True)
if n_ctx is not None:
self.gguf_writer.add_context_length(n_ctx)
else:
return [(self.map_tensor_name(name), data_torch)]
# Default to 4096 if not found
logger.warning("No context length found in config, defaulting to 4096")
self.gguf_writer.add_context_length(4096)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add model_max_length to TextModel.set_gguf_parameters instead, the fallback is not necessary.

case LLM_ARCH_ARCTIC:
case LLM_ARCH_DEEPSEEK:
case LLM_ARCH_DEEPSEEK2:
case LLM_ARCH_KIMI:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
case LLM_ARCH_KIMI:
case LLM_ARCH_KIMI_LINEAR:

if (qs.n_attention_wv != 0 && !is_clip_model)
// Skip this check for Kimi models which have hybrid KDA+MLA architecture
// (only MLA layers have attn_kv_b weights, KDA layers don't)
if (qs.n_attention_wv != 0 && !is_clip_model && model.arch != LLM_ARCH_KIMI)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (qs.n_attention_wv != 0 && !is_clip_model && model.arch != LLM_ARCH_KIMI)
if (qs.n_attention_wv != 0 && !is_clip_model && model.arch != LLM_ARCH_KIMI_LINEAR)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename to kimi-linear.cpp

@@ -0,0 +1,429 @@
#include "models.h"

llm_build_kimi::llm_build_kimi(const llama_model & model, const llm_graph_params & params) : llm_graph_context_mamba(params), model(model) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
llm_build_kimi::llm_build_kimi(const llama_model & model, const llm_graph_params & params) : llm_graph_context_mamba(params), model(model) {
llm_build_kimi::llm_build_kimi_linear(const llama_model & model, const llm_graph_params & params) : llm_graph_context_mamba(params), model(model) {

Comment on lines 286 to 287
struct llm_build_kimi : public llm_graph_context_mamba {
llm_build_kimi(const llama_model & model, const llm_graph_params & params);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
struct llm_build_kimi : public llm_graph_context_mamba {
llm_build_kimi(const llama_model & model, const llm_graph_params & params);
struct llm_build_kimi_linear : public llm_graph_context_mamba {
llm_build_kimi_linear(const llama_model & model, const llm_graph_params & params);

@cacaview
Copy link
Author

I have fixed these errors in the commit at cacaview@780dd78

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning model Model specific Nvidia GPU Issues specific to Nvidia GPUs python python script changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants