-
Notifications
You must be signed in to change notification settings - Fork 13.9k
Feature/kimi linear support #17592
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Feature/kimi linear support #17592
Conversation
- Implement KDA layer (linear attention with gates and decay) - Implement MLA layer (multi-head latent attention with KV compression) - Support MoE FFN with shared experts - Add TikToken tokenizer support for Kimi models - Fix vocab loading for large vocabularies - Model loads and runs inference (27 layers, 603 tensors)
- Add missing MoE metadata to GGUF conversion: - moe_intermediate_size (1024) - num_shared_experts (1) - first_k_dense_replace (1) - routed_scaling_factor (2.446) - expert_gating_func (sigmoid) - Fix MoE gating function default to SIGMOID (was SOFTMAX) - Add expert_weights_scale loading with default 2.446 - Enable moe_renormalize (norm_w=true) in build_moe_ffn - Add fallback for exp_probs_b tensor suffix compatibility
- Add KDA (Kimi Delta Attention) CUDA kernel (kda-scan.cu) - Fix recurrence order: decay first, then retrieval - Verify CPU/CUDA implementation consistency - Support head_dim=128, L2 normalization for Q/K
| # KimiLinearModel is defined later in this file (line ~5140) as a TextModel subclass | ||
| # This old definition has been removed to avoid conflicts | ||
|
|
||
|
|
||
| @ModelBase.register( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| # KimiLinearModel is defined later in this file (line ~5140) as a TextModel subclass | |
| # This old definition has been removed to avoid conflicts | |
| @ModelBase.register( | |
| @ModelBase.register( |
| (self.format_tensor_name(gguf.MODEL_TENSOR.ATTN_Q, bid), q), | ||
| (self.format_tensor_name(gguf.MODEL_TENSOR.ATTN_K, bid), k), | ||
| (self.format_tensor_name(gguf.MODEL_TENSOR.ATTN_V, bid), v), | ||
| ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| ] | |
| ] | |
| else: | |
| return [(self.map_tensor_name(name), data_torch)] |
convert_hf_to_gguf.py
Outdated
| @ModelBase.register("KimiLinearModel", "KimiLinearForCausalLM") | ||
| class KimiLinearModel(TextModel): | ||
| """Kimi-Linear model with hybrid MLA+KDA architecture""" | ||
| model_arch = gguf.MODEL_ARCH.KIMI |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| model_arch = gguf.MODEL_ARCH.KIMI | |
| model_arch = gguf.MODEL_ARCH.KIMI_LINEAR |
| _experts: list[dict[str, Tensor]] | None = None | ||
|
|
||
| def set_gguf_parameters(self): | ||
| self.gguf_writer.add_vocab_size(self.hparams["vocab_size"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| self.gguf_writer.add_vocab_size(self.hparams["vocab_size"]) | |
| super().set_gguf_parameters() | |
| self.gguf_writer.add_vocab_size(self.hparams["vocab_size"]) |
| # Use find_hparam for context length | ||
| # Kimi uses model_max_length | ||
| n_ctx = self.find_hparam(["max_position_embeddings", "model_max_length", "n_ctx", "n_positions"], optional=True) | ||
| if n_ctx is not None: | ||
| self.gguf_writer.add_context_length(n_ctx) | ||
| else: | ||
| return [(self.map_tensor_name(name), data_torch)] | ||
| # Default to 4096 if not found | ||
| logger.warning("No context length found in config, defaulting to 4096") | ||
| self.gguf_writer.add_context_length(4096) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add model_max_length to TextModel.set_gguf_parameters instead, the fallback is not necessary.
src/llama-model.cpp
Outdated
| case LLM_ARCH_ARCTIC: | ||
| case LLM_ARCH_DEEPSEEK: | ||
| case LLM_ARCH_DEEPSEEK2: | ||
| case LLM_ARCH_KIMI: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| case LLM_ARCH_KIMI: | |
| case LLM_ARCH_KIMI_LINEAR: |
src/llama-quant.cpp
Outdated
| if (qs.n_attention_wv != 0 && !is_clip_model) | ||
| // Skip this check for Kimi models which have hybrid KDA+MLA architecture | ||
| // (only MLA layers have attn_kv_b weights, KDA layers don't) | ||
| if (qs.n_attention_wv != 0 && !is_clip_model && model.arch != LLM_ARCH_KIMI) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if (qs.n_attention_wv != 0 && !is_clip_model && model.arch != LLM_ARCH_KIMI) | |
| if (qs.n_attention_wv != 0 && !is_clip_model && model.arch != LLM_ARCH_KIMI_LINEAR) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename to kimi-linear.cpp
src/models/kimi.cpp
Outdated
| @@ -0,0 +1,429 @@ | |||
| #include "models.h" | |||
|
|
|||
| llm_build_kimi::llm_build_kimi(const llama_model & model, const llm_graph_params & params) : llm_graph_context_mamba(params), model(model) { | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| llm_build_kimi::llm_build_kimi(const llama_model & model, const llm_graph_params & params) : llm_graph_context_mamba(params), model(model) { | |
| llm_build_kimi::llm_build_kimi_linear(const llama_model & model, const llm_graph_params & params) : llm_graph_context_mamba(params), model(model) { |
src/models/models.h
Outdated
| struct llm_build_kimi : public llm_graph_context_mamba { | ||
| llm_build_kimi(const llama_model & model, const llm_graph_params & params); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| struct llm_build_kimi : public llm_graph_context_mamba { | |
| llm_build_kimi(const llama_model & model, const llm_graph_params & params); | |
| struct llm_build_kimi_linear : public llm_graph_context_mamba { | |
| llm_build_kimi_linear(const llama_model & model, const llm_graph_params & params); |
|
I have fixed these errors in the commit at cacaview@780dd78 |
Make sure to read the contributing guidelines before submitting a PR
This is the current work progress:
#16930 (comment)