Haojie Duanmu1,2 | Xiuhong Li3 | Zhihang Yuan3 | Size Zheng4 | Jiangfei Duan5
Xingcheng Zhang2 | Dahua Lin2,5
1SJTU, 2Shanghai AI Laboratory, 3PKU, 4Bytedance Seed, 5CUHK.
Can we design a quantization scheme specifically tailored to MoE models that effectively balancing model accuracy and computational efficiency?
Insight:
- Linear-block level: Linear blocks exhibit varying quantization sensitivity
- Expert level: Imbalanced activation frequencies
$\Rightarrow$ heterogeneous computational characteristics (e.g. some experts are compute-bound while some are memory-bound).
Approach:
We explore the automated design of mixed-precision quantization scheme for MoE models:
- Assign bitwidth at linear-block level.
- Optimize bitwidth allocation (which is formulated as an ILP) by taking both model accuracy (quantization loss estimation) and computational efficiency (performance model) into consideration.
- Generate mixed-precision GroupGEMM operator through template-based kernel generation.
-
Pre-requirement
# 0. virtual env conda create -n mxmoe python=3.12 -y # 1. source code dependencies git submodule update --init --recursive # 2. python package dependencies pip install -r requirements.txt cd mxmoe/3rdparty/fast-hadamard-transform && pip install . && cd -
-
View the activation statistic of MoE models:
# e.g. sample data from humaneval-x to observe qwen2_moe (in fact qwen1.5moe) CUDA_VISIBLE_DEVICES=0 python -m mxmoe.quant.moe_tracer --model qwen2_moe --trace_gate --dataset humaneval-x
-
Calibration
-
Get the quant loss of each linear-blocks of
<MOE_MODEL>
(e.g.qwen2_moe
) under certain<QUANTIZATION_CONFIG>
(e.g.w4a4_g-1_sym
):CUDA_VISIBLE_DEVICES=0 python -m mxmoe.quant.quant calib --model qwen2_moe --method rtn --metric layer_out_norm --qcfg w4a4_g-1_sym
-
Solve the ILP based on the quant loss and kernel profile. Quantization scheme will be saved in
qconfigs
# e.g. re-produce mxmoe w5a5 python -m mxmoe.quant.bits_solver --model qwen2_moe --qtype gptq-had --wbits 5.0 --solve_mode layer --batch 8192 --filter_list w4a4_g-1_sym w8a8_g-1_sym
-
-
Accuracy Eval. You can re-produce the exp in paper by setting corresponding tasks and quantization configs.
# e.g. evaluate the performance of qwen1.5_moe under RTN w4a4_g-1_sym quantization config CUDA_VISIBLE_DEVICES=2 python -m mxmoe.quant.quant eval --model qwen2_moe --method rtn-had --qstr w4a4_g-1_sym --tasks ppl
-
Performance (Computational Efficiency) Eval. After we get the mixed-precision scheme (Step 3), we can automatially generate corresponding GroupGEMM kernel.
please refer to run_mxmoe_gg.py.
# e.g. test groupgemm in the layer-11 of qwen2_moe model (FP16): python run_mxmoe_gg.py --model qwen2_moe --bs 8192 --layer 11 # e.g. test groupgemm in the layer-11 of qwen2_moe model under mixed-precision python run_mxmoe_gg.py --model qwen2_moe --bs 8192 --layer 11 --qconfig <QCONFIG> --tile_config <TCONFIG>
If you find MxMoE useful or relevant to your project and research, please kindly cite our paper:
@article{duanmu2025mxmoe,
title={MxMoE: Mixed-precision Quantization for MoE with Accuracy and Performance Co-Design},
author={Duanmu, Haojie and Li, Xiuhong and Yuan, Zhihang and Zheng, Size and Duan, Jiangfei and Zhang, Xingcheng and Lin, Dahua},
journal={arXiv preprint arXiv:2505.05799},
year={2025}
}