MxMoE: Mixed-precision Quantization for MoE with Accuracy and Performance Co-Design

¹SJTU, ²Shanghai AI Laboratory, ³PKU, ⁴Bytedance Seed, ⁵CUHK.

🙈 TL;DR

Can we design a quantization scheme specifically tailored to MoE models that effectively balancing model accuracy and computational efficiency?

Insight:

Linear-block level: Linear blocks exhibit varying quantization sensitivity
Expert level: Imbalanced activation frequencies $\Rightarrow$ heterogeneous computational characteristics (e.g. some experts are compute-bound while some are memory-bound).

Approach:

We explore the automated design of mixed-precision quantization scheme for MoE models:

Assign bitwidth at linear-block level.
Optimize bitwidth allocation (which is formulated as an ILP) by taking both model accuracy (quantization loss estimation) and computational efficiency (performance model) into consideration.
Generate mixed-precision GroupGEMM operator through template-based kernel generation.

🚀 Results

🛠️ Usage

Pre-requirement

# 0. virtual env
conda create -n mxmoe python=3.12 -y

# 1. source code dependencies
git submodule update --init --recursive

# 2. python package dependencies
pip install -r requirements.txt
cd mxmoe/3rdparty/fast-hadamard-transform && pip install . && cd -

View the activation statistic of MoE models:

# e.g. sample data from humaneval-x to observe qwen2_moe (in fact qwen1.5moe)
CUDA_VISIBLE_DEVICES=0 python -m mxmoe.quant.moe_tracer --model qwen2_moe --trace_gate --dataset humaneval-x

Calibration

Get the quant loss of each linear-blocks of <MOE_MODEL> (e.g. qwen2_moe) under certain <QUANTIZATION_CONFIG> (e.g. w4a4_g-1_sym):

CUDA_VISIBLE_DEVICES=0 python -m mxmoe.quant.quant calib --model qwen2_moe --method rtn --metric layer_out_norm --qcfg w4a4_g-1_sym

Solve the ILP based on the quant loss and kernel profile. Quantization scheme will be saved in qconfigs

# e.g. re-produce mxmoe w5a5
python -m mxmoe.quant.bits_solver --model qwen2_moe --qtype gptq-had --wbits 5.0 --solve_mode layer --batch 8192 --filter_list w4a4_g-1_sym w8a8_g-1_sym

Accuracy Eval. You can re-produce the exp in paper by setting corresponding tasks and quantization configs.

# e.g. evaluate the performance of qwen1.5_moe under RTN w4a4_g-1_sym quantization config
CUDA_VISIBLE_DEVICES=2 python -m mxmoe.quant.quant eval --model qwen2_moe --method rtn-had --qstr w4a4_g-1_sym --tasks ppl

Performance (Computational Efficiency) Eval. After we get the mixed-precision scheme (Step 3), we can automatially generate corresponding GroupGEMM kernel.

please refer to run_mxmoe_gg.py.

# e.g. test groupgemm in the layer-11 of qwen2_moe model (FP16):
python run_mxmoe_gg.py --model qwen2_moe --bs 8192 --layer 11

# e.g. test groupgemm in the layer-11 of qwen2_moe model under mixed-precision
python run_mxmoe_gg.py --model qwen2_moe --bs 8192 --layer 11 --qconfig <QCONFIG> --tile_config <TCONFIG>

👀 Limitations

⚠️ MxMoE is mainly test on RTX-4090. The current implementation cannot fully utilize the performance of hopper and later GPU architectures.

⚠️ The profile step is a little bit time consuming. We expect to upload our previously calculated quant loss statistics and kernel profile data soon, allowing you to skip this step.

😺 Citation

If you find MxMoE useful or relevant to your project and research, please kindly cite our paper:

@article{duanmu2025mxmoe,
  title={MxMoE: Mixed-precision Quantization for MoE with Accuracy and Performance Co-Design},
  author={Duanmu, Haojie and Li, Xiuhong and Yuan, Zhihang and Zheng, Size and Duan, Jiangfei and Zhang, Xingcheng and Lin, Dahua},
  journal={arXiv preprint arXiv:2505.05799},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
media		media
mxmoe		mxmoe
perf		perf
.clang-format		.clang-format
.clangd		.clangd
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md
__init__.py		__init__.py
logger_utils.py		logger_utils.py
project_config.py		project_config.py
requirements.txt		requirements.txt
run_mxmoe_gg.py		run_mxmoe_gg.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MxMoE: Mixed-precision Quantization for MoE with Accuracy and Performance Co-Design

🙈 TL;DR

🚀 Results

🛠️ Usage

👀 Limitations

😺 Citation

About

Uh oh!

Languages

cat538/MxMoE

Folders and files

Latest commit

History

Repository files navigation

MxMoE: Mixed-precision Quantization for MoE with Accuracy and Performance Co-Design

🙈 TL;DR

🚀 Results

🛠️ Usage

👀 Limitations

😺 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Languages