HETU

Hetu is a high-performance distributed deep learning system targeting trillions of parameters DL model training, developed by DAIR Lab at Peking University. It takes account of both high availability in industry and innovation in academia.

This is the preview of Hetu 2.0, which is still under rapid development. Please raise an issue if you need any help.

We welcome everyone interested in machine learning or graph computing to contribute codes, create issues or pull requests. Please refer to Contribution Guide for more details.

Key Features

Installation

Clone the repository.
Prepare the environment. We use Anaconda to manage packages. The following command create the conda environment to be used:conda env create -f environment.yml. Please prepare Cuda toolkit, CuDNN, and gRPC in advance.
We use CMake to compile Hetu. Please copy the example configuration for compilation by cp cmake/config.example.cmake cmake/config.cmake. Users can modify the configuration file to enable/disable the compilation of each module. For advanced users (who not using the provided conda environment), the prerequisites for different modules in Hetu is listed in appendix.

# modify paths and configurations in cmake/config.cmake

# generate Makefile
mkdir build && cd build && cmake ..

# compile
# make hetu, version is specified in cmake/config.cmake
make -j 32

Prepare environment for running. Edit the hetu.exp file and set the environment path for python and the path for executable mpirun if necessary (for advanced users not using the provided conda environment). Then execute the command source hetu.exp .

Community

Email: [email protected], [email protected]
Click here to join our Slack community.
Hetu homepage: https://hetu-doc.readthedocs.io
Committers & Contributors
Contributing to Hetu
Development plan

Enterprise Users

If you are enterprise users and find Hetu is useful in your work, please let us know, and we are glad to add your company logo here.

Tencent Inc.

Alibaba Cloud.

Kuaishou Tech.

License

The entire codebase is under license

Papers

We have proposed numerous innovative optimization techniques around the Hetu system and published several papers, covering a variety of different model workloads and hardware environments.

Distributed DL System Design

Haoyang Li, Fangcheng Fu, Hao Ge, Sheng Lin, Xuanyu Wang, Jiawen Niu, Xupeng Miao, Bin Cui. Hetu v2: A General and Scalable Deep Learning System with Hierarchical and Heterogeneous Single Program Multiple Data Annotations. arXiv 2025 [code]
Xupeng Miao, Xiaonan Nie, Hailin Zhang, Tong Zhao and Bin Cui. Hetu: A highly efficient automatic parallel distributed deep learning system. SCIS 2022 [code]

Transformer Model & Large Language Model

Haoyang Li, Fangcheng Fu, Sheng Lin, Hao Ge, Xuanyu Wang, Jiawen Niu, Jinbao Xue, Yangyu Tao, Di Wang, Jie Jiang, Bin Cui. Hydraulis: Balancing Large Transformer Model Training via Co-designing Parallel Strategies and Data Assignment. SIGMOD 2026 [code]
Sheng Lin, Fangcheng Fu, Haoyang Li, Hao Ge, Xuanyu Wang, Jiawen Niu, Yaofeng Tu, Bin Cui. LobRA: Multi-tenant Fine-tuning over Heterogeneous Data. VLDB 2025 [code]
Yujie Wang, Shiju Wang, Shenhan Zhu, Fangcheng Fu, Xinyi Liu, Xuefeng Xiao, Huixia Li, Jiashi Li, Faming Wu, Bin Cui. FlexSP: Accelerating Large Language Model Training via Flexible Sequence Parallelism. ASPLOS 2025 [code]
Yujie Wang, Shenhan Zhu, Fangcheng Fu, Xupeng Miao, Jie Zhang, Juan Zhu, Fan Hong, Yong Li, Bin Cui. Spindle: Efficient Distributed Training of Multi-Task Large Models via Wavefront Scheduling. ASPLOS 2025
Hao Ge, Junda Feng, Qi Huang, Fangcheng Fu, Xiaonan Nie, Lei Zuo, Haibin Lin, Bin Cui, Xin Liu. ByteScale: Communication-Efficient Scaling of LLM Training with a 2048K Context Length on 16384 GPUs. SIGCOMM 2025
Hao Ge, Fangcheng Fu, Haoyang Li, Xuanyu Wang, Sheng Lin, Yujie Wang, Xiaonan Nie, Hailin Zhang, Xupeng Miao, Bin Cui. Enabling Parallelism Hot Switching for Efficient Training of Large Language Models. SOSP 2024 [code]
Yujie Wang, Youhe Jiang, Xupeng Miao, Fangcheng Fu, Xiaonan Nie, Bin Cui. Improving Automatic Parallel Training via Balanced Memory Workload Optimization. TKDE 2024 [code]
Xupeng Miao, Shenhan Zhu, Fangcheng Fu, Ziyu Guo, Zhi Yang, Yaofeng Tu, Zhihao Jia, Bin Cui. Reviving Efficient Attention for Long Context Language Modeling: A Survey. IJCAI 2024 [code]
Xupeng Miao, Yujie Wang, Youhe Jiang, Chunan Shi, Xiaonan Nie, Hailin Zhang, Bin Cui. Galvatron: Efficient Transformer Training over Multiple GPUs Using Automatic Parallelism. VLDB 2023 [code]
Youhe Jiang, Fangcheng Fu, Xupeng Miao, Xiaonan Nie and Bin Cui. OSDP: Optimal Sharded Data Parallel for Distributed Deep Learning. IJCAI 2023 [code]

Mixture-of-experts Model

Xinyi Liu, Yujie Wang, Fangcheng Fu, Xupeng Miao, Shenhan Zhu, Xiaonan Nie, Bin Cui. NetMoE: Accelerating MoE Training through Dynamic Sample Placement. ICLR 2025
Xiaonan Nie, Qibin Liu, Fangcheng Fu, Shenhan Zhu, Xupeng Miao, Xiaoyang Li, Yang Zhang, Shouda Liu, Bin Cui. LSH-MoE: Communication-efficient MoE Training via Locality-Sensitive Hashing. NeurIPS 2024
Xiaonan Nie, Xupeng Miao, Zilong Wang, Jilong Xue and Lingxiao Ma, Zichao Yang, Gang Cao, Bin Cui. FlexMoE: Scaling Large-scale Sparse Pre-trained Model Training via Dynamic Device Placement. SIGMOD 2023
Xiaonan Nie, Pinxue Zhao, Xupeng Miao, Tong Zhao, Bin Cui. HetuMoE: An Efficient Trillion-scale Mixture-of-Expert Distributed Training System. arXiv 2022 [code]
Xiaonan Nie, Shijie Cao, Xupeng Miao, Lingxiao Ma, Jilong Xue, Youshan Miao, Zichao Yang, Zhi Yang, Bin Cui. EvoMoE: An Evolutional Mixture-of-Experts Training Framework via Dense-To-Sparse Gate. arXiv 2021 [code]

Embedding Model

Hailin Zhang, Penghao Zhao, Xupeng Miao, Yingxia Shao, Zirui Liu, Tong Yang, Bin Cui. Experimental Analysis of Large-scale Learnable Vector Storage Compression. VLDB 2024 [code]
Hailin Zhang, Zirui Liu, Boxuan Chen, Yikai Zhao, Tong Zhao, Tong Yang, Bin Cui. CAFE: Towards Compact, Adaptive, and Fast Embedding for Large-scale Recommendation Models. SIGMOD 2024 [code]
Xupeng Miao, Hailin Zhang, Yining Shi, Xiaonan Nie, Zhi Yang, Yangyu Tao, Bin Cui. HET: Scaling out Huge Embedding Model Training via Cache-enabled Distributed Framework. VLDB 2022 (Best Scalable Data Science Paper) [code]
Xupeng Miao, Yining Shi, Hailin Zhang, Xin Zhang, Xiaonan Nie, Zhi Yang, Bin Cui. HET-GMP: a Graph-based System Approach to Scaling Large Embedding Model Training. SIGMOD 2022 [code]
Sicong Dong, Xupeng Miao, Pengkai Liu, Xin Wang, Bin Cui, Jianxin Li. HET-KG: Communication-Efficient Knowledge Graph Embedding Training via Hotness-Aware Cache. ICDE 2022

Hetetrogeneous/Dynamic/Decentralized Resources

Xuanyu Wang, Fangcheng Fu, Haoyang Li, Hao Ge, Sheng Lin , Jiawen Niu, Bin Cui. Elastor: Elastic and Efficient Model Partitioning and Checkpointing for Fault-tolerant Distributed Training. PPoPP 2026 [code]
Haoyang Li, Fangcheng Fu, Hao Ge, Sheng Lin, Xuanyu Wang, Jiawen Niu, Yujie Wang, Hailin Zhang, Xiaonan Nie, Bin Cui. Malleus: Straggler-Resilient Hybrid Parallel Training of Large-scale Models via Malleable Data and Model Parallelization. SIGMOD 2025 [code]
Xupeng Miao, Yining Shi, Zhi Yang, Bin Cui, Zhihao Jia. SDPipe: A Semi-Decentralized Framework for Heterogeneity-aware Pipeline-parallel Training. VLDB 2023 [code]
Xupeng Miao, Xiaonan Nie, Yingxia Shao, Zhi Yang, Jiawei Jiang, Lingxiao Ma, Bin Cui. Heterogeneity-Aware Distributed Machine Learning Training via Partial Reduce. SIGMOD 2021

Kernel Optimization

Hailin Zhang, Xiaodong Ji, Yilin Chen, Fangcheng Fu, Xupeng Miao, Xiaonan Nie, Weipeng Chen, Bin Cui. PQCache: Product Quantization-based KVCache for Long Context LLM Inference. SIGMOD 2025 [code]
Yifei Xia, Suhan Ling, Fangcheng Fu, Yujie Wang, Huixia Li, Xuefeng Xiao, Bin Cui. Training-free and Adaptive Sparse Attention for Efficient Long Video Generation. ICCV 2025
Zihao Yu, Haoyang Li, Fangcheng Fu, Xupeng Miao, Bin Cui. Accelerating Text-to-image Editing via Cache-enabled Sparse Diffusion Inference. AAAI 2024 [code]
Xupeng Miao, Lingxiao Ma, Zhi Yang, Yingxia Shao, Bin Cui, Lele Yu, Jiawei Jiang. CuWide: Towards Efficient Flow-based Training for Sparse Wide Models on GPUs. TKDE 2021, ICDE 2021 [code]

Memory Management

Pinxue Zhao, Hailin Zhang, Fangcheng Fu, Xiaonan Nie, Qibin Liu, Fang Yang, Yuanbo Peng, Dian Jiao, Shuaipeng Li, Jinbao Xue, Yangyu Tao, Bin Cui. Memo: Fine-grained Tensor Management For Ultra-long Context LLM Training. SIGMOD 2025 [code]
Xiaonan Nie, Yi Liu, Fangcheng Fu, Jinbao Xue, Dian Jiao, Xupeng Miao, Yangyu Tao, Bin Cui. Angel-PTM: A Scalable and Economical Large-scale Pre-training System in Tencent. VLDB 2023
Xiaonan Nie, Xupeng Miao, Zhi Yang, Bin Cui. TSplit: Fine-grained GPU Memory Management for Efficient DNN Training via Tensor Splitting. ICDE 2022 [code]

Graph Neural Network

Xupeng Miao, Yujie Wang, Jia Shen, Yingxia Shao, Bin Cui. Graph Neural Network Training Acceleration over Multi-GPUs. Journal of Software (Chinese)

coming soon...

Cite

If you use Hetu in a scientific publication, we would appreciate citations to the following papers:

@article{DBLP:journals/corr/abs-2504-20490,
  author  = {Haoyang Li and Fangcheng Fu and Hao Ge and Sheng Lin and Xuanyu Wang and Jiawen Niu and Xupeng Miao and Bin Cui},
  title = {Hetu v2: A General and Scalable Deep Learning System with Hierarchical and Heterogeneous Single Program Multiple Data Annotations},
  journal = {CoRR},
  volume = {abs/2504.20490},
  year = {2025}
}

@article{DBLP:journals/chinaf/MiaoXP22,
  author = {Miao, Xupeng and Nie, Xiaonan and Zhang, Hailin and Zhao, Tong and Cui, Bin},
  title = {Hetu: A highly efficient automatic parallel distributed deep learning system},
  journal = {Sci. China Inf. Sci.},
  url = {http://engine.scichina.com/doi/10.1007/s11432-022-3581-9},
  doi = {10.1007/s11432-022-3581-9},
  year = {2022}
}

@inproceedings{DBLP:conf/sosp/GeFLWLWN0M024,
  author = {Hao Ge and Fangcheng Fu and Haoyang Li and Xuanyu Wang and Sheng Lin and Yujie Wang and Xiaonan Nie and Hailin Zhang and Xupeng Miao and Bin Cui},
  title = {Enabling Parallelism Hot Switching for Efficient Training of Large Language Models},
  booktitle = {Proceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles},
  pages = {178--194},
  year = {2024},
  url = {https://doi.org/10.1145/3694715.3695969},
  doi = {10.1145/3694715.3695969}
}

@article{miao2021het,
  title = {HET: Scaling out Huge Embedding Model Training via Cache-enabled Distributed Framework},
  author = {Miao, Xupeng and Zhang, Hailin and Shi, Yining and Nie, Xiaonan and Yang, Zhi and Tao, Yangyu and Cui, Bin},
  journal = {Proc. {VLDB} Endow.},
  volume = {15},
  number = {2},
  pages = {312--320},
  year = {2022},
  publisher = {VLDB Endowment}
}

Acknowledgements

We learned and borrowed insights from a few open source projects including TinyFlow, autodist, tf.distribute, FlexFlow and Angel.

Appendix

The prerequisites for different modules in Hetu is listed as follows:

- OpenMP (*)
- CMake >= 3.24 (*)
- gRPC >= 1.6.3 (*)
- CUDA >= 11.8 (*)
- CUDNN >= 8.2 (*)
- MPI >= 4.1 (*)
- NCCL >= 2.19 (*)
- Pybind11 >= 2.6.2 (*)

Name		Name	Last commit message	Last commit date
Latest commit History 430 Commits
cmake		cmake
docs		docs
examples		examples
hetu		hetu
img		img
python		python
tests		tests
third_party		third_party
tools		tools
workloads/cuda		workloads/cuda
.clang-format		.clang-format
.gitignore		.gitignore
.gitmodules		.gitmodules
CMakeLists.txt		CMakeLists.txt
COMMITTERS.md		COMMITTERS.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
hetu.exp		hetu.exp

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

HETU

Key Features

Installation

Community

Enterprise Users

License

Papers

Distributed DL System Design

Transformer Model & Large Language Model

Mixture-of-experts Model

Embedding Model

Hetetrogeneous/Dynamic/Decentralized Resources

Kernel Optimization

Memory Management

Graph Neural Network

coming soon...

Cite

Acknowledgements

Appendix

About

Uh oh!

Uh oh!

Languages

License

PKU-DAIR/Hetu

Folders and files

Latest commit

History

Repository files navigation

HETU

Key Features

Installation

Community

Enterprise Users

License

Papers

Distributed DL System Design

Transformer Model & Large Language Model

Mixture-of-experts Model

Embedding Model

Hetetrogeneous/Dynamic/Decentralized Resources

Kernel Optimization

Memory Management

Graph Neural Network

coming soon...

Cite

Acknowledgements

Appendix

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Uh oh!

Languages