Hecate: Unlocking Efficient Sparse Model Training via Fully Sharded Sparse Data Parallelism

Qing, Yuhao; Zhu, Guichao; Li, Fanxin; Lei, Lintian; Sun, Zekai; Guan, Xiuxian; Zhao, Shixiong; Chen, Xusheng; Huang, Dong; Wang, Sen; Cui, Heming

Computer Science > Distributed, Parallel, and Cluster Computing

arXiv:2502.02581v1 (cs)

[Submitted on 4 Feb 2025]

Title:Hecate: Unlocking Efficient Sparse Model Training via Fully Sharded Sparse Data Parallelism

Authors:Yuhao Qing, Guichao Zhu, Fanxin Li, Lintian Lei, Zekai Sun, Xiuxian Guan, Shixiong Zhao, Xusheng Chen, Dong Huang, Sen Wang, Heming Cui

View PDF HTML (experimental)

Abstract:Mixture-of-Experts (MoE) has emerged as a promising sparse paradigm for scaling up pre-trained models (PTMs) with remarkable cost-effectiveness. However, the dynamic nature of MoE leads to rapid fluctuations and imbalances in expert loads during training, resulting in significant straggler effects that hinder training performance when using expert parallelism (EP). Existing MoE training systems attempt to mitigate these effects through expert rearrangement strategies, but they face challenges in terms of memory efficiency and timeliness of rearrangement. This paper proposes Fully Sharded Sparse Data Parallelism (FSSDP), an innovative approach that tackles the parallelization of MoE layers and potential straggler effects caused by imbalanced expert loads from a new perspective. FSSDP fully shards the parameters and optimizer states of MoE layers across devices and sparsely materializes MoE parameters from scratch in each iteration with two sparse collectives SparseAllGather and SparseReduceScatter. We build Hecate, a high-performance MoE training system that incorporates FSSDP to fully unlock its potential. Hecate introduces heterogeneous sharding, sparse materialization, and re-materialization techniques to construct flexible and efficient expert placements with low memory and communication overhead. Our evaluation reveals that Hecate achieves up to 3.54x speedup compared over state-of-the-art MoE training systems and consistently demonstrates improvements across model architectures and hardware environments.

Subjects:	Distributed, Parallel, and Cluster Computing (cs.DC)
Cite as:	arXiv:2502.02581 [cs.DC]
	(or arXiv:2502.02581v1 [cs.DC] for this version)
	https://doi.org/10.48550/arXiv.2502.02581

Submission history

From: Yuhao Qing [view email]
[v1] Tue, 4 Feb 2025 18:56:00 UTC (803 KB)

Computer Science > Distributed, Parallel, and Cluster Computing

Title:Hecate: Unlocking Efficient Sparse Model Training via Fully Sharded Sparse Data Parallelism

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Distributed, Parallel, and Cluster Computing

Title:Hecate: Unlocking Efficient Sparse Model Training via Fully Sharded Sparse Data Parallelism

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators