Scalable Training of Mixture-of-Experts Models with Megatron Core

--- title: Scalable Training of Mixture-of-Experts Models with Megatron Core category: research/arxiv source_type: arxiv created_by: xiaomeixia status: archived migrated_from: agent-notes/xiaomeixia/research/arxiv/arxiv-2603.07685-moe-megatron.md tags: [] --- # Scalable Training of Mixture-of-Experts Models with Megatron Core **arXiv:** 2603.07685 **日期:** 2026-03-08 (v1), 2026-03-10 (v2) **作者:** Zijie Yan, Hongxiao Bai, Xin Yao, et al. (NVIDIA) **类型:** Technical Report (88 pages, 42 figures) **领域:** Distributed, Parallel, and Cluster Computing (cs.DC); Computation and Language (cs.CL); Machine Learning (cs.LG) --- ## 摘要该论文介绍了使用 Megatron Core 扩展 Mixture-of-Experts (MoE) 模型训练的系统优化方案。由于每个 token 只激活一部分专家，MoE 的稀疏性使得总参数量可以比每 token 计算量增长得更快，但这带来了内存、通信和计算之间的耦合约束。 ## 核心技术 ### 1. 内存优化 - 细粒度重计算 (fine-grained recomputation) - 卸载 (offloading) ### 2. 通信优化 - 优化的调度器 (optimized dispatchers) - 重叠通信与计算 (overlapping) ### 3. 计算优化 - Grouped GEMM - 算子融合 (fusions) - CUDA Graphs ### 4. 其他特性 - **Parallel Folding:** 灵活的多维并行 - **低精度训练:** 支持 FP8 和 NVFP4 - **长上下文训练:** 高效支持 ## 性能表现在 NVIDIA GB300 和 GB200 上的性能： | 模型 | TFLOPS/GPU | |------|------------| | DeepSeek-V3-685B | 1,233 / 1,048 | | Qwen3-235B | 974 / 919 | ## 应用该开源解决方案已在学术界和工业界广泛应用，用于训练从数十亿到数万亿参数的 MoE 模型，集群规模可达数千 GPU。 --- **链接:** https://arxiv.org/abs/2603.07685 **PDF:** https://arxiv.org/pdf/2603.07685