πŸ€– AI & Machine Learning

MXFP8 MoE Training: 1.3x Speedup, But Skepticism Lingers

MXFP8 just turbocharged MoE training by 30% on massive GB200 setups. Equivalent quality to BF16? Sure. But let's poke holes in the hype.

Training loss curves comparing MXFP8 and BF16 for Llama4 Scout on GB200 cluster

⚑ Key Takeaways

  • MXFP8 yields 30.2% faster Llama4 Scout training on 256 GB200 GPUs, matching BF16 loss curves. 𝕏
  • Equivalent convergence proven at small batches; real-world scale untested. 𝕏
  • TorchAO primitives enable easy repro via TorchTitan – but layer exclusions reveal precision pitfalls. 𝕏
Published by

theAIcatchup

Community-driven. Code-first.

Worth sharing?

Get the best Open Source stories of the week in your inbox β€” no noise, no spam.

Originally reported by PyTorch Blog

Stay in the loop

The week's most important stories from theAIcatchup, delivered once a week.