π€ AI & Machine Learning
MXFP8 MoE Training: 1.3x Speedup, But Skepticism Lingers
MXFP8 just turbocharged MoE training by 30% on massive GB200 setups. Equivalent quality to BF16? Sure. But let's poke holes in the hype.
theAIcatchup
Apr 07, 2026
4 min read
β‘ Key Takeaways
-
MXFP8 yields 30.2% faster Llama4 Scout training on 256 GB200 GPUs, matching BF16 loss curves.
π
-
Equivalent convergence proven at small batches; real-world scale untested.
π
-
TorchAO primitives enable easy repro via TorchTitan β but layer exclusions reveal precision pitfalls.
π
The 60-Second TL;DR
- MXFP8 yields 30.2% faster Llama4 Scout training on 256 GB200 GPUs, matching BF16 loss curves.
- Equivalent convergence proven at small batches; real-world scale untested.
- TorchAO primitives enable easy repro via TorchTitan β but layer exclusions reveal precision pitfalls.
Published by
theAIcatchup
Community-driven. Code-first.
Worth sharing?
Get the best Open Source stories of the week in your inbox β no noise, no spam.