AI & Machine Learning

DeepSeek-V3: MXFP8, DeepEP Boost B200 Pre-training 41%

When it comes to training monstrous AI models, every second counts. Now, a clever combination of new precision formats and communication optimizations is pushing the boundaries of what's possible.

DeepSeek-V3 Flies 41% Faster on B200: The MXFP8 & DeepEP Dance — Open Source Beat

Key Takeaways

  • MXFP8 and DeepEP achieved a 41% pre-training throughput gain for DeepSeek-V3 671B on NVIDIA B200 GPUs.
  • MXFP8 targets computational bottlenecks in MoE models by accelerating GEMMs, while DeepEP optimizes inter-GPU communication.
  • The combined optimizations demonstrate significant synergistic effects, leading to cumulative speedups.
  • The experiment highlights the growing importance of specialized software and numerical formats for efficient large-scale AI training.

Is the quest for faster AI model training actually just a sophisticated game of numbers and wires? We ask this because the latest achievement out of PyTorch and Nebius isn’t just about shaving milliseconds off training times; it’s a deep dive into how architectural shifts in numerical precision and inter-GPU communication can fundamentally alter the economics of large-scale AI development. The headline figure? A staggering 41% boost in pre-training throughput for the massive DeepSeek-V3 Mixture-of-Experts (MoE) models on NVIDIA’s cutting-edge B200 GPUs. That’s not just ‘faster training’; that’s a potential paradigm shift.

Let’s cut through the corporate gloss. What’s actually happening here is a symphony of low-level engineering. The DeepSeek-V3 team, working with PyTorch and Nebius, decided to throw two orthogonal, but complementary, optimization techniques at their 16B and 671B parameter MoE behemoths running on a 256-GPU cluster. The first is MXFP8, a specialized flavor of FP8 (8-bit floating-point) arithmetic that use NVIDIA’s B200 tensor cores. The second is DeepEP, a custom communication library designed to untangle the knotty problem of how MoE models shuffle massive amounts of data between GPUs.

The Two-Pronged Attack: MXFP8 and DeepEP

The core challenge with training MoE models at scale is twofold. First, the sheer computational load. These models dynamically route data to specialized ‘expert’ networks, leading to a deluge of matrix multiplications (GEMMs) that can overwhelm even the most powerful hardware. NVIDIA’s Blackwell architecture, with its 5th-generation tensor cores, natively supports MXFP8. This format, unlike standard FP8, uses a finer-grained scaling method, preserving numerical fidelity while hitting the hardware’s peak theoretical throughput for these critical GEMMs. The goal here is simple: make the math happen faster, without sacrificing accuracy. Experiments confirm that MXFP8, particularly for the grouped GEMMs that dominate MoE expert layers, offers a significant speedup.

The second bottleneck, and often the more insidious one for MoE, is communication. Every layer in an MoE model requires two ‘all-to-all’ communication steps to send tokens to their assigned experts and then gather the results. Because the token routing is dynamic—determined at runtime by the model itself—standard communication primitives designed for predictable data flows struggle. This creates a massive bottleneck, especially as the model and cluster size grow. DeepEP steps in here, ditching generic collective communications for highly optimized NVLink and RDMA kernels. Crucially, it minimizes CPU involvement, allowing GPUs to communicate more directly, which is vital for these variable, high-volume transfers.

“DeepEP replaces the standard all-to-all backend with purpose built NVLink and RDMA kernels that reduce CPU involvement by allowing GPUs to directly send weights, reducing latency.”

What’s truly compelling is the synergy. MXFP8 targets the compute side of the equation, while DeepEP tackles the communication. When applied together, these seemingly disparate optimizations don’t just add up; they multiply. The reported 41% gain for the 671B model isn’t just DeepEP’s 32% boost plus MXFP8’s contribution. It’s a cumulative effect, suggesting that the entire pipeline has been re-engineered for efficiency.

Beyond the Benchmarks: The Real-World Impact

This isn’t just a tech demo for a specialized GPU cluster. The implications are broad. For organizations building and training massive AI models—think foundation models, large language models, or complex generative systems—every percentage point of efficiency translates directly into reduced compute costs and faster iteration cycles. Training a 671B parameter model is an astronomical undertaking. If you can shave 41% off the training time and, by extension, the associated energy consumption and cloud bills, you’ve just made the impossible slightly more feasible.

It also signals a maturing ecosystem around cutting-edge hardware like the B200. It’s not enough to have the raw horsepower; the software stack needs to be equally sophisticated. PyTorch-native tooling like TorchAO (for MXFP8) and DeepEP demonstrate that the open-source community and cloud providers like Nebius are actively building out the infrastructure to unlock the full potential of these powerful chips. The fact that all these experiments are fully reproducible adds another layer of credibility.

Is MXFP8 the Future of MoE Training?

While the results are undeniably impressive, a healthy dose of skepticism is warranted. MXFP8, while showing no degradation in convergence for the smaller 16B model, is still a mixed-precision format. The devil is always in the details of numerical stability and convergence guarantees over extremely long training runs. We’ve seen plenty of examples where aggressive precision reduction leads to subtle but significant issues down the line. However, the specific architecture of MXFP8—Microscaling FP8—and its careful integration via TorchAO seem designed to mitigate these risks, especially for the types of GEMMs prevalent in MoE. The real test will be scaling this to even larger models and longer training durations.

This experiment also highlights a critical architectural decision for MoE: how to balance computation and communication. As models grow, the communication overhead tends to dominate. DeepEP’s success suggests that specialized, hardware-aware communication kernels are not just a nice-to-have but a necessity for unlocking the next generation of AI capabilities. It’s a reminder that optimizing the silicon is only half the battle; the software and system design must evolve in lockstep.

This isn’t the end of the story. It’s the beginning of a new chapter where bespoke optimizations for specific model architectures, like MoE, become paramount. The era of simply throwing more GPUs at a problem might be giving way to a more nuanced approach, where clever software and specialized numerical formats unlock performance that brute force alone cannot achieve. The race for more efficient AI training is on, and tools like MXFP8 and DeepEP are defining the lead vehicles.


🧬 Related Insights

Frequently Asked Questions

What is MXFP8? MXFP8 (Microscaling FP8) is a low-precision numerical format that uses a shared exponent for smaller blocks of 32 elements, offering finer-grained scaling to preserve numerical fidelity while leveraging FP8 hardware for faster computations.

How does DeepEP improve communication? DeepEP replaces standard collective communication with optimized NVLink and RDMA kernels that allow GPUs to communicate more directly, reducing CPU involvement and latency, which is crucial for the dynamic all-to-all communication patterns in MoE models.

Will these optimizations work for all AI models? These specific optimizations, MXFP8 and DeepEP, are particularly beneficial for Mixture-of-Experts (MoE) models due to their heavy reliance on grouped GEMMs and dynamic all-to-all communication patterns. While the underlying principles of mixed precision and optimized communication are broadly applicable, the exact implementation details and gains may vary for different model architectures.

Written by
Open Source Beat Editorial Team

Curated insights, explainers, and analysis from the editorial team.

Frequently asked questions

What is MXFP8?
MXFP8 (Microscaling FP8) is a low-precision numerical format that uses a shared exponent for smaller blocks of 32 elements, offering finer-grained scaling to preserve numerical fidelity while leveraging FP8 hardware for faster computations.
How does DeepEP improve communication?
DeepEP replaces standard collective communication with optimized NVLink and RDMA kernels that allow GPUs to communicate more directly, reducing CPU involvement and latency, which is crucial for the dynamic all-to-all communication patterns in MoE models.
Will these optimizations work for all AI models?
These specific optimizations, MXFP8 and DeepEP, are particularly beneficial for Mixture-of-Experts (MoE) models due to their heavy reliance on grouped GEMMs and dynamic all-to-all communication patterns. While the underlying principles of mixed precision and optimized communication are broadly applicable, the exact implementation details and gains may vary for different model architectures.

Worth sharing?

Get the best Open Source stories of the week in your inbox — no noise, no spam.

Originally reported by PyTorch Blog

Stay in the loop

The week's most important stories from Open Source Beat, delivered once a week.