Explainers

NVFP4 Speeds Diffusion 1.68x on Blackwell GPUs

Blackwell's NVFP4 format just turned diffusion models into speed demons—1.68x faster on Flux.1-Dev. But is this the quantization silver bullet, or just NVIDIA's latest flex?

{# Always render the hero — falls back to the theme OG image when article.image_url is empty (e.g. after the audit's repair_hero_images cleared a blocked Unsplash hot-link). Without this fallback, evergreens with cleared image_url render no hero at all → the JSON-LD ImageObject loses its visual counterpart and LCP attrs go missing. #}
1.68x Faster Diffusion on Blackwell with NVFP4 [Benchmarks] — Open Source Beat

Key Takeaways

  • NVFP4 delivers 1.68x inference speedups on Flux.1-Dev with Diffusers/TorchAO.
  • MXFP8 hits 1.26x, best for low-batch; NVFP4 owns high-batch workloads.
  • Requires Blackwell B200; code fully reproducible via GitHub repo.

Pipeline loads. Prompt fires: “A cat holding a sign that says hello world.” Output spits out in seconds, not minutes. NVIDIA’s Blackwell architecture—specifically the B200—makes it happen with NVFP4 and MXFP8 quantization, juicing diffusion models like Flux.1-Dev up to 1.68x faster.

Zoom out. Diffusion models rule image and video gen, pumping out hyper-real visuals. Problem? They guzzle memory and compute like a Hummer in a fuel crisis. Quantization swoops in as the fix, and NVIDIA’s microscaling formats—native to Blackwell—promise the holy grail: speed without the quality nosedive.

What Are MXFP8 and NVFP4 Really Doing?

MXFP8, the OCP-standard 8-bit champ (E4M3/E5M2), groups tensors into 32-element blocks with 8-bit scales. NVFP4? NVIDIA’s 4-bit beast (E2M1), block size 16, FP8 scales, Tensor Core supercharged. Theory screams throughput: NVFP4 shrinks memory 3.5x versus BF16. Reality? End-to-end benchmarks hit 1.26x for MXFP8, 1.68x for NVFP4 on Flux.1-Dev, QwenImage, LTX-2.

Here’s the kicker—selective quantization, CUDA Graphs, LPIPS metrics. They iterated like pros, balancing perf and fidelity. Dry humor alert: LPIPS stayed low, meaning your cat pic doesn’t suddenly look like a Picasso reject.

“We demonstrate reproducible end-to-end inference speedups of up to 1.26x with MXFP8 and 1.68x with NVFP4 with diffusers and torchao on the Flux.1-Dev, QwenImage, and LTX-2 models on NVIDIA B200.”

That quote from the NVIDIA post? Gold. Reproducible. Specific models. No vaporware.

But wait. CUDA 10.0 minimum. B200 required for NVFP4. If you’re nursing an A100, tough luck—this party’s invite-only.

conda create -n nvfp4 python=3.11 -y
conda activate nvfp4
pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu130
pip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/cu130
pip install --pre mslk --index-url https://download.pytorch.org/whl/nightly/cu130
pip install diffusers transformers accelerate sentencepiece protobuf av imageio-ffmpeg

Nightlies from March 2026—PyTorch 2.12.0.dev, TorchAO 0.17.0.dev. HF login for models. Standard drill.

How to Quantize Your Diffusion Pipeline

TorchAO slots right into Diffusers. Check this:

from diffusers import DiffusionPipeline, TorchAoConfig, PipelineQuantizationConfig
import torch
from torchao.prototype.mx_formats.inference_workflow import (
    NVFP4DynamicActivationNVFP4WeightConfig,
)

config = NVFP4DynamicActivationNVFP4WeightConfig(
    use_dynamic_per_tensor_scale=True, use_triton_kernel=True,
)
pipe_quant_config = PipelineQuantizationConfig(
    quant_mapping={"transformer": TorchAoConfig(config)}
)
pipe = DiffusionPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    torch_dtype=torch.bfloat16,
    quantization_config=pipe_quant_config
).to("cuda")
pipe.transformer.compile_repeated_blocks(fullgraph=True)
pipe_call_kwargs = {
    "prompt": "A cat holding a sign that says hello world",
    "height": 1024,
    "width": 1024,
    "guidance_scale": 3.5,
    "num_inference_steps": 28,
    "max_sequence_length": 512,
    "num_images_per_prompt": 1,
    "generator": torch.manual_seed(0),
}
result = pipe(**pipe_call_kwargs)
image = result.images[0]
image.save("my_image.png")

Every Linear layer quantized. Regional compilation with fullgraph=True—cuts compile time, near-full perf. MXFP8 swap? Tweak the config subclass. Triton kernels for the win.

Benchmarks on Flux.1-Dev used exact those kwargs: 1024x1024, 28 steps, guidance 3.5. Results? NVFP4 crushes high-batch runs; MXFP8 owns low-batch latency.

NVIDIA’s spin? Polished. But here’s the acerbic insight they skip: this echoes the FP16 Tensor Core debut on Volta in 2017—hype machine revved, adoption lagged until software caught up. Bold prediction: NVFP4 locks Blackwell into diffusion dominance, but expect 6-12 months of TorchAO growing pains before prime time.

Why Does This Matter for Diffusion Developers?

Speedups aren’t abstract. 1.68x means serving more users, lower costs—real dollars. QwenImage, LTX-2 join Flux; cross-model wins. Selective quant dodges accuracy cliffs in sensitive blocks. CUDA Graphs slash overhead. LPIPS validates no visual garbage.

Skepticism check: Benchmarks on B200 DGX. Your mileage? Varies with batch, model, prompt length. High-batch NVFP4 shines; small-batch, MXFP8 edges it. Corporate PR glosses that—always does.

Unique angle: Remember AWQ/ GPTQ quantization wars? This microscaling leapfrogs them on Blackwell, but ports poorly to Hopper/Ampere. NVIDIA’s ecosystem lock-in, dialed to 11.

MXFP8 as industry standard? OCP buy-in helps, but NVFP4’s Blackwell exclusivity screams moat. Developers: Fork the repo, repro the numbers. Don’t drink the Kool-Aid untested.

Will NVFP4 Replace BF16 for All Workloads?

No. Compute-bound, high-batch? Yes. Memory-pinched video gen? Slam dunk. But dynamic ranges vary—per-tensor scales mitigate, not miracle. LPIPS dips prove fidelity holds, yet outliers lurk.

TorchAO’s integration? smoothly now, nightlies be damned. Stable release incoming, bet on it.

Blackwell owners rejoice. Rest? Save for 2027 upgrades. NVIDIA just raised the bar—again.

**


🧬 Related Insights

Frequently Asked Questions**

What is NVFP4 quantization? NVIDIA’s 4-bit float format (E2M1) with 16-element blocks and FP8 scales, native to Blackwell Tensor Cores for max throughput.

How much faster is MXFP8 on diffusion models? Up to 1.26x end-to-end on Flux.1-Dev via Diffusers/TorchAO, with near-zero quality loss per LPIPS.

Do I need a B200 GPU for these speedups? Yes for NVFP4; CUDA 10.0 min. Benchmarks are B200-specific.

Written by
Open Source Beat Editorial Team

Curated insights, explainers, and analysis from the editorial team.

Frequently asked questions

What is NVFP4 quantization?
NVIDIA's 4-bit float format (E2M1) with 16-element blocks and FP8 scales, native to Blackwell Tensor Cores for max throughput.
How much faster is MXFP8 on diffusion models?
Up to 1.26x end-to-end on Flux.1-Dev via Diffusers/TorchAO, with near-zero quality loss per LPIPS.
Do I need a B200 GPU for these speedups?
Yes for NVFP4; CUDA 10.0 min. Benchmarks are B200-specific.

Worth sharing?

Get the best Open Source stories of the week in your inbox — no noise, no spam.

Originally reported by PyTorch Blog

Stay in the loop

The week's most important stories from Open Source Beat, delivered once a week.