🔧 AI Hardware

Torch.compile Crushes SOTA Normalization Speeds on H100 and B200

What if your PyTorch models trained as blazingly fast as custom kernels? Torch.compile's latest tweaks deliver SOTA normalization performance on H100 and B200, closing the gap with hyper-optimized rivals like Quack.

Benchmark graphs showing torch.compile matching Quack on LayerNorm forwards for H100 and B200

⚡ Key Takeaways

  • Torch.compile achieves SOTA LayerNorm/RMSNorm speeds on H100/B200 via autotuning and heuristics. 𝕏
  • Key fixes: Bigger RBLOCK, adjusted warps for peak vectorization, persistent reductions. 𝕏
  • Automatic fusion promises end-to-end training speedups, reducing custom kernel needs. 𝕏
Published by

theAIcatchup

Community-driven. Code-first.

Worth sharing?

Get the best Open Source stories of the week in your inbox — no noise, no spam.

Originally reported by PyTorch Blog

Stay in the loop

The week's most important stories from theAIcatchup, delivered once a week.