What is torch.compile normalization performance on H100?

Torch.compile now matches SOTA Quack speeds for LayerNorm/RMSNorm forwards on H100, with minor regressions on specific shapes. Autotuning fixes were key.

Does torch.compile beat Quack on B200?

Yes, near-parity or better post-heuristic tweaks, especially sensitive to vectorization for Blackwell's bandwidth.

How to enable max-autotune for LayerNorm in PyTorch?

Use torch.compile(model, mode='max-autotune') and add torch._dynamo.reset() in benchmarks to avoid dynamic shapes.

🔧 AI Hardware

Torch.compile Crushes SOTA Normalization Speeds on H100 and B200

What if your PyTorch models trained as blazingly fast as custom kernels? Torch.compile's latest tweaks deliver SOTA normalization performance on H100 and B200, closing the gap with hyper-optimized rivals like Quack.

theAIcatchup Apr 08, 2026 4 min read

Benchmark graphs showing torch.compile matching Quack on LayerNorm forwards for H100 and B200

⚡ Key Takeaways

Torch.compile achieves SOTA LayerNorm/RMSNorm speeds on H100/B200 via autotuning and heuristics. 𝕏
Key fixes: Bigger RBLOCK, adjusted warps for peak vectorization, persistent reductions. 𝕏
Automatic fusion promises end-to-end training speedups, reducing custom kernel needs. 𝕏

Published by

theAIcatchup

Community-driven. Code-first.

#LayerNorm #NVIDIA H100 #NVIDIA H100 B200 #PyTorch #PyTorch Inductor #RMSNorm #torch.compile

Worth sharing?

Get the best Open Source stories of the week in your inbox — no noise, no spam.

Originally reported by PyTorch Blog

⚡ Key Takeaways

The 60-Second TL;DR

theAIcatchup

Share this article

Worth sharing?

Related Stories

PyTorch-Filters: A Fresh Take on Edge Detection That Might Actually Stick

OpenMed's $165 Bet: mRNA Models Trained Across 25 Species in 55 GPU-Hours

WebGPU: The Browser Trick Turning Your Laptop into an AI Beast

ThunderKittens 2.0 Unleashes Blazing GPU Kernels

Stay in the loop