AI & Machine Learning

GPU Utilization: Counter vs. Cause

That comforting 97% GPU utilization number? It's a lie. Your GPU might be busy, but it's likely not doing anything useful.

{# Always render the hero — falls back to the theme OG image when article.image_url is empty (e.g. after the audit's repair_hero_images cleared a blocked Unsplash hot-link). Without this fallback, evergreens with cleared image_url render no hero at all → the JSON-LD ImageObject loses its visual counterpart and LCP attrs go missing. #}
A graphic showing a timeline with high GPU utilization (green) and a significant dip in token throughput (red gaps), illustrating the discrepancy.

Key Takeaways

  • NVIDIA's GPU utilization metric (e.g., from nvidia-smi) measures time *any* kernel was running, not useful work, leading to misleading performance diagnostics.
  • Common failure modes like imbalanced prefill/decode, distributed training waits, I/O stalls, CPU contention, and memory bandwidth saturation all present as high GPU utilization with low throughput.
  • Accurate bottleneck detection requires correlating lower-level data: CUDA API calls, driver events, kernel execution traces, and host-side CPU activity, not just aggregate utilization.

A vLLM server screams 97% utilization on nvidia-smi for a solid eight minutes. Simultaneously, token throughput craters. Both statements, absurdly, are true. And therein lies the digital snake oil. NVIDIA’s ubiquitous GPU utilization metric isn’t a measure of productive work. Nope. It’s a simple duty-cycle counter. It tells you when something was running on the GPU, not if that something was worth running.

We bumped into this absurdity while running an internal repro of a vLLM latency spike. The hardware? A TensorDock RTX 4090. The software? vLLM 0.18.0, with Qwen2.5-0.5B-Instruct chugging along. For eight minutes, the dashboard was a picture of health. Nvidia-smi wavered between 92-99%, a steady 97% on average. Fans hummed, memory was stable, power draw held at 320W. All systems go, right?

Wrong.

The culprit was an unassuming request: n_completions=8 paired with logprobs=20. This beast expanded each decode step into eight separate sequences, each demanding a full-vocabulary softmax. We’re talking about 150,000 tokens per expansion. Each of these behemoths effectively held every other co-scheduled request hostage for 9-11 seconds. The GPU stayed busy, yes, but it was busy churning through user-invisible garbage. The throughput? Collapsed.

This isn’t some fringe scenario. This is the predictable outcome when your only diagnostic tool is a glorified stopwatch.

NVIDIA’s own documentation helpfully defines GPU-Util as: “percent of time over the past sample period during which one or more kernels was executing on the GPU.” Duty cycle. That’s it. It offers zero insight into whether the kernel is efficient, whether it’s bottlenecked, or if it’s actively hindering other operations. It’s like bragging about how many hours you spent in the gym, without mentioning if you were lifting weights or just staring at the ceiling.

DCGM, NVIDIA’s more advanced toolkit, offers finer granularity with counters like SM_ACTIVE and MEM_COPY_UTIL. These help, slightly. But a kernel running at a pathetic 5% of its peak potential for 100 milliseconds still registers 100% SM_ACTIVE for that interval. The dashboard remains oblivious.

We’ve dissected this pattern across various workloads. High utilization, plummeting throughput, and a dashboard that might as well be a Magic 8-Ball. The common thread? The root cause lives deeper.

The Usual Suspects: Why Your GPU Thinks It’s Busy

  1. Prefill/Decode Tango: Frameworks like vLLM, SGLang, and TGI try to batch prefill (input processing) and decode (output generation) on the same hardware. When prefill demands exponentially more compute than decode—a common scenario with long contexts—a single long-context request becomes a traffic jam for all shorter requests. The GPU stays at 100% SM_ACTIVE because the prefill kernels are hogging the shader cores. Meanwhile, decode latency for the waiting requests stretches to infinity.

  2. Distributed Training Gridlock: Imagine a 4-GPU all-reduce operation. If one GPU is a straggler, the others wait. Those waiting GPUs show 100% utilization because the kernel orchestrating the wait is itself a kernel. The overall throughput is dictated by the slowest rank, not the efficient ones.

  3. Dataloader Deadlock: PyTorch’s DataLoader, when performing index permutation on the main process, can become a single-threaded bottleneck. The GPU dutifully runs the same forward kernel repeatedly, while the launch of the next batch is blocked by a cudaStreamSync. The kernel screams, but the next job is stuck in the driveway.

  4. CPU Core Chaos: vLLM’s engine loop is single-threaded. An OS context switch—a neighboring core’s kernel work, a pesky interrupt, a poorly managed cgroup—can halt the cudaLaunchKernel call. We’ve seen p99 cudaLaunchKernel times stretch to 13.1ms (a gargantuan leap from a typical 16.7us p50), all due to scheduler hiccups. The GPU keeps running whatever kernel was active before the stall, making utilization appear normal.

  5. Memory Bandwidth Meltdown: A kernel that floods the system with data faster than the SMs can process it will report 100% SM_ACTIVE. But the real constraint? DRAM bandwidth. Utilization is a red herring; the bottleneck is memory throughput.

In every one of these scenarios, the symptom is depressingly familiar: high utilization, low throughput. The cause, however, hides in the layers beneath.

Finding the Real Bottleneck

So, how do you peel back the layers? Forget the aggregate utilization. Ask the hard question: “What was the GPU actually waiting on, second by second?”

Answering this demands correlating data from multiple sources on the same host, synchronized by timestamp:

  • CUDA Runtime API Calls: Monitor events like cudaLaunchKernel, cudaMemcpyAsync, cudaStreamSynchronize, and cudaDeviceSynchronize via uprobes on libcudart.so.
  • CUDA Driver API Calls: Track cuLaunchKernel and related driver-level operations using uprobes on libcuda.so.
  • Kernel Execution Traces: Dive into the actual kernels being run. Tools like CUPTI or NVIDIA Nsight can provide detailed profiles of kernel duration, occupancy, and resource utilization within the kernel itself.
  • Host-Side Activity: Don’t ignore the CPU. Monitor CPU thread activity, context switches, and system calls related to GPU driver interaction.
  • Memory Bandwidth: Directly measure DRAM bandwidth usage. This is often exposed via DCGM or specific profiling tools.

By weaving these threads together, you can finally see the difference between a GPU that’s churning through productive computation and one that’s merely spinning its wheels—a distinction that 97% utilization conveniently obscures.

This isn’t just a theoretical problem; it’s a persistent, frustrating reality in high-performance computing. And as AI workloads grow more complex, the ability to see beyond the simplistic utilization counter will become not just beneficial, but absolutely essential.


🧬 Related Insights

Written by
Open Source Beat Editorial Team

Curated insights, explainers, and analysis from the editorial team.

Worth sharing?

Get the best Open Source stories of the week in your inbox — no noise, no spam.

Originally reported by Dev.to

Stay in the loop

The week's most important stories from Open Source Beat, delivered once a week.