In Part 6, we exceeded reference performance on the RTX 3090, hitting 101.5% through FP instruction fusion and auto-tuning. But when we ran the same kernel on the A100, performance dropped to just 80.3% - a 20-point swing from success to significant underperformance.
What causes such a large performance gap between these two Ampere-generation GPUs?
Profiling Kernel 7 on the A100
Let's compare the Nsight Compute profiling metrics of the reference kernel with the top configuration of kernel 7. We'll examine three key areas: compute pipeline utilization, instruction counts, and how the A100's throughput characteristics amplify our instruction overhead.
These profiles were done for seq_len=4096, d_head=128.
Compute Pipeline Utilization
Let's start by examining pipeline utilization1 across the tensor, FMA (primarily floating point), and ALU (primarily integer) pipelines. The numbers reveal an immediate problem: our tensor pipeline is active for only 60.5% of the kernel's execution compared to 75.4% for the reference. Meanwhile, our scalar pipelines are significantly more active:
- FMA: 29.2% vs 20.5%
- ALU: 21.2% vs 13.1%
- 50% higher in aggregate (50.4% vs 33.6%)
Reference kernel (B_r=128, B_c=64), Kernel 7 (B_r=64, B_c=64).
The cycle counts tell an even starker story. While both kernels execute roughly the same number of tensor pipeline cycles, kernel 7's scalar pipelines (FMA and ALU) each run for as long as they do combined in the reference kernel. We're doing the same amount of matrix multiplication work but spending far more time on scalar operations.
Reference kernel (B_r=128, B_c=64), Kernel 7 (B_r=64, B_c=64).
RTX 3090: A Different Story
What about the RTX 3090? The picture looks completely different. Kernel 7 actually has slightly higher tensor utilization than the reference (48.0% vs 47.2%). We see similar patterns to the A100 in scalar pipeline activity - 53% higher in aggregate (14.2% vs 9.27%) - but the impact is much less severe.
Reference kernel (B_r=128, B_c=32), Kernel 7 (B_r=64, B_c=64).
Consumer CUDA GPU Tensor Throughput
The throughput for matrix multiplications with BF16/FP16 inputs and FP32 outputs2 is artificially bottlenecked at half the speed on consumer GPUs compared to prosumer GPUs (Quadro/Pro lineup). The "effective" pipeline utilization values may be closer to double the reported metrics - 95.9% for kernel 7 vs 94.5% for reference.
With these effective values, the tensor pipeline is near its throughput ceiling. This raises a critical question: why is tensor utilization for our kernel significantly lower on the A100 but higher on the RTX 3090, both compared to their respective reference kernels?
Scalar Instruction Overhead
The tensor pipeline utilization points us toward the problem, but we need to dig deeper. Why are the FMA and ALU pipelines so active? Let's compare the actual instruction counts between our kernel and the reference to quantify the overhead.
Kernel 7 and Reference (B_r=128, B_c=32)
Block Configuration
While our top-performing configuration uses
, we show the configuration here for a fairer instruction count comparison with the reference kernel. The instruction overhead patterns remain similar across different configurations.
Our kernel executes significantly more of these instructions:
IMAD: integer multiply-addLOP3.LUT: bitwise logic operation with 3 operandsMOV: register copySHF: bit shift
Not pictured is CS2R, which copies special registers like the zero register RZ. The reference kernel doesn't execute any of these, while ours does - another source of overhead we'll eliminate in Part 8.
Instruction Modifiers
These raw instruction counts tell only part of the story. Instructions like IMAD serve multiple purposes via modifiers. For example, IMAD variants include:
IMAD.MOV.U32(register copy)IMAD.SHL.U32(left shift)IMAD.IADD(integer addition)
Similarly, LOP3.LUT implements any 3-input boolean operation via lookup table (docs).
Profiling Instructions With Modifiers
Nsight Compute displays instruction counts for base opcodes (e.g.,
IMAD) without modifiers by default. A highIMADcount might mostly beIMAD.MOV.U32operations rather than actual multiplications - a completely different performance picture.Profile instructions with modifiers using the
sass__inst_executed_per_opcode_with_modifier_allmetric for a granular view of where the GPU spends its time. The default metricsass__inst_executed_per_opcodeaggregates all variations of each base opcode.
Looking at the modifier breakdown reveals the real issue. The majority of IMAD instructions are actually register copies (IMAD.MOV.U32) or left bit shifts (IMAD.SHL.U32) rather than fused integer multiply-adds. These operations add overhead without contributing to the actual GEMM computation.
Compared to the reference kernel, we're executing:
- 11.6x as many register copies (
IMAD.MOV.U32+MOV) - 216x the logic operations (
LOP3.LUT) - 23.8x the left shift instructions (
IMAD.SHL.U32+SHF.L.U32)
Combined, this amounts to roughly 2x the scalar instructions and 38.7% more instructions overall.
Data Dependency Instructions
The sheer number of extra instructions is concerning, but it's not the whole story. The critical problem is that many of these scalar instructions create data dependencies that block the operations that actually matter: cp.async, ldmatrix, and mma.
To quantify this, we'll count the unique scalar instructions that compute values used by cp.async, ldmatrix, and mma in the main loop. These are read-after-write (RAW) hazards - situations where one instruction must wait for another to complete before it can read the required register value.
Here's an example of such a dependency in SASS:
SHF.L.U32 R176, R176, 0x1, RZ ;
...
LDSM.16.M88.4 R32, [R176+0x4000] ; // ldmatrixThe ldmatrix instruction can't execute until the SHF.L.U32 completes, since it needs the value in R176.
The numbers reveal the severity of the problem:
| Configuration | Kernel 7 | Optimized (K16) | Multiplier |
|---|---|---|---|
| 197 | 29 | 6.8x | |
| 238 | 21 | 11.3x |
Each of these dependency instructions must complete before its dependent memory or tensor operation can begin. They add latency to the critical path and consume register resources, creating stalls that prevent us from keeping the tensor cores fully utilized. This is why our tensor cores sit idle for so much of the kernel's execution.
Why the A100 Suffers: Throughput Ratios
Let's synthesize what we've discovered:
- On the A100, our tensor pipeline utilization is much lower than the reference, while on the RTX 3090 it's slightly higher
- On both devices, scalar pipeline utilization (FMA+ALU) is 50% higher than the reference
- We're executing roughly 2x the scalar instructions with 6.8-11.3x more dependency-creating instructions
Why does this instruction overhead cripple performance on the A100 but not the RTX 3090? The answer lies in drastically different throughput characteristics:
| Device | mma TFLOPs/s(16b input, 32b accum) | FP32 TFLOPs/s | mma / FP32 ratio |
|---|---|---|---|
| A100 | 311.84 | 19.5 | 16x |
| RTX 3090 | 71 | 35.6 | 2x |
The A100 delivers 4.5x the tensor throughput of the RTX 3090, but only half the FP32 throughput. This creates vastly different tensor-to-FP32 ratios: 16x on the A100 versus 2x on the RTX 3090.
This lopsided ratio amplifies the impact of our scalar dependency problem. Neither factor alone explains the performance gap - it's their interaction:
A100 (16x ratio): The tensor cores can process work 16x faster than scalar units can prepare it. When scalar dependencies block tensor operations, the high-throughput tensor cores sit idle waiting. The massive throughput imbalance means we can't hide this latency - there's simply not enough scalar compute happening in parallel to overlap with tensor operations.
RTX 3090 (2x ratio): The tensor and scalar throughputs are more balanced. Even with the same scalar dependencies, the lower tensor throughput provides more time to resolve them. The scalar overhead can partially hide behind tensor operation latencies, and the higher FP32 throughput means more useful work happens while tensor cores wait.
The result: our scalar-heavy instruction mix creates a severe bottleneck on the A100, where tensor cores spend most of their time starved for data. The RTX 3090's more balanced architecture tolerates the same overhead much better.
For a deeper look at how this all works in the hardware, you can check out Appendix A - Ampere Microarchitecture.
Block Size Limitations
The throughput imbalance explains much of the performance gap, but there's a second critical factor: kernel 7 cannot efficiently use the block configurations that perform best on the A100.
The A100 reference kernel uses
| Kernel Iteration | Relative TFLOPs to | ||
|---|---|---|---|
| 16 | 128 | 64 | 100.0% |
| 16 | 128 | 32 | 89.3% |
| 16 | 64 | 64 | 84.1% |
| 16 | 64 | 32 | 83.4% |
Even with all optimizations in place,
Why Can't Kernel 7 Use Larger Blocks?
Kernel 7 actually performs worse at
| Kernel Iteration | Relative TFLOPs to | ||
|---|---|---|---|
| 7 | 64 | 64 | 100.0% |
| 7 | 128 | 32 | 95.4% |
| 7 | 64 | 32 | 90.3% |
| 7 | 128 | 64 | 86.9% |
The culprit: register spills. At
| Registers per Thread | SMEM per CTA | Warps per SM | |
|---|---|---|---|
| (64, 64, 4) | 229 | 48KiB | 8 |
| (64, 32, 4) | 168 | 32KiB | 12 |
| (128, 32, 4) | 255 (0B spilled) | 48KiB | 8 |
| (128, 64, 4) | 255 (272B spilled) | 64KiB | 4 (RTX 3090) 8 (A100) |
Why Doesn't the RTX 3090 Show This Issue?
Two hardware constraints mask these problems on the RTX 3090:
- SMEM limitations: The RTX 3090's 100KB SMEM per SM can't run
with sufficient occupancy anyway - Already tensor-bound: With tensor pipeline utilization near its peak (effective ~96%), larger blocks wouldn't help much
The A100's 164KB SMEM enables larger, more efficient configurations - but kernel 7's excessive register usage prevents taking advantage of them.
I'll have a section in the appendix soon that dives deeper into how block sizes affect instruction patterns and performance. Be on the lookout for that if you're interested!
Summary
Our analysis reveals why kernel 7 achieves 101.5% of reference performance on the RTX 3090 but only 80.3% on the A100.
The Core Problems
Scalar instruction overhead: Kernel 7 executes significantly more instructions than necessary:
- 2x the scalar instructions compared to reference
- 6.8-11.3x more dependency-creating instructions that block
cp.async,ldmatrix, andmmaoperations - 38.7% more total instructions
These dependency instructions add latency to the critical path, creating pipeline stalls that prevent the tensor cores from maintaining high utilization.
Register pressure limits block size: The A100's optimal block configuration
Why the RTX 3090 Hides These Issues
Balanced throughput: The devices have drastically different throughput characteristics:
- A100: 16x
mma-to-FP32 ratio - tensor cores starve while scalar dependencies resolve - RTX 3090: 2x ratio - scalar overhead partially hides behind tensor operation latencies
Hardware constraints: The RTX 3090's limitations mask the problems:
- 100KB SMEM per SM can't run larger block sizes with sufficient occupancy anyway
- Tensor pipeline utilization already near peak (~96% effective), so larger blocks wouldn't help much
The A100's strengths - 4.5x higher tensor throughput and support for larger blocks - expose our kernel's inefficiencies: we can't keep the tensor cores fed due to scalar bottlenecks, and we can't leverage larger blocks due to register pressure.
Up Next
In Part 8, we'll hunt down the sources of these extra register copies, shifts, and logic operations through SASS-level analysis. By eliminating instruction overhead and reducing register pressure, we'll narrow the performance gap significantly - reaching 95.2% of reference performance on the A100, a nearly 15-percentage-point improvement from our current 80.3%.