Flash Attention from Scratch Part 7: A100 Profiling

In Part 6, we exceeded reference performance on the RTX 3090, hitting 101.5% through FP instruction fusion and auto-tuning. But when we ran the same kernel on the A100, performance dropped to just 80.3% - a 20-point swing from success to significant underperformance.

What causes such a large performance gap between these two Ampere-generation GPUs?

Profiling Kernel 7 on the A100

Let's compare the Nsight Compute profiling metrics of the reference kernel with the top configuration of kernel 7. We'll examine three key areas: compute pipeline utilization, instruction counts, and how the A100's throughput characteristics amplify our instruction overhead.

These profiles were done for seq_len=4096, d_head=128.

Compute Pipeline Utilization

Let's start by examining pipeline utilization¹ across the tensor, FMA (primarily floating point), and ALU (primarily integer) pipelines. The numbers reveal an immediate problem: our tensor pipeline is active for only 60.5% of the kernel's execution compared to 75.4% for the reference. Meanwhile, our scalar pipelines are significantly more active:

FMA: 29.2% vs 20.5%
ALU: 21.2% vs 13.1%
50% higher in aggregate (50.4% vs 33.6%)

Pipeline utilization chart.
Reference kernel (B_r=128, B_c=64), Kernel 7 (B_r=64, B_c=64).

The cycle counts tell an even starker story. While both kernels execute roughly the same number of tensor pipeline cycles, kernel 7's scalar pipelines (FMA and ALU) each run for as long as they do combined in the reference kernel. We're doing the same amount of matrix multiplication work but spending far more time on scalar operations.

Pipeline cycles stacked.
Reference kernel (B_r=128, B_c=64), Kernel 7 (B_r=64, B_c=64).

RTX 3090: A Different Story

What about the RTX 3090? The picture looks completely different. Kernel 7 actually has slightly higher tensor utilization than the reference (48.0% vs 47.2%). We see similar patterns to the A100 in scalar pipeline activity - 53% higher in aggregate (14.2% vs 9.27%) - but the impact is much less severe.

Pipeline utilization chart.
Reference kernel (B_r=128, B_c=32), Kernel 7 (B_r=64, B_c=64).

Consumer CUDA GPU Tensor Throughput

The throughput for matrix multiplications with BF16/FP16 inputs and FP32 outputs² is artificially bottlenecked at half the speed on consumer GPUs compared to prosumer GPUs (Quadro/Pro lineup). The "effective" pipeline utilization values may be closer to double the reported metrics - 95.9% for kernel 7 vs 94.5% for reference.

With these effective values, the tensor pipeline is near its throughput ceiling. This raises a critical question: why is tensor utilization for our kernel significantly lower on the A100 but higher on the RTX 3090, both compared to their respective reference kernels?

Scalar Instruction Overhead

The tensor pipeline utilization points us toward the problem, but we need to dig deeper. Why are the FMA and ALU pipelines so active? Let's compare the actual instruction counts between our kernel and the reference to quantify the overhead.

Count of key executed instructions on ALU and FMA pipelines,
Kernel 7 and Reference (B_r=128, B_c=32)

Block Configuration

While our top-performing configuration uses , we show the configuration here for a fairer instruction count comparison with the reference kernel. The instruction overhead patterns remain similar across different configurations.

Our kernel executes significantly more of these instructions:

IMAD: integer multiply-add
LOP3.LUT: bitwise logic operation with 3 operands
MOV: register copy
SHF: bit shift

Not pictured is CS2R, which copies special registers like the zero register RZ. The reference kernel doesn't execute any of these, while ours does - another source of overhead we'll eliminate in Part 8.

Instruction Modifiers

These raw instruction counts tell only part of the story. Instructions like IMAD serve multiple purposes via modifiers. For example, IMAD variants include:

IMAD.MOV.U32 (register copy)
IMAD.SHL.U32 (left shift)
IMAD.IADD (integer addition)

Similarly, LOP3.LUT implements any 3-input boolean operation via lookup table (docs).

Profiling Instructions With Modifiers

Nsight Compute displays instruction counts for base opcodes (e.g., IMAD) without modifiers by default. A high IMAD count might mostly be IMAD.MOV.U32 operations rather than actual multiplications - a completely different performance picture.

Profile instructions with modifiers using the sass__inst_executed_per_opcode_with_modifier_all metric for a granular view of where the GPU spends its time. The default metric sass__inst_executed_per_opcode aggregates all variations of each base opcode.

Looking at the modifier breakdown reveals the real issue. The majority of IMAD instructions are actually register copies (IMAD.MOV.U32) or left bit shifts (IMAD.SHL.U32) rather than fused integer multiply-adds. These operations add overhead without contributing to the actual GEMM computation.

Compared to the reference kernel, we're executing:

11.6x as many register copies (IMAD.MOV.U32 + MOV)
216x the logic operations (LOP3.LUT)
23.8x the left shift instructions (IMAD.SHL.U32 + SHF.L.U32)

Combined, this amounts to roughly 2x the scalar instructions and 38.7% more instructions overall.

Data Dependency Instructions

The sheer number of extra instructions is concerning, but it's not the whole story. The critical problem is that many of these scalar instructions create data dependencies that block the operations that actually matter: cp.async, ldmatrix, and mma.

To quantify this, we'll count the unique scalar instructions that compute values used by cp.async, ldmatrix, and mma in the main loop. These are read-after-write (RAW) hazards - situations where one instruction must wait for another to complete before it can read the required register value.

Here's an example of such a dependency in SASS:

RAW example

SHF.L.U32 R176, R176, 0x1, RZ ;
...
LDSM.16.M88.4 R32, [R176+0x4000] ;     // ldmatrix

The ldmatrix instruction can't execute until the SHF.L.U32 completes, since it needs the value in R176.

The numbers reveal the severity of the problem:

Configuration	Kernel 7	Optimized (K16)	Multiplier
	197	29	6.8x
	238	21	11.3x

Each of these dependency instructions must complete before its dependent memory or tensor operation can begin. They add latency to the critical path and consume register resources, creating stalls that prevent us from keeping the tensor cores fully utilized. This is why our tensor cores sit idle for so much of the kernel's execution.

Why the A100 Suffers: Throughput Ratios

Let's synthesize what we've discovered:

On the A100, our tensor pipeline utilization is much lower than the reference, while on the RTX 3090 it's slightly higher
On both devices, scalar pipeline utilization (FMA+ALU) is 50% higher than the reference
We're executing roughly 2x the scalar instructions with 6.8-11.3x more dependency-creating instructions

Why does this instruction overhead cripple performance on the A100 but not the RTX 3090? The answer lies in drastically different throughput characteristics:

Device	`mma` TFLOPs/s (16b input, 32b accum)	FP32 TFLOPs/s	`mma` / FP32 ratio
A100	311.84	19.5	16x
RTX 3090	71	35.6	2x

The A100 delivers 4.5x the tensor throughput of the RTX 3090, but only half the FP32 throughput. This creates vastly different tensor-to-FP32 ratios: 16x on the A100 versus 2x on the RTX 3090.

This lopsided ratio amplifies the impact of our scalar dependency problem. Neither factor alone explains the performance gap - it's their interaction:

A100 (16x ratio): The tensor cores can process work 16x faster than scalar units can prepare it. When scalar dependencies block tensor operations, the high-throughput tensor cores sit idle waiting. The massive throughput imbalance means we can't hide this latency - there's simply not enough scalar compute happening in parallel to overlap with tensor operations.

RTX 3090 (2x ratio): The tensor and scalar throughputs are more balanced. Even with the same scalar dependencies, the lower tensor throughput provides more time to resolve them. The scalar overhead can partially hide behind tensor operation latencies, and the higher FP32 throughput means more useful work happens while tensor cores wait.

The result: our scalar-heavy instruction mix creates a severe bottleneck on the A100, where tensor cores spend most of their time starved for data. The RTX 3090's more balanced architecture tolerates the same overhead much better.

For a deeper look at how this all works in the hardware, you can check out Appendix A - Ampere Microarchitecture.

Block Size Limitations

The throughput imbalance explains much of the performance gap, but there's a second critical factor: kernel 7 cannot efficiently use the block configurations that perform best on the A100.

The A100 reference kernel uses . To understand why this matters, let's look ahead at our final optimized kernel (kernel 16) with different block sizes:

Kernel Iteration			Relative TFLOPs to
16	128	64	100.0%
16	128	32	89.3%
16	64	64	84.1%
16	64	32	83.4%

Even with all optimizations in place, -- kernel 7's top configuration -- achieves only 84.1% of the performance of . Kernel 7's performance ceiling is fundamentally limited by its inability to use larger blocks efficiently.

Why Can't Kernel 7 Use Larger Blocks?

Kernel 7 actually performs worse at :

Kernel Iteration			Relative TFLOPs to
7	64	64	100.0%
7	128	32	95.4%
7	64	32	90.3%
7	128	64	86.9%

The culprit: register spills. At , kernel 7 spills 272 bytes of registers per thread. A significant portion of execution time gets consumed by expensive local memory traffic.

	Registers per Thread	SMEM per CTA	Warps per SM
(64, 64, 4)	229	48KiB	8
(64, 32, 4)	168	32KiB	12
(128, 32, 4)	255 (0B spilled)	48KiB	8
(128, 64, 4)	255 (272B spilled)	64KiB	4 (RTX 3090) 8 (A100)

Why Doesn't the RTX 3090 Show This Issue?

Two hardware constraints mask these problems on the RTX 3090:

SMEM limitations: The RTX 3090's 100KB SMEM per SM can't run with sufficient occupancy anyway
Already tensor-bound: With tensor pipeline utilization near its peak (effective ~96%), larger blocks wouldn't help much

The A100's 164KB SMEM enables larger, more efficient configurations - but kernel 7's excessive register usage prevents taking advantage of them.

I'll have a section in the appendix soon that dives deeper into how block sizes affect instruction patterns and performance. Be on the lookout for that if you're interested!

Summary

Our analysis reveals why kernel 7 achieves 101.5% of reference performance on the RTX 3090 but only 80.3% on the A100.

The Core Problems

Scalar instruction overhead: Kernel 7 executes significantly more instructions than necessary:

2x the scalar instructions compared to reference
6.8-11.3x more dependency-creating instructions that block cp.async, ldmatrix, and mma operations
38.7% more total instructions

These dependency instructions add latency to the critical path, creating pipeline stalls that prevent the tensor cores from maintaining high utilization.

Register pressure limits block size: The A100's optimal block configuration causes kernel 7 to spill 272 bytes of registers per thread. These spills require expensive memory accesses that negate the benefits of larger blocks, forcing us to use smaller, less efficient configurations.

Why the RTX 3090 Hides These Issues

Balanced throughput: The devices have drastically different throughput characteristics:

A100: 16x mma-to-FP32 ratio - tensor cores starve while scalar dependencies resolve
RTX 3090: 2x ratio - scalar overhead partially hides behind tensor operation latencies

Hardware constraints: The RTX 3090's limitations mask the problems:

100KB SMEM per SM can't run larger block sizes with sufficient occupancy anyway
Tensor pipeline utilization already near peak (~96% effective), so larger blocks wouldn't help much

The A100's strengths - 4.5x higher tensor throughput and support for larger blocks - expose our kernel's inefficiencies: we can't keep the tensor cores fed due to scalar bottlenecks, and we can't leverage larger blocks due to register pressure.

Up Next

In Part 8, we'll hunt down the sources of these extra register copies, shifts, and logic operations through SASS-level analysis. By eliminating instruction overhead and reducing register pressure, we'll narrow the performance gap significantly - reaching 95.2% of reference performance on the A100, a nearly 15-percentage-point improvement from our current 80.3%.

The metric depicted in the graph is equal to . The values can sum up to over 100%, because different pipelines can run in parallel. ↩
FP16 inputs with FP16 outputs is not affected. ↩

Sonny's Blog

Flash Attention from Scratch Part 7: A100 Profiling

Profiling Kernel 7 on the A100

Compute Pipeline Utilization

RTX 3090: A Different Story

Scalar Instruction Overhead

Instruction Modifiers

Data Dependency Instructions

Why the A100 Suffers: Throughput Ratios

Block Size Limitations

Why Can't Kernel 7 Use Larger Blocks?

Why Doesn't the RTX 3090 Show This Issue?

Summary

The Core Problems

Why the RTX 3090 Hides These Issues

Up Next

Table of Contents

Sonny's Blog

Flash Attention from Scratch Part 7: A100 Profiling

Profiling Kernel 7 on the A100

Compute Pipeline Utilization

RTX 3090: A Different Story

Scalar Instruction Overhead

Instruction Modifiers

Data Dependency Instructions

Why the A100 Suffers: Throughput Ratios

Block Size Limitations

Why Can't Kernel 7 Use Larger Blocks?

Why Doesn't the RTX 3090 Show This Issue?

Summary

The Core Problems

Why the RTX 3090 Hides These Issues

Up Next

Footnotes

Table of Contents