In this 9-part series, we're going to implement Flash Attention 2 from scratch on Ampere GPUs. We'll build an initial implementation and optimize it over 16 kernel iterations, all without importing any external libraries. By the final kernel, we'll reach 99.2% the performance of the official implementation on the A100 and 102.9% on the RTX 3090 (at sequence length 4096, ).

Continued in Part 1...

Table of Contents