In this 9-part series, we're going to implement Flash Attention 2 from scratch on Ampere GPUs. We'll build an initial implementation and optimize it over 16 kernel iterations, all without importing any external libraries. By the final kernel, we'll reach 99.2% the performance of the official implementation on the A100 and 102.9% on the RTX 3090 (at sequence length 4096,
Table of Contents
-
Part 6: FP Instruction Fusion & Auto-Tuning (Coming Soon)
-
Part 7: Instruction Reduction (Coming Soon)
-
Part 8: Final Optimizations (Coming Soon)
-
Part 9: Kernel Analysis (Coming Soon)