Flash Attention From Scratch

In this 10-part series, we're going to implement Flash Attention 2 from scratch on Ampere GPUs. We'll build an initial implementation and optimize it over 16 kernel iterations, all without importing any external libraries. By the final kernel, we'll reach 99.2% the performance of the official implementation on the A100 and 102.9% on the RTX 3090 (at sequence length 4096, ).

Continued in Part 1...

Github
Part 1: Intro and Overview
Part 2: Building Blocks
Part 3: Kernel 1
Part 4: Bank Conflicts & Swizzling
Part 5: CUTLASS GEMM Optimizations
Part 6: FP Instruction Fusion & Auto-Tuning
Part 7: A100 Profiling
Part 8: Instruction Reduction (Coming Soon)
Part 9: A100 Final Optimizations (Coming Soon)
Part 10: Kernel Analysis (Coming Soon)
Appendix
Appendix A - Ampere Microarchitecture
Appendix B - Block Size Configuration (Coming Soon)
Glossary

Sonny's Blog

Flash Attention From Scratch

Table of Contents

Github

Part 1: Intro and Overview

Part 2: Building Blocks

Part 3: Kernel 1

Part 4: Bank Conflicts & Swizzling

Part 5: CUTLASS GEMM Optimizations

Part 6: FP Instruction Fusion & Auto-Tuning

Part 7: A100 Profiling

Part 8: Instruction Reduction (Coming Soon)

Part 9: A100 Final Optimizations (Coming Soon)

Part 10: Kernel Analysis (Coming Soon)

Appendix

Appendix A - Ampere Microarchitecture

Appendix B - Block Size Configuration (Coming Soon)

Glossary