Maximum Achievable Matmul FLOPS
Explore matmul performance across shapes, dtypes, and GPUs — with shareable URLs and CSV export.
What is MAMF?
MAMF (Maximum Achievable Matmul FLOPS) is a practical upper bound on matrix-multiplication throughput for a given GPU + software stack (PyTorch/CUDA/cuBLAS, etc). The benchmark sweeps many (M, N, K) shapes and records the best achieved TFLOPS for each shape, giving you a map of where the hardware is fast and where performance cliffs are.
Note: The reported results use a Debian-slim based image with Python 3.12 for all Modal runs. Some runs (L40S, A40, and V100) were performed on other hardware but use the same pytorch/python versions.
- Capacity planning: estimate how close your workloads can get to peak throughput for the shapes you actually run.
- Hardware comparisons: compare GPUs on the same shapes/dtype (not just a single cherry-picked benchmark).
- Performance debugging: find regimes where you are bandwidth/launch-bound or hitting kernel selection issues.
- Regression tracking: compare results across PyTorch/CUDA versions to spot wins/losses
This project uses the mamf_finder.py script originally published by Stas Bekman in the ml-engineering repository.
Coverage
| Hardware | Torch | Dtype | Shapes | Peak TFLOPS | % Peak ? | Peak shape |
|---|---|---|---|---|---|---|
| NVIDIA B200 | 2.9.0+cu128 | bfloat16 | 493,039 | 1718.9 | 76.4% | 18944x2560x11776 |
| NVIDIA B200 | 2.9.0+cu128 | float8_e4m3fn | 493,039 | 3343.4 | 74.3% | 4352x3328x7424 |
| NVIDIA H200 | 2.9.0+cu128 | bfloat16 | 493,039 | 796.4 | 80.5% | 1536x2816x7936 |
| NVIDIA H200 | 2.9.0+cu128 | float8_e4m3fn | 493,039 | 1450.9 | 73.3% | 5632x768x12800 |
| NVIDIA H100 80GB HBM3 | 2.9.0+cu128 | bfloat16 | 493,039 | 825.4 | 83.5% | 5632x1536x8448 |
| NVIDIA H100 80GB HBM3 | 2.9.0+cu128 | float8_e4m3fn | 493,039 | 1487.1 | 75.1% | 2560x2048x15616 |
| NVIDIA H100 80GB HBM3 | 2.8.0+cu128 | bfloat16 | 493,039 | 816.8 | 82.6% | 5632x1536x14848 |
| NVIDIA H100 80GB HBM3 | 2.7.1+cu126 | bfloat16 | 493,039 | 800.4 | 80.9% | 16128x16896x15616 |
| NVIDIA A100-SXM4-40GB | 2.9.0+cu128 | bfloat16 | 493,039 | 284.0 | 91.0% | 1536x2304x9984 |
| NVIDIA A100-SXM4-80GB | 2.9.0+cu128 | bfloat16 | 493,039 | 288.5 | 92.5% | 17664x4608x18432 |
| NVIDIA L40S | 2.9.0+cu128 | bfloat16 | 493,039 | 259.5 | 71.7% | 3328x9728x4096 |
| NVIDIA L40S | 2.9.0+cu128 | float8_e4m3fn | 493,039 | 426.6 | 58.2% | 5376x8192x5376 |
| NVIDIA A40 | 2.9.0+cu128 | bfloat16 | 493,039 | 135.7 | 90.6% | 1536x1792x12544 |
| Tesla V100-PCIE-32GB | 2.9.0+cu128 | float16 | 493,039 | 94.9 | 84.7% | 2560x1024x5632 |