MAMF Explorer
Maximum Achievable Matmul FLOPS
Loading…

What is MAMF?

MAMF (Maximum Achievable Matmul FLOPS) is a practical upper bound on matrix-multiplication throughput for a given GPU + software stack (PyTorch/CUDA/cuBLAS, etc). The benchmark sweeps many (M, N, K) shapes and records the best achieved TFLOPS for each shape, giving you a map of where the hardware is fast and where performance cliffs are.

Note: The reported results use a Debian-slim based image with Python 3.12 for all Modal runs. Some runs (L40S, A40, and V100) were performed on other hardware but use the same pytorch/python versions.

Why should you care?
  • Capacity planning: estimate how close your workloads can get to peak throughput for the shapes you actually run.
  • Hardware comparisons: compare GPUs on the same shapes/dtype (not just a single cherry-picked benchmark).
  • Performance debugging: find regimes where you are bandwidth/launch-bound or hitting kernel selection issues.
  • Regression tracking: compare results across PyTorch/CUDA versions to spot wins/losses
How is MAMF measured?

This project uses the mamf_finder.py script originally published by Stas Bekman in the ml-engineering repository.

Tip: open Lookup and select All GPUs to compare the same shape across everything.

Coverage

peak = max(max_tflops) per config
Hardware Torch Dtype Shapes Peak TFLOPS % Peak ? Peak shape
NVIDIA B200 2.9.0+cu128 bfloat16 493,039 1718.9 76.4% 18944x2560x11776
NVIDIA B200 2.9.0+cu128 float8_e4m3fn 493,039 3343.4 74.3% 4352x3328x7424
NVIDIA H200 2.9.0+cu128 bfloat16 493,039 796.4 80.5% 1536x2816x7936
NVIDIA H200 2.9.0+cu128 float8_e4m3fn 493,039 1450.9 73.3% 5632x768x12800
NVIDIA H100 80GB HBM3 2.9.0+cu128 bfloat16 493,039 825.4 83.5% 5632x1536x8448
NVIDIA H100 80GB HBM3 2.9.0+cu128 float8_e4m3fn 493,039 1487.1 75.1% 2560x2048x15616
NVIDIA H100 80GB HBM3 2.8.0+cu128 bfloat16 493,039 816.8 82.6% 5632x1536x14848
NVIDIA H100 80GB HBM3 2.7.1+cu126 bfloat16 493,039 800.4 80.9% 16128x16896x15616
NVIDIA A100-SXM4-40GB 2.9.0+cu128 bfloat16 493,039 284.0 91.0% 1536x2304x9984
NVIDIA A100-SXM4-80GB 2.9.0+cu128 bfloat16 493,039 288.5 92.5% 17664x4608x18432
NVIDIA L40S 2.9.0+cu128 bfloat16 493,039 259.5 71.7% 3328x9728x4096
NVIDIA L40S 2.9.0+cu128 float8_e4m3fn 493,039 426.6 58.2% 5376x8192x5376
NVIDIA A40 2.9.0+cu128 bfloat16 493,039 135.7 90.6% 1536x1792x12544
Tesla V100-PCIE-32GB 2.9.0+cu128 float16 493,039 94.9 84.7% 2560x1024x5632