MAMF (Maximum Achievable Matmul FLOPS) is a practical upper bound on matrix-multiplication throughput for a given GPU + software stack (PyTorch/CUDA/cuBLAS, etc). The benchmark sweeps many (M, N, K) shapes and records the best achieved TFLOPS for each shape, giving you a map of where the hardware is fast and where performance cliffs are.

Note: The reported results use a Debian-slim based image with Python 3.12 for all Modal runs. Some runs (L40S, A40, and V100) were performed on other hardware but use the same pytorch/python versions.

Why should you care?

Capacity planning: estimate how close your workloads can get to peak throughput for the shapes you actually run.
Hardware comparisons: compare GPUs on the same shapes/dtype (not just a single cherry-picked benchmark).
Performance debugging: find regimes where you are bandwidth/launch-bound or hitting kernel selection issues.
Regression tracking: compare results across PyTorch/CUDA versions to spot wins/losses

How is MAMF measured?

This project uses the mamf_finder.py script originally published by Stas Bekman in the ml-engineering repository.

Tip: open Lookup and select All GPUs to compare the same shape across everything.

Coverage

Sort by:

peak = max(max_tflops) per config

Hardware	Torch	Dtype	Shapes	Peak TFLOPS	% Peak ?	Peak shape
NVIDIA B200	2.9.0+cu128	bfloat16	493,039	1718.9	76.4%	18944x2560x11776
NVIDIA B200	2.9.0+cu128	float8_e4m3fn	493,039	3343.4	74.3%	4352x3328x7424
NVIDIA H200	2.9.0+cu128	bfloat16	493,039	796.4	80.5%	1536x2816x7936
NVIDIA H200	2.9.0+cu128	float8_e4m3fn	493,039	1450.9	73.3%	5632x768x12800
NVIDIA H100 80GB HBM3	2.9.0+cu128	bfloat16	493,039	825.4	83.5%	5632x1536x8448
NVIDIA H100 80GB HBM3	2.9.0+cu128	float8_e4m3fn	493,039	1487.1	75.1%	2560x2048x15616
NVIDIA H100 80GB HBM3	2.8.0+cu128	bfloat16	493,039	816.8	82.6%	5632x1536x14848
NVIDIA H100 80GB HBM3	2.7.1+cu126	bfloat16	493,039	800.4	80.9%	16128x16896x15616
NVIDIA A100-SXM4-40GB	2.9.0+cu128	bfloat16	493,039	284.0	91.0%	1536x2304x9984
NVIDIA A100-SXM4-80GB	2.9.0+cu128	bfloat16	493,039	288.5	92.5%	17664x4608x18432
NVIDIA L40S	2.9.0+cu128	bfloat16	493,039	259.5	71.7%	3328x9728x4096
NVIDIA L40S	2.9.0+cu128	float8_e4m3fn	493,039	426.6	58.2%	5376x8192x5376
NVIDIA A40	2.9.0+cu128	bfloat16	493,039	135.7	90.6%	1536x1792x12544
Tesla V100-PCIE-32GB	2.9.0+cu128	float16	493,039	94.9	84.7%	2560x1024x5632