While I fully agree with you on the absence of good benchmarks and the growing LLM slop ...
"running a model trained in FP8 in 16bit, something noone would do, etc"
I did that because on the RTX 3090 - which can be a good bang per buck for inference - the FP8 support is nerfed at the driver level. So a kernel that upscales FP8 to FP16 inside SRAM, then does the matmul, then downscales to FP8 again can bring massive performance benefits on those consumer cards.
BTW, you can run a good DeepSeek3 quant on a single H200.
Thanks! I was looking at blackwell 6000PROs, 8x 96GB for running full fp8 (as it's supported and presumably fast).
I know AWQ should run, and be pretty snappy & efficient w/ the new MLA added, but wanted to check if fp8 fits as well, because from a simple napkin math it seems pretty tight (might only work for bs1, ctx_len <8k which would probably not be suited for coding tasks).
"running a model trained in FP8 in 16bit, something noone would do, etc"
I did that because on the RTX 3090 - which can be a good bang per buck for inference - the FP8 support is nerfed at the driver level. So a kernel that upscales FP8 to FP16 inside SRAM, then does the matmul, then downscales to FP8 again can bring massive performance benefits on those consumer cards.
BTW, you can run a good DeepSeek3 quant on a single H200.