Hacker News new | past | comments | ask | show | jobs | submit login

While I fully agree with you on the absence of good benchmarks and the growing LLM slop ...

"running a model trained in FP8 in 16bit, something noone would do, etc"

I did that because on the RTX 3090 - which can be a good bang per buck for inference - the FP8 support is nerfed at the driver level. So a kernel that upscales FP8 to FP16 inside SRAM, then does the matmul, then downscales to FP8 again can bring massive performance benefits on those consumer cards.

BTW, you can run a good DeepSeek3 quant on a single H200.




Thanks! I was looking at blackwell 6000PROs, 8x 96GB for running full fp8 (as it's supported and presumably fast).

I know AWQ should run, and be pretty snappy & efficient w/ the new MLA added, but wanted to check if fp8 fits as well, because from a simple napkin math it seems pretty tight (might only work for bs1, ctx_len <8k which would probably not be suited for coding tasks).




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: