fp16 models inference just fine in fp32, though I was sorta joking in my original comment, it would potentially take weeks for this to run one input. You're better off trying to make something like huggingface accelerate work (like the comment above), which swaps layers of the model on and off the disk