Finetuning is easy and worthwhile, especially with LoRAs as these Unsloth demos do. The bottleneck then becomes how to self-host the finetuned model in a way that's cost-effective and scalable.
In practice prompt engineering and few-shot prompting with modern LLMs, due to their strong-and-only-getting-better-over-time prompt adherence, tends to be more pragmatic.
There are inference providers such as Together AI that will serve your LoRA adapters at no extra cost above the model price. Then, there’s basically no difference between using your fine-tuned model or an API model off the shelf (except for the benefits you get from fine-tuning).
> The bottleneck then becomes how to self-host the finetuned model in a way that's cost-effective and scalable
It's not actually that expensive and hard. For narrow usecases, you can produce 4-bit quantized fine-tunes that perform as well as the full model. Hosting the 4-bit quantized version can be done on relatively low cost. You can use A40 or RTX 3090 on Runpod for ~$300/month.
For self-hosting I've been using https://tuns.sh which is a tunneling solution using SSH. It works great for prototyping and I've been using it to host open-webui
> If you have the resources to fine tune, you have the resources to run inference on fine tuned model.
I don't think that's true.
I can fine tune a model by renting a few A100s for a few hours, total cost in the double digit dollars. It's a one-time cost.
Running inference with the resulting model for a production application could cost single digit dollars per hour, which adds up to hundreds or even thousands of dollars a month on an ongoing basis.
In practice prompt engineering and few-shot prompting with modern LLMs, due to their strong-and-only-getting-better-over-time prompt adherence, tends to be more pragmatic.