Finetuning is easy and worthwhile, especially with LoRAs as these Unsloth demos ...

anon373839 · 2025-03-19T20:47:31 1742417251

There are inference providers such as Together AI that will serve your LoRA adapters at no extra cost above the model price. Then, there’s basically no difference between using your fine-tuned model or an API model off the shelf (except for the benefits you get from fine-tuning).

nl · 2025-03-20T06:20:02 1742451602

This (Serverless LoRA providers) is what most people want even if they don't know it.

slopeloaf · 2025-03-19T19:02:34 1742410954

Yeah this big time. I haven’t found a solution that makes sense. Larger models are already good enough and so convenient.

When it’s more feasible to do inference on the client (browser or desktop) I can see SLMs popping up more common in production.

ekojs · 2025-03-19T20:58:12 1742417892

> The bottleneck then becomes how to self-host the finetuned model in a way that's cost-effective and scalable

It's not actually that expensive and hard. For narrow usecases, you can produce 4-bit quantized fine-tunes that perform as well as the full model. Hosting the 4-bit quantized version can be done on relatively low cost. You can use A40 or RTX 3090 on Runpod for ~$300/month.

qudat · 2025-03-19T20:53:56 1742417636

For self-hosting I've been using https://tuns.sh which is a tunneling solution using SSH. It works great for prototyping and I've been using it to host open-webui

naveen99 · 2025-03-19T19:35:10 1742412910

If you have the resources to fine tune, you have the resources to run inference on fine tuned model.

If you want to scale up and down on demand, you can just fine tune on openai and google cloud as well.

simonw · 2025-03-19T19:52:44 1742413964

> If you have the resources to fine tune, you have the resources to run inference on fine tuned model.

I don't think that's true.

I can fine tune a model by renting a few A100s for a few hours, total cost in the double digit dollars. It's a one-time cost.

Running inference with the resulting model for a production application could cost single digit dollars per hour, which adds up to hundreds or even thousands of dollars a month on an ongoing basis.

curious_cat_163 · 2025-03-19T20:43:26 1742417006

This assumes that inference is needed 24/7.

That may or may not be true for use-cases that require asynchronous, bulk inference _and_ require some task-specific post-training.

FWIW, my approach towards tasks like the above is to

1. start with using an off-the-shelf LM API until

2. one figures out (using evals that capture product intent) what the failure modes are (there always are some) and then

3. post-train against those (using the evals)