Hacker News new | past | comments | ask | show | jobs | submit login

Finetuning is easy and worthwhile, especially with LoRAs as these Unsloth demos do. The bottleneck then becomes how to self-host the finetuned model in a way that's cost-effective and scalable.

In practice prompt engineering and few-shot prompting with modern LLMs, due to their strong-and-only-getting-better-over-time prompt adherence, tends to be more pragmatic.




There are inference providers such as Together AI that will serve your LoRA adapters at no extra cost above the model price. Then, there’s basically no difference between using your fine-tuned model or an API model off the shelf (except for the benefits you get from fine-tuning).


This (Serverless LoRA providers) is what most people want even if they don't know it.


Yeah this big time. I haven’t found a solution that makes sense. Larger models are already good enough and so convenient.

When it’s more feasible to do inference on the client (browser or desktop) I can see SLMs popping up more common in production.


> The bottleneck then becomes how to self-host the finetuned model in a way that's cost-effective and scalable

It's not actually that expensive and hard. For narrow usecases, you can produce 4-bit quantized fine-tunes that perform as well as the full model. Hosting the 4-bit quantized version can be done on relatively low cost. You can use A40 or RTX 3090 on Runpod for ~$300/month.


For self-hosting I've been using https://tuns.sh which is a tunneling solution using SSH. It works great for prototyping and I've been using it to host open-webui


If you have the resources to fine tune, you have the resources to run inference on fine tuned model.

If you want to scale up and down on demand, you can just fine tune on openai and google cloud as well.


> If you have the resources to fine tune, you have the resources to run inference on fine tuned model.

I don't think that's true.

I can fine tune a model by renting a few A100s for a few hours, total cost in the double digit dollars. It's a one-time cost.

Running inference with the resulting model for a production application could cost single digit dollars per hour, which adds up to hundreds or even thousands of dollars a month on an ongoing basis.


This assumes that inference is needed 24/7.

That may or may not be true for use-cases that require asynchronous, bulk inference _and_ require some task-specific post-training.

FWIW, my approach towards tasks like the above is to

1. start with using an off-the-shelf LM API until

2. one figures out (using evals that capture product intent) what the failure modes are (there always are some) and then

3. post-train against those (using the evals)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: