I have a few questions. 1. I'm assuming by the pricing it's "serverless" inference, what's the cold-start time like? 2. Any idea on inference costs?
Also just to reiterate what others say but the option of exporting weights would definitely make it more appealing (although it sounds like that's in the roadmap).
> I'm assuming by the pricing it's "serverless" inference, what's the cold-start time like?
Yeah, you could probably call it serverless inference. However, due to the fact that all fine-tuned models are trained on the same base model(s), we have some interesting optimizations we can apply over standard "serverless" model deployment. The biggest is that we can keep the base model loaded in VRAM and only swap the trained weight deltas per request. This gives us sub-second cold-start times for inference in the average case.
> Any idea on inference costs?
Right now, we’re pricing inference at $0.5/M input tokens, $2.5/M output tokens. That’s in a similar price range but a bit lower than gpt-4o/Claude 3.5, which we consider the main models we’re "competing" with. As it’s our goal to democratize access to models/agents in the long run, we hope that we can drop the prices for inference further, which should be enabled by some other optimizations we’re currently planning.
I have a few questions. 1. I'm assuming by the pricing it's "serverless" inference, what's the cold-start time like? 2. Any idea on inference costs?
Also just to reiterate what others say but the option of exporting weights would definitely make it more appealing (although it sounds like that's in the roadmap).