This assumes that inference is needed 24/7. That may or may not be true for use-...

This assumes that inference is needed 24/7.

That may or may not be true for use-cases that require asynchronous, bulk inference _and_ require some task-specific post-training.

FWIW, my approach towards tasks like the above is to

1. start with using an off-the-shelf LM API until

2. one figures out (using evals that capture product intent) what the failure modes are (there always are some) and then

3. post-train against those (using the evals)