Toolformer seems to already have been productized in ChatGPT Plugins fwiw

> 1.) to which degree can we train substantially smaller LLMs for specific tasks that could be run in-house 2.) it seems like these new breakthroughs may need a different mode of evaluation compared to what we have used since the 80s in the field and I am not sure what that would look like (maybe along the lines of HELM [2]?)

so you are proposing a set of benchmarks for domain specific tasks? by definition they wont be shared benchmarks...