I've been wrestling with these same challenges while building RA.Aid—trying to make tools that speak LLM. We have good tool integrations, but a lot of the tools were originally designed for human consumption. The LLMs seem to have their own idea of how they want to do something, which is what makes prompt optimization such an important factor.
> The LLMs seem to have their own idea of how they want to do something
exactly! what I'm experiencing is that prompt engineering has its limitations and comes with inconsistency issues...
by designing the tool from scratch tailored to LLMs, we can make the interface match what their "own idea of how to do" that particular task, which is more reliable and scalable