I tested out Gemini 2.5 and it failed miserably at calling into tools that we had defined for it. Also, it got into an infinite loop a number of times where it would just spit out the exact same line of text continuously until we hard killed the process. I really don't know how others are getting these amazing results. We had no problems using Claude or OpenAI models in the same scenario. Even Deepseek R1 works just fine.