Here are my own anecdotes from using o3-pro recently. My primary use cases where...

JamesBarney · 2025-06-17T19:12:45 1750187565

> With that said, I am seeing Claude Opus 4 (w/extended thinking) be distinctly worse at missing problems which o3-pro and Gemini find. It seems fairly consistent that Opus will be the worst out of the three (despite sometimes noticing things the others do not).

I've found the same thing. That claude is more likely miss a bug than o3 or gemini but more likely to catch something o3 and gemini missed. If I had to pick one model I'd pick o3 or gemini, but if I had to pick a second model I'd pick opus.

It's also seems to have a much higher false positive rate where as gemini seems to have the lowest false positive rate.

Basically o3 and gemini are better, but also more correlated which gives opus a lot of value.

throwdbaaway · 2025-06-17T19:18:56 1750187936

For the code review use case, maybe can try to create the diff with something like `git diff -U99999`, and then send only the diff.