This is very interesting, because we(at cheatlayer.com) can publish results of t...

This is very interesting, because we(at cheatlayer.com) can publish results of the exact opposite happening and we test a lot of code generation with thousands of actual customers live.

It's entirely possible the examples are cherry-picked or could be explained by fine tuning differences, but in terms of "proofs" in the mathematical sense the paper doesn't prove this since you can get the opposite results based on the test cases.

The frozen version GPT-4-0314 is not capable of supporting our new autonomous sales agents for example, and many automations just don't work at all in the older GPT4