I'm not entirely convinced that is all there is to it. I had it write some code ...

I'm not entirely convinced that is all there is to it. I had it write some code and associated unit tests, and then it came up with passing and failing examples. I also prompted for function results based on arbitrary input, and it would perform the calculations.

It has some emergent ability to evaluate code IMO. I do believe this ability has been drastically reduced in the last several months. It no longer executes complex code as reliably as it once did.