Indeed, I'd expect more capable models to be less amenable to prompt engineering...

Indeed, I'd expect more capable models to be less amenable to prompt engineering, but even if it is true it's quite possible out best models are not past "prompt engineering efficiency peak" yet. It is also fairly hard to quantify (I was thinking about some naive approaches to do that during our work on BIG-Bench [1] but I couldn't think of something robust enough), so I don't think we will even be able to say we are past this peak until much later.

[1] https://github.com/google/BIG-bench/issues/801