Asking GPT-4 top solve simple few-step classical planning problems

ren_engineer · on April 16, 2023

the TLDR seems to be that much of the improvement for GPT-4 on more complex benchmarks that GPT3 previously struggled with is due to either straight up consuming the answers to benchmarks and pattern matching them or having RLFH give the LLM the answer. Just obfuscating words for the same question actually caused GPT4 to do worse than GPT3

YeGoblynQueenne · on April 17, 2023

Also see discussion (sub-thread) on why just guessing SAT answers won't cut it in the long run:

https://twitter.com/rao2z/status/1553082695852298240