the TLDR seems to be that much of the improvement for GPT-4 on more complex benchmarks that GPT3 previously struggled with is due to either straight up consuming the answers to benchmarks and pattern matching them or having RLFH give the LLM the answer. Just obfuscating words for the same question actually caused GPT4 to do worse than GPT3