Since you are here, when I was reading the paper I wondered -- when they show the "zero-shot solve rates", does that mean that they are basically running the same experiment code, but without the prompts that call `few_shot_response` (i.e. they are still trying each question with every expert prefix, and every critique?) It wasn't clear to me at a glance.