There's no need to be quite so adversarial in this case though. The methodology is explained by the report:
> A minority of the problems in the exams were seen by the model during training; for each exam we run a variant with these questions removed and report the lower score of the two. We believe the results to be representative. For further details on contamination (methodology and per-exam statistics), see Appendix C.
I hadn't seen the technical report: it is good they made an attempt to remove them, but they only use substring match of 50 characters to find duplicates. Forum discussions after an exam are usually peoples more fuzzy memories of the question (it is impressive if it can convert back at test time from people's fuzzy memories, but still potentially taking an exam from the past where it has had access to the questions, especially the hard ones which get discussed the most).
From the results before and after removing some of the duplicates it doesn't seem to have hurt its performance badly though. Sometimes the score increases, so the substring approach may be helping it by excluding question variants with matching substring that it memorized but then the real test varied somewhere outside of the sampled substrings and had a different answer (or it random chance that the extrapolated score increased with some questions removed).
> A minority of the problems in the exams were seen by the model during training; for each exam we run a variant with these questions removed and report the lower score of the two. We believe the results to be representative. For further details on contamination (methodology and per-exam statistics), see Appendix C.