There's no need to be quite so adversarial in this case though. The methodology ...

cma · on March 15, 2023

I hadn't seen the technical report: it is good they made an attempt to remove them, but they only use substring match of 50 characters to find duplicates. Forum discussions after an exam are usually peoples more fuzzy memories of the question (it is impressive if it can convert back at test time from people's fuzzy memories, but still potentially taking an exam from the past where it has had access to the questions, especially the hard ones which get discussed the most).

From the results before and after removing some of the duplicates it doesn't seem to have hurt its performance badly though. Sometimes the score increases, so the substring approach may be helping it by excluding question variants with matching substring that it memorized but then the real test varied somewhere outside of the sampled substrings and had a different answer (or it random chance that the extrapolated score increased with some questions removed).