Can someone explain these Aider benchmarks to me? They pass same 113 tests throu...

Can someone explain these Aider benchmarks to me? They pass same 113 tests through llm every time. Why they then extrapolate ability of llm to pass these 113 basic python challenges to the general ability to produce/edit code? For me it sounds like this or that model is 70% accurate in solving same hundred python training tasks, but why does it mean that it's good at other languages and arbitrary, private tasks as well? Does anyone ever tried to change them test cases or wiggle conditions a bit to see if it will still hit 70%?