They're testing it on 2019 data. From the paper (https://www.nature.com/articles...

They're testing it on 2019 data. From the paper (https://www.nature.com/articles/s41586-024-08252-9):

>> We use 2019 as our test period, and, following the protocol in ref. 2, we initialize ML models using ERA5 at 06 UTC and 18 UTC, as these benefit from only 3 h of look-ahead (with the exception of sea surface temperature, which in ERA5 is updated once per 24 h). This ensures ML models are not afforded an unfair advantage by initializing from states with longer look-ahead windows.

See Baselines section in the paper that explains the methodology in more depth. They basically feed the competing models with data from weather stations and predict the weather in a certain time period. Then they compare the prediction with the ground truth from that period.

Plot twist: they measure accuracy in predicting the weather 5 years in the past.