Still, it's 0.03% difference, or 3 images difference out of 10k images in CIFAR-10. Just 3 images.
Re-training SotA with a different random seed may make its score 0.03% difference. Or there was a wrong calculation in 17,810 TPU core-hours due to faulty hardware or cosmic ray hit which cause the final produce model 0.03% difference.
The problem with this sort of argument against caring about SOTA scores is that there is only so much luck to go around. While any individual 5% reduction in error rates could theoretically be highly influenced by luck, if you have a chain of small reductions in error rates, such that the difference between the first and the last is more like a factor of 2, then you know that somewhere in the middle of that, even if any individual improvement is suspect, there must have been real, gradual improvement.
It isn't that important on CIFAR-10 any more, which is pretty much a solved benchmark, but CIFAR was only solved because of such incremental progress, and papers focusing on moving the state of the art use newer, much harder benchmarks.
> Re-training SotA with a different random seed may make its score 0.03% difference. Or there was a wrong calculation in 17,810 TPU core-hours due to faulty hardware or cosmic ray hit which cause the final produce model 0.03% difference.
Isn’t it the job of science to determine if this is the case?
Re-training SotA with a different random seed may make its score 0.03% difference. Or there was a wrong calculation in 17,810 TPU core-hours due to faulty hardware or cosmic ray hit which cause the final produce model 0.03% difference.