It is not the sorting per-se which was improved here, but sorting (particularly short sequences) on modern CPUs with really the complexity being on the difficulty of predicting what will work quickly on these modern CPUs.
Doing an empirical algorithm search to find which algorithms fit well on modern CPUs/memory systems is pretty common, see e.g. FFTW, ATLAS, https://halide-lang.org/
What's well studies is theoretical algorithmic sorting.
For practical sorting, for a particular CPU architecture, there is still plenty of low hanging fruit:
> Today we're sharing open source code that can sort arrays of numbers about ten times as fast as the C++ std::sort, and outperforms state of the art architecture-specific algorithms, while being portable across all modern CPU architectures
Part of it is because hardware properties are always changing. The instructions available, the relative speed of CPU to memory and the various caches, how big and numerous and fast the various caches are, etc etc.
I am curious why things can't just get better on a base that doesn't change, until the base changes because the improvements with the new base are just that much better...
Or is that why hardware properties change so much?
For a time that is what happened. Every year you would get a new CPU and that would be faster at pretty much everything than the version the year before. But then we hit the clockspeed wall and the only way of making faster CPUs was to add complexity to the internals of the CPUs. So branch prediction, micro code pipelining, larger caches, simd instructions more CPU cores ect was the result.
So nowadays a new CPU might not be better at everything then the previous version but it will most likely have more cache and some internal improvements to pipelining/concurrency.
Given this, for newer versions it can be useful to add instructions to take advantage of extra pipelining or using a different instruction that happen to be faster now.