This, 1000%. Especially floating point arithmetic. Compilers aren't that great at vectorization (or more accurately, people aren't that great at writing algorithms that can be vectorized), and scalar operations on x86_64 have the same latency as their vectorized counterparts. When you account for how smart the pipelining/out-of-order engines on x86_64 CPUs are, even with additional/redundant arithmetic you can achieve >4x throughput for the same algorithm.
Audio is one of the big areas where we can see huge gains, and I think philosophies about optimization are changing. That said, there are myths out there like "recursive filters can't be vectorized" that need to be dispelled.
It has gotten very tricky on x86 since using the really wide SIMD instructions can more than halve your clock speed for ALL instructions running on that particular core.
iirc that's only with some processors using AVX2/512 instructions, which are still faster than scalar (depending on what you're doing around the math).
Things being tricky with AVX and cache-weirdness doesn't change the fact that if you're not vectorizing your arithmetic you're losing performance.
There's also an argument to be made if you're writing performance critical code you shouldn't optimize it for mid/bottom end hardware, but not everyone agrees.
Audio is one of the big areas where we can see huge gains, and I think philosophies about optimization are changing. That said, there are myths out there like "recursive filters can't be vectorized" that need to be dispelled.