Not really, it's an apples-to-oranges comparison. If you ran the same distribute...

Not really, it's an apples-to-oranges comparison. If you ran the same distributed algorithm on a single core you wouldn't see the same speed improvements. These chips are ridiculously efficient because they need far fewer gates to accomplish the same purpose if you have programmed to them specifically. Just like you couldn't simulate a GPU for the same power budget on a CPU. The loss of generality is more than made up for by the increase in efficiency. This is very similar in that respect, but with the caveat that a GPU is even more specialized and so even more efficient.

Imagine what kind of performance you could get out of hardware that is task specific. That's why for instance crypto mining went through a very rapid set of iterations: CPU->GPU->ASIC in a matter of a few years with an extremely brief blip of programmable hardware somewhere in there as well (FPGA based miners, approximately 2013).

Any loss of generality will result in efficiency and vice versa, the question is whether or not it is economically feasible and there are different points on that line that have resulted in marketable (and profitable) products. But there are also plenty of wrecks.