What the authors have completely missed in their analysis is the rapidly-growing trend of specialising standard general-purpose CPUs -- that is adding special-purpose instructions to general purpose CPUs instead of adding special-purpose processors.
This process started a couple of decades ago, adding instructions to assist in the calculation of hashes and encryption, and relatively narrow SIMD parallel processing to assist multimedia.
Now virtually all high volume general purpose processors have such instructions.
If you count simple floating point add, subtract, multiple, divide as special purpose then this process started 50+ years ago.
The advantage of doing this is less added area and power consumption, and the ability to mix special purpose and general purpose operations at a much finer-grained level. Sometimes its worth copying a few MB of data across to a GPU's local RAM, downloading a special program to it, and then copying the results back.
The number of potential special purpose operations that are useful to someone is probably unbounded. But each one might be useful to only a small number of people. It's not feasible to just keep on adding everything someone thinks of to the volume leader mass-market processor.
Three related things are happening to help with this:
1) adding reconfigurable hardware to a general purpose processor or embedding the processor into reconfigurable hardware. Here we have Xilinx "Zynq" and MicroChip "PolarFire SoC" with ARM and RISC-V (respectively) hard CPU cores inside an FPGA. We also have Cypress PSoC which I believe is more like adding a small amount of reconfigurable hardware on the side of a conventional CPU core. If the performance needs of the general-purpose part of the processing are relatively low then you can use a "soft core" CPU built from the FPGA resources themselves. Each FPGA vendor has had their own custom instruction set and CPU core, but now people are moving more and more to instructions sets and cores they can use on any FPGA -- chiefly RISC-V.
2) making custom chips with a standard CPU core augmented with a few custom instructions / execution units. Again, much of this activity is centred around RISC-V though ARM has announced support for this with one or two of their standard CPU cores, initially the Cortex A35. Going into full production of a chip like this has costs in the low millions of dollars, with incremental unit costs as low as $1 to $10. Small numbers of custom chips (100+) can be made for $5 to $500 each depending on the size of the chip and the process node -- bigger, slower nodes are cheaper.
3) adding special purpose instructions that can be more flexibly applied to a larger range of problems. The main contender here is support for "Cray-style" processing of (possibly) long vectors of flexible length. If appropriately designed the same program can run at peak efficiency (for that chip) on CPUs with vastly different vector register lengths. This is in contrast to traditional SIMD where the program has to be rewritten every time a CPU is made with longer vector registers -- and it is very inconvenient to deal with data set sizes that are not a multiple of the vector length.
If suitable primitives are included for predication of vector elements and divergent and convergent calculations then such a vector processor can run the same algorithms as GPUs (e.g. directly compiling CUDA and OpenCL to them). CPUs with sufficiently long vector registers can then compete directly in performance with GPUs on GPU-style code. All while staying tightly integrated with general purpose computations.
ARM SVE and the RISC-V V extension are the examples of this, with I think the RISC-V version being the more flexible and forward-looking.
This process started a couple of decades ago, adding instructions to assist in the calculation of hashes and encryption, and relatively narrow SIMD parallel processing to assist multimedia.
Now virtually all high volume general purpose processors have such instructions.
If you count simple floating point add, subtract, multiple, divide as special purpose then this process started 50+ years ago.
The advantage of doing this is less added area and power consumption, and the ability to mix special purpose and general purpose operations at a much finer-grained level. Sometimes its worth copying a few MB of data across to a GPU's local RAM, downloading a special program to it, and then copying the results back.
The number of potential special purpose operations that are useful to someone is probably unbounded. But each one might be useful to only a small number of people. It's not feasible to just keep on adding everything someone thinks of to the volume leader mass-market processor.
Three related things are happening to help with this:
1) adding reconfigurable hardware to a general purpose processor or embedding the processor into reconfigurable hardware. Here we have Xilinx "Zynq" and MicroChip "PolarFire SoC" with ARM and RISC-V (respectively) hard CPU cores inside an FPGA. We also have Cypress PSoC which I believe is more like adding a small amount of reconfigurable hardware on the side of a conventional CPU core. If the performance needs of the general-purpose part of the processing are relatively low then you can use a "soft core" CPU built from the FPGA resources themselves. Each FPGA vendor has had their own custom instruction set and CPU core, but now people are moving more and more to instructions sets and cores they can use on any FPGA -- chiefly RISC-V.
2) making custom chips with a standard CPU core augmented with a few custom instructions / execution units. Again, much of this activity is centred around RISC-V though ARM has announced support for this with one or two of their standard CPU cores, initially the Cortex A35. Going into full production of a chip like this has costs in the low millions of dollars, with incremental unit costs as low as $1 to $10. Small numbers of custom chips (100+) can be made for $5 to $500 each depending on the size of the chip and the process node -- bigger, slower nodes are cheaper.
3) adding special purpose instructions that can be more flexibly applied to a larger range of problems. The main contender here is support for "Cray-style" processing of (possibly) long vectors of flexible length. If appropriately designed the same program can run at peak efficiency (for that chip) on CPUs with vastly different vector register lengths. This is in contrast to traditional SIMD where the program has to be rewritten every time a CPU is made with longer vector registers -- and it is very inconvenient to deal with data set sizes that are not a multiple of the vector length.
If suitable primitives are included for predication of vector elements and divergent and convergent calculations then such a vector processor can run the same algorithms as GPUs (e.g. directly compiling CUDA and OpenCL to them). CPUs with sufficiently long vector registers can then compete directly in performance with GPUs on GPU-style code. All while staying tightly integrated with general purpose computations.
ARM SVE and the RISC-V V extension are the examples of this, with I think the RISC-V version being the more flexible and forward-looking.