The number one issue is hardware; particularly memory latency. Unless you have a...

The number one issue is hardware; particularly memory latency. Unless you have a lot of compute that excels on GPU (like a gorillion matrix multiplications) or a lot of memory that you want to execute, then the execution time is dominated by memory transfer.

Recently I had a task where I wanted to compute just cosine similarities between two vectors. For a couple hundred thousand floats my code spent something like ~1ms on CPU and ~25ms on the GPU. The GPU didn't start winning until I got to the millions of floats. For my use case a better solution was just taking advantage of a SIMD library.