They presented an approach which was not cache optimized which also showed significant gains. And of course they used a lot of cores, they're comparing against a v100, which is a $7k gpu (maybe $5k if you're lucky).
It’s 44 threads (22 cores). They also compare with tensorflow cpu compiled with SIMD instruction sets on the same hardware.
What other optimizations would you like to see? I would expect the tensorflow team to already pay pretty close attention to performance in cpu and gpu implementations, not to mention CUDNN and such...
Is there anything preventing the same or a similar algorithmic optimization from being implemented on the GPU though? IIUC, the new algorithm (on the CPU) was compared to an existing algorithm (both on the CPU and GPU).
https://github.com/keroro824/HashingDeepLearning