https://arxiv.org/abs/1903.03129 https://github.com/keroro824/HashingDeepLearnin...

yor1001 · on April 9, 2021

44 Cores, and careful cache management. I'm wondering whether the GPU implementation has been similarly optimized.

eyegor · on April 9, 2021

They presented an approach which was not cache optimized which also showed significant gains. And of course they used a lot of cores, they're comparing against a v100, which is a $7k gpu (maybe $5k if you're lucky).

guywhocodes · on April 9, 2021

Our GPU implementations are a lot more optimized. This is pretty far behind MKL etc for matrix multiplication

rsfern · on April 9, 2021

It’s 44 threads (22 cores). They also compare with tensorflow cpu compiled with SIMD instruction sets on the same hardware.

What other optimizations would you like to see? I would expect the tensorflow team to already pay pretty close attention to performance in cpu and gpu implementations, not to mention CUDNN and such...

foerbert · on April 9, 2021

Considering the GPU implementation is TensorFlow, I think it's very safe to assume the GPU implementation is the far more optimized one.

d110af5ccf · on April 9, 2021

Is there anything preventing the same or a similar algorithmic optimization from being implemented on the GPU though? IIUC, the new algorithm (on the CPU) was compared to an existing algorithm (both on the CPU and GPU).

taf2 · on April 9, 2021

Interesting the golang version https://github.com/nlpodyssey/goslide

king_magic · on April 9, 2021

I stand corrected!