Hacker Newsnew | past | comments | ask | show | jobs | submitlogin



44 Cores, and careful cache management. I'm wondering whether the GPU implementation has been similarly optimized.


They presented an approach which was not cache optimized which also showed significant gains. And of course they used a lot of cores, they're comparing against a v100, which is a $7k gpu (maybe $5k if you're lucky).


Our GPU implementations are a lot more optimized. This is pretty far behind MKL etc for matrix multiplication


It’s 44 threads (22 cores). They also compare with tensorflow cpu compiled with SIMD instruction sets on the same hardware.

What other optimizations would you like to see? I would expect the tensorflow team to already pay pretty close attention to performance in cpu and gpu implementations, not to mention CUDNN and such...


Considering the GPU implementation is TensorFlow, I think it's very safe to assume the GPU implementation is the far more optimized one.


Is there anything preventing the same or a similar algorithmic optimization from being implemented on the GPU though? IIUC, the new algorithm (on the CPU) was compared to an existing algorithm (both on the CPU and GPU).


Interesting the golang version https://github.com/nlpodyssey/goslide


I stand corrected!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: