Going from 30 min on CPU to 0.4 sec on GPU can mean only one thing - you had a r...

brigade · on Dec 21, 2016

Going further, power for power on the current generation, for SGEMM, i'd expect GPUs to beat CPUs by maybe 5x (6700k has a theoretical 512 gflops at 91W, while xeons can come a bit shy of 1.5 tflops at 145w)

Which is still pretty major, but nowhere near the wild claims that used to be common in academic papers. It's outdated, but [1] is still pretty good reading.

[1] http://sbel.wisc.edu/Courses/ME964/Literature/LeeDebunkGPU20...

benjaminPKP · on Dec 21, 2016

We implemented the LeNet-5 convolutional neural network, and yes when run sequentially it really does take that long. The sequential code that took that long was also written by professors who worked at NVIDIA, it was written as well as it could be sequentially. The project was a competition to see how fast we could have it go in parallel so we used a ton of parallelization techniques and tricks like streaming to make it go at .4 sec. Think again before you call someones code "crappy"

p1esk · on Dec 21, 2016

I just did a quick test for Lenet-5, using Theano + CuDNN 5, Xeon E5-1620v2 3.7GHz, Maxwell Titan X GPU:

Two conv layers (6 and 12 feature maps, 5x5 filters), one fully connected layer (120 neurons), activation=Tanh, pooltype: average (excluding padding), cost=Negative Log Likelihood.

Learning rate=0.20, minibatch size=100, dropout = 0.0, L2 lambda=0.0, momentum=0.0, initialization: normal

Training for 10 epochs:

CPU: 172 sec, GPU: 14 sec.

When decreasing batch size to 20 images, the numbers are: CPU: 318 sec, GPU: 38 sec.

So yeah, the CPU code your professors wrote was really crappy.

nkurz · on Dec 21, 2016

I'll add my vote to the side saying "poorly optimized CPU code". This doesn't mean that the code is "crappy", but if you are more than 1000x faster on the GPU than CPU, there is almost certainly the potential for improved performance on the CPU. Optimization is hard, and depending on their area of focus, professors who worked at NVIDIA may not be in the best position to get top performance out of modern Intel CPU's.

I'd be betting that someone with low level optimization expertise (comparable to what you appear to have done with the GPU) could get at least another 10x out of the CPU version. You are completely right that GPU's have the potential for great increases over the current normal, but there's also (typically) room for large improvement on modern CPU's as well.

benjaminPKP · on Dec 21, 2016

It's actually about 75x faster, not 1000x. I mean there's a difference between parallelizing basic code vs something like the LeNet-5. The sequential code was pretty standard. I mean I'm sure the CPU code could've been optimized more, I'm not trying to argue, I just thought it was interesting to see how fast you can make some CUDA code go

p1esk · on Dec 21, 2016

Wait, did you make a typo, and it was 30 seconds for the CPU run time, not 30 minutes? If so, then yes, that's more realistic.

benjaminPKP · on Dec 21, 2016

Oh wow. Now that's embarrassing. My bad for stirring things up with that

Asooka · on Dec 21, 2016

You can actually get some gain in CPU performance by writing in OpenCL. That's because OpenCL code is meant to be easily parallelisable and consumed by wide SIMD units, so Intel's compiler can do a lot more autovectorising than for C code.

CyberDildonics · on Dec 21, 2016

> Think again before you call someones code "crappy"

Lets see it then. Most cpu programs have enormous amounts of fat left in them from cache misses and poor or no use of SIMD.

brigade · on Dec 21, 2016

I'll second p1esk calling the CPU version barely optimized based solely on the numbers you're providing. Unless you're comparing a D410 against a Titan X?

benjaminPKP · on Dec 21, 2016

It was run on a tesla k80 cluster, so pretty high end. But the reason we were able to get such acceleration is because we wrote the the low-end CUDA code, using as much register memory and shared memory as possible, while using good streaming techniques (which really accelerates it), as well as matrix multiplication techniques faster than SGEMM. The batch size was also huge that the code was run on. Since we wrote the low-end CUDA code we were also able to prevent control-divergence as much as possible by using knowledge on warps and how dram bursts occur. We weren't using any API like openACC or anything for help, we wrote CUDA code with some pretty good optimizations. The numbers are real, and a lot of the optimizations come from understanding how the nested loops in the LeNet-5 CNN work together. All I'm saying is this shows how well GPUs can speed things up when writing efficient GPU code.