I thought I'd do something smart and inline all the matrix multiplications into ...

davedx · 2025-05-16T08:41:50 1747384910

> This is indeed twice as fast as the vectorized implementation, but, disappointingly, the naive implementation with loops is even faster.

On CPU or GPU?

kccqzy · 2025-05-16T17:59:55 1747418395

This is NumPy we are discussing. It doesn't use the GPU.

threeducks · 2025-05-17T20:34:10 1747514050

To be fair, you could replace `import numpy as np` with `import cupy as np` and it would run on GPU without further changes. It is not any good though. PyTorch is roughly 12 times faster.