When you compile the C example with the compiler option "-Wall", you'll get a warning that the variable 'r' is set but not used, so the compiler would be free to simply skip the for loops. In fact, if you compile with clang instead of gcc, the compiler will do just that and you'll get almost zero computation time.
It would be better to do something with the computed result so the compiler does not remove the computation, e.g. print a randomly selected value.
I also benchmarked the fixed C code against Python and "Python" (which uses OpenBLAS under the hood in my case) was 10 times faster:
import time
import numpy as np
a = np.random.rand(100, 100)
t = time.perf_counter()
for _ in range(1000): a @ a
print(time.perf_counter() - t, "seconds")
Implementation matters a lot for matrix multiplication.
Yes, definitely a quick and dirty benchmark (I did test after I posted to see if initializing a does anything; it didn't). Timings for J below, since I think it's the most focused on linear algebra. The remarkable thing about the K code from the article is that it's all made from totally general-purpose pieces that have nothing to do with matrix products, and K interpreters don't have any clue what a matrix product is. In J the matrix product is written +/ .* with the generalized dot product operator . (which does need the preceding space, oof) and handled by specialized code. Given that, I found this measurement a little disappointing: about as fast as my C code in the 100x100 case and slightly faster in the 200x200 case.
a =: ?100 100$0
<.1e6 * 1000 (6!:2) '+/ .*~ a'
269
a =: ?200 200$0
<.1e6 * 1000 (6!:2) '+/ .*~ a'
1796
It would be better to do something with the computed result so the compiler does not remove the computation, e.g. print a randomly selected value.
I also benchmarked the fixed C code against Python and "Python" (which uses OpenBLAS under the hood in my case) was 10 times faster:
Implementation matters a lot for matrix multiplication.