Hacker News new | past | comments | ask | show | jobs | submit login

avoiding writes to memory and reducing the number of loops (although not FLOPs)

    for j in range(10):
      c[j] = a[j] + b[j]
    for j in range(10):
      d[j] = c[j] * 2
becomes

    for j in range(10):
      d[j] = (a[j] + b[j]) * 2



Or, better, identifying that the machine has a primitive that is better than doing each op individually. For example, a multiply-accumulate instruction vs a multiply and separate accumulate. The source code still says "a*b+c", the compiler is just expected to infer the MAC instruction.


Yep! This is an assumed optimization when it comes to modern linear algebra compilers. New primitives go way beyond FMAs: full matrix multiplies on nvidia/Intel and outer product accumulates on Apple silicon. It’s also expected that these are used nearly optimally (or you’ve got a bug).


I am extremely familiar with how far these primitives go, ha. I develop kernels professionally for AWS ML accelerators.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: