That's cool, am I right in assuming that you want to automate the production of efficient GPU (or other accelerator) code based on these low level primitives? But you would still need a piece of sorcery that can produce high performance OpenCL code, right? And that code could be different for every device, so you would need some trial and error, benchmark-based compilation at the very least. Or would OpenCL code be generated by hand for each device?
Working on parameterizing a search space that includes more than the local group size. The end dream is some ML guided search to optimize the kernels :)
OK generally I think you're doing exactly what I believe ML is lacking right now. Another huge opportunity is instead of taking the average neural network and designing accelerators for it, designing hardware-friendly networks that run well on a sane accelerator that was designed to work with only these specialised networks (that doesn't need 80% chip area for on-chip memory for example). These might end up being completely different networks to what researchers use today. I work in this area and I think it's also possible to use the loss function to optimise the network for a specific HW.