Yeah performance at low buffer sizes is a big challenge, generally I recommend 512 or higher, which I know is not great but right now it's the most practical thing. The issue is that the computation is all done on the GPU, and there's a round-trip latency that has to be amortized. One day I'd like to convince Apple to work on the kernel scheduling latency...