I think this is available now. The waves/wavefronts on a GPU run independently. Communication between them isn't great, independent is better.
Given chips from a couple of years ago have ~64 compute units, each running ~32 wavefronts, your 256 target looks fine. It's one block of contiguous memory, but using it as 256 separate blocks would work great.
I don't know of a ready made language targeting the GPU like that.
Given chips from a couple of years ago have ~64 compute units, each running ~32 wavefronts, your 256 target looks fine. It's one block of contiguous memory, but using it as 256 separate blocks would work great.
I don't know of a ready made language targeting the GPU like that.