I agree, not a well designed benchmark. The mmap side did not use any of the mec...

I agree, not a well designed benchmark. The mmap side did not use any of the mechanisms for paging in data before faulting each page, which are table stakes for this usecase.

The point of DDIO is that the data is frequently processed fast enough that it can get pulled from L3 to L1 and be finished with before it needs to be flushed to main memory. For a system with a coherent L3 or appropriate NUMA this isn’t really non-temporal, it’s more like the L3 cache is shared with the device.