Async hazard: MMAP is blocking IO

mjb · 2024-08-24T18:02:05 1724522525

I like this point - it's no secret that mmap can make memory access cost the same as an IO (swap can too) - but the interaction with async schedulers isn't immediately obvious. The cost can, sometimes, be even higher than this post says, because of write back behavior in Linux.

Mmap is an interesting tool for system builders. It's super powerful, and super useful. But it's also kind of dangerous because the gap between happy case and worst case performance is so large. That makes benchmarking hard, adds to the risk of stability bugs, and complicates taming tail latency. It's behavior also varies a lot between OSs.

It's also nice to see all the data in this post. Too many systems design conversations are just dueling assertions.

cbsmith · 2024-08-24T22:19:59 1724537999

I'm surprised this is seen as a liability of mmap rather than a cooperative scheduler that isn't using native kernel threads. This is the deal you make with the devil when you use cooperative scheduling without involving the kernel, so I'm surprised it is news to people working with cooperative schedulers. These faults can happen even if you never explicitly memory map files (particularly since executables and shared libraries are often memory mapped into processes), so page faults are a blocking hazard for cooperative schedulers even without mmap.

The MMU in the hardware is aggressively parallel, and the only thread being blocked on the page fault is the one touching the page that needs to be swapped in. In reality, you can get heavily parallelized IO using mmap (indeed, it works quite well when you have a ton of IO you'd like to execute in parallel).

jandrewrogers · 2024-08-25T05:09:59 1724562599

I would describe it more as a limitation of mmap than a liability.

Modern async models have their origin in addressing serious shortcomings with the traditional POSIX APIs, particularly with respect to mmap and kernel schedulers. You can’t mix the models easily, many parts of POSIX don’t play nicely with async architectures. Traditional async I/O engines use direct I/O on locked memory, and if you use async you need to be cognizant of why this is. Half the point of async is to have explicit control and knowledge of when page faults are scheduled. Async is a power tool, not something that should be used lightly for casual tasks nor accidentally pulled in as a dependency.

The issue here appears to be that it is far too easy for someone to inadvertently mix models, not async per se. Async has substantial advantages versus native kernel threads, so getting rid of async is not a realistic solution. No one is going to give up a several-fold performance increase versus native kernel threads because some developers can’t figure out how to not mix models or the ecosystem doesn’t protect developers against inadvertently mixing models.

Async is used heavily in some C/C++ domains but it doesn’t seem to cause many issues there, perhaps because dependencies are much more explicit and intentional. Async has also been idiomatic for certain domains in C/C++ for decades so there is an element of maturity around working with it.

cbsmith · 2024-08-25T17:23:12 1724606592

> I would describe it more as a limitation of mmap than a liability.

Except it's a limitation that shows up even if you never make an mmap call. It's just a reality of living with virtual memory (and arguably, with preemptive based kernel scheduling in general as the kernel can decide to context switch from a thread).

> Traditional async I/O engines use direct I/O on locked memory, and if you use async you need to be cognizant of why this is. Half the point of async is to have explicit control and knowledge of when page faults are scheduled. Async is a power tool, not something that should be used lightly for casual tasks nor accidentally pulled in as a dependency.

Cooperative multitasking, to avoid the stalling described in this article, also needs locked memory/explicit control/knowledge of when page faults are scheduled.

> Async is used heavily in some C/C++ domains but it doesn’t seem to cause many issues there, perhaps because dependencies are much more explicit and intentional. Async has also been idiomatic for certain domains in C/C++ for decades so there is an element of maturity around working with it.

It's more that the people in those domains have an understanding of cooperative multitasking's trade-offs, and it is an explicit design choice to employ it.

Moto7451 · 2024-08-25T00:41:15 1724546475

I think your point here can be more generalized. Why should someone expect reading memory to benefit from async code?

The fact that the memory in this case has an access layer with exploitable latency is where the chatter about this stems from, but it misses the fundamental issue at hand.

If this was a valid concept we’d have async memcpy interfaces.

gpderetta · 2024-08-25T19:08:17 1724612897

It is not exactly async memory, but at the turn of the millennium a few unices experimented with scheduler activations: the kernel would upcall back into the application whenever a thread would block for any reason, allowing rescheduling of the user space thread.

In the end, the complexity wasn't worth it at the time, bit it is possible that something like that could be brought back in the fitire

mjb · 2024-08-25T02:26:42 1724552802

> I'm surprised this is seen as a liability of mmap rather than a cooperative scheduler that isn't using native kernel threads

Indeed. In practice, though, it's easier to write high performance servers and storage systems with async runtimes (like tokio) than with native threads, at least with the current state of the ecosystem. That's not for some fundamental reason - it's possible to get great threaded performance - just the current reality.

So, whoever's fault this is, it's useful to have good evidence of this downside of async runtimes (and worth thinking about ways that OSs could let runtimes know when they were about to block on IO).

cbsmith · 2024-08-25T06:37:55 1724567875

> That's not for some fundamental reason - it's possible to get great threaded performance - just the current reality.

I would argue it is for a fundamental reason. Cooperative multitasking in user-space requires far less overhead than anything a kernel might do. It's just an explicit part of the trade-off: you get more efficient context switches and control when context can change, and in exchange you leave something on the table whenever the kernel is involved.

> So, whoever's fault this is, it's useful to have good evidence of this downside of async runtimes (and worth thinking about ways that OSs could let runtimes know when they were about to block on IO).

But it isn't specific to async runtimes (in fact, a kernel-based preemptively scheduled async runtime wouldn't have this problem). It's a problem specific to cooperative multitasking.

scienceplease · 2024-08-25T10:58:39 1724583519

> cooperative scheduler that isn't using native kernel threads

Can anyone point me towards cooperative thread schedulers that use native kernel threads? Would this effectively mean implementing a cooperative model on top of pthreads?

jsnell · 2024-08-25T12:56:14 1724590574

The term to search for prior art is user-mode scheduler / scheduling. Basically you add additional kernel features that allow making some scheduling decisions in the application, it's not something you'd just build on a vanilla pthreads implementation.

Examples:

Windows 7 UMS: https://learn.microsoft.com/en-us/windows/win32/procthread/u...

google3 fibers / switchto: https://www.youtube.com/watch?v=KXuZi9aeGTw

raggi · 2024-08-25T02:56:42 1724554602

changes in the page table block the whole process, doesn't matter what combination of concurrency models you're using. we could do with a sub-process mapping API from the OS, but it's not something any major OS offers today, and requires designing for at a very fundamental level due to interaction with the hardware, and associated hardware constraints.

toast0 · 2024-08-25T03:11:04 1724555464

> changes in the page table block the whole process, doesn't matter what combination of concurrency models you're using.

I don't think that's necessarily true --- adding a mapping doesn't need to stop other threads that share a page table unless they're also modifying the page table. I don't think the TLB would cache an unmapped entry, but even if it did, the page fault handler will check, see that it's fine and resume execution.

For unmapping, it's different, in that you have to do IPI TLB shootdowns, but that interupts, not blocks.

comex · 2024-08-25T03:52:57 1724557977

And even if other threads are contending for the page table lock, the kernel doesn’t hold that lock for the entire duration of the I/O. Only for the tiny fraction of that duration where the kernel is spending CPU time doing bookkeeping. For the rest of the time, during which the system is just waiting for the disk, the thread that triggered the page-in is still blocked, but other threads can do whatever they want, including page-table modifications.

From what I’ve read on LWN, contention on the page table lock (mmap_sem / mmap_lock) has been a real and perennial issue for Linux, especially on servers with huge numbers of CPUs; but it’s a far smaller effect than what this post is talking about.

cbsmith · 2024-08-25T19:29:23 1724614163

> From what I’ve read on LWN, contention on the page table lock (mmap_sem / mmap_lock) has been a real and perennial issue for Linux, especially on servers with huge numbers of CPUs; but it’s a far smaller effect than what this post is talking about.

...and either way, that's kernel lock contention, not blocking IO.

raggi · 2024-08-25T03:32:47 1724556767

fair pushback, though s/unless they're also modifying the page table/unless they're also accessing the page table/, as it too needs to be synchronized. so yes, sometimes it has no effect, but given how often programs end up loading, crosstalk is super common

cbsmith · 2024-08-25T06:44:52 1724568292

Crosstalk is absolutely common for a number of operations, but that crosstalk is NOT the same as blocking until the page is loaded into RAM. That operation is blocking on IO.

cbsmith · 2024-08-25T06:42:16 1724568136

Page faults block the thread, not the process, because the thread is trying to access memory that isn't available to it. Other threads run just fine so long as they too don't trigger page faults. The article specifically mentions this, and I've built entire architectures based around this reality. They work great.

...and of course there are also multi-process concurrency models, where even if a process were blocked, the other processes would not. So no, it does absolutely matter what combination of concurrency models you are using.

a-dub · 2024-08-24T19:45:00 1724528700

it outsources buffer management and user thread i/o scheduling to the kernel. for some use cases it's a great way to simplify implementation or boost performance. for others it may not perform as well.

the blog post points (in my mind) at some more general advice when programming which is not to mix and match paradigms unless you really know what you're doing. if you want to do user space async io, cool. if using kernel features tickles your fancy, also cool.

mixing both without a deep understanding of what's going on under the hood will probably give you trouble.

rbanffy · 2024-08-24T22:08:33 1724537313

Making it work asynchronously would require the compiler to split the memory access into two parts, a non-blocking IO dispatch and a blocking access to the mapped address. The OS would need to support that, however, and the language would need to keep track of what is a materialised array and what’s not.

a-dub · 2024-08-24T22:55:49 1724540149

as i understand, mmap is only efficient because it can leverage hardware support for trapping into the kernel when a page needs to be loaded to satisfy an access attempt.

i think adding software indirection to every access in the mapped region would be really slow.

i think a better answer would be to impose more structure on the planned memory access, then maybe given some constraints (like say, "this loop is embarrassingly parallel") the system could be smarter about working on the stuff in ram first while the rest is loaded in.

rbanffy · 2024-08-25T18:07:17 1724609237

> every access in the mapped region would be really slow.

Would certainly be slower. The compiler would need to be aware we want this behaviour and split the access in two parts, one to trigger the page read and yield to the app’s async loop, and another to resolve the read when the page has loaded. This would only need to happen for explicitly marked asynchronous memory reads (doing that without hardware support for all memory reads would be painful).

gpderetta · 2024-08-25T19:30:15 1724614215

Normal syscalls also "leverage hardware support for trapping into the kernel". Mmap is usually used because it is a simple way to do 0-copy disk I/O.

a-dub · 2024-08-27T14:27:27 1724768847

i can't think of any other syscall that makes use of tlb caches and page fault machinery to enter the kernel as needed in response to ordinary user space memory access.

noctune · 2024-08-25T07:54:52 1724572492

I think you could make by with some kind of async memory-touch system call, i.e. "page in this range of memory, notify me when finished". The application would have to call this on blocks of the mmap prior to actually reading it.

This of course means you lose some of the benefits of mmap (few system calls, automatic paging), but would maybe still be beneficial from a performance perspective.

rbanffy · 2024-08-25T18:16:33 1724609793

It would allow a memory read to yield to the async loop, but overall performance of the read itself would always be lower.

It’s the kind of thing that would be better implemented as a special “async buffer” where reads are guarded by a page fault handler that returns as soon as the read is scheduled and a read that yields on an unresolved page load.

gpderetta · 2024-08-25T19:18:18 1724613498

io_uring + madvise is probably the the closest solution.

Although if you are using uring, there are other options for async disk Io.

correnos · 2024-08-24T19:02:52 1724526172

IMO this is a strong argument for proper threads over async: you can try and guess what will and won't block as an async framework dev, but you'll never fully match reality and you end up wasting resources when an executor blocks when you weren't expecting.

dan-robertson · 2024-08-24T21:08:22 1724533702

I don’t find this argument super strong, fwiw. It could just mean ‘be wary of doing blocking operations with async, and note map makes reading memory blocking (paging in) and writing memory blocking (CoW pages)’

I think there are reasons to be wary but to me, debugging comes first (this goes two ways though: if you have a single ‘actual’ thread then many races can’t happen) because debuggers/traces/… work better on non-async code. Performance comes second but it’s complicated. The big cost with threads is heavy context switches and per-thread memory. The big cost with async is losing cpu locality (because many syscalls on Linux won’t lead to your thread yielding, and the core your thread is on will likely have more of the relevant information and lots of cache to take advantage of when the syscall returns[1]) and spending more on coordination. Without io_uring, you end up sending out your syscall work (nonblocking fd ops excepted) to some thread pool to eventually pick up (likely via some futex) load into cache, send to the os on some random core, and then send back to you in a way that you will notice such that the next step can be (internally) scheduled. It can be hard to keep a handle on the latency added by all that indirection. The third reason I have to be wary of async is that it can be harder to track resource usage when you have a big bag of async stuff going on at once. With threads there is some sense in which you can limit per-thread cost and then limit the number of threads. I find this third reason quite weak.

All that said, it seems pretty clear that async provides a lot of value, especially for ‘single-threaded’ (I use this phrase in a loose sense) contexts like JavaScript or Python where you can reduce some multithreading pain. And I remain excited for io_uring based async to pick up steam.

[1] there’s this thing people say about the context switching in and out of kernel space for a syscall being very expensive. See for example the first graph here: https://www.usenix.org/legacy/events/osdi10/tech/full_papers... . But I think it isn’t really very true these days (maybe spectre & co mitigations changed that?) at least on Linux.

dathinab · 2024-08-25T00:50:19 1724547019

It isn't.

Async is for tasks dominated by waiting, e.g. http serving, not computations. This means it's extremely rare to run into mmap blocking related issues if you don't do something strange.

Furthermore async doesn't exclude multi threading:

- having multi threaded worker threads in addition to CPU threads is pretty normal

- having multiple async threads potentially with cross core work stealing is also the nrom

I.e. if you just follow basic advice the huge majority of task interacting in any potential performance problematic way will not be run in async task even if you write an async web server.

> but you'll never fully match reality and you end up wasting resources when an executor blocks when you weren't expecting

and you wast tons of resources always even without doing something unusual with non async IO _iff_ it's about waiting dominated tasks as you have way more management overhead

furthermore in more realistic cases it's quite common that some unplanned blocking is mainly casing latency issues (which in worst case could case timeouts) but due async engines still using multi threading it not leading relevant utilization issues. That is if it's just some unplanned blocking. If you do obviously wrong things like processing large files in async tasks things can be different.

An argument against async is that depending what you use it can add complexity and that a lot of use-cases don't benefit form it's benefits enough to make it a reasonable choice. Through that is also a bit language dependent. E.g. JS is already anyway coperative in your program and using async makes things simpler here (as the alternative are callbacks). Or in pythons with GIL the perf. gain of async are much higher compared to the gains in idk. C++.

neonsunset · 2024-08-25T01:06:30 1724547990

This kind of issue exists only in async executor implementations that cannot detect blocked workers and inject new ones to compensate for the starvation. I'm not aware if Rust has anything like this today (both Tokio and async-std are not like that) or in development for tomorrow, but there are implementations that demonstrate resilience to this in other language(s).

dbaupp · 2024-08-25T02:29:59 1724552999

Do you have info about current (production) implementations that increase the number of workers?

In https://tokio.rs/blog/2020-04-preemption#a-note-on-blocking (2020), there's reference to .NET doing this, and an explicit suggestion that Go, Erlang and Java do not, as well as discussion of why Tokio did not.

neonsunset · 2024-08-25T16:30:43 1724603443

Yes, it is .NET as Tokio blog post references.

Unfortunately, it does not appear to look into .NET's implementation with sufficient detail and as a result gets its details somewhat wrong.

Starting with .NET 6, there are two mechanisms that determine active ThreadPool's active thread count: hill-climbing algorithm and blocking detection.

Hill-climbing is the mechanism that both Tokio blog post and the articles it references mention. I hope the blog's contents do not indicate the depth of research performed by Tokio developers because the coverage has a few obvious issues: it references an article written in 2006 covering .NET Framework that talks about the heavier and more problematic use-cases. As you can expect, the implementation received numerous changes since then and 14 years later likely shared little with the original code. In general, as you can expect, the performance of then-available .NET Core 3.1 was incomparably better to put it mildly, which includes tiered-compilation in the JIT that reduced the impact of such startup-like cases that used to be more problematic. Thus, I don't think the observations made in Tokio post are conclusive regarding current implementation.

In fact, my interpretation of how various C# codebases evolved throughout the years is that hill-climbing worked a little too well enabling ungodly heaps of exceedingly bad code that completely disregarded expected async/await usage and abuse threadpool to oblivion, with most egregious cases handled by enterprise applications overriding minimum thread count to a hundred or two and/or increasing thread injection rate. Luckily, those days are long gone. The community is now in over-adjustment phase where people would rather unnecessarily contort the code with async than block it here an there and let threadpool work its magic.

There are also other mistakes in the article regarding task granularity, execution time and behavior there but it's out of scope of this comment.

Anyway, the second mechanism is active blocking detection. This is something that was introduced in .NET 6 with the rewrite of threadpool impl. to C#. The way it works is it exposes a new API on the threadpool that lets all kinds of internal routines to notify it that a worker is or about to get blocked. This allows it to immediately inject a new thread to avoid starvation without a wind-up period. This works very well for the most problematic scenarios of abuse (or just unavoidable sync and async interaction around the edges) and allows to further ensure the "jitter" discussed in the articles does not happen. Later on, threadpool will reclaim idle threads after a delay where it sees they do not perform useful work, with hill-climbing or otherwise.

I've been meaning to put up a small demonstration of hill-climbing in light of un-cooperative blocking for a while so your question was a good opportunity:

https://github.com/neon-sunset/InteropResilienceDemo there are additional notes in the readme to explain the output and its interpretation.

You can also observe almost-instant mitigation of cooperative (aka through managed means) blocking by running the code from here instead: https://devblogs.microsoft.com/dotnet/performance-improvemen... (second snippet in the section).

dbaupp · 2024-08-25T21:01:56 1724619716

Thanks for the up-to-date info.

> .NET 6

(I’m under the impression that this was released in 2021, whereas the linked Tokio post is from 2020. Hopefully that frames the Tokio post’s more accurately.)

neonsunset · 2024-08-25T23:57:50 1724630270

UPD: Ouch, messed up the Rust lib import path on Unix systems in the demo. Now fixed.

akira2501 · 2024-08-24T19:11:50 1724526710

> How do other mmap/madvise options influence this (for instance, MADV_SEQUENTIAL, MADV_WILLNEED, MADV_POPULATE, MADV_POPULATE_READ, mlock)? (Hypothesis: these options will make it more likely that data is pre-cached and thus fall into fast path more often, but without a guarantee.)

That probably should have been the first thing to try. Too mad the mmap2 crate does not expose this.

Also looking at the mmap2 crate, it chooses some rather opinionated defaults depending on which function you actually call, and it makes accessing things like HUGEPAGE maps somewhat difficult.. and for whatever reason includes the MMAP_STACK flag when you call through this path.

I feel like a lot of rust authors put faith in crates that, upon inspection, are generally poorly designed and do not expose the underlying interface properly. It's a bad crutch for the language.

dbaupp · 2024-08-25T02:41:21 1724553681

> That probably should have been the first thing to try.

The point of the post is not "how to make mmap work best with async/await" or "how to optimise mmap with async/await", but exploring the consequence of incorrect code (and thus explaining why one might need potential remedies like those). Sorry if that didn't come across!

akira2501 · 2024-08-25T04:39:14 1724560754

I think it's harder for you to write "correct code" here because the crate is hiding most of the actual detail from you. I put that in quotes because there's absolutely nothing incorrect about the code, it's really just suboptimal, and most probably because it can't even use the full syscall interface.

Seriously, I hate to be a curmudgeon, but that crate looks like a particularly bad and naive wrapper around mmap. It works very hard to provide you things you don't need when the basic interface is much more flexible. Aside from having to put `unsafe` around the call and re-import the kernel header constants, there's almost no reason to even have this in a crate.

dbaupp · 2024-08-25T04:53:27 1724561607

I have a feeling we’re talking at cross purposes here: I was actively trying to write incorrect code. This post isn’t about memmap2 crate specifically at all, it just happens to be a convenient way to get the exact (“incorrect”) syscall I wanted from Rust.

I see where you’re coming from but… it feels like you’re trying to convince me of something about the post? If you feel like convincing a larger audience of the limitations of the memmap2 crate specifically, I suggest writing your own blog posts and/or getting involved with it. :)

rini17 · 2024-08-25T09:32:31 1724578351

I looked into it (as part of optimizing LLM memory throughput) and hugepage file-backed maps aren't supported under linux. Only anonymous mappings.

malkia · 2024-08-24T21:00:44 1724533244

WIth mmap you have to be prepared to handle unexpected page fault errors due to corrupted volume: Unlike standard read/write, where one can handle the issue, now it can happen anywhere the memory is mapped - your code, third party library, etc.

It gets even unwieldy, and now you have to add additional tracking where access is to be expected. Blindly delegating mmap area to any code path that does not have such handling, and you would have to deal with these failures.

Maybe that's not the case on Linux/OSX/BSD, but definitely is on Windows where you would have it. Also in C/C++ land you have to handle this using SEH - e.g. `__try/__except` - standard C++ handling won't cut it (I guess in other systems these would be through some signals (?)).

In any case, it might seem like an easy path to achieve glory, yet riddled with complications.

jcalvinowens · 2024-08-24T21:55:25 1724536525

On Linux, if you get a SIGBUS from poking a memory map that generally means you'd have certainly gotten -ENOMEM or -EIO during an equivalent sequence of syscalls (or been oom-killed, if you overcommit). Those are treated as fatal in the vast majority of programs, so dying to SIGBUS isn't meaningfully different for most usecases.

By your logic, passing a file descriptor to a library is also "unwieldy", because the library might not handle -EIO.

You can use MAP_POPULATE|MAP_LOCKED to ensure you get an error from mmap() instead of getting killed in the ENOMEM case, if you don't overcommit (if you do, you can still be oom-killed). You still get SIGBUS beyond EOF, but that's the behavior you want: it's equivalent to overrunning a buffer.

The behavior when file size isn't a multiple of PAGE_SIZE is legitimately weird (writes to the final page beyond EOF are visible to the entire system in memory but never written back to the file), but it's intuitive if you understand how the page cache works at a high level, and you can avoid it by making the size page aligned.

For more complex usecases where you really do want to handle these sorts of errors, userfaultfd() gives you all the tools you need: https://www.man7.org/linux/man-pages/man2/userfaultfd.2.html

EDIT: Initially described MAP_POPULATE wrong.

malkia · 2024-08-24T23:10:06 1724541006

Passing file handle is different - a well written library would check and handle errors, but pointer to buffer - and I don't expect a library to "try" and "catch" (even if it's possible) cases. Would strlen do this? No.

jcalvinowens · 2024-08-24T23:49:25 1724543365

See below: this is wrong, the IO errors from the syscalls aren't synchronous. You might successfully call write() and then get -EIO from a sync() call with no indication of what actually failed. At least with fsync() you would know which file it was, but you have no way of knowing what portion of the data you wrote was successfully written back.

The failure is fatal either way, most of the time. It's very rare you actually want to handle -EIO, you just want to tell the user to replace their broken disk and call abort(). SIGBUS is fine, you can easily write a handler to add debugging context to your failure message.

malkia · 2024-08-25T01:04:53 1724547893

Are you saying that I can get somehow an error during "strlen" post "read()" - is this what are you saying? Because what I'm saying is that with "mmap" you can get that.

jcalvinowens · 2024-08-25T01:48:48 1724550528

Absolutely: if the user page you read into experiences an uncorrectable ECC error in strlen(), you get SIGBUS.

Friends don't let friends write code to attempt recovery from hardware failures.

malkia · 2024-08-25T09:27:49 1724578069

ECC errors of a magnitudes lower chance to happen than broken file system

jcalvinowens · 2024-08-25T21:15:30 1724620530

Depends on who you buy hardware from :)

How about swap then: the user page you read() into might get swapped out, and an I/O error might occur when strlen() faults it back in. I can't remember if you get SEGV or BUS in that case... but it's certainly fatal.

loeg · 2024-08-24T22:30:35 1724538635

> On Linux, if you get a SIGBUS from poking a memory map that generally means you'd have certainly gotten -ENOMEM or -EIO during an equivalent sequence of syscalls (or been oom-killed, if you overcommit). Those are treated as fatal in the vast majority of programs, so dying to SIGBUS isn't meaningfully different for most usecases.

No. EIO is not instantly fatal in the same way that SIGBUS is. It allows for printing an error message, associating the error with some context of what failed, and either recovering in a degraded state or exiting cleanly. Doing any of this in a SIGBUS handler ranges from unwieldy to impossible.

jcalvinowens · 2024-08-24T23:44:39 1724543079

> It allows for printing an error message, associating the error with some context of what failed.

Nope: -EIO is not synchronous unless you're using O_SYNC. More commonly, you're going to get -EIO from sync() or close(), not write(). All write() does a lot of the time is memcpy() into the pagecache and mark the page as dirty.

Once you've written enough to trigger writeback, write() will return -EIO. But the error is still asynchronous, probably having occurred during the writeback of data you passed to a successful write() call in the past.

A SIGBUS handler that inspects /proc/self/maps and tells you what file the failure occurred in before calling abort() is trivial. That's sufficient for 95% of usecases.

krilovsky · 2024-08-25T00:09:39 1724544579

You're assuming here that mmap is only used for writing, where TFA is actually describing a read-only scenario, in which case EIO is synchronous as the read can't be completed.

As for the triviality of writing a SIGBUS handler correctly, that is an oversimplification at best. I/O errors aren't always fatal, even in the write case, and handling SIGBUS in the way you describe wouldn't work when e.g. you're also out of file descriptors, or when the cause of SIGBUS isn't due to an I/O error. So what works for 95% of your usecases doesn't necessarily apply to the 95% of other people's usecases.

jcalvinowens · 2024-08-25T00:36:18 1724546178

The point is the same for reads: the vast majority of usecases just immediately abort() when a read fails. Writing byzantine fault logic to deal with broken storage media is like trying to recover from SIGSEGV, it's almost never a good idea.

> I/O errors aren't always fatal, even in the write case

Linux will not return -EIO unless the disk is in an unrecoverable state. Generally the assumption is that userspace will treat -EIO as fatal, so the kernel won't return it unless it's truly hosed. Sometimes the error is specific to a file, but that's the far less common case in practice.

> e.g. you're also out of file descriptors,

ENFILE is easy to deal with in a fatal path, by closing stdin so fd #0 can be reused (you're about to call abort(), you don't need it anymore). Try again :)

> or when the cause of SIGBUS isn't due to an I/O error.

It's either -EIO, or it's I/O beyond EOF. The second thing is a bug equivalent to a buffer overrun. That's synchronous, you can handle it just like you handle SIGSEGV if you want to emit more debugging or even write byzantine recovery logic.

krilovsky · 2024-08-25T01:19:28 1724548768

> Generally the assumption is that userspace will treat -EIO as fatal

A single bad disk doesn't make the situation fatal (unless that's the only disk in your system, in which case you're not even guaranteed to have your signal handler code in memory).

> ENFILE is easy to deal with in a fatal path, by closing stdin so fd #0 can be reused

That's assuming you have stdin open, which again, may work for 95% of your usecases, but isn't universal.

> It's either -EIO, or it's writing beyond EOF

That's an unfounded statement. A quick search of the kernel code will show that there are other reasons for getting a SIGBUS, which are unrelated to mmap (non-disk hardware failures, certain CPU exceptions, to name a few). So yeah, if you know that apart from the disk (or filesystem, at any rate) your hardware is in order, and that the only reason for SIGBUS could be a failed I/O through a memory mapped file, and you know that all of the code in your process is well behaved, writing a SIGBUS handler that terminates the process with a message indicating an mmap I/O error might be reasonable, but that's not the reality for every process, and likely not even 95% of processes.

Regardless, my main point wasn't that lack of file descriptors makes your suggestion problematic, but that your description of it as trivial is an oversimplification at best. mmap has its uses (as does writing a SIGBUS handler to deal with errors), but that doesn't mean that it doesn't have issues. Highlighting them doesn't mean that plain read/write are perfect and free from issues either, and certainly code that isn't ready to deal with EIO will have a bad time when a VFS operation fails. But there are cases where making I/O explicit is better, and I'm not sure why you seem to be making blanket statements that trivialise the issues with mmap.

jcalvinowens · 2024-08-25T01:33:28 1724549608

> A single bad disk doesn't make the situation fatal (unless that's the only disk in your system, in which case you're not even guaranteed to have your signal handler code in memory).

Yes it does. Your point about signal handlers is why I'm right, that's beyond the point where you can expect the machine to function in a sane way. Trying to recover is often actively harmful.

> That's assuming you have stdin open, which again, may work for 95% of your usecases, but isn't universal.

If you've hit EMFILE, you absolutely have some FD which you can sacrifice to collect debug info, is my point. If you don't you can reserve one a priori, this isn't that hard to deal with.

> writing a SIGBUS handler that terminates the process with a message indicating an mmap I/O error might be reasonable, but that's not the reality for every process, and likely not even 95% of processes.

You're completely wrong here: you've invented an ambiguity that does not exist. Take a look at the manpage for sigaction(), and you'll see that all the non-I/O cases you mention are independently identifiable via members of the siginfo_t struct passed to your SIGBUS handler (just like the I/O cases).

> but that your description of it as trivial is an oversimplification at best.

I'm not oversimplifying: you're spewing unfounded FUD about the mmap() interface, and I'm telling you that none of these details matter for 95% of usecases.

krilovsky · 2024-08-25T04:25:14 1724559914

> Yes it does. Your point about signal handlers is why I'm right

If it's not a single disk system, not necessarily. To give you a concrete example: a process that writes logs to a disk dedicated for log collection can simply ignore an EIO/ENOSPC if logging isn't its main task. It can't easily recover from a SIGBUS in that scenario though.

> If you've hit EMFILE, you absolutely have some FD which you can sacrifice to collect debug info, is my point. If you don't you can reserve one a priori, this isn't that hard to deal with.

I'm not sure why you keep sticking to this example, when I already said that it was just an example of another detail that you need to take into account when implementing a SIGBUS handler. Sure, you can open /proc/self/maps a-priori and side-step the issue, but that's another detail that you need to take into account (and that you didn't mention until I brought it up). I never said that it was hard, only that writing a proper handler that deals with the edge cases isn't as trivial as you claim.

> you've invented an ambiguity that does not exist [...] you'll see that all the non-I/O cases you mention are independently identifiable via members of the siginfo_t struct

I'm not sure what's the ambiguity that you're claiming that I've invented. Yes, some of the specific examples that I gave (specifically CPU exceptions) are identifiable if you already know the details, but not all of them: non-disk faults can still result in SIGBUS with BUS_ADRERR, so that alone isn't enough to identify EIO errors or EOF coming from memory-mapped files, and I know that from personal experience debugging SIGBUS crashes.

> you're spewing unfounded FUD about the mmap() interface

I don't know where this is coming from. I never said that using mmap is bad or that it's impossible to write a SIGBUS handler to output debug info before crashing. I merely pointed out that it's not necessarily trivial, as there are details that should be taken care of, and that it may not in fact be suitable for 95% of usecases as you claimed.

You have a mental model of an ideal system which either can't recover from I/O errors, or doesn't get SIGBUS for reasons other than EIO or reading beyond EOF. I'm trying to tell you that not every system is like that, and that while mmap is useful, there are cases where explicit I/O is better suited for the task, and that your 95% might not be everyone's 95%. If you see FUD in simple facts, then I'm sorry, but I see no point in continuing this discussion.

jcalvinowens · 2024-08-25T05:08:25 1724562505

> If it's not a single disk system, not necessarily.

Again, you miss the point. 95%+ of Linux systems are single disk. That's the expected case.

>> If you've hit EMFILE

> I'm not sure why you keep sticking to this example

You brought this up initially, saying it was difficult to handle. I'm demonstrating that you're wrong, it's actually quite trivial to handle. Handwaving about "edge cases" is FUD, if you have some specific point to make then make it

> I'm not sure what's the ambiguity that you're claiming that I've invented.

You claimed it wasn't possible to be sure SIGBUS is from an I/O error. That's wrong.

> non-disk faults can still result in SIGBUS with BUS_ADRERR, so that alone isn't enough to identify EIO errors or EOF coming from memory-mapped files

Wrong. You can resolve that ambiguity from the cited address and si_errno etc. Try it next time.

> I'm trying to tell you that not every system is like that.

The fact you think I need to be told that is amusing. You're completely missing the point.

Let me try one more time:

Don't make things hard when they don't have to be. 95% of the time, they don't have to be. Saying "no, this is actually really hard, and you need to care about these normally irrelevant things" without first acknowledging the simple case is FUD in my book.

krilovsky · 2024-08-25T06:10:16 1724566216

> Again, you miss the point. 95%+ of Linux systems are single disk. That's the expected case.

I specifically added ENOSPC as an example that's relevant on single disk systems as well.

Regardless, I thought we were talking about 95% of usecases in relation to implementations, not runtime systems, but even if we're talking about runtime systems, I'm not sure where you're pulling that 95% number from (or why you felt the need to add a plus sign this time around). That may be true for personal computers, but most Linux systems are servers, which generally aren't deployed in a single disk configuration.

> You brought this up initially, saying it was difficult to handle

I didn't say anything about difficulty. I only said that it wasn't trivial as you made it out to be, which isn't the same thing. Also, when I initially brought it up all I said was that in the FD exhaustion case it wouldn't work in the way you described in the comment that I responded to.

> You claimed it wasn't possible to be sure SIGBUS is from an I/O error. That's wrong.

I didn't. All I said in response to your claim of "It's either -EIO, or it's writing beyond EOF" was that there are other reasons for getting a SIGBUS. Moreover, I actually said (in the same paragraph), that if you know that a SIGBUS is caused by an I/O error, and that all of the code in your process is well-behaved (and by that I meant that terminating it with an abort() wouldn't cause side-effects due to e.g. atexit() handlers not running), using mmap with a SIGBUS handler might be reasonable.

> Wrong. You can resolve that ambiguity from the cited address. Try it next time.

First you claimed that I invented an ambiguity that doesn't exist, and that SIGBUS causes can be identifiable if I just read the sigaction(7) manpage. Now you say that there is an ambiguity, but that it can be resolved using the address, so which is it? [0]

I never said that using mmap is impossible, or even hard (and definitely not "this is actually really hard"). I actually agreed that in some cases it might be reasonable to do it with a SIGBUS handler. All I did say was that it isn't trivial to deal with errors, and that the 95% figure might be true for your usecases, but that it doesn't necessarily apply to other people's usecases.

The only one who said that something was "hard" during this discussion was you.

I get it, it's easier to attack the strawman rather than respond to my comments. I'm just not sure why you think it has anything to do with what I said.

[0] EDIT: I now see that you edited the sentence I quoted to say "from the cited address and si_errno etc.". It might surprise you to learn that si_errno is almost never set in Linux (the manpage is actually explicit about it with "si_errno is generally unused in Linux"), and definitely not in mmap-related SIGBUS coming from memory mapped files. I have no idea why you added this remark telling me that I should try it, when you clearly didn't.

jcalvinowens · 2024-08-25T06:54:35 1724568875

> I have no idea why you added this remark telling me that I should try it, when you clearly didn't.

You are hilariously hostile here, I don't get it. si_errno is the second field in the struct after si_signo, saying "si_errno etc." is obviously in reference to the rest of the fields in the structure...

krilovsky · 2024-08-25T16:30:05 1724603405

> You are hilariously hostile here, I don't get it.

I apologise if it came out hostile. That was not my intention. I was in a bit of hurry when I made the edit, and I just trying to expand my comment in response to your edit, and explain that non-I/O and non-disk SIGBUS errors sometimes look exactly like disk and filesystem errors that return EIO (not just signum being SIGBUS, but also si_code being set to BUS_ADRERR, etc.), so looking at the siginfo_t fields alone wouldn't be enough to diambiguate.

Then there's the address field, which can be probably be used in combination with parsing /proc/self/maps, but my point in that comment was that the information on the manpage alone wouldn't have helped people trying to implement a handler correctly.

In any case, I already described a scenario where crashing would be the wrong thing to do IMO, which you seemed to ignore. Even in scenarios where crashing is reasonable, I'm sure there's a solution for every edge case that I would bring up, but I never said that it was impossible, so I'm not sure why asking me to list every possible edge case is relevant when my point was just that there are edge cases, and that you'd need to consider them (and they would be different for different apps), thus making an implementation not trivial. That doesn't mean that it's necessarily difficult, just that it might be a more complex solution when compared to dealing with a failing VFS operation.

As it seems that we've reached an impasse, I'll just say that simplicity depends on the context and is sometimes a matter of personal taste. I don't have anything against mmap, and I was only trying to argue that there's a trade-off, but you are of course free to disagree and use mmap everywhere if that works for you.

I don't think I have anything more to add to what I already said, and I'm sorry again if you felt personally attacked, or that I had something against mmap and trying to spread FUD.

jcalvinowens · 2024-08-25T06:43:51 1724568231

> but most Linux systems are servers, which generally aren't deployed in a single disk configuration.

You are incorrect about that: most Linux servers in the world have one disk. Most servers are not storage servers.

> I didn't say anything about difficulty. I only said that it wasn't trivial as you made it out to be

...and I demonstrated by counterexample that you're wrong, it is trivial. If you think I'm missing some detail, you are free to explain it. You're just handwaving.

> First you claimed that I invented an ambiguity that doesn't exist, and that SIGBUS causes can be identifiable if I just read the sigaction(7) manpage. Now you say that there is an ambiguity, but that it can be resolved using the address, so which is it?

Both, obviously? If you only look at signo there's an "ambiguity", but with the rest of siginfo_t the "ambiguity" ceases to exist. There is no case where you cannot unambiguously handle -EIO in a mmap via SIGBUS.

You claimed that you could only use SIGBUS with mmap if you were sure there were no other sources of SIGBUS. Quoting you directly:

> So yeah, if you know that apart from the disk (or filesystem, at any rate) your hardware is in order, and that the only reason for SIGBUS could be a failed I/O through a memory mapped file, and you know that all of the code in your process is well behaved, writing a SIGBUS handler that terminates the process with a message indicating an mmap I/O error might be reasonable

That statement is completely wrong: you can always tell whether it came from the mmap or something else, by looking at the siginfo_t fields.

> and by that I meant that terminating it with an abort() wouldn't cause side-effects due to e.g. atexit() handlers not running

Any system that breaks if atexit() handlers don't run is fundamentally broken by design. There are a dozen reasons the process can die without running those.

> All I did say was that it isn't trivial to deal with errors

Yes, and that statement is wrong. Most of the time it is trivial, because you just call abort(). There is no possibly simpler error handling than printing a message and calling abort(). For 95% of the workloads running across the world on Linux, that is entirely sufficient.

It is very unusual to try to recover from I/O error, and most programmers who try are really shooting themselves in the foot without realizing it.

You're free to disagree obviously, but I'm directly refuting the points you're making. Calling it a "strawman" make you look really really silly.

loeg · 2024-08-25T00:05:13 1724544313

First, you're free to not use buffered IO. Second, EIO on fsync or close for buffered IO is still adjacent to the relevant file descriptor.

jcalvinowens · 2024-08-25T00:38:38 1724546318

> adjacent to the relevant file descriptor.

So is SIGBUS: you get the address in the handler. You probably have a data structure associating the two things somewhere anyway, and if you don't you can look it up in /proc.

krilovsky · 2024-08-24T21:33:31 1724535211

Yes, on POSIX systems you'd get a SIGBUS if the I/O fails or if there's no available physical memory to back the mapping.

loeg · 2024-08-24T22:28:31 1724538511

It is also the case on Linux/BSD.

dathinab · 2024-08-25T01:20:18 1724548818

While the general point the article is making is correct there are some issues.

- (minor issue) async example is artificially limited to 1 thread (the article states that). The issue is comparing 8 OS threads no async to 1 thread async is fundamentally not very useful as long as you didn't pin all threads to the same physical core.. So in general you should compare something async with num_cpus threads vs. num_cpus*X OS threads. Through this wouldn't have been that useful in this example without pinning the tokio async threads to CPUs to forcefully highlight the page issue, and doing it is bothersome so I wouldn't have done so either.

- (bigger issue) The singled thread async "traditional IO" example is NOT single threaded. Async _file_ IO anything between not a thing or very bad in most OSes hence most async engines including tokio do file IO in worker threads. This means the "single threaded" conventional IO async example is running 8 threads for reading IO and one to "touch the buffer" (i.e. do hardly anything).

To be clear the single threaded not being single threaded issue isn't discrediting the article, the benchmarks still show the problem it's that the 8 threaded conventional and 1 threaded async conventional are accidentally basically both 8 thraded.

dbaupp · 2024-08-25T01:54:18 1724550858

(Author here)

> The singled thread async "traditional IO" example is NOT single threaded

The threads backing the single-threaded IO are an implementation detail of fulfilling the `.read().await` calls. The key is that there's a single coordinator thread that's issuing all the work, with the user-space runtime multiplexing tasks on that thread. I thought the fact that the "start a request and come back when it is finished" behaviour happens to be implemented via user-space threads rather than kernel-level epoll (or similar) is unlikely to affect behaviour.

I considered scaling up the number of files and using a multi-threaded runtime, but I felt that'd make everything more complicated without fundamentally changing behaviour.

However, maybe my theory is incorrect, in which case someone else can do their own experiments to provide more concrete information.

(This is referenced in a footnote: https://huonw.github.io/blog/2024/08/async-hazard-mmap/#fn:t... )

> The issue is comparing 8 OS threads no async to 1 thread async is fundamentally not very useful as long as you didn't pin all threads to the same physical core

The point is not to benchmark async vs. non-async, but provide a general reference point for "mmap working well" for comparison. As you suggest you agree with the "minor" issue tag, I don't think the parallelism vs. concurrency distinction matters much here... but again, definitely happy to see some concrete data that suggests otherwise!

dathinab · 2024-08-25T21:49:21 1724622561

yes no of the issues are affecting the outcome of the blog about memory mapped files

> The threads backing the single-threaded IO are an implementation detail of fulfilling the `.read().await` calls.

IMHO it's not just an implementation detail, it's a very relevant design aspect for anything related to blocking and benchmarks. Through yes it doesn't matter to much for this blog.

Retr0id · 2024-08-24T18:14:49 1724523289

> This is thus a worst case, the impact on real code is unlikely to be quite this severe!

I think the actual worst-case would be to read the pages in a (pseudo-)random order.

dbaupp · 2024-08-25T01:57:06 1724551026

(Author here.)

Good point. Do you think the difference will be observable? Will it be observable on an SSD (vs. HDD)?

Retr0id · 2024-08-25T13:12:38 1724591558

I think so, yes - only one way to find out, though!

davesque · 2024-08-25T19:21:12 1724613672

I always thought that one of the use cases of memory mapping was to improve multiprocessing workloads, where a group of processes don't have to duplicate the same region of a working set. In that sense, maybe it's not surprising that single-threaded concurrency can't leverage all of the benefits of memory mapping.

dsp_person · 2024-08-24T22:15:34 1724537734

> One possible implementation might be to literally have the operating system allocate a chunk of physical memory and load the file into it, byte by byte, right when mmap is called… but this is slow, and defeats half the magic of memory mapped IO: manipulating files without having to pull them into memory

This doesn't defeat the purpose necessarily. How about for example, implementing a text editor: I want the best performance by loading the existing file initially (say it is <1MB), and the convenience and robustness of any writes to this memory being efficiently written to disk.

dekhn · 2024-08-27T00:44:41 1724719481

I used to really like mmap for a wide range of uses (having noticed its performance in the BLAST DNA/protein search command) but over time I've come to consider it a true expert tool with deep subtlety, like a palantir.

rmholt · 2024-08-25T19:20:30 1724613630

While author said that C's mmap suffers the same issue, I would argue C's mmap is fine, because C doesn't have async. The issue arises from the mmap crate not having an async read and the confusion around how does async work.

rmholt · 2024-08-25T19:21:30 1724613690

I feel like author used a lot of words to say "mmaped reads are IO (obvious) but critically, they are usually not awaitable like IO (bad)"

gpderetta · 2024-08-25T19:26:47 1724614007

No, that misses the point. Async doesn't require an async keyword. Evented programming (which as the same issues) has been common in C for decades.

colonwqbang · 2024-08-25T07:38:49 1724571529

Function calls are also blocking IO then because executables and libraries are mmapped.

PaulHoule · 2024-08-21T17:07:07 1724260027

No secret. Reading from memory is synchronous and always has been, at least in a normal computer. (Sometimes I think of how you could fit a fancy memory controller in a transport triggered architecture but that’s something different)

01HNNWZ0MV43FF · 2024-08-24T17:44:03 1724521443

By blocking they mean that it can take ballpark non-volatile storage times instead of ballpark RAM times

xoranth · 2024-08-24T17:51:11 1724521871

I believe they mean that since it bypasses the (Tokio) scheduler, so if you use it in async code you lose the main benefit of async code (namely, the scheduler is able to switch to some other task while waiting for IO to complete.). Basically the same behavior you'd get if you called a blocking syscall directly.

guerrilla · 2024-08-24T17:53:53 1724522033

GP is aware. mmap makes files act like memory. Memory is always synchronous, thus blocking, so mmaped files are always blocking. I'm surprised OP even found this surprising. It should be completely obvious.

jerf · 2024-08-24T20:38:23 1724531903

The term "blocking" has diverged between various communities and it is important to recognize those differences or you'll have dozens of people talking past each other for hundreds of messages as they all say "blocking" and think they mean the same thing, and then get very confused and angry at all the other people who are so obviously wrong (and in their context, they are) but just can't see it.

It is obvious that a given "execution context", which is my generalized term for a thread and an async job and anything else of a similar nature, when it reaches for a value from an mmap'd file will be blocked until it is available. However, different communities have different ideas of an "execution context".

Threaded language users tend to see them as threads, so while a given thread may be blocked the rest of the program can generally proceed. (Although historically the full story around file operations and what other threads can proceed past has been quite complicated.)

Async users on the other hand are surprised here because the operation is blocking their entire executor, even though in principle it ought to be able to proceed with some other context. Because it's invisible to the executor, it isn't able to context-switch.

In this case, the threaded world view is reasonably "obvious" but it can be non-obvious that a given async environment may not be able to task switch and it may freeze an entire executor, and since "one executor" is still a fairly common scenario, the entire OS process.

(I am expressing no opinion about whether it must block an executor. System calls come with a lot of flags nowadays and for all I know there's some way an async executor could "catch" a mapped access and have an opportunity to switch. I am taking the original article's implicit claim that there isn't one happening in their particular environment at face value.)

As long as you do not distinguish how various communities use the term "blocking", you will get very, very deeply nested threads full of arguments about something that, if you just are careful with your terminology, isn't complicated for anyone from any subculture to understand.

dataflow · 2024-08-24T18:00:57 1724522457

At first I thought the title meant the mmap() call itself blocks, which I figured could be slightly surprising. But it seems they're referring to I/O on the mapped file? I'm also baffled, how could it possibly not block?

anonymoushn · 2024-08-24T20:19:41 1724530781

Well, the OP could probably get their benchmarks to run faster if they passed MAP_POPULATE, which would make the mmap call block for longer.

In a pedantic sense, the mmap call is already blocking, because any system call takes longer than e.g. stuffing an sqe onto a queue and then making 0 syscalls, and it could take a variable amount of time depending on factors beyond that one process's control. I don't think anyone actually needs to offload their non-MAP_POPULATE mmaps to a separate thread or whatever though.

dbaupp · 2024-08-25T02:09:21 1724551761

Yes, it couldn't not block, it's obvious... but I've encountered non-trivial amounts of "magical thinking" around both async/await ("go-fast juice") and mmap ("go-fast juice") separately, so the intersection surely has a bunch of magical thinking too, where people haven't taken the time to properly think through what's going on.

Hence, my investigation to try to make the "obvious" conclusion obvious to more people.

(Author here)

icedchai · 2024-08-24T19:04:15 1724526255

One trick is to read the file into memory at application startup. All data is paged in, so it's hot and ready to go: no page faults. In the early 2000's, I worked on a near real time system that used memory mapped I/O. At app startup, several gigabytes were read into memory. It never blocked under normal circumstances (in production) since the systems were provisioned with enough memory.

dataflow · 2024-08-24T19:12:16 1724526736

But that requires knowing your RAM is big enough to fit the file. It can't work in general.

usefulcat · 2024-08-24T19:01:29 1724526089

> how could it possibly not block?

When the bytes being read are already in the cache. Hence the later part of the article where the author shows that reading mapped memory can be significantly faster.

magicalhippo · 2024-08-24T19:09:17 1724526557

It still blocks. It just completes orders of magnitudes faster.

danarmak · 2024-08-24T19:34:23 1724528063

No, if the memory-mapped page you're accessing is in RAM, then you're just reading the RAM; there is no page fault and no syscall and nothing blocks.

You could say that any non-register memory access "blocks" but I feel that's needlessly confusing. Normal async code doesn't "block" in any relevant sense when it accesses the heap.

magicalhippo · 2024-08-24T19:56:18 1724529378

When dealing with async I think it is very relevant to think of exactly the points where control can be switched.

As such a regular memory read is blocking, in that control will not switch while you're doing the read (ie your not doing anything else while it's copying). This is unlike issuing an async read, which is exactly a point where control can switch.

edit: As an example, consider synchronous memory copy vs asynchronous DMA-based memory copy. From the point of view of your thread, the synchronous copying blocks, while with the DMA-based copying the thread can do other stuff while the copying progresses.

nemetroid · 2024-08-24T21:14:46 1724534086

So what is the definition of "blocking" here? That it takes more than 1 µs?

dbaupp · 2024-08-25T02:12:29 1724551949

As the author, I don't think there's a clear definition of "blocking" in this space, other some vibes about an async task not switching back to the executor for too long, for some context-dependent definition of "too long".

It's all fuzzy and my understanding is that what one use-case considers being blocked for too long might be fine for another. For instance, a web server trying to juggle many requests might use async/await for performance and find 0.1ms of blocking too much, vs. a local app that uses async/await for its programming model might be fine with 10ms of "blocking"!

https://tokio.rs/blog/2020-04-preemption#a-note-on-blocking discusses this in more detail.

danarmak · 2024-08-24T22:26:04 1724538364

That the process/thread enters kernel mode and then is suspended waiting for IO or for some other event. As long as the thread is running your code (or, is scheduleable) it's not blocked. And then the async implementation can ensure your code cooperatively gives up the CPU for other code.

magicalhippo · 2024-08-24T23:12:33 1724541153

If your memory is paged out and you then access it, using your definition, it would block.

So, in the context of async code, there's no difference from the application perspective between reading mmap'ed data and reading "regular" data (ie memory from the regular paged pool), as both could incur blocking IO.

If you're lucky and the mmap'ed data is in the system cache, then reading that data will not block and is fast. If you're unlucky and your process has been swapped out, then doing a regular memory read will block and is slow.

icedchai · 2024-08-24T19:32:01 1724527921

Do you consider reading from a normal array (one not backed by a memory mapped file) to also be blocking?

nemetroid · 2024-08-24T21:15:58 1724534158

If the memory has been paged to disk, I guess so?

magicalhippo · 2024-08-24T19:43:40 1724528620

In the languages and platforms I use, absolutely yes. Do you have some examples where a normal memory read is async?

icedchai · 2024-08-24T19:56:49 1724529409

Your definition of blocking is a bit different from my own. Synchronous is not always blocking. If the data is there, ready to go, there is no "blocking."

If you consider all memory reads to be "blocking", then everything must be "blocking". The executable code must, after all, be read by the processor. In an extreme case, the entire executable could be paged out to disk! This interpretation is not what most people mean by "blocking."

magicalhippo · 2024-08-24T20:09:42 1724530182

Fair point. I guess I conflate the two, because what's interesting to me, most of the time, is where does the control flow switch.

I never rely on synchronous IO being non-blocking when writing regular code (ie not embedded). As such reading from cache (non-blocking) vs disk (blocking) doesn't matter that much, as such. It's synchronous and that's all I need to reason about how it behaves.

If I need it to be non-blocking, ie playing audio from a file, then I need to ensure it via other means (pre-loading buffer in a background thread, etc etc).

edit: And if I really need it not to block, the buffer needs to reside in the non-paged pool. Otherwise it can get swapped to disk.

juped · 2024-08-25T01:52:11 1724550731

you don't yield to your cooperative-multitasking runtime during reading from it, which is obviously what everyone in this thread means, and it's not helpful to start telling them "you're using the word blocking wrong" apropos of nothing

icedchai · 2024-08-26T17:29:02 1724693342

Why would it yield when reading from local memory? Are there any cooperative environments that do that? Seems like an unusual expectation.

rbanffy · 2024-08-24T22:27:15 1724538435

> Do you have some examples where a normal memory read is async?

This hints at a way to make it work, but would need the compiler (or explicit syntax) to make it clear you want to be able to switch to another task when the page fault triggers the disk read, and return to a blocking access that resolves the read from memory after the IO part is concluded.

It could look like a memory read but would include a preparation step.

okr · 2024-08-24T19:36:05 1724528165

Ha. Exactly.

dbaupp · 2024-08-25T02:01:11 1724551271

Yeah, of course a synchronous call that might block the thread is blocking, I agree... but, if I didn't have the context of "we're in a comment thread about a blog post about mmap", I'm pretty sure I wouldn't flag `x[i]` on `x: &[u8]` (or any other access) as a synchronous call that is worth worrying about.

Hence the discussion of subtlety in https://huonw.github.io/blog/2024/08/async-hazard-mmap/#at-a...

It's obvious when pointed out, but I don't think it's obvious without the context of "there's an mmap nearby".

(Author here, this behaviour wasn't surprising to me, but felt subtle enough to be worth investigating, and get some concrete data.)

bhawks · 2024-08-24T17:56:33 1724522193

Same thing would happen if your memory has been paged to disk too.

xoranth · 2024-08-24T17:32:04 1724520724

On Linux, you might be able to use userfaultfd to make it async...

dividuum · 2024-08-24T18:38:19 1724524699

I don't see how that would work. The memory access causing the page fault still blocks, but now another thread handles paging in the requested data. So without coordination between those two, nothing really changes. Sounds easier to just use nonblocking reads directly.

Thanks for the pointer to userfaultfd. Didn't know that existed.

cmrdporcupine · 2024-08-24T22:20:15 1724538015

Yeah. Part of the problem is that userfaultfd isn't itself quite flexible enough. What you might want to do is release the faulted thread to do some other work, letting it "know" what it can come back later when the data is available, but there's no mechanism to make that happen. Instead it's going to be entirely blocked until the fault can be resolved.

charleshn · 2024-08-25T05:19:18 1724563158

See also the classic "Are You Sure You Want to Use MMAP in Your Database Management System?" which mentions this common pitfall of mmap, and others, in the context of DBMS.

[0] https://db.cs.cmu.edu/papers/2022/cidr2022-p13-crotty.pdf

lowbloodsugar · 2024-08-24T18:35:52 1724524552

- register

- shadow register

- L1

- L2

- L3

- RAM

- GPU/SPU/RSP

- SSD

- Network

- HDD

The line is drawn depending on what you are doing and how.

Edit: moved Network above HDD. :-)

tux3 · 2024-08-24T18:38:15 1724524695

I would put networks above HDDs (depending on how many miles you need to send your emails)

mjb · 2024-08-24T19:15:30 1724526930

Modern data center networks offer RTTs about 100x lower than hard drive latency, and comparable to local SSD. It depends, of course, how far over the network you're going, and how fast the other side responds, but <100us is very achievable.

pengaru · 2024-08-24T19:14:56 1724526896

water is wet

mwcampbell · 2024-08-24T20:07:33 1724530053

Nobody automatically knows everything, and we have limited energy for drawing inferences based on what we do know. So the material covered in this post isn't obvious to everyone in its target audience, especially since Rust has had some success in making systems programming more approachable to inexperienced programmers, which is a good thing.

dbaupp · 2024-08-25T02:06:12 1724551572

(Author here)

Thank you, you've expressed one of my goals with doing this sort of investigation/getting this data far better than I have. :)

It's something that's feels obvious once the dots are connected, but I was pretty sure many people wouldn't connect these dots automatically.