The only reason I mentioned that spawning threads/creating an async future is bl...

Dylan16807 · on Jan 3, 2022

> And I agree, it would and therefore the definition is potentially meaningless. But pedantically it is blocking (but the functions called within it aren't to the current thread).

If I was going for maximally pedantic but still useful definitions, I'd say that a "[non-]blocking syscall" is a different concept from how you'd describe running functions synchronously or asynchronously. And to elaborate, something like: Code that runs asynchronously is non-blocking, code that runs synchronously can be either blocking or non-blocking, and a syscall always has at least some synchronous code.

I like the idea of saying a syscall is non-blocking if the spec says it returns instantly. But I would add on to that, and say that if "this is not a real-time-OS" is the only reason the spec doesn't say it returns instantly, then we should call that non-blocking too. Or "non-blocking*" with a footnote that mentions RTOS issues.

You ask about getpid() taking a second. I'd say that within the model of "put those RTOS issues aside", that doesn't happen and can't happen. Just like we usually exclude unplugging the computer from our execution model, so too we exclude "linux isn't RTOS" from our execution model. getpid can't get stuck waiting on any resources, and does only trivial computation, so it will return immediately.

_dh54 · on Jan 3, 2022

> I like the idea of saying a syscall is non-blocking if the spec says it returns instantly.

”instantly” is not a strong enough guarantee to call the syscall non-blocking. The caller needs to know exactly how the callee will perform in terms of run time. Most high level RTOSes spec this as saying the call will take a constant amount of time, allowing you to measure the call once during your testing and using that to estimate future runs.

Words like “fast” “slow” “instantly” are not useful in the domain of building real time systems at all. It’s about specifying a predictable run time.

Without providing any spec on the runtime of a system call, the only robust assumption is to assume it blocks indefinitely. When you assume a run time spec for a call where one is not spec’d (e.g. close()) that will inevitably result in unexpected behavior. Using calls that take unbounded time in a process that has strict time requirements is a recipe for failure. The domain of real-time interactive systems is not the same as the domain of batch processing.

> You ask about getpid() taking a second. I'd say that within the model of "put those RTOS issues aside", that doesn't happen and can't happen. Just like we usually exclude unplugging the computer from our execution model, so too we exclude "linux isn't RTOS" from our execution model. getpid can't get stuck waiting on any resources, and does only trivial computation, so it will return immediately.

This further shows that there is a fundamental misunderstanding in how POSIX systems operate. It’s very possible for getpid() to take longer than one second during normal operation because it’s stuck on a resource and POSIX allows for that on purpose. Every entry into a system call invokes a litany of bookkeeping tasks by the kernel before returning to user space, with the exception of VDSO calls like gettimeofday(). Please see exit_to_user_mode_loop() which gets called before every syscall returns to user space to see all the potentials sources of additional latency a call like getpid() may incur: https://github.com/torvalds/linux/blob/c9e6606c7fe92b50a02ce...

Again this is not by accident, this is on purpose. You’ll find a similar loop in all POSIX kernel system call entry/exit code.

Dylan16807 · on Jan 3, 2022

Pretend I said 10 microseconds everywhere I said instantly, then. Same argument, more or less.

Anything that could make getpid take too long is outside the scope of what linux could guarantee.

But inside that scope, it's still worthwhile to distinguish between "blocking" and "nonblocking with very specific exceptions"

> It’s very possible for getpid() to take longer than one second during normal operation being stuck on a resource

What resource? I did my best to look at the implementation, but the source code is complicated and scattered. I can't really process your link by itself. How often are these things causing delays?

"Being rescheduled" is already part of the model of any process, anyway. If a system call doesn't make it any more likely that my process stops compared to the baseline, then I think "nonblocking" is a reasonable term to want to use.

_dh54 · on Jan 3, 2022

> What resource? I did my best to look at the implementation, but the source code is complicated and scattered. I can't really process your link by itself. How often are these things causing delays?

A signal may need to be invoked and that could cause paging to disk. The point is that the kernel is allowed to do a non-predictable amount of work on most system calls and therefore you cannot assume getpid() completes in any amount of time. If you’re building a real time interactive system, then this matters. If you’re building a system that’s allowed to be non-responsive (for running batch processes, network servers) then it doesn’t.

Dylan16807 · on Jan 4, 2022

People are going to keep using non-realtime systems to run soft realtime UIs.

We can't make them stop, so it's still important to distinguish between "this syscall might hit a signal or an interrupt, just like every single line of code in the program" and "this syscall might hit a signal or an interrupt, but also it might get stuck waiting on a resource in a way that couldn't have otherwise happened".

If you want to suggest different terms from "nonblocking" and "blocking" I'm open to change. But in the absence of better terms, I'm going to keep using those, with an asterisk that says I'm inside linux and literally anything could technically block.

_dh54 · on Jan 4, 2022

> People are going to keep using non-realtime systems to run soft realtime UIs.

Very true and if they want their applications to work well they should write they applications correctly!

Dylan16807 · on Jan 4, 2022

The best way to help them write applications correctly is not to say "all syscalls are blocking, none are nonblocking, no other categories".

_dh54 · on Jan 4, 2022

There are categories, some system calls block on timers, some block on disk io, some block on network io. But they all block, except for gettimeofday() and friends.

Dylan16807 · on Jan 4, 2022

I mean I wouldn't say gettimeofday is significantly better than getpid because your thread might switch out anyway. But sure five categories is fine, I just dislike lumping almost everything together.