What the proposed patch does is delay a specific latent operation to an asynchro...

kennylevinsen · on Jan 3, 2022

Patch author here.

It is important to not conflate POSIX requirements with expected behavior, especially for device files which require very specific knowledge of their implementation to use (DRM ioctl's and resources anyone?).

You might think that as a well-behaved game should not be opening/closing evdev fds during gameplay at all, this is clearly just an application bug. However, games are not the main user of evdev devices, your display server is! This bug causes input device closure during session switching (e.g. VT switching) to take abnormally long - on the machine I discovered the bug on, it ends up adding over a second to the session switch time, significantly impacting responsiveness.

This is absolutely a kernel bug. I did not push the patch further as I had other priorities, and testing this kind of patch is quite time-consuming when it only reproduces in a measurable way on single physical machine. Other machines end up with a much shorter synchronize_rcu wait and often have many fewer input devices, explaining why the issue was not discovered/fixed earlier.

call_rcu is intended to be used wherever you do not want the writer to block, while alternative fixes involve synchronize_rcu_expedited (very fast but expensive), identifying if the long synchronize_rcu wait is itself a bug that could be fixed (might be correct), or possibly refactoring evdev (which is quite a simple device file).

As for putting things in threads, I would consider it a huge hack to move open/close. Threads are not and will never be mandatory to have great responsiveness.

_dh54 · on Jan 3, 2022

> As for putting things in threads, I would consider it a huge hack to move open/close. Threads are not and will never be mandatory to have great responsiveness.

The POSIX interface was invented for batch processing. Long running non-interactive jobs. This is why it lacks timing requirements. All well-designed interactive GUI applications do not interact with the file system on their main thread. This is especially true for game display loops. The fundamental problem here is that they are doing unbounded work on a thread that has specific timing requirements (usually 16.6ms per loop). As I’ve said elsewhere, this bug will still manifest itself no matter how fast you make close(), just depends on how many device files are present on that particular system. It’s a poor design. Well designed games account for every line of code run in their drawing loop.

> This is absolutely a kernel bug.

I don’t think that is proven unless the original author can chime in. It’s your best guess and opinion that the author intended to not block on synchronize_rcu but it’s perfectly possible they did indeed intend the code as written. synchronize_rcu is used in plenty of other critical system call paths in similar ways, not every one of those uses is a bug. I would guess you might be slightly suffering from tunnel vision a bit here given how the behavior was discovered.

If it is indeed the case the synchronize_rcu is taking up to 50ms I would suspect there is a deeper issue at play on this machine. By search/replacing the call with call_rcu or similar you may just be masking the problem. RCU updates should not be taking that long.

kennylevinsen · on Jan 4, 2022

> All well-designed interactive GUI applications do not interact with the file system on their main thread

I strongly disagree. A well-designed interactive GUI application can absolutely interact with the filesystem on its main thread without any impact to responsiveness what-so-ever. You only need threads once you need more CPU time.

The POSIX interfaces provide sufficient non-blocking functionality for this to be true, and the (as per the documentation, "brief") blocking allowed by things like open/close is not an issue.

(io_uring is still a nice improvement though.)

> I don’t think that is proven unless the original author can chime in.

This argument is nonsense. Whether or not code is buggy does not depend on whether or not the author comments on the matter. This is especially true for a project as vast as the Linux kernel with its massive number of ever-changing authors.

> If it is indeed the case the synchronize_rcu is taking up to 50ms I would suspect there is a deeper issue at play on this machine. By search/replacing the call with call_rcu or similar you may just be masking the problem. RCU updates should not be taking that long.

synchronize_rcu is designed to block for a significant amount of time, but I did not push the patch further exactly because I would like to dig deeper into the issue rather than making a text-book RCU fix.

diamondlovesyou · on Jan 5, 2022

> A well-designed interactive GUI application can absolutely interact with the filesystem on its main thread without any impact to responsiveness what-so-ever. You only need threads once you need more CPU time.

The "well-designed" argument here is a bit No True Scotsman, and absolutely not true. Consider a lagging NFS mount. Or old hard drives; a disk seek could take milliseconds!

Real time computing isn't about what is normal or average, it's about the worst case. Filesystem IO can block, therefore you must assume it will.

kennylevinsen · on Jan 5, 2022

> The "well-designed" argument here is a bit No True Scotsman, and absolutely not true.

This counter arguments can be interpreted as a mere No True Scotsman of "responsiveness", so this is not a very productive line of argument.

Should one be interested in having a discussion like this again, I would suggest strictly establishing what "responsive" means (which is a subjective experience), including defining when a "responsive" application may be "unresponsive" (swapping to disk, no CPU/GPU time, the cat ate the RAM), and evading terms like "well-designed" (I included it in protest of its use in the comment I responded to).

For example, failing to process input or skipping frames in gameplay would be bad, but no one would see a skipped frame in a config menu, and frames cannot even be skipped if there are no frames to be rendered.

_dh54 · on Jan 6, 2022

> Should one be interested in having a discussion like this again, I would suggest strictly establishing what "responsive" means (which is a subjective experience)

This has been established for years. This is the basis of building real time systems. For example, Flight control systems absolutely must be responsive, no exceptions. What does that mean? That the system is guaranteed to respond to an input within a maximum time limit. POSIX applications may generally give the appearance of being responsive but absolutely are not unless specially configured. There is no upper bound on how long any operation will complete. This will be apparent the minute your entire system starts to choke because of a misbehaving application. Responsive systems have a hard bound on worst case behavior.

_dh54 · on Jan 4, 2022

> A well-designed interactive GUI application can absolutely interact with the filesystem on its main thread without any impact to responsiveness what-so-ever. You only need threads once you need more CPU time.

Hmm. If you call open()/read()/close() on the main thread and it causes a high latency network operation because that user happens to have their home directory on a network file system like NFS or SMB, your application will appear to hang. When you design applications you can’t just assume your users have the same setup as you.

> The POSIX interfaces provide sufficient non-blocking functionality for this to be true

POSIX file system IO is always blocking, even with O_NONBLOCK. You can use something like io_uring to do non blocking file system io but that would no longer be POSIX.

> Whether or not code is buggy does not depend on whether or not the author comments on the matter.

That would depend on if you knew more about how the code is intended to work than the original author of the code. Do you presume to know more about how this code is intended to work than the original author?

kennylevinsen · on Jan 6, 2022

> That would depend on if you knew more about how the code is intended to work than the original author of the code. Do you presume to know more about how this code is intended to work than the original author?

I am not sure if you are suggesting that only the author can know how code is supposed to work, that finding bugs require understanding of the code strictly superior to the author, or that the author is infallible and intended every behavior of the current operation.

Either way, this attitude would not have made for a healthy open source contribution environment.

_dh54 · on Jan 6, 2022

> that finding bugs require understanding of the code strictly superior to the author,

Evaluating whether or not something is a bug in a specific part of a system absolutely requires understanding the intent of the code equal to the author. You have found undesirable application-level behavior and have attributed the cause to a specific line of code in the kernel but it’s possible you are missing the bigger picture of how everything is intended to work. Just because latency has been tracked down to that line of code does not mean the root source of that latency is that line of code. Symptoms vs root causes.

londons_explore · on Jan 2, 2022

close() is typically a blocking operation. But when it happens in devfs, procfs, tmpfs, or some other ram only filesystem I expect it to be fast unless documented otherwise.

colonwqbang · on Jan 3, 2022

Especially when you are in devfs you should not assume anything at all! Close in devfs is just a function pointer which is overridden by each of the myriad device drivers that expose files in /dev. Your close() could be the final one which lets the driver perform some cleanup. It might decide to borrow your thread to do it. Maybe some device was about to be ejected/disabled but could not previously because you were holding an FD to it.

The same goes for /proc and /sys which are very similar to /dev in that they represent various entry points into the kernel.

jandrese · on Jan 3, 2022

It can be slow every time if your AV software hooks close to do its expensive scan operation like Windows Defender.

_dh54 · on Jan 2, 2022

> I expect it to be fast unless documented otherwise.

Logically you should expect it to block indefinitely unless documented otherwise. The exception would be completing within a time bound, the rule is blocking indefinitely.

loeg · on Jan 2, 2022

> Logically you should expect it to block indefinitely

Frankly, that’s completely insane. It should block if and only if there is actual io in flight which could produce a failure return that an application needs. Syscalls should be fast unless there is a very good reason not to be.

_dh54 · on Jan 2, 2022

> It should block if and only if there is actual io in flight which could produce a failure return that an application needs.

Blocking simply means that the specification does not guarantee an upper bound on the completion time. There is no other meaningful definition. POSIX is not an RTOS therefore nearly all system calls block. The alternative is that the specification guarantees an upper bound on completion time. In that case what is an acceptable upper bound for close() to complete in? 1ms? 10ms? 100ms? Any answer diminishes the versatility of the POSIX VFS.

> Syscalls should be fast unless there is a very good reason not to be.

I think this is an instance of confusing what should be with what is. We’ve been through this before with O_PONIES. The reality is that system calls aren’t “fast” and they can’t portably or dynamically be guaranteed to be fast. So far the only exception to this is gettimeofday() and friends.

Robust systems aren’t built on undocumented assumptions. Again, POSIX is not an RTOS. Anything you build that assumes a deterministic upper bound to a blocking system call execution time will inevitably break, evidenced by OP.

evouga · on Jan 2, 2022

> The reality is that system calls aren’t “fast” and they can’t portably or dynamically be guaranteed to be fast.

Perhaps, but the reality is also that the vast majority of games and other interactive applications routinely make blocking system calls in a tight main loop and expect these calls to take an unspecified but reasonable amount of time.

“It’s a blocking syscall so if it takes 1s to close a file, that’s technically not a bug” is correct, but is any player of “Papers, Please” going to be sympathetic to that explanation? Probably not; they’ll think “Linux is slow,” “Linux is buggy,” “why can’t Linux run basic applications correctly that I have no problem running on Windows or OS X?,” etc.

“Syscalls should be fast unless there is a very good reason not to be” strikes me as a wise operating principle, which weights usability and usefulness of the operating system alongside being technically correct.

_dh54 · on Jan 2, 2022

> “It’s a blocking syscall so if it takes 1s to close a file, that’s technically not a bug” is correct, but is any player of “Papers, Please” going to be sympathetic to that explanation? Probably not; they’ll think “Linux is slow,” “Linux is buggy,” “why can’t Linux run basic applications correctly that I have no problem running on Windows or OS X?,” etc.

I don’t agree with this logic. Windows and macOS system calls also block. The issue of people considering Linux to be slow is not relevant to the fact that its systems calls block. The poorer quality of Linux games, and commercial Linux software in general, is more likely due to smaller market size / profit opportunity and the consequential lack of effort / investment into the Linux desktop/gaming ecosystem.

Now if your argument is we should work around buggy applications and distribute hacked patches when the developers have abandoned them for the sake of improving user experience. I agree with that.

> “Syscalls should be fast unless there is a very good reason not to be” strikes me as a wise operating principle, which weights usability and usefulness of the operating system alongside being technically correct.

Linux already operates by this principle. We are examining a situation where best effort was not good enough to hide poor application design.

Dylan16807 · on Jan 3, 2022

> Linux already operates by this principle. We are examining a situation where best effort was not good enough to hide poor application design.

Linux has this principle as a goal, but it's probably not checked often.

I would say this code fails the principle, independent of particular application problems.

_dh54 · on Jan 3, 2022

> I would say this code fails the principle, independent of particular application problems.

For every system call you determine satisfies that principle, I could come up with a application level algorithm that is broken because of it. The principle is aspirational, Linux does a best effort as all Unix systems do not because Linux is buggy but because it can never be 100% given the spec. The core issue here was not close() taking 100ms or whatever it took, the core issue was doing unbounded work on the main drawing thread, which has strict timing requirements.

Dylan16807 · on Jan 3, 2022

They're both problems.

This slowness is approaching the point where even checking for joysticks on a dedicated thread would start having delay problems. And spawning a thread per file would be ridiculous and would get even more scorn if it was slow, "why are you spawning so many threads, of course that's not efficient".

_dh54 · on Jan 3, 2022

> This slowness is approaching the point where even checking for joysticks on a dedicated thread would start having delay problems.

Poorly designed code will perform poorly. Well designed code won’t have delay problems.

> And spawning a thread per file would be ridiculous and would get even more scorn if it was slow,

Where in this entire thread was it suggested to spawn a thread per file? Threads are able to perform more than a single unit of work.

Dylan16807 · on Jan 3, 2022

> Poorly designed code will perform poorly. Well designed code won’t have delay problems.

If I need to open and close 20 files every few seconds, and they all might have unpredictable latencies, even the best designed code in the world could have delay problems.

> Where in this entire thread was it suggested to spawn a thread per file?

You just implied that checking all the files on a dedicated thread is still 'poorly designed code', didn't you?

So if a dedicated thread for the whole group of files isn't enough, sounds like you need to move to a thread per file. Unless it's wrong to use close() at all, or something? You can only blame the code so much.

loeg · on Jan 2, 2022

> Blocking simply means that the specification does not guarantee an upper bound on the completion time.

I don't think that's a commonly-accepted (or useful) definition of "blocking." By that definition, getpid(2) is blocking.

> I think this is an instance of confusing what should be with what is.

Who is doing the confusing? I said "should be." Are you saying they're fast now but should be slow? Why?

> The reality is that system calls aren’t “fast” and they can’t portably or dynamically be guaranteed to be fast.

This isn't a portable program; it's a Linux program. The problem isn't that close can't be portably guaranteed to complete in some time bound; it's that Linux is adding what is essentially an extra usleep(100000), with very high probability, for the devfs synthetic filesystem in Linux.

This is entirely an own-goal; Linux has historically explicitly aimed to complete system calls quickly, when that does not break other functionality. It is a bug that can be fixed, e.g., with the proposed patch(es).

POSIX does not mandate that close blocks on anything other than removing the index from the fd table -- it's even allowed to leave associated IO in-flight and silently ignore errors. It makes little sense for a synthetic filesystem without real IO to block close so grossly.

dagmx · on Jan 2, 2022

CyberRabbi's definition of blocking is correct and what I've always seen commonly accepted.

Blocking means you don't know how long it'll take, and you want to wait for it to finish. The only safe assumption is that you cannot guarantee how long it'll take.

getpid is accurately therefore a blocking call. You don't know how long it'll take. You can profile and make best guesses, but you can never assuredly say how long it'll take.

loeg · on Jan 2, 2022

Every operation in a non-RTOS is blocking by this definition, even local function calls that don’t enter the kernel, because the kernel may switch to another thread at any time. It’s utterly useless as a definition. Much more common is to divide system calls into ones that call depend on some external actor and those that don’t. Eg, recv() on a socket, blocking on a futex held by some other process, or waiting on IO to some disk controller. Getpid() is synchronous but does not block.

_dh54 · on Jan 2, 2022

Blocking in that sense is usually used in relation to some event. E.g. sleep() blocks on a timer, read() blocks on IO, etc.

In the general sense, it means that the call has an indefinite run time. E.g. “this call blocks” = “this call could take an arbitrarily long amount of time”

getpid() is blocking but it likely does not block on IO (though it could as that is allowed by the spec).

dagmx · on Jan 2, 2022

If you call getpid, or even local functions, can the rest of your code (in a single thread) continue till getpid returns?

E.g if you do this inside a function (useless code)

int pid = getpid(); std::cout << pid+2 << std::endl;

Will the output print even if the hypothetical call to getpid takes a second?

If the answer is the print will wait, then it's a blocking call.

If it was an async call, then it could happen concurrently or in parallel, and unless you waited, it would continue on in a non blocking fashion.

Waiting for a return == blocking. It may be quick but unless the spec specifies that it must be synchronous+non-blocking, the distinction between the two is moot.

Dylan16807 · on Jan 3, 2022

But with such an extreme definition, can you even show me what an an async non-blocking syscall would look like?

Because I'm going to point at the assembly instructions that pass the parameters, and say "an interrupt happens here, delaying it for 1 second".

Any definition of blocking that includes "int fifty() {return 50;}" strikes me as having problems.

More specifically, I'd say there's some amount of "kernel does a thing" that needs to be excusable when you're talking about whether a syscall is blocking or not, otherwise everything is blocking.

Unless we want to say that 'nonblocking' is fake on non-RTOS systems, and not even try to define the term in that context.

dagmx · on Jan 3, 2022

There are two points that I've made a couple times that are perhaps getting lost:

1. It's about blocking your logic flow, not about how the system is actually executing it or what the machine code resolves to. If a subsequent call is blocked on a previous one, then it's blocking. Spawning an async function or creating a new thread etc can be blocking, whereas what runs on it isn't (for your current thread).

2. Being blocking or not is independent of performance. A blocking function call can be near instant, it may get inlined, it may take a year to run. Similarly an async or non blocking call can also have the same time complexity. The issue is that if the spec doesn't say it returns instantly, or you don't know for sure that it does, you can't guarantee that the blocking time will be short enough to be acceptable. So while getpid or close will almost always return instantly, it's still blocking. And if the spec doesn't say it's guaranteed, then the performance acceptability in the hot path can change.

End of the day it's all just (often pedantic) semantics to let people describe the execution nature of things so devs can make the best decisions for their performance needs.

Dylan16807 · on Jan 3, 2022

I think you replied before I added 'Unless we want to say that 'nonblocking' is fake on non-RTOS systems, and not even try to define the term in that context. "

Sure, the spec doesn't give a guarantee. But let's say it's impossible to give a guarantee on Linux. Is it really the best option to give up on defining 'nonblocking' entirely? Maybe we should formulate guarantees with an escape hatch for non-RTOS hazards. If we can do that, then getpid deserves one of those conditional guarantees.

And since I'm pretty sure the intent of mentioning getpid was to talk about the code, not the documentation, I think that would make it nonblocking.

> End of the day it's all just (often pedantic) semantics to let people describe the execution nature of things so devs can make the best decisions for their performance needs.

Which is why you don't want to label everything blocking. Nobody can have a useful discussion then.

And also why it's useful to talk about the execution nature of code, even when no spec exists. You don't want to get stuck on implementation details but you shouldn't ignore implementation either.

Edit:

> Spawning an async function or creating a new thread etc can be blocking, whereas what runs on it isn't (for your current thread).

There's some value in talking about functions that way, but for a syscall in particular you need a nonblocking spawn for the syscall to be nonblocking. If that's definitionally impossible, then something bad has happened to the definitions being used.

dagmx · on Jan 3, 2022

The only reason I mentioned that spawning threads/creating an async future is blocking is because you had mentioned that async would generate blocking assembly by my definition.

And I agree, it would and therefore the definition is potentially meaningless. But pedantically it is blocking (but the functions called within it aren't to the current thread).

In a colloquial every day sense, I'd not be this pedantic. but this is a thread specifically about that pedantry.

End of the day, if I were talking colloquially, I'd only talk about expensive blocking calls as being blocking, regardless of IO when responsiveness is important. Otherwise it doesn't matter unless it's parallelizable and there are performance gains to be had.

Dylan16807 · on Jan 3, 2022

> And I agree, it would and therefore the definition is potentially meaningless. But pedantically it is blocking (but the functions called within it aren't to the current thread).

If I was going for maximally pedantic but still useful definitions, I'd say that a "[non-]blocking syscall" is a different concept from how you'd describe running functions synchronously or asynchronously. And to elaborate, something like: Code that runs asynchronously is non-blocking, code that runs synchronously can be either blocking or non-blocking, and a syscall always has at least some synchronous code.

I like the idea of saying a syscall is non-blocking if the spec says it returns instantly. But I would add on to that, and say that if "this is not a real-time-OS" is the only reason the spec doesn't say it returns instantly, then we should call that non-blocking too. Or "non-blocking*" with a footnote that mentions RTOS issues.

You ask about getpid() taking a second. I'd say that within the model of "put those RTOS issues aside", that doesn't happen and can't happen. Just like we usually exclude unplugging the computer from our execution model, so too we exclude "linux isn't RTOS" from our execution model. getpid can't get stuck waiting on any resources, and does only trivial computation, so it will return immediately.

_dh54 · on Jan 3, 2022

> I like the idea of saying a syscall is non-blocking if the spec says it returns instantly.

”instantly” is not a strong enough guarantee to call the syscall non-blocking. The caller needs to know exactly how the callee will perform in terms of run time. Most high level RTOSes spec this as saying the call will take a constant amount of time, allowing you to measure the call once during your testing and using that to estimate future runs.

Words like “fast” “slow” “instantly” are not useful in the domain of building real time systems at all. It’s about specifying a predictable run time.

Without providing any spec on the runtime of a system call, the only robust assumption is to assume it blocks indefinitely. When you assume a run time spec for a call where one is not spec’d (e.g. close()) that will inevitably result in unexpected behavior. Using calls that take unbounded time in a process that has strict time requirements is a recipe for failure. The domain of real-time interactive systems is not the same as the domain of batch processing.

> You ask about getpid() taking a second. I'd say that within the model of "put those RTOS issues aside", that doesn't happen and can't happen. Just like we usually exclude unplugging the computer from our execution model, so too we exclude "linux isn't RTOS" from our execution model. getpid can't get stuck waiting on any resources, and does only trivial computation, so it will return immediately.

This further shows that there is a fundamental misunderstanding in how POSIX systems operate. It’s very possible for getpid() to take longer than one second during normal operation because it’s stuck on a resource and POSIX allows for that on purpose. Every entry into a system call invokes a litany of bookkeeping tasks by the kernel before returning to user space, with the exception of VDSO calls like gettimeofday(). Please see exit_to_user_mode_loop() which gets called before every syscall returns to user space to see all the potentials sources of additional latency a call like getpid() may incur: https://github.com/torvalds/linux/blob/c9e6606c7fe92b50a02ce...

Again this is not by accident, this is on purpose. You’ll find a similar loop in all POSIX kernel system call entry/exit code.

Dylan16807 · on Jan 3, 2022

Pretend I said 10 microseconds everywhere I said instantly, then. Same argument, more or less.

Anything that could make getpid take too long is outside the scope of what linux could guarantee.

But inside that scope, it's still worthwhile to distinguish between "blocking" and "nonblocking with very specific exceptions"

> It’s very possible for getpid() to take longer than one second during normal operation being stuck on a resource

What resource? I did my best to look at the implementation, but the source code is complicated and scattered. I can't really process your link by itself. How often are these things causing delays?

"Being rescheduled" is already part of the model of any process, anyway. If a system call doesn't make it any more likely that my process stops compared to the baseline, then I think "nonblocking" is a reasonable term to want to use.

_dh54 · on Jan 3, 2022

> What resource? I did my best to look at the implementation, but the source code is complicated and scattered. I can't really process your link by itself. How often are these things causing delays?

A signal may need to be invoked and that could cause paging to disk. The point is that the kernel is allowed to do a non-predictable amount of work on most system calls and therefore you cannot assume getpid() completes in any amount of time. If you’re building a real time interactive system, then this matters. If you’re building a system that’s allowed to be non-responsive (for running batch processes, network servers) then it doesn’t.

Dylan16807 · on Jan 4, 2022

People are going to keep using non-realtime systems to run soft realtime UIs.

We can't make them stop, so it's still important to distinguish between "this syscall might hit a signal or an interrupt, just like every single line of code in the program" and "this syscall might hit a signal or an interrupt, but also it might get stuck waiting on a resource in a way that couldn't have otherwise happened".

If you want to suggest different terms from "nonblocking" and "blocking" I'm open to change. But in the absence of better terms, I'm going to keep using those, with an asterisk that says I'm inside linux and literally anything could technically block.

_dh54 · on Jan 4, 2022

> People are going to keep using non-realtime systems to run soft realtime UIs.

Very true and if they want their applications to work well they should write they applications correctly!

Dylan16807 · on Jan 4, 2022

The best way to help them write applications correctly is not to say "all syscalls are blocking, none are nonblocking, no other categories".

_dh54 · on Jan 4, 2022

There are categories, some system calls block on timers, some block on disk io, some block on network io. But they all block, except for gettimeofday() and friends.

Dylan16807 · on Jan 4, 2022

I mean I wouldn't say gettimeofday is significantly better than getpid because your thread might switch out anyway. But sure five categories is fine, I just dislike lumping almost everything together.

vanviegen · on Jan 2, 2022

I'd say that the commonly accepted definition for a blocking call is one that may depend on I/O to complete, releasing control of the CPU core while waiting.

By that definition, getpid() is definitely nonblocking, though it doesn't have an upper bound in execution time. POSIX does not offer hard realtime guarantees.

close() in general would probably be blocking (as a filesystem may need to do I/O), but I'd expect it to behave nonblocking in most cases, especially when operating on virtual files opened read-only. Unfortunately, I don't think those kinds of behavioral details are documented.

dagmx · on Jan 2, 2022

A function that sleeps for 5 seconds is blocking. No IO involved.

Blocking just means that you're blocking your current code till you return out of the called function.

Anything else regarding a function call is an assumption unless you know the exact implementation.

_dh54 · on Jan 2, 2022

> I don't think that's a commonly-accepted (or useful) definition of "blocking." By that definition, getpid(2) is blocking.

When it comes to expecting a specific duration, getpid() is blocking. If you run getpid() in a tight loop and then have performance issues you can’t reasonably blame the system.

> This isn't a portable program; it's a Linux program

But the interface is a portable interface

> POSIX does not mandate that close blocks on anything other than removing the index from the fd table

And what if the fd-table is a very large hash table with high collision rate? How do you then specify how quickly close() should complete? 1ms/open fd? 10ms/open fd? Etc.

It should be clear that the problem here is that the author of the code had a faulty understanding of the system in which their code runs. Today the issue was close() just happened to be too “slow.” If the amount of input devices were higher, let’s say 2x more, then the same issue would have manifested even if close() were 2x “faster.” No matter how fast you make close() there is a situation in which this issue would manifest itself. I.e. the application has a design flaw.

loeg · on Jan 2, 2022

> Today the issue was close() just happened to be too “slow.” If the amount of input devices were higher, let’s say 2x more, then the same issue would have manifested even if close() were 2x “faster.” No matter how fast you make close() there is a situation in which this issue would manifest itself.

Close, on an fd for which no asynchronous IO has occurred, should be 10000x faster, or more. It’s unlikely a user will have even 100 real input devices. I agree the algorithm leaves something to be desired, but the only reason it is user-visible is the performance bug in Linux.

I’ve worked on performance in both userspace and the kernel and I think you’re fundamentally way off-base in a way we’ll never reconcile.

_dh54 · on Jan 2, 2022

> I agree the algorithm leaves something to be desired, but the only reason it is user-visible is the performance bug in Linux.

The only reason it wasn’t user-visible was luck. Robust applications don’t depend on luck.

Something tells me you’ll think twice before calling close() in a time-sensitive context in your future performance engineering endeavors. That’s because both you and I now know that no implementation of POSIX makes any guarantee on the runtime of close() nor will likely do so in the future. That’s just reality kicking in. Welcome to the club :)

slavik81 · on Jan 2, 2022

There's no guarantee for the runtime of any function. It's perfectly valid for the OS to swap your program instructions to disk, and then take seconds or even minutes to load it back.

It's effectively impossible to avoid depending on what you call "luck". The OS does not provide nearly enough guarantees to build useful interactive applications without also depending on other reasonable performance expectations.

_dh54 · on Jan 3, 2022

> It's perfectly valid for the OS to swap your program instructions to disk, and then take seconds or even minutes to load it back.

It’s not valid to swap your program instructions to disk if you call mlock() on your executable pages. Indeed, performance sensitive applications do just that. https://man7.org/linux/man-pages/man2/mlock.2.html

> It's effectively impossible to avoid depending on what you call "luck". The OS does not provide nearly enough guarantees to build useful interactive applications without also depending on other reasonable performance expectations.

This is all self-evidently false. You likely wrote your comment on a POSIX-based interactive application. It just takes knowledge of how the system works and what the specifications are. Well-designed programs are hard to come by but they do exist.

slavik81 · on Jan 3, 2022

Does mlock itself have a guaranteed maximum execution time? Is it guaranteed to return success under the relevant conditions? While that is an excellent way to address the problem I mentioned, you still have to depend on more than just the guaranteed behaviour of the OS.

> You likely wrote your comment on a POSIX-based interactive application. It just takes knowledge of how the system works and what the specifications are.

I wrote my comment on an interactive POSIX application, yes, but I believe my browser depends on "reasonable performance" of OS-provided functions in order to be usable.

It would be a fun exercise to evaluate such a program that supposedly did not. For any given program, I suspect I could patch the Linux kernel in such a way that the kernel still fulfilled all guaranteed behaviour while still making the program unusable.

utbabya · on Jan 4, 2022

I agree the application should not have done this. On the other hand I also agree indefinite block time is not a useful definition despite being correct in theory, perhaps a more pragmatic one would be some time / compute unit percentile? So a consistent 100ms close call which is proven to be a bug won't get lost in definition.

arghwhat · on Jan 3, 2022

The machine is not running POSIX, it's running Linux which is POSIX-ey, and an RTOS does not guarantee that system calls do not block. The insistence on only referring to POSIX was what caused the O_PONIES debate in the first place.

If one assumes that "there is no upper bound on the completion time", then that also means assuming that a poll/read/write will never return within the lifetime of the machine as it could block for that long (maybe you're using this computer: https://www.youtube.com/watch?v=nm0POwEtiqE), and so it is impossible to implement a functioning, responsive application, much less a game.

In the real-world you need to make slightly more reasonable assumptions. And, again, when interacting with device files you must refer to the kernel documentation rather than POSIX, as POSIX does not describe how these files work in any meaningful way or form.

_dh54 · on Jan 3, 2022

> poll/read/write

The “non-blocking” nature of those calls were invented for network servers, not for video games. Not only is jitter tolerable there but high latency is allowed from the lowest layers of the stack. It’s not uncommon to simply get no response from a network request.

A video game should never ever do arbitrary system calls on its main drawing thread unless those system calls are specifically intended for that use case. Jitter is not tolerable in this use case since the timing requirements are so strict. The code must product a frame every 16.6ms, no exceptions. The interface must never become unresponsive.

> RTOS does not guarantee that system calls do not block

RTOSes do indeed provide upper bounds for all calls.

> And, again, when interacting with device files you must refer to the kernel documentation rather than POSIX

Yes that would be a relevant point if it were the case that the kernel documentation for these devices specified that close() should complete within some time bound.

virtue3 · on Jan 2, 2022

Very similar to people using node.getenv in hot sections of code and the resulting not understanding what's happening.

https://github.com/nodejs/node/issues/3104

When you call out to the sys or libc things are going to happen and you should try and be aware of what those are.

fao_ · on Jan 2, 2022

Sorry... what? Why the hell was an application using env() to carry application state?!

The environment list is created at init, it's literally placed right behind the C argument list as an array -- AUXV if you want to go read the ABI Specification for it.

Therefore, anything you grab using getenv() can be considered to be static (Barring use of setenv), so the proper and correct thing to do is shove the things you need into a variable at init. Unless you yourself are editing it, but you should still use a variable because variables are typed and getenv is not (Thinking along the lines of storing port information, or whatever, where you need to parse it into a string to get it into the environment, and then need to parse it out of a string). For things like $HOME, those only ever change once, and you should really have a list of those that you check, because you will want to check XDG_HOME_DIR, and a few other areas. So you will want those in a list anyway, might as well do it at creation time when the data is fresh.

Anything you set with setenv() only alters the your environment state, and that will carry down to newly created children at creation time. So the only reason I can think of why anyone would do this would be to communicate data to child processes. Except there are so, so many better and non-stringly typed ways to do this, including global variables. Child processes inherit copies(?) of their parent's state, you can just use that, so there is literally, NO reason ever to do this.

thelopa · on Jan 2, 2022

… unless you intend to exec after forking

fao_ · on Jan 2, 2022

Sure, but just use execvp and it's a damn sight safer because you know exactly the state of your child's environment state. You can see this in the CERT C coding guidelines: https://wiki.sei.cmu.edu/confluence/display/c/ENV03-C.+Sanit...

also ENV02-C comes into effect, as well, if your program is invoked with

    SOME_INTERNAL_VARIABLE=1 PORT=2000 ./prog

then you try to invoke your child with:

    setenv("SOME_INTERNAL_VARIABLE", "2", 1);
    (fork blah blah)

cryptonector · on Jan 3, 2022

u/CyberRabbi is absolutely correct. It's true that for _some_ kinds of devices you could expect fast close(2) IF the device documents that. But as you can see, implementing this can be hard even for devices where you'd think close(2) has to be fast. Even a tmpfs might have trouble making close(2) fast due to concurrency issues.

The correct thing to do when you don't care about the result of close(2) is to call it in a worker thread. Ideally there would be async system calls for everything including closing open resources. Ideally there would be only async system calls except for a "wait for next event on this(these) handle(s)" system call.

touisteur · on Jan 2, 2022

Or, io_uring the thing. One could probably wrap close() with LD_PRELOAD and not touch the binary...

_dh54 · on Jan 2, 2022

While tempting, you can’t generally fix this by simply patching close() with some function that converts it to an unchecked asynchronous operation. If that were the case, you could just do that in the kernel. Close() is expected to complete synchronously. This matters because posix guarantees that open()/pipe() etc. will return the lowest file descriptor not in use[1]. I.e. this should work:

    close(0);
    fd = open(“/foo/bar”, …);
    // fd is guaranteed to be 0

If you made close() just dispatch an asynchronous operation and not wait on the result, then the code above would break. Any code that uses dup() likely has code that expects close() to behave that way.

The other issue is that close() can return errors. Most applications ignore close errors but to be a robust solution you’d need to ensure the target application ignores those errors as well.

[1]: https://pubs.opengroup.org/onlinepubs/9699919799/functions/V...

infogulch · on Jan 2, 2022

I did not know this, and for some reason it really annoys me. Why are our process contexts littered with useless little synchronous properties? How many other tedious and slow bookkeeping tasks does the OS have to do just to meet some outdated spec that was probably just an ossified implementation detail in the first place? I feel compelled to make it so that new fds are explicitly randomized just so you can't do this, like how Go randomizes map iteration order.

kennylevinsen · on Jan 3, 2022

This argument does not make sense - the kernel already needs to track per-process file descriptors. It just looks for the first hole instead of giving the "next" value.

Go's random map iteration does not apply here. Not only is this not an iterable map, the kernel has no problem providing this insertion guarantee so adding additional costly randomization has no benefit and just burns additional cycles.

Go would also be better off without, but they are catering to a different audience and different degree of specification, and apparently need to actively deter developers from ignoring documentation.

infogulch · on Jan 3, 2022

The correct term for this is not "developers ignoring documentation" it's "ossification" or Hyrum's Law:

    With a sufficient number of users of an API,
    it does not matter what you promise in the contract:
    all observable behaviors of your system
    will be depended on by somebody.

I guess that we got this "lowest available" rule because that's what the first implementation happened to do (it's the obvious thing to do if you have a single core), then someone 'clever' noticed that they could save 3 cycles by hard coding and reusing the fd in their IO-bound loop, and anyone that tried to implement fd allocation differently was instantly met by "your OS breaks my app", and thus the first implementation was permanently ossified in stone. To be clear I'm not making any historical claims and this is pure speculation.

"Stupid developers should have rtfm humph" is not a useful position because it ignores this behavior ossification.

The Go map example is actually very relevant, it's an "anti-ossification" feature that makes the behavior match the spec. If the spec says iteration order is not guaranteed, but in practice people can rely on it being the same in some specific situation (say, in a unit test on a particular version of Go) then the spec is ignored and it breaks people's programs when the situation changes (e.g. Go version updates). This actually happened. Instead of giving in and ossifying the first implementation's details into the spec, Go chose the only other approach: Make the behavior match the spec: "iteration order is not guaranteed" == "iteration order is explicitly randomized". (They do it pretty efficiently actually.)

kennylevinsen · on Jan 3, 2022

As mentioned elsewhere, the file descriptor table is an array and a bitmask - finding the next fd is a matter of finding the first unset bit, which is extremely efficient. And that's before we ignore that the file descriptor table is read-heavy, not write-heavy.

Should you want to have per-process file descriptor tables, you can do just that: Just create a process without CLONE_FILES. You can still maintain other thread-like behaviors if you want. I doubt you'll ever sit with a profile that shows fd allocation as main culprit however.

> If the spec says iteration order is not guaranteed, but in practice people can rely on it being the same in some specific situation ... This actually happened.

If Hyrum's law held, the API would already be "ossified" at this point.

Instead, the Go developers decided to make a statement: "The language spec rather than implementation is authoritative". They broke this misuse permanently by making the API actively hostile, not by making it "match the spec" as it already did.

While one could interpret the current implementation as "anti-ossification", I interpret the action as anti-Hyrum's Law by choosing to break existing users in the name of the contract.

touisteur · on Jan 3, 2022

There you can queue the 'workflow' xkcd https://xkcd.com/1172 and while the joke is funny I wish everyone would stop breaking my workflow.

Maybe I'm getting old, or maybe I find the permanent useless change tiring. I'm looking at GNOME, Android, Windows in particular.

loeg · on Jan 3, 2022

If we ignore POSIX for a moment, the kernel could avoid contending on the one-per-process fd map by sharding the integers into distinct allocation ranges per thread. This would eliminate a source of contention between threads.

In addition to violating POSIX’ lowest hole rule, it would break select(2) (more than it’s already broken).

kennylevinsen · on Jan 3, 2022

This sounds like premature optimization. FD availability is tracked in a bitmask, and finding the next available slot is a matter of scanning for the first unset bit under a spinlock. This is going to be extremely fast.

While you could shard the file descriptor tables for CLONE_FILES processes such as threads, you would likely complicate file descriptor table management and harm the much more important read performance (which is currently just a plain array index and pretty hard to beat).

You could also juts create your processes (or threads) without CLONE_FILES so that they get their own file descriptor table. ------

The fdtable can be seen here: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin..., and alloc_fd and __fget can be found here: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin....

R0b0t1 · on Jan 3, 2022

Are you sure such code exists? Doesn't the standard tell you to always treat the fd type as opaque anyway?

Referring to exactly the point you cite, the standard seems to be making no strong statement at all. It says to allocate from the lowest fd but that calls which may return multiple fds do not need to guarantee they are adjacent. I always took this to mean the values should pack downward and should not be e.g. allocated randomly, though it never seemed clear to me why, as the standard seems to be planning for multithreaded code.

So you are interpreting it one way, but the same statement seems to imply that fds are not meant to be introspected and should always be taken at face value from a call that generates a valid fd.

_dh54 · on Jan 3, 2022

> I always took this to mean the values should pack downward and should not be e.g. allocated randomly, though it never seemed clear to me why,

The reason for this requirement is that early versions of Unix did not have dup2(), only dup(). It has nothing to do with multi threading as this predates pthreads by more than two decades. The shell (sh) makes use of the lowest numbered property to redirect standard in/out/error when setting up pipelines:

    int pipes[2];
    /* ignore errors */
    (void) pipe(&pipes);
    if (fork()) {
        close(0);
        /* guaranteed to return 0 */
        (void) dup(pipes[0]);
        close(pipes[0]);
        close(pipes[1]);
        exec_child();
    } else {
        close(1);
        /* guaranteed to return 1, we know 0 is taken */
        (void) dup(pipes[1]);
        close(pipes[0]);
        close(pipes[1]);
        exec_parent();
    }

Code like this exists in literally every POSIX shell. Anyone saying code like this isn’t common has no idea what they’re talking about.

toast0 · on Jan 3, 2022

> says to allocate from the lowest fd but that calls which may return multiple fds do not need to guarantee they are adjacent.

If the program has fd 0-3 and 5 open, socketpair should return 4 and 6, which are not adjacent. If socketpair is called again, while close(N) (N < 7) is being called in a separate thread, you could get {7, 8}, {N, 7}, or {7, N}, depending on kernel and timing details. All of those returns fit the requirement that the fds be allocated lowest first, but may or may not be adjacent or in absolute order.

treis · on Jan 2, 2022

>This matters because posix guarantees that open()/pipe() etc. will return the lowest file descriptor not in use[1]. I.e. this should work: close(0); fd = open(“/foo/bar”, …); // fd is guaranteed to be 0

On a multi threaded system that isn't guaranteed is it? Meaning, another thread could call open in-between your close & open.

_dh54 · on Jan 2, 2022

It is guaranteed whether multi-threaded or not. It’s a process level guarantee. If your application is designed such that you don’t know what your other threads are doing then POSIX cannot help you.

loeg · on Jan 2, 2022

What you’re getting at is that an individual thread cannot really use this property without some form of synchronization with other threads in the process. Eg, to use this property, other threads either do not allocate fds, or you take some central lock around all fd allocations. Most well-written programs do not rely on it.

touisteur · on Jan 3, 2022

Oooh yes, thanks you're right. That, would make for tricky shadow fd accounting... Ugh.

throwaway81523 · on Jan 3, 2022

You should use io_uring to open and close files asynchronously, instead of open/close.

kennylevinsen · on Jan 3, 2022

You would not use io_uring for things like that. Not only will you still use regular file operations on device files for various reasons, should you chose to use io_uring you would want it to run your entire eventloop and all you I/O rather than single operations here and there. Otherwise it just adds complexity with no benefit.

throwaway81523 · on Jan 3, 2022

I don't see the big issue. There is no other way in Linux or Posix to open a file asynchronously (not sure about closing). Dan Bernstein complained about that 20 years ago(?) and io_uring finally fixes it. Before that, runtimes with lightweight processes/threads (Erlang, GHC) used a Posix threadpool to open files in the background. That seems just as messy as using io_uring, which at least keeps everything in the same thread.

http://cr.yp.to/unix/asyncdisk.html