Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

In Linux, the file descriptor table (fdtable) of a process starts with a minimum of 256 slots. Two threads creating 256 sockets each, which uses 512 fds on top of the three already present (for stdin, stdout and stderr), requires that the fdtable be expanded about halfway through when the capacity is doubled from 256 to 512, and again near the end when resizing from 512 to 1024.

This is done by expand_fdtable() in the kernel. It contains the following code:

        if (atomic_read(&files->count) > 1)
          synchronize_rcu();
The field files->count is a reference counter. As there are two threads, which share a set of open files between them, the value of this is 2, meaning that synchronize_rcu() is called here during fdtable expansion. This waits until a full RCU grace period has elapsed, causing a delay in acquiring a new fd for the socket currently being created.

If the fdtable is expanded prior to creating a new thread, as the test program optionally will do by calling dup(0, 666) if supplied a command line argument, this avoids the synchronize_rcu() call because at this point files->count == 1. Therefore, if this is done, there will be no delay later on when creating all the sockets as the fdtable will have sufficient capacity.

By contrast, the OpenBSD kernel doesn't have anything like RCU and just uses a rwlock when the file descriptor table of the process is being modified, avoiding the long delay during expansion that may be observed in Linux.



RCUs are super interesting; here's (I think I've got the right link) a good talk on how they work and why they work that way:

https://www.youtube.com/watch?v=9rNVyyPjoC4


Thanks for the explanation. I confirmed the performance timing different by enabling the dup call.

I guess my question is why would synchronize_rcu take many milliseconds (20+) to run. I would expect that to be in the very low milliseconds or less.


> allocating kernel objects from proper virtual memory makes this easier. Linux currently just allocates kernel objects straight out of the linear mapping of all physical memory

I found this to be a key takeaway of reading the full thread: this is, in part, a benchmark of kernel memory allocation approaches, that surfaces an unforeseen difference in FD performance at a mere 256 x 2 allocs. Presumably we’re seeing a test case distilled down from a real world scenario where this slowdown was traced for some reason?


That’s how they’re designed; they are intended to complete at some point that’s not soon. There’s an “expedited RCU” which to my understanding tries to get everyone past the barrier as fast as possible by yelling at them but I don’t know if that would be appropriate here.


> This is done by expand_fdtable() in the kernel. It contains the following code:

Context: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...


Reading this code for the first time, this seems to be a consequence of the separation between allocating and fd and "installing" a pointer to a file there. Allocating the fd already needs to acquire a lock. So if the install happens together with allocation, there wouldn't be a need to use synchronize_rcu to kick out other threads. The lock would do that.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: