We did this with FreeBSD several years ago at Netflix. We used to run heavily modified -stable kernels. Our tree was ahead of FreeBSD-current in a lot of ways (TCP, ktls, VM) but way behind in other ways. Every year or two we'd rebase to a new -stable version, and that would be months of work. Often, there would be some bug introduced upstream many months prior that took weeks to track down because there was just so much to bisect through. Because our tree was based on -stable, pushing changes upstream often involved porting them to a substantially different codebase in -current, and being unable to test the ported code at scale in production.
Since we've moved to tracking -current, we rebase every few weeks. We catch any newly introduced bugs from upstream almost immediately, since there are only a few weeks of changes, not a few years, to sort through. Its far easier to contribute code upstream. In fact, we do most development upstream and wait until the next rebase to pick our changes up (or cherry pick them if it is something urgent).
All in all, life is much, much, much better on the bleeding edge.
> All in all, life is much, much, much better on the bleeding edge.
Same feeling here, but on a personal level (meaning, using a desktop computer for professional work). Started using Ubuntu many years ago, ran updates maybe once every month. Every now and then something breaks, but always unsure of what/how/when, so digging through things that changed always took long time.
Started using Arch Linux and few years ago and also started doing updates at least once per day. Now it's way easier to discover what particular change broke something, as the releases are way smaller.
I attribute this more to how often I do the updates now VS before, and less to which OS you use. Same can be achieved with Debian Testing et al.
Do you run all service on the new kernel or is there a roll out process where some services are canaries?
I have one service that is low traffic that I use to test out new base images and core api changes. It’s easy to generate synthetic traffic to see how it behaves, as a lot of its traffic comes from the build process of more complex services.
All machines (can) run the same services, and we roll an entirely new self contained OS image for every update. That goes through multiple expanding levels of canaries prior to a fleet wide rollout. Bugs that escape the canaries are quite rare. One of the last ones I recall was a release that contained a firmware update for a NIC that had a super rare bug that would lock up the NIC, and we only started to see it when a sizable percent of the fleet was running that firmware.
Rebasing a Linux Kernel Branch that’s Two Years Old and 9k Commits Deep
…but you would have had thousands of us running to the bathroom to revisit our last meal.
What a Herculean effort by these two engineers. When your role is to track down fellow employees from completely different teams and crack the, er, carrot to get them to update their patch from two years ago, you can get very tired very quickly.
I’m sure the hardware driver modules are fairly self contained but things like the userspace QoS patch could be pretty hairy. The kind of thing where, for upstream, you’d spend 90% of your time landing a framework just to even be able to implement your feature with the final 10% patch.
Don't forgot the part where the author of the patch left the company a year ago, or even worse the entire team got dissolved/reorganized, which tends to happen quite regularly in a bigco. No amount of carrots will help there I'm afraid.
> The kind of thing where, for upstream, you’d spend 90% of your time landing a framework just to even be able to implement your feature with the final 10% patch.
Is this part of the google monrepo? Not that it matters, but I wondered how far monrepo went, inside the tent.
I hate to be a pessimist, and I may be simply ignorant of some realities, but it FEELS like this is setting an impossibly high goal. I think there will turn out to be something significant in 9000 variances, which cannot simply be uplifted into mainstream, and which turn out to be hard to maintain as a continual rebase into the emerging mainstream.
I agree that 40 sounds low, but probably is a good indication it can work. 1% is within view, if they maintained pace so even on a 1% if the rebase cost can be handled, they can converge but that assumes there isn't a showbuster in the 9000.
google1 (I can't remember the actual directory name, maybe //google/) used Makefiles and Makefile generators. IIRC there is a small amount of still-live code here.
google2 failed fairly quickly.
google3 is blaze/bazel and is where almost all development is today.
All three trees are actually in the same source control "depot".
If you read enough of their papers this is actually detailed there but spread across different papers.
I really hope they try to tackle OOM upstream. Every solution I've seen is "run this daemon that tries to beat the kernel OOM to the punch". Seems inelegant.
Nodejs uses this and I hate it. Tune it wrong and you get no logs, or only in a spot nobody looks. Also you end up restarting sooner than you need because you have to plan for worst case where all processes flirt with the high water mark all the time, so you restart at 1.1/n even though 1.2/n might be okay.
>"run this daemon that tries to beat the kernel OOM to the punch"
Considering that the kernel OOM killer tends to be way too late in doing its thing I don't see how this is inelegant, maybe there's a reason you can't just have the kernel kill processes earlier in the face of memory pressure.
The kernel OOM is just plain broken. I can't understand how can it be possible that, on Windows, whenever I run out of RAM the OS kills whatever process is consuming too much and the computer keeps running flawlessly. However, on Linux, my computer just... freezes. It freezes and stops responding. Not even the mouse moves. Having to use a userspace OOM is the most inelegant thing I've seen. So I need to have 2 OOMs so that the good one can beat the bad one? It's so redundant, it's plain stupid. How come NOBODY is doing anything. If I knew c++ I would for sure send a patch.
The reason the OOM killer never kicks in is because you actually never run into an OOM, or almost never.
What usually happens is that in near-OOM conditions, the kernel starts reclaiming memory pages backed by file (sometimes called "trashing"). This operation manages to keep some extra memory available, but it makes the system almost unresponsive because it's constantly copying memory back and forth from the disk. It may take anywhere between minutes to hours before the system finally OOMs and the OOM killer is invoked.
This problem has been there forever but has been made worse by the improved speeds of modern storage technologies: with slower disk I/O, the OOM condition was reached sooner.
There are several solutions:
- Buy more RAM: if your system routinely goes nearly OOM something is not right.
- Add a (small) swap. It doesn't have to be a partition: nowadays most filesystems support swap file. Just create an empty file and mark it as swap.
- Limit the amount of thrashing or protect some pages from being reclaimed. This has been proposed by Google first and several other people since then, but AFAIK it has never been implemented in the mainline kernel.
Regarding the latter solution, there is a patchset called le9-patch[1] that is included in some alternative Linux kernels and it should be relatively safe to use.
I’m not sure if you misread the previous comment, but their point is quite commonly experienced by people who use Linux vs MacOS/Windows.
All hardware being the same (RAM, SSD, CPU) as OOM is reached, Linux will freeze, whereas Windows continues to run smoothly. All OSes try to reclaim memory pages, just Linux seems to hang the user space while doing so.
As someone who has dual-booted Windows and Linux for a decade, I can 100% attest to this glaring problem.
I am sure that this distinction between a near-OOM condition and an actual OOM condition matters to someone familiar with the current kernel implementation. You seem confident describing what happens when the memory gets closer to full, so I believe you. However, the user experiences the PC freeze during certain conditions, however you choose to name them, and it is during that freeze period the user needs a program to be killed to free some memory and prevent the freeze. I would take one crashed program over power cycling the entire PC any day of the week.
> I am sure that this distinction between a near-OOM condition and an actual OOM condition matters to someone familiar with the current kernel implementation. You seem confident describing what happens when the memory gets closer to full, so I believe you.
I'm not a kernel developer or anything like that, I've just spent some time investigating why this issue happens and has been happening for more than 10 years now.
> the user experiences the PC freeze during certain conditions, however you choose to name them
I'm not trying to defent the Linux kernel, I just described how it works. In particular it's not true that the OOM killer "takes too long" or doesn't work: it's just not invoked at all. If you invoke it manually (enable the magic SysRq with`sysctl kernel.sysrq=1` and press `alt-sysrq-f`) it does its job and solves the OOM instantly.
So, if you don't want to deal lockups and don't like an OOM userspace daemon (I don't), these are the possible solutions.
> I would take one crashed program over power cycling the entire PC any day of the week.
On a laptop or desktop PC, you don't need to power cycle in a near-OOM: use the magic SysRq key.
>On a laptop or desktop PC, you don't need to power cycle in a near-OOM: use the magic SysRq key.
Thanks for the tip! If my Linux were ever to start locking up regularly, I will apply it.
But right now (so I don't have to give up the use to which I currently put my SysRq key) I would prefer some method for determining after I forcefully powered down the computer, then powered up again, whether the lockup or slow-down that motivated the force-power-down was caused by a near-OOM condition.
I don't think so, sorry. The kernel emits a few messages when an OOM is detected, including the tasks killed to free memory, but in a near-OOM probably nothing: the system is technically still working normally, though very slowly.
> Add a (small) swap. It doesn't have to be a partition: nowadays most filesystems support swap file. Just create an empty file and mark it as swap.
My experiences with swap on Linux have been similarly bad. If even brief memory pressure forces the kernel to move things to swap, the only way to revert that in any reasonable timeframe is to unmount the swap partition or to restart the machine.
Meanwhile using Windows with a swap file of twice the size of physical RAM runs smooth as butter. I have a 200GB swap file right now and no problems.
>I can't understand how can it be possible that, on Windows, whenever I run out of RAM the OS kills whatever process is consuming too much and the computer keeps running flawlessly.
I've long wondered this too. How does Windows handle memory pressure differently?
I avoid swap since it needs to be encrypted to protect sensitive data written out from memory to disk. Instead I reserve more memory for the kernel vm.min_free_kbytes based on the installed ram and also based and some redhat suggestions, reserve more memory in vm.admin_reserve_kbytes and vm.user_reserve_kbytes, adjust vm.vfs_cache_pressure based on server role and finally set vm.overcommit_ratio to 0. This worked well on over 50k bare metal servers with no swap. OOM was extremely rare outside of dev. OOM basically only happened with automation had human induced bugs that deployed too many java instances to a server. All of the servers had anywhere from 512GB to 3TB ram and nearly all the memory was in use at all times.
The kernel OOM killer is only concerned about kernel survival. It isn't designed to care about user perception of system responsiveness.
That's what resource control via cgroups is about. Fedora desktop folks (both GNOME and KDE) are working on ensuring minimum resources are available for the desktop experience, via cgroups, which then applies CPU, memory, and IO isolation when needed to achieve that. Also, systemd-oomd is enabled by default. The resource control picture isn't completely in place yet, but things are much improved.
cgroups often make the situation worse, not better, by insisting that a small memcg drop caches because that control group is full while the system overall has plenty of resources. This can lead to a system severely swapping for no apparent reason.
Putting desktop apps into individual cgroups is one of the more counter-productive ideas that has cropped up lately.
Huh? I have never seen desktop Windows killing process due to out of memory -- does it even do that?
It does thrash much more gracefully than Linux, though. In fact the "your computer is low on memory" prompt actually can show up even when severely thrashing, something utmost impossible in Linux (even starting something like zenity may take hours..).
You can already disable the linux memory overcommit feature if you want linux to never allow more memory to be allocated than exists. However, you may run into problems with programs which rely on the ability to allocate more memory than they need, or if you computer has low amounts of memory.
The reason is that Windows doesn’t have fork(), and therefore doesn’t have to promise huge multiples of the available memory only to be left holding the bag when that fiction failed. Look up “overcommit” if you’re interested.
Because originally they were (way) ahead of the mainline. Article doesn't actually say this fwiw. Nowadays mainline is mostly caught up, but it's hard to rebase (article does say this).
Because you can’t add a feature with a knob. And not all changes can be made in a module (eg core kernel). And even if they could, kernel internals are not stable and you would still need to rebase your out of tree modules.
Not everything can be written as modules. The module API is fairly limited, and it doesn't let you arbitrarily customize the behavior of existing parts of the kernel. Examples from the article include OOM and scheduling. One I ran into myself recently is that, despite the name, Linux Security Modules (LSMs) are not loadable modules, and the LSM initialization code is unloaded after kernel boot, so even if I wanted to play tricks with unpublished APIs, the code is just not there.
A little more philosophicallh, if all these customization points were available as modules, the process of updating modules to work with new versions of the kernel would be exactly as much of a mess.
for OOM you do have a lot of flexibility with containers/control groups, nowadays. What kind of problems were they solving with the scheduler? I anything known about that?
For background context, a reminder that control groups were originally developed in Google’s kernel fork, and later mainlined.
As for the scheduler stuff, the main change is a SwitchTo set of syscalls that allow threads to bypass the kernel’s scheduler and just continue execution as a different thread.
https://lkml.org/lkml/2020/7/22/1202
Thanks for the link, it is explained in the linked video:
https://www.youtube.com/watch?v=KXuZi9aeGTw They explain it around 15:01 - google added it's own syscall switchto_thread - that puts the current thread to sleep and is switching to the argument thread id. (and some other calls too). That one helps with cutting down latency in inter thread calls for m:n threading. The real effort is to make latency for individual application requests predictable, while keeping it low.
Some things just don't want to be a knob. Weird stuff that no sane person would consider, until they need to, like raising the limit on how long a command line can be.
It is not necessarily true that features touch a lot of places. But most of the changes are improvements to existing functionality (e.g. KVM, or the scheduler) rather than something self contained.
But drivers for custom hw are easy to rebase. They are quite self contained.
It is when you get into kernel internals that rebasing continuously becomes a challenge. Some parts of the core Linux kernel don't change often, but I imagine many other parts see significant churn.
>Please don't comment on whether someone read an article. "Did you even read the article? It mentions that" can be shortened to "The article mentions that."
Google borrowed for proprietary reasons; meanwhile the people who maintain and improve the kernel and give away the fruits for free have produced something that google wants so they'll merge, again for FAANG proprietary reasons. I'm not losing sleep over it, hell, I paid for it already with my privacy.
Let's not pretend that kernel developers still don't get paid for their kernel work. The majority of them are employed by corporations that use Linux or are sponsored by other means.
Maybe it was so in the past, but I doubt that any significant contributions nowadays come as a free (as in beer) work.
I'm not complaining, I'm saying that google is complaining or at least everybody here is treating google like "oh boo hoo look how much work they have", and they shouldn't, they made the choices they made, it's just technical debt.
You do realize that almost every big company pays money to the Linux foundation and that Linus Torvalds is paid 10 million dollars a year thanks to that? They still get it very cheaply, but that is because the Linux team is very efficient and not because they aren't paid.
There’s a whole nasty class of bugs introduced by bad merge resolution.
Mark was a senior front end Dev who was bad at complex logic. He sat near me and I took pity on him, helped him out to help us out.
Steve was smarter and more experienced but had attention to detail problems and did not understand git. He should have been one of my lieutenants but I could not trust him and often had to help him do branch surgery.
Steve did not like this arrangement but it couldn’t be helped. He got cranky about it and went after Mark. I can’t prove he was coming at me sideways, but it was curious.
One day Steve claims Mark broke our shit. Here’s this line with his name on it. It even looked like a Markism. But the thing was, I reviewed that code. I remember being happy Mark got it right on the first try. This was not the code I merged but git says it is. Da fuq?
So I start bisecting and looking at branches and sure enough, Steve screwed a three way merge, again, and blame treated the change as if Mark wrote it.
The amount of time and effort expended on this must be immense. Would it have been feasible at the time to sponsor having these changes pushed into mainline?
Many if not most of these patches have been attempted to be mainlined. Either they were not considered of acceptable quality, considered too similar to other features (ex: Binder in Android) or rejected for other reasons (ex: Google's Fibers and SwitchTo).
If you have a patch that saves significant resources, improves important performance metrics or unblocks hardware that does those things you don't care if upstream will take the patches or wait for them to decide if they will. You simply start using the patches ASAP and reap the benefits, then you continue trying to upstream the patches to reduce the maintenance burden.
I don't know if this is what they were driving at but there are tons of useful things that you can't do in the sockets API like selectively up- or down-grading DSCP for a subset of TCP ACKs. Linux also does not come out of the box with DSCP reflection for the SYN+ACK, or at least it didn't until 5.10. Google prodkernel had this feature for at least a decade prior.
The compiled kernel isn't distributed so there is no need to distribute source. Note that prodkernel is different than the Android kernel (where they do, and are required to share the source by GPL).
To be honest I don't think it is a major loss. IIUC most features are either only interesting internally, or are submitted upstream to reduce maintenance cost. Google's "secret sauce" isn't how great their kernel is, so I am not aware of anything that is held back for competitive advantage other than maybe some drivers for custom hardware. So while it would be nice to see all of these patches collected into a "live" snapshot of their kernel it probably wouldn't be that much more helpful (if at all) to a third-party.
It seems like they could work with upstream to figure out a way to restructure source files and add commented markup to make the process a lot easier. Not that other people should do Google's work, but such changes could be beneficial to multiple people and organizations that need to build custom kernels, like realtime Linux, Xen, and hardened.
> Most of the world manages to run their binaries on a mainline kernel
Most of the world doesn’t need to worry about their kernel - this is a good thing. But at FAANG scale, you inevitably need to make changes and optimizations.
I thought the kernel was pretty stable. My first team at twitter I think we ran into 2 or 3 kernel bugs that year. I couldn't believe it but for some problems the blaming the kernel is a real debugging step (and even the bios lol). I think I had two perf problems turn into upstream patches thanks to the kernel team.
It's not only about bugs, but if you know your workload and your systems you can do some tricks a general purpose kernel can't do to tweak it for your needs and get a benefit here or there for your code.
I remember reading an article about new system calls proposed by Google. It was a scheduler yield system call that allowed the current task to yield its remaining processing time to another task. It would presumably be used to make goroutines more performant.
Not sure if the system call was accepted or if it still exists as Google specific kernel code. I can't seem to find the original article either...
Cgroups is a great example of the problems faced here.
The cgroups code/API that made it into the upstream kernel was the product of a couple of years of internal experimentation at Google into ways to do kernel-level resource isolation, and was pretty different from the approach that was used internally in production on a rather older kernel version. (A bit like Borg vs Kubernetes). And it still took a couple more years for cgroups to replace the original internal mechanisms, since the internal way worked OK and upgrading the kernel across so many machines was risky.
It is sometimes hard to imagine the scale of Google. If you can implement a task-switching mechanism with lower overhead and save 0.1% of CPU cycles across the fleet you can save many millions of dollars a year. So they have their own thread-switching mechanism which their binaries that can't run without. If you can improve process isolation in a way that allows you to pack 10% more tasks (by RAM or CPU) onto a single machine without affecting performance significantly you have just saved billions. So they have different isolation interfaces that Borg can't run without.
Both of these are completely feasible savings and make custom binaries and a custom kernel completely with it.
Does Linux need explicit support for this? Command line arguments and environment variables are copied to the stack of the new process. I would expect the stack's size to be the limiting factor.
In addition to hardware support, the article also mentioned adjustments to the perennial Linux pain in the asses that are the scheduler and the OOM killer.
These won't make binaries "not run" as such but they are necessary for the correct and efficient operation of large-scale distributed systems.
QUIC, Snap/Pony, and user-space thread scheduling are strictly superior to what happens inside the kernel and their development makes the delta between prodkernel and upstream kernel less relevant over time. They also, collectively, make Linux itself less relevant.
Linux kernel is the most sucessful free beer UNIX clone, but it is by no means the only one, and if the trends on IoT and cloud infrastructure are any measure, in a couple of decades we might still live under POSIX monoculture to some extent, but Linux most likely won't be the kernel underneath.
Not CPUs but a lot of other components are completely custom, developed in-house. A well known example: TPUs. Google is one of the world's largest manufacturer of servers and other DC equipment, actually, it's just that none of that is for sale (though a lot is for rent :) ).
Unless they've been building any factories I don't know about, their contractors are the manufacturers. I'm talking about companies based in Taiwan like Quanta, Compal, Clevo, and others. If you've ever been to a Computex Taipei you have seen the booths of the actual manufacturers of Google's hardware.
This is a meaningless nitpicking. It is like nitpicking Apple is not a manufacturer of hardware since the actual assembly happens under some contractor. For all intent and purpose, and even technically, Google is the largest manufacturer of data center-sized supercomputers.
> Those patches implement various internal APIs (e.g. for Google Fibers), provide hardware support, add performance optimizations, and contain other "tweaks that are needed to run binaries that we use at Google".
I'd assume if you're making allowances for custom-made processors at the kernel level you wouldn't also want that leaking into your user land binaries as well.
Why not? They could have kernel accommodation for one feature, and user-land leakage for another.
Also, for some features you need both kernel and userland cooperation. Just think of eg fuse or io_uring or mmap that are in the public kernel. You can surely imagine that Google might brew up similar features.
(I used to work at Google. But I didn't have any special insider knowledge about their kernel stuff. Which is good, so I can't violate any lingering NDAs here with my speculation..)
Kind of, I've seen some non-public Intel SKUs in the fleet but afaik they just had tweaked core counts and clocks for better perf/watt. They were also exploring non-intel architectures at some point, ppc especially but no significant prod deployment yet. And then I'm sure you've heard of special purpose hardware like TPUs
Since we've moved to tracking -current, we rebase every few weeks. We catch any newly introduced bugs from upstream almost immediately, since there are only a few weeks of changes, not a few years, to sort through. Its far easier to contribute code upstream. In fact, we do most development upstream and wait until the next rebase to pick our changes up (or cherry pick them if it is something urgent).
All in all, life is much, much, much better on the bleeding edge.