Moving Google toward the mainline

drewg123 · on Oct 6, 2021

We did this with FreeBSD several years ago at Netflix. We used to run heavily modified -stable kernels. Our tree was ahead of FreeBSD-current in a lot of ways (TCP, ktls, VM) but way behind in other ways. Every year or two we'd rebase to a new -stable version, and that would be months of work. Often, there would be some bug introduced upstream many months prior that took weeks to track down because there was just so much to bisect through. Because our tree was based on -stable, pushing changes upstream often involved porting them to a substantially different codebase in -current, and being unable to test the ported code at scale in production.

Since we've moved to tracking -current, we rebase every few weeks. We catch any newly introduced bugs from upstream almost immediately, since there are only a few weeks of changes, not a few years, to sort through. Its far easier to contribute code upstream. In fact, we do most development upstream and wait until the next rebase to pick our changes up (or cherry pick them if it is something urgent).

All in all, life is much, much, much better on the bleeding edge.

capableweb · on Oct 6, 2021

> All in all, life is much, much, much better on the bleeding edge.

Same feeling here, but on a personal level (meaning, using a desktop computer for professional work). Started using Ubuntu many years ago, ran updates maybe once every month. Every now and then something breaks, but always unsure of what/how/when, so digging through things that changed always took long time.

Started using Arch Linux and few years ago and also started doing updates at least once per day. Now it's way easier to discover what particular change broke something, as the releases are way smaller.

I attribute this more to how often I do the updates now VS before, and less to which OS you use. Same can be achieved with Debian Testing et al.

hinkley · on Oct 6, 2021

Do you run all service on the new kernel or is there a roll out process where some services are canaries?

I have one service that is low traffic that I use to test out new base images and core api changes. It’s easy to generate synthetic traffic to see how it behaves, as a lot of its traffic comes from the build process of more complex services.

drewg123 · on Oct 6, 2021

All machines (can) run the same services, and we roll an entirely new self contained OS image for every update. That goes through multiple expanding levels of canaries prior to a fleet wide rollout. Bugs that escape the canaries are quite rare. One of the last ones I recall was a release that contained a firmware update for a NIC that had a super rare bug that would lock up the NIC, and we only started to see it when a sizable percent of the fleet was running that firmware.

_-david-_ · on Oct 7, 2021

Any chance there is a presentation about this?

gorgoiler · on Oct 6, 2021

You could have editorialized this title to:

Rebasing a Linux Kernel Branch that’s Two Years Old and 9k Commits Deep

…but you would have had thousands of us running to the bathroom to revisit our last meal.

What a Herculean effort by these two engineers. When your role is to track down fellow employees from completely different teams and crack the, er, carrot to get them to update their patch from two years ago, you can get very tired very quickly.

I’m sure the hardware driver modules are fairly self contained but things like the userspace QoS patch could be pretty hairy. The kind of thing where, for upstream, you’d spend 90% of your time landing a framework just to even be able to implement your feature with the final 10% patch.

Oof. Bravo.

lt · on Oct 6, 2021

As far as I understood, your editorialized title is what they've been doing every two years since ever.

A possible better editorialized title would be

Creating a parallel kernel so we can continually rebase commits instead of every two years

Too · on Oct 7, 2021

Don't forgot the part where the author of the patch left the company a year ago, or even worse the entire team got dissolved/reorganized, which tends to happen quite regularly in a bigco. No amount of carrots will help there I'm afraid.

Herculean effort indeed.

menage · on Oct 7, 2021

> The kind of thing where, for upstream, you’d spend 90% of your time landing a framework just to even be able to implement your feature with the final 10% patch.

I feel this was a fair description of CGroups.

ggm · on Oct 6, 2021

Is this part of the google monrepo? Not that it matters, but I wondered how far monrepo went, inside the tent.

I hate to be a pessimist, and I may be simply ignorant of some realities, but it FEELS like this is setting an impossibly high goal. I think there will turn out to be something significant in 9000 variances, which cannot simply be uplifted into mainstream, and which turn out to be hard to maintain as a continual rebase into the emerging mainstream.

I agree that 40 sounds low, but probably is a good indication it can work. 1% is within view, if they maintained pace so even on a 1% if the rebase cost can be handled, they can converge but that assumes there isn't a showbuster in the 9000.

drewg123 · on Oct 6, 2021

No. At least when I was there, the linux kernel was in git, and reviews were done in Gerrit.

The monorepo (google3) and its review tools were used for most other projects.

chrisseaton · on Oct 6, 2021

What does the 3 mean? Are there three big monorepos?

easton · on Oct 6, 2021

As I understand it (not working at Google) it's the third monorepo they've had.

kevincox · on Oct 6, 2021

It's more the third build system. IIRC

google1 (I can't remember the actual directory name, maybe //google/) used Makefiles and Makefile generators. IIRC there is a small amount of still-live code here.

google2 failed fairly quickly.

google3 is blaze/bazel and is where almost all development is today.

All three trees are actually in the same source control "depot".

If you read enough of their papers this is actually detailed there but spread across different papers.

xyzzyz · on Oct 6, 2021

From what I remember, prodkernel actually was outside google3 (the monorepo), and has been sitting in a git repository.

anonygler · on Oct 6, 2021

Builds are all derived from the monorepo, but git makes sense for development as they have to constantly merge upstream git commits.

mvgoogler · on Oct 6, 2021

Prodkernel is not in the monorepo - it's in a separate git repo and uses it's own systems for building and qualifying new releases.

There is a system for using git* in conjunction with the monorepo that some developers use, but that system has nothing to do with the prodkernel.

* there is also a similar system for using mecurial with the monorepo that some folks love.

NewJazz · on Oct 6, 2021

I really hope they try to tackle OOM upstream. Every solution I've seen is "run this daemon that tries to beat the kernel OOM to the punch". Seems inelegant.

hinkley · on Oct 6, 2021

Nodejs uses this and I hate it. Tune it wrong and you get no logs, or only in a spot nobody looks. Also you end up restarting sooner than you need because you have to plan for worst case where all processes flirt with the high water mark all the time, so you restart at 1.1/n even though 1.2/n might be okay.

esgwpl · on Oct 6, 2021

>"run this daemon that tries to beat the kernel OOM to the punch"

Considering that the kernel OOM killer tends to be way too late in doing its thing I don't see how this is inelegant, maybe there's a reason you can't just have the kernel kill processes earlier in the face of memory pressure.

ladyanita22 · on Oct 6, 2021

The kernel OOM is just plain broken. I can't understand how can it be possible that, on Windows, whenever I run out of RAM the OS kills whatever process is consuming too much and the computer keeps running flawlessly. However, on Linux, my computer just... freezes. It freezes and stops responding. Not even the mouse moves. Having to use a userspace OOM is the most inelegant thing I've seen. So I need to have 2 OOMs so that the good one can beat the bad one? It's so redundant, it's plain stupid. How come NOBODY is doing anything. If I knew c++ I would for sure send a patch.

rnhmjoj · on Oct 6, 2021

The reason the OOM killer never kicks in is because you actually never run into an OOM, or almost never.

What usually happens is that in near-OOM conditions, the kernel starts reclaiming memory pages backed by file (sometimes called "trashing"). This operation manages to keep some extra memory available, but it makes the system almost unresponsive because it's constantly copying memory back and forth from the disk. It may take anywhere between minutes to hours before the system finally OOMs and the OOM killer is invoked.

This problem has been there forever but has been made worse by the improved speeds of modern storage technologies: with slower disk I/O, the OOM condition was reached sooner.

There are several solutions:

- Buy more RAM: if your system routinely goes nearly OOM something is not right.

- Add a (small) swap. It doesn't have to be a partition: nowadays most filesystems support swap file. Just create an empty file and mark it as swap.

- Limit the amount of thrashing or protect some pages from being reclaimed. This has been proposed by Google first and several other people since then, but AFAIK it has never been implemented in the mainline kernel.

Regarding the latter solution, there is a patchset called le9-patch[1] that is included in some alternative Linux kernels and it should be relatively safe to use.

[1]: https://github.com/hakavlad/le9-patch

yunohn · on Oct 6, 2021

I’m not sure if you misread the previous comment, but their point is quite commonly experienced by people who use Linux vs MacOS/Windows.

All hardware being the same (RAM, SSD, CPU) as OOM is reached, Linux will freeze, whereas Windows continues to run smoothly. All OSes try to reclaim memory pages, just Linux seems to hang the user space while doing so.

As someone who has dual-booted Windows and Linux for a decade, I can 100% attest to this glaring problem.

ladyanita22 · on Oct 6, 2021

Absolutely.

This could be fixed. Android doesn't behave like that, and it's not too far from the mainline kernel nowadays.

kdmytro · on Oct 6, 2021

I am sure that this distinction between a near-OOM condition and an actual OOM condition matters to someone familiar with the current kernel implementation. You seem confident describing what happens when the memory gets closer to full, so I believe you. However, the user experiences the PC freeze during certain conditions, however you choose to name them, and it is during that freeze period the user needs a program to be killed to free some memory and prevent the freeze. I would take one crashed program over power cycling the entire PC any day of the week.

rnhmjoj · on Oct 6, 2021

> I am sure that this distinction between a near-OOM condition and an actual OOM condition matters to someone familiar with the current kernel implementation. You seem confident describing what happens when the memory gets closer to full, so I believe you.

I'm not a kernel developer or anything like that, I've just spent some time investigating why this issue happens and has been happening for more than 10 years now.

> the user experiences the PC freeze during certain conditions, however you choose to name them

I'm not trying to defent the Linux kernel, I just described how it works. In particular it's not true that the OOM killer "takes too long" or doesn't work: it's just not invoked at all. If you invoke it manually (enable the magic SysRq with`sysctl kernel.sysrq=1` and press `alt-sysrq-f`) it does its job and solves the OOM instantly.

So, if you don't want to deal lockups and don't like an OOM userspace daemon (I don't), these are the possible solutions.

> I would take one crashed program over power cycling the entire PC any day of the week.

On a laptop or desktop PC, you don't need to power cycle in a near-OOM: use the magic SysRq key.

hollerith · on Oct 6, 2021

>On a laptop or desktop PC, you don't need to power cycle in a near-OOM: use the magic SysRq key.

Thanks for the tip! If my Linux were ever to start locking up regularly, I will apply it.

But right now (so I don't have to give up the use to which I currently put my SysRq key) I would prefer some method for determining after I forcefully powered down the computer, then powered up again, whether the lockup or slow-down that motivated the force-power-down was caused by a near-OOM condition.

Do you happen to have a tip for that?

rnhmjoj · on Oct 6, 2021

I don't think so, sorry. The kernel emits a few messages when an OOM is detected, including the tasks killed to free memory, but in a near-OOM probably nothing: the system is technically still working normally, though very slowly.

wongarsu · on Oct 6, 2021

> Add a (small) swap. It doesn't have to be a partition: nowadays most filesystems support swap file. Just create an empty file and mark it as swap.

My experiences with swap on Linux have been similarly bad. If even brief memory pressure forces the kernel to move things to swap, the only way to revert that in any reasonable timeframe is to unmount the swap partition or to restart the machine.

Meanwhile using Windows with a swap file of twice the size of physical RAM runs smooth as butter. I have a 200GB swap file right now and no problems.

curt15 · on Oct 6, 2021

>I can't understand how can it be possible that, on Windows, whenever I run out of RAM the OS kills whatever process is consuming too much and the computer keeps running flawlessly.

I've long wondered this too. How does Windows handle memory pressure differently?

notriddle · on Oct 6, 2021

One of the other suggestions was to create a swap file. Windows has a page file by default, right?

kevincox · on Oct 6, 2021

IIRC windows has a managed set of page files that grows as necessary.

Bender · on Oct 7, 2021

I avoid swap since it needs to be encrypted to protect sensitive data written out from memory to disk. Instead I reserve more memory for the kernel vm.min_free_kbytes based on the installed ram and also based and some redhat suggestions, reserve more memory in vm.admin_reserve_kbytes and vm.user_reserve_kbytes, adjust vm.vfs_cache_pressure based on server role and finally set vm.overcommit_ratio to 0. This worked well on over 50k bare metal servers with no swap. OOM was extremely rare outside of dev. OOM basically only happened with automation had human induced bugs that deployed too many java instances to a server. All of the servers had anywhere from 512GB to 3TB ram and nearly all the memory was in use at all times.

0x000000001 · on Oct 7, 2021

Swap files can be dangerous as some filesystems require memory allocations to write and if you're out of memory...

cmurf · on Oct 6, 2021

The kernel OOM killer is only concerned about kernel survival. It isn't designed to care about user perception of system responsiveness.

That's what resource control via cgroups is about. Fedora desktop folks (both GNOME and KDE) are working on ensuring minimum resources are available for the desktop experience, via cgroups, which then applies CPU, memory, and IO isolation when needed to achieve that. Also, systemd-oomd is enabled by default. The resource control picture isn't completely in place yet, but things are much improved.

jeffbee · on Oct 6, 2021

cgroups often make the situation worse, not better, by insisting that a small memcg drop caches because that control group is full while the system overall has plenty of resources. This can lead to a system severely swapping for no apparent reason.

Putting desktop apps into individual cgroups is one of the more counter-productive ideas that has cropped up lately.

ladyanita22 · on Oct 6, 2021

Okay, this is a good explanation of what's going on.

AshamedCaptain · on Oct 6, 2021

Huh? I have never seen desktop Windows killing process due to out of memory -- does it even do that?

It does thrash much more gracefully than Linux, though. In fact the "your computer is low on memory" prompt actually can show up even when severely thrashing, something utmost impossible in Linux (even starting something like zenity may take hours..).

umanwizard · on Oct 6, 2021

The Linux kernel is written in C, not C++.

ladyanita22 · on Oct 6, 2021

o...kay?

xioxox · on Oct 6, 2021

You can already disable the linux memory overcommit feature if you want linux to never allow more memory to be allocated than exists. However, you may run into problems with programs which rely on the ability to allocate more memory than they need, or if you computer has low amounts of memory.

fowl2 · on Oct 7, 2021

The reason is that Windows doesn’t have fork(), and therefore doesn’t have to promise huge multiples of the available memory only to be left holding the bag when that fiction failed. Look up “overcommit” if you’re interested.

zokier · on Oct 6, 2021

I thought psi more or less solves oom, or at least the practical problems of it. I wouldn't expect anything more elegant from linux kernel

NewJazz · on Oct 9, 2021

What is PSI? Never heard of it.

zokier · on Oct 10, 2021

Pressure Stall Information: https://www.kernel.org/doc/html/latest/accounting/psi.html

> ... allows maximizing hardware utilization without sacrificing workload health or risking major disruptions such as OOM kills.

sudarshnachakra · on Oct 6, 2021

A genuine question. Why is that Google uses a custom kernel (with patches that could not be mainlined) when the below options are available.

1. Compiling the kernel with custom knobs 2. Write the custom code as Linux modules

dboreham · on Oct 6, 2021

Because originally they were (way) ahead of the mainline. Article doesn't actually say this fwiw. Nowadays mainline is mostly caught up, but it's hard to rebase (article does say this).

danobi · on Oct 6, 2021

Because you can’t add a feature with a knob. And not all changes can be made in a module (eg core kernel). And even if they could, kernel internals are not stable and you would still need to rebase your out of tree modules.

geofft · on Oct 6, 2021

Not everything can be written as modules. The module API is fairly limited, and it doesn't let you arbitrarily customize the behavior of existing parts of the kernel. Examples from the article include OOM and scheduling. One I ran into myself recently is that, despite the name, Linux Security Modules (LSMs) are not loadable modules, and the LSM initialization code is unloaded after kernel boot, so even if I wanted to play tricks with unpublished APIs, the code is just not there.

A little more philosophicallh, if all these customization points were available as modules, the process of updating modules to work with new versions of the kernel would be exactly as much of a mess.

MichaelMoser123 · on Oct 6, 2021

for OOM you do have a lot of flexibility with containers/control groups, nowadays. What kind of problems were they solving with the scheduler? I anything known about that?

dodobirdlord · on Oct 6, 2021

For background context, a reminder that control groups were originally developed in Google’s kernel fork, and later mainlined.

As for the scheduler stuff, the main change is a SwitchTo set of syscalls that allow threads to bypass the kernel’s scheduler and just continue execution as a different thread. https://lkml.org/lkml/2020/7/22/1202

MichaelMoser123 · on Oct 7, 2021

Thanks for the link, it is explained in the linked video:

https://www.youtube.com/watch?v=KXuZi9aeGTw They explain it around 15:01 - google added it's own syscall switchto_thread - that puts the current thread to sleep and is switching to the argument thread id. (and some other calls too). That one helps with cutting down latency in inter thread calls for m:n threading. The real effort is to make latency for individual application requests predictable, while keeping it low.

MichaelMoser123 · on Oct 7, 2021

the linked video briefly mentions that google has it's own futexs. What would be the difference between regular futexes and the google implementation?

geofft · on Oct 6, 2021

The article makes a passing reference to "Google Fibers" and "a new API for cooperative scheduling in user space."

There seems to be a Phoronix article with some more info and a link to a preparatory patchset that's public: https://www.phoronix.com/scan.php?page=news_item&px=Google-F...

fragmede · on Oct 6, 2021

Some things just don't want to be a knob. Weird stuff that no sane person would consider, until they need to, like raising the limit on how long a command line can be.

hinkley · on Oct 6, 2021

That one is not that weird and I don’t like having to use xargs to create multiple calls.

pjmlp · on Oct 6, 2021

Because as monolithic kernel, every feature touches lots of places and kernel modules have a very specific set of use cases.

bonzini · on Oct 6, 2021

It is not necessarily true that features touch a lot of places. But most of the changes are improvements to existing functionality (e.g. KVM, or the scheduler) rather than something self contained.

ndesaulniers · on Oct 6, 2021

I'd imagine drivers for lots of custom hardware that will never leave the datacenters.

NewJazz · on Oct 6, 2021

But drivers for custom hw are easy to rebase. They are quite self contained.

It is when you get into kernel internals that rebasing continuously becomes a challenge. Some parts of the core Linux kernel don't change often, but I imagine many other parts see significant churn.

extr · on Oct 6, 2021

[flagged]

brantonb · on Oct 6, 2021

>Please don't comment on whether someone read an article. "Did you even read the article? It mentions that" can be shortened to "The article mentions that."

https://news.ycombinator.com/newsguidelines.html

fsckboy · on Oct 6, 2021

this is, pure and simple, technical debt.

Google borrowed for proprietary reasons; meanwhile the people who maintain and improve the kernel and give away the fruits for free have produced something that google wants so they'll merge, again for FAANG proprietary reasons. I'm not losing sleep over it, hell, I paid for it already with my privacy.

tut-urut-utut · on Oct 6, 2021

Let's not pretend that kernel developers still don't get paid for their kernel work. The majority of them are employed by corporations that use Linux or are sponsored by other means.

Maybe it was so in the past, but I doubt that any significant contributions nowadays come as a free (as in beer) work.

5e92cb50239222b · on Oct 6, 2021

That's pretty much how it is.

https://lwn.net/Articles/867540/

https://lwn.net/Articles/860989/

https://lwn.net/Articles/853039/

mda · on Oct 6, 2021

Google is one of the biggest contributors to linux kernel and enploys many kernel engineers I am not sure what you are complaining about.

fsckboy · on Oct 7, 2021

I'm not complaining, I'm saying that google is complaining or at least everybody here is treating google like "oh boo hoo look how much work they have", and they shouldn't, they made the choices they made, it's just technical debt.

Jensson · on Oct 6, 2021

You do realize that almost every big company pays money to the Linux foundation and that Linus Torvalds is paid 10 million dollars a year thanks to that? They still get it very cheaply, but that is because the Linux team is very efficient and not because they aren't paid.

chinathrow · on Oct 6, 2021

> Linus Torvalds is paid 10 million dollars a year thanks to that

Any source for this claim?

dhruvrrp · on Oct 6, 2021

The actual figure has been 1.5 million for a few years now according to the linux foundation irs filings https://pp-990.s3.us-east-1.amazonaws.com/01_2020_prefixes_4...

MichaelMoser123 · on Oct 6, 2021

As a ten times programmer, and deserves to be paid at least ten times, imho.

wongarsu · on Oct 6, 2021

It's half the pay of the chair of Mozilla. It wouldn't be hard to argue that he's underpaid.

hinkley · on Oct 6, 2021

The chair of Mozilla is overpaid.

wongarsu · on Oct 6, 2021

Which we mostly care about because Firefox's market share approaches that of Desktop Linux while Mozilla does whatever.

hinkley · on Oct 6, 2021

I still use Firefox, and I don’t like that they’re setting money on fire while revenue is declining.

5e92cb50239222b · on Oct 6, 2021

He has been more of a technical manager for the past couple of decades. An awesome one, no doubt.

knorker · on Oct 6, 2021

> this is, pure and simple, technical debt.

Yes, and they say exactly that. What's your point?

dboreham · on Oct 6, 2021

This kind of "feature branch composability" can be challenging, because the code on the branches turns out to not be entirely orthogonal.

hinkley · on Oct 6, 2021

There’s a whole nasty class of bugs introduced by bad merge resolution.

Mark was a senior front end Dev who was bad at complex logic. He sat near me and I took pity on him, helped him out to help us out.

Steve was smarter and more experienced but had attention to detail problems and did not understand git. He should have been one of my lieutenants but I could not trust him and often had to help him do branch surgery.

Steve did not like this arrangement but it couldn’t be helped. He got cranky about it and went after Mark. I can’t prove he was coming at me sideways, but it was curious.

One day Steve claims Mark broke our shit. Here’s this line with his name on it. It even looked like a Markism. But the thing was, I reviewed that code. I remember being happy Mark got it right on the first try. This was not the code I merged but git says it is. Da fuq?

So I start bisecting and looking at branches and sure enough, Steve screwed a three way merge, again, and blame treated the change as if Mark wrote it.

Thanks, you stupid git.

topbanana · on Oct 6, 2021

The amount of time and effort expended on this must be immense. Would it have been feasible at the time to sponsor having these changes pushed into mainline?

kevincox · on Oct 6, 2021

Many if not most of these patches have been attempted to be mainlined. Either they were not considered of acceptable quality, considered too similar to other features (ex: Binder in Android) or rejected for other reasons (ex: Google's Fibers and SwitchTo).

If you have a patch that saves significant resources, improves important performance metrics or unblocks hardware that does those things you don't care if upstream will take the patches or wait for them to decide if they will. You simply start using the patches ASAP and reap the benefits, then you continue trying to upstream the patches to reduce the maintenance burden.

hinkley · on Oct 6, 2021

What’s really going to cook your noodle is thinking about Red Hat. They were maintaining long running forks when google was still in diapers.

knorker · on Oct 6, 2021

What's that userspace qos thing? You can already set DSCP, so is this more like setting the actual shaping/policing policies for your sockets?

jeffbee · on Oct 6, 2021

I don't know if this is what they were driving at but there are tons of useful things that you can't do in the sockets API like selectively up- or down-grading DSCP for a subset of TCP ACKs. Linux also does not come out of the box with DSCP reflection for the SYN+ACK, or at least it didn't until 5.10. Google prodkernel had this feature for at least a decade prior.

https://lwn.net/Articles/831194/

hinkley · on Oct 6, 2021

Would that be avoiding system call round trip overhead perhaps?

I wonder how some of these tools will change due to progress on io_uring and similar idioms. Batching system calls to amortize the cost.

knorker · on Oct 6, 2021

Hmm, seems unlikely, since at least for TCP this is just one syscall per connection. Syscalls are not that expensive.

fcsp · on Oct 6, 2021

Is their fork source code available somewhere publicly?

kevincox · on Oct 6, 2021

I'm pretty sure that it isn't.

fcsp · on Oct 6, 2021

That's sad. How does that work with the GPL license of the kernel?

kevincox · on Oct 6, 2021

The compiled kernel isn't distributed so there is no need to distribute source. Note that prodkernel is different than the Android kernel (where they do, and are required to share the source by GPL).

To be honest I don't think it is a major loss. IIUC most features are either only interesting internally, or are submitted upstream to reduce maintenance cost. Google's "secret sauce" isn't how great their kernel is, so I am not aware of anything that is held back for competitive advantage other than maybe some drivers for custom hardware. So while it would be nice to see all of these patches collected into a "live" snapshot of their kernel it probably wouldn't be that much more helpful (if at all) to a third-party.

yalok · on Oct 6, 2021

is there a video of their talk, or slides? Seems to be behind a paywall on OSSNA site...

encryptluks2 · on Oct 6, 2021

It seems like they could work with upstream to figure out a way to restructure source files and add commented markup to make the process a lot easier. Not that other people should do Google's work, but such changes could be beneficial to multiple people and organizations that need to build custom kernels, like realtime Linux, Xen, and hardened.

cube00 · on Oct 6, 2021

> contain other "tweaks that are needed to run binaries that we use at Google".

Most of the world manages to run their binaries on a mainline kernel so I'd love to know what's so special in their binaries.

yunohn · on Oct 6, 2021

> Most of the world manages to run their binaries on a mainline kernel

Most of the world doesn’t need to worry about their kernel - this is a good thing. But at FAANG scale, you inevitably need to make changes and optimizations.

Eg: “Twitter has a kernel team!?” - https://danluu.com/in-house/

twtrgrunt · on Oct 6, 2021

I thought the kernel was pretty stable. My first team at twitter I think we ran into 2 or 3 kernel bugs that year. I couldn't believe it but for some problems the blaming the kernel is a real debugging step (and even the bios lol). I think I had two perf problems turn into upstream patches thanks to the kernel team.

johannes1234321 · on Oct 6, 2021

It's not only about bugs, but if you know your workload and your systems you can do some tricks a general purpose kernel can't do to tweak it for your needs and get a benefit here or there for your code.

pinerd3 · on Oct 6, 2021

that was a very interesting article; thanks for posting it :)

yunohn · on Oct 6, 2021

Np, I saw it on HN last week :)

Discussion: https://news.ycombinator.com/item?id=28691676

matheusmoreira · on Oct 6, 2021

I remember reading an article about new system calls proposed by Google. It was a scheduler yield system call that allowed the current task to yield its remaining processing time to another task. It would presumably be used to make goroutines more performant.

Not sure if the system call was accepted or if it still exists as Google specific kernel code. I can't seem to find the original article either...

kccqzy · on Oct 6, 2021

Here: https://youtu.be/KXuZi9aeGTw

Jump to the 15-minute mark for the three new system calls, switchto_wait, switchto_resume, switchto_switch.

jeffbee · on Oct 6, 2021

I think you might be thinking of https://lwn.net/Articles/826860/

yegle · on Oct 6, 2021

As an example, cgroup was initially developed in Google: https://en.wikipedia.org/wiki/Cgroups

menage · on Oct 6, 2021

Cgroups is a great example of the problems faced here.

The cgroups code/API that made it into the upstream kernel was the product of a couple of years of internal experimentation at Google into ways to do kernel-level resource isolation, and was pretty different from the approach that was used internally in production on a rather older kernel version. (A bit like Borg vs Kubernetes). And it still took a couple more years for cgroups to replace the original internal mechanisms, since the internal way worked OK and upgrading the kernel across so many machines was risky.

dralley · on Oct 6, 2021

And now we have CgroupsV2 to fix all of the issues with the original

catern · on Oct 6, 2021

Yes, and it shows (it's notoriously poorly designed)

renewiltord · on Oct 6, 2021

But it has the best design property of software: it exists and works in production.

pjmlp · on Oct 6, 2021

That is what I tell myself when I get to review offshoring code.

GolDDranks · on Oct 6, 2021

Is Google notorious from poor design, or is there some other reason why cgroup being poorly designed would have to do it being designed by Google?

catern · on Oct 6, 2021

When it comes to Linux kernel work, yes.

pjmlp · on Oct 6, 2021

From Android point of view, poor design.

geofft · on Oct 6, 2021

Arguably, every line of code in the Linux kernel came from someone who couldn't run their binaries on mainline.

kevincox · on Oct 6, 2021

It is sometimes hard to imagine the scale of Google. If you can implement a task-switching mechanism with lower overhead and save 0.1% of CPU cycles across the fleet you can save many millions of dollars a year. So they have their own thread-switching mechanism which their binaries that can't run without. If you can improve process isolation in a way that allows you to pack 10% more tasks (by RAM or CPU) onto a single machine without affecting performance significantly you have just saved billions. So they have different isolation interfaces that Borg can't run without.

Both of these are completely feasible savings and make custom binaries and a custom kernel completely with it.

jdub · on Oct 6, 2021

Ask a Googler about the size of their environment variables.

packetslave · on Oct 6, 2021

ever wonder why linux supports command line arguments that are 2mb long?

yegle · on Oct 6, 2021

Ever wondered absl flags library has a --flagfile special flag?

https://abseil.io/docs/python/guides/flags#special-flags

packetslave · on Oct 6, 2021

I wonder if ABSL supports gzipped flags...

yegle · on Oct 6, 2021

Probably not, since --flagfile accepts a list of filenames :-)

https://github.com/abseil/abseil-cpp/blob/1ae9b71c474628d60e...

matheusmoreira · on Oct 6, 2021

Does Linux need explicit support for this? Command line arguments and environment variables are copied to the stack of the new process. I would expect the stack's size to be the limiting factor.

jeffbee · on Oct 6, 2021

Before they relaxed it 15 years ago, Linux had a hard limit at 128KiB for args + env on x86. After the change it was variable up to 2MiB.

https://github.com/torvalds/linux/commit/b6a2fea39318e43fee8...

lamontcg · on Oct 6, 2021

I'm afraid of what they're likely to show me...

kevincox · on Oct 6, 2021

Your fears are completely founded.

bsder · on Oct 6, 2021

In addition to hardware support, the article also mentioned adjustments to the perennial Linux pain in the asses that are the scheduler and the OOM killer.

jamesfinlayson · on Oct 6, 2021

Not sure - are Google running their servers with custom-made processors as well?

jeffbee · on Oct 6, 2021

A lot of it is just mundane details. Some of it leaks out in ABSL, for example some boring trivia is at:

https://github.com/abseil/abseil-cpp/blob/master/absl/base/i...

They also have for years had major patches to TCP which Linux refuses to adopt. Some details at:

https://www.ietf.org/proceedings/97/slides/slides-97-tcpm-tc...

These won't make binaries "not run" as such but they are necessary for the correct and efficient operation of large-scale distributed systems.

QUIC, Snap/Pony, and user-space thread scheduling are strictly superior to what happens inside the kernel and their development makes the delta between prodkernel and upstream kernel less relevant over time. They also, collectively, make Linux itself less relevant.

pjmlp · on Oct 6, 2021

Linux kernel is the most sucessful free beer UNIX clone, but it is by no means the only one, and if the trends on IoT and cloud infrastructure are any measure, in a couple of decades we might still live under POSIX monoculture to some extent, but Linux most likely won't be the kernel underneath.

H8crilA · on Oct 6, 2021

Not CPUs but a lot of other components are completely custom, developed in-house. A well known example: TPUs. Google is one of the world's largest manufacturer of servers and other DC equipment, actually, it's just that none of that is for sale (though a lot is for rent :) ).

walrus01 · on Oct 6, 2021

Unless they've been building any factories I don't know about, their contractors are the manufacturers. I'm talking about companies based in Taiwan like Quanta, Compal, Clevo, and others. If you've ever been to a Computex Taipei you have seen the booths of the actual manufacturers of Google's hardware.

https://en.wikipedia.org/wiki/Computex

drivebycomment · on Oct 6, 2021

This is a meaningless nitpicking. It is like nitpicking Apple is not a manufacturer of hardware since the actual assembly happens under some contractor. For all intent and purpose, and even technically, Google is the largest manufacturer of data center-sized supercomputers.

skunkworker · on Oct 6, 2021

Like when that Pluto Switch got leaked back in 2012

https://www.wired.com/2012/09/pluto-switch/

ars · on Oct 6, 2021

See this for an early example: https://news.ycombinator.com/item?id=542531

cube00 · on Oct 6, 2021

That's covered earlier (full quote):

> Those patches implement various internal APIs (e.g. for Google Fibers), provide hardware support, add performance optimizations, and contain other "tweaks that are needed to run binaries that we use at Google".

I'd assume if you're making allowances for custom-made processors at the kernel level you wouldn't also want that leaking into your user land binaries as well.

eru · on Oct 6, 2021

Why not? They could have kernel accommodation for one feature, and user-land leakage for another.

Also, for some features you need both kernel and userland cooperation. Just think of eg fuse or io_uring or mmap that are in the public kernel. You can surely imagine that Google might brew up similar features.

(I used to work at Google. But I didn't have any special insider knowledge about their kernel stuff. Which is good, so I can't violate any lingering NDAs here with my speculation..)

quantumofalpha · on Oct 6, 2021

Kind of, I've seen some non-public Intel SKUs in the fleet but afaik they just had tweaked core counts and clocks for better perf/watt. They were also exploring non-intel architectures at some point, ppc especially but no significant prod deployment yet. And then I'm sure you've heard of special purpose hardware like TPUs

eru · on Oct 6, 2021

Some of them, yes! At least for machine learning. They also have custom silicon in their phones (and the phones run a Linux).