Hacker Newsnew | past | comments | ask | show | jobs | submit | jmmv's commentslogin

This is something that always bothered me while I was working at Google too: we had an amazing compute and storage infrastructure that kept getting crazier and crazier over the years (in terms of performance, scalability and redundancy) but everything in operations felt slow because of the massive size of binaries. Running a command line binary? Slow. Building a binary for deployment? Slow. Deploying a binary? Slow.

The answer to an ever-increasing size of binaries was always "let's make the infrastructure scale up!" instead of "let's... not do this crazy thing maybe?". By the time I left, there were some new initiatives towards the latter and the feeling that "maybe we should have put limits much earlier" but retrofitting limits into the existing bloat was going to be exceedingly difficult.


There's a lot of tooling built on static binaries:

- google-wide profiling: the core C++ team can collect data on how much of fleet CPU % is spent in absl::flat_hash_map re-bucketing (you can find papers on this publicly)

- crashdump telemetry

- dapper stack trace -> codesearch

Borg literally had to pin the bash version because letting the bash version float caused bugs. I can't imagine how much harder debugging L7 proxy issues would be if I had to follow a .so rabbit hole.

I can believe shrinking binary size would solve a lot of problems, and I can imagine ways to solve the .so versioning problem, but for every problem you mention I can name multiple other probable causes (eg was startup time really execvp time, or was it networked deps like FFs).


We are missing tooling to partition a huge binary into a few larger shared objects.

As my https://maskray.me/blog/2023-05-14-relocation-overflow-and-c... (linked by author, thanks! But I maintain lld/ELF instead of "wrote" it - it's engineer work of many folks)

Quoting the relevant paragraphs below:

## Static linking

In this section, we will deviate slightly from the main topic to discuss static linking. By including all dependencies within the executable itself, it can run without relying on external shared objects. This eliminates the potential risks associated with updating dependencies separately.

Certain users prefer static linking or mostly static linking for the sake of deployment convenience and performance aspects:

* Link-time optimization is more effective when all dependencies are known. Providing shared object information during executable optimization is possible, but it may not be a worthwhile engineering effort.

* Profiling techniques are more efficient dealing with one single executable.

* The traditional ELF dynamic linking approach incurs overhead to support [symbol interposition](https://maskray.me/blog/2021-05-16-elf-interposition-and-bsy...).

* Dynamic linking involves PLT and GOT, which can introduce additional overhead. Static linking eliminates the overhead.

* Loading libraries in the dynamic loader has a time complexity `O(|libs|^2*|libname|)`. The existing implementations are designed to handle tens of shared objects, rather than a thousand or more.

Furthermore, the current lack of techniques to partition an executable into a few larger shared objects, as opposed to numerous smaller shared objects, exacerbates the overhead issue.

In scenarios where the distributed program contains a significant amount of code (related: software bloat), employing full or mostly static linking can result in very large executable files. Consequently, certain relocations may be close to the distance limit, and even a minor disruption (e.g. add a function or introduce a dependency) can trigger relocation overflow linker errors.


> We are missing tooling to partition a huge binary into a few larger shared objects

Those who do not understand dynamic linking are doomed to reinvent it.


There’s no way my proxy binary actually requires 25GB of code, or even the 3GB it is. Sounds to me like the answer is a tree shaker.

Google implemented the C++ equivalent of a tree shaker in their build system around 2009.

the front-end services to be "fast" AFAIK probably include nearly all the services you need to avoid hops -- so you can't really shake that much away.

Maybe I am missing something, but why didn't they just leverage dynamic libraries ?

When I was at Google, on an SRE team, here is the explanation that I was given.

Early on Google used dynamic libraries. But weird things happen at Google scale. For example Google has a dataset known, for fairly obvious reasons, as "the web". Basically any interesting computation with it takes years. Enough to be a multiple of the expected lifespan of a random computer. Therefore during that computation, you have to expect every random thing that tends to go wrong, to go wrong. Up to and including machines dying.

One of the weird things that becomes common at Google scale, are cosmic bit flips. With static binaries, you can figure out that something went wrong, kill the instance, launch a new one, and you're fine. That machine will later launch something else and also be fine.

But what happens if there was a cosmic bit flip in a dynamic library? Everything launched on that machine will be wrong. This has to get detected, then the processes killed and relaunched. Since this keeps happening, that machine is always there lightly loaded, ready for new stuff to launch. New stuff that...wind up broken for the same reason! Often the killed process will relaunch on the bad machine, failing again! This will continue until someone reboots the machine.

Static binaries are wasteful. But they aren't as problematic for the infrastructure as detecting and fixing this particular condition. And, according to SRE lore circa 2010, this was the actual reason for the switch to static binaries. And then they realized all sorts of other benefits. Like having a good upgrade path for what would normally be shared libraries.


> But what happens if there was a cosmic bit flip in a dynamic library?

I think there were more basic reasons we didn't ship shared libraries to production.

1. They wouldn't have been "shared", because every program was built from its own snapshot of the monorepo, and would naturally have slightly different library versions. Nobody worried about ABI compatibility when evolving C++ interfaces, so (in general) it wasn't possible to reuse a .so built at another time. Thus, it wouldn't actually save any disk space or memory to use dynamic linking.

2. When I arrived in 2005, the build system was embedding absolute paths to shared libraries into the final executable. So it wasn't possible to take a dynamically linked program, copy it to a different machine, and execute it there, unless you used a chroot or container. (And at that time we didn't even use mount namespaces on prod machines.) This was one of the things we had to fix to make it possible to run tests on Forge.

3. We did use shared libraries for tests, and this revealed that ld.so's algorithm for symbol resolution was quadratic in the number of shared objects. Andrew Chatham fixed some of this (https://sourceware.org/legacy-ml/libc-alpha/2006-01/msg00018...), and I got the rest of it eventually; but there was a time before GRTE, when we didn't have a straightforward way to patch the glibc in prod.

That said, I did hear a similar story from an SRE about fear of bitflips being the reason they wouldn't put the gws command line into a flagfile. So I can imagine it being a rationale for not even trying to fix the above problems in order to enable dynamic linking.

> Since this keeps happening, that machine is always there lightly loaded, ready for new stuff to launch. New stuff that...wind up broken for the same reason!

I did see this failure mode occur for similar reasons, such as corruption of the symlinks in /lib. (google3 executables were typically not totally static, but still linked libc itself dynamically.) But it always seemed to me that we had way more problems attributable to kernel, firmware, and CPU bugs than to SEUs.


Thanks. It is nice to hear another perspective on this.

But here is a question. How much of SEUs not being problems were because they weren't problems? Versus because there were solutions in place to mitigate the potential severity of that kind of problem? (The other problems that you name are harder to mitigate.)


Memory and disk corruption definitely were a problem in the early days. See https://news.ycombinator.com/item?id=14206811 for example. I also recall an anecdote about how the search index basically became unbuildable beyond a certain size due to the probability of corruption, which was what inspired RecordIO. I think ECC RAM and transport checksums largely fixed those problems.

It's pretty challenging for software to defend against SEUs corrupting memory, especially when retrofitting an existing design like Linux. While operating Forge, we saw plenty of machines miscompute stuff, and we definitely worried about garbage getting into our caches. But my recollection is that the main cause was individual bad CPUs. We would reuse files in tmpfs for days without reverifying their checksums, and while we considered adding a scrubber, we never saw evidence that it would have caught much.

Maybe the CPU failures were actually due to radiation damage, but as they tended to be fairly sticky, my guess is something more like electromigration.


As a developer depending on the infrastructure and systems you guys make reliable every day inside Google, Bless You. Truly.

When Forge has a problem, I might as well go on a nature hike.


In Azure - which I think is at Google scale - everything is dynamically linked. Actually a lot of Azure is built on C# which does not even support static linking...

Statically linking being necessary for scaling does not pass the smell test for me.


I never worked for Google, but have seen some strange things like bit flips at more modest scales. From the parent description, it looks like defaulting to static binaries is helping to speed up troubleshooting to remove the “this should never happen, but statistically will happen every so often” class of bugs.

As I see it, the issue isn’t requiring static compiling to scale. It’s requiring it to make troubleshooting or measuring performance at scale easier. Not required, per se, but very helpful.


Exactly. SRE is about monitoring and troubleshooting at scale.

Google runs on a microservices architecture. It's done that since before that was cool. You have to do a lot to make a microservices architecture work. Google did not advertise a lot of that. Today we have things like Data Dog that give you some of the basics. But for a long time, people who left Google faced a world of pain because of how far behind the rest of the world was.


Azure's devops record is not nearly as good as Google's was.

The biggest datasets that ChatGPT is aware of being processed in complex analytics jobs on Azure are roughly a thousand times smaller than an estimate of Google's regularly processed snapshot of the web. There is a reason why most of the fundamental advancements in how to parallelize data and computations - such as map-reduce and BigTable - all came from Google. Nobody else worked at their scale before they did. (Then Google published it, and people began to implement it. Then failed to understand what was operationally important to making it actually work at scale...)

So, despite how big it is, I don't think that Azure operates at Google scale.

For the record, back when I worked at Google, the public internet was only the third largest network that I knew of. Larger still was the network that Google uses for internal API calls. (Do you have any idea how many API calls it takes to serve a Google search page?) And larger still was the network that kept data synchronized between data centers. (So, for example, you don't lose your mail if a data center goes down.)


perhaps that's why azure has such a bad reputation in the devops crowd.

Does AWS have a good reputation in devops? Because large chunks of AWS are built on Java - which also does not offer static linking (bundling a bunch of *.jar files into one exe does not count as static linking). Still does not pass the smell test.

In AWS, only the very core Infra-as-a-Service that they dogfood can be considered "good", Everything else that's more Platform-as-a-Service can be considered a half baked leaky abstraction. Anything they release as "GA" especially around ReInvent should be avoided for a minimum of 6 months-1 year since it's more like a public Beta with some guaranteed bugs.

In AWS, only the very core Infra-as-a-Service that they dogfood can be considered "good" - large chunks of which are, by the way, written in Java. I think you are proving my point...

which just means Java isn't affected? or your definition of not not counting bundled and not shared jars as static linking is wrong, since they achieve the same effect.

> But what happens if there was a cosmic bit flip in a dynamic library?

You'd need multiple of those, because you have ECC. Not impossible, but getting all those dice rolled the same way requires even bigger scale than Google's.


Sounds like Google should put their computers at Homestake

One reason is that using static binaries greatly simplifies the problem of establishing Binary Provenance, upon which security claims and many other important things rely. In environments like Google’s it's important to know that what you have deployed to production is exactly what you think it is.

See for more: https://google.github.io/building-secure-and-reliable-system...


> One reason is that using static binaries greatly simplifies the problem of establishing Binary Provenance, upon which security claims and many other important things rely.

It depends.

If it is a vulnerability stemming from libc, then every single binary has to be re-linked and redeployed, which can lead to a situation where something has been accidentally left out due to a unaccounted for artefact.

One solution could be bundling the binary or related multiple binaries with the operating system image but that would incur a multidimensional overhead that would be unacceptable for most people and then we would be talking about «an application binary statically linked into the operating system» so to speak.


> If it is a vulnerability stemming from libc, then every single binary has to be re-linked and redeployed, which can lead to a situation where something has been accidentally left out due to a unaccounted for artefact.

The whole point of Binary Provenance is that there are no unaccounted-for artifacts: Every build should produce binary provenance describing exactly how a given binary artifact was built: the inputs, the transformation, and the entity that performed the build. So, to use your example, you'll always know which artefacts were linked against that bad version of libc.

See https://google.github.io/building-secure-and-reliable-system...


I am well aware of and understand that.

However,

> […] which artefacts were linked against that bad version of libc.

There is one libc for the entire system (a physical server, a virtual one, etc.), including the application(s) that have/have been deployed into an operating environment.

In the case of the entire operating environment (the OS + applications) being statically linked against a libc, the entire operating environment has to be re-linked and redeployed as a single concerted effort.

In dynamically linked operating environments, only the libc needs to be updated.

The former is a substantially more laborious and inherently more risky effort unless the organisation has achieved a sufficiently large scale where such deployment artefacts are fully disposable and the deployment process is fully automated. Not many organisations practically operate at that level of maturity and scale, with FAANG or similar scale being a notable exception. It is often cited as an aspiration, yet the road to that level of maturity is windy and is fraught with many shortcuts in real life which result in the binary provenance being ignored or rendering it irrelevant. The expected aftermath is, of course, a security incident.


What is the point you're trying to make?

I claimed that Binary Provenance was important to organizations such as Google where it is important to know exactly what has gone into the artefacts that have been deployed into production. You then replied "it depends" but, when pressed, defended your claim by saying, in effect, that binary provenance doesn't work in organizations that have immaturate engineering practices where they don't actually follow the practice of enforcing Binary Provenance.

But I feel like we already knew that practices don't work unless organizations actually follow them.

So what was your point?


My point is that static linking alone and by itself does not meaningfully improve binary provenance and is mostly expensive security theatre from a provenance standpoint due to a statically linked binary being more opaque from a component attribution perspective – unless an inseparable SBOM (which is cryptographically tied to the binary), plus signed build attestations are present.

Static linking actually destroys the boundaries that a provenance consumer would normally want due to erasure of the dependency identities rendering them irrecoverable in a trustworthy way from the binary by way of global code optimisation, inlining (sometimes heavy), LTO, dead code elimination and alike. It is harder to reason about and audit a single opaque blob than a set of separately versioned shared libraries.

Static linking, however, is very good at avoiding «shared/dynamic library dependency hell» which is a reliability and operability win. From a binary provenance standpoint, it is largely orthogonal.

Static linking can improve one narrow provenance-adjacent property: fewer moving parts at deploy and run time.

The «it depends» part of the comment concerned the FAANG-scale level of infrastructure and operational maturity where the organisation can reliably enforce hermetic builds and dependency pinning across teams, produce and retain attestations and SBOM's bound to release artefacts, rebuild the world quickly on demand and roll out safely with strong observability and rollback. Many organisations choose dynamic linking plus image sealing because it gives them similar provenance and incident response properties with less rebuild pressure at a substantially smaller cost.

So static linking mainly changes operational risk and deployment ergonomics, not evidentiary quality about where the code came from and how it was produced, whereas dynamic linking, on the other hand, may yield better provenance properties when the shared libraries themselves have strong identity and distribution provenance.

NB Please do note that the diatribe is not directed at you in any way, it is an off-hand remark and a reference to people who prescribe purported benefits to the static linking that it espouses because «Google does» it without taking into account the overall context, maturity and scale of the operating environment Google et al operate at.


Sounds like Google could really use Nix

I think google of all companies could build a good autostripper reducing binaries by adding partial load assembly on misses. It cant be much slower then shovelling a full monorepo assembly plus symbols into ram.

The low-hanging fruit is just not shipping the debuginfo, of course.

Is compressed debug info a thing? It seems likely to compress well, and if it's rarely used then it might be a worthwhile thing to do?

It is: https://maskray.me/blog/2022-01-23-compressed-debug-sections

But the compression ratio isn't magical (approx. 1:0.25, for both zlib and zstd in the examples given). You'd probably still want to set aside debuginfo in separate files.


Small brained primate comment.

With embedded firmware you only flash the .text and and flash to the device. But you still can debug using the .elf file. In my case if I get a bus fault I'll pull the offending address off the stack and use bintools and the .elf to show me who was naughty. I think if you have a crash dump you should be able to make sense of things as long as you keep the unstripped .elf file around.


This is explained in the original article linked in the first sentence: https://blogsystem5.substack.com/p/ssh-agent-forwarding-and-...


The image? Yes. The text? Not at all.


Thanks!

> I moved away from FreeBSD to Debian for hosting my things because the process/daemon management was too tricky.

It indeed is tricky. To be honest, I wasn't "put off" by it because I've been using BSDs and old-style Linux startup systems for almost 30 years now... but the lack of abstraction shows, and I don't think it's great.

The daemon(8) wrapper is neat to integrate pre-existing servers into rc.d, but I do not fancy having to deal with that "by hand" nor to create a shell script to manage my own service (related from a few years ago: https://jmmv.dev/2020/08/rcd-libexec-etc.html) nor to have something entirely separate to manage log rotation.

As much hate as systemd gets, I do think being declarative (and doing so in a DSL that's not a programming language) and having a true process "supervisor" is a better model. BUT, as I mentioned in this article, I also like the "no churn" of the BSDs because what I learned and refined over ~30 years is still similar to this day and that I won't be bitten by surprises.


Not GP, but I do prefer the very direct control you get with rcctl (OpenBSD), openrc (Alpine),... Systemd often feels like autoconf. It's needed when you really want the capabilities, otherwise the opaqueness and complexity feels very much cumbersome when you're dealing with a simple service.

I do like the Unix way of having different components handling different tasks instead of having different things which are entangled with each other. It encourages transparency.


Even with daemon(8), PID files and the lack of process supervision might be my least favorite aspect of FreeBSD, an OS I like overall. Not long ago, I wanted to avoid running a custom service that way on a fresh FreeBSD server. After researching my options, I found an adequate solution in the daemontools family. I'd heard of daemontools but hadn't paid much attention to it.

My service has been managed by runit and, most recently, nitro (https://github.com/leahneukirchen/nitro). Both have run as the service's user. They supervise the process and handle logging. I have found the design of daemontools and its derivatives runit and nitro elegant; it lives up to the reputation.


I have been using daemontools for managing my services on FreeBSD servers that have run 24/7 for almost a quarter of century, with down times of an hour or so only at intervals of a few years of continuous running, whenever I made a hardware upgrade or a complete OS replacement (by passing to another major version of FreeBSD).

Now there are several daemontools derivatives that bring it more up-to-date, but even the ancient original version did most of one would need for reliable service management.


> As much hate as systemd gets, I do think being declarative (and doing so in a DSL that's not a programming language) and having a true process "supervisor" is a better model.

I've been playing with dinit for a bit now; it combines a lot of the nice advantages of systemd with a finite scope and being portable across OSs.


> It gets better though! Since this is a very common operation, x86 CPUs spot this “zeroing idiom” early in the pipeline and can specifically optimise around it: the out-of-order tracking systems knows that the value of “eax” (or whichever register is being zeroed) does not depend on the previous value of eax, so it can allocate a fresh, dependency-free zero register renamer slot.

While this is probably true ("probably" because I haven't checked it myself, but it makes sense), the CPU could do the exact same thing for "mov eax, 0", couldn't it? (Does it?)


Most Intel/AMD CPUs do the same thing for a few alternative instructions, e.g. "sub rax, rax".

I do not think that anyone bothers to do this for a "mov eax, 0", because neither assembly programmers nor compilers use such an instruction. Either "xor reg,reg" or "sub reg,reg" have been the recommended instructions for clearing registers since 1978, i.e. since the launch of Intel 8086, because Intel 8086 lacked a "clear" instruction, like that of the competing CPUs from DEC or Motorola.

One should remember that what is improperly named "exclusive or" in computer jargon is actually simultaneously addition modulo 2 and subtraction modulo 2 (because these 2 operations are identical; the different methods of carry and borrow generation distinguish addition from subtraction only for moduli greater than 2).

The subtraction of a thing from itself is null, which is why clearing a register is done by subtracting it from itself, either with word subtraction or with bitwise modulo-2 subtraction, a.k.a. XOR.

(The true "exclusive or" operation is a logical operation distinct from the addition/subtraction modulo 2. These 2 distinct operations are equivalent only for 2 operands. For 3 or more operands they are different, but programmers still use incorrectly the term XOR when they mean the addition modulo 2 of 3 or more operands. The true "exclusive" or is the function that is true only when exactly one of its operands is true, unlike "inclusive" or, which is true when at least one of its operands is true. To these 2 logical "or" functions correspond the 2 logical quantifiers "There exists a unique ..." and "There exists a ...".)


> couldn't it? (Does it?)

It could of course. It can do pretty much any pattern matching it likes. But I doubt very much it would because that pattern is way less common.

As the article points out, the XOR saves 3 bytes of instructions for a really, really common pattern (to zero a register, particularly the return register).

So there's very good reason to perform the XOR preferentially and hence good reason to optimise that very common idiom.

Other approaches eg add a new "zero <reg>" instruction are basically worse as they're not backward compatible and don't really improve anything other than making the assembly a tiny bit more human readable.


Sure, lots of longer instructions have this effect. "xor eax,eax" is interesting because it's short. That zero immediate in "mov eax,0" is bigger than the entire "xor eax,eax" instruction.


I believe it does in some newer CPUs. It takes extra silicon to recognize the pattern though, and compilers emit the xor because the instruction is smaller, so I doubt there is much speed up in real workloads.


> the CPU could do the exact same thing for "mov eax, 0", couldn't it?

Yes, it could, but mov eax, 0 is still going to also be six bytes of instruction in cache, and fetched, and decoded, so optimizing on the shorter version is marginally better.


Yes, "mov r, imm" also breaks dependencies -- but the immediate needs to be encoded, so the instruction is longer.


Yep, I thought so too! It’s a very interesting and fun topic.

So… I cannot resist redirecting you to a set of articles I wrote about two years ago on DOS and memory management, which I think covers some of the basic parts. The first one is https://open.substack.com/pub/blogsystem5/p/from-0-to-1-mb-i...


Donated! I should have done this months ago when I started using NetBSD for an embedded project idea (that has gone nowhere).

But I feel this link illustrates a big problem with NetBSD’s “no hype” approach: I clicked the link you shared and found an email. The email has the donation link at the very end, and it’s not clickable. When I go to the donation page, there is a ton of text before I even get to see an ugly PayPal tiny button or a tiny form to donate via Stripe.

It’s too hard to notice and too hard to do. The project’s homepage does a better job though. But I think it should be made even more prominent if this is critical for the project’s health!


Thanks a lot for donation :)


That may be technically true but…

Linux (the kernel) may have been ported to more machines and architectures than NetBSD’s kernel, yes. But is all the code present in the same source tree or do you have to go find patch sets or unofficial branches?

More importantly: is there a modern distribution that builds an installable system for that platform?

The special thing about NetBSD is that you get the portability out of a single and modern tree for many more platforms than any single Linux distribution offers.


The Linux ecosystem is removing support for lots of stuff, especially the distros, but also the kernel.


You said the same thing I did with extra steps.


That’s because… I misread your comment.

In any case, NetBSD is not well known and “why bother because Linux also runs everywhere too” so I thought it was worth explaining.


Sadly the BSDs are not well known.

I asked a major employer why they're using Linux + Apache for an RP when OpenBSD + HAProxy + CARP is a significantly better option. Crickets.

I want a good laptop for OpenBSD (or FreeBSD, at the least) that isn't 10 years old or weighs 5+ lbs.


> Ok, this post is mostly about text-based IDEs, but I think the point mostly stands as well for IDEs in general. I'm thinking about Visual Basic or Delphi.

Exactly. I recently recorded a video of me creating a toy app with VB3 on Windows 3.11 and the corresponding tweet went “viral” for similar reasons as this article.

It’s not really about the TUI: it’s about the integrated experience as you say!


Hey, thanks for sharing this again! FYI, previous discussion from 2 years ago now (wow, time flies...): https://news.ycombinator.com/item?id=38792446


Thanks! Macroexpanded:

IDEs we had 30 years ago - https://news.ycombinator.com/item?id=38792446 - Dec 2023 (603 comments)


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: