The hunt for a cluster-killer Erlang bug (2021)

banashark · on June 14, 2022

Very interesting writeup. Distributed systems problem solving is always a very interesting process. It very frequently uncovers areas ripe for instrumentation improvement.

The Erlang Ecosystem seemed very mature and iterated. It almost seemed like the "rails of distributed system" with things like Mnesia.

The one downside to that seemed to be that while I was working on grokking the system, the limits and observability of some of these built-in solutions was not so clear. What happens when a mailbox exceeds it's limit? Does the data get dropped? Or, how to recover from a network segmentation? These proved somewhat challenging to reproduce and troubleshoot (as distributed problems can be).

There are answers for all of these interesting scenarios, but in some cases it almost would have been simpler to use an external technology (redis/etc) with established scalability/observability.

I do say this knowing that there was plenty I did not get time to learn about the ecosystem in the depth that I desired, but was curious how more experienced Erlang engineers viewed the problem.

bitwalker · on June 14, 2022

The Erlang Manual does a pretty good job of describing the system limits and how to configure them, as well as the failure modes of various components.

By default, a mailbox will continue to fill up until process reaches its configured max heap size (which by default is unlimited, i.e. the process heap will grow until the system runs out of memory, eventually crashing the node its running on). However, you can configure this on a process-by-process basis, by specifying a max heap size and what to do when that limit is reached. This is described in the docs, but as you mentioned, it's not necessarily apparent to newcomers.

But aside from that scenario, I think a lot of the interesting failure scenarios are really sensitive to what the system is doing. For example, network partitioning can either be a non-issue, or critical, depending on how you are managing state in the system. As a result, I don't think there is too much that really digs deep into those problems because it turns out to be really hard to document that kind of knowledge in a generic fashion - or at least that's how it feels to me. Everyone I've worked with has built up a toolbox of techniques they use for the task at hand, and do their best to share them when they can. It's unfortunate there isn't really a one-stop shop of such information out there though.

I think it's probably also good advice for newcomers to remember that you don't have to use something just because its there (like mnesia) versus something you are already running or are more familiar with which solves the same problem (e.g. redis).

banashark · on June 14, 2022

Agreed on all parts. The Erlang and OTP manuals were very nicely written, and I was able to reason about most aspects of the system pretty well from reading them.

I did a bit more research after writing up my comment (my mind got a bit too focused on it to let it go) and found this great resource about handling various system load scenarios: https://ferd.ca/handling-overload.html

I'll +1 your pragmatic comment on not adopting tools just because they're there.

Again I no longer work in Erlang, but I find the systems, architecture, and problem solving particularly interest piquing.

Now I'm off to look up production use-cases where Mnesia was the most pragmatic solution.

bitwalker · on June 15, 2022

Fred's blog, and Learn You Some Erlang for Great Good are invaluable, but on the topic of production systems, his ebook Erlang In Anger (https://www.erlang-in-anger.com/) is excellent as well - honestly it's hard to overstate just how much good he's done for the community in terms of documenting and philosphizing about Erlang, architecture and operating production systems. He's solid gold!

eproxus · on June 15, 2022

He's also now working on the more modern Adopting Erlang together with Tristan Sloughter: https://adoptingerlang.org

butterisgood · on June 15, 2022

The ability to format MNesia tables in such a way that exporting them over SNMP is trivial was an absolute joy to work with for me about 12 years ago!

I was able to stand up a quick management solution for a rather complex system as a one-person team using a combination of Erlang, MNesia, and port drivers to various back-ends written in Python, C, C++ and Haskell. It was the most productive I think I've ever been on any project in my entire career so far.

And I'd love to get back to that feeling of just kicking ass every day.

throwaway81523 · on June 15, 2022

1) the lack of a useful backpressure mechanism on mailbox size is a long standing issue in Erlang. It tends to not be too big a problem in practice, since overflows come from simple bugs that stop your system immediately and are found during development, rather than subtle conflicts that lurk around.

2) Erlang clustering isn't really supposed to withstand splits. Erlang was built for phone switches before the internet and while it is "distributed", the clustering is between a few boards or boxes that are wired together in the same rack, maybe over a hardwired LAN, not through the viscissitudes of a routed network like the internet connecting remote cities. So the cluster has to keep running if a node is out, but the idea is that node has crashed (or maybe equivalently, the network wire has been cut), not that the connection between nodes has somehow become flaky in a way that can be fixed with retries. Of course ordinary network connections are normally supervised under OTP and they do get restarted.

sargun · on June 15, 2022

I actually really like the "mailboxes are on the receiver's heap" thing. One of the big reasons why Go's channels bother me is that they're bounded. This makes things far more complicated, because they're not bound to a go routine, and an irresponsive goroutune can't get terminated.

toast0 · on June 14, 2022

> What happens when a mailbox exceeds it's limit? Does the data get dropped?

I thought there was some movement towards limits on mailboxes, but I can't find any documentation now, so I'm not sure if that happened? If not (or if you haven't configured it anyway), there is no explicit limit, your mailboxes can grow until you run out of memory; either by hitting a ulimit, or malloc fails, or maybe until your OS just kills processes (and probably the BEAM process, because it's biggest). In the first two cases, you'll get a nice crash dump from BEAM, but in all cases all messages are dropped, as BEAM is dead. Edit: i see there's a process_flag(max_heap_size, MaxHeapSize) to set the maximum size of the heap, and if process_flag(message_queue_data, on_heap), the default, is set, messages will eventually end up on the heap. But the maximum heap size is checked during Garbage Collection, but IIRC, GC can't be triggered when a message is added to the mailbox, only while the process is running, or if explicitly requested for the process (with erlang:garbage_collect/0 or /1); if your process ends up blocked for a long time (or possibly forever), it could still accumulate a large mailbox without being killed by the heap size limit.

You can (and should!) regularly call process_info(Pid, message_queue_len) to observe the message queue of all processes, and alert on large queues. You can then observe the messages themselves and consider appropriate response.

> Or, how to recover from a network segmentation? These proved somewhat challenging to reproduce and troubleshoot (as distributed problems can be).

Recovering from network segmentation is application dependent, and can often be tricky. Some applications can just reconnect and call it a day. Other applications may have accepted writes on both sides of the segmentation, and need some sort of reconciliation process. Mnesia has hooks for this, but I don't remember seeing any examples, and the default logic is to just continue segmented even after the segmentation is done; this is usually not what you want, but at least it's consistent? I think it should be fairly easy to simulate and trigger network segmentation, just kill drop packets between selected hosts; although you'll need more work if you want to simulate stuff like congestion between hosts or congestion on only some paths between hosts (LACP is very nice, but debugging congestion on only some paths isn't as nice).

On this particular issue, where I worked, we had a policy of flushing mailboxes that were too big (usually 1 million messages, which isn't the Erlang way, and wasn't in public OTP, but keeps a node running at least), and we wouldn't have tried to log all of the messages in a mailbox, because 1 million messages or whatever is way too many to log. Pretty printing with no limits is dangerous, even if it doesn't include a ton of references to the same big thing. We also didn't tend to use anonymous functions/closures, but that's just a happy accident: we were using Erlang before crash dumps had line numbers, and anonymous functions are hard to track down, so it's easier to give them a real name and use that instead. Of course, there's some places where closures are way more convenient than explicitly passing Terms to Funs, so it's not that we never used them, just they were rare, and unlikely to show up many times in a single logging statement, like in this case.

andyjohnson0 · on June 15, 2022

This is a great write-up. I love reading stuff like this, and Erlang/OTP/Kafka is definitely on my list of tech to investigate.

Slightly tangential, but what's the market like for Erlang developers? I know that its was originally developed for telecoms and phone switches, and Whatsapp use (used?) it in their back-end. Are there particular business sectors that tend to use it now? Geographical spread, perm/contract, salaries, etc?

davydog187 · on June 15, 2022

Erlang developers specifically seem to be slightly Euro-centric. Whatsapp is still built on Erlang. I’m not sure of the job market for it specifically, but if you’re looking for any BEAM job, there’s a growing demand for Elixir developers that pay quite well internationally.

nijave · on June 15, 2022

I still hear Exilir/Phoenix getting tossed around at some startups in the U.S. Locally, CoverMyMeds (at least used to) have a decent amount of Erlang. I think there's a few big OSS projects, RabbitMQ being the first that comes to mind

waynesonfire · on June 15, 2022

That was really fun to read! Nice work digging into the root cause.

The issue where boxing State#state.partition copies the entire stage object is very counter-intuitive and would have got me as well. I would expect it to only store the partition value.

nyanpasu64 · on June 15, 2022

This reminds me of how Rust 2021 switched to disjoint captures in closures (https://doc.rust-lang.org/edition-guide/rust-2021/disjoint-c...). Of course in Rust, there are behavioral and compiler error differences due to mutability and the borrow/lifetime checker, beyond merely the log output of equivalent referentially-transparent objects. Honestly I still prefer the C++ way of allowing explicit lambda captures.

masklinn · on June 15, 2022

> Honestly I still prefer the C++ way of allowing explicit lambda captures.

That doesn’t solve overly attached partial captures in any way though?

Except possibly by driving you towards splitting the structure if you also always write full precise capture clauses?

nyanpasu64 · on June 15, 2022

In C++, if you copy an object by reference, it doesn't create a copy (but risks use-after-free, Rust's borrow checker is definitely a strength over C++). If you copy by value it copies (which is an unwanted hidden allocation, but in the case of Rc/Arc, I wish Rust made incref as easy as C++).

The real benefit of C++ explicit captures is making it dead simple to capture exactly the state you need, and make it explicit to the reader: [bar=obj.foo.bar]() { ... }. Rust 2021 adds selective closure capture, the downside being the exact fields captured (which affect what the lambda borrows) is implicit. To explicitly mark which fields you capture, you'd need to `let bar = obj.foo.bar; || {}` either in the scope of the lambda, or hide the lambda in a statement expression (creating a second level of indentation, risking rightward drift).

Additionally adding explicit lambda captures to Rust would alleviate its lack of copy semantics for smart pointers, allowing you to copy pointers into lambdas without polluting the parent scope or introducing a level of nesting: [ptr=Rc::clone(ptr)] || {...}. This would eliminate the need for gtk-rs's clone! macro for strong (not weak) references.

masklinn · on June 15, 2022

> In C++, if you copy an object by reference, it doesn't create a copy (but risks use-after-free, Rust's borrow checker is definitely a strength over C++). If you copy by value it copies (which is an unwanted hidden allocation, but in the case of Rc/Arc, I wish Rust made incref as easy as C++).

Which has nothing to do with the question.

> The real benefit of C++ explicit captures is making it dead simple to capture exactly the state you need, and make it explicit to the reader

You could just have answered that it's the latter.

> To explicitly mark which fields you capture, you'd need to `let bar = obj.foo.bar; || {}` either in the scope of the lambda, or hide the lambda in a statement expression (creating a second level of indentation, risking rightward drift).

Oh no. Anyway.

> Additionally adding explicit lambda captures to Rust would alleviate its lack of copy semantics for smart pointers, allowing you to copy pointers into lambdas without polluting the parent scope or introducing a level of nesting: [ptr=Rc::clone(ptr)] || {...}.

That's literally just a repetition of your previous sentence.

> This would eliminate the need for gtk-rs's clone! macro for strong (not weak) references.

If it would, then you already do not need it, because you can use the precise capture pattern (a block expression around a `move` closure) in order to create your clone and the clone macro is just a convenience.

nyanpasu64 · on June 16, 2022

> That doesn’t solve overly attached partial captures in any way though?

> Which has nothing to do with the question.

Update: now that I think about it, Rust's borrow checker (not C++) is a good warning system for undesired captures, since if you capture an object rather than a field, Rust will likely throw an error when you try mutating the captured object or returning the lambda past the lifetime of the object reference. And Rust 2021 moves to capturing struct fields by default to avoid undesired captures, but I prefer C++'s more flexible capture system (but not its absence of borrow checker) to allow concisely and ergonomically picking exactly which variables/fields/expressions to capture.

I still feel Rust's way of adding explicit lambda captures (polluting the outer scope or introducing two layers of nesting) is deeply unsatisfactory, and its approach to turning off disjoint captures (dummy `let _ = &x; x.field`, see https://doc.rust-lang.org/edition-guide/rust-2021/disjoint-c...) is also unsatisfactory. If you actually feel having to type extra and clutter the code is "good enough" for Rust and doesn't punish use, then analogously C++'s std::variant and std::get_if<>() and std::visit() (deeply flawed sad excuses) ought to be "good enough" to take the place of `match` and `if let`, and so is making `var` default and `const var` take extra typing and cluttering the code.

nyanpasu64 · on June 15, 2022

> Oh no. Anyway.

Languages encourage the behavior they make easy, and discourage the behavior they make difficult. Rust discourages making lambda captures explicit, because you need to pollute the outer scope or define a nested scope, and even with a nested scope, you can still access variables you didn't pseudo-capture in the nested scope (meaning you can't enforce explicit captures). Additionally I do not appreciate your flippant trolling and dismissal of my arguments.

> If it would, then you already do not need it, because you can use the precise capture pattern (a block expression around a `move` closure) in order to create your clone and the clone macro is just a convenience.

gtk-rs requires shared mutability. Rust discourages shared mutability. I find that pre-cloning shared pointers for every GUI method I want to bind sufficiently painful (as opposed to lacking an unneeded convenience) I would avoid using gtk-rs if not for the clone! macro (and even with it, subclassing was confusing enough and GTK4 so buggy on X11 that I quit anyway).

> That's literally just a repetition of your previous sentence.

No, my previous sentence (was intended to) describe capturing portions of structs (usually by value or trivial copy, always without mutating the original data), whereas this one describes invoking copy constructors of refcounted pointers within the capture block (mutating data visible by the original pointer).

tiffanyh · on June 14, 2022

Fantastic detailed write up. Wish there was more of these style of articles on HN.

eproxus · on June 15, 2022

I wish there was a "flag" feature for HN to flag articles as non-tech, and then a section that only showed tech articles. I miss the old tech-blog style front page with less politics (I wouldn't mind going to the new homepage every now and then, I just want tech focus to be the default).

jiveturkey · on June 15, 2022

or, in the world

waisbrot · on June 15, 2022

I felt like a missing conclusion was "Kafka is a critical dependency". They'd started out with the assumption that Kafka is a soft dependency and found this library bug that made it a hard dependency (which they then patched).

But isn't going metrics-blind whenever Kafka goes down bad enough that you should push more effort into keeping Kafka alive?

dszoboszlay · on June 17, 2022

Kafka being a soft dependency is not an assumption, it's a design goal. So the conclusion should be to eliminate any hard dependencies on Kafka, if anything.

However, when it comes to metrics in particular, I'd say Kafka was still a soft dependency. The reason we lost our metrics during the incident was a scheduler lock-up blocking the collection of VM-level metrics. It's just coincidence that the scheduler lock was caused by brod in this case. Otherwise metrics would just flow to Graphite directly, never interacting with Kafka.

When it comes to the complete monitoring solution around Kred, there was one bit of it depending on Kafka: System monitor (https://github.com/klarna-incubator/system_monitor). This tool is exporting data that doesn't fit well into Graphite or Splunk, so we store it in Postgres instead. Due to pure laziness it was at the time of the incident not writing directly to Postgres though, but was just pushing the data to Kafka. (The reason is that we already had a service available to push data from Kafka to Postgres.) After the incident we eliminated Kafka from the path. I didn't mention this work in the post because it was only marginally related.

rramadass · on June 15, 2022

Relevant: https://www.erlang-in-anger.com/

davidw · on June 14, 2022

> So our initial 1 GB binary data pretty printed as a string will take about 1 GB × 3.57 characters/byte × 2 words/character × 8 bytes/word = 57.12 GB memory.

Yeah, I saw that one in an Erlang system too. It was pretty ugly.

detuur · on June 15, 2022

IMHO, the obvious takeaway for me here is that pretty printing debug data should not be part of the originator system. It's a-okay to dump a 1GB binary log. The pretty-printing logic should be located in whatever dedicated application is used to inspect those dumps.

teddyh · on June 15, 2022

For many people, though, “binary logs” is a bad word.

throwaway81523 · on June 15, 2022

Ok I've looked at this article and it is pretty good. It sounds like there were various Erlang antipatterns in the program, but the actual bug was a user-level memory leak in an Erlang process that locked the scheduler, which isn't good. Also, the memory leak was amplified because it involved serializing an object to memory that contained a lot of repeated references to other objects. So the object itself, while fairly large, still used only a manageable about of memory. But the serialized version's size (because of the repeated content) grew exponentially with the recursion depth. That in turn was due to an Erlang "optimization" that didn't try to indicate the repeated references in the object during serialization. Also of interest was using gdb on the Erlang node to debug this, since the usual Erlang interactive shell was hosed.

tpmx · on June 14, 2022

I thought Klarna had moved away from Erlang, mostly towards Java. I guess not.

eproxus · on June 15, 2022

From what I heard they tried to replace everything with a new Java system (the old "let's rewrite everything from scratch" trap) and didn't or only partially succeeded (it may be used in some other region?). Too much legacy and too many features supported in the old system to be able to move away from it.

dszoboszlay · on June 17, 2022

In the beginning (read: 2005) there was only one system, Kred. It was written in Erlang and did everything, so became a huge monolith over time.

About 10 years later Klarna started to change its system architecture (which makes total sense, since by that time it grew to a successful unicorn, with a way more complex business than what it had at launch time).

First, instead of shovelling every new feature into the monolith, it introduced new (mostly micro-)services around it. This model is successful, and most of these new services are written in Java, but there are also Erlang ones and other, even some "exotic" ones, like Haskell. (Also, through a series of acquisitions, Klarna got a bunch of tech from other companies, which, you can imagine, were written in whatever language those other companies used.)

There was also an attempt to replace Kred, the old Erlang system, with a new one (happened to be written in Java), and this project indeed was a failure in the sense that Kred is still around and nobody wants to decommission it nowadays. But the intended replacement system still proved useful, so it's not really a failure from that system's developers point of view.

tpmx · on June 17, 2022

Thanks for the insight. That's some proper complexity debt.

AtNightWeCode · on June 15, 2022

They have software in Java amongst other langs. But I believe it was JS that they tried to replace the core system with.

elvinyung · on June 15, 2022

That doesn't seem mutually exclusive with the fact that the oldest service is written in the oldest-used language.