Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I have no idea where this limit came from. I worked at WhatsApp[1], and while we did split nodes into separate clusters, I think our big cluster had around 2000 nodes when I was working on it.

Everything was pretty ok, except for pg2, which needed a few tweaks (the new pg module in Erlang 23 I believe comes from work at WhatsApp).

The big issue with pg2 on large clusters, is locking of the groups when lots of processes are trying to join simultaneously. global:set_lock is very slow when there's a lot of contention because when multiple nodes send out lock requests simultaneously and some nodes receive a request from A before B and some receive B before A, both A and B will release and retry later, you only get progress when there's a full lock; applying the Boss node algorithm from global:set_lock_known makes progress much faster (assuming the dist mesh is or becomes stable). The new pg I believe doesn't take these locks anymore.

The other problem with pg2 is a broadcast on node/process death that's for backwards compatibility with something like Erlang R13 [2]. These messages are ignored when received, but in a large cluster that experiences a large network event, the amount of sends can be enormous, which causes its own problems.

Other than those issues, a large number of nodes was never a problem. I would recommend building with fewer, larger nodes over a large number of smaller nodes though; BEAM scales pretty well with lots of cores and lots of ram, so it's nicer to run 10 twenty core nodes instead of 100 dual core nodes.

[1] I no longer work for WhatsApp or Facebook. My opinions are my own, and don't represent either company. Etc.

[2] https://github.com/erlang/otp/blob/5f1ef352f971b2efad3ceb403...



>I think our big cluster had around 2000 nodes when I was working on it.

Is there fairly recent? I thought WhatsApp was on FreeBSD with Powerful Node instead of Lots of Little Node?

>BEAM scales pretty well with lots of cores and lots of ram, so it's nicer to run 10 twenty core nodes instead of 100 dual core nodes.

Something the I was thinking of when reading POWER10 [1], what system and languages to use with a maximum of 15 Core x 16 Socket x SMT 8 in a single machine. That is 1920 Threads!

[1] https://www.anandtech.com/show/15985/hot-chips-2020-live-blo...


Lots of powerful nodes. That cluster was all dual xeon 2690v4. My in-depth knowledge of the clusters ends when they moved from FreeBSD at SoftLayer to Linux at Facebook. I didn't care for the environment and it made a nice boundary for me --- once I ran out of FreeBSD systems, I was free to go, and I didn't have to train people to do my job.

We did some trials of quad socket x86, but didn't see good results. I didn't run the tests, but my guess from future reading is we were probably running into NUMA issues, but didn't know how to measure or address them. I have also seen that often two dual socket machines are way less expensive than a quad socket with the same total number of cores and equivalent speeds; with Epyc's core counts, single socket looks pretty good too. Keeping node count down is good, but it's a balance between operation costs and capital costs, and lead time for replacements.

The BEAM ecosystem is fairly small too, so you might be the only one running a 16 socket POWER 10 beast, and you'll need to debug it. It might be a lot simpler to run 16 single socket nodes. Distribution scales well for most problems too.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: