> significantly cheaper per-process/task cost Can you elaborate? BEAM processes ...

SirGiggles · 2024-10-07T15:26:53 1728314813

I think a big difference is how C# uses stackless coroutines while anything on the BEAM uses stackful. How much cheaper is a good question however

neonsunset · 2024-10-08T02:32:58 1728354778

.NET's Task<T> + the state machine box for it emitted by Roslyn start at about 100B of heap-allocated memory, as of .NET 8/9.

The state machines generated for those by F# don't seem to be far behind either (I tested this before replying, F#'s asynchronous computations aka async { } appear to be much less efficient however so the guidance is to avoid them in favor of task { } and .NET's regular tasks).

Notably, BEAMs processes come with their own per-process GC each, which is going to add a lot of additional cost every time a new process is spawned. In a similar vein, Go's goroutines pre-allocate quite a bit of memory for their virtual stacks (60 KiB?).

.NET's tasks, as sibling comment mentions, are stackless coroutines[0] so their memory usage is going to be much lower. They come with a different set of tradeoffs but overall their cost is going to be significantly cheaper because bytecode is JIT/AOT compiled to native, the GC has precise tracking of object liveness and because .NET does not perform BEAM-style preemptive userspace scheduling.

Instead, .NET employs work-stealing threadpool with hill-climbing thread count scaling to achieve optimal throughput. This way, when the workers cannot advance all submitted work items in time, additional threads are injected, which are then preempted by kernel thread scheduler. This means that even if other workers are busy, the work items will not wait in the queues indefinitely. This is a pathological case and usually the thread count varies between 1-2x physical core count.

This has a downside of achieving potentially worse scheduling fairness, and independent tasks allocating can and do affect each other w.r.t. GC pause impact. I believe this to be non-issue because this is more than compensated for by spending <10x CPU time vs BEAM on computing the same result, and significantly less memory (I don't have hard numbers but .NET is quite well behaved in terms of allocation traffic) too. At the end of the day, Task<T> is designed for much higher granularity of concurrency and parallelism so it would be quite unusable if it had greater cost.

If you're curious, I made an un-scientific and likely incorrect but maybe interesting comparison some time ago (it's in Ukrainian but the table is readable enough I hope):

https://gist.github.com/neon-sunset/8fcc31d6853ebcde3b45dc7a...

This calculates the CPU time and max MEM RSS usage required to spawn 1M tasks/coroutines/processes/futures that sleep for 5s and await their completion.

[0]: This might stop being true in a pure sense of the word in .NET 10 because the task handling is going to be completely changed by replacing state machines generated by a language that targets .NET with specially annotated methods, for which the runtime itself is going to implicitly emit state machines instead, allowing to pay the cost only at "true" suspend points. Reference: https://github.com/dotnet/runtimelab/blob/feature/async2-exp...

klibertp · 2024-10-08T14:05:07 1728396307

This is interesting! I learned quite a bit, thank you :)

One comment, though: it's not an apples-to-apples comparison! (I'm talking about your gist.) Specifically, in Elixir, you should create the state machine yourself (very easy in Elixir, thanks to pattern-matching definitions of functions) and schedule it across a few processes manually. You'd need ~100loc for this, and you'd get results much closer to C# and Rust.

What your comparison highlights is that the primitives behind the same async/await API can vary widely in their specifics. Goroutines and BEAM processes are much more lightweight than OS-level threads, but they are much more complex than just a compiled state machine, so they are heavier than coroutines. On the other hand, BEAM processes do the "preemptive userspace scheduling," which means they can be used for scheduling (an implementation of) coroutines. An interesting note about F# `async { }` - I thought this one also builds coroutines through CPS transform; it should be on par with Task regarding overhead.

It would be nice to expand the comparison: Java got virtual threads lately, I'm curious how that would fare - closer to C# or BEAM? Throw in Kotlin's coroutines (probably closer to C#) and Python's asyncio (or Lua native coroutines, but then you need to write the scheduler) for a good measure... We could see how concurrency primitives differ in overhead across more languages.

neonsunset · 2024-10-08T19:11:48 1728414708

> One comment, though: it's not an apples-to-apples comparison! (I'm talking about your gist.) Specifically, in Elixir, you should create the state machine yourself (very easy in Elixir, thanks to pattern-matching definitions of functions) and schedule it across a few processes manually. You'd need ~100loc for this, and you'd get results much closer to C# and Rust.

Hmm, I can't see myself agreeing to this. Each state machine would have to be hand-rolled manually, or another abstraction for that would need to be introduced. It becomes a lot of manual work very quickly. Or it becomes a callback hell. Whichever you prefer, both are among reasons behind .NET pioneering async/await and its later adoption across other languages. I am unsure about scheduling cost too - significant amounts of application logic implemented in pure Erlang or Elixir will be subject to interpreter-tier performance which is unlikely to match the performance of JIT/AOT-compiled languages. BEAM has excellent allocation throughput thanks to per-process GCs, but not the raw speed of code execution.

The comparison was linked as an addendum to the discussion, didn't intend to make it the main focus :) But if you're interested in the context - the initial idea behind the comparison was that I got tired of hearing a colleague giving unjustified praise to Go. Especially when it comes to highly-granular interleaving of concurrent operations (for the lack of better term).

And while yes, BEAM processes and Goroutines belong to category of stackful coroutines with preemptive userspace scheduling, which has very different cost model when compared to how task continuations are handled by .NET's threadpool and task scheduler implementations, one of the most common ways of using `go func(...` in Golang is as if it was a task/future, where one or multiple goroutines are fired once to yield one or multiple results collected via channel/slice/some other collection + WaitGroup. In Go, the users do not have an alternative to this for highly granular ad-hoc concurrency save for re-implementing an async/await or fork/join-like API themselves that leverages a custom pool of goroutines and yield/join primitives.

So for all intents and purposes the numbers indicate the user experience when someone wants to spawn a lot of independently ran operations and then wait for them all to complete. This overhead is real, albeit pushing the task/goroutines/process count to 1 million is going to be very uncommon.

It's probably on the opposite end to what would idiomatic Erlang or Elixir code look like, but below is a popular pattern that relies on this:

  using var http = new HttpClient {
    BaseAddress = new("https://news.ycombinator.com")
  };

  // Tasks are hot-started, making the requests parallel
  var page1 = http.GetStringAsync("?p=1");
  var page2 = http.GetStringAsync("?p=2");

  Console.WriteLine(await page1 + await page2);

This way, you can also map e.g. an array of elements into a sequence of tasks, which is then fed into Task.WhenAll which awaits all of them and returns an array of results. This is what the linked comparison measures. You can easily spawn thousands of requests awaited concurrently and the runtime will scale with this very well (also because Socket has efficient epoll/kqueue integration underneath).

> On the other hand, BEAM processes do the "preemptive userspace scheduling,"

Perhaps you meant "processes are scheduled preemptively"? My understanding is that processes are the main scheduling unit of BEAM, much like Goroutines are in Go (which Go has an explicit mechanism for suspension mid-execution).

> It would be nice to expand the comparison: Java got virtual threads lately, I'm curious how that would fare - closer to C# or BEAM? Throw in Kotlin's coroutines (probably closer to C#) and Python's asyncio (or Lua native coroutines, but then you need to write the scheduler) for a good measure... We could see how concurrency primitives differ in overhead across more languages.

The Java way of solving this will be using its Structured Concurrency which is currently in an incubator. I did not add Python because Python example was attempted by a colleague and it was impossibly slow, and there are multiple ways of going about it. If you care about performance, especially in multi-tasking, then using Python at all is a terrible idea. I mostly focused on the languages that put concurrency and parallelism at their forefront, and without making it a proper automated comparison with controlled environment I don't think the numbers will be very useful, it's also more tedious than it looks unfortunately.

> An interesting note about F# `async { }` - I thought this one also builds coroutines through CPS transform; it should be on par with Task regarding overhead.

F#'s Async (and async { } blocks that express it) is the original source of where async/await comes from. It is implemented via F#'s "Computation Expressions"[0] as a form of do-notation. It is only afterwards that C# got its own async/await implementation. For all intents and purposes both achieve the same goal and both implement a stackless coroutine. F#, however, keeps its core netstandard2.0-compatible and overall runtime-agnostic making it unable to use the APIs that are available for newer targets.

As measured, it appears to be much preferable to use task { } blocks instead, which have significantly lower heap size impact, and are similar in their performance to C#'s async methods. It's important to note that async { } and task { } blocks have different semantics w.r.t. hot vs cold start, multiple awaits/result caching and explicit vs implicit cancellation propagation.

[0]: https://learn.microsoft.com/en-us/dotnet/fsharp/language-ref...

SirGiggles · 2024-10-08T14:47:58 1728398878

Would the async experiment also imply better interop between .NET languages?

One issue I’ve heard from F# devs was the lack of ability to comfortably call C# asynchrony methods

neonsunset · 2024-10-08T19:29:15 1728415755

> One issue I’ve heard from F# devs was the lack of ability to comfortably call C# asynchrony methods

This is no longer the case starting with F# 6: https://learn.microsoft.com/en-us/dotnet/fsharp/whats-new/fs... C# and F# can now transparently interoperate using the same main Task<T> type (in the past, this was done via community libraries for C#->F#, and F#->C# was always possible with |> Async.AwaitTask).

For advanced usage scenarios, there also exist https://github.com/fsprojects/FSharp.Control.TaskSeq and https://github.com/TheAngryByrd/IcedTasks

> Would the async experiment also imply better interop between .NET languages?

This is a good question. If a language already uses Task<T> and ValueTask<T>, the interop is already a solved problem (like with F#). However, this could allow to completely remove or at least significantly simplify the implementation for emitting state machines / resumable code / coroutines for asynchronous methods for any language that targets .NET 10+ (it will most likely get into 10 but is not guaranteed yet from what I've heard).

For example, a guest language could simply choose to emit calls to async methods with special modreq(?) attribute and the runtime would handle suspension, creating the closure for the state captured by asynchronous continuation and then subsequent resumption without any additional work of that guest language to be async-compatible. You could likely even write that by hand in IL with ILAsm. This is a significant improvement for a .NET as a hosting platform, and arguably makes it better than JVM at some high-level scenarios (because for low-level it already supports almost the entirety of what is expressible with C plus struct generics, and for FP it supports tail. prefix for call* opcodes for mandatory tailcalls required by recursive functions).

And for those in the camp of "function coloring considered harmful" (which I'm not a fan of but to each their own), you could even have a guest language that completely hides the fact that it performs such async calls underneath.