There is a spectrum of solutions for dealing with slow I/O in programming languages (well, all I/O is best assumed to be slow I/O). At the two ends of the spectrum are:
- continuation-passing style (CPS), hand-coded or compiler
- preemptive threading / processes
In between lie various solutions, like async/await (closer to CPS), and green threads (closer to preemptive threading).
The key difference between the two ends of this spectrum is memory footprint. With CPS you manually compress state into tuned structures. With threads you store state all over the stack in very inefficient ways because all those stack frames take extra space.
Any solution towards the thread side of the spectrum will yield significantly larger memory footprints than solutions towards the CPS side of the spectrum.
On the other hand, solutions towards the CPS side of the spectrum can require writing application-specific schedulers if fairness issues arise.
On the whole, solutions on the CPS side of the spectrum are better, IMO.
Java is kinda stuck with threads, so green threads make some sense. Of course, you can get pathological issues in M:N threading, so be careful about that.
This is partially right. But the Java implementation under the hood is still a CPS transformation, so it only allocates the memory that is required for the "virtual stack". Compared to that Go - which has similar semantics - allocates a more classical stack for each goroutine, which can however grow over time.
The CPS approach is certainly more space efficient, but but I'm not sure how much of a difference it really makes in the end. Go seems to be doing well too, after some attempts with linked stack segments and then moving to copying stacks while growing them.
What is interesting is that that some of the CPS like implementations have other performance drawbacks. E.g. since it makes the virtual stack more distributed over memory, the cache efficiency of such an approach might be lower. In Rust one limitation of the CPS approach is that the coroutine state is first stack allocated before being moved onto the heap, and this operation has shown itself to be costly for some applications. So right now I'm not sure if there is any implementation which is superior in all possible benchmarks. But the Java one definitely seems to make a lot of sense for what they want to offer!
Normally a POSIX thread gets a very large stack (1MB is typically the default), and obviously these "virtual stacks" will only be as large as they grow (splayed on the heap?). But the memory for those POSIX thread stacks isn't necessarily allocated up front! The OS grows the thread when the thread traps trying to access the guard page, so it's not really a 1MB stack.
Now, if you need to serve 1e6 clients with threads, and those are 1MB stack threads, then you'll be using 1TB of your VM space, which... is almost certainly going to have some performance issues (MMU table size issues at the very least). If you splay your stacks on the heap as linked lists of stack chunks then you might get away with having a very large (and fragmented) heap with large page table entries, which might be a win.
I think approaches on the CPS side of the spectrum will be generally better than this. No, I don't study this and I don't have numbers. Yes, CPS in general means allocating closures on the heap so that some state does live splayed all over rather than compressed, but it doesn't have to be so. But often you'll have only a handful of such closures, and the language could understand that they are one-time use closures (hello Rust) so that no GC is needed.
I've written a small (proprietary) HTTP server that is hand-coded CPS -- specifically it supports hanging-GETs with Range: bytes=0- of regular files as a form of tail -f over HTTP, which is great for log files. That implementation has a single object per GET that has all the state needed, and the only other place state lives is in epoll event registrations (which essentially are the closures, and they are very small, and only one per-connection). Granted, this is a very simple application, and it would be a lot more complicated if, for example, it had to do async I/O directly on a block device to implement a filesystem in the same process -- that would require more care to keep the state compressed.
So in general I'm for CPS. But it's generally true that CPS solutions cost more dev time, and that can be prohibitive. The memory footprint cost difference will be a linear factor, which does not trivially justify the additional dev cost. Then again, if you'll be running lots and lots of instances with lots and lots of clients, the run-time savings can then easily be gargantuan compared to the dev costs -- but no one measures this, and by the time you wish you'd used CPS it will be too late and reimplementation costs prohibitive. Then again, async/await might fit the bill well enough most of the time.
> Any solution towards the thread side of the spectrum will yield significantly larger memory footprints than solutions towards the CPS side of the spectrum.
The community-at-large decided that hand-tuning garbage collection was too finicky and not worth it, even though it obviously 'costs' memory.
I'm frankly at a loss as to why so, so, so many blogposts and tech experts are all-in on the CPS-side of this argument; it seems quite obvious to me that in the vast majority of cases, the considerably simpler* model of (green) threads means you're making the exact same trade-off: Simpler to write and debug code at the cost of needing more memory when running the app you write.
*) For sequential/imperative-style languages, that is. If you're writing in a language that is definitely and clearly intended to be written in a functional style, I can see how the gap between CPS-style and threading-style is far narrower. However, java, python, javascript - these are languages where the significant majority of lines of code written are sequential and imperative in nature.
Also note that in e.g. java you can actually configure stack sizes as you make threads. Thus, your choice of words of "'significantly' more memory footprint" is debatable.
A GC does not obviously cost memory. It might, or it might not. Both a GC and a traditional memory allocator have hidden costs. A GC with movable objects can sometimes do better, because it can manage fragmentation.
I prefer a message-passing style; on which side of the spectrum would this fall?
My experience with green threads is libraries that make I/O operations look like a regular function call. This is similar to RPC where a remote call and a local can look the same, even though the remote call is much slower. This can result in surprising performance characteristics. Even worse, a remote call can time out or take indefinitely long to complete; the same is not true of local calls. Message passing is more onerous but makes surprises more obvious.
I've often fancied writing for an architecture where main memory is treated as fast remote storage, accessible with message passing. I know such architectures exist but I've never had the opportunity to write for one. I wonder if the change in style would have a positive or negative effect on performance.
In CPS the continuations (closures) have to be allocated on the heap, but generally they are one-time use only, which means no GC is needed for them. Hello Rust.
> it seems quite obvious to me that in the vast majority of cases, the considerably simpler* model of (green) threads means you're making the exact same trade-off: Simpler to write and debug code at the cost of needing more memory when running the app you write.
I don't find it's quite that simple.
My experience is that the complexity of CPS tends to scale linearly with use, whereas threads scale exponentially. For small uses threads are easier, but CPS quickly catches up.
CPS forces you to actually declare a dependency tree for your data. Things depend on other things, and that exists in your code. It's very easy for threads to end up a mess, where it's not clear how data is passing through the code, which causes bugs like deadlocks and race conditions.
It's deceptively easy to write code where thread A tries to lock mutexes X and Y, and thread C tries to lock mutexes Y and X, and it deadlocks because neither thread can get both locks.
It would be much harder and more arcane to do that in Javascript or in Python's async. I'm not saying it's impossible, but I don't think I've ever accidentally created a race condition or deadlock in their CPS engines.
TL;DR if your functions are only marked async so you can await something, threading probably is simpler. If you're actually passing promises around, things become much more favorable to CPS.
Java is kinda stuck with threads, so green threads make some sense.
This isn't really what's happening here. Firstly, you can implement CPS on the JVM no problem. Kotlin Coroutines do exactly that. Loom's design is a very, very explicit design choice. Ron Pressler - the lead and designer of Loom - has talked about this extensively in many videos. He has argued persuasively for the way Loom works as not only a good way but the best possible way, one which is not a requirement of Java's previous design choices but rather, is only actually possible due to Java's prior design choices.
It's highly recommended. I'll try to summarize the basic argument.
The ideal, from a developer's perspective, is to have the programming model of threads with the efficiency of hand-coded CPS or state machines. Why: because threads naturally provide useful debugging and profiling information via their stacks, they provide backpressure, because there are tons of libraries that work with them and already use them, and most critically because it avoids the "colored function" problem which splits your ecosystem.
Why do most languages not provide that ideal? Mostly for implementation reasons. It's not due to theoretical disagreements or anything. Providing what Loom does is very difficult and is possible largely only because so much of Java and the Java ecosystem is written in Java itself. One reason native threads are relatively heavy is because the kernel can't assume anything about how the code in a process was compiled or what it will do. The JVM on the other hand is compiling code on the fly, and knows much more about the stack. In particular it knows about the (absence of) interior pointers, it knows it has a garbage collector, it controls the synchronization and mutex mechanisms, it controls debugging and profiling engines.
This allows it to very efficiently move data back and forth between a native thread stack and compressed encodings on a GCd heap. It's also why Loom has some weaknesses around calling into native code. Once you're outside the JVM controlled world it can no longer make these assumptions anymore and must revert to a much more conservative approach (this is "pinning" the "carrier thread"). Note, though, that this situation is not worse than async/await colored functions, which have exactly the same issue.
Maybe? it depends how you use them... any asynchronous operation will require memory, and that memory has to go somewhere. In CPS it goes on the heap... in the threaded style you have a stack you can put it on. This probably will waste some memory, though not as much as you might think. On the other hand, you can also reclaim the stack memory as soon as the operation finishes. The CPS technique creates garbage that accumulates until you do gc cycle, but the whole principle of fast concurrent GC is to let garbage accumulate! I'm really not sure what will end of consuming more memory, particularly if you design your threads to primarily use stack allocation, which is admittedly hard in a language like java.
Green threads by definition cannot use OS stack and must allocate their stack memory on heap. Although this memory can be reused, as it is known from Go to avoid performance bottlenecks at least for Go code is better to allocate the stack as single continues block and copy the stack to a bigger block when thread’s stack reaches the current stack size. But then the whole stack space is pinned to the thread and cannot be reused.
For Java it may still be possible not to allocate the whole stack as a single chunk and instead have smaller chunks like one per few frames. But I really doubt that it can reduce memory pressure compared with CSP in real applications especially given how good GC became in Java.
So in Java we know a few things about the stack that are not true for other languages. We know nothing on the stack is a pointer into a Java stack frame, and nothing on the heap points into a Java stack frame. These facts allow us to mount virtual threads onto carrier threads by copying portions of the stack to and from the heap. This is normally less memory than you’d expect because although you might have a pretty deep stack most of the work will happen in just a few stack frames, so the rest can be left on the heap until the stack unwinds to that point.
The big advantage of this over CSP is that you can take existing blocking code and run it on a virtual thread and get all the advantages, there is no function colouring limiting what you can call (give or take a couple of restrictions related to calling native code).
I like CSP precisely because it requires to color-annotate the code so it is knows what can and what cannot do IO! Surely it decreases flexibility, but makes reasoning about and maintaining the code easier.
Thread stacks are not OS level objects, at least in linux you just malloc or anon-mmap some memory and pass that to clone() or you own green thread implementation.
The question is can unused potion of the stack be used for anything else? With native threads the answer is no and so is with Go green threads. Time will tell if Java can pull off the trick of sharing unused space place, but I am sceptical.
With POSIX threads the stack size defaults to something like 1MB or 2MB depending on the platform, but it's not allocated up front -- the stack grows as needed up to that maximum.
The main difference then between allocating stack chunks on the heap as needed, and stacks grown by the virtual memory subsystem, has to do with virtual memory management matters. If you can use huge pages for your heap, then allocating stack chunks on the heap will be cheaper than traditional stacks.
In CPS the state of the program is captured in the continuation, which is a closure, which is allocated on the heap, and in any ancillary data structures pointed to by it.
However, there's generally only a very small number of such closures -- typically only one -- and they are generally one-time use only. That means they can be freed as soon as they tail-call out. Hello Rust.
I've written hand-coded CPS in C, but I take your point. The issue is that while every function exits via tail-calls, so the stack footprint is small, every continuation is a closure that has to be allocated somewhere, and that somewhere is generally going to have to be the heap.
My biggest gripe with continuations, futures and callbacks is the propensity for stalls; if the programmer makes a mistake it's quite likely that the program will just stall, with very little insight available as to why that happened.
With threads, green or not, you have a much clearer failure model and is easier to debug.
But doesn't that depend entirely on the synchronization mechanisms and how safe they are? For example, contrary to what you suggest I'd say that Go's stateless green threads are hard to debug (though there are good tools) and easily lead to deadlocks. That's because the available synchronization mechanisms finitely buffered channels, non-re-entrant mutexes, and a few atomic operations are hard to use correctly. Higher level constructs like actors or transactional memory are in my opinion less error prone.
The point is that e.g. futures are just some threads with a global synchronization mechanism for obtaining the result. Whatever makes the future stall will also make a low-level thread + your own synchronization stall. Or do you mean some more advanced failure-tolerant threading like in Erlang as compared to less advanced threading primitives like futures?
I definitely prefer CPS, especially in functional languages (where the noise I complain about below often fades away entirely).
On the other hand, CPS usually is a bit noisier from the developer's perspective; either your continuations are callbacks (whence callback hell) or your continuations are, as you say, manually compressed, tuned structures, which requires a fair amount of manual labor.
I believe Rust uses a CPS transform (well, more of a continuation-returning style, no?), but it automatically generates the tuned structures ("futures"). The cognitive overhead isn't all gone, but it definitely helps.
What about async/await? You can basically write linear, imperative code, and don't have to deal with continuations or these "tuned structures" manually.
For me it is the cleanest style of writing concurrent code. And more and more I find I can also replace state machines with it, which makes sense because the compiler generates state machines under the hood usually.
You know, the kind of code where you have to communicate with some outside device and it is easy to do blockingly but devolves to state machine madness if you need to do other things concurrently. For example it would be really nice if I could use async/await in C on a microcontroller to read from a serial port...
2. Poor interaction with debuggers, profilers, and other tools that expect to be working with normal stacks.
Loom solves this because it lets you work with normal threads, but suddenly you can have millions of them in a process without blowing out your memory or other related problems.
> You know, the kind of code where you have to communicate with some outside device and it is easy to do blockingly but devolves to state machine madness if you need to do other things concurrently.
Speaking personally, I've found Lua's coroutines to have the nicest experience for modeling flows like that. The big issue with async/await is the function color problem [0] -- writing async functions is perfectly fine, but mixing them with non-async functions can be extremely frustrating. Especially if you're doing anything with higher-order functions.
I used to struggle with "function color" until I realized that the functions just have a different type. Async functions return a future `Task<Thing>`, while normal function return a plain `Thing`. Of course they are incompatible.
A different way of looking at it is that in asyncs functions you should only do things that have negligible runtime (compared to the response time of your GUI or network service). If your task needs more time, you mark the call site and the called function "async" and the task will suspend somewhere "down in the call stack". (Without looking into it too much, I think something similar actually happens with these virtual threads. They modified IO functions to do cooperative multitasking under the hood?)
As to async functions being contagious, I found it helps to split "imperative" procedures and "pure" functions, and the async color mostly applies to the previous.
The problem with function coloring is that it divides the language for no good reason. Should there really be two names for the same sleep function, just because one is blocking and the other is not? As for a return type, that’s just a leaky abstraction imo (especially for voids, like is a blocking call returning nothing different than an async call returning a Future void?)
As for loom, due to it running all in a runtime, a blocking ‘read’ call for example is not actually a blocking system call (everything uses non-blocking APIs at that level) so the runtime is free to suspend execution at such a blocking site and continue useful work elsewhere until that finishes. So for some “async” functionality you can just fire up a new virtual task with easy to understand blocking calls and that’s it, it will do the right thing automagically, and it will throw exception where it actually make sense, you will be able to debug it line by line, no callback hell, etc.
Loom will also provide something called structured concurrency where you can fire up semantically related threads and easily wait for their finish at one place.
As for pureness, I don’t think it maps that cleanly to async/blocking. What about doing the same function on each pixel of a picture in memory where you subdivide it into n smaller chunks and run it in parallel?
Personally, in JavaScript I like that you can mix and match imperative and asynchronous code using Promise instances. It lets you handle asynchronous control flow in a purely synchronous function.
However in other languages, having functions be of a different 'color' is far more painful. In Python for example, a synchronous function has to setup an event loop manually before it can run an asynchronous function. The call works, but nothing is 'running' without the event loop. Additionally, the asynchronous function may have been written to work with a particular eventloop (e.g. trio vs curio), and thus you have to use that type.
If non-blocking code has a standardized control state like Javascript, I think it's better to be explicit about async vs sync.
I also think of "color" fundamentally as different types; it's just painful to have two kinds of functions that you can't combine. I, personally, really feel that async/await is just adding a second kind of continuation (promises/futures) to a language that already has a perfectly good one (call stacks), and the language ergonomics suffers for it.
The reason I say functional languages don't get bit by this as bad is because functional languages rely far less on the specific notion of a call stack, and it's usually much easier to work with continuations (either via primitives like shift/reset or via syntax like do-notation).
I struggle with this idea of "separate colors"... I see Promise returning functions as a super set of immediately returning functions... that is, any function that can return immediately could also return as a Promise (which resolved immediately), so really, the immediately returning function is just an optimization to apply when it's helpful. I'm curious why a language suffers from this explicit separation of code which "returns immediately" versus code which "returns eventually"?
> With CPS you manually compress state into tuned structures. With threads you store state all over the stack in very inefficient ways because all those stack frames take extra space.
On the other hand, these stack frames can be thought of as a large arena allocator for what would otherwise be lots of smaller objects allocated on the heap.
- continuation-passing style (CPS), hand-coded or compiler
- preemptive threading / processes
In between lie various solutions, like async/await (closer to CPS), and green threads (closer to preemptive threading).
The key difference between the two ends of this spectrum is memory footprint. With CPS you manually compress state into tuned structures. With threads you store state all over the stack in very inefficient ways because all those stack frames take extra space.
Any solution towards the thread side of the spectrum will yield significantly larger memory footprints than solutions towards the CPS side of the spectrum.
On the other hand, solutions towards the CPS side of the spectrum can require writing application-specific schedulers if fairness issues arise.
On the whole, solutions on the CPS side of the spectrum are better, IMO.
Java is kinda stuck with threads, so green threads make some sense. Of course, you can get pathological issues in M:N threading, so be careful about that.