> Java's Thread.interrupt is also just a sleep loop You probably mean that Java'...

f_devd · on March 25, 2024

> So why not drop this special word `async`?

You can drop the special word in Rust it's just sugar for 'returns a poll-able function with state'; however threads and async/await are not the same.

You can implement concurrency any way you like, you can run it in separate processes or separate nodes if you are willing to put in the work, that does not mean they equivalent for most purposes.

Threads are almost always implemented preemptively while async is typically cooperative. Threads are heavy/costly in time and memory, while async is almost zero-cost. Threads are handed over to the kernel scheduler, while async is entirely controlled by the program('s executor).

Purely from a merit perspective threads are simply a different trade-off. Just like multi-processing and distributed actor model is.

gpderetta · on March 25, 2024

> Threads are almost always implemented preemptively while async is typically cooperative. Threads are heavy/costly in time and memory, while async is almost zero-cost. Threads are handed over to the kernel scheduler, while async is entirely controlled by the program('s executor).

Keyword here being almost. See Project Loom.

avodonosov · on March 25, 2024

@f_devd, cooperative vs preemptive is a good point.

(That threads are heavy or should be scheduled by OS is not required by the nature of the threads).

But preemptive is strictly better (safer at least) than cooperative, right? Otherwise, one accidental endless loop, and this code occupies the executor, depriving all other futures from execution.

@gpderetta, I think Project Loom will need to become preemptive, otherwise the virtual threads can not be used as a drop-in replacement for native threads - we will have deadlocks in virtual threads where they don't happen in native threads.

f_devd · on March 25, 2024

Preemptive is safer for liveliness since it avoids 'starvation' (one task's poll taking too long), however it in practice almost always more expensive in memory and time due to the implicit state.

In async, only the values required to do a poll need to be held (often only references), while for threads the entire stack & registers needs to be stored at all times, since at any moment it could be interrupted and it will need to know where to continue from. And since it needs to save/overwrite all registers at each context switch (+ scheduler/kernel handling), it takes more time overall.

In general threads are a good option if you can afford the overhead, but assuming threads as a default can significantly hinder performance (or make near impossible to even run) where Rust needs to.

avodonosov · on March 27, 2024

@f_devd, I think you are mistaken.

Not that I want to discourage anyone from using async/await. I am glad async/await solves people problems, especially when people do not have a ready to use alternative as my perfect ideal threads.

But just to reduce the number of people who are mistaken in the Internet :)

I think the only real problem that makes threads really expensive for embedded systems is statically allocated large stack. If stack size is managed dynamically, it can be small thus allowing many threads. The other expenses should be tolerable. Embedded systems don't require high computational throughput, I think.

All implementation approaches used for async/await can be used for threads, and vice versa, because they are basically the same thing.

> In async, only the values required to do a poll need to be held (often only references), while for threads the entire stack & registers needs to be stored at all times, since at any moment it could be interrupted and it will need to know where to continue from.

Well, it seems opposite - the approach you attribute to threads can be more efficient here. If async function, when blocked, holds in its Feature state record only the part of local vars and parameters that is needed to continue execution, the function needs to copy them from the stack. And that's redundant copying and memory allocation for Feature state records. Note, this happens at every element of function call chain, so the Future state records act as stack frames. And this stack copying is most likely done in individual assignments, var by var.

And I am afraid this allocation and copying can happen every time the async function blocks. Reusing Future state records may be non-trivial, given that next time the top-level async function we are await'ing for may block in some other internal branch.

Compared to saving the stack which is just saving two registers: stack base and stack pointer.

> And since it needs to save/overwrite all registers at each context switch (+ scheduler/kernel handling), it takes more time overall.

Saving registers is cheap. Also there is no magic, when next async function is activated by async function scheduler, it uses the registers as it wants, so register values of previously blocked async function need to be saved somehow - this happens when the most nested function copies it local vars to the Future state record.

Speaking of preemption requiring kernel - not necessarily. It can be done in user space. A thread can yield control to scheduler when it invokes a blocking function (as Java virtual threads currently do). In addition to that, other preemption points can be used - function calls, allocations, maybe loop boundaries. This approach lies in between the cooperative threading and full preemption.

If we consider preemption by timer interrupts. First, it only happens if the thread haven't yet yielded control by calling a blocking function. Second, if preemption by timer happens, kernel can pass control to the user space scheduler in the application runtime instead of applying kernel's heavy weight scheduler (is kernel scheduler really more heavy weight?).

Moreover, I've just searched for user space interrupts, and it looks like new processors provide such a feature. The first link in search currently is https://lwn.net/Articles/871113/. Green threads scheduling is mentioned as one of the use cases.

So, in short, I don't see why threads would be inherently less performant than async/await.

f_devd · on March 28, 2024

I think you might be confusing Runtime, OS and bare-metal primitives. Java virtual threads are possible because there is always the runtime which code will return to, and since it's already executing in a VM the concept of Stack/Heap Store/Loads don't really matter for performance.

> Compared to saving the stack which is just saving two registers: stack base and stack pointer.

In embedded you might not have a stack base, just a stack pointer, this means in order to switch to a different stack you need to copy 2 stacks. (I might be wrong here; I know some processors have linear stacks, but this might be more uncommon).

On bare metal this dynamic changes significantly, in order to "switch contexts" with preemption the following steps are needed (omitting the kernel switch ops):

- Receive interrupt

- Mask interrupts

- Store registers to heap

- Store stack to heap

- Calculate next preemption time (scheduler)

- Set interrupt for next preemption time

- Load stack from heap

- Load registers from heap

- Unmask interrupts

- Continue execution using program counter

While for async/await everything already in place on the stack/heap so a context switch is:

- Call Future.poll function

- If Poll::Ready, make parent task new Future and (if it exists) call it

- If Poll::Pending, go to next Future in Waker queue

Async/await (in rust) is without a runtime, and without copies or register stores/loads; it can be implemented on any cpu. On embedded tasks can also decide how they want to be woke, so if you want to do low-power operation you can make an interrupt which calls `wake(future)` and it will only poll that task after the interrupt has hit, meaning any time the Waker queue is empty it knows it can sleep with interrupts enabled.

> so register values of previously blocked async function need to be saved somehow

The difference is that we know exactly which values are needed instead of not knowing what we need from the stack/registers.

User-space interrupts would make it easier to do preemption in user-space but this is yet another feature you can't make assumptions about (especially since there has been only a single gen of processors which support it).

gpderetta · on March 28, 2024

Yes, of course a non-cooperative switch is more expensive than a cooperative one. But the thread model does not require preemption or even time-slice scheduling.

But with async/await cooperative switch is the only option.

f_devd · on March 28, 2024

I'm unfamiliar with a bare-metal thread model that doesn't do preemption outside of a Runtime. I imagine you'd need to effectively inject code to do a cooperative switch as there aren't many ways for a cpu to exit it's current 'task' outside of an interrupt (pre-emption) or a defer call (cooroutines/async). For Runtimes it usually also means you effectively have a cooperative switch but it's hidden away in runtime code.

Do you have an example?

avodonosov · on April 2, 2024

@f_devd, I realized that my main objection to async/await does not apply to Rust.

Thank you for staying in the discussion long enough for me to realize that completely.

I dislike async/await in Javascript because async functions can not be called synchronously from normal functions. The calling function and all its callers and all their callers need to turned async.

In Rust, since we can simply do `executor::block_on(some_async_functino())`, my objection goes away - all primitives remain fully composable. Async functions can call usual functions and vice versa.

So my first comment was to some extend a "knee-jerk reaction".

As we started to discuss thread preemption cost, I will provide some responses below. In short, I believe it can be on par with async/await.

=================================================

> I think you might be confusing Runtime, OS and bare-metal primitives.

I am not confusing, but I consider all those cases down to what happens at CPU level.

> Java virtual threads are possible because there is always the runtime which code will return to, and since it's already executing in a VM the concept of Stack/Heap Store/Loads don't really matter for performance.

They remain applicable, as at the lowest level the VM / Runtime is executed by a CPU.

> Async/await (in rust) is without a runtime,

Rust Executor is a kind of runtime, IMHO.

> and without copies or register stores/loads;

The CPU register values are still saved to memory when async function returns Poll::Pending, so that the intermediate computation results are not lost and when polled again the function continues its execution correctly. (On the level of Rust source code, the register saving corresponds to assignment of local variables of the most nested async function to the fields of the generated anonymous future).

==============================================

> In embedded you might not have a stack base, just a stack pointer, this means in order to switch to a different stack you need to copy 2 stacks. (I might be wrong here; I know some processors have linear stacks, but this might be more uncommon).

If the CPU does not have a stack base (stack segment register), saving of the stack pointer is enough to switch to another stack.

In practice, I think, even CPUs with stack segment register, most often only need to save stack pointer for context switch - all stacks of the process can live in the same segment, and even for different processes the OS can arrange the segments to have the equal segment selector. I know that switching to kernel mode usually involves changing stack segment register in addition to the stack pointer (as the kernel stack segment has different protection level).

==============================================

> On bare metal this dynamic changes significantly, in order to "switch contexts" with preemption the following steps are needed (omitting the kernel switch ops): [...] While for async/await everything already in place on the stack/heap so a context switch is: [..]

The operations you listed for bare metal are very cheap, some items in the list are just single CPU instruction. (Also, I think timer interrupts are configured once for periodic interval and don't need to be recalculated and set on every context switch).

If one expands the "go to next Future in Waker queue" item you listed for async/await in the same level of detail that you did for bare metal, the resulting list may be even longer than the bare metal list.

==============================================

The majority of the context switch cost at CPU level is when we switch to different process, so that new virtual memory mapping table needs to be loaded to the CPU, (and correspondingly, the cached mappings in TLB needs to be reset and new ones need to be computed during execution in the new context); from the need to load different descriptor tables.

Nothing of that applies to in-process green thread context switches.

Ygg2 · on March 25, 2024

Java can afford that. M:N threads come with a heavy runtime. Java has already a heavy runtime, so what is a smidgen more flab?

Source: https://github.com/rust-lang/rfcs/blob/master/text/0230-remo...

gpderetta · on March 25, 2024

So it seems that the biggest issue was having a single Io interface forcing overhead on both green and native threads and forcing runtime dispatching.

It seems to me that the best would have been to have the two libraries evolve separately and capture the common subset in a trait (possibly using dynamic impl when type erasure is tolerable), so that you can write generic code that can work with both or specialized code to take advantage of specific features.

As it stand now, sync and async are effectively separated anyway and it is currently impossible to write generic code that hande both.