Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I've seen this over and over. One of the main issues pointed out by TFA is that there's too many small tasks allocated for parallel execution. Rayon is not going to magically distribute your work perfectly, though it very often does a decent job.

If your algorithm is the equivalent of a couple of nested iterations, you have essentially three options: parallelize outer, inner, or both. In the vast majority of the cases I've run into, you want thread/task level parallelism on the outer loop (only), and if required, data/simd parallelism on the inner loop(s).

It's a rule of thumb, but it biases towards batches of work assigned to CPUs for a decent amount of time, allowing cache locality and pipelining to kick in. That's even before SIMD.



The rule of thumb also keeps you from doing a lot of task switching. It isn't free enqueue and dequeue tasks. It is better if you have a million things to do to have a smaller set of tasks. Especially if the runtime for those tasks are somewhat uniform.


For sure. Context switching tasks is certainly a lot cheaper than context switching threads, but it isn't free.


> One of the main issues pointed out by TFA is that there's too many small tasks allocated for parallel execution

Valid concern but I don't think this was the OP case though?

From my understanding, author gained the most benefits by dumbing down the generic rayon implementation to the same kind (thread-pool with task queues) but with different work-stealing algorithm.

> Rayon is not going to magically distribute your work perfectly, though it very often does a decent job.

Work-stealing by definition kinda makes distributing the work "correctly" a difficult task, doesn't it?


Well, sure, in practice work stealing makes correct distribution difficult, but in theory, work stealing is to repair an incorrect work distribution, right?

If every CPU is 100% utilized without needing context switch (and running the right number of worker threads without switching those), then work stealing is not required.

But my comment is solidly "Rule of thumb". I claim no theoretical basis other than "Giving fewer longer tasks to fewer threads, (still >= number of worker threads), is better than giving more shorter ones"


Larger grain size better.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: