The ability to compute the stack monoid on slices of the input, then stitch them together later, is essential for processing this in parallel.
So what isn't possible is having two threads push/pop to the same conventional stack. So I assume the author is doing something different.
So what I guess they're doing is descending down the tree, and splitting off all the operations that can be done in parallel so that you get a bunch of kernels feeding their results into each other with the structure of the tree. It seems like some synchronization is need so one thread waits for the results of another. A monoid, as I vaguely understand them, acts as a kind stored operation so this seems logical.
And if you ran out of threads, the monoid would just act as a conventional stack on some of the remaining threads.
Of course, none of this says the mechanism to implement this.
> So what isn't possible is having two threads push/pop to the same conventional stack. So I assume the author is doing something different.
Indeed; what they're doing is building up a command object that represents a sequence of modifications to a stack. And that's something that you can parallelise, map-reduce style: they've come up with a representation for that command that makes it easy to combine two commands into a bigger command, and a technique for doing that combine in-place without needing extra scratch space. This doesn't need any synchronization between parallel threads - of course the final result is only formed when all the threads finish. (To materialize the final value of the stack you'd have to "run" the final command, but it turns out that's trivial for this command representation - just confirm that the number of pops is zero, and then the push sequence is the final value of the stack).
* This doesn't need any synchronization between parallel threads - of course the final result is only formed when all the threads finish*
I'm still confused. If one calculates (x+y)*(z+w). You can divide the addition between threads but the multiplication has to happen once these two threads end.
If you're talking a "map-reduce" approach, as far as I understand it, the reduce has to happen once the mapping is done. Which would seem to require synchronization also.
I think this is a semantic disagreement rather than a real one. A task in a map-reduce style model doesn't need to synchronize with any other task executing in parallel; of course a task cannot start until its inputs are available. If you want to implement that on top of a threads-and-shared-memory model then you'd need to use synchronization in your task scheduler, but from the point of view of code written inside the model there isn't and can't be any synchronization, which makes it much easier to write correct code.
So what isn't possible is having two threads push/pop to the same conventional stack. So I assume the author is doing something different.
So what I guess they're doing is descending down the tree, and splitting off all the operations that can be done in parallel so that you get a bunch of kernels feeding their results into each other with the structure of the tree. It seems like some synchronization is need so one thread waits for the results of another. A monoid, as I vaguely understand them, acts as a kind stored operation so this seems logical.
And if you ran out of threads, the monoid would just act as a conventional stack on some of the remaining threads.
Of course, none of this says the mechanism to implement this.