The stack monoid revisited

gugagore · on May 17, 2021

From Wikipedia:

> In abstract algebra, a branch of mathematics, a monoid is a set equipped with an associative binary operation and an identity element.

So M = (S, op, id)

My go-to example of a monoid is (the set of strings, string concatenation, the empty string). It's my go-to because there's e.g. no `x` such that `concat("blah", x) == ""`. Another example is (integers, +, 0), but it is not my go-to example because it also more than a monoid.

I did not find a clear explanation of what the stack monoid is here. Can someone state it, please?

tomstuart · on May 17, 2021

An earlier post (https://raphlinus.github.io/gpu/2020/09/05/stack-monoid.html) explains it. The elements of the monoid are stack operations represented by pairs `(n, vs)`, where `n` is a number of values to pop off the stack and `vs` is a list of values to push on afterwards. So e.g. `(3, [])` is “pop 3 values”, `(1, ['foo', 'bar'])` is “pop 1 value then push 'foo' and 'bar'”, etc. Two elements are composed by combining their pop counts and push lists to reflect what happens to the stack if you apply the operations sequentially, e.g. the previous two examples compose to `(4, ['foo', 'bar'])`. The identity is `(0, [])`.

lmm · on May 17, 2021

So this isn't really the stack monoid, this is mutations as a monoid (which is a construction that exists for anything mutable), or the command monoid. The insight is that these mutations have a nice representation as first-class values and you can combine two "stack mutation commands" efficiently.

Twisol · on May 17, 2021

> So this isn't really the stack monoid, this is mutations as a monoid (which is a construction that exists for anything mutable)

I can see why you'd balk at the name "stack monoid", but the specific commands / actions being modeled (push/pop) are exactly the ones modeling a stack. It's not a purely syntactic construction because a push followed by a pop is evaluated away. The representation as `(pops, pushes)` is the natural conclusion of this line of semantic analysis.

> this is mutations as a monoid

In general though, yeah, this is the secret sauce.

The history of commands on a mutable cell is sufficient to recover the value of the cell at any point in time, and the history of values held by the cell is sufficient to derive a (not necessarily unique) history of commands going between each snapshot.

Typically, though, we only remember the "latest" value of the cell, so we lose a lot of information. In a parallel algorithm, multiple "points in time" need to be accessible simultaneously, so you get a lot by moving to a representation where you're not already missing information.

TheMatten · on May 17, 2021

> My go-to example of a monoid is (the set of strings, string concatenation, the empty string).

There's an interesting reason why strings are a good example - when you think of `string` as `List<char>`, that `List` in it actually turns out to be a "free monoid". That is, minimal amount of structure needed to encode monoid operations in such a way that you can interpret them later as needed into something else. And one possible interpretation function is something commonly called `reduce` - you can basically think of `[1,2,3].reduce((x, y) => x + y, 0)` as interpretation of this monoidal structure into monoid of integers with addition.

dan-robertson · on May 17, 2021

Consider first the bicyclic monoid, which may be presented as < a,b | ba >. That is, the elements are lists of the symbols a, b, such that you never see a b followed by an a, so elements have the form a^n b^m, and indeed the numbers n,m are sufficient to describe an element of the monoid.

One way to look at it is as parenthesis balancing: replace a,b with (,) respectively and then the monoid is strings of (,) except any matched pairs () are removed.

Now change ( for lots of different elements you might want, say integers for their place in the string maybe. You get a presentation like < p, x \in X | \forall x \in X xp = e >, that is if you see an element of your data set immediately followed by a p (for pop) you can delete both. This is the stack monoid and elements may be described as p^n s, where s is a list of elements of X and may be read as “the stack operations of popping n times and pushing a,” or “the tree traversal of going up n nodes and then going down the path s.”

Twisol · on May 17, 2021

> My go-to example of a monoid is (the set of strings, string concatenation, the empty string)

My favorite is the monoid of endomorphisms! "Endomorphism" is a fancy word, but it just means "if you give me a `T`, I'll return a `T` like it but with some changes."

Programs that operate on a stack usually look something like this (if we ignore everything that isn't touching the stack):

    stack.push('a');
    stack.push('b');
    stack.pop();

It's assumed that `stack` refers to some mutable cell, here. We can push mutation to the boundary if we change `push` and `pop` to return the new stack, and change `stack` itself to reference a get/set register containing a stack:

    stack.set(stack.get().push('a').push('b').pop());

Now, what is the type signature of `push`? In Haskell-ish, it's `Char -> (Stack -> Stack)`. With `pop`, it's `Stack -> Stack`. (I'm assuming that `this` is passed last, which is amusingly the opposite of what we're used to.) In other words, each operation takes zero or more arguments and produces a transformation that turns one Stack into another.

The thing about operations like `Stack -> Stack` is that you can combine them together without even having a `Stack` handy. You just produce a function that does one, passes its result to the other, and returns the final result. And of course, I could write an identity function that returns the given stack unchanged. That makes `Stack -> Stack` (and really any `T -> T`) a monoid!

In other words, instead of `(stack.push('a')).pop()`, where we morally associate the parentheses to the left, we have something like `stack.(push('a').pop())`, except obviously OOP dot notation doesn't work that way. But that's the intuition -- we're combining the operations before we feed them a stack to operate on.

Ultimately, that's what Raph's stack monoid is based on, except instead of functions `T -> T` they describe `push` and `pop` with data. This is called "defunctionalization", and it basically works because we can write a single function (an interpreter) that performs the cumulative actions described by that data.

As a sibling comment explains, lists are the free monoid for a set of actions (where "free" means "assuming the least necessary"), so we could get by with `[Push 'a', Push 'b', Pop]` as a representation. But this contains more information than we actually need. We know that if a push is followed by a pop, the contents of the push don't matter to us. We can therefore remove any push-then-pop pair of actions. When we simplify in this way, we find that the only sequences of actions we have are N pops followed by M pushes. So Raph's representation coalesces the adjacent pushes and pops into `(Int, [Char])`

tremon · on May 17, 2021

> With `pop`, it's `Stack -> Stack`.

A signature like that would simply discard the value of the element on the stack, which makes the data structure write-only and superfluous. The signature should be `Stack -> (Char, Stack)`.

Twisol · on May 17, 2021

`Stack -> (Char, Stack)` is equivalent to `(Stack -> Char, Stack -> Stack)`, which amounts to adding a `peek` method to the scheme above. The inclusion of `peek` changes nothing about the description of stack mutations above, and its clear separation better maps to what Raph is modeling in the first place.

To be more explicit: Raph is modeling a scene graph, and a traversal of the scene graph consists of a series of pushes and pops on a stack, together with operations on nodes that act on a particular state of the stack. To parallelize these operations we need to be able to access the state of the stack at multiple points along the traversal.

Each operation will use `peek` (and likely `pop`, too, but notice that it doesn't affect other operations here; it's more like moving a cursor down the stack) to drive its own behavior, but those read operations are independent of the traversal itself. It is the traversal we are modeling.

skavi · on May 18, 2021

Is a reduce method a monoid then?

Twisol · on May 19, 2021

It's definitely related, yes! You can reduce any list given a monoid on the type of elements of the list. That is, if you have a type `T` with combining operator `combine: (T, T) -> T` and unit `e: T`, you can reduce a list of `[T]` starting with `e` and combining each element into it.

If I have a list of numbers `xs`, there are two nice monoids available: (+, 0) and (*, 1). In a JS-like notation, I can reduce using either monoid with `xs.reduce(0, (acc, x) => acc + x)` or `xs.reduce(1, (acc, x) => acc * x)`, respectively.

The existence of `reduce` is actually exactly what makes lists the "free monoid"!

liuliu · on May 17, 2021

Interesting in-depth discussion with the parallel parentheses matching solution.

Back to the clipping problem though, in each intermediate step (when we still incorporating prefixed results), do we compute clipping for each elements progressively (the result array in 1st pseudo-code snippet), or do we just compute the clipping boxes (the stack array snapshot in 1st pseudo-code snippet) and then compute the final clipping for each elements in parallel when the clipping boxes (position of each push & their bounding boxes) are ready?

Sorry about my inaccurate terminology in advance as I didn't work closely with 2D graphics before.

joe_the_user · on May 17, 2021

The ability to compute the stack monoid on slices of the input, then stitch them together later, is essential for processing this in parallel.

So what isn't possible is having two threads push/pop to the same conventional stack. So I assume the author is doing something different.

So what I guess they're doing is descending down the tree, and splitting off all the operations that can be done in parallel so that you get a bunch of kernels feeding their results into each other with the structure of the tree. It seems like some synchronization is need so one thread waits for the results of another. A monoid, as I vaguely understand them, acts as a kind stored operation so this seems logical.

And if you ran out of threads, the monoid would just act as a conventional stack on some of the remaining threads.

Of course, none of this says the mechanism to implement this.

lmm · on May 17, 2021

> So what isn't possible is having two threads push/pop to the same conventional stack. So I assume the author is doing something different.

Indeed; what they're doing is building up a command object that represents a sequence of modifications to a stack. And that's something that you can parallelise, map-reduce style: they've come up with a representation for that command that makes it easy to combine two commands into a bigger command, and a technique for doing that combine in-place without needing extra scratch space. This doesn't need any synchronization between parallel threads - of course the final result is only formed when all the threads finish. (To materialize the final value of the stack you'd have to "run" the final command, but it turns out that's trivial for this command representation - just confirm that the number of pops is zero, and then the push sequence is the final value of the stack).

joe_the_user · on May 17, 2021

* This doesn't need any synchronization between parallel threads - of course the final result is only formed when all the threads finish*

I'm still confused. If one calculates (x+y)*(z+w). You can divide the addition between threads but the multiplication has to happen once these two threads end.

If you're talking a "map-reduce" approach, as far as I understand it, the reduce has to happen once the mapping is done. Which would seem to require synchronization also.

lmm · on May 18, 2021

I think this is a semantic disagreement rather than a real one. A task in a map-reduce style model doesn't need to synchronize with any other task executing in parallel; of course a task cannot start until its inputs are available. If you want to implement that on top of a threads-and-shared-memory model then you'd need to use synchronization in your task scheduler, but from the point of view of code written inside the model there isn't and can't be any synchronization, which makes it much easier to write correct code.

mlang23 · on May 17, 2021

Monoids are cool. It took me a long time to actually discover monoids during coding. It finally clicked when I realized the QuadBitboard I wrote for a chess move generator is actually a monoid. Empty board is the neutral element, and xor of all 4 bitboards is the operation.

https://github.com/mlang/chessIO/blob/master/src/Game/Chess/...

bweitzman · on May 17, 2021

What does xoring mean for combining two boards that have pieces in the same position? Is there a use for such an operation?

StandardFuture · on May 17, 2021

Would someone like to ELI5/summate what this is and what it is for?

anchpop · on May 17, 2021

The author is working on piet-gpu, a library for efficiently rendering 2D graphics on the GPU, for UIs and font editors and things like that. This is very nontrivial (and in fact most 2D graphics you see today are in large part rendered on the CPU).

The goal of piet-gpu is that you just specify the scene graph (blue circle here, red line there, etc) and send it over to the GPU, and the GPU takes care of all the rendering.

GPUs are very good at processing data in the shape of arrays. For example, a pixel shader (or "fragment shader") processes a 2D array of pixels on the screen. piet-gpu needs to sometimes process data that has a tree structure, rather than an array structure, to support clipping (for example to show a shape that represents the intersection between a circle and a square). The stack monoid the article talks about is a way of doing tree-like operations in parallel on the GPU, and the author thinks it will end up being useful for being able to process clipping operations entirely on the GPU.

StandardFuture · on May 17, 2021

Thanks for the summary!

Quick question: wouldn't executing partially-ordered sets of operations in a semi-parallel or fully-parallel way already be an existing research topic in Computer Science? Obviously this project is attempting to find a solution for a singular application and not attempting to find a general algorithm, but still there must be some historical precedent.

I would imagine that companies like NVIDIA would have even already contributed to this space?

Are there any existing resources on the topic that perhaps you or the OP knows of?

mijoharas · on May 17, 2021

There are actually some in the linked article "some theory" section[0].

Seems to be some old work based on a parallel memory model that used to be used (PRAM), some stuff from ken iverson's APL book, and some more APL stuff from someone's thesis.

[0] https://raphlinus.github.io/gpu/2021/05/13/stack-monoid-revi...

kllrnohj · on May 17, 2021

> and in fact most 2D graphics you see today are in large part rendered on the CPU

This is pretty misleading. Most 2D arbitrary paths are rendered on the CPU but there's a lot more to 2D graphics than that and run on the GPU today already just fine.

marai2 · on May 17, 2021

You could start a business doing ELI5s of complex topics! Thank you for a wonderful summary!

uyt · on May 16, 2021

If you test on random parens, won't the nesting depth be fairly shallow?

I'm still reading and I am trying to understand what this will do if you have long streaks of pushes with rare pops (but maybe still occasional enough to cause large shuffles in the in-place reduction). What do you do after the stack size is larger than the number of parallel thread?

raphlinus · on May 16, 2021

Just testing on random parens gives you an expected stack depth of O(sqrt(N)). I've also (just now) tried biasing the sequence generation with more pushes than pops, and it doesn't substantially affect running time.

The case where the stack depth is greater than the partition size is tricky for sure, and this is what I call the "backlink" mechanism. The tl;dr is that it follows the deepest link (furthest from top of stack) that it can, and continues the look-back process, until the slice is complete. You can search for "backlink" in the code - it's actually not a huge amount of logic, though it certainly was subtle to get right.

dan-robertson · on May 17, 2021

Any guess as to what the distribution will look like in practice? I suppose you started with a list of iid equally likely parens, but I would guess that in practice the distribution D might look more like:

  () moderately likely
  (X) where X ~ (roughly) D quite likely
  X Y where X,Y ~ (roughly) D slightly likely.

My prior is that in user interfaces one will typically see a somewhat deep nesting of clips (ie components) with a few primitives to draw at each level. But maybe this is just wrong or most of these clips are axis-aligned rectangles and can be dealt with as the command list is built. Or maybe you do see nesting in practice but not super deep nesting (say window > panel > panel > set of buttons > button > text with each corresponding to several layers of nested clips)

082349872349872 · on May 17, 2021

Is the distribution normal around that expected stack depth, or does it skew to many shallow stacks but the rare deep stacks are really deep?