The way Erlang does it is to use buckets so it looks like a single queue to the user code but really is more like multiple queues behind the scene. Scales extremely well. It's certainly not "just moving a pointer to a piece of shared memory" though...
https://www.erlang.org/blog/parallel-signal-sending-optimiza...