Are there similar relatively simple methods or simd tricks for multi-way merges?

dragontamer · on July 13, 2021

I dunno about "multiway merge". But your standard 2-way merge is SIMD-optimized by having a binary-search across the "diagonals" of the so-called mergepath.

Leading to O(lg(p * sqrt(2))) time for the longest comparison diagonal, where "p" is the number of processors. Including thread divergence, that's O(N/p * lg(p * sqrt(2))) total work done, or assuming constant-p, O(N) of work per 2-way SIMD merge (with a large "p" providing an arbitrarily large speedup: but in practice is probably limited to 1024 for modern GPU architectures. Since modern GPUs only really "gang up" into 1024-CUDA thread blocks / workgroups efficiently)

https://web.cs.ucdavis.edu/~amenta/f15/GPUmp.pdf

This provides a natural way at allowing ~1024 CUDA threads in a singular block to sort a list at O(n * log(n)) efficiently.