I dunno about "multiway merge". But your standard 2-way merge is SIMD-optimized by having a binary-search across the "diagonals" of the so-called mergepath.
Leading to O(lg(p * sqrt(2))) time for the longest comparison diagonal, where "p" is the number of processors. Including thread divergence, that's O(N/p * lg(p * sqrt(2))) total work done, or assuming constant-p, O(N) of work per 2-way SIMD merge (with a large "p" providing an arbitrarily large speedup: but in practice is probably limited to 1024 for modern GPU architectures. Since modern GPUs only really "gang up" into 1024-CUDA thread blocks / workgroups efficiently)