Another approach you might try is to start with a rotator, then mask off the leading or trailing bits for right or left shifts. (Rotate right by X is the same as rotate left by N-X.)
I chose the mux-based approach because it has the better area*delay product, but it might be interesting to implement the mask-based design to see how it works out on my FPGA.
The design I used should have a delay of one 74'350 and one 74'257 per two bits of shift/rotate index, but that's admittedly still one bit per two of extra muxing.
The version I breadboarded has 8 '350s and 5 '257s for a 16-bit SRU[0]; I'm not sure how to compare that area-wise to a 32-bit circuit without '350s, but you'd at least avoid needing logic to do or not do the bit-reverse.