That thread-local MXCSR register is particularly entertaining in a thread pool environment, such as OpenMP. OSes carefully preserve that piece of thread state across context switches.
I tend to avoid touching that value, even when it means extra instructions like roundpd for specific rounding mode, or shuffles to avoid division by 0 in the unused lanes.
I tend to avoid touching that value, even when it means extra instructions like roundpd for specific rounding mode, or shuffles to avoid division by 0 in the unused lanes.