It's really something the stdlib should do, which, when I've seen it implemented, is usually what happens.
The compiler should just provide good inlining support, so that if eg. you include the short-string optimization in your stdlib, the compiler can optimize it down to a couple bit operations, a test, and a word copy. If the test fails and your string is more than 7 bytes, it's perfectly fine to call a function - the function call overhead is usually dwarfed by the copy loop for large strings. And then if new hardware comes out and you vectorize it differently, you can get away with replacing that one function in the stdlib instead of recompiling every single program in existence.
The compiler should just provide good inlining support, so that if eg. you include the short-string optimization in your stdlib, the compiler can optimize it down to a couple bit operations, a test, and a word copy. If the test fails and your string is more than 7 bytes, it's perfectly fine to call a function - the function call overhead is usually dwarfed by the copy loop for large strings. And then if new hardware comes out and you vectorize it differently, you can get away with replacing that one function in the stdlib instead of recompiling every single program in existence.