The glibc implementation[0] uses Enhanced REP MOVSB when the array is long enoug...

JoshTriplett · on Jan 9, 2021

> The glibc implementation[0] uses Enhanced REP MOVSB when the array is long enough. It takes a few cycles to start up the ERMS feature, so it's only used on longer arrays.

That isn't true anymore either, on sufficiently recent processors with "Fast Short REP MOVSB (FSRM)". If the FSRM bit is set (which it is on Ice Lake and newer), you can just always use REP MOVSB.

jabl · on Jan 9, 2021

Still waiting for the "Yes, This Time We Really Mean It Fast REP MOVSB" (YTTWRMIFRM) bit.

More seriously, if REP MOVSB can be counted on always being the fastest method that's fantastic. One thing that easily gets forgotten in microbenchmarking is I$ pollution by those fancy unrolled SIMD loops with 147 special cases.

magicalhippo · on Jan 9, 2021

Ever since reading the advice to avoid these CISC-y instructions (IIRC right back to the first Pentium), I've been wondering why.

Like what made them not implement the best possible microcode for that?

I mean sure for very short loops I guess I can see unrolling being faster than accessing microcode, but yeah.