Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The glibc implementation[0] uses Enhanced REP MOVSB when the array is long enough. It takes a few cycles to start up the ERMS feature, so it's only used on longer arrays.

Edit: Wait a minute... if this is true, then how can AVX be responsible for the speed up? Is it related to the size of the buffers being copied internally?

[0] Line 48 here: http://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86_...



> The glibc implementation[0] uses Enhanced REP MOVSB when the array is long enough. It takes a few cycles to start up the ERMS feature, so it's only used on longer arrays.

That isn't true anymore either, on sufficiently recent processors with "Fast Short REP MOVSB (FSRM)". If the FSRM bit is set (which it is on Ice Lake and newer), you can just always use REP MOVSB.


Still waiting for the "Yes, This Time We Really Mean It Fast REP MOVSB" (YTTWRMIFRM) bit.

More seriously, if REP MOVSB can be counted on always being the fastest method that's fantastic. One thing that easily gets forgotten in microbenchmarking is I$ pollution by those fancy unrolled SIMD loops with 147 special cases.


Ever since reading the advice to avoid these CISC-y instructions (IIRC right back to the first Pentium), I've been wondering why.

Like what made them not implement the best possible microcode for that?

I mean sure for very short loops I guess I can see unrolling being faster than accessing microcode, but yeah.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: