I think Carmack credits someone else as the origin - possibly some magazine entry.
These days I think the reciprocal square root intrinsic is the fastest where precision is not that important.
I think there was a bit twiddling hack for pop count which was consistently faster than the equivalent cpu intrinsic due to some weird pipelining effect, so sometimes it is possible to beat the compilers and intrinsics with clever hacks.
If anything it's a lesson that the definition of brilliance is being in the wrong place at the wrong time... ;-)