> Intel's comparable 11 bit precision vectorized log function performs about 6 times faster at lengths of 64k or longer.
The issue is probably the division in the approximation. Agner Fog says even Sandy Bridge doesn't pipeline FP (or integer) divisions, which take from 10 to 14 cycles (less than two add/mul pairs). If the goal is to minimise latency for a small number of values, a lower order rational approximation might be preferable to a higher order polynomial; however, when we want throughput, add/mul pipeline better.
The issue is probably the division in the approximation. Agner Fog says even Sandy Bridge doesn't pipeline FP (or integer) divisions, which take from 10 to 14 cycles (less than two add/mul pairs). If the goal is to minimise latency for a small number of values, a lower order rational approximation might be preferable to a higher order polynomial; however, when we want throughput, add/mul pipeline better.