Using Horner's method to evaluate the polynomials is probably suboptimal because it introduces dependencies and eliminates instruction-level parallelism. It might be worth trying Estrin's method instead (if the compiler does not rewrite them already).
Actually, there is quite a bit of available ILP in the forms that use division - as high as four insns. The numerator and denominator can be evaluated in parallel, and the coefficient loads can be interleaved with the FMAs to hide some of their latency.
That said, with such a low variation in the timings I wonder if some other overhead was dominating the costs. For example, floating-point division is typically much slower than either addition or multiplication. According to Agner Fog, div is 2-4x slower in latency than add and mul while completely hogging the pipeline.
I have tried the code posted at the end of the article and it gives wildly incorrect results for input values in range of [ 0.5 , 1.5 ). Also for output, starting with multiplies of 4, values suddenly diverge and then start to converge towards correct value until the next multiple.
Did I forget to do something with the result or is there a bug?
No mention of Padé approximants? I'd just use a Chebyshev approximation transformed into a Padé approximation, then perturb the coefficients to optimize maximum error.
For values far from 1, log_2 will be dominated by the integer portion, which IEEE FP representation gives you for free, so it's not surprising that outputs eventually look right. I think the problem is that the approximation works with (x - 1), but only the "significand >= 1.5" branch subtracts that offset. Really, if you need speed, I'd probably start with an approximation over [1, 2)...
Re-implementation in SBCL:
CL-USER> (defun approx (x)
(let ((o (1- x)))
(values (/ (+ (* o o 0.338953) (* o 2.198599)) (+ o 1.523692))
(log x 2.0))))
CL-USER> (loop for x from 0.70 upto 1.51 by 0.1
do (format t "~,3F: ~{~F ~F~}~%" x (multiple-value-list (approx x))))
0.700: -0.51407874 -0.5145732
0.800: -0.32194924 -0.32192805
0.900: -0.15204856 -0.15200303
1.000: 0.0 0.0
1.100: 0.13749498 0.13750356
1.200: 0.26296926 0.26303446
1.300: 0.3784003 0.37851173
1.400: 0.48535433 0.48542693
1.500: 0.585088 0.5849626
in [part I](http://www.ebaytechblog.com/2015/05/01/fast-approximate-loga...) the author argues against using [1,2) because it causes large relative error near x=1 (since log x = 0 there). I would disagree (why would you ever care about relative error in a logarithm -- IMHO absolute error makes much more sense... unless for some crazy reason you're dividing by log x near x=1) but that's his argument.
I copy pasted the code and ran a unit test, which it failed to pass. Inputs are a positive floats and the outputs are compared to the results of log2f, then the average error is calculated, which was way off.
Did you try to compare results of input values between 0.5 and 1.5 to the library log2f function?
Just for fun I bench marked the function found at the end of the post, disregarding accuracy for the moment.
If you call this function on a single value at a time, the MSFT log function will take twice as long to generate a full accuracy properly handled result.
If you call this function over an array of data, for instance, iterating over an array of floats to perform the log function on each value in the array, (this is typically done as it allows you to take advantage of SIMD CPU instructions), Visual Studio will actually replace your for loop (with standard log function) with a vectorized version of the log function at around an array of 16 floats in length giving you a full accuracy log function for about 1/2 the speed of this function for array lengths of 16 to upwards of 4.5 times faster for array lengths of 64k in length. Intel's comparable 11 bit precision vectorized log function performs about 6 times faster at lengths of 64k or longer.
Simply reorganizing your data could lead to incredible improvements.
*Tested on an Intel 4770 CPU, forcing single threaded execution, using CPU clock cycles as a timer, averaged over ~1 minute of execution time. Intel library cost is a couple hours of developert time. Input was uniformly distributed over a range of 1.0e-6 to 1.0e6.
> Intel's comparable 11 bit precision vectorized log function performs about 6 times faster at lengths of 64k or longer.
The issue is probably the division in the approximation. Agner Fog says even Sandy Bridge doesn't pipeline FP (or integer) divisions, which take from 10 to 14 cycles (less than two add/mul pairs). If the goal is to minimise latency for a small number of values, a lower order rational approximation might be preferable to a higher order polynomial; however, when we want throughput, add/mul pipeline better.
I've always wondered what the remaining opportunities for techniques like these are on modern processors. I'd guess if you took a long look at the x86-64 instruction set you might be able to figure some nifty things out.
The reverse question is if all the silicon needed to implement these advanced instructions is holding back clock speeds significantly, in which case we might be better off moving to something RISC and pushing core count and clock speed.
Log functions can be called heavily in the machine learning algorithms used for making recommendations and automatically categorizing items. I'd presume this is Ebay's use case.