Fast Approximate Logarithms, Part III: The Formulas

eutectate · on June 27, 2015

Using Horner's method to evaluate the polynomials is probably suboptimal because it introduces dependencies and eliminates instruction-level parallelism. It might be worth trying Estrin's method instead (if the compiler does not rewrite them already).

brandmeyer · on June 28, 2015

Actually, there is quite a bit of available ILP in the forms that use division - as high as four insns. The numerator and denominator can be evaluated in parallel, and the coefficient loads can be interleaved with the FMAs to hide some of their latency.

That said, with such a low variation in the timings I wonder if some other overhead was dominating the costs. For example, floating-point division is typically much slower than either addition or multiplication. According to Agner Fog, div is 2-4x slower in latency than add and mul while completely hogging the pipeline.

jason_s · on July 6, 2015

interesting -- never heard of Estrin's method; thanks!

opcvx · on June 27, 2015

I have tried the code posted at the end of the article and it gives wildly incorrect results for input values in range of [ 0.5 , 1.5 ). Also for output, starting with multiplies of 4, values suddenly diverge and then start to converge towards correct value until the next multiple.

Did I forget to do something with the result or is there a bug?

opcvx · on June 27, 2015

I have found the bug.

The else block in the last if statement is missing this: signif = signif - 1.0; before the lg2 = FN; line.

Then the relative error becomes similar to what the bits of accuracy table predicts, which is very low.

jason_s · on June 27, 2015

No mention of Padé approximants? I'd just use a Chebyshev approximation transformed into a Padé approximation, then perturb the coefficients to optimize maximum error.

pkhuong · on June 27, 2015

For values far from 1, log_2 will be dominated by the integer portion, which IEEE FP representation gives you for free, so it's not surprising that outputs eventually look right. I think the problem is that the approximation works with (x - 1), but only the "significand >= 1.5" branch subtracts that offset. Really, if you need speed, I'd probably start with an approximation over [1, 2)...

Re-implementation in SBCL:

  CL-USER> (defun approx (x)
             (let ((o (1- x)))
               (values (/ (+ (* o o 0.338953) (* o 2.198599)) (+ o 1.523692))
                       (log x 2.0))))

  CL-USER> (loop for x from 0.70 upto 1.51 by 0.1
                 do (format t "~,3F: ~{~F ~F~}~%" x (multiple-value-list (approx x))))
  0.700: -0.51407874 -0.5145732
  0.800: -0.32194924 -0.32192805
  0.900: -0.15204856 -0.15200303
  1.000: 0.0 0.0
  1.100: 0.13749498 0.13750356
  1.200: 0.26296926 0.26303446
  1.300: 0.3784003 0.37851173
  1.400: 0.48535433 0.48542693
  1.500: 0.585088 0.5849626

That's in the right ballpark for [0,7, 1.5]

  CL-USER> (defun almost-log2 (x)
             (let* ((bits (sb-kernel:single-float-bits x))
                    (exp (ash bits -23)))
               (values 
                (if (logbitp 22 bits)
                    (+ (- exp 126) (approx (sb-kernel:make-single-float
                                            (dpb 126 (byte 9 23) bits))))
                    (+ (- exp 127) (approx (sb-kernel:make-single-float
                                            (dpb 127 (byte 9 23) bits)))))
                (log x 2.0))))

  CL-USER> (loop for x from 0.1 upto 10.1 by 0.1
                 do (format t "~,3F: ~{~,8F ~,8F~}~%" x (multiple-value-list (almost-log2 x))))
  0.100: -3.32194920 -3.32192800
  0.200: -2.32194920 -2.32192800
  0.300: -1.73703070 -1.73696570
  0.400: -1.32194920 -1.32192800
  0.500: -1.00000000 -1.00000000
  0.600: -0.73703074 -0.73696554
  0.700: -0.51464570 -0.51457310
  0.800: -0.32194915 -0.32192796
  0.900: -0.15204845 -0.15200295
  1.000: 0.00000017 0.00000017
  1.100: 0.13749513 0.13750371
  1.200: 0.26296943 0.26303460
  1.300: 0.37840047 0.37851185
  1.400: 0.48535448 0.48542705
  1.500: 0.58509207 0.58496270
  1.600: 0.67805094 0.67807215
  1.700: 0.76547647 0.76553500
  1.800: 0.84795165 0.84799720
  1.900: 0.92598030 0.92599964
  2.000: 1.00000010 1.00000010
  2.100: 1.07039330 1.07038940
  2.200: 1.13749500 1.13750360
  [...]
  2.800: 1.48535400 1.48542650
  2.900: 1.53605470 1.53605260
  3.000: 1.58508750 1.58496210
  3.100: 1.63230250 1.63226800
  3.200: 1.67805030 1.67807150
  [...]
  3.800: 1.92597950 1.92599880
  3.900: 1.96346550 1.96347340
  4.000: 1.99999940 1.99999930
  4.100: 2.03562740 2.03562330
  4.200: 2.07039260 2.07038880
  [...]
  5.000: 2.32183340 2.32192750
  [...]
  6.000: 2.58508730 2.58496170
  [...]
  9.900: 3.30734100 3.30742860
  10.000: 3.32183430 3.32192850

jason_s · on June 27, 2015

in [part I](http://www.ebaytechblog.com/2015/05/01/fast-approximate-loga...) the author argues against using [1,2) because it causes large relative error near x=1 (since log x = 0 there). I would disagree (why would you ever care about relative error in a logarithm -- IMHO absolute error makes much more sense... unless for some crazy reason you're dividing by log x near x=1) but that's his argument.

pkhuong · on June 27, 2015

It's an interesting trade off; you get a nice bump in precision for a small increase in code complexity.

acqq · on June 27, 2015

Formulas are for x in the given range but use y. Have you?

opcvx · on June 27, 2015

I copy pasted the code and ran a unit test, which it failed to pass. Inputs are a positive floats and the outputs are compared to the results of log2f, then the average error is calculated, which was way off.

Did you try to compare results of input values between 0.5 and 1.5 to the library log2f function?

stagger87 · on June 27, 2015

Just for fun I bench marked the function found at the end of the post, disregarding accuracy for the moment.

If you call this function on a single value at a time, the MSFT log function will take twice as long to generate a full accuracy properly handled result.

If you call this function over an array of data, for instance, iterating over an array of floats to perform the log function on each value in the array, (this is typically done as it allows you to take advantage of SIMD CPU instructions), Visual Studio will actually replace your for loop (with standard log function) with a vectorized version of the log function at around an array of 16 floats in length giving you a full accuracy log function for about 1/2 the speed of this function for array lengths of 16 to upwards of 4.5 times faster for array lengths of 64k in length. Intel's comparable 11 bit precision vectorized log function performs about 6 times faster at lengths of 64k or longer.

Simply reorganizing your data could lead to incredible improvements.

*Tested on an Intel 4770 CPU, forcing single threaded execution, using CPU clock cycles as a timer, averaged over ~1 minute of execution time. Intel library cost is a couple hours of developert time. Input was uniformly distributed over a range of 1.0e-6 to 1.0e6.

pkhuong · on June 27, 2015

> Intel's comparable 11 bit precision vectorized log function performs about 6 times faster at lengths of 64k or longer.

The issue is probably the division in the approximation. Agner Fog says even Sandy Bridge doesn't pipeline FP (or integer) divisions, which take from 10 to 14 cycles (less than two add/mul pairs). If the goal is to minimise latency for a small number of values, a lower order rational approximation might be preferable to a higher order polynomial; however, when we want throughput, add/mul pipeline better.

TTPrograms · on June 27, 2015

I've always wondered what the remaining opportunities for techniques like these are on modern processors. I'd guess if you took a long look at the x86-64 instruction set you might be able to figure some nifty things out.

The reverse question is if all the silicon needed to implement these advanced instructions is holding back clock speeds significantly, in which case we might be better off moving to something RISC and pushing core count and clock speed.

pervycreeper · on June 27, 2015

Anyone care to speculate why they need to use logs so much in their codebase?

nkurz · on June 27, 2015

Log functions can be called heavily in the machine learning algorithms used for making recommendations and automatically categorizing items. I'd presume this is Ebay's use case.

stagger87 · on June 27, 2015

Or why a number of already vectorized (read: faster) off the shelf low precision log functions were not adequate.