>almost everything on a modern CPU uses "microcode" of some kind, although the t...

bri3d · on April 8, 2023

Edit: just saw your edit, that's something I'd never really thought about before - 8086 is the "purest" microcoded processor from the x86 series, in that every instruction runs through an actual interpreter rather than some form of fixed-function instruction issue unit!

In the case of integer division, I think that it's also the "true" kind of microcoded instruction on many modern CPUs. That is to say, the instruction goes through the actual microcode interpreter to issue the uOps, rather than the fixed function decoder. Although, it's been awhile since I had to worry about microcode switches, and it looks like maybe this isn't true anymore in the very newest microarchitectures?

I think this is for two reasons: so that the microcode can switch between "fast" and "slow" division and issue a different uOp program for fast division, and because most division is longer than the fixed-function decoder width (I think on Intel it used to be anything longer than 4 uOps?).

Anyway, I figured this was probably a bit more detail than what OP needed for their question about division algorithms, so here are my takeaways:

* Yes, integer division is implemented as multiple operations on many modern CPUs, although it is increasingly moving towards hardware (fewer uOps).

* Sometimes the micro-instruction programs for division are encoded in a fixed operation decoder and sometimes they are themselves generated by microcode.

* But, at the same time no, division is not implemented algorithmically using purely non-division functions, there is usually some fixed-function division logic of various types.

* This is a hot topic so change is rapid.

atq2119 · on April 9, 2023

While division may still decode to multiple uOps, I seriously doubt that there's a loop in microcode on modern processors. The pipeline latency makes that infeasible.

The looping logic is almost certainly a bit of fixed function hardware in the execution unit.

bri3d · on April 9, 2023

Hmm. This gets into the fuzzy definition of "loop in microcode" depending on how you look at the system. I don't think the actual looping happens in microcode, that is, it's not like the ucode unit jumps to earlier ucode - this wouldn't make sense architecturally for a variety of reasons.

However, in the case of 64-bit integer division on mid-aged Intel processors (for example, Kaby Lake), I do think that division is both iterative and microcoded (versus fixed-function logic), but that the ucode emits an _unrolled_ loop into the scheduler.

IDIV with 64-bit operands on Kaby Lake takes 56/57 uOps (!) vs the still-huge 11 uOps for 32-bit IDIV. (for comparison, we're down to 5/4 uOps for 64-bit division on Alder Lake).

atq2119 · on April 9, 2023

Take a look at Agner Fog's https://www.agner.org/optimize/instruction_tables.pdf

For example, Zen4 64-bit DIV is listed as: 2 uOps, 10-18 cycles latency, 7-12 cycles inverse throughput.

This suggests uOps with variable execution lengths, i.e. iteration happening in the execution unit and not just a fixed unrolled loop streamed by the microcode part of the frontend.

You may be right that there were some CPUs that did the fixed unrolling, but it doesn't seem that common.

monocasa · on April 9, 2023

My understanding is that there can be both. That the execution pipes themselves on some implementations have a 'nanocode' for stuff like cordics and maybe division who's execution streams are kicked off from the one or two high level uOps that the instruction decoder emits.