Are there any physical indications of the order in which instructions were added? I mean to say: Is there any indication that (for example) they added DIV last, perhaps because it's complicated and they didn't know if they could fit that in.
Andrew Jenner pointed out that the microcode for the string (block move) operations doesn't exactly match the flowchart in the documentation, and the micro-instructions are kind of scattered around. So it's likely that the string operations were modified late in the process. As far as the other instructions, I haven't seen any particular indication of the order.
Do you think the 8086/8088 were well-optimised/efficient in general? In other words, had they spent some more effort, could they have reduced the transistor count and reduced the number of cycles the instructions take without changing the functionality?
The 8086/8088 look pretty well-optimized in many ways: the design of the instruction set, the architecture, the microcode, and the silicon layout. Studying it closely, I've seen multiple things that are more clever than I expected, and haven't seen anything that makes me shake my head in dismay.
Just a note that, just like AAD is a multiplication (and add) in disguise, AAM is an 8-bit unsigned division in disguise and also uses the CORD subroutine.
The footnote that says "most ARM processors still don't have integer division" seems a little out of date to me -- all 64-bit Arm cores do, of course, and on the 32-bit side anything since v7VE (Cortex-A15 was the first of these, released 2011) does, and all the M-profile microcontroller cores have it. It's a bit unfortunate it didn't make it into v7, since that's a pretty popular 32-bit compilation target, but the runtime libs should use it even if the compiler can't opencode a division insn.
Good point, it's not in v6M, though it is in v8M baseline, which is the v8 equivalent. So M0, M0+ and M1 don't have it but I think everything else M-profile should.
Given that Ken said most ARM processors, the nerd sniping becomes: what percentage of all ARM cores are M0/M0+/M1. I'd guess a lot (maybe not half though). Don't all micro SD cards have a processor inside them? They're probably almost all ARM...
> Signed multiplication and division use an internal flag called F1 to keep track of the sign. The F1 flag is toggled by microcode through the CF1 (Complement F1) micro-instruction. The F1 flag is implemented with a flip-flop, along with a multiplexer to select the value. It is cleared when a new instruction starts, set by a REP prefix, and toggled by the CF1 micro-instruction.
Does this mean that one can run 'REP DIV' and get a negated quotient?
The 80186 already implemented the loop for multiplication and (unsigned) division in dedicated logic that could do 1 bit per cycle.
The microcode only had to set up the registers, test for overflow, and undo the final subtraction if it underflowed (since it now used a non-restoring algorithm).
Starting with the 486 there is likely no longer any microcode involved, and newer chips can handle multiple bits per step. See end of article, the first algorithm to do this comes from 1957!
It's way more complex than this on modern CPUs, so it's harder to explain.
> is microcode still used for division?
Yes, almost everything on a modern CPU uses "microcode" of some kind, although the term gets kind of hazy, since everything is out-of-order and a lot of instructions are issued in parallel. In a typical modern CPU, the "frontend" will decompose instructions into uOps, which then get pushed into a "reservation station" / "scheduler." The scheduler queues and reorders various uOps in various surprising and complicated ways to try to account for interdependencies and memory latency. Eventually, a uOp is issued into to an "execution port," which is connected to a fixed-function piece of logic that actually performs part or all of an operation (for example, an Arithmetic Unit / ALU).
But, while microcode will be _involved_ still, most modern CPUs will have fixed-function hardware for the meaty parts of the division instruction - they generally speaking won't implement division _purely_ using microcode like the algorithm documented in the article.
> are there better and faster alternatives?
They're not "alternatives" per se, but there are a _lot_ of ways to implement division algorithmically, and a _lot_ of ways to trade size for speed.
Improving integer division performance has been a fairly big focus in newer CPU microarchitectures, with major improvements arriving in the latest Intel, AMD, and Apple Silicon architectures.
https://uops.info/table.html will show how many uOps a given x86 instruction decomposes to, what ports it uses, and rough latency estimates for the instruction's execution.
Here's some reading I found, with a lot of references:
>almost everything on a modern CPU uses "microcode" of some kind, although the term gets kind of hazy
µOps are different from the kind of microcode described here. Older x86 CPUs basically had a "bytecode interpreter" in microcode ROM, every instruction (except for some trivial set/clear flag operations) would go to a specific entry point, and even something simple like addition would take at least two µ-instrs.
The 80486 was the first generation that could decode some opcodes directly into one-cycle µOps.
edit
The term "interpreter" is of course a simplified description. The decoding itself is done outside of microcode, and there is logic to select different registers or ALU operations etc. But conceptually it's similar in that almost every opcode transfers control to some sequence of microinstructions ending in "RNI", which acts like a jump back to the main interpreter loop.
The 8086 is actually the closest to the "RISC-like microcode" meme, in that even address computation is done by a series of µ-instrs.
Edit: just saw your edit, that's something I'd never really thought about before - 8086 is the "purest" microcoded processor from the x86 series, in that every instruction runs through an actual interpreter rather than some form of fixed-function instruction issue unit!
In the case of integer division, I think that it's also the "true" kind of microcoded instruction on many modern CPUs. That is to say, the instruction goes through the actual microcode interpreter to issue the uOps, rather than the fixed function decoder. Although, it's been awhile since I had to worry about microcode switches, and it looks like maybe this isn't true anymore in the very newest microarchitectures?
I think this is for two reasons: so that the microcode can switch between "fast" and "slow" division and issue a different uOp program for fast division, and because most division is longer than the fixed-function decoder width (I think on Intel it used to be anything longer than 4 uOps?).
Anyway, I figured this was probably a bit more detail than what OP needed for their question about division algorithms, so here are my takeaways:
* Yes, integer division is implemented as multiple operations on many modern CPUs, although it is increasingly moving towards hardware (fewer uOps).
* Sometimes the micro-instruction programs for division are encoded in a fixed operation decoder and sometimes they are themselves generated by microcode.
* But, at the same time no, division is not implemented algorithmically using purely non-division functions, there is usually some fixed-function division logic of various types.
While division may still decode to multiple uOps, I seriously doubt that there's a loop in microcode on modern processors. The pipeline latency makes that infeasible.
The looping logic is almost certainly a bit of fixed function hardware in the execution unit.
Hmm. This gets into the fuzzy definition of "loop in microcode" depending on how you look at the system. I don't think the actual looping happens in microcode, that is, it's not like the ucode unit jumps to earlier ucode - this wouldn't make sense architecturally for a variety of reasons.
However, in the case of 64-bit integer division on mid-aged Intel processors (for example, Kaby Lake), I do think that division is both iterative and microcoded (versus fixed-function logic), but that the ucode emits an _unrolled_ loop into the scheduler.
IDIV with 64-bit operands on Kaby Lake takes 56/57 uOps (!) vs the still-huge 11 uOps for 32-bit IDIV. (for comparison, we're down to 5/4 uOps for 64-bit division on Alder Lake).
For example, Zen4 64-bit DIV is listed as: 2 uOps, 10-18 cycles latency, 7-12 cycles inverse throughput.
This suggests uOps with variable execution lengths, i.e. iteration happening in the execution unit and not just a fixed unrolled loop streamed by the microcode part of the frontend.
You may be right that there were some CPUs that did the fixed unrolling, but it doesn't seem that common.
My understanding is that there can be both. That the execution pipes themselves on some implementations have a 'nanocode' for stuff like cordics and maybe division who's execution streams are kicked off from the one or two high level uOps that the instruction decoder emits.
Sort of. uOps don't have to come out of ucode. There's fixed function hardware that cracks most instructions into uOps and only falls back to ucode for particularly complex operations.
For comparison, a modern CPU architecture such as the Zen 4 can perform a 64-bit integer division in 10-18 clock cycles, which means it is processing 3.5 to 6.5 bits per clock.