But the carry propagation blocks parallelism. So by reducing the need to carry, ...

asciimov · on May 29, 2020

But you are doing the same number of carries. Op is making it sound like you are doing less carries.

phkahler · on May 29, 2020

It depends how many numbers you add. If you compute A+B it will be slower because the Carrie's still have to be handled after the addition. But if you compute A+B+C+D+E you do 4 fast addition and only deal with carry bits once. TFA allows up to about 8000 numbers to be summed for a single pass of carry propagation.

asciimov · on May 29, 2020

Right, I understand we are parallelizing the addition operations and the performance gain is due to adding chunks of the numbers in parallel.

What I am saying is the total number of carry operations is conserved, even though they are delayed.

MauranKilom · on May 29, 2020

I can't tell from your comments whether you realized yet, but the trick is only useful if you add several 256 bit numbers. You can add up to 2^13 numbers (each one 256 bits) before you actually have to do the four carry shift-outs once. In effect, you do reduce the number of carry operations (to four, albeit slightly more involved), and that is what ends up saving time.

There is no performance gain if you only add two numbers, even though that's the setup the article discusses for 90% of its length. It seems that this is leading to some confusion.

asciimov · on May 29, 2020

Yeah, I re-read the whole thing again, and I see the mistake I made. I took a wrong turn when op went into base 37 numbers, which made me consider every bit carry not just the carry on word boundary.

tspiteri · on May 29, 2020

You're grouping carries. If you have 13 spare bits, you can accumulate 13 additions before you have to take care of carries. Then you have to add some overhead for bit manipulation, but the article's point is that the parallelism gains more than make up for the bit manipulation.

asciimov · on May 29, 2020

Right, am not saying that this isn't faster, I am saying you aren't reducing to total number of carries.

tspiteri · on May 29, 2020

You are reducing the number of carries. The article explains it clearly. If you're not convinced by the article, I won't try to convince you. (Edit: the last thing I'll add: you are reducing the number of carries for 64-bit additions. When adding 13 51-bit numbers, that does not count as carries instruction-wise, as it's still add, not adc.)

asciimov · on May 29, 2020

I see what I was doing wrong, I was considering total bit carries during the addition, ie... 1+1= 10 and is a one bit carry, and not the carry operation done after overflowing the word boundary.

Next time, I'll keep from reading tech articles while running on too little sleep.

Thanks for your understanding.

tspiteri · on May 29, 2020

No worries, I kinda realized just after posting the comment, which explains why I added the edit. :)