It depends how many numbers you add. If you compute A+B it will be slower because the Carrie's still have to be handled after the addition. But if you compute A+B+C+D+E you do 4 fast addition and only deal with carry bits once. TFA allows up to about 8000 numbers to be summed for a single pass of carry propagation.
I can't tell from your comments whether you realized yet, but the trick is only useful if you add several 256 bit numbers. You can add up to 2^13 numbers (each one 256 bits) before you actually have to do the four carry shift-outs once. In effect, you do reduce the number of carry operations (to four, albeit slightly more involved), and that is what ends up saving time.
There is no performance gain if you only add two numbers, even though that's the setup the article discusses for 90% of its length. It seems that this is leading to some confusion.
Yeah, I re-read the whole thing again, and I see the mistake I made. I took a wrong turn when op went into base 37 numbers, which made me consider every bit carry not just the carry on word boundary.
You're grouping carries. If you have 13 spare bits, you can accumulate 13 additions before you have to take care of carries. Then you have to add some overhead for bit manipulation, but the article's point is that the parallelism gains more than make up for the bit manipulation.
You are reducing the number of carries. The article explains it clearly. If you're not convinced by the article, I won't try to convince you. (Edit: the last thing I'll add: you are reducing the number of carries for 64-bit additions. When adding 13 51-bit numbers, that does not count as carries instruction-wise, as it's still add, not adc.)
I see what I was doing wrong, I was considering total bit carries during the addition, ie... 1+1= 10 and is a one bit carry, and not the carry operation done after overflowing the word boundary.
Next time, I'll keep from reading tech articles while running on too little sleep.