Hardware-acclerated text processing in SSE 4.2

mahmud · on July 20, 2009

Hennessy, Sequin and Patterson must be rolling in their office chairs or riterment hammocks right about now.

Whoever needed this could have gotten it in custom hardware; Intel's "instruction set is an API" is a really bad idea.

Can someone convince me why we need this in the processor?

DarkShikari · on July 20, 2009

One thing you'll often find is that processor vendors like to release instructions that expose already-existing capability within the chip, but which previously wasn't directly accessible.

I doubt this takes a remotely significant amount of silicon.

Plus, it's all part of the XML hype: people include XML-related technologies in software and now even hardware (they've marketed these as, among other things, accelerators for XML parsing) solely for the purpose of being "Enterprise-grade".

Here's a fun fact for you: a very very major network hardware vendor sells extraordinarily expensive boards with "hardware XML parsing" (that's how they can charge so much). In reality, it turned out to be a nightmare to implement, so they just did it in software--and didn't tell their customers.

mahmud · on July 20, 2009

It's the exposure that's bad, not the existance of the capability in silicon. Think of all the programmer-hours that will go into retrofiting compilers, libraries and runtime with "hardware accelerated text processing", just because some bonehead PHB heard this news and suddenly wants it?

They could have done the right thing and started stripping x86-32 cruft from x86-64 as it gained more traction. As time moved on, we would have had a clean architecture with features being deprecated instead of being added.

DarkShikari · on July 20, 2009

Think of all the programmer-hours that will go into retrofiting compilers, libraries and runtime with "hardware accelerated text processing"

Don't worry, odds are almost nobody will actually use it.

They could have done the right thing and started stripping x86-32 cruft from x86-64 as it gained more traction.

There isn't really any "cruft" that matters; the old useless instructions do nothing but waste instruction code space, but that's not really a big issue; the real significant improvement would come by re-doing x86 as a three-operand architecture (and make other similar improvements that would be impossible to do "bit by bit"). Another potential improvement would be to just re-do the instruction coding to make it faster to parse; instruction decoding is becoming a significant bottleneck on x86 already. If Intel was going to do an overhaul like that, though, they'd just do it all at once.

AVX is actually beginning to go that way; we're going to have three-operand for SIMD, even though we won't have it for regular instructions.

limmeau · on July 20, 2009

What's the difference between "re-doing x86 as a three-operand architecture" and switching to a different 64-bit RISC architecture?

DarkShikari · on July 20, 2009

The former could probably be done without a complete retooling of the chip designs.

limmeau · on July 20, 2009

I don't think it will happen, though; the ratio of people who care about the elegance of their processor's assembly language to people who buy computers is just too low.

Not in form of a clean cut, at least: x86-64 brought us eight additional registers and removed the silly BCD-instructions.

tptacek · on July 20, 2009

Are you sure "really did it in software" doesn't mean "managed to get a hardware configuration that admitted a fast C implementation"? There's several boutique network hardware vendors that have played games with the MIPS architecture --- multicore with very customized memory bus --- to get themselves to a point where they can set speed records on text processing with C code that will only run on their boards.

DarkShikari · on July 20, 2009

No, I have a friend who works for an FPGA vendor; the Major Network Hardware Vendor bought FPGAs for them to use for XML parsing, but eventually found that a software implementation was faster and easier--but they kept the FPGAs on the boards so that they could charge $30k a pop for them.

limmeau · on July 20, 2009

Faster strlen(), strcmp() etc. are not so exotic things to wish for.

tptacek · on July 20, 2009

Can someone convince me why we need this in the processor?

Lock-in.

jrockway · on July 20, 2009

I doubt anyone is going to use these instructions directly inside the application code; it will be in a library that falls back to the normal way if the instructions aren't available. Like video decoders do now (or even 3d rendering that falls back to software rendering).

If anyone is locked in, it will because this significantly speeds up their app; in which case it's not really lock-in, it's just a good solution to the problem.

In conclusion, I disagree.

DarkShikari · on July 20, 2009

Remember, the #1 rule of new instruction sets released by Intel (at least in the past ~5 years) is that they are always close to useless in the first architecture that supports them. For example, PHADDW took 6 cycles on the Core 2 Conroe, making it almost useless in real code. But the Penryn doubled its speed, making it potentially useful.

The string operations are, last I saw, something on the order of 9 cycles latency, making them rather unfortunately slow in practice... fitting perfectly with the trend mentioned above.