Really interested in the nuts and bolts - are you optimizing specifically for one quality setting (in which case I'm guessing you could probably do the quantization as part of the dct and throw away some calculations)?
I played with a realtime jpeg compression implementation back in college on transputers (yes I'm that old). Fun stuff, nice to see there are still places where going right down to the metal can make a real impact on a product...
While SnappyCam has been the most difficult, complex, piece of software I've written since I started coding in my early teens, it's also been one of the most satisfying technically.
I'd love to disclose the many, many optimizations baked in, but as this is a commercial app I must keep much of it as a trade secret.
I will say though that a lot of precomputation was involved, both for the encoder and decoder. Jumped at the chance to avoid computation, memory reads, etc., as much as possible. :-)