That doesn't use the GPU cores. They just happen to package dedicated video hardware alongside it, but it's more similar to a CPU with a specialized DSP attached.
Any kind of decompression isn't fully parallelizable. If you've found any opportunities, that means the compression wasn't as efficient as it theoretically could be. Most codecs are merciful and eg restart the entropy coder across frames, which is why the multithreaded decoding in ffmpeg is able to work.
(But it comes with a lossless video codec called ffv1 that doesn't allow this.)
Yeah, I wrote my CS dissertation on this. It started as me writing a GPGPU video codec (for a simplified h264), and turned into me writing an explanation of why this wouldn't work. I did get somewhere with a hybrid approach (use the GPU for a first pass without intra-frame knowledge, followed by a CPU SIMD pass to refine), but it wasn't much better than a pure CPU SIMD implementation and used a lot more power.
x264 actually gets a little use out of GPGPU - it has a "lookahead" pass which does a rough estimate of encoding over the whole video, to see how complex each scene is and how likely parts of the picture are to be reused later. That can be done in CUDA, but IIRC it has to run like 100 frames ahead before the speed increase wins over the CPU<>GPU communication overhead.
Any kind of decompression isn't fully parallelizable. If you've found any opportunities, that means the compression wasn't as efficient as it theoretically could be. Most codecs are merciful and eg restart the entropy coder across frames, which is why the multithreaded decoding in ffmpeg is able to work.
(But it comes with a lossless video codec called ffv1 that doesn't allow this.)