It's also about resource constraints. If your parallel request workload (let's t...

It's also about resource constraints. If your parallel request workload (let's take S3 as an example) is already using all the available resources then parallelizing this single task in the hope that it would make things more efficient/run faster is just going to result in things running slower (because of the overhead e.g. of moving data around, synchronization, etc.) At least in the higher end of request concurrency. Ofcourse if a machine just happens to be be processing a single request that can be a win but that means you're not utilizing your hardware. The way to make those heavy parallel request workload systems go faster is simply to do less, i.e. optimize them, not parallelize them. The exception is if you want to have some "quality of service" controls within this setup. That is fairly unusual in the types of systems I'm thinking about.

I've seen this happen in practice and it's a common anti-pattern. An engineer tries to get requests to go faster, parallelizes some of the pieces of a single request, only to find the system can now handle less requests/s. The reason why is easy, he's now doing more work on the same amount of resources.

This is very different than a single task running on a multi-core machine with 31 cores sitting idle and one doing all the work.

I think your statement mostly applies in the second case you want to chop things up in more or less equal size piece. Otherwise you'll bottleneck on the piece that took longer.