> If it was just XML overhead, compression should mitigate most of it.
Strong enough compression should mitigate most of it, but DEFLATE (and consequently zip and gzip) is not a strong enough algorithm.
For example, let's imagine that a particular format is available both in JSON and in a binary format and is entirely composed of objects, arrays and ASCII strings, so binary doesn't benefit much from a compact encoding. Now consider a JSON `[{"field1":...,"field2":...,...},...]` with lots of `fieldN` strings duplicated. DEFLATE will be able to figure out that `","fieldN":"` fragments frequently occur and can be shortened into a backreference, but that backreference still takes at least two bits and normally a lot more bits (because you have to distinguish different `","fieldN":"` fragments), so they will translate to pure overhead compared to compressed binary.
Modern compression algorithms mainly deal this pattern with two ways, possibly both. The backreference can be encoded to fractional number of bits, implying the use of arithmetic/range/ANS coding (cf. Zstandard). Or you can recognize different token distributions for different contexts (cf. Brotli). They do come with non-negligible computational overhead though and became only practical recently with newer techniques.
Strong enough compression should mitigate most of it, but DEFLATE (and consequently zip and gzip) is not a strong enough algorithm.
For example, let's imagine that a particular format is available both in JSON and in a binary format and is entirely composed of objects, arrays and ASCII strings, so binary doesn't benefit much from a compact encoding. Now consider a JSON `[{"field1":...,"field2":...,...},...]` with lots of `fieldN` strings duplicated. DEFLATE will be able to figure out that `","fieldN":"` fragments frequently occur and can be shortened into a backreference, but that backreference still takes at least two bits and normally a lot more bits (because you have to distinguish different `","fieldN":"` fragments), so they will translate to pure overhead compared to compressed binary.
Modern compression algorithms mainly deal this pattern with two ways, possibly both. The backreference can be encoded to fractional number of bits, implying the use of arithmetic/range/ANS coding (cf. Zstandard). Or you can recognize different token distributions for different contexts (cf. Brotli). They do come with non-negligible computational overhead though and became only practical recently with newer techniques.