Your feelings are spot on. In most modern distributed tracing, "observability", ...

kiitos · 2024-12-06T05:01:10 1733461270

You never send a single individual log event per HTTP request, you always batch them up. Assuming some reasonable batch size per request (minimum ~1MiB or so) there is rarely any meaningful difference in payload size between gzipped/zstd/whatever JSON bytes, and any particular binary encoding format you might prefer.

jiggawatts · 2024-12-06T06:37:10 1733467030

Most log collection systems do not compress logs as they send them, because again, why would they? This would instantly turn their firehose of revenue cash down to a trickle. Any engineer suggesting such a feature would be disciplined at best, fired at worst. Even if their boss is naive to the business realities and approves the idea, it turns out that it's weirdly difficult in HTTP to send compressed requests. See: https://medium.com/@abhinav.ittekot/why-http-request-compres...

HTTP/2 would also improve efficiency because of its built-in header compression feature, but again, I've not seen this used much.

The ideal would be to have some sort of "session" cookie associated with a bag of constants, slowly changing values, and the schema for the source tables. Send this once a day or so, and then send only the cookie followed by columnar data compressed with RLE and then zstd. Ideally in a format where the server doesn't have to apply any processing to store the data apart from some light verification and appending onto existing blobs. I.e.: make the whole thing compatible with Parquet, Avro, or something other than just sending uncompressed JSON like a savage.

kiitos · 2024-12-06T08:19:02 1733473142

Most systems _do_ compress request payloads on the wire, because the cost-per-byte in transit over those wires is almost always frictional and externalized.

Weird perspective, yours.

piterrro · 2024-12-06T12:36:00 1733488560

They will compress over the wire, but then decompress and ingest counting billing for uncompressed data. After that, an interesting thing will happen, because they will compress the data along other interesting techniques to minimize the size of the data on their premises. Cant blame them... they're just trying to cut costs but the fact that they are charging so much for something that is so easily compressible is just... not fair.

jiggawatts · 2024-12-07T02:10:40 1733537440

A part of the problem is that the ingestion is not vector compressed, so they're charging you for the CPU overhead of this data rearrangement.

It would cut costs a lot if the source agents did this (pre)processing locally before sending it down the wire.

piterrro · 2024-12-07T17:22:18 1733592138

We should distinct between compression in transit and at rest. Compressing a larger corpus should yield better results in comparison to smaller chunks because dictionaries can be reused (zstd for example)

david38 · 2024-12-06T07:59:20 1733471960

This is why metrics rule and logging in production need only be turned on to debug specific problems and even then have a short TTL

jiggawatts · 2024-12-06T08:07:19 1733472439

You got... entirely the wrong message.

The answer to "this thing is horrendously inefficient because of misaligned incentives" isn't to be frugal with the thing, but to make it efficient, ideally by aligning incentives.

Open source monitoring software will eventually blow the proprietary products out of the water because when you're running something yourself, the cost per gigabyte is now just your own cost and not a profit centre line item for someone else.

piterrro · 2024-12-06T12:33:43 1733488423

Unless you start attaching tags to metrics and allow engineers to explode cardinality of the metrics. Then your pockets need to be deep.