I think there's an alternate universe out there where: - we collectively realize...

andrewstuart2 · on Sept 20, 2023

Traces are just distributed "logs" (in the data structure sense; data ordered only by its appearance in something) where you also pass around the tiniest bit of correlation context between apps. Traces are structured, timestamped, and can be indexed into much more debug-friendly structures like a call tree. But you could just as easily ignore all the data and print them out in streaming sorted order without any correlation.

Honestly it sounds like you're pitching opentelemetry/otlp but where you only trace and leave all the other bits for later inside your opentelemetry collector, which can turn traces into metrics or traces into logs.

hardwaresofton · on Sept 21, 2023

So this is kind of what I was talking about but it's more than that -- if your default is structured logs (simplest example is JSON) then all you have to do is put the data you care about into the log.

So I'm imagining something more like:

   {"level":"info", "otlp": { "trace": { ... }}}

   {"level":"info", "otlp": { "error": { ... }}}

   {"level":"info", "otlp": { "log": { ... }}}

   {"level":"info", "otlp": { "metric": { ... }}}

(standardizing this format would be non-trivial of course, but I could imagine a really minimal standard)

Your downstream collector only needs one API endpoint/ingestion mechanism -- unpacking the actual type of telemetry that came in (and persisting where necessary) can be left to other systems.

Basically I think the systems could have been massively simpler in most UNIX-y environments -- just hook up STDOUT (or scrape it, or syslog or whatever), and you're done -- no allowing ports out for jaeger, dealing with complicated buffering, etc -- just log and forget.

phillipcarter · on Sept 20, 2023

That's more or less the model Honeycomb uses. Every signal type is just a structured event. Reality is a bit messier, though. In particular, metrics are the oddball in this world and required a lot of work to make economical.

hardwaresofton · on Sept 21, 2023

Ah thanks for noting this, I that's exactly the insight I mean here.

Yeah I think the worst case you basically just exfiltrate metrics out to other subsystems (honestly, you could kind of exfiltrate all of this), but the default is pipe heavily compressed stuff to short and long term storage, and some processors for real time... blah blah blah.

Obviously Honeycomb is actually doing the thing and it's not as easy as it sounds, but it feels like if we had all thought like this earlier we might have skipped making a few protocols (zipkin, jaeger, etc), and focused on just data layout (JSON vs protobuf vs GELF, etc) and figuring out what shapes to expect across tools.

dalyons · on Sept 20, 2023

Is that really an alternate universe? That’s the universe that splunk and friends are selling, everything’s a log. It’s really expensive.

hardwaresofton · on Sept 21, 2023

Splunk does have margins and I think they're quite high. Same with Datadog (see: all the HN startups that are trying to grab some of that space).

There's a big gap between what it takes for the engineering to work and what all these companies charge.

My point is really more about the engineering time wasted on different protocols and stuff when we could have stuffed everything into minimally structured log lines (and figured out the rest of the insight machinery later). Concretely, that zipkin/jaeger/prometheus protocols and stuff may not have needed to exist, etc.

ec109685 · on Sept 21, 2023

Once you have logs, you can index them in a variety of ways to turn them into metrics, traces, etc., but having logs as the fundamental primitive is powerful.