Tracing: Structured logging, but better

zoogeny · on Sept 20, 2023

One thing about logging and tracing is the inevitable cost (in real money).

I love observability probably more than most. And my initial reaction to this article is the obvious: why not both?

In fact, I tend to think more in terms of "events" when writing both logs and tracing code. How that event is notified, stored, transmitted, etc. is in some ways divorced from the activity. I don't care if it is going to stdout, or over udp to an aggregator, or turning into trace statements, or ending up in Kafka, etc.

But inevitably I bump up against cost. For even medium sized systems, the amount of data I would like to track gets quite expensive. For example, many tracing services charge for the tags you add to traces. So doing `trace.String("key", value)` becomes something I think about from a cost perspective. I worked at a place that had a $250k/year New Relic bill and we were avoiding any kind of custom attributes. Just getting APM metrics for servers and databases was enough to get to that cost.

Logs are cheap, easy, reliable and don't lock me in to an expensive service to start. I mean, maybe you end up integrating splunk or perhaps self-hosting kibana, but you can get 90% of the benefits just by dumping the logs into Cloudwatch or even S3 for a much cheaper price.

phillipcarter · on Sept 20, 2023

FWIW part of the reason you're seeing that is, at least traditionally, APM companies rebranding as Observability companies stuffed trace data into metrics data stores, which becomes prohibitively expensive to query with custom tags/attributes/fields. Newer tools/companies have a different approach that makes cost far more predictable and generally lower.

Luckily, some of the larger incumbents are also moving away from this model, especially as OpenTelemetry is making tracing more widespread as a baseline of sorts for data. And you can definitely bet they're hearing about it from their customers right now, and they want to keep their customers.

Cost is still a concern but it's getting addressed as well. Right now every vendor has different approaches (e.g., the one I work for has a robust sampling proxy you can use), but that too is going the way of standardization. OTel is defining how to propagate sampling metadata in signals so that downstream tools can use the metadata about population representativeness to show accurate counts for things and so on.

_moog · on Sept 20, 2023

> Newer tools/companies have a different approach that makes cost far more predictable and generally lower.

What newer tools/companies are in this category? Any that you recommend?

hooverd · on Sept 21, 2023

I haven't used anything else, but I'll gladly shill for https://honeycomb.io.

mikeshi42 · on Sept 21, 2023

I think we fit in that bucket [1] - open source, self-hostable, based on OpenTelemetry and backed by Clickhouse DB (columnar, not time-series).

Clickhouse gives users much greater flexibility in tradeoffs than either a time-series or inverted-index based store could offer (along with S3 support). There's nothing like a system that can balance high performance AND (usable) high cardinality.

[1] https://github.com/hyperdxio/hyperdx

disclaimer (in case anyone just skimmed): I'm one of the authors of HyperDX

makeavish · on Sept 21, 2023

Companies like https://signoz.io/ are Opentelemetry native and have very transparent approach to predictable pricing. You can self host easily as well.

alexisread · on Sept 20, 2023

Any of the Clickhouse-based Otel stores can dump the traces to s3 for long-term storage, and can be self-hosted. I know the following use CH: https://uptrace.dev/ https://signoz.io/ https://github.com/hyperdxio/hyperdx

hosh · on Sept 20, 2023

I have made use of tracing, metrics, and logging all together and find each of them have its own place, as well as synergies of being able to work with all three together.

Cost is a real issue, and not just in terms of how much the vendor costs you. When tracing becomes a noticeable fraction of CPU or memory usage relative to the application, it's time to rethink doing 100% sampling. In practice, if you are sampling thousands of requests per second, you're very unlikely to actually look through each one of those thousands (thousands of req/s may not be a lot for some sites, but it is already exceeding human-scale without tooling). In order to keep accurate, useful statistics with sampling, you end up using metrics to store trace metrics prior to sampling.

thangalin · on Sept 20, 2023

> In fact, I tend to think more in terms of "events" when writing both logs and tracing code.

They are events[1]. For my text editor, KeenWrite, events can be logged either to the console when run from the command-line or displayed in a dialog when running in GUI mode. By changing "logger.log()" statements to "event.publish()" statements, a number of practical benefits are realized, including:

* Decoupled logging implementation from the system (swap one line of code to change loggers).

* Publish events on a message bus (e.g., D-Bus) to allow extending system functionality without modifying the existing code base.

* Standard logging format, which can be machine parsed, to help trace in-field production problems.

* Ability to assign unique identifiers to each event, allowing for publication of problem/solution documentation based on those IDs (possibly even seeding LLMs these days).

[1]: https://dave.autonoma.ca/blog/2022/01/08/logging-code-smell/

jameshart · on Sept 20, 2023

But events that another system relies upon are now an API. Be careful not to lock together things that are only superficially similar, as it affects your ability to change them independently.

thangalin · on Sept 20, 2023

Architecturally, the decoupling works as follows:

    Event -> Bus -> UI Subscriber -> Dialog (table)
    Event -> Bus -> Log Subscriber -> Console (text)
    Event -> Bus -> D-Bus Subscriber -> Relay -> D-Bus -> Publish (TCP/IP)

With D-Bus, published messages are versioned, allowing for API changes without breaking third-party consumers. The D-Bus Subscriber provides a layer of isolation between the application and the published messages so that the two can vary independently.

jameshart · on Sept 20, 2023

Observability costs feel high when everything’s working fine. When something snaps and everything is down and you need to know why in a hurry… those observability premiums you’ve been paying all along can pay off fast.

pondidum · on Sept 21, 2023

As other posters have mentioned, the incument companies rebranding to Observability definitely are expensive, because they are charging in the same way as they do for logs and/or metrics: per entry and per unique dimension (metrics especially).

Honeycomb at least charges per event, which in this case means per span - however they don't charge per span attribute, and each span can be pretty large (100kb / 2000 attributes).

I run all my personal services in their free tier, which has plenty of capacity, and that's before I do any sampling.

csomar · on Sept 21, 2023

How does one break into the industry though? I worked in a project tangentially related and the problem is that sales were done by corporate sales man rather than on technicality. The companies buying the product didn't care because the people involved were making "deals". The company selling the product didn't care about making the product better because it was selling and having high AWS bills sounds like they were doing something (even though they were burning money).

BiteCode_dev · on Sept 21, 2023

You don't have to keep traces for long though.

Log for long term, traces for short debut and analisys is a fine compromise.

thinkharderdev · on Sept 20, 2023

> I mean, maybe you end up integrating splunk or perhaps self-hosting kibana

I think this is the issue. Both Splunk and OpenSearch (even self-hosted OpenSearch) get really pricy as well especially with large volumes of log data. Cloudwatch can also get ludicrously expensive. They charge something like $0.50 per GB (!) and another $0.03 per GB to store. I've seen situations at a previous employer where someone accidentally deployed a lambda function with debug logging and ran up a few thousand $$ in Cloudwatch bills overnight.

You should look at Coralogix (disclaimer: I work there). We've built a platform that allows you to store your observability data in S3 and query it through our infrastructure. It can be dramatically more cost-effective than other providers in this space.

hahn-kev · on Sept 21, 2023

Why would you ever have lockin if you're using open telemetry?

layer8 · on Sept 20, 2023

> Log Levels are meaningless. Is a log line debug, info, warning, error, fatal, or some other shade in between?

I partly agree and disagree. In terms of severity, there are only three levels:

– info: not a problem

– warning: potential problem

– error: actual problem (operational failure)

Other levels like “debug” are not about severity, but about level of detail.

In addition, something that is an error in a subcomponent may only be a warning or even just an info on the level of the superordinate component. Thus the severity has to be interpreted relative to the source component.

The latter can be an issue if the severity is only interpreted globally. Either it will be wrong for the global level, or subcomponents have to know the global context they are running in to use the severity appropriate for that context. The latter causes undesirable dependencies on a global context. Meaning, the developer of a lower-level subcomponent would have to know the exact context in which that component is used, in order to chose the appropriate log level. And what if the component is used in different contexts entailing different severities?

So one might conclude that the severity indication is useless after all, but IMO one should rather conclude that severity needs to be interpreted relative to the component. This also means that a lower-level error may have to be logged again in the higher-level context if it’s still an error there, so that it doesn’t get ignored if e.g. monitoring only looks at errors on the higher-level context.

Differences between “fatal” and “error” are really nesting differences between components/contexts. An error is always fatal on the level where it originates.

Hermitian909 · on Sept 20, 2023

The OP is wrong, log levels are very valuable if you leverage them.

Here's a classic problem as an illustration: The storage cost of your logs is really prohibitive. You would like to cut out some of your logs from storage but cannot lower retention below some threshold (say 2 weeks maybe). For this example, assume that tracing is also enabled and every log has a traceId

A good answer is to run a compaction job that inspects each trace. If it contains an error preserve it. Remove X% of all other traces.

Log levels make the ergonomics for this excellent and it can save millions of dollars a year at sufficient scale.

abraae · on Sept 20, 2023

> In addition, something that is an error in a subcomponent may only be a warning or even just an info on the level of the superordinate component.

Or, keep it simple.

- error means someone is alerted urgently to look at the problem

- warning means someone should be looking into it eventually, with a view to reclassifying as info/debug or resolving it.

IMO many people don't care much about their logs, until the shit hits the fan. Only then, in production, do they realise just how much harder their overly verbose (or inadequate) logging is making things.

The simple filter of "all errors send an alert" can go a long way to encouraging a bit of ownership and correctness on logging.

layer8 · on Sept 20, 2023

> - error means someone is alerted urgently to look at the problem

The issue is that the code that encounters the problem may not have the knowledge/context to decide whether it warrants alerting. The code higher up that does have the knowledge, on the other hand, often doesn’t have the lower-level information that is useful to have in the log for analyzing the failure. So how do you link the two? When you write modular code that minimizes assumptions about its context, that situation is a common occurrence.

chii · on Sept 21, 2023

> When you write modular code that minimizes assumptions about its context, that situation is a common occurrence.

so your code isn't modular after all, because the code is _doing_ logging as a side-effect of the actual functionality.

The modularity of your code should mean that the outcome of the functionality is packaged into a bundle of data, and this bundle includes information about errors (or warnings) - aka, a status result.

The caller of this module will inspect this data, and they themselves will decide to log (or, if they are a module of their own, pass the data up again). This goes on, until the data goes into a logging layer - solely responsible for logging perhaps.

TeMPOraL · on Sept 21, 2023

Yes, except the problem here is that if the app crashes, you'll lose all the messages in the bundle. That's why people tend to use side-effect logging that persists messages immediately. That, and because it keeps timestamps correct.

I suppose this approach would make most sense in event-driven apps where no particular processing takes any meaningful amount of time, so you're constantly revisiting the top-level loop, where the "logging layer" could live. However, most software isn't written this way.

Too · on Sept 22, 2023

App segfaulting before having chance to log is mostly a thing in the past, unless you are writing c++. Any other language will instead have a top level exception handler.

If you were to take hard crashes into account, you would even have to log before each operation instead of after, basically reverting to printf-debugging.

TeMPOraL · on Sept 22, 2023

> unless you are writing c++

Guilty as charged.

> If you were to take hard crashes into account, you would even have to log before each operation instead of after

Yes, that's exactly what I see done and do for large enough operations (substeps of those operations only log when they're done).

> basically reverting to printf-debugging

That's what logging is, fundamentally. printf debugging, but with your own printf that has a few more knobs.

abraae · on Sept 20, 2023

If the code detecting the error is a library/subordinate service then the same rule can be followed - should this be immediately brought to a human's attention?

The answer for a library will often be no, since the library doesn't "have the knowledge/context to decide whether it warrants alerting".

So in that case the library can log as info, and leave it to the caller to log as error if warranted (after learning about the error from return code/http status etc.).

When investigating the error, the human has access to the info details from the subordinate service.

SkyPuncher · on Sept 20, 2023

I agree with your premise, but do consider debug to be a fourth level.

Info is things like “processing X”

Debug is things like “variable is Y” or “made it to this point”

BillinghamJ · on Sept 20, 2023

I tend to think of "warning" as - "something unexpected happened, but it was handled safely"

And then "error" as - "things are not okay, a developer is going to need to intervene"

And errors then split roughly between "must be fixed sometime", and "must be fixed now/ASAP"

layer8 · on Sept 20, 2023

> I tend to think of "warning" as - "something unexpected happened, but it was handled safely"

It was handled safely at the level where it occurred, but because it was unusual/unexpected, the underlying cause may cause issues later on or higher up.

If one were sure it would 100% not indicate any issue, one wouldn’t need to warn about it.

BillinghamJ · on Sept 25, 2023

That would indicate an issue - i.e. something we don't want. Just that it's not something where an engineer needs to go and mop up, and in theory would continue to operate correctly indefinitely. I guess correct as in - safe but not necessarily the most desirable behavior

fnordpiglet · on Sept 20, 2023

Tracing is poor at both very long lived traces, at stream processing, and most tracing implementations are too heavy to run in computationally bound tasks beyond at a very coarse level. Logging is nice in that it has no context, no overhead, is generally very cheap to compose and emit, and with including transaction id and done in a structured way gives you most of what tracing does without all the other baggage.

That said for the spaces where tracing works well, it works unreasonably well.

riv991 · on Sept 20, 2023

I think Open Telemetry has solved the stream processing problem issue with span links[1]. Treating each unit of work as an individual trace but being able to combine them and see a causal relationship. Slack published a blog about it pretty recently [2]

[1] https://opentelemetry.io/docs/concepts/signals/traces/#span-...

[2] https://slack.engineering/tracing-notifications/

cschneid · on Sept 20, 2023

When I worked at ScoutAPM, that list is basically the exact areas where we had issues supporting. We didn't do full-on tracing in the OpenTracing kind of way, but the agent was pretty similar, with spans (mostly automatically inserted), and annotations on those spans with timing, parentage, and extra info (like the sql query this represented in Active record).

The really hard things, which we had reasonable answers for, but never quite perfect: * Rails websockets (actioncable) * very long running background jobs (we stopped collecting at some limit, to prevent unbounded memory) * trying to profile code, we used a modified version of Stackprof to do sampling instead of exact profiling. That worked surprisingly well at finding hotspots, with low overhead.

All sorts of other tricks came along too. I should go look at that codebase again to remind me. That'd be good for my resume.... :)

https://github.com/scoutapp/scout_apm_ruby

phillipcarter · on Sept 20, 2023

Hmmm, for long-lived processes and stream processing we use tracing just fine. What we do is make a cutoff of 60 seconds, which each chunk is its own trace. But our backend queries trace data directly, so we can still analyze the aggregate, long-term behavior and then dig into a particular 60 second chunk if it's problematic.

fnordpiglet · on Sept 21, 2023

So, here are a few examples -

Suppose you have a long data pipeline that you want to trace jobs across. There are not an enormous number of jobs but each one takes 12 hours across many phases. In theory tracing works great here, but in practice most tracing platforms can’t handle this. This is especially true with tailed based tracing as traces can be unbounded and it has to assume at some point their time out. You can certainly build your own, but most of the value of tracing solutions is the user experience; which is also the hardest part.

On stream processing I’ve generally found it too expensive to instrument stream processors with tracing. Also there’s generally not enough variability to make it interesting. Context stitching and span management as well as sweeping and shipping of traces can be expensive in a lot of implementations and stream processing is often cpu bound.

A simple transaction id annotated log makes a lot more sense in both, queried in a log analytic platform.

alkonaut · on Sept 20, 2023

I like a log to read like a book if it’s the result of a task taking a finite time, such as for example an installation, a compilation, a loading of a browser page or similar. Users are going to look into it for clues about what happened and they a) aren’t always related to those who wrote the tools b) don’t have access to the source code or any special log analytics/querying tools.

That’s when you want a log and that’s what the big traditional log frameworks were designed to handle.

A web backend/service is basically the opposite. End users don’t have access to the log, those who analyze it can cross reference with system internals like source code or db state and the log is basically infinite. In that situation a structured log and querying obviously wins.

It’s honestly not even clear that these systems are that closely related.

WatchDog · on Sept 20, 2023

It’s a good distinction to make, logging for client based systems, is essentially UI design.

For a web app, serving lots of concurrent users, they are essentially unreadable without tools, so you may as well optimise the logs for tool based consumption.

mrkeen · on Sept 20, 2023

> If you’re writing log statements, you’re doing it wrong.

I too use this bait statement.

Then I follow it up with (the short version):

1) Rewrite your log statements so that they're machine readable

2) Prove they're machine-readable by having the down-stream services read them instead of the REST call you would have otherwise sent.

3) Switch out log4j for Kafka, which will handle the persistence & multiplexing for you.

Voila, you got yourself a reactive, event-driven system with accurate "logs".

If you're like me and you read the article thinking "I like the result but I hate polluting my business code with all that tracing code", well now you can create an independent reader of your kafka events which just focuses on turning events into traces.

rewmie · on Sept 20, 2023

> 3) Switch out log4j for Kafka, which will handle the persistence & multiplexing for you.

I don't think this is a reasonable statement. There are already a few logging agents that support structured logging without dragging in heavyweight dependencies such as Kafka. Bringing up Kafka sounds like a case of a solution looking for a problem.

lmm · on Sept 21, 2023

> I don't think this is a reasonable statement. There are already a few logging agents that support structured logging without dragging in heavyweight dependencies such as Kafka. Bringing up Kafka sounds like a case of a solution looking for a problem.

If it's data you care about then you put it in Kafka, unless you're big enough to use something like Cassandra or rich enough to pay a cloud provider to make redundant data storage their problem. Logs are something that you need to write durably and reliably when shit is hitting the fan and your networks are flaking and machines are crashing - so ephemeral disks are out, NFS is out, ad-hoc log collector gossip protocols are out, and anything that relies on single master -> read replica and "promoting" that replica is definitely out.

Kafka is about as lightweight as it gets for anything that can't be single-machine/SPOF. It's a lot simpler and more consistent than any RDBMS. What else would you use? HDFS (or maybe OpenAFS if your ops team is really good) is the only half-reasonable alternative I can think of.

jimbokun · on Sept 22, 2023

OK, but then how do you perform ad hoc queries on everything you logged to Kafka when it's time to debug an issue?

There are plenty of well known, battle tested solutions for solving that problem with old school logging.

lmm · on Sept 24, 2023

> OK, but then how do you perform ad hoc queries on everything you logged to Kafka when it's time to debug an issue?

Again I'd say treat it like data you care about. Use your best guess at a primary identifier as the record key, depending on your data volume do some indexing/pre-aggregation around other facets that you think you might want to query on (which might include materialising everything in ksqldb, or even in some other datastore), and accept that occasionally you're going to have to do a slow full scan.

> There are plenty of well known, battle tested solutions for solving that problem with old school logging.

Splunk was just bought for $28B because none of those "well known, battle tested solutions" are any good. (Splunk also sucks! It just sucks a little less than the other options).

mrkeen · on Sept 22, 2023

Do you want to debug what happened to your business entities, or do you want to debug what happened in your logs? Because if they're different things, those are different questions.

> There are plenty of well known, battle tested solutions for solving that problem with old school logging.

And you can run them in parallel (and without interference) by having them ingest from Kafka.

mrkeen · on Sept 20, 2023

> There are already a few logging agents that support structured logging without dragging in heavyweight dependencies such as Kafka.

What are they? Because admittedly I've lost a little love for the operational side of Kafka, and I wish the client-side were a little "dumber", so I could match it better to my uses cases.

ahoka · on Sept 20, 2023

I think OP meant event sourcing.

rewmie · on Sept 20, 2023

> I think OP meant event sourcing.

That is really besides the point. Logging and tracing have always been fundamentally event sourcing, but that never forced anyone ever at all to onboard onto freaking Kafka of all event streaming/messaging platforms.

This blend of suggestion sounds an awful lot like resume driven development instead of actually putting together a logging service.

Spivak · on Sept 21, 2023

Hard disagree, Kafka is one of the simplest lowest maintenance tools for this with excellent language support and would probably be the first choice for anyone not paying $cloud_vendor for a managed durable queue.

The first step in building a reliable logging system is setting up a high write throughout highly available FIFOish durable storage. Once you have that everything else gets a lot easier.

* Once the log is committed to the durable queue that's it the application can move on secure the log isn't going to get lost.

* Multiple consumer groups can process the logs for different purposes, the usuals are one group for persisting the logs to a searchable index and one group for real time altering.

* Everything downstream from Kafka can be far less reliable because it's just a queue backup.

* You can fake more throughout then you actually have in your downstream processors because it just manifests as a lagging offset.

rewmie · on Sept 21, 2023

> Hard disagree, Kafka is one of the simplest lowest maintenance tools for this (..)

You sound like you've been using an entirely different project named Kafka, because the Kafka everyone uses is renowned among message brokers for its complexity and operational overhead.

Spivak · on Sept 21, 2023

I might be, it's one of the lowest touch services we run. But we aren't doing the "Kafka all the things" model where every single little app is hooked into it for generic message passing but simply logs go in, logs go out, nothing else.

The business logic message passing goes through Rabbit because we wanted out of order processing, priority routing, retry queues, blah blah.

bowsamic · on Sept 20, 2023

How to get me to leave your company 101

mrkeen · on Sept 20, 2023

I did write a pretty glib description of what to do ;)

That said, I've had conflicts with a previous team-mate about this. He couldn't wrap his head around Kafka being a source of truth. But when I asked him whether he'd trust our Kafka or our Postgres if they disagreed, he conceded that he'd believe Kafka's side of things.

crabbone · on Sept 20, 2023

> The second problem with writing logs to stdout

Who on Earth does that? Logs are almost always written to stderr... In part to prevent other problems author is talking about (eg. mixing with the output generated by the application).

I don't understand why this has to be either or... If you store the trace output somewhere you get a log... (let's call it "un-annotated" log, since trace won't have the human-readable message part). Trace is great when examining the application interactively, but if you use the same exact tool and save the results for later you get logs, with all the same problems the author ascribes to logs.

OJFord · on Sept 20, 2023

Loads of people, it drives me around the twist too (especially when there's inevitably custom parsing to separate the log messages from the output) but it happens, probably well correlated with people that use more GUI tools, not that there's anything wrong with that, just I think the more you use a CLI the more you're probably aware of this being an issue, or other lesser best practices that might make life easier like newline and tab separation.

FridgeSeal · on Sept 20, 2023

I do, as does everyone at my work? Along with basically everyone I’ve ever worked with, ever?

Like, I develop cli apps, so like, what else would go to stdout that you suppose will interfere?

crabbone · on Sept 21, 2023

Nothing will go to stdout! Nothing is the best thing you can have when it comes to program output. Easiest validation! This is also how all Unix commands work -- they don't write to stdout unless you tell them to. But, if there's nothing extraordinary happening during the program execution -- nothing is written.

But why would you write your own logs instead of using something built into your language's library? I believe Python's logging module writes to stderr by default. Go's log package always goes to stderr.

But... today I've learned that console.log() in NodeJS writes to stdout... well, I've lots another tiny bit of faith in humanity.

FridgeSeal · on Sept 22, 2023

> This is also how all Unix commands work -- they don't write to stdout unless you tell them to

Ok? But as per my other comment, I’m not writing CLI apps, it’s mostly services and I have supporting services which harvest the logs from each containers stdout.

> But why would you write your own logs instead of using something built into your language's library?

I’m not writing my own logging setup? I am using the provided tools?? Every language logging library I’ve ever used writes to stdout?

Structlog in Python, nodejs obvs, all the Rust logging libraries I’ve ever used, I know you can configure Java/scala 3 million different ways (hello yes log4j lol), but all the Spark stuff I’ve written has logged to stdout.

FridgeSeal · on Sept 21, 2023

Can’t edit this now, but this is supposed to say “I don’t develop cli apps”

dalyons · on Sept 20, 2023

Being doing it for decade+, ever since the 12 factor app concept became popular. It’s way more common imho for web apps than stderr logging.

benreesman · on Sept 20, 2023

As a historical critic of Rust-mania (and if I’m honest, kind of an asshole about it too many times, fail), I’ve recently bumped into stuff like tokio-tracing, eyre, tokio-console, and some others.

And while my historical gripes are largely still the status quo: stack traces in multi-threaded, evented/async code that actually show real line numbers? Span-based tracing that makes concurrent introspection possible by default?

I’m in. I apologize for everything bad I ever said and don’t care whatever other annoying thing.

That’s the whole show. Unless it deletes my hard drive I don’t really care about anything else by comparison.

hardwaresofton · on Sept 20, 2023

I think there's an alternate universe out there where:

- we collectively realized that logs, events, traces, metrics, and errors are actually all just logs

- we agreed on a single format that encapsulated all that information in a structured manner

- we built firehose/stream processing tooling to provide modern o11y creature comforts

I can't tell if that universe is better than this one, or worse.

andrewstuart2 · on Sept 20, 2023

Traces are just distributed "logs" (in the data structure sense; data ordered only by its appearance in something) where you also pass around the tiniest bit of correlation context between apps. Traces are structured, timestamped, and can be indexed into much more debug-friendly structures like a call tree. But you could just as easily ignore all the data and print them out in streaming sorted order without any correlation.

Honestly it sounds like you're pitching opentelemetry/otlp but where you only trace and leave all the other bits for later inside your opentelemetry collector, which can turn traces into metrics or traces into logs.

hardwaresofton · on Sept 21, 2023

So this is kind of what I was talking about but it's more than that -- if your default is structured logs (simplest example is JSON) then all you have to do is put the data you care about into the log.

So I'm imagining something more like:

   {"level":"info", "otlp": { "trace": { ... }}}

   {"level":"info", "otlp": { "error": { ... }}}

   {"level":"info", "otlp": { "log": { ... }}}

   {"level":"info", "otlp": { "metric": { ... }}}

(standardizing this format would be non-trivial of course, but I could imagine a really minimal standard)

Your downstream collector only needs one API endpoint/ingestion mechanism -- unpacking the actual type of telemetry that came in (and persisting where necessary) can be left to other systems.

Basically I think the systems could have been massively simpler in most UNIX-y environments -- just hook up STDOUT (or scrape it, or syslog or whatever), and you're done -- no allowing ports out for jaeger, dealing with complicated buffering, etc -- just log and forget.

phillipcarter · on Sept 20, 2023

That's more or less the model Honeycomb uses. Every signal type is just a structured event. Reality is a bit messier, though. In particular, metrics are the oddball in this world and required a lot of work to make economical.

hardwaresofton · on Sept 21, 2023

Ah thanks for noting this, I that's exactly the insight I mean here.

Yeah I think the worst case you basically just exfiltrate metrics out to other subsystems (honestly, you could kind of exfiltrate all of this), but the default is pipe heavily compressed stuff to short and long term storage, and some processors for real time... blah blah blah.

Obviously Honeycomb is actually doing the thing and it's not as easy as it sounds, but it feels like if we had all thought like this earlier we might have skipped making a few protocols (zipkin, jaeger, etc), and focused on just data layout (JSON vs protobuf vs GELF, etc) and figuring out what shapes to expect across tools.

dalyons · on Sept 20, 2023

Is that really an alternate universe? That’s the universe that splunk and friends are selling, everything’s a log. It’s really expensive.

hardwaresofton · on Sept 21, 2023

Splunk does have margins and I think they're quite high. Same with Datadog (see: all the HN startups that are trying to grab some of that space).

There's a big gap between what it takes for the engineering to work and what all these companies charge.

My point is really more about the engineering time wasted on different protocols and stuff when we could have stuffed everything into minimally structured log lines (and figured out the rest of the insight machinery later). Concretely, that zipkin/jaeger/prometheus protocols and stuff may not have needed to exist, etc.

ec109685 · on Sept 21, 2023

Once you have logs, you can index them in a variety of ways to turn them into metrics, traces, etc., but having logs as the fundamental primitive is powerful.

jeffbee · on Sept 20, 2023

This is a great article because everyone should understand the similarity between logging and tracing. One thing worth pondering though is the differences in cost. If I am not planning to centrally collect and index informational logs, free-form text logging is extremely cheap. Even a complex log line with formatted strings and numbers can be emitted in < 1µs on modern machines. If you are handling something like 100s or 1000s of requests per second per core, which is pretty respectable, putting a handful of informational log statements in the critical path won't hurt anyone.

Off-the-shelf tracing libraries on the other hand are pretty expensive. You have one additional mandatory read of the system clock, to establish the span duration, plus you are still paying for a clock read on every span event, if you use span events. Every span has a PRNG call, too. Distributed tracing is worthless if you don't send the spans somewhere, so you have to budget for encoding your span into json, msgpack, protobuf, or whatever. It's a completely different ball game in terms of efficiency.

xyzzy_plugh · on Sept 20, 2023

I will agree that conceptually logging can be much cheaper than tracing ever can, but in practice any semi-serious attempt at structured logging ends up looking very, very close to tracing. In fact I'd go so far as to say that the two are effectively interchangeable at a point. What you do with that information, whether you index it or build a graph, is up to you -- and that is where the cost creeps in.

Adding timestamps and UUIDs and an encoding is par for the course in logging these days, I don't think that is the right angle to criticize efficiency.

Tracing can be very cheap if you "simply" (and I'm glossing over a lot here) search for all messages in a liberal window matching each "span start" message and index the result sets. Offering a way to view results as a tree is just a bonus.

Of course, in practice this ends up meaning something completely different, and far costlier. Why that is I cannot fathom.

nithril · on Sept 20, 2023

It is actually simpler to conceptualize the difference, one is stateless, the other one is stateful.

Actually structured logging exists since years like in Java https://github.com/logfellow/logstash-logback-encoder

hyperpape · on Sept 20, 2023

I don't generally disagree, but using json for structured logs is a growing thing as well.

perpil · on Sept 20, 2023

I was recently musing about the 2 different types of logs:

1. application logs, emitted multiple times per request and serve as breadcrumbs

2. request logs emitted once per request and include latencies, counters and metadata about the request and response

The application logs were useless to me except during development. However the request logs I could run aggregations on which made them far more useful for answering questions. What the author explains very well is that the problem with application logs is they aren't very human-readable which is where visualizing a request with tracing shines. If you don't have tracing, creating request logs will get you most of the way there, it's certainly better than application logs. https://speedrun.nobackspacecrew.com/blog/2023/09/08/logging...

ec109685 · on Sept 21, 2023

Stripe is big believer in request logs: https://stripe.com/blog/canonical-log-lines

ducharmdev · on Sept 20, 2023

Minor nitpick, but I wish this post started with defining what we mean by logging vs tracing, since some people use these interchangeably. The reader instead has to infer this from the criticisms of logging.

ryanklee · on Sept 20, 2023

I've never encountered this confusion anywhere, so I wouldn't ever think to dispel it. Which isn't to say that I disagree with the more general point that defining your terms is good thing.

In any case, the post itself (which is not long) illustrates and marks out many of the differences.

ducharmdev · on Sept 21, 2023

I would guess that you're either not around junior engineers, or people are very good at hiding their confusion.

ryanklee · on Sept 21, 2023

I wouldn't assert that the confusion is non-existent. But I think the audience for a post comparing technical differences between logging and tracing is unlikely a junior one.

But again, I do think the (brief) post marks out the differences throughout, so regardless, it still doesn't strike me as a problem here.

jlokier · on Sept 20, 2023

I agree. I'm working with code that uses 'verbose "message"' for level 1 verbosity logs and 'trace "message"' for level 2 verbosity. Makes sense in its world, but it's not the same meaning as how cloud-devops-observability culture uses those words.

waffletower · on Sept 20, 2023

There are logging libraries that include syntactically scoped timers, such as mulog (https://github.com/BrunoBonacci/mulog). While a great library, we preferred timbre (https://github.com/taoensso/timbre) and rolled our own logging timer macro that interoperates with it. More convenient to have such niceties in a Lisp of course. Since we also have OpenTelemetry available, it would also be easy to wrap traces around code form boundaries as well. Thanks OP for the idea!

goalieca · on Sept 20, 2023

Logging is essential for security. I think tracing is wonderful and so are metrics. I see these as more of a triad for observability.

waffletower · on Sept 20, 2023

Indeed, the three legs (metrics, logs, traces) of OpenTelemetry's telescope. https://opentelemetry.io

candiddevmike · on Sept 20, 2023

Something missing from OTel IMO is a standard way of linking all three together. It seems like an exercise left to the reader, but I feel like there should be standard metadata for showing a relationship between traces, metrics, and logs. Right now each of these functions is on an island (same with the tooling and storage of the data, but that's another rant).

discodachshund · on Sept 20, 2023

Isn't that the trace ID? For metrics, it's in the form of exemplars, and for logs it is the log context

candiddevmike · on Sept 20, 2023

That might be dependent on the library then, there isn't an official OTel Go logging library yet. Seems you have to add the trace ID exemplars manually too

phillipcarter · on Sept 20, 2023

Go is behind several of the languages in OTel right now. Just a consequence of a very difficult implementation and its load-bearing nature as being the language (and library) of choice for CNCF infrastructure. If you use Java or .NET, for example, it's quite fleshed out.

jen20 · on Sept 20, 2023

One would hope that there will not _be_ an Open Telemetry logging library for Go. Unlike last time there was a thread about this, there is now a standard - `slog` in the stdlib.

gazpacho · on Sept 21, 2023

One big failing of OpenTelemetry's traces in particular is that attaching structured data to them is difficult. Most structured logs can be JSON which for all of it's faults most things can be serialized to JSON. OpenTelemetry's attributes on traces are much more limited, they don't even support a null/None value! I wish they just accepted JSON-like data, it'd make it much easier to always use traces.

h1fra · on Sept 20, 2023

Tracing is much more actionnable but barely usable without a platform. Which makes local programming dependent on third party. Also it requires passing context or have a way to get back the context in every function that requires it, which can be daunting.

On my side I have opted to mixed structured/text, a generic message that can be easily understood while glancing over logs, and a data object attached for more details.

hinkley · on Sept 20, 2023

Someone got me excited about tracing and I started tweaking our stats API to optionally add tracing. Retrofitted it into a mature app, then immediately discovered that all of the data was being dropped because AWS only likes very tiny traces. Depth or fanout or both break it rather quickly.

And OpenTelemetry has a very questionable implementation. For a nested trace, events fire when the trace closes, meaning that a parent ID is reported before it is seen in the stream. That can’t be good for processing. Would be better to have a leading edge event (also helps with errors throwing and the parent never being reported).

Kind of a bummer. Needs work.

pcthrowaway · on Sept 20, 2023

> OpenTelemetry has a very questionable implementation

The nice thing about OpenTelemetry is that it's a standard. The questionable implementation you're referencing isn't a source of truth. There isn't some canonical "questionable" implementation.

There are many, slightly different, questionable implementations.

hinkley · on Sept 20, 2023

If the wire protocol has a bug, that’s not something an implementation can fix.

I’m saying the wire protocol is wrong.

candiddevmike · on Sept 20, 2023

You can add Jaeger to your local dev containers and run it in memory, it's really lightweight and easy to use.

vkoskiv · on Sept 20, 2023

Nit to the author: 'rapala' seems like a mistranslation. It is a brand name of a company that makes fishing lures, as far as I can tell. It is not the Finnish word for "to bait", and is therefore only used to refer to a that particular brand. I'm not sure what the purpose of the text in parenthesis is here, but 'houkutella' would be the most apt translation in this case.

pondidum · on Sept 21, 2023

Thanks, I have fixed the definition in the post! Turns out its just company slang for bating, rather than Finnish slang!

jauntywundrkind · on Sept 20, 2023

What's most incredible to me is how close tracing feels in spirit to me to event-sourcing.

Here's this log of every frame of compute going on, plus data or metadata about the frame.... but afaik we have yet to start using the same stream of computation for business processes as we do for it's excellent observability.

alexisread · on Sept 20, 2023

Any of the Clickhouse-based Otel stores can do event sourcing - just set up materialised views on the trace tables. I know the following use CH: https://uptrace.dev/ https://signoz.io/ https://github.com/hyperdxio/hyperdx

juliogreff · on Sept 20, 2023

As a matter of fact, at a previous job we used traces as a data source for event sourcing. One use case: we tracked usage of certain features in API calls in traces, and some batch job ran at whatever frequency aggregated which users were using which features. While it was far from real time because of the sheer amount of data, it was so simple to implement that we had dozens of use cases implemented like that.

koliber · on Sept 21, 2023

Does this naive approach work for anyone to allow a log to be read like a trace:

1. At the start of a request, generate a globally unique traceId

2. Pass this traceId through the whole call stack.

3. Whenever logging, log the traceId as a parameter

Now you have a log with many of the plusses of a trace. The only additional cost to the log is the storage of the traceId on every message.

If you want to read a trace, search through your logs for "traceId: xyz123". If you use plain text storage you can grep. If you use some indexed storage, search for the key-value pair.

This way, you can retrieve something that looks like a trace from a log.

This does not solve all the issues named in the article. However, it is a decent tradeoff that I've used successfully in the past. Call it "poor man's tracing".

pondidum · on Sept 21, 2023

Yes, but going to this effort, why not move to tracing instead?

A migration path I could see might be:

- replace current logging lib with otel logging (sending to same output) - setup tracing - replace logging with tracing over time (I prefer moving the most painful areas of code first)

koliber · on Sept 22, 2023

One benefit is that you only need to send one string value (traceId) through the whole call stack, instead of passing around a trace object that gets built up. It seems lighter and simpler to add to an existing codebase.

skybrian · on Sept 20, 2023

How would a hobbyist programmer get started with tracing for a simple web app? Where do the traces end up and how do I query it? Can tracing be used in a development environment?

Context: the last thing I wrote used Deno and Deno Deploy.

curioussavage · on Sept 20, 2023

Just install opentelemetry libs. I found this example with a quick search: https://dev.to/grunet/leveraging-opentelemetry-in-deno-45bj

opentelemetry has a service you can run that will collect the telemetry data and you can export it to something like prometheus which can store it and let you query it. Example here https://github.com/open-telemetry/opentelemetry-collector-co...

Typically in dev environments trace spans are just emitted to stdout just like logs. I sometimes turn that off too though because it gets noisy.

eep_social · on Sept 20, 2023

For local dev Jaegar provides a very nice UI to do trace inspection, searching etc.

spullara · on Sept 20, 2023

It drives me insane that the standardized tracing libraries have you only report closed spans. What if it crashes? What if it stalls? Why should I keep open spans in memory when I can just write an end span event?

andersrs · on Sept 20, 2023

I have a side project that I run in Kubernetes with a postgres database and a few Go/Nodejs apps. Recommend me a lightweight otel backend that isn't going to blow out my cloud costs.

hosh · on Sept 20, 2023

That’s weird. I use both logging and tracing where I can. And metrics.

While there are better tools for alerting, metrics, or aggregations, it helps a lot in debugging and troubleshooting.

aero142 · on Sept 20, 2023

I think the author's point is that tracing is a better implementation of both logs and metrics, and I think it's a valid point. * metrics are pre-aggregated into timeseries data, which makes cardinality expensive. You could also aggregate a value from a trace statement. * Logs are hand crafted and unique, and are usually improved by adding structured attributes. Structured attributes are better as traces because you can have execution context and well defined attributes that provide better detail.

Traces can be aggregated or sampled to provide all of the information available from logs, but in a more flexible way. * Certain traces can be retained at 100%. This is equivalent to logs. * Certain trace attributes can be converted to timeseries data. This is equivalent to metrics. * Certain traces can be sampled and/or queried with streaming infrastructure. This is a way to observe data with high cardinality without hitting the high cost.

hosh · on Sept 20, 2023

There are things you can do with metrics and logging that you cannot do with traces. These usually fall outside of debugging application performance and bottlenecks. So I think what the author says is true if you are only thinking about application, and not for gaining a holistic understanding of the entire system, including infrastructure.

Probably the biggest tradeoff with traces is that, in practice, you are not retaining 100% of all traces. In order to keep accurate statistics, it generally gets ingested as metrics before sampling. The other is that traces are not stored in such a way where you are looking at what is happening at a point-in-time -- which is what logging does well. If I want to ensure I have execution context for logging, I make the effort to add trace and span ids so that traces and logging can be correlated.

To be fair, I live in the devops world more often than not, and my colleagues on the dev teams rarely have to venture outside of traces.

I don't mind the points this author is making. My main criticism is that it is scoped to the world of applications -- which is fine -- but then taken as universal for all of software engineering.

marcus_holmes · on Sept 21, 2023

I'm fundamentally uncomfortable with sending all my data to a third party.

The cool thing about logs is that they're just a text file and don't need to be sent over the internet to someone else. But yes, I've encountered some problems just using text logs and I'd like to solve them.

Is there an OpenTelemetry solution that is capable of being self-hosted (and preferably OS) that anyone recommends?

Too · on Sept 22, 2023

Grafana Tempo or Jaeger all-in-one are both OpenTelemetry compatible and easy host yourself on small scale.

malinens · on Sept 21, 2023

open source tracing tools can be easily locally hosted

jasonjmcghee · on Sept 20, 2023

I really enjoyed the content- it's a great article.

Note to author: all but the last code block have a very odd mixture of rather large font sizes (at least on mobile) which vary line to line that make them pretty difficult to read.

Also the link to "Observability Driven Development." was a blank slide deck AFAICT

pondidum · on Sept 21, 2023

They all look fine in "mobile view" in firefox, and on firefox in android.

It's all statically rendered html, and I don't see anything weird in the html either.

Do you have a screenshot and some device info so I can look a bit more? Thanks

jasonjmcghee · on Sept 23, 2023

https://pasteboard.co/xNmrHz0YNmgg.png

Happened in safari and brave.

iOS 16, iPhone 13 Pro

amelius · on Sept 20, 2023

This is stuff that a debugger is supposed to do for you, for free.

This should not require code at the application level, but it should be implemented at the tooling level.

Too · on Sept 22, 2023

Are you saying every single variable and function call should be logged every time the code runs? In a dream world sure. While we are at it, let's make it possible to freeze all the state from the production system and let me add breakpoints to rewind in time. In the real world, someone has to make a decision what is noise and what is content.

Unless you are talking about profilers, that measure execution time and memory only, but traces are a lot more than only that.

Annotating the code with logs and traces is a UX activity, not for the end users, but for the ops-team. They don't have knowledge of the internals of the code. Logs should be written in the context of levers that ops have control over.

Take the example from the OP: nr of cache hits. It's something ops can control by configuring the cache size, it is something ops can observe and correlate with request-time and network bandwidth. It would require an immensely sophisticated debugger to make all these correlations automatically.

conradludgate · on Sept 20, 2023

Are you running a debugger on a web service in production?

lmm · on Sept 21, 2023

I wouldn't call it a "debugger", but plenty of people run an instrumentation agent like New Relic or AppDynamics that records tracing information on their production web services with little or even zero modification to their application code.

conradludgate · on Sept 21, 2023

> zero modifications

Highly depends on the language or framework. I used to use NewRelic APM with Go and it required additional code to instrument.

All languages I have used in production require manual instrumentation. For that tracing and logging both work well in my experience.

lmm · on Sept 21, 2023

> I used to use NewRelic APM with Go and it required additional code to instrument.

I'm not surprised, Go's runtime is pretty limited. But for e.g. Java if you're using a well-known/"standard" framework you can just flick the switch and it'll give you a lot of good useful information - additional manual instrumentation usually helps, but the level of instrumentation without it is useful.

eep_social · on Sept 20, 2023

Tracing would be on every web service in production! At the same time! And saving all the output!

lambda_garden · on Sept 20, 2023

Couldn't this be injected into the runtime so that no code changes are required?

Perhaps really performance critical stuff could have a "notrace" annotation.

imiric · on Sept 20, 2023

There are several projects that leverage eBPF for automatic instrumentation[1].

How accurate and useful these are vs. doing this manually will depend on the use case, but I reckon the automatic approach gets you most of the way there, and you can add the missing traces yourself, so if nothing else it saves a lot of work.

[1]: https://ebpf.io/applications/

austinsharp · on Sept 20, 2023

Yes, OTel has autoinstrumentation libraries for some language that can pick up a fair amount by default. Though it's unlikely that that would ever be sufficient, it's a nice start.

For Java: https://opentelemetry.io/docs/instrumentation/java/automatic...

thinkharderdev · on Sept 20, 2023

Sure, and a lot of tools will do this in one way or another. Either instrument code directly or provide annotations/macros to trace a specific method (something like tokio-tracing in the Rust ecosystem).

However, tracing literally every method call would probably be prohibitively expensive so typically you have either:

1. Instrumentation with "understands" common frameworks/libraries and knows what to instrument (eg request handlers in web frameworks)

2. Full opt-in. They make it easy to add a trace for a method invocation with a simple annotation but nothing gets instrumented by default

pondidum · on Sept 21, 2023

Yes, and itel has instrumentation libraries which do this.

However, no automatic instrumentation can do everything for you; it can't know what are all the interesting properties or things to add as attributes. But adding tracing automatically to SQL clients, web frameworks etc is very valuable

killbot5000 · on Sept 21, 2023

Logs should go to stderr. I will die on this hill.

hello1234567 · on Sept 20, 2023

person writing this came to know some thing that he din't know earlier and decided to convert his light bulb moment into a blog post. not bad bad but failed to understand that logs are the generalisation of very thing they are talking about.

thegrizzlyking · on Sept 20, 2023

Logs are mostly "Hi I reached this line of code, here is some metadata"