Is it time to version observability?

Veserv · on Aug 9, 2024

They do not appear to understand the fundamental difference between logs, traces, and metrics. Sure, if you can log every event you want to record, then everything is just events (I will ignore the fact that they are still stuck on formatted text strings as a event format). The difference is what do you do when you can not record everything you want to either at build time or runtime.

Logs are independent. When you can not store every event, you can drop them randomly. You lose a perfect view of every logged event, but you still retain a statistical view. As we have already assumed you can not log everything, this is the best you can do anyways.

Traces are for correlated events where you want every correlated event (a trace) or none of them (or possibly the first N in a trace). Losing events within a trace makes the entire trace (or at least the latter portions) useless. When you can not store every event, you want to drop randomly at the whole trace level.

Metrics are for situations where you know you can not log everything. You aggregate your data at log time, so instead of getting a statistically random sample you instead get aggregates that incorporate all of your data at the cost of precision.

Note that for the purposes of this post, I have ignored the reason why you can not store every event. That is an orthogonal discussion and techniques that relieve that bottleneck allow more opportunities to stay on the happy path of "just events with post-processed analysis" that the author is advocating for.

otterley · on Aug 9, 2024

> what do you do when you can not record everything you want to

The thesis of the article is that you should use events as the source of truth, and derive logs, metrics, and traces from them. I see nothing logically or spiritually wrong with that fundamental approach and it's been Honeycomb's approach since Day 1.

I do feel like Charity is ignoring the elephant in the room of transit and storage cost: "Logs are infinitely more powerful, useful and cost-effective than metrics." Weeeelllllll....

Honeycomb has been around a while. If transit and storage were free, using Honeycomb and similar solutions would be a no-brainer. Storage is cheap, but probably not as cheap as it ought to be in the cloud.[1] And certain kinds of transit are still pretty pricey in the cloud. Even if you get transit for free by keeping it local, using your primary network interface for shipping events reduces the amount of bandwidth remaining for the primary purpose of doing real work (i.e., handling requests).

Plus, I think people are aware--even if they don't specifically say so--that data that is processed, stored, and never used again is waste. Since we can't have perfect prior knowledge whether some data will be valuable later, the logical thing course of action is to retain everything. But since doing so has a cost, people will naturally tend towards trying to capture as little as they can get away with yet still be able to do their job.

[1] I work for AWS, but any opinions stated herein are mine and not necessarily those of my colleagues or the company I work for.

rustybolt · on Aug 10, 2024

> The thesis of the article is that you should use events as the source of truth, and derive logs, metrics, and traces from them.

That was the main point? Man, the article goes on and on about definitions and semantic versioning and how we should call observability "olly", but glad to hear there's actually a coherent thought behind it.

Also, I don't think the "it's too expensive to log everything" argument should be true. Storage is insanely cheap and you can fit a whole lot of information inside a single MB (even more so when using compression).

Veserv · on Aug 9, 2024

Of course everything is events. The distinction is how do and should they degrade when you can not record everything (and secondarily how we should interpret and correlate them, though that is more in the structured/schema aspect).

I personally believe you should deploy systems with robust, comprehensive, automatic instrumentation that you analyze offline, which appears to match what the author is advocating for (except maybe the automatic part). But, that is orthogonal to the author’s claim that it is some sort of groundbreaking revolution in observability. It is not, it is just a scheme that assumes you only need to tread the happy path.

However, if you are able to stay on the happy path, then you should absolutely prefer a scheme in the same general vicinity as what the article proposes. Personally, the things the author links to leads me to question the efficiency of the implementations (compute, bandwidth, storage bottlenecks become worse with inefficient implementations) which makes it harder to stay on the happy path, but the core idea is very effective on the happy path.

lukev · on Aug 9, 2024

Yes, I'm quite sure the CTO of a leading observability platform is simply confused about terminology.

pclmulqdq · on Aug 9, 2024

It is not impossible that this is the case (at least in GP's view). Companies in the space argue between logs and events as structured or unstructured data and how much to exploit that structure. Unstructured is the simple way, and appears to be the approach that TFA prefers, while deep exploitation of structured event collection actually appears to be better for many technical reasons but is more complex.

From what I can tell, Honeycomb is staffed up with operators (SRE types). The GP is thinking about logs, traces, and metrics like a mathematician, and I am not sure that anyone at Honeycomb actually thinks that way.

theamk · on Aug 10, 2024

I think the Veserv was too nice in their comments - the CTO is not "simply confused", she is specifically trying to introduce confusion to separate Honeycomb from competitors.

See, under common definition of observability, Honeycomb is just another platform with upsides and downsides. But if you believe the original post, Honeycomb is "2.0" while everything else is "1.0". So why are you even bothering with feature comparison for with other platforms? Are you _really_ going to go with observability 1.0 (a.k.a. most of Honeycomb's competitors) when 2.0 is out? Just sign up already, preferably for most expensive plan so you can record all those events you'll be generating.

spimmy · on Aug 12, 2024

jesus christ ok

MrDarcy · on Aug 9, 2024

To be fair, the first heading and section of TFA says no one knows what the terminology means anymore.

Which, to GP’s point, is kind of BS.

ethbr1 · on Aug 9, 2024

When in doubt, call it a data lake.

jaggederest · on Aug 10, 2024

Sounds to me like that's just a database but with extra steps

ethbr1 · on Aug 10, 2024

Database, with schema left as an exercise for the reader

growse · on Aug 9, 2024

> Logs are independent. When you can not store every event, you can drop them randomly. You lose a perfect view of every logged event, but you still retain a statistical view. As we have already assumed you can not log everything, this is the best you can do anyways.

Usually, every log message occurs as part of a process (no matter how short lived), and so I'm not sure it's ever the case that a given log is truly independent from others. "Sampling log events" is a smell that indicates that you will have incomplete views on any given txn/process.

What I find works better is to always correlate logs to traces, and when just drop the majority of non-error traces (keep 100% of actionable error traces).

> When you can not store every event, you want to drop randomly at the whole trace level.

Yes, this.

thayne · on Aug 10, 2024

Yes, you can probably combine traces and logs into a single category, and really what are traces but a collection of related logs with information about the relationship between them.

But metrics are an entirely different beast.

Sure some metrics can be derived from "events". But what about thinks like CPU and RAM usage? Do you want to record an event for every packet so you can measure network performance? And as mentioned above, you generally don't want to sample the things you have metrics if.

growse · on Aug 10, 2024

If I were being pedantic, I'd say that the metric of "CPU usage" comes from the event "measuring the CPU usage" :p

yencabulator · on Aug 10, 2024

> you generally don't want to sample the things you have metrics [o]f.

Metrics are a sampling of a counter/gauge.

0x6c6f6c · on Aug 9, 2024

I wonder if anyone has utilized trace capabilities like head and tail based sampling _for logging_. You could potentially have the same logic used to determine that logs do not need to be generated because there are no transaction logs ahead of it or before.

Useful capabilities for tracing today that could translate well for logs too. So long as everthing is stitched together, which is what these tracing libraries are responsible for.

phillipcarter · on Aug 10, 2024

I'm not aware of tail-based (i.e., gathering a batch of logs and sampling the group coherently), but some of the sampling techniques mentioned in the post are absolutely used for structured logging systems.

At least with Honeycomb, for logs unassociated with traces, you can use dynamic sampling[0] to significantly cut down on total event volume and bake in good representivity. And when you have logs correlated with traces, like how OTel does it, you can sample all logs correlated to a trace when that trace is also sampled[1]. We've at least also been doing some thinking about what batching up and coherently sampling "likely related but not correlated by ID" groups of logs. It's definitely in the realm of doable and TBH I find it a little strange that people haven't sought these kinds of solutions at large yet. I think most folks still just assume that the only way you can reduce volume is by giving up some notion of representativeness when that's just not true.

[0]: https://github.com/honeycombio/dynsampler-go

[1]: https://github.com/honeycombio/refinery

justinsaccount · on Aug 9, 2024

> Logs are independent. When you can not store every event, you can drop them randomly. You lose a perfect view of every logged event, but you still retain a statistical view. As we have already assumed you can not log everything, this is the best you can do anyways.

I think one of the google SRE books mentioned why you don't want to simply randomly sample events. If you're do something like storing http events and logging 1 event per 1000 and you have a small burst of errors from some service failing, you'd potentially miss all or most of them. For certain types of events you want a random sample per [endpoint/status code/whatever], potentially at different rates.. .1% of 200 responses, but 1% of 500 errors.

Veserv · on Aug 10, 2024

You are correct. I was being a little lazy with my terminology.

You can choose however you want to sample, random just being the simplest and most easily automated mechanism. The key is that you degrade from a perfect record to a statistical record.

Traces should degrade from a perfect record to a statistical record of correlated events (whole traces).

Metrics are “inherently degraded” from a perfect record to a aggregate record for situations where you know you can not have a perfect record, but you want information (even if low precision) on every event.

jrockway · on Aug 10, 2024

Yeah, I used to run a service that got a pretty constant 20,000 qps and traces pretty much never showed me anything about the really weird issues. It is nice when you can delay the sampling decision to much later.

(I use zap for logs now and like the sampling algorithm; only duplicate log lines are suppressed, based on the message text but not the structured field values. So if you log "http request finished ok" and "http request errored", then you get a good sampling of both event types. Not distributed, of course, and no guarantee that the same x-request-id will be sampled in other systems.)

phillipcarter · on Aug 10, 2024

> That is an orthogonal discussion and techniques that relieve that bottleneck allow more opportunities to stay on the happy path of "just events with post-processed analysis" that the author is advocating for.

Perhaps Charity's post didn't elaborate on this, but Honeycomb's storage and query engine is designed precisely to relieve the bottleneck you describe and allow for more opportunities along the "it's all just events with post-processed analysis" world described. There are other telemetry backends that allow for this kind of analysis, such as some of Datadog's products (mostly their logging product, to an extent), and several tools that use Clickhouse on the backend. But there's a lot of devils packed into those details, because a column-based database doesn't inherently mean maximal analysis flexibility, even if it is foundational to achieve these things.

More specifically, we decompose traces into events, and re-assemble them on-demand whenever you want to look at a specific trace. Metrics are similarly just events that indeed, when generated, are an aggregation of some measures exported into an event at some regular interval. On the backend you can trivially combine event data whether it is sourced from a trace, log, or metric. The other thing the post talks about that is entirely backed-specific and has no real bearing on how you treat the data, is that every field in your data is effectively indexed, and the cost to process an event with 10 fields on it compared to an event with 500 fields on it is marginal. It's tremendously freeing to treat Observability as a real-time analytics problem space in this way. And when you combine this with intelligent sampling techniques (i.e., not just taking a flat 0.01% of all data or whatever) your costs are very much under control for high volumes of data.

That said, I don't think the post did the greatest job clarifying that "Observability 2.0" as-described is much more about analysis (no pre-indexing, no ingest-only limitations on combining data, no limitations on cardinality) and less about the underlying data (events as the base decomposition of data). The focus on "everything is an event!", event this, event that, etc. is something I've seen lose developers because the abstractions you build atop them (traces, metrics, browser events, etc.) are often some of the most helpful ways to reason about the underlying data, and it's what developers are very often concerned with generating in the first place.

cmgriffing · on Aug 9, 2024

I am no expert but my take was that in many cases, logs, metrics, and traces are stored in individually queryable tables/stores.

Their proposal is to have a single "Event" table/store with fields for spans, traces, and metrics that can be sparsely populated similar to a dynamodb row.

Again, I might have missed the point, though.

strken · on Aug 10, 2024

My take on the underlying point is that there's a good reason not to use events for every system everywhere. That reason is cost: not cost as in "we can save a bit of money", but cost as in "collecting all our events would double our infrastructure budget". Traces (correlated bundles of events) and metrics (aggregated events, and also other stuff that has nothing to do with events, but let's ignore that) are an attempt to solve the problems caused when you've exhausted the limits of shoving events into a table.

I think GPs point is that sure, you can shove events into a table, but this is just structured logging. It doesn't need to be called observability 2.0 because it's really observability 0.5: the baseline you start with before you need traces or metrics derived from events. All the observability 2.0 hype and the love hearts are a sales pitch from a CTO explaining why you shouldn't shoot yourself in the foot with an unnecessarily heavyweight observability implementation when your log volume is small.

Veserv · on Aug 10, 2024

Yep, my post was just saying this is observability 0.5, not 2.0.

However, I actually do agree that you should prefer observability 0.5 in most cases if you have a appropriate implementation.

I am familiar with time travel debugging where you use automatic event instrumentation to stream GB/s per core to fully and perfectly reconstruct all program states. The problems and solutions I see being proposed in the distributed tracing space seem so grotesquely inefficient, yet actionable data anemic in comparison that I am pretty sure most of the limits preventing “just events” are due to inadequate implementations.

However, if you really are running into limits, then it is important to have these known fallback mechanisms instead of just ignoring their existence and calling it a 2.0.

spimmy · on Aug 12, 2024

it's weird that you think it's a sales pitch, when i ended it by pleading for other people to share writeups of their similar solutions. i know they exist, and i know people who are desperate to use them.

if it came across as a sales pitch, i def missed the target somehow, apologies.

strken · on Aug 15, 2024

Apologies if that sounded too cynical. I didn't mean that the general argument sounded "sales-ey", just that a CTO writing about how their company's implementation is better is not taking a thousand-foot objective view of the upsides and downsides.

In this case, I do think 1.0 and 2.0 are flawed names that imply one supercedes the other, when really it's a choice between scalability to levels most companies don't need at the cost of complexity most companies don't want.

archenemybuntu · on Aug 9, 2024

Id gonna break a nerve and say most orgs overengineer observability. There's the whole topology of otel tools, Prometheus tools and bunch of Long term storage / querying solutions. Very complicated tracing setups. All these are fine if you have a team for maintaining observability only. But your avg product development org can sacrifice most of it and do with proper logging with a request context, plus some important service level metrics + grafana + alarms.

Problem with all these above tools is that, they all seem like essential features to have but once you have the whole topology of 50 half baked CNCF containers set up in "production" shit starts to break in very mysterious ways and also these observability products tend to cost a lot.

datadrivenangel · on Aug 9, 2024

The ratio of 'metadata' to data is often hundreds or thousands to one, which translates to cost, especially if you're a using a licensed service. I've been at companies where the analytics and observability costs are 20x the actual cost of the application for cloud hosting. Datadog seems to have switched to revenue extraction in a way that would make oracle proud.

lincolnq · on Aug 9, 2024

Is that 20x cost... actually bad though? (I mean, I know Datadog is bad. I used to use it and I hated its cost structure.)

But maybe it's worth it. or at least, the good ones would be worth it. I can imagine great metadata (and platforms to query and explore it) saves more engineering time than it costs in server time. So to me this ratio isn't that material, even though it looks a little weird.

ElevenLathe · on Aug 9, 2024

The trouble is that the o11y costs in developer time too. I've seen both traps:

Trap 1: "We MUST have PERFECT information about EVERY request and how it was serviced, in REALTIME!"

This is bad because it ends up being hella expensive, both in engineering time and in actual server (or vendor) bills. Yes, this is what we'd want if cost were no object, but it sometimes actually is an object, even for very important or profitable systems.

Trap 2: "We can give customer support our pager number so they can call us if somebody complains."

This is bad because you're letting your users suffer errors that you could have easily caught and fixed for relatively cheap.

There is diminishing returns with this stuff, and a lot of the calculus depends on the nature of your application, your relationship with consumers of it, your business model, and a million other factors.

ethbr1 · on Aug 9, 2024

Family in pharma had a good counter-question to rationally scope this:

"What are we going to do with this, if we store it?"

A surprising amount of the time, no one has a plausible answer to that.

Sure, sometimes you throw away something that would have been useful, but that posture also saves you from storing 10x things that should never have been stored, because they never would have been used.

And for the things you wish you'd stored... you can re-enable that after you start looking closely at a specific subsystem.

ElevenLathe · on Aug 12, 2024

I agree that this is the way, but the problem with this math is that you can't, like, prove that that one thing in ten that you could have saved but didn't wouldn't have been 100x as valuable as the 9 that you didn't end up needing. So what if you saved $1000/yr in storage if you also had to throw out a million dollar feature that you didn't have the data for? There is no way to go about calculating this stuff, so ultimately you have to go by feel, and if the people writing the checks have a different feel, they will get their way.

never_inline · on Aug 9, 2024

I would be curious to know, what's the ratio of AWS bill to programmer salary in J random grocery delivery startup.

fishtoaster · on Aug 9, 2024

For what it's worth, I found it almost trivial to set up open telemetry and point it honeycomb. It took me an afternoon about a month ago for a medium-sized python web-app. I've found that I can replace a lot of tooling and manual work needed in the past. At previous startups it's usually like

1. Set up basic logging (now I just use otel events)

2. Make it structured logging (Get that for free with otel events)

3. Add request contexts that's sent along with each log (Also free with otel)

4. Manually set up tracing ids in my codebase and configure it in my tooling (all free with otel spans)

Really, I was expecting to wind up having to get really into the new observability philosophy to get value out of it, but I found myself really loving this setup with minimal work and minimal koolade-drinking. I'll probably do something like this over "logs, request context, metrics, and alarms" at future startups.

JoshTriplett · on Aug 9, 2024

I've currently done this, and I'm seriously considering undoing it in favor of some other logging solution. My biggest reason: OpenTelemetry fundamentally doesn't handle events that aren't part of a span, and doesn't handle spans that don't close. So, if you crash, you don't get telemetry to help you debug the crash.

I wish "span start" and "span end" were just independent events, and OTel tools handled and presented unfinished spans or events that don't appear within a span.

bamboozled · on Aug 9, 2024

Isn’t the problem here that your code is crashing and you’re relying on the wrong tool to help you solve that ?

JoshTriplett · on Aug 9, 2024

Logging solves this problem. If OTel and observability is attempting to position itself as a better alternative to logging, it needs to solve the problems that logging already solves. I'm not going to use completely separate tools for logging and observability.

Also, "crash" here doesn't necessarily mean "segfault" or equivalent. It can also mean "hang and not finish (and thus not end the span)", or "have a network issue that breaks the ability to submit observability data" (but after an event occurred, which could have been submitted if OTel didn't wait for spans to end first). There are any number of reasons why a span might start but not finish, most of which are bugs, and OTel and tools built upon it provide zero help when debugging those.

phillipcarter · on Aug 10, 2024

OTel logs are just your existing logs, though. If you have a way to say "whoopsie it hung" then this doesn't need to be tied to a trace at all. The only tying to a trace that occurs is when there's active span/trace in context, at which point the SDK or agent you use will wrap the log body in that span/trace ID. Export of logs is independent of trace export and will be in separate batches.

Edit: I see you're a major Rust user! That perhaps changes things. Most users of OTel are in Java, .NET, Node, Python, and Go. OTel is nowhere near as developed in Rust as it is for these languages. So I don't doubt you've run into issues with OTel for your purposes.

growse · on Aug 9, 2024

Can you give an example of an event that's not part of a span / trace?

Spivak · on Aug 9, 2024

Unhandled exceptions is a pretty normal one. You get kicked out to your app's topmost level and you lost your span. My wishlist to solve this (and I actually wrote an implementation in Python which leans heavily on reflection) is to be able to attach arbitrary data to stack frames and exceptions when they occur merge all the data top-down and send it up to your handler.

Signal handlers are another one and are a whole other beast simply because they're completely devoid of context.

growse · on Aug 10, 2024

Two good examples - thank you.

They're icky (as language design / practices) to me precisely because you end up executing context-free code. But I'd probably also just start a new trace in my signal handler / exception handler tagged with "shrug"...

jononor · on Aug 9, 2024

Can't you close the span on an exception?

JoshTriplett · on Aug 9, 2024

See https://news.ycombinator.com/item?id=41205665 for more details.

And even in the case of an actual crash, that doesn't necessarily mean the application is in a state to successfully submit additional OTel data.

nicbou · on Aug 10, 2024

How would you underengineer it? What would be a barebones setup for observability at the scale of one person with a few servers running at most a dozen different scripts?

I would like to make sure that a few recurrent jobs run fine, but by pushing a status instead of polling it.

jauntywundrkind · on Aug 10, 2024

I just find logs to be infuriatingly inconsistent & poorly done. What gets logged is arbitrary as hell & has such poor odds of showing what happened.

Where-as tracing instrumentation is fantastically good at showing where response time is being spent, showing what's getting hit. And it comes with no developer cost; the automatic instrumentation runs & does it all.

Ideally you also throw in some. additional tags onto the root-entry-span or current span. That takes some effort. But then it should be consistently & widely available & visible.

Tracing is very hard to get orgs culturally on board with. And there are some operational challenges but personally I think you are way over-selling how hard it is... There are just a colossal collection of softwares that serve as really good export destinations, and your team can probably already operate one or two quite well.

It does get a lot more complex if you want longer term storage. Personally I'm pro mixing systems observability with product performance tracking stuff, so yes you do need to keep some data indefinitely. And that can be hugely problematic: either trying to juggle storage and querying for infinitely growing data, or building systems to aggregate & persist the data & derived metrics, you need while getting rid of the base data.

But I just can't emphasize enough how badly most orgs are at logs and how not worth anyone's time it so to invest in something manual like that that offers so much less than the alternative (traces).

spimmy · on Aug 12, 2024

HUGE +1 to mixing systems observability with product data. this is an oft-missed aspect of observability 2.0 that is increasingly critical. all of the interesting questions in software are some combination and conjunction of systems, app, and business data.

also big agree that most places are so, so, so messy and bad about doing logs. :( for years, i refused to even use the term "logs" because all the assumption i wanted people to make were the opposite of the assumptions people bring to logs: unstructured, messy, spray and pray, etc.

carefulfungi · on Aug 9, 2024

I feel like the focus on trace/log/metrics terminology is overshadowing Charity's comments on the presentation and navigation tier, which is really where the focus should be in my experience. Her point about making the curious more effective than the tenured is quite powerful.

Observability databases are quickly adopting columnar database technologies. This is well aligned with wide, sparse columns suitable to wide, structured logs. These systems map well to the query workloads, support the high speed ingest rate, can tolerate some about of buffering on the ingest path for efficiency, and store a ton of data highly compressed, and now readily tier local to cloud storage. Consolidating more of the fact table to this format makes a lot of sense - a lot more sense than running two or three separate database technologies specialized to metrics, logs, and traces. You can now end the cardinality miseries of legacy observability TSDBs.

But the magic sauce in observability platforms is making the rows in the fact table linkable and navigable - getting from a log message to a relevant trace; navigating from an error message in a span to a count of those errors filtered by region or deployment id... This is the complexity in building highly ergonomic observability platforms - all of the transformation, enrichment, and metadata management (and the UX to make it usable).

spimmy · on Aug 12, 2024

they've had nice things in BI land for YEARS. it's very cobbler's children have no shoes that we're still over here in software land doling out little drips of cardinality, guessing, eyeballing and jumping to conclusions. nice tools with nice data make alllll the difference.

software development should be a creative, curious, collaborative job... and it can be, with the right tools

viraptor · on Aug 9, 2024

This is quite frustrating to read. The whole set of assumed behaviours is wrong. I'm happy doing exactly what's described on 2.0 processes while using datadog.

Charity's talk about costs is annoying too. Honeycomb is the most expensive solution I've seen so far. Until they put a "we'll match your logging+metrics contact cost for same volume and features" guarantee on the pricing page, it's just empty talk.

Don't get me wrong, I love the Honeycomb service and what they're doing. I would love to use it. But this is just telling me "you're doing things wrong, you should do (things I'm already doing) using our system and save money (even though pricing page disagrees)".

spimmy · on Aug 12, 2024

my eyes popped at the "most expensive solution i've seen so far". compared to what?!? we don't like to promise "it's always cheaper", but .. it's always cheaper, lol.

with datadog, you have to arm wrestle them for every drop of cardinality, and on honeycomb, you can throw in as much as you want, any time you want. it smells to me like you aren't used to instrumenting your code with rich data?

spimmy · on Aug 12, 2024

i would love to hear how you are doing all the 2.0 stuff i described on datadog. you can't zoom in, zoom out, identify outliers and correlations.. the data doesn't exist! at best, you can predefine a few connective points between your logs and metrics and traces.

which is fine.. if your systems aren't that complicated and rarely fail in unpredictable ways. if that's the case, -- i'm glad you've found something that owrks for you.

flockonus · on Aug 9, 2024

> Y’all, Datadog and Prometheus are the last, best metrics-backed tools that will ever be built. You can’t catch up to them or beat them at that; no one can. Do something different. Build for the next generation of software problems, not the last generation.

Heard a very similar thing from Plenty Of Fish creator in 2012, I unfortunately believed him; "the dating space was solved". Turns out it never was, and like every space, solutions will keep on changing.

phillipcarter · on Aug 9, 2024

IMO dating apps didn't evolve because they got better at matchmaking based on stated preferences in a profile (something Plenty of Fish nailed quite well!), they shifted the paradigm towards swiping on pictures and an evolving matchmaking algorithm based on user interactions within the app.

This is sort of what the article is getting at. For the purposes of gathering, aggregating, sending, and analyzing a bunch of metrics, you'll be hard-pressed to beat Datadog at this game. They're extremely good at this and, by virtue of having many teams with tons of smart people on them, have figured out many of the best ways to squeeze as much analysis value as you can with this kind of data. The post is arguing that better observability demands a paradigm shift away from metrics as the source of truth for things, and with that, many more possibilities open up.

FridgeSeal · on Aug 10, 2024

I hear good (but expensive) things about Datadog, and Prometheus is _useful_but I would never call it “the peak”.

Configuring it is awful, driving it is awful, the query language is part good, and part broken glass, relabelling is “not actively broken” but it’s far from “sensible, well designed and thoughtful”. Grafana’s whole stack is massively overwrought if you’re self hosting, and rapidly expensive for managed services. The devs often ignore and react aggressively to issues. Improvements to UX or correctness are ignored, denigrated or just outright denied. There’s some really weird design choices around distributed stuff that makes them annoying in my opinion, and there seems to be no intention of ever making that better. Prometheus and worse, Mimir have been some of the most annoying and fragile things I’ve had the displeasure of operating. Prometheus might have been a lot better than what we had before, but I really thing we can do a lot, a lot better than Prometheus, and I see “improved in every way” solutions like Victoria Metrics as direct evidence of that.

spimmy · on Aug 12, 2024

i just think that metrics are the right tool for the job when the job is summarizing vast quantities of data.

not when the job is understanding complex systems. in order to do that, you need a ton of context and cardinality, etc. i know so many observability engineering teams that spend an outright majority of their time trying to skate the line between "enough cardinality to understand what's happening" but not so much that it bankrupts them. it's the wrong tool for the job. we need something much more like BI for technical data.

suyash · on Aug 9, 2024

I'll put InfluxDB right up there as well.

abeppu · on Aug 9, 2024

... is that a good example? I think people who use the dating apps today mostly hate them, and find that they have misaligned incentives and/or encourage poor behavior. There's been generation of other services that shift in popularity (and network effects mean that lots of people shift) but I'm not convinced that this has ever involved delivering a better solution.

bloodyplonker22 · on Aug 9, 2024

Indeed. I hate to say this, but most people hate the dating apps because they're ugly. The top 10% are getting all the dates on these apps and the rest are left with endless swiping and only likes from scammers, bots, and pig butcherers. Trust me, I know because I'm ugly.

michaelt · on Aug 9, 2024

People mostly hate observability tooling too.

The point isn't that people like or dislike - it's that the fact a system someone in the industry tells you isn't worth even trying to compete with might be replaced a handful of years later.

abeppu · on Aug 9, 2024

The post author didn't claim that no one else would make money or attract customers in metrics-backed tools after Datadog and Prometheus -- but that they were the last and best. The "at that" in "You can’t catch up to them or beat them at that" seems pretty clearly about "best", i.e. quality of the solution.

I claim that the in the intervening decade, dating apps have changed but not gotten better which suggests to me that the Plenty of Fish person may have been right, and this example is not convincingly making the point that flockonus wants to make.

zellyn · on Aug 9, 2024

A few questions:

a) You're dismissing OTel, but if you _do_ want to do flame graphs, you need traces and spans, and standards (W3C Trace-Context, etc.) to propagate them.

b) What's the difference between an "Event" and a "Wide Log with Trace/Span attached"? Is it that you don't have to think of it only in the context of traces?

c) Periodically emitting wide events for metrics, once you had more than a few, would almost inevitably result in creating a common API for doing it, which would end up looking almost just like OTel metrics, no?

d) If you're clever, metrics histogram sketches can be combined usefully, unlike adding averages

e) Aren't you just talking about storing a hell of a lot of data? Sure, it's easy not to worry, and just throw anything into the Wide Log, as long as you don't have to care about the storage. But that's exactly that happens with every logging system I've used. Is sampling the answer? Like, you still have to send all the data, even from very high QPS systems, so you can tail-sample later after the 24 microservice graph calls all complete?

Don't get me wrong, my years-long inability to adequately and clearly settle the simple theoretical question of "What's the difference between a normal old-school log, and a log attached to a trace/span, and which should I prefer?" has me biased towards your argument :-)

spimmy · on Aug 12, 2024

i'm not dismissing otel at all! under the hood, actually, everything is an event in otel ;)

datadrivenangel · on Aug 9, 2024

So the core idea is to move to arbitrarily wide logs?

Seems good in theory, except in practice it just defers the pain to later, like schema on read document databases.

firesteelrain · on Aug 9, 2024

It took me a bit to really understand the versioning angle and I think I understand.

The blog discusses the idea of evolving observability practices, suggesting a move from traditional methods (metrics, logs, traces) to a new approach where structured log events serve as a central, unified source of truth. The argument is that this shift represents a significant enough change to be considered a new version of observability, similar to how software is versioned when it undergoes major updates. This evolution would enable more precise and insightful software development and operations.

Unlike separate metrics, logs, and traces, structured log events combine these data types into a single, comprehensive source, simplifying analysis and troubleshooting.

Structured events capture more detailed context, making it easier to understand the "why" behind system behavior, not just the "what."

spimmy · on Aug 12, 2024

hey, thanks! i would love to hear your feedback on how i could have made this simpler and easier to understand, if you have any. :)

tunesmith · on Aug 9, 2024

Did I miss an elephant in the room?

Wide structured logging to log EVERYTHING? Isn't that just massively huge? I don't see how that would be cheaper.

Related Steven Wright joke: “I have a map of the United States... Actual size. It says, 'Scale: 1 mile = 1 mile.' I spent last summer folding it. I hardly ever unroll it. People ask me where I live, and I say, 'E6.”

phillipcarter · on Aug 9, 2024

> I don't see how that would be cheaper.

It's cheaper for several tools that bill by number of events rather than total volume of data in GB. The way this works with very high volumes of data is to employ smarter sampling to make sure you get as good a ratio of good vs. useless events as possible within a given budget.

Observability in this fashion is much more like real-time analytics (with an appropriate backend, i.e., not a timeseries database), where the cost of querying an event that as 2 fields compared to 200 fields is marginal. And so in a world like this, you're encouraged to pack more information into each log/event/span. There's a lot of details underlying that, like some backends still requiring you to define a subset you'd like to always be able to group by, whereas other backends have no such limitations, but this is largely the category of system that's being talked about.

xyzzy_plugh · on Aug 9, 2024

I was excited by the title and thought that this was going to be about versioning the observability contracts of services, dashboards, alerts, etc., which are typically exceptionally brittle. Boy am I disappointed.

I get what Charity is shouting. And Honeycomb is incredible. But I think this framing overly simplifies things.

Let's step back and imagine everything emitted JSON only. No other form of telemetry is allowed. This is functionally equivalent to wide events albeit inherently flawed and problematic as I'll demonstrate.

Every time something happens somewhere you emit an Event object. You slurp these to a central place, and now you can count them, connect them as a graph, index and search, compress, transpose, etc. etc.

I agree, this works! Let's assume we build it and all the necessary query and aggregation tools, storage, dashboards, whatever. Hurray! But sooner or later you will have this problem: a developer comes to you and says "my service is falling over" and you'll look and see that for every 1 MiB of traffic it receives, it also sends roughly 1 MiB of traffic, but it produces 10 MiB of JSON Event objects. Possibly more. Look, this is a very complex service, or so they tell you.

You smile and tell them "not a problem! We'll simply pre-aggregate some of these events in the service and emit a periodic summary." Done and done.

Then you find out there's a certain request that causes problems, so you add more Events, but this also causes an unacceptable amount of Event traffic. Not to worry, we can add a special flag to only emit extra logs for certain requests, or we'll randomly add extra logging ~5% of the time. That should do it.

Great! It all works. That's the end of this story, but the result is that you've re-invented metrics and traces. Sure, logs -- or "wide events" that are for the sake of this example the same thing -- work well enough for almost everything, except of course for all the places they don't. And now where they don't, you have to reinvent all this stuff.

Metrics and traces solve these problems upfront in a way that's designed to accommodate scaling problems before you suffer an outage, without necessarily making your life significantly harder along the way. At least that's the intention, regardless of whether or not that's true in practice -- certainly not addressed by TFA.

What's more is that in practice metrics and traces today are in fact wide events. They're metrics events, or tracing events. It doesn't really matter if a metric ends up scraped by a Prometheus metrics page or emitted as a JSON log line. That's besides the point. The point is they are fit for purpose.

Observability 2.0 doesn't fix this, it just shifts the problem around. Remind me, how did we do things before Observability 1.0? Because as far as I can tell it's strikingly similar in appearance to Observability 2.0.

So forgive me if my interpretation of all of this is lipstick on the pig that is Observability 0.1

And finally, I get you can make it work. Google certainly gets that. But then they built Monarch anyways. Why? It's worth understanding if you ask me. Perhaps we should start by educating the general audience on this matter, but then I'm guessing that would perhaps not aid in the sale of a solution that eschews those very learnings.

alisonatwork · on Aug 10, 2024

I had to scroll almost to the bottom to find it, but this is the right answer.

We already had logs. We added "mapped diagnostic contexts" and standardized the emission format to make them into easy-to-analyze structured logs. Those were very useful.

Then we had so many structured logs that storing, filtering and aggregating them became a bigger problem than just keeping the application running. So we split the structured logs into smaller structured logs, some of them with pre-filtering and pre-aggregation to keep the firehose of structured log spam manageable.

Someone branded this new form of logging "observability" and here we are. Planning for the next big paradigm shift. Which apparently is... a stream of structured logs. Okay then.

FridgeSeal · on Aug 10, 2024

> My other hope is that people will stop building new observability startups built on metrics.

I mean, can you blame them?

Metrics alone are: valuable and useful, prom text format and remote write protocol is widely used, straightforward to implement and a much, much, much smaller slice than “the entirety of the OpenTelemetry spec”. Have you read those documents? Massive, sprawling, terminology for days, it’s confusingly written in places IMO. I know it’s trying to cover a lot of bases all at once (logs, traces AND metrics) and design accordingly to handle all of them properly, so it’s probably fine to deal with if you have large enough team, but that’s not everyone.

To say nothing of the full adoption of opentelemetry data. Prometheus is far from my favourite bit of tech, but setting up scraping and a grafana dashboard is way less shenanigans than setting up open telemetry collection, and validating it’s all correct and present in my experience.

If someone prefers to tackle a slice like metrics only and do it better than the whole hog, more power to them IMO.

moomin · on Aug 9, 2024

We came up with a buzzword to market our product. The industry made this buzzword meaningless. Now we’re coming up with a new one. We’re sure the same thing won’t happen again.

jrockway · on Aug 9, 2024

I like the wide log model. At work, we write software that customers run for themselves. When it breaks, we can't exactly ssh in and mutate stuff until it works again, so we need some sort of information that they can upload to us. Logs are the easiest way to do that, and because logs are a key part of our product (batch job runner for k8s), we already have infrastructure to store and retrieve logs. (What's built into k8s is sadly inadequate. The logs die when the pod dies.)

Anyway, from this we can get metrics and traces. For traces, we log the start and end of requests, and generate a unique ID at the start. Server logging contexts have the request's ID. Everything that happens for that request gets logged along with the request ID, so you can watch the request transit the system with "rg 453ca13b-aa96-4204-91df-316923f5f9ae" or whatever on an unpacked debug dump, which is rather efficient at moderate scale. For metrics, we just log stats when we know them; if we have some io.Writer that we're writing to, it can log "just wrote 1234 bytes", and then you can post-process that into useful statistics at whatever level of granularity you want ("how fast is the system as a whole sending data on the network?", "how fast is node X sending data on the network?", "how fast is request 453ca13b-aa96-4204-91df-316923f5f9ae sending data to the network?"). This doesn't scale quite as well, as a busy system with small writes is going to write a lot of logs. Our metrics package has per-context.Context aggregation, which cleans this up without requiring any locking across requests like Prometheus does. https://github.com/pachyderm/pachyderm/blob/master/src/inter...

Finally, when I get tired of having 43 terminal windows open with a bunch of "less" sessions over the logs, I hacked something together to do a light JSON parse on each line and send the logs to Postgres: https://github.com/pachyderm/pachyderm/blob/master/src/inter.... It is slow to load a big dump, but the queries are surprisingly fast. My favorite thing to do is the "select * from logs where json->'x-request-id' = '453ca13b-aa96-4204-91df-316923f5f9ae' order by time asc" or whatever. Then I don't have 5 different log files open to watch a single request, it's just all there in my psql window.

As many people will say, this analysis method doesn't scale in the same way as something like Jaeger (which scales by deleting 99% of your data) or Prometheus (which scales by throwing away per-request information), but it does let you drill down as deep as necessary, which is important when you have one customer that had one bad request and you absolutely positively have to fix it.

My TL;DR is that if you're a 3 person team writing some software from scratch this afternoon, "print" is a pretty good observability stack. You can add complexity later. Just capture what you need to debug today, and this will last you a very long time. (I wrote the monitoring system for Google Fiber CPE devices... they just sent us their logs every minute and we did some very simple analysis to feed an alerting system; for everything else, a quick MapReduce or dremel invocation over the raw log lines was more than adequate for anything we needed to figure out.)

amelius · on Aug 9, 2024

I can't even run valgrind on many libraries and Python modules because they weren't designed with valgrind in mind. Let's work on observability before we version it.