If I understand correctly the "big idea" here is: 1. Juice up your Traces with e...

jiggawatts · 2025-04-25T22:21:11 1745619671

Yes, it is absurdly expensive no matter what the marketing says. It’s only “cheap” if you’re setting VC cash on fire.

The benefit is that you can retroactively extract reports filtered with very complex predicates.

Sure, aggregated metrics are cheap and efficient, but trivial metrics like CPU usage just tell you that there is a problem, not what the problem is. If you need to “deep dive”, you can’t, not without a Time Machine to go back and configure a filtered metric looking for the specific info you need.

Most sysadmins at this point would just configure a new filtered metric and start collecting data… for a month. While the system is broken. Wrong needle? Start looking through the haystack again with another new custom metric for another month.

As a random example, many systems will track 5xx errors per minute. Great, but are those timeouts or instant failures? I want to group 5xx errors per time bucket! Are those correlated by app release version? By server memory free bytes? By instance? Kernel version? Etc…

Wide events let you do all those and more, trivially and quickly: seconds instead of months.

The downside is the cost.

algorithmsRcool · 2025-04-26T18:06:15 1745690775

> Most sysadmins at this point would just configure a new filtered metric and start collecting data… for a month. While the system is broken. Wrong needle? Start looking through the haystack again with another new custom metric for another month.

In this example i feel like it is treating metrics as the only telemetry signal that operators have access to. Once the metrics indicate an issue, we can pull existing logs, traces and profiles to dig into it and eventually capture dumps.

I'm totally onboard with the idea of rich trace metadata, but it seems more evolutionary than revolutionary