I felt like a missing conclusion was "Kafka is a critical dependency". They'd started out with the assumption that Kafka is a soft dependency and found this library bug that made it a hard dependency (which they then patched).
But isn't going metrics-blind whenever Kafka goes down bad enough that you should push more effort into keeping Kafka alive?
Kafka being a soft dependency is not an assumption, it's a design goal. So the conclusion should be to eliminate any hard dependencies on Kafka, if anything.
However, when it comes to metrics in particular, I'd say Kafka was still a soft dependency. The reason we lost our metrics during the incident was a scheduler lock-up blocking the collection of VM-level metrics. It's just coincidence that the scheduler lock was caused by brod in this case. Otherwise metrics would just flow to Graphite directly, never interacting with Kafka.
When it comes to the complete monitoring solution around Kred, there was one bit of it depending on Kafka: System monitor (https://github.com/klarna-incubator/system_monitor). This tool is exporting data that doesn't fit well into Graphite or Splunk, so we store it in Postgres instead. Due to pure laziness it was at the time of the incident not writing directly to Postgres though, but was just pushing the data to Kafka. (The reason is that we already had a service available to push data from Kafka to Postgres.) After the incident we eliminated Kafka from the path. I didn't mention this work in the post because it was only marginally related.
But isn't going metrics-blind whenever Kafka goes down bad enough that you should push more effort into keeping Kafka alive?