I architected and built our entire log-ingestion pipeline for intrusion detectio...

I architected and built our entire log-ingestion pipeline for intrusion detection on it, at Square.

I built a small Ruby wrapper around the C API. Then I used that to slurp all the logs, periodically writing the current log ID to disk. Those logs went out onto a pubsub queue, where they were ingested into both BigQuery for long-term storage / querying, and into our alerting pipeline for real-time detection.

Thanks to journald, all the logs were structured and we were able to keep a bunch of trusted metadata like timestamp, PID, UID, the binary responsible, etc. (basically anything with an underscore prefix) separate from the log message all the way to BigQuery. No parsing, and you get free isolation of the trusted bits of metadata never intermingling with user-controlled attributes.

Compared to trying to durably tail a bunch of syslog files, or having a few SPOF syslog servers that everyone forwarded to, or implementing syslog plugins, this was basically the Promised Land for us. I think we went from idea to execution in maybe a month or two (I say “we” but really this was “me”) and rolled it out as a local daemon to the entire fleet of thousands. It has received—I think—one patch release in its six+ year lifetime, and still sits there quietly collecting everything to be shipped off-host.

The only issue we’ve ever really ran into that I never figured out is a handful of times per year (across a fleet of thousands) the journald database corrupted and you couldn’t resume collecting from the saved message ID. But we were also on an absolutely ancient version of RHEL, and I suspect anything newer probably fixed that bug. We just caught the error and restarted from an earlier timestamp. We built the whole thing around at-least-once delivery so having duplicates enter the pipeline didn’t really matter.

Damn, honestly at this point I’m wishing I’d pushed to open source it.

Ironically, actually, I did write a syslog server that also forwarded into this pipeline since we had network hardware we couldn’t install custom services onto but you could point them at syslog. I also wrote this in Ruby, using the new (at the time) Fibers (“real” concurrency) feature. The main thread fired up four background threads for listening (UDP, UDP/DTLS, TCP, TCP/TLS), and each of those would hand off clients to a dedicated per-connection worker thread for message parsing. Once parsed they went onto one more background thread for collecting and sending to PubSub. Even in Ruby it could handle gazillions of messages without breaking a sweat. Fun times.

Since I’m rambling, we also made cool use of zstd’s pre-trained dictionary feature. Log messages are small and very uniform so they were perfect for it. By pre-sharing a dictionary optimized for our specific data with both ends of the pubsub queue, we got something like 90%–95% compression rates. Given the many terabytes of logs we were schlepping from our datacenters to GCP, this was a pretty nice bit of savings.