Wide events seems like it would require more memory and CPU to combine and more bandwidth due to size.
I've implemented services with loggers that gather data and statistics and write out just one combined log line at the end. It's certainly more economical in regard to dev time, not sure how "one large" compares to "many small" in reality resource-wise.
so they didn't want to pay for AWS CloudWatch [1]; decided to roll their in-house network flow log collection; and had to re-implement attribution?
i wonder how many hundreds of thousands of dollars network flow logs cost them; obviously at some point it is going to be cheaper to re-implement monitoring in-house.
Because vanilla flowlogs that you get from VPC/TGW are nearly useless outside the most basic use cases. All you get is how many bytes and which tcp flags were seen per connection per 10 minutes. Then you need to attribute ip addresses to actual resources yourself separately, which isn't simple when you have containers or k8s service networking.
Doing it with eBPF on end hosts you can get the same data, but you can attribute it directly as you know which container it originates from, snoop dns, then you can get extremely useful metrics like per tcp connection ack delay and retransmissions, etc.
AWS recently released Cloudwatch Network Monitoring that also uses an agent with eBPF, but its almost like a children's toy compared to something like Datadog NPM. I was working on a solution similar to Netflix's when NPM was released, was no point after that.
I recall a time when we managed network flows by manually parsing /proc/net/tcp and correlating PIDs with netstat outputs. eBPF? Sounds like a fancy way to avoid good old-fashioned elbow grease.
reply