I work at Netflix and use Atlas every day. It's our go-to performance monitoring tool, and has solved countless performance and reliability issues. It's exciting to have it open source!
I summarized it in a talk recently at Surge 2014, where I showed its role for a performance investigation, and how it is central to everything:
There are several different data sources for Atlas:
* There is a poller cluster that gathers SNMP and HTTP healthcheck metrics and forwards them to the Atlas backend.
* There are on-instance log parsers written in Perl and Python that count events in Apache HTTPd and Tomcat logs and send data to the Atlas backend.
* The Servo library [0] is used to instrument Java code with counters, timers and gauges. There is a separate client implementation that handles forwarding metrics to the Atlas backend. The client also polls and reports JMX metrics from the JVM that it runs inside. Spectator [1] is a new library that provides cleaner abstractions of Servo concepts.
* The Prana sidecar [2] was extended to provide REST endpoints for Servo and the client, so that metrics can be delivered from non-Java code.
What kind of ratio of metadata traffic (telemetry) to total traffic did you see? How does this divide between "system level" and "application level"?
My client is lookin at these telemetry problems now, is there possibility of commercial high-level consultancy coming out of Netflix / colleagues ? Ping me on details in my profile if you can help?
Telemetry traffic is a small fraction of the the total traffic running through a region, partially due to the use of the Smile data format (binary JSON) for delivering metrics from the client to the Atlas backend.
When you give developers tools for creating and aggregating highly dimensional metrics, they tend to create lots of metrics so that they can answer interesting business questions about the use of their applications. We have some developers who have written code that produces up to 150,000 metrics per instance and the vast majority of these metrics are application-level. We typically see 3-5% of the metrics delivered from an instance are system-level performance metrics.
And this, of course, doesn't account for cases where a minor developer error results in code that, say, creates a new metric for every source IP address from which we see a request. Dynamic metric names FTW.
Suro collects arbitrary data, and Atlas is for numbers (time series numbers to be more specific). So metrics generally go through Atlas and logs go through Suro.
Brendangregg, do you know if anyone tried to use this platform to predict failures or DDOS attacks ahead of time? Or it is not feasible with this API? Thanks.
We've used it to predict failures as well as a data source to predict scale up and scale down events. It hasn't been used for DDos prediction, but I see no reason why it couldn't.
Not Brendan, but ... we do a bunch of outlier and anomaly detection using Atlas to notice slow degradation in cluster performance based on outlier nodes and auto-execute them.
OT, but I'm using your app on a stock Nexus 7 (at 4.4.4), and I find a couple of controls very difficult to trigger, while others respond just fine at the first touch.
The "back up 30 seconds" button can be very fussy and difficult to trigger. Sometimes it takes many touches before it will trigger.
Also, when in a series the end titles of an episode display, it can be quite difficult to get a touch to register so that the titles play out and the app does not skip forward to the next episode before they do. (And even then, another count-down starts that will auto-launch the next episode. I wish (and expressed to your call-in support) that you would add a user account setting to disable this.)
Sorry for the OT, but I would think you'd want the app more responsive in this regard. As I said, the other controls/buttons do not exhibit this difficulty; they are immediately responsive.
One of the most useful applications of Atlas (while working on Netflix Reliability) was creating alert conditions that would scale dynamically with changes in volume via the use of double-exponential smoothing (DES). It was very easy to create alerts that compared and combined multiple signals using the features in Atlas Stack Language. I am so excited to see it finally open-sourced!
Sounds like a great platform. I wonder how many instances of everything they need for peak time. I have a system that is rated for 5M Time series data points per second and it takes about 150 physical servers so I am curious about what a Netflix sized ( 20M time series data points per second ) would look like.
As the announcement notes, we have multiple tiers holding different data horizons. The most active, and large, tier is the one holding the last six hours of data. That tier, being the most critical one (we try to train our engineers to only need 6 hours of data to understand how their system is working in the worst case), is mirrored. Right now, each of those mirrors is about[0] 756 r3.2xl instances.
[0] For a very exact definition of "about," though that exact number could change in the next 5 minutes, or 5 hours, or 5 days, or not until we see another metrics increase.
I just learned about Akka recently (by way of Akka.net), and I'm encouraged by these frameworks that look like legitimate alternatives to Erlang. Erlang has always seemed like an extreme measure for implementing a distributed concurrent system, so it's good to know there are Java/Scala/C#/F# options.
The issue with naming conflicts arises from the fact that these things are generally created and named internally long before they're released as open source.
Everyone wants to name their product after a figure in Greco-Roman mythology, but there are only so many, and the cool ones are already taken. Same thing happened in investment funds, then they started naming things after trees, rocks, and fortifications.
Very interesting.. I'm guessing you can do less than a minute on granularity? That's a lifetime to me.
I'm working on my own little system based on Riak for storage and their new search functionality to index the streams(what I call a unique metric). I have a goal of graphite API compatibility though so it can be a drop-in replacement for me. Worked out a schema for mapping graphite metric names to the stream dimensions, wrote a graphite function parser/lexar, etc. Atlas should definitely be worth a look for some fresh ideas :)
We CAN do <1m, but very rarely do (at least in terms of persisting and showing it). We have a feature called 'Critical Metrics' that is a separate publishing pipeline into Atlas that is shorter, simpler, hardier, and supports 10s granularity, though for a trivial minority of metrics -- our current limit is on the order of around 400K metrics per cluster, IIRC, which means that if you've got, say, 1000 nodes, we're going to limit you to 400 metrics per node that would be flowing through the Critical Metrics pipeline.
(We haven't opensourced much of the pipeline ... yet)
Sounds similar to FnordMetric (http://fnordmetric.io/chartsql) which also supports dimensional timeseries data. Major differences between Atlas and FnordMetric on first sight:
- SQL based query and charting frontend (ChartSQL), so you don't have to learn yet another DSL
- ships with a a wire compatible StatsD API
- supports pluggable backends
- renders charts to SVG
- will probably not scale to petabytes of data out-of-the box
- single c++ binary, deploy it in 5 minutes using docker
Interesting - I had not heard of Fnordmetrics before. There are not many monitoring systems that implement dimensionality for metrics tagging. It looks like Fnordmetrics goes about it slightly differently, but it seems to achieve the same end goal of arbitrary grouping by like characteristics.
In the Atlas eco-system, Servo and Atlas client eliminate the need for StatsD. The combination of these products allow for code-level instrumentation and delivery of metrics. The Prana sidecar is available for non-Java applications to deliver metrics to the Atlas backend as formatted JSON objects.
Atlas supports multiple backend storage systems, although this is not easily pluggable just yet. Earlier iterations of Atlas had support for MongoDB and Cassandra as storage backends, but there were issues obtaining enough IOPS to satisfy the read and write performance requirements at scale, so storage was switched to in-memory.
Atlas can return data in JSON format (?format=json) which is suitable for JS or SVG based rendering systems such as Highcharts. There is also a streaming API that trades response time for increased data payloads.
It takes some time to learn the Atlas Stack Language, but the fact that it is URL-based means that the browser is the interactive query editor. Using tools like Chrome's Edit URL to help handle long strings, you can make small changes to queries iteratively and see the results in less than 2 seconds in many cases. Average PNG render time is typically less than 10 seconds; slow rendering can take around 30 seconds.
20 million different time series.
I mean that is a lot.
If you have say, 20,000 servers running that is still
1,000 different time series per server. Memory, CPU, logins, logouts, customer selections, I mean I struggle to get to those numbers.
It's actually more like close to 1.2 billion different time series -- we report most metrics on one minute granularity, but they're not all reporting at the same second (thank God), so on average we're getting up to 20M time series per minute.
But this of course just makes the question more reasonable -- 1.2B different time series? Really?
Yup. We get a bunch of system telemetry, and a bunch of default application telemetry, without even getting traffic hitting the box, but that's a relatively small percentage of the overall volume. Developers LOVE metrics.
So imagine you want to measure requests to our API, and these are some tags you want to keep track of:
request type: 5 different types
result: 2 possible values (success, failure)
originating country: 50 countries
originating device type: 200 devices
And let's say you've got a 1000 instances reporting this data.
1,000 metrics per server is quite reasonable. I work for a performance management company and we handle thousands of time series metrics per monitored server at one-second resolution.
above, copperlight quotes about 3%-5% of metrics being system-level. A pretty small number would be process-level, I'm guessing, with the vast majority being app-level.
At Netflix-sized, the answer to pretty much any question is "a lot." :)
"Anomaly detection" is one of those vague terms that can mean anything from "it's gone above the pre-set limit, and that's anomalous" to "the system has studied the signal to learn what the accepted limits should be, and it's exceeded these limits." We mostly mean the latter for anomaly detection.
The Insight Engineering team at Netflix is largely composed of four kinds of engineers: Platform/back-end engineers, UI engineers, Site Reliability Engineers, and Real-Time Analytics (RTA) engineers; it's the latter group of engineers who are looking into ways to quickly (and efficienlty) detect anomalies in a truly-absurd amount of data.
The RTA group is now about 6 months old or thereabouts; I have high hopes that we'll see some public presentations from them soon that will be helpful to other people outside Netflix.
I summarized it in a talk recently at Surge 2014, where I showed its role for a performance investigation, and how it is central to everything:
http://youtu.be/H-E0MQTID0g?t=22m http://www.slideshare.net/brendangregg/netflix-from-clouds-t...
There's more features to Atlas I didn't mention; check out the Overview on github linked in the blog post.