Introducing Atlas: Netflix's Primary Telemetry Platform

brendangregg · on Dec 12, 2014

I work at Netflix and use Atlas every day. It's our go-to performance monitoring tool, and has solved countless performance and reliability issues. It's exciting to have it open source!

I summarized it in a talk recently at Surge 2014, where I showed its role for a performance investigation, and how it is central to everything:

http://youtu.be/H-E0MQTID0g?t=22m http://www.slideshare.net/brendangregg/netflix-from-clouds-t...

There's more features to Atlas I didn't mention; check out the Overview on github linked in the blog post.

crescentfresh · on Dec 12, 2014

What are the sources for data for Atlas vs for Suro at Netflix?

(Suro: http://techblog.netflix.com/2013/12/announcing-suro-backbone...). Suro was/is used to collect "more than 1.5 million events per second during peak hours, or around 80 billion events per day" from ec2 instances.

copperlight · on Dec 12, 2014

There are several different data sources for Atlas:

* There is a poller cluster that gathers SNMP and HTTP healthcheck metrics and forwards them to the Atlas backend.

* There are on-instance log parsers written in Perl and Python that count events in Apache HTTPd and Tomcat logs and send data to the Atlas backend.

* The Servo library [0] is used to instrument Java code with counters, timers and gauges. There is a separate client implementation that handles forwarding metrics to the Atlas backend. The client also polls and reports JMX metrics from the JVM that it runs inside. Spectator [1] is a new library that provides cleaner abstractions of Servo concepts.

* The Prana sidecar [2] was extended to provide REST endpoints for Servo and the client, so that metrics can be delivered from non-Java code.

[0] https://github.com/Netflix/servo

[1] https://github.com/Netflix/spectator

[2] https://github.com/Netflix/prana

lifeisstillgood · on Dec 13, 2014

What kind of ratio of metadata traffic (telemetry) to total traffic did you see? How does this divide between "system level" and "application level"?

My client is lookin at these telemetry problems now, is there possibility of commercial high-level consultancy coming out of Netflix / colleagues ? Ping me on details in my profile if you can help?

copperlight · on Dec 13, 2014

Telemetry traffic is a small fraction of the the total traffic running through a region, partially due to the use of the Smile data format (binary JSON) for delivering metrics from the client to the Atlas backend.

When you give developers tools for creating and aggregating highly dimensional metrics, they tend to create lots of metrics so that they can answer interesting business questions about the use of their applications. We have some developers who have written code that produces up to 150,000 metrics per instance and the vast majority of these metrics are application-level. We typically see 3-5% of the metrics delivered from an instance are system-level performance metrics.

CrankyFool · on Dec 13, 2014

And this, of course, doesn't account for cases where a minor developer error results in code that, say, creates a new metric for every source IP address from which we see a request. Dynamic metric names FTW.

lifeisstillgood · on Dec 13, 2014

Thank you both for your insights

jedberg · on Dec 12, 2014

Suro collects arbitrary data, and Atlas is for numbers (time series numbers to be more specific). So metrics generally go through Atlas and logs go through Suro.

bones6 · on Dec 13, 2014

I really enjoyed your overall Linux Performance Tool talk: http://youtu.be/SN7Z0eCn0VY

Thanks for sharing your expertise with performance monitoring. Releasing Atlas is adding tons of value to what I've already seen.

zeratul · on Dec 12, 2014

Brendangregg, do you know if anyone tried to use this platform to predict failures or DDOS attacks ahead of time? Or it is not feasible with this API? Thanks.

jedberg · on Dec 12, 2014

We've used it to predict failures as well as a data source to predict scale up and scale down events. It hasn't been used for DDos prediction, but I see no reason why it couldn't.

CrankyFool · on Dec 12, 2014

Not Brendan, but ... we do a bunch of outlier and anomaly detection using Atlas to notice slow degradation in cluster performance based on outlier nodes and auto-execute them.

pasbesoin · on Dec 12, 2014

OT, but I'm using your app on a stock Nexus 7 (at 4.4.4), and I find a couple of controls very difficult to trigger, while others respond just fine at the first touch.

The "back up 30 seconds" button can be very fussy and difficult to trigger. Sometimes it takes many touches before it will trigger.

Also, when in a series the end titles of an episode display, it can be quite difficult to get a touch to register so that the titles play out and the app does not skip forward to the next episode before they do. (And even then, another count-down starts that will auto-launch the next episode. I wish (and expressed to your call-in support) that you would add a user account setting to disable this.)

Sorry for the OT, but I would think you'd want the app more responsive in this regard. As I said, the other controls/buttons do not exhibit this difficulty; they are immediately responsive.

pasbesoin · on Dec 13, 2014

I'll take the downvotes if that's necessary.

Netflix had months-long problems with the audio on some of their programs, including programs that others here mentioned watching.

It was only after I made some comments regarding these audio problems, here and on Reddit, that they were fixed.

(I'd reported them via Netflix's problem reporting mechanisms, months earlier when I discovered them, to no effect.)

_lflx · on Dec 12, 2014

One of the most useful applications of Atlas (while working on Netflix Reliability) was creating alert conditions that would scale dynamically with changes in volume via the use of double-exponential smoothing (DES). It was very easy to create alerts that compared and combined multiple signals using the features in Atlas Stack Language. I am so excited to see it finally open-sourced!

dkhenry · on Dec 12, 2014

Sounds like a great platform. I wonder how many instances of everything they need for peak time. I have a system that is rated for 5M Time series data points per second and it takes about 150 physical servers so I am curious about what a Netflix sized ( 20M time series data points per second ) would look like.

CrankyFool · on Dec 12, 2014

"It's complicated."

As the announcement notes, we have multiple tiers holding different data horizons. The most active, and large, tier is the one holding the last six hours of data. That tier, being the most critical one (we try to train our engineers to only need 6 hours of data to understand how their system is working in the worst case), is mirrored. Right now, each of those mirrors is about[0] 756 r3.2xl instances.

[0] For a very exact definition of "about," though that exact number could change in the next 5 minutes, or 5 hours, or 5 days, or not until we see another metrics increase.

Thaxll · on Dec 12, 2014

In the video they say that they're close to 1B/min. http://youtu.be/tHrT6kQR7vw?t=36m

CrankyFool · on Dec 12, 2014

Pshaw. That was last year :)

(The presenter in the video)

dkhenry · on Dec 13, 2014

They list a peak of 1.2 billion / minute which is 20M/ second

malingo · on Dec 12, 2014

I really enjoy learning about real-world actor-model systems like this. Having access to the source code is even better.

Are there other similar examples?

malingo · on Dec 13, 2014

Can't edit my earlier post so I'll reply.

I found this page on other companies & projects using Akka: http://doc.akka.io/docs/akka/2.0.1/additional/companies-usin...

I just learned about Akka recently (by way of Akka.net), and I'm encouraged by these frameworks that look like legitimate alternatives to Erlang. Erlang has always seemed like an extreme measure for implementing a distributed concurrent system, so it's good to know there are Java/Scala/C#/F# options.

skinofstars · on Dec 12, 2014

Wait, didn't Hashicorp release Atlas a couple of days ago?

_neil · on Dec 12, 2014

There's also Atlas [0], the learning environment from O'Reilly. Also announced a couple days ago.

[0] http://chimera.labs.oreilly.com/tags/featured

rdtsc · on Dec 12, 2014

Or is the Facebook's Atlas: http://atlassolutions.com/ ?

diab0lic · on Dec 12, 2014

The issue with naming conflicts arises from the fact that these things are generally created and named internally long before they're released as open source.

jedberg · on Dec 12, 2014

FWIW we've called it Atlas internally for years.

praseodym · on Dec 12, 2014

And RIPE Atlas [0], an Internet measurement network.

[0] https://atlas.ripe.net

fludlight · on Dec 12, 2014

Everyone wants to name their product after a figure in Greco-Roman mythology, but there are only so many, and the cool ones are already taken. Same thing happened in investment funds, then they started naming things after trees, rocks, and fortifications.

ShonM · on Dec 12, 2014

That's why I've taken to calling out full names now, like "Hashicorp's Atlas", because yes, it's a mess.

CrankyFool · on Dec 12, 2014

Really, we should have opensourced it as com.netflix.atlas :)

-roy rapoport

staffordrj · on Dec 12, 2014

It's their only product without its own domain name

https://atlas.hashicorp.com/

scrapcode · on Dec 12, 2014

Isn't Google's internal maps editor named "Atlas" as well?

Rapzid · on Dec 13, 2014

Very interesting.. I'm guessing you can do less than a minute on granularity? That's a lifetime to me.

I'm working on my own little system based on Riak for storage and their new search functionality to index the streams(what I call a unique metric). I have a goal of graphite API compatibility though so it can be a drop-in replacement for me. Worked out a schema for mapping graphite metric names to the stream dimensions, wrote a graphite function parser/lexar, etc. Atlas should definitely be worth a look for some fresh ideas :)

CrankyFool · on Dec 13, 2014

We CAN do <1m, but very rarely do (at least in terms of persisting and showing it). We have a feature called 'Critical Metrics' that is a separate publishing pipeline into Atlas that is shorter, simpler, hardier, and supports 10s granularity, though for a trivial minority of metrics -- our current limit is on the order of around 400K metrics per cluster, IIRC, which means that if you've got, say, 1000 nodes, we're going to limit you to 400 metrics per node that would be flowing through the Critical Metrics pipeline.

(We haven't opensourced much of the pipeline ... yet)

paulasmuth · on Dec 12, 2014

Sounds similar to FnordMetric (http://fnordmetric.io/chartsql) which also supports dimensional timeseries data. Major differences between Atlas and FnordMetric on first sight:

- SQL based query and charting frontend (ChartSQL), so you don't have to learn yet another DSL

- ships with a a wire compatible StatsD API

- supports pluggable backends

- renders charts to SVG

- will probably not scale to petabytes of data out-of-the box

- single c++ binary, deploy it in 5 minutes using docker

- includes an interactive query editor

copperlight · on Dec 13, 2014

Interesting - I had not heard of Fnordmetrics before. There are not many monitoring systems that implement dimensionality for metrics tagging. It looks like Fnordmetrics goes about it slightly differently, but it seems to achieve the same end goal of arbitrary grouping by like characteristics.

In the Atlas eco-system, Servo and Atlas client eliminate the need for StatsD. The combination of these products allow for code-level instrumentation and delivery of metrics. The Prana sidecar is available for non-Java applications to deliver metrics to the Atlas backend as formatted JSON objects.

Atlas supports multiple backend storage systems, although this is not easily pluggable just yet. Earlier iterations of Atlas had support for MongoDB and Cassandra as storage backends, but there were issues obtaining enough IOPS to satisfy the read and write performance requirements at scale, so storage was switched to in-memory.

Atlas can return data in JSON format (?format=json) which is suitable for JS or SVG based rendering systems such as Highcharts. There is also a streaming API that trades response time for increased data payloads.

It takes some time to learn the Atlas Stack Language, but the fact that it is URL-based means that the browser is the interactive query editor. Using tools like Chrome's Edit URL to help handle long strings, you can make small changes to queries iteratively and see the results in less than 2 seconds in many cases. Average PNG render time is typically less than 10 seconds; slow rendering can take around 30 seconds.

lifeisstillgood · on Dec 13, 2014

20 million different time series. I mean that is a lot.

If you have say, 20,000 servers running that is still 1,000 different time series per server. Memory, CPU, logins, logouts, customer selections, I mean I struggle to get to those numbers.

CrankyFool · on Dec 13, 2014

It's actually more like close to 1.2 billion different time series -- we report most metrics on one minute granularity, but they're not all reporting at the same second (thank God), so on average we're getting up to 20M time series per minute.

But this of course just makes the question more reasonable -- 1.2B different time series? Really?

Yup. We get a bunch of system telemetry, and a bunch of default application telemetry, without even getting traffic hitting the box, but that's a relatively small percentage of the overall volume. Developers LOVE metrics.

So imagine you want to measure requests to our API, and these are some tags you want to keep track of: request type: 5 different types result: 2 possible values (success, failure) originating country: 50 countries originating device type: 200 devices

And let's say you've got a 1000 instances reporting this data.

Suddenly you've got 5 * 2 * 50 * 200 * 1000

Oh look. Here's 100M different metrics.

And that's a relatively trivial example.

preetamjinka · on Dec 13, 2014

1,000 metrics per server is quite reasonable. I work for a performance management company and we handle thousands of time series metrics per monitored server at one-second resolution.

lifeisstillgood · on Dec 13, 2014

What's the rough ratio for system level, process level and app level (ie total MB, MB / process and "a customer just signed up")?

How much traffic does that add up to? It seems a lot.

CrankyFool · on Dec 13, 2014

above, copperlight quotes about 3%-5% of metrics being system-level. A pretty small number would be process-level, I'm guessing, with the vast majority being app-level.

At Netflix-sized, the answer to pretty much any question is "a lot." :)

preetamjinka · on Dec 13, 2014

I see "anomaly detection" is listed under real-time analytics. Do they use Holt-Winters for that or something else?

CrankyFool · on Dec 13, 2014

"Anomaly detection" is one of those vague terms that can mean anything from "it's gone above the pre-set limit, and that's anomalous" to "the system has studied the signal to learn what the accepted limits should be, and it's exceeded these limits." We mostly mean the latter for anomaly detection.

The Insight Engineering team at Netflix is largely composed of four kinds of engineers: Platform/back-end engineers, UI engineers, Site Reliability Engineers, and Real-Time Analytics (RTA) engineers; it's the latter group of engineers who are looking into ways to quickly (and efficienlty) detect anomalies in a truly-absurd amount of data.

The RTA group is now about 6 months old or thereabouts; I have high hopes that we'll see some public presentations from them soon that will be helpful to other people outside Netflix.