Probabilistic Programming for Anomaly Detection

vamin · on May 3, 2016

It kind of seems like the example in the iPython notebook is a red herring. Wouldn't using a Multivariate Gaussian algorithm perform just as well as probabilistic programming in this case?

nightski · on May 3, 2016

Unless I am missing something it is using a multivariate Gaussian. What the probabilistic programming is doing is inferring the mean and variance from the data.

cons[i] ~ normal(mu_cons, sigma_cons) T[0.0,];

latency[i] ~ normal(beta * cons[i], sigma_latency) T[0.0,]; // latency is linearly related to cons

These are both normal distributions.

elwell · on May 3, 2016

Related: "Anomaly detection from server log data" (http://www.vtt.fi/inf/pdf/tiedotteet/2009/T2480.pdf)

huac · on May 3, 2016

How does this scale for high-dimensional data? The example uses 2 dimensions - connections and latency - but real world (performance) data will often have many more associated variables. I don't know a lot about Bayesian methods but it seems like calculating joint posteriors would become slow.

But definitely a very interesting approach, and the Stan code is surprisingly readable!

ppcsf · on May 3, 2016

Stan uses Markov chain Monte Carlo for inference. MCMC algorithms can do very well at approximating the high-dimensional integrals required to calculate the posterior; estimating 1000-dimensional distributions with MCMC methods is not unusual.

glial · on May 4, 2016

Not unusual, but the sampling can take a long time and scales poorly with the amount of data you have.

craigching · on May 3, 2016

Could someone clarify what PPL they're using? Certainly something based on Python obviously, but beyond that? I wasn't aware of PPL and had hoped to learn something from the blog post, but there wasn't much information. Some experience sharing with PPL would be really awesome!

tmarkovich · on May 3, 2016

From the iPython notebook that appears to be associated with the blog post [1], they appear to be using Stan [2]. Stan is a relatively mature PPL that was designed with specific priorities in mind such as treating continuous variables. Overall, it's quite cool to see a startup focused on developing a PPL in the wild!

[1] https://github.com/fastforwardlabs/anomaly_detection/blob/ma...

[2] http://mc-stan.org

*edit, formatting to get line-breaks.

ppcsf · on May 3, 2016

I recently developed a probabilistic programming framework in C#. There's a fun example [1] in the tutorial that should give you a pretty good idea of what a PPL "feels like" to use.

The example is originally from the Anglican website (another PPL), but the general idea is transferable; PPLs make composing entire probabilistic models as easy as adding integers. We can perform complex inference on our models, and even compose our inference methods to form more powerful ones.

[1] https://github.com/joashc/csharp-probability-monad#modelling...

williamsmj · on May 3, 2016

The notebook linked to by the blog post uses Stan (and PyStan).

graycat · on May 3, 2016

Some of the leading challenges in anomaly detection are:

(1) Goal. What is the goal? One goal is for near real-time monitoring for health and wellness and, then, detection ASAP. This goal can be appropriate in monitoring on-line systems, say, server farms, digital communications networks, process plants. For much of medicine, there is more time for detection.

(2) Old Data. What is assumed about the input data, that is, the old or historical data? One leading assumption is that the data is a second order stationary stochastic process. Another leading assumption is that the data is independent, identically distributed.

(3) Anomalies. There are essentially two important cases about the anomalies. First, we have seen the anomalies before, have some good data characterizing them, and now are just looking for those old anomalies in new data. An example here is looking for signatures of malware.

Second, we have not seen the anomalies before and are trying to detect them for the first time, that is, the zero-day case.

(4) Real-Time Assumption. If doing real-time monitoring, likely want to assume that the real-time data is probabilistically the same as the old, historical data.

(5) False Alarm Rate. Once the context of the work as in (1)-(4) is clear, likely the most important goal is to be able to select a desired false alarm rate and get that rate in practice.

(6) Detection Rate. The next major challenge is, for whatever false alarm rate are willing to tolerate, how to get the highest possible detection rate? Of course, we might like to achieve the highest possible detection rate in the sense of the classic Neyman-Pearson result, e.g., in

https://news.ycombinator.com/item?id=11440459

(7) Nominal versus Interval Data. Another challenge is how to handle data that is nominal (i.e., with just a few values known in advance) or interval, i.e., essentially any real numbers.

(8) Multi-Dimensional Data. Another challenge is how to get a higher detection rate by jointly exploiting data on several variables.

(9) Distribution-Free. Another challenge is how to handle situations where there is little or no chance of knowing the joint probability distribution of data on several variables. E.g., it is easy to get input data on 20 variables, but getting data enough on 20 variables to have an accurate description of the joint probability distribution stands to be challenging, say, 1000^20 data points or some such. So, can want techniques that are distribution-free, that is, that make no assumptions about probability distributions.

Anomaly detection is often close to hypothesis testing in statistics, and that is a large and mature field with a lot of good work.

huac · on May 3, 2016

What do you think of approaching anomaly detection as a semi-supervised classification problem?

graycat · on May 3, 2016

I suggest approaching anomaly detection in ways to get good answers to the nine challenges in my post and any additional important challenges.

What you are suggesting may be related to what I suggested in my post

https://news.ycombinator.com/item?id=11440459

There for the data, both historical and new, we have only some discrete cases and only a few thousand or so of those. And in that example there was data on both healthy and sick. So, we could go fairly directly to the classic Neyman-Pearson result as in the derivations there in that post. But that example was quite special.

Classification seems usually to be for only some modest number of discrete cases, "modest" enough that we can get, store, and manipulate data on all the cases in advance (supervised) and before starting to detect anomalies. For data on several variables, the number of cases stands to grow as an exponential in the number of variables, say, as in the 1000^20 for 20 variables in my post above.

Another approach is to do some work with nearest neighbors, and for that with some commonly justified assumptions there is a way to set and get a desired false alarm rate and get some good things about detection rate.

Again, I see little choice but to see much of anomaly detection as part of statistical hypothesis testing.

huac · on May 3, 2016

It seems like a semi-supervised classification framework can address each of those 9 issues. I agree btw that each is important, but would emphasize the importance of understanding collinearity/correlation with multidimensional data. Maybe also consider online updating (adding new data to the old data).

I think I phrased my previous comment poorly; in the OP's example, and in a pdf linked elsewhere in this thread (https://news.ycombinator.com/item?id=11623647), the anomaly detection problem is of server data. The motivation for finding anomalies is then to prevent server outages/delays, and it's that problem I was considering (rather than simply anomalies, which can be good or bad or just special).

My thinking is that under a semi-supervised framework we would take some historical data (where each point in time is associated with n readings of a system), mark known anomalous periods (e.g. if there's an outage detected by some other system), train a classification model based on this previous data, and then attempt to classify new points in time as anomalous or not. I don't make any claims to what kind of classification model; nearest neighbors or logistic regression would surely be fine.

With that understanding, I don't follow what you mean by 'modest number of discrete cases' for classification. I follow most of your linked approach, which I'm understanding as similar to a Naive Bayes method (which, incidentally, doesn't appear at face value to deal with your point #7 about continuous data) - it's quite interesting how that eventually ends up intimately connected with the hypo test math. But if we frame the question this way, of having a desired result ('normal' data) and having an undesired result ('anomalous' data), then I don't see how a classification approach wouldn't work.

jmmcd · on May 3, 2016

Part of the problem in anomaly detection is that sometimes anomalies are of a "novel" type. It can be impossible to collect labelled data on all possible anomalies. So learning a model of normal behaviour turns out better.

But if we're talking about a specific application, where there's only one (or a few) types of bad behaviour, then it's probably worthwhile to manually label some data and use binary classification as you suggested.

graycat · on May 3, 2016

Yes, for your first point, you seem to be considering behavioral monitoring where we attempt to characterize the behavior of a healthy or normal system and decree and declare everything else a sick system or an anomaly.

Yes, of your second, an example seems to be detecting malware in communications data streams or hard disk files by looking for specific, characterizing patterns of bits, that is, signatures extracted from earlier detections of such malware.

In behavioral monitoring, we might have challenges with false alarm rate.

graycat · on May 3, 2016

> It seems like a semi-supervised classification framework can address each of those 9 issues.

Well, in the terminology of machine learning, semi-supervised needs a start with supervised training and labeled data, those are not possible with the challenge I mentioned with zero-day.

Also, in that "framework", without more detail, it's not very clear how to select and obtain a false alarm rate.

There is a lot on classification. Maybe what you mean is close to L. Breiman's book Classification and Regression Trees (CART and related software) and/or his more recent random forests.

> My thinking ... nearest neighbors or logistic regression would surely be fine.

So, it sounds like you have some historical or training data that, for the set of real numbers R and positive integers m and n, consists of m points in R^n. Then you want to assume that some regions in R^n (I omit an attempt at a careful definition of a region) are for a sick system and the other regions are for a healthy system (more generally, in some regions the conditional probability of a sick system is relatively high and in other regions, low). Well, if the number of variables n and the dimension of R^n is much over, say, 10, and for data from a server farm that's easy to have, then could need number of historical point m to be huge or could have some quite sparse data.

It may be that logistic regression trees could fit such data, that is, the regions, with good accuracy. Then for false alarm rate, apply the trees, the detector, to the old data or some reserved old data, and assume that new data is appropriately similar. Then that application could give estimates of both false alarm rate and detection rate. First cut, I would suggest using regression (with or without the logistic version) trees and not just one regression expression.

So, first cut, how to build a regression tree? Do ordinary regression, see where there is a poor fit, split off that data, and fit it separately. Continue splitting and get a tree. For more, read Breiman! Breiman was a darned good mathematician and statistician.

So there the role of classification is that the detector, from the logistic regression trees, would return 0 for a healthy system and 1 for a sick one. If you get a low rate of false alarms and a high rate of detections, that is, nearly a perfect detector, on, say, your reserved historical data, and are confident you didn't get hurt with over fitting and are okay on other needed assumptions, then just smile on the way to the bank.

> With that understanding, I don't follow what you mean by 'modest number of discrete cases' for classification.

For your problem, as I understand it, you don't have "a modest number of discrete cases" but the post of mine I linked to for the thread on fraud detection did.

> it's quite interesting how that eventually ends up intimately connected with the hypo test math.

Again, my view is that in anomaly detection we can't be far from the framework and/or content of classic statistical hypothesis testing. What I gave in that post was a cheap and quick and dirty derivation of the classic, IIRC, 1947, Neyman-Pearson result on best possible detection. So, in part for best possible, we have to solve a maximization problem. My proof for the general case also makes use of Lagrange multipliers but also uses the Hahn decomposition from the Radon-Nikodym theorem.

> doesn't appear at face value to deal with your point #7 about continuous data)

Right, that fraud detection problem in my link was for a quite special case of anomaly detection.

For those nine challenges I listed, I was not suggesting that we could attack them all with just one approach to anomaly detection. Instead, I was just saying that those nine challenges get encountered in some cases of anomaly detection. I was assuming implicitly that we would be taking different approaches for different cases of anomaly detection and addressing those of the nine challenges that applied in each of the cases.

As a response to the OP here, I was suggesting that there is a lot to anomaly detection and, via the nine challenges, tried to describe some of how much there is to consider. In particular I never used Bayes and never mentioned programming languages or software, right, because I believe that there are challenges that need solutions that don't have much to do with Bayes or programming languages. E.g., in describing the role of the historical (training) data, I would consider conditional probabilities and conditional expectation.

> but would emphasize the importance of understanding collinearity/correlation with multidimensional data.

When I was considering graduate school, one day on my job at FedEx, I was at GE in Boston so took some time off and drove down to the Division of Applied Mathematics at Brown and talked with several of the professors including Ulf Grenander, one of the world's best statisticians. At the time he had a lot of log data from an IBM mainframe and was analyzing it. I noticed that he was surprised at, say, the complexity of the data. He said it was nothing like the data he was used to in statistics from, say, biostatistics. He was correct. And that complexity remains the case.

In particular, the usual experience with probability distributions go into the trash -- e.g., in any very direct sense, usually Gaussian becomes nearly irrelevant.

Sure, for the case I mentioned of data from a second order stationary stochastic process, covariance is crucial. But, suppose the data is from transactions, say, one data point from each transaction. Then point by point in time, we can't expect anything important from covariance. For such a transaction now? Sure: The data from one visitor to a Web site.

Sure, maybe in the past on a computer system, establish some thresholds and use those to detect anomalies, that is, a sick system. Then, have some rules that may do better in some more complicated cases. Or maybe on multi-user computer, watch how busy the CPU is and the rate of page faults and try to detect virtual memory thrashing that goes on for, say, several seconds, and then try to diagnose that and correct it. Okay. But, the world of monitoring server farms and networks has gotten more challenging now! E.g., in some cases, now we want to be able to detect problems we have never seen before -- zero-day detection!