Computer systems, which are deterministic, have long since passed out of the realm of being understandable.
They're chaotic systems[1]. Parameters which are below the threshold of detection can drastically affect their trajectory in phase space. Accidental complexity has zoomed off to live a life all its own, and that takeoff has been continuous for decades [2].
Everything we build around software that's not directly bearing on the problem domain is chipping away at this or that aspect of the ungoverned bits. As we corral each level of new unpredictability, we merely set the scene of the next level of complexity.
There is no universal solution. There is no final boss which, once defeated, leads to the scrolling names and 8-bit themesong. It never stops, ever, because the value of systems that frequently fail unpredictably is lower than the cost of flawless systems and higher than the value of not building such systems.
Edit: sounds very unhelpful, even defeatist. I know. I'm deeply ambivalent about what can be done versus what we should assume is possible. I'm a fan of serious software engineering and tool support and all that jazz ... yet thoroughly pessimistic about the limits of the possible. Deep sighs for every occasion!
They're not even deterministic. Even if your input is totally ordered, entropy from the memory subsystem will bubble up to communicating threads if you have more than one. Sometimes one thread will read the newly written value, sometimes the old. That said, there is some clever research out of UW into how to make such systems deterministic while still recovering almost all parallelism.
I think the more important point jacques_chester was making is that even in the case where computer systems are 100% deterministic it doesn't matter because they are so complex they become chaotic systems. And then hypothetical determinism won't save you because there are too many variables in play simultaneously, and even the slightest variation in any one of them could change the behavior completely.
Sorry guys, but I don't buy your defeatist arguments for one second. Sure using traditional tools, you may be right. However, all we have to do is look at the brain with its 100 trillion Connections (axioms) or a CPU with its billions of transistors to see that your logic is flawed. If it was not possible to design fault tolerant systems on large scale neither of these two examples could exist.
I don't claim to have the answers, but a few years ago, I had to help out some EE's write some testing software in
LabVIEW. The first thing that shocked me was the system was immune to bad data. Secondly, how elegantly it took advantage of our multiprocessing and multithreading hardware.
What I wish systems researchers would work on is concurrent, signal-based, synchronous languages. This is a winnable game, but we must swallow the red pill and change the rules of the game with new tools and fresh ideas.
Just reading your second reference, the bit about expert systems is pretty interesting . . . what's your stance on this? It seems that, given sufficient instrumentation information, one could probably construct a fairly decent expert system that might be able to shed some light onto these sorts of problems. (Keep in mind the fact that I'm still very much a junior software developer!)
The example of FDs from the original article seems like a good one; a properly developed expert system might have been able to notice a lack of changes in the current software environment and suggest looking elsewhere. (Problem is, if I got that response from my automated problem-solver I'd just ignore it and keep looking . . .)
So, any insightful points to crush my youthful idealism? :)
Expert systems for monitoring and management
of large server farms and networks? Been there.
Done that. Got the T-shirt. Wrote papers.
Gave one at an AAAI IAAI conference at Stanford.
Shipped two commercial products.
First problem: Need some 'experts'. So, only
know about problems the experts do. Their
knowledge is supposed to be mostly just empirical,
from their 'expert experience'. Even if they
have 'deep knowledge', that is, how the systems
work internally (e.g., the engine connects to the
torque converter connects to the transmission
connects to the rear differential connects to
the rear wheels), the specific problems they
know about are usually just the ones they have
encountered. So, essentially can only address
problems seen before and are well understood.
But for the problems of the OP, they were seen
for the first time. Bummer. Actually, with
irony, if the problems have been seen before and
are well understood, then why the heck have they
not already been solved?
Second, with expert systems, we're working just
intuitively from just experience. So,
we have little idea if what we are doing is the
best possible in any reasonable sense or even
at all good. In particular, in monitoring, the
main, first operational problem on the 'bridge'
or in the 'NOC' (network operations center)
is the false alarm rate too high with no good
way to lower it (except just to ignore some
alarms, but we should be able to do much better).
Net, for a serious attack on system monitoring
and management, I can't recommend taking
expert systems very seriously.
This is a great topic, the idea that contemporary computer systems are not atomically understandable or model-able. And it's a topic that's almost entirely absent from computer science discourse.
We passed the point quite a while back where "applied CS" -- what it's like to build and maintain scalable systems in the real world -- looks a lot more like anthropology than like mathematics.
We have tools like sampling profilers, loggers that do stateful packet inspection on a small percentage of traffic, and sophisticated A/B testing frameworks that reason about user populations rather than individuals. But I think we could probably build a whole 'nother generation of these tools by thinking about this problem of understanding complex systems from some new perspectives. I nominate anthropology as a discipline that has many years under its belt wrestling with the impossibility of complete description. (And wrestling with other tough issues, sometimes not well, but that's another conversation.)
I'm not talking about the kind of "anthropology" work that companies like Intel have done for a long time, which are firmly in the social sciences tradition and are really user studies. (Though that work is valuable.) I'm talking about porting ideas such as "thick description" to our technical toolkit.
No one's asking for a solution to complexity; I agree there's no such thing. What we need are better ways to manage complexity -- and that is absolutely within our grasp.
Speaking as a student studying systems (particularly security and OS development) . . . the second point sounds really interesting.
Recently I've been working on a simple microkernel (called Sydi) that should actually make this possible. It's a distributed system straight down to the kernel: when two instances of the kernel detect each other running on a network, they freely swap processes and resources between each other as required to balance the load. This is a nightmare from a security perspective, of course, and I haven't figured out the authentication bits yet -- too busy working on an AML interpreter because I don't like ACPICA -- but that's not the issue here. (It's far from finished, but I'm pretty confident that I've worked out the first 85% of the details, with just the remaining 115% left, plus the next 95% of the programming . . .)
Due to some aspects of its design (asynchronous system calls via message-passing, transparent message routing between nodes in the network, etc) I feel that it will be entirely possible to take a snapshot of an entire cluster of machines running the kernel. It would be expensive -- requiring a freeze of all running processes while the snapshot is taking place to maintain any level of precision -- but I'm confident I could code that in during the space of a week or two . . .
I haven't thought much about what kind of analysis one could do on the instrumentation/snapshot results, though. I'm sadly too inexperienced with `real-world' systems stuff to be able to say. Anyone have any suggestions for possible analysis avenues to explore?
This sounds interesting! Have you published anything more substantial on it? I yearn for the day when I can have a Plan 9-ish resource sharing and file system connecting my personal computer to my cloud computers.
> It would be expensive -- requiring a freeze of all running processes while the snapshot is taking place to maintain any level of precision.
Have you thought about distributed algorithms like the Chandy-Lamport Snapshot algorithm (http://en.wikipedia.org/wiki/Snapshot_algorithm) that take a consistent snapshot and do not require a system freeze?
I want to register a complaint. Everyone talks about wanting ground-breaking research and innovation, but in my own personal experience I see little-to-no support for it.
It reminds of the scene from the film Election, where a character intones, after you've seen him ski down a double-black diamond slope and crash, "Last winter when I broke my leg, I was so mad at God."
We're spending money to squeeze the last drops of gold out of the research done in the 60's and 70's, while research starves, and then complaining that there's nothing new. I don't see how we can expect anything different.
i originally opened the link expecting to conclude the same thing, however if you read the article again i think the guy is a) highly qualified to complain (after all he chaired HotOS's PC, no small recognition of his stature in the community) and b) he raises good points.
systems are about interactions, and he's identified the challenges in analyzing systems.
this isn't so much about work from the 60s and 70s, but about modern architectures and the new things they enable. we're still working with models form the 60s and 70s and ignoring various aspects of new system architectures and wondering why we still suck.
It's a great comment, graycat, and I couldn't agree more. Where we might differ is that I think the future of computing is in Computer Science -- but where it will lead to an extension of mathematics (specifically Set Theory). But we're essentially saying the same thing.
For where computing will be based, currently
there are big cultural issues:
With one
paper I published, in a computer science
journal for a problem in computer science
but basically some applied math, I had to discover that, really, it was tough for the computer science
community to review the paper. I sent an
'informal' submission to a nice list of the top
journals appropriate for the paper. From one
Editor in Chief of one of the best journals and
a chaired professor of computer science at one
of the best universities I got back "Neither I
nor anyone on my board of editors has the
mathematical background to review your paper."
Then I got a similar statement from another
such person. For one at MIT, I wrote him
background tutorials for two weeks before he
gave up. Finally I found a journal that
really wanted the paper, but apparently
in the end only the editor in chief reviewed
the paper and did so likely by walking it
around his campus to the math department
to check the math and then to the computer
science department to check the relevance
for computer science.
I had to conclude: For the good future of
computer science via math, the current crop of
CS profs just didn't take the right courses
in college and graduate school.
That's one side of the cultural divide. For
the other side, (1) the math departments
always want to be as pure as possible and
f'get about applied math and (2) in applied
math really don't like doing computer science.
Some obscure math of relativity theory,
maybe, but mostly nothing as practical as
computer science.
Of course, the big 'cross cutting' exception
is the problem P versus NP, apparently first
discovered in operations research (integer
linear programming, yes, in NP-complete),
later in computer science with SAT, and now
at times taken seriously in math, e.g., at
Clay Math.
Here's a 'reason' for the math: When we
write software, we need something prior to
the software as, say, a 'specification', that
is, saying what the software is to do.
Now, where is computer science going to get
that specification? A big source has been
just to program what we know how to do
at least in principle just manually. After
that, computer science starts to lose it
and drift into 'expert systems' (program,
with 'rules' and Forgy's RETE algorithm
what an expert says), intuitive heuristics,
and various kinds of brute-force fitting,
machine learning, neural networks, where
basically where we fit to the 'training'
data, test the fit with the rest of the
data, and stop when get a good fit. So
we throw fitting methods at the data
until something appears to stick.
We need more powerful means of getting that
prior specification. The advantage of
math is that it can start with a real problem,
formulate it as a math problem, solve the
math problem, and then let the math solution
and what is says we needed to do in manipulating
the data be the specification for the core
software. E.g., if want to design an airplane
on a computer, then start with the applied
math of structural engineering, mechanical
engineering, and aeronautical engineering, program
that applied math, and then design the plane.
For software to navigate a spacecraft to the
outer planets, start with Newton's second
law and law of gravity, get the differential
equations of motion, get some numerical
means of solution, and then program what the
numerical analysis says to do.
We need things solid and prior to the software
to know what software to write, and basically
that is we need a math solution first.
For computing itself, as in monitoring, we
can call that a problem in statistics --
more applied math.
A lot in computer load balancing is
some serious applied math. Actually
optimal job scheduling is awash in
NP-complete optimization problems.
Or, we used to have 'metaphysics'. Then
physics became mathematical and made
real progress. Basically the solid logical
chain of correctness given by math theorems
and proofs is just too darned hard to
compete with or, thus, ignore.
i find it "interesting" that all the complains are about useability from a "userland" developer pov.
Generally what comes from that are patched up systems to "run stuff in kernel" because "its safe and fast"
Which of course is not true.
There are however various attempts to make a "proper" "modern" OS but all of these have failed due to the lack of commercial backing. There is apparently no financial interest into having a secure, scalable, fast, deterministic, simple OS right now (which, oh, requires rewriting all your apps, obviously)
Even Microsoft tried. Yes, Microsoft did an excellent such OS.
The Mindori project is still secret, so who knows what they are doing there, while some of the team is nowadays playing with the concept of pico-hipervisors and running the full OS as a library instead.
Note that the part where I mentioned kernel and safe used quotes because its sarcastic.
The modern oses do have kernels that are much safer than what we use day to day, however. They're indeed not in C. In general the core is typed ASM and it runs a a managed language right on top.
The kernel is generally written with that and uses various concepts that separate privileges as much as possible for security, reliability and predictability, basically.
That includes kernel components, and obviously drivers.
Messages between components are also generally declared ("contracted") with well defined types.
That would require building consensus around such a language. I mean, yes, this is one of my own hobby projects, but Rust is the most fully-developed attempt made so far and it still doesn't have the kind of developer backing that C did in the BSD days.
I am old enough to remember when C only had backing from people working on UNIX systems, while other systems were either using Assembly or another high level language e.g. PL/I, Algol or Pascal dialect.
This could catch some programming bugs, sure, but it wouldn't make the OS much safer. You can still have attacks on integrity, privacy and availability.
Configuration, performance, package management, reliability/robustness, security, monitoring, virtualization, and distributed computing. Those are the big problems and missing pieces that people doing systems level engineering are facing today. And, no coincidence, these are the things where you see the most repeated instances of tools that address these issues.
Consider package management. You have an OS level package manager, you have OS level application management (on android, for example), you have node package management, you have ruby gems, you have perl and python packages, you have wordpress themes and plugins, etc. Obviously it doesn't make sense to have a single global package manager, but what about standards for package management? Competing package management infrastructures which serve as foundations for all these lower level systems? A mechanism for aggregating information and control for all of the package management sub-systems?
As the article points out a lot of systems folks and OS folks are still retreading the same paths. They want to pave those paths, then build moving walkways, and so forth. That's all fine and good, but there's a lot of wilderness out there, and a lot of people are out in that wilderness hacking away with machetes and pickaxes building settlements, but for some reason most OS folks don't even imagine that roads can go there too.
As far as the ruby/python/perl package problem goes, I think we're heading towards a standard with virtualenv-style tools and potted environments. The simplest possible standard there is a <foo>/bin/activate script to set the right environment variables; you can build pretty much anything else you want on top of that.
I think all of these areas are making great progress.
> An escape from configuration hell
debconf. chef, puppet. git. augeas. gsettings. In the Windows world, the registry.
There's tons of work being done in this area. You may say that more can be done, or that it isn't good enough, but I think you're wrong to claim that nobody is working on it.
> Understanding interactions in a large, production system
How about the Google Prediction API? Send the data in, and look for changes when you have a problem that you're trying to fix, such as in your example.
> Pushing the envelope on new computing platforms
This is incredibly vague. I find it difficult to understand your definition of "new computing platforms". Based on my interpretation of what you mean, I think a modern smartphone might qualify: a new form factor, more ubiquitous than before, novel UX (capacitive touchscreen), and an array of interesting sensors (accelerometers, compasses, GPS).
I think your post highlights the lamentation of AI researchers: we don't have a good, static measure of success. Once we've achieved something, our definition changes to exclude it.
None of those configuration management projects really advance the state of the art beyond cfengine which has been around for 20 years. They're still just pushing text files around.
I think people replying here aren't quite appreciating the complexity of configuration properly here. One might think that configuration doesn't get harder with scale, but it does. Suggesting that the solution is simply using a general purpose language for configuration or storing configs in a key-value db or in a version control system is almost adorable. It's basically completely missing where the pain points are.
And you can be sure that a company like Google will have done configuration using flat files, key-value stores, DSLs, general purpose languages that generate config files, general purpose languages that actually execute stuff, and probably many other ways :-) If there were simple solutions like that, they'd already be in use.
Here's a small list of desirable properties:
- Configurations must be reusable, composable and parametrizable. When running a system with multiple components, you need to be able to take base configurations of those two components and merge them into a useful combination, rather than writing the configurations from scratch. You must also be able to parametrize those base configurations in useful ways -- which external services to connect to, which data center and on how many machines to run on, etc. And note -- this can't be a one time thing. If the base configurations change, there needs to be some way of eventually propagating those changes to all users.
- Configurations must be reproducible. If the configuration is a program, it shouldn't depend on the environment where it's run on (nor should it be able to have side-effects on the environment). Why? Because when somebody else needs to fix your system at 3am, the last thing you want them to do is need to worry about exactly replicating the previous person's setup.
- Tied to the previous point, configurations also need to be static or snap-shottable. If a server crashes and you need to restart a task on another machine, it'd be absolutely inexcusable for it to end up using a different configuration than the original task due to e.g. time dependencies in programmatic configuration.
- It must be possible to update the configuration of running services without restarting them, and it must be possible to roll out config changes gradually. During such a gradual rollout you need to have options for manual and automatic rollback if the new configuration is problematic on the canaries.
- Configurations need to be debuggable. Once your systems grow to having tens of thousands of lines of configuration or hundreds of thousands of programatically generated key-value pairs, the mere act of figuring out "why does this key have this value" can be a challenge.
- It'd be good if configurations can be validated in advance, rather than only when actually starting the program reading the configuration. At a minimum, this would probably mean that people writing configurations that other people use as a base be able to manually add assertions or constraints on whatever is parametrized. But you might also want constraints and validation against the actual runtime environment. "Don't allow stopping this service unless that service is also down", or something.
Some of these desirable properties are of course conflicting. And some are going to be things that aren't universally agreed on. For example there is a real tension between not being able to run arbitrary code from a configuration (important for several of these properties) vs. reuse of data across different services.
My thinking on this is already a few years out of date, it's probable that configuration has again gotten an order of magnitude more complex since the last time I thought about this stuff seriously, and there are already new and annoying properties to think about. The main point is just that this really is a bit more complex than your .emacs file is.
The properties that you listed reminded me of the Nix package manager http://nixos.org/nix/ and NixOS that uses it. I came across it originally while looking into package managers that would allow rollbacks or at least make it easy for multiple installed versions of a package.
I would add that the configured program should spit its final configuation back out as it starts up, either as a set of command line parameters into a log file, or if that gets out of hand, as a file in /tmp with all external cofigurable values explicitly written out. This simplifies the 3am fix problem somewhat, since some values will always be missing.
Dealing with configuration files is a problem that extends well beyond distributed systems, or even systems programming in general. It's certainly a very important issue.
One of the fundamental problems with configuration files is that they are written in ad-hoc external DSLs. This means that the config files behave nothing like actual software--instead, they have their own special rules to follow and their own semantics to understand. These languages also tend to have very little abstractive power, which leads to all sort of incidental complexity and redundancy. Good software engineering and programming language practices are simply thrown away in config files.
Some of these issues are mitigated in part by using a well-understood format like XML or even JSON. This is great, but it does not go far enough: config files still stand alone and still have significant problems with abstraction and redundancy.
I think a particularly illuminating example is CSS. When you get down to it, CSS is just a configuration file for how your website looks. IT certainly isn't a programming language in the normal sense of the word. And look at all the problems CSS has: rules are easy to mess up, you end up copying and pasting the same snippets all over the place and css files quickly become tangled and messy.
These problems are addressed to a first degree by preprocessors like Sass and Less. But they wouldn't have existed in the first place if CSS was an embedded DSL instead of a standalone language.
At the very least, being an embedded language would give it access to many of the features preprocessors provide for free. You would be able to pack rules into variables, return them from functions and organize your code in more natural ways. Moreover, you could also take advantage of your language's type system to ensure each rule only had acceptable values. Instead of magically not working at runtime, your mistakes would be caught by compile time.
This is particularly underscored by how woefully complex CSS3 is getting. A whole bunch of the new features look like function calls or normal code; however, since CSS is not a programming language, it has to get brand new syntax for this. This is both confusing and unnecessary: if CSS was just embedded in another language, we could just use functions in the host language.
I think this approach is promising for configuration files in general. Instead of arbitrary custom languages, we could have our config files be DSL shallowly embedded in whatever host language we're using. This would simultaneously give us more flexibility and make the configuration files neater. If done correctly, it would also make it very easy to verify and process these configuration files programmatically.
There has already been some software which has taken this approach. Emacs has all of its configuration in elisp, and it's probably the most customized software in existence. My .emacs file is gigantic, and yet I find it much easier to manage than configuration for a whole bunch of other programs like Apache.
Another great example is XMonad. This is really more along the lines of what I really want because it lets you take advantage of Haskell's type system when writing or editing your configuration file. It catches mistakes before you can run them. Since your config file is just a normal .hs file, this also makes it very easy to add new features: for example, it would be trivial to support "user profiles" controlled by an environment variable. You would just read the environment variable inside your xmonad.hs file and load the configuration as appropriate.
Specifying configuration information inside a general-purpose language--probably as an embedded DSL--is quite a step forward from the ad-hoc config files in use today. It certainly won't solve all the problems with configuration in all fields, but I think it would make for a nice improvement across the board.
> This means that the config files behave nothing like actual software--instead, they have their own special rules to follow and their own semantics to understand. These languages also tend to have very little abstractive power, which leads to all sort of incidental complexity and redundancy. Good software engineering and programming language practices are simply thrown away in config files.
I have to disagree that this is the critical problem with system configuration. The typical config has orders of magnitude less complexity than the typical software project. The problem with config files is not the complexity of the configuration itself (except perhaps in Java), but the complexity of the global effects they produce. It doesn't matter how clean and DRY you make your config, and how well-tested, if the software supports too many options that can interact in unpredictable ways, or worse, produce subtle outward facing changes that ripple out to systems that interact with yours.
It's certainly not the only problem with systems configuration. However, it is a broad problem: it applies to more than just systems. And, as you may have guessed from my examples, I don't really do systems stuff all that much. But I do deal with a bunch of other configuration files and formats!
Well I agree the proliferation of formats is a problem, but I think it's an intractable one. General languages solve the problem of "I can't use config language X because it's missing feature Y", but for every project you bring on board you lose one because they don't want that much power in their config files.
Those are the things that would actually get easier: your configuration would be a data structure in the host language rather than a text file. So you wouldn't have to parse it at all. Other programs would have to either be in the same language or have some way of communicating with the host language, but that isn't too bad a restriction.
Similarly, this would make modifying configuration programmatically better. Instead of dealing with serializing and deserializing to random text formats, you would just provide a Config -> Config function. This would also make tools modifying configuration parameters more composable because they wouldn't be overwriting each other's changes unnecessarily.
While reading your comment, I thought of the problems with configuration files I was having in my own project... when I realized, "why not just use python"?
The program (written in go) could simply launch a new process on startup, e.g.
python -c "import imp; c = imp.load_source('c', 'abc.conf'); print '\n'.join('%s = %s' % (n, repr(getattr(c, n))) for n in ['option1', 'option2', 'option3'])"
and read it's output (which could also be another well-known format, like JSON) and use it for configuration. Simple, elegant, and enables very powerful abstractions.
Using a standardized, real programming language (TCL) as a standard for configuration files was suggested something like 15 years ago. RMS shot it down, saying the GNU would never use it, that they would build their own format that all the GNU tools would use (which IIRC never materialized). The problem isn't creating a good language, it's persuading the linux community to standardize on something.
Well, TCL is not a good language for configuration files imho. I'd favor a declarative approach, because you can analyse (hence debug, check, etc) it better. Inspired by Prolog, but change the syntax. A configuration language should also avoid features like "eval", though Turing-completeness is probably desirable.
Nevertheless I agree that standardization is a problem. Not for the "linux community", but in general. Some people do not require Turing-completeness and (reasonably) do not want it. Others already have a scripting language (Lua,TCL,Python,etc) embedded, so reusing it is a good idea.
All the three points in the "wishlist" are real-world problems, something which academic conferences will shun straight-away. All work in areas like configuration management must be either a)Evolutionary work based off existing tools. This unfortunately doesn't qualify as good-enough research OR b)A radically new approach, which is almost useless unless a robust enough solution is developed, deployed, and tested. Researchers prefer multiple papers over perfecting the engineering of software tools.
Hence there is no incentive for academic systems researchers to focus on these (it's no fun having paper after paper shot down).
> The "bug" here was not a software bug, or even a bad configuration: it was the unexpected interaction between two very different (and independently-maintained) software systems leading to a new mode of resource exhaustion.
can be viewed as pointing to the solution:
Broadly in computing, we have 'resource limits'
from the largest integer that can be stored
in 2's complement in 32 bits to disk partition
size, maximum single file size, number of IP
ports, etc. Mostly we don't pay really careful
attention all of these resource limits.
Well, for 'systems' such as are being considered
in the OP, do consider the resource limits. So,
for each such limit, have some software track
its usage and then report the usage and especially
when usage of that resource is close to be exhausted.
And when the resource is exhausted, then report that
the problem was the resource was exhausted and
who the users of that resource were.
For a little more, can make some progress by
doing simulations of 'networks of queues' with
whatever stochastic processes want to assume
for 'work arrival time' and
'work service times'.
For more, do some real time system monitoring.
For problems that are well understood and/or
have been seen before, e.g., running out of
a critical resource, just monitor for the
well understood aspects of the problem.
Then
there are problems never seen before and
not yet well understood. Here monitoring is
essentially forced to be a continually applied
'statistical hypothesis test' for the 'health
and wellness' of the system complete with
rates of 'false positives' (false alarms) and
rates of 'false negatives' (missed detections
of real problems).
For the operational definition of 'healthy and
well', collect some operational data, let it
'age' until other evidence indicates that
the system was 'healthy and well' during
that time, and, then, use that data as the
definition. Then decree and declare that any
data 'too far' from the healthy and well data
indicates 'sick'.
Then as usual in statistical hypothesis testing,
want to know the false alarm rate in advance
and want to be able to adjust it. And if
get a 'detection', that is, reject the 'null
hypothesis' that the system is healthy and
well, then want to know the lowest false alarm
rate at which the particular data observed
in real time would still be a 'detection'
and use that false alarm rate as a measure
of the 'seriousness' of the detection.
If make some relatively weak
probabilistic assumptions about the data,
then, in calculating the false alarm rate,
can get some probabilistically 'exchangeable'
data. Then get to apply some group theory
(from abstract algebra) and borrow a little
from classic ergodic theory to calculate
false alarm rate. And can use a classic
result in measure theory by S. Ulam,
that the French probabilist LeCam
nicely called
'tightness', to
show that the hypothesis test is not
'trivial'. See Billingsly 'Convergence
of Probability Measures'.
Of course, the 'most powerful' test for
each given false alarm rate is from the
Neyman-Pearson lemma (the elementary
proofs are not much fun, but there is
a nice proof starting with the
Hahn decomposition, a fast result
from the Radon-Nikodym theorem),
but for problems
never seen before we do not have data
enough to use that result.
For the statistical hypothesis test,
we need two special properties: (1)
Especially for monitoring systems,
especially complex or distributed systems,
we should have statistical hypothesis
tests that are multi-dimensional (multi-variate);
that is, if we are monitoring once a second,
then each second we should get data on each
of a certain n variables for some positive
integer n. So, we are working with n variables.
(2) As U. Grenander at Brown U. observed
long ago, operational data from computer
systems is probabilistically, certainly
in distribution, quite different from data from most of
the history of statistics. He was correct.
Especially since we are interested in data
on several variables, we have no real hope
of knowing the distribution of the data even
when the system is 'healthy and well' and
still less hope when it is sick with a problem
we have never seen before. So, we need
hypothesis tests that are 'distribution-free',
that is, make no assumptions about probability
distributions. So, we are faced with calculating
false alarm rates for multi-dimensional data
where we know nothing about the probability
distribution.
There is a long history of hypothesis tests for
from a
single variable, including many tests that are
distribution-free. See old books by
S. Siegel,
'Nonparametric Statistics for
the Behavioral Sciences' or
E. Lehmann, 'Nonparametrics'.
For multi-dimensional
hypothesis tests, there's not much
and still less when also distribution-free.
Why multi-dimensional? Because for the
two interacting systems in the example in
the OP, we guess that to detect the problem
we would want data from both systems. More
generally in a large server farm or network,
we are awash in variables on which we can
collect data at rates from thousands of
points per second down to a point each
few seconds. So, for n dimensions, n can
easily be dozens, hundreds, ....
For such tests, look at "anomaly
detector" in
'Information Sciences' in, as I
recall, 1999.
If want to implement such a test, might
want to read about k-D trees in, say,
Sedgewick. Then think about
backtracking in the tree, depth-first,
and accumulating some cutting planes.
For monitoring, there was long some work
by Profs Patterson and Fox at Stanford
and Berkelely, in their RAD Lab, funded
by Google, Microsoft, and Sun. The
paper in 'Information Sciences' seems
to be ahead.
More can be done. E.g., consider the
Rocky Mountains and assume that they
are porous to water. Let the mountains
be the probability distribution of
two-variable data when the system is
healthy and well. Now pour in water
until, say, the lowest 1% of the probability
mass has been covered. Now the test is,
observe a point in the water and raise
an alarm. Now the false alarm rate is
1%. And the area of false alarms is the
largest we can make it for being 1%
(a rough surrogate for the best detection rate --
with a fast use of Fubini's theorem in
measure theory, can say more here).
As we know, lakes can have fractal boundaries,
and the techniques in the paper will,
indeed, approximate those. Also the techniques
do not require that all the lakes be
connnected. And for the dry land, it need
not all be connected either and might
be in islands. So, it is quite general.
But may want to assume that the region of
healthy and well performance is a convex
set and try again. If this assumption
is correct, then for a given false alarm
rate will get a higher detection rate,
that is, a more 'powerful' test.
Still more can be done. But, really,
the work is essentially some applied
math with, at times, some moderately
advanced prerequisites from pure and
applied math. Theme: The crucial
content of the most powerful future
of computing is in math, not computer
science. Sorry 'bout that!
All of this is done at Google - there's extensive monitoring for all production systems, with alerts firing once parameters move outside of their statistical "normal" range. In practice the range tends to get set more by "How tight can we make it before we get really annoyed by these pagers?" than any rigorous statistical method, but the run-time operation carefully measures average & extreme values and determines when a spike is just a momentary anomaly vs. when it's worth paging someone.
The problem is that inevitably the next problem is exhaustion of some resource that nobody even thought of as a resource. In my time at Google, some of the resource limits I've faced have included (besides the obvious RAM/disk/CPU): binary size, size of linker inputs, average screen size of websearch users, time required for an AJAX request, byte-size of the search results page, visual space on the monitoring UI, size of a Javascript alert box, maximum length of a URL in Internet Explorer, maximum length of a request in Google's load-balancers, time until someone submits a CL touching the same config file as you, and time before a cosmic ray causes a single-bit error in memory.
File descriptors are a comparatively obvious resource, but nobody thought of them because in previous normal operation they never even came close to running out. Should someone set an alert on "wind speed running through the Columbia River Gorge" (which would've been responsible for an actual outage had there not been significant redundancy in the system, BTW)?
> All of this is done at Google - there's extensive monitoring for all production systems, with alerts firing once parameters move outside of their statistical "normal" range. In practice the range tends to get set more by "How tight can we make it before we get really annoyed by these pagers?" than any rigorous statistical method, but the run-time operation carefully measures average & extreme values and determines when a spike is just a momentary anomaly vs. when it's worth paging someone.
Really, "All"?
First question is, 'parameters'. Usually, and likely
at Google, they are monitoring just one parameter
at a time. This is a really big bummer. E.g., even if
do this for several parameters, are forced into assuming
that the 'healthy and well' geometry of the parameters is
just a rectangle. Bummer that hurts the combination of
false alarm rate and detection rate.
Next, the "normal range" is a bummer because it assumes
that what is healthy and well is just a 'range', that is,
an interval, and this is not always the case. The result,
again, is a poor combination of false alarm rate and detection rate.
Again, yet again, please read again, just for you, as I wrote very clearly, to do well just must be multi-dimensional. I doubt that there is so much as a single
large server farm or network in the world that is doing
anything at all serious working with multidimensional
data for monitoring. Not one.
Next, your remark about false alarm rate points to
a problem I point out is solved: The method, with meager
assumptions, permits knowing false alarm rate in advance and
setting it, actually setting it exactly.
For how many and what variables to monitor, yes, that is a question that
can need some overview of the server farm or network
and some judgment, but there is some analytical work
that should help.
For "rigorous" statistics, the point is not 'rigor' but
useful power. Being multidimensional, knowing false alarm
rate and being able to adjust it, etc. are powerful.
They're chaotic systems[1]. Parameters which are below the threshold of detection can drastically affect their trajectory in phase space. Accidental complexity has zoomed off to live a life all its own, and that takeoff has been continuous for decades [2].
Everything we build around software that's not directly bearing on the problem domain is chipping away at this or that aspect of the ungoverned bits. As we corral each level of new unpredictability, we merely set the scene of the next level of complexity.
There is no universal solution. There is no final boss which, once defeated, leads to the scrolling names and 8-bit themesong. It never stops, ever, because the value of systems that frequently fail unpredictably is lower than the cost of flawless systems and higher than the value of not building such systems.
Edit: sounds very unhelpful, even defeatist. I know. I'm deeply ambivalent about what can be done versus what we should assume is possible. I'm a fan of serious software engineering and tool support and all that jazz ... yet thoroughly pessimistic about the limits of the possible. Deep sighs for every occasion!
[1] http://chester.id.au/2012/06/27/a-not-sobrief-aside-on-reign... (warning, self linkage)
[2] http://www.cs.nott.ac.uk/~cah/G51ISS/Documents/NoSilverBulle...