Hacker News new | past | comments | ask | show | jobs | submit login
What I wish systems researchers would work on (matt-welsh.blogspot.com)
93 points by ivanist on May 16, 2013 | hide | past | favorite | 57 comments



Computer systems, which are deterministic, have long since passed out of the realm of being understandable.

They're chaotic systems[1]. Parameters which are below the threshold of detection can drastically affect their trajectory in phase space. Accidental complexity has zoomed off to live a life all its own, and that takeoff has been continuous for decades [2].

Everything we build around software that's not directly bearing on the problem domain is chipping away at this or that aspect of the ungoverned bits. As we corral each level of new unpredictability, we merely set the scene of the next level of complexity.

There is no universal solution. There is no final boss which, once defeated, leads to the scrolling names and 8-bit themesong. It never stops, ever, because the value of systems that frequently fail unpredictably is lower than the cost of flawless systems and higher than the value of not building such systems.

Edit: sounds very unhelpful, even defeatist. I know. I'm deeply ambivalent about what can be done versus what we should assume is possible. I'm a fan of serious software engineering and tool support and all that jazz ... yet thoroughly pessimistic about the limits of the possible. Deep sighs for every occasion!

[1] http://chester.id.au/2012/06/27/a-not-sobrief-aside-on-reign... (warning, self linkage)

[2] http://www.cs.nott.ac.uk/~cah/G51ISS/Documents/NoSilverBulle...


They're not even deterministic. Even if your input is totally ordered, entropy from the memory subsystem will bubble up to communicating threads if you have more than one. Sometimes one thread will read the newly written value, sometimes the old. That said, there is some clever research out of UW into how to make such systems deterministic while still recovering almost all parallelism.


I think the more important point jacques_chester was making is that even in the case where computer systems are 100% deterministic it doesn't matter because they are so complex they become chaotic systems. And then hypothetical determinism won't save you because there are too many variables in play simultaneously, and even the slightest variation in any one of them could change the behavior completely.


Sorry guys, but I don't buy your defeatist arguments for one second. Sure using traditional tools, you may be right. However, all we have to do is look at the brain with its 100 trillion Connections (axioms) or a CPU with its billions of transistors to see that your logic is flawed. If it was not possible to design fault tolerant systems on large scale neither of these two examples could exist.

I don't claim to have the answers, but a few years ago, I had to help out some EE's write some testing software in LabVIEW. The first thing that shocked me was the system was immune to bad data. Secondly, how elegantly it took advantage of our multiprocessing and multithreading hardware.

What I wish systems researchers would work on is concurrent, signal-based, synchronous languages. This is a winnable game, but we must swallow the red pill and change the rules of the game with new tools and fresh ideas.


Just reading your second reference, the bit about expert systems is pretty interesting . . . what's your stance on this? It seems that, given sufficient instrumentation information, one could probably construct a fairly decent expert system that might be able to shed some light onto these sorts of problems. (Keep in mind the fact that I'm still very much a junior software developer!)

The example of FDs from the original article seems like a good one; a properly developed expert system might have been able to notice a lack of changes in the current software environment and suggest looking elsewhere. (Problem is, if I got that response from my automated problem-solver I'd just ignore it and keep looking . . .)

So, any insightful points to crush my youthful idealism? :)


Expert systems for monitoring and management of large server farms and networks? Been there. Done that. Got the T-shirt. Wrote papers. Gave one at an AAAI IAAI conference at Stanford. Shipped two commercial products.

First problem: Need some 'experts'. So, only know about problems the experts do. Their knowledge is supposed to be mostly just empirical, from their 'expert experience'. Even if they have 'deep knowledge', that is, how the systems work internally (e.g., the engine connects to the torque converter connects to the transmission connects to the rear differential connects to the rear wheels), the specific problems they know about are usually just the ones they have encountered. So, essentially can only address problems seen before and are well understood.

But for the problems of the OP, they were seen for the first time. Bummer. Actually, with irony, if the problems have been seen before and are well understood, then why the heck have they not already been solved?

Second, with expert systems, we're working just intuitively from just experience. So, we have little idea if what we are doing is the best possible in any reasonable sense or even at all good. In particular, in monitoring, the main, first operational problem on the 'bridge' or in the 'NOC' (network operations center) is the false alarm rate too high with no good way to lower it (except just to ignore some alarms, but we should be able to do much better).

Net, for a serious attack on system monitoring and management, I can't recommend taking expert systems very seriously.


Would love to get an email (address in profile) from you to chat about some tangental stuff (re: intelligence in monitoring systems)


This is a great topic, the idea that contemporary computer systems are not atomically understandable or model-able. And it's a topic that's almost entirely absent from computer science discourse.

We passed the point quite a while back where "applied CS" -- what it's like to build and maintain scalable systems in the real world -- looks a lot more like anthropology than like mathematics.

We have tools like sampling profilers, loggers that do stateful packet inspection on a small percentage of traffic, and sophisticated A/B testing frameworks that reason about user populations rather than individuals. But I think we could probably build a whole 'nother generation of these tools by thinking about this problem of understanding complex systems from some new perspectives. I nominate anthropology as a discipline that has many years under its belt wrestling with the impossibility of complete description. (And wrestling with other tough issues, sometimes not well, but that's another conversation.)

I'm not talking about the kind of "anthropology" work that companies like Intel have done for a long time, which are firmly in the social sciences tradition and are really user studies. (Though that work is valuable.) I'm talking about porting ideas such as "thick description" to our technical toolkit.

  http://en.wikipedia.org/wiki/Thick_description


No one's asking for a solution to complexity; I agree there's no such thing. What we need are better ways to manage complexity -- and that is absolutely within our grasp.


Speaking as a student studying systems (particularly security and OS development) . . . the second point sounds really interesting.

Recently I've been working on a simple microkernel (called Sydi) that should actually make this possible. It's a distributed system straight down to the kernel: when two instances of the kernel detect each other running on a network, they freely swap processes and resources between each other as required to balance the load. This is a nightmare from a security perspective, of course, and I haven't figured out the authentication bits yet -- too busy working on an AML interpreter because I don't like ACPICA -- but that's not the issue here. (It's far from finished, but I'm pretty confident that I've worked out the first 85% of the details, with just the remaining 115% left, plus the next 95% of the programming . . .)

Due to some aspects of its design (asynchronous system calls via message-passing, transparent message routing between nodes in the network, etc) I feel that it will be entirely possible to take a snapshot of an entire cluster of machines running the kernel. It would be expensive -- requiring a freeze of all running processes while the snapshot is taking place to maintain any level of precision -- but I'm confident I could code that in during the space of a week or two . . .

I haven't thought much about what kind of analysis one could do on the instrumentation/snapshot results, though. I'm sadly too inexperienced with `real-world' systems stuff to be able to say. Anyone have any suggestions for possible analysis avenues to explore?


This is a nightmare from a security perspective, of course

Well, only because you're trusting your hardware. Most people take that on faith anyway.

If you're willing to trust the hardware, I'm fairly sure software-based isolation and security is relatively doable.


This sounds interesting! Have you published anything more substantial on it? I yearn for the day when I can have a Plan 9-ish resource sharing and file system connecting my personal computer to my cloud computers.


> It would be expensive -- requiring a freeze of all running processes while the snapshot is taking place to maintain any level of precision.

Have you thought about distributed algorithms like the Chandy-Lamport Snapshot algorithm (http://en.wikipedia.org/wiki/Snapshot_algorithm) that take a consistent snapshot and do not require a system freeze?


I want to register a complaint. Everyone talks about wanting ground-breaking research and innovation, but in my own personal experience I see little-to-no support for it.

It reminds of the scene from the film Election, where a character intones, after you've seen him ski down a double-black diamond slope and crash, "Last winter when I broke my leg, I was so mad at God."

We're spending money to squeeze the last drops of gold out of the research done in the 60's and 70's, while research starves, and then complaining that there's nothing new. I don't see how we can expect anything different.


i originally opened the link expecting to conclude the same thing, however if you read the article again i think the guy is a) highly qualified to complain (after all he chaired HotOS's PC, no small recognition of his stature in the community) and b) he raises good points.

systems are about interactions, and he's identified the challenges in analyzing systems.

this isn't so much about work from the 60s and 70s, but about modern architectures and the new things they enable. we're still working with models form the 60s and 70s and ignoring various aspects of new system architectures and wondering why we still suck.


Thanks, I think you're right that I need to read it again. I have a personal issue with this and it leads to knee-jerk reactions. Thanks for that.



It's a great comment, graycat, and I couldn't agree more. Where we might differ is that I think the future of computing is in Computer Science -- but where it will lead to an extension of mathematics (specifically Set Theory). But we're essentially saying the same thing.


For where computing will be based, currently there are big cultural issues:

With one paper I published, in a computer science journal for a problem in computer science but basically some applied math, I had to discover that, really, it was tough for the computer science community to review the paper. I sent an 'informal' submission to a nice list of the top journals appropriate for the paper. From one Editor in Chief of one of the best journals and a chaired professor of computer science at one of the best universities I got back "Neither I nor anyone on my board of editors has the mathematical background to review your paper." Then I got a similar statement from another such person. For one at MIT, I wrote him background tutorials for two weeks before he gave up. Finally I found a journal that really wanted the paper, but apparently in the end only the editor in chief reviewed the paper and did so likely by walking it around his campus to the math department to check the math and then to the computer science department to check the relevance for computer science.

I had to conclude: For the good future of computer science via math, the current crop of CS profs just didn't take the right courses in college and graduate school.

That's one side of the cultural divide. For the other side, (1) the math departments always want to be as pure as possible and f'get about applied math and (2) in applied math really don't like doing computer science. Some obscure math of relativity theory, maybe, but mostly nothing as practical as computer science.

Of course, the big 'cross cutting' exception is the problem P versus NP, apparently first discovered in operations research (integer linear programming, yes, in NP-complete), later in computer science with SAT, and now at times taken seriously in math, e.g., at Clay Math.

Here's a 'reason' for the math: When we write software, we need something prior to the software as, say, a 'specification', that is, saying what the software is to do. Now, where is computer science going to get that specification? A big source has been just to program what we know how to do at least in principle just manually. After that, computer science starts to lose it and drift into 'expert systems' (program, with 'rules' and Forgy's RETE algorithm what an expert says), intuitive heuristics, and various kinds of brute-force fitting, machine learning, neural networks, where basically where we fit to the 'training' data, test the fit with the rest of the data, and stop when get a good fit. So we throw fitting methods at the data until something appears to stick.

We need more powerful means of getting that prior specification. The advantage of math is that it can start with a real problem, formulate it as a math problem, solve the math problem, and then let the math solution and what is says we needed to do in manipulating the data be the specification for the core software. E.g., if want to design an airplane on a computer, then start with the applied math of structural engineering, mechanical engineering, and aeronautical engineering, program that applied math, and then design the plane. For software to navigate a spacecraft to the outer planets, start with Newton's second law and law of gravity, get the differential equations of motion, get some numerical means of solution, and then program what the numerical analysis says to do.

We need things solid and prior to the software to know what software to write, and basically that is we need a math solution first.

For computing itself, as in monitoring, we can call that a problem in statistics -- more applied math.

A lot in computer load balancing is some serious applied math. Actually optimal job scheduling is awash in NP-complete optimization problems.

Or, we used to have 'metaphysics'. Then physics became mathematical and made real progress. Basically the solid logical chain of correctness given by math theorems and proofs is just too darned hard to compete with or, thus, ignore.


Wow... That's a tremendous response. Thank you.

Could you please send me links to everything you have ever written? Please. My email is in my profile.


All praise is welcome!

For now I'm anonymous at Hacker News.

Sorry.


i find it "interesting" that all the complains are about useability from a "userland" developer pov. Generally what comes from that are patched up systems to "run stuff in kernel" because "its safe and fast"

Which of course is not true. There are however various attempts to make a "proper" "modern" OS but all of these have failed due to the lack of commercial backing. There is apparently no financial interest into having a secure, scalable, fast, deterministic, simple OS right now (which, oh, requires rewriting all your apps, obviously)

Even Microsoft tried. Yes, Microsoft did an excellent such OS.


They haven't given up yet.

The Mindori project is still secret, so who knows what they are doing there, while some of the team is nowadays playing with the concept of pico-hipervisors and running the full OS as a library instead.

http://research.microsoft.com/en-us/projects/drawbridge/


I somehow hadn't noticed it, even thus its not new. Thanks :)


Well, there is a lot of interest in a secure, scalable, fast, deterministic, simple OS which is also backwards compatible. ;)

What do you mean by "safe" though? How is kernel-space safer then user-space?


Note that the part where I mentioned kernel and safe used quotes because its sarcastic.

The modern oses do have kernels that are much safer than what we use day to day, however. They're indeed not in C. In general the core is typed ASM and it runs a a managed language right on top. The kernel is generally written with that and uses various concepts that separate privileges as much as possible for security, reliability and predictability, basically. That includes kernel components, and obviously drivers. Messages between components are also generally declared ("contracted") with well defined types.


By being written in systems programming languages safer than C.


That would require building consensus around such a language. I mean, yes, this is one of my own hobby projects, but Rust is the most fully-developed attempt made so far and it still doesn't have the kind of developer backing that C did in the BSD days.


I am old enough to remember when C only had backing from people working on UNIX systems, while other systems were either using Assembly or another high level language e.g. PL/I, Algol or Pascal dialect.


This could catch some programming bugs, sure, but it wouldn't make the OS much safer. You can still have attacks on integrity, privacy and availability.


Agreed, but the attack window would be smaller, that is what counts.

If you really want to attack a system, nothing beats social engineering anyway.


CS research has been pathological for a long time.

Anybody near the field has their own list of "taboo" research topics that are very interesting but just don't get funding.

Then you go to the library and realize they have 120 shelf-feet of conference proceedings about Internet QoS that have gone precisely nowhere.


Configuration, performance, package management, reliability/robustness, security, monitoring, virtualization, and distributed computing. Those are the big problems and missing pieces that people doing systems level engineering are facing today. And, no coincidence, these are the things where you see the most repeated instances of tools that address these issues.

Consider package management. You have an OS level package manager, you have OS level application management (on android, for example), you have node package management, you have ruby gems, you have perl and python packages, you have wordpress themes and plugins, etc. Obviously it doesn't make sense to have a single global package manager, but what about standards for package management? Competing package management infrastructures which serve as foundations for all these lower level systems? A mechanism for aggregating information and control for all of the package management sub-systems?

As the article points out a lot of systems folks and OS folks are still retreading the same paths. They want to pave those paths, then build moving walkways, and so forth. That's all fine and good, but there's a lot of wilderness out there, and a lot of people are out in that wilderness hacking away with machetes and pickaxes building settlements, but for some reason most OS folks don't even imagine that roads can go there too.


As far as the ruby/python/perl package problem goes, I think we're heading towards a standard with virtualenv-style tools and potted environments. The simplest possible standard there is a <foo>/bin/activate script to set the right environment variables; you can build pretty much anything else you want on top of that.


I think all of these areas are making great progress.

> An escape from configuration hell

debconf. chef, puppet. git. augeas. gsettings. In the Windows world, the registry.

There's tons of work being done in this area. You may say that more can be done, or that it isn't good enough, but I think you're wrong to claim that nobody is working on it.

> Understanding interactions in a large, production system

How about the Google Prediction API? Send the data in, and look for changes when you have a problem that you're trying to fix, such as in your example.

> Pushing the envelope on new computing platforms

This is incredibly vague. I find it difficult to understand your definition of "new computing platforms". Based on my interpretation of what you mean, I think a modern smartphone might qualify: a new form factor, more ubiquitous than before, novel UX (capacitive touchscreen), and an array of interesting sensors (accelerometers, compasses, GPS).

I think your post highlights the lamentation of AI researchers: we don't have a good, static measure of success. Once we've achieved something, our definition changes to exclude it.


None of those configuration management projects really advance the state of the art beyond cfengine which has been around for 20 years. They're still just pushing text files around.


Could you expand on your advice about the Google Prediction API? It seems like you're saying it can be used as a fault detector.


I think people replying here aren't quite appreciating the complexity of configuration properly here. One might think that configuration doesn't get harder with scale, but it does. Suggesting that the solution is simply using a general purpose language for configuration or storing configs in a key-value db or in a version control system is almost adorable. It's basically completely missing where the pain points are.

And you can be sure that a company like Google will have done configuration using flat files, key-value stores, DSLs, general purpose languages that generate config files, general purpose languages that actually execute stuff, and probably many other ways :-) If there were simple solutions like that, they'd already be in use.

Here's a small list of desirable properties:

- Configurations must be reusable, composable and parametrizable. When running a system with multiple components, you need to be able to take base configurations of those two components and merge them into a useful combination, rather than writing the configurations from scratch. You must also be able to parametrize those base configurations in useful ways -- which external services to connect to, which data center and on how many machines to run on, etc. And note -- this can't be a one time thing. If the base configurations change, there needs to be some way of eventually propagating those changes to all users.

- Configurations must be reproducible. If the configuration is a program, it shouldn't depend on the environment where it's run on (nor should it be able to have side-effects on the environment). Why? Because when somebody else needs to fix your system at 3am, the last thing you want them to do is need to worry about exactly replicating the previous person's setup.

- Tied to the previous point, configurations also need to be static or snap-shottable. If a server crashes and you need to restart a task on another machine, it'd be absolutely inexcusable for it to end up using a different configuration than the original task due to e.g. time dependencies in programmatic configuration.

- It must be possible to update the configuration of running services without restarting them, and it must be possible to roll out config changes gradually. During such a gradual rollout you need to have options for manual and automatic rollback if the new configuration is problematic on the canaries.

- Configurations need to be debuggable. Once your systems grow to having tens of thousands of lines of configuration or hundreds of thousands of programatically generated key-value pairs, the mere act of figuring out "why does this key have this value" can be a challenge.

- It'd be good if configurations can be validated in advance, rather than only when actually starting the program reading the configuration. At a minimum, this would probably mean that people writing configurations that other people use as a base be able to manually add assertions or constraints on whatever is parametrized. But you might also want constraints and validation against the actual runtime environment. "Don't allow stopping this service unless that service is also down", or something.

Some of these desirable properties are of course conflicting. And some are going to be things that aren't universally agreed on. For example there is a real tension between not being able to run arbitrary code from a configuration (important for several of these properties) vs. reuse of data across different services.

My thinking on this is already a few years out of date, it's probable that configuration has again gotten an order of magnitude more complex since the last time I thought about this stuff seriously, and there are already new and annoying properties to think about. The main point is just that this really is a bit more complex than your .emacs file is.


The properties that you listed reminded me of the Nix package manager http://nixos.org/nix/ and NixOS that uses it. I came across it originally while looking into package managers that would allow rollbacks or at least make it easy for multiple installed versions of a package.


I had the same thought. (I haven't used Nix, only read about it.)


I would add that the configured program should spit its final configuation back out as it starts up, either as a set of command line parameters into a log file, or if that gets out of hand, as a file in /tmp with all external cofigurable values explicitly written out. This simplifies the 3am fix problem somewhat, since some values will always be missing.


Having an option to generate a fully populated configuration file with all of the defaults in place can be really helpful as well.


Dealing with configuration files is a problem that extends well beyond distributed systems, or even systems programming in general. It's certainly a very important issue.

One of the fundamental problems with configuration files is that they are written in ad-hoc external DSLs. This means that the config files behave nothing like actual software--instead, they have their own special rules to follow and their own semantics to understand. These languages also tend to have very little abstractive power, which leads to all sort of incidental complexity and redundancy. Good software engineering and programming language practices are simply thrown away in config files.

Some of these issues are mitigated in part by using a well-understood format like XML or even JSON. This is great, but it does not go far enough: config files still stand alone and still have significant problems with abstraction and redundancy.

I think a particularly illuminating example is CSS. When you get down to it, CSS is just a configuration file for how your website looks. IT certainly isn't a programming language in the normal sense of the word. And look at all the problems CSS has: rules are easy to mess up, you end up copying and pasting the same snippets all over the place and css files quickly become tangled and messy.

These problems are addressed to a first degree by preprocessors like Sass and Less. But they wouldn't have existed in the first place if CSS was an embedded DSL instead of a standalone language.

At the very least, being an embedded language would give it access to many of the features preprocessors provide for free. You would be able to pack rules into variables, return them from functions and organize your code in more natural ways. Moreover, you could also take advantage of your language's type system to ensure each rule only had acceptable values. Instead of magically not working at runtime, your mistakes would be caught by compile time.

This is particularly underscored by how woefully complex CSS3 is getting. A whole bunch of the new features look like function calls or normal code; however, since CSS is not a programming language, it has to get brand new syntax for this. This is both confusing and unnecessary: if CSS was just embedded in another language, we could just use functions in the host language.

I think this approach is promising for configuration files in general. Instead of arbitrary custom languages, we could have our config files be DSL shallowly embedded in whatever host language we're using. This would simultaneously give us more flexibility and make the configuration files neater. If done correctly, it would also make it very easy to verify and process these configuration files programmatically.

There has already been some software which has taken this approach. Emacs has all of its configuration in elisp, and it's probably the most customized software in existence. My .emacs file is gigantic, and yet I find it much easier to manage than configuration for a whole bunch of other programs like Apache.

Another great example is XMonad. This is really more along the lines of what I really want because it lets you take advantage of Haskell's type system when writing or editing your configuration file. It catches mistakes before you can run them. Since your config file is just a normal .hs file, this also makes it very easy to add new features: for example, it would be trivial to support "user profiles" controlled by an environment variable. You would just read the environment variable inside your xmonad.hs file and load the configuration as appropriate.

Specifying configuration information inside a general-purpose language--probably as an embedded DSL--is quite a step forward from the ad-hoc config files in use today. It certainly won't solve all the problems with configuration in all fields, but I think it would make for a nice improvement across the board.


> This means that the config files behave nothing like actual software--instead, they have their own special rules to follow and their own semantics to understand. These languages also tend to have very little abstractive power, which leads to all sort of incidental complexity and redundancy. Good software engineering and programming language practices are simply thrown away in config files.

I have to disagree that this is the critical problem with system configuration. The typical config has orders of magnitude less complexity than the typical software project. The problem with config files is not the complexity of the configuration itself (except perhaps in Java), but the complexity of the global effects they produce. It doesn't matter how clean and DRY you make your config, and how well-tested, if the software supports too many options that can interact in unpredictable ways, or worse, produce subtle outward facing changes that ripple out to systems that interact with yours.


It's certainly not the only problem with systems configuration. However, it is a broad problem: it applies to more than just systems. And, as you may have guessed from my examples, I don't really do systems stuff all that much. But I do deal with a bunch of other configuration files and formats!

It's more a comment on config formats in general.


Well I agree the proliferation of formats is a problem, but I think it's an intractable one. General languages solve the problem of "I can't use config language X because it's missing feature Y", but for every project you bring on board you lose one because they don't want that much power in their config files.


You really don't want to use a Turing-complete language for configuration. It makes all sorts of things impossible such as:

- Automatically scanning for insecure configurations (eg. OpenSCAP)

- Parsing the configuration in other programs.

- Modifying the configuration programmatically (cf. Puppet et al)

Also, http://augeas.net/


Those are the things that would actually get easier: your configuration would be a data structure in the host language rather than a text file. So you wouldn't have to parse it at all. Other programs would have to either be in the same language or have some way of communicating with the host language, but that isn't too bad a restriction.

Similarly, this would make modifying configuration programmatically better. Instead of dealing with serializing and deserializing to random text formats, you would just provide a Config -> Config function. This would also make tools modifying configuration parameters more composable because they wouldn't be overwriting each other's changes unnecessarily.


On the off chance you see this later, I think you want a Turing-complete language, but one without an interesting standard library.

> - Automatically scanning for insecure configurations (eg. OpenSCAP)

Since the language can't access the outside world, the worst it can do is use unbounded space or time. Just verify that it halts in a couple ms.

> - Parsing the configuration in other programs.

You don't parse, you embed an interpreter and execute.

> - Modifying the configuration programmatically (cf. Puppet et al)

Code generation isn't that hard.


While reading your comment, I thought of the problems with configuration files I was having in my own project... when I realized, "why not just use python"?

The program (written in go) could simply launch a new process on startup, e.g.

  python -c "import imp; c = imp.load_source('c', 'abc.conf'); print '\n'.join('%s = %s' % (n, repr(getattr(c, n))) for n in ['option1', 'option2', 'option3'])"
and read it's output (which could also be another well-known format, like JSON) and use it for configuration. Simple, elegant, and enables very powerful abstractions.


Using a standardized, real programming language (TCL) as a standard for configuration files was suggested something like 15 years ago. RMS shot it down, saying the GNU would never use it, that they would build their own format that all the GNU tools would use (which IIRC never materialized). The problem isn't creating a good language, it's persuading the linux community to standardize on something.


Well, TCL is not a good language for configuration files imho. I'd favor a declarative approach, because you can analyse (hence debug, check, etc) it better. Inspired by Prolog, but change the syntax. A configuration language should also avoid features like "eval", though Turing-completeness is probably desirable.

Nevertheless I agree that standardization is a problem. Not for the "linux community", but in general. Some people do not require Turing-completeness and (reasonably) do not want it. Others already have a scripting language (Lua,TCL,Python,etc) embedded, so reusing it is a good idea.


Turing completeness is maybe not desirable... look at what happened when YAML was allowed to create objects of any class it liked.

I like your declarative language idea very much, though. As long as it cannot execute arbitrary code.


All the three points in the "wishlist" are real-world problems, something which academic conferences will shun straight-away. All work in areas like configuration management must be either a)Evolutionary work based off existing tools. This unfortunately doesn't qualify as good-enough research OR b)A radically new approach, which is almost useless unless a robust enough solution is developed, deployed, and tested. Researchers prefer multiple papers over perfecting the engineering of software tools.

Hence there is no incentive for academic systems researchers to focus on these (it's no fun having paper after paper shot down).


In the OP, the

> The "bug" here was not a software bug, or even a bad configuration: it was the unexpected interaction between two very different (and independently-maintained) software systems leading to a new mode of resource exhaustion.

can be viewed as pointing to the solution:

Broadly in computing, we have 'resource limits' from the largest integer that can be stored in 2's complement in 32 bits to disk partition size, maximum single file size, number of IP ports, etc. Mostly we don't pay really careful attention all of these resource limits.

Well, for 'systems' such as are being considered in the OP, do consider the resource limits. So, for each such limit, have some software track its usage and then report the usage and especially when usage of that resource is close to be exhausted. And when the resource is exhausted, then report that the problem was the resource was exhausted and who the users of that resource were.

For a little more, can make some progress by doing simulations of 'networks of queues' with whatever stochastic processes want to assume for 'work arrival time' and 'work service times'.

For more, do some real time system monitoring. For problems that are well understood and/or have been seen before, e.g., running out of a critical resource, just monitor for the well understood aspects of the problem.

Then there are problems never seen before and not yet well understood. Here monitoring is essentially forced to be a continually applied 'statistical hypothesis test' for the 'health and wellness' of the system complete with rates of 'false positives' (false alarms) and rates of 'false negatives' (missed detections of real problems).

For the operational definition of 'healthy and well', collect some operational data, let it 'age' until other evidence indicates that the system was 'healthy and well' during that time, and, then, use that data as the definition. Then decree and declare that any data 'too far' from the healthy and well data indicates 'sick'.

Then as usual in statistical hypothesis testing, want to know the false alarm rate in advance and want to be able to adjust it. And if get a 'detection', that is, reject the 'null hypothesis' that the system is healthy and well, then want to know the lowest false alarm rate at which the particular data observed in real time would still be a 'detection' and use that false alarm rate as a measure of the 'seriousness' of the detection.

If make some relatively weak probabilistic assumptions about the data, then, in calculating the false alarm rate, can get some probabilistically 'exchangeable' data. Then get to apply some group theory (from abstract algebra) and borrow a little from classic ergodic theory to calculate false alarm rate. And can use a classic result in measure theory by S. Ulam, that the French probabilist LeCam nicely called 'tightness', to show that the hypothesis test is not 'trivial'. See Billingsly 'Convergence of Probability Measures'.

Of course, the 'most powerful' test for each given false alarm rate is from the Neyman-Pearson lemma (the elementary proofs are not much fun, but there is a nice proof starting with the Hahn decomposition, a fast result from the Radon-Nikodym theorem), but for problems never seen before we do not have data enough to use that result.

For the statistical hypothesis test, we need two special properties: (1) Especially for monitoring systems, especially complex or distributed systems, we should have statistical hypothesis tests that are multi-dimensional (multi-variate); that is, if we are monitoring once a second, then each second we should get data on each of a certain n variables for some positive integer n. So, we are working with n variables. (2) As U. Grenander at Brown U. observed long ago, operational data from computer systems is probabilistically, certainly in distribution, quite different from data from most of the history of statistics. He was correct. Especially since we are interested in data on several variables, we have no real hope of knowing the distribution of the data even when the system is 'healthy and well' and still less hope when it is sick with a problem we have never seen before. So, we need hypothesis tests that are 'distribution-free', that is, make no assumptions about probability distributions. So, we are faced with calculating false alarm rates for multi-dimensional data where we know nothing about the probability distribution.

There is a long history of hypothesis tests for from a single variable, including many tests that are distribution-free. See old books by S. Siegel, 'Nonparametric Statistics for the Behavioral Sciences' or E. Lehmann, 'Nonparametrics'.

For multi-dimensional hypothesis tests, there's not much and still less when also distribution-free.

Why multi-dimensional? Because for the two interacting systems in the example in the OP, we guess that to detect the problem we would want data from both systems. More generally in a large server farm or network, we are awash in variables on which we can collect data at rates from thousands of points per second down to a point each few seconds. So, for n dimensions, n can easily be dozens, hundreds, ....

For such tests, look at "anomaly detector" in 'Information Sciences' in, as I recall, 1999.

If want to implement such a test, might want to read about k-D trees in, say, Sedgewick. Then think about backtracking in the tree, depth-first, and accumulating some cutting planes.

For monitoring, there was long some work by Profs Patterson and Fox at Stanford and Berkelely, in their RAD Lab, funded by Google, Microsoft, and Sun. The paper in 'Information Sciences' seems to be ahead.

More can be done. E.g., consider the Rocky Mountains and assume that they are porous to water. Let the mountains be the probability distribution of two-variable data when the system is healthy and well. Now pour in water until, say, the lowest 1% of the probability mass has been covered. Now the test is, observe a point in the water and raise an alarm. Now the false alarm rate is 1%. And the area of false alarms is the largest we can make it for being 1% (a rough surrogate for the best detection rate -- with a fast use of Fubini's theorem in measure theory, can say more here). As we know, lakes can have fractal boundaries, and the techniques in the paper will, indeed, approximate those. Also the techniques do not require that all the lakes be connnected. And for the dry land, it need not all be connected either and might be in islands. So, it is quite general.

But may want to assume that the region of healthy and well performance is a convex set and try again. If this assumption is correct, then for a given false alarm rate will get a higher detection rate, that is, a more 'powerful' test.

Still more can be done. But, really, the work is essentially some applied math with, at times, some moderately advanced prerequisites from pure and applied math. Theme: The crucial content of the most powerful future of computing is in math, not computer science. Sorry 'bout that!


All of this is done at Google - there's extensive monitoring for all production systems, with alerts firing once parameters move outside of their statistical "normal" range. In practice the range tends to get set more by "How tight can we make it before we get really annoyed by these pagers?" than any rigorous statistical method, but the run-time operation carefully measures average & extreme values and determines when a spike is just a momentary anomaly vs. when it's worth paging someone.

The problem is that inevitably the next problem is exhaustion of some resource that nobody even thought of as a resource. In my time at Google, some of the resource limits I've faced have included (besides the obvious RAM/disk/CPU): binary size, size of linker inputs, average screen size of websearch users, time required for an AJAX request, byte-size of the search results page, visual space on the monitoring UI, size of a Javascript alert box, maximum length of a URL in Internet Explorer, maximum length of a request in Google's load-balancers, time until someone submits a CL touching the same config file as you, and time before a cosmic ray causes a single-bit error in memory.

File descriptors are a comparatively obvious resource, but nobody thought of them because in previous normal operation they never even came close to running out. Should someone set an alert on "wind speed running through the Columbia River Gorge" (which would've been responsible for an actual outage had there not been significant redundancy in the system, BTW)?


> All of this is done at Google - there's extensive monitoring for all production systems, with alerts firing once parameters move outside of their statistical "normal" range. In practice the range tends to get set more by "How tight can we make it before we get really annoyed by these pagers?" than any rigorous statistical method, but the run-time operation carefully measures average & extreme values and determines when a spike is just a momentary anomaly vs. when it's worth paging someone.

Really, "All"?

First question is, 'parameters'. Usually, and likely at Google, they are monitoring just one parameter at a time. This is a really big bummer. E.g., even if do this for several parameters, are forced into assuming that the 'healthy and well' geometry of the parameters is just a rectangle. Bummer that hurts the combination of false alarm rate and detection rate.

Next, the "normal range" is a bummer because it assumes that what is healthy and well is just a 'range', that is, an interval, and this is not always the case. The result, again, is a poor combination of false alarm rate and detection rate.

Again, yet again, please read again, just for you, as I wrote very clearly, to do well just must be multi-dimensional. I doubt that there is so much as a single large server farm or network in the world that is doing anything at all serious working with multidimensional data for monitoring. Not one.

Next, your remark about false alarm rate points to a problem I point out is solved: The method, with meager assumptions, permits knowing false alarm rate in advance and setting it, actually setting it exactly.

For how many and what variables to monitor, yes, that is a question that can need some overview of the server farm or network and some judgment, but there is some analytical work that should help.

For "rigorous" statistics, the point is not 'rigor' but useful power. Being multidimensional, knowing false alarm rate and being able to adjust it, etc. are powerful.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: