> But every time I need to dig down into the source code, I quickly realize why ...

kazagistar · on Jan 19, 2017

I'm not sure what that has to do with things like race conditions that randomly result in "kill and delete" erroring and leaving the container in a invisble, dead state, with a totally useless error message. There are a lot of resources to keep track of in a system like docker and esoteric error conditions, and pretty poor tools to help you do it correctly in Go.

ekidd · on Jan 18, 2017

> Vault's documentation clearly states to not use ETCD as the HA backend.

Yes, when I reported bugs against the etcd HA backend they said they were going to add more prominent warnings to the docs, if I remember correctly. :-) I think the underlying bug comes from overly optimistic multithreading code using CSP, but I stared at the code for several hours and couldn't see how to make it unambiguously correct.

We now have our own in-house Redis "HA" backend that I wrote which appears to be rock solid in production, but which temporarily disables Vault during Redis restarts. So it's really "fake HA" but it works nicely for our use case and never requires manual intervention or embarrasses us during business hours. I refuse to set up and administer an entire Consul cluster for a single leader-election lock. Managing Consul has its own complications: https://www.consul.io/docs/guides/outage.html

> Instead they started with an architecture where everything goes over an HTTP socket through a daemon with high probability for contention, deadlock, and failure. None of that has anything to do with Go.

The Docker daemon's wire protocol suffers from a kind a of sloppy thinking about data types, in my opinion. Consider this line in a Rust Docker client: https://github.com/faradayio/boondock/blob/bf29afb0c78f8d5d6...

    pub Ports: Option<HashMap<String, Option<Vec<PortMapping>>>>,

Here, the `Option` types say, "This might be a JSON `null` value on the wire." So the `Ports` member might be `null`, or it might contain an empty hashmap, which in turns contains values that might be `null`, or might be `PortMapping` value. The corresponding Go code is extremely unclear (at least in most similar situations) as to which values can be `null` on the wire.

There's a lot of really foggy specifications going on in the network layer of the Docker server. It's not bad in the way that certain old Unix servers like `csvd` were bad, but I feel like Docker could be better if Go demanded more clarity about data types.

On the other hand, Docker is clearly a raging commercial success, and it actually works quite nicely (at least if you use Amazon's ECS to manage your clusters). So maybe I'm just being overly fastidious. Clearly all this new Go-based devops infrastructure is hugely successful, even if many of us get frustrated with it now and then.

Still, I really wish that more programming languages forced you to be clear about which values can be `null` and which can't. Hoare called it his "billion dollar mistake": http://lambda-the-ultimate.org/node/3186

bmurphy1976 · on Jan 19, 2017

I'm still unconvinced Docker's choice of using Go is the primary culprit for their problems. If Docker had started bottom up they would have been forced to design tighter contracts between the components from the beginning (regardless of language).

Ultimately it's the architectural decision to couple everything that allowed them to be loosey-goosey with internal-only interfaces. Why bother specifying them thoroughly if you can update all components simultaneously and have no dependencies outside of your control? I'm sure a tighter language would have helped in the small, but their problems are categorically big problems that I believe transcend smaller and more localized failings.

My experience using Consul for the Vault backend has been pretty good. Yes, you have to manually go in and clean up old nodes after a restart, but that isn't too bad. So far (knock on wood) we haven't run into a situation where a replacement node has refused to join the cluster or the cluster has out-right failed.

That said, I do wish they would put the effort to properly integrate with Etcd. We've gone through the motions of setting up and managing a highly-available Etcd cluster for quite some time now and are more confident running it than Consul. I'd rather standardize on Etcd myself, but I'm not inclined to swim upstream. :)