In the meantime, your PHP app had 40 major security flaws, no meaningful monitoring, a DB that wasn't backed up and major data consistency problems. Also, when your box's hdd crashed, you lost all those minor changes you had vim'ed over the years. But for the rest, yeah, all is fine.
> In the meantime, your PHP app had 40 major security flaws, no meaningful monitoring,
How does Kubernetes, microservices and front end frameworks fix that?
They don't. In fact, as someone who works at the more modern end of these things, I'd suspect that "no meaningful monitoring" probably correlates quite well with cloud era microservices thingies.
But we can't blame the tools, that space is just younger and by nature contains more immature stuff.
I don't think the argument here is for keeping things out of version control and monitoring. It's that you can actually leave stuff alone for several years, come back and find it working just as before, when the layers underneath doesn't constantly change. That can be liberating.
* k8s helps because you keep your application online during rolling updates. You keep your changes relatively small and can test your changes in isolation.
* microservices help security because for each service you write, you create a minimal surface area (as opposed to monolithic applications where each method has access to maximum dependencies).
* microservices also help monitoring, because failures can be contained and one application does (should) not impact another. This means that your focus of monitoring is on application performance, and less on the underlying system architecture.
* monitoring can be a cross-cutting concern. For example, you can monitor all HTTP requests by instrumenting your ingress layer. Similarly, you can support a unified authorisation scheme by bringing this to the ingress layer. Even more, using microservices, you can independently promote services to production (sort of like feature toggles) or do green/blue promotion.
* security of your microservices is improved, because you can independently upgrade your dependencies per service. Many older monoliths are 'stuck' in a certain dependency configuration and cannot be upgraded without a 'big bang' of expensive dependency resolution and testing.
I could give many more examples, but I hope this captures the spirit a little.
As a "not a cloud guy" who only recently got heavily exposed to it.. I found the state of the art / CloudOps attitudes towards monitoring astonishingly Stone Age.
Yeah observability is great, but no one I've seen in cloud manages to launch with it. So in place of SRE org with full observability stack, they launch with effectively nothing.
Strong attitudes of "we don't measure hardware metrics like cpu/memory/disk, we measure customer facing stuff like throughput/error rates etc". Sounds good, but what do you think happens to those if you run out of disk bro? Pretty sure the running out of disk happens first. Imagine if you could like.. catch it?
CPU/Memory: scale horizontally when needed. Monitor cost.
Disk: essentially limitless. If disk of VM runs out, node will crash, new node will start. Service should keep running on other nodes meanwhile. If restarts happen too often, throughput/error rate will suffer or cost will rise.
Modern mentality is "let it crash" [0]. Software has bugs. Design so it can crash and scale depending on need.
Letting an Erlang process crash is letting a process holding a small collection of resources, maybe a single TCP connection and several kilobytes of local state. It does not necessarily scale beyond that. Pretty much by definition, if you've got something that can run out of disk, when it runs out of disk and you nuke it, you're taking out a lot more than a single connection and a few kilobytes of state.
And "let it crash" is hardly "My service is invincible!" I'm sure any serious Erlang deployment has seen what I've seen as well, which is a sufficiently buggy process that crashes often enough that the system never gets into a stable state of handling requests. "Let it crash" is hardly license to write crap code and let the supervisor pick up the pieces. It's a great backstop, it's a shitty foundation.
I built a distributed system once on two principles: 1. let it crash, 2. make everything deterministic. obviously, this resulted in crashes being invisible and transient (good) or an infinite crash loop (bad.)
I haven't used Erlang, but my impression is that it's probably the same experience there?
The way it builds on immutability means it naturally can lean in that direction, but the way it tends to be used for networks a lot undoes that because network communication by its nature is not deterministic in the sense you mean.
In my case, IIRC, it was something to the effect of, a lot of our old clients out in the field connected with a field that crashed the handler. Enough of these were connecting that the supervisor was flipping out and restarting things (even working things) because it thought there was a systemic issue. (I mean, it was right, though I had it configured to be a bit to aggressive in its response.) The fact that I could let things crash didn't rescue the fact my system, even if I fixed that config, would strain its resources constantly doing TLS negotiations and then immediately crashing out what were supposed to be long term connections.
Obviously, the problem was in the test suite, we were able to roll back, address the problem, ultimately this was a blip not a catastrophe. I just cite it as a case where "let it crash" didn't save me, because the thing that was crashing was just too broken. You still have to write good code in Erlang. It may survive not having "great" code but it's not a license to not care.
Using Taleb's nomenclature, let it crash is not anti-fragile at run time. Erlang does not get progressively better at holding your code together the longer it crashes. It is only resilient and robust. Which is ahead of a lot of environments, but that's all.
Many software development processes considered as a whole are anti-fragile... mine certainly is, which is a great deal of why I love automated testing so much (I often phrase it as "it gives me monotonic forward progress" but "it gives me antifragility" is a reasonably close spin on it too)... but that's not unique to Erlang nor does Erlang have anything special to help with it as everything Erlang has for robustness is focused at run time. You can have anti-fragile development processes with any language. (The fact that I successfully left Erlang for another language entirely is also testament to that. I had to replace Erlang's robustness but I didn't have to replace its anti-fragility, since it didn't have it particularly.)
Anti fragility is just a fancy name for 'ability to learn'. Erlang error handling philosophy enables learning by keeping thing simple and transparent. It's easy to see some component keeps failing, it doesn't bring your whole app down and you can look into it and improve it. Adding tonnes of third party machinery may be robust or even resilient, but if it makes things more opaque or demands bigger teams of deeper specialists, it precludes learning. Thus is not 'anti fragile'. You can keep your ability to learn healthy without Erlang, and you can use it without learning much over time.
This is an adequate philosophy for like.. a CRUD app, some freemium SaaS, social media, etc. Stuff with millions of users and billions of sessions, etc.
However there are industries applying these lessons in HPC / data analytics / things that touch money live .. operating on scales of users in the 10s to maybe 100s. So stuff where downtime is far more costly both in dollars and reputation.
I'm also intrigued by the constant cloud refrain of "stuff crashes all the time so just expect it to" coming from a background where I have apps that run without crash for 6 months at a time, or essentially until the next release.
I'm all for scaling, recovery, etc.. I just fail to understand why it is desirable for this to be an OR rather than an AND.
What if stuff was highly recoverable and scalable but also.. we just didn't run out of disk needlessly?
> I'm also intrigued by the constant cloud refrain of "stuff crashes all the time so just expect it to" coming from a background where I have apps that run without crash for 6 months at a time, or essentially until the next release.
IMHO, those aren't mutually exclusive. Your app code should be robust enough to run 6+ months at at time, and the "stuff crashes all the time so just expect it to" attitude should be reserved for stuff outside your control, like hardware failures.
Right, which is why I think brushing aside actually monitoring basic hardware stats that are leading indicators of error rates / API issues / etc makes no sense.
How is that better than a simple monitor/alert for low disk space? That low disk space is likely caused by having an application store too much cumulative data in log files or temporary caches etc. and often easy enough to fix. And many applications out there simply don't need the level of scalability and extra-robustness you need that you can still expect decent levels of service in the immediate aftermath of having one node go down. Certainly from my experience it's less work (and cost) to put measures in place to minimise the chances of a fatal crash than it is to ensure the whole environment functions smoothly even if parts of it do crash regularly. I'd also note we can be grateful that the developers of OSes, web servers, VMs and database servers don't subscribe to "let it crash"!
It looks like you misunderstood the article. "Let it crash" in the BEAM VM world pertains to a single green thread / fiber (confusingly called "process" in Erlang).
It pertains to e.g. single database connection, single HTTP request etc. If something crashes there your APM reports it and Erlang's unique runtime continues unfazed. It's a property of the BEAM VM that no other runtime possesses.
"Let it crash" is in fact "break your app's runtime logic to many small pieces each of which is independent and an error in any single one does not impact the others".
Scaling an Erlang node is very rarely the solution unless you literally run out of system resources.
I understood the article just fine. The "Let It Crash" philosophy is scale invariant.
Please read the last three paragraphs in [0]: "a well-designed application is a hierarchy of supervisor and worker processes" and "It handles what makes sense, and allows itself to terminate with an error ("crash") in other cases."
I've personally designed and co-implemented mission critical real-time logistics systems which dealt with tens of thousands of events per second, with hundreds of different microservices deployed on a cluster of 14 heavy nodes. Highly complex logic. At first we were baby-sitting the system, until it became resilient by itself. Stuff crashed all the time. Functional logic was still intact. Then we had true silence for months on our pager alerts.
I call it anti-fragile and Taleb is right. You can't make a system resilient if you don't allow it to fail.
Is that so different to how Java+Tomcat or .NET+IIS work?
A crash handling one request generally can't/doesn't affect the ability to handle other requests. Unfortunately it does often mean you have limited control over how the end-user perceives that one failed request.
It is only the same when you observe the visible results and your APM, and nowhere else. The stacks you mention -- and many others -- engage much more system resources per request compared to the BEAM VM. I have personally achieved 5000 req/s on a 160 EUR refurbished laptop with a Celeron J CPU, a pretty anemic one you know, in my local network -- by bombarding an Elixir/Phoenix web app; Elixir steps on Erlang if you did not know -- and that's without even trying to use cache.
RE: error handling, Elixir apps I coded and maintained never had a problem. Everything was 100% transparent which is another huge plus.
In general CGI and PHP had the right idea from the start but the actual implementations were (maybe still are? no idea) subpar.
Erlang's runtime is of course nowhere near what you will get with Rust and very careful programming with the tokio async runtime, but it's the best out there in the land of the dynamic languages. I've tried 10+ languages which is of course not technically exhaustive but I was finally satisfied with making super parallel _and_ resilient workloads when I tried Elixir.
For a lot of tasks I can just code in the Elixir REPL and crank out a solution in an hour, including copy-pasting from the REPL and into an actual program with tests. Same task in Rust took me a week, just saying (though I am still not a senior in Rust; that's absolutely a factor as well) and about 3/4 of a day in Golang.
The only other small-friction no-BS language and ecosystem that came kinda close for me is Golang. I like that one a lot but you have to be very careful when you do true parallelism or you have to reach for a good number of libraries to avoid several sadly very common footguns.
And I would add -
All the GOOD cloud people I know are GOOD linux people. It is a prerequisite.
However theres lots of cloud people who don't know linux.
Therein lies the challenge.
Too many of these cloud devs forget that whether its "server less" or not, theres a .. server, somewhere.
All the layers of abstraction work when they work, and when they don't .. best of luck figuring out what's going on in a timely fashion. Hope you aren't running any money on top of it.
If you forget about a PHP container for a few years it will /also/ have 40 new vulnerabilities. Actually, containers are worse because OS updates of core shared libraries do nothing. You have to rebuild every damn container.
Setting up monitoring for your docker containers is also a whole thing. :)
I think you’re taking my example a little too literally. My point is not that docker/k8s/whatever is bad; just that the ‘new’ adds features at the cost of simplicity.
I can say with great certainty: Almost no one rebuilds their damn container anywhere near as often as the the gray-beard in the basement updates the Debian packages on the the server that runs the container.
I had the debate with a client, we did monthly security update, unless something horrible happened. The client was rather upset that we didn't patch more frequently, like weekly or daily. My argumentation is that it doesn't really matter if the Linux kernel or bash is patched, when the only thing running is a container with a beta version of Tomcat that hasn't had security updates applied in three years.
Even worse are the people who just pull things from Docker Hub, with no plan as to how and when they'll pull newer versions. But fine, let's just keep running KeyCloak from 2017, and that old Postgresql image which the developer never configured to do backups, I'm sure it's fine.
> I can say with great certainty: Almost no one rebuilds their damn container anywhere near as often as the the gray-beard in the basement updates the Debian packages on the the server that runs the container.
Most probably the gray-beard simply enabled "unattended-upgrades".
You can do something similar with a container (track the security fixes for the packages used, force rebuild and deploy when needed), but it is a bit of work and I don't know of any ready solution.
I think what people are missing is that sure, code "rots", at the very least because of security patches. But since this all happens to everything simultaneously, the more distinct layers and support tools you have in your stack, the more often you have to deal with something breaking in a nontrivial way.
In short: the more moving parts you have, the less time you have between major malfunctions.
(Manufacturing and hardware world understands it well, which is a big part in why they like integrating things so much.)
So e.g. over in the backend-land where I live, it used to be that I had to occasionally update the compiler or one of the few third-party dependencies that I used. Today, I have many more libraries (to the point there's something to update for security reasons roughly once a month, on average), and on top of that, I have CI/CD introducing its own mess, Conan updates which occasionally get messed up, or make some existing recipes incompatible, CMake updates which are done unexpectedly and break stuff, now also Docker is adding more of its own problems, etc. So I have to deal with some kind of tooling breakage every other week now.
And always, always, when I think it's all finally OK and I can get on with my actual job, some forgotten or hidden component craps itself out of the blue. Like that time our git precommit hooks broke for me, because someone changed them in a way that doesn't work with my setup. And then me wasting a day on trying to fix it, eventually giving up and degrading my setup to unblock myself. Or another day where, for no apparent reason, some automation that made automated commits to some git repos started losing Change-ID headers in commit messages, making Gerrit very sad, leading to several people wasting a total of several person-days trying to fix it. Etc.
There's always something breaking, the frequency of such breakages seems to be increasing, and it's a major source of frustration for me in this job. Which is why I too am often thinking back to "good old days", and am increasingly in favor of keeping the amount of dependencies - both libraries and tooling - to a minimum.
Right, and it's easy to think about all the winnings when things go right - it's harder to think about increased frequency and cost of failure due to increase in complexity and number of independently moving parts.
The biggest thing with containers is, these 40 vulnerabilities won't matter as much if they're about erasing your directories or killing your machine. It will get rebooted in a pristine state and an attacker would need to stick there killing it at every reboot to have lasting effect.
Which is also the point of monitoring a container, which is fairly reliable nowadays, compared to managing your own health check service to ping your box to check if it's alive.
All in all, I think k8s is way too complicated for what it does for most people, but administrating servers was also a complex task to begin with, and the things sys admins were dealing with looked nightmarish to me. Heck, there was a time companies would have their own SMTP in house...
Layering more abstractions on top of what was already complicated doesn’t fix the complications below, it just hides them. You then have to hope that there’s not a bug somewhere in between those layers. Additionally, each new layer has a cost in performance. More layers means more compute, and eventually mean more money. If you’re Netflix, it’s worth it, if you’re not… what admins dealt with before wasn’t so bad. Now that that job has largely been killed and renamed 20 times, it’s far worse. Admins don’t just have to deal with a server, they have to deal with a server and 60 containers which are just little baby servers.
> won't matter as much if they're about erasing your directories or killing your machine
Those sort of attacks have tapered off though, right? Unless you're engaged in a specific feud, or have caught the ire of a social justice warrior, these days we're mostly looking at data exfiltration as the primary goal. That being said, there are a lot of ongoing feuds and active social justice warriors.
If we're talking about common, automated attacks, I think the game plan for the past few years has been to probe for systems vulnerable to RCE, install a crypto miner, join the server to a botnet and see how long it takes its owner to notice.
now read those flaws, compare them to the worst in 2004 and tell me they're the same. They're not the same because the internet was insecure as fuck in 2004 and these days security researchers motivations to receive bounties result in considerably more situational (and significantly less severe) security issues.
Shiney new app has no monitoring, 40 security flaws, and a ultra-modern cloud database that promotes data consistency issues. But all the devs you can hire act like they might know something about this mess. They really don't.