I love that somewhere out there is a developer who doesn't even work at Fastly but just innocently pushed a change to their Fastly config and basically broke the entire internet. I'm actually jealous. If it was me, I'd put that on my resume.
A trivial example would be a bug that replaces the configuration for all customers with the last uploaded. Then when the next customer uploads a new (valid!) config, you have a problem.
Obviously it wasn’t that trivial but the point is: it wasn’t the customer’s configuration change that was the problem but some code that managed the config change.
It's more common than we imagine. That's usually the start of many of the historical network incidents. The important part, as usual, is to make sure the remediations of such incidents focus on how to limit blast radius of small changes, and how to accomplish that without imposing artificial gatekeeping and bureaucracy into the change process.
A web filled with DDOS attacks and scraping is a web that needs cloudflare and fastly. I’m not sure how to avoid this sorry state of things.
I2P doesn’t seem like an immediate solution -- maybe it can resist DDOS, but at the cost of losing fast, easily-accessible, easily-searchable public websites, no? Could Starbucks host their website on I2P, to pick a random example? Seems like a bunch more infrastructure would be needed first.
I2P is simply an alternative web. Yes, it's slower, but everything else works in the same way as in the clearnet. You have readable names of the websites, you have search engines there. It looks like the web from the 90s.
> Could Starbucks host their website on I2P, to pick a random example?
Yes, they definitely can. Which additional infrastructure is needed? I tried to host websites there myself. Did not see any problems.
This attitude is why we have only 2½ search engines on the entire Internet. Only Google, Bing, and Yandex run crawlers. Everybody else is just a reseller for them.
Web crawlers are a feature not a bug. If your site shouldn't be crawled, it doesn't belong on the Internet.
If you cannot generate revenue by your internet content, probably you can’t live from generating content for the internet.
The consequence, IMHO, is that the internet would have this amount of content and usefulness.
Newspapers? No. Can’t live from internet news if anyone can copy a reporter’s work and post it on his own site and dilute traffic.
Online selling? Don’t look like a viable business model, as anyone can copy the photos you paid a photographer for, the descriptions you paid someone to write and the reviews your customers wrote. True reviews are priceless, you now? Even more now that an AI can detect computer generated reviews.
Obviously an open and totally money-free internet is nice, but it wouldn’t be the internet people make a living from.
Test “it”? The change in question wasn’t by fastly but a customer of theirs making a config change. It’s possible that this customer did validate their change somehow.
Fastly obviously didn’t test their code (with the bug) enough, but testing of course can never prove the absence of bugs. Testing for a global deployment like a massive CDN happens to a large extent in prod because you don’t have another globe. You can test on a smaller scale but eventually you run into a problem that only shows itself at full scale.
> We experienced a global outage due to an undiscovered software bug that surfaced on June 8 when it was triggered by a valid customer configuration change.
Their change was bad, that was May 12. Since that seemed OK on May 13,14,… there wasn’t much indicating that change would blow up weeks later. For example if they roll it out gradually, they would reach 100% rollout with all lights being green
The customer change was a valid configuration. That was yesterday.
Sometimes when I access a webpage I think to myself "what if I do X and the whole website crashes?" Or if the webpage crashes I think "heh, what if I brought down the whole server?"
It looks like this guy did just that. And for Fastly. Wow.
"So, did I just hear three distinct light switch clicks?"
Absolutely no details about the bug or why a single customer configuration effected global state on the server, or why this wasn't caught by configuration change safety mechanisms/smoke tests/gradual rollout.
Also, what is up with their partitioning? Do they seriously have one customer that gets served from 85% of their servers? Is it a whale?
Good on them for getting a statement out right away (although they basically had to) but seems to be lacking any useful details. Wonder if they were scrubbed by PR/legal in hopes of reducing the number of customers coming to ask for gibs.
It's an update on the situation until they confirm the bug fix is completely rolled out. I certainly wouldn't expect they tell their customer base how they can exploit a bug to bring down part of their business.
With respect to partitioning - we don't know how or why an invalid configuration could poison so many nodes; if the config was physically present on them or if there was a cascade of healing/balancing issues stemming from it.
I would leave speculation on many of your points at the doorstep until we see a full report.
At a guess, perhaps each server can serve everyone, and the system-wide config file that is shipped to each server (to handle each customer) perhaps became corrupted as part of a customer update, and it was rolled out to every node assumed to be correct.
A defense against this could be to ensure the system that applies the change validates some health-checks continue to work after the new file is made (or automatically rollback to previous configuration).
I can see how this would happen, assuming thats what happened.
Yes, it sounds sort of like that. But this can be remediated by two things I was asking about: smoke tests and partitioning.
When making a config change I'd assume they don't make it to all servers at once and instead roll it out gradually. If this caused the server to instantly start 503'ing all customers, presumably this would have been caught - perhaps it was more delayed though (resource leak, etc) and obviously that is somewhat more difficult to catch.
If they're properly partitioning customers, ideally they wouldn't even ship the configs to all servers (slightly less good, but still pretty good they could ship them there but not parse/load them). It sounds like at the least this customer's config change effected 85% of servers, which seems absurd to me.
So yes, I can see how it happened, but for Fastly, which runs one of the biggest CDNs, these don't seem like very reasonable mistakes.
One of the big competitive advantage Fastly has compared to say Akamai is that configuration changes roll out extremely fast. I could see them skimping on smoke tests to keep that advantage and not thinking that this could ever happen.
At its core Fastly uses heavily modified Varnish 2 with customer's having full access to vcl providing an unprecedented level of control of the request processing and cache behavior. It is extremely difficult, if not impossible, to completely remove foot-guns while retaining this ability. They do amazing thing checking validity of the code and aborting broken requests but it is still vcl and in its core varnish 2 does not have multi-tenancy segmentation.
Thank you for that. That last clause seems like the real key here. With that much power in customers' hands and no hard multi-tenant isolation, unintentional DoS becomes almost inevitable. In effect, it puts every customer at the mercy of every other's diligence (or lack thereof). Even with diligence, that seems a bit fragile.
> one customer that gets served from 85% of their servers
There could be a feedback loop that is the opposite of a smoke test.
1. Validate customer configuration, if passes, assume it can roll out
2. Roll out customer configuration to node
3. Node goes down
4. Migrate all customers on node to new nodes
5. Node that problematic customer was migrated to goes down
6. Rinse and repeat as problematic customer migrates to every node and takes out every last one.
Smoke tests or gradual rollout won’t help with the change by fastly on May 12 when they deployed the buggy change. It’s likely that this was gradually rolled out and looked ok with all customer configs existing on May 12. Obviously there should be other safeguards in place but gradual rollout in itself wouldn’t have helped since it would look green at a 100% rollout weeks ago.
You’d think that individual customer configuration changes should only ever affect that customer, and that gradual vs instant rollout would be an option the customer handles when changing their configuration!
TikTok is my guess. ByteDance is valued at 250 billion. Plus, the change was pushed in the middle of the night, which would be daytime in Asia. Certainly there are other development teams in Asia, but considering the scale of the change it likely comes from HQ, and Fastly's whale in Asia would be them.
edit They may have lost TikTok at the end of last year, either partially or completely [1]. Anyone know what they use now? Akamai, or maybe they stealthily switched back to Fastly?
Are you implying that no senior devs or main offices are located in europe?
Also, just because the rollout to fastly happened at a morning EU time, doesn't mean that the change was made. If there's a deployment pipeline, it could have been made 2-3 hours earlier, or even the day before
I'm saying a big network change is probably going to be executed while HQ is awake and in the loop. In my experience the most skillful devs aren't the most senior.
We don't know much about the specific client configuration change that triggered this condition yet. It doesn't necessarily have to be a big company wide infrastructure change.
To me it sounds plausible that an SRE team in an alternate location made a change scoped to their permission level, following company-directed playbooks, which eventually triggered the faulty condition at Fastly.
Many big tech companies (and individual products within the company) rotate oncall/release/SRE responsibilities between NA and EU. HQ doesn't need to be awake in these cases.
Why would reddit push such a change at 4am SF time? It's also unlikely that a high load at that time would trigger anything. Reddit's peak activity times are US daylight hours.
I don't expect fastly to name and shame a customer who made a valid change, nor do I expect fastly to give us a detailed explanation of what the bug is.
I'm still a little annoyed at their status page [0]. It says:
> We're currently investigating potential impact to performance with our CDN services.
yet in the blog post we're talking about here it says:
> Early June 8, a customer pushed a valid configuration change that included the specific circumstances that triggered the bug, which caused 85% of our network to return errors.
85% of your network returning errors is _not_ a potential performance impact.
This is a great write-up. When describing my own bugs and issues to stakeholders I often find the following difficult to communicate:
* Bug was introduced on date X but only caused problems on date Y ("if the bug was introduced on date X then we would have seen it on date X, so you’re wrong")
* Doing X led to the outage but X wasn’t the fault, X was a valid thing to do, the code should have been able to handle X, the fact the code couldn't handle it was the actual problem which needs to be fixed ("look, you said X caused the problem, so the solution is just not to do X right?")
This article conveys both these points clearly and effortlessly. I might borrow some terminology from this in the future.
Let’s just hope that this is just the first statement we see about the outage. This is not even 24 hours after the outage, so I’ll give them the benefit of the doubt that there will be more details forthcoming.
But I can’t help but be bothered that a single customer’s configuration change would have such a wide ranging impact across so many sites. I’m looking forward to finding out how that happens…
The "valid customer configuration change" seems to cover the angle that the user input was validated and valid but the backend implementation of said configuration was the buggy part. Look forward to actual details from them.
This is annoyingly vague. What was the software bug, and what was the valid customer configuration change?
It's perhaps a bit premature to demand it at this point, but I'm hoping a full post-mortem will outline precisely how this change was not picked up in pre-prod. Surely all valid customer configurations must be tested prior to rollout.
It's not just annoying value. It's insultingly vague.
If my data centre provider suffered a complete outage, then I demand to get a detailed post-mortem of what happened (in due time). If they just tell me bullshit PR speak about "We value our customers", I'll be looking at switching providers.
As a Fastly customer whose site went down, I'm entitled to know exactly what happened. If they don't tell me, I'm switching CDNs as a matter of priority.
> As a Fastly customer whose site went down, I'm entitled to know exactly what happened.
Does your contract say you're entitled to an RCA?
As others have said, this is more of an update, not a complete RCA on the entire situation. They have short term tasks that they've described in this summary post and I would expect that they will give a more complete analysis later.
It does say they haven't finished rolling out the permanent fix, (i.e. such a customer configuration(/exploit) could still bring down some servers) and will/are conducting a full post-mortem. So hopefully a juicier post to come.
> As a Fastly customer whose site went down, I'm entitled to know exactly what happened. If they don't tell me, I'm switching CDNs as a matter of priority.|
If you are a hardcore user of their vcl on the edge I'm very curious where you would go to. The last time I looked ( a year ago ) there was no one that came even close to giving customers that level of control in request processing. Most of them fail do complicated stuff with CORSs without doing arabesque while balancing on a medicine ball ( Looking at you Lambda@Edge ) not to mention ability to massage the response.
Someone answered this yesterday. CloudFront is good for video and large download assets (plus very low margins) but not for images and smaller stuff which Fastly is much faster at:
My completely uneducated guess is a networking rule that wasn't properly validated or tested, and black-holed all traffic destined to 0.0.0.0/0 or similar.
Does this sort of incident cause secondary disruption, any kind of ripple effects? I felt that large chunks of the Web were flaky for most of the day, not the relatively short space of time mentioned here.
Using DNS to failover to another CDN is pretty much the only solution if the solution is not "build out your own global edge infrastructure".
I was thinking more about this though and it has its own problems. You want a short TTL so failover is fast, but this increases the number of DNS lookups people have to do (and DNS lookups can be very slow!).
Additionally, a short TTL means you're more vulnerable to problems like the dyndns attack [1] from 2016: names with longer TTLs were up for longer since they preserved the correct DNS records for longer.
But if you have a long TTL, even if you fail over, you'll still be down for at least as long as the DNS TTL pointing to the bad CDN.
Maybe, you could do DNS roundrobin against multiple CDN providers at once. Say you used 4, then if one went down, only 25% of requests would fail, and you could just remove the failing entry. This seems very expensive!
Honestly, the cost of these solutions is probably not worth it. The product I work on went partially down during the fastly outage. Then it came back up and everything is back to normal. It really won't impact us much at all. Shrug.
Use a short TTL on the CDN subdomain you use. Then setup an alternative CDN provider in advance, so that you can switch from one to the other in a matter of minutes.
So, a valid customer configuration change triggered a bug. One thing I don't see in this writeup is a commitment to ensure that customer configurations cannot break the whole system. Cloudflare does seem to make this promise with their zero trust architecture,
> commitment to ensure that customer configurations cannot break the whole system
You can't just ensure a config change won't break things in large distributed systems, it's too complex with too many factors, there will always be risk. To mitigate your risk, youd want to design your system to do progressive, regional rollouts, with canaries to attempt to detect and isolate before a wide spread outage occurs. Even if you have all of this set up, there is still risk that your regions and systems are not fully isolated and outages could cascade anyway.
There will always be risk, there will always be errors. This is why SLAs and SLOs exist, they define and codify an agreement of what an outage is and what compensation is required if the agreement isn't met.
Pushing out new versions of your site. You can’t have the new assets on half the nodes that are serving your site otherwise your site goes down while things slowly propagate.
> Why is this the case? I don't have too much knowledge of CDN architecture so I am curious
Fastly is not really a regular CDN. It is a fully programmable edge cache with cache control algorithms decided and controlled by the customer running at the edges. You can think of Fastly configuration as a part of your code base where it is for you to decide if you want to perform the action on the edge on a per-request basis rather than on the origin per cached request basis.
That in turn means that if you do deploy to your API/web 50 times a day, you would are likely to deploy your Fastly configurations about the same number of times
Zero trust seems to be very unrelated to this issue. The issue seems to have been a poison config breaking fastly stack. Zero trust is about verifying authentication of devices/users. Unrelated things really.
One of Cloudflare's top engineers previously wrote in this forum,
> This incident emphasizes the importance of the Zero Trust model that Cloudflare follows and provides to customers, which ensures that if any one system or vendor is compromised, it does not compromise the entire organization. [1]
Authentication is a part of a zero-trust model, not the whole thing.
> No single specific technology is associated with zero trust architecture; it is a holistic approach to network security that incorporates several different principles and technologies. [2]
They were referring to a completely different incident, involving compromised authentication to a camera system. I’d love to hear an explanation of how a zero-trust model would apply to this situation with Fastly. Seems like it would have to apply to a lot of multi-tenant resource exhaustion issues since we know so little about the specifics on the Fastly incident.
A blog post concerning how customer configurations cannot bring down other customers' sites would be great to see from Cloudflare. Fastly does not seem in a position to say that about its own stack and I don't expect another company to know their stack that well.
Why do so many big companies use Fastly when Cloudflare (from the outside, as someone who doesn't know much about the space) looks to be so much cleaner and more technically sophisticated? Am I being brainwashed by their blog posts?
Truly different capabilities under the hood. Yes, they are both CDN providers, but fastly offers a remarkable amount of customization that cloudflare does not.
For 99% of customers, one can argue that cloudflare is more than sufficient. For 1% of customers, fastly is arguably the correct choice just based on feature set alone.
So, in summary, you can certainly compare the two, but for certain customers cloudflare lacks the feature set they may choose to use on fastly.
Yes, to the extent that Fastly lets you upload your own VCL configuration files. This was the source of the problem here, but is incredibly powerful for complex use cases.
I don't know about you but I find the prospect of a Cloudflare monoculture pretty worrying, especially since they've already demonstrate a willingness to kick off users they don't like. (I also think the https veneer that they offer is misleading to end users and bad for everyone on the internet, though not everyone will agree with that).
I too think Cloudflare's "reverse HTTPS proxy" approach (where they have a CA-signed certificate for every domain that gets pointed at them) is bad for the Internet.
But how does Fastly avoid this problem? It's really more a symptom of the "web pki" trainwreck than anything else.
I tried looking on Fastly's website for technical details, but like every other corporate website it was an impenetrable mass of marketing bling and partner logos.
As much as that's not ideal, if the proxy is actually using HTTPS and verifying the upstream certificate then I don't think it breaks the user's security expectations too badly. But CloudFlare also offer a mode where they will serve HTTPS to the user but connect to your upstream via unencrypted HTTP over the public internet, which I think is just shockingly awful compared to what a user expects a site that uses HTTPS to do.
Here's a guess: vcl [1] looks able to include time-based conditions. So customer deploys new, valid config with a time-based condition. It rolls out globally, with no issue. Then time that activates the bit which tickles this bug rolls around. Boom.
You took down my site and a good swath of the whole internet. I am entitled to know, in detail, what happened, so I can be more informed and assess any actions I might need to take.
I don't want any more of your PR speak or "we value our customers". That's crap and insults my intelligence. STOP getting PR to write your comms; just speak to engineers like engineers. I'd rather get no response than this post.
I hope there are actual details as they complete their investigation. If there isn't a public post-mortem, I am switching away from Fastly.
But you're really not entitled to the details, no matter how you feel about it. You're entitled to whatever compensation is defined in your contract if an SLA was breached. Beyond that, Fastly is going to provide the level of detail they feel is necessary to reassure their major customers that they've addressed the issue and will do their best to keep it from happening again. Unfortunately one of the downsides of using the services of other companies is knowing that something is going to happen at some point, and there's not going to be anything you can do about it except hope that your contingency plans are adequate or wait it out.
Obviously on this site we tend to be rather technical people, so we want to know as much detail as possible, but that's something we desire, not something we are entitled to.