AWS Is Hosting Half the Internet; It's a Problem

, 4 minutes to read

If you’re anything like me—and by that I mean you’re online every once in a while—you most certainly reached at least one error screen on Monday, 20 October. For me, it was mostly that I noticed my colleagues were unable to log in through our SSO provider to Confluence, Jira, and other Atlassian tools anymore. I was lucky enough that I’d logged in early in the morning, so my login token stayed valid, and I wasn’t affected, at least not for work.

What we’ve learnt since then is that there was a DNS-related1 outage of an Amazon service. Apparently misconfiguring DNS in a single region can lead to widespread outages all around the world. The scale of the disruption was remarkable—services that millions of people rely on daily simply stopped working because of a configuration error in one AWS region.

And again, if you’re anything like me, you were wondering: why does so much depend on a single service? Why does so much depend on a single AWS region? Even other cloud providers were affected. Every so often it really baffles my mind how a decentralised system such as the internet can be so centralised in a single AWS region. The irony of a globally distributed network being brought down by a single point of failure isn’t lost on anyone paying attention.

Everything Is Affected

It looks like almost everything2 was impacted. Even Signal—which doesn’t use AWS infrastructure for everything—had some problems. The outage demonstrated just how deeply AWS has embedded itself into the fabric of the modern internet.

PSA: we're aware that Signal is down for some people. This appears to be related to a major AWS outage. Stand by.

Now, with the ongoing digitisation of everything, when a large part of the internet goes offline, it unfortunately always affects a bit of everyone. And while it’s funny when a smart bed goes offline, it’s much more serious when it’s your bank. I, personally, don’t really carry cash anymore, so if I can’t pay with my plastic money, I really can’t pay at all. The shift to a cashless society has made us vulnerable in ways we’re only beginning to understand when these outages occur.

And whilst Reddit and Snapchat going offline is probably not much of an issue, if your doorbell is offline, you should probably question your dependency on big tech. There’s something deeply unsettling about not being able to answer your own door because a server farm thousands of kilometres away has a problem.

Learning?

Look, I don’t think tech or other big companies will learn from this. Frankly, it doesn’t hurt them enough to change their practices. The financial impact of a few hours of downtime is negligible compared to the cost of building truly resilient, multi-provider infrastructure. And of course, hiding amongst the crowds and saying, “Well, everyone was offline” will work quite well as a defence. But then, I do think we need to differentiate between different services and their importance.

Some things need to be available as much as possible. Some of these websites, such as the UK’s official government website, are rather essential. When critical government services go down because of a third-party infrastructure failure, we need to ask serious questions about digital sovereignty and resilience.

And frankly, if you’re a bank, you need to learn to build a more resilient system than depending on a single cloud provider based in the US. Banks have regulatory requirements around uptime and availability, yet many seem content to outsource their entire infrastructure to a single vendor. And because banks are not going to act by their own volition, I guess we need regulations to force them to. It’s not only a question of resilience but also of caring for your customers and protecting their interests and privacy. Perhaps it’s time for financial regulators to mandate multi-cloud strategies or require proof of failover capabilities that don’t rely on a single provider’s infrastructure.

The centralisation of the internet onto a handful of cloud providers—with AWS leading the pack—represents a systemic risk that we’re only beginning to grapple with. Until the pain of outages exceeds the cost of building redundancy, I suspect we’ll keep seeing these incidents. And each time, we’ll collectively shrug and move on, until the next one.


  1. It’s always DNS, isn’t it? The running joke in tech circles is that when something breaks, it’s probably DNS. This time, the joke wasn’t funny—it was just accurate. ↩︎

  2. Of course, as the Mastodon people were quick to point out, the decentralised nature of Mastodon meant it was running quite fine.

    Signal down due to AWS issues. Maybe centralization really isn’t such a great idea. Who would have thought.

    (Except for the few instances hosted on AWS, naturally.)

    Meanwhile, because Mastodon server admins are all cheapskates and we're not going to pay the sort of prices for hosting that AWS charge, the Fediverse keeps on chugging away happily while the corporate internet falls apart.

     ↩︎

Tags: Cloud, Political, Technical