Swaths of websites went down on Tuesday morning after an outage at the cloud computing services provider Fastly. Internet users were unable to access major news outlets, e-commerce platforms, and even government websites. Everyone from Amazon to the New York Times to the White House was affected, all thanks to one customer trying to change their settings.
At around 6:30 am ET, Fastly said it applied a “fix” to the issue, and many of the websites that went down seemed to be working again as of 9 am ET. Still, the outage highlights how dependent, centralized, and susceptible the infrastructure supporting the internet — especially cloud computing providers that the average user doesn’t directly interact with — actually is. This is at least the third time in less than a year that a problem at a large cloud computing provider has led to countless websites and apps going dark.
Fastly is a content delivery network (CDN), which maintains a network of servers that transfer content quickly from websites to users. The company, which counts Shopify, Stripe, and many media outlets as customers, promises “lightning fast delivery” and “advanced security.” The nature of such a network also means that problems can quickly spread and affect many of those customers at once. In the case of Tuesday’s incident, Fastly says it “identified a service configuration that triggered disruptions” around the globe. It took about two hours from the time the problem was identified until a fix was implemented.
At the moment, there’s no reason to suspect the outage was the result of a cyberattack. On Tuesday night, Fastly said the issue was the result of a bug in its software, which a single customer apparently triggered. Still, the outage comes amid a slew of recent cyberincidents that have impacted everything from the global meat supply to a major oil pipeline in the United States.
It’s nevertheless clear that the outage caused momentary mayhem. The site Downdetector, which tracks complaints about website failures, shows a slew of sites received an uptick in complaints this morning, not only for media outlets like the New York Times and CNN but also for Reddit, Spotify, and Walt Disney World. Outages at payments systems like Stripe and e-commerce platforms like Shopify also suggest money could have been lost in transactions that didn’t go through, though it’s so far unclear if that’s the case.
All Vox Media websites, including this one, were offline for a half-hour. The Verge, which is owned by Vox Media, transitioned to offering its content on Google Docs before internet users swarmed the doc and started editing (editors accidentally left the page unrestricted). Kentik, an internet observability company, reported that the outage was responsible for a 75 percent drop in traffic from Fastly’s servers.
The scale of Tuesday’s outage — and the frequency of large outages like this one — is what’s really worrisome. Last July, connection issues between two of the data centers operated by Cloudflare ultimately took many sites, including Politico, League of Legends, and Discord, briefly offline. Then, a data-processing problem for Amazon Web Services last November caused problems for sites like the Chicago Tribune, the security camera company Ring, and Glassdoor. The Fastly outage shows the trend continuing, especially as most of the web remains increasingly dependent on cloud providers.
While the issue seems to be fixed for now, it will take some time to measure the damage caused by even a couple hours of downtime at a major cloud computing provider. And that leaves the world anxiously awaiting the next time this happens.
Why these outages feel like they’re getting worse
One of the reasons the Fastly outage seems so wide in scale is that cloud computing service companies like Fastly are consolidating, leaving websites dependent on a shrinking number of providers. Even if there aren’t that many total outages, the fact that so many everyday sites rely on fewer cloud providers makes each individual outage feel pretty significant to an average internet user who just wanted to buy some stuff on Amazon and read the New York Times early Tuesday morning.
There are benefits to consolidation, explains Doug Madory, the head of internet analysis at the network monitoring company Kentik. For instance, a smaller number of cloud providers means it’s much easier to get those providers to deploy a particular security change. “The flip side is the liability [of] having a few megacompanies, whether they’re CDNs [content delivery networks] or other types of internet firms, responsible for a lot of our internet activities,” Madory told Recode.
In other words, when one of these megacompanies updates its systems and inadvertently causes an outage, the damage radius could be quite wide. This is what happened in 2011 when one of Amazon’s cloud computing systems, Elastic Block Store (EBS), crashed and brought Reddit, Quora, and Foursquare offline. After the incident, Amazon explained that engineers inadvertently caused technical problems that trickled down through its systems and caused the outage.
“You end up with these cascading failures,” explained Christopher Meiklejohn, a PhD student at Carnegie Mellon’s Institute for Software Research. “They’re difficult to debug. They’re stressful and difficult to resolve. And they can be very difficult to detect early on when you’re thinking about making that change, because the systems are so complex and they involve so many moving parts.”
In the case of Fastly’s Tuesday outage, the issue appeared to come from a bug that was introduced back in May when the company deployed some new software. But the issue was only discovered on Tuesday when a customer’s routine change to its systems triggered the bug — and inadvertently brought down much of the internet, according to a summary released by Nick Rockwell, the company’s SVP of engineering and infrastructure.
Central to the challenge of systems like Fastly’s, Meiklejohn said, is the fact that these cloud computing systems can involve tens of thousands of servers deployed across the world. It’s very difficult for developers working on new changes to anticipate all the characteristics of the larger system, a scenario that makes it more likely for an error to occur when updates are finally implemented. Companies don’t always have the tools to detect these problems before they happen, though there’s growing research and effort into better solutions.
The Fastly outage also happened amid growing concerns about cybersecurity. Now, many are anxious for more details from Fastly — which markets itself as a dependable and speedy service — about how its systems went down. The outage serves as a reminder that the internet is built on increasingly complicated infrastructure, one that’s global and can potentially affect the sites and services of countless companies. That means little mistakes can have massive consequences.
Update, June 9, 2021, 3:40 pm ET: This piece has been updated with new information about the cause of the outage.