A sprawling Amazon Web Services cloud outage that began early Monday morning illustrated the fragile interdependencies of the internet as major communication, financial, health care, education, and government platforms around the world suffered disruptions. As the day wore on, AWS diagnosed and began working to correct the issue, which stemmed from the company’s critical US-EAST-1 region based in northern Virginia. But the cascade of impacts took time to fully resolve.
Researchers reflecting on the incident particularly highlighted the length of the outage, which started around 3 am ET on Monday, October 20. AWS said in status updates that by 6:01 pm ET on Monday “all AWS services returned to normal operations.” The outage directly stemmed from Amazon’s DynamoDB database application programming interfaces and, according to the company, “impacted” 141 other AWS services. Multiple network engineers and infrastructure specialists emphasized to WIRED that errors are understandable and inevitable for so-called “hyperscalers” like AWS, Microsoft Azure, and Google Cloud Platform, given their complexity and sheer size. But they noted, too, that this reality shouldn’t simply absolve cloud providers when they have prolonged downtime.
“The word hindsight is key. It’s easy to find out what went wrong after the fact, but the overall reliability of AWS shows how difficult it is to prevent every failure,” says Ira Winkler, chief information security officer of the reliability and cybersecurity firm CYE. “Ideally, this will be a lesson learned, and Amazon will implement more redundancies that would prevent a disaster like this from happening in the future—or at least prevent them staying down as long as they did.”
AWS did not respond to questions from WIRED about the long tail of the recovery for customers. An AWS spokesperson says the company plans to publish one of its “post-event summaries” about the incident.
“I don’t think this was just a ‘stuff happens’ outage. I would have expected a full remediation much faster,” says Jake Williams, vice president of research and development at Hunter Strategy. “To give them their due, cascading failures aren’t something that they get a lot of experience working with because they don’t have outages very often. So that’s to their credit. But it’s really easy to get into the mindset of giving these companies a pass, and we shouldn’t forget that they create this situation by actively trying to attract ever more customers to their infrastructure. Clients don’t control whether they are overextending themselves or what they may have going on financially.”
The incident was caused by a familiar culprit in web outages—“domain name system” resolution issues. DNS is essentially the internet’s phonebook mechanism to direct web browsers to the right servers. As a result, DNS issues are a common source of outages, because they can cause requests to fail and keep content from loading.
 
 
 
  
  
  
  
  
  
  
  
  
  
  
  
  
  
 