AWS blames a typo for Tuesday’s outage

Human error behind hours-long cloud outage that hit websites and apps

Comments

Amazon Web Services said today its outage earlier this week that affected major websites and apps was caused by human error.

Sites including Netflix, Reddit and the Associated Press struggled for hours on Tuesday -- all because of a simple typo.

"While we are proud of our long track record of availability with Amazon S3, we know how critical this service is to our customers, their applications and end users, and their businesses," the company wrote in an online message. "We will do everything we can to learn from this event and use it to improve our availability even further."

On Tuesday morning, AWS reported on its Service Health Dashboard that it was having problems with its S3, or Simple Storage Service, in its data centers located in northern Virginia.

The issue, which even affected the AWS dashboard, was not cleared up until about 5 p.m. ET that day.

Now, AWS is offering an explanation of what happened.

"The Amazon Simple Storage Service (S3) team was debugging an issue causing the S3 billing system to progress more slowly than expected," the company noted. "At [12:37 p.m. ET], an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process.

"Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended," the message added.

Zeus Kerravala, an analyst with ZK Research, said it's not surprising that such a major issue was caused by human error.

"My research shows that 37% of IT outages are from human error," he said. "It's scary and shows that despite so many advancements in technology, we are still largely beholden to manual processes. This is an example of where better automation and machine learning could help."

AWS noted in its online message today that its engineers have learned from the outage Tuesday and are making changes to try to keep it from happening again.

"While removal of capacity is a key operational practice, in this instance, the tool used allowed too much capacity to be removed too quickly," the company explained. "We have modified this tool to remove capacity more slowly and added safeguards to prevent capacity from being removed when it will take any subsystem below its minimum required capacity level."

That, according to AWS, should prevent an incorrect input from triggering another outage.

The company also noted that engineers are auditing other operational tools to ensure they have similar safety checks.

"We will also make changes to improve the recovery time of key S3 subsystems," AWS noted. "We employ multiple techniques to allow our services to recover from any failure quickly."

Patrick Moorhead, an analyst with Moor Insights & Strategy, said he thinksS this incident will give AWS a black eye in the short term.

"It's incredible to think that one mistake by one person on one command can take down millions of users," he said. "People should expect more from AWS... This incident will make enterprises think twice about moving certain workloads and apps to the public cloud and motivate them to look closely at the private cloud."

For his part, Kerravala said he expects cloud rivals Google and Microsoft to hop on this AWS incident and try to drive any lost business their way.