Depending on where you live and which websites you visit you may have noticed the internet acting up on Tuesday (Feb 28, 2017). Countless websites were disrupted, and millions of people were affected. And now, thanks to Amazon admitting the truth, we know how it happened…
According to Amazon’s pithily titled “Summary of the Amazon S3 Service Disruption in the Northern Virginia (US-EAST-1) Region” a single typo was responsible for taking down the internet. And no, that’s not a typo. One small mistake by an unfortunate engineer killed the internet for several hours.
Have You Tried Turning It Off and On Again?
As Amazon explains, at 9:37am, an Amazon engineer “executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process”. So far, so good, as this team member was doing exactly what they were paid to do.
“Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended”. These servers “supported two other S3 subsystems,” one of which “manages the metadata and location information of all S3 objects in the region”. Which created a serious problem.
Amazon then tried turning it off and on again. Unfortunately, “S3 has experienced massive growth over the last several years and the process of restarting these services and running the necessary safety checks to validate the integrity of the metadata took longer than expected”. Therefore it took until 1:54pm for everything to start functioning properly again.
In order to prevent a repeat performance, Amazon is “making several changes as a result of this operational event”. This includes introducing “safeguards to prevent capacity from being removed,” “auditing our other operational tools to ensure we have similar safety checks,” and making “changes to improve the recovery time of key S3 subsystems”.
We Told You Typos Always Matter
I once opined that typos always matter, even online and in text messages. This unfortunate incident surely bears that out, as it demonstrates the devastating impact a solitary typo can have. OK, so we’re not all Amazon engineers charged with keeping the internet up and running, but still, there’s no excuse!
Were you affected by the Amazon outage on Tuesday? What websites did you notice acting awry? How do you feel about the fact that a single typo took down the internet? What should Amazon do to avoid a repeat performance? Please let us know in the comments below!
Image Credit: Marco Verch via Flickr