Ghosts In The Machine – What Really Happened When The Lights Went Out At Amazon Web Services



The author addresses the Amazon Web Services (AWS) outages that occurred on June 29, 2012 that affected big consumer properties like Netflix and Instagram. As a result of the outage, Amazon posted a lengthy message describing the incident events and what it plans to do to mitigate future outages. 


Ultimately, a power spike caused the power generators to take over. But one of its data center generators failed that led to the downtime for Netflix and others. According to AWS about 7% of the instances were affected in the ”US-East-1 region.”


Seems straight forward, but a bug AWS had never been seen before in its elastic load balancers caused a flood of requests, which created a backlog. At the same time, customers launched EC2 instances to replace what they had lost when the power went out caused by the storm. That in turn triggered instances to be added to existing load balancers in other zones that were not affected. The load, though, began to pile up in what equates to a traffic jam with only one lane available out of the mess. Requests took increasingly long to complete, which led to the issues with Netflix and the rest.


The author points out that the AWS network is really a giant computer that is tuned constantly to account for usage shifts. Since it is primarily network based, it relies heavily on machine-to-machine communication and highly efficient machines.


Without it, Netflix could not economically offer its on-demand service. Instagram would have to rely on a traditional hosting environment, which can’t come close to what can be done with AWS.


Lastly, the author asks for customers to be patient in that it is a very complex, efficient and economical environment.


Curator Comments


While it is true that without services like AWS, companies like Instagram and Netflix could not provide the economical services that they do, the bottom line is that we are all dealing with an immature, complex infrastructure that is primarily driven by customer demand shifts. In other words, the cloud environment is subject to downtime.


A thunderstorm caused power surges that triggered a fully tested backup facility, that uncovered a previously unbeknownst “bug”, which ultimately caused customer service outages and degraded services.


Over 20 years ago, enterprise IT shops proudly displayed “IT-centric” minimal mainframe outages as an indication of effectiveness. They forgot to include end-to-end customer measures so that the “true” measure fell far short of 99.999% availability. 


One conclusion that can be reached is that cloud service providers (especially large mainframe-like ones) should learn from these traditional mistakes of measurement and marketing before customers become disenchanted like they did 20 years ago. 


Call To Action


Ultimately, any IT Executive that relies on cloud services is still responsible for IT support – the cloud is only the delivery mechanism of choice. When measuring effectiveness and setting expectations, it is incumbent on those executives to drill down on service provider claims to make sure that a full understanding is reached on potential risk and outages.




Original Article


Related References

Summary of AWS Service Event in the US Region