Earlier this week (28th Feb, 2017), AWS S3 witnessed an outage/crash for approximately 3 hours in US-EAST-1 region. Many companies whose online services depended on AWS were affected as a result and had disruptions during this period.
For what it was, AWS has always been open about issues that ever occured on their systems. They performed an internal analysis and have put out a summary of the disruption as well. You can check the summary here:
RCA from AWS
From AWS Summary of the analysis: At around 9:40 AM PST, S3 availability dropped to 0%. Packet loss and availability of the system plumetted and this crash state remained for the whole outage period.
From some of the AWS RCA intercepts about the outage, the reason was human error. S3 team has been debugging a S3 billing system issue. As part of the debugging process, they had to remove a small number of servers from S3 subsystem that is used by S3 billing process.
Unfortunately, one of the inputs was incorrectly provided which removed a larger set of servers than what was intended. The servers which were removed in the above cycle also contained units which supported two other subsystems of S3. One of these subsystems was responsible for managing metadata and location information about all S3 objects in the region. The teams realized this and started bringing them back up, but the process of bringing these subsystems back into functional state took time.
You can read more about the exact problem and the corrective measures which AWS is taking to prevent it from happening again from the summary link that you can find above.
Impact on users:
S3 being a core system, many other service systems at AWS depending internally on the S3 system also failed. Services like Elastic Load Balancers(ELB), Redshift data warehouse and others plummeted to limited or no availability during the outage. Many companies and services like National Geographic, Coursera, Slack were said to have been affected.
Impact on AWS:
AWS provides a strong S3 SLA. The current outage of 3 hours means that S3 went below 99.9% threshold of SLA. Therefore, AWS might be on a 10% monetary reduction within US-EAST-1 region. This outage might have also triggered a domino's affect on the other dependent services SLA as well. An exact amount would be difficult to arrive but it probably would be in millions of dollars.
Did companies rely too heavily on S3 SLA?:
One of the important take away from the whole episode of S3 outage was the factor of how much solutions should trust/depend on a vendor service. One of robust solutions that exists on cloud currently would be Netflix. And it is a robust solution because of the core principle on which they built the entire architecture.
Build for failure
Nettflix services were built for failure. Which means that they assume 90% of the systems in the infrastruture to be prone to failure. Solutions turn out to be highly available When they are architected with this end goal in perspective. One of those important principles that AWS suggests its architects to keep in mind is to build an application level high-availability layer and not to depend on the vendor availability guarantee alone.
AWS S3 system was built to be highly available and redundant by automatically replicating stored objects and files across data centers. For third-party solutions depending on AWS services, an additional layer of redundancy would require leveraging additional AWS regions or even alternative cloud providers as a fall-back.
This approach would add more application complexity and also cost overhead as the user would be responsible for maintaining data consistency and synchronization. You can check the Open Source toolsuite that Nettflix internally uses to test their production system stability. Its called SimianArmy. One of the more popular tool from the suite is the ChaosMonkey which randomly terminates VM instances. But still, their tool suite did not consider a base component failure like that of S3.
Most organizations and solutions do not opt for such high redundancy and availability due to various concerns that we talked about above. The usual redundancy approach that would be followed is to have data backups. Though backups provide a layer of Disaster Recovery protection, they do not help much in short-term (i.e: Cannot guarantee 24x7 uptime since backups will need time to be restored)
What did we learn from the outage?
Recent uprise and move to cloud has delivered huge improvements in terms of time to market, stability, resilience and availability when compared to traditional infrastructures. However, cloud brings its own complexities like the dependency of solutions on a third-party service that we no longer have control. Cloud developers, designers and operation teams need to review such dependencies and should include various monitoring and recovery strategies. The aim being to improve redundancy and availability of solution critical services.