Lights out at Amazon

posted in: Viewpoints | 1

Tuesday, 28 February, saw “high error rates” in multiple AWS services in the US’ eastern region. These began in the S3 service provided by the US-East-1 North Virginia site before spreading to other services hosted by US-East-1 including CloudWatch, EC2, Storage Gateway, and WAF (web application firewall). Disruptions to the S3 Service, which allows clients to store and retrieve data from AWS servers, left numerous websites devoid of product images and company logos (1).

Approximately 148,000 websites rely on the Amazon S3 service (2). Based on the location of Amazon’s headquarters, the Cambridge Centre for Risk Studies (CCRS) estimates that nearly one quarter of AWS clients use services provided by the US-East-1 North Virginia site.

The US-East-1 North Virginia site is the oldest of Amazon’s cloud service regions. Nonetheless it remains a bustling hub that hosts 84 of AWS’ 86 services – more than any other region in Amazon’s global infrastructure. As the outage progressed, disruptions occurred in various other services offered by the site (3). These included outages in its “Service Health Dashboard”.

The dashboard was unable to update for the first two hours of the outage thus extending the impact of the power cut beyond the US North-eastern region. Users in all regions had their access clouds on the Health Dashboard halted. AWS tweeted “If you want to check on status, you’ll have to do so directly from Amazon” due to the dashboard being down.

Amazon S3 reports that it is able to “automatically replicate data cross multiple data centres and is designed to deliver 99.999999999% durability” with “geo-redundancy” (4). However, this does not seem to be the case for the company’s Health Dashboard. CCRS discovered that the outage of Amazon’s Service Health Dashboard occurred as the service cites the US-East-1 as its sole endpoint with no geo-redundancy. To scale down the impact of this outage, it may have been advisable for Amazon to replicate its service and store its S3 data in additional independent centres, along with the US-East-1 site.

Diversification of essential data and services across cloud providers is not only recommended but is becoming increasingly common. A market has even opened up for “Cloud Storage Managers”. This is typically a costly insurance policy as diversification comes at a premium for cloud providers. Nonetheless as user dependency on cloud services grows so does the need for data redundancy and region diversification.

  1. Companies Affected: Adobe’s services, Amazon’s Twitch, Atlassian’s Bitbucket and HipChat, Autodesk Live and Cloud Rendering, Buffer, Business Insider, Carto, Chef, Citrix, Clarifai, Codecademy, Coindesk, Convo, Coursera, Cracked, Docker, Elastic, Expedia, Expensify, FanDuel, FiftyThree, Flipboard, Flippa, Giphy, GitHub, GitLab, Google-owned Fabric, Greenhouse, Heroku, Home Chef, iFixit, IFTTT, Imgur, Ionic, isitdownrightnow.com, Jamf, JSTOR, Kickstarter, Lonely Planet, Mailchimp, Mapbox, Medium, Microsoft’s HockeyApp, the MIT Technology Review, MuckRock, New Relic, News Corp, OrderAhead, PagerDuty, Pantheon, Quora, Razer, Signal, Slack, Sprout Social, StatusPage (which Atlassian recently acquired), Travis CI, Trello, Twilio, Unbounce, the U.S. Securities and Exchange Commission (SEC), The Verge, Vermont Public Radio, VSCO, Wix, Xero, and Zendesk, among other things. Airbnb, Down Detector, Freshdesk, Pinterest, SendGrid, Snapchat’s Bitmoji, and Time Inc. are currently working slowly. Perhaps more of Apple remains on AWS rather than shifting to Microsoft, its website reported issues with its app store & music-streaming service.
  2. Source: http://www.ibtimes.co.uk/amazon-s3-cloud-service-outage-takes-down-big-part-internet-1609117
  3. Later reported problems with: Athena, CloudWatch, EC2, Elastic File System, Elastic Load Balancing (ELB), Kinesis Analytics, Redshift, Relational Database Service (RDS), Simple Email Service (SES0, Simple Workflow Service, WorkDocs, WorkMail, CodeBuild, CodeCommit, CodeDeploy, Elastic Beanstalk (EBS), Key Management Service (KMS), Lambda, OpsWorks, Storage Gateway, and WAF (web application firewall), AppStream, CloudWatch, Elastic MapReduce (EMR), Kinesis Firehose, WorkSpaces, CloudFormation, CodePipeline, API Gateway, CloudSearch, Cognito, the EC2 Container Registry, ElastiCache, the Elasticsearch Service, Glacier cold storage, Lightsail, Mobile Analytics, Pinpoint, Certificate Manager, CloudTrail, Config, Data Pipeline, Mobile Hub, and QuickSight.
  4. Source: https://aws.amazon.com/backup-recovery/
Jennifer Daffron

Jennifer Daffron

Dr Jennifer Daffron's research interests include defining and exposing cyber threat vulnerabilities on organisational and human behavioural platforms. Jennifer holds a PhD in Experimental Psychology from the University of Cambridge. Prior to joining the Risk Centre, Jennifer completed postdoctoral research at the University of Cambridge's Department of Psychology and has published several papers on attentional templates in visual search.

Leave a Reply to How Does the Recent Amazon Cloud Provider Outage Demonstrate Potential for Catastrophe Loss in Cyber Insurance? | The RMS Blog Cancel reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.