Our research has found that while most enterprises use multiple public clouds or some form of hybrid cloud implementation, it isn’t done for resilience purposes. The reasons for this are many, but to summarize – Building applications to use multiple clouds is hard. There can be massive inefficiencies, complexities, and redundant tasks in managing multiple versions of the same application across more than one cloud. Still, as the current AWS outage shows, single cloud strategies can introduce new and unfamiliar points of failure.
Yes, the cloud is more resilient than your datacenter – but…
One incredibly attractive selling point of the cloud is inherent workload and data resilience. With availability metrics in the 5 nines range and data durability promised at 11 nines, it seems like migrating to the cloud is a panacea for technology resiliency. However, every technology platform has its own risks. In 2021, every one of the major cloud providers has experienced outages, some lasting a few minutes, others hours. And as AWS CTO Werner Vogel reminded us at his recent re:Invent keynote, many AWS services incorporate other underlying services– which means that AWS users may have dependencies on services that they don’t even know they are consuming.
Many of those outages have been related to human errors compounded by automation, some have been due to cyber-attacks like massive DDoS attacks against key cloud infrastructure. In order to build the massively redundant infrastructure that the hyperscalers provide, they have a complex ecosystem of automation and interrelated services that can trigger systemic outages in the case of attack or error.
Multi and Hybrid cloud strategies allow you to mitigate single cloud risks
An essential effort to resilience planning is to understand the unique risks to your business. Technology leaders need to consider those risks to major components of their infrastructure including cloud providers. If you are hosting a key business application in a particular cloud provider, you are sharing their risk profile. If that cloud provider has an outage, your business can suffer revenue loss, loss of employee productivity, operational inefficiencies, and possible reputation-related risk.
As your business continues its cloud journey, you can follow a basic strategy of Diversify, Mitigate, and Inquire to help build your cloud resilience strategy.
Diversify your risk by building applications and services that can be shifted between multiple cloud providers or private infrastructure automatically as a service fails. You don’t have to replicate your entire stack in an alternate provider to mitigate some of your cloud vendor concentration risk. For example, you can use a secondary cloud provider for DRaaS, data backup, and basic office applications in the event of an extended outage or some event that impacts the reliability of your primary cloud.
In cases where you can’t diversify your technology risk, mitigate risk impact by defining aggressive XLA based metrics for core business applications hosted in a single cloud vendor. (If another company is hosting a significant part of your enterprise’s value stream, then that provider is by default a partner in your business delivery and your relationship with them should share risks to your business introduced by their risk profile. Remediation should be more than just per-minute credits for unavailable services.)
For best of breed SaaS applications, options are limited but it is your business’ responsibility to inquire about what the vendor’s risk mitigation plan is for their services. Customer needs and demands are a powerful tool in shaping SaaS vendors offerings, especially if similar resilience needs are expressed across their client base.
*Special thanks to my colleagues Lee Sustar and Tracy Woo in the creation of this post.