Skip to content

Q1. Anticipate failure and engineer systems to handle it gracefully

Dr. Werner Vogels, AWS CTO, has been quoted as saying "Everything fails, all the time." Although the extent to which this is literally true for a given cloud workload will vary, it is a good practice to act “as if” this is the case and plan accordingly. As stated elsewhere, application development patterns should avoid reliance on specific instances of infrastructure. This practice goes further, recommending that service owners not only refrain from relying on specific instances of infrastructure, but explicitly expect and plan for the failure of such instances.

Planning for failure has implications for both application design and operations. In the former case, service owners should engineer redundancy into their workloads, taking advantage of platform services for this, and in the latter, they should automate wherever possible the replacement of failed infrastructure components, and in any case, detect failure quickly, and establish processes to minimize downtime.