AWS (US-EAST-1) Region outage on the morning of February 28th 2017
https://aws.amazon.com/message/41926/
Image credit : Danilo Pianini - https://danysk.github.io/
Major outages are CTO's and Operations director's worst nightmare. I have been here 3 times in my 25-year career albeit with national mobile networks rather than global cloud services. The principles are the same, however. Incidents like this are huge and painful learning experiences and I am grateful to have learnt from some great bosses (Ciaran Quigley) and to have worked with some of the best engineers to resolve the issues as well as on one occasion a CEO and chairman who appreciated the thousands of things done well* and understood that as a professional I would be my own worst critic. I have learnt to be decisive, think quickly, direct, support, take measured risks and when necessary protect the engineers involved. The painful amounts of time spent ensuring there are strong processes and procedures in place to prevent such issues in the first place are essential.
While there is never room for complacency there should be room for some understanding of the facts. Clear processes, procedures and maintenance windows are key. In networks under my watch any critical maintenance, or any work on a critical part of my network, no matter how minor, was done very late on a Saturday night or early on a Sunday morning. The logic for this was that if something did go wrong it was in the lowest traffic period and we had the longest opportunity to recover with minimum impact on our customers. This policy, which I never compromised on, didn't make me popular with the engineers but it saved our bacon on a number of occasions. It was a golden never to be compromised.
Things sometimes do go wrong especially in situations where things have never been done before. This was not the case with AWS and blaming that poor AWS engineer is plain wrong. Blaming the playbook and process would have been the right and honourable thing to do. Why was some critical work being done at 9.37 on a Tuesday morning? Why were such powerful commands being used without some protection? On only one occasion did I support disciplinary action with an engineer. The reason was not for his human error it was because the outage he caused/prolonged was as a result of him not following a clearly documented process and while he had never caused an outage before he had ignored process before. I am sure the AWS team will learn from this and their platform will be the better for it. More than anything I hope they learn not to blame individual engineers in such a public way ever again. Good practice, processes and procedures will never eliminate human error but they do minimise impact.