Here are some slides I've presented at a number of places people have asked me to post.
Lastly i'd like to add that "Self Correcting Systems" are vital to the success of SRE. Of course we all hear about auto-remediation or self-healing technologies. While those are self evident I personally recommend you think about your people and processes. Think about the motivations, rewards and expected human behaviors. If you focus on a target of reducing false monitoring alarms, someone MIGHT decide to just disable the alarms instead of fixing them. If you focus on auto-healing too much, you may miss the fact that most things that can/should be fixed by auto-healing is a design flaw/problem. Unfortunately we tend to ask how many fires we put out, not how many fires we prevented because "fires put out" is easier to count. We have to educate our stakeholders and leadership to learn that an ounce of prevention is worth a pound of cure!
Post a Comment