SRE and BDD: The Ultimate Power Pair

Reading Time: Approximately 7 minutes.

The responsibilities of a Reliability Engineer are well understood: maintain a high degree of service availability so that customers can have a consistently enjoyable and predictable experience. How these goals are accomplished — establishing SLOs with customers, enforcing them through monitoring SLIs and exercising the platform against failure through Game Days — is also well understood. Much of the literature that exists on SRE goes into great depths talking about these concepts, and for good reason: failing to establish a contract with the customer on availability expectations for the service that they are paying for is a great way for its engineers to spend their entire careers fire-fighting. … »

SRE Communities vs SRE Centers of Excellence

Reading Time: Approximately 7 minutes.

I read Google’s Site Reliability Engineering Workbook on a flight to New York the other day. I read their original book when it came out two years ago and was curious to see how much of it mirrored my own (brief) experience as a Google SRE. Given that it’s been a while since I did pure SRE work, I wanted to keep my skills caught up, and the Workbook seemed like a more accurate reference to follow. … »

Good Tools Are Important. Ignore At Your Own Peril

Reading Time: Approximately 7 minutes.

I’ve been consulting for some of the world’s largest companies for the last three years and have observed three themes that worry me: Agile is a really controversial word, despite the manifesto being quite clear on the matter, Somewhere within every company, there are many, many engineers that have been waiting weeks for test environments, and Engineers have the heaviest, plasticky-iest, and most unpleasant machines in the entire organization This (hopefully) brief post is about that third point. … »