How to SRE-ify your React app with Prometheus

Reading Time: Approximately 3 minutes.

I am not a JavaScript developer. However, I was given a task at work recently that forced me to enter the abyss and get good at keeping my Promises. I was asked to create a webinar on helping developers become better SREs through observability and instrumentation. The objective was to take a broken web app and add enough monitoring and logging to it to make troubleshooting its brokenness easier. (I’ll update this post with a link when we broadcast it on April 22nd! … »

SRE and BDD: The Ultimate Power Pair

Reading Time: Approximately 7 minutes.

The responsibilities of a Reliability Engineer are well understood: maintain a high degree of service availability so that customers can have a consistently enjoyable and predictable experience. How these goals are accomplished — establishing SLOs with customers, enforcing them through monitoring SLIs and exercising the platform against failure through Game Days — is also well understood. Much of the literature that exists on SRE goes into great depths talking about these concepts, and for good reason: failing to establish a contract with the customer on availability expectations for the service that they are paying for is a great way for its engineers to spend their entire careers fire-fighting. … »

SRE Communities vs SRE Centers of Excellence

Reading Time: Approximately 7 minutes.

I read Google’s Site Reliability Engineering Workbook on a flight to New York the other day. I read their original book when it came out two years ago and was curious to see how much of it mirrored my own (brief) experience as a Google SRE. Given that it’s been a while since I did pure SRE work, I wanted to keep my skills caught up, and the Workbook seemed like a more accurate reference to follow. … »