The responsibilities of a Reliability Engineer are well understood: maintain a high degree of service availability so that customers can have a consistently enjoyable and predictable experience. How these goals are accomplished — establishing SLOs with customers, enforcing them through monitoring SLIs and exercising the platform against failure through Game Days — is also well understood. Much of the literature that exists on SRE goes into great depths talking about these concepts, and for good reason: failing to establish a contract with the customer on availability expectations for the service that they are paying for is a great way for its engineers to spend their entire careers fire-fighting.

However, there are times in which the definiton of availability is not as clear cut. If a web service responds correctly within its availability SLO guidelines (say, 99.95%), but the content that’s actually served by that service is incorrect 30% of the time, then your engineers will likely still spend a large portion of their time fire-fighting despite their Reliability dashboards looking good.

There are various ways of capturing these details through black-box monitoring techniques such as the Prometheus blackbox_exporter or using synthetic testing services from Sauce Labs or New Relic, for example. (My personal favorite is using Cachet with the (Cachet Monitor)[https://github.com/castawaylabs/cachet-monitor] running alongside it.). The Google Customer Reliability team mentions a great example of a prober they added to an example Shakespeare searching service to measure malformed queries. However, one simpler and more transparent method that I don’t often see discussed is leveraging acceptance tests and behavior-driven development. That’s what I’ll discuss in this post.

BDD and SRE: An Unexpected Power Pair

Behavior-Driven Development, or BDD, helps provide a continuous interface through which product teams and engineering can collaborate and iterate on feature development. On healthy product teams, feature development through BDD looks something like this:

Product teams begin the conversation for a new feature with an acceptance test: a file written in English that describes what the feature is and how it should behave.
Engineering writes a failing implementation for that acceptance test by way of step definitions, then writes code that, ultimately, makes those step definitions pass.
Once the acceptance test for that feature passes, the code for that feature enters the release process through to production via continuous integration.

An Example of BDD in action

Here’s a simple example of this in action. Your company maintains a sharp-looking to-do list product. Customer feedback collected from surveys has demonstrated a clear need for integrating your login workflow with third-party OAuth providers, namely Google and Facebook. In preparation for your bi-weekly story grooming session, a product owner might author a acceptance test with Cucumber that looks like this:

# features/login/third_party_auth.feature
Feature: Logging in with Third-Party Providers

  While many of our customers are happy with our login flow,
  surveys are showing a clear need for authenticating via third-parties like Google
  and Facebook.

  Scenario: Logging in with Google
    Given an instance of our to-do app
    And a valid Google Account
    When I navigate to the login page at "/login"
    Then I see a button that lets me log in with Google
    And I enter the Google authentication flow once it is clicked
    And I can successfully log into our to-do app with our account

Ideally, these acceptance tests would live in a separate repository since they are closer to integration tests than service-level tests. It also makes continuous acceptance testing easier to accomplish since the pipeline running the tests will only need to operate against a single repository instead of potentially-many repositories. However, using a monorepo for acceptance tests can complicate pull requests for service repositories since running an entire suite of acceptance tests for a single PR is expensive and probably unnecessary. This can be engineered around, but it requires a bit of work.

After Product and Engineering agree on the scope of this feature and its timing in the backlog, an Engineer might author a failing series of step definitions for this feature, one of which might look something like this:

# features/step_definitions/third_party_auth.rb
require 'todo-app'
require 'vault'

Given("an instance of our to-do app") do
  @todo_app = TodoApp::Client.new
end

Given("a valid Google Account") do
  @google_account = {
    username: test@gmail.com,
    password: Vault::Client.get_value_for(key: test@gmail.com,
                                          path: '/todo/testing/accounts',
                                          token: ENV['VAULT_TOKEN'])
  }
end

When("I navigate to the login page at {string}") do |url|
  @todo.visit url
end

Then("I see a button that lets me log in with Google") do
  expect(page).to have_element("//button[id='google_login']")
end

Once the engineer playing this story is able to make this series of step definitions pass, Engineering and Product can play the acceptance test end-to-end to confirm that the feature implemented is in the ballpark of what they were looking for. (Yay for automating QA!) Once this is agreed upon, the feature gets released into Production through their CI/CD pipelines.

An Example of BDD for Site Reliability in Action

We can employ the same tactics outlined above to define availability constraints. However, in this instance, the Reliability team would be submitting these acceptance tests instead of Product.

Let’s say that data collected from user session tracking shows that out of the 100,000 users that use our todo app on any given month, 85% of them that wait for the login page for more than five seconds leave our app, presumably to a competitor like Todoist. Because our company is backed by venture capital, growth is our company’s primary metric. Obtaining growth at any cost helps with future funding rounds that will help the company explore more expensive market plays and fund a potential IPO in the future. Thus, capturing as many of the fleeting 85% is pretty critical.

To that end, the Reliability team can write a acceptance test that looks like this:

@reliability
Feature: Timely logins

  Prevent users from bouncing early by ensuring that we can hit the login page in a timely manner.

  Scenario: Login page within five seconds
    Given an instance of the to-do app
    When I navigate to the login page
    Then the login page loads in five seconds or less at least ten times in succession.

Notice the @reliability tag at the top of this acceptance test. This tag is important, as it allows us to run our series of acceptance tests with a specific focus on reliability. Since these tests are intended to be quick, we can run them on a schedule several times per hour. If the failure rate for these tests is too high (as this rate would be a metric captured by your observability stack), then Reliability can decide to roll back or fail forward. Additionally, developers can run these tests during their local testing to gain greater confidence in releasing a reliable product and having a better sense of what “reliability” actually means.

Reliability Tests Don’t Replace Observability!

Feature testing tools like Cucumber are often used well-beyond their initial scope, largely due to how flexible they are. That said, I am not arguing for removing observability tools! Quite the contrary, in fact: I think that reliability tests compliment more granular and data-driven monitoring techniques quite nicely.

Going back to our /login example, setting a service-level objective around liveliness — whether /login returns HTTP 200/OK or not — still helps a lot in giving customers a general expectation of how available this service will be during a given period. Using feature tests to drive that will be complicated and slow, and slow metrics are guaranteed to prevent teams from hitting their SLO targets. Using near-realtime monitoring against the /login service and providing a dashboard showing this service’s uptime and remaining error budget along with a widget showing the rate at which this service’s reliability tests are passing tells a fuller story of its healthiness.

Wrapping Up

Setting SLOs and chasing SLIs are tenets most Reliability Engineers understand well. However, these metrics alone may not paint a complete picture of what it means for a service to be “up.” Additionally, these metrics are pretty opaque: developers, product or anyone else outside of the Reliability team that wants to know how things work so well all of the time might have a dashboard or two as their only recourse.

Reliability tests use behavior-driven development and acceptance testing principles to bridge this gap. Authoring reliability tests gives non-Reliability engineers a better understanding of availability expectations, and it shifts some of the onus of making sure that the code is reliable onto the developer. Additionally, because they are written in plain English, everyone can understand them, which means that everyone can talk about and iterate on them.

Give it a try!

neurons firing from a keyboard

thoughts about devops, technology, and faster business from a random guy from dallas.

SRE and BDD: The Ultimate Power Pair

BDD and SRE: An Unexpected Power Pair

An Example of BDD in action

An Example of BDD for Site Reliability in Action

Reliability Tests Don’t Replace Observability!

Wrapping Up