SITE RELIABILITY ENGINEERING

Hope is not a strategy

Ever since Google managed to successfully run its complex platforms through the SRE model, most of the big firms across the globe tried to adopt the model in some way or another. Before we dive a little deeper into the subject lets understand what SRE stands for? It’s an abbreviation for Site Reliability Engineering. In a layman terms SRE is what happens when you ask a Software Engineer to design an Operations team.

With the traditional silo-based development and ops model, there is a chance of blame game when it comes to stability of the production systems. Development team will try to push for new features while Operations team will be susceptible and ensure new functionality doesn’t break anything in production. Google understood this as early as in 2003 and introduced SRE model where production services are being taken care by Site Reliability Engineers.

Principles of Google SRE:

Principle 1: Focus on Engineering

SRE team members can only work on manual, repeated stuff to the 50% of their capacity (not a hard-line though). Rest 50% of the time they have to spend on automation, development or self-learning. If Ops work takes more than 50% of their time then it has to be passed on the product development team. It’s then up to the development team to bring efficiency in the process and help SRE team by automating or improving existing product features.

Principle 2: Changes governed by SLOs and Error Budgets

As development and ops teams are traditionally measured against different metrics (dev. being measured on the number of frequent changes and Ops being on number of incident free days), there are chances that at some point both teams will push back each other. What is the best way forward then, do we have an acceptable mechanism to resolve this deadlock?

Error Budget is the answer. Before discussing Error Budget, let’s talk about SLI/SLOs (Service Level Indicators/Objectives).

An SLI is a service level indicator—a carefully defined quantitative measure of some aspect of the level of service that is provided to the end users. For example, for an externally hosted website, some of the key SLIs can be availability, latency and application throughput. We choose SLIs based on the fact that it impacts/matters to the end users. Typically, we define SLIs using a monitoring tool like Splunk, AppDynamics, Prometheus etc.

An SLO is a service level objective: a target value or range of values for a service level that is measured by an SLI. For example, for an average search request latency should be less than 100 milliseconds. Once defined, SLOs are then shared with the end users.

Some of the common examples of good SLOs are:

  • Website will be available to serve end user requests for 99.9% of the time in Q1
  • 99% of the user login requests will be served in < 100 milliseconds in Q1
  • Website will be able to handle 100 concurrent users and still meet latency and availability SLOs

Error Budget – Google defines error budget as a clear, objective metric that determines how unreliable the service is allowed to be within a single quarter. This metric removes the politics from negotiations between the SREs and the product developers when deciding how much risk to allow.

For example, imagine that a service’s availability SLO is to successfully serve 99% of all requests per quarter. This means that the service’s error budget is a failure rate of 1% for a given quarter. To further simplify it, if our service SLA is 24*7, then out of available 90 days in a quarter, our services should be up and running for 86.67 days. Hence out error budget is 3.33 days.

Error Budget and change decisions – As explained above, higher the availability target, lesser the error budget we have to play around. That’s why we shouldn’t ever aim for the 100% system availability. Change decisions have to be based on the available error budget. Just imagine during a certain quarter due to reoccurring production outages, we spend all of our available error budget then development team will have no option but to cease all the feature development work and start working on the production stabilization.

Availability Table: Google SRE book

Principle 3: Shared Responsibility & management buy-in

Running a production service is a shared responsibility and no more the sole responsibility of the operations team. Unorthodox approaches to service management require strong management support.

For example, the decision to stop releases for the remainder of the quarter once an error budget is depleted might not be embraced by a product development team unless mandated by their senior management or leadership team. Senior leadership support is also required for arranging training sessions for the skill upliftment of the SRE team members.

SRE vs Development team: Roles and Responsibilities

Conclusion:

Finally, as I said in the beginning, there is no rule of thumb if an organisation wants to transform its Ops function in to SRE one. SRE is available in different versions and it can be tailored according to the one’s need. Most important thing is to have is engineering/growth mindset, without that any SRE adoption is bound to fail.

I am telling you all this from my own personal experience, we have just finished first phase of the SRE adoption and initial results are encouraging. On the onset, we sat out to achieve following objectives:

  • 10% reduction in manual ops work in Q1 2020
    1. Improve documentation/SPOs
    2. Automate manual scripts, processes
    3. Introduce self-service for end users so that they don’t raise a support ticket for trivial issues
  • Tools specific training’s to the SRE team members
  • Analysis of all BAU tickets for last 2 quarters and define actions for Q2 2020

For Further Reading:

https://landing.google.com/sre/books/