SITE RELIABILITY ENGINEERING


Hope is not a strategy

Ever since Google managed to successfully run its complex platforms through the SRE model, most of the big firms across the globe tried to adopt the model in some way or another. Before we dive a little deeper into the subject let’s understand what SRE stands for? It’s an abbreviation for Site Reliability Engineering. In layman’s terms, SRE is what happens when you ask a Software Engineer to design an Operations team.

With the traditional silo-based development and ops model, there is a chance of a blame game when it comes to the stability of the production systems. The development team will try to push for new features while the Operations team will be susceptible and ensure new functionality doesn’t break anything in production. Google understood this as early as 2003 and introduced the SRE model where production services are being taken care of by Site Reliability Engineers.

Dev Vs OPS

Principles of Google SRE:

Principle 1: Focus on Engineering

SRE team members can only work on manual, repeated stuff to 50% of their capacity (not a hard-line though). Rest 50% of the time they have to spend on automation, development or self-learning. If Ops work takes more than 50% of their time then it has to be passed on to the product development team. It’s then up to the development team to bring efficiency to the process and help the SRE team by automating or improving existing product features.

Principle 2: Changes governed by SLOs and Error Budgets

As development and ops teams are traditionally measured against different metrics (dev. being measured on the number of frequent changes and Ops being on a number of incident-free days), there are chances that at some point both teams will push back each other. What is the best way forward then, do we have an acceptable mechanism to resolve this deadlock?

Error Budget is the answer. Before discussing Error Budget, let’s talk about SLI/SLOs (Service Level Indicators/Objectives).

An SLI is a service level indicator—a carefully defined quantitative measure of some aspect of the level of service that is provided to the end-users. For example, for an externally hosted website, some of the key SLIs can be availability, latency and application throughput. We choose SLIs based on the fact that it impacts/matters to the end-users. Typically, we define SLIs using a monitoring tool like Splunk, AppDynamics, Prometheus etc.

An SLO is a service level objective: a target value or range of values for a service level that is measured by an SLI. For example, an average search request latency should be less than 100 milliseconds. Once defined, SLOs are then shared with the end-users.

Some of the common examples of good SLOs are:

  • The website will be available to serve end-user requests 99.9% of the time in Q1
  • 99% of the user login requests will be served in < 100 milliseconds in Q1
  • The website will be able to handle 100 concurrent users and still meet latency and availability SLOs

Error Budget – Google defines error budget as a clear, objective metric that determines how unreliable the service is allowed to be within a single quarter. This metric removes the politics from negotiations between the SREs and the product developers when deciding how much risk to allow.

For example, imagine that a service’s availability SLO is to successfully serve 99% of all requests per quarter. This means that the service’s error budget is a failure rate of 1% for a given quarter. To further simplify it, if our service SLA is 24*7, then out of available 90 days in a quarter, our services should be up and running for 86.67 days. Hence out error budget is 3.33 days.

Error Budget and change decisions – As explained above, higher the availability target, lesser the error budget we have to play around. That’s why we shouldn’t ever aim for the 100% system availability. Change decisions have to be based on the available error budget. Just imagine during a certain quarter due to reoccurring production outages, we spend all of our available error budget then development team will have no option but to cease all the feature development work and start working on the production stabilization.

Availability Table

Principle 3: Shared Responsibility & management buy-in

Running a production service is a shared responsibility and no more the sole responsibility of the operations team. Unorthodox approaches to service management require strong management support.

For example, the decision to stop releases for the remainder of the quarter once an error budget is depleted might not be embraced by a product development team unless mandated by their senior management or leadership team. Senior leadership support is also required for arranging training sessions for the skill upliftment of the SRE team members.

SRE vs Development team: Roles and Responsibilities

Conclusion:

Finally, as I said in the beginning, there is no rule of thumb if an organisation wants to transform its Ops function into SRE one. SRE is available in different versions and it can be tailored according to one’s needs. The most important thing is to have is an engineering/growth mindset, without that any SRE adoption is bound to fail.

I am telling you all this from my own personal experience, we have just finished the first phase of the SRE adoption and the initial results are encouraging. At the onset, we sat out to achieve the following objectives:

  • 10% reduction in manual ops work in Q1 2020
    1. Improve documentation/SPOs
    2. Automate manual scripts, processes
    3. Introduce self-service for end users so that they don’t raise a support ticket for trivial issues
  • Tools specific training’s to the SRE team members
  • Analysis of all BAU tickets for the last 2 quarters and define actions for Q2 2020

Further Reading:

https://landing.google.com/sre/books/

%d bloggers like this: