SITE RELIABILITY ENGINEERING (SRE)

Published by

Pankaj Bisht

on

May 2, 2022

Hope is not a strategy

Ever since Google managed to successfully run its complex platforms through the SRE model, most of the big firms across the globe tried to adopt the model in some way or another. Before we dive a little deeper into the subject let’s understand what SRE stands for? It’s an abbreviation for Site Reliability Engineering.

At its core, SRE aims to bridge the gap between development and operations teams, promoting a collaborative environment focused on both rapid innovation and system stability. By viewing operational challenges as software problems, SRE teams aim to create scalable and highly reliable systems that can adapt to the ever-evolving demands of modern applications.

With the traditional silo-based development and ops model, there is a chance of a blame game when it comes to the stability of the production systems. The development team will try to push for new features while the Operations team will be susceptible and ensure new functionality doesn’t break anything in production. Google understood this as early as 2003 and introduced the SRE model where production services are being taken care of by Site Reliability Engineers.

Principles of Google SRE:

Principle 1: Focus on Engineering

SRE team members are allowed to spend up to 50% of their time on manual, repetitive tasks. However, this is not a strict limit; the remaining 50% of their time must be dedicated to automation, development, or self-learning. If operational work exceeds 50% of their time, it should be handed over to the product development team. It is then the responsibility of the development team to find ways to improve efficiency in the process and assist the SRE team by automating tasks or enhancing existing product features.

Principle 2: Changes governed by SLOs and Error Budgets

As development and ops teams are traditionally measured against different metrics (dev. being measured on the number of frequent changes and Ops being on a number of incident-free days), there are chances that at some point both teams will push back each other. What is the best way forward then, do we have an acceptable mechanism to resolve this deadlock?

Error Budget is the answer. Before discussing Error Budget, let’s talk about SLI/SLOs (Service Level Indicators/Objectives).

An SLI is a service level indicator—a carefully defined quantitative measure of some aspect of the level of service that is provided to the end-users. For example, for an externally hosted website, some of the key SLIs can be availability, latency and application throughput. We choose SLIs based on the fact that it impacts/matters to the end-users. Typically, we define SLIs using a monitoring tool like Splunk, AppDynamics, Prometheus etc.

An SLO is a service level objective: a target value or range of values for a service level that is measured by an SLI. For example, an average search request latency should be less than 100 milliseconds. Once defined, SLOs are then shared with the end-users.

Some of the common examples of good SLOs are:

The website will be available to serve end-user requests 99.9% of the time in Q1
99% of the user login requests will be served in < 100 milliseconds in Q1
The website will be able to handle 100 concurrent users and still meet latency and availability SLOs

Error Budget – Google defines error budget as a clear, objective metric that determines how unreliable the service is allowed to be within a single quarter. This metric removes the politics from negotiations between the SREs and the product developers when deciding how much risk to allow.

For example, imagine that a service’s availability SLO is to successfully serve 99% of all requests per quarter. This means that the service’s error budget is a failure rate of 1% for a given quarter. To further simplify it, if our service SLA is 24*7, then out of available 90 days in a quarter, our services should be up and running for 86.67 days. Hence out error budget is 3.33 days.

Error Budget and change decisions – As explained above, higher the availability target, lesser the error budget we have to play around. That’s why we shouldn’t ever aim for the 100% system availability. Change decisions have to be based on the available error budget. Just imagine during a certain quarter due to reoccurring production outages, we spend all of our available error budget then development team will have no option but to cease all the feature development work and start working on the production stabilization.

Availability Table

Principle 3: Shared Responsibility & management buy-in

Running a production service is a shared responsibility and no more the sole responsibility of the operations team. Unorthodox approaches to service management require strong management support.

For example, the decision to stop releases for the remainder of the quarter once an error budget is depleted might not be embraced by a product development team unless mandated by their senior management or leadership team. Senior leadership support is also required for arranging training sessions for the skill upliftment of the SRE team members.

SRE vs Development team: Roles and Responsibilities

Conclusion:

Finally, as I said in the beginning, there is no rule of thumb if an organisation wants to transform its Ops function into an SRE one. SRE is available in different versions and can be tailored to one’s needs. The most important thing is to have an engineering/growth mindset; without that, any SRE adoption is bound to fail.

I am telling you all this from my own personal experience. We have just finished the first phase of the SRE adoption, and the initial results are encouraging. At the onset, we sat out to achieve the following objectives:

10% decrease in manual operations in Q1 2020
Improve documentation/SPOs
Automate manual scripts, processes
Introduce self-service for end users so that they don’t raise a support ticket for trivial issues
Tools-specific training for the SRE team members
Analysis of all BAU tickets for the last 2 quarters and define actions for Q2 2020

Product Management Blog – Pankaj Bisht