Site reliability engineering

Ben Treynor Sloss at Google, coined the term Site Reliability Engineering. Site Reliability Engineering helps organization to get understandable level of reliability of the uptime of the systems, services and applications.

Just imagine, Hashnode (the application where you are currently reading this content) is not working, and hence a lot of people like you and me can’t read and write on it, and also its not a good news for even the people who created this platform. Its doing no good to any one. Loss of money, reputation, morale, etc. (If the company is listed on Stock Exchange, it could damage the share price of the company). And this is why Reliability is so important.

SRE and DevOps are two different ways to address almost the same challenges. SRE is very objective however DevOps is intentionally not objective, it’s more about the culture. SRE and DevOps have a lot of things in common like monitoring and automating.

Key SRE Principles

The principle of SRE revolves around the continuous feedback loop.

Service Level Indicator: How do you decide if its working and all good? Measurement of success vs failure. For example, SLI for a service could be what percentage of request succeeds, how long does it take to respond, what is the throughput to it. Once you know how the service is doing, now you can decide what level of reliability can be expected. And this is what Service Level Indicator (SLI) is.

Service Level Objective: Service Level Objective is the expectation of the availability of the application, agreed with the service’s developer. The SLOs help us to decide if the application is working or not, it is needed to configure monitoring systems. If the monitoring data meats the SLOs then the service is working else its not.

Error Budgets: Error budget is the agreed time of downtime of the service i.e. excluding the SLO. For example, we have SLO of the service as 90% of time, so here the budget is 10% of the time as Error Budget. And this time can be used to improve the availability of the service with new release, doing maintenance, upgrading the underlying resources. It’s kind of downtime window you get.

Blameless Postmortems: After the major incident, it is good to have a retrospective of what went wrong and how it can be avoided in future. The focus should be on the failure of the process or technology and not the actions of the assigned people. The idea of blameless postmortems should to improve the process, technology and not to punish the assigned people. If an organization embraces the learning, improvement in the process and technology from the outages is bound to flourish and this is the core principle of the SRE.

Toil: In SRE, Toil means work being done manually. It is normally repetitive and boring, it could be automated but not till the time it’s not it had to be done manually by someone. One of the objective of SRE is to eliminate toil. SRE work to eliminate toil wherever and whenever is appropriate. Project work vs Reactive Ops: To automate or fix the toil work, SRE’s time has to be allocated for automating or fixing it for all. The SRE’s time should not always go in watering the burning. There are times in SRE, where a lot of the days go in firefighting but that can’t help to get a reliable service. 50% project work and 50% reactive operations is the suggested breakdown of the work for SRE.

Getting started with SRE: Start with getting monitoring in your organization, this way you can do analysis of the reliability of your services. And then get all the developer of the service to get to SLIs and SLOs. Further reading: Google — Site Reliability Engineering Seeking SRE

Introduction to Site Reliability Engineering

Key SRE Principles

Did you find this article valuable?