Reliability Engineering Concepts!!!

What Is Site Reliability Engineering?


Site reliability engineering (SRE) is a software engineering approach to IT operations.

SRE is a concept which incorporate software engineering experience with system administration and/or networking skill set. SREs are dedicated to increase the performance and reliability of the products. SRE is a valuable practice when creating scalable and highly reliable software systems. It helps you manage large systems through code, which is more scalable and sustainable for sysadmins managing thousands or hundreds of thousands of machines.Standardization and automation are 2 important components of the SRE model. Site reliability engineers should always be looking for ways to enhance and automate operations tasks.

The Site Reliability Engineering (SRE) concept was introduced by Google nearly more than a decade ago and is credited to Ben Treynor Sloss. We often see that while the development is able to scale up pretty quickly, there is an issue with the hand off to operations, since operations or support team is not ready to handle this high frequency of change in the application landscape and once the development team delivers the change and it's deployed to production,development team has little or no accountability.Here comes the need to build a cross-functional team who should always be looking for ways to enhance and automate operations tasks. SRE model is based on two main components : Standardization and Automation . Site reliability engineers (SREs) ensure that they spend maximum time on development, while using tools and solutions to proactively monitor, predict and prevent issues in operations, and also automate most of the manuals tasks.


Key Focus Areas of SRE ( Source : www.wipro.com)



What does a Site Reliability Engineer do?


A Site Reliability Engineer is responsible for the availability, performance, monitoring, and incident response, among other things, of the platforms and services that our company runs and owns.A Site Reliability Engineer (SRE) will spend up to 50% of their time doing "ops" related work such as issues, on-call, and manual intervention. Since the software system that an SRE oversees is expected to be highly automatic and self-healing, the SRE should spend the other 50% of their time on development tasks such as new features, scaling or automation. The ideal site reliability engineer candidate is either a software engineer with a good administration background or a highly skilled system administrator with knowledge of coding and automation.


Comparing DevOps vs SRE:


DevOps and SRE seem like two sides of the same coin. Both titles aim to bridge the gap between development and operation teams, with a unified goal of enhancing the release cycle without any compromises.

The official definition of DevOps is “a software engineering culture and practice, that aims at unifying software development and software operation.” The term was first coined in year 2008 and nowadays just about every company, from enterprises to startups, are hiring DevOps.

The concept of Site Reliability Engineer (SRE) has been around since 2003, making it even older than DevOps. Just like DevOps, SRE is also about combining development and operation teams, helping them see the other side of the process, while introducing visibility to the complete application life-cycle.

Google itself mentioned that SRE and DevOps are not so different from one another: “they’re not two competing methods for software development and operations, but rather close friends designed to break down organizational barriers to deliver better software faster.”

As per Google, that’s where the main difference between DevOps and SRE lies. While DevOps is all about the “What” needs to be done, SRE talks about “How” this can be done. It’s about expanding the theoretical part to an efficient workflow, with the right work methods, tools and so on. It’s also about sharing the responsibility between everyone, and getting everyone in sync with the same goal and vision.


Please refer this article released by Google for more insight.


Seth and Liz, two google employees represented the similarities and differences between the two through the top 5 pillars of DevOps, explaining what they stand for SRE:


DevOps defines 5 key pillars of success:

  1. Reduce organizational silos

  2. Accept failure as normal

  3. Implement gradual changes

  4. Leverage tooling and automation

  5. Measure everything

SRE satisfies the DevOps pillars as follows:

  1. Reduce organizational silos

  • SRE shares ownership with developers to create shared responsibility

  • SREs use the same tools that developers use, and vice versa.

2. Accept failure as normal

  • SREs embrace risk

  • SRE quantifies failure and availability in a prescriptive manner using Service Level Indicators (SLIs) and Service Level Objectives (SLOs).

SLIs measure the failures per request, by calculating request latency, throughput of requests per second, or failures per request as measured over time. SLOs derive out of this threshold, percentage or number, and represent the success of SLIs over a certain amount of time



3. Implement gradual changes

  • SRE encourages developers and product owners to move quickly by reducing the cost of failure.


4. Leverage tooling and automation

  • Both titles encourage adding as much automation and tools as possible, as long as they provide value to developers and operations by removing manual tasks.


5. Measure everything

  • SRE defines prescriptive ways to measure values.

  • SRE fundamentally believes that systems operation is a software problem which led them to define prescriptive ways for measuring availability, uptime, outages, toil, etc.

SREs also ensure that every stakeholder agrees on how to measure reliability, and what to do when availability falls out of specification. This includes contributors at every level from organisation.


Reviewing the Tools for Site Reliability:




Final Thought:


Google’s Site Reliability Workbook (https://landing.google.com/sre/workbook/toc/) is a great place to start exploring more about SRE. Google provides an overview of SRE implementation, covering the guiding principles that led the organization-wide adoption of SRE. For site reliability engineering, the word “mindset” is key and the SRE role requires a mix of development and operations skills.No matter how you implement SRE in your company, the role and the best practices it brings, should have a significant effect.


There are many good books available to know more about SRE.

Please explore https://landing.google.com/sre/books/ for more details and to know more about SRE.


Please share your feedback if you like this article :).


Recent Posts

See All

Steps to Setup kubernetes Cluster!!

In this blog, I will show you, how you can set up a Kubernetes Cluster using Kubeadm. For this hands on, I have used Ubuntu EC2s hosted in AWS environment.I have launched one master node and one work