Context

Many articles about Site Reliability Engineering (SRE) focus on the concept, such as Service Level Indicators (SLI), Service Level Objectives (SLO), Service Level Agreements (SLA), error budgets, and so on. However, I often found myself lost and unable to find practical information for real-world contexts. With this blog, my first about SRE, I aim to provide a clearer picture of SRE in practice. I hope it helps guide those who, like my past self, are seeking direction in their SRE careers or engineering careers.

What exactly does SRE do ?

In short, SRE is Engineering, and its title is all about the responsibilities an engineer should have: The reliability of the system (or Site in the title).

However, when you're talking about reliability, you're talking about everything in engineering, such as availability, resilience, scalability, security, compliance, and even cost - at least in the context of software engineering, which is my area of specific expertise. Why? The answer might be very simple - customers can forgive a bug in software (or its low quality), but they rarely forgive a product that lacks reliability. For example, let's imagine you want to use Google. It's acceptable if it sometimes returns some wrong search results, but what if the site crashed every time you visited, or in a milder scenario, it took too long to search for something on Google? Then what are the consequences? You'd give up and switch to another search engine, or maybe an alternative solution like ChatGPT.

SRE Responsibility

To keep users satisfied with your product, and even interested in trying new paid features in the future, your current system and new features need to have adequate reliability for the users. The SRE team is responsible for ensuring this reliability.

In my point of view, there are 3 critical factors that every SRE team should have:

A good understanding about real user experience in using the product. When the team understand the user experiences, they can understand its goal and what user want to have in production.
A good monitoring framework built for product to ensure its reliability (or in details, SRE need a good SLI, SLO and SLA). When SRE team build a good monitoring framework, they can observe the issue better on production environment. The sooner SRE team detects an issue, the less customer will be impacted.
A good practices in system failures, degraded performance or even outage. When putting SRE in customer situation that they feel unsatisfying, we can have proper solution to avoid this unexpected experiences. The more scenarios the team practices and successfully remediates, the lower the risk of customer-facing failures.

Top 6 misunderstanding about SRE

These are the top 6 misunderstandings that I frequently encounter from my colleagues, which I believe I should write down in the hopes of improving understanding around SRE for myself and others.

SRE is more about operational of a product

SRE partially manages operational tasks like infrastructure maintenance and deployment. Its primary role is to upgrade infrastructure with minimal impact on production. However, as an engineering team, SRE also develops monitoring frameworks, follows processes like escalation or Agile, and continuously improves these frameworks. In my current company, we applied Agile best practice to manage SRE team so we can improve the monitoring framework continously. I will dive deep in this topic in the next article.

SRE is all about production operations

As I said earlier, the SRE team is also an engineering team that develops and continuously improves the monitoring framework. So the team also has a nonprod environment to do so. And depending very much on the team and product structure, the development environment of the monitoring framework can be decoupled from the main application environment, or it can be the same environment. At my current company, we do it in a mixed mode. For larger products, we employ decoupling, but for smaller ones, we keep it simple.

SRE and DevOps have the same responsibility

Communication between SRE and DevOps teams is crucial for understanding how system deployment and operational tasks like patching can affect customer experiences. However, the responsibilities of each team can vary significantly.

From my observations, the roles of SRE and DevOps are often combined into one, particularly in companies with smaller product sizes. This allows one or two DevOps personnel to manage the production environment. However, as the product expands, relying on a single role to handle multiple tasks concurrently may not be the most effective approach.