AIOps

The Vital Role of an SRE and How it's Evolving

August 5, 2021

Have you heard about SRE? It is an evolving role that is becoming more and more vital within IT, including on the mainframe. In this blog, I focus specifically on SRE from an operations perspective. Stay tuned for further blogs around SRE applicability on DevOps.  

Beginning with the basics, SRE stands for Site Reliability Engineering and started in Google more than a decade ago. It focuses on applying a software engineering mindset to system administration topics. Their mission is to protect, provide for and progress the software and systems behind all the services. Availability, latency, performance and capacity are metrics which are always on SRE’s radar. They treat operations as a software problem. And there are other key concepts we will talk about, including system knowledge, monitoring and automation. 

SREs cover a broad range of topics, being granular when they need to be, drilling down problems to the bits and bytes, while also remaining high-level and seeing the bigger picture when it comes to things like capacity, architecture, etc.

SRE Principles

SREs follow some key principles:

  • Automate, helping to reduce MTTR and risks. 
  • Eliminate toil, reducing manual and repetitive activities.
  • SREs split their work 50% on toil and 50% of improvements. 
  • Define and monitor SLAs and SLOs, assuring target levels for service reliability. 
  • Define error budget, which is the tool an SRE uses to balance service reliability with the pace of innovation. 
  • Blameless postmortem, encouraging lessons learned, learning with the failures, where the focus is about solving and improving the problem, not pointing fingers.

Specifically talking about reliability, organizations are always looking for reliable environments, but how does it balance with innovation, new features, capabilities, etc.? At some point, we need to embrace some risks for the sake of innovation, however at the end of the day, organizations are looking for uninterrupted quality of service, fewer fire drills, prioritization and more efficient problem solving. 

When mainframe is part of the IT infrastructure, we are talking about the most reliable platform in the world. So how do we answer the previous question? Does it mean I need to trade off some of this reliability for innovation, modernization and new capabilities? Well, my answer is NO. To start, mainframe is already modern and new innovative capabilities are continuously being released that enable organizations to run their mainframe efficiently and securely for a more reliable hybrid IT environment. 

What is toil?

Let’s explore some of the different SRE principles from a mainframe perspective. One of the concepts I enjoy about SRE which can apply to mainframe is “eliminating toil”.

This is the definition of toil from the SRE Book: “Toil is the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows.”

When we are talking about the mainframe, for me the 3 key categories of toil are: 

  1. Manual. A clear example is when an SME needs to take manual action to restore a problem. A mainframe scenario would be, a started task is hanging due to some memory problem and you recycle it manually. 
  2. Repetitive. When you need to execute the same task repeatedly, like checklists. For example checking for WTORs (write-to-operator-with-reply requests), contention, SMF, online started tasks, availability, etc.
  3. Automatable. When there is a workflow of activities, often procedurally oriented, that could be automated. For example,  you have a script triggered manually to restore a problem, that’s great, but that means the process is only partially automated. Here, we are aiming for a fully automated process with no human intervention.

Toil is also often tactical and reactive, such as when you’re distracted by a problem requiring your immediate attention. Toil typically lacks enduring value, meaning the activity may provide a temporary quick fix, but does not provide real improvement to the process or service. And when operational work grows as fast as the size of your underlying infrastructure and you cannot escape from basic work, there is probably opportunities to eliminate toil.  

What are the basic skills of an SRE?

You may be starting to get interested in the benefits this role can bring to your organization, but what are the base skills of an SRE?  

  • I like to begin by talking about topology, which is all about having an understanding of the environment. SREs should have a holistic view of their environment topology, know how resources are connected and have knowledge of all the key applications. Having senior expertise is valuable, but having a comprehensive understanding of the overall environment is crucial. 
  • Next is monitoring. Critical alert monitoring is key and being able to correlate events is even better. You can leverage machine learning capabilities to help on this task. Historical insights are equally important, being able to analyze trends and use the data for continuous improvements. 
  • And of course automation. It’s not only important to automate repetitive tasks and reduce toil, but also support standardization, enable fast repair of problems and save time for in-house teams. 

SRE is a large subject, with varying opinions across the IT industry. We could spend hours talking about it, but more important than knowing what’s right and wrong when applying SRE, is how you translate it to your reality. 

You might finish reading this blog and conclude that you already practice most of this, so you may ask yourself “Am I an SRE already?” You might interpret it as a next path in your career and/or a potential way to expand your skills. 

I don’t think the SRE role will be deployed equally across all organizations, but the concept needs to be preserved. You need to balance the SRE benefits with how it would fit into your organization and then determine how to adapt to your reality.

And finally, is applying the SRE role to the mainframe any different? The answer is no, not at all. The mainframe is modern as any other platform, and the most reliable one, so the SRE role is definitely a perfect fit for it. Want to learn more? Checkout this blog  and watch the video above.

Like what you read? Join our Mainframe Insights group to collaborate and ideate with us as we grow our Mainframe ecosystem together: https://www.linkedin.com/groups/9053158/

Tag(s): AIOps, Mainframe