Have you heard about SRE? It is an evolving role that is becoming more and more vital within IT, including on the mainframe. In this blog, I focus specifically on SRE from an operations perspective. Stay tuned for further blogs around SRE applicability on DevOps.
Beginning with the basics, SRE stands for Site Reliability Engineering and started in Google more than a decade ago. It focuses on applying a software engineering mindset to system administration topics. Their mission is to protect, provide for and progress the software and systems behind all the services. Availability, latency, performance and capacity are metrics which are always on SRE’s radar. They treat operations as a software problem. And there are other key concepts we will talk about, including system knowledge, monitoring and automation.
SREs cover a broad range of topics, being granular when they need to be, drilling down problems to the bits and bytes, while also remaining high-level and seeing the bigger picture when it comes to things like capacity, architecture, etc.
SRE Principles
SREs follow some key principles:
Specifically talking about reliability, organizations are always looking for reliable environments, but how does it balance with innovation, new features, capabilities, etc.? At some point, we need to embrace some risks for the sake of innovation, however at the end of the day, organizations are looking for uninterrupted quality of service, fewer fire drills, prioritization and more efficient problem solving.
When mainframe is part of the IT infrastructure, we are talking about the most reliable platform in the world. So how do we answer the previous question? Does it mean I need to trade off some of this reliability for innovation, modernization and new capabilities? Well, my answer is NO. To start, mainframe is already modern and new innovative capabilities are continuously being released that enable organizations to run their mainframe efficiently and securely for a more reliable hybrid IT environment.
What is toil?
Let’s explore some of the different SRE principles from a mainframe perspective. One of the concepts I enjoy about SRE which can apply to mainframe is “eliminating toil”.
This is the definition of toil from the SRE Book: “Toil is the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows.”
When we are talking about the mainframe, for me the 3 key categories of toil are:
Toil is also often tactical and reactive, such as when you’re distracted by a problem requiring your immediate attention. Toil typically lacks enduring value, meaning the activity may provide a temporary quick fix, but does not provide real improvement to the process or service. And when operational work grows as fast as the size of your underlying infrastructure and you cannot escape from basic work, there is probably opportunities to eliminate toil.
What are the basic skills of an SRE?
You may be starting to get interested in the benefits this role can bring to your organization, but what are the base skills of an SRE?
SRE is a large subject, with varying opinions across the IT industry. We could spend hours talking about it, but more important than knowing what’s right and wrong when applying SRE, is how you translate it to your reality.
You might finish reading this blog and conclude that you already practice most of this, so you may ask yourself “Am I an SRE already?” You might interpret it as a next path in your career and/or a potential way to expand your skills.
I don’t think the SRE role will be deployed equally across all organizations, but the concept needs to be preserved. You need to balance the SRE benefits with how it would fit into your organization and then determine how to adapt to your reality.
And finally, is applying the SRE role to the mainframe any different? The answer is no, not at all. The mainframe is modern as any other platform, and the most reliable one, so the SRE role is definitely a perfect fit for it. Want to learn more? Checkout this blog and watch the video above.
Like what you read? Join our Mainframe Insights group to collaborate and ideate with us as we grow our Mainframe ecosystem together: https://www.linkedin.com/groups/9053158/