I confess – I’m partially to blame for the flood of information that is challenging mainframe Level 1 support staff today. But I’m now making amends by helping my team at Broadcom launch an innovative, open, integrated observability solution that will improve business performance by streamlining the identification and resolution of high-priority alerts.
But first, some context is in order. When I began my software engineering career writing mainframe operating system code (for what is today called z/OS), ensuring OS reliability, availability, and serviceability (RAS) was a core part of our training. Part of that responsibility meant ensuring that my code spun off sufficient diagnostic information in case something went wrong. The Write to Operator (WTO) assembler macro quickly became a good friend of mine! But an unintended consequence of my RAS diligence was a steady stream of information pushed to operator consoles – most of it just meaningless (“Hey, that task you started has just been completed”).
When I started my career 40 years ago (yikes, did I just reveal that?) this wasn’t a big deal. Transaction rates were more manageable, and operators tended to be highly skilled in the mainframe (as an aside, I can still remember a colleague who could look at a hex dump and instantly spot a problem). Today, however, the mainframe has become such an indispensable behind-the-scenes player in so many aspects of daily life that the rate of unactionable telemetry being pushed to Level 1 has increased astronomically. And many of today’s Level 1 operators did not begin their careers on the mainframe, making it increasingly more difficult for them to pinpoint the root cause of a problem, or spot an anomaly that may become a problem in the future.
Today’s software vendors – Broadcom included – are focused on AIOps, particularly the use of machine learning (ML) to study performance patterns and identify anomalies. AIOps has become a table-stakes offering, but is it sufficient to solve the true issues plaguing IT operations?
Consider today’s Level 1 sitting in front of their workstation, trying to identify real problems from a never-ending stream of mainly unactionable information. Think about being asked to identify a specific snowflake in a blizzard; that will give you a good idea of the challenge. AIOps pushes an alert into the stream that a potential problem is brewing – but how do they find the root cause? How do they know what to fix? Are they even able to pick out that alert in the first place?
With this in mind, mainframe observability requires a holistic approach that doesn’t just use ML to identify anomalies, but also takes the entire task of Level 1 into account. First, high-priority alerts that require attention must be clearly identified. Those alerts must then be augmented with pertinent context, such as related documentation and a view of the mainframe components that may be impacted by the alert, so that the right subject matter experts can be quickly identified. Full integration into the company’s ticketing system is required so that the alert and all of its related context can be rapidly brought to the attention of specialists. And finally, coming back to those ML insights, they need to be generated in real-time and incorporated within the context of the specific problem to help guide those experts to the right corrective action.
Today’s mainframe observability solutions must enable operations teams to rationalize and coordinate workflows more efficiently – but that’s only the first step. They must also expand mainframe observability to multi-platform, enterprise-wide tools so that system reliability experts (SREs) can zero in on and fix alerts that are most critical to ensuring the continued smooth operation of the business.
At Broadcom, we fully understand the needs of IT operations and are taking our offerings to the next level to meet these needs. Unfortunately, that’s about all I can share today! BUT, if this is intriguing to you, please pay attention to this space. In particular – will you be at SHARE Orlando in March? I will be there, and at that time we’ll be able to have this discussion in depth. If I can’t meet you in Orlando, I’ll be back here on LinkedIn after the conference with the information that you need to get started on your observability journey.