The mainframe in financial services is heavily utilized... and even that is a massive understatement. At my company, the mainframe serves as our core trading platform and is where we conduct all our 401K management and custodial activities. An outage in the systems or applications that depend on the mainframe is to be avoided at all costs. This is no simple task, but with AIOps, we are able to get ahead of the curve - predicting, analyzing, and resolving potential issues before they have a business impact.
Building a mainframe AIOps powerhouse at my company has been a continuous journey. Consider this single example: we used to get 150K alerts per market day. There was no way of identifying which alerts were truly actionable and which were the result of factors as innocuous as Elon Musk tweeting about a certain stock. The signal-to-noise ratio was too high. Today, however, we average less than 5 or 6 alerts per market day on that system, and those alerts are associated with issues that could truly impact our customers and therefore our business. We can readily research and analyze this handful of alerts to determine if any remediation steps are necessary. We are no longer buried (and stressed out) by an avalanche of alerts. We can zero in on alerts that really matter, keeping the business running smoothly and delivering a great customer experience.
How did we move from an overwhelming 150K alerts per day to a very manageable half dozen? In partnership with Broadcom, we have been evolving our mainframe AIOps capabilities at my company. This evolution has involved five key actions implemented incrementally over the course of the past year:
- DECREASING ANOMALY SENSITIVITY
Academically, an anomaly is anything out of the ordinary. Practically, we only care about anomalies that could have a business or customer impact. We explained our needs to the Broadcom team and they made significant code changes to the AIOps machine learning algorithms in their software to decrease sensitivity to anomalies. Now, our thresholds are set at levels that have true meaning for us.
- FINE-TUNING ADVANCED MACHINE LEARNING PARAMETERS
The high-level code changes Broadcom made to decrease sensitivity to anomalies were deployed not only to us, but for the benefit of all their customers. However, we wanted something more: the capability to fine-tune advanced machine learning parameters on our own to align with our unique needs. Broadcom delivered, and we can now modify key parameters and thresholds on demand. This is AIOps at its best, because it combines machine learning and domain expertise based on our specific environment and business needs.
- TURNING OFF ALERTS THAT DON'T MATTER
Candidly, there are some alerts that we just don't care about. We don't want to see them - ever. Using an alert suppression utility, we specify metrics that we do not want to be bothered with. For instance, I don't particularly care that a CICS server isn't performing as many transactions per second as it normally does (because we might not be getting as much traffic), but I really care whenever it starts running a lot of transactions per second higher, because that could cause an issue further down the line.
- CLUSTERING ALERTS
Given that some situations are only worthy of attention when they are coupled with other situations, the clustering of alerts is also key in helping to reduce alert noise. For example, instead of seeing thirty individual alerts from MQ, CICS, and Db2 and not realizing immediately that they all relate to a single issue, we get one alert that shows the thirty different red flags that relate to it. This provides us with a clear view of how alerts are interconnected and what actions need to be taken. In another form of clustering, we use a multi-metric rules utility to combine several metrics together so that we are only alerted if all the conditions for the rule are true.
- AUTO-DISCOVERING MAINFRAME TOPOLOGY
Topology tools that require manual effort are widely used; where I work, however, we are going beyond traditional solutions by partnering with Broadcom to develop the ability to auto-discover mainframe topology. The goal is that by using a new dynamic mainframe topology that can automatically update itself, our team members from multiple domains would be able to instantly focus on the same real-time view of our environment and rapidly collaborate to speed issue resolution. The auto-discovery capability will enable both new team members and seasoned experts to gain an in-depth understanding of our mainframe environment while providing visibility that reflects our real-time environment.
The actions we have taken to evolve our AIOps capabilities allow us to become truly predictive and proactive in how we operate. We can pinpoint areas that need small tune ups, take the necessary steps, and eliminate issues before they are noticed or become problematic. The application developers who rely on me and my team are thrilled, because they can do their work without encountering any hitches or hang-ups along the way.
Ultimately, AIOps is not just about technology - it is about partnership. Each of the above actions took place in the context of our team collaborating with the Broadcom team to share ideas for innovations and enhancements. We regularly explore and refine new features and capabilities together. We plan implementations carefully to optimize outcomes. Through our partnership with Broadcom, we have made huge strides in alerting and predictive forecasting that help us attain our business goals and objectives. We are truly ahead of the curve - and that's where we plan to stay.