Resiliency is defined as the ability to recover from or adjust easily to adversity or change. It doesn’t mean that you’re not going to have problems – it means that you can bounce right back when a problem happens.
Successful businesses maintain comprehensive resiliency plans. Unfortunately, as companies began offering digital services, not all of them carried over resilience into their operational IT plans. IT shops have traditionally been tasked to be as efficient as possible, but efficiency and resiliency can be opposing forces. Resiliency requires duplication, redundancy, and preparation for outlying conditions that may never occur – which has been viewed as an inefficient use of resources at a time when most IT failures did not materially impact business results.
But as we have seen in recent years, the exponential growth of data and transactions generated by accelerated digital transformation has made business resiliency highly dependent on IT resiliency – operational resiliency is no longer an option. Business and IT resiliency are now permanently linked.
At Broadcom, we constantly work with our customers to develop and refine a set of best practices for better aligning IT and business resiliency, and there are a few brief points on areas you may want to focus on to improve your resiliency plans. These best practices are based on an acceptance that problems are going to occur – automatic recovery is the preferred response, with an ability to rapidly diagnose and adjust when automation is not possible.
Keep in mind that it’s ok to establish different resiliency plans for different types of service. A digital service that would result in substantial business issues if unavailable will require a predictive and preventative level of resiliency with automatic remediation, but a service that’s invoked, say, quarterly to support internal executive reviews could tolerate a longer MTTR. In other words – take an application-centric view when establishing your resiliency plans.
Not every incident can be resolved through automated recovery processes. What is your incident management plan for dealing with issues that require manual resolution?
Setting up a “war room” (physical or virtual) is a traditional approach – gather all your cross-functional stakeholders, infrastructure team, all your SMEs, and figure it out. The challenge is that in war room situations it’s often not clear who’s calling the shots and who’s responsible for making decisions. This can create a chaotic, toxically uncomfortable environment with potential heated arguments and finger-pointing. Such dynamics not only derail the actual debugging of the issue but also inhibits some people from sharing ideas that might resolve the issue because the environment causes them to feel unsafe or not valued.
Let’s face it – War rooms are unlikely to go away anytime soon. But high-performing teams have shown that establishing a “no blame” culture – where team members are willing to openly communicate risks and opportunities – leads to fresh ideas and uncovers new possibilities in recovering from an issue. It can take a lot of work to establish such a culture, but it’s worth it – I outlined some of those ideas in this blog post.
Business owners tend to discourage change in the digital systems that support their applications – often because there have been impacts associated with change in the past. In some shops, business pressure has limited IT to applying some fixes in ever-narrowing windows. The challenge here is that when you have fewer windows to make a change, and the number of changes is accelerating, it becomes a self-fulfilling prophecy that you're going to hit an issue. Why?
Consider the extremes. If you apply one fix, and something breaks, you immediately understand the cause and can back that fix out. If you apply a thousand fixes simultaneously, and something breaks, it’s almost impossible to identify which fix was responsible. In some cases, the root cause of a problem can be obscured by other services in the update that masks the underlying issue. And of course, delaying the rollout of preventative service leaves known defects in production systems longer – increasing the odds of hitting a problem that has a fix available..
Frequently applying preventative maintenance minimizes the impact of outages. Having fewer changes in a service update makes it easier to diagnose what had been changed and what caused the outage, as well as to revert any of those changes – and if you can’t identify which one affected you, backing out just a handful is less risky than backing out hundreds.
Due to historical impacts, businesses that advocate continuous delivery of new functionality do not always equate this agile way of working to IT. If your business application changes are very frequent and granular, but your operational changes are infrequently applied in large batches, you’re facing a conflict that should be resolved.
There are so many angles to the change management conversation. My best advice is to work closely with the business to help them understand that remaining current with maintenance is a good thing. Examine your change management process, and get a clear view of what the end objectives are, then you can go back and identify the minimum number of artifacts you must have to make a change.
In conjunction with that, automate as much as possible!
Some institutions segregate application change windows from infrastructure windows based on a perception that it’s difficult to debug both at the same time. The challenge here is that if you’re testing application changes and infrastructure changes at the same time, but only roll the infrastructure changes to production, your production systems become disconnected from what you just tested. This only increases risk due to the fact that what you tested does not match what you have in production – it’s an unpredictable offshoot.
Imagine a scenario where thousands of application changes have been applied, perhaps in an agile process, but changes required within the underlying operating system have not been applied for a relatively long period. Once those infrastructure changes are applied and business applications begin to fail, how do you diagnose the problem? Both the application changes and infrastructure changes were thoroughly tested – but they were not applied together.
Consider moving towards a continuous testing model that uses self-verifying, automated test cases, and a change process where application teams confirm they've tested when a new set of packages for infrastructure have come across – and be sure to deploy both the application and infrastructure updates simultaneously.
There was a time when good systems programmers knew the z/OS control blocks, structures, and states, and could easily work their way through a system dump. These days, due to retirements and changes in the way new IT professionals are educated, such deep core z/OS technical skills are hard to find, and many operators lack an understanding of the core operating system they're managing. Manual problem diagnosis is no longer the norm; it’s a luxury.
The key best practice to compensate for this shift is to use tooling that can generate intelligent insights into a system-crippling situation and, even better, deliver an automated response.
Which leads to observability. IT tools can’t derive insights or recommend automation without first being able to measure a system’s current state by examining its outputs – the definition of observability. The best practice here is to work with products that are committed to making data available via APIs – preferably open APIs – so that operations tools can help operations staff use that data to begin shifting their focus from reactive recovery to proactive avoidance.
Resilience is not just about you making changes – it’s a partnership between you and your IT vendors. At Broadcom, we take this partnership seriously, and we are committed to delivering programs that can help you on your resiliency journey. A few brief highlights follow – and as a licensee of Broadcom mainframe software, you already have access to all these programs and more.
All-Access Pass to No-Cost Education: Whether you’re new to the mainframe or building your skills, you’ll find tremendous value in our comprehensive library of training, tutorials, product guides, and documentation – all kept fresh regularly. And if you can’t find exactly what you need, use our online communities to collaborate with your peers and industry experts.
Expert Change Planning: Work together with Broadcom SMEs to proactively review your plans in collaborative on-site workshops, tailored to your needs.
Assess Your Mainframe Environment: Customized Mainframe Resource Intelligence (MRI) engagements analyze your mainframe health and environment to give you clear actions you can take right now to save money and allocate resources most effectively.
Mitigate Risk: Reduce the risk and cost of software conversions by taking advantage of a proven methodology and resources before you implement your next software release.
There are always opportunities to improve. We are constantly continuously reviewing our documentation, publications, and knowledge documents to ensure that you will be able to quickly find and implement best practices. And we are always looking at ideas to refine our tooling so that it’s easier for you to be aware of whether your systems are in or out of compliance with best practices, and to try and eliminate – or at least, reduce – the possibility of configuration drift over time.