In today's hybrid computing landscape, observability is crucial for maintaining and troubleshooting complex applications. While modern microservices architectures benefit from advanced tracing capabilities, mainframe systems often remain a blind spot. We explored OpenTelemetry’s potential for mainframe insights in this blog post. Let’s expand on that overview, using a practical example to explore how OpenTelemetry can eliminate that blind spot and provide end-to-end visibility for digital business services.
The Happy Path: Tracing a Web Application
Let's consider an e-commerce store that uses web applications, microservices, and databases hosted on the mainframe to process customer orders:
These applications and microservices may be deployed on the cloud, or on-prem servers. Once the customer places an order, how do you determine which services, or what infrastructure, is being used by the transaction that was generated by the order? If there is a slowdown, or errors occur, how do you quickly pinpoint the cause? One option is for the Site Reliability Engineer (SRE) team to enable observability using OpenTelemetry auto-instrumentation to collect traces from the applications and microservices to see the transaction flow. These traces are sent to a central repository or monitoring tool such as Jaeger, Datadog, AppDynamics, DXOI, etc.
Here is how a trace looks in Jaeger:
In this trace, it looks like there are two services involved in our transaction:
Imagine that a ‘Trace’ is like an order for a meal in a restaurant. That trace would consist of ‘Spans’, one for each step involved in fulfilling the request (such as taking the order, sending the order to the kitchen, serving the order, etc.). In order to uniquely identify the meal order, the restaurant uses a table number or order ID. Similar to order ID, there is a trace context ID that gets propagated throughout the services.
With traces, developers and SREs can visualize this entire flow, including the duration of each step and the relationships between services.
Now, let's consider a scenario where customers are unable to order merchandise because the payment services are slow and reporting errors.
Based on this trace, it looks like paymentBackend made a Db2 call on the mainframe, which failed. IBM Db2 documentation for the return code ‘00C90088’ states:
“The resource identified by NAME in message DSNT501I is involved in a deadlock condition.”
Well, it looks like our web application failed due to a deadlock on the Db2 database. But the SRE does not know why the deadlock occurred as there are no spans coming from mainframe Db2 services. There is no choice but to contact the mainframe SME to understand more about the error so that steps can be taken to restore order services. Can we avoid this blind spot?
With Broadcom’s WatchTower Platform™, the answer is yes! Mainframes emit well-defined, detailed instrumentation data in the form of SMF (System Management Facility) records. With WatchTower real-time streaming (z/IRIS), whenever a business application initiates a mainframe transaction (via z/OS Connect, CICS Transaction Gateway, JDBC, etc.), we can convert SMF data into standard OpenTelemetry spans and, using the transaction trace ID noted in SMF, propagate these spans as part of the end-to-end trace context:
Notice that now you can see spans detailing the deadlock, the affected Db2 resources as well as the information about the ‘waiter’ and ‘holder’ of the deadlock. WatchTower enables visibility about applications and services on the mainframe to serve both SREs and mainframe SMEs looking to pinpoint the source of the problem. The real power lies in the context that WatchTower provides to understand the true business impact of errors like deadlocks.
Using the OpenTelemetry ‘Span link’, we connect otherwise independent transactions that blocked each other in the same deadlock.
Now we can see that there is another business application called ‘AIFraudDetect’ which was involved in this deadlock. Although this application is not related to our shop services in any way, it happens to be on the same tables as the shop application. In this case, Db2 allowed AlFraudDetect to proceed, but only after a long delay and the shop payment service was the victim process and was forced to roll back by Db2’s deadlock detector.
The SRE has also identified that the deadlock increased the latency of the AIFraduDetect service substantially and is able to further determine what the total business impact was for the deadlock error.
OpenTelemetry is a vendor-agnostic framework, so you can use your existing monitoring tools to visualize traces. Here is how the same end-to-end shop trace looks in Datadog including the Db2 deadlock spans delivered by WatchTower:
Now that we have a comprehensive, end-to-end view of the application, let's revisit our problematic scenario:
WatchTower’s z/IRIS capability powers this comprehensive view, providing the missing link for mainframe visibility, allowing for faster problem identification and resolution, and significantly reducing MTTD.
Interested in learning more about WatchTower z/IRIS? Please contact Machhindra.Nale@broadcom.com and Angelika.Heinrich@broadcom.com.
Comments