As artificial intelligence transitions from a productivity experiment into the backbone of production operations, a critical question is emerging across financial services: how do you make AI-driven infrastructure not just fast and efficient, but safe, auditable, and trustworthy enough for the most regulated environments on earth? Credit unions, digital banks, and financial SaaS platforms serve customers who cannot afford for their systems to fail — and the engineers operating those systems are under pressure like never before.
Ajay Devineni is a Senior Site Reliability and DevSecOps Engineer based in Atlanta, Georgia, and an AWS Certified Solutions Architect with over a decade of experience building and operating cloud-native financial infrastructure. Over the past several years, he has developed and deployed a series of AI-powered reliability systems — from machine learning-based alert intelligence to causal inference root cause analysis engines — across a major digital banking SaaS platform serving credit union clients on AWS, Azure, and Google Cloud. He has also published multiple peer-reviewed research papers formalizing these production-proven methodologies for the broader SRE and AIOps communities. In this exclusive TechBullion interview, Devineni explains what AI-driven reliability actually looks like in a SOC 2 regulated banking environment, why alert fatigue is destroying on-call effectiveness, and why self-healing infrastructure is no longer a futuristic concept — it is a production necessity.
Q1. Tell us about yourself and the problems you focus on solving in financial services infrastructure.
I am a Senior Site Reliability and DevSecOps Engineer with more than ten years of experience running cloud-native banking infrastructure. My focus has been on the intersection of AI, automation, and operational reliability — specifically in financial services environments where the stakes of failure are measured not in user experience points but in real customer transactions that cannot complete.
The core problem I have spent my career solving is this: financial infrastructure is simultaneously among the most complex and the most failure-intolerant in the world. Credit union members depend on their banking applications around the clock. The regulatory frameworks — SOC 2, PCI DSS, and others — require documented accountability for every change made to production systems. And the teams operating these systems are often small relative to the scope of what they manage.
AI-driven reliability engineering is the answer to that tension. Not AI as a buzzword — AI as a systematic engineering approach to replacing manual, error-prone processes with autonomous, auditable, and continuously improving systems. That is the work I have been doing in production banking environments, and formalizing in peer-reviewed research so other teams can apply the same methods without repeating the same learning curve.
Q2. Alert fatigue is a well-known problem in cloud operations. What makes it particularly dangerous in financial services, and what have you done about it?
Alert fatigue is not a minor inconvenience — it is a reliability crisis. When on-call engineers are buried in non-actionable pages night after night, they become desensitized. Response times slow. Genuine incidents get missed. Engineers burn out and leave. And the monitoring systems that were supposed to protect production reliability become the thing that undermines it.
In financial services, this plays out against a backdrop of genuine consequences. A missed alert in a credit union banking application is not a degraded user experience — it is a balance transfer that fails, a loan payment that does not process, a real person who cannot access their money.
Over eight years of production on-call across credit union banking and enterprise telecom environments, I have experienced this directly — and I have built the engineering response to it. The most significant project was an ML-driven alert severity scoring system I designed and deployed, trained on over 50,000 historical PagerDuty and Dynatrace alert events. The model uses 17 engineered features including service dependency depth, deployment recency, historical false positive rates, correlated anomaly signals, and time-of-day patterns. It achieved 89.3% severity classification accuracy with a 2.1% P1 miss rate — meaning it dramatically reduced noise without creating the false negatives that make noise-reduction dangerous.
The results were measurable: 34% fewer actionable pages per engineer per month, 28% faster mean time to acknowledge for genuine P1 incidents, and 41% reduction in after-hours pages. Those are not just technical metrics. I also measured engineer wellbeing through post-shift surveys, and scores improved from 5.8 to 7.4 out of 10 following the system’s deployment. In a field where burnout is endemic, that is a reliability outcome too.
Q3. You developed a framework called TraceCausalNet for automating root cause analysis. What problem does it solve, and how does it work?
The shift to microservice architectures has created a root cause analysis crisis in production operations. When a distributed banking system fails, understanding which service caused the failure versus which services are merely exhibiting downstream symptoms is extremely difficult. At the start of this work, our production baseline for mean time to root cause was 47 minutes per incident. That is 47 minutes of active investigation, with engineers manually correlating logs, traces, and dashboards while a production system is degraded for customers.
TraceCausalNet is a causal inference framework that addresses this directly. It constructs dynamic service dependency graphs from distributed trace data collected via Dynatrace. When Dynatrace Davis AI detects an anomaly and triggers a PagerDuty alert, causal analysis begins immediately — before the on-call engineer even opens their laptop. The framework applies Granger causality analysis to identify which services are causally driving failures downstream, rather than just which services happen to be exhibiting anomalies simultaneously. Root cause candidates are ranked by interventional impact score, giving engineers a prioritized shortlist rather than a raw event dump.
Evaluated across hundreds of production incidents spanning six credit union banking applications over four years, the framework reduced mean time to root cause from 47 minutes to 8 minutes — an 83% improvement — with 91% top-three root cause accuracy. I published this work in the International Journal of Emerging Trends in Computer Science and Information Technology so that other SRE teams can replicate the methodology in their own environments. The paper documents the architecture, evaluation methodology, and the specific considerations for deploying causal inference RCA in SOC 2 regulated environments.
Q4. You have published research on the risks of AI-generated code in production banking environments. Why is this a problem that does not get enough attention?
It is perhaps the fastest-growing reliability and compliance risk in financial services infrastructure right now, and the industry is behind on it.
The productivity gains from AI coding assistants are real and significant. Engineers can generate Terraform configurations, shell scripts, and deployment parameter files much faster than before. But AI-generated infrastructure artifacts introduce a novel category of risk that conventional code review processes were not designed to catch: code that is syntactically valid but semantically incorrect in its specific production context. Hallucinated API calls. Configurations that create working infrastructure with unintended security or blast radius implications.
Research has demonstrated that approximately 40% of programs generated by AI assistants contain exploitable vulnerabilities. In a banking environment where every production infrastructure change requires documented change control and SOC 2 audit evidence, deploying AI-generated scripts without a structured validation framework creates compliance gaps that outweigh the productivity benefit.
The framework I developed and published — a Remediation Guardrail Framework — addresses this through a four-tier risk classification system based on blast radius analysis, reversibility assessment, and scope validation against existing Terraform state. It integrates with change control workflows through Model Context Protocol tooling, ensuring every AI-generated artifact is associated with an approved change record before it is applied to production. Evaluated against a corpus of 108 AI-generated artifacts, the automated tier classifier achieved 91% accuracy against expert human review, and caught seven artifacts containing production errors that visual plan review alone would not have detected. In a banking environment, any one of those seven could have been a critical incident.
Q5. You have also built a self-healing certificate management system. How does that work, and why does it matter?
Certificate expiry is one of the most preventable causes of production outages — and one of the most common. It is preventable because the failure is perfectly predictable: a certificate expires on a known date, creating a known failure at a known time. Yet organizations consistently fail to manage this because at scale, the operational complexity outpaces manual tracking capacity.
Across six production banking applications, I built an autonomous certificate lifecycle management pipeline covering 847 production certificates. The system runs daily discovery scans across Kubernetes secrets, Ingress resources, and AWS load balancer configurations to build and maintain a live certificate-to-service dependency graph. Each certificate is evaluated using a Certificate Risk Score model that incorporates days to expiry, service criticality, and dependency chain depth. The system responds with graduated automation: a Jira ticket is created automatically at a CRS of 40, ACM renewal triggers automatically at 70, and emergency autonomous renewal plus service restart orchestration activates at 85. Post-renewal, Kubernetes rolling restarts execute with automatic rollback if health checks fail.
The result over 18 months: zero certificate-related production incidents, compared to three incidents in the prior two-year period. That is self-healing in the truest sense — the system identifies a future failure before it occurs and resolves it without human scheduling or memory.
Q6. You have pioneered the use of large language models to extract knowledge from incident documentation. What is the insight behind that work?
The insight is that post-mortem documentation is one of the most underutilized assets in site reliability engineering. After every significant incident, teams write thorough root cause analyses — documenting what failed, why it failed, what contributed to it, and how it was resolved. That knowledge is then filed in a documentation system and almost never retrieved again. The next time a similar incident occurs, an engineer spends 47 minutes rediscovering a root cause that was documented two years ago.
The system I built applies large language models to extract structured knowledge from incident documentation and organize it into a searchable reliability knowledge graph stored in Amazon Neptune. LLM-based entity extraction identifies services, failure modes, contributing factors, timeline events, and remediation actions from unstructured post-mortem text. Semantic similarity search then enables on-call engineers to retrieve historically analogous incidents and their resolution pathways in real time during active investigations — within seconds of a new alert firing, not hours into an investigation.
Validated against over 300 post-mortem documents accumulated across six credit union banking applications, the system identified 14 systemic anti-patterns recurring across the incident corpus. These were patterns that no individual engineer would have noticed, but became visible at the corpus level. Proactive infrastructure improvements derived from those patterns are estimated to prevent approximately six high-severity incidents annually. I published this work in the International Journal of Artificial Intelligence, Data Science, and Machine Learning, because the methodology — treating post-mortem documentation as a continuously compounding engineering asset rather than a compliance artifact — applies well beyond the specific environment where I built it.
Q7. How do you think about the boundary between what AI should automate autonomously and what must still involve a human?
This is the most important design question in any self-healing or agentic AI system, and it is especially consequential in regulated financial environments.
My framework uses three factors to determine the automation boundary. First, reversibility: can the automated action be undone if it turns out to be wrong? Second, blast radius: what is the worst-case impact of an incorrect automated action on production systems and customer-facing services? Third, novelty: how similar is this situation to the training distribution the system was built on?
Actions that are reversible, bounded in blast radius, and similar to historical patterns are safe to automate. A certificate renewal with automatic rollback on failed health checks meets all three criteria — it executes autonomously. An account-level change in a production financial system that is irreversible, has large downstream impact, or represents a pattern the system has not seen before should always route to human review. The AI gathers the evidence, presents the case, and the human decides.
This is not a reluctance to trust AI. It is an engineering discipline. Self-healing systems must know the boundaries of their own competence. When the system encounters an incident pattern that falls significantly outside its training distribution, it defaults to human escalation rather than attempting autonomous remediation. That design choice — building in explicit novelty detection and conservative fallback behavior — is what makes it possible to gain organizational approval for automation in regulated banking environments. Trust is earned through evidence and through the system demonstrating it knows when not to act.
Q8. What are the biggest mistakes you see organizations make when they try to adopt AI reliability engineering?
Three patterns show up consistently.
The first is starting with models instead of data. Every AI reliability system I have built that underperformed did so because of data quality problems, not model complexity limitations. Inconsistent alert severity labels, incomplete RCA documentation, missing timestamps in incident records — these data problems limit accuracy more than any choice of algorithm. The first investment for any organization considering AI-SRE should be a data quality and instrumentation audit, followed by documentation standardization, followed only then by model selection.
The second is underestimating the organizational change required relative to the technical change. In my experience, the ratio is approximately three to one — three parts organizational work for every one part technical work. The ML alert severity scorer took weeks to build and months to gain approval for live activation in a regulated environment. The agentic AI co-pilot took days to configure and weeks to establish the review workflows that made it safe to use consistently. Organizations that plan for this imbalance succeed in reaching production impact. Those that treat AI reliability as a technology deployment project and skip the change management component stall at proof-of-concept.
The third is not measuring engineer wellbeing alongside technical reliability metrics. Alert fatigue, on-call burnout, and decision fatigue are leading indicators of reliability degradation. Engineers experiencing severe alert fatigue make more diagnostic errors, take longer to resolve incidents, and are more likely to leave the team — all of which directly impact the reliability numbers organizations care about. Measuring and improving engineer wellbeing is reliability engineering, not a soft benefit on top of it.
Q9. What do you see as the most important trend shaping cloud reliability engineering over the next two to three years?
The rise of agentic AI as standard infrastructure for SRE operations. We are moving from AI as an analytical layer — classifying alerts, diagnosing failures — to AI as an operational agent that can read infrastructure code, generate remediation artifacts, execute validated actions, and document its own work.
The skills that made exceptional SREs five years ago — deep tool knowledge, ability to manually correlate multi-source telemetry, memory of historical incident patterns — are becoming less differentiating as AI handles more of that cognitive work. The skills that will define exceptional SREs in two or three years are AI collaboration skills: how to prompt an agentic AI effectively for infrastructure tasks, how to validate AI-generated artifacts efficiently, how to design human-AI handoff boundaries for specific risk categories, and how to build the organizational trust frameworks that allow AI systems to operate with appropriate autonomy in regulated environments.
The engineers and teams that build those skills now will have a substantial advantage. I expect agentic SRE tools to become standard infrastructure in production operations within two years. The teams adopting them now are not just improving their current operations — they are building the workflow expertise that will be the competitive differentiator in that future.
Ajay Devineni is an AWS Certified Solutions Architect and Senior SRE / DevSecOps Engineer based in Atlanta, Georgia, with over a decade of experience in cloud-native financial infrastructure. His peer-reviewed research on AI reliability engineering, alert intelligence, causal inference RCA, and AI-generated code safety has been published in The American Journal of Engineering and Technology, the International Journal of Emerging Trends in Computer Science and Information Technology, the International Journal of AI, BigData, Computational and Management Studies, and the International Journal of Artificial Intelligence, Data Science, and Machine Learning. His work is indexed on Google Scholar.
Last updated: June 3, 2026