Maryland, USA – Selva Kumar Ranganathan, AWS Cloud Architect at the Maryland Department of Human Services (MDTHINK), has authored a detailed research paper exploring how artificial intelligence can enhance root cause analysis (RCA) within DevOps environments. The study, titled “Intelligent Incident Management: Leveraging AI for Real-Time Root Cause Analysis in DevOps Pipelines,” has been published in the Journal of Engineering Technology and Applications.
The paper addresses growing concerns around reliability in software deployment pipelines and introduces a practical AI-driven approach for identifying and resolving failures in real time. The methodology centers on minimizing downtime, improving system performance, and enabling engineering teams to act quickly on operational incidents.
The Challenge of Root Cause Analysis in Modern DevOps
With the growing complexity of microservices, container orchestration, and CI/CD automation, pinpointing the root cause of failures has become more difficult than ever. A single deployment may involve dozens of interdependent components across hybrid environments. When something goes wrong, traditional methods such as log inspection or manual monitoring often fall short in both speed and accuracy.
Ranganathan identifies this gap in modern DevOps workflows and argues for a shift toward intelligent, data-driven diagnosis. In the paper, he emphasizes the need for automated tools that can learn from previous incidents, correlate system behavior across time and services, and provide engineers with fast, actionable insights.
AI Techniques for Real-Time Incident Detection and Diagnosis
At the core of the research is an AI-based system designed to perform root cause analysis by ingesting live telemetry data from the CI/CD pipeline. The proposed model uses techniques such as anomaly detection, pattern recognition, and supervised learning to identify the origin of issues as they occur.
The framework includes:
- Historical failure pattern mining: AI is trained on historical incidents to understand common causes and signatures of failure.
- Real-time anomaly detection: Monitoring tools are enhanced with AI algorithms that flag irregularities in build times, test failures, or resource usage.
- Correlation across systems: Events from different services are cross-referenced to understand whether an incident is isolated or systemic.
- Confidence scoring: The model assigns probabilities to potential root causes, helping engineers prioritize their investigation.
These techniques work together to enable near-instant diagnosis, replacing hours of manual analysis with intelligent alerts and visualizations.
Implementation Within Public Sector Platforms
The research is deeply informed by Ranganathan’s work on MDTHINK, a large-scale, cloud-native platform that delivers critical human services across the state of Maryland. MDTHINK supports programs such as Medicaid, SNAP, and child welfare, and processes high volumes of sensitive, time-critical data.
In such systems, even a short disruption can affect thousands of users and essential services. By applying AI-based RCA, MDTHINK and similar platforms can maintain high availability and meet strict performance expectations.
Ranganathan’s research presents this not just as a technical innovation, but as a necessary adaptation for public service infrastructure, where resiliency directly affects real-world outcomes.
Practical Recommendations for DevOps Teams
In addition to the conceptual model, the study provides a set of recommendations for DevOps and Site Reliability Engineering (SRE) teams looking to implement similar capabilities in their environments. These include:
- Data collection and labeling: Build a repository of past incident logs, metrics, and outcomes to train AI models.
- Toolchain integration: Embed the AI models into existing monitoring systems like Prometheus, Grafana, or Splunk.
- Feedback loops: Use incident reports to continuously refine model accuracy and reduce false positives.
- Human oversight: Ensure that AI-driven insights are validated by engineers to maintain operational trust.
These steps allow organizations to gradually adopt AI into their workflows without overhauling their infrastructure.
Future Opportunities for Research and Development
Ranganathan concludes the paper by pointing toward potential directions for future research. These include:
- Graph-based analysis: Using dependency graphs to visualize and trace fault propagation through systems.
- Reinforcement learning: Training systems to recommend or even initiate recovery actions based on previous outcomes.
- Collaborative AI: Designing tools that complement, rather than replace, human engineers by surfacing the most relevant diagnostic information during high-severity events.
By advancing these areas, incident management could become increasingly autonomous, predictive, and responsive to the ever-growing complexity of enterprise systems.
Contributing to Smarter, More Resilient Software Delivery
Selva Kumar Ranganathan’s research makes a meaningful contribution to the evolving field of DevOps reliability engineering. It combines academic rigor with practical insights drawn from real-world public infrastructure, offering a valuable roadmap for organizations facing similar challenges.
As both public and private sector platforms continue to scale and interconnect, the ability to resolve incidents quickly and intelligently will become essential. This work supports that future by showing how artificial intelligence can be thoughtfully applied to a critical aspect of software operations.
The full article is available at the following link:
https://espjeta.org/jeta-v3i3p117
