When a large enterprise runs finance, procurement, and production planning through one backbone system, uptime stops being an IT metric. It becomes the difference between trucks leaving on time and phones lighting up across operations. The hard part is not buying redundancy. The hard part is proving, under pressure, that recovery is predictable when real dependencies start failing in sequence.
Vamshi Krishna Jeksani, serving as a Senior Cloud Solutions Architect at a leading global cloud services provider, builds for that reality. As an editorial board member for the SARC Journal of Technology Perception, he brings an evaluator’s mindset to resilience work: targets are only real when they translate into tiering, failover order, and drills that someone can execute without improvising.
From Targets To Tiers: Turning RTO And RPO Into A Real Design
The quickest way to spot a paper recovery plan is to ask one question: what does the business actually lose when the system is down. That is why recovery design starts with targets, not diagrams, because the economics forces teams to be honest. In a major 2024 global data center survey, 54% of respondents reported their most recent significant outage had serious or severe consequences, and 20% said the cost exceeded $1 million. When stakes look like that, low RTO and low RPO are not branding. They are the input that determines which components must fail over together, which can lag, and which should never be rebuilt during an incident.
In a leading North American automotive manufacturer’s public cloud migration, Jeksani translated business recovery objectives into a tiered design that treated the ERP core, the reporting layer, and the integration services as different recovery classes with a defined order of operations. He designed high availability across multiple availability zones and paired it with a disaster recovery posture that could actually be exercised, not just described. The result was a measured shift from an environment where recovery could take about 24 hours to one with RTO reduced to roughly 3.5 to 4 hours and RPO tightened to 15 to 20 minutes. That change made the plan usable, because every tier now had a target the team could defend, test, and repeat.
“Recovery targets are not wishful numbers you paste into a document. They decide what must be hot, what can be warm, and what can wait. If the order is unclear, your RTO will slip the moment you are under stress,” notes Jeksani.
Failover Order Only Matters If You Rehearse It
Once tiers are defined, the next question is whether the sequence actually works when something breaks in the middle. That is where most teams get surprised, because dependencies do not fail politely. Real incidents look like partial service loss, stale name resolution, and a queue that never drains. It is also why drills matter more than assurances, especially in a world where recovery is often demanded under hostile conditions. In a 2024 ransomware survey focused on critical infrastructure sectors, only 20% of organizations hit by ransomware recovered within a week or less, while 55% took more than a month to recover. Those timelines are a warning. If you have not rehearsed the recovery path, you are learning it live.
Jeksani treated testing as part of the build, not a late checklist item. During a leading North American automotive manufacturer’s migration, he led multiple high availability and disaster recovery drills and validated over 20 failure scenarios across application and database layers. One drill, he recalls, looked clean on paper until the team watched a dependency refuse to come back in the expected order because name resolution was behaving differently than the runbook assumed. The room got quiet fast. They tightened the sequence, clarified ownership for each step, and reran the scenario until recovery behavior matched the design. The value was not the drill itself. It was the removal of ambiguity before an actual outage forced it.
“Failover order is a story you tell yourself until you test it. Drills make the system argue back. That argument is exactly what you want, while the stakes are still controlled,” says Jeksani.
Automation Is What Keeps Low-RTO From Turning Into Heroics
After drills expose the weak links, the next step is making recovery repeatable without relying on the same two people every time. Nobody wants that pager at 2 a.m. This is where automation stops being a productivity goal and becomes a resilience requirement, because recovery work is often rebuilding work. Yet maturity is uneven. A 2024 cloud maturity research found only 8% of organizations qualify as highly mature, which helps explain why basic repeatability still breaks down under pressure. At the same time, teams are converging on standardized delivery mechanics, with an end user survey reporting that nearly 60% of Kubernetes clusters rely on Argo CD. The direction is clear: rebuilds have to be scripted, consistent, and fast enough that recovery does not stall on manual setup.
In a leading North American automotive manufacturer’s environment, Jeksani built that repeatability into the platform itself. He developed infrastructure-as-code with Terraform and implemented CI/CD pipelines using Jenkins to automate provisioning and operating system configuration. Instead of waiting five days for manual provisioning cycles and hoping configuration stayed consistent, the team could stand up environments in about 90 minutes with the same patterns each time. That speed did not just help delivery. It reduced the number of moving parts during recovery, because rebuilding became a controlled process rather than a memory test. He has also served as a peer reviewer for multiple SARC research papers, and that evaluator’s mindset shows up in how he insists each automation step be verifiable before it is treated as reliable.
“As soon as you need a hero, your recovery time becomes a rumor. Automation turns recovery into a routine. The goal is not flashy speed, it is repeatable speed,” observes Jeksani.
Readiness Criteria That Prevent Paper Recovery
With automation in place, the remaining risk is operational. Teams can still fail recovery if roles are unclear, runbooks are stale, or the handoffs break down under stress. The cost of disruption keeps rising, and executives are increasingly measuring the aftermath, not just the moment of impact. The global average cost of a data breach reached $4.88 million in 2024, and 70% of breached organizations reported the incident caused significant or very significant disruption. The details differ across incidents, but the lesson carries: recovery is judged on business disruption, and that is shaped as much by readiness criteria as by technical design.
For a leading North American automotive manufacturer, Jeksani treated readiness as a deliverable with its own artifacts. He created a detailed disaster recovery playbook and ran structured operational readiness workshops with the leading North American automotive manufacturer and partner teams so escalation paths, execution steps, and validation checks were explicit. The platform’s controls are also aligned with operational compliance expectations common in enterprise environments, including SOX and IT general controls, high availability and disaster recovery standards, and security controls aligned with SOC and ISO 27001 practices. That alignment mattered because it anchored recovery work in documented behavior, not tribal knowledge. The playbook became the way the program defended its readiness, in reviews and in real drills.
“A working runbook is a promise you can keep under stress. If the steps are vague, the incident will rewrite them for you. Readiness is not paperwork, it is the difference between recovery and confusion,” states Jeksani.
Looking Ahead: Recovery Design Will Be A Default Expectation
Once enterprises move their core systems to public cloud, expectations change quickly. Global public cloud end user spending is projected to reach $723.4 billion in 2025, and cloud revenues are expected to rise to $2 trillion by 2030. As more mission-critical workloads land on shared infrastructure, recovery will stop being a specialized exercise and become part of the baseline for running the platform. The teams that win will be the ones who can show tiering discipline, prove failover order under drills, and keep rebuild steps automated enough that the outcome is predictable.
Jeksani’s approach fits that trajectory. As an editorial board member for the SARC Journal of Innovative Science, he evaluates technical work through the lens of repeatability and evidence, and that same lens shows up in how he designs recovery for systems that cannot afford surprises.
“Cloud scale does not excuse vague recovery. It raises the bar. The best teams treat recovery as a normal operation, then prove it until the proof is boring,” says Jeksani.