Business news

“Your Cloud Isn’t Breaking Because of AWS, It’s Your Architecture”— Matvii Horskyi’s View

By James Andrew

Posted on April 28, 2026

The AWS outage exposed the fragility of cloud dependence. Software Engineer Matvii Horskyi worked on architecture that helped reduce AWS infrastructure costs by around 50%, proving that speed and stability can reinforce one another in modern cloud infrastructure.

In October 2025, an AWS outage in the US-EAST-1 region disrupted over 4 million users and more than 1,000 companies worldwide. The incident showed how much the modern internet depends on a few large cloud providers.

We spoke with Matvii Horskyi, a software engineer whose career reflects the evolution of cloud infrastructure, from working on large-scale Elasticsearch cluster management systems at Qbox (for clients like DoorDash, Yahoo, and CBRE) to building fault-tolerant Kubernetes operators at NetApp that were designed to control infrastructure worth tens of millions. In 2025, he served as a judge at the NextGen Hackathon in France, reviewing projects from international teams alongside 20 technology experts. Through these experiences, Horskyi has seen how the industry’s view of technical skill and system design has changed, and how good architecture can mean the difference between a reliable system and a costly failure.

Matvii, cloud cost overruns and reliability failures are among the most common challenges facing modern engineering teams. A documented Qbox case study reported a 50% reduction in AWS costs, saving hundreds of thousands of dollars, alongside major reliability improvements. You later worked on the platform’s next-generation architecture, building automation-driven, fault-tolerant systems for large-scale production. What were the key design decisions behind that stability and cost-efficiency?

Your Cloud Isn’t Breaking Because of AWS, It’s Your Architecture. The real problem is that most teams still treat the cloud like a fancier version of servers. They lift their old monolith into containers, drop it on Kubernetes, and expect it to behave, then act surprised when it breaks in totally new ways. I’ve seen it firsthand.

At Qbox, the platform supported thousands of Elasticsearch clusters for customers such as Yahoo and DoorDash. Early on, we trusted AWS too much, calling APIs directly like they’d never fail. But cloud outages are constant, and halfway-done operations would leave clusters in messy, broken states. We spent a lot of time firefighting. The big turnaround came when we started thinking like distributed-systems engineers. We rebuilt everything around ideas like idempotency, eventual consistency, and graceful degradation, managing clusters through state machines that could pick up right where they left off. Once we accepted that failure was normal, everything became more predictable and a lot less stressful.

At NetApp, you designed Kubernetes operators built to manage multi-million-dollar infrastructure across major databases. When automation handles such scale, which architectural patterns ensure safety, and which mistakes risk cascading failures in enterprise cloud environments?

The operators I built at NetApp were designed with three non-negotiable principles: True idempotency. Every operation must be safe to retry from any point. If provisioning a Kafka cluster fails halfway, resuming shouldn’t create duplicate resources or corrupt state. This requires careful state tracking and conditional logic, not just wrapping operations in retry loops.

Then, concurrency control. When managing hundreds of clusters across regions, you can’t prevent simultaneous operations. You need proper locking mechanisms and conflict resolution. A cluster shouldn’t end up in an undefined state because two scaling operations raced.

After that, observable failure modes. When something goes wrong, and it will, your system must provide clear signals about what failed, what state it’s in, and what recovery actions are safe. Operators that fail silently or leave clusters in ambiguous states are worse than no automation at all.

FerretDB, which you contributed to at Instaclustr, now exceeds 4,000 instances and received Microsoft’s 2025 endorsement. What does this open-source MongoDB alternative signal about the future of enterprise infrastructure and database strategy?

Microsoft’s endorsement of FerretDB marks a turning point in enterprise infrastructure strategy. It signals that even major cloud providers now view vendor lock-in as a liability rather than an advantage. Built on PostgreSQL, FerretDB delivers MongoDB compatibility with a focus on correctness and concurrency, enabling production workloads to run safely. With over 10,800 GitHub stars and 200+ contributors, the project’s growth underscores demand for open alternatives.

Microsoft’s support reflects a wider industry reality: enterprises are rejecting closed ecosystems and seeking infrastructure they can migrate, audit, and control without punitive licensing. This shift extends beyond databases to how organizations think about resilience itself, openness equals flexibility, and risk reduction. For cloud providers like Microsoft, backing open-source platforms is strategic, not charitable. The future of enterprise infrastructure lies in open, portable systems that foster trust and longevity rather than dependence.

As a judge at NextGen Hackathon 2025 in France alongside 20 technology experts, you evaluated projects from international teams. When reviewing modern cloud-native architectures, what fundamental capability separates projects that could survive production

Most systems are designed for the happy path, where APIs respond, networks cooperate, and nothing fails. Real production isn’t like that. Databases time out, regions degrade, and users act unpredictably. At Qbox, managing thousands of Elasticsearch clusters taught that reliability comes from assuming failure is normal and embedding recovery into design, not treating it as an afterthought.

When reviewing architecture, I look for graceful handling of partial failures, explicit state management, and observable failure modes. Kubernetes helps because it enforces distributed-systems thinking, but deploying a monolith in containers doesn’t make it cloud-native. True resilience isn’t about tools; it’s about mindset. Great engineering anticipates failure, isolates it, and recovers automatically. That philosophy separates production-ready systems from ones that shine in demos but crumble under real-world load.

Considering the broader evolution of the software industry, especially the rapid progress in AI and infrastructure automation, what new responsibilities fall to engineering leaders as these systems grow more powerful and deeply integrated into everyday life?

We’re at a turning point where the systems we build make important decisions without people directly involved. Kubernetes operators can create infrastructure worth hundreds of thousands of dollars. AI systems choose who gets hired, who receives credit, and what content appears online. As these tools grow in scale, engineering leaders must think not only about correctness but also about social impact. Automation must be transparent; every action should be clear and auditable. Hidden black box systems aren’t acceptable. Also, we need to challenge exploitative business models. Proprietary clouds that trap users through closed APIs and data gravity harm trust and fairness. That’s why I believe open, transparent software isn’t just better engineering; it’s a question of ethics.