In May 2025, Meta opened its @Scale conference series with an unmistakable message: Big Tech is on top. Now it needs to bear the weight of its crown.
The Systems & Reliability track brought together engineers from Netflix, AMD, Google, Meta, Microsoft, NVIDIA, and Pinterest to grapple with the infrastructure demands of a world running on AI and personalization. With gen AI workloads growing each day, and global traffic straining even the most mature platforms, even the most mature platforms are being pushed to rethink their approach to reliability.
Against that backdrop, Netflix‘s talk on Mosaic leaves a particularly strong impression. While many sessions focused on tuning AI models or GPU scheduling, Senior Software Engineers Karthik Puthraya and Saurabh Jaluka focused on the critical layer underneath: how to test and maintain the infrastructure that delivers personalized content to hundreds of millions of subscribers, every minute of every day, without collapsing under its own weight.
Behind the Scenes of Your Netflix Homepage
If you’ve used Netflix recently, you’ve seen Mosaic in action. It’s the system that orchestrates the homepage: the rows of content carousels and the recommendation previews that help guide what hundreds of millions of users watch and experience. Every element is personalized, server-driven, and carefully tuned by artificial intelligence, then assembled through a sprawling, globally distributed network of microservices.
The platform supports over 300 million households in 190 countries, offering more than 18,000 titles and handling hundreds of billions of viewing hours per year. At that scale, cracks don’t stay small for long.
That’s why Puthraya and his team begin with the assumption that things will break. Netflix was early to embrace chaos engineering—a discipline the company helped popularize with tools like Chaos Monkey, which randomly shuts down production servers to test how systems recover. Mosaic builds on that legacy, but with a sharper focus on testing and reinforcing reliability deeper, and more frequently, into the software development lifecycle.
Solving Through Testing
Puthraya and the team had to reframe the testing itself. Traditional QA practices (where you write a test, run it, and verify the output) break down in environments as complicated as the streaming giant’s. There’s too much variation, and too many dependencies. Instead of relying on post-deployment stress testing or synthetic QA, the team turned to replay testing. This technique pulls real traffic samples from production and replays them in controlled environments, catching regressions and edge cases that might never show up in internal test scenarios. Most importantly, it allows the team to verify behavior under conditions that can’t be neatly categorized.
In practice, this means failure becomes a primary design concern: services are loosely coupled and fault-tolerant by default. If a server instance misbehaves or an entire region goes dark, others pick up the load. This thinking paid off in 2011 when an AWS outage took down a large segment of cloud infrastructure. Netflix, famously, stayed online. As Puthraya explains, it’s the kind of performance that consumers now take for granted, but which engineers know is anything but guaranteed.
The Industry Hits Its Limits
The Mosaic story is part of a broader industry reckoning. For much of the last decade, platform engineering focused on speed, with effort going towards deploying and iterating faster. That might have worked well while traffic was predictable, and services were simpler. But today, with AI intermediating user interactions and services interdependent across global stacks, consistency and uptime have become the new bottlenecks. According to one report, half of organizations surveyed agreed that slow is the new down. In use cases ranging from streaming, to gaming, to finance, and even enterprise SaaS, reliability is increasingly the differentiator.
That was the throughline of the Systems & Reliability track at @Scale. Talks ranged from Meta’s GPU provisioning pipeline, which cut deployment times for AI workloads, to AMD’s exploration of hardware-aware ML scaling. Microsoft and Pinterest shared their own playbooks for failure isolation and latency management. While each talk focused on different layers of the stack, one theme was clear: you can’t build for today’s traffic with yesterday’s testing strategies.
“The reliability challenge is a cultural change as much as a technical one,” explains Puthraya. “What happens when you’re serving the entire world? We’re finding out together.”
Where Things Are Headed
Reliability may never be as flashy as a new model architecture or a clever algorithm, but it’s fast becoming the deciding factor in whether the latest tech breakthroughs can deliver on their promises. A recommendation system is only as useful as its uptime; a generative model only creates value if the interface is fast and stable. And with tech giants investing over $180 billion in data center infrastructure last year alone, these products and platforms are expected to perform under relentless, global demand.
The conference reinforced that maintaining reliable, scalable systems is everyone’s challenge in the new digital economy. As computational demands rise and user expectations harden, the margin for error shrinks. The cost of downtime and latency, whether measured in dollars or users, is simply too high. The good news is that the tech community is responding in force.
The question, it seems, is shifting from can we build it? to will it hold up?
