The race to build larger AI models has quietly turned into a race for hardware. Training runs that once fit on a handful of chips now span thousands of GPUs and TPUs, and the accelerators themselves have become the scarcest input in the entire pipeline. The constraint is no longer whether a team can design a better model. It is whether the compute exists to train it, and whether that compute is being used well. Across the industry, peak GPU utilization sits below 70% at roughly two-thirds of organizations, which leaves a large share of the most expensive hardware in the data center idle or underused. The gap between what is bought and what is actually put to work has become one of the defining inefficiencies of the AI era.
Ankit Sinha is a Senior Software Engineer who has spent the last several years building large-scale machine learning infrastructure at one of the world’s largest technology companies, with earlier work in high-performance systems and autonomous vehicle software. He holds a master’s degree in electrical and computer engineering and specializes in the layer of the stack where software orchestration meets hardware efficiency: deciding how a finite fleet of accelerators gets divided among the teams competing for it. He presented peer-reviewed research on fleet-scale compiler optimization at the 2026 Conference on Machine Learning and Systems, one of the field’s leading venues for systems-level machine learning work. His focus is the question those utilization numbers expose, which is how to make scarce, expensive compute do more without buying more of it.
The Allocation Problem Behind the Compute Shortage
For all the attention paid to chip shortages and data center buildouts, the more persistent problem inside large AI organizations is distribution, not supply. The hardware exists. Deciding who gets it, and for how long, is where things break down. In most environments, accelerator capacity is still handed out through manual, ad hoc processes: spreadsheets, standing reservations, and direct negotiation between teams that each believe their workload is the priority. Those decisions are made infrequently, held for long stretches, and rarely revisited as demand shifts, so capacity stays committed on paper while real utilization drifts far below it.
Sinha led the design and rollout of a system built to replace that manual process entirely. Beginning in 2023, he served as technical lead for an effort to allocate and autoscale a global pool of accelerators based on real-time demand rather than fixed, human-set reservations. The system continuously measures what each workload actually needs and adjusts capacity accordingly, pulling back compute that would otherwise sit reserved but unused and redirecting it to teams that can use it immediately. Before this work, no comparable system existed at that scale; allocation had been handled by people, not policy. He owned the technical roadmap, the system design, and the execution across infrastructure, product, and hardware groups.
“Allocation looks like a scheduling problem until you see it at scale, and then it becomes an economics problem,” Ankit Sinha says. “Every reserved accelerator that isn’t doing work is capital sitting still. The job is to make the system give that capacity back the moment it stops being needed, without a human in the loop.”
Replacing Human Judgment With Continuous Optimization
Automating allocation is harder than it sounds, because the thing being automated is judgment. A human capacity planner weighs how urgent a launch is, how much headroom a team needs, and how much risk the organization can absorb if a job stalls. Encoding that into a system means turning soft priorities into explicit policy, then letting software apply that policy thousands of times a day without supervision. The hard constraint is that the underlying hardware is heterogeneous: GPUs and TPUs have different performance profiles, availability, and cost, and a workload that runs well on one may run poorly on the other.
The system Sinha built is hardware-agnostic by design, managing both GPU and TPU fleets under a single framework rather than treating them as separate pools with separate rules. That unification matters because it lets the allocator place a workload wherever capacity is cheapest and most available at that moment, instead of stranding demand against one kind of chip while another sits open. Policy-driven autoscaling replaced the standing reservations that had defined the old model, so capacity now follows demand on a continuous basis. The effect is to pull real work out of a fixed fleet that would otherwise have run well below its potential.
“You can’t buy your way out of an allocation problem,” Sinha explains. “Adding more accelerators to a fleet that’s already running at half its potential just makes the waste more expensive. The advantage is in the layer that decides where each workload runs and when, because that is where most of the value is either captured or lost.”
Why Scarcity Turns Efficiency Into a Competitive Edge
The scarcity behind all of this is not abstract. Lead times for the most sought-after AI accelerators have stretched to 36 to 52 weeks, the product of fully booked manufacturing capacity and multibillion-dollar advance orders from the largest buyers. When new hardware can take the better part of a year to arrive, the fleet an organization already owns becomes the only lever it can pull in the near term. Efficiency stops being a cost-optimization exercise and becomes the difference between shipping a product on schedule and waiting on a supply chain.
This is the environment Sinha’s work is built for, one where reclaiming idle capacity is worth as much as buying new hardware and far faster to realize. His allocation system treats the existing fleet as the resource to optimize, identifying compute that has been reserved but left unused and returning it to the pool for teams that can put it to work right away. Rather than chasing a single peak-efficiency number, it holds the whole fleet closer to productive use across shifting demand. He is also a program committee member for the International Conference on Machine Learning and Applications, one of the established venues for applied machine learning research.
“When hardware is constrained, the winners aren’t the teams with the most chips,” Sinha notes. “They’re the teams that waste the least. A fleet that runs efficiently behaves like a much larger fleet, and that advantage compounds every quarter the shortage lasts.”
The Hard Part Is Coordination, Not Code
The engineering challenge in a system like this is real, but it is rarely the part that decides whether the project succeeds. The harder problem is organizational. Allocation touches every team that depends on compute, which means changing how it works requires those teams to give up control they are used to having. Infrastructure groups, product teams racing toward launches, and hardware planners all hold legitimate and competing claims on the same finite pool. A system that optimizes globally can look, from any single team’s seat, like it is taking something away.
Much of Sinha’s role was managing that tension, not coding around it. He drove alignment across infrastructure, product, and hardware groups whose priorities did not naturally agree, often while those same teams were under pressure from fast-moving launches in a competitive market. The system had to earn trust before it could take over decisions that people had been making by hand, which meant proving that automated allocation would not strand a critical workload at the wrong moment. Building the technology was the precondition; getting an organization to hand it the keys was the actual work.
“The technical design was the easy half,” Sinha observes. “The hard half was convincing a dozen teams that a system would make better capacity decisions than they would, and then being right often enough that they stopped second-guessing it.”
The Allocation Layer Becomes Infrastructure
The stakes here are set to grow rather than ease. The market for AI accelerator chips, valued at roughly $45.8 billion in 2025, is projected to reach about $746 billion by 2035. Every dollar of that spend represents hardware that has to be allocated, scheduled, and kept busy to justify its cost. As fleets expand and the capital tied up in them climbs, the systems that decide how that hardware is used stop being an operational detail and become infrastructure in their own right.
What began as a way to reduce waste inside one organization points at a broader shift in how the industry thinks about compute. For years the default answer to capacity pressure was to buy more. As accelerators grow scarcer and more expensive, the more durable answer is to allocate what already exists with far more precision, and to do it automatically. The allocation layer Sinha helped pioneer is an early version of something the rest of the industry will need: a way to treat a fixed, costly fleet as a continuously optimized resource rather than a set of static reservations. The teams that adopt that mindset will get more out of every chip they own.
“The instinct in this industry is to solve scarcity by adding capacity,” Sinha reflects. “But the fleet you already have is the one you can change today. If you can make it give back everything it isn’t using, you have effectively built more compute without pouring a single new slab of concrete. That is the kind of efficiency that decides what actually gets built.”