Latest News

Megha Aggarwal: “When you suddenly have 100 containers requesting training data simultaneously, traditional storage systems create a massive bottleneck.”

Megha Aggarwal

An Amazon AWS Software Development Engineer shares proven strategies for optimizing storage solutions across the entire AI pipeline.

The AI-powered storage market spending has reached $22.90 billion in 2023 and is projected to accelerate to $110.68 billion by 2030, driven by a 25.2% compound annual growth rate, according to Fortune Business Insights. Simultaneously, data center power densities have more than doubled in just two years, jumping from 8 kW to 17 kW per rack, with projections reaching 30 kW by 2027 as AI workloads intensify. In this challenging environment, where GPUs excel at parallel processing and are essential for accelerating AI workloads, the choice of storage solution becomes critical for maximizing the utilization of these expensive and scarce resources.

Megha Aggarwal, Software Development Engineer at Amazon AWS EKS team and expert in containerized AI workloads, has directly shaped strategic storage decisions across the entire AI pipeline. Working within the AWS Elastic Block Storage organization, she has developed deep expertise in the considerations essential for choosing storage offerings for data-intensive AI applications. As a member of the judging panel for the prestigious Globee Awards for Technology 2025, Megha evaluates groundbreaking innovations in cloud computing, artificial intelligence, and enterprise technology solutions. Her experience spans the entire lifecycle, from data preparation through training, tuning, and inference, with a particular focus on Kubernetes-based autoscaling techniques for containerized environments.

In our interview, Megha reveals the three critical stages where storage decisions can significantly impact or break AI model performance, explains why GPU utilization drops substantially with poor storage choices, and shares her systematic approach to selecting storage solutions that organizations can implement regardless of their scale or infrastructure complexity.

Megha, working at AWS Elastic Block Storage, you’ve seen how storage decisions directly impact GPU efficiency. Can you share a specific case where poor storage architecture was costing a company significant performance?

I worked with a financial services company that was spending $2 million annually on GPU infrastructure for fraud detection models but getting terrible training times. Their models took 18 hours to train when they should have completed in 6 hours. The issue was that they used traditional NAS storage, designed for file sharing, rather than high-throughput AI workloads. We migrated them to NVMe-backed storage with optimized data pipelines, and immediately their GPU utilization jumped from 35% to 87%. The same training jobs are now complete in under 7 hours, which means they can iterate three times faster and deploy updated fraud models much more frequently.

Your expertise covers both traditional storage and Kubernetes autoscaling. When a Kubernetes cluster suddenly scales from 10 to 100 GPU nodes, what happens to the storage that most people don’t anticipate?

The storage system often becomes the limiting factor, not the compute resources. Most organizations plan for compute scaling but treat storage as static infrastructure. When you suddenly have 100 containers requesting training data simultaneously, traditional storage systems create a massive bottleneck. I’ve developed techniques that enable storage performance to scale linearly with the number of containers. The key is implementing storage policies that understand Kubernetes’ dynamic nature; you need storage that can automatically provision bandwidth and handle sudden spikes in concurrent access without requiring manual intervention.

You work specifically with AWS Elastic Block Storage. What’s a common misconception organizations have about storage performance in AI workloads?

Many believe that raw IOPS numbers tell the whole story, but AI workloads have unique access patterns that make traditional storage metrics misleading. For instance, during training, you may need sustained sequential reads for feeding data to GPUs, as well as rapid random writes for checkpointing. A storage system optimized for database transactions might show excellent IOPS but perform poorly for AI training. The misconception is that expensive storage automatically equals better AI performance; it’s really about matching storage characteristics to your specific AI pipeline requirements.

Training deep learning models requires frequent checkpointing to protect against failures. How should organizations balance checkpoint frequency with storage performance to optimize both?

This is where I see organizations waste the most money. They either checkpoint too frequently, which kills GPU performance, or too rarely, risking the loss of days of training progress. The optimal strategy depends on your storage’s write performance characteristics. With high-performance NVMe storage, you can checkpoint every few epochs without significant impact. But with slower storage, you need to be more strategic. I recommend implementing adaptive checkpointing that adjusts frequency based on real-time storage performance metrics and training progress. The goal is to protect your training investment without creating performance bottlenecks.

For inference workloads that might spin up hundreds of containers simultaneously, what storage architecture prevents performance degradation?

Inference creates entirely different challenges than training because you’re dealing with massive concurrent access to relatively small model files rather than streaming large datasets. The key is implementing read-optimized storage with intelligent caching layers. I recommend architectures that can quickly replicate model artifacts across multiple storage endpoints and use container-aware caching that understands the Kubernetes pod lifecycle. When containers start up, they should find model data already cached locally rather than competing for network bandwidth to download the same models repeatedly.

What’s the biggest mistake you see organizations make when selecting storage for their first serious AI project?

Choosing storage based on cost per gigabyte without considering the total cost of ownership in an AI context. I’ve seen companies select the cheapest storage option and then discover their training jobs take three times longer to complete, making the entire project much more expensive. They optimize for storage cost but overlook the fact that their GPU resources, which cost far more than storage, are sitting idle, waiting for data. The smart approach is to calculate the total cost, including GPU time, developer productivity, and time-to-market, not just storage unit costs.

Looking ahead, how do you expect storage requirements to evolve as models grow from gigabytes to terabytes in size?

We’re already seeing models that challenge current storage architectures in fundamental ways. When a single model checkpoint is larger than most organizations’ entire storage systems were designed to handle, you need completely different approaches. The future requires storage systems designed specifically for AI workloads, which I refer to as “AI-native storage.” This type of storage understands model architectures, optimizes data placement dynamically, and integrates seamlessly with AI development workflows. Edge AI deployment will also demand storage solutions that can synchronize massive models across distributed environments while maintaining performance consistency.

Comments
To Top

Pin It on Pinterest

Share This