Abhishek Vajarekar’s journey in software engineering reflects his passion for building scalable, resilient systems. A B.E. (Hons.) Computer Science graduate from India’s prestigious BITS Pilani in 2015, he has worked with global tech leaders like Walmart Labs and Amazon. Along the way, Abhishek honed his expertise, transitioning from engineer to Software Development Manager, now leading critical projects at Prime Video.
In this interview, he delves into the challenges of designing distributed systems, covering fault tolerance, scalability, observability, chaos testing, and real-time monitoring. Abhishek also shares how he mentors teams to balance scalability and resilience while exploring emerging technologies like serverless computing, edge computing, and AI’s role in high-performance systems.
Join us as Abhishek shares his insights into distributed systems, fostering engineering excellence, and shaping the future of scalable software architectures.
In your experience, what are the key challenges in designing resilient distributed systems, and how do you approach overcoming them?
In my experience with managing distributed systems, the key challenges often revolve around fault tolerance, scalability, overcoming bottlenecks, reducing time-to-market, and observability.
- Fault Tolerance: Failures are inevitable in distributed systems. In addition to implementing appropriate testing in each step of the development cycle, I also prepare for scenarios when failures do occur in production. For instance, I use circuit breakers to stop the propagation of failures and ensure that if one service fails, it doesn’t cascade into others. Another key strategy is failover mechanisms, where redundant systems or backup instances are pre-configured to take over automatically in the event of a failure. I also make use of automatic retries, fallback defaults, and rate limiting. If a dependent service becomes unresponsive, the system can fall back to default responses, retry failed calls, or slow down request rates to prevent overload.
- Scalability: Scaling a system to handle rapid growth or traffic spikes is one of the biggest design challenges I face. To keep up with demand, I prioritize horizontal scalability. This allows us to spin up additional instances of stateless microservices as traffic increases. I also use elastic auto-scaling to automatically adjust system capacity as traffic fluctuates, so we’re not paying for unused resources during off-peak times.
- Bottlenecks: Bottlenecks are one of those “invisible” problems that seem fine—until suddenly, they’re not. They can show up in databases, message queues, or compute-heavy services, and they often go unnoticed until they impact performance. That’s why I focus on continuous performance monitoring. I analyze metrics like request latency, queue depths, and throughput to figure out which part of the flow takes more time.
- Reducing Time-to-Market: In modern distributed systems, speed to market is everything. The faster you can get a new feature into users’ hands, the stronger your competitive edge. To make that happen, I prefer a combination of automation, CI/CD, and Domain-Driven Design (DDD). With CI/CD pipelines, we can push out small, incremental updates with minimal risk. This is crucial because smaller changes are easier to test, debug, and roll back if something goes wrong. With Domain-Driven Design (DDD), microservices are structured around independent domains, which eliminates coordination overhead across teams.
- Observability: You can’t fix what you can’t see. That’s why I prioritize observability from the very beginning. Distributed systems consist of multiple independent services, each generating its own logs, traces, and metrics. Without proper observability, troubleshooting becomes guesswork. To avoid this, I focus on structured logging for faster debugging, distributed tracing to have visibility across services, and centralized dashboards for key metrics to detect and analyze issues in real time.
How have you applied chaos testing principles to uncover vulnerabilities in a distributed system, and what metrics do you prioritize to measure its effectiveness?
The goal of applying chaos testing principles to distributed systems is not just to identify failures but to understand how the system behaves under stress and ensure it can recover gracefully. My approach to chaos testing can be summarized as controlled experimentation, fault injection, and continuous learning.
To begin with, I identify possible failure scenarios and rank them by likelihood. Examples include unavailability of an external dependency, unexpected traffic spikes, or resource exhaustion like CPU or memory constraints. Then, instead of waiting for these issues to occur naturally, I deliberately inject failures into a non-production environment that resembles production as closely as possible. For example, I simulate CPU exhaustion, a network partition, or a service failure to observe how other services react and what customers experience.
One of the tools I use for this is AWS Fault Injection Simulator (FIS), which allows developers to simulate real-world failures. We monitor how services respond, whether retries are effective, and how fast the system recovers.
Key Metrics I observe to evaluate the effectiveness of chaos testing include:
- Availability (Uptime Percentage): Was the system still available to end users during the failure?
- Error Rate: Did user-facing errors increase during the failure scenario?
- Service Degradation: Did any services experience slowdowns or partial failures during the test?
- Alert Accuracy: Did our monitoring and alerting tools detect the failure quickly, and did alerts trigger for the right issues?
- Mean Time to Recovery (MTTR): How quickly can the system detect and recover from failure?
With the learnings from the system’s response, I also ensure that the testing is actionable. If the test reveals weaknesses in failover mechanisms, retries, or fallback defaults, I work with the team to prioritize fixes. Chaos testing is not a one-time effort—it’s a continuous improvement loop. Each step helps make our services more resilient.
What are some critical metrics you consider when designing systems to handle massive workloads while maintaining high availability, and can you share a practical example?
When designing systems for massive workloads with high availability, I focus on metrics that provide visibility into system performance and user experience. These metrics can be broadly categorized into performance, reliability, and resource utilization.
- Performance Metrics:
- Latency: Measures the time it takes to process a request. For high-workload systems, latency should remain consistent with low variance during traffic spikes.
- Throughput: Tracks the number of requests or transactions processed per second. This helps baseline scalability and capacity per host.
- Reliability Metrics:
- Error Rate: Monitors the percentage of failed requests. A low error rate indicates reliability, even under high load.
- Availability (Uptime Percentage): Ensures services remain accessible to users.
- Mean Time to Recovery (MTTR): Measures the time to identify, respond to, and recover from failures.
- Resource Utilization Metrics:
- CPU and Memory Utilization: Monitors stress levels on servers.
- Disk I/O and Network Bandwidth: Ensures the system is scaled to handle both incoming and outgoing requests.
A practical example from my experience involved scaling a streaming service to handle spikes in traffic during the launch of new titles. The challenge was ensuring the system could process 2–3 million requests per second without degrading the user experience.
We began by conducting load tests to understand the system’s limits, simulating traffic beyond peak predictions. These tests revealed high query latency in the database during heavy load, which affected availability. To address this:
- Caching: Introduced in-memory caching (e.g., Redis) to reduce database load for frequently accessed data.
- Read Replicas: Distributed workloads across multiple database nodes to avoid bottlenecks.
- Horizontal Scaling: Made microservices stateless to enable automatic scaling during peak traffic.
By closely monitoring latency, throughput, error rates, and resource utilization throughout the process, we ensured the system could handle traffic spikes three times the predicted peak while maintaining p99 latency within 10%.
How do you leverage real-time monitoring strategies to maintain the performance and health of distributed systems, and can you share an example of handling unexpected system degradation?
Real-time monitoring is essential for maintaining distributed systems’ performance and health. I rely on strategies such as structured logging, distributed tracing, and real-time metrics dashboards to proactively address anomalies.
- Structured Logging: Ensures every service logs meaningful data that can be queried for debugging. This simplifies report generation and troubleshooting.
- Distributed Tracing: Tracks request lifecycles across services to identify bottlenecks or failure points. This is especially useful for identifying issues caused by interactions between services.
- Metrics Dashboards and Alerts: Aggregate and monitor KPIs such as latency, error rates, CPU/memory usage, and cache hit ratios. Alerts are configured to notify developers of anomalies immediately.
For example, in one instance, a customer-access evaluator service for a streaming platform experienced sudden performance degradation. Real-time monitoring detected increased latency, and dependency latency metrics isolated the issue to the distributed cache layer. The cache showed a sharp increase in miss rates and elevated response times for read-heavy services.
To resolve the issue:
- Read Replicas: Added to distribute the read load across multiple instances.
- Cache Autoscaling: Adjusted policies to respond more dynamically to sudden traffic spikes.
- Cache TTL Optimization: Ensured frequently accessed data remained cached longer to reduce unnecessary recomputation.
Continuous monitoring validated the fixes, with response times and cache hit rates returning to acceptable levels.
As a Software Development Manager, how do you ensure your team is aligned on best practices for building scalable and robust services, and how do you mentor engineers on balancing scalability and resilience?
As a Software Development Manager, my approach to aligning the team on best practices involves fostering a shared understanding, promoting ownership, and offering actionable guidance.
To ensure alignment, I focus on:
- Guidelines and Documentation: Providing clear documentation outlining best practices, including architectural patterns, coding standards, and principles like graceful degradation and failure isolation.
- Knowledge Sharing: Hosting regular architecture reviews and post-mortem analyses to reflect on system behaviors and share lessons learned.
- Code Reviews: Encouraging peer reviews to ensure proposed changes align with scalability and resilience standards.
Mentoring engineers to balance scalability and resilience involves:
- Scenario-Based Learning: Analyzing real-world challenges, such as handling traffic spikes or simulating outages, to develop solutions as a team. This hands-on approach helps engineers understand trade-offs.
- Encouraging Ownership: Empowering engineers to take responsibility for specific components, motivating them to proactively address scalability and resilience issues.
- Framing Complementary Goals: Teaching engineers to view scalability and resilience as interconnected. For example, while discussing horizontal scaling, I emphasize resource optimization and fallback mechanisms to handle failures effectively.
By promoting a strong engineering culture with open communication and a safe space for experimentation, I ensure the team is prepared to build scalable, resilient services.
What emerging trends or technologies in distributed systems and high-performance computing excite you the most, and how do you see AI or machine learning influencing resilience and performance?
Emerging trends in distributed systems and high-performance computing that excite me include serverless computing, edge computing, and chaos engineering.
- Serverless Computing: Abstracts infrastructure management, enabling teams to focus on application logic. With automatic scaling and pay-as-you-go pricing, serverless is ideal for event-driven systems requiring dynamic scaling.
- Edge Computing: Reduces latency and improves performance by processing data closer to the user. This is critical for real-time applications like IoT and video streaming.
- Chaos Engineering: Helps identify and address weaknesses by simulating real-world failure scenarios, enhancing system resilience.
AI and ML are revolutionizing distributed systems:
- Predictive Maintenance: AI analyzes historical data to predict potential failures or resource exhaustion, enabling proactive interventions.
- Intelligent Scaling: Dynamic scaling and load balancing use real-time data to allocate resources efficiently.
- Root Cause Analysis: AI accelerates troubleshooting by correlating logs, metrics, and other signals, reducing mean time to recovery (MTTR).
- Resource Optimization: AI analyzes usage patterns to recommend configurations that balance cost and performance.
In the future, AI-driven recommendations may be acted upon with minimal developer intervention, further improving resilience and operational efficiency. However, human validation remains essential for complex issues to ensure accuracy and reliability.