Latest News

How Rajesh Kesavalalji’s Vision Fuels AI Software Innovation 

Rajesh Kesavalalji

Rajesh Kesavalalji, a leading software expert, is working on a high-priority project focused on reducing the high energy costs of AI. He shared his vision for the industry’s future and the advancements that could drive innovation at an unprecedented pace.

What motivated you to pursue a career in software and AI?

Rajesh Kesavalalji: Growing up, there was a huge wave of Computer Science in India, and I had access to computers at school. My first motivation came from the fact that I could play around with computers. What really piqued my interest was when I learned algorithms. 

I majored in electronics, and my college project focused on smart homes, where we controlled house appliances via phone. It was one of the early steps toward home automation, and it was exciting to work on something that could eventually be used in warehouses or for businesses.

Can you tell us about your current work and the high-priority AI infrastructure project you’re involved in?

I’m focused on AI and Cloud infrastructure — specifically, redesigning and optimizing data centers to handle AI workloads more efficiently. The project involves managing GPUs (Graphics Processing Units), specialized processors that accelerate AI model training.

A major part of this is optimizing energy use and ensuring that the cost and time of running these workloads are minimized. If there’s an issue, like a GPU failure, we need to spot it quickly, reroute tasks to healthy machines, and ensure that no valuable processing time is lost.

The bigger challenge in AI is not just using it but managing the infrastructure efficiently. AI models like large language models (LLMs) can cost millions of dollars to train and take months to develop. The focus of my work is on reducing those costs, improving time-to-market, and ensuring that systems stay up and running.

Could you walk us through some of the technologies you’re using to manage such a large-scale AI infrastructure?

A big part of my work is using open-source software. We rely on open telemetry agents to collect data, and I use tools like Trino for query building and Grafana for monitoring. Kubernetes is also crucial for orchestrating our infrastructure. 

The key to managing the AI infrastructure is continuous monitoring. We use out-of-band signals to gather important data like temperature or power usage from the GPUs and nodes.

I also lead the analytics division, focusing on out-of-band signals to identify potential issues before they cause problems. We use tools like Kafka for event tracking, and we rely on software like OpsGenie for alerting. With all these tools in place, we’re able to manage and monitor the health of thousands of machines at once.

The scale you’re working at must require a lot of specialized knowledge. What’s been the biggest learning curve for you and your team?

One of the unique aspects of this project is the cross-disciplinary nature of the work. For example, we need to know the optimal temperature for GPUs and the power usage of each node. This requires knowledge of both hardware and software. But, I believe the biggest challenge is still understanding how to manage the workload across this massive infrastructure efficiently.

How do you think the adoption of AI, particularly in infrastructure, will impact productivity and innovation in the future?

AI’s true potential lies in automation and optimization. For instance, in software development, AI could generate tests for engineers, saving them time. The productivity boost AI can offer is massive, but it’s not about replacing jobs. It’s about augmenting human capabilities and making us more efficient.

AI has the potential to generate test data and scripts automatically. Developers spend a lot of time writing unit tests, integration tests, and end-to-end tests. Imagine if AI could analyze the code and generate all the necessary tests for you. This would save a huge amount of time.

What’s important here is the consistency of the tests. Right now, AI-generated tests can sometimes lack consistency, and that’s a concern. However, I believe we’ll see a shift toward a mix of static and dynamic test generation, which will ensure that the tests remain reliable and repeatable. This approach could drastically improve productivity in the software development lifecycle.

What do you see as the biggest challenge for the AI industry?

The biggest challenge will be maintaining consistency and trust in AI systems. Just like how updates to a model can sometimes lead to unexpected results, we need to be careful about relying too much on external AI models. 

If you’re using a vendor’s model and it gets updated, it could break your system or give you different results. That’s why I believe that companies should consider building their own models and using internal systems to maintain control over consistency.

Another challenge is the cost. As AI becomes more pervasive, it will require significant resources. That’s why optimization in infrastructure, like what I’m working on, is so important. We need to make AI more accessible by reducing its operational costs.

Comments
To Top

Pin It on Pinterest

Share This