Interview: ‘An Insight into what it is like to build LLM infrastructure at an AI company’

By Miller V

Posted on November 27, 2024

Consider constructing a framework that’s capable of first handling the constraints of a language model, second dealing with millions of insights and queries while drawing the focal point on the premise of the model to suit its needs in business across the globe and lastly bringing in the very notions of AI into the picture. Be it AI or any technology, there is always an underlying infrastructure in place that lays the foundation of its complexity. It is this complex infrastructure or rather ML infrastructure, that every large language model (LLM) has which is essentially modelled and optimised by teams working on algorithms to meet complexity and density requirements of many forms, and this is reliability, expansiveness and profitability. Today we are given the rare chance to understand this rather complicated and perverse ML structure by engaging with a professional – Duraikrishna Selvaraju, who formulates such frameworks in the industry.

Q: Since your work included building LLM infrastructure in ML, what were the requirements you targeted at the onset of the project?

One of the main targets to be achieved was a support structure that is performant and could be adjusted to suit our needs as they grew. Since LLMs are able to solve only with resources of huge size and data is transferred in large quantities, this included. And adaptability imposed performance and cost as well; the training of LLM as well as using it can be extremely expensive, so the resource and performance is key to optimization.

Also, flexibility was another priority. AI is shifting at a fast pace and so the structure has to be able to fit in with the playing instruments and alter with the innovations. It was important to understand and own every piece of the hardware and software stack. This allows us to scale efficiently and adapt to our needs, be it on a choice of technology or configuring a workflow.

Q: Having a wide range of technology at your disposal how did you go about the decision of selecting what to use for this infrastructure? What caught your attention?

Given the costs of training and deploying AI are high we analysed, learnt and sometimes even re-wrote the entire tech stack to make careful choices of how to enable AI computation. We learnt that choosing the right tools is a very crucial process as it needs acceptance from all stakeholders data scientists, data engineers, product users and MLOps.

We focused on choosing open source softwares deployable on Kubernetes to ship the AI platform for training and deploying models as it allowed us to ship faster and also own the deployed code as it was open source. Although, there are unforeseen consequences to such choices as our initial infrastructure choice for training was an Open Source solution on our AI training workflow cluster. Although this solution did its job, it had an ecosystem problem where the components were both hard to operate for Data Scientists and were unstable because of lack of maintenance and we had to adopt another. We learnt to prototype and evaluate solutions that work best for all stakeholders by letting all stakeholders operate and accept the solution. Although this process is time-consuming, it is best for the longer run.

Q: What specific optimisations or strategies did you implement to keep costs in check while enhancing performance?

One of the most effective strategies was implementing data parallelism, where data is split across multiple processors, allowing us to accelerate training without adding significant costs. This reduced overall training time while maintaining the model’s performance.

Another important optimization was model quantization, where we reduced the precision of certain calculations. This allowed us to decrease model size and, consequently, storage and inference costs without compromising accuracy. Keep in mind that GPU processors can be also split to increase occupancy.

Caching prediction results was another technique that provided significant benefits, especially for frequently accessed data and computations. Also by caching certain computations, we reduced the load on our core systems, delivering faster response times for users while reducing our dependence on expensive compute resources. We also prioritised pipeline optimization, eliminating redundant tasks in data processing and training, which helped streamline operations and lower expenses.

Q: Did you find that a cloud-based solution, on-premises infrastructure, or a hybrid setup was best suited for your needs? Why?
We opted for a hybrid approach, which combined the flexibility of cloud-based solutions with the control of on-premises infrastructure. Cloud services were incredibly valuable for scaling resources quickly, particularly during peak usage times, especially for workloads that involved CPU only or smaller GPUs. They also allowed us to expand into new regions effortlessly, ensuring low latency and fast response times for users across different geographies.

Because of the shortage of modern GPUs, cloud providers were not a viable option due to lack of committed availability on the cloud. At the same time, we maintained specific on-premises resources for high-security workloads and to reduce long-term costs. Some data and models required tight control due to regulatory or compliance needs, which made on-premises solutions essential. The hybrid approach allowed us to balance scalability, cost, and security effectively. By having a cloud component, we could be agile, while on-premises infrastructure gave us stability and cost control over the long term.

Q: It can’t be a stretch to assume that building such a system wasn’t smooth. What were some of the toughest challenges you ran into?
We struggled a lot with the issue of cost vs computational need. We analysed hardware GPU costs for training LLMs and their operational cost in terms of energy needed and determined that our committed capacity for GPUs need to be bought on-premise. That led to us setting up the GPU machines in a data center, which was one of the toughest challenges. We realized we needed professional help to set the bare machines in the data center!

It was a constant optimization to find the best value configuration for running LLMs, designing and training them was also very resource intensive. It took a lot of our time searching for the right model configurations, deployment structures and hyperparameters, adjusting every aspect to utilise as much of the available committed hardware as possible while not spending money needlessly on scaling up on cloud.

Q: Deployed models should exhibit stability in production as a matter of principle. What steps do you take to supervise and ensure that your models work well even after deployment?
One of the most important components of our infrastructure is model monitoring. We implement model monitoring with the help of Prometheus and Grafana for locking the performance characteristics such as latency, accuracy, or memory loads in real time. Thanks to this framework, we manage to identify the majority of problems in a timely fashion and have some understanding as to how and why certain models perform in other situations. More than the primary metrics though, we have established alerting systems to call our attention to instances of performance that deviate from the prescribed standard and this enables timely action regarding possible concerns.

An important undertaking in the process of monitoring is the feedback loop which we have incorporated. User interactions are accumulated and mined in order to make sure that the model is enhanced in the required direction. This data serves the purpose of improving existing models and training new ones so that the models can be improved even from minimising the performance closely tailored to the users viewpoint. In this manner we are able to not only optimise performance but also keep up with the moderating behaviours and needs of the users.

Q: With the continued growth of this AI journey how do you see the ML infrastructure scaling or transforming to meet future expectations?

In the long term, a lot of AI is going to be driven by LLMs, which means scaling would involve increasing the amount of cloud providers while also focusing on next-gen hardware to accommodate the ever more complex models needed. One area of our exploration is satisfying need with alternative GPU providers that are either new age processors or processors in the market with comparatively lesser demand to aid fast growing needs.

On top of that, we are also looking at model distillation to further reduce the network size where multiple smaller models are trained to represent a single larger one. This would mean we are able to deploy smaller and resource efficient models while still retaining their effectiveness.. In the end, we want to develop an effective model development practice that can withstand the agitation of time but is still eco friendly and ever expandable.

Conclusion:

The task of building and maintaining the LLM infrastructure is never static. It mainly involves an intricate interplay among efficiency, economics, and flexibility in order to create dependable and cutting-edge systems. By implementing a prudent vision, realistic plans, and optimizations around scalability, we are not only fulfilling today’s requirements but also anticipating some requirements of the not-so-distant future. With the ever-evolving technology landscape, we are keen about global penetration of our LLM capabilities in various sectors, thus making AI easier to use and more effective. The transition is illustrative of the opportunities in AI as well as the tremendous architecture required in its support.