In the fast-paced world of artificial intelligence, Aleksei Naumov has carved a niche as a trailblazer in neural network compression. As the Lead AI Research Engineer at Terra Quantum, Naumov’s innovative work is making AI more accessible, efficient, and secure. His recent paper, “TQCompressor: Improving Tensor Decomposition Methods in Neural Networks via Permutations,” presented at the esteemed IEEE MIPR 2024, is already being hailed as a landmark contribution to AI research.
Naumov’s pioneering methods address one of the field’s most pressing challenges: how to shrink massive LLMs to fit within the constraints of mobile devices without compromising their performance. In this interview, he shares the technical hurdles, the transformative potential of on-device AI, and his vision for the future of AI applications in healthcare, security, and beyond.
What is your opinion on Meta’s release of the compressed Llama 3.2 models for smartphones? Does this mean that more AI developers now can use the method proposed by Meta to create compact models for mobile devices?
Aleksei: This is a very positive signal indicating a shift in the industry towards developing solutions based on on-device models. Until now, major companies and labs specializing in foundational models have primarily focused on creating large models designed for inference on GPU clusters. Meta, as a significant leader in the field, can set an example for other developers to pay attention to this direction.
On-device inference offers several substantial advantages:
•Enhanced data security: Processing data locally on the device minimizes risks of data leakage.
•Cost reduction: It lowers reliance on expensive cloud-based computational resources.
•New use cases: It unlocks applications that were previously inaccessible.
For instance, a private assistant for messaging—similar to an advanced T9—was previously impossible due to privacy restrictions and high costs. Now, such solutions become feasible. The range of new possibilities is immense and hard to fully evaluate.
From a technical perspective, Meta didn’t introduce groundbreaking innovations with this release. They applied established techniques like pruning and knowledge distillation. Essentially, they took larger versions of their models, removed some parameters, reduced memory consumption (quantization), and fine-tuned the compressed model to replicate the behavior of the original.
However, the method proposed by Meta unfortunately remains inaccessible to most developers. Compressing models as they did requires extensive retraining after compression, which can cost tens or even hundreds of thousands of dollars. This makes such techniques available only to large companies and well-funded labs.
Despite Meta’s release, creating compressed models remains accessible only to large companies. In your work on TQCompressor, you describe a method that reduces the time and cost of fine-tuning after compression by over 30x, which is an amazing result, democratizing the creation of compressed AI model! Could you elaborate on what challenges engineers might face when compressing large-scale models and how to address them?
Aleksei: The main challenge in compressing large-scale models lies in the significant computational resources required—not just for regular usage but especially for restoring their quality after compression. Once compressed, models need fine-tuning to regain their original performance.
For example, fully fine-tuning a large language model (LLM) in half-precision (16 bits) typically requires around 16GB of GPU memory per 1 billion parameters. This is far greater than the 2GB per 1 billion parameters required for inference, as fine-tuning demands additional memory for optimizer states, gradients, and other training data. With 8-bit optimizers, a 7B parameter model may still require up to 70GB of GPU VRAM. To put this into perspective, NVIDIA’s top-tier H100 GPU has only 80GB of VRAM and costs $30,000. This means developers would either need to invest in multiple such GPUs for faster processing or spend thousands of dollars renting cloud GPUs for weeks or even months to conduct compression experiments.
While some methods, such as quantization, are less resource-intensive, others, like pruning or matrix decomposition, require extensive fine-tuning and experimentation. For large models, this process can easily cost tens or even hundreds of thousands of dollars, making it viable only for well-funded companies and specialized labs.
At Terra Quantum, my team is actively tackling this challenge. In our research paper, TQCompressor, we presented a novel method that reduces the fine-tuning time for compressed models by over 33 times, resulting in dramatic cost savings.
We achieved this by developing a novel approach that ensures the initial compressed model closely resembles the original full-scale version. As a result, much fewer resources are needed for fine-tuning to restore its performance.
We are committed to further innovations in this area and aim to make these methods more accessible to a wider range of developers.
As one of leading experts in compression and optimization of AI models and specifically LLMs, what advice would you offer to companies aiming to develop AI solutions specifically optimized for mobile hardware?
Aleksei: I would advise paying more attention to tensor and matrix decomposition compression methods. My team at Terra Quantum is deeply engaged in this area.
Currently, most developers rely on methods like pruning and distillation, where model parameters are manually removed, followed by fine-tuning to restore quality. However, these approaches have significant drawbacks:
•They don’t guarantee that the compressed model will closely mimic the behavior of the original, often leading to high fine-tuning costs.
•In some cases, the model’s quality may degrade to a point where it becomes irrecoverable, wasting substantial time and resources.
Matrix decomposition methods, on the other hand, offer mathematical guarantees that the compressed model closely approximates the original. Additionally, these methods provide an automated approach to determining the compressed architecture, reducing time and costs for both experimentation and fine-tuning. This results in a more efficient and reliable process.
It seems that the use of local, compressed models is the key to unlocking AI innovation in healthcare by overcoming critical barriers like privacy and security. Do you agree, and could you elaborate on how this approach could transform the industry?
Aleksei: Absolutely. In healthcare, where the stakes are uniquely high, privacy isn’t just a priority—it’s a foundational requirement. Sensitive patient data is protected by strict legal frameworks like HIPAA in the US or GDPR in Europe. These regulations make centralized data processing risky and often infeasible for AI solutions.
On-device AI models, or those deployed on the hardware of medical institutions, present a transformative opportunity. By processing data locally, these models ensure that private medical information never leaves the user’s smartphone or the institution’s secure environment. This drastically reduces the risk of data breaches while allowing AI to enhance patient care.
Such localized approaches are critical for enabling personalized medicine, where AI analyzes individual health data to deliver tailored diagnostics or treatment plans in real time. For instance, a smartphone-based AI could monitor chronic conditions, predict emergencies, or optimize medication adherence—all while keeping the data securely on the device.
Furthermore, compressed models make these innovations practical by reducing the computational resources required for deployment. This allows smaller medical facilities or underserved regions to benefit from cutting-edge AI without needing expensive infrastructure.
In essence, the future of healthcare innovation depends on overcoming the privacy and security barriers that currently limit AI adoption. Local, secure, and efficient compressed models may very well be the cornerstone of this transformation.
