About the author
The article was written by ML Engineer Aleksei Babenkov. With a rich background spanning over eight years in machine learning and deep learning, Aleksei has significantly influenced Russia’s data science landscape. At Avito, a major player in the Russian tech industry, he is playing a critical role in advancing the performance of ranking algorithms through cutting-edge research. Prior to this, Aleksei led an NLP research team at MaximaTelecom, where he was involved in projects that pushed the boundaries of natural language processing. Between these roles, Aleksei worked at Lia.Chat, transforming the chatbot into the leading tech support solution in the Russian market. Through rigorous research and development, he built a language model that rivaled the capabilities of the most advanced systems, like ChatGPT. Under his guidance, Lia.Chat dramatically improved its performance, quickly surpassing competitors like Yandex, Sber, and MTS, and was selected by major players like Sber and MTS for their marketplace platforms.
Introduction
Training large language models (LLMs) on the scale of ChatGPT 3 and ChatGPT 4 requires not only powerful compute resources but also effective optimization methods. This article explores modern optimization approaches that can enhance the stability and performance of such models. We will discuss optimizers including Adam, Lion, Shampoo and Adan. We will also consider the impact of batch size on training.
Adam Optimizer: The Baseline
Adam is one of the most popular optimizers for training neural networks. Adam optimizer combines the ideas of momentum and adaptive learning rates to create a more efficient and stable optimization process. Imagine you are hiking in rough terrain with constantly changing conditions. Momentum helps smooth out the path by considering the direction and speed of previous steps, preventing abrupt changes.
Adaptive learning rates, on the other hand, adjust your step size based on how steep or flat the terrain is, ensuring you don’t overshoot or get stuck in one place. Together, these components help Adam navigate the complex landscape of neural network training more effectively than simple gradient descent methods.
Adam Update Formulas
- First moment calculation (average of gradients):
- Second moment calculation (average of squared gradients):
- Bias correction for moments:
- Weight update:
Advantages and Disadvantages of Adam
While Adam is the default optimizer for many tasks, it’s not without its drawbacks. It can be unstable and diverge at large scales. The second moment, which depends on gradient stability, can slow down the overall learning rate if gradients start to “jump” in different directions. However, this idea is not without merit: gradient instability indicates uncertainty about which direction to step, placing us in a “turbulent zone” where it’s better to move slowly to avoid going in the wrong direction and prevent the learning process from diverging completely.
Lion Optimizer: Efficient and Memory-Saving
Lion’s Working Principles
Lion (Layer-wise Optimized Newton) is a more efficient and memory-saving alternative to Adam. Lion simplifies the optimization process by using a constant step size in the direction of the exponential moving average (EMA) of gradients. Think of it like steering a ship where instead of making large, sudden turns, you gently adjust the direction based on the average of past currents. This not only smooths out the navigation but also reduces the amount of memory needed to store the ship’s position and direction history. By maintaining only essential information, Lion manages to perform well without the extra computational burden, making it an attractive choice for large-scale models.
(Illustration source: https://arxiv.org/pdf/2302.06675)
Lion Update Formulas
Shampoo Optimizer: Quasi-Newton Methods
Shampoo’s Working Principles
Shampoo uses second-order information (Hessian) to adapt the weight update steps, allowing faster convergence compared to first-order methods. Imagine training as climbing a mountain. First-order methods (like SGD, Adam) only use local slope information, like navigating with just a compass. Shampoo, on the other hand, uses a topographic map, enabling the selection of a more efficient path considering the curvature of the landscape. Shampoo maintains the tensor structure of the gradient and updates the preconditioning matrices separately for each dimension, enabling more efficient manipulation of large models.
Shampoo’s key innovation lies in its structure-aware preconditioning approach for stochastic optimization over tensor spaces. Instead of maintaining a full preconditioner matrix, which is often computationally prohibitive, Shampoo maintains a set of smaller preconditioning matrices, each corresponding to a single dimension of the tensor. This allows efficient manipulation and storage, while still leveraging the benefits of preconditioning to accelerate convergence.
Shampoo Update Formulas
- Update of the preconditioning matrices
Shampoo maintains a preconditioning matrix for each dimension . The preconditioning matrix is updated using the gradients:
Here, represents the matricization of the gradient tensor along the dimension.
- Preconditioned gradient
The gradient is preconditioned by contracting it with the preconditioning matrices across all dimensions:
This operation involves tensor-matrix products that are commutative, which helps to calculate the result without explicitly forming large matrices.
- Weight update
Finally, the weights are updated using the preconditioned gradient:
Advantages and Disadvantages of Shampoo
When used correctly, Shampoo can significantly accelerate training, but due to its complexity and computational heaviness, it should be used with caution. It maintains the tensor structure of the gradient and uses separate preconditioner matrices for each dimension, allowing efficient manipulation of large models. Despite its more complex update rule, Shampoo’s runtime per step is comparable to simpler methods like SGD, AdaGrad, and Adam, making it a practical choice for large-scale machine learning problems.
In theory, Shampoo looks promising, but in practice, it exhibits a high degree of instability that often requires additional tricks beyond the original paper to fix. Out of the box, it may not work well on a large scale, necessitating further adjustments and fine-tuning to achieve the desired performance.
(Illustration source: https://arxiv.org/pdf/1802.09568)
Adan Optimizer: Adaptive Nesterov Momentum
Adan’s Working Principles
Adan (Adaptive Nesterov Momentum Algorithm) incorporates Nesterov momentum to enhance the optimization process, similar to how a skilled cyclist anticipates and accelerates before reaching a hill to maintain momentum. Instead of just reacting to the gradient at the current point, Adan looks ahead to where the gradient is moving, making adjustments based on this forecast.
This foresight helps in smoothing the optimization path and speeds up convergence. By combining this with adaptive adjustments for the first and second moments of the gradient, Adan achieves better performance and stability, especially with large batch sizes, which is crucial for training extensive models.
Adan Update Formulas
- First moment calculation:
- Gradient difference calculation:
- Second moment calculation:
- Learning rate scaling:
- Weight update:
- Optional restart condition:
Advantages of Adan
Adan demonstrates better performance compared to Adam and other optimizers due to its improved architecture based on Nesterov acceleration, while also demonstrating high resilience to large mini-batch sizes, allowing efficient use of parallelism. As a state-of-the-art optimizer for large language models, Adan has been adopted by prominent chatbots such as YaGPT. Yandex officially recommends Adan as the primary optimizer for training large language models.
(Illustration source: https://arxiv.org/pdf/2208.06677)
Batch Size Considerations
There’s a common misconception that larger batch sizes are always better. However, in practice, the optimal batch size depends on the specific model and task.
Each model has its own optimal batch size that provides the best quality and training speed. Too large or too small a batch size can negatively affect performance.
(Illustration source: https://medium.com/deep-learning-experiments/effect-of-batch-size-on-neural-net-training-c5ae8516e57)
Conclusion
Applying these optimization methods can significantly improve the stability and efficiency of training large language models. Using modern optimizers such as Lion, Shampoo, and Adan, as well as choosing the right batch size, helps models handle large volumes of data and complex tasks. Experiment with these methods in your projects and share the results with the community.