About the author
The article was written by ML Engineer Aleksei Babenkov. With a rich background spanning over six years in machine learning and deep learning, Aleksei has significantly influenced Russia’s data science landscape. At Avito, a major player in the Russian tech industry, he is playing a critical role in advancing the performance of ranking algorithms through cutting-edge research. Prior to this, Aleksei led an NLP research team at MaximaTelecom, where he focused on NLP and sequence processing, inventing several new methods that pushed the boundaries of the field. From the ground up, he created a market-leading user profiling system, elevating MaximaTelecom’s standing in the advertising industry.
Introduction
Training large language models (LLMs) is a complex task that requires substantial computational resources and effective optimization methods. In this article, we will explore practical techniques that can improve the stability and efficiency of training LLMs on a scale comparable to ChatGPT. Specifically, we will discuss modifications in layer normalization, activation functions, positional encodings, and modern optimization methods.
From Post-LayerNorm to Pre-LayerNorm
Initially, transformer architectures utilized post-layer normalization, where layer normalization is applied after adding the residual connection. This method was popular due to its simplicity and effectiveness at smaller scales. However, as model sizes have increased, the limitations of post-layer normalization have become apparent.
Transitioning to pre-layer normalization, where normalization is applied at the input of each block (whether attention or feed-forward network), helps avoid the issues of exploding and vanishing gradients. Pre-layer normalization helps maintain the stability of gradient flows throughout the network, preventing them from exploding or vanishing, which is crucial for large-scale models. While post-layer normalization might converge to a slightly better optimum and yield higher quality in smaller models, it is less stable.
For large models, stability becomes paramount, as instability can prevent the model from converging and yielding any usable results. Therefore, pre-layer normalization is preferred for large models due to its stability guarantees. However, for smaller models, you might opt for post-layer normalization to potentially achieve slightly higher quality.
We can illustrate the difference between these methods by presenting a diagram of a transformer block using post-layer normalization and pre-layer normalization.
(Illustration source: On Layer Normalization in the Transformer Architecture, https://arxiv.org/pdf/2002.04745)
RMSNorm and the Abandonment of Bias in Other Layers
RMSNorm is an alternative to LayerNorm where biases are disabled. This method enhances model stability by simplifying computations and reducing the impact of outliers. Using RMSNorm can achieve greater training stability, which is particularly important for large-scale models. This results in more predictable model behavior and reduces the likelihood of converging to local minima.
Following this motivation, we also eliminate biases in linear layers. Empirical evidence suggests that this improves model quality. Additionally, removing biases makes the network more stable overall, allowing for more intensive training. This means we can increase the aggressiveness of learning rates and weight decay.
By understanding and implementing these normalization techniques, we can enhance the performance and stability of training large language models, paving the way for more advanced and capable AI systems.
(Illustration source: On Layer Normalization in the Transformer Architecture, https://arxiv.org/pdf/2002.04745)
The article authors’ results indicate that for the task of machine translation using transformer neural network architectures, replacing LayerNorm with RMSNorm gives you an additional one or two tenths of a point on the BLEU metric for free (and in the modern world of big data and smart neural networks, this improvement is significant and valuable). Choosing between RMSNorm or pRMSNorm is not straightforward, so it’s worth experimenting and selecting the option that works best for your specific task.
Upgrading Activation Functions
Standard activation functions like ReLU (Rectified Linear Unit) have limitations, especially at larger scales. ReLU can lead to training instability due to its discontinuities in derivatives. However, the world around us is more continuous than discrete.
Experiments indicate that replacing ReLU with smooth functions such as GELU (Gaussian Error Linear Unit), Swish, and ELU (Exponential Linear Unit) improves training stability. These functions better model complex dependencies in the data due to their continuous derivatives.
A more powerful approach involves transitioning to gated activations like ReGLU, GEGLU, and SwiGLU. Essentially, each of these activation functions replaces a single linear layer with two: one linear layer and one equipped with an activation function. The results of these two computations are then combined using the element-wise multiplication of the resulting vectors (Hadamard product). This enhances model quality without increasing the number of parameters. In essence, if you take a model with a simple activation function and distribute the same number of parameters across two layers in a gated activation function, you achieve better quality without increasing the parameter count.
Activation Function | Training Stability | Model Quality | Model Quality |
ReLU | Moderate | Moderate | Simple and efficient for smaller models, but can cause instability at larger scales due to discontinuities in derivatives. |
GELU | High | High | Smooth derivatives lead to better training stability and improved performance for large models. |
Swish | High | High | Combines smoothness and flexibility, offering excellent stability and quality. |
ELU | High | High | Provides a smooth and continuous output, enhancing stability and model quality. |
ReGLU | Very High | Very High | Gated activation improves quality without increasing parameters, offering superior stability. |
GEGLU | Very High | Very High | Similar to ReGLU with added flexibility, further improving stability and quality. |
SwiGLU | Very High | Very High | Combines Swish activation with gating mechanism, providing the best in terms of stability and model performance. |
Evolution of Positional Encoding
In the original 2017 Transformer paper, trigonometric functions were proposed for positional encoding. This solution works but is not ideal; propagating this information through the entire network can destabilize the model. As the positional signal is propagated, it needs to be amplified, which increases activations. Elevated activations are a primary cause and symptom of training instability. Dragging the positional information through the entire network is challenging, and some information is inevitably lost. Meanwhile, it’s important to note that such extensive propagation may not be necessary. The position information is retained as tokens pass through the network since, in a sense, the transformer is bijective, merely obtaining increasingly higher-level representations. This means we can add positional encoding only where needed: before attention.
The first proposed solution is relative position encoding. This involves training a bias that depends not on the absolute position but on the distance between tokens, added to the attention matrix. This bias is introduced at the last moment, right before the attention in each block separately, which means it avoids dragging the heavy positional signal through the entire network.
This fits well with the attention model as it is a way of communication, a method to extract information from neighbors relative to the current token. This includes the distance, as it is a function of both the current and the target token. Additionally, this approach introduces symmetry: for instance, tokens 2 and 5 will have the same positional relationship as tokens 3 and 6. However, this method is computationally intensive, as it requires knowing an additional vector and adding it to the attention matrix, which is the heaviest part of the transformer architecture.
Rotary Positional Encodings
A More Interesting Idea: Rotary Positional Encodings
A more intriguing idea to address this problem is Rotary Positional Encodings (RoPE), stemming from research results just published by our Chinese colleagues, presenting an extremely exciting new development. The concept involves multiplying our K and V vectors by certain complex numbers that depend on their positions. The use of complex numbers here is due to their ability to encode rotation operations easily.
Let K and V be the key and value vectors, respectively. For a given position pos, the rotary encoding can be defined as:
where θ(pos)\theta(pos)θ(pos) is a function of the position. The actual implementation can use a sine and cosine function pair for simplicity:
For a position pos, the positionally encoded key K’ and value V’ are:
The dot product of two vectors obtained using RoPE at positions m and n results in a vector encoded with RoPE for m − n:
This means the dot product depends on the distance between tokens. This approach eliminates the need to remember individual positional parameters and avoids modifying the attention matrix QK^T, as the transformation is applied directly to K and V. The theory is supported by practice: it works just as well in terms of quality.
Here’s a better way to think about it: Imagine each token in a sequence as a point on a plane. Traditional positional encodings assign fixed coordinates to these points. However, as these coordinates are propagated through the layers of the model, they can get lost or distorted, leading to instability.
RoPE works differently: it not only sets the position of the point but also the direction of rotation (like rotating a compass needle). These rotations depend on the token’s position and help better preserve the relative positions of tokens to each other.
Using sines and cosines makes it easy to manage these rotations rather than just fixing the coordinates. As a result, the model gets more stable and accurate representations of token positions, improving its performance.
For example, imagine a circle with a point on it. Traditional positional encoding is like a fixed position of the point on the circle. RoPE, however, adds the position of an arrow pointing to the point, which can rotate. This way, the model better “remembers” where the point is relative to other points on the circle.
(Illustration Source: Roformer: Enhanced Transformer with Rotary Position Embedding, https://arxiv.org/pdf/2104.09864)
Conclusion
Training large language models (LLMs) requires innovative techniques to optimize performance and stability. This article explored several advanced methods, including pre-layer normalization, RMSNorm, and the removal of biases in linear layers, all of which enhance model stability and efficiency.
We also discussed upgrading activation functions from ReLU to smoother options like GELU, Swish, and ELU, as well as the more advanced gated activations such as ReGLU, GEGLU, and SwiGLU. These changes improve training stability and model quality without increasing parameters.
In the realm of positional encodings, we compared traditional trigonometric methods with relative position encoding and Rotary Positional Encodings (RoPE). These newer methods more effectively manage positional information, improving model performance.
Implementing these techniques significantly boosts the stability and efficiency of LLM training. We encourage you to experiment with these methods and share your results, contributing to the ongoing advancement of AI technology.