In this blog, I will discuss three important factors when training LLMs: Pre-training Tasks, Long Context Modeling and Optimization Setting. This blog is based on datawhale files and a nice survey.
Pre-training is crucial for encoding general knowledge from a large corpus into the extensive model parameters. When training LLMs, two commonly used pre-training tasks are language modeling and denoising autoencoding.
The language modeling task (LM) is the most frequently employed objective for pre-training decoder-only LLMs.
Formally, when a sequence of tokens x = { x 1 , … , x n } \mathbf{x} = \{x_1,\dots,x_n\} x={x1?,…,xn?} is given, the LM task seeks to predict the target tokens x i x_i xi? based on the preceding tokens x < i x_{\lt i} x<i? in a sequence in an autoregressive manner. The general training objective is to maximize the following likelihood:
L L M ( x ) = ∑ i = 1 n l o g P ( x i ∣ x < i ) {L}_{LM}(\mathbf{x}) = \sum_{i=1}^nlogP(x_i|\mathbf{x}_{<i}) LLM?(x)=i=1∑n?logP(xi?∣x<i?)
Given that most language tasks can be framed as a prediction problem based on the input, these decoder-only LLMs may have the potential to implicitly learn how to handle these tasks in a unified LM approach. Some research has also indicated that decoder-only LLMs can naturally transition to specific tasks by autoregressively predicting the next tokens, without requiring fine-tuning.
An important variation of LM is the prefix language modeling task, which is tailored for pre-training models with the prefix decoder architecture. The tokens within a randomly selected prefix are not utilized in computing the loss of prefix language modeling. With the same number of tokens observed during pre-training, prefix language modeling performs slightly less effectively than language modeling, as fewer tokens in the sequence are involved in model pre-training.
Apart from the traditional LM, the denoising autoencoding task (DAE) has also been extensively employed for pre-training language models. The inputs x / x ^ \mathbf{x}_{/\mathbf{\hat{x}}} x/x^? for the DAE task consist of corrupted text with randomly replaced spans. Subsequently, the language models are trained to recover the replaced tokens x ^ \mathbf{\hat{x}} x^.
Formally, the training objective of DAE is denoted as follows:
L D A E = l o g P ( x ^ ∣ x / x ^ ) L_{DAE}=logP(\mathbf{\hat{x}}|\mathbf{x}_{/\mathbf{\hat{x}}}) LDAE?=logP(x^∣x/x^?)
Nevertheless, the implementation of the DAE task appears to be more intricate than that of the LM task. Consequently, it has not been extensively utilized for pre-training large language models.
MoD considers both LM and DAE objectives as distinct types of denoising tasks, namely S-denoiser (LM), R-denoiser (DAE, short span and low corruption), and X-denoiser (DAE, long span or high corruption). Among these three denoising tasks, S-denoiser is akin to the traditional LM objective, while R-denoiser and X-denoiser resemble DAE objectives, differing from each other in the lengths of spans and the ratio of corrupted text. For input sentences that commence with different special tokens, the model will be optimized using the corresponding denoisers.
In practical scenarios, there is a growing need for LLMs to effectively model long contexts, such as in PDF processing and story writing. To improve the long context modeling capabilities, there are typically two viable approaches: scaling position embeddings and adjusting the context window.
Transformer-based LLMs can effectively learn position embeddings within the maximum training length. Therefore, when adapting LLMs to language tasks that extend beyond the maximum training length, it becomes necessary to scale to larger position indices. Some position embeddings have demonstrated a degree of generalizability to text beyond the training length, formally termed as extrapolation capability.
However, empirical studies have shown that RoPE, as one of the mainstream position embedding methods, exhibits limited extrapolation ability. In the following of this part, several methods for scaling RoPE to longer texts will be explored.
Warning: this section is hard to understand and it is so fine-grind that I think it is natural to be confused about it, just skip and read the left.
Due to the limited context windows of Transformer-based LLMs, they are unable to directly integrate or utilize the complete information from long sequences that exceed the context window. To address this limitation, various methods for adapting LLMs to long contexts have been proposed.
In language model pre-training, it is common to set the batch size to a large number to enhance training stability and throughput. Notably, LLMs such as GPT-3 and PaLM have introduced a novel strategy that dynamically increases the batch size during training, ultimately reaching a million scale. Specifically, the batch size of GPT-3 gradually increases from 32K to 3.2M tokens. Empirical results have demonstrated that this dynamic schedule of batch size can effectively stabilize the training process of LLMs.
During pre-training, existing LLMs typically follow a similar learning rate schedule, incorporating warm-up and decay strategies. Initially, within the first 0.1% to 0.5% of the training steps, a linear warm-up schedule is employed to gradually increase the learning rate to a maximum value ranging from approximately 5 × 1 0 ? 5 5 \times 10^{-5} 5×10?5 to 1 × 1 0 ? 4 1 \times 10^{-4} 1×10?4. Subsequently, a cosine decay strategy is adopted, gradually reducing the learning rate to approximately 10% of its maximum value, until the training loss converges.
The Adam optimizer and AdamW optimizer are commonly employed for training LLMs, such as GPT-3. These optimizers are based on adaptive estimates of lower-order moments for first-order gradient-based optimization. Typically, their hyper-parameters are set as follows: β 1 = 0.9 , β 2 = 0.95 \beta_1 = 0.9,\beta_2 = 0.95 β1?=0.9,β2?=0.95 and ? = 1 0 ? 8 \epsilon = 10^{?8} ?=10?8. Additionally, the Adafactor optimizer has been utilized in training LLMs, such as PaLM and T5. Adafactor is a variant of the Adam optimizer specifically designed to conserve GPU memory during training. The hyper-parameters of the Adafactor optimizer are set as: β 1 = 0.9 \beta_1 = 0.9 β1?=0.9 and β 2 = 1.0 ? k ? 0.8 \beta_2 = 1.0 ? k^{?0.8} β2?=1.0?k?0.8, where k denotes the number of training steps.
When pre-training LLMs, instability during training is a common issue that can lead to model collapse. To tackle this problem, weight decay and gradient clipping are commonly employed, with previous research often setting the gradient clipping threshold to 1.0 and weight decay rate to 0.1. However, as LLMs scale up, instances of training loss spikes become more frequent, resulting in unstable training. To address this, some approaches adopt a straightforward strategy of restarting the training process from an earlier checkpoint before the spike occurs and skipping over problematic data. Additionally, GLM observes that abnormal gradients in the embedding layer are often responsible for spikes and suggests reducing the embedding layer gradients to alleviate this issue.
The table below illustrates the detailed optimization setting of several existing LLMs and it is cited from survey.
END