一、Abstract 摘要
1、In terms of pre-training, we have scaled the high-quality pre-training datasets from the previous 7 trillion tokens to 18 trillion tokens. 在预训练过程中的token 从Qwen2的训练集 7t到了18t。
2、In terms of post-training, we implement intricate supervised finetuning with over 1 million samples, as well as multistage reinforcement learning, including offline learning DPO and online learning GRPO. 后训练 包括了 一百万的数据,使用了多阶段训练,包括离线DPO和在线GRPO。
3、The open-weight offerings include base models and instruction-tuned models in sizes of 0.5B, 1.5B, 3B, 7B, 14B, 32B, and 72B parameters. Quantized versions of the instruction-tuned models are also provided. Over 100 models can be accessed from Hugging Face Hub, ModelScope, and Kaggle. 开源了多种Size的模型。
4、In addition, for hosted solutions, the proprietary models currently include two mixture-of-experts (MoE) variants: Qwen2.5Turbo and Qwen2.5-Plus, both available from Alibaba Cloud Model Studio. MoE版本的模型 turbo以及plus没有开源,在阿里云上可以使用。
二、Architecture & Tokenizer 模型结构和分词器
Dense models
For dense models, we maintain the Transformer-based decoder architecture (Vaswani et al., 2017; Radford et al., 2018) as Qwen2 (Yang et al., 2024a).
Dense模型采用了和Qwen2类似的模型结构,这种结构有几个关键的组件:
1、Grouped Query Attention (GQA, Ainslie et al., 2023) for efficient KV cache utilization
2、SwiGLU activation function (Dauphin et al., 2017) for non-linear activation
3、Rotary Positional Embeddings (RoPE,Su 2024)for encoding position information
4、QKV bias (Su, 2023) in the attention mechanism and RMSNorm (Jiang et al., 2023b) with pre-normalization to ensure stable training.
MoE model architectures MoE模型
This is achieved by replacing standard feed-forward network (FFN) layers with specialized MoE layers, where each layer comprises multiple FFN experts and a routing mechanism that dispatches tokens to the top-K experts. Following the approaches demonstrated in Qwen1.5-MoE (Yang et al., 2024a), we implement fine-grained expert segmentation (Dai et al., 2024) and shared experts routing (Rajbhandari et al., 2022; Dai et al., 2024). These architectural innovations have yielded substantial improvements in model performance across downstream tasks. 采用 Qwen1.5-MoE类似的结构,使用了 expert segmentation、shared experts routing来改进模型。
Tokenization
For tokenization, we utilize Qwen’s tokenizer (Bai et al., 2023), which implements byte-level byte-pair encoding (BBPE, Brown et al., 2020; Wang et al., 2020; Sennrich et al., 2016) with a vocabulary of 151,643 regular tokens. We have expanded the set of control tokens from 3 to 22 compared to previous Qwen versions, adding two new tokens for tool functionality and allocating the remainder for other model capabilities. This expansion establishes a unified vocabulary across all Qwen2.5 models, enhancing consistency and reducing potential compatibility issues.
分词器使用了Qwen系列的分词器,将控制token从3个扩展到了22个。
三、预训练
1、数据
First, we carefully curate high-quality training data through sophisticated filtering and scoring mechanisms, combined with strategic data mixture.
通过复杂的过滤和评分机制,结合数据混合策略,精心策划高质量的训练数据。
2、超参数优化
Second, we conduct extensive research on hyperparameter optimization to effectively train models at various scales.
在不同大小的模型上面,进行了超参优化。
3、长上下文
Finally, we incorporate specialized long-context pre-training to enhance the model’s ability to process and understand extended sequences.
结合了专门的长上下文预训练,以增强模型处理和理解扩展序列的能力。
Pre-training Data 预训练数据
1、Better data filtering. 更好的数据过滤方法
We leverage Qwen2-Instruct models as data quality filters that perform comprehensive, multi-dimensional analysis to evaluate and score training samples.The enhanced capabilities enable more nuanced quality assessment, resulting in both improved retention of high-quality training data and more effective filtering of low-quality samples across multiple languages. 使用了 Qwen2-Instruct模型来过滤数据。
2、Better math and code data.加入了数学和代码数据。
During the pre-training phase of Qwen2.5, we incorporate training data from Qwen2.5-Math (Yang et al., 2024b) and Qwen2.5-Coder (Hui et al., 2024).By leveraging these high-quality domain-specific datasets during pre-training, Qwen2.5 inherits strong capabilities in both mathematical reasoning and code generation.加入了用于训练Qwen2.5-Math和 Qwen2.5-Coder的数据,使得Qwen2.5模型在数据推理和代码生成上有很强的能力。
3、Better synthetic data. 更好的合成数据。
To generate high-quality synthetic data, particularly in mathematics, code, and knowledge domains, we leverage both Qwen2-72B-Instruct (Yang et al., 2024a) and Qwen2Math-72B-Instruct (Qwen Team, 2024c). The quality of this synthesized data is further enhanced through rigorous filtering using our proprietary general reward model and the specialized Qwen2-Math-RM-72B (Qwen Team, 2024c) model. 使用了 Qwen2-72B-Instruct和Qwen2Math-72B-Instruct模型来合成数据。
4、Better data mixture. 更好的数据混合策略。
To optimize the pre-training data distribution, we employ Qwen2-Instruct models to classify and balance content across different domains.Our analysis revealed that domains like e-commerce, social media, and entertainment are significantly overrepresented in web-scale data, often containing repetitive, template-based, or machine-generated content. Conversely, domains such as technology, science, and academic research, while containing higherquality information, are traditionally underrepresented. Through strategic down-sampling of overrepresented domains and up-sampling of high-value domains, we ensure a more balanced and information-rich training dataset that better serves our model’s learning objectives. 使用 Qwen2-Instruct模型分类和平衡不同领域的内容。通过分析发现,网络上经济、社交媒体、娱乐都被过度展现了,包括很多重复、模版化、机器生成的内容。技术、科学、学术研究等领域的内容质量更高、没有被过度展现。通过对过度展现的内容进行下采样、对信息量高的内容上采样来平衡样本以获得更加平衡和高信息量的训练数据。
Building on these techniques, we have developed a larger and higher-quality pre-training dataset, expanding from the 7 trillion tokens used in Qwen2 (Yang et al., 2024a) to 18 trillion tokens.最终获得了18Ttoken。
Scaling Law for Hyper-parameters 超参数优化
We develop scaling laws for hyper-parameter based on the pre-training data of Qwen2.5 (Hoffmann et al., 2022; Kaplan et al., 2020). While previous studies (Dubey et al., 2024; Almazrouei et al., 2023; Hoffmann et al., 2022) primarily used scaling laws to determine optimal model sizes given compute budgets, we leverage them to identify optimal hyperparameters across model architectures. Specifically, our scaling laws help determine key training parameters like batch size B and learning rate μ for both dense models and MoE models of varying sizes.
以前的scaling laws用于在计算资源和模型大小上使用的,Qwen2.5团队将scaling laws在超参数优化和模型结构上使用。包括不同模型大小上的Dense和MoE模型上的 batch size B and learning rate μ 等超参的设置。
Through extensive experimentation, we systematically study the relationship between model architecture and optimal training hyper-parameters. Specifically, we analyze how the optimal learning rate μopt and batch size Bopt vary with model size N and pre-training data size D. Our experiments cover a comprehensive range of architectures, including dense models with 44M to 14B parameters and MoE models with 44M to 1B activated parameters, trained on datasets ranging from 0.8B to 600B tokens. Using these optimal hyper-parameter predictions, we then model the final loss as a function of model architecture and training data scale.
做了很多实验,在不同大小的Dense模型和MoE模型上去发现学习率μ,batchSize B和模型大小N以及预训练数据大小D之间的关系。使用这些最佳超参数预测,我们将最终损失建模为模型架构和训练数据规模的函数。
Additionally, we leverage scaling laws to predict and compare the performance of MoE models with varying parameter counts against their dense counterparts. This analysis guides our hyper-parameter configuration for MoE models, enabling us to achieve performance parity with specific dense model variants (such as Qwen2.5-72B and Qwen2.5-14B) through careful tuning of both activated and total parameters.
除此之外,使用scaling laws来预测和对比MoE模型和对应的Dense模型的效果。scaling laws指导我们来精细化的构建出和Dense模型效果相当的MoE模型。
Long-context Pre-training 长上下文预训练
For optimal training efficiency, Qwen2.5 employs a two-phase pre-training approach: an initial phase with a 4,096-token context length, followed by an extension phase for longer sequences. Following the strategy used in Qwen2, we extend the context length from 4,096 to 32,768 tokens during the final pre-training stage for all model variants except Qwen2.5-Turbo. Concurrently, we increase the base frequency of RoPEfrom 10,000 to 1,000,000 using the ABF technique (Xiong et al., 2023).两阶段训练 首先是4096 token长度训练、然后使用Qwen2策略来将上下文窗口扩展到32768 token(Turbo模型除外),同时将RoPE中的基础频率从使用ABF技术从10,000 扩展到1,000,000。
For Qwen2.5-Turbo, we implement a progressive context length expansion strategy during training, advancing through four stages: 32,768 tokens, 65,536 tokens, 131,072 tokens, and ultimately 262,144 tokens, with a RoPE base frequency of 10,000,000. At each stage, we carefully curate the training data to include 40% sequences at the current maximum length and 60% shorter sequences. This progressive training methodology enables smooth adaptation to increasing context lengths while maintaining the model’s ability to effectively process and generalize across sequences of varying lengths.对于Turbo模型,使用逐步扩展的方法从32768、65536、131072最终到262144token长度,RoPE的基础频率扩展到了 10,000,000。这种渐进式训练方法可以顺利适应不断增加的上下文长度,同时保持模型有效处理和泛化不同长度序列的能力。
To enhance our models’ ability to process longer sequences during inference, we implement two key strategies: YARN (Peng et al., 2023) and Dual Chunk Attention (DCA, An et al., 2024). Through these innovations, we achieve a four-fold increase in sequence length capacity, enabling Qwen2.5-Turbo to handle up to 1 million tokens and other models to process up to 131,072 tokens. Notably, these approaches not only improve the modeling of long sequences by reducing perplexity but also maintain the models’ strong performance on shorter sequences, ensuring consistent quality across varying input lengths. 在预测过程中使用了 YARN和Dual Chunk Attention来进行增强。
四、Post-training 后训练
Expanded Supervised Fine-tuning Data Coverage 扩展监督微调数据覆盖范围
The supervised fine-tuning process leverages a massive dataset comprising millions of high-quality examples. This expansion specifically addresses key areas where the previous model showed limitations, such as long-sequence 4 generation, mathematical problem-solving, coding, instruction-following, structured data understanding, logical reasoning, cross-lingual transfer, and robust system instruction.
监督式微调过程利用了包含数百万个高质量示例的海量数据集。这种扩展专门解决了以前模型存在局限性的关键领域,例如长序列 生成、数学问题解决、编码、指令跟踪、结构化数据理解、逻辑推理、跨语言迁移和稳健的系统指令。
Two-stage Reinforcement Learning 两阶段强化学习
Offline RL: This stage focuses on developing capabilities that are challenging for the reward model to evaluate, such as reasoning, factuality, and instruction-following. Through meticulous construction and validation of training data, we ensure that the Offline RL signals are both learnable and reliable (Xiang et al., 2024), enabling the model to acquire those complex skills effectively.离线 RL:此阶段侧重于开发奖励模型难以评估的能力,例如推理、事实性和指令遵循。通过对训练数据的精心构建和验证,我们确保离线 RL 信号既可学习又可靠 ,使模型能够有效地获得这些复杂的技能。
Online RL: The Online RL phase leverages the reward model’s ability to detect nuances in output quality, including truthfulness, helpfulness, conciseness, relevance, harmlessness and debiasing. It enables the model to generate responses that are precise, coherent, and well-structured while maintaining safety and readability. As a result, the model’s outputs consistently meet human quality standards and expectations.在线 RL 阶段利用奖励模型的能力来检测输出质量的细微差别,包括真实性、有用性、简洁性、相关性、无害性和消除偏见。它使模型能够生成精确、连贯且结构良好的响应,同时保持安全性和可读性。因此,该模型的输出始终符合人类质量标准和期望。
Supervised Fine-tuning 监督学习
Long-sequence Generation 长文本生成
Qwen2.5 is capable of generating high-quality content with an output context length of up to 8,192 tokens, a significant advancement over the typical posttraining response length, which often remains under 2,000 tokens.To address this gap, we develop long-response datasets (Quan et al., 2024). We employ back-translation techniques to generate queries for long-text data from pre-training corpora, impose output length constraints, and use Qwen2 to filter out low-quality paired data. 通常模型有能力生成8192长度的token,但是大部分只生成2000个token就完成了回答,为了解决这个gap,我们开发了长回答数据集,采用回译技术从预训练语料库中生成长文本数据的查询,输出长度限制,并使用 Qwen2 过滤掉低质量的配对数据。
Mathematics 数学能力
We introduce the chain-of-thought data of Qwen2.5-Math (Yang et al., 2024b), which encompasses a diverse range of query sources, including public datasets, K-12 problem collections, and synthetic problems. To ensure high-quality reasoning, we employ rejection sampling (Yuan et al., 2023) along with reward modeling and annotated answers for guidance, producing step-by-step reasoning process.使用了Qwen2.5-Math 的思维链数据,其中包含各种查询源,包括公共数据集、K-12 问题集合和合成问题。为了确保高质量的推理,我们采用拒绝抽样以及奖励建模和注释答案作为指导,产生分步推理过程。
Coding 代码能力
To enhance coding capabilities, we incorporate the instruction tuning data of Qwen2.5Coder (Hui et al., 2024). We use multiple language-specific agents into a collaborative framework, generating diverse and high-quality instruction pairs across nearly 40 programming languages. We expand our instruction dataset by synthesizing new examples from code-related Q&A websites and gathering algorithmic code snippets from GitHub. A comprehensive multilingual sandbox is used to perform static code checking and validate code snippets through automated unit testing, ensuring code quality and correctness (Dou et al., 2024; Yang et al., 2024c).
为了增强编码能力,我们整合了 Qwen2.5Coder 的指令调优数据。我们将多个特定于语言的代理使用到一个协作框架中,在近 40 种编程语言中生成多样化和高质量的指令对。我们通过从与代码相关的 Q&A 网站综合新示例并从 GitHub 收集算法代码片段来扩展我们的教学数据集。使用全面的多语言沙箱执行静态代码检查,并通过自动化单元测试验证代码片段,确保代码质量和正确性。
Instruction-following 指令跟随
To ensure high-quality instruction-following data, we implement a rigorous code-based validation framework. In this approach, LLMs generate both instructions and corresponding verification code, along with comprehensive unit tests for cross-validation. Through execution feedback-based rejection sampling, we carefully curate the training data used for Supervised Fine-Tuning, thereby guaranteeing the model’s faithful adherence to intended instructions (Dong et al., 2024).为了确保高质量的指令遵循数据,我们实施了严格的基于代码的验证框架。在这种方法中,LLM 会生成说明和相应的验证码,以及用于交叉验证的全面单元测试。通过基于执行反馈的拒绝采样,我们精心策划用于监督微调的训练数据,从而保证模型忠实地遵守预期的指令。
Structured Data Understanding: 结构化数据理解
We develop a comprehensive structured understanding dataset that encompasses both traditional tasks, such as tabular question-answering, fact verification, error correction, and structural understanding, as well as complex tasks involving structured and semi-structured data. By incorporating reasoning chains into the model’s responses, we significantly enhance its ability to infer information from structured data, thereby improving its performance across these diverse tasks. This approach not only broadens the scope of the dataset but also deepens the model’s capacity to reason and derive meaningful insights from complex data structures.
我们开发了一个全面的结构化理解数据集,它既包括传统任务,如表格问答、事实验证、纠错和结构理解,也包括涉及结构化和半结构化数据的复杂任务。通过将推理链纳入模型的响应中,我们显著提高了它从结构化数据中推断信息的能力,从而提高了它在这些不同任务中的性能。这种方法不仅拓宽了数据集的范围,还加深了模型从复杂数据结构中推理和获得有意义见解的能力。
Logical Reasoning 逻辑推理能力
To enhance the model’s logical reasoning capabilities, we introduce a diverse set of 70,000 new queries spanning various domains. These queries encompass multiple-choice questions, true / false questions, and open-ended questions. The model is trained to approach problems systematically, employing a range of reasoning methods such as deductive reasoning, inductive generalization, analogical reasoning, causal reasoning, and statistical reasoning. Through iterative refinement, we systematically filter out data containing incorrect answers or flawed reasoning processes. This process progressively strengthens the model’s ability to reason logically and accurately, ensuring robust performance across different types of reasoning tasks.
为了增强模型的逻辑推理能力,我们引入了一组跨越各个领域的 70000 个新查询。这些查询包括多项选择题、判断题和开放式问题。该模型经过训练,可以系统地解决问题,采用一系列推理方法,例如演绎推理、归纳泛化、类比推理、因果推理和统计推理。通过迭代优化,我们系统地过滤掉包含错误答案或有缺陷的推理过程的数据。这个过程逐渐加强了模型逻辑和准确推理的能力,确保在不同类型的推理任务中具有稳健的性能。
Cross-Lingual Transfer 跨语言能力
To facilitate the transfer of the model’s general capabilities across languages, we employ a translation model to convert instructions from high-resource languages into various low-resource languages, thereby generating corresponding response candidates. To ensure the accuracy and consistency of these responses, we evaluate the semantic alignment between each multilingual response and its original counterpart. This process preserves the logical structure and stylistic nuances of the original responses, thereby maintaining their integrity and coherence across different languages.为了促进模型的一般功能在语言之间的传输,我们采用翻译模型将指令从高资源语言转换为各种低资源语言,从而生成相应的响应候选者。为了确保这些响应的准确性和一致性,我们评估了每个多语言响应与其原始对应响应之间的语义一致性。这个过程保留了原始响应的逻辑结构和文体上的细微差别,从而在不同语言中保持了它们的完整性和连贯性。
Robust System Instruction 稳健的系统指令
We construct hundreds of general system prompts to improve the diversity of system prompts in post-training, ensuring consistency between system prompts and conversations. Evaluations with different system prompts show that the model maintains good performance (Lu et al., 2024b) and reduced variance, indicating improved robustness.我们构建了数百个通用系统提示符,以提高培训后系统提示符的多样性,确保系统提示符和对话之间的一致性。使用不同系统提示进行的评估表明,该模型保持了良好的性能并减少了方差,表明鲁棒性有所提高。
Response Filtering: 回答过滤
To evaluate the quality of responses, we employ multiple automatic annotation methods, including a dedicated critic model and a multi-agent collaborative scoring system. Responses are subjected to rigorous assessment, and only those deem flawless by all scoring systems are retained. This comprehensive approach ensures that our outputs maintain the highest quality standards.为了评估回复的质量,我们采用了多种自动注释方法,包括专用的评论家模型和多代理协作评分系统。回答要经过严格的评估,只有那些被所有评分系统认为完美无缺的回答才会被保留。这种全面的方法确保我们的输出保持最高的质量标准。
Ultimately, we construct a dataset of over 1 million SFT examples. The model is fine-tuned for two epochs with a sequence length of 32,768 tokens. To optimize learning, the learning rate is gradually decreased from 7 × 10−6 to 7 × 10−7. To address overfitting, we apply a weight decay of 0.1, and gradient norms are clipped at a maximum value of 1.0.最终,我们构建了一个包含超过 100 万个 SFT 样本的数据集。该模型针对两个 epoch 进行了微调,序列长度为 32,768 token。为了优化学习,学习率从 7 × 10-6 逐渐降低到 7 × 10-7。为了解决过拟合问题,我们应用了 0.1 的权重衰减,并将梯度范数裁剪为最大值 1.0。
Offline Reinforcement Learning 离线强化学习
Compared to Online Reinforcement Learning (RL), Offline RL enables the pre-preparation of training signals, which is particularly advantageous for tasks where standard answers exist but are challenging to evaluate using reward models. In this study, we focus on objective query domains such as mathematics, coding, instruction following, and logical reasoning, where obtaining accurate evaluations can be complex. In the previous phase, we extensively employ strategies like execution feedback and answer matching to ensure the quality of responses. For the current phase, we reuse that pipeline, employing the SFT model to resample responses for a new set of queries. Responses that pass our quality checks are used as positive examples, while those that fail are treated as negative examples for Direct Preference Optimization (DPO) training (Rafailov et al., 2023). To further enhance the reliability and accuracy of the training signals, we make use of both human and automated review processes (Cao et al., 2024). This dual approach ensures that the training data is not only learnable but also aligned with human expectations. Ultimately, we construct a dataset consisting of approximately 150,000 training pairs. The model is then trained for one epoch using the Online Merging Optimizer (Lu et al., 2024a), with a learning rate of 7 × 10−7.与在线强化学习 (RL) 相比,离线强化学习支持预先准备训练信号,这对于存在标准答案但难以使用奖励模型进行评估的任务特别有利。在这项研究中,我们专注于客观查询领域,例如数学、编码、指令跟随和逻辑推理,在这些领域获得准确的评估可能很复杂。在前一阶段,我们广泛采用执行反馈和答案匹配等策略来确保响应的质量。在当前阶段,我们重用该管道,使用 SFT 模型对一组新查询的响应进行重新采样。通过我们的质量检查的回答被用作正面示例,而那些未通过的回答则被视为直接偏好优化 (DPO) 训练的负面示例(Rafailov et al., 2023)。为了进一步提高训练信号的可靠性和准确性,我们利用人工和自动审查流程。这种双重方法确保训练数据不仅可学习,而且与人类的期望保持一致。最终,我们构建了一个由大约 150,000 个训练对组成的数据集。然后使用在线合并优化器对模型进行一个 epoch 训练,学习率为 7 × 10-7。
Online Reinforcement Learning 在线强化学习
To develop a robust reward model for online RL, we adhere to a set of carefully defined labeling criteria. Those criteria ensure that the responses generated by the model are not only high-quality but also aligned with ethical and user-centric standards (Wang et al., 2024a). The specific guidelines for data labeling are as follows:
为了为在线 RL 开发一个稳健的奖励模型,我们遵循一套精心定义的标签标准。这些标准确保模型生成的响应不仅高质量,而且符合道德和以用户为中心的标准(Wang et al., 2024a)。数据标注的具体准则如下:
Truthfulness:真实性
Responses must be grounded in factual accuracy, faithfully reflecting the provided context and instructions. The model should avoid generating information that is false or unsupported by the given data.回答必须以事实准确性为基础
Helpfulness 有用性
The model’s output should be genuinely useful, addressing the user’s query effectively while providing content that is positive, engaging, educational, and relevant. It should follow the given instructions precisely and offer value to the user.模型的输出应该是真正有用的
Conciseness:简洁
Responses should be succinct and to the point, avoiding unnecessary verbosity. The goal is to convey information clearly and efficiently without overwhelming the user with excessive detail.回答应该简洁明了
Relevance:相关性
All parts of the response should be directly related to the user’s query, dialogue history, and the assistant’s context. The model should tailor its output to ensure it is perfectly aligned with the user’s needs and expectations.模型响应是和用户请求相关的
Harmlessness 无害性
The model must prioritize user safety by avoiding any content that could lead to illegal, immoral, or harmful behavior. It should promote ethical conduct and responsible communication at all times.该模型必须优先考虑用户安全,避免任何可能导致非法、不道德或有害行为的内容。
Debiasing:无偏见
The model should produce responses that are free from bias, including but not limited to gender, race, nationality, and politics. It should treat all topics equally and fairly, adhering to widely accepted moral and ethical standards.平等
The queries utilized to train the reward model are drawn from two distinct datasets: publicly available open-source data and a proprietary query set characterized by higher complexity. Responses are generated from checkpoints of the Qwen models, which have been fine-tuned using different methods—SFT, DPO, and RL—at various stages of training. To introduce diversity, those responses are sampled at different temperature settings. Preference pairs are created through both human and automated labeling processes, and the training data for DPO is also integrated into this dataset.用于训练奖励模型的查询来自两个不同的数据集:公开可用的开源数据和具有更高复杂性的专有查询集。响应是从 Qwen 模型的检查点生成的,这些检查点在训练的不同阶段使用不同的方法(SFT、DPO 和 RL)进行了微调。为了引入多样性,这些响应在不同的温度设置下采样。偏好对是通过人工和自动标记过程创建的,DPO 的训练数据也集成到此数据集中。
In our online reinforcement learning (RL) framework, we employ Group Relative Policy Optimization (GRPO, Shao et al., 2024). The query set utilized for training the reward model is identical to the one used in the RL training phase. The sequence in which queries are processed during training is determined by the variance of their response scores, as evaluated by the reward model. Specifically, queries with higher variance in response scores are prioritized to ensure more effective learning. We sample 8 responses for each query. All models are trained with a 2048 global batch size and 2048 samples in each episode, considering a pair of queries and responses as a sample.在我们的在线强化学习 (RL) 框架中,我们采用了组相对策略优化。用于训练奖励模型的查询集与 RL 训练阶段使用的查询集相同。在训练期间处理查询的顺序由其响应分数的方差决定,由奖励模型评估。具体来说,响应分数差异较大的查询将被优先考虑,以确保更有效的学习。我们为每个查询抽样 8 个响应。所有模型都使用 2048 个全局批处理大小和 2048 个样本进行训练,每个事件中都考虑了一对查询和响应作为样本。
Long Context Fine-tuning 长上下文微调
To further extend the context length of Qwen2.5-Turbo, we introduce longer SFT examples during post-training, enabling it to better align with human preference in long queries.
为了进一步扩展Turbo模型的上下文长度,在后训练过程中使用了更长的sft数据,用于对齐用户长请求。
In the SFT phase, we employ a two-stage approach. In the first stage, the model is fine-tuned exclusively using short instructions, each containing up to 32,768 tokens. This stage uses the same data and training steps as those employed for the other Qwen2.5 models, ensuring strong performance on short tasks. In the second stage, the fine-tuning process combines both short instructions (up to 32,768 tokens) and long instructions (up to 262,144 tokens). This hybrid approach effectively enhances the model’s instruction-following ability in long context tasks while maintaining its performance on short tasks. 普通Qwen模型用32,768 token数据进行微调,Turbo模型第二阶段用 32,768 token长度和262,144 token长度的数据进行微调。
During the RL stage, we use a training strategy similar to that used for the other Qwen2.5 models, focusing solely on short instructions. This design choice is driven by two primary considerations: first, RL training is computationally expensive for long context tasks; second, there is currently a scarcity of reward models that provide suitable reward signals for long context tasks. Additionally, we find that adopting RL on short instructions alone can still significantly enhance the model’s alignment with human preferences in long context tasks.在 RL 阶段,我们使用类似于其他 Qwen2.5 模型的训练策略,只关注简短的指令。这种设计选择由两个主要考虑因素驱动:首先,RL 训练对于长上下文任务来说计算成本很高;其次,目前缺乏为长上下文任务提供合适奖励信号的奖励模型。此外,我们发现,仅在短指令上采用 RL 仍然可以显着增强模型与人类在长上下文任务中的偏好的一致性。