社区供稿｜阿里国际AI团队最新开源！探索面向开放性问题的推理模型 Marco-o1-CSDN博客

本文链接：https://blog.csdn.net/bagell/article/details/143989825

我们发布了最新的Marco-o1模型，Marco-o1不仅关注具有标准答案的学科（例如代码、数学等）领域，而且更加强调开放式问题的解决方案。我们的目标是解决：“o1这类模型能否有效的推广到难以量化且缺乏明确奖励的其他领域上”这一问题。

Arxiv：https://arxiv.org/abs/2411.14405

Github：https://github.com/AIDC-AI/Marco-o1

Hugging Face：https://huggingface.co/AIDC-AI/Marco-o1

我们的特色有：

**1. 使用了超长CoT数据进行微调。**我们通过self-play+MCTS构建了一批具备反思、改正能力的超长CoT数据。结合其他开源数据一同训练了Marco-o1-CoT。

**2. 使用MCTS扩展解空间。**在推理阶段，通过使用MCTS+reward引导我们的模型(Marco-o1-MCTS)扩大解空间，输出更优秀的结果。

**3. 细粒度解****空间扩展。**考虑到step级别依然具备较大的搜索细粒度，我们进一步的定义了mini-Step来进一步的扩大整个模型的解空间，引导并扩大我们的模型(Marco-o1-MCTS mini-Step)具备输出更优秀答案的可能性。

**4. 在翻译任务中应用。**此外，我们创新地使用大型推理模型（LRM）到翻译任务中，这对于一些长难句翻译具有良好的效果。也是第一次将Inference time scaling应用到机器翻译任务中。

**5. 开源。**我们开源了部分CoT数据与目前最好的模型，稍晚些我们还会开源更多的数据与模型。

模型会对response做更深入的思考，比如下图中，对于单词‘strawberry’中有多少个r，模型在输出中逐步拆解单词中的每一个字母并且比较，在输出最终结果之前对之前的所有步骤进行确认，最终正确的输出了‘The answer is 3’。

另一个翻译的case。在机器翻译领域，如何处理例如‘踩屎感’、‘光腿神器’等这种国内特色表达方式一直是一个难点。通过这样一条推理链路，模型正确的识别了踩屎感这个难点，逐词翻译，解决了整体的翻译准确性，也在一定程度上防止了漏翻的出现。

此外，我们还在其他领域进行了尝试，证明了该模型具备一定的解决其他通用现实问题的能力。

整体方法

Marco-o1的整体结构如下，我们通过self-play+MCTS构建了一批具备反思、改正能力的超长CoT数据，结合其他开源数据一同训练了Marco-o1-CoT。考虑到模型整体的指令遵循能力十分重要，我们还融入了MarcoPolo家族的一些指令遵循数据集，整体模型指令遵循能力得到了显著的提升。

数据构造

我们使用self-play的形式，引导模型进行例如计划、执行、反思、纠错、输出答案等行为（Action as a Action)，并使用MCTS进行搜索，当模型输出答案时，给予答案一个reward来更新这条MCTS搜索树上的V和N，也就是rollout。

具体来说，我们尝试了使用在初始prompt中引导模型执行例如计划、反思、纠错等行为。我们发现这对于一些较大模型（例如GPT4o、Qwen72B等）由于整体出色的模型能力和指令遵循能力，模型可以很好地完成既定指令。同时，我们还尝试了7B甚至1.5B模型的输出。模型也基本能够遵循我们设计好的PATTERN进行输出，经过MCTS的选择后，整体的输出基本满足了要求。

这里我们给出了我们的一个早期版本的prompt供大家参考：

Begin by split the task into serval sub-tasks within <sub-task> tags,` `then enclosing all thoughts within <thinking> tags, exploring multiple angles and``Continuously adjust your reasoning based on intermediate results and reflections, adapting your strategy as you progress.``The hypothesis is presented within <hypothesis>.``Regularly evaluate progress using <reflection> tags.``Be critical and honest about your reasoning process.``If unsure, backtrack and try a different approach, explaining your decision within <thinking> tags.``Output should be markdown format and tags can not nested.   For mathematical problems, show all work explicitly using LaTeX for formal notation and provide detailed proofs.``Explore multiple solutions individually if possible, comparing approaches in reflections.``Use thoughts as a scratchpad, writing out all calculations and reasoning explicitly.``Synthesize the final answer within <answer> tags, providing a clear, concise summary.``Conclude with a final reflection on the overall solution, discussing effectiveness, challenges, and solutions.``\n\nNow the Question is:\n4只小鸭子在一个大的圆形水池中，分别随机的出现在圆圈中的任意一点。4只鸭子出现在 同一半圆的概率是多少？

同时，我们还在MCTS搜索的中间引入了一些关键句进一步引导模型动作。比如在某个具体行为的搜索后主动拼接一些特征词，例如Wait! Maybe这类句子或者短语，强化模型对于例如标签的理解以及约束后续行为。

另外，我们认为模型的指令遵循能力对于o1来说依然重要（比如告诉模型你的计算预算，你需要在多少个token内输出答案，这对于简单问题的快速输出具有现实意义），因此我们还添加了指令遵循数据集强化指令遵循能力，还添加了少量常规sft数据作为playback使用。这部分数据直接使用我们MarcoPolo系列的数据集。

除此之外，MCTS进行数据搜索整体上的效率较低，成本较大，因此我们还对开源数据（例如Open O1）进行了清洗过滤。这些过滤包括启发式和llm as judge等，过滤掉一些低质量以及我们认为对我们的模型没有帮助的数据。

这三组数据的混合构成了第一版Marco-o1的训练数据。由于时间关系，我们并没有进行特别细致的消融实验。不过我们人工对MCTS输出的数据进行了细致筛查，整体质量较高，我们相信会显著提升模型相关性能的。

最后，我们观察到实际上OpenAI o1的行为不止这几个，如何找到并设计这些行为是一个很大的挑战，并且为这些行为的行为树设计也是一个挑战（比如不可以直接）。有一些信息表明OpenAI为o1设计了数千个规则来引导模型进行搜索，这也会是我们后续的一个方向。

逻辑推理

推理阶段，我们设计了一个基于模型prob的reward来引导MCTS的搜索。但是我们观察到使用Action作为MCTS搜索细粒度是较粗的，往往导致较难搜索出正确答案——这一步在数据构造阶段影响有限，因为数据构造阶段是有ground truth信号作为引导的，但是推理阶段使用经过特殊设计的porb影响是显著的。因此我们进一步拆解了MCTS的搜索粒度，我们设计了3种粒度。

一种是类似于Verify Step by Step的以Step为单位进行搜索，每一个MCTS节点内为一句话（或者一个Action标签），此外，我们还特别的对Step也进行了拆分，我们称之为mini-Step，也就是固定每64/32个token为一个”step“，进一步扩大模型的搜索空间。

我们也发现，我们设计的基于prob的reward设计具备较多局限性，我们观察到Llama-o1的基于prob的reward设计似乎是一个更优秀的方案。后续我们将持续探索其他reward设计以及Reward Model的训练与开源。

对于reward设计的更详细的说明欢迎移步我们的Github。

结果展示

我们的Marco-o1-CoT模型相比baseline有了一定的提升。其中mgsm-zh的性能下降是由于我们使用了中文CoT进行推理——我们认为推理路径遵循源语言是更容易被使用者所理解的。但是训练数据中中文CoT数据极少，这可能导致了性能下降。不过这部分性能下降在之后的MCTS搜索中得到了恢复。

在使用mini-Step进一步扩大解空间后，观察到了性能有一些波动。经过观察整个搜索树的结果，我们发现这主要是由于我们使用的reward计算方案导致的搜索路径选择错误。正如之前我们所说的，我们认为使用mini-Step继续扩大模型的解空间是有实际意义的，因此我们后续将使用更优秀的reward或者Reward Model来引导MCTS的搜索。

左侧为step level的MCTS，结果错误。右侧为mini-Step 32的MCTS，可以看到在value increase 150%后出现了错误。step level较难避免这类错误，而mini-Step的可以通过下一次搜索在一定程度上规避这一问题。

使用方法

目前模型已经上传到huggingface，只需要简单代码即可使用。考虑到模型输出较长，因此我们还提供了一个vllm版本供大家使用。

import torch``from typing import List, Dict, Tuple``from transformers import AutoModelForCausalLM, AutoTokenizer``   ``   ``def load_model_and_tokenizer(path):`    `tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True)`    `model = AutoModelForCausalLM.from_pretrained(path, trust_remote_code=True).to('cuda:0')`    `model.eval()`    `return tokenizer, model``   ``   ``def generate_response(model, tokenizer,`                      `input_ids, attention_mask,`                      `max_new_tokens=4096):`    `generated_ids = input_ids`    `with torch.inference_mode():`        `for _ in range(max_new_tokens):`            `outputs = model(input_ids=generated_ids, attention_mask=attention_mask)`            `next_token_id = torch.argmax(outputs.logits[:, -1, :], dim=-1).unsqueeze(-1)`            `generated_ids = torch.cat([generated_ids, next_token_id], dim=-1)`            `attention_mask = torch.cat([attention_mask, torch.ones_like(next_token_id)], dim=-1)`            `new_token = tokenizer.decode(next_token_id.squeeze(), skip_special_tokens=True)`            `print(new_token, end='', flush=True)`            `if next_token_id.item() == tokenizer.eos_token_id:`                `break`    `return tokenizer.decode(generated_ids[0][input_ids.shape[-1]:], skip_special_tokens=True)``   ``   ``def chat(model, tokenizer):`    `history: List[Dict[str, str]] = []`    `print("Enter 'q' to quit, 'c' to clear chat history.")`    `while True:`        `user_input = input("User: ").strip().lower()`        `if user_input == 'q':`            `print("Exiting chat.")`            `break`        `if user_input == 'c':`            `print("Clearing chat history.")`            `history.clear()`            `continue`        `if not user_input:`            `print("Input cannot be empty.")`            `continue``   `        `history.append({"role": "user", "content": user_input})`        `text = tokenizer.apply_chat_template(history, tokenize=False, add_generation_prompt=True)`        `model_inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=4096).to('cuda:0')``   `        `print('Assistant:', end=' ', flush=True)`        `response = generate_response(model, tokenizer, model_inputs.input_ids, model_inputs.attention_mask)`        `print()`        `history.append({"role": "assistant", "content": response})``   ``   ``def main():`    `path = "AIDC-AI/Marco-o1"`    `tokenizer, model = load_model_and_tokenizer(path)`    `print('Starting chat.')`    `chat(model, tokenizer)``   ``   ``if __name__ == "__main__":`    `main()