使用Facebook Messenger数据进行模型微调

最新推荐文章于 2025-07-24 23:58:38 发布

qq_37836323

最新推荐文章于 2025-07-24 23:58:38 发布

阅读量385

点赞数 3

CC 4.0 BY-SA版权

文章标签： facebook 服务器运维

本文链接：https://blog.csdn.net/qq_29929123/article/details/149251249

在AI应用开发中，充分利用已有的聊天数据进行模型微调是一个提高模型性能的关键技术。本文将详细介绍如何从Facebook Messenger加载数据，并将其转换为可供模型微调的格式，最终应用到LangChain中。我们将通过具体的代码演示来实现这一过程。

## 技术背景介绍

Facebook Messenger聊天数据包含丰富的人机交互示例，其结构适合用于语言模型的微调。通过OpenAI的API，我们可以将这些聊天记录进行格式化处理和模型微调，从而提升模型在特定应用场景下的表现。

## 核心原理解析

1. **数据下载与解压**：下载Facebook Messenger数据，并解压为JSON格式，以便后续处理。
2. **数据加载与转换**：使用特定的加载器将数据转换为模型微调所需的格式。
3. **消息合并与角色映射**：将连续消息进行合并并映射到"AIMessage"类，以匹配OpenAI微调格式。
4. **数据微调与应用**：上传转换后的数据到OpenAI进行模型微调，之后在LangChain中使用微调后的模型。

## 代码实现演示(重点)

### 1. 数据下载与解压

```python
import zipfile
import requests

def download_and_unzip(url: str, output_path: str = "file.zip") -> None:
    file_id = url.split("/")[-2]
    download_url = f"https://drive.google.com/uc?export=download&id={file_id}"

    response = requests.get(download_url)
    if response.status_code != 200:
        print("Failed to download the file.")
        return

    with open(output_path, "wb") as file:
        file.write(response.content)
        print(f"File {output_path} downloaded.")

    with zipfile.ZipFile(output_path, "r") as zip_ref:
        zip_ref.extractall()
        print(f"File {output_path} has been unzipped.")

download_and_unzip("https://drive.google.com/file/d/1rh1s1o2i7B-Sk1v9o8KNgivLVGwJ-osV/view?usp=sharing")

2. 数据加载与转换

from langchain_community.chat_loaders.facebook_messenger import SingleFileFacebookMessengerChatLoader

loader = SingleFileFacebookMessengerChatLoader(
    path="./hogwarts/inbox/HermioneGranger/messages_Hermione_Granger.json",
)

chat_session = loader.load()[0]
print(chat_session["messages"][:3])

3. 消息合并与角色映射

from langchain_community.chat_loaders.utils import map_ai_messages, merge_chat_runs

merged_sessions = merge_chat_runs(loader.load())
alternating_sessions = list(map_ai_messages(merged_sessions, "Harry Potter"))

print(alternating_sessions[0]["messages"][:3])

4. 数据微调与应用

import openai
from io import BytesIO
import json
import time

client = openai.OpenAI(
    base_url='https://yunwu.ai/v1',
    api_key='your-api-key'
)

# Convert messages
training_data = convert_messages_for_finetuning(alternating_sessions)

# Write the jsonl file in memory
my_file = BytesIO()
for m in training_data:
    my_file.write((json.dumps({"messages": m}) + "\n").encode("utf-8"))
my_file.seek(0)

training_file = client.files.create(file=my_file, purpose="fine-tune")

# Fine-tune the model
job = client.fine_tuning.jobs.create(
    training_file=training_file.id,
    model="gpt-3.5-turbo",
)
# Wait for the job to complete
status = client.fine_tuning.jobs.retrieve(job.id).status
while status != "succeeded":
    time.sleep(5)
    status = client.fine_tuning.jobs.retrieve(job.id).status

model = job.fine_tuned_model