【Graphrag】解决多模态模型对多个PPT识别效果不佳的方案

原创已于 2025-08-21 15:44:59 修改 · 620 阅读

8 ·

CC 4.0 BY-SA版权

文章标签：

#python #AI #RAG #graphrag

于 2025-08-21 15:34:01 首次发布

1. 背景与现有问题

在企业知识管理与对比分析中，常见的问题包括：

多模态识别不准：直接让模型识别 PPT 图片/图表，容易漏掉关键信息。

跨文档对比难：单文档 QA 尚可，但跨多个企业 PPT 抽取共性/差异效果差。

原因：传统 RAG（Retrieval-Augmented Generation）存在以下不足：

碎片化检索：传统 RAG 依赖切分段落 + 向量相似度，难以抓住不同文档间的联系，结果往往只停留在局部信息。

缺乏结构化语义：实体、关系等语义信息没有显式建模，模型难以回答“不同企业的共同点与差异”这类跨文档问题。

可扩展性不足：当文档数量增加时，纯向量检索效率下降，容易漏召或召回不准。

这些问题严重影响了效率与可用性。

2. 方案与实现路径

2.1 关键思路

本方案基于 微软 GraphRAG 框架

 https://github.com/microsoft/graphrag.git

结合知识图谱和语义检索，实现高效、可扩展的多 PPT 对比分析：

知识图谱增强：不仅存储段落，还抽取企业、技术、战略等实体及关系，构建图谱。

社区级总结：通过社区检测生成高层次主题报告，支持跨 PPT 的战略对比。

索引复用与增量更新：避免每次全量重建，支持快速查询和文档迭代。

可解释性：回答中可追溯至具体实体与文档出处，增强结果可信度。

2.2 模型选择

在 config.yaml 中完成了如下配置：

对话模型（Chat Model）

类型：openai_chat（OpenAI 兼容模式）

API 地址：https://dashscope.aliyuncs.com/compatible-mode/v1

模型名称：qwen-plus

用途：负责知识抽取、摘要、报告生成、跨文档对比等任务。
向量模型（Embedding Model）

类型：openai_embedding（OpenAI 兼容模式）

API 地址：https://dashscope.aliyuncs.com/compatible-mode/v1

模型名称：text-embedding-v3

用途：用于文本切分后的向量化表示，支持局部检索与相似度计算。
调用方式
- 授权方式：auth_type: api_key，通过环境变量 DASHSCOPE_API_KEY 管理密钥。
- 并发设置：concurrent_requests: 25
- 容错机制：retry_strategy: native + max_retries: 10，保障稳定性。
- 结合微软 GraphRAG 与阿里云模型，方案具备以下特点：
- 高效建图：利用通义千问对中文文本的强理解力，提升实体关系抽取的准确度。
- 多模态兼容：支持 PPT 文本提取后进行知识图谱构建，利于跨文档分析。
- 低延迟查询：在国内调用阿里云 API，响应速度明显快于跨境调用。

流程图：

配置代码：

### This config file contains required core defaults that must be set, along with a handful of common optional settings.
### For a full list of available settings, see https://microsoft.github.io/graphrag/config/yaml/

### LLM settings ###
## There are a number of settings to tune the threading and token limits for LLM calls - check the docs.

models:
  default_chat_model:
    type: openai_chat # or azure_openai_chat
    api_base: https://dashscope.aliyuncs.com/compatible-mode/v1
    # api_version: 2024-05-01-preview
    auth_type: api_key # or azure_managed_identity
    api_key: ${DASHSCOPE_API_KEY} # set this in the generated .env file
    # audience: "https://cognitiveservices.azure.com/.default"
    # organization: <organization_id>
    model: qwen-plus
    # deployment_name: <azure_model_deployment_name>
    encoding_model: cl100k_base # automatically set by tiktoken if left undefined
    model_supports_json: true # recommended if this is available for your model.
    concurrent_requests: 25 # max number of simultaneous LLM requests allowed
    async_mode: threaded # or asyncio
    retry_strategy: native
    max_retries: 10                   # set to -1 for dynamic retry logic (most optimal setting based on server response)
    tokens_per_minute: auto             # set to 0 to disable rate limiting
    requests_per_minute: auto            # set to 0 to disable rate limiting
  default_embedding_model:
    type: openai_embedding # or azure_openai_embedding
    api_base: https://dashscope.aliyuncs.com/compatible-mode/v1
    # api_version: 2024-05-01-preview
    auth_type: api_key # or  azure_managed_identity
    api_key: ${DASHSCOPE_API_KEY}
    # audience: "https://cognitiveservices.azure.com/.default"
    # organization: <organization_id>
    model: text-embedding-v3
    # deployment_name: <azure_model_deployment_name>
    encoding_model: cl100k_base # automatically set by tiktoken if left undefined
    model_supports_json: true # recommended if this is available for your model.
    concurrent_requests: 25 # max number of simultaneous LLM requests allowed
    async_mode: threaded # or asyncio
    retry_strategy: native
    max_retries: 10                   # set to -1 for dynamic retry logic (most optimal setting based on server response)
    tokens_per_minute: auto             # set to 0 to disable rate limiting
    requests_per_minute: auto            # set to 0 to disable rate limiting
 
vector_store:
  default_vector_store:
    type: lancedb
    db_uri: output\lancedb
    container_name: default
    overwrite: True
 
embed_text:
  batch_size: 10
  model_id: default_embedding_model
  vector_store_id: default_vector_store
 
### Input settings ###
 
input:
  type: file # or blob
  file_type: text # or csv
  base_dir: "input"
  file_encoding: utf-8
  file_pattern: ".*\\.txt$$"
 
chunks:
  size: 200
  overlap: 50
  group_by_columns: [id]
 
### Output settings ###
## If blob storage is specified in the following four sections,
## connection_string and container_name must be provided
 
cache:
  type: file # [file, blob, cosmosdb]
  base_dir: "cache"
 
reporting:
  type: file # [file, blob, cosmosdb]
  base_dir: "logs"
 
output:
  type: file # [file, blob, cosmosdb]
  base_dir: "output"
 
### Workflow settings ###
 
extract_graph:
  model_id: default_chat_model
  prompt: "prompts/extract_graph.txt"
  entity_types: [organization,person,geo,event]
  max_gleanings: 1
 
summarize_descriptions:
  model_id: default_chat_model
  prompt: "prompts/summarize_descriptions.txt"
  max_length: 500
 
extract_graph_nlp:
  text_analyzer:
    extractor_type: regex_english # [regex_english, syntactic_parser, cfg]
 
extract_claims:
  enabled: false
  model_id: default_chat_model
  prompt: "prompts/extract_claims.txt"
  description: "Any claims or facts that could be relevant to information discovery."
  max_gleanings: 1
 
community_reports:
  model_id: default_chat_model
  graph_prompt: "prompts/community_report_graph.txt"
  text_prompt: "prompts/community_report_text.txt"
  max_length: 2000
  max_input_length: 8000
 
cluster_graph:
  max_cluster_size: 10
 
embed_graph:
  enabled: false # if true, will generate node2vec embeddings for nodes
 
umap:
  enabled: false # if true, will generate UMAP embeddings for nodes (embed_graph must also be enabled)
 
snapshots:
  graphml: false
  embeddings: false
 
### Query settings ###
## The prompt locations are required here, but each search method has a number of optional knobs that can be tuned.
## See the config docs: https://microsoft.github.io/graphrag/config/yaml/#query
 
local_search:
  chat_model_id: default_chat_model
  embedding_model_id: default_embedding_model
  prompt: "prompts/local_search_system_prompt.txt"
 
global_search:
  chat_model_id: default_chat_model
  map_prompt: "prompts/global_search_map_system_prompt.txt"
  reduce_prompt: "prompts/global_search_reduce_system_prompt.txt"
  knowledge_prompt: "prompts/global_search_knowledge_system_prompt.txt"
 
drift_search:
  chat_model_id: default_chat_model
  embedding_model_id: default_embedding_model
  prompt: "prompts/drift_search_system_prompt.txt"
  reduce_prompt: "prompts/drift_search_reduce_prompt.txt"
 
basic_search:
  chat_model_id: default_chat_model
  embedding_model_id: default_embedding_model
  prompt: "prompts/basic_search_system_prompt.txt"

3.模型测试

初次测试采用纯文本形式，导入了5000字短篇小说，测试模型理解能力测试结果如下：

提问：

graphrag query --root ./ragtest --method global --query "文章核心思想是什么"

初始化创建知识图谱5000字耗时20分钟，消耗约30万token

4.优化

微软 GraphRAG 在 create 阶段每次都会 重新跑 ingest → 抽取实体 → 构图，所以特别慢

解决方案：

1) 一次性建好工程与索引

# 1) 初始化工程
graphrag init --root .\projects\ppt_compare

# 2) 把多个企业的 PPT 转成 txt / csv / json，放在：
#    .\projects\ppt_compare\input\
#    （文件名里最好包含公司名，方便后续区分）

# 3) 首次建索引（标准或快速）
graphrag index --root .\projects\ppt_compare -m standard -o .\projects\ppt_compare\output
# 或者快速模式
graphrag index --root .\projects\ppt_compare -m fast -o .\projects\ppt_compare\output

存到制定的输出目录。后续查询只要指向这个目录即可，无需重跑索引。

2) 复用已有图直接问（无需重建）

#全局搜索
graphrag query -m global 
  -q "对比A公司与B公司的战略异同，并指出双方在AI投入、定价与渠道上的主要差异" 
  -d .\projects\ppt_compare\output

#局部搜索
graphrag query -m local 
  -q "A公司在2024年Roadmap里的核心里程碑与负责人是谁？" 
  -d .\projects\ppt_compare\output

3) 新增PPT只做增量，不重建

graphrag update --root .\projects\ppt_compare 
  -m standard-update 
  -o .\projects\ppt_compare\update_output

查询问题代码改成：

graphrag query -m global
  -q "加入新PPT后，三家企业的生成式AI商业化路径如何变化？" 
  -d .\projects\ppt_compare\update_output

如此可以大幅提高graphrag效率。

后续仍有对比更新