构建LangChain应用程序的示例代码：51、如何使用 Chroma 实现多模态检索增强生成 (RAG)

本文链接：https://blog.csdn.net/wangjiansui/article/details/140157977

Chroma 多模态 RAG

许多文档包含多种内容类型，包括文本和图像。

然而，大多数 RAG 应用中，图像中捕获的信息往往被忽略。

随着多模态 LLM 的出现，如 GPT-4V，值得考虑如何在 RAG 中利用图像：

选项 1：

使用多模态嵌入（例如 CLIP）来嵌入图像和文本
使用相似性搜索检索图像和文本
将原始图像和文本片段传递给多模态 LLM 进行答案合成

选项 2：

使用多模态 LLM（例如 GPT-4V、LLaVA 或 FUYU-8b）从图像生成文本摘要
嵌入并检索文本
将文本片段传递给 LLM 进行答案合成

选项 3：

使用多模态 LLM（例如 GPT-4V、LLaVA、或 FUYU-8b）直接从图像和文本中生成答案
使用引用原始图像嵌入和检索图像摘要
将原始图像和文本块传递到多模态LLM以进行答案合成

本文重点介绍选项1 。

我们将使用 Unstructed 来解析文档 (PDF) 中的图像、文本和表格。
我们将使用 Open Clip多模态嵌入。
我们将使用支持多模式的 Chroma。

Packages

对于 unstructured ，您的系统中还需要 poppler （安装说明）和 tesseract （安装说明）。

! pip install -U langchain openai chromadb langchain-experimental # (newest versions required for multi-modal)

#lock to 0.10.19 due to a persistent bug in more recent versions
! pip install "unstructured[all-docs]==0.10.19" pillow pydantic lxml pillow matplotlib tiktoken open_clip_torch torch

数据加载

分割 PDF 文本和图像

让我们看一个包含有趣图像的 pdf 示例。

1/ J Paul Getty 博物馆的艺术品：

这是一个包含 PDF 和已提取图像的 zip 文件。
https://www.getty.edu/publications/resources/virtuallibrary/0892360224.pdf

2/ 国会图书馆的著名照片：

https://www.loc.gov/lcm/pdf/LCM_2020_1112.pdf
我们将在下面使用它作为示例

我们可以使用下面的 Unstructed 中的 partition_pdf 来提取文本和图像。

要提供它来提取图像：

extract_images_in_pdf=True

如果使用此 zip 文件，那么您只需使用以下命令即可简单地处理文本：

extract_images_in_pdf=False

# Folder with pdf and extracted images
path = "/Users/rlm/Desktop/photos/"
# Extract images, tables, and chunk text
from unstructured.partition.pdf import partition_pdf

raw_pdf_elements = partition_pdf(
    filename=path + "photos.pdf",
    extract_images_in_pdf=True,
    infer_table_structure=True,
    chunking_strategy="by_title",
    max_characters=4000,
    new_after_n_chars=3800,
    combine_text_under_n_chars=2000,
    image_output_dir_path=path,
)
# Categorize text elements by type
tables = []
texts = []
for element in raw_pdf_elements:
    if "unstructured.documents.elements.Table" in str(type(element)):
        tables.append(str(element))
    elif "unstructured.documents.elements.CompositeElement" in str(type(element)):
        texts.append(str(element