基于文档的快捷问答开源项目调研-CSDN博客

本文链接：https://blog.csdn.net/weixin_37838725/article/details/144910460

                    
                        
                    
                    基于文档的快捷问答开源项目调研
项目名称ChatFilesh2oGPT pdfGPT
简介项目是一个前端工程，使用supabase后端服务和数据库，每个聊天窗口对应一份文档。Apachae V2开源项目，功能强大，支持查询和总结私人文档，以及与本地私有GPT LLM聊天，有丰富的自定义配置项。轻量级PDF智能聊天，文档数据保存在本地服务器，在聊天时将文档拆分向量化，并且计算相似度，在内存中使用向量数据，不会持久化保存。
示例工程地址https://lkt-chatfiles.vercel.app/zhhttps://gpt.h2o.ai/https://bhaskartripathi-pdfgpt-turbo.hf.space/
编程语言typescriptpythonpython
支持文件格式word、pdf、txt、markdownpdf、excel、word、images、text、markdown、etcpdf
核心流程文档上传上传到本地服务器上传到本地服务器上传到本地服务器
文档拆分使用langchain的文本分割器，按照字数拆分使用langchain的文本分割器，按照字数拆分自己编写拆分算法将文本拆分为指定长度的文本块，拆分逻辑跟langchain的文本分割器类似
文档向量化使用openai接口将文本向量化使用向量模型（instructor-large或all-MiniLM-L6-v2）将文本向量化使用google的
Universal Sentence Encoder接口将文本材料进行向量化
相似度匹配获取supabase向量存储对象执行向量保存和相似度搜索，通过余弦距离计算相似度值。使用Chroma或Weaviate向量数据库执行向量保存和相似度搜索，通过余弦距离计算相似度得分。使用k-邻近算法计算相似度值，底层使用的是欧氏距离。
文本召回相似度匹配时返回文本材料相似度匹配时返回文本材料相似度匹配时返回文本材料
文本召回相似度匹配时返回文本材料相似度匹配时返回文本材料相似度匹配时返回文本材料
组装提示词相关上下文、用户问题、系统提示词（The following is a friendly conversation between a human and an AI. The AI is talkative and provides lots of specific details from its context. If the AI does not know the answer to a question, it truthfully says it does not know.）不支持用户自定义配置相关上下文、用户问题、系统提示词Pre-Prompt（添加在上下文之前）、系统提示词Prompt（添加在上下文之后）支持用户自定义配置相关上下文、用户问题、系统提示词（Instructions: Compose a comprehensive reply to the query using the search results given. Cite each reference using [Page Number] notation (every result has this number at the beginning). Citation should be done at the end of each sentence. If the search results mention multiple subjects with the same name, create separate answers for each. Only include information found in the results and  don't add any additional information. Make sure the answer is correct and don't output false content. If the text does not relate to the query, simply state 'Text Not Found in PDF'. Ignore outlier search results which has nothing to do with the question. Only answer what is asked. The answer should be short and concise. Answer step-by-step. ）不支持用户自定义配置
提问gpt支持openai,azure openai大语言模型支持LLaMa2, Mistral, Falcon, Vicuna, WizardLM.支持openai大语言模型



                

基于文档的快捷问答开源项目调研
项目名称		ChatFiles	h2oGPT	pdfGPT
简介		项目是一个前端工程，使用supabase后端服务和数据库，每个聊天窗口对应一份文档。	Apachae V2开源项目，功能强大，支持查询和总结私人文档，以及与本地私有GPT LLM聊天，有丰富的自定义配置项。	轻量级PDF智能聊天，文档数据保存在本地服务器，在聊天时将文档拆分向量化，并且计算相似度，在内存中使用向量数据，不会持久化保存。
示例工程地址		https://lkt-chatfiles.vercel.app/zh	https://gpt.h2o.ai/	https://bhaskartripathi-pdfgpt-turbo.hf.space/
编程语言		typescript	python	python
支持文件格式		word、pdf、txt、markdown	pdf、excel、word、images、text、markdown、etc	pdf
核心流程	文档上传	上传到本地服务器	上传到本地服务器	上传到本地服务器
	文档拆分	使用langchain的文本分割器，按照字数拆分	使用langchain的文本分割器，按照字数拆分	自己编写拆分算法将文本拆分为指定长度的文本块，拆分逻辑跟langchain的文本分割器类似
	文档向量化	使用openai接口将文本向量化	使用向量模型（instructor-large或all-MiniLM-L6-v2）将文本向量化	使用google的 Universal Sentence Encoder接口将文本材料进行向量化
	相似度匹配	获取supabase向量存储对象执行向量保存和相似度搜索，通过余弦距离计算相似度值。	使用Chroma或Weaviate向量数据库执行向量保存和相似度搜索，通过余弦距离计算相似度得分。	使用k-邻近算法计算相似度值，底层使用的是欧氏距离。
	文本召回	相似度匹配时返回文本材料	相似度匹配时返回文本材料	相似度匹配时返回文本材料
	文本召回	相似度匹配时返回文本材料	相似度匹配时返回文本材料	相似度匹配时返回文本材料
	组装提示词	相关上下文、用户问题、系统提示词（The following is a friendly conversation between a human and an AI. The AI is talkative and provides lots of specific details from its context. If the AI does not know the answer to a question, it truthfully says it does not know.）不支持用户自定义配置	相关上下文、用户问题、系统提示词Pre-Prompt（添加在上下文之前）、系统提示词Prompt（添加在上下文之后）支持用户自定义配置	相关上下文、用户问题、系统提示词（Instructions: Compose a comprehensive reply to the query using the search results given. Cite each reference using [Page Number] notation (every result has this number at the beginning). Citation should be done at the end of each sentence. If the search results mention multiple subjects with the same name, create separate answers for each. Only include information found in the results and don't add any additional information. Make sure the answer is correct and don't output false content. If the text does not relate to the query, simply state 'Text Not Found in PDF'. Ignore outlier search results which has nothing to do with the question. Only answer what is asked. The answer should be short and concise. Answer step-by-step. ）不支持用户自定义配置
	提问gpt	支持openai,azure openai大语言模型	支持LLaMa2, Mistral, Falcon, Vicuna, WizardLM.	支持openai大语言模型