5.3.3 字符文本分割器
在LangChain中,CharacterTextSplitter是一个简单的文本分割器,它基于单个字符进行分割,并根据字符数来衡量文本块的大小。这种方法适用于那些不需要根据语义或结构进行分割的场景,而是需要将长文本分割成较小的部分以便处理。在默认情况下,CharacterTextSplitter 会逐个字符进行分割,块的大小通过字符数量来衡量。请看下面的实例,功能是使用 CharacterTextSplitter 分割文本文件 example1.txt。
实例5-10:使用CharacterTextSplitter分割文本(源码路径:codes\5\Fen04.py)
实例文件Fen04.py的具体实现代码如下所示。
from langchain_text_splitters import CharacterTextSplitter
# 读取长文本文件
with open("123.txt") as f:
state_of_the_union = f.read()
# 创建 CharacterTextSplitter 实例
text_splitter = CharacterTextSplitter(
separator="\n\n", # 分隔符,这里使用两个连续的换行符
chunk_size=1000, # 每个块的大小为1000个字符
chunk_overlap=200, # 块之间的重叠为200个字符
length_function=len, # 使用 len 函数来计算字符长度
is_separator_regex=False, # 不将分隔符作为正则表达式处理
)
# 使用分割器创建文档列表
texts = text_splitter.create_documents([state_of_the_union])
print(texts[0]) # 打印第一个分割后的文本块
在上述代码中,首先读取一个名为“123.txt”的长文本文件,然后使用langchain_text_splitters中的CharacterTextSplitter类将文本分割成多个块。在初始化CharacterTextSplitter时,设置了分隔符为两个连续的换行符(\n\n),每个文本块的大小为1000个字符,并且相邻块之间有200个字符的重叠。使用len函数来计算文本长度,并且不将分隔符作为正则表达式处理。最后,通过调用create_documents方法将文本分割成文档列表,并打印出第一个分割后的文本块。执行后会输出:
page_content='LangChain is a framework dedicated to building language model applications, offering various tools and methods to enhance the performance of language models. Among these, Retrieval-Augmented Generation (RAG) is a crucial component.
Basic Principles
Retrieval Phase: Before generating text, relevant fragments or information related to the current task or input content are retrieved from an external knowledge base (such as a collection of documents or a database). This retrieval process typically employs information retrieval techniques like inverted indexing and vector search, calculating the similarity between the input query vector and the vectors of documents in the knowledge base to identify the most relevant document fragments.
Generation Phase: The retrieved relevant information is provided to the language model as context or prompt. The language model then generates more accurate, rich, and targeted text content by incorporating these additional contextual details.
Advantages
Knowledge Up-to-dateness: The knowledge of a language model is based on its training data and becomes relatively fixed once training is complete. RAG introduces the latest information by searching the most recent external knowledge base, ensuring that the generated text reflects the most up-to-date knowledge and facts, thus avoiding the issue of outdated knowledge in language models.
Improved Accuracy: When dealing with tasks that require specific domain knowledge or accurate factual support, relying solely on the language model's inherent knowledge may not guarantee the accuracy of the generated content. RAG enhances accuracy by providing more precise context from relevant external knowledge, enabling the generation of more reliable text.
Content Richness: The retrieved information offers additional details and perspectives to the language model, resulting in more diverse and comprehensive text content. This prevents the issue of overly simplistic or templated content that might arise from relying solely on the language model.
Application Scenarios
Question-Answering Systems: By retrieving relevant document fragments, more accurate and detailed answers can be provided to user questions. For example, a medical Q&A system can search medical literature and case studies to offer more professional medical advice.
Content Creation: Assisting creators in obtaining inspiration and background information during writing. For instance, when writing a news report, relevant news events, data, and expert opinions can be retrieved to enrich the article's content.
Smart Customer Service: Quickly retrieving relevant knowledge base content in response to customer inquiries allows for more accurate and targeted answers, enhancing customer satisfaction.'
在处理文本的过程中,有时候只是想简单地将文本分割成块,而不需要关注文档的其他信息。直接分割文本就是在不使用元数据的情况下,直接将文本传递给文本分割器进行分割,例如下面是一个使用CharacterTextSplitter直接分割文本的例子。
实例5-11:使用CharacterTextSplitter直接分割文本(源码路径:codes\5\Fen05.py)
实例文件Fen05.py的具体实现代码如下所示。
from langchain_text_splitters import CharacterTextSplitter
# 创建 CharacterTextSplitter 实例
text_splitter = CharacterTextSplitter(
separator="\n\n",
chunk_size=1000,
chunk_overlap=200,
length_function=len,
is_separator_regex=False,
)
# 定义要分割的文本
state_of_the_union = " LangChain is a framework dedicated to building language model applications"
# 使用 CharacterTextSplitter 对文本进行分割
texts = text_splitter.create_documents([state_of_the_union])
# 打印分割后的文本
print(texts[0]) # 打印第一个分割后的文本块
在上述代码中,创建了一个CharacterTextSplitter实例,然后使用create_documents方法直接对文本进行分割。 LangChain会根据指定的分割参数将文本切分成块,并将结果存储在texts列表中。执行后会输出:
page_content='LangChain is a framework dedicated to building language model applications'
5.3.4 代码分割器
在LangChain中,可以使用CharacterTextSplitter分割不同编程语言的代码片段。通过下面的的代码,可以打印输出目前所支持的编程语言。
from langchain_text_splitters import Language, RecursiveCharacterTextSplitter
# 支持的所有编程语言列表
print([e.value for e in Language])
执行后会输出:
['cpp', 'go', 'java', 'kotlin', 'js', 'ts', 'php', 'proto', 'python', 'rst', 'ruby', 'rust', 'scala', 'swift', 'markdown', 'latex', 'html', 'sol', 'csharp', 'cobol', 'c', 'lua', 'perl']
在使用RecursiveCharacterTextSplitter分割程序代码时,首先需要通过Language枚举设置编程语言的类型,然后将一个代码片段传递给分割器,分割器根据预定义的代码分隔符将程序代码分割成块。例如分割Python代码的流程如下所示。
(1)使用RecursiveCharacterTextSplitter,将变成语言设置为Language.PYTHON。
(2)将一个Python代码片段传递给分割器。
(3)分割器根据预定义的Python代码分隔符将Python代码分割成块。
请看下面的实例,演示了使用CharacterTextSplitter分割Python代码片段的过程。
实例5-12:使用CharacterTextSplitter分割Python代码片段(源码路径:codes\5\Fen06.py)
实例文件Fen06.py的具体实现代码如下所示。
from langchain_text_splitters import Language, RecursiveCharacterTextSplitter
# 定义Python代码文本
python_code = """
def hello_world():
print("Hello, World!")
hello_world()
"""
# 使用CharacterTextSplitter创建Python文本分割器实例
python_splitter = RecursiveCharacterTextSplitter.from_language(
language=Language.PYTHON, chunk_size=50, chunk_overlap=0
)
# 分割Python代码文本
python_docs = python_splitter.create_documents([python_code])
# 打印分割后的文档
for doc in python_docs:
print(doc.page_content)
在上述代码中,使用 for 循环遍历文档列表 python_docs,并打印每个文档的内容。在这个例子中只有一个文档,所以它只打印了整个Python代码文本。执行后会输出:
def hello_world():
print("Hello, World!")
# Call the function
hello_world()