【项目实战】红楼梦TF-IDF值计算（包含数据预处理）

最新推荐文章于 2025-07-24 11:48:02 发布

IlIaya1xt

最新推荐文章于 2025-07-24 11:48:02 发布

阅读量807

点赞数 26

CC 4.0 BY-SA版权

文章标签： tf-idf

本文链接：https://blog.csdn.net/winnfield/article/details/149199509

任务1、2为数据预处理部分，任务3为计算TF-IDF值过程。

在研究古典文学时，经常需要对长篇文本进行分章节处理。今天我要分享一个实用的 Python 脚本，可以自动将《红楼梦》全文拆分为多个独立的章节文件，大大提高文本分析的效率。

这个拆分算法的核心逻辑非常清晰：它会读取完整的《红楼梦》文本文件，识别章节标题行，然后将不同章节的内容分别保存到对应的新文件中。特别贴心的是，它还会自动提取开头部分作为单独的介绍文件。

上图为红楼梦.txt开头内容，在章节内容开始前不重复。

任务1：将红楼梦根据卷名分隔成卷文件。

import os

def split_file(input_file, output_dir):
    os.makedirs(output_dir, exist_ok=True)
    first_file_path = os.path.join(os.path.dirname(input_file), '红楼梦开头.txt')
    current_file = None
    first_section = True

    with open(input_file, 'r', encoding='utf-8') as main_file:
        with open(first_file_path, 'a', encoding='utf-8') as first_file:
            for line in main_file:
                if '卷 第' in line:
                    if current_file:
                        current_file.close()

                    chapter_file = os.path.join(output_dir, f"{line.strip()}.txt")
                    current_file = open(chapter_file, 'w', encoding='utf-8')
                    first_section = False
                    continue
                
                if '手机电子书' in line:
                    continue
                    
                if first_section:
                    first_file.write(line)
                elif current_file:
                    current_file.write(line)
    
    if current_file:
        current_file.close()
input_file_path = r'./红楼梦/红楼梦.txt'
output_dir = r'./红楼梦/分卷'
split_file(input_file_path, output_dir)

主函数中先创建输出目录，并且定义开头那部分要输出的文件。

# 创建输出目录，如果已存在则不会报错
os.makedirs(output_dir, exist_ok=True)
# 定义开头部分的输出文件路径，位于原文件所在目录
first_file_path = os.path.join(os.path.dirname(input_file), '红楼梦开头.txt')

使用with函数进行文件打开，可以看见使用的with函数打开文件中的关键字不同，在使用输入文件即红楼梦.txt时直接导入即可，所以使用'r'，read。在使用first_file_path时要在读取每一行后添加到文件后，所以用'a'，add。

with open(input_file, 'r', encoding='utf-8') as main_file:
    with open(first_file_path, 'a', encoding='utf-8') as first_file:

阅读红楼梦.txt文件发标题行中'卷第'只会出现在标题行当中，所以检测标题行只需要在遍历file中的line时，检测'卷第'是否出现，若出现则为标题行内容，即可开始创建新.txt文件，并将内容写入到文件中。

for line in main_file:
    if '卷 第' in line:

如果过滤内容只到这一步会导致出现广告等无关内容，使得计算TF-IDF值时出现一些数字。所以找到重复的广告内容如“手机电子书·大学生小说网更新时间:2006-7-26 11:43:00 本章字数:8717”

广告内容是单独一个line，所以在遍历时只需要检测到广告中任意一段话，遍历时直接continue跳过即可。

if '手机电子书' in line:
    continue

运行上述代码即可在相对应的路径中看到生成的.txt文件，文件夹内容如下：

文件内容如下：

任务2：对每个卷进行分词，并删除包含停用词的内容
a、遍历所有卷的内容，并添加到内存中
b、将词库添加到jieba库中（注：jieba不一定能识别出所有的词，需要手动添加词库）
c、读取停用词
d、对每个卷内容进行分词
e、将所有分词后的内容写入到分词后汇总.txt

import os
import jieba
import pandas as pd

file_paths = []
file_contents = []
for root, dirs, files in os.walk(r'./红楼梦/分卷/'):
    for file in files:
        file_path = os.path.join(root, file)
        file_paths.append(file_path)
        f = open(file_path, 'r', encoding='utf-8')
        file_content = f.read()
        f.close()
        file_contents.append(file_content.strip())
df = pd.DataFrame({'FilePath':file_paths, 'FileContent':file_contents})

使用os库中的os.walk函数，遍历目录下内容，返回三个值，分别为路径，文件夹以及文件夹中的文件，files是一个可迭代的生成器，遍历可得到相应文件。分别读取文件内容和标题。可直接读取标题是因为任务1使用了文件标题作为每一卷名称。

在添加文件内容时，可以使用.strip()函数将多余的标识符去掉，比如文件开头的\n等。

最后使用pandas库将文件路径和内容保存为DataFrame格式。

运行结果如下图所示：

根据分词文件，对上述df文件进行分词操作：

jieba.load_userdict(r'.\红楼梦\红楼梦词库.txt')
stopwords = pd.read_csv(r'./StopwordsCN.txt',encoding='utf8', engine='python',index_col=False)

def is_valid_token(seg):
    return seg not in stopwords.stopword.values and len(seg.strip()) > 0
    
with open(r'./红楼梦/分词后汇总.txt','w',encoding='utf-8') as file_to_jieba:
    for index, row in df.iterrows():
        juanci = ''
        filePath = row['FilePath']
        fileContent = row['FileContent']
        segs = jieba.lcut(fileContent)
        for seg in segs:
            if is_valid_token(seg):
                juanci += seg + ' '
        file_to_jieba.write(juanci + '\n')

使用jieba库中的load_userdict导入自定义词库并读取 stopwords。

使用DataFrame类型的方法读取每一行，读取完成后每一行更具FilePath和FileContent分成内容和路径。

对文件内容进行切分操作，将每一个分割词通过空格连接在一起，并且去除stopwords的内容。目的是减少语气词等无效内容，减少计算量。

任务3：计算TF-IDF值

将上述预处理完成的数据进行计算

from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

with open(r".\红楼梦\分词后汇总.txt", 'r',encoding='utf-8') as inFile:
    corpus = inFile.readlines()
    vectorizer = TfidfVectorizer(use_idf=True)
    tfidf = vectorizer.fit_transform(corpus)
    wordlist = vectorizer.get_feature_names_out()
    
    df =pd.DataFrame(tfidf.T.todense(), index=wordlist)
    for i in range(len(corpus)):
        featurelist = df.iloc[:, i].tolist()
        
        resdict = {}
        for i in range(0, len(wordlist)):
            resdict[wordlist[i]] = featurelist[i]
        resdict = sorted(resdict.items(), key=lambda x:x[1], reverse=True)
        print(resdict[0:10])

即可得到想要的词频计算内容。