信息论 Python 计算中文文本信息熵

最新推荐文章于 2025-08-01 15:52:43 发布

原创最新推荐文章于 2025-08-01 15:52:43 发布 · 1.6k 阅读

20 ·

CC 4.0 BY-SA版权

文章标签：

#python #开发语言 #算法

该博客介绍了如何使用Python计算中文文本的信息熵。首先，程序读取文本文件，统计每个中文字符的出现次数，并将结果存储在字典中。然后，根据字典计算信息熵，最后输出信息熵值。整个过程还计算了程序的运行时间。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Python 计算中文文本信息熵

配置：Python 3.9 PyCharm2022社区版

# Python 3.9 PyCharm 2022社区版
import os
import math
import time

start = time.process_time() # 程序开始时间
i1 = 0 # 循环数
j1 = 0 # 循环数
word_sum = 0 # 总的中文字符数
word_num_dit = {} # 存储中文字符和出现次数的字典
# 文本文件需要放在与Python项目同一目录下，或者输入文件绝对路径
#                   只读打开                     忽略错误
with open("test1.txt", "r", encoding='gb18030', errors='ignore') as file1:
    file_end = file1.seek(0, os.SEEK_END) # 求出文本长度作为循环计数
    print('文本字符总数：', file_end)

with open("D:/PycharmProjects/pythonProject_test1/test1.txt", "r", encoding='gb18030', errors='ignore') as file1:
    while i1 < file_end:# 依次读取文本

        a = file1.read(1)
        if u'\u4e00' < a < u'\u9fff': # 若读出的字符为中文，则
            word_sum += 1 # 总的中文字符数+1
            j1 = 0 # 设置循环数
            while j1 < len(word_num_dit): # 判断读出的中文字符是否为首次出现，
                if a in word_num_dit: # 读出的中文字符与当前字符相同，则
                    word_num_dit[a] += 1 # 当前中文字符出现次数+1
                    break
                else:
                    j1 += 1
            if j1 == len(word_num_dit): # 若循环到尾，确认出现新中文字符
                word_num_dit[a] = 1 # 在字典中加入新中文字符
        i1 += 1
# 为字典排序
ordered_word_num_dit = {k: v for k, v in sorted(word_num_dit.items(), key=lambda item: item[1])}
print('中文字符与出现次数：\n', ordered_word_num_dit)
num_list = sorted(ordered_word_num_dit.values()) # 取出字典中字符出现次数作为一个列表

i1 = 0
str2_sum = 0
# 计算中文信息熵
while i1 < len(ordered_word_num_dit):
    num_list[i1] = num_list[i1] / word_sum * math.log(num_list[i1] / word_sum) / math.log(2)
    str2_sum += -num_list[i1]
    i1 += 1

print('中文信息熵：', str2_sum)

end = time.process_time()
print("程序运行时间：", end - start)