文章中的英文和汉语词数出现次数的统计

最新推荐文章于 2020-11-24 02:57:17 发布

一匹脱缰的野马

最新推荐文章于 2020-11-24 02:57:17 发布

阅读量516

点赞数 1

CC 4.0 BY-SA版权

分类专栏：机器学习

本文链接：https://blog.csdn.net/qq_39112101/article/details/101037079

机器学习专栏收录该内容

12 篇文章

订阅专栏

本文通过实际案例，详细介绍了如何使用CountVectorizer进行文本特征提取，分别对英文和中文文本进行分词处理，展示了如何设定停用词，并通过运行代码获取了特征名称及对应的词频矩阵。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

对英文词语的统计如下

import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

#需要处理的文字
content = ['Hong Kong residents express feelings through anthem; Guangzhou Museum offers moon-gazing through telescopes; China beat Cameroon 3-0 at Women’s V-ball World Cup','China unveiled the official mascots (Bing Dwen Dwen and Shuey Rhon Rhon) on Tuesday for the Beijing 2022 Winter Olympics and Winter Paralympic Games.']

#1.构建实例
#min_df=1 设置分词的时候词至少出现一次
#stop_words 停用词，stop_words=[''],认为不重要的词放进去，停止对这些词的使用
con_vet = CountVectorizer(stop_words=['express','feelings'])

#2.进行词语提取
# 对于英文来说按照空格分隔，来获取
# 认为单个的词对文章没有影响，所以不取出来
X = con_vet.fit_transform(content)

name = con_vet.get_feature_names()

print(name)
print(X.toarray())
print(X)

代码运行结果如下:

上面的列表是出现的词语，下面的列表代表在以一句话和第二句话中对应词语出现的次数。

上面的代码无法直接处理中文，所以要对代码进行处理，这个时候需要引入结巴分词来处理中文，使用结巴分词之前要使用pip将结巴分词下载到本地

import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
import jieba

#自己构建文章
content = ['海是一座没有围墙的城','那些和青春为伍的日子','曾在你心里住过一阵子','将青春的承诺载出天涯']

content_list = []
for tmp in content:
    #这里使用的是精确匹配
    res = jieba.cut(tmp,cut_all=True)
    res_str = ','.join(res)
    content_list.append(res_str)

#1.构建实例
con_vet = CountVectorizer(stop_words=['为伍','没有'])

#2.进行词语提取
# 对于英文来说按照空格分隔，来获取
# 认为单个的词对文章没有影响，所以不取出来
x = con_vet.fit_transform(content_list)

#获取提取到的词语
name = con_vet.get_feature_names()
print(name)

print(x.toarray())

代码运行结果如下: