gensim 之相似性查询

最新推荐文章于 2020-10-31 17:58:27 发布

jrymos001

最新推荐文章于 2020-10-31 17:58:27 发布

阅读量829

点赞数

CC 4.0 BY-SA版权

分类专栏：机器学习自然语言处理

本文链接：https://blog.csdn.net/m0_37681914/article/details/73832151

机器学习同时被 2 个专栏收录

10 篇文章

订阅专栏

自然语言处理

5 篇文章

订阅专栏

本文介绍如何使用 gensim 库中的 LSI（潜在语义索引）模型进行文本相似度检索。通过具体示例展示了从文档集合中检索与查询字符串最相似的文档的过程。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

官方文档:
http://radimrehurek.com/gensim/tut3.html

现实中常有需求,检索一条信息, 展示结果则取决于与该条信息相似性高低.

使用gensim求检索信息与文档集相似度

检索信息:

Human computer interaction

有以下文档集(每一行代表一个文档):

Human machine interface for lab abc computer applications
A survey of user opinion of computer system response time
The EPS user interface management system
System and human system engineering testing of EPS
Relation of user perceived response time to error measurement
The generation of random binary unordered trees
The intersection graph of paths in trees
Graph minors IV Widths of trees and well quasi ordering
Graph minors A survey

gensim代码如下:

from gensim import corpora, models, similarities
#获取词库
dictionary = corpora.Dictionary.load('/tmp/deerwester.dict')
#获取语料库
corpus = corpora.MmCorpus('/tmp/deerwester.mm') # comes from the first tutorial, "From strings to vectors"
#通过语料库和词库构建lsi模型
lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=2)
#检索语句
doc = "Human computer interaction"
#构建检索语句的向量
vec_bow = dictionary.doc2bow(doc.lower().split())
##将向量转换为lsi向量
vec_lsi = lsi[vec_bow] # convert the query to LSI space
print(vec_lsi)
#将语料库转换为相似度index
index = similarities.MatrixSimilarity(lsi[corpus]) # transform corpus to LSI space and index it
#index.save('/tmp/deerwester.index')
#index = similarities.MatrixSimilarity.load('/tmp/deerwester.index')
#求得检索语句与各个文档相似度
sims = index[vec_lsi] # perform a similarity query against the corpus
print(list(enumerate(sims))) # print (document_number, document_similarity) 2-tuples
#安装相似性降序排序
sims = sorted(enumerate(sims), key=lambda item: -item[1])
print(sims) # print sorted (document number, similarity score) 2-tuples