官方文档:
http://radimrehurek.com/gensim/tut3.html
现实中常有需求,检索一条信息, 展示结果则取决于与该条信息相似性高低.
使用gensim求检索信息与文档集相似度
检索信息:
Human computer interaction
有以下文档集(每一行代表一个文档):
Human machine interface for lab abc computer applications
A survey of user opinion of computer system response time
The EPS user interface management system
System and human system engineering testing of EPS
Relation of user perceived response time to error measurement
The generation of random binary unordered trees
The intersection graph of paths in trees
Graph minors IV Widths of trees and well quasi ordering
Graph minors A survey
gensim代码如下:
from gensim import corpora, models, similarities
#获取词库
dictionary = corpora.Dictionary.load('/tmp/deerwester.dict')
#获取语料库
corpus = corpora.MmCorpus('/tmp/deerwester.mm') # comes from the first tutorial, "From strings to vectors"
#通过语料库和词库构建lsi模型
lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=2)
#检索语句
doc = "Human computer interaction"
#构建检索语句的向量
vec_bow = dictionary.doc2bow(doc.lower().split())
##将向量转换为lsi向量
vec_lsi = lsi[vec_bow] # convert the query to LSI space
print(vec_lsi)
#将语料库转换为相似度index
index = similarities.MatrixSimilarity(lsi[corpus]) # transform corpus to LSI space and index it
#index.save('/tmp/deerwester.index')
#index = similarities.MatrixSimilarity.load('/tmp/deerwester.index')
#求得检索语句与各个文档相似度
sims = index[vec_lsi] # perform a similarity query against the corpus
print(list(enumerate(sims))) # print (document_number, document_similarity) 2-tuples
#安装相似性降序排序
sims = sorted(enumerate(sims), key=lambda item: -item[1])
print(sims) # print sorted (document number, similarity score) 2-tuples