python实现余弦相似度_Python中N-Gram,tf-idf和余弦相似度的简单实现

本文介绍了一种简单的文档相似度计算方法,使用n-grams、tf-idf和余弦相似度来比较存储在数据库中的文档,并给出0到1之间的相似度评分。文章提供了计算n-grams和余弦相似度的具体实现。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

I need to compare documents stored in a DB and come up with a similarity score between 0 and 1.

The method I need to use has to be very simple. Implementing a vanilla version of n-grams (where it possible to define how many grams to use), along with a simple implementation of tf-idf and Cosine similarity.

Is there any program that can do this? Or should I start writing this from scratch?

解决方案

Check out NLTK package: http://www.nltk.org it has everything what you need

For the cosine_similarity:

def cosine_distance(u, v):

"""

Returns the cosine of the angle between vectors v and u. This is equal to

u.v / |u||v|.

"""

return numpy.dot(u, v) / (math.sqrt(numpy.dot(u, u)) * math.sqrt(numpy.dot(v, v)))

For ngrams:

def ngrams(sequence, n, pad_left=False, pad_right=False, pad_symbol=None):

"""

A utility that produces a sequence of ngrams from a sequence of items.

For example:

>>> ngrams([1,2,3,4,5], 3)

[(1, 2, 3), (2, 3, 4), (3, 4, 5)]

Use ingram for an iterator version of this function. Set pad_left

or pad_right to true in order to get additional ngrams:

>>> ngrams([1,2,3,4,5], 2, pad_right=True)

[(1, 2), (2, 3), (3, 4), (4, 5), (5, None)]

@param sequence: the source data to be converted into ngrams

@type sequence: C{sequence} or C{iterator}

@param n: the degree of the ngrams

@type n: C{int}

@param pad_left: whether the ngrams should be left-padded

@type pad_left: C{boolean}

@param pad_right: whether the ngrams should be right-padded

@type pad_right: C{boolean}

@param pad_symbol: the symbol to use for padding (default is None)

@type pad_symbol: C{any}

@return: The ngrams

@rtype: C{list} of C{tuple}s

"""

if pad_left:

sequence = chain((pad_symbol,) * (n-1), sequence)

if pad_right:

sequence = chain(sequence, (pad_symbol,) * (n-1))

sequence = list(sequence)

count = max(0, len(sequence) - n + 1)

return [tuple(sequence[i:i+n]) for i in range(count)]

for tf-idf you will have to compute distribution first, I am using Lucene to do that, but you may very well do something similar with NLTK, use FreqDist:

if you like pylucene, this will tell you how to comute tf.idf

# reader = lucene.IndexReader(FSDirectory.open(index_loc))

docs = reader.numDocs()

for i in xrange(docs):

tfv = reader.getTermFreqVector(i, fieldname)

if tfv:

rec = {}

terms = tfv.getTerms()

frequencies = tfv.getTermFrequencies()

for (t,f,x) in zip(terms,frequencies,xrange(maxtokensperdoc)):

df= searcher.docFreq(Term(fieldname, t)) # number of docs with the given term

tmap.setdefault(t, len(tmap))

rec[t] = sim.tf(f) * sim.idf(df, max_doc) #compute TF.IDF

# and normalize the values using cosine normalization

if cosine_normalization:

denom = sum([x**2 for x in rec.values()])**0.5

for k,v in rec.items():

rec[k] = v / denom

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值