【推荐】LSI(latent semantic indexing) 完美教程

本文摘录了Dr. E. Garcia撰写的LSI/LSA教程精华内容,澄清了LSI的一些常见误解,并解释了LSI如何通过识别文档中第二阶共现词汇来改善信息检索效果。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

【推荐】LSI(latent semantic indexing) 完美教程

"instead of lecturing about SVD I want to show you how things work --step by step"

-- 如果大家认同这句话的话,Dr. E. Garcia写的此教程就是最适合你阅读的LSI / LSA教程。

原文比较长,直接贴链接了:

http://www.miislita.com/information-retrieval-tutorial/svd-lsi-tutorial-1-understanding.html

 

若觉得原文太长,还可以看Garcia写的精简版:

Latent Semantic Indexing (LSI) Fast Track Tutorial
Singular Value Decomposition (SVD) Fast Track Tutorial

 

 

摘录部分内容:

 

一、常见的对LSI的不正确认识:

1) is theming (analysis of themes).

2) is used by search engines to find all the nouns and verbs, and then associate them with related (substitution-useful) nouns and verbs.

3) allows search engines to "learn" which words are related and which noun concepts relate to one another.

4) is a form of on-topic analysis (term scope/subject analysis).can be applied to collections of any size.

5) has no problem addressing polysemy (terms with different meanings).

 

Pasted from <http://www.miislita.com/information-retrieval-tutorial/svd-lsi-tutorial-1-understanding.html>

 

 

二、LSI本质上识别了以文档为单位的second-order co-ocurrence的单词并归入同一个子空间。因此:

1)落在同一子空间的单词不一定是同义词,甚至不一定是在同情景下出现的单词,对于长篇文档尤其如是。

2)LSI根本无法处理一词多义的单词(多义词),多义词会导致LSI效果变差。

 

A persistent myth in search marketing circles is that LSI grants contextuality; i.e., terms occurring in the same context. This is not always the case. Consider two documents X and Y and three terms A, B and C and wherein:

 

A and B do not co-occur.

X mentions terms A and C

Y mentions terms B and C.

 

:. A---C---B

 

The common denominator is C, so we define this relation as an in-transit co-occurrence since both A and B occur while in transit with C. This is called second-order co-occurrence and is a special case of high-order co-occurrence.

 

However, only because terms A and B are in-transit with C this does not grant contextuality, as the terms can be mentioned in different contexts in documents X and Y. For example, this would be the case of X and Y discussing different topics. Long documents are more prone to this.

 

Even if X and Y are monotopic thesemight be discussing different subjects. Thus, it would be fallacious to assume that high-order co-occurrence between A and B while in-transit with C equates to a contextuality relationship between terms. Add polysemy to this and the scenario worsens, as LSI can fail to address polysemy.

 

Pasted from <http://www.miislita.com/information-retrieval-tutorial/svd-lsi-tutorial-1-understanding.html>

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值