SciSpacy实例开发

原创已于 2025-05-06 18:14:55 修改 · 995 阅读

CC 4.0 BY-SA版权

文章标签：

于 2025-03-12 17:28:56 首次发布

SciSpacy 是一个专门为科学和生物医学文本处理设计的自然语言处理（NLP）工具包。它基于 spaCy，并提供了针对科学领域的预训练模型和工具。以下是一个完整的 SciSpacy 实例开发指南，涵盖从安装到实际应用的步骤。

1. 安装 SciSpacy

首先，确保你已经安装了 SciSpacy 和所需的模型：

pip install scispacy
pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.1/en_core_sci_sm-0.5.1.tar.gz

2. 导入 SciSpacy 和模型

import scispacy
import spacy
from scispacy.abbreviation import AbbreviationDetector
from scispacy.umls_linking import UmlsEntityLinker

# 加载 SciSpacy 的预训练模型
nlp = spacy.load("en_core_sci_sm")

3. 添加 SciSpacy 组件

SciSpacy 提供了额外的组件，如缩写检测和 UMLS 实体链接。

3.1 缩写检测

# 添加缩写检测器
nlp.add_pipe("abbreviation_detector")

3.2 UMLS 实体链接

# 添加 UMLS 实体链接器
nlp.add_pipe("scispacy_linker", config={"resolve_abbreviations": True, "linker_name": "umls"})

4. 文本处理示例

以下是一个完整的示例，展示如何使用 SciSpacy 处理科学文本。

4.1 输入文本

text = """
The human immunodeficiency virus (HIV) is a lentivirus that causes HIV infection and acquired immunodeficiency syndrome (AIDS). 
AIDS is a condition in humans in which progressive failure of the immune system allows life-threatening opportunistic infections and cancers to thrive. 
CD4+ T cells are a type of white blood cell that HIV infects and destroys.
"""

4.2 处理文本

# 使用 SciSpacy 处理文本
doc = nlp(text)

5. 提取信息

SciSpacy 提供了丰富的功能来提取文本中的信息。

5.1 分词和词性标注

for token in doc:
    print(f"Token: {token.text}, POS: {token.pos_}, Lemma: {token.lemma_}")

5.2 命名实体识别（NER）

for ent in doc.ents:
    print(f"Entity: {ent.text}, Label: {ent.label_}")

5.3 缩写检测

for abrv in doc._.abbreviations:
    print(f"Abbreviation: {abrv}, Long Form: {abrv._.long_form}")

5.4 UMLS 实体链接

linker = nlp.get_pipe("scispacy_linker")
for ent in doc.ents:
    if ent._.umls_ents:
        print(f"Entity: {ent.text}, UMLS CUI: {ent._.umls_ents[0][0]}, Score: {ent._.umls_ents[0][1]}")

6. 完整代码示例

import scispacy
import spacy
from scispacy.abbreviation import AbbreviationDetector
from scispacy.umls_linking import UmlsEntityLinker

# 加载 SciSpacy 的预训练模型
nlp = spacy.load("en_core_sci_sm")

# 添加缩写检测器
nlp.add_pipe("abbreviation_detector")

# 添加 UMLS 实体链接器
nlp.add_pipe("scispacy_linker", config={"resolve_abbreviations": True, "linker_name": "umls"})

# 输入文本
text = """
The human immunodeficiency virus (HIV) is a lentivirus that causes HIV infection and acquired immunodeficiency syndrome (AIDS). 
AIDS is a condition in humans in which progressive failure of the immune system allows life-threatening opportunistic infections and cancers to thrive. 
CD4+ T cells are a type of white blood cell that HIV infects and destroys.
"""

# 使用 SciSpacy 处理文本
doc = nlp(text)

# 分词和词性标注
print("=== Tokens and POS ===")
for token in doc:
    print(f"Token: {token.text}, POS: {token.pos_}, Lemma: {token.lemma_}")

# 命名实体识别（NER）
print("\n=== Named Entities ===")
for ent in doc.ents:
    print(f"Entity: {ent.text}, Label: {ent.label_}")

# 缩写检测
print("\n=== Abbreviations ===")
for abrv in doc._.abbreviations:
    print(f"Abbreviation: {abrv}, Long Form: {abrv._.long_form}")

# UMLS 实体链接
print("\n=== UMLS Entity Linking ===")
linker = nlp.get_pipe("scispacy_linker")
for ent in doc.ents:
    if ent._.umls_ents:
        print(f"Entity: {ent.text}, UMLS CUI: {ent._.umls_ents[0][0]}, Score: {ent._.umls_ents[0][1]}")

7. 输出示例

运行上述代码后，你会看到类似以下的输出：

=== Tokens and POS ===
Token: The, POS: DET, Lemma: the
Token: human, POS: ADJ, Lemma: human
Token: immunodeficiency, POS: NOUN, Lemma: immunodeficiency
Token: virus, POS: NOUN, Lemma: virus
Token: (, POS: PUNCT, Lemma: (
Token: HIV, POS: PROPN, Lemma: HIV
Token: ), POS: PUNCT, Lemma: )
...

=== Named Entities ===
Entity: human immunodeficiency virus, Label: DISEASE
Entity: HIV, Label: DISEASE
Entity: HIV infection, Label: DISEASE
Entity: acquired immunodeficiency syndrome, Label: DISEASE
Entity: AIDS, Label: DISEASE
...

=== Abbreviations ===
Abbreviation: HIV, Long Form: human immunodeficiency virus
Abbreviation: AIDS, Long Form: acquired immunodeficiency syndrome

=== UMLS Entity Linking ===
Entity: human immunodeficiency virus, UMLS CUI: C0019682, Score: 1.0
Entity: HIV, UMLS CUI: C0019682, Score: 1.0
Entity: HIV infection, UMLS CUI: C0019693, Score: 1.0
Entity: acquired immunodeficiency syndrome, UMLS CUI: C0001175, Score: 1.0
Entity: AIDS, UMLS CUI: C0001175, Score: 1.0
...