NLP工具——Stanford CoreNLP的python封装包处理中文

最新推荐文章于 2024-08-08 07:40:33 发布

冰__蓝

最新推荐文章于 2024-08-08 07:40:33 发布

阅读量7k

点赞数 5

CC 4.0 BY-SA版权

分类专栏： NLP技术文章标签： NLP StanfordNLP

本文为博主原创文章，未经博主允许不得转载。

本文链接：https://blog.csdn.net/ling620/article/details/98864588

文章目录

1、StanfordCoreNLP是什么？

CoreNLP项目是斯坦福大学开发的一套开源NLP工具包，包括词性（POS）标记器，命名实体识别器（NER），解析器，情感分析，自举模式学习和开放式信息提取工具。

Stanford CoreNLP是用Java编写的，目前最新的版本是V3.9.2，最新版本需要Java 1.8+。因此，需要安装Java才能运行CoreNLP。但是，可以通过命令行或其Web服务与CoreNLP交互；也可以使用Javascript，Python或其他语言编写自己的代码时使用CoreNLP。

支持多种语言的处理，基本发行版提供了用于分析的英语模型文件，但该引擎与其他语言的模型兼容，提供阿拉伯语，中文，法语，德语和西班牙语的打包模型。

在这里插入图片描述

项目细节可见其官网： Stanford CoreNLP
github地址：stanfordnlp/CoreNLP: Stanford CoreNLP: A Java suite of core NLP tools

引用信息

Manning, Christopher D., Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David McClosky. 2014. The Stanford CoreNLP Natural Language Processing Toolkit In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 55-60. StanfordCoreNlp2014.pdf [bib]

2、StanfordNLP是什么？

StanfordNLP是一个斯坦福官方提供的python版本的NLP工具包。提供了73个树库中53种（人类）语言的预训练神经模型。
这些模块构建在PyTorch之上。如果在支持GPU的计算机上运行此系统，将获得更快的性能。

如果在纠结如何使用java版本的StanfordCoreNLP，那么可以使用该python版本的StanfordNLP。

当然，除了官方提供的python版本外，还有许多其他python版本，如stanfordcorenlp

官网： StanfordNLP 0.2.0
github地址：stanfordnlp/stanfordnlp

语言模型
下面列出了StanfordNLP支持的所有（人类）语言（通过这个Python神经管道）。

ANGUAGE	TREEBANK	LANGUAGE CODE	TREEBANK CODE	MODELS	VERSION
Afrikaans	AfriBooms	af	af_afribooms	download	0.2.0
Ancient Greek	Perseus	grc	grc_perseus	download	0.2.0
	PROIEL	grc	grc_proiel	download	0.2.0
Arabic	PADT	ar	ar_padt	download	0.2.0
Armenian	ArmTDP	hy	hy_armtdp	download	0.2.0
Basque	BDT eu	eu_bdt	download	0.2.0
Bulgarian	BTB	bg	bg_btb	download	0.2.0
Buryat	BDT	bxr	bxr_bdt	download	0.2.0
Catalan	AnCora	ca	ca_ancora	download 0.2.0	GNU License
Chinese (traditional)	GSD	zh	zh_gsd	download	0.2.0
Croatian	SET	hr	hr_set	download	0.2.0
Czech	CAC	cs	cs_cac	download	0.2.0

完整的语言支持列表，可见Models | StanfordNLP

引用信息:如果在研究中使用了他们的神经管道，可以参考他们的 CoNLL 2018 共享任务系统描述文件：

@inproceedings{qi2018universal,
address = {Brussels, Belgium},
author = {Qi, Peng and Dozat, Timothy and Zhang, Yuhao and Manning, Christopher D.},
booktitle = {Proceedings of the {CoNLL} 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies},
month = {October},
pages = {160–170},
publisher = {Association for Computational Linguistics},
title = {Universal Dependency Parsing from Scratch},
url = {https://nlp.stanford.edu/pubs/qi2018universal.pdf},
year = {2018}
}

3、StanfordNLP的使用

3.1 安装

StanfordNLP 支持 Python 3.6 及之后版本。推荐从 PyPI 中安装 StanfordNLP。如果已经安装了pip，运行以下命令：

pip install stanfordnlp

这有助于解决 StanfordNLP 的所有依赖项，例如PyTorch 1.0.0 及以上版本。

或者，你还可以从该git repo 中安装 StanfordNLP，这样你可以更加灵活地基于 StanfordNLP 开发，以及训练自己的模型。运行以下命令：

git clone https://github.com/stanfordnlp/stanfordnlp.git
cd stanfordnlp
pip install -e .

3.2 运行

从神经管道开始

当第一次运行StanfordNLP，可以参考如下代码：

>>> import stanfordnlp
>>> stanfordnlp.download('en')   # This downloads the English models for the neural pipeline
# IMPORTANT: The above line prompts you before downloading, which doesn't work well in a Jupyter notebook.
# To avoid a prompt when using notebooks, instead use: >>> stanfordnlp.download('en', force=True)
>>> nlp = stanfordnlp.Pipeline() # This sets up a default neural pipeline in English
>>> doc = nlp("Barack Obama was born in Hawaii.  He was elected president in 2008.")
>>> doc