用biobert标记基因和蛋白质

本文介绍了如何使用BioBERT进行生物医学文本的命名实体识别(NER)。BioBERT是BERT模型的变体,特别适合处理生物医学领域的任务。通过预训练和微调,BioBERT能够识别基因和蛋白质等实体。作者提供了数据处理、模型构建和训练的详细步骤,展示了BioBERT在NER任务中的应用,达到了96%的训练准确率和95%的验证准确率。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

一,引言 (I. Introduction)

Text mining in the clinical domain has become increasingly important with the number of biomedical documents currently out there with valuable information waiting to be deciphered and optimized by NLP techniques. With the accelerated progress in NLP, pre-trained language models now carry millions (or even billions) of parameters and can leverage massive amounts of textual knowledge for downstream tasks such as question answering, natural language inference, and in the case that we will work through, biomedical text tagging via named-entity recognition. All of the code can be found on my GitHub.

在临床领域中的文本挖掘已变得越来越重要,因为目前存在许多生物医学文档,有价值的信息正在等待通过NLP技术进行解密和优化。 随着NLP的加速发展,经过预先训练的语言模型现在具有数百万(甚至数十亿)个参数,并且可以利用大量文本知识来完成下游任务,例如问题解答,自然语言推论以及在我们可以工作的情况下通过命名实体识别进行生物医学文本标记。 所有代码都可以在我的GitHub上找到

二。 背景 (II. Background)

As a state-of-the-art breakthrough in NLP, Google researchers developed a language model known as BERT (Devlin et. al, 2018) that was developed to learn deep representations by jointly conditioning on a bidirectional context of the text in all layers of its architecture¹. These representations are valuable for sequential data, such as text, that heavily relies on context and the advent of transfer learning in this field helps carry the encoded knowledge over to strengthen an individual’s smaller tasks across domains. In transfer learning, we call this step “fine-tuning”, which means that the pre-trained model is now being fine-tuned for the particular task we have in mind. The original English-language model used two corpora in their pre-training: Wikipedia and BooksCorpus. For a deeper intuition behind transformers like BERT, I would suggest a series of blogs on their architecture and fine-tuned tasks.

作为NLP的最新突破,谷歌研究人员开发了一种称为BERT的语言模型(Devlin等人,2018),该模型被开发为通过在所有层次上共同基于文本的双向上下文来学习深度表示的建筑¹。 这些表示形式对于顺序数据(例如文本)非常有价值,例如文本,文本在很大程度上依赖于上下文,并且该领域的转移学习的到来有助于将编码后的知识带到身边,以加强个人在各个领域的较小任务。 在转移学习中,我们将此步骤称为“微调”,这意味着现在针对我们要考虑的特定任务对预训练模型进行微调。 最初的英语模型在预训练中使用了两种语料库:Wikipedia和BooksCorpus。 为了更深入地了解BERT之类的变压器,我建议提出一系列有关其架构和微调任务的博客

Image for post
BERT Architecture (Devlin et al., 2018)
BERT Architecture(Devlin等人,2018)

BioBERT (Lee et al., 2019) is a variation of the aforementioned model from Korea University and Clova AI. Researchers added to the corpora of the original BERT with PubMed and PMC. PubMed is a database of biomedical citations and abstractions, whereas PMC is an electronic archive of full-text journal articles. Their contributions were a biomedical language representation model that could manage tasks such as relation extraction and drug discovery to name a few. By having a pre-trained model that encompasses both general and biomedical domain corpora, developers and practitioners could now encapsulate biomedical ter

BioBERT:用于生物医学文本挖掘的预训练生物医学语言表示模型。随着生物医学文献数量的快速增长,生物医学文本挖掘变得越来越重要。随着自然语言处理(NLP)的进步,从生物医学文献中提取有价值的信息已在研究人员中受到欢迎,深度学习促进了有效的生物医学文本挖掘模型的发展。但是,由于单词分布从普通领域的语料库转移到生物医学的语料库,直接将NLP的进步应用到生物医学的文本挖掘中常常会产生不令人满意的结果。在本文中,我们研究了最近引入的预训练语言模型BERT如何适用于生物医学语料库。我们介绍了BioBERT(用于生物医学文本挖掘的变压器的双向编码器表示),这是在大型生物医学语料库上预先训练的领域特定语言表示模型。通过在任务上几乎相同的体系结构,在经过生物医学语料库的预训练之后,BioBERT在许多生物医学文本挖掘任务中都大大优于BERT以前的最新模型。尽管BERT获得的性能可与以前的最新模型相媲美,但在以下三个代表性生物医学文本挖掘任务上,BioBERT的性能明显优于它们:生物医学命名实体识别(F1分数提高0.62%),生物医学关系提取(2.80%) F1分数提高)生物医学问答(MRR提高12.24%)。我们的分析结果表明,对生物医学语料库进行BERT的预培训有助于其理解复杂的生物医学文献。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值