2020.acl-main.577
2020.acl-main.577
6470
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6470–6476
July 5 - 10, 2020. c 2020 Association for Computational Linguistics
enumerate exhaustively all possible spans up to a
Biaffine Classifier
defined length by concatenating the LSTMs out-
puts for the start and end position and then using
this to calculate a score for each span. Apart from
the different network and word embedding config-
FFNN_Start FFNN_End urations, the main difference between their model
and ours is there for the use of biaffine model. Due
to the biaffine model, we get a global view of the
sentence while Sohrab and Miwa (2018) concate-
nates the output of the LSTMs of possible start
BiLSTM and end positions up to a distinct length. Dozat
and Manning (2017) demonstrated that the biaffine
mapping performs significantly better than just the
concatenation of pairs of LSTM outputs.
BERT, fastText & Char Embeddings
3 Methods
Figure 1: The network architectures of our system.
Our model is inspired by the dependency parsing
model of Dozat and Manning (2017). We use both
a number of NLP applications including NER. De- word embeddings and character embeddings as in-
vlin et al. (2019) invented BERT, a bidirectional put, and feed the output into a BiLSTM and finally
transformer architecture for the training of lan- to a biaffine classifier.
guage models. BERT and its siblings provided bet- Figure 1 shows an overview of the architecture.
ter language models that turned again into higher To encode words, we use both BERTLarge and fast-
scores for NER. Text embeddings (Bojanowski et al., 2016). For
Lample et al. (2016) cast NER as transition- BERT we follow the recipe of (Kantor and Glober-
based dependency parsing using a Stack-LSTM. son, 2019) to obtain the context dependent embed-
They compare with a LSTM-CRF model which dings for a target token with 64 surrounding tokens
turns out to be a very strong baseline. Their each side. For the character-based word embed-
transition-based system uses two transitions (shift dings, we use a CNN to encode the characters of
and reduce) to mark the named entities and handles the tokens. The concatenation of the word and
flat NER while our system has been designed to character-based word embeddings is feed into a
handle both nested and flat entities. BiLSTM to obtain the word representations (x).
Nested Named Entity Recognition. Early After obtaining the word representations from
work on nested NER, motivated particularly by the the BiLSTM, we apply two separate FFNNs to
GENIA corpus, includes (Shen et al., 2003; Beat- create different representations (hs /he ) for the
rice Alex and Grover, 2007; Finkel and Manning, start/end of the spans. Using different representa-
2009). Finkel and Manning (2009) also proposed tions for the start/end of the spans allow the system
a constituency parsing-based approach. In the last to learn to identify the start/end of the spans sep-
years, we saw an increasing number of neural mod- arately. This improves accuracy compared to the
els targeting nested NER as well. Ju et al. (2018) model which directly uses the outputs of the LSTM
suggested a LSTM-CRF model to predict nested since the context of the start and end of the entity
named entities. Their algorithm iteratively contin- are different. Finally, we employ a biaffine model
ues until no further entities are predicted. Lin et al. over the sentence to create a l × l × c scoring tensor
(2019) tackle the problem in two steps: they first (rm ), where l is the length of the sentence and c is
detect the entity head, and then they infer the entity the number of NER categories + 1(for non-entity).
boundaries as well as the category of the named We compute the score for a span i by:
entity. Straková et al. (2019) tag the nested named
entity by a sequence-to-sequence model exploring hs (i) = FFNNs (xsi )
combinations of context-based embeddings such he (i) = FFNNe (xei )
as ELMo, BERT, and Flair. Zheng et al. (2019)
rm (i) = hs (i)> Um he (i)
use a boundary aware network to solve the nested
NER. Similar to our work, Sohrab and Miwa (2018) + Wm (hs (i) ⊕ he (i)) + bm
6471
where si and ei are the start and end indices of the Parameter Value
span i, Um is a d × c × d tensor, Wm is a 2d × c
BiLSTM size 200
matrix and bm is the bias.
BiLSTM layer 3
The tensor rm provides scores for all possible
BiLSTM dropout 0.4
spans that could constitute a named entity under the
FFNN size 150
constrain that si ≤ ei (the start of entity is before
FFNN dropout 0.2
its end). We assign each span a NER category y 0 :
BERT size 1024
y 0 (i) = arg max rm (i) BERT layer last 4
fastText embedding size 300
We then rank all the spans that have a category Char CNN size 50
other than ”non-entity” by their category scores Char CNN filter widths [3,4,5]
(rm (iy0 )) in descending order and apply follow- Char embedding size 8
ing post-processing constraints: For nested NER, Embeddings dropout 0.5
a entity is selected as long as it does not clash the Optimiser Adam
boundaries of higher ranked entities. We denote a learning rate 1e-3
entity i to clash boundaries with another entity j if
si < sj ≤ ei < ej or sj < si ≤ ej < ei , e.g. in Table 1: Major hyperparameters for our models.
the Bank of China, the entity the Bank of clashes
boundary with the entity Bank of China, hence only
the span with the higher category score will be se- fair comparson we also used the same documents
lected. For flat NER, we apply one more constraint, as in Lu and Roth (2015) for each split.
in which any entity containing or is inside an entity For GENIA, we use the GENIA v3.0.2 corpus. We
ranked before it will not be selected. The learning preprocess the dataset following the same settings
objective of our named entity recognizer is to as- of Finkel and Manning (2009) and Lu and Roth
sign a correct category (including the non-entity) (2015) and use 90%/10% train/test split. For this
to each valid span. Hence it is a multi-class classi- evaluation, since we do not have a development set,
fication problem and we optimise our models with we train our system on 50 epochs and evaluate on
softmax cross-entropy: the final model.
For CONLL 2002 and CONLL 2003, we evaluate
exp(rm (ic )) on all four languages (English, German, Dutch and
pm (ic ) = PC
ĉ=1 exp(rm (iĉ ))
Spanish). We follow Lample et al. (2016) to train
our system on the concatenation of the train and
N X
C
X development set.
loss = − yic log pm (ic )
For ONTONOTES, we evaluate on the English
i=1 c=1
corpus and follow Strubell et al. (2017) to use the
same train, development and test split as used in
4 Experiments CoNLL 2012 shared task for coreference resolution
Data Set. We evaluate our system on both nested (Pradhan et al., 2012).
and flat NER, for the nested NER task, we use the Evaluation Metric. We report recall, precision
ACE 2004 2 , ACE 2005 3 , and GENIA (Kim et al., and F1 scores for all evaluations. The named en-
2003) corpora; for flat NER, we test our system on tity is considered correct when both boundary and
the CONLL 2002 (Tjong Kim Sang, 2002), CONLL category are predicted correctly.
2003 (Tjong Kim Sang and De Meulder, 2003) Hyperparameters We use a unified setting for
and ONTONOTES4 corpora. all of the experiments, Table 1 shows hyperparam-
For ACE 2004, ACE 2005 we follow the same eters for our system.
settings of Lu and Roth (2015) and Muis and Lu 5
(2017) to split the data into 80%,10%,10% for train, In Sohrab and Miwa (2018), the last 10% of the training
set is used as a development set, we include their result mainly
development and test set respectively. To make a because their system is similar to ours.
6
2
The revised version is provided by the shared task organ-
https://catalog.ldc.upenn.edu/LDC2005T09 iser in 2006 with more consistent annotations. We confirmed
3
https://catalog.ldc.upenn.edu/LDC2006T06 with the author of Akbik et al. (2018) that they used the revised
4
https://catalog.ldc.upenn.edu/LDC2013T19 version.
6472
Model P R F1 Model P R F1
ACE 2004 ONTONOTES
Katiyar and Cardie (2018) 73.6 71.8 72.7 Chiu and Nichols (2016) 86.0 86.5 86.3
Wang et al. (2018) - - 73.3 Strubell et al. (2017) - - 86.8
Clark et al. (2018) - - 88.8
Wang and Lu (2018) 78.0 72.4 75.1
Fisher and Vlachos (2019) - - 89.2
Straková et al. (2019) - - 84.4 Our model 91.1 91.5 91.3
Luan et al. (2019) - - 84.7
Our model 87.3 86.0 86.7 CONLL 2003 English
Chiu and Nichols (2016) 91.4 91.9 91.6
ACE 2005
Lample et al. (2016) - - 90.9
Katiyar and Cardie (2018) 70.6 70.4 70.5 Strubell et al. (2017) - - 90.7
Wang et al. (2018) - - 73.0 Devlin et al. (2019) - - 92.8
Wang and Lu (2018) 76.8 72.3 74.5 Straková et al. (2019) - - 93.4
Lin et al. (2019) 76.2 73.6 74.9 Our model 93.7 93.3 93.5
Fisher and Vlachos (2019) 82.7 82.1 82.4 CONLL 2003 German
Luan et al. (2019) - - 82.9 Lample et al. (2016) - - 78.8
Straková et al. (2019) - - 84.3 Straková et al. (2019) - - 85.1
Our model 85.2 85.6 85.4 Our model 88.3 84.6 86.4
GENIA CONLL 2003 German revised6
Katiyar and Cardie (2018) 79.8 68.2 73.6 Akbik et al. (2018) - - 88.3
Wang et al. (2018) - - 73.9 Our model 92.4 88.2 90.3
Ju et al. (2018) 78.5 71.3 74.7 CONLL 2002 Spanish
Wang and Lu (2018) 77.0 73.3 75.1
Lample et al. (2016) - - 85.8
Sohrab and Miwa (2018)5 93.2 64.0 77.1 Straková et al. (2019) - - 88.8
Lin et al. (2019) 75.8 73.9 74.8 Our model 90.6 90.0 90.3
Luan et al. (2019) - - 76.2
CONLL 2002 Dutch
Straková et al. (2019) - - 78.3
Our model 81.8 79.3 80.5 Lample et al. (2016) - - 81.7
Akbik et al. (2019) - - 90.4
Table 2: State of the art comparison on ACE 2004, ACE Straková et al. (2019) - - 92.7
2005 and GENIA corpora for nested NER. Our model 94.5 92.8 93.7
6473
F1 ∆ a biaffine model and confirms our hypothesis that
the dependency parsing framework is an important
Our model 89.9
factor for the high accuracy of our system.
- biaffine 89.1 0.8
Contextual Embeddings We ablate BERT em-
- BERT emb 87.5 2.4
beddings and as expected, after removing BERT
- fastText emb 89.5 0.4
embeddings, the system performance drops by a
- Char emb 89.8 0.1
large number of 2.4 percentage points (see Table
4). This shows that BERT embeddings are one of
Table 4: The comparison between our full model and
ablated models on ONTONOTES development set. the most important factors for the accuracy.
Context Independent Embeddings We re-
move the context-independent fastText embedding
fine-grained named entity categories. To predict from our system. The context-independent em-
named entities for this corpus is more difficult than bedding contributes 0.4% towards the score of our
for CONLL 2002 and CONLL 2003. These corpora full system (Table 4). Which suggests that even
use coarse-grained named entity categories (only with the BERT embeddings enabled, the context-
4 categories). The sequence-to-sequence models independent embeddings can still make quite no-
usually perform better on the CONLL 2003 English ticeable improvement to a system.
corpus (see Table 3), e.g. the system of Chiu and Character Embeddings Finally, we remove the
Nichols (2016); Strubell et al. (2017). In contrast, character embeddings. As we can see from Table 4,
our system is less sensitive to the domain and the the impact of character embeddings is quite small.
granularity of the categories. As shown in Table 3, One explanation would be that English is not a mor-
our system achieved an F1 score of 91.3% on the phologically rich language hence does not benefit
ONTONOTES corpus and is very close to our system largely from character-level information and the
performance on the CONLL 2003 corpus (93.5%). BERT embeddings itself are based on word pieces
On the multi-lingual data, our system achieved F1 that already capture some character-level informa-
scores of 86.4% for German, 90.3% for Spanish tion.
and 93.5% for Dutch. Our system outperforms the Overall, the biaffine mapping and the BERT em-
previous SoTA results by large margin of 2.1%, bedding together contributed most to the high ac-
1.5%, 1.3% and 1% on ONTONOTES, Spanish, Ger- curacy of our system.
man and Dutch corpora respectively and is slightly
better than the SoTA on English data set. In ad- 8 Conclusion
dition, we also tested our system on the revised In this paper, we reformulate NER as a structured
version of German data to compare with the model prediction task and adopted a SoTA dependency
by Akbik et al. (2018), our system again achieved parsing approach for nested and flat NER. Our sys-
a substantial gain of 2% when compared with their tem uses contextual embeddings as input to a multi-
system. layer BiLSTM. We employ a biaffine model to
assign scores for all spans in a sentence. Further
7 Ablation Study constraints are used to predict nested or flat named
entities. We evaluated our system on eight named
To evaluate the contribution of individual compo-
entity corpora. The results show that our system
nents of our system, we further remove selected
achieves SoTA on all of the eight corpora. We
components and use ONTONOTES for evaluation
demonstrate that advanced structured prediction
(see Table 4). We choose ONTONOTES for our ab-
techniques lead to substantial improvements for
lation study as it is the largest corpus.
both nested and flat NER.
Biaffine Classifier We replace the biaffine map-
ping with a CRF layer and convert our system into Acknowledgments
a sequence labelling model. The CRF layer is fre-
quently used in models for flat NER, e.g. (Lample This research was supported in part by the DALI
et al., 2016). When we replace the biaffine model project, ERC Grant 695662.
of our system with a CRF layer, the performance
drops by 0.8 percentage points (Table 4). The large
performance difference shows the benefit of adding
6474
References NER. In Proceedings of the 57th Annual Meet-
ing of the Association for Computational Linguis-
Alan Akbik, Tanja Bergmann, and Roland Vollgraf. tics, pages 5840–5850, Florence, Italy. Association
2019. Pooled contextualized embeddings for named for Computational Linguistics.
entity recognition. In Proceedings of the 2019 Con-
ference of the North American Chapter of the Asso- Meizhi Ju, Makoto Miwa, and Sophia Ananiadou.
ciation for Computational Linguistics: Human Lan- 2018. A neural layered model for nested named en-
guage Technologies, Volume 1 (Long and Short Pa- tity recognition. In Proceedings of the 2018 Con-
pers), pages 724–728, Minneapolis, Minnesota. As- ference of the North American Chapter of the Asso-
sociation for Computational Linguistics. ciation for Computational Linguistics: Human Lan-
guage Technologies, Volume 1 (Long Papers), pages
Alan Akbik, Duncan Blythe, and Roland Vollgraf.
1446–1459, New Orleans, Louisiana. Association
2018. Contextual string embeddings for sequence
for Computational Linguistics.
labeling. In Proceedings of the 27th International
Conference on Computational Linguistics, pages
Ben Kantor and Amir Globerson. 2019. Coreference
1638–1649, Santa Fe, New Mexico, USA. Associ-
resolution with entity equalization. In Proceed-
ation for Computational Linguistics.
ings of the 57th Annual Meeting of the Association
Barry Haddow Beatrice Alex and Claire Grover. 2007. for Computational Linguistics, pages 673–677, Flo-
Recognising nested named entities in biomedical rence, Italy. Association for Computational Linguis-
text. In Proc. of BioNLP, pages 65–72. tics.
Piotr Bojanowski, Edouard Grave, Armand Joulin, Arzoo Katiyar and Claire Cardie. 2018. Nested named
and Tomas Mikolov. 2016. Enriching word vec- entity recognition revisited. In Proceedings of the
tors with subword information. arXiv preprint 2018 Conference of the North American Chapter of
arXiv:1607.04606. the Association for Computational Linguistics: Hu-
man Language Technologies, Volume 1 (Long Pa-
Jason PC Chiu and Eric Nichols. 2016. Named entity pers), pages 861–871, New Orleans, Louisiana. As-
recognition with bidirectional lstm-cnns. Transac- sociation for Computational Linguistics.
tions of the Association for Computational Linguis-
tics, 4:357–370. J.-D. Kim, T. Ohta, Y. Tateisi, and J. Tsujii. 2003. GE-
NIA corpus—a semantically annotated corpus for
Kevin Clark, Minh-Thang Luong, Christopher D. Man- bio-textmining. Bioinformatics, 19(suppl1 ) : i180−
ning, and Quoc Le. 2018. Semi-supervised se- −i182.
quence modeling with cross-view training. In Pro-
ceedings of the 2018 Conference on Empirical Meth- Guillaume Lample, Miguel Ballesteros, Sandeep Subra-
ods in Natural Language Processing, pages 1914– manian, Kazuya Kawakami, and Chris Dyer. 2016.
1925, Brussels, Belgium. Association for Computa- Neural architectures for named entity recognition. In
tional Linguistics. Proceedings of the 2016 Conference of the North Amer-
ican Chapter of the Association for Computational Lin-
Ronan Collobert, Jason Weston, Léon Bottou, Michael guistics: Human Language Technologies, pages 260–
Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 270. Association for Computational Linguistics.
2011. Natural language processing (almost) from
scratch. Journal of machine learning research, Hongyu Lin, Yaojie Lu, Xianpei Han, and Le Sun. 2019.
12(Aug):2493–2537. Sequence-to-nuggets: Nested entity mention detection
via anchor-region networks. In Proceedings of the 57th
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Annual Meeting of the Association for Computational
Kristina Toutanova. 2019. Bert: Pre-training of deep
Linguistics, pages 5182–5192, Florence, Italy. Associa-
bidirectional transformers for language understand-
tion for Computational Linguistics.
ing. In Proceedings of the 2019 Annual Conference
of the North American Chapter of the Association
Wei Lu and Dan Roth. 2015. Joint mention extraction and
for Computational Linguistics.
classification with mention hypergraphs. In Proceed-
Timothy Dozat and Christopher Manning. 2017. Deep ings of the 2015 Conference on Empirical Methods in
biaffine attention for neural dependency parsing. Natural Language Processing, pages 857–867, Lisbon,
In Proceedings of 5th International Conference on Portugal. Association for Computational Linguistics.
Learning Representations (ICLR).
Yi Luan, Dave Wadden, Luheng He, Amy Shah, Mari
Jenny Rose Finkel and Christopher D. Manning. 2009. Ostendorf, and Hannaneh Hajishirzi. 2019. A gen-
Nested named entity recognition. In Proceedings of eral framework for information extraction using dy-
the 2009 Conference on Empirical Methods in Nat- namic span graphs. In Proceedings of the 2019 Con-
ural Language Processing, pages 141–150, Singa- ference of the North American Chapter of the Associa-
pore. Association for Computational Linguistics. tion for Computational Linguistics: Human Language
Technologies, Volume 1 (Long and Short Papers), pages
Joseph Fisher and Andreas Vlachos. 2019. Merge and 3036–3046, Minneapolis, Minnesota. Association for
label: A novel neural network architecture for nested Computational Linguistics.
6475
Xuezhe Ma and Eduard Hovy. 2016. End-to-end se- of the Seventh Conference on Natural Language Learn-
quence labeling via bi-directional LSTM-CNNs-CRF. ing at HLT-NAACL 2003, pages 142–147.
In Proceedings of the 54th Annual Meeting of the Asso-
ciation for Computational Linguistics (Volume 1: Long Bailin Wang and Wei Lu. 2018. Neural segmental hyper-
Papers), pages 1064–1074, Berlin, Germany. Associa- graphs for overlapping mention recognition. In Pro-
tion for Computational Linguistics. ceedings of the 2018 Conference on Empirical Meth-
ods in Natural Language Processing, pages 204–214,
Aldrian Obaja Muis and Wei Lu. 2017. Labeling gaps be- Brussels, Belgium. Association for Computational Lin-
tween words: Recognizing overlapping mentions with guistics.
mention separators. In Proceedings of the 2017 Confer-
ence on Empirical Methods in Natural Language Pro- Bailin Wang, Wei Lu, Yu Wang, and Hongxia Jin. 2018.
cessing, pages 2608–2618, Copenhagen, Denmark. As- A neural transition-based model for nested mention
sociation for Computational Linguistics. recognition. In Proceedings of the 2018 Conference
on Empirical Methods in Natural Language Process-
Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt ing, pages 1011–1017, Brussels, Belgium. Association
Gardner, Christopher Clark, Kenton Lee, and Luke S. for Computational Linguistics.
Zettlemoyer. 2018. Deep contextualized word repre-
sentations. In Proceedings of the 2018 Annual Confer- Changmeng Zheng, Yi Cai, Jingyun Xu, Ho-fung Leung,
ence of the North American Chapter of the Association and Guandong Xu. 2019. A boundary-aware neural
for Computational Linguistics. model for nested named entity recognition. In Pro-
ceedings of the 2019 Conference on Empirical Methods
Sameer Pradhan, Alessandro Moschitti, Nianwen Xue, in Natural Language Processing and the 9th Interna-
Olga Uryupina, and Yuchen Zhang. 2012. CoNLL- tional Joint Conference on Natural Language Process-
2012 shared task: Modeling multilingual unrestricted ing (EMNLP-IJCNLP), pages 357–366, Hong Kong,
coreference in OntoNotes. In Proceedings of the China. Association for Computational Linguistics.
Sixteenth Conference on Computational Natural Lan-
guage Learning (CoNLL 2012), Jeju, Korea.
Dan Shen, Jie Zhang, Guodong Zhou, Jian Su, and Chew-
Lim Tan. 2003. Effective adaptation of a Hidden
Markov Model-based Named Entity Recognizer for
the biomedical domain. In Proceedings of the ACL
2003 Workshop on Natural Language Processing in
Biomedicine.
Mohammad Golam Sohrab and Makoto Miwa. 2018.
Deep exhaustive model for nested named entity recog-
nition. In Proceedings of the 2018 Conference on
Empirical Methods in Natural Language Processing,
pages 2843–2849, Brussels, Belgium. Association for
Computational Linguistics.
Jana Straková, Milan Straka, and Jan Hajic. 2019. Neural
architectures for nested NER through linearization. In
Proceedings of the 57th Annual Meeting of the Associ-
ation for Computational Linguistics, pages 5326–5331,
Florence, Italy. Association for Computational Linguis-
tics.
Emma Strubell, Patrick Verga, David Belanger, and An-
drew McCallum. 2017. Fast and accurate entity recog-
nition with iterated dilated convolutions. In Proceed-
ings of the 2017 Conference on Empirical Methods
in Natural Language Processing, pages 2670–2680,
Copenhagen, Denmark. Association for Computational
Linguistics.
Erik F. Tjong Kim Sang. 2002. Introduction to
the CoNLL-2002 shared task: Language-independent
named entity recognition. In COLING-02: The
6th Conference on Natural Language Learning 2002
(CoNLL-2002).
Erik F. Tjong Kim Sang and Fien De Meulder. 2003. In-
troduction to the CoNLL-2003 shared task: Language-
independent named entity recognition. In Proceedings
6476