Merging and splitting tokens
In some cases, we want to unite or split multiword named entities. For example, this is needed when the tokenizer does not perform so well on some unusual tokens, and you need to split them by hand. In this subsection, we’ll cover a very practical remedy for our multiword expressions, multiword named entities, and typos: doc.retokenize
.
doc.retokenize
is used in a context manager and it’s the correct tool for merging and splitting the spans of doc
objects. The retokenizer.merge()
method should receive the spans to merge and the attributes to set on these merged tokens. Let’s see an example of retokenization by merging a multiword named entity, as follows:
- First, let’s create a
doc
from the sentence and print the entities:doc = nlp("She lived in New Hampshire.") print(doc.ents)
- Now let’s see how spaCy separated the tokens:
print([(token.text, token.i) for token in doc]) >>> [('She&apos...