Training with a custom corpus reader
spaCy’s Corpus
class manages annotated corpora for data loading during training. The default corpus reader (spacy.Corpus.v1
) creates the Example
objects using the make_doc()
method of the Language
class. This method only tokenizes the text. To train the EntityLinker
component, it needs to have the entities available in the doc. That’s why we will create our own corpus reader in a file called custom_functions.py
. The reader should receive as parameters the path to the DocBin
file and the nlp
object. Inside the method, we will loop through each Doc
to create the examples. Let’s go ahead and create this method:
- First, we disable the
EntityLinker
component of the pipeline and then get all the docs from theDocBin
file:def read_files(file: Path, nlp: "Language") -> Iterable[Example]: with nlp.select_pipes(disable="entity_linker"): doc_bin = DocBin...