Training with a custom corpus reader
spaCy’s Corpus class manages annotated corpora for data loading during training. The default corpus reader (spacy.Corpus.v1) creates the Example objects using the make_doc() method of the Language class. This method only tokenizes the text. To train the EntityLinker component, it needs to have the entities available in the doc. That’s why we will create our own corpus reader in a file called custom_functions.py. The reader should receive as parameters the path to the DocBin file and the nlp object. Inside the method, we will loop through each Doc to create the examples. Let’s go ahead and create this method:
- First, we disable the
EntityLinkercomponent of the pipeline and then get all the docs from theDocBinfile:def read_files(file: Path, nlp: "Language") -> Iterable[Example]: with nlp.select_pipes(disable="entity_linker"): doc_bin = DocBin...