Training an NER pipeline component
Our goal is to create a pipeline to identify the FASHION_BRAND entities. We will do that by training the EntityRecognizer pipeline component. It uses a transition-based algorithm that assumes the most decisive information about the entities that are close to the initial tokens.
The first thing we need to do to train the component is to create the config file. We’ll use the spacy init config CLI command to do that:
python3 –m spacy init config cpu_config.cfg --lang "en" --pipeline "ner" --optimize "efficiency"
This creates the cpu_config.cfg file with all default configurations and sets the ner component for training optimized for efficiency (faster inference, smaller model, and lower memory consumption). We will use the preprocess.py script from the spaCy tutorial to convert the jsonl data into DocBin objects:
python3 ./scripts/preprocess.py ./data/fashion_brands_training.jsonl ./data/fashion_brands_training...