Mastering Rule-Based Matching
Rule-based information extraction is indispensable for any natural language processing (NLP) pipeline. Certain types of entities, such as times, dates, and telephone numbers, have distinct formats that can be recognized by a set of rules without having to train statistical models.
In this chapter, you will learn how to quickly extract information from text by matching patterns and phrases. You will use morphological features, parts-of-speech (POS) tags, regular expressions (regexes), and other spaCy features to form pattern objects to feed to Matcher objects. You will continue with fine-graining statistical models with rule-based matching to lift statistical models to better accuracies.
By the end of this chapter, you will know about a vital part of information extraction. You will also be able to extract entities of specific formats, as well as entities specific to your domain.
In this chapter, we’re going to cover the following main topics...