Text classification with spaCy
spaCy models are very successful for general NLP purposes, such as understanding a sentence’s syntax, splitting a paragraph into sentences, and extracting entities. However, sometimes, we work on very specific domains that spaCy pre-trained models didn’t learn how to handle.
For example, X (formerly Twitter) text contains many non-regular words, such as hashtags, emoticons, and mentions. Also, X sentences are usually just phrases, not full sentences. Here, it’s entirely reasonable that spaCy’s POS tagger performs in a substandard manner as the POS tagger is trained on full, grammatically correct English sentences.
Another example is the medical domain. It contains many entities, such as drug, disease, and chemical compound names. These entities are not expected to be recognized by spaCy’s NER model because it has no disease or drug entity labels. NER does not know anything about the medical domain at all.
In...