Introducing Tokenization
We saw in Figure 2.1 that the first step in a text processing pipeline is tokenization. Tokenization is always the first operation because all the other operations require tokens.
Tokenization simply means splitting the sentence into its tokens. You can think of a token as the smallest meaningful part of a piece of text. Tokens can be words, numbers, punctuation, currency symbols, and any other meaningful symbols that are the building blocks of a sentence. The following are examples of tokens:
USA N.Y. City 33 3rd ! … ? 's
Input to the spaCy tokenizer is Unicode text and the result is a Doc object. The following code shows the tokenization process:
- First, we import the library and load the English language model:
import spacy nlp = spacy.load("en_core_web_md") - Next, we apply the
nlpobject to a sentence to create aDocobject. TheDocobject is the container for a sequence ofTokenobjects. We then print the token texts...