Introducing Tokenization
We saw in Figure 2.1 that the first step in a text processing pipeline is tokenization. Tokenization is always the first operation because all the other operations require tokens.
Tokenization simply means splitting the sentence into its tokens. You can think of a token as the smallest meaningful part of a piece of text. Tokens can be words, numbers, punctuation, currency symbols, and any other meaningful symbols that are the building blocks of a sentence. The following are examples of tokens:
USA N.Y. City 33 3rd ! … ? 's
Input to the spaCy tokenizer is Unicode text and the result is a Doc
object. The following code shows the tokenization process:
- First, we import the library and load the English language model:
import spacy nlp = spacy.load("en_core_web_md")
- Next, we apply the
nlp
object to a sentence to create aDoc
object. TheDoc
object is the container for a sequence ofToken
objects. We then print the token texts...