Running the pipeline with large datasets
The Language.pipe() method processes texts as a stream and yields Doc objects in order. It buffers the texts in batches instead of one-by-one, since this is usually more efficient. If we want to get a specific doc, we need to call list() first because the method returns a Python generator that yields Doc objects. This is how you can do it:
utterance_texts = df.text.to_list() processed_docs = list(nlp.pipe(utterance_texts)) print(processed_docs[0], processed_docs[0]._.intent)
In the preceding code, we are getting a list of text utterances from the DataFrame we loaded at the beginning of the chapter and processing it in batches using .pipe(). Let’s compare the time difference by using and not using the .pipe() method:
import timestart_time = time.time()
utterance_texts = df.text.to_list()
processed_docs_vanilla = [nlp(text) for text in utterance_texts]
end_time = time.time()
execution_time = end_time - start_time
print("Execution...