Text Preprocessing in Python
Last Updated :
26 Apr, 2025
Text processing is a key part of Natural Language Processing (NLP). It helps us clean and convert raw text data into a format suitable for analysis and machine learning. In this article, we will learn how to perform text preprocessing using various Python libraries and techniques focusing on the NLTK (Natural Language Toolkit) library.
1. Importing Libraries
We will be importing nltk, regex, string and inflect.
Python
import nltk
import string
import re
import inflect
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
from nltk.tokenize import word_tokenize
2. Convert to Lowercase
We lowercase the text to reduce the size of the vocabulary of our text data.
Python
def text_lowercase(text):
return text.lower()
input_str = "Hey, did you know that the summer break is coming? Amazing right !! It's only 5 more days !!";
text_lowercase(input_str)
Output:
“hey, did you know that the summer break is coming? amazing right !! it’s only 5 more days !!”
3. Removing Numbers
We can either remove numbers or convert the numbers into their textual representations. To remove the numbers we can use regular expressions.
Python
def remove_numbers(text):
result = re.sub(r'\d+', '', text)
return result
input_str = "There are 3 balls in this bag, and 12 in the other one."
remove_numbers(input_str)
Output:
‘There are balls in this bag, and in the other one.’
4. Converting Numerical Values
We can also convert the numbers into words. This can be done by using the inflect library.
Python
p = inflect.engine()
def convert_number(text):
temp_str = text.split()
new_string = []
for word in temp_str:
if word.isdigit():
temp = p.number_to_words(word)
new_string.append(temp)
else:
new_string.append(word)
temp_str = ' '.join(new_string)
return temp_str
input_str = 'There are 3 balls in this bag, and 12 in the other one.'
convert_number(input_str)
Output:
‘There are three balls in this bag, and twelve in the other one.’
5. Removing Punctuation
We remove punctuations so that we don’t have different forms of the same word. For example if we don’t remove the punctuation then been. been, been! will be treated separately.
Python
def remove_punctuation(text):
translator = str.maketrans('', '', string.punctuation)
return text.translate(translator)
input_str = "Hey, did you know that the summer break is coming? Amazing right !! It's only 5 more days !!"
remove_punctuation(input_str)
Output:
‘Hey did you know that the summer break is coming Amazing right Its only 5 more days ‘
6. Removing Whitespace
We can use the join and split function to remove all the white spaces in a string.
Python
def remove_whitespace(text):
return " ".join(text.split())
input_str = "we don't need the given questions"
remove_whitespace(input_str)
Output:
“we don’t need the given questions”
7. Removing Stopwords
Stopwords are words that do not contribute much to the meaning of a sentence hence they can be removed. The NLTK library has a set of stopwords and we can use these to remove stopwords from our text. Below is the list of stopwords available in NLTK
Python
nltk.download('punkt_tab')
def remove_stopwords(text):
stop_words = set(stopwords.words("english"))
word_tokens = word_tokenize(text)
filtered_text = [word for word in word_tokens if word not in stop_words]
return filtered_text
example_text = "This is a sample sentence and we are going to remove the stopwords from this."
remove_stopwords(example_text)
Output:
[‘This’, ‘sample’, ‘sentence’, ‘going’, ‘remove’, ‘stopwords’, ‘.’]
8. Applying Stemming
Stemming is the process of getting the root form of a word. Stem or root is the part to which affixes like -ed, -ize, -de, -s, etc are added. The stem of a word is created by removing the prefix or suffix of a word.
Example:
books —> book
looked —> look
denied —> deni
flies —> fli
There are mainly three algorithms for stemming. These are the Porter Stemmer, the Snowball Stemmer and the Lancaster Stemmer. Porter Stemmer is the most common among them.
Python
stemmer = PorterStemmer()
def stem_words(text):
word_tokens = word_tokenize(text)
stems = [stemmer.stem(word) for word in word_tokens]
return stems
text = 'data science uses scientific methods algorithms and many types of processes'
stem_words(text)
Output:
[‘data’,
‘scienc’,
‘use’,
‘scientif’,
‘method’,
‘algorithm’,
‘and’,
‘mani’,
‘type’,
‘of’,
‘process’]
9. Applying Lemmatization
Lemmatization is a NLP technique that reduces a word to its root form. This can be helpful for tasks such as text analysis and search as it allows you to compare words that are related but have different forms.
Python
nltk.download('wordnet')
lemmatizer = WordNetLemmatizer()
def lemma_words(text):
word_tokens = word_tokenize(text)
lemmas = [lemmatizer.lemmatize(word) for word in word_tokens]
return lemmas
input_str = "data science uses scientific methods algorithms and many types of processes"
lemma_words(input_str)
Output:
[‘data’,
‘science’,
‘us’,
‘scientific’,
‘method’,
‘algorithm’,
‘and’,
‘many’,
‘type’,
‘of’,
‘process’]
In this guide we learned different NLP text preprocessing technique which can be used to make a NLP based application and project.
Must Read:
Similar Reads
Python Coding Practice Problems
This collection of Python coding practice problems is designed to help you improve your overall programming skills in Python. The links below lead to different topic pages, each containing coding problems, and this page also includes links to quizzes. You need to log in first to write your code. You
1 min read
Python Program to Merge Mails
In this article, we are going to merge mails with Python Python Program to Merge MailsTo merge two or more mail files in Python, the below following steps have to be followed: To execute the program, firstly we require two .txt files 'mail1.txt' and 'mail2.txt' where both of the .txt files will cont
2 min read
Python Program to Replace Text in a File
In this article, we are going to replace Text in a File using Python. Replacing Text could be either erasing the entire content of the file and replacing it with new text or it could mean modifying only specific words or sentences within the existing text. Method 1: Removing all text and write new t
3 min read
Word location in String - Python
Word location in String problem in Python involves finding the position of a specific word or substring within a given string. This problem can be approached using various methods in Python, such as using the find(), index() methods or by regular expressions with the re module. Using str.find()str.f
4 min read
Print the Content of a Txt File in Python
Python provides a straightforward way to read and print the contents of a .txt file. Whether you are a beginner or an experienced developer, understanding how to work with file operations in Python is essential. In this article, we will explore some simple code examples to help you print the content
3 min read
Count Words in Text File in Python
Our task is to create a Python program that reads a text file, counts the number of words in the file and prints the word count. This can be done by opening the file, reading its contents, splitting the text into words, and then counting the total number of words. Example 1: Count String WordsFirst,
3 min read
Output of Python programs | Set 7
Prerequisite - Strings in Python Predict the output of the following Python programs. These question set will make you conversant with String Concepts in Python programming language. Program 1[GFGTABS] Python var1 = 'Hello Geeks!' var2 = "GeeksforGeeks" print "var1[0]: ",
3 min read
Python - Retain Numbers in String
Retaining numbers in a string involves extracting only the numeric characters while ignoring non-numeric ones. Using List Comprehensionlist comprehension can efficiently iterate through each character in the string, check if it is a digit using the isdigit() method and join the digits together to fo
2 min read
Output of Python programs | Set 8
Prerequisite - Lists in Python Predict the output of the following Python programs. Program 1 [GFGTABS] Python list = [1, 2, 3, None, (1, 2, 3, 4, 5), ['Geeks', 'for', 'Geeks']] print len(list) [/GFGTABS]Output: 6Explanation: The beauty of python list datatype is that within
3 min read
Python - Print the last word in a sentence
Printing the last word in a sentence involves extracting the final word , often done by splitting the sentence into words or traversing the string from the end. Using split() methodsplit() method divides the string into words using spaces, making it easy to access the last word by retrieving the las
3 min read