Remove URLs from string in Python
A regular expression (regex) is a sequence of characters that defines a search pattern in text. To remove URLs from a string in Python, you can either use regular expressions (regex) or some external libraries like urllib.parse. The re-module in Python is used for working with regular expressions. In this article, we will see how we can remove URLs from a string in Python.
Python Remove URLs from a String
Below are the ways by which we can remove URLs from a string in Python:
- Using the re.sub() function
- Using the re.findall() function
- Using the re.search() function
- Using the urllib.parse class
Python Remove URLs from String Using re.sub() function
In this example, the code defines a function 'remove_urls' to find URLs in text and replace them with a placeholder [URL REMOVED], using regular expressions for pattern matching and the re.sub() method for substitution.
import re
def remove_urls(text, replacement_text="[URL REMOVED]"):
# Define a regex pattern to match URLs
url_pattern = re.compile(r'https?://\S+|www\.\S+')
# Use the sub() method to replace URLs with the specified replacement text
text_without_urls = url_pattern.sub(replacement_text, text)
return text_without_urls
# Example:
input_text = "Visit on GeeksforGeeks Website: https://www.geeksforgeeks.org/"
output_text = remove_urls(input_text)
print("Original Text:")
print(input_text)
print("\nText with URLs Removed:")
print(output_text)
Output
Original Text: Visit on GeeksforGeeks Website: https://www.geeksforgeeks.org/ Text with URLs Removed: Visit on GeeksforGeeks Website: [URL REMOVED]
Remove URLs from String Using re.findall() function
In this example, the Python code defines a function 'remove_urls_findall' that uses regular expressions to find all URLs using re.findall() method in a given text and replaces them with a replacement text "[URL REMOVED]".
import re
def remove_urls_findall(text, replacement_text="[URL REMOVED]"):
url_pattern = re.compile(r'https?://\S+|www\.\S+')
urls = url_pattern.findall(text)
for url in urls:
text = text.replace(url, replacement_text)
return text
# Example:
input_text = "Check out the latest Python tutorials on GeeksforGeeks: https://www.geeksforgeeks.org/category/python/"
output_text_findall = remove_urls_findall(input_text)
print("\nUsing re.findall():")
print("Original Text:")
print(input_text)
print("\nText with URLs Removed:")
print(output_text_findall)
Output:
Using re.findall():
Original Text:
Check out the latest Python tutorials on GeeksforGeeks: https://www.geeksforgeeks.org/category/python/
Text with URLs Removed:
Check out the latest Python tutorials on GeeksforGeeks: [URL REMOVED]
Remove URLs from String in Python Using re.search() function
In this example, the Python code defines a function 'remove_urls_search' using regular expressions and re.search() to find and replace URLs in a given text with a replacement text "[URL REMOVED]".
import re
def remove_urls_search(text, replacement_text="[URL REMOVED]"):
url_pattern = re.compile(r'https?://\S+|www\.\S+')
while True:
match = url_pattern.search(text)
if not match:
break
text = text[:match.start()] + replacement_text + text[match.end():]
return text
# Example:
input_text = "Visit our website at https://geeksforgeeks.org/ for more information. Follow us on Twitter: @geeksforgeeks"
output_text_search = remove_urls_search(input_text)
print("\nUsing re.search():")
print("Original Text:")
print(input_text)
print("\nText with URLs Removed:")
print(output_text_search)
Output:
Using re.search():
Original Text:
Visit our website at https://geeksforgeeks.org/ for more information. Follow us on Twitter: @geeksforgeeks
Text with URLs Removed:
Visit our website at [URL REMOVED] for more information. Follow us on Twitter: @geeksforgeeks
Remove URLs from String Using urllib.parse
In this example, the Python code defines a function 'remove_urls_urllib' that uses urllib.parse to check and replace URLs in a given text with a replacement text "[URL REMOVED]".
# Using urllib.parse
from urllib.parse import urlparse
def remove_urls_urllib(text, replacement_text="[URL REMOVED]"):
words = text.split()
for i, word in enumerate(words):
parsed_url = urlparse(word)
if parsed_url.scheme and parsed_url.netloc:
words[i] = replacement_text
return ' '.join(words)
# Example:
input_text = "Check out the GeeksforGeeks website at https://www.geeksforgeeks.org/ for programming tutorials."
output_text_urllib = remove_urls_urllib(input_text)
print("Using urllib.parse:")
print("Text with URLs Removed:")
print(output_text_urllib)
Output
Using urllib.parse: Text with URLs Removed: Check out the GeeksforGeeks website at [URL REMOVED] for programming tutorials.