Given a string that contains one or more URLs, the goal is to remove those URLs and replace them with clean text. For Example:
Input: Visit our site at https://www.geeksforgeeks.org/ for tutorials.
Output: Visit our site at [URL REMOVED] for tutorials.
Explanation: URL is detected inside the text and replaced with the placeholder [URL REMOVED]
Let's explore different ways to remove URLs from a string in Python.
Using re.sub()
This method replaces all URLs in one go. re.sub() scans the entire string for matches of the regex pattern and directly substitutes them with the replacement text.
import re
txt = "Visit: https://www.geeksforgeeks.org/ for more info."
p = r'https?://\S+|www\.\S+'
res = re.sub(p, "[URL REMOVED]", txt)
print(res)
Output
Visit: [URL REMOVED] for more info.
Explanation:
- re.sub(p, "[URL REMOVED]", txt) finds every URL matching pattern p and replaces it.
- Pattern https?://\S+|www\.\S+ matches links starting with http, https, or www.
Using re.findall()
This method first extracts all URLs using re.findall(), then replaces each one manually within the text. Useful when you need the list of URLs as well.
import re
txt = "Python tutorials: https://www.geeksforgeeks.org/category/python/"
p = r'https?://\S+|www\.\S+'
urls = re.findall(p, txt)
for u in urls:
txt = txt.replace(u, "[URL REMOVED]")
print(txt)
Output
Python tutorials: [URL REMOVED]
Explanation:
- re.findall(p, txt) returns all URL matches as a list.
- The loop replaces each found URL using txt.replace(u, "[URL REMOVED]").
Using re.search()
This method removes one URL at a time using repeated searching using re.search(). Useful when you need full control over how each match is replaced.
import re
txt = "Visit https://www.geeksforgeeks.org/ for info."
p = re.compile(r'https?://\S+|www\.\S+')
while True:
m = p.search(txt)
if not m:
break
txt = txt[:m.start()] + "[URL REMOVED]" + txt[m.end():]
print(txt)
Output
Visit [URL REMOVED] for info.
Explanation:
- p.search(txt) finds the next URL in the string.
- m.start() and m.end() give the exact position of the URL.
- The replacement is done manually by slicing around the match: txt[:m.start()] text before the URL, " [URL REMOVED] " replacement and txt[m.end():] text after the URL
- Loop continues until no URLs remain.
Using urllib.parse
This method checks each word to see if it behaves like a URL and replaces it using urllib module. It works without regex by analyzing the structure of each word.
from urllib.parse import urlparse
txt = "Check https://www.geeksforgeeks.org/ for resources."
words = txt.split()
for i, w in enumerate(words):
p = urlparse(w)
if p.scheme and p.netloc:
words[i] = "[URL REMOVED]"
res = " ".join(words)
print(res)
Output
Check [URL REMOVED] for resources.
Explanation:
- urlparse(w) breaks the word into parts like scheme and domain.
- If both p.scheme and p.netloc exist, the word is considered a URL.
- The word is then replaced in the list and recombined into a string.