Open In App

Web crawling with Python

Last Updated : 17 Apr, 2025
Comments
Improve
Suggest changes
Like Article
Like
Report

Web crawling is widely used technique to collect data from other websites. It works by visiting web pages, following links and gathering useful information like text, images, or tables. Python has various libraries and frameworks that support web crawling. In this article we will see about web crawling using Python.

1. Web Crawling with Requests

The first step in web crawling is fetching the content of a webpage. The requests library allows us to send an HTTP request to a website and retrieve its HTML content. For this we will use requests module in python.

  1. requests.get(URL) : Sends a GET request to the specified URL.
  2. response.status_code : Checks if the request was successful status code 200 means success.
  3. response.text : Contains the HTML content of the webpage.
Python
import requests

URL = "https://www.geeksforgeeks.org/"
resp = requests.get(URL)

print("Status Code:", resp.status_code)

print("\nResponse Content:")
print(resp.text)

Output:

Screenshot-2025-04-14-201756
Web Crawling with Requests

2. Web Crawling in JSON Format

Sometimes websites provide data in JSON format which we need to convert into Python Dictionary. In this example a GET request is made to the location API using the `requests` library. If the request is successful indicated by a status code of 200 then ISS's current location data is fetched and printed. Otherwise an error message with the status code is displayed.

  1. response.json() : Converts the JSON response into a Python dictionary.
  2. You can now access specific fields like data['iss_position']['latitude'].
Python
import requests
URL = "http://api.open-notify.org/iss-now.json"

response = requests.get(URL)
if response.status_code == 200:
    data = response.json()

    print("ISS Location Data:")
    print(data)
else:
    print(
        f"Error: Failed to retrieve data. Status code: {response.status_code}")

Output:

Screenshot-2025-04-14-202308
Web crawling in json format

3. Web Scraping Images with Python

You can also use web crawling to download images from websites. In this example a GET request is used to fetch an image from a given URL. If the request is successful the image data is saved to a local file named "gfg_logo.png". Otherwise a failure message is displayed.

  1. response.content : Contains the binary content of the image.
  2. open(output_filename, "wb") : Opens a file in binary write mode to save the image.
Python
import requests

image_url = "https://media.geeksforgeeks.org/wp-content/uploads/20230505175603/100-Days-of-Machine-Learning.webp"
output_filename = "gfg_logo.png"

response = requests.get(image_url)

if response.status_code == 200:
    with open(output_filename, "wb") as file:
        file.write(response.content)
    print(f"Image downloaded successfully as {output_filename}")
else:
    print("Failed to download the image.")

Output:

Image downloaded successfully as gfg_logo.png

4. Crawling Elements Using XPath

We use Python to get the current temperature of Noida from a weather website. First we send a request to the website. Then we use XPath to find and show the temperature from the webpage. If it's found we print it otherwise we show an error message.

Python
from lxml import etree
import requests

weather_url = "https://weather.com/en-IN/weather/today/l/60f76bec229c75a05ac18013521f7bfb52c75869637f3449105e9cb79738d492"

response = requests.get(weather_url)

if response.status_code == 200:
    dom = etree.HTML(response.text)
    elements = dom.xpath(
        "//span[@data-testid='TemperatureValue' and contains(@class,'CurrentConditions')]")
    
    if elements:
        temperature = elements[0].text
        print(f"The current temperature is: {temperature}")
    else:
        print("Temperature element not found.")
else:
    print("Failed to fetch the webpage.")

Output:

The current temperature is: 31

5. Reading Tables on the Web Using Pandas

We can read tables on the web by using Pandas and web crawling. Pandas is used to extract tables from a specified URL using its read_html function. If tables are successfully extracted from the webpage they are printed one by one with a separator. If no tables are found a message indicating this is displayed.

Python
import pandas as pd

url = "https://www.geeksforgeeks.org/html/html-tables/"
extracted_tables = pd.read_html(url)

if extracted_tables:
    for idx, table in enumerate(extracted_tables, 1):
        print(f"Table {idx}:")
        print(table)
        print("-" * 50)
else:
    print("No tables found on the webpage.")

Output:

Screenshot-2025-04-14-203100
Reading Tables on web using Pandas

6. Crawl a Web Page and Get Most Frequent Words

We can also find most frequent words we crawl a web page using requests and then use BeautifulSoup to read the content. We focus on a specific part of the page called 'entry-content' and extract all the words from it. After cleaning the words, removing symbols and non-alphabetic characters we count how often each word appears. Finally we show the 10 most common words from that content.

Python
import requests
from bs4 import BeautifulSoup
from collections import Counter

def start(url):

    source_code = requests.get(url).text
    soup = BeautifulSoup(source_code, 'html.parser')

    wordlist = []

    for each_text in soup.findAll('div', {'class': 'entry-content'}):
        content = each_text.text

        words = content.lower().split()

        for each_word in words:
            wordlist.append(each_word)

    clean_wordlist(wordlist)
def clean_wordlist(wordlist):

    clean_list = []
    symbols = "!@#$%^&*()_-+={[}]|\\;:\"<>?/.,"

    for word in wordlist:
        for symbol in symbols:
            word = word.replace(symbol, '')

        if len(word) > 0:
            create_dictionary(clean_list)

def create_dictionary(clean_list):

    word_count = Counter(clean_list)

    top = word_count.most_common(10)

    print("Top 10 most frequent words:")
    for word, count in top:
        print(f'{word}: {count}')

if __name__ == "__main__":
    start(url)

Output:

Screenshot-2025-04-14-203718
crawl a web page

Web crawling with Python provides an efficient way to collect and analyze data from the web. It is essential for various applications such as data mining, market research and content aggregation. With proper handling of ethical guidelines, web crawling becomes important for gathering data and insights from web.


Similar Reads