0% found this document useful (0 votes)
28 views11 pages

Lecture 12 - Web Scrapping

The document provides an overview of web scraping using Python, specifically with the Beautiful Soup library, detailing the steps to extract data from websites. It includes code examples for setting up the environment, fetching web pages, parsing HTML, and saving extracted data into CSV files. Additionally, it contrasts the requests and urllib libraries for HTTP requests and explains working with JSON data in web scraping.

Uploaded by

hshawon561
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views11 pages

Lecture 12 - Web Scrapping

The document provides an overview of web scraping using Python, specifically with the Beautiful Soup library, detailing the steps to extract data from websites. It includes code examples for setting up the environment, fetching web pages, parsing HTML, and saving extracted data into CSV files. Additionally, it contrasts the requests and urllib libraries for HTTP requests and explains working with JSON data in web scraping.

Uploaded by

hshawon561
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

We have a book titled "Python for Beginners" by John Doe, published in 2023,

with a price of $29.99.

Plain Text Representation: contains no formatting or structure.


Book Title: Python for Beginners
Author: John Doe
Year: 2023
Price: $29.99

Focus: Simple human-readable form.


No tags, structure, or styling.

HTML Representation is used for displaying data in web pages.

<!DOCTYPE html>
<html>
<head>
<title>Book Information</title>
</head>
<body>
<h1>Python for Beginners</h1>
<p><strong>Author:</strong> John Doe</p>
<p><strong>Year:</strong> 2023</p>
<p><strong>Price:</strong> $29.99</p>
</body>
</html>

Focus: Formatting and presentation.


Contains tags for styling (<h1>, <p>, <strong>).

XML Representation is used for structured data storage and data exchange.

<?xml version="1.0" encoding="UTF-8"?>


<book>
<title>Python for Beginners</title>
<author>John Doe</author>
<year>2023</year>
<price>29.99</price>
</book>

Focus: Structure and hierarchy.


Tags are purely descriptive; no styling.

JSON Representation is used for lightweight data transfer, especially in APIs.

{
"title": "Python for Beginners",
"author": "John Doe",
"year": 2023,
"price": 29.99
}

Focus: Key-value pairs for quick parsing.


Compact and easy for programs to process.

Web Scraping with Python and Beautiful Soup

Definition
• Web scraping means automatically extracting data from websites.
• Beautiful Soup is a Python library that makes it easier to read, search, and
modify HTML/XML content.
How it works
1. Send a request to the website (using requests or urllib).
2. Get the HTML content of the page.
3. Pass it to BeautifulSoup to parse it into a structured format (a “soup” object).
4. Extract specific data using tags, attributes, CSS selectors, etc.
5. Save or use that data.
Key BeautifulSoup methods:

• .select_one('tag') → returns the first matching element.


• .select('tag') → returns a list of matching elements.
• .get('attribute') → gets the value of an attribute (e.g., href).
• .text → extracts inner text from an HTML element.

Step 1: Setting Up Your Environment

Install Required Libraries:


o Use pip to install the necessary libraries:

Import Libraries:
• Import the required modules in your Python script:

Website: [Link]
Simple Code:

What’s happening here?


• GET → Retrieve data. Note: POST → Send data to the server.
• [Link] → Contains the HTML content of the webpage as a string.
• BeautifulSoup(...) → Converts that HTML string into a structured
BeautifulSoup object (a “parse tree”) so you can search and extract elements
easily.
• '[Link]' → This tells BeautifulSoup which parser to use for reading the
HTML.

What is a parser?
A parser is a tool that:
• Reads raw HTML (just text).
• Breaks it down into a tree-like structure based on the rules of HTML.
• Lets you navigate and find specific tags, attributes, or text.
• Without a parser, BeautifulSoup wouldn’t know how to navigate through
tags and extract data.
Think of HTML as raw spaghetti — a parser untangles it and arranges it neatly so
you can pick exactly the noodle (or tag) you want.
Why do we need a parser?
• Raw HTML is just a long string — very hard to extract specific elements
from it directly.
• HTML often contains nested tags (tags inside tags) — parsing creates a tree
so you can navigate parent → child → sibling elements easily.
• Parsers also help fix broken HTML (many pages have missing tags or small
errors).

Why '[Link]'?
• '[Link]' is Python’s built-in HTML parser.
• It’s simple, requires no extra installation, and works well for most HTML.
• BeautifulSoup supports other parsers too:
o 'lxml' → very fast, requires installing lxml library.
o 'html5lib' → the most forgiving, handles messy HTML best.
Code:
# [Link]
# To run this script, paste `python [Link]` in the terminal

import requests
from bs4 import BeautifulSoup

def scrape():
url = '[Link]
response = [Link](url)
soup = BeautifulSoup([Link], '[Link]')
print(soup)

if __name__ == '__main__':
scrape()
Output:

Let’s go ahead and capture 3 things from this page: The title, the text, and the “More
information…” link.
Let’s capture the title, text, and extract the link from the <a> tag. Now, rewrite the
code as:
# Import required libraries
from bs4 import BeautifulSoup
import requests

# Specify the URL to scrape


url = "[Link] # Replace with the URL you want to scrape

# Fetch the webpage


try:
response = [Link](url)
# Check if the request was successful
if response.status_code == 200:
# Parse the webpage content
soup = BeautifulSoup([Link], '[Link]')
else:
print(f"Failed to fetch the webpage. Status code: {response.status_code}")
soup = None
except Exception as e:
print(f"An error occurred while fetching the webpage: {e}")
soup = None

# Extract the desired elements if soup is successfully created


if soup:
# Extract the title from the first <h1> tag
title = soup.select_one('h1').text if soup.select_one('h1') else "No title found"

# Extract the text from the first <p> tag


text = soup.select_one('p').text if soup.select_one('p') else "No text found"

# Extract the href attribute from the first <a> tag


link = soup.select_one('a').get('href') if soup.select_one('a') else "No link found"

# Print the extracted information


print("Title:", title)
print("Text:", text)
print("Link:", link)
else:
print("Soup object was not created. Unable to scrape the webpage.")

Now we want to save the extracted data into a CSV file:


# Import required libraries
from bs4 import BeautifulSoup
import requests
import csv # Library for handling CSV files
import os # Library to interact with the file system

# Specify the URL to scrape


url = "[Link] # Replace with the URL you want to scrape

# Fetch the webpage


try:
response = [Link](url)
# Check if the request was successful
if response.status_code == 200:
# Parse the webpage content
soup = BeautifulSoup([Link], '[Link]')
else:
print(f"Failed to fetch the webpage. Status code: {response.status_code}")
soup = None
except Exception as e:
print(f"An error occurred while fetching the webpage: {e}")
soup = None

# Extract the desired elements if soup is successfully created


if soup:
# Extract the title from the first <h1> tag
title = soup.select_one('h1').text if soup.select_one('h1') else "No title found"

# Extract the text from the first <p> tag


text = soup.select_one('p').text if soup.select_one('p') else "No text found"

# Extract the href attribute from the first <a> tag


link = soup.select_one('a').get('href') if soup.select_one('a') else "No link found"

# Save the extracted data into a CSV file


csv_filename = 'scraped_data.csv'
try:
with open(csv_filename, mode='w', newline='', encoding='utf-8') as file:
writer = [Link](file)
# Write the header row
[Link](['Title', 'Text', 'Link'])
# Write the data row
[Link]([title, text, link])

# Get the full file path


file_path = [Link](csv_filename)
print(f"Data saved to {csv_filename}")
print(f"File is located at: {file_path}")
except Exception as e:
print(f"An error occurred while saving to CSV: {e}")
else:
print("Soup object was not created. Unable to scrape the webpage.")

Intro to requests and urllib Module


Both are HTTP client libraries in Python used to fetch content from the web.
requests
• High-level, simpler, and more human-friendly.
• Can send GET, POST, PUT, DELETE requests.
• Handles cookies, headers, sessions easily.

Example
import requests
response = [Link]("[Link]
print(response.status_code) # 200 means OK
print([Link]) # HTML content

urllib
• Built-in Python module (no installation needed).
• More low-level than requests, needs a bit more code.
import [Link]
with [Link]("[Link] as response:
html = [Link]().decode()
print(html)

Main difference:
• requests is easier for most scraping tasks and is widely used.
• urllib is good if you want a standard library option without extra installation.

Working with JSON


1. JSON (JavaScript Object Notation) is:
a. A text format for storing and transmitting data.
b. Commonly used in APIs and web data exchange.
c. Easy to read and write for humans and machines.
2. Python and JSON:
a. Python has a built-in json module to read/write JSON.
3. Loading JSON from a string:
import json
data = '{"name": "Aysha", "age": 15}'
parsed = [Link](data) # string → Python dict
print(parsed["name"]) # Aysha

Saving Python data as JSON:


person = {"name": "Aysha", "age": 15}
json_string = [Link](person) # dict → JSON string
print(json_string)

When scraping:
• Some websites give data directly in JSON instead of HTML.

In web scraping, you might:


1. Use requests (or urllib) to get the webpage or API data.
2. If it’s HTML, parse it with BeautifulSoup.
3. If it’s JSON, parse it with json module instead of BeautifulSoup.

You might also like