0% found this document useful (0 votes)

28 views11 pages

Lecture 12 - Web Scrapping

The document provides an overview of web scraping using Python, specifically with the Beautiful Soup library, detailing the steps to extract data from websites. It includes code examples for setting up the environment, fetching web pages, parsing HTML, and saving extracted data into CSV files. Additionally, it contrasts the requests and urllib libraries for HTTP requests and explains working with JSON data in web scraping.

Uploaded by

hshawon561

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

28 views11 pages

Lecture 12 - Web Scrapping

Uploaded by

hshawon561

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

We have a book titled "Python for Beginners" by John Doe, published in 2023,

with a price of $29.99.

Plain Text Representation: contains no formatting or structure.

Book Title: Python for Beginners
Author: John Doe
Year: 2023
Price: $29.99

Focus: Simple human-readable form.

No tags, structure, or styling.

HTML Representation is used for displaying data in web pages.

<!DOCTYPE html>
<html>
<head>
<title>Book Information</title>
</head>
<body>
<h1>Python for Beginners</h1>
<p><strong>Author:</strong> John Doe</p>
<p><strong>Year:</strong> 2023</p>
<p><strong>Price:</strong> $29.99</p>
</body>
</html>

Focus: Formatting and presentation.

Contains tags for styling (<h1>, <p>, <strong>).

XML Representation is used for structured data storage and data exchange.

<?xml version="1.0" encoding="UTF-8"?>

<book>
<title>Python for Beginners</title>
<author>John Doe</author>
<year>2023</year>
<price>29.99</price>
</book>

Focus: Structure and hierarchy.

Tags are purely descriptive; no styling.

JSON Representation is used for lightweight data transfer, especially in APIs.

{
"title": "Python for Beginners",
"author": "John Doe",
"year": 2023,
"price": 29.99
}

Focus: Key-value pairs for quick parsing.

Compact and easy for programs to process.

Web Scraping with Python and Beautiful Soup

Definition
• Web scraping means automatically extracting data from websites.
• Beautiful Soup is a Python library that makes it easier to read, search, and
modify HTML/XML content.
How it works
1. Send a request to the website (using requests or urllib).
2. Get the HTML content of the page.
3. Pass it to BeautifulSoup to parse it into a structured format (a “soup” object).
4. Extract specific data using tags, attributes, CSS selectors, etc.
5. Save or use that data.
Key BeautifulSoup methods:

• .select_one('tag') → returns the first matching element.

• .select('tag') → returns a list of matching elements.
• .get('attribute') → gets the value of an attribute (e.g., href).
• .text → extracts inner text from an HTML element.

Step 1: Setting Up Your Environment

Install Required Libraries:

o Use pip to install the necessary libraries:

Import Libraries:
• Import the required modules in your Python script:

Website: [Link]
Simple Code:

What’s happening here?

• GET → Retrieve data. Note: POST → Send data to the server.
• [Link] → Contains the HTML content of the webpage as a string.
• BeautifulSoup(...) → Converts that HTML string into a structured
BeautifulSoup object (a “parse tree”) so you can search and extract elements
easily.
• '[Link]' → This tells BeautifulSoup which parser to use for reading the
HTML.

What is a parser?
A parser is a tool that:
• Reads raw HTML (just text).
• Breaks it down into a tree-like structure based on the rules of HTML.
• Lets you navigate and find specific tags, attributes, or text.
• Without a parser, BeautifulSoup wouldn’t know how to navigate through
tags and extract data.
Think of HTML as raw spaghetti — a parser untangles it and arranges it neatly so
you can pick exactly the noodle (or tag) you want.
Why do we need a parser?
• Raw HTML is just a long string — very hard to extract specific elements
from it directly.
• HTML often contains nested tags (tags inside tags) — parsing creates a tree
so you can navigate parent → child → sibling elements easily.
• Parsers also help fix broken HTML (many pages have missing tags or small
errors).

Why '[Link]'?
• '[Link]' is Python’s built-in HTML parser.
• It’s simple, requires no extra installation, and works well for most HTML.
• BeautifulSoup supports other parsers too:
o 'lxml' → very fast, requires installing lxml library.
o 'html5lib' → the most forgiving, handles messy HTML best.
Code:
# [Link]
# To run this script, paste `python [Link]` in the terminal

import requests
from bs4 import BeautifulSoup

def scrape():
url = '[Link]
response = [Link](url)
soup = BeautifulSoup([Link], '[Link]')
print(soup)

if __name__ == '__main__':
scrape()
Output:

Let’s go ahead and capture 3 things from this page: The title, the text, and the “More
information…” link.
Let’s capture the title, text, and extract the link from the <a> tag. Now, rewrite the
code as:
# Import required libraries
from bs4 import BeautifulSoup
import requests

# Specify the URL to scrape

url = "[Link] # Replace with the URL you want to scrape

# Fetch the webpage

try:
response = [Link](url)
# Check if the request was successful
if response.status_code == 200:
# Parse the webpage content
soup = BeautifulSoup([Link], '[Link]')
else:
print(f"Failed to fetch the webpage. Status code: {response.status_code}")
soup = None
except Exception as e:
print(f"An error occurred while fetching the webpage: {e}")
soup = None

# Extract the desired elements if soup is successfully created

if soup:
# Extract the title from the first <h1> tag
title = soup.select_one('h1').text if soup.select_one('h1') else "No title found"

# Extract the text from the first <p> tag

text = soup.select_one('p').text if soup.select_one('p') else "No text found"

# Extract the href attribute from the first <a> tag

link = soup.select_one('a').get('href') if soup.select_one('a') else "No link found"

# Print the extracted information

print("Title:", title)
print("Text:", text)
print("Link:", link)
else:
print("Soup object was not created. Unable to scrape the webpage.")

Now we want to save the extracted data into a CSV file:

# Import required libraries
from bs4 import BeautifulSoup
import requests
import csv # Library for handling CSV files
import os # Library to interact with the file system

# Specify the URL to scrape

url = "[Link] # Replace with the URL you want to scrape

# Fetch the webpage

# Extract the desired elements if soup is successfully created

if soup:
# Extract the title from the first <h1> tag
title = soup.select_one('h1').text if soup.select_one('h1') else "No title found"

# Extract the text from the first <p> tag

text = soup.select_one('p').text if soup.select_one('p') else "No text found"

# Extract the href attribute from the first <a> tag

link = soup.select_one('a').get('href') if soup.select_one('a') else "No link found"

# Save the extracted data into a CSV file

csv_filename = 'scraped_data.csv'
try:
with open(csv_filename, mode='w', newline='', encoding='utf-8') as file:
writer = [Link](file)
# Write the header row
[Link](['Title', 'Text', 'Link'])
# Write the data row
[Link]([title, text, link])

# Get the full file path

file_path = [Link](csv_filename)
print(f"Data saved to {csv_filename}")
print(f"File is located at: {file_path}")
except Exception as e:
print(f"An error occurred while saving to CSV: {e}")
else:
print("Soup object was not created. Unable to scrape the webpage.")

Intro to requests and urllib Module

Both are HTTP client libraries in Python used to fetch content from the web.
requests
• High-level, simpler, and more human-friendly.
• Can send GET, POST, PUT, DELETE requests.
• Handles cookies, headers, sessions easily.

Example
import requests
response = [Link]("[Link]
print(response.status_code) # 200 means OK
print([Link]) # HTML content

urllib
• Built-in Python module (no installation needed).
• More low-level than requests, needs a bit more code.
import [Link]
with [Link]("[Link] as response:
html = [Link]().decode()
print(html)

Main difference:
• requests is easier for most scraping tasks and is widely used.
• urllib is good if you want a standard library option without extra installation.

Working with JSON

1. JSON (JavaScript Object Notation) is:
a. A text format for storing and transmitting data.
b. Commonly used in APIs and web data exchange.
c. Easy to read and write for humans and machines.
2. Python and JSON:
a. Python has a built-in json module to read/write JSON.
3. Loading JSON from a string:
import json
data = '{"name": "Aysha", "age": 15}'
parsed = [Link](data) # string → Python dict
print(parsed["name"]) # Aysha

Saving Python data as JSON:

person = {"name": "Aysha", "age": 15}
json_string = [Link](person) # dict → JSON string
print(json_string)

When scraping:
• Some websites give data directly in JSON instead of HTML.

In web scraping, you might:

1. Use requests (or urllib) to get the webpage or API data.
2. If it’s HTML, parse it with BeautifulSoup.
3. If it’s JSON, parse it with json module instead of BeautifulSoup.

Unit I
No ratings yet
Unit I
12 pages
DAP 4 Module
No ratings yet
DAP 4 Module
45 pages
Web Scraping with BeautifulSoup in Python
No ratings yet
Web Scraping with BeautifulSoup in Python
6 pages
DAP Module4 1
No ratings yet
DAP Module4 1
110 pages
Practical Introduction To Web Scraping in Python
100% (1)
Practical Introduction To Web Scraping in Python
14 pages
Python Module-4
No ratings yet
Python Module-4
109 pages
A Guide To Web Scraping in Python Using Beautiful Soup
No ratings yet
A Guide To Web Scraping in Python Using Beautiful Soup
6 pages
DAP - Module 4
No ratings yet
DAP - Module 4
57 pages
Web Scraping Using Python
No ratings yet
Web Scraping Using Python
18 pages
DAP Module4
No ratings yet
DAP Module4
109 pages
Web Scraping With Python
No ratings yet
Web Scraping With Python
16 pages
4F IntroToWebScraping
No ratings yet
4F IntroToWebScraping
6 pages
Python For Web Scraping - Week 3: 1 Installing A Module
No ratings yet
Python For Web Scraping - Week 3: 1 Installing A Module
4 pages
Quick Guide Web Scraping With Python
No ratings yet
Quick Guide Web Scraping With Python
3 pages
Web Scraping Using Python - Notes
No ratings yet
Web Scraping Using Python - Notes
6 pages
Web Scraping Cheat Sheet (2021), Python For Web Scraping by Frank Andrade Geek Culture - Medium
100% (3)
Web Scraping Cheat Sheet (2021), Python For Web Scraping by Frank Andrade Geek Culture - Medium
26 pages
Retrieving Data From The Web
No ratings yet
Retrieving Data From The Web
9 pages
Webscraping1 1 PDF
No ratings yet
Webscraping1 1 PDF
10 pages
Web Scraping Python Tutorial - How To Scrape Data From A Website
No ratings yet
Web Scraping Python Tutorial - How To Scrape Data From A Website
19 pages
Webscraping
No ratings yet
Webscraping
12 pages
Web Crawling and Scraping with Python
No ratings yet
Web Crawling and Scraping with Python
34 pages
Ujjual PDF Web Scraping 2
No ratings yet
Ujjual PDF Web Scraping 2
2 pages
Beginner Guide To Web Scraping of Data
No ratings yet
Beginner Guide To Web Scraping of Data
14 pages
Web Scraping
No ratings yet
Web Scraping
11 pages
Web Scraping Techniques in Python
100% (1)
Web Scraping Techniques in Python
20 pages
Web Scraping Basics with Python
No ratings yet
Web Scraping Basics with Python
4 pages
Basic Scraping Techniques
No ratings yet
Basic Scraping Techniques
7 pages
Data - Collection Python
No ratings yet
Data - Collection Python
40 pages
Python Web Scraping Guide
100% (2)
Python Web Scraping Guide
35 pages
Web Scraping & API Guide
No ratings yet
Web Scraping & API Guide
24 pages
Web Crawling and Social Media Mining: Module No. 5
No ratings yet
Web Crawling and Social Media Mining: Module No. 5
77 pages
Web Scraping CheatSheet Guide
No ratings yet
Web Scraping CheatSheet Guide
10 pages
Programming 2 Lectures
No ratings yet
Programming 2 Lectures
52 pages
How To Scrape Websites With Python and BeautifulSoup PDF
100% (2)
How To Scrape Websites With Python and BeautifulSoup PDF
10 pages
Apuntes Curso
No ratings yet
Apuntes Curso
2 pages
Introduction To Web Crawling Chapter - 13
No ratings yet
Introduction To Web Crawling Chapter - 13
3 pages
Sari Serhan Python Toolbox 100 Scripts For Developers 2023
No ratings yet
Sari Serhan Python Toolbox 100 Scripts For Developers 2023
193 pages
Web Scraping Using Python: A Step by Step Guide: September 2019
No ratings yet
Web Scraping Using Python: A Step by Step Guide: September 2019
7 pages
Notes For Web Scraping - BeautifulSoup-3903
No ratings yet
Notes For Web Scraping - BeautifulSoup-3903
6 pages
055-En
No ratings yet
055-En
2 pages
Beautiful Soup Tutorial
100% (2)
Beautiful Soup Tutorial
56 pages
Beautiful Soup 4 Documentation Guide
No ratings yet
Beautiful Soup 4 Documentation Guide
61 pages
Beautiful Soup: Python HTML/XML Parsing
No ratings yet
Beautiful Soup: Python HTML/XML Parsing
40 pages
Download
No ratings yet
Download
4 pages
Web Scraping and HTML Basics
No ratings yet
Web Scraping and HTML Basics
4 pages
WEBSCRAping Buildwithpython
No ratings yet
WEBSCRAping Buildwithpython
78 pages
Web Scraping with Beautiful Soup in Python
No ratings yet
Web Scraping with Beautiful Soup in Python
7 pages
Web+Scraping+Cheat+Sheet+2 0
No ratings yet
Web+Scraping+Cheat+Sheet+2 0
3 pages
Python Web Scraping Guide
No ratings yet
Python Web Scraping Guide
7 pages
Web Scraping Using Python: A Step by Step Guide: September 2019
0% (1)
Web Scraping Using Python: A Step by Step Guide: September 2019
7 pages
20 - BeautifulSoup Library For Web Scraping
No ratings yet
20 - BeautifulSoup Library For Web Scraping
12 pages
Beautiful Soup & Selenium Web Scraping Guide
No ratings yet
Beautiful Soup & Selenium Web Scraping Guide
5 pages
Beautiful Soup 4 Documentation Guide
100% (1)
Beautiful Soup 4 Documentation Guide
56 pages
Cheat Sheet For API's and Data Collection
No ratings yet
Cheat Sheet For API's and Data Collection
4 pages
Python Web Scraping Basics
No ratings yet
Python Web Scraping Basics
6 pages
Blockchain For Iot Scurity
No ratings yet
Blockchain For Iot Scurity
30 pages
BlockChain For IOT
No ratings yet
BlockChain For IOT
30 pages
Comparison of Computing Models
No ratings yet
Comparison of Computing Models
9 pages
301 Assignment
No ratings yet
301 Assignment
5 pages
FastApi Theory
No ratings yet
FastApi Theory
22 pages
Swantechnologies Co in Website Ecommerce Website
No ratings yet
Swantechnologies Co in Website Ecommerce Website
5 pages
Amazon Clone Project Synopsis
No ratings yet
Amazon Clone Project Synopsis
16 pages
Aman Anand: Contact Information
0% (1)
Aman Anand: Contact Information
2 pages
MVC3 Partial View Implementation
No ratings yet
MVC3 Partial View Implementation
7 pages
Chat GPT 500 + Prompt Cheat Sheet
No ratings yet
Chat GPT 500 + Prompt Cheat Sheet
35 pages
Creating External Links in HTML
No ratings yet
Creating External Links in HTML
1 page
Cookies for Globosat Online Access
No ratings yet
Cookies for Globosat Online Access
10 pages
District57 Scholarship Application
No ratings yet
District57 Scholarship Application
11 pages
API Testing Slides
No ratings yet
API Testing Slides
93 pages
XXX
No ratings yet
XXX
12 pages
Mastering The Coding Mindset
100% (2)
Mastering The Coding Mindset
118 pages
Bosch Icons for Developers
No ratings yet
Bosch Icons for Developers
1 page
Ecommerce Clothing Website Project Report
No ratings yet
Ecommerce Clothing Website Project Report
7 pages
Web Designing Lab File (KIT451) : Submitted By-Saumya Asawa Roll No: 1900970130108
67% (3)
Web Designing Lab File (KIT451) : Submitted By-Saumya Asawa Roll No: 1900970130108
32 pages
ReactJs Developer Resume Sathyanarayana 3years Exp
No ratings yet
ReactJs Developer Resume Sathyanarayana 3years Exp
4 pages
Summer Report On WEB Development
25% (4)
Summer Report On WEB Development
30 pages
Xss Cheat Sheet 2016
No ratings yet
Xss Cheat Sheet 2016
3 pages
GIBM SkillsBuild Internship Offer Letter and Orientation Details!
No ratings yet
GIBM SkillsBuild Internship Offer Letter and Orientation Details!
1 page
IT Companies in Dehradun
No ratings yet
IT Companies in Dehradun
14 pages
CL m0ck Exam Paper
No ratings yet
CL m0ck Exam Paper
9 pages
Backend Development with Node.js
No ratings yet
Backend Development with Node.js
32 pages
WT 6.7sp2 Relnotes
No ratings yet
WT 6.7sp2 Relnotes
88 pages
ASP.NET Developer Resume Overview
No ratings yet
ASP.NET Developer Resume Overview
4 pages
French Hacker Incident Analysis
No ratings yet
French Hacker Incident Analysis
5 pages
Static vs Dynamic Websites Explained
No ratings yet
Static vs Dynamic Websites Explained
3 pages
CS411 Solved Papers: March 5-16
No ratings yet
CS411 Solved Papers: March 5-16
10 pages
2 Building A Webpage Template With HTML5
No ratings yet
2 Building A Webpage Template With HTML5
59 pages
150 - Profile - VoiceMod - Crack
No ratings yet
150 - Profile - VoiceMod - Crack
4 pages
Web Programming: Home Work-1
No ratings yet
Web Programming: Home Work-1
5 pages
Bcom-Iv-Sem-Web-Technologies Lab Record
No ratings yet
Bcom-Iv-Sem-Web-Technologies Lab Record
89 pages

Lecture 12 - Web Scrapping

Uploaded by

Lecture 12 - Web Scrapping

Uploaded by

We have a book titled "Python for Beginners" by John Doe, published in 2023,

with a price of $29.99.

Plain Text Representation: contains no formatting or structure.

Focus: Simple human-readable form.

HTML Representation is used for displaying data in web pages.

Focus: Formatting and presentation.

<?xml version="1.0" encoding="UTF-8"?>

Focus: Structure and hierarchy.

JSON Representation is used for lightweight data transfer, especially in APIs.

Focus: Key-value pairs for quick parsing.

Web Scraping with Python and Beautiful Soup

• .select_one('tag') → returns the first matching element.

Step 1: Setting Up Your Environment

Install Required Libraries:

What’s happening here?

# Specify the URL to scrape

# Fetch the webpage

# Extract the desired elements if soup is successfully created

# Extract the text from the first <p> tag

# Extract the href attribute from the first <a> tag

# Print the extracted information

Now we want to save the extracted data into a CSV file:

# Specify the URL to scrape

# Fetch the webpage

# Extract the desired elements if soup is successfully created

# Extract the text from the first <p> tag

# Extract the href attribute from the first <a> tag

# Save the extracted data into a CSV file

# Get the full file path

Intro to requests and urllib Module

Working with JSON

Saving Python data as JSON:

In web scraping, you might:

You might also like