0% found this document useful (0 votes)

127 views

Web Scrapping: Dept - of CS&E, BIET, Davangere Page - 1

Web scraping is a process that uses bots to extract data and content from websites by replicating the HTML code and extracting the underlying data stored in databases. Legitimate uses of web scraping include search engines indexing websites and price comparison sites gathering product information. However, web scraping can also be done maliciously through botnets that scrape websites without permission.

Uploaded by

jeanie

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

127 views

Web Scrapping: Dept - of CS&E, BIET, Davangere Page - 1

Uploaded by

jeanie

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 8

Web scrapping

INTRODUCTION:
Web scraping is the process of using bots to extract content and data from a
website. Unlike screen scraping, which only copies pixels displayed onscreen,
web scraping extracts underlying HTML code and, with it, data stored in a
database. The scraper can then replicate entire website content elsewhere.
Web scraping is used in a variety of digital businesses that rely on data
harvesting.

Legitimate use cases include:

 Search engine bots crawling a site, analysing its content and then ranking
it.
 Price comparison sites deploying bots to auto-fetch prices and product
descriptions for allied seller websites.
 Market research companies using scrapers to pull data from forums and
social media (e.g., for sentiment analysis).

Scraper tools and bots:

Web scraping tools are software (i.e., bots) programmed to sift through
databases and extract information. A variety of bot types are used, many being
fully customizable to:
 Recognize unique HTML site structures
 Extract and transform content
 Store scraped data
 Extract data from APIs

Since all scraping bots have the same purpose—to access site data—it can be
difficult to distinguish between legitimate and malicious bots.

Dept.of CS&E,BIET,Davangere Page | 1

Web scrapping

That said, several key differences help distinguish between the two.
1. Legitimate bots are identified with the organization for which they
scrape. For example, Googlebot identifies itself in its HTTP header as
belonging to Google. Malicious bots, conversely, impersonate legitimate
traffic by creating a false HTTP user agent.
2. Legitimate bots abide a site’s robot.txt file, which lists those pages a bot
is permitted to access and those it cannot. Malicious scrapers, on the other
hand, crawl the website regardless of what the site operator has allowed.

Resources needed to run web scraper bots are substantial—so much so that
legitimate scraping bot operators heavily invest in servers to process the vast
amount of data being extracted. A perpetrator, lacking such a budget, often
resorts to using a botnet—geographically dispersed computers, infected with the
same malware and controlled from a central location. Individual botnet
computer owners are unaware of their participation. The combined power of the
infected systems enables large scale scraping of many different websites by the
perpetrator.

Dept.of CS&E,BIET,Davangere Page | 2

Web scrapping

Python Libraries for Web Scraping:

 requests — this critical library is needed to actually get the data from the
web server onto your machine, and it contains some additional cool
features like caching too.
 Beautiful Soup 4 — This is the library we’ve used here, and it’s designed
to make filtering data based on HTML tags straight forward.
 lmxl — An HTML and XML parser that’s fast (and now, integrated with
Beautiful Soup, too!)
 Selenium — A web driver tool that’s useful when you need to get data
from a website that the requests library can’t access, because it’s hidden
behind things like login forms or mandatory mouse-clicks.
 Scrapy — A full-on web scraping framework that might be overkill for
one-off data analysis projects, but a good fit when scraping’s required for
production projects, pipelines, etc.

WORKING OF WEB SCRAPING:

When we scrape the web, we write code that sends a request to the server that’s
hosting the page we specified. Generally, our code downloads that page’s
source code, just as a browser would. But instead of displaying the page
visually, it filters through the page looking for HTML elements we’ve specified,
and extracting whatever content we’ve instructed it to extract.
For example, if we wanted to get all of the titles inside H2 tags from a website,
we could write some code to do that. Our code would request the site’s content
from its server and download it. Then it would go through the page’s HTML
looking for the H2 tags. Whenever it found an H2 tag, it would copy whatever
text is inside the tag, and output it in whatever format we specified.

Dept.of CS&E,BIET,Davangere Page | 3

Web scrapping

One thing that’s important to note: from a server’s perspective, requesting a

page via web scraping is the same as loading it in a web browser. When we use
code to submit these requests, we might be “loading” pages much faster than a
regular user, and thus quickly eating up the website owner’s server resources.

Potential Challenges of Web Scraping:

 One of the challenges you would come across while scraping information
from websites is the various structures of websites. Meaning, the
templates of websites will differ and will be unique; hence, generalizing
across websites could be a challenge.
 Another challenge could be longevity. Since the web developers keep
updating their websites, you cannot certainly rely on one scraper for too
long. Even though the modifications might be minor, but they still might
create a hindrance for you while fetching the data.

Hence, to address the above challenges, there could be various possible

solutions. One would be to follow continuous integration & development
(CI/CD) and constant maintenance as the website modifications would be
dynamic.

Another more realistic approach is to use Application Programming Interfaces

(APIs) offered by various websites & platforms. For example, Facebook and
twitter provide you API's specially designed for developers who want to
experiment with their data or would like extract information to let's say related
to all friends & mutual friends and draw a connection graph of it. The format of
the data when using APIs is different from usual web scraping i.e., JSON or
XML, while in standard web scraping, you mainly deal with data in HTML
format.

Dept.of CS&E,BIET,Davangere Page | 4

Web scrapping

Implementation Source Code:

<!DOCTYPE html>
<html lang="en" dir="ltr">
<head>
<meta charset="utf-8">
<title>Tessa personal site</title>
<link rel="stylesheet" href="css/styles.css">
</head>
<body>
<table cellspacing="10">
<tr>
<td>
<img src="https://images.iphonephotographyschool.com/24755/portrait-photography.jpg"
alt="Tessa profile picture"height="150"width="150">
</td>
<td><h1> Tessa John</h1>
<p><em>I am a<strong> teacher</strong>. <br>I love playing <strong><a
href="https://www.google.com/search?
rlz=1C1GCEB_enIN907IN907&sxsrf=ALeKk02j5ICSqE8xyC-
2SbOjCx0TS_8DgQ:1596368832757&q=basketball&spell=1&sa=X&ved=2ahUKEwitz9Op
ufzqAhUtyzgGHc3_D0gQBSgAegQIHhAn&biw=1536&bih=706">basketball</a></strong>
and <strong><a href="https://en.wikipedia.org/wiki/Throwball">throw ball</a></strong>.
</em></p> <p><em>
</td>
</tr>
</table>
<hr>
<h3>Schools Worked</h3>
<ul>
<li>Taralabalu</li>
<li>St.Paluse Convent</li>
<li>Bapuji</li>
</ul>

Dept.of CS&E,BIET,Davangere Page | 5

Web scrapping

<hr>
<h3>Work Experience</h3>
<table>
<thead>
<tr>
<td><h5>YEAR</h5></td>
<td><h5>SUBJECT</h5></td>
</tr>
</thead>
<tbody>
<tr>
<td>2010</td>
<td>Social</td>
</tr>
<tr>
<td>2012</td>
<td>Science</td>
</tr>
<tr>
<td>2014</td>
<td>Mathematics</td>
</tr>
<tr>
<td>2016</td>
<td>English</td>
</tr>
</tbody>
</table>
<a href="Division of Subjects.html">Division of Subjects</a>
<hr>
<h3>Skills</h3>
<table>
<tr>

Dept.of CS&E,BIET,Davangere Page | 6

Web scrapping

<td>Painting</td>
<td>⭐⭐⭐⭐⭐</td>
</tr>
<tr>
<td>Dancing</td>
<td>⭐⭐⭐</td>
</tr>
<tr>
<td>Singing</td>
<td>⭐⭐⭐⭐</td>
</tr>
</table>
<a href="contact.html">CONTACT DETAILS</a>
</body>
</html>

<!DOCTYPE html>
<html lang="en" dir="ltr">
<head>
<meta charset="utf-8">
<title>Division of subjects</title>
<link rel="stylesheet" href="css/styles.css">
</head>
<body>
<h3>Division of Subjects</h3>
<ol>
<li><a href="https://science.sciencemag.org/">Science</a></li>
<li><a href="https://g.co/kgs/TMkyEz">Social</a></li>
<li><a href="https://en.wikipedia.org/wiki/Mathematics">Mathematics</a></li>
<li><a href="https://en.wikipedia.org/wiki/English_language">English</a></li>
</ol>
</body>
</html>

Dept.of CS&E,BIET,Davangere Page | 7

Web scrapping

Snapshots:

Dept.of CS&E,BIET,Davangere Page | 8

JN0-363 Q&a 264
50% (2)
JN0-363 Q&a 264
107 pages
Resource Roadmap, Hangukquant
No ratings yet
Resource Roadmap, Hangukquant
9 pages
How To Scrap Any Website's Content Using Scrapy
0% (1)
How To Scrap Any Website's Content Using Scrapy
20 pages
Web Scraping Cheat Sheet (2021), Python For Web Scraping by Frank Andrade Geek Culture - Medium
100% (2)
Web Scraping Cheat Sheet (2021), Python For Web Scraping by Frank Andrade Geek Culture - Medium
26 pages
Data Aggregation by Web Scraping Using Python
No ratings yet
Data Aggregation by Web Scraping Using Python
48 pages
1.8 Data Scrapping PDF
No ratings yet
1.8 Data Scrapping PDF
42 pages
The A-Z of Web Scraping in 2020 (A How-To Guide)
No ratings yet
The A-Z of Web Scraping in 2020 (A How-To Guide)
18 pages
Data Analysis by Web Scraping Using Python
No ratings yet
Data Analysis by Web Scraping Using Python
6 pages
Scraping Book
No ratings yet
Scraping Book
50 pages
Scraping Book Python PDF
No ratings yet
Scraping Book Python PDF
50 pages
Introduction To Web Scraping
100% (1)
Introduction To Web Scraping
3 pages
Web Scraping
No ratings yet
Web Scraping
28 pages
WebScraping Lessons 1
100% (1)
WebScraping Lessons 1
3 pages
Data Scraping
No ratings yet
Data Scraping
63 pages
06.05.23 - Python - Web Scraping in Python
No ratings yet
06.05.23 - Python - Web Scraping in Python
108 pages
Python BeautifulSoup - Parse HTML, XML Documents in Python
100% (1)
Python BeautifulSoup - Parse HTML, XML Documents in Python
21 pages
Hadoop Tutorial
No ratings yet
Hadoop Tutorial
58 pages
Scrap Website With Python Free Code Camp
No ratings yet
Scrap Website With Python Free Code Camp
6 pages
Web Scraping: Applications and Tools
100% (2)
Web Scraping: Applications and Tools
31 pages
What Is The Best Web Scraping Open Source
No ratings yet
What Is The Best Web Scraping Open Source
7 pages
Shopify - Sa Form
No ratings yet
Shopify - Sa Form
6 pages
Data Collection & Scraping
No ratings yet
Data Collection & Scraping
43 pages
Learning Python Guide
No ratings yet
Learning Python Guide
5 pages
A Winter Training Report On Automation Using Python
No ratings yet
A Winter Training Report On Automation Using Python
29 pages
Kumar 2017
No ratings yet
Kumar 2017
5 pages
Web Scraping
No ratings yet
Web Scraping
718 pages
Social media and marketing notes M
No ratings yet
Social media and marketing notes M
108 pages
Data Mining With Py Draft PDF
No ratings yet
Data Mining With Py Draft PDF
103 pages
Web Scraping Using Python: A Step by Step Guide: September 2019
No ratings yet
Web Scraping Using Python: A Step by Step Guide: September 2019
7 pages
Web Scraping With Python
No ratings yet
Web Scraping With Python
21 pages
Sari Serhan Python Toolbox 100 Scripts For Developers 2023
No ratings yet
Sari Serhan Python Toolbox 100 Scripts For Developers 2023
193 pages
PDF SQL and NoSQL Databases: Modeling, Languages, Security and Architectures for Big Data Management Michael Kaufmann download
100% (3)
PDF SQL and NoSQL Databases: Modeling, Languages, Security and Architectures for Big Data Management Michael Kaufmann download
36 pages
Yahoo Finance
No ratings yet
Yahoo Finance
5 pages
Become A Web Scraping Pro: With These 5 Tips
No ratings yet
Become A Web Scraping Pro: With These 5 Tips
6 pages
3 - Big Data Insight V.2019 PDF
No ratings yet
3 - Big Data Insight V.2019 PDF
28 pages
Web Scraping With BeautifulSoup
100% (1)
Web Scraping With BeautifulSoup
8 pages
PHP Form and Validation
No ratings yet
PHP Form and Validation
16 pages
Understanding The DOM
100% (1)
Understanding The DOM
126 pages
Website Design - Class 2 Trainers
100% (1)
Website Design - Class 2 Trainers
15 pages
Google Analytics
No ratings yet
Google Analytics
7 pages
Google-Ads-Course-3
No ratings yet
Google-Ads-Course-3
48 pages
Engineering-A Review Web Data Scrapping
No ratings yet
Engineering-A Review Web Data Scrapping
4 pages
Basic Libraries For Data Science
No ratings yet
Basic Libraries For Data Science
4 pages
The-Road-To-React-Your-Journey-To-Master-Plain-Yet-Pragmatic-React-2020 EDITION
No ratings yet
The-Road-To-React-Your-Journey-To-Master-Plain-Yet-Pragmatic-React-2020 EDITION
207 pages
20 - 3 - A Study
No ratings yet
20 - 3 - A Study
5 pages
web_scrapping_final[1]
No ratings yet
web_scrapping_final[1]
7 pages
Unit 11 Application Development Using Python
No ratings yet
Unit 11 Application Development Using Python
19 pages
Web Scraping
No ratings yet
Web Scraping
4 pages
Web Crawling - python
No ratings yet
Web Crawling - python
34 pages
Scraping
100% (1)
Scraping
25 pages
Web Scraping - Unit 1
100% (1)
Web Scraping - Unit 1
31 pages
Seminar Completed
No ratings yet
Seminar Completed
22 pages
Sing Rodia 2019
No ratings yet
Sing Rodia 2019
6 pages
Web Scraping With Python and Selenium: Sarah Fatima, Shaik Luqmaan Nuha Abdul Rasheed
No ratings yet
Web Scraping With Python and Selenium: Sarah Fatima, Shaik Luqmaan Nuha Abdul Rasheed
5 pages
Web Scraping Report
No ratings yet
Web Scraping Report
14 pages
Web Scraping With Python Tutorials From A To Z
100% (1)
Web Scraping With Python Tutorials From A To Z
35 pages
Web Scraping With Python - Sample Chapter
100% (3)
Web Scraping With Python - Sample Chapter
26 pages
Web Scraping by Using R
No ratings yet
Web Scraping by Using R
3 pages
E-Commerce Review Scrapper: Python Mini Project On
No ratings yet
E-Commerce Review Scrapper: Python Mini Project On
15 pages
Web Scraping 2
No ratings yet
Web Scraping 2
14 pages
Web Scraping
No ratings yet
Web Scraping
5 pages
Configuring Cisco Callmanager Express (Cme) : Cisco Networking Academy Program
No ratings yet
Configuring Cisco Callmanager Express (Cme) : Cisco Networking Academy Program
146 pages
Reading As A Cognitive Process
No ratings yet
Reading As A Cognitive Process
5 pages
E Marketing.
No ratings yet
E Marketing.
12 pages
Nomad NGS User Manual
No ratings yet
Nomad NGS User Manual
116 pages
PROTON X70 Navigation Update
No ratings yet
PROTON X70 Navigation Update
9 pages
G997XC04939
No ratings yet
G997XC04939
3 pages
Повикот На Дивината PDF
No ratings yet
Повикот На Дивината PDF
1 page
Log
No ratings yet
Log
4 pages
7 - GaTeBaSep Game Theory-Based Security Protocol Against ARP Spoofing Attacks in Software-Defined Networks - 2023 Pringer
No ratings yet
7 - GaTeBaSep Game Theory-Based Security Protocol Against ARP Spoofing Attacks in Software-Defined Networks - 2023 Pringer
15 pages
IP Addresses in AWS
No ratings yet
IP Addresses in AWS
3 pages
CP R81 Gaia AdminGuide
No ratings yet
CP R81 Gaia AdminGuide
474 pages
Circular Regarding Online Access of EDCP OB Corpus Card
No ratings yet
Circular Regarding Online Access of EDCP OB Corpus Card
11 pages
4p 4c 4s
No ratings yet
4p 4c 4s
10 pages
Practice Questions For MCA
No ratings yet
Practice Questions For MCA
4 pages
Curriculum Guide
No ratings yet
Curriculum Guide
6 pages
Ec2352 Computer Networks Question Bank CN
No ratings yet
Ec2352 Computer Networks Question Bank CN
4 pages
(Step-By-Step) Cisco Unity Connection SCCP Integration With CUCM Cisco Call Manager
No ratings yet
(Step-By-Step) Cisco Unity Connection SCCP Integration With CUCM Cisco Call Manager
25 pages
Datas One100 1e
No ratings yet
Datas One100 1e
2 pages
How Does The Internet Help Secondary School Students in Learning
No ratings yet
How Does The Internet Help Secondary School Students in Learning
2 pages
Chapter 10
No ratings yet
Chapter 10
4 pages
MNGT 8 Chapter 3
No ratings yet
MNGT 8 Chapter 3
10 pages
TITANIC - GUITARRA TAB Sheet Music For Guitar (Mix
No ratings yet
TITANIC - GUITARRA TAB Sheet Music For Guitar (Mix
1 page
ZXA10 C6xx Equipment Commissioning and Maintenance Guide
No ratings yet
ZXA10 C6xx Equipment Commissioning and Maintenance Guide
93 pages
Summative Test Tle6
No ratings yet
Summative Test Tle6
9 pages
Web and HTTP
No ratings yet
Web and HTTP
4 pages
Plug-and-Play Education - Knowledge and Learning in The Age of Platforms and Artificial Intelligence - 2024
No ratings yet
Plug-and-Play Education - Knowledge and Learning in The Age of Platforms and Artificial Intelligence - 2024
127 pages
Elective 1 Review
No ratings yet
Elective 1 Review
10 pages
Social Media Engagement Point
No ratings yet
Social Media Engagement Point
11 pages
Architecture: Magazine Presentation
No ratings yet
Architecture: Magazine Presentation
45 pages