0% found this document useful (0 votes)
127 views

Web Scrapping: Dept - of CS&E, BIET, Davangere Page - 1

Web scraping is a process that uses bots to extract data and content from websites by replicating the HTML code and extracting the underlying data stored in databases. Legitimate uses of web scraping include search engines indexing websites and price comparison sites gathering product information. However, web scraping can also be done maliciously through botnets that scrape websites without permission.

Uploaded by

jeanie
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
127 views

Web Scrapping: Dept - of CS&E, BIET, Davangere Page - 1

Web scraping is a process that uses bots to extract data and content from websites by replicating the HTML code and extracting the underlying data stored in databases. Legitimate uses of web scraping include search engines indexing websites and price comparison sites gathering product information. However, web scraping can also be done maliciously through botnets that scrape websites without permission.

Uploaded by

jeanie
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 8

Web scrapping

INTRODUCTION:
Web scraping is the process of using bots to extract content and data from a
website. Unlike screen scraping, which only copies pixels displayed onscreen,
web scraping extracts underlying HTML code and, with it, data stored in a
database. The scraper can then replicate entire website content elsewhere.
Web scraping is used in a variety of digital businesses that rely on data
harvesting.

Legitimate use cases include:


 Search engine bots crawling a site, analysing its content and then ranking
it.
 Price comparison sites deploying bots to auto-fetch prices and product
descriptions for allied seller websites.
 Market research companies using scrapers to pull data from forums and
social media (e.g., for sentiment analysis).

Scraper tools and bots:


Web scraping tools are software (i.e., bots) programmed to sift through
databases and extract information. A variety of bot types are used, many being
fully customizable to:
 Recognize unique HTML site structures
 Extract and transform content
 Store scraped data
 Extract data from APIs

Since all scraping bots have the same purpose—to access site data—it can be
difficult to distinguish between legitimate and malicious bots.

Dept.of CS&E,BIET,Davangere Page | 1


Web scrapping

That said, several key differences help distinguish between the two.
1. Legitimate bots are identified with the organization for which they
scrape. For example, Googlebot identifies itself in its HTTP header as
belonging to Google. Malicious bots, conversely, impersonate legitimate
traffic by creating a false HTTP user agent.
2. Legitimate bots abide a site’s robot.txt file, which lists those pages a bot
is permitted to access and those it cannot. Malicious scrapers, on the other
hand, crawl the website regardless of what the site operator has allowed.

Resources needed to run web scraper bots are substantial—so much so that
legitimate scraping bot operators heavily invest in servers to process the vast
amount of data being extracted. A perpetrator, lacking such a budget, often
resorts to using a botnet—geographically dispersed computers, infected with the
same malware and controlled from a central location. Individual botnet
computer owners are unaware of their participation. The combined power of the
infected systems enables large scale scraping of many different websites by the
perpetrator.

Dept.of CS&E,BIET,Davangere Page | 2


Web scrapping

Python Libraries for Web Scraping:


 requests — this critical library is needed to actually get the data from the
web server onto your machine, and it contains some additional cool
features like caching too.
 Beautiful Soup 4 — This is the library we’ve used here, and it’s designed
to make filtering data based on HTML tags straight forward.
 lmxl — An HTML and XML parser that’s fast (and now, integrated with
Beautiful Soup, too!)
 Selenium — A web driver tool that’s useful when you need to get data
from a website that the requests library can’t access, because it’s hidden
behind things like login forms or mandatory mouse-clicks.
 Scrapy — A full-on web scraping framework that might be overkill for
one-off data analysis projects, but a good fit when scraping’s required for
production projects, pipelines, etc.

WORKING OF WEB SCRAPING:

When we scrape the web, we write code that sends a request to the server that’s
hosting the page we specified. Generally, our code downloads that page’s
source code, just as a browser would. But instead of displaying the page
visually, it filters through the page looking for HTML elements we’ve specified,
and extracting whatever content we’ve instructed it to extract.
For example, if we wanted to get all of the titles inside H2 tags from a website,
we could write some code to do that. Our code would request the site’s content
from its server and download it. Then it would go through the page’s HTML
looking for the H2 tags. Whenever it found an H2 tag, it would copy whatever
text is inside the tag, and output it in whatever format we specified.

Dept.of CS&E,BIET,Davangere Page | 3


Web scrapping

One thing that’s important to note: from a server’s perspective, requesting a


page via web scraping is the same as loading it in a web browser. When we use
code to submit these requests, we might be “loading” pages much faster than a
regular user, and thus quickly eating up the website owner’s server resources.

Potential Challenges of Web Scraping:


 One of the challenges you would come across while scraping information
from websites is the various structures of websites. Meaning, the
templates of websites will differ and will be unique; hence, generalizing
across websites could be a challenge.
 Another challenge could be longevity. Since the web developers keep
updating their websites, you cannot certainly rely on one scraper for too
long. Even though the modifications might be minor, but they still might
create a hindrance for you while fetching the data.

Hence, to address the above challenges, there could be various possible


solutions. One would be to follow continuous integration & development
(CI/CD) and constant maintenance as the website modifications would be
dynamic.

Another more realistic approach is to use Application Programming Interfaces


(APIs) offered by various websites & platforms. For example, Facebook and
twitter provide you API's specially designed for developers who want to
experiment with their data or would like extract information to let's say related
to all friends & mutual friends and draw a connection graph of it. The format of
the data when using APIs is different from usual web scraping i.e., JSON or
XML, while in standard web scraping, you mainly deal with data in HTML
format.

Dept.of CS&E,BIET,Davangere Page | 4


Web scrapping

Implementation Source Code:


<!DOCTYPE html>
<html lang="en" dir="ltr">
<head>
<meta charset="utf-8">
<title>Tessa personal site</title>
<link rel="stylesheet" href="css/styles.css">
</head>
<body>
<table cellspacing="10">
<tr>
<td>
<img src="https://images.iphonephotographyschool.com/24755/portrait-photography.jpg"
alt="Tessa profile picture"height="150"width="150">
</td>
<td><h1> Tessa John</h1>
<p><em>I am a<strong> teacher</strong>. <br>I love playing <strong><a
href="https://www.google.com/search?
rlz=1C1GCEB_enIN907IN907&sxsrf=ALeKk02j5ICSqE8xyC-
2SbOjCx0TS_8DgQ:1596368832757&q=basketball&spell=1&sa=X&ved=2ahUKEwitz9Op
ufzqAhUtyzgGHc3_D0gQBSgAegQIHhAn&biw=1536&bih=706">basketball</a></strong>
and <strong><a href="https://en.wikipedia.org/wiki/Throwball">throw ball</a></strong>.
</em></p> <p><em>
</td>
</tr>
</table>
<hr>
<h3>Schools Worked</h3>
<ul>
<li>Taralabalu</li>
<li>St.Paluse Convent</li>
<li>Bapuji</li>
</ul>

Dept.of CS&E,BIET,Davangere Page | 5


Web scrapping

<hr>
<h3>Work Experience</h3>
<table>
<thead>
<tr>
<td><h5>YEAR</h5></td>
<td><h5>SUBJECT</h5></td>
</tr>
</thead>
<tbody>
<tr>
<td>2010</td>
<td>Social</td>
</tr>
<tr>
<td>2012</td>
<td>Science</td>
</tr>
<tr>
<td>2014</td>
<td>Mathematics</td>
</tr>
<tr>
<td>2016</td>
<td>English</td>
</tr>
</tbody>
</table>
<a href="Division of Subjects.html">Division of Subjects</a>
<hr>
<h3>Skills</h3>
<table>
<tr>

Dept.of CS&E,BIET,Davangere Page | 6


Web scrapping

<td>Painting</td>
<td>⭐⭐⭐⭐⭐</td>
</tr>
<tr>
<td>Dancing</td>
<td>⭐⭐⭐</td>
</tr>
<tr>
<td>Singing</td>
<td>⭐⭐⭐⭐</td>
</tr>
</table>
<a href="contact.html">CONTACT DETAILS</a>
</body>
</html>

<!DOCTYPE html>
<html lang="en" dir="ltr">
<head>
<meta charset="utf-8">
<title>Division of subjects</title>
<link rel="stylesheet" href="css/styles.css">
</head>
<body>
<h3>Division of Subjects</h3>
<ol>
<li><a href="https://science.sciencemag.org/">Science</a></li>
<li><a href="https://g.co/kgs/TMkyEz">Social</a></li>
<li><a href="https://en.wikipedia.org/wiki/Mathematics">Mathematics</a></li>
<li><a href="https://en.wikipedia.org/wiki/English_language">English</a></li>
</ol>
</body>
</html>

Dept.of CS&E,BIET,Davangere Page | 7


Web scrapping

Snapshots:

Dept.of CS&E,BIET,Davangere Page | 8

You might also like