Web Scrapping: Dept - of CS&E, BIET, Davangere Page - 1
Web Scrapping: Dept - of CS&E, BIET, Davangere Page - 1
INTRODUCTION:
Web scraping is the process of using bots to extract content and data from a
website. Unlike screen scraping, which only copies pixels displayed onscreen,
web scraping extracts underlying HTML code and, with it, data stored in a
database. The scraper can then replicate entire website content elsewhere.
Web scraping is used in a variety of digital businesses that rely on data
harvesting.
Since all scraping bots have the same purpose—to access site data—it can be
difficult to distinguish between legitimate and malicious bots.
That said, several key differences help distinguish between the two.
1. Legitimate bots are identified with the organization for which they
scrape. For example, Googlebot identifies itself in its HTTP header as
belonging to Google. Malicious bots, conversely, impersonate legitimate
traffic by creating a false HTTP user agent.
2. Legitimate bots abide a site’s robot.txt file, which lists those pages a bot
is permitted to access and those it cannot. Malicious scrapers, on the other
hand, crawl the website regardless of what the site operator has allowed.
Resources needed to run web scraper bots are substantial—so much so that
legitimate scraping bot operators heavily invest in servers to process the vast
amount of data being extracted. A perpetrator, lacking such a budget, often
resorts to using a botnet—geographically dispersed computers, infected with the
same malware and controlled from a central location. Individual botnet
computer owners are unaware of their participation. The combined power of the
infected systems enables large scale scraping of many different websites by the
perpetrator.
When we scrape the web, we write code that sends a request to the server that’s
hosting the page we specified. Generally, our code downloads that page’s
source code, just as a browser would. But instead of displaying the page
visually, it filters through the page looking for HTML elements we’ve specified,
and extracting whatever content we’ve instructed it to extract.
For example, if we wanted to get all of the titles inside H2 tags from a website,
we could write some code to do that. Our code would request the site’s content
from its server and download it. Then it would go through the page’s HTML
looking for the H2 tags. Whenever it found an H2 tag, it would copy whatever
text is inside the tag, and output it in whatever format we specified.
<hr>
<h3>Work Experience</h3>
<table>
<thead>
<tr>
<td><h5>YEAR</h5></td>
<td><h5>SUBJECT</h5></td>
</tr>
</thead>
<tbody>
<tr>
<td>2010</td>
<td>Social</td>
</tr>
<tr>
<td>2012</td>
<td>Science</td>
</tr>
<tr>
<td>2014</td>
<td>Mathematics</td>
</tr>
<tr>
<td>2016</td>
<td>English</td>
</tr>
</tbody>
</table>
<a href="Division of Subjects.html">Division of Subjects</a>
<hr>
<h3>Skills</h3>
<table>
<tr>
<td>Painting</td>
<td>⭐⭐⭐⭐⭐</td>
</tr>
<tr>
<td>Dancing</td>
<td>⭐⭐⭐</td>
</tr>
<tr>
<td>Singing</td>
<td>⭐⭐⭐⭐</td>
</tr>
</table>
<a href="contact.html">CONTACT DETAILS</a>
</body>
</html>
<!DOCTYPE html>
<html lang="en" dir="ltr">
<head>
<meta charset="utf-8">
<title>Division of subjects</title>
<link rel="stylesheet" href="css/styles.css">
</head>
<body>
<h3>Division of Subjects</h3>
<ol>
<li><a href="https://science.sciencemag.org/">Science</a></li>
<li><a href="https://g.co/kgs/TMkyEz">Social</a></li>
<li><a href="https://en.wikipedia.org/wiki/Mathematics">Mathematics</a></li>
<li><a href="https://en.wikipedia.org/wiki/English_language">English</a></li>
</ol>
</body>
</html>
Snapshots: