Search Engine Scraping

Search engine scraping involves harvesting URLs, descriptions, and other information from search engines like Google, Bing, and Yahoo. It allows companies to monitor how their clients' websites rank for keywords. While search engines don't take legal action against scraping, they use various techniques to detect and block automated scraping, such as CAPTCHAs, IP blocking, and behavior analysis. To scrape successfully, tools must emulate human behavior by rotating IPs, handling URLs and headers correctly, and inserting delays between queries. Common programming languages for scraping include PHP, Python, and C++.

Uploaded by

linda976

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

72 views

Search Engine Scraping

Uploaded by

linda976

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

Search engine scraping

Search engine scraping is the process of harvesting URLs, descriptions, or other information from search
engines such as Google, Bing, Yahoo, or Yandex. This is a specific form of screen scraping or web
scraping dedicated to search engines only.

Most commonly larger search engine optimization (SEO) providers depend on regularly scraping keywords
from search engines, especially Google, Sogou to monitor the competitive position of their customers'
websites for relevant keywords or their indexing status.

Search engines like Google have implemented various forms of human detection to block any sort of
automated access to their service,[1] in the intent of driving the users of scrapers towards buying their
official APIs instead.

The process of entering a website and extracting data in an automated fashion is also often called
"crawling". Search engine’s like Google, Bing, Yahoo or Sogou get almost all their data from automated
crawling bots.

Search engines are an integral part of the modern online ecosystem. They provide a way for people to find
information, products, and services online quickly and easily. In fact, more than 90% of online experiences
begin with a search engine, and the top search results receive the majority of clicks. This is why SEO is
critical for businesses and organizations that want to succeed in the digital world.

SEO is essential because it enables websites to rank higher in search results pages, making it easier for
people to find them. A higher ranking in search results can increase a website's visibility, traffic, and
ultimately, revenue. SEO can also help businesses and organizations establish their authority, credibility,
and reputation in their respective industries.[2][3]

Difficulties
Google is by far the largest search engine with most users in numbers as well as most revenue in creative
advertisements, which makes Google the most important search engine to scrape for SEO related
companies.[4]

Although Google does not take legal action against scraping, it uses a range of defensive methods that
makes scraping their results a challenging task, even when the scraping tool is realistically spoofing a
normal web browser:

Google is using a complex system of request rate limitation which can vary for each
language, country, User-Agent as well as depending on the keywords or search parameters.
The rate limitation can make it unpredictable when accessing a search engine automated,
as the behaviour patterns are not known to the outside developer or user.
Network and IP limitations are as well part of the scraping defense systems. Search engines
can not easily be tricked by changing to another IP, while using proxies is a very important
part in successful scraping. The diversity and abusive history of an IP is important as well.
Offending IPs and offending IP networks can easily be stored in a blacklist database to
detect offenders much faster. The fact that most ISPs give dynamic IP addresses to
customers requires that such automated bans be only temporary, do not block innocent
users.
Behaviour based detection is the most difficult defense system. Search engines serve their
pages to millions of users every day, this provides a large amount of behaviour information.
A scraping script or bot is not behaving like a real user, aside from having non-typical access
times, delays and session times the keywords being harvested might be related to each
other or include unusual parameters. Google for example has a very sophisticated
behaviour analyzation system, possibly using deep learning software to detect unusual
patterns of access. It can detect unusual activity much faster than other search engines.[5]
HTML markup changes, depending on the methods used to harvest the content of a website,
even a small change in HTML data can render a scraping tool broken until it is updated.
General changes in detection systems. In the past years search engines have tightened their
detection systems nearly month by month making it more and more difficult to reliable scrape
as the developers need to experiment and adapt their code regularly.[6]

Detection
When search engine defense thinks an access might be automated, the search engine can react differently.

The first layer of defense is a captcha page[7] where the user is prompted to verify they are a real person
and not a bot or tool. Solving the captcha will create a cookie that permits access to the search engine again
for a while. After about one day, the captcha page is removed again.

The second layer of defense is a similar error page but without captcha, in such a case the user is
completely blocked from using the search engine until the temporary block is lifted, or the user changes
their IP.

The third layer of defense is a long-term block of the entire network segment. Google has blocked large
network blocks for months. This sort of block is likely triggered by an administrator and only happens if a
scraping tool is sending a very high number of requests.

All these forms of detection may also happen to a normal user, especially users sharing the same IP address
or network class (IPV4 ranges as well as IPv6 ranges).

Methods of scraping Google, Bing, Yahoo or Sogou

To scrape a search engine successfully, the two major factors are time and amount.

The more keywords a user needs to scrape and the smaller the time for the job, the more difficult scraping
will be and the more developed a scraping script or tool needs to be.

Scraping scripts need to overcome a few technical challenges:[8]

IP rotation using Proxies (proxies should be unshared and not listed in blacklists)
Proper time management, time between keyword changes, pagination as well as correctly
placed delays Effective long-term scraping rates can vary from only 3–5 requests (keywords
or pages) per hour up to 100 and more per hour for each IP address / Proxy in use. The
quality of IPs, methods of scraping, keywords requested and language/country requested
can greatly affect the possible maximum rate.
Correct handling of URL parameters, cookies as well as HTTP headers to emulate a user
with a typical browser[9]
HTML DOM parsing (extracting URLs, descriptions, ranking position, sitelinks and other
relevant data from the HTML code)
Error handling, automated reaction on captcha or block pages and other unusual
responses[10]
Captcha definition explained as mentioned above by[11]

An example of an open source scraping software which makes use of the above-mentioned techniques is
GoogleScraper.[9] This framework controls browsers over the DevTools Protocol and makes it hard for
Google to detect that the browser is automated.

Programming languages
When developing a scraper for a search engine, almost any programming language can be used. Although,
depending on performance requirements, some languages will be favorable.

PHP is a commonly used language to write scraping scripts for websites or backend services, since it has
powerful capabilities built-in (DOM parsers, libcURL); however, its memory usage is typically 10 times the
factor of a similar C/C++ code. Ruby on Rails as well as Python are also frequently used to automated
scraping jobs. For highest performance, C++ DOM parsers should be considered.

Additionally, bash scripting can be used together with cURL as a command line tool to scrape a search
engine.

Tools and scripts

When developing a search engine scraper there are several existing tools and libraries available that can
either be used, extended or just analyzed to learn from.

iMacros - A free browser automation toolkit that can be used for very small volume scraping
from within a user browser [12]
cURL – a command line browser for automation and testing, as well as a powerful open
source HTTP interaction library available for a large range of programming languages.[13]
Google-search - A Go package to scrape Google.[14]
SEO Tools Kit (https://seotoolskit.co/) – Free Online Tools, DuckDuckGo, Baidu, Sogou) by
using proxies (socks4/5, http proxy). The tool includes asynchronous networking support
and is able to control real browsers to mitigate detection.[15]
se-scraper - Successor of SEO Tools Kit. Scrape search engines concurrently with different
proxies.[16]

Legal
When scraping websites and services the legal part is often a big concern for companies, for web scraping it
greatly depends on the country a scraping user/company is from as well as which data or website is being
scraped. With many different court rulings all over the world.[17][18][19] However, when it comes to
scraping search engines the situation is different, search engines usually do not list intellectual property as
they just repeat or summarize information they scraped from other websites.

The largest public known incident of a search engine being scraped happened in 2011 when Microsoft was
caught scraping unknown keywords from Google for their own, rather new Bing service,[20] but even this
incident did not result in a court case.
One possible reason might be that search engines like Google, Sogou are getting almost all their data by
scraping millions of public reachable websites, also without reading and accepting those terms.