0% found this document useful (0 votes)
32 views17 pages

UNIT-5

Web mining involves applying data mining techniques to extract patterns from the World Wide Web, categorized into web usage mining, web content mining, and web structure mining. Web content mining focuses on extracting useful data from web pages, while web structure mining analyzes hyperlink structures to identify authorities on topics. Web usage mining captures user behavior to improve web applications, and search engines utilize web crawlers, databases, and search interfaces to retrieve relevant information for user queries.

Uploaded by

videosp312
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views17 pages

UNIT-5

Web mining involves applying data mining techniques to extract patterns from the World Wide Web, categorized into web usage mining, web content mining, and web structure mining. Web content mining focuses on extracting useful data from web pages, while web structure mining analyzes hyperlink structures to identify authorities on topics. Web usage mining captures user behavior to improve web applications, and search engines utilize web crawlers, databases, and search interfaces to retrieve relevant information for user queries.

Uploaded by

videosp312
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Unit-5

Web Mining
I) Introduction to Web Mining:
Web mining is the application of data mining techniques to discover patterns from
the World Wide Web. As the name proposes, this is information gathered by mining the web.

Web mining can be divided into three different types – Web usage mining, Web content
mining and Web structure mining.

• Web Usage Mining is the application of data mining techniques to discover


interesting usage patterns from Web data in order to understand and better serve the
needs of Web-based applications. Usage data captures the identity or origin of Web
users along with their browsing behaviour at a Web site.
• Web structure mining uses graph theory to analyze the node and connection
structure of a web site. According to the type of web structural data, web structure
mining can be divided into two kinds:
a) Extracting patterns from hyperlinks in the web: a hyperlink is a structural
component that connects the web page to a different location.
b) Mining the document structure: analysis of the tree-like structure of page
structures to describe HTML or XML tag usage.
• Web content mining is the mining, extraction and integration of useful data,
information and knowledge from Web page content. The heterogeneity and the lack
of structure that permits much of the ever-expanding information sources on the
World Wide Web, such as hypertext documents, makes automated discovery,
organization, and search and indexing tools of the Internet and the World Wide Web.
II) Web Content Mining:
Web content mining, also known as text mining, is generally the second step in Web
data mining. Content mining is the scanning and mining of text, pictures and graphs of a Web
page to determine the relevance of the content to the search query. This scanning is
completed after the clustering of web pages through structure mining and provides the results
based upon the level of relevance to the suggested query.

When one is searching the web for something of interest, often the relevant material is
spread many servers over the world. The example shows how the relevant information from a
wide variety of sources presented in a wide variety of formats may be integrated for the user.
The example involves extracting a relation of books in the form (author, title) from the web
starting with a small sample list. The problem may be defined in more general terms. We
wish to build a relation R that has a no. of attributes. The Information about tuples of R is
found on WebPages built is unstructured. The aim is to extract it with low error rate.
Information.

The algorithm proposed is called Dual Interactive Relation Extraction. It works as


follows:
1. Sample: Start with sample provided by the user.
2. Occurences: Find Occurrences of Tuples starting with those in S. Once tuples are
found the context of every occurrence is saved. Let these be O. O→S.

3. Patterns: Generate patterns based on set of occurrences O. this requires generating


patterns with similar contexts→ O.

4. Match Patterns: The web is now search for patterns.

5. STOP if enough matches found else go to Step 2.

i) WEB Document Clustering: Web Document Clustering is another approach to finding


relevant documents on atopic or about query keywords. The Popular Search engines often
return huge, unmanageable list of Documents which contain the keywords that the user
specified. Finding the most useful documents from such large list is usually Tedious, often
impossible. The user could apply the clustering to a set of documents returned by a search
engine in response to a query with an aim of finding semantically meaningful clusters, rather
than list of ranked documents, that are easier to interpret.
K-means and agglomerative methods can be used for web document. Cluster analysis
as well but there methods assume that each document as a fixed set of attributes that appear
in all documents. Similarity between documents can be computed based on these values. One
could possibly have a set of words and their frequencies in each document and they use those
values for clustering them.
One approach for that takes a different path and is designed specifically for web
document cluster analysis. It is called Suffix Tree clustering and it uses a Phrase Based
clustering approach rather than using single word frequency. In STC the key requirements of
the web document clustering algorithm include the following:
1. Relevance: This most obvious requirement. We want clusters that are relevant to
user query and that cluster similar document together.
2. Browsable Summaries: the cluster may easy to understand. The user should be
quickly able to browse the description of a cluster and work at whether the cluster is
relevant to the query.
3. Snippiest Tolerance: The clustering Method should not require whole documents
and should be able to produce relevant clusters based only on the information that the
search returns.
4. Performance: the Clustering method should be able to process the results of the
search engine quickly and provide the resulting clusters to user.
III) Web Structure Mining:
The aim of web structure mining is to discover the link structure of the model that is
assumed to underlie the web. The model may be based on the topology of hyperlinks. This
help in discovering similarity between sites or in discovering authorities for a particular topic.
Link structures are only the kind of information that may be used in analysing the structure of
the web.
The links on webpage provide a useful source of information that may be harnessed in
web searches. Kleinberg has developed a connectivity analysis algorithm called as HITS
(Hyper link Induced Topic search) based on assumptions that links represent human
judgement. The HITS is based on the idea that if the creator of page P provides a link to Q,
then P confers same authority on page Q. for Example links to the homepage in a large
website. The HITS algorithm has 2 major steps:
1) Sampling Step: It collects a step of relevant WebPages given a topic.
2) Iterative Step: it finds hubs and authorities using the information collected during
sampling.
i) Sampling Step: The first step involves finding a subset of nodes or subgraph S, which is in
relevant authoritative pages. To obtain such a subgraph, the algorithm starts with a root set of
say 200 pages selected from the result of searching for the query in a traditional search
engine. Let the root set R, we wish to obtain a set S that has the following properties:

1) S is relatively small.
2) S is in rich in relevant pages given the query.
3) S contains most of the strongest authorities.
The root set R usually satisfies conditions 1 and 2 , i.e. by 100 or 200 highly ranked pages
retrieved by a search engine. These pages may or may not satisfy the condition 3 but pages in
R must contain links to other authorities if there are any. In some cases, this may not be true.
HITS algorithm expands the root set R into base set S by using the following
algorithm:
1) Let S=R
2) For each page in S, do step 3 to 5.
3) Let T be the set of all pages S points to.
4) Let F be the set of all pages that point to S.
5) Let S=S+T + some or all of F
6) Delete all links with the same domain name.
7) This S is returned
ii) Finding Hubs and Authorities:

The algorithm for finding hubs and authorities now work as follows:

1) Let a Page P have a nonnegative authority weight xp and a nonnegative hub weight yp
. Pages with relatively large weights xp will be classified to be the authorities.

2) The weights are normalized so their squared sum for each type of weight is 1. Since
only the relative weights are important.

3) For a page P, the value xp is updated to the sum of yq over all pages q that link to p.

4) For a page P, the value of yp is updated to be the sum of sum of xq over all pages q
that p links to.
5) Continue to step 2 unless a termination condition has been removed.

6) On termination, the output of the algorithm is a set of pages with the largest xp
weights that can be assumed to be authorities and those with largest yp weights that
can be assumed to be the hubs.

iii) Properties with HITS algorithm:

1) Hubs and authorities: a clear cut distinction between hubs and authorities may not
be appropriate since many sites are hubs as well as authorities.

2) Topic drift: certain arrangements of tightly connected documents perhaps due to


mutually reinforcing relationships between hosts, can dominate the HITS
computation. These documents in some instance may not be the most relevant to the
query that was posed.

3) Automatically generated Links: some of the links are computer generated and
represent no human judgement. HITS still give them equal importance.

4) Non-relevant Documents: some Queries can return non-relevant documents in the


highly ranked queries and this can lead to erroneous results from the HITS algorithm.

5) Efficiency: the real time performance of the algorithm is not good given the steps
that involve finding sites that are pointed by pages in the root pages.

IV) Web Usage Mining:


Web Usage Mining is the application of data mining techniques to discover
interesting usage patterns from Web data in order to understand and better serve the needs of
Web-based applications. Usage data captures the identity or origin of Web users along with
their browsing behaviour at a Website.
Web usage mining itself can be classified further depending on the kind of usage data
considered:
• Web Server Data: The user logs are collected by the Web server. Typical data
includes IP address, page reference and access time.
• Application Server Data: Commercial application servers have significant features
to enable e-commerce applications to be built on top of them with little effort. A key
feature is the ability to track various kinds of business events and log them in
application server logs.
Application Level Data: New kinds of events can be defined in an application, and
logging can be turned on for them thus generating histories of these specially defined
events. It must be noted, however, that many end applications require a combination of
one or more of the techniques applied in the categories above.
The aim of web usage mining is to obtain information and discover usage patterns
that may assist web design and perhaps to assist navigation through the size. The Mined
data includes Web data repositories which may includes Data logs of user interactions
with the web, web server logs, proxy server logs, browser logs, and so on. The
information collected in the web server logs usually includes information about access,
referrer, and Agent. Access Information includes the servers information may be obtained
by using tools that are available or at a low cost.
Using such tools it is generally possible to find at least the following information:
• No. of Hits: the number of times each page in the web size has been viewed.
• No. of visitors: The number of users who came to the site.
• Visitors referring to website: The website URL of the site the user came from.
• Visitors Referral Website: The website Url of the site where the user went when
he/she left the website.
• Entry point: which website page the user entered from.
• Visitor time and Duration: The time the day of the visitor and how long the visitors
browse the site.
• Path analysis: A list of path of pages that user took.
• Visitor IP address: This helps in finding which part of the world the user come from.
• Browser type, Platform, Cookies etc…
Search Engines
I) Introduction: Search Engine refers to a huge database of internet resources such as web
pages, newsgroups, programs, images etc. It helps to locate information on World Wide Web.
User can search for any information by passing query in form of keywords or phrase.
It then searches for relevant information in its database and return to the user.
i) Search Engine Components
Generally there are three basic components of a search engine as listed below:
• Web Crawler
• Database
• Search Interfaces.
1. Web crawler: It is also known as spider or bots. It is a software component that
traverses the web to gather information.
2. Database: All the information on the web is stored in database. It consists of huge
web resource.
3. Search Interfaces: This component is an interface between user and the database. It
helps the user to search through the database.
ii) TYPES OF SEARCH ENGINES:
a) Crawler-based search engines: Crawler based search engines develop their
listings by using software agent known as crawler or spider. The crawler indexes the
web pages by crawling the whole web periodically.
Examples of crawler based search engines are Google, Altavista etc. Any
change in the web pages can be identified by the crawler and will influence the listing
of web pages in the search engines.
b) Directory based search engines: Directory based search engines or human
powered directories develop their listings by human editors, for example, open
directory and Yahoo directory.
When we want to search for general query by human powered directory then
in this condition, human directory helps us and provides refined and relevant search
results but it does not provide relevant results i.e., does not work efficiently when we
search a specific query.
II) Characteristics of Search Engines:
There are certain parameters on the basis of which the results are retrieved by the
search engines. The results retrieved by different search engines are different. There are some
characteristics of the search engines which makes one search engine different from another
search engine.
1) Web Crawling or Spidering :A web based crawler is a software agent or program
that crawls the whole web. It tracks the list of URLs known as seeds. These URLs are
recognized by the web crawler from many different sources and are stored in the local
database of the web search engines.
2) Result Matching: The results matching technique is used to determine the all
relevant pages in the database of search engine corresponding to a query. Different matching
algorithms are used by different search engines to show more relevant pages in the search
results.
3) Result Ranking: The order in which the search results are displayed to the user is
known as result ranking. There are number of results which can be displayed to the user but
the order in which the results are displayed matters. It would be better for the user if the
desired results are shown to the user in first or second page of the search engine result page.
4) Single source search engines and Meta-search engines : Search engines are
classified as either single source search engines or Meta search engines. When the search
results are retrieved by only one search engine then it is known as single source search
engine. But, when the results are retrieved by more than one search engines then it is known
as Meta search engines.
i) Goals of web search: It has been suggested that the information needs of the user may be
divide into 3 classes:
1. Navigational: The primary information need in these queries is to reach the
website that the user has in mind.
2. Informational: The primary information need in these queries is to find a website
that provides useful information about a topic of interest. The user does not have a
particular website in mind.
3. Transactional: The primary need in such queries is to perform some kind of
transaction. The user may or may not know the target websites.
According to survey
• Navigational queries- 20-25 percent
• Informational queries- 40-45 percent
• Transactional queries- 30-35 percent

ii) Quality of search results:


The results from search engine should satisfy the following quality requirements:
1. Precision: Only relevant documents should be returned.
2. Recall: All the relevant documents should be returned.
3. Ranking: A ranking of the documents providing some indication of the relative
ranking of the results should be returned.
4. Speed: Results should be provided quickly since users have little patience.
III) Search engine functionality:
A search engine is a rather complex collection of software modules. We discuss a no.
of functional areas. A search engine carries out a variety of tasks. These include:
1) Collecting information: A search engine would normally collecting WebPages or
information about them by web crawling or by human submission of pages.
2) Evaluating and categorizing information: In some cases, for eg when web pages are
submitted to a directory, it may be necessary to evaluate and decide whether a submitted page
should be selected.
3) Creating a database and creating Indexes: the information collected needs to be stored
either in database or same kind of file system. Indexes must be created so that the information
may be searched efficiently.
4) Computing ranks of web documents: A variety of methods are being used to determine
the rank of each page retrieved in response to a user query. The information used may include
frequency of keywords, value of in-links and out-links from the page and frequency of use of
the page.
5) Checking Queries and Executing them: queries posed by the users need to be checked,
for example for spelling errors and whether words in the query or recognizable. Once
checked, a query is executed by searching the search engine database.
6) Presenting results: How the search engine presents the results to the user is important.
The search engine must determine what results to present and how to display them.
7) Profiling the users: To improve search performance, the search engines carry out user
profiling that deals with the way users use search engines.
IV) Search engine Architecture:

A typical search engine Architecture as shown in the figure consist of many


components including the following 3 major components:
1) The Crawler: or spider is an application program that carries out a task similar to graph
traversal. It is given set of starting URLs that it uses to automatically traverse the web by
retrieving a page, initially from the starting set. Some search engine uses no. of distributed
crawlers. Each page found by the crawler is often not stored as a separate file otherwise 4
billion pages would require managing 4 billion files.

Crawlers follow an algorithm like the following:


A) Find base URLs- a set of known and working hyperlinks are collected.
B) Build a Queue- Put the base URLs in the Queue and add new URLs to the queue as
more discovered.
C) Retrieve the next page-) Retrieve the next page in the queue, process and store in
the search engine database.
D) Add to the Queue- check if the out-links of the current page already been
processed.
E) Continue the process until some stopping criteria or met.
2) The Indexer:
Given the size of the web and the number of documents that current search engines
have in their databases, an index is essential to reduce cost of the query evaluation. Building
an index requires document analysis and term extraction. The term extraction involves
extraction all the words from each page, elimination of stop words (common words like the,
it, and, that etc.,) and stemming (transforming words like computer, computing and
computation into one word say computer). It may also involve analysis of hyper links. The
indexes require major updates every time a cycle of crawling has been completed.
3. Updating the Index:
As the crawler updates the search engine database, the inverted index must also be
updated. Depending on how the index is stored, incremental updating may be relatively easy
but some time may incremental updates it may be necessarily rebuild the whole index.
4. User profiling:
Most search engines provide just one type of interface to the user. They provide an
input box in which the user types in the keywords and waits for the results. The interface does
not take into account whether is the user is a novice has been using search engines for years.
5. Query Server:
First of all a search engine needs to receive the query and check the spelling of key
words that the user has been typed. If the search engine cannot recognize the key words as
words in the language are proper nouns, it is desirable to suggest alternative spellings to the
user. Once the keywords are found to be acceptable the query may need to transform.
6. Query Composition:
A search engine providing query refinement based on user feedback. Search engine
often cache the result of the query and then use the cache results, if the refined query is a
modification of a query that has been already processed.
7. Query Processing:
Search engine query processing is quite different from normal query processing and
query optimization in relational database systems. In DB systems query processing requires the
attribute values match exactly the values provided the query. But in search engine query processing an
exact match is not always necessary because of it will searched through an indexes.

8. Catching Query Results:


The most common approach is to use web catches and proxies as intermediaries
between the client browser and machine serving the web pages. A web cache or proxy
essentially mediates access to the web for improved efficiency. Caching reduces the network
traffic and reduces load on busy web servers.
v) Ranking of Web pages:
i) Page ranking Algorithm:
The page ranking algorithm is abed on using the hyperlinks as indicators of pages'
importance. It is almost like vote counting election. Every unique page is assigned to a page
rank. If a lot of page vote for a page by linking to it then the page that being pointed to will
be considered important. Vote cast by a link farm (a page with many links) are given less
importance than votes cast by an article that any links to a few pages. Internal site links also
count in assessing page rank.
• The Original page rank algorithm was designed by Lawerence page and Sergey Brin.
• Page ranking was originally developed based on probability model of random
surfer visiting a webpage. Page rank as a model of user behavior.
• The probability of a random surfer clicking on a link may be estimated based on the
no. of links on that page.
Page Rank of A is the given by:
PR(A)=(1-d)+d(PR(T1)/C(T1)+PR(T2)/C(T2)+….).
• PR(A) is the page rank of page A
• PR(Ti ) is the page rank of pages Ti which link to page A.
• C(Ti ) is the no. of out bound links on page Ti.
• d is the damping factor which can be set between 0 and 1.
• The Most suitable damping factor by default is 0.85
• The rank of a Document is given by the rank of those documents which link it.
• The PR of each page depends upon the PR of pages pointing to it. But we don’t know
what PR those pages have until the pages pointing to them have their PR calculated
and so on.
• Page ranking is Iterative Process.
• Inbound link for webpage always increase the pages page rank.
• When a web page has no out bound links , its page rank cannot be distributed to other
pages. Such are called dangling Links or dead links.

Example: There are 3 WebPages

• Initially page Rank (PR) for all web pages =1.


• PR(A)=(1-d)+d(PR(T1)/C(T1)+PR(T2)/C(T2)+….).
I-Iteration
1) PR(A) = (1-d)+d[PR(C)/C(C)

= (1-0.85)+0.85[1/1]

= 0.15+0.85

=1

2) PR(B) = (1-d)+d[PR(A)/C(A)

= (1-0.85)+0.85[1/2]

=0.15+0.85[0.5]

=0.15+0.425

=0.575
3) PR(C) = (1-d)+d[PR(A)/C(A) + PR(B)/C(B)]
= (1-0.85)+0.85[(1/2)+(0.575/1)]
= 0.15+0.85[0.5+0.575]
=0.15+0.85[1.075]
= 1.06375

II-Iteration:
1) PR(A) = (1-d)+d[PR(C)/C(C)
=(1-0.85)+0.85[1.06375/1]

= 0.15+0.85[ 1.06375]

= 0.15+0.9041875

= 1.0541875

2) PR(B) = (1-d)+d[PR(A)/C(A)

=(1-0.85)+0.85[1.0541875/2]

= 0.15+0.85[0.52709375]

= 0.15+0.4480296875

=0.5980296875

3)PR(C) = (1-d)+d[PR(A)/C(A) + PR(B)/C(B)]

=(1-0.85)+0.85[(1.0541875/2)+ (0.5980296875/1]

=0.15+0.85[0.5270935+0.5980296875]

=0.15+0.85[1.125123438]

=0.15+0.9563549219

= 1.06354922
Iteration A B C

0 1 1 1

1 1 0.575 1.06375

2 1.0541875 0.5980296875 1.06354922

VI) Enterprise Search engine:


Enterprise search is an extensive search system that provides the means to search both
structured and unstructured data sources with a single query. It addresses businesses that need
to store, retrieve and track digital information of all kinds.

• Enterprise search is the practice of making content from multiple enterprise-type


sources, such as databases and intranets, searchable to a defined audience.

• Enterprise search can be contrasted with web search, which applies search technology
to documents on the open web, and desktop search, which applies search technology
to the content on a single computer.

Data sources in enterprise search systems include information stored in many


different containers such as e-mail servers, desktops, messaging, enterprise application
databases, content management systems, file systems, intranet sites and external Web sites.
Enterprise search systems provide users with fast query times and search results that are
usually ranked in such a way that the information you need is easily accessible. Enterprise
search systems also use access controls to enforce a security policy on their users.
i) Components of an enterprise search system:
1) Content awareness:
Content awareness (or "content collection") is usually either a push or pull model. In
the push model, a source system is integrated with the search engine in such a way that it
connects to it and pushes new content directly to its APIs. This model is used when real-time
indexing is important. In the pull model, the software gathers content from sources using a
connector such as a web crawler or a database connector. The connector typically polls the
source with certain intervals to look for new, updated or deleted content.
2. Content processing and analysis
Content from different sources may have many different formats or document types,
such as XML, HTML, Office document formats or plain text. The content processing phase
processes the incoming documents to plain text using document filters. It is also often
necessary to normalize content in various ways to improve recall or precision. These may
include stemming, synonym expansion, entity extraction, part of speech tagging.
3. Indexing: The resulting text is stored in an index, which is optimized for quick lookups
without storing the full text of the document. The index may contain the dictionary of all
unique words in the corpus as well as information about ranking and term frequency.
4. Query Processing: Using a web page, the user issues a query to the system. The query
consists of any terms the user enters as well as navigational actions such as faceting and
paging information.
5. Matching: The processed query is then compared to the stored index, and the search
system returns results (or "hits") referencing source documents that match. Some systems are
able to present the document as it was indexed.
ii) Characteristics of an enterprise search engine:
1) The need to access information in diverse repositories , including file systems,
HTTP web servers, Lotus notes, Microsoft Exchange, content management, such as
documentation as well as relational databases.
2) The need to respect fine grained individual access control rights, typically at the
document level; thus 2 users issuing the same search requests may see differing sets
of documents due to differences in their privileges.

3) The need to index and search a large variety of documents types such as PDF,
word, PDF files etc.

4) The need to seamlessly and scalable combine structured and unstructured


information in a document search, as well as for organization purposes (clustering,
classification etc) and for personalization.
For example imagine a large university with many degree programs and
considerable consulting and research. Such a university is likely to have an enormous
amount of information on the web including the following:
• Information about the university its location and how to contact it.
• Information about degrees offered, admission requirements, Regulations , credit
transfer requirements.
• Material designed for UG and PG students who may be considering joining the
university.
• Information about courses offered including course descriptions etc.
• List of academic staff, general staff and students, their qualifications and expertise
where appropriate.
• Course material including material achieved from previous years.
• Press Releases.
• Internal Newsgroup of Employees.
• Information about University facilities including Laboratories and buildings.
• Information about human resources including terms and conditions of Employment
agreements, pay scales etc.
• Alumni news and Newsletter.

You might also like