Discovering knowledge using web structure mining
1. What is Web?
1.1 Problems With Web
 Difficulty in finding

relevant information
 Personalization of

information
 Learning about

consumers or individual
users
2.Objectives
i.

To Survey the area of
web mining.

ii.

Introduction to Link
Mining.

iii.

Review of HITS and
Page Rank algorithm.
3. Web Mining: Definition
 Process of discovering
 potentially useful &
 previously unknown

information or knowledge
from the web data.
3.1 Web Mining: Subtasks
 Resource finding

 Information selection

and pre-processing
 Generalization
 Analysis
3.1 Web Mining Categories
Web Mining

Web Content
Mining

Web Structure
Mining

Text and
Multimedia
Documents

Hyperlink
Structure

Web Usage
Mining

Web Log
Records
3.1.1 Web Content Mining
 Scanning data of a Web page to determine content
relevance with respect to search query.
Web Content
Mining

Agent Based
Approach

Database
Approach
3.1.2 Web Structure Mining
 Identifies relationships

between Web pages.
 Focuses on following

problems
 Reducing irrelevant search

results.
 Helps indexing
information on the web.
3.1.3 Web Usage Mining
 Focuses on techniques that predict user behavior while

interacting with the WWW.
 Web log records analyzed to discover user access pattern.
 The challenges could be

divided into three phases:
 Pre-processing
 Pattern discovery

 Pattern Analysis
4. Link Mining
 It is located at the intersection of the work in





Link analysis
Hypertext and web mining
Relational learning and inductive logic programming
Graph mining.

 Some tasks of link mining applicable in web structure

mining are:






Linked-based classification
Linked-based cluster analysis
Link Type
Link Strength
Link Cardinality
(i) Link-based Classification
 Predicts category of a web

page, based on
 words that occur on the page

 Links between pages
 anchor text
 HTML tags
 and other possible attributes

on web page.

 Eg: Predicting the category

of a paper, based on its
citations and the co-citations.
(ii) Link-based Cluster Analysis
 Goal : Finding naturally occurring subclasses.
 Data is segmented into groups
 similar objects - grouped together
 dissimilar objects - different groups.
 Helps in discovering hidden patterns.
 Eg: Finding diseases with similar transmission pattern.
(iii) Link Type
 Predicting link type

between two entities.
 Predicting purpose of

a link.
 Eg. Navigational or

Advertising
(iv) Link Strength
 Links could be associated with weights.
 Strong links - higher weight
 Weak links – lower weight
(v) Link Cardinality
 Refers to the number

of inbound links to a
web site.
 Link popularity :
 combination of
factors that weigh the
importance of each
incoming link.
5. Hyperlink-Induced Topic Search
(HITS)
 Link analysis algorithm that

rates pages.
 Identifies two kinds of pages

from Web hyperlink structure:

Web
Pages

With
Links
To

Web
Pages

With

 Authorities: Contains valuable

information on the subject.
 Hubs: Contains useful links
towards the authoritative
pages.

Other
Pages

Hubs

Content

Authority
HITS Contd…
 Two step process:
 Sampling step: Set of
relevant pages collected
 Iterative step: Hubs and
authorities are found
using output of above step
HITS Contd…
 Sampling Step:
 Query submitted to search engine yields a root set
 From root set we expand to base set

Expanding the root set into base set
HITS Contd…
 Iterative step:
 Associate non-negative authority weight x<p> and nonnegative hub weight y<p>.

Computing Authority Weight

Computing Hub Weight
Problems With HITS Algorithm
 Some problems with the HITS algorithm are:
 Mutually reinforced relationships between hosts
 Automatically generated links
 Non-relevant nodes
 Hubs and authorities
 Topic drift
 Efficiency
6. PageRank Model
 It is a link analysis algorithm.
 Numeric value to know the

importance of a web page
 Computes importance by no.

of incoming links
PageRank Contd…
 Rank of a page is divided evenly among its out-links to

contribute to the ranks of the pages they point to.

 Page Ranks form a probability distribution over web

pages, so the sum of all pages’ Page Ranks will be one.
PageRank Contd…
 PageRank can be calculated by:
PR(A)= (1-d) + d (PR (T1)/C (T1) +…+ PR (Tn)/C (Tn))
 T1..Tn are the pages that point to page A.
 C(A) is defined as the number of links going out of page A.
 d is the dampening factor which is usually set to 0.85

 The dampening factor is the probability at each page a

random surfer will get bored and will request another
random page.
Applications
 HITS was used in Clever search engine by IBM.
 PageRank is used by Google.
References
 Knowledge Discovery and Retrieval on World Wide Web Using Web Structure









Mining: Sekhar Babu Boddu, V.P Krishna Anne, Rajesekhara Rao Kurra and
Durgesh Kumar Mishra, 2010, In proceedings of Fourth Asia International
Conference on Mathematical/Analytical Modelling and Computer Simulation
(AMS), IEEE.
Link Mining: A New Data Mining Challenge by Lise Getoor, 2003, SIGKDD
Explorations, Volume 4, Issue 2
Authoritative Sources in a Hyperlinked Environment by Jon M. Kleinberg, 1998, In
proceedings of ACM-SIAM Symposium on Discrete Algorithms
The PageRank Citation Ranking: Bringing Order to the Web by L. Page, S. Brin and
T. Winograd, 1998, Technical report, Stanford University
wikipedia.org
web-datamining.net
maya.cs.depaul.edu
Discovering knowledge using web structure mining

More Related Content

PPTX
Web mining (structure mining)
PPTX
Web content mining
PPT
An Introduction to Graph Databases
PPT
Knowledge discovery thru data mining
PPTX
Web crawler and applications
PPTX
Data Literacy
PPT
Chapter - 4 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
PPTX
Distributed database system
Web mining (structure mining)
Web content mining
An Introduction to Graph Databases
Knowledge discovery thru data mining
Web crawler and applications
Data Literacy
Chapter - 4 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Distributed database system

Viewers also liked (17)

PPTX
Page rank and hyperlink
ODP
Web content mining
PDF
Linear Regression Parameters
PDF
Machine Learning with WEKA
PDF
DATA MINING WITH WEKA
PDF
Survey on data mining techniques in heart disease prediction
PPTX
How to detect &amp; diagnose congenital heart disease in children
PDF
Clustering and Regression using WEKA
PDF
Artificial Neural Networks Lect5: Multi-Layer Perceptron & Backpropagation
PPTX
Web Mining Presentation Final
PDF
Web mining slides
PPT
Multiple regression presentation
PPTX
Presentation On Regression
PPT
Regression analysis
PPTX
Web Usage Mining - Temas Avanzados
ODP
Multiple linear regression
PPS
Correlation and regression
Page rank and hyperlink
Web content mining
Linear Regression Parameters
Machine Learning with WEKA
DATA MINING WITH WEKA
Survey on data mining techniques in heart disease prediction
How to detect &amp; diagnose congenital heart disease in children
Clustering and Regression using WEKA
Artificial Neural Networks Lect5: Multi-Layer Perceptron & Backpropagation
Web Mining Presentation Final
Web mining slides
Multiple regression presentation
Presentation On Regression
Regression analysis
Web Usage Mining - Temas Avanzados
Multiple linear regression
Correlation and regression
Ad

Similar to Discovering knowledge using web structure mining (20)

PPT
Data.Mining.C.8(Ii).Web Mining 570802461
PPT
Web mining
PPTX
Web mining
PPT
Web Mining
PPT
Web Mining
PDF
International conference On Computer Science And technology
PDF
IRJET- Page Ranking Algorithms – A Comparison
PPTX
WEB MINING.pptx
PDF
Evaluation of Web Search Engines Based on Ranking of Results and Features
PPTX
Search Engine working, Crawlers working, Search Engine mechanism
PDF
A Study on Web Structure Mining
PDF
A Study On Web Structure Mining
PDF
WEBMINING_SOWMYAJYOTHI.pdf
PDF
Link analysis for web search
PDF
Identifying Important Features of Users to Improve Page Ranking Algorithms
PDF
IDENTIFYING IMPORTANT FEATURES OF USERS TO IMPROVE PAGE RANKING ALGORITHMS
PDF
IDENTIFYING IMPORTANT FEATURES OF USERS TO IMPROVE PAGE RANKING ALGORITHMS
PPTX
web mining
PDF
Macran
PDF
CONTENT AND USER CLICK BASED PAGE RANKING FOR IMPROVED WEB INFORMATION RETRIEVAL
Data.Mining.C.8(Ii).Web Mining 570802461
Web mining
Web mining
Web Mining
Web Mining
International conference On Computer Science And technology
IRJET- Page Ranking Algorithms – A Comparison
WEB MINING.pptx
Evaluation of Web Search Engines Based on Ranking of Results and Features
Search Engine working, Crawlers working, Search Engine mechanism
A Study on Web Structure Mining
A Study On Web Structure Mining
WEBMINING_SOWMYAJYOTHI.pdf
Link analysis for web search
Identifying Important Features of Users to Improve Page Ranking Algorithms
IDENTIFYING IMPORTANT FEATURES OF USERS TO IMPROVE PAGE RANKING ALGORITHMS
IDENTIFYING IMPORTANT FEATURES OF USERS TO IMPROVE PAGE RANKING ALGORITHMS
web mining
Macran
CONTENT AND USER CLICK BASED PAGE RANKING FOR IMPROVED WEB INFORMATION RETRIEVAL
Ad

Recently uploaded (20)

PPTX
Juvenile delinquency-Crim Research day 3x
PDF
V02-Session-4-Leadership-Through-Assessment-MLB.pdf
PPTX
Ppt obs emergecy.pptxydirnbduejguxjjdjidjdbuc
PPTX
CHF refers to the condition wherein heart unable to pump a sufficient amount ...
PDF
Teacher's Day Quiz 2025
DOCX
OA 7- Administrative Office Procedure and Management.docx
PPTX
FILIPINO 8 Q2 WEEK 1(DAY 1).power point presentation
PDF
17649-Learning By Doing_text-tailieu.pdf
PDF
English 2nd semesteNotesh biology biopsy results from the other day and I jus...
PPTX
MALARIA - educational ppt for students..
PPTX
Entrepreneurship Management and Finance - Module 1 - PPT
PDF
gsas-cvs-and-cover-letters jhvgfcffttfghgvhg.pdf
PDF
IDA Textbook Grade 10 .pdf download link if 1st link isn't working so hard to...
PDF
Global strategy and action plan on oral health 2023 - 2030.pdf
PDF
WHAT NURSES SAY_ COMMUNICATION BEHAVIORS ASSOCIATED WITH THE COMP.pdf
DOCX
HELMET DETECTION AND BIOMETRIC BASED VEHICLESECURITY USING MACHINE LEARNING.docx
PDF
NGÂN HÀNG CÂU HỎI TÁCH CHỌN LỌC THEO CHUYÊN ĐỀ TỪ ĐỀ THI THỬ TN THPT 2025 TIẾ...
PDF
HSE 2022-2023.pdf الصحه والسلامه هندسه نفط
PPTX
Power of Gratitude: Honouring our teachers
PDF
Financial Reporting and Analysis Using Financial Accounting Information by Ch...
Juvenile delinquency-Crim Research day 3x
V02-Session-4-Leadership-Through-Assessment-MLB.pdf
Ppt obs emergecy.pptxydirnbduejguxjjdjidjdbuc
CHF refers to the condition wherein heart unable to pump a sufficient amount ...
Teacher's Day Quiz 2025
OA 7- Administrative Office Procedure and Management.docx
FILIPINO 8 Q2 WEEK 1(DAY 1).power point presentation
17649-Learning By Doing_text-tailieu.pdf
English 2nd semesteNotesh biology biopsy results from the other day and I jus...
MALARIA - educational ppt for students..
Entrepreneurship Management and Finance - Module 1 - PPT
gsas-cvs-and-cover-letters jhvgfcffttfghgvhg.pdf
IDA Textbook Grade 10 .pdf download link if 1st link isn't working so hard to...
Global strategy and action plan on oral health 2023 - 2030.pdf
WHAT NURSES SAY_ COMMUNICATION BEHAVIORS ASSOCIATED WITH THE COMP.pdf
HELMET DETECTION AND BIOMETRIC BASED VEHICLESECURITY USING MACHINE LEARNING.docx
NGÂN HÀNG CÂU HỎI TÁCH CHỌN LỌC THEO CHUYÊN ĐỀ TỪ ĐỀ THI THỬ TN THPT 2025 TIẾ...
HSE 2022-2023.pdf الصحه والسلامه هندسه نفط
Power of Gratitude: Honouring our teachers
Financial Reporting and Analysis Using Financial Accounting Information by Ch...

Discovering knowledge using web structure mining

  • 2. 1. What is Web?
  • 3. 1.1 Problems With Web  Difficulty in finding relevant information  Personalization of information  Learning about consumers or individual users
  • 4. 2.Objectives i. To Survey the area of web mining. ii. Introduction to Link Mining. iii. Review of HITS and Page Rank algorithm.
  • 5. 3. Web Mining: Definition  Process of discovering  potentially useful &  previously unknown information or knowledge from the web data.
  • 6. 3.1 Web Mining: Subtasks  Resource finding  Information selection and pre-processing  Generalization  Analysis
  • 7. 3.1 Web Mining Categories Web Mining Web Content Mining Web Structure Mining Text and Multimedia Documents Hyperlink Structure Web Usage Mining Web Log Records
  • 8. 3.1.1 Web Content Mining  Scanning data of a Web page to determine content relevance with respect to search query. Web Content Mining Agent Based Approach Database Approach
  • 9. 3.1.2 Web Structure Mining  Identifies relationships between Web pages.  Focuses on following problems  Reducing irrelevant search results.  Helps indexing information on the web.
  • 10. 3.1.3 Web Usage Mining  Focuses on techniques that predict user behavior while interacting with the WWW.  Web log records analyzed to discover user access pattern.  The challenges could be divided into three phases:  Pre-processing  Pattern discovery  Pattern Analysis
  • 11. 4. Link Mining  It is located at the intersection of the work in     Link analysis Hypertext and web mining Relational learning and inductive logic programming Graph mining.  Some tasks of link mining applicable in web structure mining are:      Linked-based classification Linked-based cluster analysis Link Type Link Strength Link Cardinality
  • 12. (i) Link-based Classification  Predicts category of a web page, based on  words that occur on the page  Links between pages  anchor text  HTML tags  and other possible attributes on web page.  Eg: Predicting the category of a paper, based on its citations and the co-citations.
  • 13. (ii) Link-based Cluster Analysis  Goal : Finding naturally occurring subclasses.  Data is segmented into groups  similar objects - grouped together  dissimilar objects - different groups.  Helps in discovering hidden patterns.  Eg: Finding diseases with similar transmission pattern.
  • 14. (iii) Link Type  Predicting link type between two entities.  Predicting purpose of a link.  Eg. Navigational or Advertising
  • 15. (iv) Link Strength  Links could be associated with weights.  Strong links - higher weight  Weak links – lower weight
  • 16. (v) Link Cardinality  Refers to the number of inbound links to a web site.  Link popularity :  combination of factors that weigh the importance of each incoming link.
  • 17. 5. Hyperlink-Induced Topic Search (HITS)  Link analysis algorithm that rates pages.  Identifies two kinds of pages from Web hyperlink structure: Web Pages With Links To Web Pages With  Authorities: Contains valuable information on the subject.  Hubs: Contains useful links towards the authoritative pages. Other Pages Hubs Content Authority
  • 18. HITS Contd…  Two step process:  Sampling step: Set of relevant pages collected  Iterative step: Hubs and authorities are found using output of above step
  • 19. HITS Contd…  Sampling Step:  Query submitted to search engine yields a root set  From root set we expand to base set Expanding the root set into base set
  • 20. HITS Contd…  Iterative step:  Associate non-negative authority weight x<p> and nonnegative hub weight y<p>. Computing Authority Weight Computing Hub Weight
  • 21. Problems With HITS Algorithm  Some problems with the HITS algorithm are:  Mutually reinforced relationships between hosts  Automatically generated links  Non-relevant nodes  Hubs and authorities  Topic drift  Efficiency
  • 22. 6. PageRank Model  It is a link analysis algorithm.  Numeric value to know the importance of a web page  Computes importance by no. of incoming links
  • 23. PageRank Contd…  Rank of a page is divided evenly among its out-links to contribute to the ranks of the pages they point to.  Page Ranks form a probability distribution over web pages, so the sum of all pages’ Page Ranks will be one.
  • 24. PageRank Contd…  PageRank can be calculated by: PR(A)= (1-d) + d (PR (T1)/C (T1) +…+ PR (Tn)/C (Tn))  T1..Tn are the pages that point to page A.  C(A) is defined as the number of links going out of page A.  d is the dampening factor which is usually set to 0.85  The dampening factor is the probability at each page a random surfer will get bored and will request another random page.
  • 25. Applications  HITS was used in Clever search engine by IBM.  PageRank is used by Google.
  • 26. References  Knowledge Discovery and Retrieval on World Wide Web Using Web Structure       Mining: Sekhar Babu Boddu, V.P Krishna Anne, Rajesekhara Rao Kurra and Durgesh Kumar Mishra, 2010, In proceedings of Fourth Asia International Conference on Mathematical/Analytical Modelling and Computer Simulation (AMS), IEEE. Link Mining: A New Data Mining Challenge by Lise Getoor, 2003, SIGKDD Explorations, Volume 4, Issue 2 Authoritative Sources in a Hyperlinked Environment by Jon M. Kleinberg, 1998, In proceedings of ACM-SIAM Symposium on Discrete Algorithms The PageRank Citation Ranking: Bringing Order to the Web by L. Page, S. Brin and T. Winograd, 1998, Technical report, Stanford University wikipedia.org web-datamining.net maya.cs.depaul.edu