0% found this document useful (0 votes)

3 views

Spatial and Web Mining

Uploaded by

samanthaargent21

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views

Spatial and Web Mining

Uploaded by

samanthaargent21

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 27

ADVANCED DATA MINING

WEB MINING

Web Mining Issues

◼ Size
– >350 million pages (1999)
– Grows at about 1 million pages a day
– Google indexes 3 billion documents
◼ Diverse types of data

1
Web Data
◼ Web pages
◼ Intra-page structures
◼ Inter-page structures
◼ Usage data
◼ Supplemental data
– Profiles
– Registration information
– Cookies
3

Web Mining Taxonomy

2
Web Content Mining
◼ Extends work of basic search engines
◼ Search Engines
– IR application
– Keyword based
– Similarity between query and document
– Crawlers
– Indexing
– Profiles
– Link analysis
5

Crawlers
◼ Robot (spider) traverses the hypertext
sructure in the Web.
◼ Collect information from visited pages
◼ Used to construct indexes for search engines
◼ Traditional Crawler – visits entire Web (?)
and replaces index
◼ Periodic Crawler – visits portions of the Web
and updates subset of index
◼ Incremental Crawler – selectively searches
the Web and incrementally modifies index
◼ Focused Crawler – visits pages related to a
particular subject
6

3
Focused Crawler
◼ Only visit links from a page if that page
is determined to be relevant.
◼ Classifier is static after learning phase.
◼ Components:
– Classifier which assigns relevance score to
each page based on crawl topic.
– Distiller to identify hub pages.
– Crawler visits pages to based on crawler
and distiller scores.

Focused Crawler
◼ Classifier to related documents to topics
◼ Classifier also determines how useful
outgoing links are
◼ Hub Pages contain links to many
relevant pages. Must be visited even if
not high relevance score.

4
Focused Crawler

Context Focused Crawler

◼ Context Graph:
– Context graph created for each seed document .
– Root is the sedd document.
– Nodes at each level show documents with links
to documents at next higher level.
– Updated during crawl itself .
◼ Approach:
1. Construct context graph and classifiers using
seed documents as training data.
2. Perform crawling using classifiers and context
graph created.
© Prentice Hall 10

5
Context Graph

Virtual Web View

◼ Multiple Layered DataBase (MLDB) built on top
of the Web.
◼ Each layer of the database is more generalized
(and smaller) and centralized than the one
beneath it.
◼ Upper layers of MLDB are structured and can be
accessed with SQL type queries.
◼ Translation tools convert Web documents to XML.
◼ Extraction tools extract desired information to
place in first layer of MLDB.
◼ Higher levels contain more summarized data
obtained through generalizations of the lower
levels. 12

6
Personalization
◼ Web access or contents tuned to better fit the
desires of each user.
◼ Manual techniques identify user’s preferences
based on profiles or demographics.
◼ Collaborative filtering identifies preferences
based on ratings from similar users.
◼ Content based filtering retrieves pages
based on similarity between pages and user
profiles.

Web Structure Mining

◼ Mine structure (links, graph) of the Web
◼ Techniques
– PageRank
– CLEVER
◼ Create a model of the Web organization.
◼ May be combined with content mining to
more effectively retrieve important pages.

7
PageRank
◼ Used by Google
◼ Prioritize pages returned from search by
looking at Web structure.
◼ Importance of page is calculated based
on number of pages which point to it –
Backlinks.
◼ Weighting is used to provide more
importance to backlinks coming form
important pages.

PageRank (cont’d)
◼ PR(p) = c (PR(1)/N1 + … + PR(n)/Nn)
– PR(i): PageRank for a page i which points
to target page p.
– Ni: number of links coming out of page i

8
CLEVER
◼ Identify authoritative and hub pages.
◼ Authoritative Pages :
– Highly important pages.
– Best source for requested information.
◼ Hub Pages :
– Contain links to highly important pages.

HITS
◼ Hyperlink-Induces Topic Search
◼ Based on a set of keywords, find set of
relevant pages – R.
◼ Identify hub and authority pages for these.
– Expand R to a base set, B, of pages linked to or
from R.
– Calculate weights for authorities and hubs.
◼ Pages with highest ranks in R are returned.

9
HITS Algorithm

Web Usage Mining

◼ Extends work of basic search engines
◼ Search Engines
– IR application
– Keyword based
– Similarity between query and document
– Crawlers
– Indexing
– Profiles
– Link analysis
20

10
Web Usage Mining Applications
◼ Personalization
◼ Improve structure of a site’s Web pages
◼ Aid in caching and prediction of future
page references
◼ Improve design of individual pages
◼ Improve effectiveness of e-commerce
(sales and advertising)

Web Usage Mining Activities

◼ Preprocessing Web log
– Cleanse
– Remove extraneous information
– Sessionize
Session: Sequence of pages referenced by one user at a sitting.
◼ Pattern Discovery
– Count patterns that occur in sessions
– Pattern is sequence of pages references in session.
– Similar to association rules
» Transaction: session
» Itemset: pattern (or subset)
» Order is important
◼ Pattern Analysis
22

11
Web Usage Mining Issues
◼ Identification of exact user not possible.
◼ Exact sequence of pages referenced by
a user not possible due to caching.
◼ Session not well defined
◼ Security, privacy, and legal issues

Spatial Mining

12
Spatial Object
◼ Contains both spatial and nonspatial
attributes.
◼ Must have a location type attributes:
– Latitude/longitude
– Zip code
– Street address
◼ May retrieve object using either (or
both) spatial or nonspatial attributes.
25

Spatial Data Mining Applications

◼ Geology
◼ GIS Systems
◼ Environmental Science
◼ Agriculture
◼ Medicine
◼ Robotics
◼ May involved both spatial and temporal
aspects
26

13
Spatial Queries
◼ Spatial selection may involve specialized
selection comparison operations:
– Near
– North, South, East, West
– Contained in
– Overlap/intersect
◼ Region (Range) Query – find objects that
intersect a given region.
◼ Nearest Neighbor Query – find object close to
identified object.
◼ Distance Scan – find object within a certain
distance of an identified object where distance is
made increasingly larger.
27

Spatial Data Structures

◼ Data structures designed specifically to store or
index spatial data.
◼ Often based on B-tree or Binary Search Tree
◼ Cluster data on disk basked on geographic
location.
◼ May represent complex spatial structure by
placing the spatial object in a containing structure
of a specific geographic shape.
◼ Techniques:
– Quad Tree
– R-Tree
– k-D Tree
28

14
MBR
◼ Minimum Bounding Rectangle
◼ Smallest rectangle that completely
contains the object

MBR Examples

15
Quad Tree
◼ Hierarchical decomposition of the space
into quadrants (MBRs)
◼ Each level in the tree represents the
object as the set of quadrants which
contain any portion of the object.
◼ Each level is a more exact representation
of the object.
◼ The number of levels is determined by
the degree of accuracy desired.
31

Quad Tree Example

16
R-Tree
◼ As with Quad Tree the region is divided
into successively smaller rectangles
(MBRs).
◼ Rectangles need not be of the same
size or number at each level.
◼ Rectangles may actually overlap.
◼ Lowest level cell has only one object.
◼ Tree maintenance algorithms similar to
those for B-trees.
33

R-Tree Example

17
K-D Tree
◼ Designed for multi-attribute data, not
necessarily spatial
◼ Variation of binary search tree
◼ Each level is used to index one of the
dimensions of the spatial object.
◼ Lowest level cell has only one object
◼ Divisions not based on MBRs but
successive divisions of the dimension
range.
35

k-D Tree Example

18
Topological Relationships
◼ Disjoint
◼ Overlaps or Intersects
◼ Equals
◼ Covered by or inside or contained in
◼ Covers or contains

Distance Between Objects

◼ Euclidean
◼ Manhattan
◼ Extensions:

19
Spatial Data Dominant Algorithm

STING
◼ STatistical Information Grid-based
◼ Hierarchical technique to divide area
into rectangular cells
◼ Grid data structure contains summary
information about each cell
◼ Hierarchical clustering
◼ Similar to quad tree

20
STING

STING Build Algorithm

21
STING Algorithm

Spatial Rules
◼ Characteristic Rule
The average family income in Dallas is $50,000.
◼ Discriminant Rule
The average family income in Dallas is $50,000,
while in Plano the average income is $75,000.
◼ Association Rule
The average family income in Dallas for families
living near White Rock Lake is $100,000.

22
Spatial Association Rules
◼ Either antecedent or consequent must
contain spatial predicates.
◼ View underlying database as set of
spatial objects.
◼ May create using a type of progressive
refinement

Spatial Association Rule Algorithm

23
Spatial Classification
◼ Partition spatial objects
◼ May use nonspatial attributes and/or
spatial attributes
◼ Generalization and progressive
refinement may be used.

Spatial Clustering
◼ Detect clusters of irregular shapes
◼ Use of centroids and simple distance
approaches may not work well.
◼ Clusters should be independent of order
of input.

24
CLARANS Extensions
◼ Remove main memory assumption of
CLARANS.
◼ Use spatial index techniques.
◼ Use sampling and R*-tree to identify
central objects.
◼ Change cost calculations by reducing
the number of objects examined.
◼ Voronoi Diagram
51

Voronoi

25
SD(CLARANS)
◼ Spatial Dominant
◼ First clusters spatial components using
CLARANS
◼ Then iteratively replaces medoids, but
limits number of pairs to be searched.
◼ Uses generalization
◼ Uses a learning to to derive description
of cluster.
53

SD(CLARANS) Algorithm

26
Aggregate Proximity
◼ Aggregate Proximity – measure of how
close a cluster is to a feature.
◼ Aggregate proximity relationship finds the
k closest features to a cluster.
◼ CRH Algorithm – uses different shapes:
– Encompassing Circle
– Isothetic Rectangle
– Convex Hull
55

Startup Accelerator Programmes: A Practice Guide
82% (260)
Startup Accelerator Programmes: A Practice Guide
31 pages
Car Mechanic Simulator 2018 Car Parts List v2
No ratings yet
Car Mechanic Simulator 2018 Car Parts List v2
5 pages
Web Content Mining
No ratings yet
Web Content Mining
112 pages
12-Web (1)
No ratings yet
12-Web (1)
78 pages
Webmininglec
No ratings yet
Webmininglec
75 pages
Data Mining Unit4 5
No ratings yet
Data Mining Unit4 5
130 pages
Web Mining1
No ratings yet
Web Mining1
87 pages
Ir 73 103
No ratings yet
Ir 73 103
31 pages
Module 2 Web Usage Mining
No ratings yet
Module 2 Web Usage Mining
34 pages
Query and Reporting Tools: Search Engine Architecture
No ratings yet
Query and Reporting Tools: Search Engine Architecture
5 pages
Screenshot 2024-06-04 at 12.03.03 AM
No ratings yet
Screenshot 2024-06-04 at 12.03.03 AM
32 pages
Page Rank Algorithm
No ratings yet
Page Rank Algorithm
26 pages
9 Link Analysis
No ratings yet
9 Link Analysis
86 pages
Unit 5 DM
No ratings yet
Unit 5 DM
61 pages
SEO Beginners Slide Show
No ratings yet
SEO Beginners Slide Show
44 pages
Web Crawlers: Presented By: B. Tech. Final Year Information Technology
No ratings yet
Web Crawlers: Presented By: B. Tech. Final Year Information Technology
27 pages
EDS WebCrawlerArchitecture
No ratings yet
EDS WebCrawlerArchitecture
3 pages
Adaptive Focus
No ratings yet
Adaptive Focus
6 pages
Module1PartAweb mining-intro
No ratings yet
Module1PartAweb mining-intro
28 pages
Data Mining Unit 5
No ratings yet
Data Mining Unit 5
36 pages
Bda Class - Feb 7th
No ratings yet
Bda Class - Feb 7th
28 pages
Web Mining and Text Mining
No ratings yet
Web Mining and Text Mining
65 pages
Module I
No ratings yet
Module I
85 pages
Spatial & Web Mining
No ratings yet
Spatial & Web Mining
45 pages
SEO Stratefy Kkluxurygroup
No ratings yet
SEO Stratefy Kkluxurygroup
2 pages
Working of Webb Search Engines
No ratings yet
Working of Webb Search Engines
29 pages
Web Mining
No ratings yet
Web Mining
53 pages
19 Web Mining 2
No ratings yet
19 Web Mining 2
41 pages
web_mining
No ratings yet
web_mining
8 pages
Search Engine Architecture 1
No ratings yet
Search Engine Architecture 1
23 pages
Data Processing in Web Mining Structure by Hyperlinks and Pagerank
No ratings yet
Data Processing in Web Mining Structure by Hyperlinks and Pagerank
6 pages
10-Searching The Web
No ratings yet
10-Searching The Web
27 pages
Cse3024 Web-Mining Eth 1.1 47 Cse3024 PDF
No ratings yet
Cse3024 Web-Mining Eth 1.1 47 Cse3024 PDF
12 pages
Web and Text Mining
No ratings yet
Web and Text Mining
73 pages
Searching The Web
No ratings yet
Searching The Web
24 pages
Felke-Morris Bowd5e PPT 12
No ratings yet
Felke-Morris Bowd5e PPT 12
30 pages
6 WebMining
No ratings yet
6 WebMining
45 pages
Sharda Dss10 PPT 08 ST
No ratings yet
Sharda Dss10 PPT 08 ST
14 pages
Web Search Engines: Chapter 27, Part C Based On Larson and Hearst's Slides at UC-Berkeley
No ratings yet
Web Search Engines: Chapter 27, Part C Based On Larson and Hearst's Slides at UC-Berkeley
14 pages
Web Miningppt
No ratings yet
Web Miningppt
29 pages
E-Commerce: Search Engine Optimization
No ratings yet
E-Commerce: Search Engine Optimization
25 pages
Experiment 9: Web Mining
No ratings yet
Experiment 9: Web Mining
9 pages
Custom SEO Strategy For
No ratings yet
Custom SEO Strategy For
16 pages
Term Paper OF Int-301: Web Programming: Topic: Search Engine
No ratings yet
Term Paper OF Int-301: Web Programming: Topic: Search Engine
18 pages
Building Business Intelligence Data Extractor Using NLP and Python
No ratings yet
Building Business Intelligence Data Extractor Using NLP and Python
5 pages
7 CurrentTrendsAndIssues
No ratings yet
7 CurrentTrendsAndIssues
50 pages
Chapter 7!!
No ratings yet
Chapter 7!!
1 page
Unit 4 (DWDM)
No ratings yet
Unit 4 (DWDM)
27 pages
DWDM Unit 4
No ratings yet
DWDM Unit 4
11 pages
08 Web Search and Web Crawling
No ratings yet
08 Web Search and Web Crawling
33 pages
SEO AdTech Dave Chaffey
No ratings yet
SEO AdTech Dave Chaffey
44 pages
Data Mining
No ratings yet
Data Mining
80 pages
([email protected]) 許廷兆研究生 ([email protected]) 張賢宗研究生 ([email protected])
No ratings yet
([email protected]) 許廷兆研究生 ([email protected]) 張賢宗研究生 ([email protected])
77 pages
Web Crawling: Based On The Slides by Filippo
No ratings yet
Web Crawling: Based On The Slides by Filippo
52 pages
Web Mining: G.Anuradha References From Dunham
100% (1)
Web Mining: G.Anuradha References From Dunham
63 pages
Mining The Web Graph: Technical Seminar Presentation On
No ratings yet
Mining The Web Graph: Technical Seminar Presentation On
15 pages
Icf 9 PPT Week 2
No ratings yet
Icf 9 PPT Week 2
26 pages
Web Mining: By-Pawan Singh Piyush Arora Pooja Mansharamani Pramod Singh Praveen Kumar
No ratings yet
Web Mining: By-Pawan Singh Piyush Arora Pooja Mansharamani Pramod Singh Praveen Kumar
20 pages
The Anatomy of A Large-Scale Hypertextual
No ratings yet
The Anatomy of A Large-Scale Hypertextual
41 pages
Backlink Basic
From Everand
Backlink Basic
MUHAMMAD NUR WAHID ANUAR
No ratings yet
Web Strategy for Everyone: How to Create and Manage a Website, Usable by Anyone on Any Device, With Great Information Architecture and High Performance
From Everand
Web Strategy for Everyone: How to Create and Manage a Website, Usable by Anyone on Any Device, With Great Information Architecture and High Performance
Marcus Österberg
4/5 (2)
CSS Framework Alternatives: Explore Five Lightweight Alternatives to Bootstrap and Foundation with Project Examples
From Everand
CSS Framework Alternatives: Explore Five Lightweight Alternatives to Bootstrap and Foundation with Project Examples
Aravind Shenoy
No ratings yet
L15-Leftist-Heaps-JP
No ratings yet
L15-Leftist-Heaps-JP
60 pages
Ch 05 E Digital Signature
No ratings yet
Ch 05 E Digital Signature
34 pages
IDS and Honeypot
No ratings yet
IDS and Honeypot
29 pages
firewalls
No ratings yet
firewalls
37 pages
CNS Research Paper (4)
No ratings yet
CNS Research Paper (4)
15 pages
ADS 0256 EXP 7
No ratings yet
ADS 0256 EXP 7
10 pages
CNS_Research_Paper (3)
No ratings yet
CNS_Research_Paper (3)
15 pages
ADS EXP 8 Tanisha Kanal
No ratings yet
ADS EXP 8 Tanisha Kanal
10 pages
SA ESE-1
No ratings yet
SA ESE-1
4 pages
L-0010107193-pdf
No ratings yet
L-0010107193-pdf
30 pages
dw_chap2
No ratings yet
dw_chap2
15 pages
10th Oct 2023 ImmersionWeek
No ratings yet
10th Oct 2023 ImmersionWeek
8 pages
11th October Wednesday ImmersionWeek
No ratings yet
11th October Wednesday ImmersionWeek
7 pages
OS Numericals Mitul Shah
No ratings yet
OS Numericals Mitul Shah
7 pages
Teaching and Assessing Single-Pilot Human Factors and Threat and Error Management
No ratings yet
Teaching and Assessing Single-Pilot Human Factors and Threat and Error Management
54 pages
Rationale For The Prevention of Oral Diseases in Primary Health Care
No ratings yet
Rationale For The Prevention of Oral Diseases in Primary Health Care
11 pages
UTIITSL PBV Indent PDF
No ratings yet
UTIITSL PBV Indent PDF
2 pages
964A00243 R3 - PowerDash MNL
No ratings yet
964A00243 R3 - PowerDash MNL
12 pages
Emillie Grace D. Tombucon RN
No ratings yet
Emillie Grace D. Tombucon RN
26 pages
Test Bank
No ratings yet
Test Bank
25 pages
Line Conventions and Lettering: ASME Y14.2-2014
No ratings yet
Line Conventions and Lettering: ASME Y14.2-2014
5 pages
Batch Wise PCCP Morning Time Table W e F 30-08-2021 To 04-09-2021
No ratings yet
Batch Wise PCCP Morning Time Table W e F 30-08-2021 To 04-09-2021
1 page
Citizens Charter
No ratings yet
Citizens Charter
46 pages
Lee Mei Peng No 33 Jalan Delima 4A/Ks6 Bandar Parkland Pendamar 41200 KLANG
No ratings yet
Lee Mei Peng No 33 Jalan Delima 4A/Ks6 Bandar Parkland Pendamar 41200 KLANG
4 pages
8918675-CL 10 - Arithmetic Progressions - Case Study - Arsha.k.r. - 2023 - 24
0% (1)
8918675-CL 10 - Arithmetic Progressions - Case Study - Arsha.k.r. - 2023 - 24
4 pages
Download Complete Design Management Managing Design Strategy Process and Implementation Required Reading Range 1st Edition Kathryn Best PDF for All Chapters
No ratings yet
Download Complete Design Management Managing Design Strategy Process and Implementation Required Reading Range 1st Edition Kathryn Best PDF for All Chapters
81 pages
Uk Sample
No ratings yet
Uk Sample
2 pages
AEC UPS Catalogue 2020
No ratings yet
AEC UPS Catalogue 2020
46 pages
Docking Station Exercise 4
No ratings yet
Docking Station Exercise 4
13 pages
BBA Course Description - Batch 2025 - Semester 3
No ratings yet
BBA Course Description - Batch 2025 - Semester 3
36 pages
Presentation Group 10
No ratings yet
Presentation Group 10
39 pages
Online Miracle Factory - Omc2
No ratings yet
Online Miracle Factory - Omc2
1 page
My CV
No ratings yet
My CV
4 pages
Brown and Yellow Scrapbook Brainstorm Presentation
No ratings yet
Brown and Yellow Scrapbook Brainstorm Presentation
16 pages
Activitysheets
No ratings yet
Activitysheets
19 pages
BPO Industry in India
No ratings yet
BPO Industry in India
18 pages
Woodward - easYgen-3000XT
100% (1)
Woodward - easYgen-3000XT
3 pages
Ideo - Design Consulting Group
No ratings yet
Ideo - Design Consulting Group
10 pages
2d Lesson Plan-Landscape Atmospheric Perspective
No ratings yet
2d Lesson Plan-Landscape Atmospheric Perspective
2 pages
cjbs250 07fa17
No ratings yet
cjbs250 07fa17
5 pages
Hallmark Event
No ratings yet
Hallmark Event
17 pages
Confirmation Letter - Nishant Pruthi
No ratings yet
Confirmation Letter - Nishant Pruthi
2 pages