Spatial and Web Mining
Spatial and Web Mining
WEB MINING
1
Web Data
◼ Web pages
◼ Intra-page structures
◼ Inter-page structures
◼ Usage data
◼ Supplemental data
– Profiles
– Registration information
– Cookies
3
© Prentice Hall 4
2
Web Content Mining
◼ Extends work of basic search engines
◼ Search Engines
– IR application
– Keyword based
– Similarity between query and document
– Crawlers
– Indexing
– Profiles
– Link analysis
5
Crawlers
◼ Robot (spider) traverses the hypertext
sructure in the Web.
◼ Collect information from visited pages
◼ Used to construct indexes for search engines
◼ Traditional Crawler – visits entire Web (?)
and replaces index
◼ Periodic Crawler – visits portions of the Web
and updates subset of index
◼ Incremental Crawler – selectively searches
the Web and incrementally modifies index
◼ Focused Crawler – visits pages related to a
particular subject
6
3
Focused Crawler
◼ Only visit links from a page if that page
is determined to be relevant.
◼ Classifier is static after learning phase.
◼ Components:
– Classifier which assigns relevance score to
each page based on crawl topic.
– Distiller to identify hub pages.
– Crawler visits pages to based on crawler
and distiller scores.
Focused Crawler
◼ Classifier to related documents to topics
◼ Classifier also determines how useful
outgoing links are
◼ Hub Pages contain links to many
relevant pages. Must be visited even if
not high relevance score.
4
Focused Crawler
10
5
Context Graph
11
11
12
6
Personalization
◼ Web access or contents tuned to better fit the
desires of each user.
◼ Manual techniques identify user’s preferences
based on profiles or demographics.
◼ Collaborative filtering identifies preferences
based on ratings from similar users.
◼ Content based filtering retrieves pages
based on similarity between pages and user
profiles.
13
13
14
14
7
PageRank
◼ Used by Google
◼ Prioritize pages returned from search by
looking at Web structure.
◼ Importance of page is calculated based
on number of pages which point to it –
Backlinks.
◼ Weighting is used to provide more
importance to backlinks coming form
important pages.
15
15
PageRank (cont’d)
◼ PR(p) = c (PR(1)/N1 + … + PR(n)/Nn)
– PR(i): PageRank for a page i which points
to target page p.
– Ni: number of links coming out of page i
16
16
8
CLEVER
◼ Identify authoritative and hub pages.
◼ Authoritative Pages :
– Highly important pages.
– Best source for requested information.
◼ Hub Pages :
– Contain links to highly important pages.
17
17
HITS
◼ Hyperlink-Induces Topic Search
◼ Based on a set of keywords, find set of
relevant pages – R.
◼ Identify hub and authority pages for these.
– Expand R to a base set, B, of pages linked to or
from R.
– Calculate weights for authorities and hubs.
◼ Pages with highest ranks in R are returned.
18
18
9
HITS Algorithm
© Prentice Hall 19
19
20
10
Web Usage Mining Applications
◼ Personalization
◼ Improve structure of a site’s Web pages
◼ Aid in caching and prediction of future
page references
◼ Improve design of individual pages
◼ Improve effectiveness of e-commerce
(sales and advertising)
21
21
22
11
Web Usage Mining Issues
◼ Identification of exact user not possible.
◼ Exact sequence of pages referenced by
a user not possible due to caching.
◼ Session not well defined
◼ Security, privacy, and legal issues
23
23
Spatial Mining
24
24
12
Spatial Object
◼ Contains both spatial and nonspatial
attributes.
◼ Must have a location type attributes:
– Latitude/longitude
– Zip code
– Street address
◼ May retrieve object using either (or
both) spatial or nonspatial attributes.
25
25
26
13
Spatial Queries
◼ Spatial selection may involve specialized
selection comparison operations:
– Near
– North, South, East, West
– Contained in
– Overlap/intersect
◼ Region (Range) Query – find objects that
intersect a given region.
◼ Nearest Neighbor Query – find object close to
identified object.
◼ Distance Scan – find object within a certain
distance of an identified object where distance is
made increasingly larger.
27
27
28
14
MBR
◼ Minimum Bounding Rectangle
◼ Smallest rectangle that completely
contains the object
29
29
MBR Examples
30
30
15
Quad Tree
◼ Hierarchical decomposition of the space
into quadrants (MBRs)
◼ Each level in the tree represents the
object as the set of quadrants which
contain any portion of the object.
◼ Each level is a more exact representation
of the object.
◼ The number of levels is determined by
the degree of accuracy desired.
31
31
32
32
16
R-Tree
◼ As with Quad Tree the region is divided
into successively smaller rectangles
(MBRs).
◼ Rectangles need not be of the same
size or number at each level.
◼ Rectangles may actually overlap.
◼ Lowest level cell has only one object.
◼ Tree maintenance algorithms similar to
those for B-trees.
33
33
R-Tree Example
34
34
17
K-D Tree
◼ Designed for multi-attribute data, not
necessarily spatial
◼ Variation of binary search tree
◼ Each level is used to index one of the
dimensions of the spatial object.
◼ Lowest level cell has only one object
◼ Divisions not based on MBRs but
successive divisions of the dimension
range.
35
35
36
36
18
Topological Relationships
◼ Disjoint
◼ Overlaps or Intersects
◼ Equals
◼ Covered by or inside or contained in
◼ Covers or contains
37
37
38
38
19
Spatial Data Dominant Algorithm
41
41
STING
◼ STatistical Information Grid-based
◼ Hierarchical technique to divide area
into rectangular cells
◼ Grid data structure contains summary
information about each cell
◼ Hierarchical clustering
◼ Similar to quad tree
42
42
20
STING
43
43
44
44
21
STING Algorithm
45
45
Spatial Rules
◼ Characteristic Rule
The average family income in Dallas is $50,000.
◼ Discriminant Rule
The average family income in Dallas is $50,000,
while in Plano the average income is $75,000.
◼ Association Rule
The average family income in Dallas for families
living near White Rock Lake is $100,000.
46
46
22
Spatial Association Rules
◼ Either antecedent or consequent must
contain spatial predicates.
◼ View underlying database as set of
spatial objects.
◼ May create using a type of progressive
refinement
47
47
48
48
23
Spatial Classification
◼ Partition spatial objects
◼ May use nonspatial attributes and/or
spatial attributes
◼ Generalization and progressive
refinement may be used.
49
49
Spatial Clustering
◼ Detect clusters of irregular shapes
◼ Use of centroids and simple distance
approaches may not work well.
◼ Clusters should be independent of order
of input.
50
50
24
CLARANS Extensions
◼ Remove main memory assumption of
CLARANS.
◼ Use spatial index techniques.
◼ Use sampling and R*-tree to identify
central objects.
◼ Change cost calculations by reducing
the number of objects examined.
◼ Voronoi Diagram
51
51
Voronoi
52
52
25
SD(CLARANS)
◼ Spatial Dominant
◼ First clusters spatial components using
CLARANS
◼ Then iteratively replaces medoids, but
limits number of pairs to be searched.
◼ Uses generalization
◼ Uses a learning to to derive description
of cluster.
53
53
SD(CLARANS) Algorithm
54
54
26
Aggregate Proximity
◼ Aggregate Proximity – measure of how
close a cluster is to a feature.
◼ Aggregate proximity relationship finds the
k closest features to a cluster.
◼ CRH Algorithm – uses different shapes:
– Encompassing Circle
– Isothetic Rectangle
– Convex Hull
55
55
27