0% found this document useful (0 votes)
3 views

Spatial and Web Mining

Uploaded by

samanthaargent21
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Spatial and Web Mining

Uploaded by

samanthaargent21
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

ADVANCED DATA MINING

WEB MINING

Web Mining Issues


◼ Size
– >350 million pages (1999)
– Grows at about 1 million pages a day
– Google indexes 3 billion documents
◼ Diverse types of data

1
Web Data
◼ Web pages
◼ Intra-page structures
◼ Inter-page structures
◼ Usage data
◼ Supplemental data
– Profiles
– Registration information
– Cookies
3

Web Mining Taxonomy

© Prentice Hall 4

2
Web Content Mining
◼ Extends work of basic search engines
◼ Search Engines
– IR application
– Keyword based
– Similarity between query and document
– Crawlers
– Indexing
– Profiles
– Link analysis
5

Crawlers
◼ Robot (spider) traverses the hypertext
sructure in the Web.
◼ Collect information from visited pages
◼ Used to construct indexes for search engines
◼ Traditional Crawler – visits entire Web (?)
and replaces index
◼ Periodic Crawler – visits portions of the Web
and updates subset of index
◼ Incremental Crawler – selectively searches
the Web and incrementally modifies index
◼ Focused Crawler – visits pages related to a
particular subject
6

3
Focused Crawler
◼ Only visit links from a page if that page
is determined to be relevant.
◼ Classifier is static after learning phase.
◼ Components:
– Classifier which assigns relevance score to
each page based on crawl topic.
– Distiller to identify hub pages.
– Crawler visits pages to based on crawler
and distiller scores.

Focused Crawler
◼ Classifier to related documents to topics
◼ Classifier also determines how useful
outgoing links are
◼ Hub Pages contain links to many
relevant pages. Must be visited even if
not high relevance score.

4
Focused Crawler

Context Focused Crawler


◼ Context Graph:
– Context graph created for each seed document .
– Root is the sedd document.
– Nodes at each level show documents with links
to documents at next higher level.
– Updated during crawl itself .
◼ Approach:
1. Construct context graph and classifiers using
seed documents as training data.
2. Perform crawling using classifiers and context
graph created.
© Prentice Hall 10

10

5
Context Graph

11

11

Virtual Web View


◼ Multiple Layered DataBase (MLDB) built on top
of the Web.
◼ Each layer of the database is more generalized
(and smaller) and centralized than the one
beneath it.
◼ Upper layers of MLDB are structured and can be
accessed with SQL type queries.
◼ Translation tools convert Web documents to XML.
◼ Extraction tools extract desired information to
place in first layer of MLDB.
◼ Higher levels contain more summarized data
obtained through generalizations of the lower
levels. 12

12

6
Personalization
◼ Web access or contents tuned to better fit the
desires of each user.
◼ Manual techniques identify user’s preferences
based on profiles or demographics.
◼ Collaborative filtering identifies preferences
based on ratings from similar users.
◼ Content based filtering retrieves pages
based on similarity between pages and user
profiles.

13

13

Web Structure Mining


◼ Mine structure (links, graph) of the Web
◼ Techniques
– PageRank
– CLEVER
◼ Create a model of the Web organization.
◼ May be combined with content mining to
more effectively retrieve important pages.

14

14

7
PageRank
◼ Used by Google
◼ Prioritize pages returned from search by
looking at Web structure.
◼ Importance of page is calculated based
on number of pages which point to it –
Backlinks.
◼ Weighting is used to provide more
importance to backlinks coming form
important pages.

15

15

PageRank (cont’d)
◼ PR(p) = c (PR(1)/N1 + … + PR(n)/Nn)
– PR(i): PageRank for a page i which points
to target page p.
– Ni: number of links coming out of page i

16

16

8
CLEVER
◼ Identify authoritative and hub pages.
◼ Authoritative Pages :
– Highly important pages.
– Best source for requested information.
◼ Hub Pages :
– Contain links to highly important pages.

17

17

HITS
◼ Hyperlink-Induces Topic Search
◼ Based on a set of keywords, find set of
relevant pages – R.
◼ Identify hub and authority pages for these.
– Expand R to a base set, B, of pages linked to or
from R.
– Calculate weights for authorities and hubs.
◼ Pages with highest ranks in R are returned.

18

18

9
HITS Algorithm

© Prentice Hall 19

19

Web Usage Mining


◼ Extends work of basic search engines
◼ Search Engines
– IR application
– Keyword based
– Similarity between query and document
– Crawlers
– Indexing
– Profiles
– Link analysis
20

20

10
Web Usage Mining Applications
◼ Personalization
◼ Improve structure of a site’s Web pages
◼ Aid in caching and prediction of future
page references
◼ Improve design of individual pages
◼ Improve effectiveness of e-commerce
(sales and advertising)

21

21

Web Usage Mining Activities


◼ Preprocessing Web log
– Cleanse
– Remove extraneous information
– Sessionize
Session: Sequence of pages referenced by one user at a sitting.
◼ Pattern Discovery
– Count patterns that occur in sessions
– Pattern is sequence of pages references in session.
– Similar to association rules
» Transaction: session
» Itemset: pattern (or subset)
» Order is important
◼ Pattern Analysis
22

22

11
Web Usage Mining Issues
◼ Identification of exact user not possible.
◼ Exact sequence of pages referenced by
a user not possible due to caching.
◼ Session not well defined
◼ Security, privacy, and legal issues

23

23

Spatial Mining

24

24

12
Spatial Object
◼ Contains both spatial and nonspatial
attributes.
◼ Must have a location type attributes:
– Latitude/longitude
– Zip code
– Street address
◼ May retrieve object using either (or
both) spatial or nonspatial attributes.
25

25

Spatial Data Mining Applications


◼ Geology
◼ GIS Systems
◼ Environmental Science
◼ Agriculture
◼ Medicine
◼ Robotics
◼ May involved both spatial and temporal
aspects
26

26

13
Spatial Queries
◼ Spatial selection may involve specialized
selection comparison operations:
– Near
– North, South, East, West
– Contained in
– Overlap/intersect
◼ Region (Range) Query – find objects that
intersect a given region.
◼ Nearest Neighbor Query – find object close to
identified object.
◼ Distance Scan – find object within a certain
distance of an identified object where distance is
made increasingly larger.
27

27

Spatial Data Structures


◼ Data structures designed specifically to store or
index spatial data.
◼ Often based on B-tree or Binary Search Tree
◼ Cluster data on disk basked on geographic
location.
◼ May represent complex spatial structure by
placing the spatial object in a containing structure
of a specific geographic shape.
◼ Techniques:
– Quad Tree
– R-Tree
– k-D Tree
28

28

14
MBR
◼ Minimum Bounding Rectangle
◼ Smallest rectangle that completely
contains the object

29

29

MBR Examples

30

30

15
Quad Tree
◼ Hierarchical decomposition of the space
into quadrants (MBRs)
◼ Each level in the tree represents the
object as the set of quadrants which
contain any portion of the object.
◼ Each level is a more exact representation
of the object.
◼ The number of levels is determined by
the degree of accuracy desired.
31

31

Quad Tree Example

32

32

16
R-Tree
◼ As with Quad Tree the region is divided
into successively smaller rectangles
(MBRs).
◼ Rectangles need not be of the same
size or number at each level.
◼ Rectangles may actually overlap.
◼ Lowest level cell has only one object.
◼ Tree maintenance algorithms similar to
those for B-trees.
33

33

R-Tree Example

34

34

17
K-D Tree
◼ Designed for multi-attribute data, not
necessarily spatial
◼ Variation of binary search tree
◼ Each level is used to index one of the
dimensions of the spatial object.
◼ Lowest level cell has only one object
◼ Divisions not based on MBRs but
successive divisions of the dimension
range.
35

35

k-D Tree Example

36

36

18
Topological Relationships
◼ Disjoint
◼ Overlaps or Intersects
◼ Equals
◼ Covered by or inside or contained in
◼ Covers or contains

37

37

Distance Between Objects


◼ Euclidean
◼ Manhattan
◼ Extensions:

38

38

19
Spatial Data Dominant Algorithm

41

41

STING
◼ STatistical Information Grid-based
◼ Hierarchical technique to divide area
into rectangular cells
◼ Grid data structure contains summary
information about each cell
◼ Hierarchical clustering
◼ Similar to quad tree

42

42

20
STING

43

43

STING Build Algorithm

44

44

21
STING Algorithm

45

45

Spatial Rules
◼ Characteristic Rule
The average family income in Dallas is $50,000.
◼ Discriminant Rule
The average family income in Dallas is $50,000,
while in Plano the average income is $75,000.
◼ Association Rule
The average family income in Dallas for families
living near White Rock Lake is $100,000.

46

46

22
Spatial Association Rules
◼ Either antecedent or consequent must
contain spatial predicates.
◼ View underlying database as set of
spatial objects.
◼ May create using a type of progressive
refinement

47

47

Spatial Association Rule Algorithm

48

48

23
Spatial Classification
◼ Partition spatial objects
◼ May use nonspatial attributes and/or
spatial attributes
◼ Generalization and progressive
refinement may be used.

49

49

Spatial Clustering
◼ Detect clusters of irregular shapes
◼ Use of centroids and simple distance
approaches may not work well.
◼ Clusters should be independent of order
of input.

50

50

24
CLARANS Extensions
◼ Remove main memory assumption of
CLARANS.
◼ Use spatial index techniques.
◼ Use sampling and R*-tree to identify
central objects.
◼ Change cost calculations by reducing
the number of objects examined.
◼ Voronoi Diagram
51

51

Voronoi

52

52

25
SD(CLARANS)
◼ Spatial Dominant
◼ First clusters spatial components using
CLARANS
◼ Then iteratively replaces medoids, but
limits number of pairs to be searched.
◼ Uses generalization
◼ Uses a learning to to derive description
of cluster.
53

53

SD(CLARANS) Algorithm

54

54

26
Aggregate Proximity
◼ Aggregate Proximity – measure of how
close a cluster is to a feature.
◼ Aggregate proximity relationship finds the
k closest features to a cluster.
◼ CRH Algorithm – uses different shapes:
– Encompassing Circle
– Isothetic Rectangle
– Convex Hull
55

55

27

You might also like