Module 6 : Spatial and Web Mining
Priya R L
Faculty Incharge for CSC 504 References: Text Book - 4
Department of Computer Engineering Data Mining : Introductory and Advanced
VES Institute of Technology, Mumbai Topics by Margaret H. Dunham
CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)
Agenda
● Introduction to Spatial Mining
● Spatial Data,
● Spatial Vs. Classical Data Mining,
● Spatial Data Structures,
● Introduction to Web Mining
● Web Content Mining
● Web Structure Mining
● Web Usage Mining
● Applications of Web Mining
CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)
Spatial Objects
CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)
Spatial Objects
CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)
Spatial Databases (GIS)
CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)
Example : Spatial Data in GIS
CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)
Spatial Queries
• Spatial selection may involve specialized selection comparison
operations:
• Near
• North, South, East, West
• Contained in
• Overlap/intersect
• Region (Range) Query – find objects that intersect a given region.
• Nearest Neighbor Query – find object close to identified object.
• Distance Scan – find object within a certain distance of an identified
object where distance is made increasingly larger.
CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)
Spatial Data Mining Applications
• Geology
• GIS Systems
• Environmental Science
• Agriculture
• Medicine
• Robotics
• May involved both spatial and temporal aspects
CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)
Spatial Vs. Classical Data Mining
CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)
Spatial Data Structures – Primary Data
CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)
Spatial Data Structures – Secondary Data
CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)
Examples of Spatial Databases
CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)
Spatial Data Structures
• Data structures designed specifically to store or index spatial data.
• Often based on B-tree or Binary Search Tree
• Cluster data on disk basked on geographic location.
• May represent complex spatial structure by placing the spatial object in a
containing structure of a specific geographic shape.
• Techniques:
• Quad Tree
• R-Tree
• k-D Tree
CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)
Spatial Data Structures
• Minimum Bounding Rectangle (MBR)
• Smallest rectangle that completely contains the object
CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)
Spatial Data Structures: MBR Examples
CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)
Quad Tree
• Hierarchical decomposition of the space into quadrants (MBRs)
• Each level in the tree represents the object as the set of quadrants
which contain any portion of the object.
• Each level is a more exact representation of the object.
• The number of levels is determined by the degree of accuracy
desired.
CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)
Quad Tree: Example
CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)
R Tree
• As with Quad Tree the region is divided into successively smaller
rectangles (MBRs).
• Rectangles need not be of the same size or number at each level.
• Rectangles may actually overlap.
• Lowest level cell has only one object.
• Tree maintenance algorithms similar to those for B-trees.
CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)
R Tree: Example
CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)
K-D Tree
• Designed for multi-attribute data, not necessarily spatial
• Variation of binary search tree
• Each level is used to index one of the dimensions of the spatial object.
• Lowest level cell has only one object
• Divisions not based on MBRs but successive divisions of the
dimension range.
CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)
K-D Tree: Example
CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)
Web Mining
CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)
What is Web?
• Web is a collection of inter-related files on one or more
web servers
• Wealth of information: Presence everywhere
• Structure: Graph structure with links between pages
• Access: Hundreds of millions of requests per day.
CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)
Introduction to Web Mining
• Web mining is the use of data mining techniques to
automatically discover and extract information from web
documents.
• Discovering useful information from WWW and its usage
patterns
CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)
Data Mining Vs Web Mining
• Traditional Data Mining: Concept of identifying a significant pattern from
the data that gives a better outcome
• Data is structured and relational
• Well-defined tables, columns, rows, keys & Constraints
• Web Mining: Process of performing data mining in the web. Extracting
the web documents and discovering the patterns from it.
• Semi-structured and Unstructured.
• Rich in features and patterns
CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)
Why Mine the Web?
• Enormous wealth of information on web
• Financial Information
• Book/CD/Video Stores
• Restaurant Information
• Car Prices and so on..
• Lots of data on user access pattern
• Semi-structure Web logs contain sequence of URLs
accessed by users.
CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)
Web Data Mining Process
CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)
Why Web mining is Hard?
• The Web is a huge collection of documents except for
• Hyperlink Information
• Access and Usage Information
• Web is very dynamic
• New pages are constantly being generated..
CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)
Issues are…
• Web data sets can be very large
○ Tens to hundreds of terabyte
• Cannot mine on a single server
○ Need large farms of servers
• Proper organization of hardware and software to mine multi-terabyte data
sets
• Difficulty in finding relevant information
• Extracting new knowledge from the web
CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)
Web Mining Taxonomy
CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)
Web Content Mining – Introduction
• Mining, extraction and integration of useful data, information and
knowledge from Web page content.
• Web content mining is related but different from data mining and text
mining.
• Web data are mainly semi-structured and/or unstructured, while data
mining deals primarily with structured data.
CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)
Web Content Mining – Introduction
• Extends work of basic search engines
• Search Engines
• IR application
• Keyword based
• Similarity between query and document
• Crawlers
• Indexing
• Profiles
• Link analysis.
CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)
Web Content Mining Includes …?
CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)
Unstructured Web Data Mining
CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)
Unstructured Documents – Feature Extraction
• Bag of words to represent unstructured documents
○ Takes single word as feature
○ Ignores the sequence in which words occur
• Features could be
• Boolean
• Word either occurs or does not occur in a document
• Frequency based
• Frequency of the word in a document
• Variations of the feature selection include
○ Removing the case, punctuation, infrequent words and stop words etc..
• Features can be reduced using different feature selection techniques:
○ Information gain, mutual information, cross entropy.
○ Stemming: which reduces words to their morphological roots.
CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)
Structured Web Data
CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)
Mining Techniques Using Agent & Database
CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)
What is Web Structure Mining?
• Web structure mining is the process of discovering structure
information from the web.
• The structure of typical web graph consists of Web pages as nodes,
and hyperlinks as edges connecting between two related pages.
CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)
What is Web Structure Mining?
• This type of mining can be performed either at the document
level(intra-page) or at the hyperlink level(inter-page).
• The research at the hyperlink level is called Hyperlink analysis.
• Hyperlink structure can be used to retrieve useful information on the web.
• There are two main approaches:
• PageRank
• CLEVER
• Hubs and Authorities - HITS
CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)
Page Rank Approach - Background
• PageRank was presented and published by Sergey Brin and Larry Page at
the Seventh International World Wide Web Conference (WWW7) in April
1998
• The aim of this algorithm is track some difficulties with the content-based
ranking algorithms of early search engines which used text documents for
webpages to retrieve the information with no explicit relationship of link
between them
CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)
Page Rank Approach - Introduction
• PageRank is an algorithm uses to measure the importance of website
pages using hyperlinks between pages.
• Some hyperlinks point to pages to the same site (in link) and others point
to pages in other Web sites(out link).
• PageRank is a “vote”, by all the other pages on the Web, about how
important a page is.
• A link to a page counts as a vote of support
CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)
Page Rank Approach
• Used to discover the most important pages on the web. Example: google
• Prioritize pages returned from search by looking at web structure.
• Importance of pages is calculated based on the number of pages which point to it
(backlinks).
• Weighting is used to provide more importance to backlinks coming from important pages.
• PR(p) = (1-d) + d (PR(1)/N1 + …… + PR(n)/Nn)
○ PR(i): PageRank for a page i which points to target page p.
○ Ni: Number of links coming out of page i.
○ d: constant value between 0 and 1 used for normalization (Known as damping factor)
○ (1-d): Bit of probability math magic so that sum of all webpages page ranks should be
one.
CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)
Page Rank Approach
• To formulate the above ideas, Consider the Web as a directed graph G = (V, E),
• where V is the set of vertices or nodes, i.e., the set of all pages, and
• E is the set of directed edges in the graph, i.e., hyperlinks.
• Let the total number of pages on the Web be n (i.e., n = |V|).
• The PageRank score of the page i (denoted by P(i)) is defined by:
• Oj is the number of out-links of page j
CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)
Page Rank Approach
CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)
Page Rank Approach : Hubs & Authorities
• Authoritative pages
○ Authors defines an authority as the best source for the request.
○ Highly important pages.
○ Best source for requested information.
• Hub pages
○ Contains links to highly important pages.
CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)
HITS (Hyperlink Induced Topic Search)
• Iterative algorithm for mining the Web graph to identify the topic hubs and authorities.
• Algorithm:
○ Let’s consider a matrix A with rows and columns corresponding to web pages. Aij =1
indicates that page i links to j and 0 otherwise.
○ Let a and h are vectors, whose i-th component corresponds to the degree of authority
and hubbiness of ith page.
○ Hubbiness of the page is defined as the sum of the authorities of all the pages it links
to. i.e., h = A x a.
○ Authority of the page is defined as the sum of hubbiness of all the pages that link to it.
i.e. a = At x h. where At is the transposed matrix.
CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)
Page Rank Algorithm Calculation
• PageRank or PR(A) can be calculated using a simple iterative algorithm, and
corresponds to the principal eigenvector of the normalized link matrix of the web.
• Calculate a page’s PR without knowing the final value of the PR of the other pages –
Strange Option
• Remember the each value we calculate and repeat the calculations lots of times until the
numbers stop changing much.
• 2 pages A,B:
CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)
Page Rank Algorithm Calculation – Guess 1
• We don’t know what their PR should be to begin with, so let’s take a
guess at 1.0 and do some calculations:
• d = 0.85
• PR(A) = (1 – d) + d(PR(B)/1)
• PR(B) = (1 – d) + d(PR(A)/1)
• i.e. PR(A) = 0.15 + 0.85 * 1 = 1
• PR(B) = 0.15 + 0.85 * 1 = 1
• Hmm, the numbers aren’t changing at all! So it looks like we started
out with a lucky guess!!!
CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)
Page Rank Algorithm Calculation – Guess 2
• let’s start the guess at 0 instead and re-calculate: PR(A) = 0.15 + 0.85 *
0 = 0.15 PR(B) = 0.15 + 0.85 * 0.15=0.2775
• Already calculated as 0.2775 “next best guess” at PR(A) so we use it
here And again: PR(A) = 0.15 + 0.85 * 0.2775 = 0.385875 PR(B) = 0.15
+ 0.85 * 0.385875 = 0.47799375 And again PR(A) = 0.15 + 0.85 *
0.47799375 = 0.5562946875 PR(B) = 0.15 + 0.85 * 0.5562946875 =
0.622850484375 and so on. The numbers just keep going up. But will
the numbers stop increasing when they get to 1.0?
CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)
Page Rank Algorithm Calculation – Guess 3
• Let’s start the guess at 40 each and do a few cycles: PR(A) = 40 PR(B)
= 40
• First calculation: PR(A) = 0.15 + 0.85 * 40 = 34.25
• PR(B) = 0.15 + 0.85 * 0.385875 = 29.1775
• And again PR(A) = 0.15 + 0.85 * 29.1775 = 24.950875
• PR(B) = 0.15 + 0.85 * 24.950875 = 21.35824375
• Yup, those numbers are heading down alright! It sure looks the numbers
will get to 1.0 and stop
CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)
Page Rank Algorithm Calculation – Example
CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)
Page Rank Algorithm Calculation – Example
CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)
Page Rank Algorithm Calculation – Example 1
So, for Page D, no backlinks means
the equation looks like this:
= (1-d) + d * (0)
PR(A)
= 0.15
CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)
Page Rank Algorithm Calculation – Example 2
CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)
Page Rank Algorithm Calculation – Example 2
CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)
Web Structure Mining Applications
• Information retrieval in social networks.
• To find out the relevance of each web page.
• Measuring the completeness of Web sites.
• Used in search engines to find out the relevant information
CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)
Web Usage Mining
CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)
Why analyze Website Usage?
Knowledge about how visitors use website could
• Provide guidelines to website reorganizations; Help Prevent disorientation
• Help designers place important information where the visitors look for it
• Pre-fetching and caching web pages
• Provide adaptive website (Personalization)
• Questions which could be answered:
• What are the differences in usage and access patterns among users?
• What user behavior's change over time?
• How usage pattern change with quality of service (slow/fast)?
• What is the distribution of network traffic over time?
CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)
Web Usage Mining
CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)
Web Usage Mining Applications
• Personalization
• Improve structure of a site’s Web pages
• Aid in caching and prediction of future page references
• Improve design of individual pages
• Improve effectiveness of e-commerce (sales and advertising)
CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)
Web Usage Mining Issues
• Identification of exact user not possible.
• Exact sequence of pages referenced by a user not possible due to
caching.
• Session not well defined
• Security, privacy, and legal issues
CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)
Web Usage Mining Activities – Three Phases
CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)
Web Usage Mining – Data Preparation
CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)
Web Usage Mining – Pattern Discovery Tasks
CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)
Web Usage Mining – Pattern Discovery Tasks
CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)
Web Usage Mining – Pattern Discovery Tasks
CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)
Web Usage Mining – Pattern Analysis Tasks
CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)
Web Usage Mining – Pattern Analysis Tasks
CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)