0% found this document useful (0 votes)
35 views68 pages

M6 Spatial and Web Mining I

Uploaded by

purvaisrani2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views68 pages

M6 Spatial and Web Mining I

Uploaded by

purvaisrani2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Module 6 : Spatial and Web Mining

Priya R L
Faculty Incharge for CSC 504 References: Text Book - 4
Department of Computer Engineering Data Mining : Introductory and Advanced
VES Institute of Technology, Mumbai Topics by Margaret H. Dunham

CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)


Agenda
● Introduction to Spatial Mining
● Spatial Data,
● Spatial Vs. Classical Data Mining,
● Spatial Data Structures,
● Introduction to Web Mining
● Web Content Mining
● Web Structure Mining
● Web Usage Mining
● Applications of Web Mining

CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)


Spatial Objects

CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)


Spatial Objects

CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)


Spatial Databases (GIS)

CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)


Example : Spatial Data in GIS

CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)


Spatial Queries
• Spatial selection may involve specialized selection comparison
operations:
• Near
• North, South, East, West
• Contained in
• Overlap/intersect
• Region (Range) Query – find objects that intersect a given region.
• Nearest Neighbor Query – find object close to identified object.
• Distance Scan – find object within a certain distance of an identified
object where distance is made increasingly larger.

CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)


Spatial Data Mining Applications
• Geology
• GIS Systems
• Environmental Science
• Agriculture
• Medicine
• Robotics
• May involved both spatial and temporal aspects

CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)


Spatial Vs. Classical Data Mining

CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)


Spatial Data Structures – Primary Data

CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)


Spatial Data Structures – Secondary Data

CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)


Examples of Spatial Databases

CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)


Spatial Data Structures
• Data structures designed specifically to store or index spatial data.
• Often based on B-tree or Binary Search Tree
• Cluster data on disk basked on geographic location.
• May represent complex spatial structure by placing the spatial object in a
containing structure of a specific geographic shape.
• Techniques:
• Quad Tree
• R-Tree
• k-D Tree
CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)
Spatial Data Structures
• Minimum Bounding Rectangle (MBR)
• Smallest rectangle that completely contains the object

CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)


Spatial Data Structures: MBR Examples

CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)


Quad Tree

• Hierarchical decomposition of the space into quadrants (MBRs)


• Each level in the tree represents the object as the set of quadrants
which contain any portion of the object.
• Each level is a more exact representation of the object.
• The number of levels is determined by the degree of accuracy
desired.

CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)


Quad Tree: Example

CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)


R Tree

• As with Quad Tree the region is divided into successively smaller


rectangles (MBRs).
• Rectangles need not be of the same size or number at each level.
• Rectangles may actually overlap.
• Lowest level cell has only one object.
• Tree maintenance algorithms similar to those for B-trees.

CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)


R Tree: Example

CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)


K-D Tree

• Designed for multi-attribute data, not necessarily spatial


• Variation of binary search tree
• Each level is used to index one of the dimensions of the spatial object.
• Lowest level cell has only one object
• Divisions not based on MBRs but successive divisions of the
dimension range.

CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)


K-D Tree: Example

CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)


Web Mining

CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)


What is Web?

• Web is a collection of inter-related files on one or more


web servers
• Wealth of information: Presence everywhere
• Structure: Graph structure with links between pages
• Access: Hundreds of millions of requests per day.

CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)


Introduction to Web Mining

• Web mining is the use of data mining techniques to


automatically discover and extract information from web
documents.
• Discovering useful information from WWW and its usage
patterns

CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)


Data Mining Vs Web Mining
• Traditional Data Mining: Concept of identifying a significant pattern from
the data that gives a better outcome
• Data is structured and relational
• Well-defined tables, columns, rows, keys & Constraints
• Web Mining: Process of performing data mining in the web. Extracting
the web documents and discovering the patterns from it.
• Semi-structured and Unstructured.
• Rich in features and patterns

CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)


Why Mine the Web?
• Enormous wealth of information on web
• Financial Information
• Book/CD/Video Stores
• Restaurant Information
• Car Prices and so on..
• Lots of data on user access pattern
• Semi-structure Web logs contain sequence of URLs
accessed by users.
CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)
Web Data Mining Process

CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)


Why Web mining is Hard?

• The Web is a huge collection of documents except for


• Hyperlink Information
• Access and Usage Information
• Web is very dynamic
• New pages are constantly being generated..

CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)


Issues are…
• Web data sets can be very large
○ Tens to hundreds of terabyte
• Cannot mine on a single server
○ Need large farms of servers
• Proper organization of hardware and software to mine multi-terabyte data
sets
• Difficulty in finding relevant information
• Extracting new knowledge from the web

CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)


Web Mining Taxonomy

CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)


Web Content Mining – Introduction

• Mining, extraction and integration of useful data, information and


knowledge from Web page content.
• Web content mining is related but different from data mining and text
mining.
• Web data are mainly semi-structured and/or unstructured, while data
mining deals primarily with structured data.

CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)


Web Content Mining – Introduction
• Extends work of basic search engines
• Search Engines
• IR application
• Keyword based
• Similarity between query and document
• Crawlers
• Indexing
• Profiles
• Link analysis.
CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)
Web Content Mining Includes …?

CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)


Unstructured Web Data Mining

CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)


Unstructured Documents – Feature Extraction
• Bag of words to represent unstructured documents
○ Takes single word as feature
○ Ignores the sequence in which words occur
• Features could be
• Boolean
• Word either occurs or does not occur in a document
• Frequency based
• Frequency of the word in a document
• Variations of the feature selection include
○ Removing the case, punctuation, infrequent words and stop words etc..
• Features can be reduced using different feature selection techniques:
○ Information gain, mutual information, cross entropy.
○ Stemming: which reduces words to their morphological roots.

CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)


Structured Web Data

CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)


Mining Techniques Using Agent & Database

CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)


What is Web Structure Mining?
• Web structure mining is the process of discovering structure
information from the web.
• The structure of typical web graph consists of Web pages as nodes,
and hyperlinks as edges connecting between two related pages.

CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)


What is Web Structure Mining?

• This type of mining can be performed either at the document


level(intra-page) or at the hyperlink level(inter-page).
• The research at the hyperlink level is called Hyperlink analysis.
• Hyperlink structure can be used to retrieve useful information on the web.
• There are two main approaches:
• PageRank
• CLEVER
• Hubs and Authorities - HITS

CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)


Page Rank Approach - Background

• PageRank was presented and published by Sergey Brin and Larry Page at
the Seventh International World Wide Web Conference (WWW7) in April
1998
• The aim of this algorithm is track some difficulties with the content-based
ranking algorithms of early search engines which used text documents for
webpages to retrieve the information with no explicit relationship of link
between them

CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)


Page Rank Approach - Introduction

• PageRank is an algorithm uses to measure the importance of website


pages using hyperlinks between pages.
• Some hyperlinks point to pages to the same site (in link) and others point
to pages in other Web sites(out link).
• PageRank is a “vote”, by all the other pages on the Web, about how
important a page is.
• A link to a page counts as a vote of support

CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)


Page Rank Approach
• Used to discover the most important pages on the web. Example: google
• Prioritize pages returned from search by looking at web structure.
• Importance of pages is calculated based on the number of pages which point to it
(backlinks).
• Weighting is used to provide more importance to backlinks coming from important pages.
• PR(p) = (1-d) + d (PR(1)/N1 + …… + PR(n)/Nn)
○ PR(i): PageRank for a page i which points to target page p.
○ Ni: Number of links coming out of page i.
○ d: constant value between 0 and 1 used for normalization (Known as damping factor)
○ (1-d): Bit of probability math magic so that sum of all webpages page ranks should be
one.

CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)


Page Rank Approach
• To formulate the above ideas, Consider the Web as a directed graph G = (V, E),
• where V is the set of vertices or nodes, i.e., the set of all pages, and
• E is the set of directed edges in the graph, i.e., hyperlinks.
• Let the total number of pages on the Web be n (i.e., n = |V|).
• The PageRank score of the page i (denoted by P(i)) is defined by:

• Oj is the number of out-links of page j

CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)


Page Rank Approach

CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)


Page Rank Approach : Hubs & Authorities

• Authoritative pages
○ Authors defines an authority as the best source for the request.
○ Highly important pages.
○ Best source for requested information.
• Hub pages
○ Contains links to highly important pages.

CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)


HITS (Hyperlink Induced Topic Search)
• Iterative algorithm for mining the Web graph to identify the topic hubs and authorities.
• Algorithm:
○ Let’s consider a matrix A with rows and columns corresponding to web pages. Aij =1
indicates that page i links to j and 0 otherwise.
○ Let a and h are vectors, whose i-th component corresponds to the degree of authority
and hubbiness of ith page.
○ Hubbiness of the page is defined as the sum of the authorities of all the pages it links
to. i.e., h = A x a.
○ Authority of the page is defined as the sum of hubbiness of all the pages that link to it.
i.e. a = At x h. where At is the transposed matrix.

CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)


Page Rank Algorithm Calculation
• PageRank or PR(A) can be calculated using a simple iterative algorithm, and
corresponds to the principal eigenvector of the normalized link matrix of the web.
• Calculate a page’s PR without knowing the final value of the PR of the other pages –
Strange Option
• Remember the each value we calculate and repeat the calculations lots of times until the
numbers stop changing much.

• 2 pages A,B:

CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)


Page Rank Algorithm Calculation – Guess 1
• We don’t know what their PR should be to begin with, so let’s take a
guess at 1.0 and do some calculations:
• d = 0.85
• PR(A) = (1 – d) + d(PR(B)/1)
• PR(B) = (1 – d) + d(PR(A)/1)
• i.e. PR(A) = 0.15 + 0.85 * 1 = 1
• PR(B) = 0.15 + 0.85 * 1 = 1
• Hmm, the numbers aren’t changing at all! So it looks like we started
out with a lucky guess!!!

CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)


Page Rank Algorithm Calculation – Guess 2
• let’s start the guess at 0 instead and re-calculate: PR(A) = 0.15 + 0.85 *
0 = 0.15 PR(B) = 0.15 + 0.85 * 0.15=0.2775
• Already calculated as 0.2775 “next best guess” at PR(A) so we use it
here And again: PR(A) = 0.15 + 0.85 * 0.2775 = 0.385875 PR(B) = 0.15
+ 0.85 * 0.385875 = 0.47799375 And again PR(A) = 0.15 + 0.85 *
0.47799375 = 0.5562946875 PR(B) = 0.15 + 0.85 * 0.5562946875 =
0.622850484375 and so on. The numbers just keep going up. But will
the numbers stop increasing when they get to 1.0?

CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)


Page Rank Algorithm Calculation – Guess 3

• Let’s start the guess at 40 each and do a few cycles: PR(A) = 40 PR(B)
= 40
• First calculation: PR(A) = 0.15 + 0.85 * 40 = 34.25
• PR(B) = 0.15 + 0.85 * 0.385875 = 29.1775
• And again PR(A) = 0.15 + 0.85 * 29.1775 = 24.950875
• PR(B) = 0.15 + 0.85 * 24.950875 = 21.35824375
• Yup, those numbers are heading down alright! It sure looks the numbers
will get to 1.0 and stop

CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)


Page Rank Algorithm Calculation – Example

CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)


Page Rank Algorithm Calculation – Example

CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)


Page Rank Algorithm Calculation – Example 1

So, for Page D, no backlinks means


the equation looks like this:

= (1-d) + d * (0)
PR(A)
= 0.15

CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)


Page Rank Algorithm Calculation – Example 2

CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)


Page Rank Algorithm Calculation – Example 2

CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)


Web Structure Mining Applications

• Information retrieval in social networks.


• To find out the relevance of each web page.
• Measuring the completeness of Web sites.
• Used in search engines to find out the relevant information

CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)


Web Usage Mining

CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)


Why analyze Website Usage?
Knowledge about how visitors use website could
• Provide guidelines to website reorganizations; Help Prevent disorientation
• Help designers place important information where the visitors look for it
• Pre-fetching and caching web pages
• Provide adaptive website (Personalization)
• Questions which could be answered:
• What are the differences in usage and access patterns among users?
• What user behavior's change over time?
• How usage pattern change with quality of service (slow/fast)?
• What is the distribution of network traffic over time?

CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)


Web Usage Mining

CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)


Web Usage Mining Applications

• Personalization
• Improve structure of a site’s Web pages
• Aid in caching and prediction of future page references
• Improve design of individual pages
• Improve effectiveness of e-commerce (sales and advertising)

CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)


Web Usage Mining Issues

• Identification of exact user not possible.


• Exact sequence of pages referenced by a user not possible due to
caching.
• Session not well defined
• Security, privacy, and legal issues

CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)


Web Usage Mining Activities – Three Phases

CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)


Web Usage Mining – Data Preparation

CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)


Web Usage Mining – Pattern Discovery Tasks

CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)


Web Usage Mining – Pattern Discovery Tasks

CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)


Web Usage Mining – Pattern Discovery Tasks

CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)


Web Usage Mining – Pattern Analysis Tasks

CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)


Web Usage Mining – Pattern Analysis Tasks

CSC 504 : Data Warehousing & Mining (CBCGS - Autonomy)

You might also like