0% found this document useful (0 votes)
119 views

Web Mining: By-Pawan Singh Piyush Arora Pooja Mansharamani Pramod Singh Praveen Kumar

This document provides an overview of the topic of web mining. It defines web mining as discovering useful information from web data and discusses its four main subtasks: resource finding, information selection/preprocessing, generalization, and analysis. The document also distinguishes between the three categories of web mining: web content mining, web structure mining, and web usage mining.
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
119 views

Web Mining: By-Pawan Singh Piyush Arora Pooja Mansharamani Pramod Singh Praveen Kumar

This document provides an overview of the topic of web mining. It defines web mining as discovering useful information from web data and discusses its four main subtasks: resource finding, information selection/preprocessing, generalization, and analysis. The document also distinguishes between the three categories of web mining: web content mining, web structure mining, and web usage mining.
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 20

Web Mining

By-
Pawan Singh
Piyush Arora
Pooja Mansharamani
Pramod Singh
Praveen Kumar
1
Outline

 Introduction
 Web Mining
 Web Content Mining
 Web Structure Mining
 Web Usage Mining
 Conclusion & Exam Questions

2
Four Problems

 Finding relevant information


 Low precision-which is due to the irrelevance of many of the search results. This
results in a difficulty finding the relevant information.
 LOW RECALL which is due to the inability to index all the information available
on the web.This results in a difficulty finding the unindexed information that is
relevant.

 Creating new knowledge out of available


information on the web
 While the problem above is a query-triggered process (retrieval
oriented), this problem is a data-triggered process .

3
Personalizing the information
 Catering to personal preference in content and presentation(associated
with the type and presentation of the information )

Learning about the consumers


 What does the customer want to do?
 Using web data to effectively market products and/or services

4
Other Approaches

Web mining is NOT the only approach


 Database approach (DB)
 Information retrieval (IR)
 Natural language processing (NLP)
 In-depth syntactic and semantic analysis
 Web document community
 Standards, manually appended meta-information,
maintained directories, etc

5
Direct vs. Indirect Web Mining

 Web mining techniques can be used to solve the


information overload problems:
 Directly
Attack the problem with web mining techniques
E.g. newsgroup agent classifies news as relevant
 Indirectly
Used as part of a bigger application that addresses
problems
E.g. used to create index terms for a web search service

6
The Research

 Converging research from: Database,


information retrieval, and artificial
intelligence (specifically NLP and machine
learning)
 Focusing on research from the machine
learning point of view

7
Web Mining: Definition

 “Web mining refers to the overall process of


discovering potentially useful and previously
unknown information or knowledge from the
Web data.”
Can be viewed as four subtasks
Not the same as Information Retrieval
Not the same as Information Extraction

8
Web Mining: Subtasks

 Resource finding
 Retrieving intended documents
 Information selection/pre-processing
 Select and pre-process specific information from
retrieved web resources.
 Generalization
 Discover general patterns within and across web sites
 Analysis
 Validation and/or interpretation of mined patterns

9
Web Mining: Not IR

 Information retrieval (IR) is the automatic


retrieval of all relevant documents while at
the same time retrieving as few of the non-
relevant documents as possible
 Web document classification, which is a Web
Mining task, could be part of an IR system
(e.g. indexing for a search engine)

10
Web Mining: Not IE

 Information extraction (IE) aims to extract


the relevant facts from given documents
while IR aims to select relevant documents.
IE systems for the general Web are not feasible
Most focus on specific Web sites or content

11
IE - IR

Information Retrieval Information Extraction


 Automatic retrieval of  Extract relevant facts from documents
relevant documents
 Primary Goals:  Primary Goals:
o Indexing Text o Transform collection of retrieved
o Searching for useful documents to information.
documents in a collection o Structure of representation of a document
o “Web document classification “ task is an
o “Bag of unordered words” instance of IR
o “Web document o IE has a higher level of granularity
classification “ task is an o Result:
instance of IR o Structured Database
o Compression or summary of Text or
documents

12
Types of IE

 I E from unstructured texts


( Classical)
• Unstructured ?? Free texts eg.News
stories
• Basic to deep linguistic pre-processing.

 IE from semi-structured texts


(Structural)
• Semi-Structured ?? HTML
• Uses meta-information eg. HTML tags
Wrapper Induction,
Machine learning used to build systems
(semi-)automatically

13
Web Mining and Machine
Learning
 Machine learning is concerned with the
development of algorithms and techniques that
allow computers to "learn".
 Web mining is NOT learning from the Web.
 Some applications of machine learning on the web
are NOT Web Mining
 Methods used for Web Mining are NOT limited to
machine learning
 There is a close relationship between web mining
and machine learning

14
Web Mining and Machine Learning

• Machine learning techniques support and help web mining


as they could be applied to the processes in the web mining.

• For example, recent research shows that applying machine


learning techniques could improve the text classification
process compared to the traditional IR techniques.

• In short,web mining intersects with the application of the


machine learning on the web.

15
Web Mining Categories

 Web Content Mining


 Discovering useful information from web
contents/data/documents.
 Web Structure Mining
 Discovering the model underlying link structures
(topology) on the Web. E.g. discovering authorities and
hubs
 Web Usage Mining
 Make sense of data generated by surfers
 Usage data from logs, user profiles, user sessions, cookies,
user queries, bookmarks, mouse clicks and scrolls, etc. 16
Web Content Data Structure

 Unstructured – free text


 Semi-structured – HTML
 More structured – Table or Database
generated HTML pages
 Multimedia data – receive less attention than
text or hypertext

17
Web Structure Mining

 Interested in the structure between Web documents


(not within a document)
 Example: PageRank – Google
 Application: Discovering micro-communities in the
Web
 Measuring the “completeness” of a Web site

18
Web Usage Mining
 Tries to predict user behavior from
interaction with the Web
 Wide range of data (logs)
 Web client data
 Proxy server data
 Web server data
 Two common approaches
 Map usage data into relational tables before using
adapted data mining techniques
 Use log data directly by utilizing special pre-processing
techniques
19
Thank you!

20

You might also like