0% found this document useful (0 votes)

119 views

Web Mining: By-Pawan Singh Piyush Arora Pooja Mansharamani Pramod Singh Praveen Kumar

This document provides an overview of the topic of web mining. It defines web mining as discovering useful information from web data and discusses its four main subtasks: resource finding, information selection/preprocessing, generalization, and analysis. The document also distinguishes between the three categories of web mining: web content mining, web structure mining, and web usage mining.

Uploaded by

Pooja Mansharamani

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

119 views

Web Mining: By-Pawan Singh Piyush Arora Pooja Mansharamani Pramod Singh Praveen Kumar

Uploaded by

Pooja Mansharamani

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 20

Web Mining

By-
Pawan Singh
Piyush Arora
Pooja Mansharamani
Pramod Singh
Praveen Kumar
1
Outline

 Introduction
 Web Mining
 Web Content Mining
 Web Structure Mining
 Web Usage Mining
 Conclusion & Exam Questions

2
Four Problems

 Finding relevant information

 Low precision-which is due to the irrelevance of many of the search results. This
results in a difficulty finding the relevant information.
 LOW RECALL which is due to the inability to index all the information available
on the web.This results in a difficulty finding the unindexed information that is
relevant.

 Creating new knowledge out of available

information on the web
 While the problem above is a query-triggered process (retrieval
oriented), this problem is a data-triggered process .

3
Personalizing the information
 Catering to personal preference in content and presentation(associated
with the type and presentation of the information )

Learning about the consumers

 What does the customer want to do?
 Using web data to effectively market products and/or services

4
Other Approaches

Web mining is NOT the only approach

 Database approach (DB)
 Information retrieval (IR)
 Natural language processing (NLP)
 In-depth syntactic and semantic analysis
 Web document community
 Standards, manually appended meta-information,
maintained directories, etc

5
Direct vs. Indirect Web Mining

 Web mining techniques can be used to solve the

information overload problems:
 Directly
Attack the problem with web mining techniques
E.g. newsgroup agent classifies news as relevant
 Indirectly
Used as part of a bigger application that addresses
problems
E.g. used to create index terms for a web search service

6
The Research

 Converging research from: Database,

information retrieval, and artificial
intelligence (specifically NLP and machine
learning)
 Focusing on research from the machine
learning point of view

7
Web Mining: Definition

 “Web mining refers to the overall process of

discovering potentially useful and previously
unknown information or knowledge from the
Web data.”
Can be viewed as four subtasks
Not the same as Information Retrieval
Not the same as Information Extraction

8
Web Mining: Subtasks

 Resource finding
 Retrieving intended documents
 Information selection/pre-processing
 Select and pre-process specific information from
retrieved web resources.
 Generalization
 Discover general patterns within and across web sites
 Analysis
 Validation and/or interpretation of mined patterns

9
Web Mining: Not IR

 Information retrieval (IR) is the automatic

retrieval of all relevant documents while at
the same time retrieving as few of the non-
relevant documents as possible
 Web document classification, which is a Web
Mining task, could be part of an IR system
(e.g. indexing for a search engine)

10
Web Mining: Not IE

 Information extraction (IE) aims to extract

the relevant facts from given documents
while IR aims to select relevant documents.
IE systems for the general Web are not feasible
Most focus on specific Web sites or content

11
IE - IR

Information Retrieval Information Extraction

 Automatic retrieval of  Extract relevant facts from documents
relevant documents
 Primary Goals:  Primary Goals:
o Indexing Text o Transform collection of retrieved
o Searching for useful documents to information.
documents in a collection o Structure of representation of a document
o “Web document classification “ task is an
o “Bag of unordered words” instance of IR
o “Web document o IE has a higher level of granularity
classification “ task is an o Result:
instance of IR o Structured Database
o Compression or summary of Text or
documents

12
Types of IE

 I E from unstructured texts

( Classical)
• Unstructured ?? Free texts eg.News
stories
• Basic to deep linguistic pre-processing.

 IE from semi-structured texts

(Structural)
• Semi-Structured ?? HTML
• Uses meta-information eg. HTML tags
Wrapper Induction,
Machine learning used to build systems
(semi-)automatically

13
Web Mining and Machine
Learning
 Machine learning is concerned with the
development of algorithms and techniques that
allow computers to "learn".
 Web mining is NOT learning from the Web.
 Some applications of machine learning on the web
are NOT Web Mining
 Methods used for Web Mining are NOT limited to
machine learning
 There is a close relationship between web mining
and machine learning

14
Web Mining and Machine Learning

• Machine learning techniques support and help web mining

as they could be applied to the processes in the web mining.

• For example, recent research shows that applying machine

learning techniques could improve the text classification
process compared to the traditional IR techniques.

• In short,web mining intersects with the application of the

machine learning on the web.

15
Web Mining Categories

 Web Content Mining

 Discovering useful information from web
contents/data/documents.
 Web Structure Mining
 Discovering the model underlying link structures
(topology) on the Web. E.g. discovering authorities and
hubs
 Web Usage Mining
 Make sense of data generated by surfers
 Usage data from logs, user profiles, user sessions, cookies,
user queries, bookmarks, mouse clicks and scrolls, etc. 16
Web Content Data Structure

 Unstructured – free text

 Semi-structured – HTML
 More structured – Table or Database
generated HTML pages
 Multimedia data – receive less attention than
text or hypertext

17
Web Structure Mining

 Interested in the structure between Web documents

(not within a document)
 Example: PageRank – Google
 Application: Discovering micro-communities in the
Web
 Measuring the “completeness” of a Web site

18
Web Usage Mining
 Tries to predict user behavior from
interaction with the Web
 Wide range of data (logs)
 Web client data
 Proxy server data
 Web server data
 Two common approaches
 Map usage data into relational tables before using
adapted data mining techniques
 Use log data directly by utilizing special pre-processing
techniques
19
Thank you!

Caterpillar Service Welding Guide
100% (2)
Caterpillar Service Welding Guide
77 pages
Dummies Guide To Writing A Literature Review
100% (2)
Dummies Guide To Writing A Literature Review
9 pages
Week 1
No ratings yet
Week 1
80 pages
Web Mining
No ratings yet
Web Mining
22 pages
Web Mining Research: A Survey: Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000
No ratings yet
Web Mining Research: A Survey: Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000
34 pages
Unit I
No ratings yet
Unit I
11 pages
Web Mining U-1,2
No ratings yet
Web Mining U-1,2
15 pages
Module1PartAweb mining-intro
No ratings yet
Module1PartAweb mining-intro
28 pages
Web Mining: By:-Vineeta 8pgc18 M.Tech (II Semester)
No ratings yet
Web Mining: By:-Vineeta 8pgc18 M.Tech (II Semester)
33 pages
Web Mining and Text Mining
No ratings yet
Web Mining and Text Mining
65 pages
6 WebMining
No ratings yet
6 WebMining
45 pages
Data Mining
No ratings yet
Data Mining
12 pages
Web Mining
No ratings yet
Web Mining
73 pages
Web Mining: Day-Today: International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)
No ratings yet
Web Mining: Day-Today: International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)
4 pages
Mining The Web Searching and Integration
No ratings yet
Mining The Web Searching and Integration
5 pages
Web Mining
100% (3)
Web Mining
28 pages
Web Content Mining: by Saumya Aggarwal (0232083107 - IT) Richa Sharma (0732082707 - CSE)
No ratings yet
Web Content Mining: by Saumya Aggarwal (0232083107 - IT) Richa Sharma (0732082707 - CSE)
12 pages
Web Mining
No ratings yet
Web Mining
53 pages
Web Miining: Summary: Sonia Gupta, Neha Singh
No ratings yet
Web Miining: Summary: Sonia Gupta, Neha Singh
6 pages
Overview of Web Data Mining and Applications: Bamshad Mobasher Depaul University
No ratings yet
Overview of Web Data Mining and Applications: Bamshad Mobasher Depaul University
25 pages
Webminingtextmining 160906165305
No ratings yet
Webminingtextmining 160906165305
18 pages
Webmininglec
No ratings yet
Webmininglec
75 pages
Data Mining Unit 5
No ratings yet
Data Mining Unit 5
36 pages
Introduction to Web Mining
No ratings yet
Introduction to Web Mining
20 pages
QU PPT Format
No ratings yet
QU PPT Format
12 pages
Webmining I
No ratings yet
Webmining I
69 pages
Webmining I
No ratings yet
Webmining I
69 pages
Web Mining: by Saumil Shah Roll No: 46 Mca 4 Sem
No ratings yet
Web Mining: by Saumil Shah Roll No: 46 Mca 4 Sem
28 pages
Web Mining
No ratings yet
Web Mining
42 pages
DWM Assignment 1: 1. Write Detailed Notes On The Following: - A. Web Content Mining
No ratings yet
DWM Assignment 1: 1. Write Detailed Notes On The Following: - A. Web Content Mining
10 pages
UNIT - 3 Final
No ratings yet
UNIT - 3 Final
37 pages
Bda Class - Feb 7th
No ratings yet
Bda Class - Feb 7th
28 pages
Web Mining: Presented By: Vikash Kumar
No ratings yet
Web Mining: Presented By: Vikash Kumar
24 pages
Web Mining
No ratings yet
Web Mining
20 pages
Business Data Mining Week 13
No ratings yet
Business Data Mining Week 13
15 pages
Web Mining
No ratings yet
Web Mining
10 pages
19 Web Mining 2
No ratings yet
19 Web Mining 2
41 pages
On The Improvement of Weighted Page Content Rank: Seifedine Kadry and Ali Kalakech
No ratings yet
On The Improvement of Weighted Page Content Rank: Seifedine Kadry and Ali Kalakech
5 pages
Web Content Mining Techniques Tools & Algorithms - A Comprehensive Study
No ratings yet
Web Content Mining Techniques Tools & Algorithms - A Comprehensive Study
6 pages
13-Web Mining
No ratings yet
13-Web Mining
3 pages
Sandaruwan WP
No ratings yet
Sandaruwan WP
4 pages
Web Mining and Knowledge Discovery of Usage Patterns: CS 748T Project (Part I)
No ratings yet
Web Mining and Knowledge Discovery of Usage Patterns: CS 748T Project (Part I)
25 pages
Web Mining Notes
100% (1)
Web Mining Notes
8 pages
Web Mining
No ratings yet
Web Mining
15 pages
Web Mining MMMUT NOTES
No ratings yet
Web Mining MMMUT NOTES
5 pages
DM-UNIT ADVANCED CONCEPTS
No ratings yet
DM-UNIT ADVANCED CONCEPTS
57 pages
Webminingtextmining 160906165305
No ratings yet
Webminingtextmining 160906165305
17 pages
Web Content Mining: A Case Study For Bput Results: Binayak Panda, K Murali Gopal, Sudhanshu Shekhar Bisoyi
No ratings yet
Web Content Mining: A Case Study For Bput Results: Binayak Panda, K Murali Gopal, Sudhanshu Shekhar Bisoyi
5 pages
Unit 4 (DWDM)
No ratings yet
Unit 4 (DWDM)
27 pages
Data Harvesting Through Web Mining: A Survey: Prakul Gupta Amit Sharma Dr. Sunil KR Singh
No ratings yet
Data Harvesting Through Web Mining: A Survey: Prakul Gupta Amit Sharma Dr. Sunil KR Singh
7 pages
Unit 7: Web Mining and Text Mining
No ratings yet
Unit 7: Web Mining and Text Mining
13 pages
Web Mining
No ratings yet
Web Mining
42 pages
UNIT 3 DMW
No ratings yet
UNIT 3 DMW
31 pages
Extracting Data Through Webmining: Mrs - Bhanu Bhardwaj Asst Proff DCE G.Noida
No ratings yet
Extracting Data Through Webmining: Mrs - Bhanu Bhardwaj Asst Proff DCE G.Noida
6 pages
Unit V - Web and Text Mining
No ratings yet
Unit V - Web and Text Mining
35 pages
BA4027 Datamining For BI
100% (1)
BA4027 Datamining For BI
67 pages
A Web Mining and Optimization Approach For Improving Data Retrieval Performance in Web Search Engine Outcomes
No ratings yet
A Web Mining and Optimization Approach For Improving Data Retrieval Performance in Web Search Engine Outcomes
5 pages
Web Mining1
No ratings yet
Web Mining1
87 pages
Data Mining-World Wide Web
No ratings yet
Data Mining-World Wide Web
4 pages
Image Retrieval: Unlocking the Power of Visual Data
From Everand
Image Retrieval: Unlocking the Power of Visual Data
Fouad Sabry
No ratings yet
Image Retrieval: Fundamentals and Applications
From Everand
Image Retrieval: Fundamentals and Applications
Fouad Sabry
No ratings yet
Automatic Image Annotation: Fundamentals and Applications
From Everand
Automatic Image Annotation: Fundamentals and Applications
Fouad Sabry
No ratings yet
K-ETA User Guide English
No ratings yet
K-ETA User Guide English
14 pages
Aqa Number Grid Coursework
100% (2)
Aqa Number Grid Coursework
4 pages
Account Staff Wanted
No ratings yet
Account Staff Wanted
6 pages
Wmo 1127 en
No ratings yet
Wmo 1127 en
40 pages
Non-Linear Analyses Using LS-DYNA Implicit
No ratings yet
Non-Linear Analyses Using LS-DYNA Implicit
28 pages
Harlow SlidesMania
No ratings yet
Harlow SlidesMania
18 pages
API TRAINING APLICATION FORM in 2018 PDF
No ratings yet
API TRAINING APLICATION FORM in 2018 PDF
2 pages
Rapid Development of Multiple Fold Patterns For Airbag Simulation in LS-DYNA Using Oasys Primer
No ratings yet
Rapid Development of Multiple Fold Patterns For Airbag Simulation in LS-DYNA Using Oasys Primer
14 pages
Fanuc 0i - 0imate Model B - Alarm List
100% (1)
Fanuc 0i - 0imate Model B - Alarm List
77 pages
Machines by Webster)
100% (2)
Machines by Webster)
4 pages
Reflective Essay
No ratings yet
Reflective Essay
3 pages
FUND RELEASE ORDER FORM
No ratings yet
FUND RELEASE ORDER FORM
2 pages
CrimPro Compilation
No ratings yet
CrimPro Compilation
95 pages
AKASH QE 3 RESUME
No ratings yet
AKASH QE 3 RESUME
2 pages
Baguio V Masweng
100% (1)
Baguio V Masweng
2 pages
Chart of Accounts
No ratings yet
Chart of Accounts
19 pages
Quiz - Financial Statements With Solution
No ratings yet
Quiz - Financial Statements With Solution
6 pages
Chapter 1-5
100% (1)
Chapter 1-5
77 pages
Case Study #1 - Abou Shakra
No ratings yet
Case Study #1 - Abou Shakra
7 pages
Question Bank
No ratings yet
Question Bank
25 pages
423 - Armstrong Delta2 DRP Self Regulating Series - 0
No ratings yet
423 - Armstrong Delta2 DRP Self Regulating Series - 0
12 pages
Moolchand Extradosed Bridge For Delhi Me
No ratings yet
Moolchand Extradosed Bridge For Delhi Me
8 pages
Family Member Presentation Personal Development 11 HUMSS B Group 1
No ratings yet
Family Member Presentation Personal Development 11 HUMSS B Group 1
207 pages
Data Protection and Management Participant Guide 1 PDF
No ratings yet
Data Protection and Management Participant Guide 1 PDF
610 pages
Linux Foundation Certified Sysadmin (LFCS)
No ratings yet
Linux Foundation Certified Sysadmin (LFCS)
9 pages
CESC Q4 Lesson 2 PDF
No ratings yet
CESC Q4 Lesson 2 PDF
35 pages
Project Report File (Gym Exercise Body Posture Detection)
No ratings yet
Project Report File (Gym Exercise Body Posture Detection)
63 pages
94 - CPAR Final Preaboard AFAR - Booklet
No ratings yet
94 - CPAR Final Preaboard AFAR - Booklet
14 pages