0% found this document useful (0 votes)
89 views

Web and Text Mining

good
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
89 views

Web and Text Mining

good
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 73

Web Mining

What is Web Mining?


● Web mining is the use of data mining techniques to extract knowledge from web data.
● Web data includes :
○ web documents
○ hyperlinks between documents
○ usage logs of web sites
● The WWW is huge, widely distributed, global information service centre and, therefore,
constitutes a rich source for data mining.
Data Mining vs Web Mining
● Data Mining : It is a concept of identifying a significant pattern from the
data that gives a better outcome.
● Web Mining : It is the process of performing data mining in the web.
Extracting the web documents and discovering the patterns from it.
Web Data Mining Process

https://ieeexplore.ieee.org/document/5485404/
Issues
● Web data sets can be very large
○ Tens to hundreds of terabyte
● Cannot mine on a single server
○ Need large farms of servers
● Proper organization of hardware and software to mine multi-terabyte data sets
● Difficulty in finding relevant information
● Extracting new knowledge from the web
Opportunities
Web offers unprecedented opportunities to data mining

● Abundant and easily accessible data


● Huge variety of data: structured, semi structured, images, multimedia etc.
● Most of the data on web is linked: There are hyperlinks among pages within a site.
● Much of the data is redundant: The same piece of information or its variants appear
in number of pages.
Challenges
Along with opportunities, there are serious challenges in web mining

● Web is noisy : A Web page typically contains a mixture of many kinds of


information, e.g., main contents, advertisements, navigation panels, copyright
notices, etc.
● Web is dynamic : Information on the Web changes constantly. Keeping up with
the changes and monitoring the changes are important issues.
● Web is a virtual society : It is not only about data, information and services, but
also about interactions among people, organizations and automatic systems, i.e.,
communities.
● Many other such restrictions pose a pretty big challenge for mining the web.
Data mining v/s Web mining
● Web mining is an application of data mining techniques.
● Web mining is studied as a specific branch of data mining to consider the
specific structures of the available web data.
● Web data:
○ Web content : text, images etc.
○ Web usage: http logs, app server logs etc.
○ Web structure: hyperlinks, tags
Web Mining Taxonomy

https://www.researchgate.net/figure/Taxonomy-of-Web-Mining-Source-Kavita-et-al-2011_fig1_282357293
Web Structure Mining
What is Web Structure Mining?
The structure of a typical Web graph consists of
Hyperlink
Web pages as nodes, and hyperlinks as edges
connecting between two related pages
Web
Document

Web Graph Structure

Web Structure Mining is the process of discovering structure


information from the Web
• This type of mining can be performed either at the (intra-page)
document level or at the (inter-page) hyperlink level
• The research at the hyperlink level is also called Hyperlink Analysis
Motivation to study Hyperlink Structure
• Hyperlinks serve two main purposes.
• Pure Navigation.
• Point to pages with having relevant information.
• This can be used to retrieve useful information from the web.
Web Structure by Language

Source: https://www.slideshare.net/AmirFahmideh/web-mining-structure-mining
Web Structure Terminology
• Web-graph: A directed graph that represents the Web.

• Node: Each Web page is a node of the Web-graph.

• Link: Each hyperlink on the Web is a directed edge of the Web-graph.

• In-degree: The in-degree of a node, p, is the number of distinct links that


point to p.

• Out-degree: The out-degree of a node, p, is the number of distinct links


originating at p that point to other nodes.
Web Structure Terminology(2)
Directed Path: A sequence of links, starting from p that can be followed to reach q.

Shortest Path: Of all the paths between nodes p and q, which has the shortest
length, i.e. number of links on it.

Diameter: The maximum of all the shortest paths between a pair of nodes
p and q, for all pairs of nodes p and q in the Web-graph.
LinkedIn : Shortest Path Example
Web Structure Mining (cont.)
● This type of mining can be performed either at the document level(intra-page) or at the
hyperlink level(inter-page).
● The research at the hyperlink level is called Hyperlink analysis.
● Hyperlink structure can be used to retrieve useful information on the web.

There are two main approaches:

● PageRank
● Hubs and Authorities - HITS
Google Page Rank Algorithm
• Page Rank (PR) is an algorithm used by Google Search to rank websites in
their search engine results.

i.e. the PageRank value for a page u is dependent on the PageRank values for each
page v contained in the set Bu (the set containing all pages linking to page u),
divided by the number L(v) of links from page v.
PageRank
● Used to discover the most important pages on the web.
● Prioritize pages returned from search by looking at web structure.
● Importance of pages is calculated based on the number of pages which point to it (backlinks).
● Weighting is used to provide more importance to backlinks coming from important pages.
● PR(p) = (1-d) + d (PR(1)/N1 + …… + PR(n)/Nn)
○ PR(i): PageRank for a page i which points to target page p.
○ Ni: Number of links coming out of page i.
○ d: constant value between 0 and 1 used for normalization.
○ (1-d): Bit of probability math magic so that sum of all webpages pageranks should be one.
PageRank (cont.)

https://en.wikipedia.org/wiki/PageRank
Hubs and Authorities
● Authoritative pages
○ Authors defines an authority as the best source for the request.
○ Highly important pages.
○ Best source for requested information.
● Hub pages
○ Contains links to highly important pages.

Hubs Authorities
HITS (Hyperlink Induced Topic Search)
● Iterative algorithm for mining the Web graph to identify the topic hubs and authorities.
● Algorithm:
○ Let’s consider a matrix A with rows and columns corresponding to web pages. Aij =1
indicates that page i links to j and 0 otherwise.
○ Let a and h are vectors, whose ith component corresponds to the degree of authority and
hubbiness of ith page.
○ Hubbiness of the page is defined as the sum of the authorities of all the pages it links to. i.e
h = A x a.
○ Authority of the page is defined as the sum of hubbiness of all the pages that link to it. i.e.
a = At x h. where At is the transposed matrix.
Web Structure Applications
Web Structure is a useful source for extracting information such as
1. Quality of Web Page - Ranking of web pages
2. Interesting Web Structures - Graph patterns like Co-citation, Social
choice, etc.
3. Web Page Classification - Classifying web pages according to
various topics
4. Finding Related Pages - Given one relevant page, find all related pages
5. Detection of duplicate pages - Detection of neared-mirror sites to
eliminate duplication

18
Web Usage Mining
Web Usage Mining
● Web usage mining: automatic discovery of patterns in clickstreams
and associated data collected or generated as a result of user
interactions with one or more Web sites.
● Goal: analyze the behavioral patterns and profiles of users interacting with
a Web site.
● The discovered patterns are usually represented as collections of pages,
objects, or resources that are frequently accessed by groups of users with
common interests.
● Data in Web Usage Mining:
a. Web server logs
b. Site contents
c. Data about the visitors, gathered from external channels
Web Usage Mining Phases

Pre-Processing Pattern Discovery Pattern Analysis

Raw Server log Preprocessed data Rules and Patterns Interesting Knowledge
reference
Data Preparation
● Data cleaning
○ By checking the suffix of the URL name, for example, all log entries with filename
suffixes such as, \gif, jpeg, etc
● User identification
○ If a page is requested that is not directly linked to the previous pages, multiple users are
assumed to exist on the same machine
○ Other heuristics involve using a combination of IP address, machine name, browser agent,
and temporal information to identify users
● Transaction identification
○ All of the page references made by a user during a single visit to a site
○ Size of a transaction can range from a single page reference to all of the page references
Pattern Discovery Tasks
● Clustering and Classification
○ Clustering of users help to discover groups of users with similar navigation patterns =>
provide personalized Web content
○ Clustering of pages help to discover groups of pages having related content => search
engine
○ E.g. clients who often access webminer software products tend to be from educational
institutions.
○ clients who placed an online order for software tend to be students in the 20-25 age group
and live in the United States.
○ 75% of clients who download software and visit between 7:00 and 11:00 pm on weekend
are engineering students
Pattern Discovery Tasks
● Sequential Patterns:
○ extract frequently occurring intersession patterns such that the presence of a set of items
followed by another item in time order
○ Used to predict future user visit patterns=>placing ads or recommendations

● Association Rules:
○ Discover correlations among pages accessed together by a client
○ Help the restructure of Web site
○ Develop e-commerce marketing strategies - Grocery Mart
Pattern Analysis Tasks
● Pattern Analysis is the final stage of WUM, which involves the validation and
interpretation of the mined pattern

● Validation:
○ to eliminate the irrelative rules or patterns and to extract the interesting rules or
patterns from the output of the pattern discovery process

● Interpretation:
○ the output of mining algorithms is mainly in mathematic form and not suitable for
direct human interpretations
Web Usage Mining - Pattern Discovery Tasks

• Statistical Analysis
• Clustering
• Classification
• Association Rules

22
Web Usage Mining - Pattern Discovery Tasks (Cont.)
Statistical Analysis:
• Different kinds of descriptive statistical analyses (frequency, mean, median, etc.)
on variables such as page views, viewing time and length of a navigational path
gives useful knowledge.
Clustering:
• Clustering is a technique to group together a set of items having similar
characteristics.
• In the Web Usage domain, there are two kinds of interesting clusters to be
discovered :
• Clustering of users : discover groups of users with similar navigation
patterns. => Perform market segmentation in E-commerce.
• Clustering of pages: discover groups of pages having related content => Useful
for search engines
23
Web Usage Mining - Pattern Discovery Tasks (Cont.)
Classification: Classification is the task of mapping a data item into one of several
predefined classes.
• In the Web domain, one is interested in developing a profile of users
belonging to a particular class or category. Uses Decision Tree
classifiers, Naive Bayesian classifiers, Neural Networks, SVM etc.
Association Rules:
• Given: A database of transactions, where each transaction is a list of items.
Find: all rules that associate the presence of one set of items with that of
another set of items.
• For web mining, it refer to sets of pages accessed together with a support value
exceeding some specified threshold.
• Are applicable for business and marketing applications, and can help Web
designers to restructure their Website.

24
Source: http://www3.cs.stonybrook.edu/~cse634/L8ch5assoc.pdf
Web Usage Mining - Pattern Analysis
• Last step in the overall Web Usage mining process.
• Motivation : Filter out uninteresting rules or patterns from the set found in the
pattern discovery phase.
• The exact analysis methodology - governed by the application for which Web mining
is done.
• The most common form of pattern analysis consists of:
– A knowledge query mechanism (Like SQL).
–Visualization techniques (Like graphing patterns or assigning colors to
different values) - highlight overall patterns or trends.
Web Usage Mining Application: User Profiles
• The Web has taken user profiling to completely new levels.
• In a ’brick and-mortar’ store, data collection happens only at the checkout
counter (’point-of-sale’).
• In an online store, the complete click-stream is recorded:
– Provides a detailed record of every single action taken by the user.
– Allows creating a detailed user profile
• Most organizations build profiles using user behavior limited to their own
sites (IMDB, Netflix).
• Web-wide profiling also exists (Facebook, Google)

26
Web Usage Mining Application: User Profiles
Amazon’s Recommendation Systems:

The data Amazon mines:


• Purchased shopping carts = real money from real people spent on real items.
• Wishlists - what's on Amazon specifically for you.
• Demographic information they know what is popular in your general area
for your kids, yourself, your spouse, etc.
• User segmentation = did you buy 3 books in separate months for a toddler?
likely you have a kid.

And lots more!

Source: https://www.cs.umd.edu/~samir/498/Amazon-Recommendations.pdf 27
Web usage Mining Application: User Profiles

AMAZON’s Recommendation System 28


https://www.flipkart.com/ amazon.com
NETFLIX’s Recommendation Systems 29
On a lighter note
Web Content Mining
Web Content Mining - Introduction ??
● Mining, extraction and integration of useful data, information and knowledge from Web page
content.
● Web content mining is related but different from data mining and text mining.
● Web data are mainly semi-structured and/or unstructured, while data mining deals primarily
with structured data.
Web Content Mining
Extraction of useful information from the contents of Web documents (structured
and unstructured data)

Text Audio

Image Video

source : https://talkroute.com/automatic-speech-recognition-mixing-it-with-ivr/ , https://


www.underconsideration.com/brandnew/archives/new_logo_for_youtube_done_in_house.php
Web Content Mining
Extraction of useful information from the contents of Web documents

Text Audio

Image Video

source : https://talkroute.com/automatic-speech-recognition-mixing-it-with-ivr/ , https://


www.underconsideration.com/brandnew/archives/new_logo_for_youtube_done_in_house.php
Web Content Mining Includes ? ? ?

Image

Image ref-1
image-ref-2
image-ref-3
Unstructured Web Data Mining

Image Source http://www.3idatascraping.com/6-tips-on-how-to-do-data-scraping-of-unstructured-data.php


https://www.researchgate.net/figure/The-Progress-of-Web-Content-Mining_fig13_299529651
Unstructured Documents - Feature Extraction
● Bag of words to represent unstructured documents
○ Takes single word as feature
○ Ignores the sequence in which words occur
● Features could be
○ Boolean
■ Word either occurs or does not occur in a document
○ Frequency based
■ Frequency of the word in a document
● Variations of the feature selection include
○ Removing the case, punctuation, infrequent words and stop words etc..
● Features can be reduced using different feature selection techniques:
○ Information gain, mutual information, cross entropy.
○ Stemming: which reduces words to their morphological roots.
Structured Web Data
Structured Web Data

Image Source: http://slideplayer.com/slide/6356153/


Structured Content

source : http://www.3idatascraping.com/how-to-use-data-scraping-to-mine-structured-data-from-the-unstructured-data.php
Geonames geographical gazetteer

source : http://www.geonames.org/
Dbpedia (Structured data from Wikipedia
and other sources)

source : http://lod-cloud.net/
Wordnet - Structured information for NLP
● WordNet® is a large lexical
database of English.
● Nouns, verbs, adjectives and
adverbs are grouped into sets
of cognitive synonyms
(synsets), each expressing a
distinct concept.
● Synsets are interlinked by
means of conceptual-semantic
and lexical relations.

source : http://tomkdickinson.co.uk/2017/05/combining-wordnet-and-conceptnet-in-neo4j/
Mining Techniques Using Agent and Database

Image Source: http://slideplayer.com/slide/6356153/


Agent-Based Approach
● Intelligent-Search-Agents developed that searches for characteristics to organize and interpret
the discovered information.
● Information-Filtering/Categorization - Using various information retrieval techniques and
characteristics of open hypertext Web documents to automatically retrieve, filter, and categorize
them. HyPursuit, BO (Bookmark Organizer).
● Development of sophisticated AI systems acting on behalf of users autonomously or semi-
autonomously to discover and organize information.
Database Approaches
Used for transforming unstructured data into more structured and high-level collections of
resources, such as in relational databases, and using standard database querying mechanisms and
data mining techniques to access and analyze this information.

● Multilevel-Databases
○ lowest level - semi- structured information is kept
○ High level - generalizations from lower levels organized into relations and objects.
● Web-Query Systems
○ Web-based query systems and languages developed such as SQL, NLP for extracting data.
Typical Crawler

Img Source: https://www.researchgate.net/figure/Typical-high-level-architecture-of-a-Web-crawler_fig4_228548882


Text Mining - Brief

Img source: https://edumine.wordpress.com/2015/09/14/part-of-speech-tags-in-text-mining/


https://www.amazon.com/Amazon-Video
source : https://www.slideshare.net/asdkfjqlwef/statistical-text-mining-introduction-florian-leitner
Typical NLP Pipeline
Algorithm Named Entities
Pre-processing Feature Extraction
(Training)

Keyphrase
Sentence Splitting
Extraction
Tokenization Term frequency (TF)

Remove stopwords - Document frequency


Supervised:
“and”, “a”, “the” (DF) Topic Modeling
Stemming - make Neural Networks
TF-IDF values
SVMS
“bringing” to “bring”
Decision Trees
Bag-of-words model
Lemmatize - make Text
Unsupervised: summarization
“cats” to “cat” Parts of speech tagging
- mark the nouns, verbs, Clustering
Spelling correction adjectives in a sentence

Lowercase words Relation


Extraction
Who is using text mining?

NLP Pipeline
source : http://bytecubed.com/natural-language-processing-for-everyday-people/
Trend Analysis

Text mining with Web


Data
Machine Translation
Spam Filtering

Speech Recognition

source : https://codecanyon.net/item/easy-google-translate/21220197, https://www.youtube.com/watch?


v=Kor-A3mGRnA, https://irchs.org/google-tools/gmail-logo/
Quality insights through social media mining
● Twitter brand monitoring through sentiment analysis of customer
tweets
● Customer loyalty analysis by extracting sentiments and topics from
user posts on facebook, twitter, instagram
● Disease or epidemic outbreaks from tweets
● Monitoring signs of mental health problems in users from their tweets
● Analyzing social networks for election trends
● New business ventures using big data technologies, visualization
dashboards, social media mining
● Digital marketing
● Customer acquisition and customer retention
● Predicting business sales
● Finding latest trends and patterns in population

source : http://www.xsjjys.com/facebook-
wallpapers.html
Tracking Flu and People Movement from Twitter

Source: http://www.youtube.com/watch?v=rUuPBfEkiJs
“You Are What You Tweet” : Analyzing Twitter for Public Health
Authors : Paul, Michael J., and Mark Dredze.

AAAI Publications, Fifth International AAAI Conference on Weblogs and Social Media, 2011

1. Analyze Twitter data to find public health characteristics.


2. Applies Ailment Topic Aspect Model over sentiments
3. Prior distribution using articles related to ailments
4. Measuring behavioral risk factors, localizing illness by geography, analyze
symptoms and medical usage
5. Quantitative public health data
6. Qualitative evaluations
Ailment Topic Aspect Model
LDA (Latent Dirichlet allocation) - Each document is a mixture of topics. Each
words is attributable to one of the topics.
https://dashboard.symplur.com/dashboard
Social media mining - Approach
A relative new domain closely related to web-mining: Web content mining + Web
Usage mining

Capturing the data -

1. REST APIs exposed by platforms (Twitter, Facebook, etc)

2. Web crawling

Data architecture -

Primarily big data architecture - Petabytes of data on social media (TBs generated
daily - 2.7 billion FB likes, 98 million posts, per day)

Can be real time or batch processing system


Social-media/web mining in news
1. Facebook & Cambridge analytica - Using social media data to influence
elections.
2. China - Using social media data to create credit profiles of users.
3. USA to check social media history for VISA applicants for background checks
4. Customized advertisements based on Social media content
5. LiveRamp - Customer data platform (Aggregates all data from Online/Offline
customer interaction points
Privacy Issues

• Privacy is a sensitive topic which has been attracting a lot of attention recently
due to rapid growth of ecommerce and social media.
• Users want to maintain strict anonymity on the Web.
• On the other hand, site administrators are interested in finding out the
demographics of users as well as the usage statistics of different sections of their
Web site.
• The main challenge is to come up with guidelines and rules such that site
administrators can perform analytics on the usage data without compromising
the identity of an individual user.

51
Privacy Issues
• Furthermore, there should be strict regulations to prevent the usage data from
being exchanged/sold to other sites.
• The users should be made aware of the privacy policies followed by any given
site.

Source: https://madspace.nl/privacy-jokes/

52

You might also like