Web and Text Mining
Web and Text Mining
https://ieeexplore.ieee.org/document/5485404/
Issues
● Web data sets can be very large
○ Tens to hundreds of terabyte
● Cannot mine on a single server
○ Need large farms of servers
● Proper organization of hardware and software to mine multi-terabyte data sets
● Difficulty in finding relevant information
● Extracting new knowledge from the web
Opportunities
Web offers unprecedented opportunities to data mining
https://www.researchgate.net/figure/Taxonomy-of-Web-Mining-Source-Kavita-et-al-2011_fig1_282357293
Web Structure Mining
What is Web Structure Mining?
The structure of a typical Web graph consists of
Hyperlink
Web pages as nodes, and hyperlinks as edges
connecting between two related pages
Web
Document
Source: https://www.slideshare.net/AmirFahmideh/web-mining-structure-mining
Web Structure Terminology
• Web-graph: A directed graph that represents the Web.
Shortest Path: Of all the paths between nodes p and q, which has the shortest
length, i.e. number of links on it.
Diameter: The maximum of all the shortest paths between a pair of nodes
p and q, for all pairs of nodes p and q in the Web-graph.
LinkedIn : Shortest Path Example
Web Structure Mining (cont.)
● This type of mining can be performed either at the document level(intra-page) or at the
hyperlink level(inter-page).
● The research at the hyperlink level is called Hyperlink analysis.
● Hyperlink structure can be used to retrieve useful information on the web.
● PageRank
● Hubs and Authorities - HITS
Google Page Rank Algorithm
• Page Rank (PR) is an algorithm used by Google Search to rank websites in
their search engine results.
i.e. the PageRank value for a page u is dependent on the PageRank values for each
page v contained in the set Bu (the set containing all pages linking to page u),
divided by the number L(v) of links from page v.
PageRank
● Used to discover the most important pages on the web.
● Prioritize pages returned from search by looking at web structure.
● Importance of pages is calculated based on the number of pages which point to it (backlinks).
● Weighting is used to provide more importance to backlinks coming from important pages.
● PR(p) = (1-d) + d (PR(1)/N1 + …… + PR(n)/Nn)
○ PR(i): PageRank for a page i which points to target page p.
○ Ni: Number of links coming out of page i.
○ d: constant value between 0 and 1 used for normalization.
○ (1-d): Bit of probability math magic so that sum of all webpages pageranks should be one.
PageRank (cont.)
https://en.wikipedia.org/wiki/PageRank
Hubs and Authorities
● Authoritative pages
○ Authors defines an authority as the best source for the request.
○ Highly important pages.
○ Best source for requested information.
● Hub pages
○ Contains links to highly important pages.
Hubs Authorities
HITS (Hyperlink Induced Topic Search)
● Iterative algorithm for mining the Web graph to identify the topic hubs and authorities.
● Algorithm:
○ Let’s consider a matrix A with rows and columns corresponding to web pages. Aij =1
indicates that page i links to j and 0 otherwise.
○ Let a and h are vectors, whose ith component corresponds to the degree of authority and
hubbiness of ith page.
○ Hubbiness of the page is defined as the sum of the authorities of all the pages it links to. i.e
h = A x a.
○ Authority of the page is defined as the sum of hubbiness of all the pages that link to it. i.e.
a = At x h. where At is the transposed matrix.
Web Structure Applications
Web Structure is a useful source for extracting information such as
1. Quality of Web Page - Ranking of web pages
2. Interesting Web Structures - Graph patterns like Co-citation, Social
choice, etc.
3. Web Page Classification - Classifying web pages according to
various topics
4. Finding Related Pages - Given one relevant page, find all related pages
5. Detection of duplicate pages - Detection of neared-mirror sites to
eliminate duplication
18
Web Usage Mining
Web Usage Mining
● Web usage mining: automatic discovery of patterns in clickstreams
and associated data collected or generated as a result of user
interactions with one or more Web sites.
● Goal: analyze the behavioral patterns and profiles of users interacting with
a Web site.
● The discovered patterns are usually represented as collections of pages,
objects, or resources that are frequently accessed by groups of users with
common interests.
● Data in Web Usage Mining:
a. Web server logs
b. Site contents
c. Data about the visitors, gathered from external channels
Web Usage Mining Phases
Raw Server log Preprocessed data Rules and Patterns Interesting Knowledge
reference
Data Preparation
● Data cleaning
○ By checking the suffix of the URL name, for example, all log entries with filename
suffixes such as, \gif, jpeg, etc
● User identification
○ If a page is requested that is not directly linked to the previous pages, multiple users are
assumed to exist on the same machine
○ Other heuristics involve using a combination of IP address, machine name, browser agent,
and temporal information to identify users
● Transaction identification
○ All of the page references made by a user during a single visit to a site
○ Size of a transaction can range from a single page reference to all of the page references
Pattern Discovery Tasks
● Clustering and Classification
○ Clustering of users help to discover groups of users with similar navigation patterns =>
provide personalized Web content
○ Clustering of pages help to discover groups of pages having related content => search
engine
○ E.g. clients who often access webminer software products tend to be from educational
institutions.
○ clients who placed an online order for software tend to be students in the 20-25 age group
and live in the United States.
○ 75% of clients who download software and visit between 7:00 and 11:00 pm on weekend
are engineering students
Pattern Discovery Tasks
● Sequential Patterns:
○ extract frequently occurring intersession patterns such that the presence of a set of items
followed by another item in time order
○ Used to predict future user visit patterns=>placing ads or recommendations
● Association Rules:
○ Discover correlations among pages accessed together by a client
○ Help the restructure of Web site
○ Develop e-commerce marketing strategies - Grocery Mart
Pattern Analysis Tasks
● Pattern Analysis is the final stage of WUM, which involves the validation and
interpretation of the mined pattern
● Validation:
○ to eliminate the irrelative rules or patterns and to extract the interesting rules or
patterns from the output of the pattern discovery process
● Interpretation:
○ the output of mining algorithms is mainly in mathematic form and not suitable for
direct human interpretations
Web Usage Mining - Pattern Discovery Tasks
• Statistical Analysis
• Clustering
• Classification
• Association Rules
22
Web Usage Mining - Pattern Discovery Tasks (Cont.)
Statistical Analysis:
• Different kinds of descriptive statistical analyses (frequency, mean, median, etc.)
on variables such as page views, viewing time and length of a navigational path
gives useful knowledge.
Clustering:
• Clustering is a technique to group together a set of items having similar
characteristics.
• In the Web Usage domain, there are two kinds of interesting clusters to be
discovered :
• Clustering of users : discover groups of users with similar navigation
patterns. => Perform market segmentation in E-commerce.
• Clustering of pages: discover groups of pages having related content => Useful
for search engines
23
Web Usage Mining - Pattern Discovery Tasks (Cont.)
Classification: Classification is the task of mapping a data item into one of several
predefined classes.
• In the Web domain, one is interested in developing a profile of users
belonging to a particular class or category. Uses Decision Tree
classifiers, Naive Bayesian classifiers, Neural Networks, SVM etc.
Association Rules:
• Given: A database of transactions, where each transaction is a list of items.
Find: all rules that associate the presence of one set of items with that of
another set of items.
• For web mining, it refer to sets of pages accessed together with a support value
exceeding some specified threshold.
• Are applicable for business and marketing applications, and can help Web
designers to restructure their Website.
24
Source: http://www3.cs.stonybrook.edu/~cse634/L8ch5assoc.pdf
Web Usage Mining - Pattern Analysis
• Last step in the overall Web Usage mining process.
• Motivation : Filter out uninteresting rules or patterns from the set found in the
pattern discovery phase.
• The exact analysis methodology - governed by the application for which Web mining
is done.
• The most common form of pattern analysis consists of:
– A knowledge query mechanism (Like SQL).
–Visualization techniques (Like graphing patterns or assigning colors to
different values) - highlight overall patterns or trends.
Web Usage Mining Application: User Profiles
• The Web has taken user profiling to completely new levels.
• In a ’brick and-mortar’ store, data collection happens only at the checkout
counter (’point-of-sale’).
• In an online store, the complete click-stream is recorded:
– Provides a detailed record of every single action taken by the user.
– Allows creating a detailed user profile
• Most organizations build profiles using user behavior limited to their own
sites (IMDB, Netflix).
• Web-wide profiling also exists (Facebook, Google)
26
Web Usage Mining Application: User Profiles
Amazon’s Recommendation Systems:
Source: https://www.cs.umd.edu/~samir/498/Amazon-Recommendations.pdf 27
Web usage Mining Application: User Profiles
Text Audio
Image Video
Text Audio
Image Video
Image
Image ref-1
image-ref-2
image-ref-3
Unstructured Web Data Mining
source : http://www.3idatascraping.com/how-to-use-data-scraping-to-mine-structured-data-from-the-unstructured-data.php
Geonames geographical gazetteer
source : http://www.geonames.org/
Dbpedia (Structured data from Wikipedia
and other sources)
source : http://lod-cloud.net/
Wordnet - Structured information for NLP
● WordNet® is a large lexical
database of English.
● Nouns, verbs, adjectives and
adverbs are grouped into sets
of cognitive synonyms
(synsets), each expressing a
distinct concept.
● Synsets are interlinked by
means of conceptual-semantic
and lexical relations.
source : http://tomkdickinson.co.uk/2017/05/combining-wordnet-and-conceptnet-in-neo4j/
Mining Techniques Using Agent and Database
● Multilevel-Databases
○ lowest level - semi- structured information is kept
○ High level - generalizations from lower levels organized into relations and objects.
● Web-Query Systems
○ Web-based query systems and languages developed such as SQL, NLP for extracting data.
Typical Crawler
Keyphrase
Sentence Splitting
Extraction
Tokenization Term frequency (TF)
NLP Pipeline
source : http://bytecubed.com/natural-language-processing-for-everyday-people/
Trend Analysis
Speech Recognition
source : http://www.xsjjys.com/facebook-
wallpapers.html
Tracking Flu and People Movement from Twitter
Source: http://www.youtube.com/watch?v=rUuPBfEkiJs
“You Are What You Tweet” : Analyzing Twitter for Public Health
Authors : Paul, Michael J., and Mark Dredze.
AAAI Publications, Fifth International AAAI Conference on Weblogs and Social Media, 2011
2. Web crawling
Data architecture -
Primarily big data architecture - Petabytes of data on social media (TBs generated
daily - 2.7 billion FB likes, 98 million posts, per day)
• Privacy is a sensitive topic which has been attracting a lot of attention recently
due to rapid growth of ecommerce and social media.
• Users want to maintain strict anonymity on the Web.
• On the other hand, site administrators are interested in finding out the
demographics of users as well as the usage statistics of different sections of their
Web site.
• The main challenge is to come up with guidelines and rules such that site
administrators can perform analytics on the usage data without compromising
the identity of an individual user.
51
Privacy Issues
• Furthermore, there should be strict regulations to prevent the usage data from
being exchanged/sold to other sites.
• The users should be made aware of the privacy policies followed by any given
site.
Source: https://madspace.nl/privacy-jokes/
52