Web Page ClassificationFeature and AlgorithmsXiaoguangQi and Brian D. DavisonDepartment of Computer Science & EngineeringLehigh University, June 2007Presented byMr.Pachara ChutisawaengDepartment of Computer ScienceMahidol University, July 2009
AgendaWebpage classification significanceIntroductionBackgroundApplications of web classificationFeaturesAlgorithmsBlog ClassificationConclusion
Webpage classification significancePresented byMr.Pachara ChutisawaengDepartment of Computer ScienceMahidol University, July 2009
Webpage classification significanceLet’s go back in history about 10 years.The Evolution of Websites: How 5 popular Websites have changed 
Apple - present
Apple – 10 Years ago!
Amazon - present
Amazon – 9 Years ago
CNN - present
CNN – 8 Years ago
Yahoo! - present
Yahoo! – 12 Years ago
Webpage classification significanceWhat’s different between past and present what changed?
Nike - present
Nike – 8 Years ago
Webpage classification significanceWhat’s different between past and present what changed?Flash animationJava ScriptVideo Clips, Embedded ObjectAdvertise, GG Ad sense, Yahoo!
IntroductionPresented byMr.Pachara ChutisawaengDepartment of Computer ScienceMahidol University, July 2009
IntroductionWebpage classification or webpage categorization is the process of assigning a webpage to one or more category labels. E.g. “News”, “Sport” , “Business”GOAL: They observe the existing of web classification techniques to find new area for research. Including web-specific features and algorithms that have been found to be useful for webpage classification.
IntroductionWhat will you learn?A Detailed review of useful features for web classificationThe algorithms usedThe future research directionsWebpage classification can help improve the quality of web search.Knowing is thing help you to improve your SEO skill.Each search engine, keep their technique in secret.
BackgroundPresented byMr.Pachara ChutisawaengDepartment of Computer ScienceMahidol University, July 2009
BackgroundThe general problem of webpage classification can be divided intoSubject classification; subject or topic of webpage e.g. “Adult”, “Sport”, “Business”.Function classification; the role that the webpage play e.g. “Personal homepage”, “Course page”, “Admission page”.
BackgroundBased on the number of classes in webpage classification can be divided into binary classification multi-class classification	Based on the number of classes that can be assigned to an instance, classification can be divided into single-label classification and multi-label classification.
Types of classification
Applications of web classificationPresented byMr.Pachara ChutisawaengDepartment of Computer ScienceMahidol University, July 2009
Applications of web classificationConstructing and expanding web directories (web hierarchies)Yahoo !ODP or “Open Dictionary Project” http://www.dmoz.orgHow are they doing?
Keyworder
Applications of web classificationHow are they doing?By human effortJuly 2006, it was reported there are 73,354 editor in the dmoz ODP.As the web changes and continue to grow so “Automatic creation of classifiers from web corpora based on use-defined hierarchies” has been introduced by Huang et al. in 2004The starting point of this presentation !!
Applications of web classificationImproving quality of search resultsCategories viewRanking view
Categories and Ranking View
Applications of web classificationImproving quality of search results Categories viewRanking view In 1998, Page and Brin developed the link-based ranking algorithm called PageRankCalculates the hyperlinks with our considering the topic of each page
Google – 11 Years ago
Applications of web classificationHelping question answering systemsYang and Chua 2004 suggest finding answers to list questions e.g. “name all the countries in Europe”How it worked?Formulated the queries and sent to search engines.Classified the results into four categoriesCollection pages (contain list of items)Topic pages (represent the answers instance)Relevant page (Supporting the answers instance)Irrelevant pagesAfter that , topic pages are clustered, from which answers are extracted.Answering question system could benefit from web classification of both accuracy and efficiency
Applications of web classificationOther applicationsWeb content filteringAssisted web browsingKnowledge base construction
FeaturesPresented byMr.Pachara ChutisawaengDepartment of Computer ScienceMahidol University, July 2009
FeaturesIn this section, we review the types of features that useful in webpage classification research.The most important criteria in webpage classification that make webpage classification different from plaintext classification is HYPERLINK <a>…</a>We classify features intoOn-page feature: Directly located on the pageNeighbors feature: Found on the pages related to the page to be classified.
Features: On-pagePresented byMr.Pachara ChutisawaengDepartment of Computer ScienceMahidol University, July 2009
Features: On-pageTextual content and tagsN-gram featureImagine of two different documents. One contains phrase “New York”. The other contains the terms “New” and “York”. (2-gram feature).In Yahoo!, They used 5-grams feature.HTML tags or DOMTitle, Headings, Metadata and Main textAssigned each of them an arbitrary weight.Now a day most of website using Nested list (<ul><li>) which really help in web page classification.
Features: On-pageTextual content and tagsURLKan and Thi 2004Demonstrated that a webpage can be classified based on its URL
Features: On-pageVisual analysisEach webpage has two representationsText which represent in HTMLThe visual representation rendered by a web browserMost approaches focus on the text while ignoring the visual information which is useful as wellKovacevic et al. 2004Each webpage is represented as a hierarchical “Visual adjacency multi graph.”In graph each node represents an HTML object and each edge represents the spatial relation in the visual representation.
Visual analysis
Features: Neighbors FeaturesPresented byMr.Pachara ChutisawaengDepartment of Computer ScienceMahidol University, July 2009
Features: Neighbors FeaturesMotivationThe useful features that we discuss previously, in a particular these features are missing or unrecognizable
Example webpage which has few useful on-page features
Features: Neighbors featuresUnderlying AssumptionsWhen exploring the features of neighbors, some assumptions are implicitly made in existing work.The presence of many “sports” pages in the neighborhood of P-a increases the probability of P-a being in “Sport”.Chakrabari et al. 2002 and Meczer 2005 showed that  linked pages were more likely to have terms in common .Neighbor selectionExisting research mainly focuses on page with in two steps of the page to be classified. At the distance no greater than two. There are six types of neighboring pages: parent, child, sibling, spouse, grandparent and grandchild.
Neighbors with in radius of two
Features: Neighbors featuresNeighbor selection cont.Furnkranz 1999The text on the parent pages surrounding the link is used to train a classifier instead of text on the target page.A Target page will be assigned multiple labels. These label are then combine by some voting scheme to form the final prediction of the target page’s classSun et al. 2002Using the text on the target page. Using page title and anchor text from parent pages can improve classification compared a pure text classifier.
Features: Neighbors featuresNeighbor selection cont.SummaryUsing parent, child, sibling and spouse pages are all useful in classification, siblings are found to be the best source.Using information from neighboring pages may introduce extra noise, should be use carefully.
Webpage Classification
Features: Neighbors featuresFeaturesLabel : by editor or keyworderPartial content : anchor text, the surrounding text of anchor text, titles, headersFull contentAmong the three types of features, using the full content of neighboring pages is the most expensive however it generate better accuracy.
Features: Neighbors featuresUtilizing artificial links (implicit link)The hyperlinks are not the only one choice.What is implicit link?Connections between pages that appear in the results of the same query and are both clicked by users.Implicit link can help webpage classification as well as hyperlinks.
Webpage Classification
Discussion: FeaturesHowever, since the results of different approaches are based on different implementations and different datasets, making it difficult to compare their performance. Sibling page are even more use full than parents and children.This approach may lie in the process of hyperlink creation.But a page often acts as a bridge to connect its outgoing links, which are likely to have common topic.
Webpage Classification
Tip!Tracking Incoming LinkHow to know when someone link to you?Presented byMr.Pachara ChutisawaengDepartment of Computer ScienceMahidol University, July 2009
AlgorithmsPresented byMr.Pachara ChutisawaengDepartment of Computer ScienceMahidol University, July 2009
Algorithm Approaches for Webpage Classification
Dimension Reduction Feature weightingAnother important role for webpage classification
Way of boosting the classification by emphasizing the features with the better discriminative power
Special case of weighing: “Feature Selection”Dimension Reduction (cont’d)  : Feature SelectionA special case of “feature weighting”‘Zero weight’ is assigned to the eliminated featuresThe role:
Dimension Reduction (con)  : Feature SelectionSimple approachesFirst fragment of each document First fragment to the web documents in hierarchical classificationText categorization approachesInformation gainMutual informationEtc.
Feature Selection (Cont’d): Simple measureUsing the first fragment of each documentsAssumption: a summary is at beginning of the documentFast and accurate classification for news articlesNot satisfying for other types of documentsFirst fragment applied to Hierarchical classification of web pagesUseful for web documents
Feature Selection (Cont’d): Text Categorization MeasuresUsing expected mutual information and mutual informationTwo well-known metrics based on variation of the k-Nearest Neighbor algorithmWeighted terms according to its appearing HTML tags Terms within different tags handle different importanceUsing information gainAnother well-known metric Still not apparently show which one is more superior for web classification
Feature Selection (Cont’d): Text Categorization MeasuresApproving the performance of SVM classifiersBy aggressive feature selectionDeveloped a measure with the ability to predict the selection effectiveness without training and testing classifiersA popular Latent Semantic Indexing (LSI)In Text documents: Docs are reinterpreted into a smaller transformed, but less intuitive spaceCons:high computational complexity makes it inefficient to scalein Web classificationExperiments based on small datasets (to avoid the above ‘cons’)Some work has approved to make it applicable for larger datasets which still needs further study
Algorithm Approaches for Webpage Classification
Relational Learning
Relational Learning (cont’d): 2 Main ApproachesRelaxation Labeling AlgorithmsOriginal proposal: Image analysisCurrent usage:Image and vision analysisArtificial Intelligencepattern recognitionweb-miningLink-based Classification AlgorithmsUtilizing 2 popular link-based algorithmsLoopy belief propagationIterative classification
Relational Learning (cont’d): Relaxation Labeling Algorithms Flow of the algorithmRelaxation Labeling (cont’d): Algorithm variationsUsing a combined logistic classifier based on content and link informationShows improvement over a textual classifierOutperforms a single flat classifier based on both content and link featuresSelecting the proper Neighbors ONLY Not all neighbors are qualifiedThe chosen neighbors’ option:Similar enough in content
Relational Learning (cont’d): Link-based Classification AlgorithmsTwo popular link-based algorithms:Loopy belief propagationIterative classificationBetter performance on a web collection than textual classifiersDuring the scientists’ study, ‘a toolkit’ was implemented Toolkit featuresClassify the networked data which utilized a relational classifier and a collective inference procedureDemonstrated its great performance on several datasets including web collections
Algorithm Approaches for Webpage Classification
Modifications to traditional algorithmsThe traditional algorithms adjusted in the context of Webpage classificationk-Nearest Neighbors (kNN)Quantify the distance between the test document and each training documents using “a dissimilarity measure”Cosine similarity or inner product is what used by most  existing kNN classifiers Support Vector Machine (SVM)
Modification Algorithms (Cont’d)                             : k-Nearest Neighbors Algorithm Varieties of modifications:Using the term co-occurrence in documentUsing probability computationUsing “co-training”
k-Nearest Neighbors Algorithm(Cont’d): Modification Varieties Using the term co-occurrence in documentsAn improved similarity measureThe more co-occurred terms two documents have in common, the stronger the relationship between themBetter performance over the normal kNN (cosine similarity and inner product measures)Using the probability computationCondition:The probability of a document d being in class c is determined by its distance b/w neighbors and itself and its neighbors’ probability of being in cSimple equationProb. of d @ c = (distance b/w d and neighbors)(neighbors’ Prob. @ c)
k-Nearest Neighbors Algorithm(Cont’d): Modification Varieties (2) Using “Co-training”Make use of labeled and unlabeled data Aiming to achieve better accuracyScenario: Binary classificationClassifying the unlabeled instancesTwo classifiers trained on different sets of features The prediction of each one is used to train each otherClassifying only labeled instancesThe co-training can cut the error rate by halfWhen generalized to multi-class problemsWhen the number of categories is largeCo-training is not satisfyingOn the other hand, the method of combining error-correcting output coding (more than enough classifiers in use), with co-training can boost performance
Modification Algorithms (Cont’d)                             : SVM-based ApproachIn classification, both positive and negative examples are requiredSVM-Based aim:To eliminate the need for manual collection of negative examples while still retaining similar classification accuracy
SVM-based Approach(Cont’d)                             : SVM-based Flow of algorithm
Take a Break!The Internet’s Ad Market PlaceBesides Google AdwordsPresented byMr.Pachara ChutisawaengDepartment of Computer ScienceMahidol University, July 2009
Algorithm Approaches for Webpage Classification
Hierarchical ClassificationNot so many research since most web classifications focus on the same level approachesApproaches:Based on “divide and conquer”Error minimizationTopical HierarchyHierarchical SVMsUsing the degree of misclassificationHierarchical text categoriations
Hierarchical Classification (Cont’d): ApproachesThe use of hierarchical classification based on “divide and conquer”Classification problems are splitted into sub-problems hierarchicallyMore efficient and accurate that the non-hierarchical wayError minimizationwhen the lower level category is uncertain,Minimize by shifting the assignment into the higher oneTopical HierarchyClassify a web page into a topical hierarchyUpdate the category information as the hierarchy expands
Hierarchical Classification (Cont’d): Approaches (2)Hierarchical SVMsObservation:Hierarchical SVMs are more efficient than flat SVMsNone are satisfying the effectiveness for the large taxonomies Hierarchical settings do more harm than good to kNNs and naive Bayes classifiersHierarchical Classification By the degree of misclassification Opposed to measuring “correctness”Distance are measured b/w the classifier-assigned classes and the true class.Hierarchical text categorizationA detailed review was provided in 2005
Algorithm Approaches for Webpage Classification
Combining Information from Multiple SourcesDifferent sources are utilizedCombining link and content information is quite popularCommon combination way: Treat information from ‘different sources’ as ‘different (usually disjoint) feature sets’ on which multiple classifiers are trainedThen, the generation of FINAL decision will be made by the classifiersMostly has the potential to have better knowledge than any single method
Information Combination (Cont’d): ApproachesVoting and StackingThe well-developed method in machine learningCo-TrainingEffective in combining multiple sourcesSince here, different classifiers are trained on disjoint feature sets
Information Combination (Cont’d): CautionsPlease be noted that:Additional resource needs sometimes cause ‘disadvantage’The combination of 2 does NOT always BETTER than each separately
Blog classificationPresented byMr.Pachara ChutisawaengDepartment of Computer ScienceMahidol University, July 2009
Take a Break!Follow the Trend!!Everybody RETWEET!!Presented byMr.Pachara ChutisawaengDepartment of Computer ScienceMahidol University, July 2009
Follow me on TwitterFollow pChralso my Blog Http://www.PacharaStudio.comPresented byMr.Pachara ChutisawaengDepartment of Computer ScienceMahidol University, July 2009
Blog classificationThe word “blog” was originally a short form of “web log”Blogging has gained in popularity in recent years, an increasing amount of research about blog has also been conducted.Broken into three typesBlog identification (to determine whether a web document is a blog)Mood classificationGenre classification
Blog classificationElgersma and Rijke 2006Common classification algorithm on Blog identification using number of human-selected feature e.g. “Comments” and “Archives” Accuracy around 90%Mihalcea and Liu 2006 classify Blog into two polarities of moods, happiness and sadness (Mood classification)Nowson 2006 discussed the distinction of three types of blogs (Genre Classification)NewsCommentaryJournal
Blog classificationQu et al. 2006Automatic classification of blogs into four genresPersonal diaryNew Political SportsUsing unigram tfidf document representation and naive Bayes classification.Qu et al.’s approach can achieve an accuracy of 84%.
ConclusionPresented byMr.Pachara ChutisawaengDepartment of Computer ScienceMahidol University, July 2009
ConclusionWebpage classification is a type of supervised learning problem that aims to categorize webpage into a set of predefined categories based on labeled training data.They expect that future web classification efforts will certainly combine content and link information in some form.
ConclusionFuture work would be well-advised toEmphasize text and labels from siblings over other types of neighbors.Incorporate anchor text from parents.Utilize other source of (implicit or explicit) human knowledge, such as query logs and click-through behavior, in addition to existing labels to guide classifier creation.
Thank you.Presented byMr.Pachara ChutisawaengDepartment of Computer ScienceMahidol University, July 2009
Question?Presented byMr.Pachara ChutisawaengDepartment of Computer ScienceMahidol University, July 2009

More Related Content

PPTX
Web mining
PPTX
Web spam
PDF
Machine Learning
PDF
The Next Generation of AI-powered Search
PPT
Javascript
PPTX
WEB Scraping.pptx
PDF
A beginner's guide to machine learning for SEOs - WTSFest 2022
Web mining
Web spam
Machine Learning
The Next Generation of AI-powered Search
Javascript
WEB Scraping.pptx
A beginner's guide to machine learning for SEOs - WTSFest 2022

What's hot (20)

PPT
PPTX
Opinion-based Article Ranking for Information Retrieval Systems: Factoids and...
PDF
Understanding Semantic Search and AI Content to Drive Growth in 2023 March 2023
PPTX
Flask – Python
PDF
Tutorial on Web Scraping in Python
PPTX
BrightonSEO - Master Crawl Budget Optimization for Enterprise Websites
PPTX
Web scraping
PPTX
Brighton SEO 2022: On-page optimization lessons from analyzing over 400 blog ...
PDF
Automating Google Lighthouse
PDF
Website Analysis Report
PPTX
Meta tags
PDF
What is in a Lucene index?
PDF
How to Use Search Intent to Dominate Google Discover
PPTX
Web mining (structure mining)
PDF
How to Create an Airtight SEO Strategy to Beat Any Competitor - Rumble Romagnoli
PDF
Classification and Clustering
PPTX
Google Tag Manager for beginners
PPTX
Introduction to SEO
PDF
Word Embeddings - Introduction
PPTX
Agrupa y vencerás - SEO técnico
Opinion-based Article Ranking for Information Retrieval Systems: Factoids and...
Understanding Semantic Search and AI Content to Drive Growth in 2023 March 2023
Flask – Python
Tutorial on Web Scraping in Python
BrightonSEO - Master Crawl Budget Optimization for Enterprise Websites
Web scraping
Brighton SEO 2022: On-page optimization lessons from analyzing over 400 blog ...
Automating Google Lighthouse
Website Analysis Report
Meta tags
What is in a Lucene index?
How to Use Search Intent to Dominate Google Discover
Web mining (structure mining)
How to Create an Airtight SEO Strategy to Beat Any Competitor - Rumble Romagnoli
Classification and Clustering
Google Tag Manager for beginners
Introduction to SEO
Word Embeddings - Introduction
Agrupa y vencerás - SEO técnico
Ad

Viewers also liked (20)

PDF
Defining Content Categorization for a Website
PPT
web page classification
PDF
Simple responsive typography
PDF
Web page classification features and algorithms
PDF
URL Design with Lasso
PDF
Getting started-with-similar web-api
PDF
Text Categorization Using Improved K Nearest Neighbor Algorithm
PPTX
Cause and effects of computer virus
PPTX
Computer virus & its cure
PDF
How Emacs changed my life
PDF
Technical SEO Myths Facts And Theories On Crawl Budget And The Importance Of ...
PPTX
Web page concept final ppt
PPTX
Http Vs Https .
PPTX
3.2.1 The Internet
PPTX
3.3 Internet Services
PPT
Web browser
PPT
Input and Output Devices.
ODP
Computer virus
PPTX
presentation on computer virus
PDF
The lies we tell our code, LinuxCon/CloudOpen 2015-08-18
Defining Content Categorization for a Website
web page classification
Simple responsive typography
Web page classification features and algorithms
URL Design with Lasso
Getting started-with-similar web-api
Text Categorization Using Improved K Nearest Neighbor Algorithm
Cause and effects of computer virus
Computer virus & its cure
How Emacs changed my life
Technical SEO Myths Facts And Theories On Crawl Budget And The Importance Of ...
Web page concept final ppt
Http Vs Https .
3.2.1 The Internet
3.3 Internet Services
Web browser
Input and Output Devices.
Computer virus
presentation on computer virus
The lies we tell our code, LinuxCon/CloudOpen 2015-08-18
Ad

Similar to Webpage Classification (20)

PPTX
Webpage classification and Features
PDF
Macran
PDF
International Journal of Engineering Research and Development (IJERD)
PDF
`A Survey on approaches of Web Mining in Varied Areas
PDF
132-ArticleText-800-1-10-20210331 (1).pdf
PDF
PageRank algorithm and its variations: A Survey report
PDF
Pratical Deep Dive into the Semantic Web - #smconnect
PDF
IRJET- A Literature Review and Classification of Semantic Web Approaches for ...
PDF
IRJET-Multi -Stage Smart Deep Web Crawling Systems: A Review
PDF
Plenary paper-2012-weideman-academic-content-web-visibility-presence
PDF
beginners-guide.pdf
PDF
The beginners guide to SEO
PPT
Mazhiming
PPT
Internet 信息检索中的数学
PDF
Team of Rivals: UX, SEO, Content & Dev UXDC 2015
PDF
SEOMoz The Beginners Guide To SEO
PPTX
Modern web search: Web Information Systems
PPTX
Modern web search: Lecture 11
PPT
Optimizing Library Websites for Better Visibility
PPT
Optimizing Library Websites for Better Visibility
Webpage classification and Features
Macran
International Journal of Engineering Research and Development (IJERD)
`A Survey on approaches of Web Mining in Varied Areas
132-ArticleText-800-1-10-20210331 (1).pdf
PageRank algorithm and its variations: A Survey report
Pratical Deep Dive into the Semantic Web - #smconnect
IRJET- A Literature Review and Classification of Semantic Web Approaches for ...
IRJET-Multi -Stage Smart Deep Web Crawling Systems: A Review
Plenary paper-2012-weideman-academic-content-web-visibility-presence
beginners-guide.pdf
The beginners guide to SEO
Mazhiming
Internet 信息检索中的数学
Team of Rivals: UX, SEO, Content & Dev UXDC 2015
SEOMoz The Beginners Guide To SEO
Modern web search: Web Information Systems
Modern web search: Lecture 11
Optimizing Library Websites for Better Visibility
Optimizing Library Websites for Better Visibility

Recently uploaded (20)

PDF
Data Virtualization in Action: Scaling APIs and Apps with FME
PDF
EIS-Webinar-Regulated-Industries-2025-08.pdf
PDF
Transform-Your-Streaming-Platform-with-AI-Driven-Quality-Engineering.pdf
PDF
A symptom-driven medical diagnosis support model based on machine learning te...
PDF
The-2025-Engineering-Revolution-AI-Quality-and-DevOps-Convergence.pdf
PDF
Examining Bias in AI Generated News Content.pdf
PDF
IT-ITes Industry bjjbnkmkhkhknbmhkhmjhjkhj
PDF
Dell Pro Micro: Speed customer interactions, patient processing, and learning...
PDF
NewMind AI Weekly Chronicles – August ’25 Week IV
PDF
Transform-Your-Factory-with-AI-Driven-Quality-Engineering.pdf
PDF
Human Computer Interaction Miterm Lesson
PPTX
SGT Report The Beast Plan and Cyberphysical Systems of Control
PPTX
Internet of Everything -Basic concepts details
PDF
5-Ways-AI-is-Revolutionizing-Telecom-Quality-Engineering.pdf
PDF
Planning-an-Audit-A-How-To-Guide-Checklist-WP.pdf
PDF
LMS bot: enhanced learning management systems for improved student learning e...
PPTX
AI-driven Assurance Across Your End-to-end Network With ThousandEyes
PDF
Aug23rd - Mulesoft Community Workshop - Hyd, India.pdf
PDF
The AI Revolution in Customer Service - 2025
PPTX
Module 1 Introduction to Web Programming .pptx
Data Virtualization in Action: Scaling APIs and Apps with FME
EIS-Webinar-Regulated-Industries-2025-08.pdf
Transform-Your-Streaming-Platform-with-AI-Driven-Quality-Engineering.pdf
A symptom-driven medical diagnosis support model based on machine learning te...
The-2025-Engineering-Revolution-AI-Quality-and-DevOps-Convergence.pdf
Examining Bias in AI Generated News Content.pdf
IT-ITes Industry bjjbnkmkhkhknbmhkhmjhjkhj
Dell Pro Micro: Speed customer interactions, patient processing, and learning...
NewMind AI Weekly Chronicles – August ’25 Week IV
Transform-Your-Factory-with-AI-Driven-Quality-Engineering.pdf
Human Computer Interaction Miterm Lesson
SGT Report The Beast Plan and Cyberphysical Systems of Control
Internet of Everything -Basic concepts details
5-Ways-AI-is-Revolutionizing-Telecom-Quality-Engineering.pdf
Planning-an-Audit-A-How-To-Guide-Checklist-WP.pdf
LMS bot: enhanced learning management systems for improved student learning e...
AI-driven Assurance Across Your End-to-end Network With ThousandEyes
Aug23rd - Mulesoft Community Workshop - Hyd, India.pdf
The AI Revolution in Customer Service - 2025
Module 1 Introduction to Web Programming .pptx

Webpage Classification

  • 1. Web Page ClassificationFeature and AlgorithmsXiaoguangQi and Brian D. DavisonDepartment of Computer Science & EngineeringLehigh University, June 2007Presented byMr.Pachara ChutisawaengDepartment of Computer ScienceMahidol University, July 2009
  • 2. AgendaWebpage classification significanceIntroductionBackgroundApplications of web classificationFeaturesAlgorithmsBlog ClassificationConclusion
  • 3. Webpage classification significancePresented byMr.Pachara ChutisawaengDepartment of Computer ScienceMahidol University, July 2009
  • 4. Webpage classification significanceLet’s go back in history about 10 years.The Evolution of Websites: How 5 popular Websites have changed 
  • 6. Apple – 10 Years ago!
  • 8. Amazon – 9 Years ago
  • 10. CNN – 8 Years ago
  • 12. Yahoo! – 12 Years ago
  • 13. Webpage classification significanceWhat’s different between past and present what changed?
  • 15. Nike – 8 Years ago
  • 16. Webpage classification significanceWhat’s different between past and present what changed?Flash animationJava ScriptVideo Clips, Embedded ObjectAdvertise, GG Ad sense, Yahoo!
  • 17. IntroductionPresented byMr.Pachara ChutisawaengDepartment of Computer ScienceMahidol University, July 2009
  • 18. IntroductionWebpage classification or webpage categorization is the process of assigning a webpage to one or more category labels. E.g. “News”, “Sport” , “Business”GOAL: They observe the existing of web classification techniques to find new area for research. Including web-specific features and algorithms that have been found to be useful for webpage classification.
  • 19. IntroductionWhat will you learn?A Detailed review of useful features for web classificationThe algorithms usedThe future research directionsWebpage classification can help improve the quality of web search.Knowing is thing help you to improve your SEO skill.Each search engine, keep their technique in secret.
  • 20. BackgroundPresented byMr.Pachara ChutisawaengDepartment of Computer ScienceMahidol University, July 2009
  • 21. BackgroundThe general problem of webpage classification can be divided intoSubject classification; subject or topic of webpage e.g. “Adult”, “Sport”, “Business”.Function classification; the role that the webpage play e.g. “Personal homepage”, “Course page”, “Admission page”.
  • 22. BackgroundBased on the number of classes in webpage classification can be divided into binary classification multi-class classification Based on the number of classes that can be assigned to an instance, classification can be divided into single-label classification and multi-label classification.
  • 24. Applications of web classificationPresented byMr.Pachara ChutisawaengDepartment of Computer ScienceMahidol University, July 2009
  • 25. Applications of web classificationConstructing and expanding web directories (web hierarchies)Yahoo !ODP or “Open Dictionary Project” http://www.dmoz.orgHow are they doing?
  • 27. Applications of web classificationHow are they doing?By human effortJuly 2006, it was reported there are 73,354 editor in the dmoz ODP.As the web changes and continue to grow so “Automatic creation of classifiers from web corpora based on use-defined hierarchies” has been introduced by Huang et al. in 2004The starting point of this presentation !!
  • 28. Applications of web classificationImproving quality of search resultsCategories viewRanking view
  • 30. Applications of web classificationImproving quality of search results Categories viewRanking view In 1998, Page and Brin developed the link-based ranking algorithm called PageRankCalculates the hyperlinks with our considering the topic of each page
  • 31. Google – 11 Years ago
  • 32. Applications of web classificationHelping question answering systemsYang and Chua 2004 suggest finding answers to list questions e.g. “name all the countries in Europe”How it worked?Formulated the queries and sent to search engines.Classified the results into four categoriesCollection pages (contain list of items)Topic pages (represent the answers instance)Relevant page (Supporting the answers instance)Irrelevant pagesAfter that , topic pages are clustered, from which answers are extracted.Answering question system could benefit from web classification of both accuracy and efficiency
  • 33. Applications of web classificationOther applicationsWeb content filteringAssisted web browsingKnowledge base construction
  • 34. FeaturesPresented byMr.Pachara ChutisawaengDepartment of Computer ScienceMahidol University, July 2009
  • 35. FeaturesIn this section, we review the types of features that useful in webpage classification research.The most important criteria in webpage classification that make webpage classification different from plaintext classification is HYPERLINK <a>…</a>We classify features intoOn-page feature: Directly located on the pageNeighbors feature: Found on the pages related to the page to be classified.
  • 36. Features: On-pagePresented byMr.Pachara ChutisawaengDepartment of Computer ScienceMahidol University, July 2009
  • 37. Features: On-pageTextual content and tagsN-gram featureImagine of two different documents. One contains phrase “New York”. The other contains the terms “New” and “York”. (2-gram feature).In Yahoo!, They used 5-grams feature.HTML tags or DOMTitle, Headings, Metadata and Main textAssigned each of them an arbitrary weight.Now a day most of website using Nested list (<ul><li>) which really help in web page classification.
  • 38. Features: On-pageTextual content and tagsURLKan and Thi 2004Demonstrated that a webpage can be classified based on its URL
  • 39. Features: On-pageVisual analysisEach webpage has two representationsText which represent in HTMLThe visual representation rendered by a web browserMost approaches focus on the text while ignoring the visual information which is useful as wellKovacevic et al. 2004Each webpage is represented as a hierarchical “Visual adjacency multi graph.”In graph each node represents an HTML object and each edge represents the spatial relation in the visual representation.
  • 41. Features: Neighbors FeaturesPresented byMr.Pachara ChutisawaengDepartment of Computer ScienceMahidol University, July 2009
  • 42. Features: Neighbors FeaturesMotivationThe useful features that we discuss previously, in a particular these features are missing or unrecognizable
  • 43. Example webpage which has few useful on-page features
  • 44. Features: Neighbors featuresUnderlying AssumptionsWhen exploring the features of neighbors, some assumptions are implicitly made in existing work.The presence of many “sports” pages in the neighborhood of P-a increases the probability of P-a being in “Sport”.Chakrabari et al. 2002 and Meczer 2005 showed that linked pages were more likely to have terms in common .Neighbor selectionExisting research mainly focuses on page with in two steps of the page to be classified. At the distance no greater than two. There are six types of neighboring pages: parent, child, sibling, spouse, grandparent and grandchild.
  • 45. Neighbors with in radius of two
  • 46. Features: Neighbors featuresNeighbor selection cont.Furnkranz 1999The text on the parent pages surrounding the link is used to train a classifier instead of text on the target page.A Target page will be assigned multiple labels. These label are then combine by some voting scheme to form the final prediction of the target page’s classSun et al. 2002Using the text on the target page. Using page title and anchor text from parent pages can improve classification compared a pure text classifier.
  • 47. Features: Neighbors featuresNeighbor selection cont.SummaryUsing parent, child, sibling and spouse pages are all useful in classification, siblings are found to be the best source.Using information from neighboring pages may introduce extra noise, should be use carefully.
  • 49. Features: Neighbors featuresFeaturesLabel : by editor or keyworderPartial content : anchor text, the surrounding text of anchor text, titles, headersFull contentAmong the three types of features, using the full content of neighboring pages is the most expensive however it generate better accuracy.
  • 50. Features: Neighbors featuresUtilizing artificial links (implicit link)The hyperlinks are not the only one choice.What is implicit link?Connections between pages that appear in the results of the same query and are both clicked by users.Implicit link can help webpage classification as well as hyperlinks.
  • 52. Discussion: FeaturesHowever, since the results of different approaches are based on different implementations and different datasets, making it difficult to compare their performance. Sibling page are even more use full than parents and children.This approach may lie in the process of hyperlink creation.But a page often acts as a bridge to connect its outgoing links, which are likely to have common topic.
  • 54. Tip!Tracking Incoming LinkHow to know when someone link to you?Presented byMr.Pachara ChutisawaengDepartment of Computer ScienceMahidol University, July 2009
  • 55. AlgorithmsPresented byMr.Pachara ChutisawaengDepartment of Computer ScienceMahidol University, July 2009
  • 56. Algorithm Approaches for Webpage Classification
  • 57. Dimension Reduction Feature weightingAnother important role for webpage classification
  • 58. Way of boosting the classification by emphasizing the features with the better discriminative power
  • 59. Special case of weighing: “Feature Selection”Dimension Reduction (cont’d) : Feature SelectionA special case of “feature weighting”‘Zero weight’ is assigned to the eliminated featuresThe role:
  • 60. Dimension Reduction (con) : Feature SelectionSimple approachesFirst fragment of each document First fragment to the web documents in hierarchical classificationText categorization approachesInformation gainMutual informationEtc.
  • 61. Feature Selection (Cont’d): Simple measureUsing the first fragment of each documentsAssumption: a summary is at beginning of the documentFast and accurate classification for news articlesNot satisfying for other types of documentsFirst fragment applied to Hierarchical classification of web pagesUseful for web documents
  • 62. Feature Selection (Cont’d): Text Categorization MeasuresUsing expected mutual information and mutual informationTwo well-known metrics based on variation of the k-Nearest Neighbor algorithmWeighted terms according to its appearing HTML tags Terms within different tags handle different importanceUsing information gainAnother well-known metric Still not apparently show which one is more superior for web classification
  • 63. Feature Selection (Cont’d): Text Categorization MeasuresApproving the performance of SVM classifiersBy aggressive feature selectionDeveloped a measure with the ability to predict the selection effectiveness without training and testing classifiersA popular Latent Semantic Indexing (LSI)In Text documents: Docs are reinterpreted into a smaller transformed, but less intuitive spaceCons:high computational complexity makes it inefficient to scalein Web classificationExperiments based on small datasets (to avoid the above ‘cons’)Some work has approved to make it applicable for larger datasets which still needs further study
  • 64. Algorithm Approaches for Webpage Classification
  • 66. Relational Learning (cont’d): 2 Main ApproachesRelaxation Labeling AlgorithmsOriginal proposal: Image analysisCurrent usage:Image and vision analysisArtificial Intelligencepattern recognitionweb-miningLink-based Classification AlgorithmsUtilizing 2 popular link-based algorithmsLoopy belief propagationIterative classification
  • 67. Relational Learning (cont’d): Relaxation Labeling Algorithms Flow of the algorithmRelaxation Labeling (cont’d): Algorithm variationsUsing a combined logistic classifier based on content and link informationShows improvement over a textual classifierOutperforms a single flat classifier based on both content and link featuresSelecting the proper Neighbors ONLY Not all neighbors are qualifiedThe chosen neighbors’ option:Similar enough in content
  • 68. Relational Learning (cont’d): Link-based Classification AlgorithmsTwo popular link-based algorithms:Loopy belief propagationIterative classificationBetter performance on a web collection than textual classifiersDuring the scientists’ study, ‘a toolkit’ was implemented Toolkit featuresClassify the networked data which utilized a relational classifier and a collective inference procedureDemonstrated its great performance on several datasets including web collections
  • 69. Algorithm Approaches for Webpage Classification
  • 70. Modifications to traditional algorithmsThe traditional algorithms adjusted in the context of Webpage classificationk-Nearest Neighbors (kNN)Quantify the distance between the test document and each training documents using “a dissimilarity measure”Cosine similarity or inner product is what used by most existing kNN classifiers Support Vector Machine (SVM)
  • 71. Modification Algorithms (Cont’d) : k-Nearest Neighbors Algorithm Varieties of modifications:Using the term co-occurrence in documentUsing probability computationUsing “co-training”
  • 72. k-Nearest Neighbors Algorithm(Cont’d): Modification Varieties Using the term co-occurrence in documentsAn improved similarity measureThe more co-occurred terms two documents have in common, the stronger the relationship between themBetter performance over the normal kNN (cosine similarity and inner product measures)Using the probability computationCondition:The probability of a document d being in class c is determined by its distance b/w neighbors and itself and its neighbors’ probability of being in cSimple equationProb. of d @ c = (distance b/w d and neighbors)(neighbors’ Prob. @ c)
  • 73. k-Nearest Neighbors Algorithm(Cont’d): Modification Varieties (2) Using “Co-training”Make use of labeled and unlabeled data Aiming to achieve better accuracyScenario: Binary classificationClassifying the unlabeled instancesTwo classifiers trained on different sets of features The prediction of each one is used to train each otherClassifying only labeled instancesThe co-training can cut the error rate by halfWhen generalized to multi-class problemsWhen the number of categories is largeCo-training is not satisfyingOn the other hand, the method of combining error-correcting output coding (more than enough classifiers in use), with co-training can boost performance
  • 74. Modification Algorithms (Cont’d) : SVM-based ApproachIn classification, both positive and negative examples are requiredSVM-Based aim:To eliminate the need for manual collection of negative examples while still retaining similar classification accuracy
  • 75. SVM-based Approach(Cont’d) : SVM-based Flow of algorithm
  • 76. Take a Break!The Internet’s Ad Market PlaceBesides Google AdwordsPresented byMr.Pachara ChutisawaengDepartment of Computer ScienceMahidol University, July 2009
  • 77. Algorithm Approaches for Webpage Classification
  • 78. Hierarchical ClassificationNot so many research since most web classifications focus on the same level approachesApproaches:Based on “divide and conquer”Error minimizationTopical HierarchyHierarchical SVMsUsing the degree of misclassificationHierarchical text categoriations
  • 79. Hierarchical Classification (Cont’d): ApproachesThe use of hierarchical classification based on “divide and conquer”Classification problems are splitted into sub-problems hierarchicallyMore efficient and accurate that the non-hierarchical wayError minimizationwhen the lower level category is uncertain,Minimize by shifting the assignment into the higher oneTopical HierarchyClassify a web page into a topical hierarchyUpdate the category information as the hierarchy expands
  • 80. Hierarchical Classification (Cont’d): Approaches (2)Hierarchical SVMsObservation:Hierarchical SVMs are more efficient than flat SVMsNone are satisfying the effectiveness for the large taxonomies Hierarchical settings do more harm than good to kNNs and naive Bayes classifiersHierarchical Classification By the degree of misclassification Opposed to measuring “correctness”Distance are measured b/w the classifier-assigned classes and the true class.Hierarchical text categorizationA detailed review was provided in 2005
  • 81. Algorithm Approaches for Webpage Classification
  • 82. Combining Information from Multiple SourcesDifferent sources are utilizedCombining link and content information is quite popularCommon combination way: Treat information from ‘different sources’ as ‘different (usually disjoint) feature sets’ on which multiple classifiers are trainedThen, the generation of FINAL decision will be made by the classifiersMostly has the potential to have better knowledge than any single method
  • 83. Information Combination (Cont’d): ApproachesVoting and StackingThe well-developed method in machine learningCo-TrainingEffective in combining multiple sourcesSince here, different classifiers are trained on disjoint feature sets
  • 84. Information Combination (Cont’d): CautionsPlease be noted that:Additional resource needs sometimes cause ‘disadvantage’The combination of 2 does NOT always BETTER than each separately
  • 85. Blog classificationPresented byMr.Pachara ChutisawaengDepartment of Computer ScienceMahidol University, July 2009
  • 86. Take a Break!Follow the Trend!!Everybody RETWEET!!Presented byMr.Pachara ChutisawaengDepartment of Computer ScienceMahidol University, July 2009
  • 87. Follow me on TwitterFollow pChralso my Blog Http://www.PacharaStudio.comPresented byMr.Pachara ChutisawaengDepartment of Computer ScienceMahidol University, July 2009
  • 88. Blog classificationThe word “blog” was originally a short form of “web log”Blogging has gained in popularity in recent years, an increasing amount of research about blog has also been conducted.Broken into three typesBlog identification (to determine whether a web document is a blog)Mood classificationGenre classification
  • 89. Blog classificationElgersma and Rijke 2006Common classification algorithm on Blog identification using number of human-selected feature e.g. “Comments” and “Archives” Accuracy around 90%Mihalcea and Liu 2006 classify Blog into two polarities of moods, happiness and sadness (Mood classification)Nowson 2006 discussed the distinction of three types of blogs (Genre Classification)NewsCommentaryJournal
  • 90. Blog classificationQu et al. 2006Automatic classification of blogs into four genresPersonal diaryNew Political SportsUsing unigram tfidf document representation and naive Bayes classification.Qu et al.’s approach can achieve an accuracy of 84%.
  • 91. ConclusionPresented byMr.Pachara ChutisawaengDepartment of Computer ScienceMahidol University, July 2009
  • 92. ConclusionWebpage classification is a type of supervised learning problem that aims to categorize webpage into a set of predefined categories based on labeled training data.They expect that future web classification efforts will certainly combine content and link information in some form.
  • 93. ConclusionFuture work would be well-advised toEmphasize text and labels from siblings over other types of neighbors.Incorporate anchor text from parents.Utilize other source of (implicit or explicit) human knowledge, such as query logs and click-through behavior, in addition to existing labels to guide classifier creation.
  • 94. Thank you.Presented byMr.Pachara ChutisawaengDepartment of Computer ScienceMahidol University, July 2009
  • 95. Question?Presented byMr.Pachara ChutisawaengDepartment of Computer ScienceMahidol University, July 2009