0% found this document useful (0 votes)

90 views89 pages

Text Categorization: See Chapter 16 in Manning&Schütze

This document discusses text categorization and related tasks. It provides examples of classification problems including sense disambiguation, tagging, spam detection, author identification, and text categorization. It also discusses getting features for text categorization, including feature types, the need for feature selection, and common feature selection methods like document frequency, information gain, and mutual information.

Uploaded by

Vonny Pawaka

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

90 views89 pages

Text Categorization: See Chapter 16 in Manning&Schütze

Uploaded by

Vonny Pawaka

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 89

Chapter 6:

Text Categorization

See chapter 16 in Manning&Schtze

Text Categorization and

related Tasks

Classification
Goal: Assign objects from a universe to two or more classes or
categories
Examples:

Problem

Object

The words senses

Tagging

Words

POS/NE

Spam Mail Detection

Document

spam/not spam

Author identification

Document

Authors

Text Categorization

Document

Topic

Information retrieval

Query/Document

Relevant/not relevant

Spam/junk/bulk Emails
The messages you spend your time with
just to delete them
Spam: do not want to get unsolicited
messages
Junk: irrelevant to the recipient, unwanted
Bulk: mass mailing for business marketing
(or fill-up mailbox etc)

Classification task: decide for each e-mail

whether it is spam/not-spam
4

Author identification
They agreed that Mrs. X should only hear of the
departure of the family, without being alarmed on the
score of the gentleman's conduct; but even this partial
communication gave her a great deal of concern, and she
bewailed it as exceedingly unlucky that the ladies
should happen to go away, just as they were all getting
so intimate together.
Gas looming through the fog in divers places in the
streets, much as the sun may, from the spongey fields,
be seen to loom by husbandman and ploughboy. Most of the
shops lighted two hours before their time--as the gas
seems to know, for it has a haggard and unwilling look.
The raw afternoon is rawest, and the dense fog is
densest, and the muddy streets are muddiest near that
leaden-headed old obstruction, appropriate ornament for
the threshold of a leaden-headed old corporation, Temple
Bar.

Author identification
Gas looming through the fog in divers
places in the streets, much as the sun
may, from the spongey fields, be seen to
loom by husbandman and ploughboy. Most of
the shops lighted two hours before their
time--as the gas seems to know, for it has
a haggard and unwilling look. The raw
afternoon is rawest, and the dense fog is
densest, and the muddy streets are
muddiest near that leaden-headed old
obstruction, appropriate ornament for the
threshold of a leaden-headed old
corporation, Temple Bar.

Author identification

Jane Austen (1775-1817), Pride and

Prejudice
Charles Dickens (1812-70), Bleak
House

Author identification
Federalist papers
77 short essays written in 1787-1788 by Hamilton,
Jay and Madison to persuade NY to ratify the US
Constitution; published under a pseudonym
The authorships of 12 papers was in dispute
(disputed papers)
In 1964 Mosteller and Wallace* solved the problem
They identified 70 function words as good
candidates for authorships analysis
Using statistical inference they concluded the author
was Madison
13

Function words for Author

Identification

Function words for Author Identification

Text Categorization
?

Speech Recognition

Information Retrieval

bla bla bla bla

bla bla bla bla
bla bla bla bla
bla bla bla bla
bla bla bla bla
bla bla bla bla
bla bla bla bla
bla bla bla bla

Computer Linguistics

?
?

Everything else

Text categorization
Topic categorization: classify the
document into semantics topics
The U.S. swept into the Davis
Cup final on Saturday when twins
Bob and Mike Bryan defeated
Belarus's Max Mirnyi and Vladimir
Voltchkov to give the Americans
an unsurmountable 3-0 lead in the
best-of-five semi-final tie.

One of the strangest, most

relentless hurricane seasons on
record reached new bizarre heights
yesterday as the plodding approach
of Hurricane Jeanne prompted
evacuation orders for hundreds of
thousands of Floridians and high
wind warnings that stretched 350
miles from the swamp towns south
of Miami to the historic city of St.
Augustine.

Text categorization
Reuters
Collection of (21,578) newswire documents.
For research purposes: a standard text
collection to compare systems and
algorithms
135 valid topics categories

<DATE> 2-MAR-1987 16:51:43.42</DATE>

<TOPICS><D>livestock</D><D>hog</D></TOPICS>
<TITLE>AMERICAN PORK CONGRESS KICKS OFF TOMORROW</TITLE>
<DATELINE> CHICAGO, March 2 - </DATELINE><BODY>The American Pork Congress kicks
off tomorrow, March 3, in Indianapolis with 160 of the nations pork producers from 44 member states
determining industry positions on a number of issues, according to the National Pork Producers
Council, NPPC.
Delegates to the three day Congress will be considering 26 resolutions concerning various issues,
including the future direction of farm policy and the tax law as it applies to the agriculture sector. The
delegates will also debate whether to endorse concepts of a national PRV (pseudorabies virus) control
and eradication program, the NPPC said.

A large trade show, in conjunction with the congress, will feature the latest in technology in all areas
of the industry, the NPPC added. Reuter
</BODY></TEXT></REUTERS>

Classification vs. Clustering

Classification assumes labeled data: we

know how many classes there are and
we have examples for each class
(labeled data).
Classification is supervised
In Clustering we dont have labeled data;
we just assume that there is a natural
division in the data and we may not know
how many divisions (clusters) there are
Clustering is unsupervised
23

Classification

Class1
Class2
24

Classification

Class1
Class2
25

Classification

Class1
Class2
26

Classification

Class1
Class2
27

Clustering

Binary vs. multi-way classification

Binary classification: two classes
Multi-way classification: more than two
classes

Sometimes it can be convenient to treat

a multi-way problem like a binary one:
one class versus all the others, for all
classes
33

Flat vs. Hierarchical classification

Flat classification: relations between the
classes undetermined
Hierarchical classification: hierarchy
where each node is the sub-class of its
parents node

Single- vs. multi-category classification

In single-category text classification each
text belongs to exactly one category
In multi-category text classification, each
text can have zero or more categories

Getting Features for Text

Categorization

Feature terminology
Feature: An aspect of the text that is relevant to
the task
Feature value: the realization of the feature in
the text
Words present in text : Clinton, Schumacher, China

Frequency of word: Clinton(10), Schumacher(1)

Are there dates? Yes/no
Are there PERSONS? Yes/no
Are there ORGANIZATIONS? Yes/no
WordNet: Holonyms (China is part of Asia),
Synonyms(China, People's Republic of China, mainlan
d China)
37

Feature Types
Boolean (or Binary) Features
Features that generate boolean (binary)
values.
Boolean features are the simplest and the
most common type of feature.
f1(text) = 1 if text contain
Clinton
0 otherwise
f2(text) = 1 if text contain PERSON
0 otherwise
38

Feature Types
Integer Features
Features that generate integer values.
Integer features can be used to give classifiers
access to more precise information about the text.
f1(text) = Number of times text contains
Clinton
f2(text) = Number of times text contains
PERSON
39

When Do We Need
Feature Selection?
If the algorithm cannot handle all possible features
e.g. language identification for 100 languages using all words
text classification using n-grams

Good features can result in higher accuracy

But! Why feature selection?
What if we just keep all features?
Even the unreliable features can be helpful.
But we need to weight them:
In the extreme case, the bad features can have a weight of 0 (or
very close), which is a form of feature selection!
40

Why Feature Selection?

Not all features are equally good!
Bad features: best to remove
Infrequent
unlikely to be be met again
co-occurrence with a class can be due to chance

Too frequent
mostly function words

Uniform across all categories

Good features: should be kept

Co-occur with a particular category
Do not co-occur with other categories

The rest: good to keep

Types Of Feature Selection?

Feature selection reduces the number of
features

Usually:
Eliminating features
Weighting features
Normalizing features

Sometimes by transforming parameters

e.g. Latent Semantic Indexing using Singular Value
Decomposition

Method may depend on problem type

For classification and filtering, may use information
from example documents to guide selection
42

Feature Selection
Task independent methods
Document Frequency (DF)
Term Strength (TS)

Task-dependent methods
Information Gain (IG)
Mutual Information (MI)

2 statistic (CHI)

Empirically compared by Yang & Pedersen (1997)

Compared feature selection methods for text

categorization
5 feature selection methods:
DF, MI, CHI, IG, TS
Features were just words

2 classifiers:
kNN: k-Nearest Neighbor (to be covered next week)
LLSF: Linear Least Squares Fit

2 data collections:
Reuters-22173
OHSUMED: subset of MEDLINE (1990&1991 used)

Document Frequency (DF)

DF: number of documents a term appears in
Based on Zipfs Law
Remove the rare terms: (met 1-2 times)
Non-informative
Unreliable can be just noise
Not influential in the final decision
Unlikely to appear in new documents

Plus
Easy to compute

Task independent: do not need to know the classes

Minus
Ad hoc criterion
45

Rare terms can be good45

discriminators (e.g., in IR)

Examples of Frequent Words:

Most Frequent Words in Brown Corpus

Stop Word Removal

Common words from a predefined list
Mostly from closed-class categories:
unlikely to have a new word added
include: auxiliaries, conjunctions, determiners, prepositions,
pronouns, articles

But also some open-class words like numerals

Bad discriminators
uniformly spread across all classes
can be safely removed from the vocabulary
Is this always a good idea? (e.g. author identification)

Information Gain
A measure of importance of the feature
for predicting the presence of the class.
Defined as:
The number of bits of information gained
by knowing the term is present or absent
Based on Information Theory

Plus:
sound information theory justification

Minus:
computationally expensive
48

Information Gain (IG)

IG: number of bits of information gained by knowing
the term is present or absent
m

G (t ) P(ci ) logP(ci )
i 1

P(t ) P(ci | t ) logP(ci | t )

i 1
m

P(t ) P(ci | t ) logP(ci | t )

i 1

t is the term being scored,

ci is a class variable
49

Mutual Information (MI)

Logarithmic version of correlation to term t
with category c
P (t , c)

I (t , c) log
P (t ) P (c)
P (t | c)

log
P (t )
P (c | t )

log
P (c )
50

Using Mutual Information

Compute MI for each category and then combine
If we want to discriminate well across all categories, then we
need to take the expected value of MI:
m

I avg (t ) P(ci ) I (t , ci )
i 1

To discriminate well for a single category, then we take the

maximum:

I max (t ) max I (t , ci )
i 1...m

Mutual Information
Plus
I(t,c) is 0, when t and c are independent
Sound information-theoretic interpretation

Minus
Small numbers produce unreliable results
No weighting with frequency of a pair (t,c)

2 statistic
The most commonly used method of comparing
proportions.
Example: Let us measure the dependency
between a term t and a category c.
the groups would be:
1) the documents from a category ci
2) all other documents

the characteristic would be:

document contains term t
53

2 statistic
Is jaguar a good predictor for the auto class?
Term =
jaguar

Term
jaguar

Class =
auto

500

Class
auto

9500

We want to compare:
the observed distribution above; and
null hypothesis: that jaguar and auto are independent
54

2 statistic
Under the null hypothesis: (jaguar and auto independent):
How many co-occurrences of jaguar and auto do we expect?
We would have: P(j,a) = P(j) P(a)
P(j) = (2+3)/N; P(a) = (2+500)/N; N=2+3+500+9500
Num. co-occur. :
N P(j,a)=N P(j) P(a)
=N(5/N)(502/N)=2510/N=2510/10005 0.25

Term =
jaguar

Class =
auto
Class
auto

2 (0.25)
3

Term
jaguar
500
9500

2 statistic

Term =
jaguar

Class =
auto

2 (0.25)

Class
auto

3 (4.75)

Term
jaguar
500

(502)

9500 (9498)

2 statistic
2 is interested in (fo fe)2/fe summed over all table entries:
2 ( j, a) (O E ) 2 / E (2 .25) 2 / .25 (3 4.75) 2 / 4.75
(500 502) 2 / 502 (9500 9498) 2 / 9498 12.9

Term =
jaguar
Class =
auto
Class
auto
57

2 (0.25)
3 (4.75)
57

Term
jaguar
500

(502)

9500 (9498)

2 statistic
Alternatives:
Look up value for 2 in a table
Calculate it from
k /2

(1 / 2)
k / 2 1 x / 2
f ( x, k )
x
e
(k / 2)

Look it up in the internet

The null hypothesis is rejected with confidence 0.9997

2 statistic
Collect all the terms to calculate 2 directly
from contingency table
2
N
(
AD

CB
)
2 (t , c)
( A B)( A C )( B D)(C D)

A = #(t,c)

C = #(t,c)

B = #(t,c)

D = #(t, c)

N=A+B+C+D
60

2 statistic
How to use 2 for multiple categories?
Compute 2 for each category and then combine:
we can require to discriminate well across all categories, then we
need to take the expected value of 2:
m

2 avg (t ) P(ci ) 2 (t , ci )
i 1

or to discriminate well for a single category, then we take the

maximum:

2 max (t ) max 2 (t , ci )
i 1...m

2 statistic

Plus
normalized and thus comparable across terms
2(t,c) is 0, when t and c are independent
sound theoretical background

Minus
unreliable for low frequency terms
computationally expensive

Term strength
Term strength:

s(t ) p(t y | t x)
x,y: topically related document
(e.g. from a clustering algorithm)
measures co-occurrence of terms (unlike idf)
For more details see:
Wilbur and Sorotkin
The automatic identification of stop words
63

Comparison on Reuters

Correlation of feature selection criteria

Feature Selection Summary

(From Yang and Pedersen)

Classification Algorithms

Overview
There is a large zoo of classification algorithms
Decision Trees
Nave Bayes
Maximum Entropy methods

k nearest neighbor classifiers

Neural networks
Support vector machines

Many of them have been covered in other

lectures

Decision Tree for Reuter classification

From Manning&Schtze70

Decision Boundaries for Decision Trees

1-Nearest Neighbor

3-Nearest Neighbor

3-Nearest Neighbor
But this is closer..
We can weight neighbors
according to their similarity

Assign the category of the majority of the neighbors

Decision Boundaries for k Nearest Neighbor

(schematic)

Bayes Decision Rule

k arg max P( x | k ) P(k )

k:
x:

class label
features

Nave Bayes
x is not a single feature, but a bag of
features
e.g. different key-words for your spammail detection system
Assume statistical independence of
features
N

P({x1...xN } | k ) P( xi | k )
i 1

Maximum Entropy Methods

A way to estimate probabilities
Features are taken into account as
constraints for the probabilities
Otherwise as unbiased probability
estimate as possible

Linear binary classification using a

Perceptron (Simplest Neural Network)

Data: {(xi,yi)}i=1...n
x in Rd (x is a vector in d-dimensional space)
feature vector
y in {-1,+1}
label (class, category)

Question:
Design a linear decision boundary: wx + b (equation of
hyperplane) such that the classification rule associated with it
has minimal probability of error
classification rule:

y = sign(w x + b) which means:

if wx + b > 0 then y = +1
if wx + b < 0 then y =80-1

Linear binary classification

Find a good
hyperplane
(w,b) in Rd+1
that correctly classifies
data points as much
as possible
wx + b = 0

In online fashion: one

data point at the time,
update weights as
necessary
81

Classification Rule:
y = sign(wx + b)

Perceptron algorithm

Initialize: w1 = 0

Updating rule For each data point x

If class(x) != decision(x,w)

then
wk+1 wk + yixi
k

k+1

wk+1

else
wk+1 wk

-1
wk x + b = 0

Function decision(x, w)
If wx + b > 0 return +1
Drawing
Else return -1
82

Wk+1 x + b = 0

does not correspond to algorithm

with respect to the treatment of B
82

Perceptron algorithm
Online: can adjust to changing target, over
time
Advantages
Simple and computationally efficient
Guaranteed to learn a linearly separable problem
(convergence, global optimum)

Limitations
Only linear separations
Only converges for linearly separable data
Not really efficient with many features
83

Large margin classifier

Another family of linear

algorithms
Intuition (Vapnik, 1965)
If the classes are linearly
separable:
Separate the data
Place hyper-plane far from
the data: large margin
Statistical results guarantee
good generalization
84

BAD

Large margin classifier

Intuition (Vapnik, 1965) if
linearly separable:
Separate the data
Place hyperplane far
from the data: large
margin
Statistical results
GOOD
guarantee good
Maximal Margin Classifier
generalization
85

Large margin classifier

If not linearly separable

Allow some errors
Still, try to place
hyperplane far from
each class

Large Margin Classifiers

Advantages
Theoretically better (better error bounds)

Limitations
Computationally more expensive, large
quadratic programming

Support Vector Machine (SVM)

Large Margin
Classifier

wTxa + b = 1

wTxb + b = -1

Linearly
separable
case
Goal: find the
hyperplane
that
maximizes
the margin
88

wT x + b = 0

Support vectors

Summary
Types of text classification
Features and feature selection
Classification algorithms

Techniques of Text Classification
No ratings yet
Techniques of Text Classification
28 pages
Machine Learning in Automated Text Categorization FABRIZIO SEBASTIANI Consiglio Nazionale Delle Ricerche
No ratings yet
Machine Learning in Automated Text Categorization FABRIZIO SEBASTIANI Consiglio Nazionale Delle Ricerche
3 pages
Deng Et Al. - 2019 - Feature Selection For Text Classification A Review
No ratings yet
Deng Et Al. - 2019 - Feature Selection For Text Classification A Review
20 pages
Hybrid Approach Combining Machine Learning and A Rule-Based Expert System For Text Categorization
No ratings yet
Hybrid Approach Combining Machine Learning and A Rule-Based Expert System For Text Categorization
7 pages
Abbott Kim Genre Classification PDF
No ratings yet
Abbott Kim Genre Classification PDF
3 pages
Categories vs. Tags Explained
No ratings yet
Categories vs. Tags Explained
11 pages
05 Text Categorization
No ratings yet
05 Text Categorization
22 pages
Practice Architecture 1
No ratings yet
Practice Architecture 1
10 pages
News Article Text Classification and Summary For Authors and Topics
No ratings yet
News Article Text Classification and Summary For Authors and Topics
12 pages
Techniques For Text Classification: Literature Review and Current Trends
No ratings yet
Techniques For Text Classification: Literature Review and Current Trends
28 pages
Stable Classification of Text Genres
No ratings yet
Stable Classification of Text Genres
11 pages
Applied Text Mining Quiz 1
No ratings yet
Applied Text Mining Quiz 1
83 pages
CAP 11 Io1
No ratings yet
CAP 11 Io1
18 pages
Assessing Approaches To Genre Classification
No ratings yet
Assessing Approaches To Genre Classification
72 pages
Minor Project Ii Report Text Mining: Reuters-21578: Submitted by
100% (1)
Minor Project Ii Report Text Mining: Reuters-21578: Submitted by
51 pages
Task 3
No ratings yet
Task 3
17 pages
Theis Finaldoc
No ratings yet
Theis Finaldoc
86 pages
Theme-Based Retrieval of Web News
No ratings yet
Theme-Based Retrieval of Web News
2 pages
Learning To Classify Documents According To Genre: Aidan Finn and Nicholas Kushmerick
No ratings yet
Learning To Classify Documents According To Genre: Aidan Finn and Nicholas Kushmerick
26 pages
RSS News Aggregator for Users
No ratings yet
RSS News Aggregator for Users
11 pages
Hierarchical Afaan Oromoo News Text Classification
No ratings yet
Hierarchical Afaan Oromoo News Text Classification
11 pages
Web Page Classification Techniques
No ratings yet
Web Page Classification Techniques
31 pages
Automatic Induction of Rule Based Text Categorization
No ratings yet
Automatic Induction of Rule Based Text Categorization
10 pages
Text and Sentiment Analysis
No ratings yet
Text and Sentiment Analysis
41 pages
Ijcst V3i2p17
No ratings yet
Ijcst V3i2p17
5 pages
Concept-Based Document Classification
No ratings yet
Concept-Based Document Classification
5 pages
27 - Human Perception Through Collaborative Semantics
No ratings yet
27 - Human Perception Through Collaborative Semantics
11 pages
Hierarchical Classification of Amharic News
No ratings yet
Hierarchical Classification of Amharic News
41 pages
Text Classification MLND Project Report Prasann Pandya
No ratings yet
Text Classification MLND Project Report Prasann Pandya
17 pages
Guide to Automated News Text Classification
No ratings yet
Guide to Automated News Text Classification
39 pages
Understanding SOSLIT Concepts
No ratings yet
Understanding SOSLIT Concepts
4 pages
Survey on Text Categorization Methods
No ratings yet
Survey on Text Categorization Methods
7 pages
Text Classification Using Machine Learning Techniq
No ratings yet
Text Classification Using Machine Learning Techniq
10 pages
Learning To Classify Documents According To Formal and Informal Style
No ratings yet
Learning To Classify Documents According To Formal and Informal Style
31 pages
01 What Is Text Classification 8-12
No ratings yet
01 What Is Text Classification 8-12
4 pages
Book 2
No ratings yet
Book 2
3 pages
Automated Health Document Classification
No ratings yet
Automated Health Document Classification
44 pages
A New Text Mining Approach Based On HMM-SVM For Web News Classification
No ratings yet
A New Text Mining Approach Based On HMM-SVM For Web News Classification
8 pages
News Article Genre Classification System
No ratings yet
News Article Genre Classification System
5 pages
Text Classification
No ratings yet
Text Classification
7 pages
Text, Web and Social Media Analytics: SE Computer, Sem VIII Academic Year: 2023 - 24
No ratings yet
Text, Web and Social Media Analytics: SE Computer, Sem VIII Academic Year: 2023 - 24
36 pages
Background Research: 2.1 Machine Learning
No ratings yet
Background Research: 2.1 Machine Learning
9 pages
FULLTEXT01
No ratings yet
FULLTEXT01
21 pages
2007 0131 - PIP Tagging
No ratings yet
2007 0131 - PIP Tagging
9 pages
Lecture 1 - Introduction
No ratings yet
Lecture 1 - Introduction
57 pages
Identification of Common Trends in Polit
No ratings yet
Identification of Common Trends in Polit
8 pages
An Ontology-Based Sentiment Classification Methodology For Online Consumer Reviews
100% (2)
An Ontology-Based Sentiment Classification Methodology For Online Consumer Reviews
7 pages
Group08 - BDM01 - Topic Modelling in Text Classification
No ratings yet
Group08 - BDM01 - Topic Modelling in Text Classification
19 pages
Unit-IV Semantic Web
No ratings yet
Unit-IV Semantic Web
10 pages
Concepts and Categories: Plan For INFO 202 Lecture #5
No ratings yet
Concepts and Categories: Plan For INFO 202 Lecture #5
21 pages
Understanding Inflation and Text Types
No ratings yet
Understanding Inflation and Text Types
7 pages
Malignant Commentes Classifier - Multi Label Classification Project Using NLP - FlipRobo
No ratings yet
Malignant Commentes Classifier - Multi Label Classification Project Using NLP - FlipRobo
29 pages
Algorithmic Filtering. Computational Journalism Week 4
No ratings yet
Algorithmic Filtering. Computational Journalism Week 4
53 pages
Towards The Semantic Web: Collaborative Tag Suggestions
No ratings yet
Towards The Semantic Web: Collaborative Tag Suggestions
8 pages
Computational Journalism 2016 Week 2: Text Analysis
No ratings yet
Computational Journalism 2016 Week 2: Text Analysis
68 pages
Communication in Text Categorization
No ratings yet
Communication in Text Categorization
15 pages
Ontologies for Semantic Connectivity
No ratings yet
Ontologies for Semantic Connectivity
7 pages
Ethics - MORAL AGENCY
No ratings yet
Ethics - MORAL AGENCY
4 pages
Math For AI and ML
No ratings yet
Math For AI and ML
9 pages
Machine Learning for Banking Fraud Detection
No ratings yet
Machine Learning for Banking Fraud Detection
10 pages
1 s2.0 S0160791X23001264 Main
No ratings yet
1 s2.0 S0160791X23001264 Main
15 pages
Six Applications (Ai Project)
No ratings yet
Six Applications (Ai Project)
9 pages
Mobile Health Information Technology and Patient Care A Literature Review and Analysis
100% (1)
Mobile Health Information Technology and Patient Care A Literature Review and Analysis
4 pages
SauvolaNet Learning Adaptive Sauvola Network For Degraded Document Binarization PDF
No ratings yet
SauvolaNet Learning Adaptive Sauvola Network For Degraded Document Binarization PDF
15 pages
Đề Vip 8+ Số 1 (HS)
No ratings yet
Đề Vip 8+ Số 1 (HS)
14 pages
Leveraging Data Science For Global Health
100% (3)
Leveraging Data Science For Global Health
471 pages
Government Scheme
No ratings yet
Government Scheme
25 pages
Ethics and Discrimination in Artificial Intelligence-Enabled Recruitment Practices
No ratings yet
Ethics and Discrimination in Artificial Intelligence-Enabled Recruitment Practices
12 pages
Anh de Cuong GK 2 Lop 12 2024 2025 - 142202514
No ratings yet
Anh de Cuong GK 2 Lop 12 2024 2025 - 142202514
8 pages
Animal Tracking Object Detection ICTIS PrePrint
No ratings yet
Animal Tracking Object Detection ICTIS PrePrint
10 pages
Requirements Clarification Document
No ratings yet
Requirements Clarification Document
2 pages
Blockchains For Internet of Things: Fundamentals, Applications, and Challenges
No ratings yet
Blockchains For Internet of Things: Fundamentals, Applications, and Challenges
7 pages
New Advancements in Swarm Algorithms: Operators and Applications Erik Cuevas Newest Edition 2025
No ratings yet
New Advancements in Swarm Algorithms: Operators and Applications Erik Cuevas Newest Edition 2025
66 pages
Student Insights on Generative AI Use
No ratings yet
Student Insights on Generative AI Use
16 pages
NSIC Internship GEN AI 1 Month 05.05.2025
No ratings yet
NSIC Internship GEN AI 1 Month 05.05.2025
3 pages
đề số 3
No ratings yet
đề số 3
7 pages
EnjoyAlgorithms - Note - Follow
No ratings yet
EnjoyAlgorithms - Note - Follow
6 pages
390 Submission
No ratings yet
390 Submission
5 pages
Artificial Intelligence For Business Mb-Gab-Oimict-01 (Ahp) : The Correct Answer Is: Free From Errors
No ratings yet
Artificial Intelligence For Business Mb-Gab-Oimict-01 (Ahp) : The Correct Answer Is: Free From Errors
8 pages
AI&ML - Experiential Learning Exam Schedule-3rd, 5th and 7th Semesters
No ratings yet
AI&ML - Experiential Learning Exam Schedule-3rd, 5th and 7th Semesters
6 pages
Kisi-Kisi Toefl Section 2,3.
No ratings yet
Kisi-Kisi Toefl Section 2,3.
35 pages
Excel Project Sales Data Analysis GC
No ratings yet
Excel Project Sales Data Analysis GC
33 pages
IGCSE - PAPER-1 - Student Notes-126-127
100% (2)
IGCSE - PAPER-1 - Student Notes-126-127
2 pages
The Work of Art in The Age of AI Image G
No ratings yet
The Work of Art in The Age of AI Image G
13 pages
RESEARCH Proposal GradSem1
No ratings yet
RESEARCH Proposal GradSem1
27 pages
AI Class9 ProjectCycle
No ratings yet
AI Class9 ProjectCycle
28 pages
AI and Human Right
No ratings yet
AI and Human Right
2 pages

Text Categorization: See Chapter 16 in Manning&Schütze

Uploaded by

Text Categorization: See Chapter 16 in Manning&Schütze

Uploaded by

Chapter 6:

See chapter 16 in Manning&Schtze

Text Categorization and

The words senses

Spam Mail Detection

Classification task: decide for each e-mail

Jane Austen (1775-1817), Pride and

Function words for Author

Function words for Author Identification

bla bla bla bla

One of the strangest, most

Top topics in Reuters

<DATE> 2-MAR-1987 16:51:43.42</DATE>

Classification vs. Clustering

Classification vs. Clustering

Classification assumes labeled data: we

Binary vs. multi-way classification

Sometimes it can be convenient to treat

Flat vs. Hierarchical classification

Single- vs. multi-category classification

Getting Features for Text

Frequency of word: Clinton(10), Schumacher(1)

Good features can result in higher accuracy

Why Feature Selection?

Uniform across all categories

Good features: should be kept

The rest: good to keep

Types Of Feature Selection?

Sometimes by transforming parameters

Method may depend on problem type

Empirically compared by Yang & Pedersen (1997)

Compared feature selection methods for text

Document Frequency (DF)

Task independent: do not need to know the classes

Rare terms can be good45

Examples of Frequent Words:

Stop Word Removal

But also some open-class words like numerals

Information Gain (IG)

P(t ) P(ci | t ) logP(ci | t )

P(t ) P(ci | t ) logP(ci | t )

t is the term being scored,

Mutual Information (MI)

Using Mutual Information

To discriminate well for a single category, then we take the

the characteristic would be:

Look it up in the internet

The null hypothesis is rejected with confidence 0.9997

or to discriminate well for a single category, then we take the

Correlation of feature selection criteria

Correlation of feature selection criteria

Feature Selection Summary

k nearest neighbor classifiers

Many of them have been covered in other

Decision Tree for Reuter classification

Decision Boundaries for Decision Trees

Assign the category of the majority of the neighbors

Decision Boundaries for k Nearest Neighbor

Bayes Decision Rule

k arg max P( x | k ) P(k )

Maximum Entropy Methods

Linear binary classification using a

y = sign(w x + b) which means:

Linear binary classification

In online fashion: one

Updating rule For each data point x

does not correspond to algorithm

Large margin classifier

Another family of linear

Large margin classifier

Large margin classifier

If not linearly separable

Large Margin Classifiers

Support Vector Machine (SVM)

You might also like