Chapter 6:
Text Categorization
See chapter 16 in Manning&Schtze
Text Categorization and
related Tasks
Classification
Goal: Assign objects from a universe to two or more classes or
categories
Examples:
Problem
Object
Categories
Sense Disambiguation
Word /Doc.
The words senses
Tagging
Words
POS/NE
Spam Mail Detection
Document
spam/not spam
Author identification
Document
Authors
Text Categorization
Document
Topic
Information retrieval
Query/Document
Relevant/not relevant
Spam/junk/bulk Emails
The messages you spend your time with
just to delete them
Spam: do not want to get unsolicited
messages
Junk: irrelevant to the recipient, unwanted
Bulk: mass mailing for business marketing
(or fill-up mailbox etc)
Classification task: decide for each e-mail
whether it is spam/not-spam
4
Author identification
They agreed that Mrs. X should only hear of the
departure of the family, without being alarmed on the
score of the gentleman's conduct; but even this partial
communication gave her a great deal of concern, and she
bewailed it as exceedingly unlucky that the ladies
should happen to go away, just as they were all getting
so intimate together.
Gas looming through the fog in divers places in the
streets, much as the sun may, from the spongey fields,
be seen to loom by husbandman and ploughboy. Most of the
shops lighted two hours before their time--as the gas
seems to know, for it has a haggard and unwilling look.
The raw afternoon is rawest, and the dense fog is
densest, and the muddy streets are muddiest near that
leaden-headed old obstruction, appropriate ornament for
the threshold of a leaden-headed old corporation, Temple
Bar.
Author identification
They agreed that Mrs. X should only
hear of the departure of the family,
without being alarmed on the score
of the gentleman's conduct; but even
this partial communication gave her
a great deal of concern, and she
bewailed it as exceedingly unlucky
that the ladies should happen to go
away, just as they were all getting
so intimate together.
Author identification
Gas looming through the fog in divers
places in the streets, much as the sun
may, from the spongey fields, be seen to
loom by husbandman and ploughboy. Most of
the shops lighted two hours before their
time--as the gas seems to know, for it has
a haggard and unwilling look. The raw
afternoon is rawest, and the dense fog is
densest, and the muddy streets are
muddiest near that leaden-headed old
obstruction, appropriate ornament for the
threshold of a leaden-headed old
corporation, Temple Bar.
10
10
11
11
Author identification
Jane Austen (1775-1817), Pride and
Prejudice
Charles Dickens (1812-70), Bleak
House
12
12
Author identification
Federalist papers
77 short essays written in 1787-1788 by Hamilton,
Jay and Madison to persuade NY to ratify the US
Constitution; published under a pseudonym
The authorships of 12 papers was in dispute
(disputed papers)
In 1964 Mosteller and Wallace* solved the problem
They identified 70 function words as good
candidates for authorships analysis
Using statistical inference they concluded the author
was Madison
13
13
Function words for Author
Identification
14
14
Function words for Author Identification
15
15
Text Categorization
?
Speech Recognition
Information Retrieval
bla bla bla bla
bla bla bla bla
bla bla bla bla
bla bla bla bla
bla bla bla bla
bla bla bla bla
bla bla bla bla
bla bla bla bla
Computer Linguistics
?
?
16
16
Everything else
Text categorization
Topic categorization: classify the
document into semantics topics
The U.S. swept into the Davis
Cup final on Saturday when twins
Bob and Mike Bryan defeated
Belarus's Max Mirnyi and Vladimir
Voltchkov to give the Americans
an unsurmountable 3-0 lead in the
best-of-five semi-final tie.
17
17
One of the strangest, most
relentless hurricane seasons on
record reached new bizarre heights
yesterday as the plodding approach
of Hurricane Jeanne prompted
evacuation orders for hundreds of
thousands of Floridians and high
wind warnings that stretched 350
miles from the swamp towns south
of Miami to the historic city of St.
Augustine.
18
18
Text categorization
Reuters
Collection of (21,578) newswire documents.
For research purposes: a standard text
collection to compare systems and
algorithms
135 valid topics categories
19
19
Top topics in Reuters
20
20
Reuters
<REUTERS TOPICS="YES" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET"
OLDID="12981" NEWID="798">
<DATE> 2-MAR-1987 16:51:43.42</DATE>
<TOPICS><D>livestock</D><D>hog</D></TOPICS>
<TITLE>AMERICAN PORK CONGRESS KICKS OFF TOMORROW</TITLE>
<DATELINE> CHICAGO, March 2 - </DATELINE><BODY>The American Pork Congress kicks
off tomorrow, March 3, in Indianapolis with 160 of the nations pork producers from 44 member states
determining industry positions on a number of issues, according to the National Pork Producers
Council, NPPC.
Delegates to the three day Congress will be considering 26 resolutions concerning various issues,
including the future direction of farm policy and the tax law as it applies to the agriculture sector. The
delegates will also debate whether to endorse concepts of a national PRV (pseudorabies virus) control
and eradication program, the NPPC said.
A large trade show, in conjunction with the congress, will feature the latest in technology in all areas
of the industry, the NPPC added. Reuter
</BODY></TEXT></REUTERS>
21
21
Classification vs. Clustering
22
22
Classification vs. Clustering
23
Classification assumes labeled data: we
know how many classes there are and
we have examples for each class
(labeled data).
Classification is supervised
In Clustering we dont have labeled data;
we just assume that there is a natural
division in the data and we may not know
how many divisions (clusters) there are
Clustering is unsupervised
23
Classification
Class1
Class2
24
24
Classification
Class1
Class2
25
25
Classification
Class1
Class2
26
26
Classification
Class1
Class2
27
27
Clustering
28
28
Clustering
29
29
Clustering
30
30
Clustering
31
31
Clustering
32
32
Binary vs. multi-way classification
Binary classification: two classes
Multi-way classification: more than two
classes
Sometimes it can be convenient to treat
a multi-way problem like a binary one:
one class versus all the others, for all
classes
33
33
Flat vs. Hierarchical classification
Flat classification: relations between the
classes undetermined
Hierarchical classification: hierarchy
where each node is the sub-class of its
parents node
34
34
Single- vs. multi-category classification
In single-category text classification each
text belongs to exactly one category
In multi-category text classification, each
text can have zero or more categories
35
35
Getting Features for Text
Categorization
36
36
Feature terminology
Feature: An aspect of the text that is relevant to
the task
Feature value: the realization of the feature in
the text
Words present in text : Clinton, Schumacher, China
37
Frequency of word: Clinton(10), Schumacher(1)
Are there dates? Yes/no
Are there PERSONS? Yes/no
Are there ORGANIZATIONS? Yes/no
WordNet: Holonyms (China is part of Asia),
Synonyms(China, People's Republic of China, mainlan
d China)
37
Feature Types
Boolean (or Binary) Features
Features that generate boolean (binary)
values.
Boolean features are the simplest and the
most common type of feature.
f1(text) = 1 if text contain
Clinton
0 otherwise
f2(text) = 1 if text contain PERSON
0 otherwise
38
38
Feature Types
Integer Features
Features that generate integer values.
Integer features can be used to give classifiers
access to more precise information about the text.
f1(text) = Number of times text contains
Clinton
f2(text) = Number of times text contains
PERSON
39
39
When Do We Need
Feature Selection?
If the algorithm cannot handle all possible features
e.g. language identification for 100 languages using all words
text classification using n-grams
Good features can result in higher accuracy
But! Why feature selection?
What if we just keep all features?
Even the unreliable features can be helpful.
But we need to weight them:
In the extreme case, the bad features can have a weight of 0 (or
very close), which is a form of feature selection!
40
40
Why Feature Selection?
Not all features are equally good!
Bad features: best to remove
Infrequent
unlikely to be be met again
co-occurrence with a class can be due to chance
Too frequent
mostly function words
Uniform across all categories
Good features: should be kept
Co-occur with a particular category
Do not co-occur with other categories
The rest: good to keep
41
41
Types Of Feature Selection?
Feature selection reduces the number of
features
Usually:
Eliminating features
Weighting features
Normalizing features
Sometimes by transforming parameters
e.g. Latent Semantic Indexing using Singular Value
Decomposition
Method may depend on problem type
For classification and filtering, may use information
from example documents to guide selection
42
42
Feature Selection
Task independent methods
Document Frequency (DF)
Term Strength (TS)
Task-dependent methods
Information Gain (IG)
Mutual Information (MI)
2 statistic (CHI)
Empirically compared by Yang & Pedersen (1997)
43
43
Compared feature selection methods for text
categorization
5 feature selection methods:
DF, MI, CHI, IG, TS
Features were just words
2 classifiers:
kNN: k-Nearest Neighbor (to be covered next week)
LLSF: Linear Least Squares Fit
2 data collections:
Reuters-22173
OHSUMED: subset of MEDLINE (1990&1991 used)
44
44
Document Frequency (DF)
DF: number of documents a term appears in
Based on Zipfs Law
Remove the rare terms: (met 1-2 times)
Non-informative
Unreliable can be just noise
Not influential in the final decision
Unlikely to appear in new documents
Plus
Easy to compute
Task independent: do not need to know the classes
Minus
Ad hoc criterion
45
Rare terms can be good45
discriminators (e.g., in IR)
Examples of Frequent Words:
Most Frequent Words in Brown Corpus
46
46
Stop Word Removal
Common words from a predefined list
Mostly from closed-class categories:
unlikely to have a new word added
include: auxiliaries, conjunctions, determiners, prepositions,
pronouns, articles
But also some open-class words like numerals
Bad discriminators
uniformly spread across all classes
can be safely removed from the vocabulary
Is this always a good idea? (e.g. author identification)
47
47
Information Gain
A measure of importance of the feature
for predicting the presence of the class.
Defined as:
The number of bits of information gained
by knowing the term is present or absent
Based on Information Theory
Plus:
sound information theory justification
Minus:
computationally expensive
48
48
Information Gain (IG)
IG: number of bits of information gained by knowing
the term is present or absent
m
G (t ) P(ci ) logP(ci )
i 1
P(t ) P(ci | t ) logP(ci | t )
i 1
m
P(t ) P(ci | t ) logP(ci | t )
i 1
49
t is the term being scored,
ci is a class variable
49
Mutual Information (MI)
Logarithmic version of correlation to term t
with category c
P (t , c)
I (t , c) log
P (t ) P (c)
P (t | c)
log
P (t )
P (c | t )
log
P (c )
50
50
Using Mutual Information
Compute MI for each category and then combine
If we want to discriminate well across all categories, then we
need to take the expected value of MI:
m
I avg (t ) P(ci ) I (t , ci )
i 1
To discriminate well for a single category, then we take the
maximum:
I max (t ) max I (t , ci )
i 1...m
51
51
Mutual Information
Plus
I(t,c) is 0, when t and c are independent
Sound information-theoretic interpretation
Minus
Small numbers produce unreliable results
No weighting with frequency of a pair (t,c)
52
52
2 statistic
The most commonly used method of comparing
proportions.
Example: Let us measure the dependency
between a term t and a category c.
the groups would be:
1) the documents from a category ci
2) all other documents
the characteristic would be:
document contains term t
53
53
2 statistic
Is jaguar a good predictor for the auto class?
Term =
jaguar
Term
jaguar
Class =
auto
500
Class
auto
9500
We want to compare:
the observed distribution above; and
null hypothesis: that jaguar and auto are independent
54
54
2 statistic
Under the null hypothesis: (jaguar and auto independent):
How many co-occurrences of jaguar and auto do we expect?
We would have: P(j,a) = P(j) P(a)
P(j) = (2+3)/N; P(a) = (2+500)/N; N=2+3+500+9500
Num. co-occur. :
N P(j,a)=N P(j) P(a)
=N(5/N)(502/N)=2510/N=2510/10005 0.25
Term =
jaguar
55
Class =
auto
Class
auto
2 (0.25)
3
55
Term
jaguar
500
9500
2 statistic
Term =
jaguar
56
Class =
auto
2 (0.25)
Class
auto
3 (4.75)
56
Term
jaguar
500
(502)
9500 (9498)
2 statistic
2 is interested in (fo fe)2/fe summed over all table entries:
2 ( j, a) (O E ) 2 / E (2 .25) 2 / .25 (3 4.75) 2 / 4.75
(500 502) 2 / 502 (9500 9498) 2 / 9498 12.9
Term =
jaguar
Class =
auto
Class
auto
57
2 (0.25)
3 (4.75)
57
Term
jaguar
500
(502)
9500 (9498)
2 statistic
Alternatives:
Look up value for 2 in a table
Calculate it from
k /2
(1 / 2)
k / 2 1 x / 2
f ( x, k )
x
e
(k / 2)
58
Look it up in the internet
58
The null hypothesis is rejected with confidence 0.9997
59
59
2 statistic
Collect all the terms to calculate 2 directly
from contingency table
2
N
(
AD
CB
)
2 (t , c)
( A B)( A C )( B D)(C D)
A = #(t,c)
C = #(t,c)
B = #(t,c)
D = #(t, c)
N=A+B+C+D
60
60
2 statistic
How to use 2 for multiple categories?
Compute 2 for each category and then combine:
we can require to discriminate well across all categories, then we
need to take the expected value of 2:
m
2 avg (t ) P(ci ) 2 (t , ci )
i 1
or to discriminate well for a single category, then we take the
maximum:
2 max (t ) max 2 (t , ci )
i 1...m
61
61
2 statistic
Plus
normalized and thus comparable across terms
2(t,c) is 0, when t and c are independent
sound theoretical background
Minus
unreliable for low frequency terms
computationally expensive
62
62
Term strength
Term strength:
s(t ) p(t y | t x)
x,y: topically related document
(e.g. from a clustering algorithm)
measures co-occurrence of terms (unlike idf)
For more details see:
Wilbur and Sorotkin
The automatic identification of stop words
63
63
Comparison on Reuters
64
64
Correlation of feature selection criteria
65
65
Correlation of feature selection criteria
66
66
Feature Selection Summary
(From Yang and Pedersen)
67
67
Classification Algorithms
68
68
Overview
There is a large zoo of classification algorithms
Decision Trees
Nave Bayes
Maximum Entropy methods
k nearest neighbor classifiers
Neural networks
Support vector machines
Many of them have been covered in other
lectures
69
69
Decision Tree for Reuter classification
70
From Manning&Schtze70
Decision Boundaries for Decision Trees
71
71
1-Nearest Neighbor
72
72
1-Nearest Neighbor
73
73
3-Nearest Neighbor
74
74
3-Nearest Neighbor
But this is closer..
We can weight neighbors
according to their similarity
Assign the category of the majority of the neighbors
75
75
Decision Boundaries for k Nearest Neighbor
(schematic)
76
76
Bayes Decision Rule
k arg max P( x | k ) P(k )
k
k:
x:
77
class label
features
77
Nave Bayes
x is not a single feature, but a bag of
features
e.g. different key-words for your spammail detection system
Assume statistical independence of
features
N
P({x1...xN } | k ) P( xi | k )
i 1
78
78
Maximum Entropy Methods
A way to estimate probabilities
Features are taken into account as
constraints for the probabilities
Otherwise as unbiased probability
estimate as possible
79
79
Linear binary classification using a
Perceptron (Simplest Neural Network)
Data: {(xi,yi)}i=1...n
x in Rd (x is a vector in d-dimensional space)
feature vector
y in {-1,+1}
label (class, category)
Question:
Design a linear decision boundary: wx + b (equation of
hyperplane) such that the classification rule associated with it
has minimal probability of error
classification rule:
80
y = sign(w x + b) which means:
if wx + b > 0 then y = +1
if wx + b < 0 then y =80-1
Linear binary classification
Find a good
hyperplane
(w,b) in Rd+1
that correctly classifies
data points as much
as possible
wx + b = 0
81
In online fashion: one
data point at the time,
update weights as
necessary
81
Classification Rule:
y = sign(wx + b)
Perceptron algorithm
Initialize: w1 = 0
Updating rule For each data point x
wk
If class(x) != decision(x,w)
+1
then
wk+1 wk + yixi
k
k+1
wk+1
else
wk+1 wk
-1
wk x + b = 0
Function decision(x, w)
If wx + b > 0 return +1
Drawing
Else return -1
82
Wk+1 x + b = 0
does not correspond to algorithm
with respect to the treatment of B
82
Perceptron algorithm
Online: can adjust to changing target, over
time
Advantages
Simple and computationally efficient
Guaranteed to learn a linearly separable problem
(convergence, global optimum)
Limitations
Only linear separations
Only converges for linearly separable data
Not really efficient with many features
83
83
Large margin classifier
84
Another family of linear
algorithms
Intuition (Vapnik, 1965)
If the classes are linearly
separable:
Separate the data
Place hyper-plane far from
the data: large margin
Statistical results guarantee
good generalization
84
BAD
Large margin classifier
Intuition (Vapnik, 1965) if
linearly separable:
Separate the data
Place hyperplane far
from the data: large
margin
Statistical results
GOOD
guarantee good
Maximal Margin Classifier
generalization
85
85
Large margin classifier
If not linearly separable
Allow some errors
Still, try to place
hyperplane far from
each class
86
86
Large Margin Classifiers
Advantages
Theoretically better (better error bounds)
Limitations
Computationally more expensive, large
quadratic programming
87
87
Support Vector Machine (SVM)
Large Margin
Classifier
wTxa + b = 1
wTxb + b = -1
Linearly
separable
case
Goal: find the
hyperplane
that
maximizes
the margin
88
wT x + b = 0
Support vectors
88
Summary
Types of text classification
Features and feature selection
Classification algorithms
89
89