0% found this document useful (0 votes)

22 views37 pages

Seed-Guided Topic Model For Document Filtering and Classification

Uploaded by

mb4ytzm24w

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views37 pages

Seed-Guided Topic Model For Document Filtering and Classification

Uploaded by

mb4ytzm24w

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 37

This document is downloaded from DR‑NTU (https://dr.ntu.edu.

sg)
Nanyang Technological University, Singapore.

Seed‑guided topic model for document filtering

and classification

Li, Chenliang; Chen, Shiqian; Xing, Jian; Sun, Aixin; Ma, Zongyang

2018

Li, C., Chen, S., Xing, J., Sun, A., & Ma, Z. (2018). Seed‑guided topic model for document
filtering and classification. ACM Transactions on Information Systems, 37(1), 9‑.
doi:10.1145/3238250

https://hdl.handle.net/10356/142845

https://doi.org/10.1145/3238250

© 2018 Association for Computing Machinery. All rights reserved. This paper was published
in ACM Transactions on Information Systems and is made available with permission of
Association for Computing Machinery.

Downloaded on 26 Mar 2024 10:59:43 SGT

A
Seed-Guided Topic Model for Document Filtering and Classification
CHENLIANG LI, Wuhan University
SHIQIAN CHEN, Wuhan University
JIAN XING, Hithink RoyalFlush Information Network Co, Ltd
AIXIN SUN, Nanyang Technological University
ZONGYANG MA, Baidu Inc.

One important necessity is to filter out the irrelevant information and organize the relevant ones into meaningful categories.
However, developing text classifiers often requires a large number of labeled documents as training examples. Manually
labeling documents is costly and time-consuming. More importantly, it becomes unrealistic to know all the categories cov-
ered by the documents beforehand. Recently, a few methods have been proposed to label documents by using a small set
of relevant keywords for each category, known as dataless text classification. In this paper, we propose a seed-guided topic
model for the dataless text filtering and classification (named DFC). Given a collection of unlabeled documents, and for each
specified category a small set of seed words that are relevant to the semantic meaning of the category, DFC filters out the
irrelevant documents and classifies the relevant documents into the corresponding categories through topic influence. DFC
models two kinds of topics: category-topics and general-topics. Also, there are two kinds of category-topics: relevant-topics
and irrelevant-topics. Each relevant-topic is associated with one specific category, representing its semantic meaning. The
irrelevant-topics represent the semantics of the unknown categories covered by the document collection. And the general-
topics capture the global semantic information. DFC assumes that each document is associated with a single category-topic
and a mixture of general-topics. A novelty of the model is that DFC learns the topics by exploiting the explicit word co-
occurrence patterns between the seed words and regular words (i.e., non-seed words) in the document collection. A document
is then filtered, or classified, based on its posterior category-topic assignment. Experiments on two widely used datasets show
that DFC consistently outperforms the state-of-the-art dataless text classifiers for both classification with filtering and clas-
sification without filtering. In many tasks, DFC can also achieve comparable or even better classification accuracy than the
state-of-the-art supervised learning solutions. Our experimental results further show that DFC is insensitive to the tuning
parameters. Moreover, We conduct a thorough study about the impact of seed words for existing dataless text classification
techniques. The results reveal that it is not using more seed words, but the document coverage of the seed words for the
corresponding category that affects the dataless classification performance.
CCS Concepts: rInformation systems → Document topic models; Clustering and classification;
General Terms: Algorithms, Management, Experimentation
Additional Key Words and Phrases: Topic Model, Dataless Classification, Document Filtering
ACM Reference Format:
Chenliang Li, Shiqian Chen, Jian Xing, Aixin Sun, and Zongyang Ma, 2018. Seed-Guided Topic Model for Document Fil-
tering and Classification. ACM Trans. Inf. Syst. V, N, Article A (January YYYY), 36 pages.
DOI: 0000001.0000001

1. INTRODUCTION
With the advance of the Information Technology, the tremendous amounts of textual information
generated everyday is far beyond the scope that people can manage manually. The recent prevalence
of social media further exacerbates this information overload, because rich information about various

This paper is an extended version of the paper [Li et al. 2016b] presented at the 25th International ACM CIKM conference
(Indianapolis, USA, Oct 24-28, 2016).
Author’s addresses: C. Li (corresponding author), S. Chen, State Key Lab of Software Engineering, Computer School, Wuhan
University, China 430072; J. Xing, Hithink RoyalFlush Information Network Co, Ltd, Hangzhou, China; A. Sun, School of
Computer Science and Engineering, Nanyang Technological University, Singapore 639798; Z.Ma, Baidu Inc. ShenZhen,
China 518000.
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the
full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact
the owner/author(s).
c YYYY Copyright held by the owner/author(s). 1046-8188/YYYY/01-ARTA $15.00
DOI: 0000001.0000001

ACM Transactions on Information Systems, Vol. V, No. N, Article A, Publication date: January YYYY.
A:2 C. Li et al.

Supervised Text Classification

Labeled Documents Unlabeled Documents

Predicted Labels

Machine Learning Classifier

Algorithm Model

Seed Words Dataless Text Classification

Unlabeled Documents

Predicted Labels
…

Dataless
Classifier

Human Label Predicted Label Seed Word

Fig. 1. Supervised vs dataless text classification.

kinds of events, user opinions and daily life activities are generated in an unprecedented speed. These
timely information breeds the new and dynamic information needs everywhere. For example, a data
analyst might need to track an emerging event by using a few relevant keywords [Ritter et al. 2015].
A data-driven company often needs to conduct a focused and deep analysis on the documents of
the specified categories. Within these semantic applications, one fundamental task is to filter out
irrelevant information and organize relevant information into meaningful topical categories.
During the last decade, text classifiers have become important tools in managing and analyzing
large document collections. Text classification refers to the task of assigning category labels to docu-
ments based on their semantics. Due to its wide usage, text classification has been studied intensively
for many years. Existing solutions are mainly based on supervised learning techniques which require
tremendous human effort in annotating documents as labeled examples, as shown in the upper part
of Figure 1. To reduce the labeling effort, many semi-supervised algorithms have been proposed for
text classification [Chang et al. 2008; Nigam et al. 2000]. Considering the diversity of the documents
in many applications, constructing a relatively small training set required by the semi-supervised al-
gorithms remains very expensive. Recently, a number of dataless text classification methods have
been proposed [Chang et al. 2008; Chen et al. 2015; Downey and Etzioni 2008; Druck et al. 2008;
Gliozzo et al. 2009; Hingmire and Chakraborti 2014; Hingmire et al. 2013; Li et al. 2016b; Liu
et al. 2004; Song and Roth 2014]. Instead of using labeled documents as training examples, dataless
methods only require a small set of relevant words for each category or labeling the topics learned
from a standard LDA model [Blei et al. 2003], to build text classifiers. As illustrated in the lower
part of Figure 1, dataless classifiers do not require labeled documents, which saves a lot of human
efforts. It has been reported that a speed-up of up to 5 times can be achieved to build a dataless
text classifier with indistinguishable performance to a supervised classifier, by assuming that label-
ing a word is 5 times faster than labeling a document [Druck et al. 2008]. These promising results
suggest that dataless text classification is a practical alternative to the supervised approaches, when
constructing the training documents is not an easy task. More importantly, the labeled documents
produced by a dataless classifier can also be used as training examples to learn supervised text clas-
sifiers if necessary [Li et al. 2016b]. However, these existing dataless classification techniques do
not consider document filtering. That is, we like to retrieve all the documents relevant to a specified

ACM Transactions on Information Systems, Vol. V, No. N, Article A, Publication date: January YYYY.
Seed-Guided Topic Model for Document Filtering and Classification A:3

Dataless Text Filtering and Classification Filtering

??? ???

Seed Words Unlabeled Documents

Classification

… Dataless
Classifier

Category of Interest Predicted Label Seed Word ? Unknown Category

Fig. 2. Illustration for dataless filtering and classification.

set of categories from a given document collection, and organize these relevant documents into the
corresponding categories. With the existing dataless classifiers, we need to provide all the categories
and the corresponding seed words covered by the document collection. Unfortunately, it is often un-
realistic to foresee all possible categories covered by a document collection, since the documents
streamed in are likely to cover dynamic topics. In an extreme case, the number of possible categories
covered by documents could be potentially limitless.
In this study, we aim to devise a dataless algorithm for the task of filtering and classifying doc-
uments into categories of interest. Figure 2 provides an illustration for this dataless filtering and
classification task. Specifically, given a document collection of D documents and C categories of
interest, where each category c is defined by a small set of seed words Sc , the task is to filter out
the documents irrelevant to any of the C categories, and to classify the relevant documents into C
categories, without using any labeled documents. We also call this task as dataless classification
with filtering.
Human beings can quickly learn to distinguish whether a document belongs to a category, based
on several relevant keywords about a category. This is because that people can learn to build the
relevance among the representative words of the category. For example, a human being can success-
fully identify a relevant word “wheel” to category automobile, after browsing several documents
in category automobile, even if she does not know the meaning of word “wheel”. The underlying
reason is the high co-occurrence between “wheel” and other relevant words like “cars” and “en-
gines”. This relevance learning process is analogous to the unsupervised topic inference process of
the standard LDA [Blei et al. 2003], a probabilistic topic model (PTM) that implicitly infers the hid-
den topics from the documents based on the higher-order word co-occurrence patterns [Thomas and
Mark 2004]. However, conventional PTMs like PLSA and LDA are unsupervised techniques that
implicitly infer the hidden topics based on word co-occurrences [Blei et al. 2003; Hofmann 1999]. It
is difficult or even infeasible to filter and classify documents in such a purely unsupervised manner.
Inspired by the recent success of the PTM-based dataless text classification techniques [Chen
et al. 2015; Hingmire and Chakraborti 2014; Hingmire et al. 2013; Li et al. 2016b], in this paper,
we propose a seed-guided topic model for dataless text filtering and classification, named DFC.
Given a collection of unlabeled documents, DFC is able to achieve the goal of filtering and classi-
fication by taking only a few semantically relevant words for each category of interest (called “seed
words”). To enable document filtering , we model two sets of category-topics: relevant-topics and
irrelevant-topics. A one-to-one correspondence between relevant-topics and categories of interest
is made. That is, each relevant-topic is associated with one specific category of interest, and vice
versa. The relevant-topic is assumed to represent the meaning of that category1 . Since the docu-
ments relevant to the categories of interest could comprise a limited proportion of the whole collec-
tion, irrelevant-topics are expected to model other categories covered by the irrelevant documents.

1 Category and relevant-topic are considered equivalent and exchangable in this work when the context has no ambiguity.

ACM Transactions on Information Systems, Vol. V, No. N, Article A, Publication date: January YYYY.
A:4 C. Li et al.

In our earlier work [Li et al. 2016b], a topic model based dataless classification technique (named
STM) is proposed by using a set of general-topics to model the general semantics of the whole doc-
ument collection. Although the modeling of general and specific aspects of documents was studied
previously for information retrieval [Chemudugunta et al. 2006], it had been overlooked for dataless
text classification in previous studies [Chen et al. 2015; Hingmire and Chakraborti 2014; Hingmire
et al. 2013]. Our earlier work has proven that this model setting is beneficial for dataless classifica-
tion performance. Following this modeling strategy, in DFC, we also utilize a set of general-topics
to represent the general semantic information. The task of filtering and classification is achieved by
associating each document with a single category-topic2 and a mixture of general-topics. The pos-
terior category-topic assignment is then used to label the document as a category of interest, or an
irrelevant one. DFC also subsumes STM under some particular parameter settings. This means that
DFC is also able to conduct dataless text classification without filtering.
Seeking useful supervision from the seed words to precisely infer category-topics and general-
topics is vital to the efficacy of DFC. In other words, precise relevance estimation between a word and
a category-topic via a small set of seed words is crucial for DFC. It is noteworthy to underline that no
seed word can be provided for any irrelevant-topic, because the possible semantic categories covered
by a document collection are unknown beforehand. However, the precise inference for the irrelevant-
topics is an essential factor for classification performance guarantee. Without any supervision from
the corresponding seed words, the model is hard to identify the irrelevant-topics successfully, leading
to inferior classification performance. Here, we devsie a simple but effective mechanism to identify
a set of pseudo seed words for each irrelevant-topic. Specifically, we resort to using standard LDA to
extract the hidden topics for the document collection in an unsupervised manner. Then, a relevance
measure is proposed to calculate the distance between each LDA hidden topic and all the seed words
provided for the relevant-topics. After a heuristic procedure to filter out noisy LDA hidden topics,
the top topical words from the least relevant LDA hidden topics are then considered as the pseudo
seed words for the irrelevant-topics.
In contrast to the existing dataless classification methods that simply exploit the semantic guidance
provided by the seed words in an implicit way (i.e., the word co-occurrence information), we adopt
an explicit strategy to estimate the relevance between a word and a category-topic, and also the initial
category-topic distribution for each document. The estimated relevance is then utilized to supervise
the topic learning process of DFC. In particular, we investigate two mechanisms (i.e., Doc-Rel and
Topic-Rel) to estimate the probability of a word being generated by a category-topic, by measuring its
correlations to the (pseudo) seed words of that category-topic based on either document-level word
co-occurrence or topical-level word co-occurrence information. We call the words that are generated
by a category-topic category words.
In summary, DFC conducts the document filtering and classification in a weakly supervised man-
ner, just as what humans do in learning to classify documents with just few words: (i) first to identify
the highly relevant documents based on the given seed words of a category; (ii) then based on these
highly relevant documents, to collectively identify the category words in addition to the seed words;
(iii) next to use both the seed words and category words to find new relevant documents and new
category words; the last step repeats until a global equilibrium is optimized. We conduct extensive
experiments on two datasets Reuters-10 and 20-Newsgroup, and compare DFC with state-of-the-
art dataless text classifiers and supervised learning solutions. In terms of classification accuracy
measured by F1 , our experimental results show that DFC outperforms all the dataless competitors in
almost all the tasks and performs better than the supervised classifiers sLDA and SVM in many tasks
for both classification with filtering and classification without filtering. We also conduct a compre-
hensive performance evaluation to analyze the impact of parameter settings in DFC. The results show
that the proposed DFC is reliable to a broad range of parameter values, indicating its superiority in
real scenarios.

2A category-topic refers to either a relevant-topic or an irrelevant-topic in this work when the context is for DFC.

ACM Transactions on Information Systems, Vol. V, No. N, Article A, Publication date: January YYYY.
Seed-Guided Topic Model for Document Filtering and Classification A:5

We need to emphasize that all existing dataless text classification techniques rely solely on the
weak supervision provided by the seed words. That is, the quality of seed words plays a crucial
factor regarding the classification performance. It is intuitive that fewer seed words could carry less
semantic information for the classification. However, the study on seed words in the paradise of data-
less text classification is still missing. Here, an open question naturally arises: will using more seed
words lead to better classification accuracy? If the answer to this question is no. Then, what crite-
rion should we use to build a set of seed words for a category? In this work, we conduct a thorough
study with the aim of answering these two questions. We find that using more seed words may not
produce better classification accuracy. Also, we empirically observe that more relevant documents
covered by a set of seed words under a category, better classification accuarcy can be obtained. In an
extreme case, given two seed words for a category, no performance gain could be obtained by using
both seed words over using either one, if the two words always appear together in documents. That
is, it is not the number of seed words that matters, but the document coverage for that category. In
summary, the main contributions of this paper are listed as follows:
(1) We propose and formalize a new task of dataless text filtering and classifcation. To the best of
our knowledge, this is the first work to classify documents into relevant categories of interest, and
filter out irrelevant documents in a dataless manner. To enable precise inference of irrelevant-
topics, we propose a novel mechanism to identify the pseudo seed words for irrelevant topics in
an unsupervised manner.
(2) DFC does not solely rely on the implicit word co-occurrence patterns to guide the category infer-
ence process. Instead, we introduce two mechanisms to estimate the probability of a word being
generated by a category-topic. The estimation is based on the explicit word co-occurrence patterns
derived from the document collection.
(3) We empirically study the impact of seed words for dataless text classification techniques. Our
results suggest that using more seed words may not lead to better classification accuracy. Instead,
the document coverage of the selected seed words correlates positively with the classification
accuracy.
(4) We conduct extensive experiments to evaluate the proposed DFC on two real-world text datasets.
The results demonstrate that DFC achieves promising classification and filtering performance,
and outperform the existing dataless alternatives. Moreover, compared with supervised classifiers
(sLDA and SVM), DFC even achieves better performance in a few tasks.

2. RELATED WORK
Here, we mainly review related work on dataless text classification and topic modeling with auxiliary
knowledge.

2.1. Dataless Text Classifiers.

As being the seminal work of dataless text classification, Liu et al. investigated the possibility
of building a text classifier by simply employing few words relevant to each category in a semi-
supervised manner, where these relevant words are used to bootstrap an initial set of training in-
stances [Liu et al. 2004]. Then a semi-supervised naive Bayes classifier based on the Expectation
Maximization algorithm (NB-EM) [Nigam et al. 2000] is built based on the training instances. Sim-
ilarly, Gliozzo et al. [Gliozzo et al. 2009] proposed to build an initial set of training instances by
using the Latent Semantic Analysis [Deerwester et al. 1990]. Then a support vector machine (SVM)
classifier is trained based on these bootstrapped instances. Downey and Etzioni provided a theo-
retical analysis about the possibility of achieving accurate classification in the absence of training
data [Downey and Etzioni 2008]. Their analysis and empirical studies showed that the accurate text
classification without the training data is possible under certain assumptions. Druck et al. proposed
a maximum entropy based dataless text classifier which uses only the labeled words of each cat-
egory, named GE-FL [Druck et al. 2008]. GE-FL was designed by assuming that the documents
containing the seed words of a category are more likely to belong to this category. Hence, the pa-

ACM Transactions on Information Systems, Vol. V, No. N, Article A, Publication date: January YYYY.
A:6 C. Li et al.

rameters of GE-FL are optimized by minimizing the distance between the expected category distri-
bution of the documents containing a labeled word under GE-FL and the corresponding reference
category distribution of the labeled word. They showed that a speed-up of 5 times can be achieved
by GE-FL with indistinguishable performance to an entropy regularization based semi-supervised
(ER) method [Grandvalet and Bengio 2004], given labeling a word is 5 times faster than labeling a
document [Raghavan et al. 2006].
Chang et al. proposed a dataless text classification method by projecting each word and document
into the same semantic space of Wikipedia concepts [Chang et al. 2008]. They represent each cate-
gory with the words used in the category label. The similarity between a document and a category is
measure by using Explicit Semantic Analysis (ESA) [Gabrilovich and Markovitch 2007]. Recently,
Song and Roth [Song and Roth 2014] studied the task of dataless hierarchical text classification by
applying the work of [Chang et al. 2008]. Their experimental results showed that Wikipedia-based
ESA still performs the best for this task. Since the large-scale knowledge base like Wikipedia is not
always available for many languages or domains, this method may not be applicable in these cases.
Note that the proposed DFC does not rely on the external knowledge base at all. Instead, DFC learns
the discriminative category information by exploiting the semantic relevance of the seed words to
the dataset itself, which can be applied in a much broader range of scenarios.
Several methods based on the standard LDA have been proposed for dataless text classifica-
tion [Chen et al. 2015; Hingmire and Chakraborti 2014; Hingmire et al. 2013]. Hingmire et al.
proposed a dataless text classifier model based on the LDA, named ClassifyLDA [Hingmire et al.
2013]. ClassifyLDA first infers the hidden topics by using LDA. Then, an annotator assigns a cate-
gory to each topic. ClassifyLDA continues the topic inference process by aggregating the topics with
the same category label as a single topic. The corresponding category of the topic with the maxi-
mum posterior topic proportion is used as the prediction. They showed that ClassifyLDA achieves
almost comparable performance with a semi-supervised naive Bayes classifier (NB-EM) proposed
in [Nigam et al. 2000]. Hingmire and Chakraborti [Hingmire and Chakraborti 2014] proposed a new
model (TLC) by extending ClassifyLDA, which allows to assign more than one category to a topic.
Then, TLC was further enhanced to incorporate the relevant words of each category, called TLC++.
TLC++ selects the most informative words by using the information gain metric based on the ini-
tial category predictions from TLC. They found that TLC++ consistently outperforms ClassifyLDA,
TLC and GE-FL by a comprehensive evaluation.
Chen et al. proposed a LDA based dataless classification model, called DescLDA [Chen et al.
2015]. DescLDA assumes that each category is associated with a fixed number of topics, and each
topic is only associated with a single category. The selected semantically related words (called de-
scriptive words) for each category are used to constrain the topic-word distribution such that these
words have a higher probability under the associated topics. DescLDA has a tuning parameter, i.e.,
the topic number for a category. However, DescLDA is very sensitive to this parameter. According to
their experimental results, a significant performance degradation is experienced when a suboptimal
number is used across the both datasets used in their work. Our earlier work proposes a seed-guided
topic model for dataless text classification, named STM [Li et al. 2016b]. In STM, two separate sets
of topics are modeled: category-topics and general-topics. While category-topics are responsible
for extracting the discriminative category information, general-topics are used to organize the gen-
eral semantic information underlying the whole collection. The experimental results demonstrated
that STM outperforms DescLDA, TLC++, GE-FL, and a supervised topic model for classification
(sLDA), and is more robust than these competitors. They also showed that STM even achieves very
close or better performance than SVM in many tasks. As built on the basis of STM, DFC is also very
robust to the parameter settings. Our experimental results show that little performance variations are
observed for different parameter settings across different datasets.
Our proposed DFC differs significantly from the above PTM-based solutions. While these methods
can classify the documents into their corresponding categories without any training document, they
can only be applied for classification without filtering. That is, we need to provide the seed words

ACM Transactions on Information Systems, Vol. V, No. N, Article A, Publication date: January YYYY.
Seed-Guided Topic Model for Document Filtering and Classification A:7

for all the categories covered by the document colleciton, which is unrealistic in many real-world
applications.
2.2. Topic Models with Auxiliary Knowledge.
Different kinds of prior domain knowledge have been incorporated into PTMs to achieve better per-
formance of different tasks. Mimno et al. exploited the corpus-specific word co-occurrence informa-
tion to enhance the topic coherence of the standard LDA [Mimno et al. 2011]. Besides exploiting
the corpus-specific knowledge, many works have proposed to incorporate the semantical relations
between word pairs into the topic model [Andrzejewski et al. 2009; Chen and Liu 2014; Chen et al.
2013; Li et al. 2016a]. The semantic relatedness information based on the learnt word embeddings
over the large external corpus is incorporated for better short text topic modeling in [Li et al. 2017,
2016a]. A seeded topic model was proposed to extract the aspects and sentiments from the customer
comments in [Mukherjee and Liu 2012], named SAS. SAS takes the seed words related to a specific
aspect as a seed set, e.g., words related to the aspect room service. Then an aspect is considered as
a multinomial distribution over the non-seed words and the seed sets. Based on the implicit word
co-occurrence information regarding these seed words, SAS can obtain a significant improvement in
terms of aspect extraction accuracy. Similarly, Jagarlamudi et al. proposed a SeededLDA model to
learn better topic-word and document-topic distributions with the seed words selected by using infor-
mation gain from the labeled documents [Jagarlamudi et al. 2012]. These works exploit the semantic
guidance provided by the seed words in an implicit way, i.e., the word co-occurrence information.
Differing from these works, we employ an explicit strategy to estimate the initial document relevance
and discriminate relevant words of a specific category. These prior knowledge extracted based on the
seed words are then used directly to guide the topic inference process, leading to a promising classi-
fication performance. Later in Section 4.4, we will show that this strategy indeed brings significant
improvement to the classification performance.
Since our work exploits the seed words and the word co-occurrence information to derive the cate-
gory of each document in a collective manner, the underlying motivation is strongly connected to the
pseudo relevance feedback, a widely studied strategy for information retrieval enhancement[Buckley
and Salton 1995; Dunlop 1997; Espinosa and Akella 2012; Miao et al. 2016; Ye and Huang 2014].
The aim of pseudo relevant feedback is to expand the query with relevant words covered by the
top-retrieved documents in the first pass. The techniques presented in the existing literature mainly
count the word frequency in these highly relevant documents regarding the original query. Hence,
the words highly co-occur with the query terms are considered to be relevant and extracted as the
complement. Several works also incorporate topical information to guide the query expansion pro-
cess. Caballero and Akella [Espinosa and Akella 2012] incorporate topic-based language model to
extract the relevant words. Miao et al. [Miao et al. 2016] proposed to weight the top-retrieved doc-
uments in terms of their proximity in the latent topic space. Here, our work introduce two kinds of
topics: category-topics and general-topics. The word co-occurrence information and the provided
seed words are used together to build the prior knowledge (i.e., category word probability). This
prior konwledge is then used to supervise the topic inference process. By taking the seed words
as the query, the category word probability estimation and topic inference can be considered as a
query expansion process, but in a probabilistic manner. That is, the most probable words under a
category-topic reflect the meaning of the corresponding category, in addition to the seed words.
3. SEED-GUIDED TOPIC MODEL
In this section, we present the proposed DFC model for dataless text filtering and classification in
detail.
3.1. Generative Process and Inference
Unlike the existing related works that directly use a topic to represent a distinct category [Chen
et al. 2015; Hingmire and Chakraborti 2014; Hingmire et al. 2013], we assume that there are three
kinds of topics underlying the document collection: relevant-topics, irrelevant-topics and general-

ACM Transactions on Information Systems, Vol. V, No. N, Article A, Publication date: January YYYY.
A:8 C. Li et al.

DFC
Unlabeled Docs
Seed Words

LDA Hidden Topics

…
Pseudo Seed Word
Extraction

Predicted Labels Category Word Probability

Initial Document Category

Distribution
4

Irrelevant Docs
…...
3

Topic Inference
Category of Interest
General-topic
Seed Word Relevant-topic Irrelevant-topic
Distribution

Fig. 3. The architecture of DFC.

topics. As in PLSA and LDA, general-topics in DFC are shared by all the documents and capture the
global semantic information of the whole collection. In contrast, each relevant-topic is associated
with a single category and be comprised by the relevant words of the category. Also, each category
has only one relevant-topic. That is, the category and its related relevant-topic have a one-to-one
correspondence in DFC. Since the categories of interest could only cover a part of all documents in
the collection, we use irrelevant-topics to cover the semantics of the irrelevant documents. In this
sense, we call both relevant-topics and irrelevant-topics as category-topics. In DFC, each document
is generated by a category-topic and all general-topics together.
Since general-topics are used in DFC to model the semantic information besides the category in-
formation encoded by the category-topics, it is expected that the documents of the same category
could share similar document general-topic distributions to some extent. Consequently, we associate
each category-topic in DFC with a mixture of general-topics. Let ϕc denotes the general-topic dis-
tribution associated with category-topic c, θd denotes the general-topic distribution of document d.
We assume the general-topic distribution θd is sampled from a Dirichlet prior with the concentration
parameter α1 and the category’s ϕc as the base measure. This hierarchical Dirichlet prior setting al-
lows the flexibility that the deviation of general-topic distributions of the documents under the same
category is controlled by the concentration parameter α2 [Wallach et al. 2009]. Given R specified
categories (i.e., relevant-topics), T irrelevant-topics and B general-topics, a word in a document
can be either generated from category-topic c or a general-topic b. The document is an irrelevant
document when category-topic c refers to an irrelevant-topic. Otherwise, the document belong to
the corresponding category indicated by relevant-topic c. The main notations used in the rest of the
paper are summarized in Table I. The graphical representation of DFC is shown in Figure 4 and the
generative process is described as follows:
1 For each relevant-topic r ∈ {1...R}:
. (a) draw a general-topic distribution ϕr ∼ Dir(α0 );
. (b) draw a word distribution ϑr ∼ Dir(β0 );
2 For each irrelevant-topic t ∈ {1...T }:

ACM Transactions on Information Systems, Vol. V, No. N, Article A, Publication date: January YYYY.
Ȝ

Seed-Guided Topic Model for Document Filtering and Classification A:9

ȥ Į1
Į2

Șd
Į0 ĳc șd cd

yd
ȕ0 ࢡc
z
R+T țd

ȕ1 ĭb w x įw,c
B
Nd Wx(R+T)
D

Fig. 4. Graphical representation of DFC. Since the variables ηd and δw,c are estimated based on the (pseudo) seed words
of DFC, we therefore plot these two sets of variables in dotted circle.

Table I. List of Notations.

D the total number of documents in the dataset
R the total number of specified categories/relevant-topics
T the total number of irrelevant-topics in the dataset
B the total number of general-topics
W the size of the vocabulary
s a seed word of a category
Sc a set of seed words of category c
cd the category-topic assignment for document d
wd,i the observed word at position i in document d
zd,i the general-topic assignment for word wd,i
xd,i indicator about whether wd,i is generated by a category-
topic
ηd the initial category (i.e., relevant-topic) distribution of
document d
θd the general-topic distribution of document d
ϕc the prior general-topic distribution of all documents of
category-topic c
φb the word distribution of general-topic b
ϑc the word distribution of category-topic c
α2 the concentration parameter of ϕc for document’s θd of
category-topic c
δw,c the probability of word w being a category word for
category-topic c
τw,c the relevance weight between word w and category-topic
c
κd prior likelihood for document d being a relevant docu-
ment
ψ the irrelevant-topic distribution of the whole dataset
ρ the tuning parameter for category word probability δw,c
α0 , α 1 , β 0 , β 1 Dirichlet Priors

. (a) draw a general-topic distribution ϕt ∼ Dir(α0 );

. (b) draw a word distribution ϑt ∼ Dir(β0 );
3 Draw an irrelevant-topic distribution: ψ ∼ Dir(α1 )
4 For each general-topic b ∈ {1...B}:
. (a) draw φb ∼ Dir(β1 );

ACM Transactions on Information Systems, Vol. V, No. N, Article A, Publication date: January YYYY.
A:10 C. Li et al.

5 For each document d ∈ {1...D}:

. (a) draw yd ∼ Bernoulli(κd );
. (b) if yd = 1:

. ∗ generate an initial relevant-topic distribution ηd ;

. ∗ draw a category-topic cd ∼ Multi(ηd );

. (c) if yd = 0:

. ∗ draw a category-topic cd ∼ Multi(ψ);

. (d) draw a general-topic distribution θd ∼ Dir(α2 · ϕcd );

. (e) for each word i ∈ {1...|d|}:

. ∗ draw xd,i ∼ Bernoulli(δwd,i ,cd );

. ∗ if xd,i = 0:

. - draw wd,i ∼ Multi(ϑcd );

. ∗ if xd,i = 1:

. - draw zd,i ∼ Multi(θd );

. - draw wd,i ∼ Multi(φzd,i );

In the above generative process, the binary variable yd indicates whether document d is a relevant
document (yd = 1) or an irrelevant one (yd = 0). κd works as a prior preference for yd , which
indicates the prior probability that a document is a relevant document without considering the textual
content. The binary variable xd,i = 0 indicates that the associated word wd,i is generated from
category-topic cd , otherwise word wd,i is generated from a specific general-topic zd,i (i.e., xd,i = 1).
Here, we infer the hidden parameters {ϕc , ϑc , φb , θd , zd,i , xd,i , cd , yd } in DFC. As with LDA and
other PTMs, the exact inference of DFC is intractable. We therefore utilize the Gibbs Sampling to
perform the approximate inference and parameter learning [Thomas and Mark 2004]. Specifically,
we construct a Markov chain on latent parameters. At each step, a latent parameter or a set of latent
parameters are sampled based on the conditional probability given the values of other parameters.
In DFC, because parameters zd,i and xd,i are correlated, we jointly sample their values as follows:

p(zd,i , xd,i |z¬(d,i) , x¬(d,i) , c, w) ∝

 wd,i b
 φb,¬(d,i) × θd,¬i × (1 − δwd,i ,cd ) zd,i = b, xd,i = 1
(1)
 ϑwd,i
cd ,¬(d,i) × δwd,i ,cd xd,i = 0

wd,i
where φb,¬(d,i) is the probability of seeing wd,i under general-topic b excluding the current assign-
b
ment, θd,¬i is the probability of seeing general-topic b in document d excluding the current assign-
w
ment, and ϑcdd,i,¬(d,i) is the probability of seeing word wd,i under category-topic cd excluding the
current assignment.

ACM Transactions on Information Systems, Vol. V, No. N, Article A, Publication date: January YYYY.
Seed-Guided Topic Model for Document Filtering and Classification A:11

Recall that a category-topic refers to either a relevant-topic or an irrelevant-topic. When yd = 0,

we sample an irrelevant-topic t as follows:
p(cd = t, yd = 0|z, x, y¬d , c¬d , w) ∝
B Q w w
w∈d [(nb + β1 − 1) · · · (nb,¬d + β1 )]
Y
PW w
P W w
b=1 [ w=1 (nb + β1 ) − 1] · · · [ w=1 (nb,¬d + β1 )]
B
Y (nbd + α2 · ϕbt − 1) · · · (α2 · ϕbt )
× PB PB
k k k
b=1 [ k=1 (nd + α2 · ϕt ) − 1] · · · [ k=1 α2 · ϕt ]
w w
Q
w∈d [(nt + β0 − 1) · · · (nt,¬d + β0 )]
× PW W
[ w=1 (nw w
P
t + β0 ) − 1] · · · [ w=1 (nt,¬d + β0 )]
nt,¬d + α1
× PT (1 − κd ) (2)
k=1 (nk,¬d + α1 )

where nw b
b is the number of times word w is assigned to general-topic b, nd is the number of words
that are assigned to general-topic b within document d, nw t is the number of times word w is assigned
to irrelevant-topic t, nt is the number of documents that are assigned to irrelevant-topic t. Symbol ¬d
means that document d is excluded from the count. When yd = 1, we then sample a relevant-topic r
as follows:
p(cd = r, yd = 1|z, x, y¬d , c¬d , w) ∝
B Q w w
w∈d [(nb + β1 − 1) · · · (nb,¬d + β1 )]
Y
PW w
P W w
b=1 [ w=1 (nb + β1 ) − 1] · · · [ w=1 (nb,¬d + β1 )]
B
Y (nbd + α2 · ϕbr − 1) · · · (α2 · ϕbr )
× PT PT
k k k
b=1 [ k=1 (nd + α2 · ϕr ) − 1] · · · [ k=1 α2 · ϕr ]
w w
Q
w∈d [(nr + β0 − 1) · · · (nr,¬d + β0 )]
× PW W
ηd (r)κd (3)
w
P
[ w=1 (nw
r + β0 ) − 1] · · · [ w=1 (nr,¬d + β0 )]

where nw r is the number of times word w is assigned to relevant-topic r. The category general-topic
distributions {ϕr , ϕt }, document general-topic distribution θd , word-distributions {ϑr , ϑt , φb } can
be computed based on the point estimation as follows:
nbr + α0
ϕbr = PB (4)
k
k=1 (nr + α0 )
nbt + α0
ϕbt = PB (5)
k
k=1 (nt + α0 )
nbd + α2 · ϕbcd
θdb = PB (6)
k k
k=1 (nd + α2 · ϕcd )
w
n r + β0
ϑw
r = PW (7)
w0
w0 =1 (nr + β0 )
nwt + β0
ϑw
t = PW (8)
w0
w0 =1 (nt + β0 )
nwb + β1
φw
b = PW (9)
w0
w0 =1 (nb + β1 )

ACM Transactions on Information Systems, Vol. V, No. N, Article A, Publication date: January YYYY.
A:12 C. Li et al.

Because all the pairs of zd,i and xd,i of document d are affected by the choice of category-topic cd
and vice versa (Equations 1-3), the sampling order of zd,i , xd,i and cd becomes a critical factor. Sim-
ply sampling a new relevant-topic r or irrelevant-topic t based on Equations 2-3 with all zd,i , xd,i
values conditioned on the previous cd could not converge to the true posterior distribution. This
is a common issue of Markov Chain Monte Carlo (MCMC) methods, called autocorrelation phe-
nomenon [Straatsma et al. 1986]. The details of the Gibbs sampling process of DFC are described
in Algorithm 1. Here, to avoid the autocorrelation, at the first step we sample each pair of zd,i , xd,i
conditioned on each possible c (i.e., R + T possible choices, Lines 3-19 in Algorithm 1). Then, the
likely cd is sampled conditioned on all the corresponding zd,i , xd,i values with Equations 2 and 3.
Afterwards, all the values of zd,i , xd,i are set to the updated cd ’s corresponding values sampled in the
first step (Lines 21-24 in Algorithm 1). For document filtering and classification, we take the sampled
cd of document d at the last iteration as the prediction. Document d belongs to the corresponding
category of relevant-topic r when yd = 1 and cd refers to relevant-topic r. Otherwise, document d
is considered as an irrelevant document (i.e., filtering) when cd refers to an irrelevant-topic.

3.2. Pseudo Seed Word Extraction

In DFC, users only need to specify the seed words associated with the categories of interest. However,
without any supervision, it is hard to successfully infer the irrelevant-topics, which in turn hurts the
classification performance. Because the LDA model infers the hidden topics from documents based
on the higher-order word co-occurrence patterns, the words relevant to a specific category are likely
to be grouped together under a LDA hidden topic. Hence, we choose to resort to using standard LDA
to extract the hidden topics from the collection in an unsupervised manner. We expect that some
hidden topics discovered by LDA are relevant to the irrelevant-topics (i.e., irrelevant categories) to
some extent. Specifically, we can calculate the distance between a LDA hidden topic and all the
seed words provided for the categories of interest. The least relevant LDA topics are then likely to
represent the irrelevant-topics underlying the collection. Let S = ∪r Sr denotes the set of all seed
words provided, we calculate the distance between a LDA hidden topic k and S as follows:

X X
dist(k, S) = 1 − pLDA (w|k)pLDA (k|s) (10)
s∈S w∈Wk

where Wk is the dominating topical words under LDA hidden topic k, pLDA (w|k) is the word prob-
ability under topic k, pLDA (k|s) is the topic probability under seed word s. Here, we take Wk as
the top-10 topical words associated with topic k in Equation 10, since only top topical words can
precisely represent the meaning of the topic. Before the distance calculation, we first filter out the
noisy LDA hidden topics. In detail, the LDA hidden topics matching the following criterions are
filtered out based on their top-10 topical words: 1) the percentage of numeric words is larger than
50%; 2) the percentage of words with the length less than 4 characters is larger than 50%. Then, we
take the top-10 topical words under each of T least relevant LDA hidden topics as the pseudo seed
words for the irrelevant-topics in DFC.
Given a collection of unlabeled documents and a few seed words for each category, a critical
challenge of DFC is to incorporate the supervision of the provided seed words to filter and clas-
sify the documents through topic influence. As described in the generative process, ηd works as the
prior preference on the specified categories for d. And δw,c works as the prior preference that word
w is generated from category-topic c (i.e., xd,i = 0). We call the words that are generated by a
category-topic category words. In DFC, we need to derive these prior knowledge solely based on
seed words to enable effective document filtering and classification. In the following, we will de-
scribe the mechanisms used in DFC to estimate category word probability and the initial document
category distribution.

ACM Transactions on Information Systems, Vol. V, No. N, Article A, Publication date: January YYYY.
Seed-Guided Topic Model for Document Filtering and Classification A:13

Algorithm 1: DFC-Sampling Process

input : D documents, R relevant-topics, T irrelevant-topics, B general-topics, topic
assignment list L = {L1 , ..., LD } where Ld = {(zd,1 , xd,1 ), ..., (zd,|d| , xd,|d| )} is the
topic assignment list of all the words in document d.
output: Category-topic cd for all documents and the updated topic assignment list L
1 foreach d ∈ D do
2 foreach r ∈ R do
/* assume that d belongs to relevant-topic r */
3 create an empty list Lrd for relevant-topic r;
4 foreach i ∈ d do
5 foreach b ∈ B do
6 calculate general-topic likelihood score p(zd,i = b, xd,i = 1|cd = r) ; /* see
Eq.1 */
7 calculate category-topic likelihood score p(zd,i = r, xd,i = 0|cd = r); /* see
Eq.1 */
8 sample zd,i and xd,i based on the topic likelihood scores;
9 Lrd .append((zd,i , xd,i ));
10 calculate likelihood score p(cd = r, yd = 1|d) of d conditioned on Lrd ; /* see Eq.3
*/
11 foreach t ∈ T do
/* assume that d belongs to irrelevant-topic t */
12 create an empty list Ltd for irrelevant-topic t;
13 foreach i ∈ d do
14 foreach b ∈ B do
15 calculate general-topic likelihood score p(zd,i = b, xd,i = 1|cd = t); /* see
Eq.1 */
16 calculate category-topic likelihood score p(zd,i = t, xd,i = 0|cd = t); /* See
Eq.1 */
17 sample zd,i and xd,i based on the topic likelihood scores;
18 Ltd .append((zd,i , xd,i ))
19 calculate likelihood score p(cd = t, yd = 0|d) of d conditioned on Ltd ; /* see Eq.2
*/
20 sample cd and yd based on the likelihood scores of d ;
21 if cd == r and yd == 1 then
22 Ld = Lrd ;
23 else if cd == t and yd == 0 then
24 Ld = Ltd ;

3.3. Estimating Category Word Probability

The discriminative power carried by the category words essentially determines the classification
accuracy of DFC. In DFC, the chance of choosing an occurrence of word w as a category word for
category-topic c relies on the prior knowledge constructed by using the seed words, i.e., a category
word probability δw,c that word w is a category word (x = 0) under category-topic c.
Initially, it is simple to assume that each word has the same probability to be a category word for
each category-topic. Hence, we can refer a global parameter ρ as the probability that a word is picked
as a category word for any category-topic and 1 − ρ as the probability that a word is picked as being
from a general-topic, i.e., δw,c = ρ. Note that the category-topics are mainly governed by the higher-

ACM Transactions on Information Systems, Vol. V, No. N, Article A, Publication date: January YYYY.
A:14 C. Li et al.

order word co-occurrence patterns. Therefore, the words that frequently co-occur with each other
under the documents of the same category can be grouped together in the corresponding category-
topic. Because DFC is a probabilistic topic model, a wrong category-topic may be sampled for some
documents. Given the underlying collection is severely imbalanced, the documents of the largest
category could be allocated with a wrong category-topic. The smaller categories will be dominated
by these documents. The resultant incorrect category-topics of the smaller categories in turn will
hurt the classification performance very much.
Doc-Rel. Similar to the given seed words of c, category words are expected to represent the seman-
tic meaning of category-topic c. However, without training documents labeled for a category, the
statistically informative words could not be easily derived. On the other hand, as to relevant-topics,
we believe that category words could be semantically or statistically related to the seed words of
that category. Although semantically relevant words can be extracted based on the seed words and
an external thesauri or knowledge base, such prior knowledge bases may not always be available.
Here, we simply use word co-occurrences to estimate the probability of category words. If a word
has high word co-occurrences with the seed words of a category, this word is more likely to be a
category word. The degree of co-occurrences between a word w and a seed word s is measured by
the conditional probability p(w|s):
df (w, s)
p(w|s) = (11)
df (s)
where df (s) is the number of the documents containing seed word s, and df (w, s) is the number
of the documents containing both word w and seed word s. Then, we calculate the relevance score
rel(w, c) and weight τw,c for each word w and category-topic c as follows:
1 X
rel(w, c) = p(w|s) (12)
|Sc |
s∈Sc
rel(w, c) 1
ν(w, c) = max( P − , 0) (13)
c rel(w, c) A
ν(w, c)
νc (w, c) = P (14)
w ν(w, c)
νc (w, c)
τw,c = max( P , ) (15)
c νc (w, c)
In Equation 12, Sc is the collection of seed words of category-topic c. Note that Equations 11-12
needs seed words as input. However, for the irrelevant-topics, the seed words can not be provided
since the possible (irrelevant) categories under the collection is unknown beforehand. In Section 3.2,
we propose a simple but novel mechanism to extract the pseudo seed words for irrelevant-topics in an
unsupervised manner. Then, for irrelevant-topics, δw,c values are estimated based on these pseudo
seed words for topic inference.
In Equation 13, we normalize the relevance score rel(w, c) and subtract it by the average rele-
vance score 1\A for each category-topic where A refers to the total number of category-topics under
consideration (i.e., A = R + T ). It is expected that word w is a category word for category-topic c
only if w and c have a high rel(w, c) value. Therefore subtracting the relevance scores by the average
is necessary to filter out irrelevant categories using Equation 13. At this stage, we can simply take
ν(w, c) as the final weight indicating the relevance between w and c. However, we observe that the
absolute value of ν(w, c) does not really reflect the true relevance between word w and category-topic
c, because the values are largely affected by the statistical properties of the seed words as well.
Given a seed word s with very high document frequency (i.e., df (s) is large), p(w|s) would be
relatively smaller than other seed words with small document frequencies based on Equation 11.
This results in relatively smaller ν(w, c) values for all words under category c. Hence, we take the

ACM Transactions on Information Systems, Vol. V, No. N, Article A, Publication date: January YYYY.
Seed-Guided Topic Model for Document Filtering and Classification A:15

impact of the seed words into account by normalizing ν(w, c) with respect to c by using Equation 14.
Similarly, the νc (w, c) values for different word w could be in different scales. We further normalize
νc (w, c) with respect to word w and take it as the final relevance weight τw,c between word w and
category c, by using Equation 15. A higher τw,c means that word w is more likely to be a category
word of category c. A small constant is assigned for the words that have relatively low rel(w, c)
values to avoid being zero, and = 0.01 in this work. With τw,c calculated by Equation 15, we can
refine the category word probability δw,c of word w and category-topic c as follows:
τw,c ρ
δw,c = (16)
1 − ρ + τw,c ρ

In Equation 16, ρ becomes a tuning parameter within [0, 1], specifying the importance of τw,c for
δw,c . When ρ = 0 (i.e., δw,c = 0), DFC is downgraded to the standard LDA and classify each
document based on the general-topic distribution of the document only. When ρ = 1 (i.e., δw,c = 1),
DFC consists of only category-topics, and all word occurrences are assigned to some category-topic,
which is similar to TLC++ model proposed in [Hingmire and Chakraborti 2014] with the exception
that the association between the topic and category is made by the seed words here instead of manual
labeling as in the TLC++.
The above procedure simply estimates the probability of category words based on the document-
level word co-occurrences. Therefore, we denote this relevance estimation mechanism as Doc-Rel.
Topic-Rel. Normally, a document could cover multiple topics and contain noisy information (e.g.,
common words). Two words appearing together may not indicate their semantic relevance because
these two words could refer to two distinct semantic topics. Recall that we utilize the standard LDA
to extract the pseudo seed words for irrelevant-categories. Here, we also devise a topic-based mech-
anism to estimate category word probabilities. Let pLDA (w|k) be the word probability obtained
for hidden topic k by using LDA, we can therefore derive the topic probability pLDA (k|w) by us-
ing Bayes’ theorem [Li et al. 2016a]. We denote this topic-level relevance estimation mechanism as
Topic-Rel. Given the most relevant L topics Ls for a seed word s in terms of pLDA (k|s), we can
calculate the relevance score rel(w, c) as follows:

1 X X
rel(w, c) = pLDA (w|k)pLDA (k|s) (17)
|Sc |
s∈Sc k∈Ls

We then calculate δw,c by following Equations 13-16. Instead of measuring the relevance based on
coarse document-level co-occurrence information, we take the topical information into consideration
in Equation 17. Only the relevant words under the most relevant topics of the seed words are likely to
be the category words for the corresponding category. In our study, we observe that a seed word only
contains very few dominating LDA topics. That is, pLDA (k|s) is very small for most LDA topics.
Here, we use the most relevant 3 topics for the calculation (i.e., L = 3) to save the computation cost3 .
Note that both Topic-Rel and pseudo seed word extraction require running LDA over the document
collection. The inference results from a single run of LDA can be used for both relevance estimation
and pseudo seed word extraction. However, efficiency would still be a potential issue by utilizing
LDA to extract the hidden topics. Fortunately, we find that a small iteration number (i.e., 100 itera-
tions) is adequate enough to deliver promising performance and no further performance gain could
be obtained with further iterative inference. In this work, we take the results after running LDA over
100 iterations. Moreover, the existing works enabling efficient topic inference would be utilized [Yao
et al. 2009; Yuan et al. 2015]. However, this is beyond the focus of this work.

3 We validated the performance of DFC with different L values. However, litter performance variation is experienced with
larger L values.

ACM Transactions on Information Systems, Vol. V, No. N, Article A, Publication date: January YYYY.
A:16 C. Li et al.

3.4. Initial Document Category-Topic Distribution Estimation

Since the seed words represent the meaning of a category, it is reasonable to assume that a document
containing the seed words of a category is likely to belong to this category. Specifically, given a
document d, we derive the initial category distribution ηd as follows:
X
f (d, c) = tf (s, d) (18)
s∈Sc
ln(1 + f (d, c)) + γ
ηd (c) = P (19)
c (ln(1 + f (d, c))) + Rγ
where tf (s, d) is the term frequency of seed word s in document d, and γ is a prior parameter for the
Dirichlet smoothing, γ = 0.01 in this work. DFC degrades to become our earlier proposed dataless
classification model STM when δw,c is estimated by using Doc-Rel mechanism and all the irrelevant-
topics are not considered (i.e., either T = 0 or κd = 1). That is, DFC can also be used to conduct
conventional document classification in a dataless manner (i.e., classification without filtering).

4. EXPERIMENT
In this section, we evaluate the filtering and classification performance of the proposed DFC4 against
other state-of-the-art dataless text classification alterantives and supervised learning methods. Under
some parameter settings, DFC can be degraded to be equivalent of the earlier proposed STM for
classification without filtering. That is, DFC can also be adapted for dataless classification without
filtering. For the completeness of the comparison, we also conduct the performance comparison
in terms of conventional classification without filtering. Then, we analyze the impact of parameter
settings in DFC. Our experimental results show that DFC outperforms all existing state-of-the-art
competitors and is very robust to the parameter settings. At last, we conduct a thorough study about
the impact of seed words by using the existing state-of-the-art dataless techniques (including DFC).

4.1. Datasets
Two real-world text collections for the classification are used here for performance evaluation.
20-Newsgroup (20NG): The 20NG is a widely used dataset5 for document classification re-
search [Chang et al. 2008; Chen et al. 2015; Guan et al. 2009; Kusner et al. 2015; Xie and Xing
2013]. It contains approximately 20, 000 newsgroup documents, evenly distributed across 20 differ-
ent newsgroups/categories. We use the bydate version of the 20NG dataset, where a total of 18, 846
documents are divided into a training set (60%) and a test set (40%). These 20 categories can be
further aggregated into 6 major sub-categories. For example, the major category sci consists of 4 cat-
egories: sci.crypt, sci.electronics, sci.med, sci.space. This dataset has been previously used in
the related works [Chang et al. 2008; Chen et al. 2015; Druck et al. 2008; Hingmire and Chakraborti
2014; Hingmire et al. 2013; Li et al. 2016b; Song and Roth 2014]. When parsing the documents, we
keep the text contained in the “Subject”, “Keywords”, and “Content” fields. The information in the
other fields and email addresses are filtered out.
Reuters-10: Reuters-21578 is also a widely used dataset for document classification. It contains
21, 578 documents in 135 categories. Among them, 13, 625 and 6, 188 documents are in the training
set and test set respectively. This dataset is very imbalanced and the variation of category size is
quite large. We used the 10 largest categories (hence denoted by Reuters-10) in the dataset with Aptè
split6 . We further discard the documents belonging to more than one category. This left us with a
total of 7, 285 documents: 5, 228 documents in train and 2, 057 documents in test set. The same

4 Our implementation will be released after paper acceptance.

5 http://qwone.com/ ∼jason/20Newsgroups/
6 http://kdd.ics.uci.edu/database/reuters21578/reuters21578.html

ACM Transactions on Information Systems, Vol. V, No. N, Article A, Publication date: January YYYY.
Seed-Guided Topic Model for Document Filtering and Classification A:17

Table II. Statistics of 20NG dataset. #train/#test: the number of train-

ing/test documents; Avg(|d|): the average length of the documents; |SL |
/ |SD |: the number of seed words obtained from the category label SL ,
and from category description SD .
Category label #train #test Avg(|d|) |SL | / |SD |
alt.atheism 480 319 157.30 1/5
comp.graphics 584 389 135.58 2/5
comp.os.ms-windows.misc 591 394 226.08 4/8
comp.sys.ibm.pc.hardware 590 392 95.10 5/7
comp.sys.mac.hardware 578 385 86.98 4/3
comp.windows.x 593 395 146.81 3/5
misc.forsale 585 390 77.27 1/8
rec.autos 594 396 103.10 1/7
rec.motorcycles 598 398 95.41 1/3
rec.sport.baseball 597 397 113.07 2/3
rec.sport.hockey 600 399 144.714 2/3
sci.crypt 595 396 164.34 2/5
sci.electronics 591 393 97.60 2/5
sci.med 594 396 141.16 2/5
sci.space 593 394 148.34 2/6
soc.religion.christian 599 398 175.13 3/6
talk.politics.guns 546 364 166.24 2/5
talk.politics.mideast 564 376 243.53 2/5
talk.politics.misc 465 310 210.61 1/3
talk.religion.misc 377 251 165.61 1/6
politics 1575 1050 207.02 3/13
religion 1456 968 166.79 4/12
sci 2373 1579 137.92 5/31
comp 2936 1955 138.38 13/27
rec 2389 1590 114.10 4/17

Table III. Statistics of the Reuters-10 dataset.

Category label #train #test Avg(|d|) |SL | / |SD |
acq 1, 435 620 68.13 1/9
coffee 89 21 109.54 1/4
crude 223 98 118.10 1/9
earn 2, 637 1, 040 57.97 1/7
gold 70 20 78.36 1/4
interest 140 57 89.23 1/8
money-fx 176 69 99.70 2/9
ship 107 35 77.94 1/8
sugar 90 24 104.08 1/2
trade 225 73 130.02 1/7

subset, Reuters-10, has been previously used in the related works as well [Chen et al. 2015; Li et al.
2016b; Xie and Xing 2013].
For both datasets, we further remove the stop words, the words shorter than 2 characters, and the
words appearing in fewer than 5 documents. The data statistics after preprocessing are reported in
Tables II and III, respectively. The statistics of the 5 major categories of 20NG dataset used in the
experiments are reported in the last 5 rows in Table II. Observe from Table III that the Reuters-10 is
very imbalanced. While most categories have around 100−300 documents, the two largest categories
(i.e., earn, acq) have 3, 677 and 2, 055 documents respectively.

4.2. Experimental Setup

Methods in Comparison. The proposed DFC is compared against the following state-of-the-art
dataless text classification methods and supervised classification methods:

ACM Transactions on Information Systems, Vol. V, No. N, Article A, Publication date: January YYYY.
A:18 C. Li et al.

Topic Labeled Classification with Labeled Words (TLC++). This method learns to classify docu-
ments based on the posterior topic proportions and the category labels of the topics [Hingmire
and Chakraborti 2014]. The labels of the topics are annotated manually based on the highly prob-
able words in each topic. We present the 30 most probable words in each topic for topic labeling,
the same setting recommended by its authors.
Generalized Expectation with Feature Labels (GE-FL). It learns a maximum entropy based text
classifier by using the labeled words in each category as soft constraints [Druck et al. 2008].
Here, we use the same seed words that are used for DFC as labeled words of each category for
fair comparison. We used the implementation of GE-FL that is provided in the MALLET toolkit.7
Descriptive LDA (DescLDA). This method learns the category label of a document by applying doc-
ument clustering over the learned hidden topics [Chen et al. 2015]. The hidden topic distributions
are inferred based on the seed words. DescLDA has a tunable parameter: the number of topics.
We report the best results with the optimal setting obtained in our experiments.
Seed-based NB-EM (SNB-EM). It learns dataless text classifier in a semi-supervised manner [Liu
et al. 2004], where NB-EM method [Nigam et al. 2000] is used for model building. We report the
best performance with the optimal parameter settings obtained in our experiments. For fair com-
parison, we use the same seed words that are used for DFC to build the initial training instances.
Support Vector Machines (SVM). This is a state-of-the-art supervised learning technique for text
classification. We train a linear SVM classifier by using LIBSVM with the default parameter
settings and TF-IDF weighting scheme.8
Seed-based Support Vector Machines (SSVM). We construct a pseudo training set by labeing a train-
ing document with a category if the document contains any seed word of that category. Then, we
train a linear SVM classifier by using LIBSVM with the default parameter settings and TF-IDF
weighting scheme.
sLDA. It is a supervised text classifier based on the LDA model [Blei and McAuliffe 2007]. We train
the model by using the implementation provided by the authors9 . The best results obtained in our
experiments are reported.
MedLDA. It is an LDA-based supervised topic model which exploits max-margin principle for
jointly max-margin and maximum likelihood learning [Zhu et al. 2009, 2012]. We use the imple-
mentation provided by the authors. The best restuls obtained in our experiments are reported.
Among the above methods in comparison, SVM, SSVM, TLC++, MedLDA and sLDA can be
adapted for classification with filtering. For SVM and SSVM, we can adopt the one-class SVM
for evaluation10 . For TLC++, a pseudo category is added to indicate the irrelevance during the topic
annotation process. As for MedLDA and sLDA, we introduce a pseudo category to cover all irrelevant
documents in the training set. The hyper-parameters for these methods are tuned accordingly for
optimal performance.
The rest methods not discussed above are used for performance evaluation in classification without
filtering. Note that, Chang et al. learns the category label of a document by projecting the document
and the category into the same semantic space of Wikipedia concepts [Chang et al. 2008]. The
nearest-neighbor based explicit semantic analysis (NN-ESA) is then used for dataless classification
without filtering. Since NN-ESA involves parsing the whole Wikipedia, we choose not to include this
method for comparison. Nevertheless, it was reported that DescLDA significantly outperforms NN-
ESA in earlier study [Chen et al. 2015]. A state-of-the-art dataless classifier (i.e., STM) was proposed
in our earlier work [Li et al. 2016b]. However, STM can be applied for classification without filtering
only. Note that the proposed DFC subsumes STM under some particular parameter settings. As an
extension of STM, for classification without filtering evaluation, we choose to report the classification
performance of DFC with the corresponding settings (ref. Section 3.4).
7 http://mallet.cs.umass.edu
8 www.csie.ntu.edu.tw/ ∼cjlin/liblinear
9 http://www.cs.cmu.edu/ ∼chongw/slda/
10 The implementation provided in LIBSVM is used.

ACM Transactions on Information Systems, Vol. V, No. N, Article A, Publication date: January YYYY.
Seed-Guided Topic Model for Document Filtering and Classification A:19

Parameter Setting. In DFC, there are several hyper-parameters. They are set to typical values: α0 =
50/B, β0 = β1 = 0.01, as used in PTM studies [Thomas and Mark 2004]. We set γ = = 0.01
in our experiments. As to the tunable parameters in DFC, we use the following setting: (1) For
classification without filtering (i.e., T = 0, κd = 1), we follow the setting in STM [Li et al. 2016b]
to set α2 = 100, ρ = 0.85, B = 3 · R in the evaluation; (2) For classification with filtering, we set
α1 = 50/T, α2 = 100, ρ = 0.95, T = 20, B = 3 · (R + T ). Also, we use κd = 0.5 such that no
prior preference is given to a document for its relevance. The study of the tunable parameters will
be presented in Section 4.4. We run DFC for 20 iterations, and then the category-topic assigned to a
document during the last iteration is taken as its predicted label.
Performance Metric. In the experiments, we use the standard training/test partitions of the two
datasets for the evaluation. For all the dataless classifiers, the classifiers run over both the training
and test documents as a single collection of unlabeled documents (i.e.,not using their labels). The
classification accuracy are evaluated based on the labels of documents in test set, for fair comparison
with the supervised methods. For supervised methods, the classifiers are developed using the training
documents and then evaluated on the test set, as per normal.
For performance comparison, we report macro-averaged F1 scores (Macro-F1 ) [Manning et al.
2008]. Macro-F1 is the averaged F1 scores of all categories. We report the average results over 10
runs for all the methods (excluding SNB-EM, SSVM and SVM) with random model initialization.
The same setting are used in previous works [Hingmire and Chakraborti 2014; Li et al. 2016b]. The
statistical significance is conducted by using the student t-test.
Seed Words Selection. The quality of the seed words is a critical factor for all dataless clas-
sifiers. Here, we exploit two sets of the seed words selected from the category label (denoted
by SL ) and description (denoted by SD ) respectively. Category label means that the seed words
are extracted from the label in the given dataset directly. For example, from the category label
comp.sys.ibm.pc.hardware in 20NG, five seed words “computer, systems, ibm, pc, hardware” are
extracted as SL . Note that the semantically irrelevant words in the label are excluded here. For ex-
ample, “talk” is excluded from category talk.politics.guns. The seed words in SD are compiled
manually with the domain knowledge. For example, the authors of DescLDA followed the labeling
procedure used in TLC++ [Hingmire and Chakraborti 2014] (i.e., assisted by standard LDA) to com-
pile the SD . The two sets of seed words used here are both used in earlier studies [Chang et al. 2008;
Chen et al. 2015; Song and Roth 2014]. Further details11 about these seed words can be found in the
work of DescLDA [Chen et al. 2015]. These seed words are included in Appendix. The number of
seed words in SL and SD in each category are listed in Tables II and III.

4.3. Performance Comparison

Here, we first evaluate the classification preformance of the proposed DFC against the existing state-
of-the-art dataless and supervised classifiers in terms of classification with filtering. Then, we further
report the performance comparison for classification without filtering.
Classification with Filtering. For classificaiton with filtering, we need to filter out the documents
irrelevant to any specified category. In this sense, we take the documents from the specified categories
as relevant ones, and the rest documents in the collection as irrelevant documents. In this set of
experiments, we create overall 15 classification with filtering tasks over the two datasets, ranging
from 1 to 4 specified categories. Note that Reuters-10 is a very imbalanced dataset (ref. Table III).
The smallest category gold contains only 70 training documents and 20 testing documents, while
the largest category earn contains 2, 637 training documents and 1, 040 testing documents. Hence,
the tasks gold and earn are created to choose the documents in these extreme cases. Similarly, the
tasks acq-earn and coffee-gold-sugar are created by using the two largest categories and the three

11 The seed words for these two datasets are available at https://github.com/WHUIR/STM

ACM Transactions on Information Systems, Vol. V, No. N, Article A, Publication date: January YYYY.
A:20 C. Li et al.

smallest categories respectively. Tables IV and V report the Macro-F1 scores of these methods on
these 15 tasks by using SL and SD respectively. We make the following observations:
First, with SD as the seed words, DFC +Doc-Rel achieves the best performance on 12 out of
15 classification with filtering tasks among the dataless classifiers. Also, DFC +Topic-Rel achieves
the best performance on a single task and the second best on 14 tasks. Both TLC++ and SSVM
achieves much worse but comparable performance to each other. Overall on these 15 classification
with filtering tasks, DFC +SD +Doc-Rel outperforms TLC++ and SSVM by around 170.5% and
136.4% on average, respectively. The relative performance gain for DFC +SD +Topic-Rel is 146.6%
and 115.5% over the two baseline methods respectively.
Second, the supervised learning methods SVM, MedLDA and sLDA may not always perform bet-
ter than the dataless counterparts. Specifically, SVM performs consistently worse than DFC +SD /SL
+Doc-Rel and DFC +SD +Topic-Rel. Also, SVM only manages to deliver better performance than
DFC +SL +Topic-Rel on 2 classification with filtering tasks: sci and acq-earn. The averaged per-
formance gain over SVM by DFC +Topic-Rel is 54.3% and 85.6% respectively, under SL and SD
settings. It goes up to 81.6% and 103.5% respectively by using DFC +Doc-Rel. Similarly, with SL ,
both DFC +Doc-Rel and DFC +Topic-Rel outperforms MedLDA on most classification with filtering
tasks. Also, with SD , DFC +Doc-Rel performs consistently better than MedLDA on all 15 classifi-
cation with filtering tasks. And DFC +Topic-Rel also manage to achiever better performance than
MedLDA on 14 out of 15 tasks.
As to sLDA, under SD setting, DFC +Doc-Rel outperforms sLDA on 10 out of 15 classification
tasks, and DFC +Topic-Rel is better on 4 classification tasks. With SL , both DFC +Doc-Rel and
DFC +Topic-Rel outperforms sLDA on 4 classification tasks respectively. Specifically, DFC achieves
much better performance than sLDA on two tasks related to the smaller categories: gold and coffee-
gold-sugar. As reported in Table III, these categories contain relatively fewer training documents.
Without adequate amount of training data, supervised classification techniques could be error-prone
to handle the classification with filtering in an extreme sparsity. Moreover, DFC +SL +Doc-Rel
obtains very close performance to sLDA on two classification tasks: politics-religion and politics-
rec-religion-sci.
Third, among the three supervised classifiers, sLDA performs significantly much better than SVM
and MedLDA. We also observe that sLDA performs much better than MedLDA on all 15 classifi-
cation with filtering tasks, though they are both built on the basis of probabilistic topic modeling
techniques. Moreover, MedLDA performs significantly better than SVM on most tasks. Specifically,
MedLDA achieves superior performance than SVM on 11 out of 15 classification with filtering tasks.
We believe that there are two main reasons to explain their performance difference. First, relying on
word occurrences alone (e.g., SVM) is too limited to address the document filtering, because the
training documents for all the categories underlying the document collection is unavailable. Second,
both SVM and MedLDA are built on the basis of max-margin principle. When document collection
becomes severely imbalanced, the max-margin based optimization would easily lead to model over-
fitting. Note that the colleciton becomes very imbalanced when the number of specified categories
are few or the number of documents covered by the specified categories are small. For example,
MedLDA performs relatively worse on tasks like med, gold and coffee-gold-sugar. In compari-
son, the averaged performance gain by DFC +SD +Doc-Rel over sLDA is 3.7%.
This set of comparisons is unfair since MedLDA, sLDA and SVM are supervised methods which
access to a lot of training documents. Besides, the training for sLDA with 40 hidden topics takes about
6 hours. And the optimal performance is often obtained by using a larger topic number (e.g., 60 and
80), which needs much more time. This inefficiency extremely hinders its application in the scenario
where the information needs are dynamically changing and fast response is required. In contrast, DFC
+Doc-Rel takes only 30 and 15 minutes on average for 20NG and Reuters-10 respectively, excluding
the LDA hidden topic inference. Moreover, the proposed DFC can be easily deployed in parallel
computing settings. The overall superiority suggests that DFC +Doc-Rel is a desirable solution for
real-world classification with filtering scenarios with dynamic information needs.

ACM Transactions on Information Systems, Vol. V, No. N, Article A, Publication date: January YYYY.
Seed-Guided Topic Model for Document Filtering and Classification A:21

Table IV. Macro-F1 of the six methods for classification with filtering, where the seed words in SL are used. The
best and second best results by dataless classifiers are highlighted in boldface and underlined respectively,
on each task. † indicates that the difference to the best dataless classifier is statistically significant at 0.05
level. N and H indicate that the supervised classifiers perform better or worse than the best dataless classifier
respectively. Avg: the averaged Macro-F1 over all tasks.

Dataset Classification task DFC TLC++ SSVM MedLDA sLDA SVM

Doc-Rel Topic-Rel
med 0.580† 0.692 0.004† 0.100 0.256† H 0.701N 0.190H
space 0.329† 0.756 0.027† 0.166 0.304† H 0.763N 0.255H
sci 0.716 0.291† 0.007† 0.298 0.426† H 0.755† N 0.370H
religion 0.870 0.432† 0.500† 0.391 0.390† H 0.857H 0.409H
20NG med-space 0.595† 0.791 0.047† 0.133 0.332† H 0.737† H 0.222H
pc-mac 0.179† 0.242 0.019† 0.131 0.350† N 0.362† N 0.208H
politics-religion 0.849 0.506† 0.311† 0.359 0.564† H 0.861N 0.423H
politics-sci 0.602† 0.684 0.285† 0.313 0.581† H 0.797† N 0.404H
comp-religion-sci 0.779 0.585† 0.448† 0.378 0.637† H 0.850† N 0.418H
politics-rec- 0.839 0.658† 0.578† 0.341 0.705† H 0.848N 0.407H
religion-sci
autos-
motorcycles- 0.839 0.844 0.262† 0.148 0.487† H 0.782† H 0.262H
baseball-hockey
gold 0.848 0.682† 0.000† 0.294 0.000† H 0.620† H 0.333H
Reuters-10 earn 0.947 0.887† 0.943 0.419 0.489† H 0.984† N 0.822H
acq-earn 0.892† 0.274† 0.940 0.443 0.653† H 0.980† N 0.636H
coffee-gold-sugar 0.915 0.841† 0.093† 0.563 0.121† H 0.764† H 0.574H
Avg 0.719 0.611 0.298 0.298 0.420 0.777 0.396

Fouth, with SD , DFC +Doc-Rel, DFC +Topic-Rel and SSVM perform better than their counter-
parts with SL on the majority of the tasks (the study about seed words is presented in Section 4.5).
This is expected because more semantic information could be exploited by providing more seed
words. The performance gain by using SD over SL is about 12.1%, 20.3% and 14.3% for DFC
+Doc-Rel, DFC +Topic-Rel, SSVM on average, respectively. The performance gap between DFC
+SD +Doc-Rel and DFC +SL +Doc-Rel is the smallest, compared to the other alternatives. When
SD is used, SSVM performs much better than TLC++. This is reasonable since more seed words
could produce more pseudo training documents of high quality. DFC +SL +Doc-Rel outperforms
TLC++ on 14 tasks, and outperforms SSVM+SL and SSVM+SD on all the tasks. Similarly, DFC
+SL +Topic-Rel outperforms TLC++, SSVM+SL and SSVM+SD on 12, 13 and 13 tasks respectively.
As to the two relevance estimation mechanisms, DFC +Doc-Rel outperforms DFC +Topic-Rel on
the majority of the classification tasks when either SL or SD is used. With SL , DFC +Doc-Rel out-
performs DFC +Topic-Rel on 9 out of 15 classification tasks. The situation becomes much worse in
SD setting where DFC +Doc-Rel achieves better performance on 12 classification tasks.
Classification without Filtering.
For classification without filtering, we evaluate all the methods on both datasets: 20 categories in
the 20NG dataset and 10 categories in the Reuters-10 dataset. We also create 7 classification tasks
based on the 20NG dataset, by selecting the documents in subsets of all categories. For example,
one of the tasks is to classify documents in categories pc and mac, denoted by pc-mac. These
7 classification tasks were used in the works of TLC++ and STM for the evaluation [Hingmire and
Chakraborti 2014; Li et al. 2016b]. In total, we have 9 classifications tasks involving different number
of categories on the two datasets. Tables VI and VII report the Macro-F1 scores of these methods
on the 9 tasks by using SL and SD respectively. We make the following observations:
First, with SD as the seed words, DFC +Doc-Rel significantly outperforms other state-of-the-art
dataless alternatives on 8 out of 9 classification tasks. SNB-EM+SD performs the second best on 5
classification tasks, and DFC +Topic-Rel performs the best on a single task and second best on the
other 2 tasks. GE-FL, DescLDA and TLC++ achieve worse but close performances to each other. The

ACM Transactions on Information Systems, Vol. V, No. N, Article A, Publication date: January YYYY.
A:22 C. Li et al.

Table V. Macro-F1 of the six methods for classification with filtering, where the seed words in SD are used.
The best and second best results are highlighted in boldface and underlined respectively, on each task. †
indicates that the difference to the best dataless classifier is statistically significant at 0.05 level. N and H
indicate that the supervised classifiers perform better or worse than the best dataless classifier respectively.
Avg: the averaged Macro-F1 over all tasks.

Dataset Classification task DFC TLC++ SSVM MedLDA sLDA SVM

Doc-Rel Topic-Rel
med 0.863 0.820† 0.004† 0.159 0.256† H 0.701† H 0.190H
space 0.861 0.758† 0.027† 0.181 0.304† H 0.763† H 0.255H
sci 0.716 0.639† 0.007† 0.349 0.426† H 0.755† N 0.370H
religion 0.872 0.571† 0.500† 0.375 0.390† H 0.857H 0.409H
20NG med-space 0.866 0.823† 0.047† 0.170 0.333† H 0.737† H 0.222H
pc-mac 0.426 0.283† 0.019† 0.174 0.350† H 0.362† H 0.208H
politics-religion 0.899 0.771† 0.311† 0.395 0.564† H 0.861† H 0.423H
politics-sci 0.811 0.693† 0.285† 0.382 0.581† H 0.797H 0.404H
comp-religion-sci 0.793 0.743† 0.448† 0.392 0.637† H 0.850† N 0.418H
politics-rec- 0.853 0.777† 0.578† 0.387 0.705† H 0.848H 0.407H
religion-sci
autos-
motorcycles- 0.864 0.850 0.262† 0.266 0.487† H 0.782† H 0.262H
baseball-hockey
gold 0.818 0.620† 0.000† 0.167 0.000† H 0.620† H 0.333H
Reuters-10 earn 0.908† 0.942† 0.943 0.775 0.489† H 0.984† N 0.822H
acq-earn 0.803† 0.919† 0.940 0.636 0.653† H 0.980† N 0.636H
coffee-gold-sugar 0.731† 0.820 0.093† 0.310 0.121† H 0.764† H 0.574H
Avg 0.806 0.735 0.298 0.341 0.420 0.777 0.396

Table VI. Macro-F1 of the eight methods for classification without filtering, where the seed words in SL are used. The best and second
best results by dataless classifiers are highlighted in boldface and underlined respectively, on each task. † indicates that the difference
to the best dataless classifier is statistically significant at 0.05 level. N and H indicate that the supervised classifiers perform better or
worse than the best dataless classifier respectively. Avg: the averaged Macro-F1 over all tasks.

Dataset Classification task DFC Ge-FL DescLDA SNB-EM TLC++ MedLDA sLDA SVM
Doc-Rel Topic-Rel
med-space 0.967 0.972 0.712† 0.877† 0.897 0.938† 0.975N 0.910† H 0.976N
pc-mac 0.902 0.678† 0.491† 0.688† 0.895 0.685† 0.881† H 0.735† H 0.925N
politics-religion 0.907 0.506† 0.684† 0.888† 0.894 0.911 0.949† N 0.925† N 0.954N
politics-sci 0.960 0.746† 0.750† 0.624† 0.846 0.906† 0.941† H 0.930† H 0.971N
20NG comp-religion-sci 0.918 0.857† 0.709† 0.559† 0.907 0.817† 0.930N 0.900H 0.936N
politics-rec-
0.919† 0.742† 0.719† 0.514† 0.768 0.834† 0.932N 0.823† H 0.941N
religion-sci
autos-
motorcycles- 0.936† 0.957 0.849† 0.531† 0.715 0.734† 0.962N 0.894† H 0.957
baseball-hockey
All 20 categories 0.662† 0.572† 0.320† 0.632† 0.461 0.510† 0.705† N 0.633† H 0.820N
Reuters-10 All 10 categories 0.701† 0.496† 0.667† 0.317† 0.529 0.506† 0.562† H 0.754† N 0.932N
Avg 0.875 0.725 0.656 0.626 0.768 0.760 0.871 0.834 0.935

superiority of SNB-EM is attributed to its semi-supervised nature. After an initial NB-EM classifier
is built based on the seed words, SNB-EM retrains itself by using the classification results of high
confidence, and this procedure repeats until the probability parameters stabilize. All the dataless
classifiers obtain relatively poorer results when all the categories are used in the classification tasks,
i.e., 20 categories on 20NG and 10 categories on Reuters-10. DFC +SD +Doc-Rel achieves the best
performance over the other alternatives in these two tasks. Although TLC++ and GE-FL achieves
comparable performance in the other tasks when SD is used, GE-FL performs much better here.
We observe that when the number of categories is larger, the resultant topics produced by LDA
often carry mixed semantic information from more than one category. It even becomes difficult for
annotators to manually associate topics to the relevant categories for TLC++. In this sense, TLC++
experiences a significant performance deterioration. Overall on these 9 classification tasks, DFC +SD

ACM Transactions on Information Systems, Vol. V, No. N, Article A, Publication date: January YYYY.
Seed-Guided Topic Model for Document Filtering and Classification A:23

Table VII. Macro-F1 of the eight methods for classification without filtering, where the seed words in SD are used. The best and second
best results by dataless classifiers are highlighted in boldface and underlined respectively, on each task. † indicates that the difference
to the best dataless classifier is statistically significant at 0.05 level. N and H indicate that the supervised classifiers perform better or
worse than the best dataless classifier respectively. Avg: the averaged Macro-F1 over all tasks.

Dataset Classification task DFC Ge-FL DescLDA SNB-EM TLC++ MedLDA sLDA SVM
Doc-Rel Topic-Rel
med-space 0.972 0.979 0.935† 0.977 0.967 0.938† 0.975N 0.910† H 0.976H
pc-mac 0.936 0.416† 0.705† 0.694† 0.876 0.685† 0.881† H 0.735† H 0.925H
politics-religion 0.952 0.935 0.883† 0.900† 0.939 0.911† 0.949H 0.925† H 0.954N
politics-sci 0.962 0.912† 0.889† 0.912† 0.941 0.906† 0.941† H 0.930† H 0.971N
20NG comp-religion-sci 0.923 0.861† 0.828† 0.498† 0.919 0.817† 0.930N 0.900† H 0.936N
politics-rec- 0.941 0.914† 0.827† 0.782† 0.917 0.834† 0.932H 0.823† H 0.941
religion-sci
autos-
motorcycles- 0.977 0.957 0.673† 0.713† 0.938 0.734† 0.962H 0.894† H 0.957H
baseball-hockey
All 20 categories 0.739 0.728 0.590† 0.663† 0.678 0.510† 0.705† H 0.633† H 0.820N
Reuters-10 All 10 categories 0.822 0.791† 0.776† 0.800† 0.778 0.506† 0.562† H 0.754† H 0.932N
Avg 0.914 0.833 0.790 0.771 0.884 0.760 0.871 0.834 0.935

+Doc-Rel outperforms GE-FL+SD , DescLDA+SD , SNB-EM+SD , and TLC++ by around 15.7%,

18.5%, 3.4% and 20.2% on average, respectively.
Second, the supervised learning methods SVM, MedLDA and sLDA may not always perform
better than the dataless counterparts. With SD , dataless classifier outperforms SVM in 3 out of 9
classification tasks: DescLDA+Topic-Rel and DescLDA on med-space, DFC +Doc-Rel on both pc-
mac and autos-motorcycles-baseball-hockey. Moreover, DFC +SD +Doc-Rel consistently out-
performs sLDA on all the classification tasks. Even DFC +SL +Doc-Rel performs significantly better
than sLDA on 7 out of 9 classification tasks, although sLDA is a supervised classification method.
MedLDA outperforms sLDA in 8 out of 9 classification without filtering tasks. However, DFC +SD
+Doc-Rel still outperforms MedLDA in 7 out of 9 tasks. Note that sLDA only achieves a Macro-F1
of 0.735 on pc-mac, where a large number of words are shared by the two categories [Chakraborti
et al. 2008]. DFC +SD +Doc-Rel outperforms sLDA, MedLDA and SVM on this task by 27.3%,
6.2% and 1.1% respectively, given that 1, 168 labeled documents are used to train the the super-
vised classifiers. Further, DFC +SD +Doc-Rel achieves very close or even comparable performance
with SVM on 4 classificaiton tasks: politics-religion, politics-sci, comp-religion-sci, politics-rec-
religion-sci. Although SVM performs better than DFC +SD when all the categories are used in the
classification tasks on both datasets, the performance gap is not very large. Shown in Tables II and III,
only a small number of seed words are used to “train” the DFC classifier. Compared with the much
larger number of labeled documents used to train SVM, minimum human efforts are necessary to
learn dataless classifiers like DFC. In this sense, the proposed DFC +Doc-Rel can work as an im-
portant complement to the existing supervised solutions. This is important in the scenario where the
existing system only requires labeled documents rather than classifiers.
Third, with SD , DFC, GE-FL, DescLDA and SNB-EM all perform better than their counterparts
with SL on the majority of the tasks (the study about seed words is presented in Section 4.5). The
performance gain by using SD over SL is about 4.5%, 14.8%, 20.4%, 23.2%, and 15.1% for DFC
+Doc-Rel, DFC +Topic-Rel, GE-FL, DescLDA, and SNB-EM on average, respectively. Observe
that the performance gap between DFC +SD +Doc-Rel and DFC +SL +Doc-Rel is also the small-
est, compared to the other alternatives. This is consistent with the observation made in classification
with filtering. When SL is used, TLC++ outperforms GE-FL and DescLDA on most tasks. This is
reasonable since each topic in TLC++ is manually examined and labeled based on its 30 most prob-
able words, which is equivalent to labeling 30 relevant words for each topic. It is expected that more
human efforts leads to better classification accuracy, all else being equal. As to DFC, DFC +SL +Doc-
Rel outperforms GE-FL+SL /SD , DescLDA+SL , SNB-EM+SL on at least 8 classification tasks, and
outperforms TLC++, DescLDA+SD on 8 and 6 tasks respectively. As reported in the last columns

ACM Transactions on Information Systems, Vol. V, No. N, Article A, Publication date: January YYYY.
A:24 C. Li et al.

of Tables II and III, almost all categories studied here have fewer than 5 seed words in SL , except for
the major category comp which has 13 SL seed words. With just a few semantically relevant words,
DFC +SL +Doc-Rel can deliver promising classification performance. The superiority suggests that
DFC +Doc-Rel is not very sensitive to the number of seed words.
Fouth, DFC +Doc-Rel outperforms DFC +Topic-Rel on the majority of the classification tasks
when either SL or SD is used. DFC +Topic-Rel only obtains better performance on 3 classifica-
tion tasks. However, the performance gain on these tasks are marginal or small. Also, DFC +SD
+Topic-Rel performs much better than DFC +SL +Topic-Rel on almost all the calssification tasks,
as being discussed above. Compared with the performance reported for classification with filtering
in Tables IV and V, it is obvious that Topic-Rel is more suitable for classification with filtering.
Note that DFC +Topic-Rel estimates the category word probabilities with the LDA hidden topics
for both kinds of tasks. The only difference is that, in classification with filtering, the pseudo seed
words are extracted for the irrelevant-topics based on the same LDA hidden topics. That is, both
the pseudo seed words and the category word probility are infered from the same source. In this
sense, the category word probilitise for the irrelevant-topics could be estimated more precise since
the pseudo seed words match well with the LDA hidden topics. We believe that the mismatch be-
tween the LDA topics and provided seed words leads to much noisy information for DFC +Topic-Rel
in classification without filtering tasks. Therefore, we can incorporate the provided seed words into
LDA hidden topic inference process for better category word probability estimation, leading to better
classification without filtering performance for DFC +Topic-Rel. Several existing works discussed
in Section 2.2 can be utilized here by making the word-to-word pairs contraint [Andrzejewski et al.
2009; Chen et al. 2013]. We leave this exporation in our future work.
Overall, the experimental results show that DFC delivers promising classificaton accuracy with a
few seed words from either the category label or its descripton, for both classification with filtering
and classification without filtering.

4.4. Analysis of STM

We now investigate the impact of the parameter settings (i.e., ρ, T, B, α2 ) in DFC by using the SD
setting and classification with filtering. Both two relevance estimation mechanisms (i.e., Doc-Rel
and Topic-Rel) are investigated. As to parameter τw,c , our earlier work validates its impact in clas-
sification without filter [Li et al. 2016b]. Since it works exactly the same for DFC in classification
with filtering, we omit the repeation of the analysis. Similarly, the parameter analysis for classifica-
tion without filtering is excluded since the analysis is conducted in our earlier work as well. Here,
we also study the convergence rate of DFC in terms of classification performance. Specifically, we
report the experimental results by varying one specific parameter value on these 6 classification with
filtering tasks: med, gold, earn, med-space, coffee-gold-sugar, and politics-rec-religion-sci,
ranging from 1 to 4 specified categories. Similar performance patterns are also observed for other
classification tasks studied in Section 4.3. Note that when studying a specific parameter, we set the
other parameters to the values used in Section 4.3.
Impact of ρ value. Recall in Equation 16, ρ controls the importance of the relevance weight τw,c for
category word probability estimation. This relevance weight is measured based on the word occur-
rence information between word w and the (pseudo) seed words of category-topic c. As discussed in
Section 3.1, ρ = 0 is equivalent to using a standard LDA with B general-topics. In this setting, DFC
only relies on the general-topic distribution θd and category general-topic distribution ϕc to filter
and classify document d in this setting. On the other hand, DFC has no general-topic at all when
ρ = 1 , and is equivalent to TLC++ model proposed in [Hingmire and Chakraborti 2014]. The only
difference is that the category-topic is solely guided by the seed words in this setting. Figure 5(a) and
Figure 5(b) plot the performance patterns when using different ρ values in the range of [0, 1] with a
step of 0.05 for DFC +Doc-Rel and DFC +Topic-Rel respectively.
Observe that Both DFC +Doc-Rel and DFC +Topic-Rel achieve stable performance on 4 classifi-
cation tasks: med, earn, med-space, politics-reigion-sci, in the range of [0.40, 0.95]. On the tasks

ACM Transactions on Information Systems, Vol. V, No. N, Article A, Publication date: January YYYY.
Seed-Guided Topic Model for Document Filtering and Classification A:25