108 原文

The document discusses the classification of text, focusing on two main approaches: text clustering and text categorization, which help organize unstructured text into meaningful groups. It outlines the differences between single-label and multi-label categorization, as well as hard and soft categorization decisions. The evolution of techniques from knowledge engineering to machine learning is highlighted, emphasizing the advantages of machine learning methods in achieving high accuracy in text classification tasks.

Uploaded by

710629071

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views6 pages

108 原文

Uploaded by

710629071

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Classification of Text, Automatic 457

Classification of Text, Automatic

F Sebastiani, Università di Padova, Padova, Italy There are two main directions for providing con-
ß 2006 Elsevier Ltd. All rights reserved. venient access to a large, unstructured repository of
text:
. Providing powerful tools for searching relevant
Introduction
documents within this repository. This is the aim
In the last two decades, the production of textual of text search (see Document Retrieval, Automatic),
documents in digital form has increased exponen- a subdiscipline of IR concerned with building
tially, due to the increased availability of inexpensive systems that take as input a natural language query
hardware and software for generating digital text and return, as a result, a list of documents ranked
(e.g., personal computers, word processors) and for according to their estimated degree of relevance to
digitizing textual data not in digital form (e.g., scan- the user’s information need. Nowadays the tip of the
ners, optical character recognition software). As a iceberg of text search is represented by Web search
consequence, there is an ever-increasing need for engines (see Web Searching), but commercial solu-
mechanized solutions for organizing the vast quantity tions for the text search problem were being deliv-
of digital texts that are being produced, with an ered decades before the very birth of the Web.
eye toward their future use. The design of such solu- . Providing powerful tools for turning this unstruc-
tions has traditionally been the object of study of tured repository into a structured one, thereby easing
information retrieval (IR), the discipline that, broadly storage, search, and browsing. This is the aim
speaking, is concerned with the computer-mediated of text classification (TC), a discipline at the cross-
access to data with poorly specified semantics. roads of IR, machine learning (ML), and (statistical)
458 Classification of Text, Automatic

natural language processing, concerned with build- the labels originally attached to the preclassified
ing systems that partition an unstructured collection documents.
of documents into meaningful groups (Sebastiani,
2002). Single-Label and Multi-Label Text Categorization
TC itself admits of two important variants: single-
Text Clustering and Text Categorization label TC and multi-label TC. Given as input the set
of categorories C ¼ {c1 ,. . ., cm}, single-label TC is
There are two main variants of TC. The first is text
the task of attributing, to each document dj in the
clustering, which is characterized by the fact that only
repository, the category to which it belongs. Multi-
the desired number of groups (or clusters) is known in
label TC, instead, deals with the case in which each
advance: no indication as to the semantics of these
document dj may in principle belong to zero, one,
groups is instead given as input. The second variant is
or more than one category; it thus comes down to
text categorization, whereby the input to the system
deciding, for each category ci in C, whether a given
consists not only of the number of categories (or
document dj belongs or does not belong to ci.
classes), but also of some specification of their se-
The technologies for coping with either single-label
mantics. In the most frequent case, this specification
or multi-label TCs are slightly different (the former
consists in a set of labels, one for each category and
problem often being somehow more challenging), es-
usually consisting of a noun or other short natural
pecially concerning the phases of feature selection,
language expression, and in a set of example labeled
classifier learning, and classifier evaluation (see
texts, i.e., texts whose membership or nonmember-
below). In a real application, it is thus of fundamental
ship in each of the categories is known. Clustering
importance to identify whether the application requires
may thus be seen as the task of finding a latent but as
single-label or multi-label TC from the beginning.
yet undetected group structure in the repository,
while categorization can be seen as the task of struc-
Hard or Soft Text Categorization
turing the repository according to a group structure
known in advance. In logical-philosophical terms, we Taking a binary decision, yes or no, as to whether a
can see clustering as the task of determining both the document dj belongs to a category ci, is sometimes
extensional and intensional level (see Extensionality referred to as a ‘hard’ categorization decision. This is
and Intensionality) of a previously unknown group the kind of decisions that are taken by autonomous
structure, and categorization as determining the ex- text classifiers, i.e., software systems that need to
tensional level only of a group structure whose inten- decide and act accordingly without human supervi-
sional level is known. sion. A different type of decision, sometimes referred
It is the latter task that will be the focus of this to as a ‘soft’ categorization decision, is one which
article (text clustering is covered elsewhere in this consists of attributing a numeric score (e.g., between
volume – see Text Mining). From now on we will 0 and 1) to the pair (dj,ci), reflecting the degree of
thus use the expressions ‘text classification’ and ‘text confidence of the classifier in the fact that dj belongs
categorization’ interchangeably (abbreviated as TC), to ci. This allows, for instance, ranking a set of docu-
and the expression ‘(text) classifier’ to denote a system ments in terms of their estimated appropriateness for
capable of performing automatic TC. category ci, or ranking a set of categories in terms of
Note that the central notion of TC, that of mem- their estimated appropriateness for dj. Such rankings
bership of a document dj in a class ci based on the are often useful for nonautonomous, interactive classi-
semantics of dj and ci, is an inherently subjective fiers, i.e., systems whose goal is to recommend a cate-
notion, since the semantics of dj and ci cannot be gorization decision to a human expert, who is
formally specified. Different classifiers (be they responsible for making the final decision. For instance,
humans or machines) might thus disagree on whether in a single-label TC task a human expert in charge of
dj belongs to ci. This means that membership cannot the final classification decision may take advantage of a
be determined with certainty, which in turn means system that preranks the categories in terms of their
that any classifier (be it human or machine) will be estimated appropriateness to a given document dj.
prone to misclassification errors. As a consequence, it Again, the technologies for coping with either
is customary to evaluate automatic text classifiers by soft or hard categorization decisions are slightly dif-
applying them to a set of labeled (i.e., preclassified) ferent, especially concerning the phases of classifier
documents (a set that here plays the role of a gold learning and classifier evaluation (see below). In any
standard), so that the accuracy (or effectiveness) real-world application, it is thus important to estab-
of the classifier can be measured by the degree of lish whether the task is one requiring soft or hard
coincidence between its classification decisions and decisions from the beginning.
Classification of Text, Automatic 459

Applications Techniques
Maron’s seminal paper (Maron, 1961) is usually Approaches
taken to mark the official birth date of TC, which at
In the 1980s, the most popular approach to TC was
the time was called automatic indexing; this name
one based on knowledge engineering, whereby a
reflected that the main (or only) application that
knowledge engineer and a domain expert working
was then envisaged for TC was automatically
together would build an expert system capable of
indexing (i.e., generating internal representations
automatically classifying text. Typically, such an ex-
for) scientific articles for Boolean IR systems (see
pert system would consist of a set of ‘if . . . then . . .’
Indexing, Automatic). In fact, since index terms for
rules, to the effect that a document was assigned to
these representations were drawn from a fixed, pre-
the class specified in the ‘then’ clause only if the
defined set of such terms, we can regard this type of
linguistic expressions (typically: words) specified in
indexing as an instance of TC (where index terms
the ‘if’ part occurred in the document. The drawback
play the role of categories). The importance of TC
of this approach was the high cost in terms of human-
increased in the late 1980s and early 1990s with the
power required for (i) defining the rule set, and (ii)
need to organize the increasingly larger quantities of
for maintaining it, i.e., for updating the rule set as a
digital text being handled in organizations at all
result of possible subsequent additions or deletions of
levels. Since then, frequently pursued applications of
classes or as a result of shifts in the meaning of the
TC technology have been newswire filtering, i.e., the
existing classes.
grouping, according to thematic classes of interest, of
In the 1990s, this approach was superseded by the
news stories produced by news agencies, thus allow-
machine-learning approach, whereby a general in-
ing personalized delivery of information to customers
ductive process (the learner) is fed with a set of exam-
according to their profiles of interest (Hayes and
ple (training) documents preclassified according to
Weinstein, 1990); patent classification, i.e., the orga-
the categories of interest. By observing the character-
nization of patents and patent applications into
istics of the training documents, the learner may gen-
specialized taxonomies, so as to ease the detection
erate a model (the classifier) of the conditions that are
of existing patents related to a new patent application
satisfied by the documents belonging to the categories
(Fall et al., 2003); and Web page classification, i.e.,
considered. This model can subsequently be applied
the grouping of Web pages (or sites) according to the
to new, unlabeled documents for classifying them
taxonomic classification schemes typical of Web por-
according to these categories.
tals (Dumais and Chen, 2000).
This approach has several advantages over the
The applications above all have a certain thematic
knowledge engineering approach. First of all, a higher
flavor, in the sense that categories tend to coincide
degree of automation is introduced: the engineer
with topics, or themes. However, TC technology has
needs to build not a text classifier, but an automatic
been applied to real-world problems that are not
builder of text classifiers (the learner). Once built, the
thematic in nature, among which spam filtering, i.e.,
learner can then be applied to generating many differ-
the grouping of personal e-mail messages into the two
ent classifiers, for many different domains and appli-
classes LEGITIMATE and SPAM, so as to provide effective
cations: one only needs to feed it with the appropriate
user shields against unsolicited bulk mailings
sets of training documents. By the same token, the
(Drucker et al., 1999); authorship attribution, i.e.,
above-mentioned problem of maintaining a classifier
the automatic identification of the author of a text
is solved by feeding new training documents appro-
among a predefined set of candidates (Diederich et al.,
priate for the revised set of classes. Many inductive
2003) (see Authorship Attribution: Statistical and
learners are available off the shelf; if one of these is
Computational Methods); author gender detection,
used, the only human power needed in setting up a TC
i.e., a special case of the previous task in which the
system is that for manually classifying the documents
issue is deciding whether the author of the text is a
to be used for training. For performing this latter
MALE or a FEMALE (Koppel et al., 2002); genre classi-
task, less skilled human power than for building an
fication, i.e., the identification of the nontopical com-
expert system is needed, which is also advantageous.
municative goal of the text (such as determining if a
It should also be noted that if an organization has
product description is a PRODUCTREVIEW or an ADVER-
previously relied on manual work for classifying docu-
TISEMENT) (Stamatatos et al., 2000); survey coding, i.e.,
ments, then many preclassified documents are already
the classification of respondents to a survey based on
available to be used as training documents when the
the textual answers they have returned to an open-
organization decides to automate the process.
ended question (Giorgetti and Sebastiani, 2003); or
Most importantly, one of the advantages of the ML
even sentiment classification, as in deciding if a prod-
approach is that the accuracy of classifiers built
uct review is a THUMBSUP or a THUMBSDOWN (Turney
by these techniques now often rivals that of human
and Littman, 2003).
460 Classification of Text, Automatic

professionals, and usually exceeds that of classifiers regression methods, decision trees, Boolean decision
built by knowledge engineering methods. This has rules, neural networks, incremental or batch methods
brought about a wider and wider acceptance of for learning linear classifiers, example-based methods,
learning methods even outside academia. While for classifier ensembles (including boosting methods), and
certain applications such as spam filtering a combina- support vector machines. While all of these techniques
tion of ML and knowledge engineering still lies at the still retain their popularity, it is fair to say that in recent
basis of several commercial systems, it is fair to say years support vector machines (Joachims, 1998) and
that in most other TC applications (especially of the boosting (Schapire and Singer, 2000) have been the two
thematic type), the adoption of ML technology has dominant learning methods in TC. This seems attribut-
been widespread. able to a combination of two factors: (i) these two
Note that the ML approach is especially suited to the methods have strong justifications in terms of compu-
case in which no additional knowledge (of a procedural tational learning theory, and (ii) in comparative experi-
or declarative nature) of the meaning of the categories ments on widely accepted benchmarks, they have
is available, since in this case the classification rules can outperformed all other competing approaches. An ad-
be determined only on the basis of knowledge extracted ditional factor that has determined their success is the
from the training documents. This case is the most free availability, at least for research purposes, of well-
frequent one, and is thus the usual focus of TC research. known software packages based on these methods,
Solutions devised for the case in which no additional such as SVMlight and BoosTexter.
knowledge is available are extremely general, since
they do not presuppose the existence of e.g., addition-
Building Internal Representations for Documents
al lexicosemantic resources that, in real-life situations,
might be either unavailable or expensive to create The learners discussed above cannot operate on the
(see Computational Lexicons and Dictionaries). documents as they are, but require the documents to
A further reason why TC research rarely tackles the be given internal representations that the learners can
case of additionally available external knowledge is make sense of. The same is true of the classifiers, once
that these sources of knowledge may vary widely in they have been built by learners. It is thus customary
type and format, thereby making each instance of to transform all the documents (i.e., those that are
their application to TC a case in its own, from which used in the training phase, the testing phase, or the
any lesson learned can hardly be exported to different operational phase of the classifier) into internal repre-
application contexts. When in a given application sentations by means of methods used in text search,
external knowledge of some kind is available, heuris- where the same need is also present (see Indexing,
tic techniques of any nature may be adopted in order Automatic). Accordingly, a document is usually
to leverage on these data, either in combination or in represented by a vector lying in a vector space
isolation from the IR and ML techniques we will whose dimensions correspond to the terms that
discuss here. However, it should be noted that past occur in the training set, and the value of each
research has not been able to show any substantial individual entry corresponds to the weight that the
benefit from the use of external resources (such as term in question has for the document.
lexicons, thesauri, or ontologies) in TC. In TC applications of the thematic kind, the set
As previously noted, the meaning of categories is of terms is usually made to coincide with the set of
subjective. The ML techniques used for TC, rather content-bearing words (which means all words but
than trying to learn a supposedly perfect classifier (a topic-neutral ones such as articles, prepositions, etc.),
gold standard of dubious existence), strive to repro- possibly reduced to their morphological roots (stems
duce the subjective judgment of the expert who has – see Stemming) so as to avoid excessive stochastic
labeled the training documents, and do this by exam- dependence among different dimensions of the vector.
ining the manifestations of this judgment, i.e., the Weights for these words are meant to reflect the im-
documents that the expert has manually classified. portance that the word has in determining the seman-
The kind of learning that these ML techniques engage tics of the document it occurs in, and are
in is usually called supervised learning, since it is automatically computed by weighting functions.
supervised, or facilitated, by the knowledge of the These functions usually rely on intuitions of a statis-
preclassified data. tical kind, such as (i) the more often a term occurs in a
document, the more important it is for that docu-
ment; and (ii) the more documents a term appears
Learning Text Classifiers
in, the less important it is in characterizing the seman-
Many different types of supervised learners have tics of a document it occurs in.
been used in TC (Sebastiani, 2002), including probabi- In TC applications of a nonthematic nature, the
listic ‘naive Bayesian’ methods, Bayesian networks, opposite is often true. For instance, it is the frequency
Classification of Text, Automatic 461

of use of articles, prepositions, and punctuation this, feature selection functions are employed for
(together with many other stylistic features) that scoring each term according to this expected impact,
may be a helpful clue in authorship attribution, so that the highest scoring ones can be retained for the
while it is more unlikely that the frequencies of use new vector space. These functions mostly come from
of content-bearing words can be of help (see Compu- statistics (e.g., Chi-square) or information theory
tational Stylistics). This shows that choosing the right (e.g., mutual information, also known as information
dimensions of the vector space for the right classifica- gain), and tend to encode (each one in their own way)
tion task requires a deep understanding, on the part of the intuition that the best terms for classifica-
the engineer, of the nature of the task. tion purposes are the ones that are distributed most
It is fairly evident from the above discussion that differently across the different categories.
internal representations used in TC applications are,
from the standpoint of linguistic analysis, extremely
Challenges
primitive: with the possible exception of applications
in sentiment classification (Turney and Littman, 2003), TC, especially in its ML incarnation, is today a fairly
hardly any sophisticated linguistic analysis is usually mature technology that has delivered working solu-
attempted in order to provide a more faithful rendition tions in a number of applicative contexts. Interest in
of the semantics of the text. This is because previous TC has grown exponentially in the last 10 years, from
attempts at applying state-of-the-art natural language researchers and developers alike.
processing techniques (including techniques for parsing For IR researchers, this interest is one particular
text robustly (Moschitti and Basili, 2004), extracting aspect of a general movement toward leveraging user
collocations (Koster and Seutter, 2003), performing data for taming the inherent subjectivity of the IR
word sense disambiguation (Kehagias et al., 2003), task, i.e., taming the fact that it is the user, and only
etc.) have not shown any substantial benefit with the user, who can say whether a given item of infor-
respect to the basic representations outlined above. mation is relevant to a query she has issued to a Web
search engine, or relevant to a private folder of hers in
which documents should be filed according to con-
Reducing the Dimensionality of the Vectors
tent. Wherever there are predefined classes, docu-
The techniques described in the previous section tend ments previously (and manually) classified by the
to generate very large vectors, with sizes in the tens of user are often available; as a consequence, these latter
thousands. This situation is problematic in TC, since data can be exploited for automatically learning the
the efficiency of many learning devices (e.g., neural (extensional) meaning that the user attributes to the
networks) tends to degrade rapidly with the size of categories, thereby reaching accuracy levels that
the vectors. In TC applications, it is thus customary to would be unthinkable if these data were unavailable.
run a dimensionality reduction pass before starting to For ML researchers, this interest is because TC appli-
build the internal representations of the documents. cations prove a challenging benchmark for their newly
Basically, this means identifying a new vector space in developed techniques, since these applications usually
which to represent the documents, with a much smal- feature extremely high-dimensional vector spaces and
ler number of dimensions than the original one. Sev- provide large quantities of test data. In the last 5 years,
eral techniques for dimensionality reduction have this has resulted in more and more ML researchers
been devised within TC (or, more often, borrowed adopting TC as one of their benchmark applications
from the fields of ML and pattern recognition). of choice, which means that cutting-edge ML techni-
An important class of such techniques is that of ques are being applied to TC with minimal delay since
feature extraction methods (examples of which are their original invention.
term clustering methods and latent semantic indexing For application developers, this interest is mainly
– see Latent Semantic Analysis). Feature extraction due to the enormously increased need to handle
methods define a new vector space in which each larger and larger quantities of documents, a need
dimension is a combination of some (or all) of the emphasized by increased connectivity and availability
original dimensions; their effect is usually a reduction of document bases of all types at all levels in the
of both the dimensionality of the vectors and the information chain. But this interest also results from
overall stochastic dependence among dimensions. TC techniques having reached accuracy levels that
An even more important class of dimensionality often rival the performance of trained professionals,
reduction techniques is that of feature selection meth- levels that can be achieved with high efficiency on
ods, which do not attempt to generate new terms, but standard hardware and software resources. This
try to select the best ones from the original set. The means that more and more organizations are auto-
measure of quality for a term is its expected impact on mating all their activities that can be cast as TC tasks.
the accuracy of the resulting classifier. To measure Still, a number of challenges remain for TC research.
462 Classification of Text, Automatic

The first and foremost challenge is to deliver high Journal of the American Society for Information Science
accuracy in all applicative contexts. While highly and Technology 54(12), 1269–1277.
effective classifiers have been produced for applica- Hayes PJ & Weinstein SP (1990). ‘CONSTRUE/ TIS: a system
tive domains such as the thematic classification of for content-based indexing of a database of news stories.’
In Rappaport A & Smith R (eds.) Proceedings of IAAI-
professionally authored text (such as newswire
90, 2nd Conference on Innovative Applications of Artifi-
stories), in other domains reported accuracies are far
cial Intelligence. Menlo Park: AAAI Press. 49–66.
from satisfying. Such applicative contexts include the Joachims T (1998). ‘Text categorization with support vec-
classification of Web pages (where the use of text is tor machines: learning with many relevant features.’ In
more varied and obeys rules different from the ones of Nédellec C & Rouveirol C (eds.) Proceedings of ECML-
linear verbal communication), spam filtering (a task 98, 10th European Conference on Machine Learning.
which has an adversarial nature, in that spammers Lecture Notes in Computer Science series, no. 1398
adapt their spamming strategies so as to circumvent Heidelberg: Springer Verlag. 137–142.
the latest spam filtering technologies), authorship Kehagias A, Petridis V, Kaburlasos VG & Fragkou P (2003).
attribution (where current technology is not yet able ‘A comparison of word- and sense-based text categoriza-
to tackle the inherent stylistic variability among tion using several classification algorithms.’ Journal of
Intelligent Information Systems 21(3), 227–247.
texts written by the same author), and sentiment clas-
Koppel M, Argamon S & Shimoni AR (2002). ‘Automati-
sification (which requires much more sophisticated cally categorizing written texts by author gender.’ Liter-
linguistic analysis than classification by topic). ary and Linguistic Computing 17(4), 401–412.
A second important challenge is to bypass the Koster CH & Seutter M (2003). ‘Taming wild phrases.’ In
document labeling bottleneck, i.e., tackling the facts Sebastiani F (ed.) Proceedings of ECIR-03, 25th European
that labeled documents for use in the training phase Conference on Information Retrieval. Pisa: Springer
are not always available, and that labeling (i.e., man- Verlag. 161–176.
ually classifying) documents is costly. To this end, Maron M (1961). ‘Automatic indexing: an experimental
semisupervised methods have been proposed that inquiry.’ Journal of the Association for Computing
allow building classifiers from a small sample of la- Machinery 8(3), 404–417.
beled documents and a (usually larger) sample of Moschitti A & Basili R (2004). ‘Complex linguistic features
for text classification: a comprehensive study.’ In
unlabeled documents (Nigam and Ghani, 2000;
McDonald S & Tait J (eds.) Proceedings of ECIR-04,
Nigam et al., 2000). However, the problem of
26th European Conference on Information Retrieval
learning text classifiers mainly from unlabeled data Research. Lecture Notes in Computer Science series,
unfortunately is still open. no. 2997. Heidelberg: Springer Verlag. 181–196.
Nigam K & Ghani R (2000). ‘Analyzing the applicability
See also: Authorship Attribution: Statistical and Computa- and effectiveness of co-training.’ In Agah A, Callan J &
tional Methods; Computational Lexicons and Dictionaries; Rundensteiner E (eds.) Proceedings of CIKM-00, 9th
Computational Stylistics; Document Retrieval, Automatic; ACM International Conference on Information and
Extensionality and Intensionality; Indexing, Automatic; Knowledge Management. McLean: ACM Press. 86–93.
Latent Semantic Analysis; Stemming; Text Mining; Web Nigam K, McCallum AK, Thrun S & Mitchell TM (2000).
Searching. ‘Text classification from labeled and unlabeled docu-
ments using EM.’ Machine Learning 39(2/3), 103–134.
Schapire RE & Singer Y (2000). ‘BoosTexter: a boosting-
Bibliography based system for text categorization.’ Machine Learning
39(2/3), 135–168.
Diederich J, Kindermann J, Leopold E & Paaß G (2003). Sebastiani F (2002). ‘Machine learning in automated text
‘Authorship attribution with support vector machines.’ categorization.’ ACM Computing Surveys 34(1), 1–47.
Applied Intelligence 19(1/2), 109–123. Stamatatos E, Fakotakis N & Kokkinakis G (2000). ‘Auto-
Drucker H, Vapnik V & Wu D (1999). ‘Support vector matic text categorization in terms of genre and author.’
machines for spam categorization.’ IEEE Transactions Computational Linguistics 26(4), 471–495.
on Neural Networks 10(5), 1048–1054. Turney PD & Littman ML (2003). ‘Measuring praise
Dumais ST & Chen H (2000). ‘Hierarchical classification and criticism: inference of semantic orientation from as-
of Web content.’ In Belkin NJ, Ingwersen P & Leong M-K sociation.’ ACM Transactions on Information Systems
(eds.) Proceedings of SIGIR-00, 23rd ACM International 21(4), 315–346.
Conference on Research and Development in Informa-
tion Retrieval. Athens: ACM Press. 256–263. Relevant Websites
Fall CJ, Törcsvári A, Benzineb K & Karetka G (2003).
‘Automated categorization in the International Patent http://svmlight.joachims.org – SVMlight web site.
Classification.’ SIGIR Forum 37(1), 10–25. http://www.research.att.com/%schapire/BoosTexter/ – Boos-
Giorgetti D & Sebastiani F (2003). ‘Automating survey Texter website.
coding by multiclass text categorization techniques.’ http://www.math.unipd.it/fabseb60 – F. Sebastiani’s website.

Machine Learning in Automated Text Categorization
No ratings yet
Machine Learning in Automated Text Categorization
55 pages
TM05
No ratings yet
TM05
21 pages
Theis Finaldoc
No ratings yet
Theis Finaldoc
86 pages
UNIT5
No ratings yet
UNIT5
23 pages
Machine Learning in Automated Text Categorization FABRIZIO SEBASTIANI Consiglio Nazionale Delle Ricerche
No ratings yet
Machine Learning in Automated Text Categorization FABRIZIO SEBASTIANI Consiglio Nazionale Delle Ricerche
3 pages
Unit 2 Notes
No ratings yet
Unit 2 Notes
27 pages
Naive Bayes and Sentiment Classification
No ratings yet
Naive Bayes and Sentiment Classification
23 pages
Naive Bayes Sentiment Analysis
No ratings yet
Naive Bayes Sentiment Analysis
23 pages
Supervised ML for Text Classification
No ratings yet
Supervised ML for Text Classification
20 pages
Deng Et Al. - 2019 - Feature Selection For Text Classification A Review
No ratings yet
Deng Et Al. - 2019 - Feature Selection For Text Classification A Review
20 pages
Science Research Journal
No ratings yet
Science Research Journal
7 pages
Text Classification PDF
No ratings yet
Text Classification PDF
56 pages
Text Classification Using Support Vector Machine IJERTV1IS3174
No ratings yet
Text Classification Using Support Vector Machine IJERTV1IS3174
4 pages
Text Classification and Rocchio Algorithm
No ratings yet
Text Classification and Rocchio Algorithm
32 pages
Text Classification Lecture Notes
No ratings yet
Text Classification Lecture Notes
26 pages
Analytics of Machine Learning-Based Algorithms For Text Classification
No ratings yet
Analytics of Machine Learning-Based Algorithms For Text Classification
11 pages
A Survey On Different Types of Approaches To Text Categorization
No ratings yet
A Survey On Different Types of Approaches To Text Categorization
3 pages
Lec # 4-1
No ratings yet
Lec # 4-1
15 pages
An Automatic Document Classifier System Based On Genetic Algorithm and Taxonomy
No ratings yet
An Automatic Document Classifier System Based On Genetic Algorithm and Taxonomy
8 pages
Machine Learning For Text Document Classification-Efficient Classification Approach
No ratings yet
Machine Learning For Text Document Classification-Efficient Classification Approach
8 pages
Seed-Guided Topic Model For Document Filtering and Classification
No ratings yet
Seed-Guided Topic Model For Document Filtering and Classification
37 pages
Research On Short Text Classification Based On Tex
No ratings yet
Research On Short Text Classification Based On Tex
8 pages
Task 3
No ratings yet
Task 3
17 pages
Techniques For Text Classification: Literature Review and Current Trends
No ratings yet
Techniques For Text Classification: Literature Review and Current Trends
28 pages
05 Text Categorization
No ratings yet
05 Text Categorization
22 pages
A Comparison of Text Categorization Methods
No ratings yet
A Comparison of Text Categorization Methods
14 pages
Kshitij Text Classification
No ratings yet
Kshitij Text Classification
20 pages
Text Classification Based On Machine Learning and
No ratings yet
Text Classification Based On Machine Learning and
12 pages
Text Classification Techniques
No ratings yet
Text Classification Techniques
157 pages
Text Classification for Enterprises
No ratings yet
Text Classification for Enterprises
310 pages
Machen e Learning
No ratings yet
Machen e Learning
9 pages
Text Classification MLND Project Report Prasann Pandya
No ratings yet
Text Classification MLND Project Report Prasann Pandya
17 pages
The Text Classification Pipeline: Starting Shallow, Going Deeper
No ratings yet
The Text Classification Pipeline: Starting Shallow, Going Deeper
157 pages
Communication in Text Categorization
No ratings yet
Communication in Text Categorization
15 pages
Machine Learning Telugu
No ratings yet
Machine Learning Telugu
9 pages
Deep Learning
No ratings yet
Deep Learning
42 pages
Lect 05
No ratings yet
Lect 05
17 pages
Lecture Feb20&25
No ratings yet
Lecture Feb20&25
11 pages
Automatic Induction of Rule Based Text Categorization
No ratings yet
Automatic Induction of Rule Based Text Categorization
10 pages
Naive Bayes and Sentiment
No ratings yet
Naive Bayes and Sentiment
19 pages
Text Classification Techniques
No ratings yet
Text Classification Techniques
17 pages
A Complete Process of Text Classification System Using State of The Art NLP Models
No ratings yet
A Complete Process of Text Classification System Using State of The Art NLP Models
26 pages
Text Classification Research Paper 2
No ratings yet
Text Classification Research Paper 2
7 pages
01 What Is Text Classification 8-12
No ratings yet
01 What Is Text Classification 8-12
4 pages
Comparing KNN, Logistic Regression, and Random Forest
No ratings yet
Comparing KNN, Logistic Regression, and Random Forest
16 pages
Article Classification Using Natural Language Processing and Machine Learning
No ratings yet
Article Classification Using Natural Language Processing and Machine Learning
8 pages
Hierarchical Classification of Amharic News
No ratings yet
Hierarchical Classification of Amharic News
41 pages
A Survey On Machine Learning Techniques
No ratings yet
A Survey On Machine Learning Techniques
8 pages
Unit 3
No ratings yet
Unit 3
27 pages
Ijcst V3i2p17
No ratings yet
Ijcst V3i2p17
5 pages
127 1498038923 - 21-06-2017 PDF
No ratings yet
127 1498038923 - 21-06-2017 PDF
9 pages
Survey On Text Classification
No ratings yet
Survey On Text Classification
7 pages
Hybrid Approach Combining Machine Learning and A Rule-Based Expert System For Text Categorization
No ratings yet
Hybrid Approach Combining Machine Learning and A Rule-Based Expert System For Text Categorization
7 pages
Techniques of Text Classification
No ratings yet
Techniques of Text Classification
28 pages
A Survey On Text Classification From Shallow To Deep Learning
No ratings yet
A Survey On Text Classification From Shallow To Deep Learning
21 pages
Comparison of Text Classifiers On News Articles
No ratings yet
Comparison of Text Classifiers On News Articles
5 pages
Text Classification
No ratings yet
Text Classification
7 pages
111 1460444112 - 12-04-2016 PDF
No ratings yet
111 1460444112 - 12-04-2016 PDF
7 pages
成交量中的秘密 21
No ratings yet
成交量中的秘密 21
10 pages
TR350-6 1
No ratings yet
TR350-6 1
18 pages
TR350-6 19
No ratings yet
TR350-6 19
17 pages
成交量中的秘密 11
No ratings yet
成交量中的秘密 11
10 pages
TR350-6 60
No ratings yet
TR350-6 60
18 pages
TR350-6 71
No ratings yet
TR350-6 71
20 pages
成交量中的秘密 41
No ratings yet
成交量中的秘密 41
10 pages
TR350-6 48
No ratings yet
TR350-6 48
12 pages
TR350-6 91
No ratings yet
TR350-6 91
20 pages
TR350-6 31
No ratings yet
TR350-6 31
10 pages
TR350-6 21
No ratings yet
TR350-6 21
30 pages
TR350-6 78
No ratings yet
TR350-6 78
21 pages
TR350-6 2
No ratings yet
TR350-6 2
1 page
TR350-6 51
No ratings yet
TR350-6 51
10 pages
TR350-6 51
No ratings yet
TR350-6 51
20 pages
TR350-6 41
No ratings yet
TR350-6 41
10 pages
TR350-6 91
No ratings yet
TR350-6 91
10 pages
Anali Sveska 2025-3c-Lat 17
No ratings yet
Anali Sveska 2025-3c-Lat 17
2 pages
Bitstream 71202
No ratings yet
Bitstream 71202
211 pages
TR350-6 111
No ratings yet
TR350-6 111
5 pages
Anali Sveska 2025-3c-Lat 19
No ratings yet
Anali Sveska 2025-3c-Lat 19
2 pages
原文
No ratings yet
原文
17 pages
False Dawn: The Failed Reform of The Yugoslav Secret Political Police 1966-1980
No ratings yet
False Dawn: The Failed Reform of The Yugoslav Secret Political Police 1966-1980
28 pages
165 原文
No ratings yet
165 原文
6 pages
原文排版
No ratings yet
原文排版
16 pages
原文
No ratings yet
原文
5 pages
原文
No ratings yet
原文
8 pages
Change and Continuity: Comparing The Metaxas Dictatorship and The Colonels' Junta in Greece
No ratings yet
Change and Continuity: Comparing The Metaxas Dictatorship and The Colonels' Junta in Greece
15 pages
原文
No ratings yet
原文
12 pages
原文
No ratings yet
原文
15 pages
ANOVA Exam Questions and Analysis
No ratings yet
ANOVA Exam Questions and Analysis
7 pages
Classroom Observation Guidelines
No ratings yet
Classroom Observation Guidelines
5 pages
Experimental Study On The Risk Preference Characteristics of
No ratings yet
Experimental Study On The Risk Preference Characteristics of
18 pages
Filipino Consumer Behavior Post-COVID
No ratings yet
Filipino Consumer Behavior Post-COVID
8 pages
Ergonomic Evaluation of Office Workplaces With Rapid Office Strain
No ratings yet
Ergonomic Evaluation of Office Workplaces With Rapid Office Strain
3 pages
Matrix Data Analysis Diagram
75% (4)
Matrix Data Analysis Diagram
11 pages
Basic Statistical Concepts Review
100% (6)
Basic Statistical Concepts Review
227 pages
Intraoperative Consultation A Volume in The Series Foundations in Diagnostic Pathology Download Instantly
No ratings yet
Intraoperative Consultation A Volume in The Series Foundations in Diagnostic Pathology Download Instantly
335 pages
Improving Students' Vocabulary Using Card Game (An Action Research at The Fifth Year of SD Negeri Sedayu I Slogohimo)
No ratings yet
Improving Students' Vocabulary Using Card Game (An Action Research at The Fifth Year of SD Negeri Sedayu I Slogohimo)
7 pages
Module 1 The Role of Machine Learning in Cyber Security
No ratings yet
Module 1 The Role of Machine Learning in Cyber Security
38 pages
Student Entrepreneurship Form
No ratings yet
Student Entrepreneurship Form
3 pages
Sublevel Stope Optimazer
100% (3)
Sublevel Stope Optimazer
85 pages
Types of Organizational Controls:: By-Vishal Singh
No ratings yet
Types of Organizational Controls:: By-Vishal Singh
7 pages
Sample Questionnaire
No ratings yet
Sample Questionnaire
2 pages
SAP LAP Framework
100% (1)
SAP LAP Framework
5 pages
The Polyanalgesic Consensus Conference PACC Recommendations For Trialing of Intrathecal Drug Delivery Infusion Therapy. Neuromodulation 2017
No ratings yet
The Polyanalgesic Consensus Conference PACC Recommendations For Trialing of Intrathecal Drug Delivery Infusion Therapy. Neuromodulation 2017
22 pages
ROPO Insights for Mobile & DSL
No ratings yet
ROPO Insights for Mobile & DSL
26 pages
2017 People in My Life
No ratings yet
2017 People in My Life
16 pages
3 - InnovatiCS - Introduction To CRISP-DM
No ratings yet
3 - InnovatiCS - Introduction To CRISP-DM
35 pages
Analytical Data
No ratings yet
Analytical Data
3 pages
RM - HW 3
No ratings yet
RM - HW 3
3 pages
4522managerial Epidemiology For Health Care Organizations Public Health Epidemiology and Biostatistics 2nd Edition Peter J. Download
No ratings yet
4522managerial Epidemiology For Health Care Organizations Public Health Epidemiology and Biostatistics 2nd Edition Peter J. Download
62 pages
Adaptive Reuse As An Emerging Discipline
No ratings yet
Adaptive Reuse As An Emerging Discipline
12 pages
Future Ivy Your Guide To College Applications by Amy Jin
No ratings yet
Future Ivy Your Guide To College Applications by Amy Jin
314 pages
EAPP REVIEWER (2nd QTR Test)
0% (2)
EAPP REVIEWER (2nd QTR Test)
5 pages
Political Behavior - Trustworthiness - Job Satisfaction and Commitment
No ratings yet
Political Behavior - Trustworthiness - Job Satisfaction and Commitment
21 pages
Accountancy Students' Decisions
No ratings yet
Accountancy Students' Decisions
4 pages
ID Kebijakan Pemberian Izin Usaha Toko Modern Alfamart Dan Indomaret Oleh Pemerinta PDF
No ratings yet
ID Kebijakan Pemberian Izin Usaha Toko Modern Alfamart Dan Indomaret Oleh Pemerinta PDF
14 pages
Sampling Techniques For Data Preprocessing
No ratings yet
Sampling Techniques For Data Preprocessing
22 pages
Anuran Species Composition and Distribution Patterns in Brazilian Cerrado, A
No ratings yet
Anuran Species Composition and Distribution Patterns in Brazilian Cerrado, A
17 pages

108 原文

Uploaded by

108 原文

Uploaded by

Classification of Text, Automatic 457

Classification of Text, Automatic

You might also like