Multilabel Classification Problem Analysis Metrics and Techniques PDF
Multilabel Classification Problem Analysis Metrics and Techniques PDF
Francisco Charte
Antonio J. Rivera
María J. del Jesus
Multilabel
Classification
Problem Analysis, Metrics and
Techniques
Multilabel Classification
Francisco Herrera Francisco Charte
•
Multilabel Classification
Problem Analysis, Metrics and Techniques
123
Francisco Herrera Antonio J. Rivera
University of Granada University of Jaén
Granada Jaén
Spain Spain
Francisco Charte
To my family
Antonio J. Rivera
vii
viii Preface
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 The Knowledge Discovery in Databases Process . . . . . . . . . . . . . 2
1.3 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4.1 DM Methods Attending to Available Data . . . . . . . . . . . . 6
1.4.2 DM Methods Attending to Target Objective . . . . . . . . . . . 7
1.4.3 DM Methods Attending to Knowledge Representation . . . . 8
1.5 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.5.1 Binary Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.5.2 Multiclass Classification. . . . . . . . . . . . . . . . . . . . . . . . . 12
1.5.3 Multilabel Classification. . . . . . . . . . . . . . . . . . . . . . . . . 13
1.5.4 Multidimensional Classification. . . . . . . . . . . . . . . . . . . . 14
1.5.5 Multiple Instance Learning . . . . . . . . . . . . . . . . . . . . . . . 14
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2 Multilabel Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2 Problem Formal Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2.2 Symbols. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2.3 Terminology. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3 Applications of Multilabel Classification. . . . . . . . . . . . . . . . . . . 19
2.3.1 Text Categorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3.2 Labeling of Multimedia Resources . . . . . . . . . . . . . . . . . 20
2.3.3 Genetics/Biology. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3.4 Other Application Fields . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3.5 MLDs Repositories . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
ix
x Contents
5 Adaptation-Based Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.2 Tree-Based Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.2.1 Multilabel C4.5, ML-C4.5 . . . . . . . . . . . . . . . . . . . . . . . 82
5.2.2 Multilabel Alternate Decision Trees, ADTBoost.MH . . . . . 82
5.2.3 Other Tree-Based Proposals . . . . . . . . . . . . . . . . . . . . . . 83
5.3 Neuronal Network-Based Methods. . . . . . . . . . . . . . . . . . . . . . . 83
5.3.1 Multilabel Back-Propagation, BP-MLL . . . . . . . . . . . . . . 83
5.3.2 Multilabel Radial Basis Function Network, ML-RBF . . . . . 84
5.3.3 Canonical Correlation Analysis and Extreme Learning
Machine, CCA-ELM . . . . . . . . . . . . . . . . . . . . . . . . ... 85
5.4 Vector Support Machine-Based Methods . . . . . . . . . . . . . . . ... 85
5.4.1 MODEL-x . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ... 85
5.4.2 Multilabel SVMs Based on Ranking, Rank-SVM
and SCRank-SVM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.5 Instance-Based Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.5.1 Multilabel kNN, ML-kNN . . . . . . . . . . . . . . . . . . . . . . . 86
5.5.2 Instance-Based and Logistic Regression, IBLR-ML . . . . . . 87
5.5.3 Other Instance-Based Classifiers . . . . . . . . . . . . . . . . . . . 87
5.6 Probabilistic Methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.6.1 Collectible Multilabel Classifiers, CML and CMLF . . . . . . 88
5.6.2 Probabilistic Generic Models, PMM1 and PMM2 . . . . . . . 88
5.6.3 Probabilistic Classifier Chains, PCC . . . . . . . . . . . . . . . . 89
5.6.4 Bayesian and Tree Naïve Bayes Classifier Chains,
BCC and TNBCC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.6.5 Conditional Restricted Boltzmann Machines, CRBM . . . . . 89
5.7 Other MLC Adaptation-Based Methods . . . . . . . . . . . . . . . . . . . 90
5.8 Adapted Methods in Practice . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.8.1 Experimental Configuration . . . . . . . . . . . . . . . . . . . . . . 92
5.8.2 Classification Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.9 Summarizing Comments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
6 Ensemble-Based Classifiers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
6.2 Ensembles of Binary Classifiers. . . . . . . . . . . . . . . . . . . . . . . . . 102
6.2.1 Ensemble of Classifier Chains, ECC . . . . . . . . . . . . . . . . 102
6.2.2 Ranking by Pairwise Comparison, RPC . . . . . . . . . . . . . . 102
6.2.3 Calibrated Label Ranking, CLR . . . . . . . . . . . . . . . . . . . 103
6.3 Ensembles of Multiclass Classifiers . . . . . . . . . . . . . . . . . . . . . . 103
6.3.1 Ensemble of Pruned Sets, EPS . . . . . . . . . . . . . . . . . . . . 103
6.3.2 Random k-Labelsets, RAkEL . . . . . . . . . . . . . . . . . . . . . 104
6.3.3 Hierarchy of Multilabel Classifiers, HOMER . . . . . . . . . . 104
6.4 Other Ensembles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
xii Contents
xv
xvi Acronyms
Abstract This book is focused on multilabel classification and related topics. Mul-
tilabel classification is one specific type of classification, classification being one of
the usual tasks in the data mining field. Data mining itself can be seen as a step into
a broad process, the discovery of new knowledge from databases. The goal of this
first chapter is to introduce all these concepts, aiming to set the working context for
the topics covered in the following ones. A global outline to this respect is given in
Sect. 1.1. Section 1.2 provides an overview of the whole Knowledge Discovery in
Databases process. Section 1.3 introduces the essential preprocessing tasks. Then,
the different learning styles in use nowadays are explained in Sect. 1.4, and lastly
multilabel classification is introduced in comparison with other traditional types of
classification in Sect. 1.5.
1.1 Overview
The technological progress in late years has propelled the availability of huge
amounts of data. Storage and communication capabilities have grown exponentially,
increasing the needs to automatically process all these data. Due to this fact, machine
learning techniques have acquired considerable relevance. In particular, the automatic
classification of all kind of digital information, including texts, photos, music and
videos, is in growing demand. Multilabel classification is the field where methods
to perform this task, labeling resources into several non-exclusive categories, are
studied and proposed.
This book presents a review of multilabel classification procedures and related
techniques, including the analysis of obstacles specifically tied to this class of meth-
ods. Experimentation results from the most relevant proposals are also provided. The
goal of this first chapter is to set the context multilabel classification belongs to. It
starts from the wide view of the whole Knowledge Discovery in Databases (KDD)
process, then narrowing the focus until nonstandard classification methods, where
multilabel classification is introduced.
The daily activity of millions of users, working and interacting with businesses
and institutions, is digitally recorded into databases known as Online Transaction
Processing systems. This has led to the availability of huge amounts of data in all
kinds of corporations, no matter whether they are small or big companies. Extracting
useful knowledge from these data by manual means is extremely difficult, if not
impossible. This is why Data Mining (DM) techniques are increasing their popularity
as an automatic way of getting the knowledge hidden in the data. This knowledge
can be very valuable to support decision-making systems, to describe the structure
of the information, to predict future data, and so on.
DM is a very-well-known and solid discipline nowadays, usually seen as one of
the steps in the process known as KDD. In [14], KDD is defined as a non-trivial
process of identifying valid, novel, potentially useful, and ultimately understandable
patterns in data. These patterns would be the result obtained at the end of the process
and can be of disparate nature as will be explained below.
Extracting new and useful insights from a database is a process that can be divided
into multiple stages. These have been schematically represented in the diagram in
Fig. 1.1. The starting point should be understanding the domain the problem belongs
to, specifying the goals to achieve. From here, the following steps would be:
1. Data gathering and integration: The data needed to accomplish the established
objectives can reside in heterogeneous sources, such as relational databases,
spreadsheets, comma-separated value files. Once all the data have been gathered,
Fig. 1.1 The steps the KDD process is usually divided into
1.2 The Knowledge Discovery in Databases Process 3
As the diagram in Fig. 1.1 denotes, all the steps in the KDD process can jump
backwards into the pipeline, depending on different conditions. As a consequence,
each stage can be applied several times until a certain status is met, aiming to improve
data quality in each iteration.
Although DM is considered the essential stage in KDD, most of the effort and
time is usually spent in preprocessing tasks [17]. These are responsible of dealing
with problems such as missing values, noise presence, outliers, feature and instance
selection, and data compression. In the following, a deep analysis of preprocessing
and DM tasks is provided.
Once the data have been retrieved from the source it is stored into, frequently the first
step is to prepare it through one or more preprocessing methods. We are generically
referring here as data preprocessing to several integration, cleaning, transformation,
and other data preparation and data reduction tasks. These methods will be run
before applying any DM algorithm, easing the process of extracting some useful
information.
According to the following statement from [17], data preprocessing duties can be
divided into two groups entitled data preparation and data reduction. Identifying the
proper preprocessing to administer to the data can improve the results obtained in
4 1 Introduction
subsequent steps of the KDD process. In this section, the main preprocessing tasks
are briefly introduced conforming to this grouping criteria.
Data preprocessing includes data preparation, compounded by integration, cleaning, nor-
malization, and transformation of data; and data reduction tasks; such as feature selection,
instance selection, and discretization. The result expected after a reliable chaining of data
preprocessing tasks is a final dataset, which can be considered correct and useful for further
data mining algorithms.
Data Preprocessing in Data Mining, Springer 2015
Data preparation comprehend several assorted tasks, including data cleaning and
normalization, dealing with missing values, addressing of noise, and extreme value
detection. The main data preparation methods are depicted in Fig. 1.2 (reproduced
here with authorization from [17]).
The most usual data normalizations are applied over numeric attributes. The goal
is to ease the learning process of DM algorithms, normalizing the scales of val-
ues. Normalization is a simple transformation technique whose goal is to adjust the
attribute values in order to share a common scale, making them proportional and
comparable.
Missing values can be present in a dataset for disparate reasons. They can exist
due to a failure in previous data transcription steps, or simply because someone has
Fig. 1.2 Usual data preparation methods aimed to produce the data before applying any DM
algorithm. These would be used just after the data gathering step
1.3 Data Preprocessing 5
left empty a form item. The cleaning can consist in removing the instances with
missing values or following some data imputation scheme, usually replacing the
missing value by an estimation model.
Noisy data are identified as class noise or attributes containing values which are
clearly incorrect, frequently showing random variations that cannot be explained.
The techniques to clean noise are similar to the ones aforementioned for missing
values.
Building a model from a dataset with a large number of features or instances has
a high computational cost. For this reason, data reduction techniques are among the
most popular preprocessing methods. These methods can be grouped into the three
categories depicted in Fig. 1.3 (reproduced here with authorization from [17]).
Feature selection algorithms aim to reduce the number of features, removing those
that are redundant. Reducing the number of features usually helps in simplifying
further data mining tasks.
Dimensionality reduction techniques are an alternative to explicit feature selection
in some cases, creating a set of artificial features via linear or nonlinear feature com-
binations. Deep learning methods [24] such as Stacked Autoencoders and Restricted
Boltzmann Machines can be also useful for this task.
Instance Selection
Discretization
6 1 Introduction
The number of samples in the dataset can be reduced mainly through instance
selection techniques [17]. They find the most representative instances in the dataset,
selecting them as prototypes prior to applying DM algorithms.
Discretization is a process able to translate a continuous numeric value into a set
of adjacent subranges, thus reducing the number of distinct values to consider. The
result can be interpreted as a qualitative attribute, instead of a numeric one.
Although the discussed in this section can be considered the most common pre-
processing tasks, there are a few more that can be applied when a DM algorithm
cannot deal with the data in its current form. Some of them, specific to multilabel
classification, will be fully described in Chap. 2.
The ultimate aim of the data preparatory phases in the KDD process is to apply
some DM methods [14], whose duty would be learning from the data to produce
useful knowledge. Currently, there is a large collection of proven DM algorithms
available. These can be grouped attending to several criteria, depending on the data
being labeled or not, the kind of result aimed for, and the model used to represent
the knowledge, among others.
In this section, a brief overview of the most remarkable techniques, attending to
the aforementioned three grouping standards, is provided.
The nature of the available data will affect the kind of DM methods which can be
used, also determining the goals that can be set as targets of the learning process.
Three main cases can be considered, as depicted in Fig. 1.4. These three categories
are as follows:
• Supervised learning: The data instances have been previously labeled [9] by
an expert in the problem domain. The DM methods can use this information to
infer the knowledge needed to label new, never seen before, data instances. In this
context, the label could be a continuous numerical value or a discrete value. The
selected DM method has to work with the data in order to detect relationships
between the input attributes, which determine the position of each instance in the
solution space, and the target label. Supervised DM methods usually divide the
dataset into two (training and test) or three (training, validation and test) disjoint
subsets. The label of instances in the test set is not given to the algorithm, being
used only to assess its performance.
1.4 Data Mining 7
• Unsupervised learning: All data instances are unlabeled [8], so the DM methods
cannot use expert knowledge as foundation of their learning. Since only input
attributes are available, there is no point in looking for mappings between inputs
and outputs. The values in input attributes are studied to group data instances or
search for certain patterns, instead.
• Semi-supervised learning: When the dataset contains both labeled and unlabeled
samples, DM methods can combine techniques from the two previous categories
to accomplish semi-supervised learning tasks [33]. Labeled data instances can be
used to induce a model, as in supervised learning, then refining it with the informa-
tion from unlabeled samples. Analogously, unsupervised tasks can be improved
by introducing the clues given by the labeled instances.
From the understanding of the problem domain, starting point in the KDD process,
the experts have to specify one or more objectives to achieve. Depending on what
kind of target objective is set, a group of DM methods could be applied. The target
goals can be grouped into two main categories:
• Predictive tasks: The objective is to infer an output value from the information
given by the input attributes. The best-known predictive DM tasks are classifica-
tion and regression. The data samples given to the algorithms have to be labeled,
so predictive tasks are usually linked to supervised or semi-supervised learning
techniques.
• Descriptive tasks: The goal is to extract information about the inner structure of the
data samples, represented by the values of their features, or the relationships among
them. This way the data can be clustered, patterns can be mined as association rules,
anomalies can be detected, and dimensionality can be reduced, among other tasks.
These are almost always unsupervised learning tasks, although supervised and
semi-supervised learning also have applications in this area.
8 1 Introduction
Once the task at glance is identified according to the two previous grouping criteria,
e.g., the data are labeled and the goal is to classify new instances through super-
vised learning techniques, a specific representation for the obtained knowledge has
to be selected. This will be the last step prior to choosing a particular DM algorithm.
Generally, a plethora of algorithms exists for each representation model. The knowl-
edge can be represented as a decision tree, for instance, and there are many distinct
algorithms to generate this type of model.
Most models will generate a decision boundary aiming to group the data samples.
This boundary can be as simple as a linear one, and as complicated as the ones
produced by artificial neural networks in a complicated space. Assuming a plain 2D
1.4 Data Mining 9
Fig. 1.5 Decision boundaries resulting from some of the best-known learning models
space and samples belonging only to two classes, in Fig. 1.5 the aspect of some of
these boundaries has been depicted.
Without aiming to be exhaustive, below some of the most well-studied knowl-
edge representation models are enumerated, along with several of the algorithms
associated to them:
• Instance-based learning: The most popular algorithm in this category is kNN
(k Nearest Neighbors) [11]. Depending on the specific algorithm, a bare bones
kind of model can be built, e.g., getting the a priori probabilities for each class, or
the task can be accomplished without any model. When a new pattern arrives, its
nearest neighbors in the search space are obtained, attending to some similarity
function, and the class or regression value is predicted from the labels or values
present in them.
• Trees: Algorithms to induce trees [28] have been applied to classification, regres-
sion, and rule discovery, among other DM tasks. A tree recursively divide the
search space into smaller groups of instances, following a divide-and-conquer
strategy, choosing the best dividing feature in each step according to some quality
metric. C4.5 [26] and CART [7] are algorithms used to generate classification trees
and regression trees, respectively.
• Support Vector Machines: SVMs, also known as SVNs (Support Vector Net-
work), are models which maximize the margin between data instances and an
hyperplane acting as division boundary [10]. The algorithms aimed to generate
this hyperplane are instance-based learners, but the distances are used to adjust
the boundary position, instead of getting the most similar neighbors. This model
10 1 Introduction
was originally designed for linear separable problems, but through the so-called
kernel trick can be also applied to nonlinear scenarios which are linear separable
in higher dimensions.
• Neural Networks: Artificial Neural Networks (ANN) [27] are models inspired in
how the human brain works, specifically in its structure of interconnected cells
(neurons) and how their outputs are activated depending on their inputs. ANNs
have been applied to solve classification, regression, and clustering problems,
among others. ANNs are complex models which can follow disparate structures,
needing algorithms able to optimize the weights connecting neurons. Some of the
possible models are Multilayer Perceptrons, Models, Modular Neural Networks,
and Self-Organizing Maps, among others.
• Bayesian models: Statistical graphical models based on Bayes theorem [5] have
several DM applications, being applied in classification and regression tasks. The
simplest model in this category is naïve Bayes, which is limited to work with
categorical attributes and assumes total independence between feature values.
Bayesian networks [23] are more complex and powerful approaches, based essen-
tially on the same principle.
• Regression models: Regression models [9] emerged in the field of statistics, being
defined as an iterative procedure to adjust a certain mathematical function to a set
of patterns, usually minimizing a precise error metric. They are among the oldest
DM models. Some of the most common algorithms generating them are Ordinary
least-squares regression, Linear regression, and Logistic regression.
• Rules: The research in rule-based systems [15] started more than 50 years ago,
being used both for descriptive and predictive tasks. Rule-based classification
models are a set of if-then rules, with the features as a conjunction in the precedent
and the class label in the consequent. The main advantage of these classifiers is
the ease to understand the produced model. Association rules, on the other hand,
are aimed to communicate the frequency of patterns in the dataset, as well as the
relationship (concurrence, implication, etc.) between them. A set of association
rules provides insight about the internal structure of the instances and can be used
to describe the data, predict some values from the others presence, group similar
samples, etc. The first algorithm aimed to produce this kind of rules was A priori
[2]. FP-Growth [20] is also a popular algorithm for association rule mining.
In order to produce the best possible model from the available data, the methods
that generate them rely on two foundations, an evaluation metric and an optimization
approach. Both will depend on the kind of model to be produced. The evaluation
metric can be accuracy for classification models, squared error for regression models,
likelihood for Bayesian models, etc. Some common optimization approaches are
gradient based methods and evolutionary algorithms, among others.
1.5 Classification 11
1.5 Classification
This is a easiest classification problem we can face [1]. The instances in a binary
dataset have only one output attribute, and it can take only two different values. These
are usually known as positive and negative, but can also be interpreted as true and
false, 1 and 0, or any other combination of two values. A classical example of this
task is spam filtering (see Fig. 1.6), in which the classifier learns from the messages’
content which ones can be considered as spam.
A binary classifier aims to find a boundary able to separate the instances into
two groups, one belonging to the positive class and the other to the negative one.
In practice, depending on the input feature space, the distribution of samples can
be much more difficult, thus needing additional frontiers between instances. Current
12 1 Introduction
Fig. 1.6 Spam filtering is a typical binary classification problem. The classifier learns from the
messages text and predicts if it is spam or not. The problem could be not linear separable, as in this
simple example
A multiclass dataset also has only one output attribute, like in the binary datasets,
but it can hold any from a certain set of predefined values. The meaning of each
value, and the value itself, would be specific to each application. The set of classes
will be finite and discrete, on the contrary the task would not be classification but
regression. Class values could have an order relationship or not. One of the best-
known multiclass classification examples is iris species identification (see Fig. 1.7).
From four characteristics of the flower, i.e., the petal and sepal lengths and widths,
the classifier learns how to classify new instances into the corresponding family.
Many multiclass classification algorithms rely on binarization [16], a method that
iteratively trains a binary classifier for each class against the others, following a
One-vs-All (OVA) approach, or for each pair of classes, using a One-vs-One (OVO)
way.
Multiclass classification can be seen as a generalization of binary classification.
There is only one output, but it can take any value, while in the binary case it is
limited to a subset of two values.
1.5 Classification 13
Fig. 1.7 Iris species categorization is a classical multiclass classification problem. The classifier
learns from the length and width of petal and sepal, predicting which of the three species the flower
belongs to
Unlike the two previous classification models, in multilabel classification [18, 29,
31] each one of the data instances has associated a vector of outputs, instead of only
one value. The length of this vector is fixed according to the number of different
labels in the dataset. Each element of the vector will be a binary value, indicating if
the corresponding label is relevant to the sample or not. Several labels can be active
at once. Each distinct combination of labels is known as labelset. Figure 1.8 depicts1
one of the classical multilabel applications, image labeling. The dataset has four
labels in total and each picture can be assigned any of them, or even all at once if
there were a picture in which the four concepts appear.
Multilabel classification is currently applied in many fields, most of them related
to automatic labeling of social media resources such as images, music, video, news,
and blog posts. The algorithms used for this task must be able to make several
predictions at once, whether it is by transforming the original datasets or by adapting
existent binary/multiclass classification algorithms.
1 The pictures used in this example are public domain images taken from the http://www.
publicdomainpictures.net website.
14 1 Introduction
Fig. 1.8 Image labeling is an usual multilabel classification task. Two or more labels are assigned
to each image, depending on the elements it contains. There are several overlapping boundaries,
one associated to each label
This type of task can be seen as a generalization of the previous one. Multidimensional
datasets [6] also have an output vector associated to each instance, rather than only one
value. However, each item in this vector can take any value from a predefined set, not
being limited to be binary. Therefore, the relationship between multidimensional and
multilabel classification is essentially the same previously mentioned for multiclass
and binary classification.
Multidimensional techniques are closely related to that used on the multilabel
field, with transformation methods taking a relevant role to face this task. The range
of potential applications is also similar. Multidimensional classification is used to
categorize images, music, texts and analogous resources, but predicting for each label
a value in a larger set than multilabel classification.
Unlike all other classification kinds described above, the multiple instance learning
paradigm [4], also known as multiinstance learning, learns a common class label
1.5 Classification 15
for a set of input features vectors. It is a very different problem, since each logical
data instance is defined not only by a vector of input features, but for a collection of
physical instances, each one with a set of input attributes. Each group of instances
is usually known as bag. The associated class label belongs to the bag, instead of to
the individual data instances it holds.
Among the applications of this classification paradigm, one of the most easily
understandable is image categorization. The original images are divided into regions
or patches, so each image is represented by a set of vectors, containing each one the
data of a patch. The associated class label could depend on certain objects appearing
in some of the patches.
Multiple instance learning can be extended considering multiple class labels as
output, rather than only one. From this, casuistic emerges variations such as mul-
tiple instance multilabel classification [32]. In the same way, a multiple instance
multidimensional classification task could be considered.
References
1. Aggarwal, C.C. (ed.): Data Classification: Algorithms and Applications. CRC Press (2014)
2. Agrawal, R., Srikant, R., et al.: Fast algorithms for mining association rules. In: Proceedings
of 20th International Conference on Very Large Data Bases, VLDB’94, pp. 487–499. Morgan
Kaufmann (1994)
3. Aha, D.W. (ed.): Lazy Learning. Springer (1997)
4. Amores, J.: Multiple instance classification: review, taxonomy and comparative study. Artif.
Intell. 201, 81–105 (2013)
5. Barber, D.: Bayesian Reasoning and Machine Learning. Cambridge University Press (2012)
6. Bielza, C., Li, G., Larrañaga, P.: Multi-dimensional classification with Bayesian networks. Int.
J. Approximate Reasoning 52(6), 705–727 (2011)
7. Breiman, L., Friedman, J., Stone, C.J., Olshen, R.A.: Classification and Regression Trees. CRC
press (1984)
8. Celebi, M.E., Aydin, K. (eds.): Unsupervised Learning Algoritms. Springer (2016)
9. Cherkassky, V., Mulier, F.: Learning from Data: Concepts, Theory and Methods. Wiley-IEEE
Press (2007)
10. Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995)
11. Cover, T., Hart, P.: Nearest neighbor pattern classification. IEEE Trans. Inf. Theory 13(1),
21–27 (1967)
12. De Gooijer, J.G., Hyndman, R.J.: 25 years of time series forecasting. Int. J. Forecast. 22(3),
443–473 (2006)
13. Dong, G., Pei, J.: Sequence Data Mining. Springer (2007)
16 1 Introduction
14. Fayyad, U., Piatetsky-Shapiro, G., Smyth, P.: From data mining to knowledge discovery in
databases. AI Mag. 17(3), 37 (1996)
15. Fürnkranz, J., Gamberger, D., Lavrač, N.: Foundations of Rule Learning. Springer (2012)
16. Galar, M., Fernández, A., Barrenechea, E., Bustince, H., Herrera, F.: An overview of ensemble
methods for binary classifiers in multi-class problems: experimental study on one-vs-one and
one-vs-all schemes. Pattern Recogn. 44(8), 1761–1776 (2011)
17. García, S., Luengo, J., Herrera, F.: Data Preprocessing in Data Mining. Springer (2015)
18. Gibaja, E., Ventura, S.: A tutorial on multi-label learning. ACM Comput. Surv. 47(3) (2015)
19. Han, J., Kamber, M., Pei, J.: Data Mining: Concepts and Techniques. Morgan Jaufmann (2011)
20. Han, J., Pei, J., Yin, Y., Mao, R.: Mining frequent patterns without candidate generation: a
frequent-pattern tree approach. Data Min. Knowl. Disc. 8(1), 53–87 (2004)
21. Hernández-González, J., Inza, I., Lozano, J.A.: Weak supervision and other non-standard clas-
sification problems: a taxonomy. Pattern Recogn. Lett. 69, 49–55 (2016)
22. Hyndman, R.J., Athanasopoulos, G.: Forecasting: Principles and practice. OText books (2013)
23. Koller, D., Friedman, N.: Probabilistic Graphical Models. Principles and Techniques. MIT
Press (2009)
24. LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015)
25. Novak, P.K., Lavrač, N., Webb, G.I.: Supervised descriptive rule discovery: a unifying survey of
contrast set, emerging pattern and subgroup mining. J. Mach. Learn. Res. 10, 377–403 (2009)
26. Quinlan, J.R.: C4.5: Programs for Machine Learning (1993)
27. Rojas, R.: Neural Networks. A Systematic Study. Springer (1996)
28. Rokach, K., Maimon, O.: Data Mining with Decision Trees, 2nd edn. World Scientific (2015)
29. Tsoumakas, G., Katakis, I., Vlahavas, I.: Mining multi-label data. In: Data Mining and Knowl-
edge Discovery Handbook, pp. 667–685. Springer (2010)
30. Zhang, C., Zhang, S.: Association Rule Mining. Springer (2002)
31. Zhang, M.L., Zhou, Z.H.: A review on multi-label learning algorithms. IEEE Trans. Knowl.
Data Eng. 26(8), 1819–1837 (2014)
32. Zhou, Z.H., Zhang, M.L., Huang, S.J., Li, Y.F.: Multi-instance multi-label learning. Artif. Intell.
176(1), 2291–2320 (2012)
33. Zhu, X., Goldberg, A.B.: Introduction to Semi-supervised Learning. Morgan & Claypool Pub-
lishers (2009)
Chapter 2
Multilabel Classification
Abstract This book is concerned with the classification of multilabeled data and
other tasks related to that subject. The goal of this chapter is to formally introduce
the problem, as well as to give a broad overview of its main application fields and
how it have been tackled by experts. A general introduction to the matter is provided
in Sect. 2.1, followed by a formal definition of the multilabel classification problem
in Sect. 2.2. Some of the main application fields of multilabel classification are por-
trayed in Sect. 2.3. Lastly, the approaches followed to face this duty are introduced
in Sect. 2.4.
2.1 Introduction
The main difference between traditional1 and multilabel classification is in the out-
put expected from trained models. Where a traditional classifier will return only
one value, a multilabel one has to produce a vector of output values. Multilabel
classification can be formally defined as follows.
2.2.1 Definitions
Definition 2.1 Let X denote the input space, with data samples X ∈ A1 × A2 ×
... × A f , being f the number of input attributes and A1 , A2 , ..., A f arbitrary sets.
Therefore, each instance X will be obtained as the cartesian product of these sets.
Definition 2.2 Let L be the set of all possible labels. P(L) denotes the powerset of
L, containing all the possible combinations of labels l ∈ L including the empty set
and L itself. k = |L| is the total number of labels in L.
Definition 2.3 Let Y be the output space, with all the possible vectors Y , Y ∈ P(L).
The length of Y always will be k.
Definition 2.4 Let D denote a multilabel dataset, containing a finite subset of A1 ×
A2 × ... × A f × P(L). Each element (X, Y ) ∈ D|X ∈ A1 × A2 × ... × A f , Y ∈ P(L)
will be an instance or data sample. n = |D| will be the number of elements in D.
Definition 2.5 Let F be a multilabel classifier, defined as F : X → Y. The input
to F will be any instance X ∈ X , and the output will be the prediction Z ∈ Y.
Therefore, the prediction of the vector of labels associated with any instance can be
obtained as Z = F(X ).
2.2.2 Symbols
From these definitions, the following list of symbols, which will be used in this
chapter and the next ones, is derived:
D Any multilabel dataset (MLD).
n The number of data instances in D.
L The full set of labels appearing in D.
l Any of the labels in L.
k Total number of elements in L.
X The set of input attributes of any instance.
1 Ingeneral, we will refer to binary and multiclass classification, which are the most well-known
classification kinds, as traditional classification.
2.2 Problem Formal Definition 19
2.2.3 Terminology
In addition to the previous notation, the following terms will be frequently used
throughout the text:
• MLD/MLDs: Any multilabel dataset or group of multilabel datasets.
• MLC: Any multilabel classifier or multilabel classification algorithm.
• Instance/sample: A row in an MLD, including its input attributes and associated
labelset.
• Attributes/features: Generally used to refer to the set of input attributes in the
MLD, without including the output labelset.
• Label: Any of the output attributes associated with an instance.
• Labelset: A vector of labels {0, 1}k whose length will be always k.
• True labelset: The labelset a sample in the MLD is annotated with.
• Predicted labelset: The labelset an MLC is giving as output for a new sample.
Once the main concepts linked to multilabel classification have been introduced, the
next question arising possibly is what can it be used for. As has been explained in
the previous section, a multilabel classifier aims to predict a set of relevant labels for
a new data instance.
In this section, several application areas that can take advantage of this function-
ality are portrayed. In the following chapter, specific use cases belonging to each one
of these areas will be detailed.
20 2 Multilabel Classification
Multilabel classification has its roots as a solution for organizing text documents into
several not mutually exclusive categories. This is why there are so many publications
regarding this topic [15, 21, 40–42]. Most of them will be further described in the
next chapter.
Textual documents can be found anywhere, from big companies which store all
kind of agreements and reports to individuals filing their invoices and electronic mail
messages. All published books and magazines, our historic medical records, as well
as articles in electronic media, blog posts, question–answering forum messages, etc.,
are text documents also. Most of them can be classified into more than one category,
thus the usefulness of multilabel classification to accomplish this kind of work.
The usual process to transform a set of text documents into an MLD relies on
text mining techniques. The source documents are parsed, uninformative words are
removed, and vectors with each word occurrence among the documents are com-
puted. At the end of this process, each document is described by a row in the MLD,
and the columns correspond to words and their frequencies or some other kind of
representation such as TF/IDF (Term Frequency/Inverse Document Frequency, [47]).
Although documents containing only text were the most frequent ones in the past,
nowadays images, videos, and music are commonly used resources due to the huge
growth experienced by storage and communication technologies. Attending to 2015
YouTube statistics, an estimated 432 000 new videos are uploaded every day. The
number of new music and sound clips published each day is also impressive, and
new images and photographs posted everywhere, from blogs to Web sites such as
Tumblr, reach the millions per day barrier.
Multilabel classification has been used to label all these types of resources [5, 7,
28, 48, 56, 58], identifying the objects which appear in sets of images, the emotions
produced by music clips, or the concepts which can be derived from video snips. This
way, the huge number of new resources can be correctly classified without needing
human intervention, something that would be rather expensive.
Even though an image can be represented as a vector of color values, taking each
pixel as a column, usually they are preprocessed in order to obtain the most relevant
features. For doing so, segmentation techniques aimed at extracting boundaries are
applied, the image can be convoluted to transform the pixel space into an energy
space, etc. Similar procedures are put in practice with other multimedia resources.
2.3 Applications of Multilabel Classification 21
2.3.3 Genetics/Biology
Proteomics and genomics are research fields which have experienced a huge growth
in late years. As a result, immense databases with protein sequences have been
produced, but only a small fraction of them have been studied to determine their
function. Analyzing protein properties and gene expression is a very costly task, but
DM techniques can accelerate the process and make it cheaper.
The application of multilabel classification in this area [25, 29] aimed at predicting
the biologic functions of genes. Each gene can express more than one function at
once, hence the interest in using MLC techniques to face this problem. The features
used as predictors are usually the protein’s motifs, which are traits about its internal
structure. Structural motifs indicate how the elements in a secondary structural layer
are connected. Certain patterns, such as short chains of amino acids, can be identified
and used as predictive features.
Usually, when new datasets are generated, the authors of each original work publicly
provide the MLDs they have produced, so that other researchers can use them in
their own studies. Notwithstanding, most MLDs can be obtained directly from some
of the following data repositories:
• MULAN: The well-known multilabel learning package [55] has an associated data
repository [54]. As of 2016, more than 20 MLDs are available in it. These MLDs
are in ARFF2 (Attribute-Relation File Format), and each one has an associated
XML file describing the labels and their relationships. The XML file is needed,
since the position of the attributes acting as labels is not assumed to be at the end.
• KEEL: KEEL [2] is a open source software providing lots of preprocessing and
DM algorithms. It has an associated data repository [1] with almost 20 MLDs.
Along with the full datasets, 10-fcv and 5-fcv partitions are also supplied. The
KEEL file format, based on ARFF, includes the name of output attributes in the
header; therefore, a separate XML file is not needed.
• MEKA: MEKA is a multilabel software developed on top of WEKA [37]. As the
previous ones, it has also associated a data repository [45] with over 20 MLDs.
The file format is also ARFF based, and the information needed to know which
attributes are the labels, specifically the number of labels, is included in the header.
Some multidimensional datasets are included.
• RUMDR: R Ultimate Multilabel Dataset Repository [11] provides a compilation
of all MLDs publicly available, as well as an R package which eases the partitioning
and exporting to several formats, including MULAN, KEEL, MEKA, LibSVM,
and CSV.
The process to obtain a multilabel classification model is similar to that used for
traditional classification, usually following supervised learning techniques. Most
algorithms rely on an initial training phase. It depends on previously labeled samples
to adjust the parameters of the model. Once trained, the model can be used to predict
the labelset for new instances.
When it comes to learning from multilabel data, two main approaches have been
applied: data transformation and method adaptation. The former is based on trans-
formation techniques which, applied to the original MLDs, are able to produce one
or more binary or multiclass datasets. These can be processed with traditional clas-
sifiers. The latter aims for adapting existent classification algorithms, so they can
deal with multilabel data, producing several outputs instead of only one. A third
alternative, which naturally emerges from the first one, is the use of ensembles of
classifiers.
A topic closely related to MLC is label ranking [57], whose goal is to map each
instance in a dataset to a set of labels, establishing a total order relationship among
them. That is, the output of these algorithms is a ranking of labels, assigning to each
one a certain weight. A multilabel classifier can use this label ranking to decide which
labels are relevant to the instance, applying a cut threshold that has to be computed in
some way. In [8, 30, 38], different proposals on how to do multilabel classification
founded on label rankings are presented.
This section introduces the three main approaches to MLC: data transformation,
method adaptation, and ensembles of classifiers. In addition, it also advances some
of the aspects most often alluded to in multilabel learning, as is taking advantage
of label correlation information, the high-dimensionality problem, and the learning
from imbalanced data task.
MLC is a harder task than traditional classification, since the classifier has to predict
several outputs at once. One of the first approaches to arise for solving this problem
was to transform it, producing one or more simpler problems. The transformation
idea is all about to get an MLD and generate datasets that can be processed by binary
or multiclass classifiers. Commonly, the output produced by those classifiers has to
be backtransformed in order to obtain the multilabel prediction.
Some of the simplest methods originally proposed in the literature are the ones
described below:
• Selecting a single label: It is the transformation named Model-s in [5], s standing
for single label. When a sample is associated with a set of labels, one of them is
chosen as single class. This selection can be random or be based on some heuristic
method. The result produced by this transformation is a multiclass dataset having
the same number of instances than the original one, but each sample has only one
class.
• Ignoring multilabel instances: Dismissing all the samples associated with more
than one label, this transformation obtains a new dataset of multiclass nature, with
only one label per instance. It is introduced as Model-i in [5], i standing for ignore.
The resulting dataset usually will have less instances than the original one, unless
none of the samples had two or more labels. Since there is only one label per
instance, any multiclass classifier can be used.
24 2 Multilabel Classification
For each specific transformation, a method to build the predicted labelset has to
be defined. For instance, if the BR transformation is applied to the original MLD, the
individual predictions obtained from the binary classifiers have to be joined making
up the corresponding labelset.
In late years, the LP and BR transformations have been the foundation for
multiple MLC solutions, including many of which are based on ensembles of
binary and multiclass classifiers. Additional details related to these techniques
will be provided in Chap. 4, along with some experimental results.
Automatic classification [26] has been a traditional DM task for years. Throughout
this time, several families of algorithms [18] have been gradually developed, tested,
and fine-tuned. These algorithms are an essential foundation to develop new, more
specific ones, including those aimed at working with MLDs.
Some classification models were initially designed to solve binary problems and
then extended to also consider multiclass cases [4, 31]. An example of this is SVM
[9]. By contrast, other approaches are able to deal with several classes with great sim-
plicity. A kNN classifier [20], for instance, can predict the most frequent class among
2.4 Learning from Multilabel Data 25
the neighbors of the new instance, whether there are two or more classes. Analo-
gously, trees, classification rules, neural networks, and statistical models, among
others, have been used to tackle both binary and multiclass classifications.
The lessons learned from adapting binary classifiers to the multiclass scenario,
such as binarization, voting methods, and divide-and-conquer procedures, have been
also useful while adapting proven algorithms to tackle multilabel classification. The
main difficulty is to build a model capable of predicting several outputs at once.
Again, some approaches are easily adaptable, whereas others require more effort.
A kNN-based algorithm can take the multiple outputs present on its neighbors to
elaborate its own prediction, but a SVM has to find the maximum margin boundaries
between multiple labels.
• Instance space: There are MLDs with millions of instances, but this is something
increasingly common in all contexts, including traditional classification, due to
the rise of big data.
• Input feature space: Although there are binary and multiclass datasets with many
input features, MLDs stand out in this area as they usually have dozens of thousands
2.4 Learning from Multilabel Data 27
of features. The dimensionality of this space is what mostly makes difficult the
learning process.
• Output attribute space: Before multilabel classification was considered, the clas-
sifiers only had to deal with one output attribute. Whether it was binary or mul-
tivalued, there was only one output, always. By contrast, all MLDs have several
outputs and many of them have thousands, one for each label. Due to this fact,
the dimensionality treatment in MLDs is a more complex topic than in traditional
classification.
That learning from a high-dimensional input space (i.e., a large number of fea-
tures) imposes serious difficulties is something well known. Even there is an ad hoc
expression to name this problem, the curse of dimensionality. Therefore, it is an
obstacle deeply studied and analyzed in DM, with dozens of different proposals to
face it to some extent [51].
There are non-supervised methods, such as Principal Component Analysis (PCA)
[39] and Latent Semantic Indexing (LSI) [27], able to reduce dimensionality by ana-
lyzing only the input features. These can be applied to MLDs, since no knowledge
about the output variables is needed. The supervised approaches, characterized for
using the output attributes to infer which input features provide more useful infor-
mation, have to be adapted prior to be used in the multilabel context.
Dimensionality in the label space is the least studied problem until now, with
only a few proposals. Most of them will be detailed in Chap. 7, specifically
devoted to this topic.
The learning from imbalanced data is another of the casuistics intrinsically linked to
multilabel classification. Like high dimensionality in the input feature space, imbal-
ance is also a well-studied problem in traditional classification. There are many
proven techniques to face imbalance in binary classification, and many of them have
been adapted to work with multiclass datasets. A few of these techniques will be
described in Chap. 8, including how some of them have been adjusted to deal with
MLDs.
The differences among label distributions in MLDs arise firstly by their own
nature. In an MLD with thousands of labels, for instance categorizing text docu-
ments, it is usual that some of them are much more frequent than others. More
specialized documents would have rare labels, which will be in minority against the
most common ones.
Secondly, the transformations applied to multilabel data tend to increase the imbal-
ance between labels. This is the case of BR. Taking as positive only the instances in
which a certain label appears and as negative all other samples, the original distri-
28 2 Multilabel Classification
bution changes affecting the representation of the considered label. The situation is
even worse if this label is already a minority label.
To conclude this chapter, in which the topics of what multilabel classification is, what
it is used for, and how it has been faced have been dealt, the main tools currently used
to work with this kind of information are briefly introduced. They will be explained
in more detail in Chap. 9.
Although multilabel data can be analyzed and multilabel classification conducted
with any ad hoc software, specifically designed to accomplish a certain task, there
are also available some generic tools. Many researchers and practitioners have relied
on them in late years. The most noteworthy are the following:
• MULAN: Presented in 2010 [55], it is the most mature multilabel software and
probably the most widely used. MULAN is an open source library written in Java.
It provides a programming interface (API) to help in the development of multilabel
classification applications, including also evaluation of results and label ranking
generation. Several of the most common transformations and MLC algorithms are
already included, ready to use. Unlike WEKA [37], the tool it is based on, MULAN
does not offer a graphic user interface (GUI) to ease the accomplishment of these
tasks.
• MEKA: As MULAN, MEKA [45] is also founded on WEKA, but unlike MULAN
it supplies a GUI from which the user can load MLDs, perform some exploration
tasks, and apply different MLC algorithms. Since the GUI is very similar to that of
WEKA, WEKA users will immediately feel comfortable with this tool. It is also
open source and has been developed in Java.
• mldr: R is one of the most used tools to explore and analyze data nowadays, but
it lacked the ability to present the specificities of MLDs. The mldr [10] package
adds this functionality to R, providing functions to load, modify, and write MLDs,
as well as general exploratory analysis methods. The package also includes a
Web-based GUI from which most tasks can be easily performed.
• mldr.datasets: Built upon the infrastructure provided by the preceding one, this
R package [11] eases the process of obtaining multilabel datasets. In addition,
it provides methods aimed to retrieve basic information about them, including
citation data, as well as to facilitate the partitioning process.
2.5 Multilabel Data Tools 29
References
1. Alcala-Fdez, J., Fernández, A., Luengo, J., Derrac, J., García, S., Sánchez, L., Herrera, F.:
KEEL multi-label dataset repository. http://sci2s.ugr.es/keel/multilabel.php
2. Alcala-Fdez, J., Fernández, A., Luengo, J., Derrac, J., García, S., Sánchez, L., Herrera, F.: KEEL
data-mining software tool: data set repository and integration of algorithms and experimental
analysis framework. J. Mult-valued Log. Soft Comput. 17(2–3), 255–287 (2011)
3. Alvares-Cherman, E., Metz, J., Monard, M.C.: Incorporating label dependency into the binary
relevance framework for multi-label classification. Expert Syst. Appl. 39(2), 1647–1655 (2012)
4. Aly, M.: Survey on multiclass classification methods. In: Technical Report, pp. 1–9. California
Institute of Technology (2005)
5. Boutell, M., Luo, J., Shen, X., Brown, C.: Learning multi-label scene classification. Pattern
Recogn. 37(9), 1757–1771 (2004)
6. Breiman, L.: Bagging predictors. Mach. Learn. 24(2), 123–140 (1996)
7. Briggs, F., Lakshminarayanan, B., Neal, L., Fern, X.Z., Raich, R., Hadley, S.J.K., Hadley, A.S.,
Betts, M.G.: Acoustic classification of multiple simultaneous bird species: a multi-instance
multi-label approach. J. Acoust. Soc. Am. 131(6), 4640–4650 (2012)
8. Brinker, K., Hüllermeier, E.: Case-based multilabel ranking. In: Proceedings of 20th Interna-
tional Joint Conference on Artificial Intelligence, IJCAI’07, pp. 702–707. Morgan Kaufmann
(2007)
9. Burges, C.J.C.: A tutorial on support vector machines for pattern recognition. Data Min. Knowl.
Disc. 2(2), 121–167 (1998)
10. Charte, F., Charte, D.: Working with multilabel datasets in R: the mldr package. R J. 7(2),
149–162 (2015)
11. Charte, F., Charte, D., Rivera, A.J., del Jesus, M.J., Herrera, F.: R Ultimate multilabel dataset
repository. In: Proceedings of 11th International Conference on Hybrid Artificial Intelligent
Systems, HAIS’16, vol. 9648, pp. 487–499. Springer (2016)
12. Charte, F., Rivera, A.J., del Jesus, M.J., Herrera, F.: Multilabel classification. In: Problem
Analysis, Metrics and Techniques Book Repository. https://github.com/fcharte/SM-MLC
13. Charte, F., Rivera, A.J., del Jesus, M.J., Herrera, F.: Addressing imbalance in multilabel clas-
sification: measures and random resampling algorithms. Neurocomputing 163, 3–16 (2015)
14. Charte, F., Rivera, A.J., del Jesus, M.J., Herrera, F.: MLSMOTE: approaching imbalanced
multilabel learning through synthetic instance generation. Knowl.-Based Syst. 89, 385–397
(2015)
15. Charte, F., Rivera, A.J., del Jesus, M.J., Herrera, F.: QUINTA: a question tagging assistant
to improve the answering ratio in electronic forums. In: Proceedings of IEEE International
Conference on Computer as a Tool, EUROCON’15, pp. 1–6. IEEE (2015)
16. Chen, K., Lu, B., Kwok, J.: Efficient classification of multi-label and imbalanced data using
min-max modular classifiers. In: Proceedings of IEEE International Joint Conference on Neural
Networks, IJCNN’06, pp. 1770–1775 (2006)
3 http://scikit.ml.
30 2 Multilabel Classification
17. Chen, X., Zhan, Y., Ke, J., Chen, X.: Complex video event detection via pairwise fusion of
trajectory and multi-label hypergraphs. Multimedia Tools Appl. 1–22 (2015)
18. Cherkassky, V., Mulier, F.: Learning from Data: Concepts. Theory and Methods. Wiley-IEEE
Press (2007)
19. Cong, H., Tong, L.H.: Grouping of triz inventive principles to facilitate automatic patent clas-
sification. Expert Syst. Appl. 34(1), 788–795 (2008)
20. Cover, T., Hart, P.: Nearest neighbor pattern classification. IEEE Trans. Inf. Theory 13(1),
21–27 (1967)
21. Crammer, K., Dredze, M., Ganchev, K., Talukdar, P.P., Carroll, S.: Automatic code assign-
ment to medical text. In: Proceedings of Workshop on Biological, Translational, and Clinical
Language Processing, BioNLP’07, pp. 129–136. Association for Computational Linguistics
(2007)
22. Dembszynski, K., Waegeman, W., Cheng, W., Hüllermeier, E.: On label dependence in multi-
label classification. In: ICML Workshop on Learning from Multi-label data, pp. 5–12 (2010)
23. Dendamrongvit, S., Kubat, M.: Undersampling approach for imbalanced training sets and
induction from multi-label text-categorization domains. In: New Frontiers in Applied Data
Mining, LNCS, vol. 5669, pp. 40–52. Springer (2010)
24. Dietterich, T.: Ensemble methods in machine learning. In: Multiple Classifier Systems. LNCS,
vol. 1857, pp. 1–15. Springer (2000)
25. Diplaris, S., Tsoumakas, G., Mitkas, P., Vlahavas, I.: Protein classification with multiple algo-
rithms. In: Proc. 10th Panhellenic Conference on Informatics, PCI’05, vol. 3746, pp. 448–456.
Springer (2005)
26. Duda, R., Hart, P., Stork, D.: Pattern Classification, 2nd edn. John Wiley (2000)
27. Dumais, S., Furnas, G., Landauer, T., Deerwester, S., Deerwester, S., et al.: Latent semantic
indexing. In: Proceedings of 4th Text Retrieval Conference, TREC-4, pp. 105–115. NIST
(1995)
28. Duygulu, P., Barnard, K., de Freitas, J., Forsyth, D.: Object recognition as machine translation:
learning a lexicon for a fixed image vocabulary. In: Proceedings of 7th European Conference
on Computer Vision, ECCV’02, vol. 2353, pp. 97–112. Springer (2002)
29. Elisseeff, A., Weston, J.: A kernel method for multi-labelled classification. In: Advances in
Neural Information Processing Systems, vol. 14, pp. 681–687. MIT Press (2001)
30. Fürnkranz, J., Hüllermeier, E., Loza Mencía, E., Brinker, K.: Multilabel classification via
calibrated label ranking. Mach. Learn. 73, 133–153 (2008)
31. Galar, M., Fernández, A., Barrenechea, E., Bustince, H., Herrera, F.: An overview of ensemble
methods for binary classifiers in multi-class problems: experimental study on one-vs-one and
one-vs-all schemes. Pattern Recogn. 44(8), 1761–1776 (2011)
32. Galar, M., Fernández, A., Barrenechea, E., Bustince, H., Herrera, F.: A review on ensembles
for the class imbalance problem: bagging, boosting, and hybrid-based approaches. IEEE Trans.
Syst. Man Cybern. Part C Appl. Rev. 42(4), 463–484 (2012)
33. Ghamrawi, N., McCallum, A.: Collective multi-label classification. In: Proceedings of 14th
ACM International Conference on Information and Knowledge Management, CIKM’05, pp.
195–200. ACM (2005)
34. Gibaja, E., Ventura, S.: A tutorial on multi-label learning. ACM Comput. Surv. 47(3) (2015)
35. Gonçalves, T., Quaresma, P.: A preliminary approach to the multilabel classification problem of
Portuguese juridical documents. In: Proceedings of 11th Portuguese Conference on Artificial
Intelligence, EPIA’03, pp. 435–444. Springer (2003)
36. Guo, Y., Gu, S.: Multi-label classification using conditional dependency networks. In: Pro-
ceedings of 22th International Joint Conference on Artificial Intelligence, IJCAI’11, vol. 2, pp.
1300–1305 (2011)
37. Holmes, G., Donkin, A., Witten, I.H.: WEKA: a machine learning workbench. In: Proceed-
ings of 2nd Australian and New Zealand Conference on Intelligent Information Systems,
ANZIIS’02, pp. 357–361 (2002)
38. Hüllermeier, E., Fürnkranz, J., Cheng, W., Brinker, K.: Label ranking by learning pairwise
preferences. Artif. Intell. 172(16), 1897–1916 (2008)
References 31
39. Jolliffe, I.: Introduction. In: Principal Component Analysis, pp. 1–7. Springer (1986)
40. Katakis, I., Tsoumakas, G., Vlahavas, I.: Multilabel text classification for automated tag sug-
gestion. In: Proceedings of European Conference on Machine Learning and Principles and
Practice of Knowledge Discovery in Databases, ECML PKDD’08, pp. 75–83 (2008)
41. Klimt, B., Yang, Y.: The enron corpus: a new dataset for email classification research. In: Proc
eedings of 15th European Conference on Machine Learning, ECML’04, pp. 217–226. Springer
(2004)
42. Lewis, D.D., Yang, Y., Rose, T.G., Li, F.: RCV1: a new benchmark collection for text catego-
rization research. J. Mach. Learn. Res. 5, 361–397 (2004)
43. Read, J., Pfahringer, B., Holmes, G.: Multi-label classification using ensembles of pruned
sets. In: Proc eedings of 8th IEEE International Conference on Data Mining, ICDM’08, pp.
995–1000. IEEE (2008)
44. Read, J., Pfahringer, B., Holmes, G., Frank, E.: Classifier chains for multi-label classification.
Mach. Learn. 85, 333–359 (2011)
45. Read, J., Reutemann, P.: MEKA multi-label dataset repository. http://sourceforge.net/projects/
meka/files/Datasets/
46. Rokach, L.: Pattern classification using ensemble methods. World Scientific (2009)
47. Salton, G., Fox, E.A., Wu, H.: Extended Boolean information retrieval. Commun. ACM 26(11),
1022–1036 (1983)
48. Snoek, C.G.M., Worring, M., van Gemert, J.C., Geusebroek, J.M., Smeulders, A.W.M.: The
challenge problem for automated detection of 101 semantic concepts in multimedia. In: Pro-
ceedings of 14th ACM International Conference on Multimedia, MULTIMEDIA’06, pp. 421–
430 (2006)
49. Sobol-Shikler, T., Robinson, P.: Classification of complex information: Inference of co-
occurring affective states from their expressions in speech. IEEE Trans. Pattern Anal. Mach.
Intell. 32(7), 1284–1297 (2010)
50. Sriram, B., Fuhry, D., Demir, E., Ferhatosmanoglu, H., Demirbas, M.: Short text classification
in twitter to improve information filtering. In: Proceedings of 33rd international ACM SIGIR
conference on Research and development in information retrieval, pp. 841–842. ACM (2010)
51. Sun, L., Ji, S., Ye, J.: Multi-label dimensionality reduction. CRC Press (2013)
52. Tenenboim-Chekina, L., Rokach, L., Shapira, B.: Identification of label dependences for multi-
label classification. In: Proceedings of 2nd International Workshop on Learning from Multi-
Label Data, MLD’10, pp. 53–60 (2010)
53. Tsoumakas, G., Katakis, I., Vlahavas, I.: Mining multi-label data. In: Data Mining and Knowl-
edge Discovery Handbook, pp. 667–685. Springer (2010)
54. Tsoumakas, G., Xioufis, E.S., Vilcek, J., Vlahavas, I.: MULAN multi-label dataset repository.
http://mulan.sourceforge.net/datasets.html
55. Tsoumakas, G., Xioufis, E.S., Vilcek, J., Vlahavas, I.: MULAN: a java library for multi-label
learning. J. Mach. Learn. Res. 12, 2411–2414 (2011)
56. Turnbull, D., Barrington, L., Torres, D., Lanckriet, G.: Semantic annotation and retrieval of
music and sound effects. IEEE Trans. Audio Speech Lang. Process. 16(2), 467–476 (2008)
57. Vembu, S., Gärtner, T.: Label ranking algorithms: a survey. In: Preference learning, pp. 45–64.
Springer (2011)
58. Wieczorkowska, A., Synak, P., Raś, Z.: Multi-label classification of emotions in music. In:
Intelligent Information Processing and Web Mining. AISC, vol. 35, Chap. 30, pp. 307–315
(2006)
59. Wolpert, D.H., Macready, W.G.: No free lunch theorems for optimization. IEEE Trans. Evol.
Comput. 1(1), 67–82 (1997)
60. Zhang, M., Zhang, K.: Multi-label learning by exploiting label dependency. In: Proceedings of
16th International Conference on Knowledge Discovery and Data Mining, ACM SIGKDD’10,
pp. 999–1008 (2010)
Chapter 3
Case Studies and Metrics
3.1 Overview
The main application fields of MLC were introduced in the previous chapter from
a global perspective. The goal in this chapter was to delve into each one of these
fields, enumerating every one of the publicly available MLDs and stating where they
come from. In addition to this basic reference information, it would be interesting
to get some general characteristics for each MLD. For doing so, most of the charac-
terization metrics described in the literature are going to be introduced, along with
their formulations and discussion about their usefulness. Several extensive tables
containing each measurement for every MLD will be provided.
In the following chapters, several dozens of MLC algorithms will be described,
and some of them will be experimentally tested. Therefore, how to conduct such
an experiment, and the way the results can be assessed to evaluate the algorithms’
performance, are fundamental aspects. Once the available MLDs and their main traits
are known, a basic kNN-based MLC algorithm is introduced and it is run to process
some of these MLDs.
Multilabel predictive performance evaluation metrics have to deal with the pres-
ence of multiple outputs, taking into consideration the existence of predictions which
are partially correct or wrong. As will be expounded, these metrics can be grouped
into several categories according to distinct criteria. Then, most of the MLC eval-
uation metrics are explained along with their formulation, using them to assess the
results obtained from the previous experiments.
In the previous chapter, the main application fields for MLC were portrayed. Attend-
ing to the grouping criterion then established, in this section most of the case studies
found in the specialized literature will be enumerated. Table 3.1 summarizes these
case studies, giving their original references and the place they can be downloaded
from.1
Some of these case studies have associated several MLDs, whose names and
characteristics will be analyzed later. The same MLD can be available in different
formats,2 for instance MULAN, MEKA, and KEEL, depending on the repository the
user refers to.
The following subsections cover each MLC application field. The case studies
are alphabetically enumerated inside each category conforming to the name of the
MLD or set of MLDs belonging to them.
Categorizing text documents into one or more categories is a very usual need. It is
the task at the root of MLC. This is the reason for the existence of many datasets
associated with this use case. The case studies mentioned below have been used in a
considerable portion of the multilabel literature. Some of them have associated more
than one MLD.
• 20ng: This dataset, known as 20 Newsgroups, has its origin in the task [28] of
filtering news group messages. The dataset has become a classical problem for
testing text clustering and text-labeling algorithms. It contains over a thousand
entries for each one of 20 different news groups, making a total of almost 20 000
data instances. Some of the news groups are closely related, so some messages
were cross-posted to more than one group. The input attributes, there are more
than a thousand, are the terms extracted from all the documents. For each instance,
those terms appearing in the message are set to 1, while the others are set to 0.
This representation is known as boolean bag-of-words (BoW) model. There are
1 All datasets are available at RUMDR (R Ultimate Multilabel Dataset Repository) [10], from which
20 labels, corresponding to the news groups the messages have been taken from.
Only a handful of instances are assigned to more than one label.
• bibtex: Introduced in [26] as part of a tag recommendation task, it contains the
metadata for bibliographic entries. The words that presented in the papers’ title,
authors names, journal name, and publication date were taken as input attributes.
The full vocabulary consisted in 1 836 features. The data origin is Bibsonomy,3
a specialized social network where the users can share bookmarks and BibTeX
entries assigning labels to them. bibtex is the dataset generated from the data
contained in the BibTeX entries, being associated with a total of 159 different
3 http://www.bibsonomy.org.
36 3 Case Studies and Metrics
labels. The boolean BoW model is used to represent the documents, so all features
are binary indicating if a certain term is relevant to the document or not.
• bookmarks: This MLD comes from the same source [26] that the previous
one. In this case, the data are obtained from the bookmarks shared by the users.
Specifically, the URL of the resource, its title, date, and description are included
into the dataset. The vocabulary consisted in 2 150 different terms, used as input
features. The tags assigned to the bookmarks by the users, a total of 208, are taken
as labels. The main difference between bibtex and bookmarks is the size of
the MLD, having the latter more than ten times the number of instances that the
former.
• delicious: The authors of this dataset [40] are the same of the previous one,
and its nature is also similar to bookmarks. This time the links to Web pages
were taken from the del.icio.us4 portal. The page content for a set of popular tags
was retrieved and parsed, and the resulting vocabulary was filtered to avoid non-
frequent words. As a result, an MLD with almost a thousand labels was generated.
The goal of the authors was proposing an MLC method able to deal with a so large
number of labels.
• enron: The Enron corpus is a large set of email messages, with more than half a
million entries, from which a dataset for automatic folder assignment research was
generated [27]. The enron MLD is a subset of the previous dataset, with only 1
701 instances. Each one has as input features a BoW obtained from the email’s
fields, such as the subject and the body of the message. The labels correspond to
the folders in which each message was stored into by the users. A total of 53 labels
are considered.
• EUR-Lex: This case study is made up of three MLDs. The primary source is the
European Union’s database of legal documents, which includes laws, agreements,
and regulations. Each document is classified in accordance with three criteria,
EUROVOC descriptors, directory codes, and subject matters. For doing so, the
header of the document indicates which descriptors, codes, and matters are rele-
vant. Therefore, there are three multilabel tasks to accomplish. From this database,
the authors of [30] generated the eurlex-dc, eurlex-ev, and eurlex-sm
MLDs.5 Unlike in most cases, reduction techniques were not applied aiming to
obtain a limited number of labels. As a result, the eurlex-ev MLD has almost
4 000 of them. The three datasets have the same instances with the same set of
5 000 input features. These contain, in the version used in [30], the TF-IDF rep-
resentation instead of BoW as the previous ones.
• imdb: The aim of this study [32] was to automatically classify movies into the
proper genres, i.e., drama, comedy, adventure, or musical, among others. A total
of 28 genres are considered. The input features were generated from the text
gathered from the IMDB6 database for each movie, relying in a boolean BoW
4 https://delicious.com/.
5 Additional information about how these MLDs were produced, including the software to do so,
can be found at http://www.ke.tu-darmstadt.de/resources/eurlex.
6 http://imdb.org.
3.2 Case Studies 37
representation. These texts contained a summary of the movies’ plot, with a vocab-
ulary made up of a thousand terms. Containing more than 120 000 instances, it is
one of the largest MLDs publicly available.
• langlog: Introduced in [31], this MLD has been created from the posts published
into the Language Log Forum,7 a Web site for discussing language-related topics.
As many other text MLDs, this also follows the boolean BoW model, with a total
of 1 460 input features. The blog entries are categorized by 75 different labels.
• medical: The documents processed to produce this MLD are anonymized clin-
ical texts, specifically the free text where the patient symptoms are described. A
portion of the total corpus described in [18] was used to generate the MLD, with the
text transformed into a BoW per document. The labels, a total of 45, are the codes
from the International Classification of Diseases, precisely ICD-9-CM8 codes.
• ohsumed: The origin of this dataset [25] is the Medline database, a text corpus
from almost three hundred medical journals. The Ohsumed collection is a subset
of the Medline dataset compiled in the Oregon Health Science University. The
title and abstract texts of each article were processed and represented as BoW,
producing a set of thousand input features. Each document is linked to one or
more of the 23 main categories of the MeSH diseases ontology.9 These categories
are the labels appearing in the 13 929 instances that the MLD consists in.
• rcv1v2: This case study consists of five MLDs, being each one of them a subset
of the original RCV1-v2 (Reuters Corpus Volume 1 version 2). The RCV1 text
corpus was generated from the full text of English news published by Reuters along
one year, from August 20, 1996, to August 19, 1997. Version 2 of this corpus is
a corrected version introduced in [29]. Each entry was classified according to
three categories, such as topic codes, industry codes, and region codes. A total
of 101 different labels are considered. The vocabulary used as input features has
47 236 terms, represented as TF-IDF values. The full RCV1 corpus had 800 000
documents. 6 000 of them are provided in each one of the five subsets.
• reuters: Introduced in [31], it is also a subset of the RCV1 corpus. In this
case, a feature selection method has been applied, taking only 500 input attributes
instead of the more than 47 000 in rcv1v2. The goal was to work with more
representative features. At the same time, the reduced set of attributes improves
the speed of the learning process.
• slashdot: The source this MLD was generated from is Slashdot,10 a well-known
news portal mainly focused in technology and in science. The MLD was generated
[32] taking the text from the news title and summary, producing a boolean BoW
for each entry. The vocabulary has 1 079 terms. The tags used for categorize these
entries, a total of 22, act as labels.
• stackexchange: The case study faced in [15] is a tag suggestion task for ques-
tions posted in specialized forums, specifically forums from the Stack Exchange
7 http://languagelog.ldc.upenn.edu/nll/.
8 http://www.cdc.gov/nchs/icd/icd9cm.htm.
9 https://www.nlm.nih.gov/mesh/indman/chapter_23.html.
10 http://slashdot.org.
38 3 Case Studies and Metrics
network.11 Six MLDs were generated from six different forums, devoted to topics
such as cooking, computer science, and chess. The title and body of each question
was text-mined, producing a frequency BoW. The tags assigned by the users to
their questions were used as labels. The vocabulary for each forum is specific,
being made of between 540 and 1 763 terms. These worked as input attributes.
The labels are specific as well, ranging its number from 123 to 400.
• tmc2007: This dataset bore as a result of the SIAM Text Mining Workshop12
in 2007 [37]. As many other text datasets, boolean BoW was chosen as a way of
representing the terms appearing in documents. Those were aviation safety reports,
in which certain problems during flights were described. The vocabulary consists
of 49 060 different words, used as input features. Each report is tagged into one
or more categories from a set of 22. These are the labels in the MLD.
• yahoo: The authors of [47] compiled for their study the Web pages referenced in
11 out of the 14 main categories of the classical Yahoo!13 Web index. Therefore,
11 MLDs are available for this case study. All of them use the boolean BoW
representation, with features obtained from the pages referenced in the index.
The number of words goes from 21 000 to 52 000, depending on the MLD. The
subcategories that the pages belong to are used as labels. The number of labels is
in the range 21–40.
Although text resources were the first ones to demand automated classification mech-
anisms, recently the need for labeling other kind of data, such as images, sounds,
music, and video, has experimented a huge growth. By contrast with the case stud-
ies enumerated in the previous section, in which a common representation as BoW
(whether they contain boolean values, frequencies, or TF-IDF values) is used, the
following ones resort to disparate embodiments.
• birds: This MLD emerges from the case study described in [7], where the prob-
lem of identifying multiple birds species from acoustic recordings is tackled. The
researchers used hundreds of sound snippets, recorded in nature at times of day
with high bird activity. Between 1 and 5 different species appear in each snippet.
The audio was processed with a 2D time-frequency segmentation approach, aim-
ing to separate syllables overlapping in time. As a result, a set of features with
the statistic profile of each segment is produced. Since a sound can be made up of
several segments, the produced dataset is a multiinstance multilabel dataset.
• cal500: Tagging music tracks with semantic concepts is the task faced in [44],
from which the cal500 MLD is derived. The researchers took five hundred
11 http://stackexchange.com/.
12 http://web.eecs.utk.edu/events/tmw07/.
13 http://web.archive.org/web/19970517033654/http://www9.yahoo.com/.
3.2 Case Studies 39
songs, from unique singers, and defined a vocabulary aimed to define aspects such
as the emotions produced by the song, the instruments and vocal qualities, and
music genre. These concepts, a total of 174, are used as labels. Each music track
is assigned at least 3 of them and the average is above 26, which is a quite high
number in the multilabel context. The input features were generated by sound
segmentation techniques. A distinctiveness of this MLD is that no two instances
are assigned the same combination of labels.
• corel: The original Corel dataset was used in two different case studies [5, 20] by
the same authors, from which several MLDs have been obtained. The Corel dataset
has thousands of images categorized into several groups. In addition, each picture
is assigned a set of words describing its content. These pictures were segmented by
the authors using the normalized cuts method, generating a set of blobs associated
with one or more words. The input features, 500 in total, are the vectors resulting
from the segmentation process. In [20] (corel5k), 5 000 instances were taken
and there are 374 labels, since a minimum of occurrences was not established.
The posterior study in [5] (corel16k) used 138 111 instances grouped into 10
subsets. A minimum of occurrences for each label was set, limiting its number to
153–174 depending on the subset.
• emotions: The origin of this dataset is the study conducted in [48], whose
goal is to automatically identify the emotions produced by different songs. A
hundred songs from each one of seven music styles were taken as input. The
authors used the software tool described in [46] to extract from each record a set
of rhythmic features and another one with timbre features. The union of these
sets, after a process of feature selection, is used as input attributes. The songs were
labeled by three experts, using the six main emotions of the Tellegen-Watson-Clark
abstract emotional model. Only those songs where the assigned labels coincide
were retained, reducing the number of instances from the original 700 to 593.
• flags: This MLD is considered as a toy dataset, since it only has 194 instances
with a set of 19 inputs features and 7 labels. The original version can be found
in the UCI repository.14 In [24] several of its attributes, the ones indicating which
colors appear in the flag or if it contains a certain image or text, were defined as
labels. The remainder attributes, including the zone and land mass the country
belongs to, its area, religion, population, etc., are established as input features.
• mediamill: It was introduced in [35] as a challenge for video indexing. The data
consist of a collection of video sequences, taken from the TREC Video Retrieval
Evaluation,15 from which a set of 120 features have been extracted. This set of fea-
tures is the concatenation of several similarity histograms extracted from the pixels
of each frame. The goal was to discover what semantic concepts are associated
with each entry, among a set of 101 different labels. Some of these concepts refer
to environments, such as road, mountain, sky, or urban, others to physical objects,
such as flag, tree, and aircraft. A visual representation of these 101 concepts can
be fond in [35].
14 https://archive.ics.uci.edu/ml/datasets/Flags.
15 http://www-nlpir.nist.gov/projects/trecvid/.
40 3 Case Studies and Metrics
• nus-wide: The famous Flickr16 social network, in which millions of users pub-
lish their photographs every day, is the origin for the NUS-WIDE dataset, created
by NUS’s Lab for Media Search. Each image was segmented extracting color his-
togram, correlation histogram, edge direction, textures, etc. The resulting MLD
has 269 648 instances, and two versions of the MLD with different features repre-
sentation are available. The first one, known as nus-wide-BoW, used clustering
to produce a 500 dimensional vector of visual words (real values). The second one,
named nus-wide-VLAD, the vectors have 128 dimensions and are encoded as
cVLAD+ features [36] (real values). In both, each instance has an initial attribute
containing the name of the file where the image was stored into. Each image was
manually annotated using a 81 items vocabulary, with terms such as animal, house,
statue, and garden. These are the labels of the MLD.
• scene: This MLD is also related to image labeling, specifically to scene classifi-
cation. The set of pictures was taken from the Corel dataset and some personal ones
by the authors [6] were also included. The MLD is made up of 400 pictures for each
main concept, beach, sunset, field, fall foliage, mountain, and urban. Therefore,
six non-exclusive labels are considered. The images are transformed to the CIE
Luv color space, known for being perceptually uniform, and latter segmented into
49 blocks, computing for each one of them values such as the mean and variance.
The result is a vector of 294 real-value features in each instance.
3.2.3 Genetics/Biology
This is the area with less publicly available datasets, which is not surprising due to
its complexity. There are two MLDs, one focused in predicting the class of proteins
and another one for classifying genes in line with their functional expression.
• genbase: The authors of [19] produced this MLD compiling information for 662
different proteins. The Prosite access number17 was used to identify the 1 185 motif
patterns and profiles used as input features. All of them are nominal, taking only
the YES or NO values. This way the motifs and profiles present in each protein are
indicated. 27 different protein classes are considered, being each protein associated
with one or more of them. The PDOC protein class identifiers are used as label
names. Something to be taken into account while using this MLD is the presence
of one additional feature, the first one, that uniquely identifies each instance.
• yeast: In this case [21], the goal was to predict the functional expression for a
set of genes. The input features for each gene come from microarray expression
data, with a 103 real values vector per instance. A subset of 14 functional classes,
whose origin is the Comprehensive Yeast Genome Database,18 are selected and
16 https://www.flickr.com/.
17 http://prosite.expasy.org/prosite.html.
18 http://www.ncbi.nlm.nih.gov/pubmed/15608217.
3.2 Case Studies 41
used as labels. Since each gene can express more than one function at once, in fact
this is the usual situation, the result is a dataset with multilabel nature.
Even though there are a quite large collection of MLDs publicly available, in some
situations it can be desirable to work with datasets that have certain characteristics.
For instance, if we were designing an algorithm to deal with noisy data it would be
interesting to test it with MLDs having different noise levels. This is a trait that could
be modeled by generating custom synthetic datasets.
Despite the aforementioned need, which has been demanded by several authors
in some papers, there is a lack of tools to produce synthetic MLDs when compared
with utilities with the same aim for traditional classification. In most cases, internal
programs are used to generate these artificial datasets, and only the characteristics of
the data are explained. Fortunately, there are some exceptions, such as the Mldatagen
program19 described in [38].
Since they are created by a program, an a priori limit in the number of MLDs that
can be created does not exist. They can hold any number of instances, attributes, and
labels but, unlike the enumerated in the previous sections, they do not represent any
real situation.
19 http://sites.labic.icmc.usp.br/mldatagen/.
42 3 Case Studies and Metrics
The main difference between traditional and multilabel classification comes from
the fact that in the latter each instance is associated with a set of labels. This is the
reason behind the first specific metrics designed for MLDs, whose purpose is to
assess the multilabelness of the data, in other words determining the extent at which
the samples in the dataset have more than one label.
An obvious way to calculate such a measure consists in counting the number of
labels relevant to every instance in the dataset, then averaging the sum to know the
mean number of labels per instance. This simple metric was introduced in [39] as
label cardinality or simply Card (3.1).
1
n
Card (D) = |Yi | (3.1)
n i=1
In this context, n denotes the number of instances in the MLD D, Yi the labelset of
the ith instance, and k the total number of labels considered in D. The higher is the
Card level, the larger is the number of active labels per instance. As a consequence,
MLDs with low Card values, near 1.0, would denote that most of its samples have only
one relevant label. Therefore, it would be a dataset with little multilabelness nature.
On the opposite side, high Card values state that the data are truly multilabeled. As
a general rule, high Card values are linked to MLDs which have large sets of labels,
yet the contrary is not always true.
Since Card is a metric influenced by the size of the set of labels used by each MLD,
and it is expressed using the number of labels as measurement unit, a normalized
version (3.2) was also proposed. By dividing Card by the number of labels in the
MLD, a dimensionless metric, known as label density (Dens), is obtained. Usually,
a high Dens value indicates that the labels in the MLD are well represented in each
instance. By contrast, low Dens values denote more dispersion, with only a small
subset of the labels present in most instances.
11
n
Dens (D) = |Yi | (3.2)
k n i=1
|y |
Pmin (D) = (3.3)
y ∈Y /|y |=1
n
3.3 MLD Characteristics 43
Subsets of the labels in the set L appear in the instances of D forming labelsets.
Theoretically 2k different labelsets could exists, but in practice the number of unique
(distinct) labelsets is limited by the number of instances in D. Thus, the number of
unique combinations is limited by the expression min(n, 2k ). The effective number of
distinct labelsets in an MLD is an indicator of the uniformity in the labels distribution
among the samples. The higher the number is, the more irregularly the labels appear
in the data instances. The number of distinct labelsets is also known as label diversity
(Div), and it can also be normalized dividing it by the number of instances.
Furthermore, the frequency of each labelset appears in the MLD may be also
an interesting information. Even though the total number of distinct labelsets is not
high, if many of them only appear once, associated with one instance, this could
lead to some difficulties during the learning process. In addition, the analysis of the
labelsets provides information related to dependencies among labels.
Besides the previous ones, in [11] other standard statistical metrics, such as the
coefficient of variation, kurtosis, and skewness, are used to characterize how
the labels in an MLD are distributed. The joint use of all these metrics can help
in gaining insight into this problem.
The presence of class imbalance in a dataset is a challenge for most learning algo-
rithms. This problem will be analyzed in Chap. 8 in the context of MLC. As will be
seen, most MLDs suffer from label imbalance. This means that some labels are much
more frequent than others, and it is being an aspect interesting to appraise due to its
impact in classification results. Three different metrics to assess label imbalance are
proposed in [14], named IRLlbl (3.4), MaxIR (3.5) and MeanIR (3.6). In (3.4), the
operator [[expr]] denotes de Iverson bracket. It will return 1 if the expression inside
is true or 0 otherwise.
n
max [[l ∈ Yi ]]
l ∈L i=1
IRLbl(l) = n . (3.4)
i=1 [[l ∈ Yi ]]
With the IRLbl metric, it is possible to know the imbalance level of one specific
label. This is computed as the proportion between the number of appearances of the
most common label and the considered label. Therefore, for the most common label
IRLbl = 1. For least frequent labels, the level always will be greater than 1. The
higher the IRLbl, the rarer is the label presence in the MLD. The goal of the MaxIR
metric was obtaining the maximum imbalance ratio. In other words, the proportion
of the most common label against the most rare one.
44 3 Case Studies and Metrics
1
MeanIR = IRLbl(l). (3.6)
k
l∈L
Usually a global assessment of the imbalance in the MLD is desired. This metric,
named MeanIR, is calculated by averaging the IRLbl of all labels. Despite the use-
fulness of this metric by itself, some dispersion measure, such as standard deviation
or coefficient of variation, should also be included. A high MeanIR could be due to a
relatively high IRLbl for several labels, but also by cause of extreme imbalance levels
for only some labels. In this context, the CVIR (3.7) metric provides the additional
information needed to know the cause.
IRLblσ 1
CVIR = , IRLblσ = (IRLbl (l) − MeanIR)2 (3.7)
MeanIR k−1
l∈L
Besides the already aforementioned, some other characterization metrics have been
proposed in the literature to assess specific qualities of the MLDs. In [13], the SCUM-
BLE metric is introduced as a way to measure the concurrence among very frequent
and rare labels. A score is individually computed for each instance (3.8). This score
is based on the Atkinson index [4] and the IRLbl metric introduced in the previous
section. The former is an econometric measure aimed to evaluate income inequalities
among the population. In this context, monetary quantities have been replaced by
imbalance ratios, provided by the IRLbl metric. The result is a value in the [0, 1]
range indicating if all the labels in the instance have similar frequencies in the MLD,
low values, or by the contrary there are significant differences, the result would be
a higher value. The global SCUMBLE measure (3.9) is obtained by averaging the
score for all instances in the MLD. How these metrics have been the foundation for
developing new MLC algorithms will be explained in Chap. 8. As a general rule,
higher SCUMBLE values denote harder MLDs to learn from.
(1/k)
1
SCUMBLEins (i) = 1 − IRLblil (3.8)
IRLbli l∈L
1
n
SCUMBLE (D) = SCUMBLEins (i) (3.9)
n i=1
3.3 MLD Characteristics 45
The TCS (3.10) metric is presented in [16] aiming to facilitate a theoretical com-
plexity indicator. It is calculated as the product of the number of input features,
number of labels, and number of different label combinations. To avoid working
with very large values, whose interpretation and comparison would be not easy, the
log function is used to adjust the scale of the previous product. The goal was to
determine which MLDs would present a harder work to the preprocessing an learn-
ing algorithms. Unlike SCUMBLE, TCS values are not upper bounded. The higher
the value, the more complex would be to process the MLD.
Once the main characterization metrics have been defined, they can be used to analyze
the MLDs corresponding to the case studies enumerated in Sect. 3.2. Tables 3.2,
3.3, and 3.4 summarize most of these metrics for the MLDs corresponding to case
studies from the text, multimedia, and genetics fields, respectively. The columns
show, from left to right, Dataset: name of the MLD, n: number of instances, f: number
of input attributes, k: number of labels, LSet: number of distinct labelsets, Card:
label cardinality (Card), Dens: label density (Dens), MeanIR: mean imbalance ratio
(MeanIR), and SCUMBLE: imbalanced labels concurrence level (SCUMBLE).
The MLDs from text case studies clearly share a common trait, as almost all of
them have a high number of input features, in the range of thousands of them with
few exceptions. This is due to the techniques used to mining the text, which produce
large collections of words and their frequencies. Many of them also have several
hundreds of labels. This, when combined with a large number of instances, also
produces a huge amount of labelsets. It is the case with MLDs such as bookmarks
or delicious. Comparatively, the number of features, labels, and labelsets is lower
in the datasets coming from multimedia and genetics case studies.
Regarding the Card metric that indicates the mean number of labels per instance,
most MLDs are in the [1, 5] interval. Some MLDs, such as 20ng, langlog,
slashdot, yahoo-reference, birds, and scene, are only slightly above
1, meaning that most of its instances are associated with only one label. These would
be the less representative cases of what should be a multilabel scenario, since they
are closer to a multiclass one. There are a pair of extreme cases in the opposite side.
The Card values for delicious and cal500 are above 19 and 26, respectively.
These MLDs are truly multilabel, with a remarkable average number of active labels
in each instance. Halfway between the previous utmost scenarios, the remainder
MLDs present the most typical Card values, between 2 and 5 labels per instance in
average.
Dens is a metric closely related to Card. In general, most MLDs have Dens values
below 0.1. Only those with a very limited set of labels, such as emotions, flags,
or scene, or a very high Card, such as cal500, get a high label density. Therefore,
46 3 Case Studies and Metrics
Table 3.2 Main characteristics of MLDs from text classification case studies
Dataset n f k LSet Card Dens MeanIR SCUMBLE
20ng 19 300 1 006 20 55 1.029 0.051 1.007 0.000
bibtex 7 395 1 836 159 2 856 2.402 0.015 12.498 0.094
bookmarks 87 856 2 150 208 18 716 2.028 0.010 12.308 0.060
delicious 16 105 500 983 15 806 19.017 0.019 71.052 0.532
enron 1 702 1 001 53 753 3.378 0.064 73.953 0.303
eurlex-dc 19 348 5 000 412 1 615 1.292 0.003 268.930 0.048
eurlex-ev 19 348 5 000 3 993 16 467 5.310 0.001 396.636 0.420
eurlex-sm 19 348 5 000 201 2 504 2.213 0.011 536.976 0.182
imdb 120 919 1 001 28 4 503 2.000 0.071 25.124 0.108
langlog 1 460 1 004 75 304 1.180 0.016 39.267 0.051
medical 978 1 449 45 94 1.245 0.028 89.501 0.047
ohsumed 13 929 1 002 23 1 147 1.663 0.072 7.869 0.069
rcv1subset1 6 000 47 236 101 1 028 2.880 0.029 54.492 0.224
rcv1subset2 6 000 47 236 101 954 2.634 0.026 45.514 0.209
rcv1subset3 6 000 47 236 101 939 2.614 0.026 68.333 0.208
rcv1subset4 6 000 47 229 101 816 2.484 0.025 89.371 0.216
rcv1subset5 6 000 47 235 101 946 2.642 0.026 69.682 0.238
reuters 6 000 500 103 811 1.462 0.014 51.980 0.052
slashdot 3 782 1 079 22 156 1.181 0.054 17.693 0.013
stackex-chemistry 6 961 540 175 3 032 2.109 0.012 56.878 0.187
stackex-chess 1 675 585 227 1 078 2.411 0.011 85.790 0.262
stackex-coffee 225 1 763 123 174 1.987 0.016 27.241 0.169
stackex-cooking 10 491 577 400 6 386 2.225 0.006 37.858 0.193
stackex-cs 9 270 635 274 4 749 2.556 0.009 85.002 0.272
stackex-philosophy 3 971 842 233 2 249 2.272 0.010 68.753 0.233
tmc2007 28 596 49 060 22 1 341 2.158 0.098 15.157 0.175
tmc2007-500 28 596 500 22 1 172 2.220 0.101 17.134 0.193
yahoo-arts 74 840 23 146 26 599 1.654 0.064 94.738 0.059
yahoo-business 11 214 21 924 30 233 1.599 0.053 880.178 0.125
yahoo-computers 12 444 34 096 33 428 1.507 0.046 176.695 0.097
yahoo-education 12 030 27 534 33 511 1.463 0.044 168.114 0.042
yahoo-entertainment 12 730 32 001 21 337 1.414 0.067 64.417 0.039
yahoo-health 9 205 30 605 32 335 1.644 0.051 653.531 0.092
yahoo-recreation 12 828 30 324 22 530 1.429 0.065 12.203 0.030
yahoo-reference 8 027 39 679 33 275 1.174 0.036 461.863 0.049
yahoo-science 6 428 37 187 40 457 1.450 0.036 52.632 0.058
yahoo-social 12 111 52 350 39 361 1.279 0.033 257.704 0.049
yahoo-society 14 512 31 802 27 1 054 1.670 0.062 302.068 0.096
3.3 MLD Characteristics 47
Table 3.3 Main characteristics of MLDs from multimedia resources classification case studies
Dataset n f k LSet Card Dens MeanIR SCUMBLE
birds 645 260 19 133 1.014 0.053 5.407 0.033
cal500 502 68 174 502 26.044 0.150 20.578 0.337
corel5k 5 000 499 374 3 175 3.522 0.009 189.568 0.394
corel16k001 13 766 500 153 4 803 2.859 0.019 34.155 0.273
corel16k002 13 761 500 164 4 868 2.882 0.018 37.678 0.288
corel16k003 13 760 500 154 4 812 2.829 0.018 37.058 0.285
corel16k004 13 837 500 162 4 860 2.842 0.018 35.899 0.277
corel16k005 13 847 500 160 5 034 2.858 0.018 34.936 0.285
corel16k006 13 859 500 162 5 009 2.885 0.018 33.398 0.290
corel16k007 13 915 500 174 5 158 2.886 0.017 37.715 0.282
corel16k008 13 864 500 168 4 956 2.883 0.017 36.200 0.289
corel16k009 13 884 500 173 5 175 2.930 0.017 36.446 0.298
corel16k010 13 618 500 144 4 692 2.815 0.020 32.998 0.279
emotions 593 72 6 27 1.868 0.311 1.478 0.011
flags 194 19 7 54 3.392 0.485 2.255 0.061
mediamill 43 907 120 101 6 555 4.376 0.043 256.405 0.355
nus-wide-BoW 269 648 501 81 18 430 1.869 0.023 95.119 0.171
nus-wide-VLAD 269 648 129 81 18 430 1.869 0.023 95.119 0.171
scene 2 407 294 6 15 1.074 0.179 1.254 0.000
Table 3.4 Main characteristics of MLDs from genetics/proteomics classification case studies
Dataset n f k LSet Card Dens MeanIR SCUMBLE
genbase 662 1 186 27 32 1.252 0.046 37.315 0.029
yeast 2 417 103 14 198 4.237 0.303 7.197 0.104
this value is useful to know how sparse are the labelsets in the MLD. Higher Dens
values will denote labelsets with more active labels than the lower ones.
As can be stated by glancing at the column with the MeanIR values, most MLDs
show noteworthy imbalance levels. The mean proportion between the frequency
of labels are higher to 1:100 in many cases, with some drastic occasions such as
eurlex-sm, yahoo-health, or yahoo-business, whose MeanIR is above
500. There is only a handful of MLDs that could be considered as balanced, including
20ng, emotions, flags and scene. How this remarkably high imbalance levels
can influence the learning methods, and how this difficulty has been faced in the
literature, will be the main topics in Chap. 8.
The right-most column in these three tables shows the SCUMBLE value for each
MLD. Attending to what was stated in [13], values well above 0.1 in this metric
designate MLDs in which a significant proportion of rare labels jointly appear with
very frequent ones, in the same instances. As can be seen, this is the case for many
48 3 Case Studies and Metrics
of the MLDs shown in the previous tables. Some of them, such as the MLDs coming
from the Corel image database, enron and delicious, stand out with SCUMBLE
values as high as 0.532. This means that those MLDs would be specially harder for
preprocessing and learning algorithms.
A metric which does not appear in the previous tables is TCS. Since it provides a
score of the theoretical complexity of the MLDs, it is more useful to look at it after
sorting the MLDs by the TCS column, instead of alphabetically. The result is shown
in Table 3.5. Along the mentioned score, the number of features, labels, and labelsets
are also presented. From this table, it is easy to deduct that some of the MLDs
previously described as toy datasets present the lower theoretical complexity, with
TCS values around 10. Unsurprisingly, the text MLDs appear as the most complex
ones, due to their large sets of features and labels. Remember that TCS values are
logarithmic, so a difference of only one unit implies one order of magnitude lower
or higher.
50 3 Case Studies and Metrics
Obviously, the MLDs could also be ordered by their Card, MeanIR, SCUMBLE
or any other metric values, depending on which traits of the data the interest
is on. It is easy to do so using the tools described in Chap. 9.
At this point, the source, nature, and main characteristics of a large set of MLDs
have been already introduced. The characterization metrics have been applied over
the MLDs, obtaining the measures shown in the previous tables. Before going into the
study of the evaluation metrics, whose goal was to assess the predictive performance
of a classifier, some predictions would be needed. This way we could get a glimpse
of the values returned by these metrics. For this reason, this section is devoted to
demonstrate how to conduct an example of multilabel classification job.
Even though the description of MLC algorithms is the main topic of further
chapters, in the following subsection a specific algorithm is introduced to be able
to complete the task. The outputs provided by this algorithm are then evaluated by
means of different multilabel evaluation metrics.
One of the simplest approaches to classification is that of kNN. Once a new data
sample is given, a kNN classifier looks for its k-nearest neighbors. For doing so, the
distance (in some f-dimensional space) between the features of the new sample and all
instances in the dataset is computed. Once the closer instances have been gathered,
their classes are used to predict the one for the new sample. Since kNN does not
create any model, only when a new sample arrives the classifier does some work, it
is usually known as a lazy [1] method. It is also frequently referred as instance-based
learning [2].
ML-kNN [49] is an adaptation of the kNN method to the multilabel scenario.
Unlike the classic kNN algorithm, ML-kNN is not so lazy. It starts by building a
limited model that consists of two pieces of information:
• The a priori probabilities for each label. These are simply the number of times each
label appears in the MLD divided by the total number of instances. A smoothing
factor is applied to avoid multiplying by zero.
• The conditional probabilities for each label, computed as the proportion of
instances with the considered label whose k-nearest neighbors, also have the same
label.
3.4 Multilabel Classification by Example 51
These probabilities are independently computed for each label, facing the task
as a collection of individual binary problems. Therefore, the potential dependencies
among labels are fully dismissed by this algorithm.
After this limited training process, the classifier is able to predict the labels for
new instances. When a new sample arrives, it goes through the following steps:
• First, the k-nearest neighbors of the given sample are obtained. By default the
L 2 − norm (Euclidean distance) is used to measure the similarity between the
reference instance and the samples in the MLD.
• Then, the presence of each label in the neighbors is used as evidence to compute
maximum a posteriori (MAP) probabilities from the conditional ones obtained
before.
• Lastly, the labelset of the new sample is generated from the MAP probabilities.
The probability itself is provided as a confidence level for each label, thus making
possible to also generate a label ranking.
The reference MATLAB implementation for the ML-kNN algorithm is supplied
by the author at his own Web site.20 There is also available a Java implementation in
MULAN. The latter has been used in order to conduct the experimentation described
below.
Five MLDs have been chosen to run the ML-kNN algorithm. Two of them are
from the text domain (enron and stackex-cs), two more from the multime-
dia field (emotions and scene), and the last one comes from the biology domain
(genbase). Attending to their TCS measure emotions and scene, ranked at
positions 2 and 3 in Table 3.5, would be the easier cases. A little harder would be
genbase (6th), followed by enron (15th) and finally stackex-cs (45th) which,
theoretically, would be the most difficult MLD in this collection.
The MLDs were partitioned following a 2 × 5 strategy. This means that there are
two repetitions with 5 folds, and that for each run 80 % (4/5) of instances are used
for training and 20 % (1/5) for testing. Therefore, a total of 10 runs are made for each
MLD. Random sampling was used to select the instances in each fold. The full set
of folds for the aforementioned five MLDs is available in the book repository [12].
From each run, a set of predictions are obtained from the classifier. These can
be assessed using many performance evaluation metrics (they will be described
in the next section), getting a set of values for each metric/fold. These values are
then averaged, obtaining the mean indicators which are usually reported in most
papers, sometimes along with their deviations. Table 3.6 shows all these values,
whose interpretation will be further provided as the evaluation metrics are described.
20 http://cse.seu.edu.cn/people/zhangml/Resources.htm#codes.
52 3 Case Studies and Metrics
Table 3.6 Classification results produced by ML-kNN assessed with several evaluation metrics
stackex-cs emotions enron genbase scene
Accuracy ↑ 0.0540 0.5391 0.3156 0.9440 0.6667
AvgPrecision ↑ 0.3009 0.7990 0.6280 0.9860 0.8648
Coverage ↓ 77.9260 1.7715 13.2092 0.6110 0.4797
F-measure ↑ 0.5900 0.7776 0.5898 0.9776 0.9593
HammingLoss ↓ 0.0091 0.1940 0.0524 0.0048 0.0869
MacroF-measure ↑ 0.1999 0.6225 0.4284 0.9357 0.7378
MacroPrecision ↑ 0.5866 0.7279 0.5568 0.9795 0.8149
MacroRecall ↑ 0.0169 0.5981 0.0808 0.6787 0.6808
MicroAUC ↑ 0.8481 0.8565 0.9002 0.9893 0.9405
MicroF-measure ↑ 0.1065 0.6652 0.4715 0.9458 0.7331
MicroPrecision ↑ 0.6289 0.7217 0.6613 0.9934 0.8137
MicroRecall ↑ 0.0583 0.6186 0.3671 0.9031 0.6673
OneError ↓ 0.6571 0.2799 0.3070 0.0129 0.2269
Precision ↑ 0.6157 0.7182 0.6616 0.9956 0.8252
RLoss ↓ 0.1522 0.1608 0.0929 0.0072 0.0786
Recall ↑ 0.0582 0.6184 0.3654 0.9454 0.6836
SubsetAccuracy ↑ 0.0165 0.2968 0.0564 0.9132 0.6243
The arrow at the right of each metric name indicates whether lower values are better
(↓) or the opposite (↑).
Disparate plot designs can be used to graphically represent those final values,
being bar plots and line plots among the most frequent ones. When the interest is in
comparing a group of cases, in this occasion how the classifier has performed with
each MLD in accordance with several metrics, a radar chart (also known as spider
plot) can be useful. In Figs. 3.1 and 3.2, this type of representation has been used to
show the results produced by ML-kNN. Each vertex corresponds to a metric.21 The
points belonging to an MLD are connected so that a polygon is generated. The larger
is the area of the polygon, the better is the result with a certain MLD.
Through the observation of these two plots, despite the details of each metric are
not yet known, the following facts can be deducted:
• The performance with emotions and scene, with share a very similar TCS
value, is very much alike.
• The results for he previous two MLDs are clearly better than for enron, which
has a higher TCS score.
• The worst results are in general attributable to stackex-cs, the most complex
MLD according to the TCS metric.
21 The values of metrics such as HammingLoss, OneError, and RankingLoss have been comple-
mented as the difference with respect to 1, aiming to preserve the principle of assigning a larger
area to better values.
3.4 Multilabel Classification by Example 53
• The genbase results are not in line with previous appraisals, since it obtains the
best results in all measures. This could be due to the existence of an attribute,
named protein, containing a code that uniquely identifies each protein in the
MLD. This feature would allow the classifier to easily locate the closest instances,
producing a prediction that would be not so precise without that knowledge.
Overall, the previous plots seem to denote that the easier is the MLD, the better
the classifier performs. This rule, as can be seen with the exception of genbase, can
be broken depending on the MLDs specificities. Moreover, these results correspond
to one classifier only, so they must be taken cautiously.
In order to complete the judgment of these results, it would be essential to gain an
understanding of each individual evaluation metric. In the last section of this chapter,
the details regarding how the performance of classifier can be assessed are provided,
including additional discussion related to the values in Table 3.6.
The output of any multilabel classifier consists of the labelset predicted for each test
instance. When working in the traditional scenario, with only one class as output,
the prediction only can be correct or wrong. A multilabel prediction, by contrast,
can be fully correct, partially correct/wrong (at different degrees), or totally wrong.
Applying the same metrics used in traditional classification is possible, but usually
it is excessively strict. This is the reason for using specific evaluation metrics, able
to take into consideration the cases between the two extremes.
Currently, more than twenty distinct performance metrics have been defined in the
literature, and some of them quite specific aimed to hierarchical multilabel classifi-
cation. All multilabel evaluation metrics can be grouped conforming to two criteria:
• How the prediction is computed: A measurement can be made by instance or by
label, giving as a result two different groups of metrics:
– Example-based metrics: These metrics [22, 23, 34] are calculated separately
for each instance and then averaged dividing between the number of samples.
– Label-based metrics: In contrast to the previous group, the label-based metrics
[42] are computed independently for each label before they are averaged. For
doing so, two different strategies [41] can be applied:
Macro-averaging: The metric is calculated individually for each label and
the result is averaged dividing by the number of labels (k).
Micro-averaging: The counters of hits and misses for each label are firstly
aggregated, and then the metric is computed only once.
• How the result is provided: The output produced by a multilabel classifier can
be a binary bipartition of labels or a label ranking. Some of them provide both
results.
3.5 Assessing Classifiers Performance 55
These are the performance metrics which are firstly evaluated by each instance and
then averaged according to the number of instances considered. Therefore, the same
weight is assigned to every instance in the final score, whether they contain frequent
or rare labels.
Hamming loss is probably the most commonly used performance metric in MLC.
This is not surprising, as it is easy to calculate as can be seen in (3.11). The operator
returns the symmetric difference between Yi , the real labelset of the ith instance, and
Zi , the predicted one. The |r| operator counts the number of 1 s in this difference, in
other words the number of miss predictions. The total number of mistakes in the n
instances is aggregated and then normalized taking into account the number of labels
and number of instances.
11
n
HammingLoss = |Yi Zi | (3.11)
n k i=1
Since the mistakes counter is divided by the number of labels, this metric will
result in different assessments for the same amount of errors when used with MLDs
having disparate labelset lengths. This is the main reason for the low HammingLoss
value of the stackex-cs when compared to emotions or scene. The former
has a large number of labels, while the others only have six. Therefore, this metric is
56 3 Case Studies and Metrics
3.5.1.2 Accuracy
In the multilabel field, Accuracy is defined as (3.12) the proportion between the
number of correctly predicted labels and the total number of active labels, in the both
real labelset and the predicted one. The measure is computed by each instance and
then averaged, as all example-based metrics.
1 |Yi ∩ Zi |
n
Accuracy = (3.12)
n i=1 |Yi ∪ Zi |
The Accuracy for genbase is very high, due to the reason previously explained.
As shown in Fig. 3.1, the values for emotions and scene are very similar again,
although with a slight advantage to the latter. The obtained Accuracy cannot be
considered as good in the case of enron, and even less with the stackex-cs
MLD. It must be remembered that this MLD had the highest TCS of the five case
studies. Therefore, that it gets the worst classification performance was within the
expected.
Precision (3.13) is considered one of the more intuitive metrics to assess multilabel
predictive performance. It is calculated as the proportion between the number of
labels correctly predicted and the total number of predicted labels. Thus, it can be
interpreted as the percentage of predicted labels which are truly relevant for the
instance. This metric is usually used in conjunction with Recall (3.14) that returns
the percentage of labels correctly predicted among all truly relevant labels. That is,
the ratio of true labels is given as output by the classifier.
1 |Yi ∩ Zi |
n
Precision = (3.13)
n i=1 |Zi |
1 |Yi ∩ Zi |
n
Recall = (3.14)
n i=1 |Yi |
The jointly use of Precision and Recall is so common in the information retrieval
(IR) field that a metric combining them is defined. It is known as F-measure (3.15)
3.5 Assessing Classifiers Performance 57
and computed as the harmonic mean of the previous ones. This way a weighted
measure of how many relevant labels are predicted and how many of the predicted
labels are relevant is obtained.
Precision ∗ Recall
F-measure = 2 ∗ . (3.15)
Precision + Recall
By observing the right side of Fig. 3.1, where Precision, Recall, and F-measure
are depicted, that scene and emotions are once again very close can be stated,
though scene results are a bit better. With enron, it can be seen that Precision
has a higher value than Recall, a far greater fact in the case of stackex-cs. This
means that for these MLDs a high proportion of the labels included in the prediction
are relevant labels, but that there are many other true labels which are not predicted
by the classifier. Looking at the F-measure values, the same correlation between
the theoretical complexity (TCS value) of each MLD and classification performance
assessment can be deducted.
This is possibly the most strict evaluation metric. It is also known as classification
accuracy and labelset accuracy, since full labelsets, the predicted and the real one,
are compared for full equality as can be seen in (3.16). The larger is the labelset, the
lower the likelihood that the classifier produces exactly the correct output. Therefore,
for MLDs with large sets of labels that low SubsectAccuracy values are obtained is
something usual.
1
n
SubsetAccuracy = [[Yi = Zi ]] (3.16)
n i=1
Apart from the atypical case of genbase, the SubsectAccuracy for the MLDs
used in the previous experimentation reflects the problems the classifier had with each
one of them. While scene values are not bad, the performance with emotions
was far worse. As could be expected, due to their large sets of labels, enron and
stackex-cs show the worst results.
All the example-based metrics described above work over binary partitions of labels,
so they need a labelset as output from the classifier. By contrast, the explained here
need a ranking of labels, so a confidence degree or belonging probability of each
label is needed.
58 3 Case Studies and Metrics
(3.17)
The Coverage metric (3.18) counts the number of steps to going through the
ranking provided by the classifier until all the relevant labels are found. The lower
is the mean number of steps for the MLD, value returned by Coverage, the better is
performing the classifier. As can be shown in (3.18), this measure is not normalized,
so it is not upper bounded. As happens with other multilabel classification metrics,
Coverage is influenced for the size of the set of labels in each MLD. The larger is
this set, the higher usually is the mean number of steps to walk-through the ranking.
1
n
Coverage = argmax
rank(xi , y) − 1 (3.18)
n i=1 y∈Yi
1
n
OneError = [[[argmax
rank(xi , y) ∈
/ Yi ]]]. (3.19)
n i=1 y∈Zi
The RLoss (Ranking loss) metric takes all possible combinations of relevant and
non-relevant labels for an instance and counts (3.20) how many times a non-relevant
label is ranked above a relevant one in the classifier prediction. The counting is
normalized dividing by the product of relevant and non-relevant labels in the instance
and then averaged by the number of assessed instances. The lower is the RLoss
measure, the better is performing the classifier.
1
n
1
RLoss = |ya , yb : rank(xi , ya ) > rank(xi , yb ), (ya , yb ) ∈ Yi × Yi |
n i=1 |Yi |.|Yi |
(3.20)
3.5 Assessing Classifiers Performance 59
Observing the AvgPrecision in Table 3.6, it can be seen that with the exception of
stackex-cs, ML-kNN performed quite well with the other four MLDs. Looking
at the Coverage row, the values for stackex-cs and enron stand out. Since
they have more labels, the number of steps to complete before getting all relevant
labels is higher. The OneError values are quite similar for emotions, scene and
enron, while for stackex-cs is much higher. This denotes that for the latter
MLD the top-ranked label was not usually relevant. Lastly, considering the RLoss
values a different scenario is observed. In this case, the worst results are obtained
from emotions, though stackex-cs is very close. Although emotions only
has six labels, there are a significant amount of predictions made by ML-kNN in
which non-relevant labels are ranked above the relevant ones.22
All the performance metrics enumerated in the previous section are evaluated indi-
vidually for each instance, and then averaged dividing by the number of considered
instances. Therefore, each data sample is given the same weight in the final result. On
the contrary, label-based metrics can be computed by means of two different aver-
aging strategies. These are usually known as macro-averaging and microaveraging.
Any of the metrics obtained from a binary partition of labels, such as Precision,
Recall or F-measure, can be also computed using these strategies. For doing so, the
generic formulas in (3.21) and (3.22) are used. EvalMet would be one of the metrics
just mentioned. In this context, TP stands for True Positives, FP for False Positives,
TN for True Negatives, and FN for False Negatives.
1
MacroMet = EvalMet(TPl , FPl , TNl , FNl ) (3.21)
k
l∈L
MicroMet = EvalMet( TPl , FPl , TNl , FNl ) (3.22)
l∈L l∈L l∈L l∈L
In the macro-averaging approach, the metric is evaluated once per label, using the
accumulated counters for it, and then the mean is obtained dividing by the number of
labels. This way the same weight is assigned to each label, whether it is very frequent
or very rare.
22 It must be taken into account that ML-kNN does not generate a real ranking of labels as prediction,
but a binary partition. The ranking is generated from the posterior probabilities calculated for each
label. With so few labels in emotions, it is possible to have many ties in these probabilities, so
the positions in the ranking could be randomly determined in some cases.
60 3 Case Studies and Metrics
On the contrary, the microaveraging strategy first adds the counters for all labels
and then computes the metric only once. Since the predictions where rare labels
appear are combined with that made for the most frequent ones, the former are
usually diluted among the latter. Therefore, the contribution of each label to the final
measure is not the same.
In addition to label-based metrics computed from binary partitions, those cal-
culated from labels rankings are also available. The area under the ROC (Receiver
Operating Characteristic) curve (AUC) can be computed according to the macro-
and (3.23) and micro- (3.24) averaging approaches
In addition to the groups of metrics already explained here, several more can be
found, in general much more specific, in the specialized literature. For instance,
there are metrics for evaluating the performance in hierarchical multilabel
classification such as Hierarchical loss [8]. It is based on Hamming loss, but
considering the level of the hierarchy where the miss predictions are made.
References 61
References
22. Ghamrawi, N., McCallum, A.: Collective multi-label classification. In: Proceedings of 14th
ACM International Conference on Information and Knowledge Management, CIKM’05, pp.
195–200. ACM (2005)
23. Godbole, S., Sarawagi, S.: Discriminative methods for multi-labeled classification. Adv. Knowl.
Discov. Data Min. 3056, 22–30 (2004)
24. Gonçalves, E.C., Plastino, A., Freitas, A.A.: A genetic algorithm for optimizing the label
ordering in multi-label classifier chains. In: Proceedings of 25th IEEE International Conference
on Tools with Artificial Intelligence, ICTAI’13, pp. 469–476. IEEE (2013)
25. Joachims, T.: Text categorization with suport vector machines: learning with many relevant
features. In: Proceedings of 10th European Conference on Machine Learning, ECML’98, pp.
137–142. Springer (1998)
26. Katakis, I., Tsoumakas, G., Vlahavas, I.: Multilabel text classification for automated tag sug-
gestion. In: Proceedings of European Conference on Machine Learning and Principles and
Practice of Knowledge Discovery in Databases, ECML PKDD’08, pp. 75–83 (2008)
27. Klimt, B., Yang, Y.: The enron corpus: a new dataset for email classification research. In:
Proceedings of 15th European Conference on Machine Learning, ECML’04, pp. 217–226.
Springer (2004)
28. Lang, K.: Newsweeder: learning to filter netnews. In: Proceedings of 12th International Con-
ference on Machine Learning, ML’95, pp. 331–339 (1995)
29. Lewis, D.D., Yang, Y., Rose, T.G., Li, F.: RCV1: a new benchmark collection for text catego-
rization research. J. Mach. Learn. Res. 5, 361–397 (2004)
30. Mencia, E.L., Fürnkranz, J.: Efficient pairwise multilabel classification for large-scale problems
in the legal domain. In: Proceedings of 11th European Conference on Machine Learning and
Knowledge Discovery in Databases, ECML PKDD’08, pp. 50–65. Springer (2008)
31. Read, J.: Scalable multi-label classification. Ph.D. thesis, University of Waikato (2010)
32. Read, J., Pfahringer, B., Holmes, G., Frank, E.: Classifier chains for multi-label classification.
Mach. Learn. 85, 333–359 (2011)
33. Read, J., Reutemann, P.: MEKA multi-label dataset repository. http://sourceforge.net/projects/
meka/files/Datasets/
34. Schapire, R.E., Singer, Y.: Boostexter: a boosting-based system for text categorization. Mach.
Learn. 39(2–3), 135–168 (2000)
35. Snoek, C.G.M., Worring, M., van Gemert, J.C., Geusebroek, J.M., Smeulders, A.W.M.: The
challenge problem for automated detection of 101 semantic concepts in multimedia. In: Pro-
ceedings of 14th ACM International Conference on Multimedia, MULTIMEDIA’06, pp. 421–
430 (2006)
36. Spyromitros-Xioufis, E., Papadopoulos, S., Kompatsiaris, I.Y., Tsoumakas, G., Vlahavas, I.: A
comprehensive study over vlad and product quantization in large-scale image retrieval. IEEE
Trans. Multimedia 16(6), 1713–1728 (2014)
37. Srivastava, A.N., Zane-Ulman, B.: Discovering recurring anomalies in text reports regarding
complex space systems. In: Aerospace Conference, pp. 3853–3862. IEEE (2005)
38. Tomás, J.T., Spolaôr, N., Cherman, E.A., Monard, M.C.: A framework to generate synthetic
multi-label datasets. Electron. Notes Theoret. Comput. Sci. 302, 155–176 (2014)
39. Tsoumakas, G., Katakis, I.: Multi-label classification: An overview. Int. J. Data Warehouse.
Min. 3(3), 1–13 (2007)
40. Tsoumakas, G., Katakis, I., Vlahavas, I.: Effective and efficient multilabel classification in
domains with large number of labels. In: Proceedings of ECML/PKDD Workshop on Mining
Multidimensional Data, MMD’08, pp. 30–44 (2008)
41. Tsoumakas, G., Katakis, I., Vlahavas, I.: Mining multi-label data. In: Data Mining and Knowl-
edge Discovery Handbook, pp. 667–685. Springer (2010)
42. Tsoumakas, G., Vlahavas, I.: Random k-Labelsets: an ensemble method for multilabel classi-
fication. In: Proceedings of 18th European Conference on Machine Learning, ECML’07, vol.
4701, pp. 406–417. Springer (2007)
43. Tsoumakas, G., Xioufis, E.S., Vilcek, J., Vlahavas, I.: MULAN multi-label dataset repository.
http://mulan.sourceforge.net/datasets.html
References 63
44. Turnbull, D., Barrington, L., Torres, D., Lanckriet, G.: Semantic annotation and retrieval of
music and sound effects. IEEE Trans. Audio Speech Lang. Process. 16(2), 467–476 (2008)
45. Turner, M.D., Chakrabarti, C., Jones, T.B., Xu, J.F., Fox, P.T., Luger, G.F., Laird, A.R., Turner,
J.A.: Automated annotation of functional imaging experiments via multi-label classification.
Front. Neurosci. 7 (2013)
46. Tzanetakis, G., Cook, P.: Musical genre classification of audio signals. IEEE Trans. Speech
Audio Process. 10(5), 293–302 (2002)
47. Ueda, N., Saito, K.: Parametric mixture models for multi-labeled text. In: Proceedings of 15th
Annual Conference on Neural Information Processing Systems, NIPS’02, pp. 721–728 (2002)
48. Wieczorkowska, A., Synak, P., Raś, Z.: Multi-label classification of emotions in music. In:
Intelligent Information Processing and Web Mining, AISC, vol. 35, chap. 30, pp. 307–315
(2006)
49. Zhang, M., Zhou, Z.: ML-KNN: a lazy learning approach to multi-label learning. Pattern
Recogn. 40(7), 2038–2048 (2007)
Chapter 4
Transformation-Based Classifiers
4.1 Introduction
Fig. 4.1 Tabular representation of a binary dataset, a multiclass dataset, and a multilabel dataset
4.2 Multilabel Data Transformation Approaches 67
Beyond some basic transformations, such as the ones mentioned in Sect. 2.4.1
consisting in ignoring the multilabel instances or choosing only one of the labels,
most of the published methods aim to transform the multilabel samples into binary
or multiclass instances. In the following sections, the main proposals on how to
transform an MLD into one or more BIDs/MCDs are detailed. Firstly, the binary-
based transformation methods are explained. Then, the multiclass-based ones will
be illustrated.
One of the easiest ways to face multilabel classification consists in taking one label
at a time, training individual classifiers for each one of them. Since the labels only
can be active (a relevant label) or inactive (a non-relevant label), the result is a binary
dataset per label.
Depending on how the binary datasets are combined to train the classifiers, a
different model is obtained. The two main approaches are OVO and OVA, discussed
below. More complex ensembles of binary classifiers are also further detailed.
The OVO and OVA approaches are well-known techniques to face multiclass classi-
fication [5] by means of ensembles of binary classifiers. OVA, also known as OVR
(One vs the rest), is based on the training of an individual classifier for each class
against all others. This technique is extended in the multilabel field to take into
account that several labels can be relevant at once. Therefore, a binary classifier is
trained for each label, as depicted in Fig. 4.2, and their outputs are combined to gen-
erate the predicted labelset. In the multilabel field, this approach is named Binary
Relevance (BR) [6].
Any existing classification algorithm can be used as underlying classifier. This
includes tree-based methods, SVMs, ANNs, and instance-based algorithms. As can
be seen, the number of binary classifiers needed to build this solutions is k, the number
of labels associated to the MLD. In each BID, only the considered label is taken into
account. After training, the test samples are given as input to each binary classifier.
The final step consists in joining these individual outputs to form the labelset to be
returned as result.
BR is an easy way for implementing an MLC method, relying in a collection
of binary classifiers and a simple strategy to combine the outputs. Nonetheless, it
also has some inconveniences. The most remarkable is the fact that, by training
independent classifiers, it completely dismisses the potential correlations among
labels. Additionally, by taking as negative all samples in which the considered label
is not relevant, each binary classifier has to deal with extremely imbalanced data.
68 4 Transformation-Based Classifiers
Fig. 4.2 Binary Relevance transformation diagram. The original MLD (top) generates as many
BIDs as labels there are, using each one to train a binary classifier. When a new test sample arrives,
it is given to each individual classifier, joining their predictions to obtain the final labelset
also simple, just joining the individual outputs to conform the final labelset. RPC and
CLR are a little more complex, and there are other proposals also based on ensembles
of binary classifiers. The following are some of the most noteworthy:
• 2BR: This algorithm was proposed [13] aiming to mitigate the most remarkable
drawback of BR, which is not taking into consideration label dependencies. 2BR
is founded on a stacking strategy, using BR at two levels. The first one learns the
BR model, while the second one, taking the outputs of the previous one, learns a
meta-model which includes a explicit coefficient for correlated labels.
• BR+: The proposal made in [1] has the same goal of 2BR, incorporating label
dependency information to increase the multilabel classifier performance. To do
so, the binary classifier for each label gains as additional attributes the remainder
labels while training. During testing, the sample is processed with a standard BR
approach, then the outputs of each classifier are used to enrich the instance input
space, and it is introduced in the extended classifiers trained with the additional
labels.
• CC: It is maybe the best-known multilabel ensemble of binary classifiers. Classifier
chains [11] also train k models, as many as labels there are in the MLD, choosing
them in random order. The first classifier is trained using only the original input
attributes. The first output label is then added as new input attribute, and the new
input space is used to train the second classifier, and so on. This way the classifiers
are chained, taking into account the possible label dependencies. It is easy to see
that different orders of labels can produce disparate results. This is the reason why
ECC (Ensemble of Classifier Chains) is also proposed, as a set of CC chains with
diverse orders and trained with subsets of the available data. Each chain in the
ensemble gives a labelset as output which is considered as a set of votes.
The main obstacle of most multilabel ensembles based on binary classifiers is their
computational complexity. The processing of MLDs having thousands of labels can
be infeasible.
The other main data transformation path for MLDs relies on the use of multiclass
classifiers, instead of binary ones. The basic idea consists in treating each label
combination as an unique class identifier, as depicted in Fig. 4.3. This corresponds
to the basic LP method proposed in [3] where only a multiclass labelset is generated;
thereby, a single multiclass classifier is needed to obtain the predictions. Those are
then back-translated to labelsets.
Any multiclass classification algorithm can be used as underlying classifier, thus
taking advantage of the accumulated experience in the multiclass field. Since many
multiclass classification algorithms operate themselves as transformation methods,
producing binary classifiers with the OVA or OVO approach, eventually the mul-
tilabel problem could be faced as multiple binary ones depending on the chosen
algorithm.
70 4 Transformation-Based Classifiers
Fig. 4.3 Label Powerset transformation diagram. The labelsets in the original MLD are translated
to unique class identifiers and then used to train a multiclass classifier. After processing a new test
sample, the output is back-translated to obtain the label predictions
The LP transformation method is even simpler than BR, since there is no need to train
several models nor to combine their outputs. However, it also has a major drawback.
The number of different label combinations in a large dataset, with a high number
of labels and instances, can be incredibly huge.
Theoretically, 2k different label combinations can exist into an MLD with k labels.
In practice, this number is limited by the number of instances in the MLD, because
usually 2k n. For instance, the delicious dataset has 8.17×10295 potential different
labelsets, but only 16 105 instances. This circumstance causes that some MLDs have
as many distinct labelsets as instances there are in it. Using the LP transformation
with such a MLD will produce a multiclass dataset with a different class for each
instance; thus, the classifier will face a hard work to train an efficient model.
Unlike BR, the basic LP transformation method implicitly includes in the learning
process the hidden relationships among labels. Sets of labels with a certain correlation
4.4 Multiclass Classification-Based Methods 71
tend to generate the same or similar label combinations. However, the larger is the
number of total labels in the MLD, the higher the likelihood of having a combinatorial
explosion that makes this advantage almost useless.
Another obstacle in the use of the LP transformation method is its incompleteness.
An LP classifier only can predict label combinations appearing in the training set.
These usually represent only a small fraction of the 2k potential label combinations.
On the contrary, an MLC algorithm based on binarization techniques can virtually
produce every single labelset.
To avoid the previous problems, a pruning method named PS (Pruned Sets) is
proposed in [9]. The goal is to focus the classifier onto the key label combinations in
the MLD, these which occurs more frequently. For doing so, PS looks for infrequent
labelsets and split them into more common combinations. This way the aforemen-
tioned combinatorial explosion is alleviated, and the generalization ability of the
classifier is improved.
Unlike BR, the basic LP transformation method, as well as the PS algorithm, is not an
ensemble of classifiers, since only one multiclass classifier is used. Nonetheless, there
are several proposals of MLC methods based on ensembles of multiclass classifiers.
Some of the most popular are the following. They will be described in greater detail
in Chap. 6.
Although most LP-based methods claim they are able to take label correlations
into account, and therefore, they are superior to the BR approach, their effectiveness
is largely impacted by the presence of a high number of low-frequent combinations
in the MLD.
In addition to MLC methods purely based on the BR and LP transformations,
there are some proposals which aim for combining them. It is the case of the ChiDep
algorithm [12] that divides the MLC task into the following steps:
1. Firstly, labels with dependencies are identified by applying a χ 2 independence
test. Two groups of labels are created, one with the dependent labels and other
with the independent ones.
72 4 Transformation-Based Classifiers
Having known the most popular data transformation-based methods to face multilabel
classification, in this part some of them are going to be used and compared aiming to
study their behavior. In particular, the following four methods are going to be used:
• BR: It is the most basic transformation method based on binarization. It follows
the OVA approach.
• CLR: A more elaborated binarization method than BR, following the OVO
approach.
• LP: It is the most basic transformation method based on label combinations. LP
is usually compared with BR since it takes label dependency information into
account.
• PS: A more elaborated labelset method than LP, including a pruning process that
alleviates the combinatorial explosion.
These four algorithms have been tested according to the experimental configura-
tion described below. The obtained classification results are presented in Sect. 4.5.2.
performance, three evaluation metrics have been used, HLoss (Hamming loss), F-
measure, and SubsetAcc (Subset accuracy). All of them are example-based metrics,
and they were detailed in Chap. 3. The latter one is a very strict performance metric,
since it only accounts as correct those predictions in which the full labelsets coincide.
Average values were obtained for each measure/MLD across the ten runs. Moreover,
training and testing times for each algorithm are also analyzed.
HLoss is a loss metric that sums the number of miss predictions in all the test samples,
whether they are false positives or false negatives, and then averages this quantity
by the number of instances and labels. Therefore, the lower is HLoss, the better is
considered the performance of the classifier. The values obtained in this experimen-
tation are represented in Fig. 4.4 and shown in Table 4.2. Best values are highlighted
in bold.
Fig. 4.4 Classification results assessed with Hamming Loss (lower is better)
74 4 Transformation-Based Classifiers
From these result observations, it can be deducted that, in general, the more
complex is the MLD, the worse the MLC methods. According to this evaluation
metric, the binarization methods BR and CLR clearly outperformed the ones based
on label combinations, gathering all best values.
As defined in Sect. 3.5.1 (see formulation in Chap. 3), F-measure is computed as
the harmonic mean of Precision and Recall. This metric is averaged by the number
of instances in the test set, but not by the number of labels. Thus, it is not influenced
by this factor as is the case of HLoss. The F-measure values are shown in Fig. 4.5
and Table 4.3.
According to F-measure, the corel5k MLD obtained the worst classification
results, instead of tmc2007. The best performance corresponds to the CLR and
PS algorithms, but a win for BR. CLR and PS can be considered more advanced
transformation approaches than LP and BR, so that they achieve best performances
fall within the expected.
Lastly, classification results assessed with the SubsetAcc metric are provided in
Fig. 4.6 and Table 4.4. As mentioned before, this is the strictest evaluation metric,
so it is not strange that lower values than with F-measure are obtained in all cases.
Since SubsetAcc compares full labelsets, instead of individual labels, the larger is
the MLD’s set of labels the lower the likelihood of correctly predicting all of them.
This is the reason why corel5k and mediamill obtain the worst performances,
being the MLDs with more labels out of the five used here.
Another interesting fact can be stated analyzing the best values in Table 4.4. With
the only exception of genbase, all the best performances correspond to labelset-
based algorithms. This is consistent with the nature of the evaluation metric.
76 4 Transformation-Based Classifiers
As usual when evaluating an MLC task, that the assessed performance will
change depending not only on the selected algorithm, but also on the MLDs
traits and chosen evaluation metric must be taken into account.
Transformation methods such as BR and LP are at the heart of several of the meth-
ods to be described in the following chapter, since many of them rely on binarization
and label powerset techniques. In addition, these two approaches along with PS are
the foundation of many ensemble-based methods that will be outlined in Chap. 6.
Besides their classification performance, it will be also interesting to get a glimpse of
these transformation method running times, since they will have a significant impact
in a multitude of other MLC algorithms.
The time spent by each method in training the model for each MLD, measured
is seconds, is depicted in Fig. 4.7. Exact values are provided in Table 4.5. As can be
observed, binarization-based methods (BR and CLR) need more time to build the
classifier when compared to those based on label combinations (LP and PS). Overall
PS, which used a pruned set of label combinations, is the best performer regarding
this aspect. On the contrary, CLR, with its OVO approach, needs much more time
than any other of the tested transformations.
Regarding testing times, the time spent by a classifier to process the test instances
predicting the labelset for each one of them, they are shown in Fig. 4.8 and Table 4.6.
Once again, the OVO procedure of CLR makes it the slower one. However, once the
classifier has been trained, BR is the best performer labeling new instances as can
be inferred from the first bar in the plot and the first column in Table 4.6. Therefore,
in this case, training time and testing time are inverse criteria.
References
1. Alvares-Cherman, E., Metz, J., Monard, M.C.: Incorporating label dependency into the binary
relevance framework for multi-label classification. Expert Syst. Appl. 39(2), 1647–1655 (2012)
2. Barot, P., Panchal, M.: Review on various problem transformation methods for classifying
multi-label data. Int. J. Data Mining Emerg. Technol. 4(2), 45–52 (2014)
3. Boutell, M., Luo, J., Shen, X., Brown, C.: Learning multi-label scene classification. Pattern
Recogn. 37(9), 1757–1771 (2004)
4. Fürnkranz, J., Hüllermeier, E., Loza Mencía, E., Brinker, K.: Multilabel classification via
calibrated label ranking. Mach. Learn. 73, 133–153 (2008)
5. Galar, M., Fernández, A., Barrenechea, E., Bustince, H., Herrera, F.: An overview of ensemble
methods for binary classifiers in multi-class problems: Experimental study on one-vs-one and
one-vs-all schemes. Pattern Recogn. 44(8), 1761–1776 (2011)
6. Godbole, S., Sarawagi, S.: Discriminative methods for multi-labeled classification. Adv. Knowl.
Discov. Data Mining 3056, 22–30 (2004)
7. Hüllermeier, E., Fürnkranz, J., Cheng, W., Brinker, K.: Label ranking by learning pairwise
preferences. Artif. Intell. 172(16), 1897–1916 (2008)
8. Quinlan, J.R.: C4.5: Programs for Machine Learning (1993)
References 79
9. Read, J.: A pruned problem transformation method for multi-label classification. In: Proceed-
ings of New Zealand Computer Science Research Student Conference, NZCSRS’08, pp. 143–
150 (2008)
10. Read, J., Pfahringer, B., Holmes, G.: Multi-label classification using ensembles of pruned
sets. In: Proceedings of 8th IEEE International Conference on Data Mining, ICDM’08, pp.
995–1000. IEEE (2008)
11. Read, J., Pfahringer, B., Holmes, G., Frank, E.: Classifier chains for multi-label classification.
Mach. Learn. 85, 333–359 (2011)
12. Tenenboim-Chekina, L., Rokach, L., Shapira, B.: Identification of label dependencies for multi-
label classification. In: Working Notes of the Second International Workshop on Learning from
Multi-Label Data, MLD’10, pp. 53–60 (2010)
13. Tsoumakas, G., Dimou, A., Spyromitros, E., Mezaris, V., Kompatsiaris, I., Vlahavas, I.:
Correlation-based pruning of stacked binary relevance models for multi-label learning. In:
Proceedings of 1st International Workshop on Learning from Multi-Label Data, MLD’09, pp.
101–116 (2009)
14. Tsoumakas, G., Katakis, I., Vlahavas, I.: Effective and efficient multilabel classification in
domains with large number of labels. In: Proceedings of ECML/PKDD Workshop on Mining
Multidimensional Data, MMD’08, pp. 30–44 (2008)
15. Tsoumakas, G., Vlahavas, I.: Random k-Labelsets: an ensemble method for multilabel classi-
fication. In: Proceedings of 18th European Conference on Machine Learning, ECML’07, vol.
4701, pp. 406–417. Springer (2007)
Chapter 5
Adaptation-Based Classifiers
5.1 Overview
Decision trees (DT) are among the easiest to understand classification models.
Despite their apparent simplicity, some DT algorithms, such as C4.5 [24], yield
a performance that makes them competitive against other learning approaches. The
following are some of the multilabel classification methods based on DT proposed
on the literature.
The research conducted in [10] aimed to classify genes according to their function,
taking into account that the same gene can intervene in several functions. The pro-
posed solution is founded on the well-known C4.5 algorithm, appropriately modified
to deal with several labels at once. The two key points of the adapted method are as
follows:
• The leaves of the tree contain samples that are associated with a set of labels,
instead of to one class only.
• The original entropy measure is adjusted to take into consideration the non-
membership probability of the instances to a certain label.
The iterative process to build the tree partitions the instances according to the
value of a certain attribute, computing the weighted sum of the entropies of each
potential subset of labels. If a sample is associated with more than one label, then it
is accumulated several times into this weighted sum.
Since each leaf represents a set of labels, this impacts both the process of labeling
each node in the tree and also the further pruning task inherent to C4.5.
Artificial neural networks (ANNs) in general, and particularly Radial Basis Function
Networks (RBFNs), have proven their effectiveness in classification problems, as
well as in regression and time series prediction. As a consequence, the adaptation
of ANNs to accomplish MLC tasks is a recurrent topic in the literature. The goal
of this section is to provide a succinct description of several ANN-based multilabel
algorithms.
The key aspect in BP-MLL is the introduction of a new error function used while
training the ANN, and computed taking into account the fact that each sample con-
tains several labels. Specifically, this new function penalizes the predictions including
labels which are not truly relevant for the processed instance.
In BP-MLL, the input layer has as many neurons as input attributes there are in
the MLD. The number of units in the output layer is determined by the number of
labels considered. The amount of neurons in the hidden layer is also influenced by
the number of labels, and they use a sigmoid activation function.
The BP-MLL algorithm produces a label ranking as result while classifying new
instances. To decide which labels will be predicted as relevant, a parameter in charge
of adjusting the cut threshold has to be set. Configuring this parameter is the main
difficult in using BP-MLL.
The proposal made in [19], called I-BP-MLL, overtakes the aforementioned dif-
ficulty. To do so, the threshold is automatically adjusted as the committed error is
computed during the learning process. This approach produces a custom threshold
for each label, instead of a global threshold as BP-MLL does.
Vector Support Machines have been traditionally applied to solve binary classifica-
tion problems. As other techniques, they have evolved over time, being extended
to face other kind of tasks such as multiclass and multilabel classification. In this
section, several of the MLC methods based on the SVMs are outlined.
5.4.1 MODEL-x
The target task of the authors in [5] was to label natural landscapes. Each scene can
contain several objects at once and, as a consequence, it can be labeled with more
than one class. Some scene labels are urban, beach, field, mountain, beach + field,
field + mountain, etc.
Since SVMs tend to have a good behavior while dealing with images, the authors
selected this kind of model as underlying classifier. The proposed method, called
MODEL-x, trains a SVM per label using a novel approach named cross-training.
This technique takes the samples having several labels as positive cases while training
every individual model, instead of negative ones as do the algorithms which consider
each label combination as a different class. As many other MLC proposals, MODEL-
x can be also considered as an ensemble.
86 5 Adaptation-Based Classifiers
As the authors of Rank-SVM state in [13], binary classifiers are not the best option
when some correlations among labels exist, since full independence between them
is assumed by most binary-based methods. To alleviate this problem Rank-SVM, a
direct approach based on SVM principles, relies in a new metric to be minimized
while training, AHL. This a lineal approximation of Hamming Loss.
Once the SVM-like model has been trained to produce the label ranking, a specific
function is used to adjust the cutting threshold from which the predicted labelset, as
a subset of the full ranking, is extracted.
Although Rank-SVM takes label correlations into account, thus it theoretically
should perform better than pure binary models, experimental tests show that its
behavior is similar when working with high-dimensional MLDs.
These two proposals are based on the Rank-SVM, aiming to improve their effi-
ciency and performance. The goal of Rank-CVM [35] is to reduce the computational
complexity of Rank-SVM. To do so, they decided to use a CVM (core vector machine)
instead of an SVM. The analytical solution of CVMs is immediate, and much more
efficient than that of SVMs. The result is an MLC classifier with similar predictive
performance to Rank-SVM but an order of magnitude more efficient.
The goal of the authors of SCRank-SVM [33] also was to improve the efficiency
of Rank-SVM, as well as its performance as MLC classifier. The proposal includes
reformulating the calculus of the decision boundary in the underlying SVMs, sim-
plifying some of the existent constraints to maximize the margin. In the process,
the authors get rid of one of the usual SVM parameters, reducing the complexity of
computations.
Two similar MLC methods are introduced in [8], both of them based on the afore-
mentioned ML-kNN algorithm. The core of the two proposals, named IBLR-ML and
IBLR-ML+, use Bayesian techniques to consider the labels associated with nearest
neighbors of the new instance as additional characteristics. Using this information
the a priori probabilities are computed and a regression equation is obtained.
The main difficulty in these methods comes from the need to adjust an α parame-
ter, in charge of setting the weight to be assigned to the additional attributes while
computing the a posteriori probabilities. To accomplish this task an statistical para-
meter estimation method is used. This is an adaptation process which demands the
enumeration of all instances in the MLD.
Although IBLR-ML can achieve better predictive performance than ML-kNN, it
also has higher computational demands, including more memory consumption and
training time.
The kNNc method was proposed in [6]. It works in two stages, combining instance
selection techniques with instance-based classification. Firstly, a reduced set of
instances is obtained by prototype selection techniques. The aim is to determine
the set of labels which are nearest to the ones in the instance to be classified. Then,
the full set of samples is used, but limiting the prediction to the labels inferred in the
previous step.
BRkNN and LPkNN [27] are MLC classifiers made up combining BR and LP
transformation methods with kNN classification techniques, respectively. They first
apply the data transformation to the MLD, then use kNN to find the nearest neighbors,
and generate the labelset from them.
The proposal in [16] is called BRkNN-new. As its name indicates, it is an improve-
ment based on the BRkNN method just described above. The goal of the authors is to
take advantage of label correlation information in order to improve the performance
of the original model.
88 5 Adaptation-Based Classifiers
The authors of these two algorithms, presented in [17], state that binary MLC meth-
ods assume the labels are fully independent, and do not take into account potential
correlations among them. This is the reason to propose a model to capture this infor-
mation, aiming to improve the predictive performance of the classifiers.
In order to get the correlations between labels, a CRF (Conditional Random
Field) is used. A CRF is a probabilistic discriminative model, commonly used in
tasks such as text segmentation. It associates a probability to each label depending
on the existent observations, and with these data, the potential dependencies among
outputs are modeled. Four different states can exist for each label pair with respect
to the considered instance, none of the labels is relevant, both of them are applicable,
only the first one is relevant, or only the second one is relevant.
The first proposed model is CML (Collectible Multilabel). It holds a correlation
parameter for each label pair. The second one is CMLF (CML with Features), and it
works with three-variable groups such as attribute-label1-label2. For each label pair
associated with one attribute, CMLF holds a correlation value. Those labels which
do not reach a certain frequency are discarded.
These two methods were introduced in [31]. They are probabilistic generative models
(PMM), whose goal is to automate text document classification by estimating the
label probabilities from the terms appearing into the documents.
PMMs assume that each sample in the text MLD is a mixture of characteristic
words related to the labels relevant to the instance. The main difference between
PMM1 and PMM2 relies on the way the parameters controlling the algorithms are
approximated, PMM2 being a more flexible version of PMM1.
5.6 Probabilistic Methods 89
The major drawback of PCC and other similar approaches based on classifier chains is
their computational complexity. This is an inconvenience while working with MLDs
having large sets of labels, and sometimes, it could be infeasible to use this kind of
algorithms.
In [28], the authors propose several extensions to CC which follow a different path,
named BCC (Bayesian classifier chains). TNBCC (Tree Naïve BCC) is an algorithm
based on this approach. By using Bayesian networks to model, the dependencies
among labels, it reduces the number of chain combinations to consider to finally
compose the multilabel classifier.
RBMs (Restricted Boltzmann Machines) [26] are a proven mechanism when it comes
to produce high-level features from the low-level input attributes existent in a dataset.
They are a usual component in the process to build deep belief networks, as well as
other deep learning structures. An RBM is a probabilistic graphic model, specifically
a two-layer graph, one input layer and one hidden layer. The latter is used to model
the relationships between the input attributes.
Introduced in [21], the CRBM (Conditional RBM) algorithm relies on an RBM to
retrieve dependencies among labels. The goal is to build a model able to operate with
MLDs containing incomplete sets of labels. For doing so, the labels of each instance
are given as input to the RBM, obtaining as output a dependency model capable of
predicting missing labels from others presence.
90 5 Adaptation-Based Classifiers
In addition to the more than twenty MLC methods mentioned above, adaptations
of trees, ANNs, SVMs, instance-based, and probabilistic classifiers, in the literature
can be found a lot more following alternative ways of attacking the problem. The
enumerated below are some of them:
• HG: It is based on an hypergraph algorithm, each label being one edge, whose
goal is to generate a model containing the relationships among labels. The resulting
problem is hard to solve from a computational point of view, but the authors in
[29] assume certain premises that allow them to accomplish the task as a simplified
problem of minimum squares.
• CLAC: One of the major obstacles in facing MLC tasks is the usually huge number
of label combinations an MLD can hold. This produces a secondary problem, even
harder to solve, since the number of data samples sharing the same combination
of labels can be almost negligible. These cases are identified in [32] as disjuncts,
highlighting that if they are ignored, due to their poor representation, the obtained
model can loss precision since in large MLDs many disjuncts can exist. The MLC
method proposed is a lazy algorithm named CLAC (Correlated Lazy Associative
Classifier), which increases the relevance of the potential disjuncts by discarding
attributes and samples not related to the processed data instance.
• GACC: Introduced in [18], the GACC (Genetic Algorithm for ordering Classifier
Chains) method is proposed as an alternative to the traditional ECC. While the
latter randomly generates chains of binary classifiers, the former resorts to the use
of a genetic algorithm to optimize the order of the classifiers in the chain. Besides
the usual goal of improving classification results, the authors also aim to make the
obtained model more easily interpretable.
• MuLAM: Although ant colonies based algorithms [12] are mostly used in opti-
mization problems, as ACO (Ant Colony Optimization) methods, there are also
techniques such as Ant-Miner [23] able to produce classification rules following
the same approach. Taking Ant-Miner as foundation, in [7] the authors present a
new MLC algorithm named MuLAM (Multi-Label Ant-Miner). It takes the mul-
tilabel classification problem as an optimization task, but the reported predictive
performance seems to be far from the state-of-the-art MLC classifiers.
• ML-KMPSO: The proposal made in [22] is an MLC method which combines
the kNN and MPSO (Michigan Particle Swarm Optimization) algorithms. Firstly,
the a priority probabilities of each label are computed. Then, MPSO is used to
optimize the selection of nearest neighbors to the considered instance, obtaining
a set of expert particles which will help to choose the set of labels to be predicted.
• GEP-MLC: In [4], the authors present an MLC method founded on discriminant
functions optimized by GEP (Gene Expression Programming [14]) techniques.
GEP is a genetic programming approach specially suitable for regression problems,
being used for facing traditional classification as well. GEP-MLC learns one or
more discriminant function for each label, thus working as a BR transformation
method in which the underlining classifiers are the functions optimized by GEP.
5.7 Other MLC Adaptation-Based Methods 91
• ML-Tree: It is a recent multilabel classifier that takes into account the potential
dependencies among labels, including this information into the tree model. As
explained above (see Sect. 5.2.3), it follows an hierarchical approach to generate
the tree.
• BP-MLL: This method is the first adaptation of an ANN to the multilabel prob-
lem, including a specific error function which considers the existence of multiple
outputs.
• Rank-SVM: The SVM-based multilabel adaptation, as was explained in
Sect. 5.4.2, also takes advantage of label correlation information aiming to improve
classification performance.
• BRkNN: It is a lazy, instance-based multilabel learner which combines the BR
transformation with kNN techniques. It is an alternative to the ML-kNN algorithm
already used in experiments of previous chapters.
The BP-MLL and BRkNN algorithms are available in the MULAN package [30],
while ML-Tree and Rank-SVM can be obtained from the authors’ Web site.1 The
former is written in Java and it relies on the WEKA software package, as the latter
is implemented in MATLAB.
1 ML-TREE code can be downloaded from Dr. Qingyao Wu’s Web page at https://sites.google.com/
The four multilabel classifiers pointed out above have been run over the partitions
of five MLDs. These are emotions, medical, scene, slashdot and yeast.
Their basic traits are that shown in Table 5.1. The tables in Chap. 3 provide the
main characteristics of each one of them. Looking at the TCS values in Table 5.1,
these MLDs complexity goes from the 9.364 score of emotions to the 15.629 of
medical.
As in the previous chapter, each MLD was partitioned using a 2 × 5 fold cross-
validation scheme. The prediction performance is assessed with four evaluation met-
rics. Since some of the classifiers included in this experimentation produce a ranking
as output, in addition to the usual example-based HLoss (Hamming loss), three
ranking-based evaluation metrics have been selected, RLoss (Ranking loss), OneEr-
ror, and AvgPrecision (Average precision). The goal is to check how the performance
of the classifiers is reported by different evaluation metrics. All of them were detailed
in Chap. 3. The reported values in the following section are average values computed
from the 10 runs for each MLD/classifier.
Fig. 5.2. As can be observed, the shape of the classifiers is quite similar to Fig. 5.1,
although some larger differences can be detected among the three best performing
methods.
Once again that BP-MLL produces the worst classification results, specifically
when used with the more complex MLDs, can be stated. In general, the line which
corresponds to the Rank-SVM algorithm seems to be the closest to the outer limit,
denoting that it is the best performed when the results are assessed with this eval-
uation metric. This perception can be confirmed by examining the values provided
in Table 5.3. With the only exception of emotions, the best values always come
from the Rank-SVM method.
94 5 Adaptation-Based Classifiers
The raw OneError measures are the shown in Table 5.4, with best values high-
lighted in bold. Rank-SVM achieves best result with the three more complex MLDs.
Paradoxically, this algorithm shows its worst performance with the simplest dataset,
emotions.
The last metric used to assess the performance in this experimentation is Avg-
Precision. It measures the number of positions in the ranking that have to be visited
before a non-relevant label is found. Therefore, the larger is the value obtained, the
better is working the classifier.
In this case, the values produced by the evaluation metric can be plotted without
any transformation, as has been done in Fig. 5.4. This plot closely resembles the one
in Fig. 5.3; thus, the conclusions drawn from the OneError metric can be mostly
applied to AvgPrecision also.
96 5 Adaptation-Based Classifiers
The set of best values highlighted in Table 5.5 is also very similar to that of
Table 5.4. Rank-SVM produces the best results for the most complex MLDs again,
while it works quite poor with the simplest one.
The overall conclusion that can be gathered from this experiments is that, from
the four tested algorithms, Rank-SVM tends to perform better in most cases. On
the contrary, BP-MLL usually produces the worst result with few exceptions.
The other two multilabel classifiers, BRkNN and ML-Tree, are in between.
5.9 Summarizing Comments 97
References
1. Agrawal, R., Srikant, R., et al.: Fast algorithms for mining association rules. In: Proceedings of
the 20th International Conference on Very Large Data Bases, VLDB’94, pp. 487–499. Morgan
Kaufmann (1994)
2. Agrawal, J., Agrawal, S., Kaur, S., Sharma, S.: An Investigation of fuzzy PSO and fuzzy SVD
based RBF neural network for multi-label classification. In: Proceedings of the 3rd International
Conference on Soft Computing for Problem Solving, SocProS’13, vol. 1, pp. 677–687. Springer
(2014)
3. Al-Otaibi, R., Kull, M., Flach, P.: Lacova: a tree-based multi-label classifier using label covari-
ance as splitting criterion. In: Proceedings of the 13th International Conference on Machine
Learning and Applications, ICMLA’14, pp. 74–79. IEEE (2014)
4. Ávila, J., Gibaja, E., Ventura, S.: Multi-label classification with gene expression programming.
In: Proceedings of the 4th International Conference on Hybrid Artificial Intelligence Systems,
HAIS’09, pp. 629–637. Springer (2009)
5. Boutell, M., Luo, J., Shen, X., Brown, C.: Learning multi-label scene classification. Pattern
Recogn. 37(9), 1757–1771 (2004)
6. Calvo-Zaragoza, J., Valero-Mas, J.J., Rico-Juan, J.R.: Improving knn multi-label classifica-
tion in prototype selection scenarios using class proposals. Pattern Recogn. 48(5), 1608–1622
(2015)
7. Chan, A., Freitas, A.A.: A new ant colony algorithm for multi-label classification with appli-
cations in bioinfomatics. In: Proceedings of the 8th Annual Conference on Genetic and Evo-
lutionary Computation, GECCO’06, pp. 27–34. ACM Press (2006)
8. Cheng, W., Hüllermeier, E.: Combining instance-based learning and logistic regression for
multilabel classification. Mach. Learn. 76(2–3), 211–225 (2009)
9. Cheng, W., Hüllermeier, E., Dembczynski, K.J.: Bayes optimal multilabel classification via
probabilistic classifier chains. In: Proceedings of the 27th International Conference on Machine
Learning, ICML’10, pp. 279–286 (2010)
10. Clare, A., King, R.D.: Knowledge discovery in multi-label phenotype data. In: Proceedings of
the 5th European Conference Principles on Data Mining and Knowledge Discovery, PKDD’01,
vol. 2168, pp. 42–53. Springer (2001)
11. De Comité, F., Gilleron, R., Tommasi, M.: Learning multi-label alternating decision trees from
texts and data. In: Proceedings of the 3rd International Conference on Machine Learning and
Data Mining in Pattern Recognition, MLDM’03, vol. 2734, pp. 35–49. Springer (2003)
12. Dorigo, M., Stützle, T.: Ant Colony Optimization. The MIT Press (2004)
13. Elisseeff, A., Weston, J.: A kernel method for multi-labelled classification. In: Advances in
Neural Information Processing Systems, vol. 14, pp. 681–687. MIT Press (2001)
14. Ferreira, C.: Gene expression programming in problem solving. In: Soft Computing and Indus-
try, pp. 635–653. Springer (2002)
15. Freund, Y., Mason, L.: The alternating decision tree learning algorithm. In: Proceedings of the
16th International Conference on Machine Learning, ICML’99, pp. 124–133 (1999)
16. Genga, X., Tanga, Y., Zhua, Y., Chengb, G.: An improved multi-label classification algorithm
BRkNN. J. Inf. Comput. Sci. 11(16), 5927–5936 (2014)
17. Ghamrawi, N., McCallum, A.: Collective multi-label classification. In: Proceedings of the 14th
ACM International Conference on Information and Knowledge Management, CIKM’05, pp.
195–200. ACM (2005)
18. Gonçalves, E.C., Plastino, A., Freitas, A.A.: A genetic algorithm for optimizing the label order-
ing in multi-label classifier chains. In: Proceedings of the 25th IEEE International Conference
on Tools with Artificial Intelligence, ICTAI’13, pp. 469–476. IEEE (2013)
19. Grodzicki, R., Mańdziuk, J., Wang, L.: Improved multilabel classification with neural networks.
In: Proceedings of the 10th International Conference on Parallel Problem Solving from Nature,
PPSN X, pp. 409–416. Springer (2008)
References 99
20. Kongsorot, Y., Horata, P.: Multi-label classification with extreme learning machine. In: Pro-
ceedings of the 6th International Conference on Knowledge and Smart Technology, KST’14,
pp. 81–86. IEEE (2014)
21. Li, X., Zhao, F., Guo, Y.: Conditional restricted boltzmann machines for multi-label learning
with incomplete labels. In: Proceedings of the 18th International Conference on Artificial
Intelligence and Statistics, AISTATS’15, pp. 635–643 (2015)
22. Liang, Q., Wang, Z., Fan, Y., Liu, C., Yan, X., Hu, C., Yao, H.: Multi-label classification based
on particle swarm algorithm. In: Proceedings of the 9th IEEE International Conference on
Mobile Ad-hoc and Sensor Networks, MSN’13, pp. 421–424. IEEE (2013)
23. Parpinelli, R.S., Lopes, H.S., Freitas, A.A.: Data mining with an ant colony optimization
algorithm. IEEE Trans. Evol. Comput. 6(4), 321–332 (2002)
24. Quinlan, J.R.: C4.5: Programs for machine learning (1993)
25. Schapire, R.E., Singer, Y.: Improved boosting algorithms using confidence-rated predictions.
Mach. Learn. 37(3), 297–336 (1999)
26. Smolensky, P.: Information processing in dynamical systems: foundations of harmony theory
(1986)
27. Spyromitros, E., Tsoumakas, G., Vlahavas, I.: An empirical study of lazy multilabel classifi-
cation algorithms. In: Artificial Intelligence: Theories, Models and Applications, pp. 401–406.
Springer (2008)
28. Sucar, L.E., Bielza, C., Morales, E.F., Hernandez-Leal, P., Zaragoza, J.H., Larrañaga, P.: Multi-
label classification with bayesian network-based chain classifiers. Pattern Recogn. Lett. 41,
14–22 (2014)
29. Sun, L., Ji, S., Ye, J.: Hypergraph spectral learning for multi-label classification. In: Proceed-
ings of the 14th International Conference on Knowledge Discovery and Data Mining, ACM
SIGKDD’08, pp. 668–676. ACM (2008)
30. Tsoumakas, G., Xioufis, E.S., Vilcek, J., Vlahavas, I.: MULAN: a java library for multi-label
learning. J. Mach. Learn. Res. 12, 2411–2414 (2011)
31. Ueda, N., Saito, K.: Parametric mixture models for multi-labeled text. In: Proceedings of the
15th Annual Conference on Neural Information Processing Systems, NIPS’02, pp. 721–728
(2002)
32. Veloso, A., Meira Jr, W., Gonçalves, M., Zaki, M.: Multi-label lazy associative classification.
In: Proceedings of the 11th European Conference on Principles and Practice of Knowledge
Discovery in Databases, PKDD’07, pp. 605–612. Springer (2007)
33. Wang, J., Feng, J., Sun, X., Chen, S., Chen, B.: Simplified constraints Rank-SVM for multi-
label classification. In: Proceedings of the 6th Chinese Conference on Pattern Recognition,
CCPR’14, pp. 229–236. Springer (2014)
34. Wu, Q., Ye, Y., Zhang, H., Chow, T.W., Ho, S.S.: Ml-tree: a tree-structure-based approach to
multilabel learning. IEEE Trans. Neural Networks Learn. Syst. 26(3), 430–443 (2015)
35. Xu, J.: Fast multi-label core vector machine. Pattern Recogn. 46(3), 885–898 (2013)
36. Zhang, M.: Multilabel neural networks with applications to functional genomics and text cat-
egorization. IEEE Trans. Knowl. Data Eng. 18(10), 1338–1351 (2006)
37. Zhang, M.: Ml-rbf : RBF neural networks for multi-label learning. Neural Process. Lett. 29,
61–74 (2009)
38. Zhang, M., Zhou, Z.: ML-KNN: a lazy learning approach to multi-label learning. Pattern
Recogn. 40(7), 2038–2048 (2007)
Chapter 6
Ensemble-Based Classifiers
6.1 Introduction
The design of ensembles of classifiers, whether they are founded on binary, multi-
class, or heterogeneous classification models, is one of the most common approaches
to face multilabel classification tasks. The use of sets of classifiers, along with a strat-
egy to join their individual predictions, has proven to be very effective in traditional
classification, so it was another clear path to explore for classifying multilabel data.
Some of the transformation-based and adaptation-based methods discussed in the
previous chapters are ensembles by themselves. For instance, the basic BR transfor-
mation could be considered as a simple ensemble of binary classifiers with a very
straightforward strategy to fuse their individual predictions.
In addition to the simple ensembles mentioned in the previous chapters, in this
one a review of more advanced ensembles based on binary classifiers, multiclass
classifiers, and other kinds of combinations is provided. Almost a dozen proposals
found in the literature will be laid out. Four of them will be experimentally tested
and compared.
The basic BR data transformation that trains an independent binary classifier for each
label and then pastes the predictions to produce the predicted labelset (see Sect. 4.3)
is maybe the simplest MLC ensemble. There are many other ensembles also based
on binary classifiers, sometimes following an OVA strategy and others resorting to
OVO techniques.
In Sect. 4.3.2, once the BR basic transformation was described, some simple
ensembles based on binary classifiers, such as 2BR [1], BR+ [2] and CC [3], were
also introduced. The aim of this section is to complete the overview of MLC methods
based on ensembles of binary classifiers.
One of the weaknesses of CC, already mentioned in Chap. 4, is the fact that the
classifiers that make the chain are chosen in a specific order. Since the layout of the
chain influences the information given to each classifier in it, any change in the order
can also have a great impact in the final results. This is the reason why the authors
propose in the same paper [3] the ECC (Ensemble of Classifier Chains) method, an
ensemble of CC classifiers using a different sorting of binary models in each one of
them.
Since CC is an ensemble by itself, ECC can be seen as an ensemble of ensembles.
If all training data were used in each CC unit, building the full ECC model would
take a very long time for large MLDs. Therefore, only a random subset of the training
samples are used in each CC, and the order of labels in each of them is also randomly
established. Although CC is inherently a sequential algorithm, since each binary
classifier needs the outputs of the previous ones in the chain, ECC can be easily
parallelized.
Classifying a new instance with ECC implies going through all the CC classifiers,
obtaining a set of predictions for each label. Those are taken as votes for each label,
producing a label ranking. Lastly, a threshold is applied to obtain the final multilabel
prediction.
The authors also propose in the same paper an improvement of RPC named
CMLPC (Calibrated Pairwise Multilabel Perceptron). This algorithm is based on
a previous method called MLPC, which relies on a perceptron to make pairwise
comparisons among labels.
The second main data transformation approach is LP, and it also served as the foun-
dation of several MLC ensemble methods. LP by itself transforms any MLD in a
multiclass dataset, so the problem can be faced with only one classifier and there
is no need for an ensemble, as in BR. However, the basic LP approach has to deal
with other kind of problems, such as the huge number of potential combinations and,
therefore, the scarcity of samples representing each produced class.
In order to alleviate the combinatorial explosion when LP is applied to MLDs
having large sets of labels, the PS method (see Sect. 4.4.1) proposes a pruning strategy
that gets rid of infrequent combinations. Most MLC ensembles based on multiclass
classifiers follow a similar approach, getting subsets of labels to reduce the number
of obtained combinations.
The Ensemble of Pruned Sets [6] method is based on the PS algorithm described in
Sect. 4.4.1. It trains m (a user-definable parameter) independent PS classifiers, using
a random subset of the training data, typically a 63 % of the available samples, for
each one.
104 6 Ensemble-Based Classifiers
The most remarkable novelty of EPS is the introduction of a voting system that
allows the prediction of new label combinations, despite they not appearing in the
training data. The predictions of each PS classifier are combined, and a threshold is
applied to decide which labels will be relevant in the final labelset. Therefore, EPS
achieves the completeness that LP and PS lack.
Random k-Labelsets [7] is a method that generates random subsets of labels, training
a multiclass classifier for each subset. This way the problems described in Sect. 4.4.1
for the LP approach are mostly avoided, facing less potential combinations while
retaining label correlation information.
RAkEL takes two essential parameters, m and k. The former sets the number of
classifiers to train and the latter the length of the labelsets to be generated. With
k = 1 and m = |L|, RAkEL works as the BR transformation. On the opposite side,
with m = 1 and k = |L| the obtained model will be equivalent to LP. Intermediate
values for both parameters are the interesting ones, producing a set of votes for each
label that produces a label ranking.
• CDE: The authors of the ChiDep algorithm [9], described in Sect. 4.4.2, also pro-
posed CDE (ChiDep Ensemble), an MLC ensemble based on ChiDep classifiers.
The method starts by generating a large set of random label combinations; then,
the χ 2 test is computed for each of them. A parameter m establishes how many of
these labelsets will be used to train the individual classifiers in the ensemble, each
one following the ChiDep approach.
• RF-PCT: This method, introduced in [10], relies in multiobjective decision trees
as base classifier. Each tree is able to predict several outputs at once, being a
variant of PCTs (Predictive Clustering Trees). The ensemble is built following the
random forest strategy, hence the RF-PCT (Random Forest of Predictive Clustering
Trees) denomination. As usual in random forest, bagging techniques are used to
boost diversity among the obtained classifiers, using different subsets of the input
attributes for each tree.
• EML: Unlike other MLC ensembles, the EML (Ensemble of Multilabel Learners)
proposal [11] relies in an heterogeneous set of MLC classifiers, instead of an
homogeneous group of binary, multiclass, or other kinds of classifiers. MLC uses
the training data to prepare five different models, such as ECC, ML-kNN, IBLR-
ML, RAkEL, and CLR. Then, their outputs are combined using different voting
strategies. The main drawback of EML is its computational complexity, since
several of its underlying classifiers, such as ECC, RAkEL, and CLR, are ensembles
by themselves.
• CT: Although proposals such as ECC and BCC, both of them based on classifier
chains, achieve a good predictive performance, the process to discover the best
structure for the chains is time-consuming. The method proposed in [12], named
CT (Classifier Trellis), sets a trellis as a priori structure for the ensemble, instead
of infer it from the data. Over this fixed structure, the order of the labels is adjusted
relying on a simple heuristic which evaluates the frequency between label pairs.
The goal was to improve the scalability of chain-based methods while maintaining
their performance.
Once most of the multilabel ensemble methods currently available have been intro-
duced, the goal in this section is to put in practice some of them. The following four,
maybe the most popular ones, have been selected. All of them are available in the
MULAN or MEKA software packages described in Chap. 9.
106 6 Ensemble-Based Classifiers
• ECC: It is one of the first binary-based multilabel ensembles that takes into account
label dependencies, and it has been the foundation for some others.
• EPS: This labelset-based ensemble, founded on the PS transformation method, is
able to predict label combinations that have not been seen in the training data.
• RAkEL: A classical ensemble algorithm to face multilabel classification using
random subsets of labels.
• HOMER: It is a multilabel ensemble specifically designed to work with MLDs
having large sets of labels.
These four multilabel ensembles have been tested using the experimental con-
figuration explained in the following section. Obtained results are presented and
discussed in Sects. 6.5.2 and 6.5.3.
The four multilabel ensemble methods selected for this experimentation are
transformation-based, so they need an underlying classifier to accomplish each binary
or multiclass classification task. As we did to test the multilabel transformation-based
methods described in Chap. 4, the C4.5 [16] tree induction algorithm has been chosen
for this job. Default or recommended values were used for all the methods parameters.
Each multilabel ensemble method was ran using five MLDs. These are
corel16k, reuters,1 scene, slashdot, and yeast. The main traits of these
datasets can be found in the tables provided in Chap. 3. The basic ones are provided
in Table 6.1. According to their TCS values (see full list in Table 3.5), scene would
be the simplest MLD in this group, while corel16k would be the most complex.
As in the previous chapters, each MLD was partitioned with a 2 × 5 folds cross-
validation scheme. In order to assess classification performance of each ensemble,
three example-based metrics have been used, HLoss (Hamming loss), F-measure
and SubsetAcc (Subset accuracy). The details about these metrics can be also found
in Chap. 3. Average values were obtained for each metric across the runs for each
MLD/algorithm.
1 The version of this MLD having the 500 most relevant features selected was used.
6.5 Ensemble Methods in Practice 107
In addition to predictive performance metrics, the training and testing times for
each method have been also obtained. This will allow to compare the ensembles
from another point of view, the time spent on generating them and the speed
with they are able to provide predictions for new data.
The analysis of classification results starts with the HLoss metric, as usual. The HLoss
values for each ensemble and MLD have been plotted as bars in Fig. 6.1. Remember
that for HLoss, being it a measure to minimize, the taller is the bar and the worse is
performing the classifier.
At first sight, it seems that the worst results correspond to scene and yeast,
the two least complex MLDs, while with corel16k, which has the highest com-
plexity (according to its TCS value), apparently all the ensembles behave quite well.
However, it must be taken into account that HLoss is also heavily influenced by the
number of labels, and corel16k has many more than scene or yeast. That is
the reason behind these results. Nonetheless, HLoss values belonging to the same
MLD but different classifier can be compared. This allows to infer that EPS is always
performing better than the other ensembles. On the contrary, the results for ECC put
it as the weakest ensemble.
The HLoss exact values are those shown in Table 6.2. As always, best values have
been highlighted in bold. That EPS achieves the best results can be confirmed, with
the only exception of slashdot by a small margin to RAkEL.
Fig. 6.1 Classification results assessed with Hamming Loss (lower is better)
108 6 Ensemble-Based Classifiers
Fig. 6.3 Classification results assessed with Subset Accuracy (higher is better)
where the values for corel16k are almost negligible. The results for EPS stand out
with reuters, scene and yeast. The values with slashdot are quite similar
for the four methods.
By examining the raw SubsetAcc values, provided in Table 6.4, that the results
are not as conclusive as they were with F-measure can be understood. EPS is still
the ensemble that more best values achieves, but ECC works better with slashdot
and RAkEL with corel16k. It must be taken into account that, by the nature of
the SubsetAcc metric, the number of labels on each MLD has a remarkable impact
in these results.
110 6 Ensemble-Based Classifiers
Ensemble methods usually demand much more training time than standard clas-
sification algorithms, since they have to train several models. Their structure also
influences the time spent in elaborating the predictions for new data samples, going
through each individual classifier and putting together the votes to obtain a united
output. For these reasons, it would be interesting to analyze training and testing times
for each case.
Figure 6.4 shows training times for each MLD and ensemble, measured in seconds.
The training time for the datasets with fewer data samples, scene and yeast, are
almost imperceptible in this graphic representation. For the other tree MLDs, RAkEL
and EPS emerge as the methods that needed more time to be trained. On the other
side, HOMER seems to be the most efficient algorithm. This fact can be verified by
looking at the exact values reported in Table 6.5. It can be observed that for some
MLDs, such as slashdot, HOMER only needs a fraction of the time used by the
other ensembles. This is consistent with the design of HOMER, whose authors aimed
to build an ensemble able to process large MLDs.
6.5 Ensemble Methods in Practice 111
Regarding the testing times, they are much lower than the training ones as would
be expected. In fact, those corresponding to some MLDs are almost undetectable in
Fig. 6.5.
Table 6.6 provides the raw values for each MLD/classifier combination. The con-
clusion that can be drawn from these numbers is that HOMER is the quickest ensem-
ble, in both training and testing. However, it does not achieve best classification results
never in the succinct experimentation conducted in this chapter. On the other hand,
EPS always produces good classification performance, but its training and testing
times are among the worst ones. In this case, efficiency and predictive performance
seem to be conflicting goals.
One of the most popular paths to face multilabel classification lies in training a
set of relatively simple classifiers, then joining their predictions to obtain the final
output. Several MLC ensemble-based solutions have been described in the preceding
sections of this chapter. These can be considered as more advanced proposals than
the ensembles referenced in the previous chapters.
112 6 Ensemble-Based Classifiers
References
1. Tsoumakas, G., Dimou, A., Spyromitros, E., Mezaris, V., Kompatsiaris, I., Vlahavas, I.:
Correlation-based pruning of stacked binary relevance models for multi-label learning. In:
Proceedings of 1st International Workshop on Learning from Multi-Label Data, MLD’09, pp.
101–116 (2009)
2. Alvares-Cherman, E., Metz, J., Monard, M.C.: Incorporating label dependency into the binary
relevance framework for multi-label classification. Expert Syst. Appl. 39(2), 1647–1655 (2012)
References 113
3. Read, J., Pfahringer, B., Holmes, G., Frank, E.: Classifier chains for multi-label classification.
Mach. Learn. 85, 333–359 (2011)
4. Hüllermeier, E., Fürnkranz, J., Cheng, W., Brinker, K.: Label ranking by learning pairwise
preferences. Artif. Intell. 172(16), 1897–1916 (2008)
5. Fürnkranz, J., Hüllermeier, E., Loza Mencía, E., Brinker, K.: Multilabel classification via
calibrated label ranking. Mach. Learn. 73, 133–153 (2008)
6. Read, J., Pfahringer, B., Holmes, G.: Multi-label classification using ensembles of pruned
sets. In: Proceedings of 8th IEEE International Conference on Data Mining, ICDM’08, pp.
995–1000. IEEE (2008)
7. Tsoumakas, G., Vlahavas, I.: Random k-Labelsets: An ensemble method for multilabel classi-
fication. In: Proceedings of 18th European Conference on Machine Learning, ECML’07, vol.
4701, pp. 406–417. Springer (2007)
8. Tsoumakas, G., Katakis, I., Vlahavas, I.: Effective and efficient multilabel classification in
domains with large number of labels. In: Proceedings of ECML/PKDD Workshop on Mining
Multidimensional Data, MMD’08, pp. 30–44 (2008)
9. Tenenboim-Chekina, L., Rokach, L., Shapira, B.: Identification of label dependencies for multi-
label classification. In: Working Notes of the Second International Workshop on Learning from
Multi-Label Data, MLD’10, pp. 53–60 (2010)
10. Kocev, D., Vens, C., Struyf, J., Džeroski, S.: Ensembles of multi-objective decision trees. In:
Proceedings of 18th European Conference on Machine Learning, ECML’07, pp. 624–631.
Springer (2007)
11. Tahir, M.A., Kittler, J., Bouridane, A.: Multilabel classification using heterogeneous ensemble
of multi-label classifiers. Pattern Recogn. Lett. 33(5), 513–523 (2012)
12. Read, J., Martino, L., Olmos, P.M., Luengo, D.: Scalable multi-output label prediction: from
classifier chains to classifier trellises. Pattern Recogn. 48(6), 2096–2109 (2015)
13. Madjarov, G., Gjorgjevikj, D., Džeroski, S.: Dual layer voting method for efficient multi-label
classification. In: Proceedings of 5th Iberian Conference, IbPRIA’11, pp. 232–239. Springer
(2011)
14. Mencía, E.L., Park, S., Fürnkranz, J.: Efficient voting prediction for pairwise multilabel clas-
sification. Neurocomputing 73(7), 1164–1176 (2010)
15. Quevedo, J.R., Luaces, O., Bahamonde, A.: Multilabel classifiers with a probabilistic thresh-
olding strategy. Pattern Recogn. 45(2), 876–883 (2012)
16. Quinlan, J.R.: C4.5: Programs for machine learning (1993)
Chapter 7
Dimensionality Reduction
7.1 Overview
Classification is a data mining task which aims to recognize the category of new
patterns. For doing so, the algorithms use some kind of model to compare the new
sample input features with those of patterns already observed. The presence of redun-
dant and irrelevant features impacts these algorithms, making them slower, due to the
higher dimensionality, and usually also less precise, by cause of unrelated informa-
tion contributed by irrelevant attributes. In the multilabel field, high dimensionality
in the input space is something common, and it goes with high dimensionality also
in the output space almost always.
The term curse of dimensionality was introduced in [1]. It denotes the problems
derived from the presence of many variables (dimensions). As the number of dimen-
sions increases, so does the volume of the solution space. As a consequence, data
points in this volume tend to be more sparse as the dimensions grow, and distances
between them tend to be less significant. Thus, to draw meaningful conclusions gen-
erally a larger collection of data points is needed. On the contrary, the accuracy of
the estimation made by most algorithms will quickly degrade.
In the data mining area, the curse of dimensionality term is used to make reference
to problems where there are many input attributes, a trait that challenges most pattern
recognition methods [14]. The raw data collected to generate new datasets usually
includes superfluous features, with redundant information that can be derived from
other attributes, as well as irrelevant data which are not meaningful to solve the
problem at glance. Manually filter the attributes, in order to choose those which
really provides relevant information, usually is very expensive. Therefore, automatic
methods to accomplish this task have been designed over the years.
A problem of high dimensionality is always linked to the input feature space
in traditional classification, since there are only one output attribute. However, the
perspective is totally different while working with multilabeled data. It is usual
to work with MLDs having hundreds or even thousands of labels, thus having to
deal with a high-dimensional output space [3]. The difficulties produced by this
configuration are similar to that of a high-dimensional input space, implying more
time to build the classifiers and usually a degraded predictive performance. Though,
the methods applied to reduce the input feature space are not appropriate for reducing
the output label space.
This section briefly introduces the main obstacles derived from high-dimensional
input and output spaces in multilabel datasets. In Sect. 7.2, several methods described
in the literature to perform feature selection over multilabel data are portrayed,
including some experiments. Section 7.3 is devoted to label reduction techniques,
also comprising experimental results.
MLC techniques are used in fields such as text categorization and image labeling.
These have a common characteristic, the large number of features extracted from the
text, images, music, etc. Most MLDs have several hundreds, sometimes thousands,
of attributes. High dimensionality in the input space is usually one or more orders
of magnitude above with respect to traditional classification datasets. In general, the
search space in MLC tasks is much bigger than in binary or multiclass ones.
Since the selection of the best subset of features [5] is mostly faced as a combi-
natorial problem, the larger is the number of attributes in the dataset the more time
will be needed to find the ideal subset. In fact, evaluating all possible combinations
can be unfeasible for many MLDs. Feature selection methods based on heuristics or
statistical approaches can be applied in these cases.
An added barrier to reduce the input feature space in MLDs is derived from
their own nature. Since there are multiple outputs to consider, instead of only one,
determining the significance of every input attribute to each joint labelset is not an
easy job. In many cases, the number of outputs can be larger than the number of
inputs. This specific casuistic, along with the large set of input features in most
MLDs, makes the problem of feature selection in MLC harder than usual.
7.1 Overview 117
Some of the problems of dealing with many labels are, at a certain extent, analogous
to those faced in the input attribute space. A high-dimensional output space implies
most complex models, for instance:
• All the algorithms based on the binarization techniques, such as the ones described
in Chap. 4, would demand much more time to be built with large sets of labels.
In this case, the complexity will increase linearly with respect to the number of
labels with OVA approaches and quadratically with OVO.
• LP-based methods (see Chap. 4) would also face some obstacles. As the set of
labels grows, the number of combinations increases exponentially. As a conse-
quence, usually there are only a few data samples as representatives of each label
combination. The sparseness tends to degrade the performance of many methods.
• As stated in Chap. 6, most ensemble MLC methods are based on BR and LP
transformations, so the difficulties just described in the previous two bullets are
also applicable in this case.
• Most classifiers have to build a model from the input features taking into account
their relationship with the output labels. One of the few exceptions to this rule are
the methods based on nearest neighbors. In general, the larger is the set of labels,
the bigger and more complex would be that model.
The previous ones are problems that do not exist in traditional classification, so
they open a new variety of preprocessing tasks that have to be specifically designed
for multilabel data.
As a rule of thumb, reducing the output space, that is the number of labels, will
contribute to decrease the time and memory needed to train most classifiers. In
addition, those will be simpler and they could reduce overfitting, generalizing
better. However, labels cannot be simply removed from the data. Therefore,
the usual techniques followed to perform feature selection generally are not
useful in the label space.
irrelevant features while preserving the most essential properties of the data have been
designed. Many of them can be applied to multilabel data, straightly or following
some kind of data transformation.
Besides computational complexity, and the aforementioned statement indicat-
ing that more features imply the need for more data instances, working in a high-
dimensional input space also makes difficult data visualization tasks. Data with more
than three dimensions can be hardly represented in a graphical way, being this an
important tool in exploratory data analysis. Feature selection methods can be a way
to simplify visualization of high-dimensional data.
the subset of features internally using a given classifier in the evaluation process.
Therefore, this approach is classifier dependent, and it is able to infer attribute
interactions with a specific classifier. A third path is embedded algorithms, which
perform feature selection as part of the model training operation instead of as a
separate stage. Decision trees and Random forest [2] work as embedded feature
selection methods. Usually filter methods are more efficient, they need less time
to do their work, while wrappers provide better performance.
• Linear versus nonlinear: Many of the most commonly used feature selection
algorithms, such as the aforementioned PCA and LDA, are based on linear statis-
tical methods such as linear regression. For instance, in PCA a linear transforma-
tion is applied over the original features, projecting them in a lower dimensionality
space. Nonlinear methods [9] tend to provide more performance than linear ones,
but at the expense of a higher computational complexity. Some of them are based
on the kernel tricks and manifold embeddings.
Unsupervised dimensionality reduction methods rely on the analysis of the redun-
dant information provided by the input features, for instance determining if some
of them are linear combinations of others. Since they do not use the class label,
these methods can be applied right out of the box to multilabel data. Supervised
approaches, on the contrary, have to be adapted in some way to take into account
the presence of multiple output labels. A general overview of feature selection and
extraction methods can be found in [10, 18]. The following section is focused on
algorithms specifically adapted to work with MLDs.
Traditional supervised feature selection and extraction, such as LDA or CCA, are
designed to observe only one output class. The goal usually is to elucidate how
much useful information carries each input feature to distinguish between the values
of the class, obtaining a feature ranking from which a certain amount of them can
be removed. So, there is a set of independent variables and only one dependent
variable. The quality of each feature or subset of features is evaluated either by
using an extensive collection of metrics in filter methods, such as Information gain
and Relief [16], or according to the estimation provided by the underlying learning
algorithm in wrapper methods. The scenario is a bit more complex in the multilabel
field, since there are also a set of dependent variables, the labels, instead of only one.
It is possible to apply traditional feature selection methods to multilabel data,
following some of the transformation approaches described in Chap. 4. For instance:
• Binary relevance: Firstly, the MLD is transformed into a set of binary datasets.
Then, the contribution of the features to each individual label is evaluated. Lastly,
the average contribution of the features to all labels is computed, and those who
are above a certain threshold are chosen.
120 7 Dimensionality Reduction
• Label powerset: After transforming the MLD into a multiclass dataset, by tak-
ing each label combination as class identifier, the feature selection algorithm is
applied as usual. Where the BR-based approach does not consider potential label
dependencies, the LP-based one implicitly includes this information in the process.
The use of these transformation-based methods, along with Information gain and
Relief feature weighting measures, was reviewed in [21]. The conducted experimen-
tation showed high variations in the number of features selected by each approach.
Some transformation-based algorithms, such as the mutual information-based fea-
ture selection for multilabel classification introduced in [7], remove rare label com-
binations generated by the LP transformation and prune the space to explore for
dependencies.
In addition to classic unsupervised methods and supervised ones implemented
through transformation techniques, many others have been proposed to accomplish
dimensionality reduction for MLDs. Some of them are briefly described below:
• MI-based feature selection using interaction information: Presented in [17],
this method is based on mutual information between input features and output
labels, but without performing any data transformation. The authors introduce a
score metric able to measure interactions among inputs and outputs of any degree,
analyze the bounds of this score, and use them in the algorithm to obtain the most
relevant features.
• Multilabel informed latent semantic indexing: This proposal [28] follows the
unsupervised approach, as the LSI (Latent Semantic Indexing) method it is based
on, well known in the information retrieval field. However, the information related
to labels is incorporated by applying to the label space exactly the same projection
obtained for the feature space.
• Memetic feature selection algorithm for multilabel classification: Genetic algo-
rithms (GAs) are well known for their ability to find near optimum solutions, based
on a fitness function and evolution principles such as mutation and inheritance
between one population and the following one. They also have some drawbacks,
such as premature convergence and large computation times. The method proposed
in [17] overcomes these obstacles resorting to memetic techniques, extracting the
most relevant features with better efficiency and efficacy than GAs.
• Multilabel dimensionality reduction via dependence maximization: The
authors of this method, introduced in [30], aimed to project the feature space
into a lower dimensional space, like many other feature selection methods,
while maximizing the dependence between the preserved features and the cor-
responding class labels. For doing so, they founded their method in the Hilbert–
Schmidt independence criterion as metric to measure the dependencies among
input attributes and output labels.
• Hypergraph spectral learning: Hypergraphs are graphs able to capture rela-
tions in high-dimensional spaces. In [22], a method to model the correlations
among labels, based on an hypergraph, is proposed, along with a procedure to
mild the computational complexity. The authors state that the result is equivalent
7.2 Feature Space Reduction 121
7.2.3 Experimentation
Most MLDs have large feature spaces, commonly much larger than binary and multi-
class datasets. Feature engineering methods are applied to MLDs usually looking for
two goals, a significant time reduction in the training and classification processes, and
an improvement in the obtained predictions. In this section, a simple feature selection
algorithm is going to be experimentally tested, aiming to verify the attainment of the
aforementioned goals.
Five MLDs with a large number of input attributes have been selected. Their basic
traits are those shown in Table 7.1. One of them, scene, has a few hundreds, while
the others have more than a thousand features. All of them were preprocessed with
a BR-Information gain feature selection algorithm. This procedure independently
evaluates the quality of each feature with respect to each label, measuring the infor-
mation gain. This is the same measure used by C4.5 to determine the best feature to
divide a branch while inducing a tree. The scores for each feature across all labels is
averaged, obtaining a feature ranking from which the best half is chosen. Therefore,
the preprocessed version of the MLDs have half the input attributes than the original
ones.
Both the original MLDs and preprocessed versions were given as input to three
MLC methods, BR, LP, and CC. These have been already described in the previous
chapters. The first one is a binary transformation, the second a label combination
transformation, and the last one an ensemble of binary classifiers. So classification
results before and after preprocessing the MLDs have been obtained. Those results
Table 7.2 Classification results before and after label space reduction
HL FM
Classifier Dataset Before After Before After
BR bibtex 0.0146 0.0137 0.6376 0.6615
enron 0.0555 0.0535 0.5808 0.5931
genbase 0.0012 0.0013 0.9908 0.9908
medical 0.0098 0.0100 0.9259 0.9229
scene 0.1581 0.1670 0.8568 0.8589
LP bibtex 0.0204 0.0192 0.6113 0.6294
enron 0.0776 0.0745 0.5138 0.5291
genbase 0.0018 0.0012 0.9918 0.9924
medical 0.0132 0.0122 0.9322 0.9369
scene 0.1658 0.1679 0.9547 0.9453
CC bibtex 0.0147 0.0137 0.6494 0.6718
enron 0.0574 0.0558 0.5893 0.6053
genbase 0.0012 0.0013 0.9908 0.9908
medical 0.0101 0.0106 0.9323 0.9314
scene 0.1658 0.1708 0.9378 0.9419
The total number of labels in an MLD, as well as the cardinality and other label-
related characterization metrics introduced in Chap. 3, has a great influence in the
classifiers behavior. Having more labels involves more individual classifiers for all the
binarization-based methods, such as BR, CC, or RPC. A higher cardinality suggests
the existence of more label combinations, impacting the LP-based methods, such as
PS, RAkEL, or HOMER.
Intuitively, it seems that reducing the number of labels could help in decreasing the
time needed to build the models and maybe improving their predictive performance.
However, labels cannot be simply removed from an MLD, as the feature selection
methods do with input attributes, since all labels have to be present in the prediction
provided by the classifier. Thence, any label reduction method has to be able to
recover or rebuild the labelsets to preserve their primitive structure.
Unlike feature selection and extraction methods, which were already in place and
researched for a long time in traditional classification, label space reduction tech-
niques are something new, a specific problem in MLC that did not exist in binary
and multiclass datasets. This is the reason that each proposed method, they will be
further discussed, follows a different approach. Nonetheless, there are two key points
to consider, label sparseness and label dependencies, whether separately or jointly.
Label sparseness is quite common in MLDs belonging to certain fields, for
instance in text categorization. The classification vocabulary, which determines the
words included as labels, tends to be very extensive, while only a few concepts
are usually associated with each document (data instance). In multilabel terms, this
means having MLDs with a large number of labels but a low cardinality; thus, the
active labels are sparse in the label space. There are algorithms that can make the
most of this sparseness, compressing the label space following a reversible approach.
One of the recurrent studied topic in the MLC field is label dependency. Some clas-
sification approaches, such as BR and similar methods, assume that labels are com-
pletely independent, and they perform quite well in many configurations. However,
that labels which tend to appear together have some kind of relation or dependency
[6] is presumed. It is usual to differentiate between two dependency models:
• Unconditional dependence: It is the dependence that presumably exists between
certain labels, no matter what the input feature values are. Yi and Y j being two
labels, they are unconditional dependent if P(Yi |Y j ) = P(Yi ). Unconditional
dependence provides correlation information about labels in the MLD as a whole;
thus, it can be globally exploited while designing an MLC method.
• Conditional dependence: This is the kind of dependence among labels according
to the feature values of a specific data instance. Some algorithms, for instance
7.3 Label Space Reduction 125
those based on label combinations such as LP, can take advantage of conditional
dependence between labels, producing a model able to use this locally available
information.
The authors of [29] highlight that dimensionality reduction taking advantage of
label dependency information is one of the pending challenges in MLC. In the fol-
lowing section, some of the methods proposed in the literature for label space dimen-
sionality reduction, most of them based on label space sparsity and label dependency
principles, are depicted.
There are a handful of published methods whose aim is to reduce the dimensionality
of the label space in multilabel classification. Each one of them follows a different
approach to do so, with few exceptions. Some of the most remarkable are briefly
described below:
• Label subsets selection: Some algorithms, such as RAkEL [25] and HOMER [24]
(see Chap. 6), work by splitting the full labelsets into several subsets. Founded on
the LP data transformation technique, these methods train a group of classifiers
using a reduced set of labels for each one. RAkEL makes these subsets randomly,
while HOMER relies on an hierarchical algorithm. Eventually, all the labels in
the MLD are used to train one multiclass classifier, since the label space is not
effectively reduced but only divided into groups.
• Pruning of infrequent labelsets: The PS [19] and EPS [20] methods (see Chap. 4),
also based on the basic LP transformation approach, detect rare labelsets and
prune them. This way the main problem of LP-based algorithms, which is the
combinatorial explosion as the number of labels gets larger, is relieved at some
extent. As RAkEL and HOMER, PS and EPS also rely on multiclass classifiers
to produce their predictions. The infrequent labelsets pruning avoids taking into
account rare label combinations, but seldom reduce the number of labels.
• Kernel Dependency Estimation (KDE): KDE is a generic technique to find
dependencies among a set of inputs and a set of outputs. In [26] is proposed
as a way to reduce the label space for MLC tasks. This method applies a principal
component analysis to the original label space, obtaining a set of uncorrelated
projections. After that, the number of outputs is reduced keeping only the most
significant ones, those with larger eigenvalues. The proposal is completed with
a procedure to rebuild the original labelset once the reduced prediction has been
obtained, based on a method to find the preimage of the projected version.
• Compressed sensing (CS): The authors of [12] proposed a method to use CS, a
compression technique, in order to reduce the dimensionality of the label space
for multilabel classification. The original binary high-dimensional label space is
projected into a real lower dimensional compressed space. Since the intermediate
values are not binary but real, the authors resort to a regression algorithm to
126 7 Dimensionality Reduction
7.3.3 Experimentation
In Chap. 3, while studying the characteristics of several dozens of MLDs, that many
of them have hundreds or even thousands of labels was stated. In some cases, indeed
the set of labels can be larger than the set of input features. The implications of
working with a high number of labels have been already mentioned. Therefore, by
reducing this number an algorithm aims to improve classification results and usually
also running time.
This final section intends to experimentally test one of the aforementioned label
space reduction methods, assessing how it impacts the behavior of some MLC classi-
fiers. The chosen method is LI-MLC, the algorithm which banks on association rules
as representation model of label dependency information. Since the goal is to ana-
lyze a label reduction algorithm, five MLDs with a relatively large set of labels have
been used. These are cal500, corel16k, corel5k, delicious and imdb.
7.3 Label Space Reduction 127
The last one only has 28 labels, while the other four have several hundreds of them,
with delicious reaching almost a thousand labels. The basic traits of these MLDs
are provided in Table 7.4. The full set of characteristics can be found in the tables in
Chap. 3.
The experimental configuration is the same used before for testing the feature
selection method. Thus, classification results before and after applying LI-MLC
with three different classifiers, BR, LP, and CC, and two evaluation metrics, HL and
FM, have been analyzed. Furthermore, the running time of each version has been
also obtained, including the time spent in extracting the association rules.
Classifier performance has been represented in Fig. 7.3 as a set of bar plots. From
the FM values observation (top half of the figure), that a general improvement in
classification results has been achieved can be deducted. The only exception seems
Table 7.4 Basic traits of MLDs used in label space reduction experimentation
Dataset n f k Card Dens TCS
cal500 502 68 174 26.044 0.150 15.597
corel16k 13 766 500 153 2.859 0.019 19.722
corel5k 5 000 499 374 3.522 0.009 20.200
delicious 16 105 500 983 19.017 0.019 22.773
imdb 120 919 1 001 28 2.000 0.071 18.653
to be the cal500 MLD. The lower half corresponds to HL measures; thus, lower
values are better. Again, cal500 shows the worst performance, while the results
for the other four MLDs suggest the same behavior showed by FM.
The exact values for each classifier and MLD, with and without using the LI-
MLC label selection method, are the shown in Table 7.5. As can be seen, the values
highlighted in bold for the FM metric all belong to the configuration which includes
LI-MLC. As was anticipated, cal500 is the only exception and their results sign a
slight worsening. The behavior when assessing with the HL metric are very similar.
Only delicious is negatively affected in some cases, but the differences are almost
negligible.
With regard to the running times, the relative gain (or loss) is presented in Fig. 7.4.
Negative values indicate that the use of LI-MLC plus the classifier took longer than
the classifier alone. This happened with imbd, the MLD which has more samples.
The association rule mining algorithm needed more time to analyze the instances than
the reduction obtained by removing some labels. On the other hand, with corel16k
the number of removed labels compensates the time mining the rules; thus, the global
running time is significantly improved. In general, BR and CC are more influenced
by the reduction of labels than LP (Table 7.6).
Table 7.5 Classification results before and after label space reduction
HL FM
Classifier Dataset Before After Before After
BR cal500 0.1615 0.1620 0.3375 0.3364
corel16k 0.0206 0.0203 0.5155 0.5790
corel5k 0.0098 0.0097 0.4697 0.4884
delicious 0.0186 0.0188 0.3097 0.3198
imdb 0.0753 0.0724 0.6602 0.7077
LP cal500 0.1996 0.2030 0.3626 0.3627
corel16k 0.0329 0.0318 0.5496 0.5663
corel5k 0.0168 0.0162 0.4859 0.5125
delicious 0.0300 0.0300 0.3290 0.3421
imdb 0.1024 0.0987 0.7209 0.7413
CC cal500 0.1760 0.1760 0.3287 0.3111
corel16k 0.0216 0.0210 0.4592 0.5001
corel5k 0.0099 0.0098 0.4031 0.4305
delicious 0.0188 0.0189 0.2599 0.2602
imdb 0.0878 0.0745 0.6375 0.6433
7.4 Summarizing Comments 129
References
1. Bellman, R.: Dynamic Programming. P (Rand Corporation). Princeton University Press (1957)
2. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
130 7 Dimensionality Reduction
3. Charte, F., Rivera, A., del Jesus, M.J., Herrera, F.: LI-MLC: a label inference methodology
for addressing high dimensionality in the label space for multilabel classification. IEEE Trans.
Neural Networks Learn. Syst. 25(10), 1842–1854 (2014)
4. Das, S.: Filters, wrappers and a boosting-based hybrid for feature selection. In: Proceedings of
18th International Conference on Machine Learning, ICML’01, pp. 74–81. Morgan Kaufmann
(2001)
5. Dash, M., Liu, H.: Feature selection for classification. Intell. Data Anal. 1(3), 131–156 (1997)
6. Dembszynski, K., Waegeman, W., Cheng, W., Hüllermeier, E.: On label dependence in multi-
label classification. In: ICML Workshop on Learning from Multi-label Data, pp. 5–12 (2010)
7. Doquire, G., Verleysen, M.: Mutual information-based feature selection for multilabel classi-
fication. Neurocomputing 122, 148–155 (2013)
8. Fisher, R.A.: The statistical utilization of multiple measurements. Ann. Eugenics 8(4), 376–386
(1938)
9. Guyon, I., Bitter, H.M., Ahmed, Z., Brown, M., Heller, J.: Multivariate non-linear feature selec-
tion with kernel multiplicative updates and Gram-Schmidt relief. In: Proceedings of Interna-
tional Joint Workshop on Soft Computing for Internet and Bioinformatics, BISC Flint-CIBI’03,
pp. 1–11 (2003)
10. Guyon, I., Gunn, S., Nikravesh, M., Zadeh, L.A. (eds.): Feature Extraction: Foundations and
Applications. Springer (2008)
11. Hotelling, H.: Relations between two sets of variates. In: Breakthroughs in Statistics, pp. 162–
190. Springer (1992)
12. Hsu, D., Kakade, S., Langford, J., Zhang, T.: Multi-label prediction via compressed sensing.
In: Proceedings of 22th Annual Conference on Advances in Neural Information Processing
Systems, NIPS’09, vol. 22, pp. 772–780 (2009)
13. Jolliffe, I.: Principal Component Analysis. Springer Series in Statistics, vol. 1. Springer, Berlin
(1986)
14. Kira, K., Rendell, L.A.: The feature selection problem: traditional methods and a new algorithm.
In: Proceedings of 10th National Conference on Artificial Intelligence, AAAI’92, pp. 129–134.
AAAI Press (1992)
15. Kohavi, R., John, G.H.: Wrappers for feature subset selection. Artif. Intell. 97(1), 273–324
(1997)
16. Kononenko, I.: Estimating attributes: analysis and extensions of RELIEF. In: Machine Learn-
ing: ECML-94, pp. 171–182 (1994)
17. Lee, J.S., Kim, D.W.: Mutual information-based multi-label feature selection using interaction
information. Expert Syst. Appl. 42, 2013–2025 (2015)
18. Liu, H., Motoda, H.: Feature Selection for Knowledge Discovery and Data Mining, vol. 454.
Springer Science & Business Media (2012)
19. Read, J.: A pruned problem transformation method for multi-label classification. In: Proceed-
ings of New Zealand Computer Science Research Student Conference, NZCSRS’08, pp. 143–
150 (2008)
20. Read, J., Pfahringer, B., Holmes, G.: Multi-label classification using ensembles of pruned
sets. In: Proceedings of 8th IEEE International Conference on Data Mining, ICDM’08, pp.
995–1000. IEEE (2008)
21. Spolaor, N., Cherman, E.A., Monard, M.C., Lee, H.D.: A comparison of multi-label feature
selection methods using the problem transformation approach. Electron. Notes Theor. Comput.
Sci. 292, 135–151 (2013)
22. Sun, L., Ji, S., Ye, J.: Hypergraph spectral learning for multi-label classification. In: Proceedings
of 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,
pp. 668–676. ACM (2008)
23. Sun, L., Ji, S., Ye, J.: Multi-Label Dimensionality Reduction. CRC Press (2013)
24. Tsoumakas, G., Katakis, I., Vlahavas, I.: Effective and efficient multilabel classification in
domains with large number of labels. In: Proceedings of ECML/PKDD Workshop on Mining
Multidimensional Data, MMD’08, pp. 30–44 (2008)
References 131
25. Tsoumakas, G., Vlahavas, I.: Random k-Labelsets: an ensemble method for multilabel classi-
fication. In: Proceedings of 18th European Conference on Machine Learning, ECML’07, vol.
4701, pp. 406–417. Springer (2007)
26. Weston, J., Chapelle, O., Elisseeff, A., Schölkopf, B., Vapnik, V.: Kernel dependency estima-
tion. In: Proceedings of 16th Annual Conference on Advances in Neural Information Processing
Systems, NIPS’02, vol. 15, pp. 873–880 (2002)
27. Wyse, N., Dubes, R., Jain, A.K.: A critical evaluation of intrinsic dimensionality algorithms.
Pattern Recogn. Pract. 415–425 (1980)
28. Yu, K., Yu, S., Tresp, V.: Multi-label informed latent semantic indexing. In: Proceedings of 28th
Annual International ACM SIGIR Conference on Research and Development in Information
Retrieval, pp. 258–265. ACM (2005)
29. Zhang, M., Zhou, Z.: A review on multi-label learning algorithms. IEEE Trans. Knowl. Data
Eng. 26(8), 1819–1837 (2014)
30. Zhang, Y., Zhou, Z.H.: Multilabel dimensionality reduction via dependence maximization.
ACM Trans. Knowl. Discovery Data (TKDD) 4(3), 14 (2010)
31. Zhao, Z., Liu, H.: Semi-supervised feature selection via spectral analysis. In: Proceedings of
7th SIAM International Conference on Data Mining, SDM’07, pp. 641–646 (2007)
32. Zhou, T., Tao, D., Wu, X.: Compressed labeling on distilled labelsets for multi-label learning.
Mach. Learn. 88(1–2), 69–126 (2012)
Chapter 8
Imbalance in Multilabel Datasets
Abstract The frequency of class labels in many datasets is not even. On the contrary,
that a certain class appears in a large portion of the data samples while other is scarcely
represented is something quite usual. This casuistic produces a problem generically
labeled as class imbalance. Due to these differences between class distributions,
a specific need arises, imbalanced learning. This chapter beings introducing the
mentioned task in Sect. 8.1. Then, the specific aspects of imbalance in the multilabel
area are discussed in Sect. 8.2. Section 8.3 explains how imbalance in MLC has been
faced, enumerating a considerable set of proposals. Some of them are experimentally
evaluated in Sect. 8.4. Lastly, Sect. 8.5 summarizes the contents.
8.1 Introduction
solution to this problem. Nonetheless, there are also some resampling proposals
for specific classifiers.
• Algorithm adaptation: This approach is classifier dependent. Its goal is to modify
existent classification algorithms to take into account the imbalanced nature of the
data to be processed. The usual procedure is based on reinforcing the learning of
the minority class, biasing the classifier to recognize it.
• Cost-sensitive learning: Cost-sensitive classification is an approach which com-
bines the two previous techniques. The data are preprocessed to balance the class
distribution, while the learning algorithm is adapted to benefit correct classifica-
tion of samples associated with the minority class. To do so weights are associated
with the instances, and usually these weights are proportional to the size of each
class.
From these three ideas, many others have been derived, such as the combination
of data resampling and the use of ensembles of classifiers [11] as a more robust model
with certain tolerance to class imbalance.
Overall, imbalance learning is a well-known and deeply studied task in binary
classification, further extended to also cover the multiclass scenario. Imbalance in
multilabeled data increases the complexity of the problem and potential solutions,
since there are several class labels per instance. In the following, the specificities
of imbalanced MLDs, related problems, and proposed methods to tackle them are
described.
Most of the existent methods only consider the presence of one majority class
and one minority class. This way, undersampling methods only remove samples
from one class, and analogously oversampling methods generate new instances
associated with one class.
The number of labels in an MLD can go from a few dozens to several thousands.
Only a handful of them have less than ten labels. Despite the fact that most MLDs
have a large set of labels, the average number of active labels per instance (their
cardinality) seldom is above 5. Some exceptions are cal500 (Car d = 26.044) and
delicious (Car d = 19.017). With such a large set of labels and low Card, that
some labels would be underrepresented while others would be much more frequent
can be deducted. As a general rule, the more labels there are in an MLD, the higher
would be the likelihood of having imbalance problems.
Another important fact, easily deducible from the own MLDs nature, is that there
is not a single majority label and a single minority one, but several of them in each
group. This have different implications, affecting the way the imbalance level of an
8.2 Imbalanced MLD Specificities 135
The metrics related to imbalance measurement for MLDs were provided in Chap. 3
(see Sect. 3.3.2). Since there are multiple labels, this trait cannot be easily reduced
to a single value. For that reason, a triplet of metrics was proposed in the study
conducted in [5]:
• IRLbl: It is independently computed for each label. The value of this metric will
be 1 for the most frequent label and higher for all others. The larger is the IRLbl
the less frequent is the assessed label in the MLD.
• MeanIR: By averaging the IRLbl for all labels in a MLD, its MeanIR is obtained.
This value typically will be above 1. The higher is the MeanIR, the more imbal-
anced labels there are in the MLD.
• CVIR: The MeanIR is intended to give a measurement on the amount of imbal-
anced labels in the MLD, but it is also influenced by extreme values. A few very
high-imbalanced labels can produce a high MeanIR, the same that a lot of less
imbalanced labels. The CVIR is an indicator of the situation being assessed. Large
CVIR values would denote high variances in IRLbl.
Besides the use of specific characterization metrics, such as the ones just described,
one of the best approaches to analyze label imbalance in MLDs is to visually explore
the data. In Fig. 8.1, the relative frequencies for the ten most frequent labels (left
side) and the ten least frequent ones (right side) in a dozen MLDs have been plotted.1
As can be observed, the difference between frequent and rare labels is huge. Even
among the most frequent labels, there are significant disparities, with one or two
labels having much more presence than the others. This pattern is common to many
MLDs. Therefore, the imbalance problem is almost intrinsically linked to multilabel
data.
1 The frequency (Y-axis) scale is individually adjusted to show better the relevance of labels in each
Fig. 8.1 Ten most frequent and ten least frequent labels in some datasets
Looking at the graphical representations on Fig. 8.1, as well as to the imbalance levels
reported in the tables on Chap. 3, it seems legitimate to think that applying resampling
methods, as in traditional classification, the labels distribution on the MLDs could
8.2 Imbalanced MLD Specificities 137
Nonetheless, it is possible to limit the number of labels to show, choosing those with
higher dependencies, producing plots such as the ones shown2 in Fig. 8.2.
Each arc in the external circumference represents a label. The arc’s amplitude is
proportional to the frequency of the label, so small arcs are associated with minority
labels, and analogously large arcs indicate majority labels. The width of the bands
connecting arcs denote the number of samples in which each label pair appears
together.
On the basis of the specific characteristics associated with imbalanced MLDs, high-
lighted in the previous section, the design of algorithms capable of dealing with this
problem is a challenge. Three main approaches have been followed in the litera-
ture, classifier adaptation, resampling methods, and ensembles of classifiers. Most
of them are portrayed in the subsections below according to the aforementioned
categorization scheme.
One way to face the imbalance problem consists in adapting the classifier to take this
aspect into consideration, for instance assigning weights to each label depending on
its frequency. Obviously, it is a solution tightly attached to the adjusted algorithm.
Although it is not a general application approach, what can be seen as a disadvantage,
the adaptation can strengthen the best point of a good classifier, something that a
preprocessing method cannot do.
Some of the multilabel classifiers adapted to deal with imbalanced MLDs pro-
posed in late years are the following:
• Min–max modular with SVM (M3 -SVM): This method, proposed in [8], relies
on a Min–Max Modular network [17] to divide the original multilabel classification
problem into a set of simpler tasks. Several strategies are used to guarantee that the
2 These plots were generated by the mldr R package, described in the following chapter.
8.3 Facing Imbalanced Multilabel Classification 139
imbalance level in these smaller tasks is lower than in the original MLD, following
random, clustering, and PCA approaches. The simpler tasks are always binary
classification jobs, using SVM as base classifier. Therefore, the proposal can be
seen as a combination of data transformation and method adaptation techniques.
• Enrichment process for neural networks: The proposal made in [25] is an adap-
tation of the training process for neural networks. This task is divided into three
phases. The first one uses a clustering method to group similar instances and gets
a balanced representation to initialize the neural network. In the second stage, the
network is iteratively trained, as usual, while data samples are added and removed
from the training set, according to their prevalence. The final phase checks if the
enrichment process has reached the stop condition or it has to be repeated. This
way, the overall balance of the neural network used as classifier is improved.
• Imbalanced multimodal multilabel learning (IMMML): In [14], the authors
face an extremely imbalanced multilabel task, specifically the prediction of sub-
cellular localization of human proteins. Their algorithm is based on a Gaussian
process model, combined with latent functions on the feature space and covariance
matrices to obtain correlations among labels. The imbalance problem is tackled
giving each label a weighting coefficient linked to the likelihood of labels on
each sample. Therefore, it is very specific solution to a definite problem, hardly
applicable in a different context.
• Imbalanced multiinstance multilabel radial basis function neural networks
(IMIMLRBF): It was introduced in [15] as an extension to the MIMLRBF algo-
rithm [26], a multiinstance and multilabel classification algorithm based on radial
basis neural networks. The adaptation consists in two key points. Firstly, the num-
ber of units in the hidden layer, which in MIMLRBF is constant, is computed
according to the number of samples of each label. Secondly, the weights asso-
ciated with the links between the hidden and output layers are adjusted, biasing
them depending on the label frequencies.
The resampling approach is based on removing samples which belong to the major-
ity label, adding samples associated with the minority label, or both actions at once.
The way the instances to be removed are selected, and the technique used to produce
new instances, usually follows one of two possible ways. The first one is the ran-
dom approach, whereas the second one is known as heuristic approach. The former
randomly chooses the data samples to delete, imposing the restriction that they have
to belong to a certain label. Analogously, new samples are produced by randomly
picking and cloning instances associated with a specific label. The latter path can be
based on disparate heuristics to search the proper instances, as well as to generate
new ones.
Therefore, resampling methods can be grouped depending on the way they try
to balance the label frequency, removing or adding samples, and the strategy to do
140 8 Imbalance in Multilabel Datasets
so, randomly or heuristically. There are quite a few proposals based on resampling
techniques, among them:
• Undersampling for imbalanced training sets in text categorization domains:
The proposal made in [9] combines the data transformation approach, producing a
set of binary classifiers, with undersampling techniques, removing instances linked
to the majority label to balance the distribution in each binary dataset. In addition,
a decision tree is used to get the most relevant features for each label. kNN is used
as underlying binary classifier, and different k values were tested in the conducted
experimentation.
the percentage of samples to remove from the MLD. After computing the average
number of samples sharing each labelset, a set of majority bags are produced. The
number of instances to delete is distributed among these majority bags, randomly
picking the data samples to remove. The LP-ROS algorithm works in a very sim-
ilar fashion, but obtaining bags with minority labelsets and adding to them clones
of samples randomly retrieved from them. These are simple techniques, and they
consider the presence of several majority and minority combinations, instead of
only one as most resampling methods assume.
bag, reassessing their condition of minority bags in each cycle. This way, the best
possible balance for each group is set as goal. The ML-RUS algorithm behavior is
quite similar, but it gets bags with majority labels and iteratively removes samples
from them.
• A case study with the SMOTE algorithm: The authors of the study published
in [12] stated the imbalance problem in MLC, and proposed to face it using the
original SMOTE (Synthetic Minority Over-sampling Technique) algorithm [18].
SMOTE was designed to produce synthetic instances of the minority class for
binary datasets. In [12], three ways to feed SMOTE with multilabel data are tested,
all of them considering one minority label only. The first path is similar to BR,
giving to SMOTE all the instances in which the minority label appears to obtain
synthetic samples from them and their neighbors. The second approach is quite
limited, since only considers instances having the minority label alone, without
any other labels. The third way, which probed to be the most effective, grouped
the minority label instances according to the combinations of labels in which it
appeared.
field introducing two key ideas, a principle to chose the samples acting as candi-
dates to be removed and a comparison operator to determine when the labelsets
of two instances are considered to be different. Only the instances which do not
contain any minority label are used as candidates, instead of all the samples as in
the original ENN implementation. Regarding how the classes of these instances
are compared, a metric based on the Hamming distance among labelsets, but only
taking into account active labels, is defined.
The main advantage of these methods is that they are classifier independent.
They are used as a preprocessing step, even it is possible to combine them, and
they do not require a specific multilabel classifier to be used. Therefore, the
preprocessed MLDs can be later given as input to any of the MLC algorithms
described in previous chapters.
Ensemble-based techniques are quite common in the MLC field. A significant number
of proposals have been already published, as was reported in Chap. 6 devoted to
multilabel ensembles. ECC, EPS, RAkEL, and HOMER are among the most popular
MLC ensembles, an approach that also has been applied to solve the imbalance
problem.
Theoretically, each classifier in an ensemble could introduce a bias toward a
different set of labels, easing and making more effective the imbalanced learning
task. The following two proposals are headed in this direction:
• Inverse random undersampling (BR-IRUS): The method proposed in [24] is
built upon an ensemble of binary classifiers. Several of them are trained for each
label, using a subset of the original data with each one. This subset of the instances
includes all samples in which the minority label is present, as well as a small portion
of the remainder samples. This way, each individual classifier faces a balanced
classification task. Joining the predictions given by the classifiers associated with
a label, a more defined boundary around the minority label space is generated. The
name of the proposal, BR-IRUS, highlights the fact of using the binary relevance
transformation.
146 8 Imbalance in Multilabel Datasets
In the previous sections, most of the published methods aimed to tackle multilabel
imbalanced learning have been portrayed. The goal in this section is to experimentally
test some of them. Five methods, belonging to different techniques, have been chosen,
specifically:
• Random resampling: Two algorithms based on random resampling techniques
have been applied, ML-RUS and ML-ROS. The former performs undersampling,
by removing samples associated with majority labels randomly picking them,
while the latter does the opposite, producing clones of instances linked to minority
labels.
• Heuristic resampling: This group of approaches is also represented by two meth-
ods, MLeNN and MLSMOTE. The first one removes instances with majority labels
following the ENN rule. The second produces synthetic instances associated with
minority labels, generating both features and labelsets from the information in the
neighborhood.
• Ensembles: The EML (ensemble-based method), just described in the previous
section, is also included in the test bed. Unlike the previous four algorithms, EML
is not a preprocessing technique but a full classifier by itself, able to face imbalance
by combining predictions coming from several classifiers with different biases.
These five methods3 have been tested using the experimental configuration
explained in the following section. Obtained results are presented and discussed
in Sect. 8.4.2.
3 The implementations of these methods can be found in the links section provided in this book
repository [7], along with dataset partitions.
8.4 Multilabel Imbalanced Learning in Practice 147
Four out of the five imbalance methods to be tested are preprocessing procedures.
Therefore, once they have done their work, producing the rebalanced MLD, the data
has to be given to an MLC classifier in order to obtain comparable classification
results. A basic BR transformation has been used for this duty, with the C4.5 [20]
tree induction algorithm as underlying binary classifier.
Five MLDs with disparate imbalance levels, bibtex, cal500, corel5k,
medical and tmc2007, have been included in the experimentation. Their basic
traits, including the MeanIR, are provided in Table 8.1. Each MLD was partitioned
with a 2 × 5 fold cross-validation scheme, as usual. Training partitions were pre-
processed with ML-RUS, ML-ROS, MLeNN, and MLSMOTE.
Thus, five versions of each one were used, one without resampling and four more
preprocessed by each method. The original version, without resampling, was given
as input to the BR classifier to obtain a base evaluation. It was also used with EML,
which did not need an independent classifier. The preprocessed versions also served
as input to the same BR + C4.5 MLC, with exactly the same configuration parameters.
In Chap. 3, the metrics designed to assess MLC algorithms performance were
introduced. Many of them, such as Hamming Loss, Accuracy, Precision, Recall, and
F-measure, have been used in the experiments of previous chapters. To study the
behavior of classifiers when working with imbalanced data, as it is done here, it is
usual to rely on label-based metrics, instead of sample-based ones. In this case,
F-measure following the macro- and microaveraging strategies are the metrics
obtained to assess the results. MacroFM (Macro-F-measure) assigns the same weight
to all labels, while MicroFM (Micro-F-measure) is heavily influenced by the frequen-
cies of each label. Therefore, the former is usually used to assess the performance
with respect to minority labels, and the latter to obtain a more general view of the
classifier’s behavior.
Classification results assessed with MacroFM are shown in Fig. 8.3. Each group of
bars corresponds to an MLD, with each bar depicting the performance of a method.
148 8 Imbalance in Multilabel Datasets
The left-most bar shows the value for base results, those obtained without any special
imbalance treatment.
Analogously, Fig. 8.4 presents the results evaluated with the MicroFM metric. The
structure of the plot is exactly the same. To analyze these data, it would be interesting
to know in which cases an imbalance treatment has achieved some improvement over
the base results. Another important fact is which one of the applied methods works
better.
As can be observed in the two previous plots, undersampling methods seem
to behave worse than the oversampling ones. The exception is MLeNN with the
corel5k MLD, which achieves the best results with the two evaluation metrics.
8.4 Multilabel Imbalanced Learning in Practice 149
EML, the ensemble-based solution, does not produce good MacroFM results,
although with MicroFM the performance seems to be slightly better, specifically
with the cal500 MLD. Regarding the oversampling methods, MLSMOTE appears
as the best performed almost always. In fact, this method accomplishes the best
results in many cases.
The MacroFM and MicroFM raw values are provided in Tables 8.2 and 8.3, respec-
tively. Values highlighted in italics denote an amelioration with respect to results
without imbalance treatment. Best values across all methods are emphasized in bold,
as usual.
From these values observation it can be stated that EML seldom reaches the
performance of the BR + C4.5 base classifier, although it achieves the best MicroFM
result with the cal500 MLD. In comparison, MLSMOTE improves base results
always for both metrics and manages to get the best performance in seven out of
ten configurations. ML-RUS and ML-ROS produce some improvements, as well as
a few losses. Lastly, MLeNN seems to work well with the corel5k MLD, but its
behavior with the other four datasets is not as good.
Class imbalance is a very usual obstacle while learning a classification model. In this
chapter, how label imbalance is present in most MLDs, and some of the specificities,
in this field such as label concurrence among imbalanced labels, have been intro-
duced. Several metrics aimed to assess these traits have been explained, and some
specialized data visualizations have been provided.
Solutions to deal with imbalanced multilabel data can be grouped into a few cat-
egories, including preprocessing methods, algorithm adaptation, and ensembles. A
handful of proposals from each category have been described, and some of them have
been experimentally tested. According to the results obtained, the resampling tech-
niques deliver certain improvements while maintaining the benefit of being classifier
independent.
References
1. Charte, F., Rivera, A., del Jesus, M.J., Herrera, F.: Resampling multilabel datasets by decoupling
highly imbalanced labels. In: Proceedings of 10th International Conference on Hybrid Artificial
Intelligent Systems, HAIS’15, vol. 9121, pp. 489–501. Springer (2015)
2. Charte, F., Rivera, A.J., del Jesus, M.J., Herrera, F.: A first approach to deal with imbalance
in multi-label datasets. In: Proceedings of 8th International Conference on Hybrid Artificial
Intelligent Systems, HAIS’13, vol. 8073, pp. 150–160. Springer (2013)
3. Charte, F., Rivera, A.J., del Jesus, M.J., Herrera, F.: Concurrence among imbalanced labels
and its influence on multilabel resampling algorithms. In: Proceedings of 9th International
Conference on Hybrid Artificial Intelligent Systems, HAIS’14, vol. 8480. Springer (2014)
4. Charte, F., Rivera, A.J., del Jesus, M.J., Herrera, F.: MLeNN: a first approach to heuristic
multilabel undersampling. In: Proceedings of 15th International Conference on Intelligent
Data Engineering and Automated Learning, IDEAL’14, vol. 8669, pp. 1–9. Springer (2014)
5. Charte, F., Rivera, A.J., del Jesus, M.J., Herrera, F.: Addressing imbalance in multilabel clas-
sification: measures and random resampling algorithms. Neurocomputing 163, 3–16 (2015)
6. Charte, F., Rivera, A.J., del Jesus, M.J., Herrera, F.: MLSMOTE: approaching imbalanced
multilabel learning through synthetic instance generation. Knowl. Based Syst. 89, 385–397
(2015)
7. Charte, F., Rivera, A.J., del Jesus, M.J., Herrera, F.: Multilabel Classification. Problem analysis,
metrics and techniques book repository. https://github.com/fcharte/SM-MLC
8. Chen, K., Lu, B., Kwok, J.: Efficient classification of multi-label and imbalanced data using
min-max modular classifiers. In: Proceedings of IEEE International Joint Conference on Neural
Networks, IJCNN’06, pp. 1770–1775 (2006)
9. Dendamrongvit, S., Kubat, M.: Undersampling approach for imbalanced training sets and
induction from multi-label text-categorization domains. In: New Frontiers in Applied Data
Mining. LNCS, vol. 5669, pp. 40–52. Springer (2010)
10. Fernández, A., López, V., Galar, M., del Jesus, M.J., Herrera, F.: Analysing the classification
of imbalanced data-sets with multiple classes: binarization techniques and ad-hoc approaches.
Knowl. Based Syst. 42, 97–110 (2013)
11. Galar, M., Fernández, A., Barrenechea, E., Bustince, H., Herrera, F.: An overview of ensemble
methods for binary classifiers in multi-class problems: experimental study on one-vs-one and
one-vs-all schemes. pattern Recogn. 44(8), 1761–1776 (2011)
References 151
Abstract Multilabel classification and other learning from multilabeled data tasks
are relatively recent, with barely a decade of history behind them. When compared
against binary and multiclass learning, the range of available datasets, frameworks,
and other software tools is significantly more scarce. The goal of this last chapter
is to provide the reader with the proper insight to take advantage of these software
tools. A brief overview of them is offered in Sect. 9.1. Section 9.2 discusses the
different multilabel file formats, enumerates the data repositories the MLDs can be
downloaded from, and describes how to automate some tasks with the mldr.datasets
R package. How to perform exploratory data analysis of MLDs is the main topic
of Sect. 9.3. Then, the process to conduct experiments with multilabel data using
different tools is outlined in Sect. 9.4.
9.1 Overview
Despite the software shortage aforementioned above, currently there are some mul-
tilabel data repositories, as well as two frameworks for algorithm developers and at
least one exploratory data tool. By using them, tasks such as downloading, citing and
partitioning datasets, multilabel data exploration, and conducting experiments with
existent MLC algorithms will be at your fingertips.
The present chapter has been structured into three main sections. The first one
describes the tools needed to work with multilabel data. This includes details about
MLDs file formats, data repositories MLDs can be obtained from, and how most
of these tasks can be accomplished by means of a specific software tool, the
mldr.datasets R package.
How to perform exploratory analysis of multilabel data is the topic the second
section is dedicated to. To do so, two specific programs are depicted, the R mldr
package and the Java MEKA framework. Many of the plots in this book have been
produced by the former, a tool which also provides methods to filter and transform
MLDs.
The concern of the third and final section is how to conduct multilabel experi-
ments, by means of MEKA, MULAN, and a specific Java utility developed by the
authors. Following these guidelines, and using the data partitions provided in this
book repository, the reader should be able to reproduce the experiments described in
previous chapters.
The design of any algorithm aimed to deal with multilabeled data, whether its goal
is to induce a classification model or to apply some kind of preprocessing, has a key
requirement, it will have to be tested against some MLDs. Therefore, whatever is the
researching goal, the first step will usually be obtaining enough multilabel data. These
MLDs will have to be partitioned, and commonly some exploratory analysis would
have to be conducted on them. In addition, they have to be properly documented into
the research projects they are used, including the correct citing information.
Fortunately, nowadays there are several data repositories and software tools to
fulfill these needs. This first section provides a brief description of such resources,
along with useful references to obtain them.
One of the first issues that any multilabel researcher or practitioner (the user hence-
forward) has to face is the disparate set of MLDs file formats. Unlike traditional
datasets, MLDs have more than one output attribute, so that the last feature is the
class label cannot be assumed. How to communicate which ones of the features are
labels is the main origin of the several file formats, because each developer came
with a different approach to solve this obstacle.
Most MLDs are written using one of two base file formats, CSV (Comma-
Separated Values) or ARFF1 (Attribute-Relation File Format). Both are text file
formats, but the latter includes a header with data descriptors followed by the data
itself, whereas the former usually only provides the data and the header, if it is
present, only brings field names. CSV files cannot contain label information, so the
knowledge of how many labels there are, where are they located, or what are their
names will depend on an external resource. By contrast, an ARFF file can include
this information into the header.
1 An ARFF file is usually divided into three sections. The first one contains the name of the
dataset after de @relation tag, the second one provides information about the attributes with
@attribute tags, and the third one, whose beginning is marked with the @data tag, contains
the actual data. It is the file format used by the popular WEKA data mining tool.
9.2 Working with Multilabel Data 155
MLDs can be downloaded from repositories (see the next section) such as
MULAN [1], MEKA [2], LibSVM [3], KEEL [4], and RUMDR [5], each one using
a different file format. The differences among multilabel file formats can be grouped
according to the following criteria:
• CSV versus ARFF: ARFF is the most usual base file format for MLDs. The
datasets available at MULAN, MEKA, and KEEL are ARFF files. On the contrary,
LibSVM chose to use the simpler CSV file format.
• Label information: In order to use an MLD, knowing how many labels there are
or which are the names of the attributes acting as labels is essential. MULAN
datasets provide the label names in a separate XML file. KEEL datasets include
in the ARFF header the set of output attributes. MEKA datasets indicate in the
header, along with the name of the relation, the number of labels.
• Label location: Although multilabel formats providing label names could locate
the attributes acting as labels at any position in the MLD, they usually put them at
the end, after all the input attributes. This is the case for MULAN and KEEL. On
the other hand, MEKA and LibSVM always arrange the labels at the beginning.
Knowing the number of labels, the location allows to get the proper attribute names
without needing to include them in the ARFF header or providing an XML file.
• Sparse versus non-sparse: There are MLDs that have thousands of input attributes
plus thousands of labels. Therefore, each data row (instance) consists of a long
sequence of values. Many of them could be zero, since labels can only take two
values and the same is also applicable to many input attributes. In these cases,
the MLD will be a large array, with thousands of columns and maybe rows, with
zeroes in most of its values. To avoid storage and memory wasting, these MLDs
are usually stored as sparse data. The rows in a sparse MLD are composed of
comma-separated pairs of values. In each pair, the first value indicates the index
of the attribute, while the second provides the actual value. In non-sparse MLDs,
each row will contain the same number of columns, having the values for each
attribute.
When some experiment is going to be conducted using a set of MLDs, the user
has to choose between converting all of them to the proper file format, suitable for
the tool to be used later to conduct the experiment, or being limited to only use those
MLDs which are already available in this file format.
When it comes to multilabel data gathering, there are several alternatives to choose
from. Custom MLDs can be produced for specific fields where multilabel data are
still not available. Alternately, existing MLDs produced by someone else can be
obtained from several repositories, as long as they suit the faced task needs. The
option to generate these MLDs synthetically, by means of some software tools, is
another potential choice. This section will look into the second approach.
156 9 Multilabel Software
Multilabel data repositories provide a convenient way to obtain MLDs that other
researchers have built and used in their studies. It is an approach that allows to
compare different strategies against the same data. Nonetheless, only full datasets
are available some times. Few of these repositories also provide citation information.
Therefore, the user usually has to download the MLD, partition it, and search for the
proper bibliographic entry.
The following are among the best-known multilabel data repositories. For each
one, the file format of the MLDs is also indicated:
• MULAN: MULAN [1] is a reference multilabel software tool (it will be further
described), and its associated repository [6] is probably the most used resource by
researchers in this field. The MLDs are provided in the ARFF format. The labels
are usually located at the end of the attribute list, and each MLD is linked to an
XML file containing the label names and their hierarchical relationship if it exists.
Currently, this repository holds 27 MLDs,2 some of them with prebuilt partitions.
• MEKA: MEKA is a multilabel tool based on WEKA. As MULAN, it brings
reference implementations for several methods, as will be shown later. The MEKA
repository [2] supplies 15 MLDs. Some of them are the same found in MULAN,
but using the MEKA file format. This is also ARFF-based, but the labels always
appear at the beginning of the attribute list. There are no separate XML file with
label names, but the number of labels in the MLD is indicated in the ARFF header,
as a parameter of the relation name.
• LibSVM: LibSVM [3] is a popular software library for SVMs. There are many
classification algorithms, including some multilabel ones, build upon LibSVM.
The associated data repository [7] includes 8 MLDs. In this case, the file format
is CSV-based instead of ARFF-based, but the attribute values are given according
to the sparse representation previously described. The labels are always put at
the beginning of each instance. There is no associated XML file nor any header
indicating the number of labels or their names.
• KEEL: Unlike MULAN and MEKA, KEEL [4] is a general-purpose data mining
application, similar to WEKA. This software tool has an extensive data repository
with different kinds of datasets, including 16 MLDs [8]. The file format is ARFF-
based, indicating in the attribute list which features act as labels.
• RUMDR: The R Universal Multilabel Dataset Repository [5] is associated with
an R package named mldr.datasets [9] (it will be portrayed in the following sub-
section). Currently, this is the most extensive multilabel data repository, providing
more than 60 MLDs. These can be directly downloaded from the repository in R
file format; thus, they are designed to be loaded from R. The functions provided by
the package allow to export them from this native format to several ones, including
MULAN, MEKA, LibSVM, and CSV.
2 The number of MLDs provided by each repository has been checked as of April 2016.
9.2 Working with Multilabel Data 157
When it comes to data exploration, analysis, and mining, R [11] is a very popu-
lar tool/language due to its extensive package list. One of these packages, named
mldr.datasets [9], is specifically designed to aid the user in the tasks of obtaining,
partitioning, converting, and exporting multilabel datasets. mldr.datasets is tied to
the aforementioned RUMDR repository.
The mldr.datasets is available at CRAN (Comprehensive R Archive Network),
the distributed network providing most R packages. Therefore, it can be
downloaded and installed from any up-to-date R version by simply issuing the con-
sole the install.packages("mldr.datasets") command. Once installed,
the package has to be loaded into memory with the usual library(mldr.
datasets) command. This will bring to the R workspace ten medium-sized MLDs,
along with the functions needed to access many more and to manipulate them. The
preloaded MLDs are those stored in the data folder of the RUMDR repository.
In the following, how to use the mldr.datasets to accomplish some basic tasks
over MLDs is explained, assuming the package is already installed and loaded into
memory.
After loading the package, the user can know which MLDs are available using the
usual data() function, passing the name of the package as parameter. These are the
MLDs brought to memory by loading the package, but there are many more available
on the RUMDR repository. A list of these is returned by the mldrs() function, as
shown in Fig. 9.1.
158 9 Multilabel Software
To load any of the available MLDs, all the user has to do is typing in the R
console its name followed by empty parentheses. The package will check whether the
requested MLD is locally available into the user’s computer, loading it into memory
if this is the case. On the contrary, the MLD will be automatically downloaded from
the RUMDR repository, stored in the local machine, and then loaded into memory,
without needing any user intervention.
The MLDs supplied by the mldr.datasets package are mldr objects. It is the object
format defined by the mldr package, further addressed in this chapter. These objects
have several members containing data helpful to explore the MLD structure, such as
the names and frequencies of labels and labelsets and domains of input attributes.
To access any of these members, the dataset$member syntax will be used, as
depicted in Fig. 9.2.
The multilabel data are stored into the dataset member. This is a standard R
data.frame; therefore, the usual R syntax to access any of its columns and rows
is used. The measures() function returns a list of characterization metrics, such
as the number of instances, features, and labels, imbalance levels, and theoretical
complexity level.
9.2 Working with Multilabel Data 159
Fig. 9.2 The mldr objects have several members with disparate information
When an MLD produced by any third party is going to be used in a new study, includ-
ing the proper citation to give the original author, the correct attribution is mandatory.
Obtaining the precise citation information is not always easy. The mldr.datasets pack-
age includes a custom version of the R’s toBibtex() function whose goal is to
provide the BibTeX entry associated with any mldr object.
The value returned by the toBibtex() function is properly formatted to copy
it to the clipboard and then pasting it in the user’s BibTeX editor. As demonstrated
in Fig. 9.3, it can also be printed to the console.
Although partitioned MLDs can be obtained from some repositories, this is not
always the case. Furthermore, the number of partitions or their proportions could be
not adequate for the user needs. The mldr.datasets package contributes two functions,
named random.kfolds() and stratified.kfolds(), whose goal is to
partition any mldr object into the number of desired parts. The difference between
these two functions relies on the approach followed to choose the instances included
in each partition. The former does it randomly, while the latter stratifies the data
trying to balance the label distribution among partitions.
Both functions need the mldr object to be partitioned as their first argument.
Additionally, they can take two more parameters specifying the number of folds, it
160 9 Multilabel Software
Fig. 9.4 An MLD being partitioned using random and stratified approaches
is 5 by default, and the seed for the random generator. The result returned by these
functions is a list containing as many elements as folds have been indicated. Each
one of these elements is made up of two members, called train and test, with
the corresponding data partitions.
The example code shown in Fig. 9.4 demonstrates how to use these two functions,
as well as how to access the training partition of the first fold. The summary()
function prints a summary of characterization metrics, allowing to compare how the
different partitioning approach has influenced the obtained partition.
Although R is a tool from which the MLDs provided by mldr.datasets can be used
with disparate machine learning algorithms, currently software packages such as
9.2 Working with Multilabel Data 161
MULAN and MEKA are preferred, due to the large prebuilt set of MLC algorithms
they incorporate.
The file format of MLDs provided by RUMDR is the R native object format.
Nonetheless, once they have been loaded into R, it is possible to convert them
to several other file formats. For doing so, the write.mldr() function of the
mldr.datasets package has to be called.
The write.mldr() function accepts as first argument an mldr object, con-
taining the MLD to be written to a file. It is also able to deal with a list as the one
returned by the partitioning functions described above, writing each training and test
fold to a different file. This is the only mandatory parameter, and the remaining ones
take default values.
As second argument, the write.mldr() function takes a vector of strings
stating the file formats the data are going to be exported to. Valid formats are MULAN,
MEKA, KEEL, LIBSVM, and CSV. The default value is c("MULAN," "MEKA"),
being these the two most popular multilabel file formats. If the MULAN format is
chosen, the function will also write the corresponding XML file for the MLD. For
the CSV format, an additional CSV file containing a list with label names is also
created.
The third parameter has to be a boolean value, indicating if sparse representation
has to be used to write the data. By default, it takes the FALSE value, so the non-sparse
format is chosen unless otherwise specified.
Lastly, the fourth argument sets the base filename the write.mldr() function
will use to name the written files. This filename will be followed by a set of numbers,
stating the fold and total number of folds, if the first parameter is a list of mldr
objects.
The write.mldr() function can be combined with the previously described
partitioning functions, as shown in Fig. 9.5. In this example, the yeast MLD is being
partitioned into fivefolds, and then, the resulting partitions are written in MEKA and
CSV file formats.
Fig. 9.5 Partitioning and exporting an MLD to MEKA and CSV file formats
162 9 Multilabel Software
Since due to its link to the RUMDR repository, most of the MLDs publicly avail-
able nowadays can be downloaded and then properly cited, partitioned, and exported
to the common file formats, the mldr.datasets package can be the most convenient
way of dealing with existent MLDs.
The MLDs provided in the previous repositories have been generated from real data
coming from different domains, as was explained in Chap. 3. However, sometimes
having MLDs with very definite traits would be desirable, for instance while ana-
lyzing the behavior of algorithms aimed to solve a very specific problem. In order
to produce these kinds of data, an appropriate algorithm has to be designed, usually
including a mathematical method to correlate the inputs with the outputs.
In [12], such an algorithm, which offers two different approaches based on hyper-
cubes and hyperspheres, is presented. The associated synthetic dataset generator for
multilabel learning is available online, as a Web application (see Fig. 9.6). It allows
the user to choose among the two strategies for generating instances, as well as to
indicate how many features, labels, and instances the MLD will have, the noise level
to add to the data, and other handful of configuration parameters. Once all of them
have been set, the resulting dataset characteristics are shown and the file can be
downloaded.
Mldatagen is a quite generic multilabel data generator. In the multilabel litera-
ture, some authors have created their own synthetic MLDs following more ad hoc
approaches, adapted to suit specific goals. References to several of them can be found
in [13].
The Web repositories where the MLDs are downloaded from, such as the previously
mentioned MULAN and MEKA, usually also supply some basic information about
the datasets. The number of instances, input attributes, and labels, along with label
cardinality and sometimes label density, are common metadata. However, before
using these MLDs to conduct some experiments, most users will demand additional
details. Those will be generally obtained by means of exploratory analysis tasks,
including summarizing at different levels and visualizing the data.
9.3 Exploratory Analysis of MLDs 163
Fig. 9.6 The Mldatagen tool allows creating custom synthetic MLDs
As long as the structure of each multilabel file format is adequately decoded, any
generic data manipulation tool could be used to explore the MLDs. Nevertheless, this
section is focused on interactive software tools specifically built to work with multi-
label data. Two particular tools in this category are described below, MEKA and the
mldr package. Both provide some EDA (Exploratory Data Analysis) functionality.
9.3.1 MEKA
MEKA is a software tool built upon WEKA, and it brings a similar user interface to
this popular application but with multilabel capabilities. MEKA is developed in Java;
therefore, to use it, the first requirement to meet is having the Java Runtime Envi-
164 9 Multilabel Software
ronment (JRE) installed in the system. Then, the most recent version of MEKA can
be downloaded from https://adams.cms.waikato.ac.nz/snapshots/meka. The instal-
lation process is just extracting the files of the compressed file to a folder.
In addition to the software itself, the MEKA compressed file also includes a PDF
document with a tutorial and some example data files. Two scripts, one for Windows
(run.bat) and another one for UNIX-based systems run.sh, aimed to ease the
launch of the software are provided as well.
Launching MEKA through the proper script will open the program’s main window
(see Fig. 9.7). The options in the Tools menu run the essential MEKA tools. Some
of them are depicted below.
MEKA includes a viewer able to open any ARFF file, enumerating its attributes,
including the labels, and the values assigned to them. Once a dataset has been loaded,
its content is shown in a grid like the one in Fig. 9.8. The Properties option in the File
Fig. 9.8 The MEKA ARFF viewer allows viewing and editing any ARFF dataset contents
9.3 Exploratory Analysis of MLDs 165
menu will report the number of instances, attributes, and classes. This is a generic
ARFF tool, so it is not aware of the multilabel nature of the data.
Actually, this tool is also an editor, since the values can be changed, instances
and attributes can be removed, and the order of the data samples can be modified.
Any change will be lost unless the data in memory are saved, usually with a different
filename.
The other exploration tool embedded in MEKA is the MEKA Explorer, accessi-
ble through the Tool→Explorer option in the main window. This tool is aimed to
interactively test preprocessing and classification algorithms, but it also has some
exploration capabilities.
After loading an MLD, the Preprocess page will show two lists containing all the
attributes in the dataset (see Fig. 9.9). At the top of the left list, there is a summary
with the number of instances and attributes. If the MLD has been loaded from a file
in MEKA format, the program will automatically detect which attributes are labels.
These will be highlighted in bold and always will be at the beginning in both lists.
Loading MLDs from other ARFF-based file formats is allowed, but the program will
be not able to identify the labels. The user has to mark them in the right list and then
Fig. 9.9 The Preprocess page allows to explore the attributes and labels in the MLD
166 9 Multilabel Software
Fig. 9.10 Pairwise scatter plots for the attributes in the MLD
click the Use class attributes. By selecting any attribute in the left list, a summary
of its domain will be displayed below the right list.
The Visualize page in this tool provides a matrix of scatter plots (see Fig. 9.10),
each one showing the relationship between a pair of attributes. The controls at the
bottom can be used to configure these plots, change their size, sample the instances
appearing in them, etc. By clicking any of the plots, a bigger version will be opened
in a separate window, with additional controls to customize it. The plot can be saved
to a file.
Basically, these are all the exploratory analysis functions offered by MEKA. As
can be seen, they are mostly applicable to both traditional and multilabel datasets.
There is a lack of information about how the labels are distributed and correlated,
which labelsets and other specific multilabel measurements exist.
dependencies between the two packages. That means that mldr can be used without
installing mldr.datasets and vice versa. Nonetheless, mldr can take advantage of hav-
ing all the MLDs included in the mldr.datasets package, as well as their functions to
partition and export them.
The mldr package is able to load MLDs from MULAN and MEKA file formats,
as well as to generate new MLDs on the fly from data synthetically generated or
provided by the user in a data.frame. The package defines a custom representation
for multilabel data. The MLDs are R S3 objects with mldr class. This is the same
representation used by the MLDs in the mldr.datasets package, and hence, these
datasets are compatible with mldr.
Inside the mldr package, a plethora of characterization metrics are computed,
along with a set of functions aimed to ease the analysis of multilabel data. In addition,
the package provides a Web-based user interface to speed up exploratory tasks,
including specific graphic representations of the data.
In this section, the procedures to accomplish different exploratory duties using
the mldr package are explained. It is assumed the user has installed the package, by
issuing the install.packages("mldr") command at the R console, and it is
loaded into memory, by typing the library(mldr) command.
After loading the mldr package, three mldr objects will be already available. These
correspond to the birds, emotions, and genbase MLDs. If the mldr.datasets is
also loaded, all the MLDs provided by it will be accessible as well. To load any other
MLD, assuming it is stored in an ARFF file using the MEKA or MULAN formats,
the mldr() function has to be called. The only mandatory argument is the filename
without extension. The function will assume the MLD is in MULAN file format
and the existence of an XML file with the same name. Additional parameters can
be supplied to change this default behavior, stating the XML filename, the number,
indexes, or names of the attributes acting as labels if there are not an XML file
available nor this information is provided in the ARFF header, etc.
The mldr() function always checks whether the mldr.datasets package is
installed in the system. If this is the case, the function entrusts the loading of the MLD
to the proper mldr.datasets function. To avoid this functionality, forcing the loading
from a local file, the force_read_from_file parameter has to be included in
the call to mldr(), assigning it the TRUE value.
In Fig. 9.11, a use case of the mldr() function is shown. In this example, the
name of the XML file does not coincide with the ARFF filename, so it is explicitly set
by means of the xml_file parameter. Once the MLD has been loaded, the object
can be queried to obtain some dataset traits as demonstrated.
Furthermore, new MLDs can be created from any existing data, whether it is
from a real domain or produced by any formula. This functionality, provided by
the mldr_from_dataframe() function, allows creating synthetic MLDs. This
function needs as inputs a data.frame containing all the features, a vector stating
168 9 Multilabel Software
Fig. 9.12 New MLDs can be generated on the fly from any formula
which ones of them will be labels, and optionally a name to assign to the new MLD.
The result, as is shown in Fig. 9.12, is an mldr object that can be used as any other
MLD.
Independently of the origin of the MLD’s data, all mldr objects have the same
structure and can be processed with the same set of functions. Many of these are
aimed to compute and provide several characterization metrics. The supplied metrics
can be grouped into four categories:
9.3 Exploratory Analysis of MLDs 169
Fig. 9.13 A large set of traits are computed for each MLD
• Basic traits: The number of instances, number of input attributes, number of output
labels, and number of labelsets are in this group. All of them can be queried with
the syntax mldrobject$measures$num.XXX. The summary() function
also returns this information.
• Label distribution data: Metrics such as label cardinality, label density, and
the frequency of each individual label are available through the measures and
labels members, as demonstrated in Fig. 9.13.
• Label relationship metrics: The relationships among the labels in the dataset
can be inspected through metrics such as the total number of unique labelsets
(measures$num.labelsets) in the MLD, the number of single labelsets
(measures$num.single.labelsets), and analyzing the values for the
global and by label SCUMBLE measures (measures$scumble and labels
$SCUMBLE).
• Metrics related to label imbalance: The individual imbalance level for each label
is provided by the labels$IRLbl member. The average imbalance level for the
MLD can be obtained through the measures$meanIR member.
Additional information about the MLD is provided in the members of the mldr
object such as labelsets (signature and counter of each labelset in the MLD),
attributes (name and domain of each attribute), and measures$tcs (the
theoretical complexity score of the MLD). All of them are properly documented in
the electronic help included in the package.
170 9 Multilabel Software
800
Number of instances
600
400
200
0
2 4 6 8 10
Number of labels per instance
The mldr package delivers a custom plot() function for mldr objects, able to
produce seven specific plots from the data contained in those objects. The arguments
to this function are usually two, the mldr object to analyze and the type parameter
specifying what type of plot is desired. Some kinds of visualizations accept additional
parameters, for instance to restrict the set of plotted labels.
Three of the custom plots are histograms designed to depict how labels and
labelsets are distributed among the instances. The one shown in Fig. 9.14 is pro-
duced by the plot(yeast, type = "CH") call, showing the distribution of
label cardinality among the instances. The yeast dataset has a Card value of 4.223.
That most instances have four labels can be observed in this plot.
Two more of the available types, "LB" and "LSB," are bar plots depicting how
many instances each label and labelset appear. This way, the differences between
frequencies of labels and labelsets can be explored. For instance, Fig. 9.15 shows
that there are two majority labels (Class12 and Class13) and several minority
labels (Class14 and Class9 to Class11).
Another kind of plot is denoted as "AT." It is a regular pie chart showing the
proportion of each type of features, continuous, nominal, and labels. The seventh
visualization option is the one generated by default when the type argument is
not provided. It is a circular plot like the one described in Chap. 8 (see Fig. 9.16),
illustrating how the labels interact among them. This kind of plot, as well as those
presenting label frequencies, accepts the optional parameters labelCount and
labelIndices, whose goal is to restrict the set of drawn labels. All plot types
also take other optional arguments, such as title to set the title of the plot and col
to set the color, or the standard graphical parameters usually given to R functions
such as barplot() and hist().
9.3 Exploratory Analysis of MLDs 171
yeast
1450
1392
1334
1276
1218
1160
1102
Number of samples
1044
986
928
870
812
754
696
638
580
522
464
406
348
290
232
174
116
58
0
1
9
10
11
12
13
14
ss
ss
ss
ss
ss
ss
ss
ss
ss
ss
ss
ss
ss
ss
Cla
Cla
Cla
Cla
Cla
Cla
Cla
Cla
Cla
Cla
Cla
Cla
Cla
Cla
Instances per label
Fig. 9.15 The differences among label frequencies are easily inferred in this bar plot
The mldr package includes some functions able to generate reports by means of
automated analysis of the label measurements. The functions in charge of producing
these reports only need the mldr object to be analyzed as argument. The resulting
report is printed to the console.
With the labelInteractions() function, an analysis of which labels are in
minority and how they interact with others is generated. The reported result includes
the indexes of the minority labels, at the beginning, and for each one of them the
list of labels they have interactions with, along with the number of samples in which
they appear together (see Fig. 9.17).
The second report is produced by the concurrenceReport() function. As its
name suggests, the report analyzes the concurrence among labels, but in a more elab-
orated way than the labelInteractions() function. As is shown in Fig. 9.18,
the report includes the global SCUMBLE measurement and its CV, as well as SCUM-
BLE values for each individual label. For each minority label, a list of the labels it
interacts with is also provided, including label names and indexes, the number of
times they appear together, and their respective SCUMBLE values.
The information given by these functions can be useful in different scenarios. For
instance, the indexes of minority labels and the labels they interact with would be
convenient to customize the interaction plot previously described.
172 9 Multilabel Software
All the exploratory functions in the mldr package are accessible through the command
line, so that R scripts can be written to perform reproducible analysis procedures.
However, all the functionality already described is also reachable through the inte-
grated GUI. To open it, the user has to enter the mldrGUI() sentence. The GUI
will be launched inside the default Web browser, consisting of several sections.
Just after the GUI has been launched, the main section is shown and the first MLD
available in the working environment is selected. The name of the active MLD is
always shown at the top of the GUI. A panel in this section (see Fig. 9.19) allows the
user to choose from the loaded MLDs, as well as to load any others from their files.
9.3 Exploratory Analysis of MLDs 173
Fig. 9.17 Report about how minority labels interact with other labels
A visual summary of the selected MLD is provided at the right, in the same section,
including plots that show the proportion of each kind of attributes and how labels
and labelsets are distributed. These plots can be saved for further use, for instance
including them in any study.
174 9 Multilabel Software
The tabs located at the top of the user interface, below the selected MLD name, are
the doors to the other GUI sections. The Labels and Labelsets pages are quite similar.
Both provide a table with names and frequencies, as well as the associated bar plot.
The labels table includes additional data, such as the feature index, imbalance levels,
and concurrence levels. These tables can be sorted and filtered, and their content can
be printed and exported, as is shown in Fig. 9.20.
In the Attributes page, the domain of each input attribute is summarized. The
frequency of each value is computed for nominal attributes, while for numeric ones,
some statistical measures are computed (see Fig. 9.21).
By opening the Concurrence section, the user can access a concurrence analysis
report, along with a customizable circular plot. The labels presented in it can be inter-
9.3 Exploratory Analysis of MLDs 175
Fig. 9.20 Table with all the labelsets and their frequencies and the corresponding plot
Fig. 9.21 The Attributes page provides a summary of each attribute domain
176 9 Multilabel Software
Fig. 9.22 The concurrence report along with the customizable interaction plot
actively selected, updating the visualization until the desired result is achieved. The
report is based on the concurrence and imbalance metrics, indicating which minority
labels have high interactions and which labels interact with others (Fig. 9.22).
Overall, the exploratory functions and GUI of the mldr package supply the user
with an extensive range of useful information. In addition, the GUI allows customiz-
ing the tables and plots, and then exporting them, easing the process of documenting
any report or study.
In addition to the exploratory analysis functionality, the mldr package also offers the
user functions to transform and filter the MLDs, as well as to evaluate the predictive
performance from the outputs obtained by any classifier. Some of these functions are
9.3 Exploratory Analysis of MLDs 177
3 The [] operator defined in the mldr package is designed to work with mldr objects only. The
standard [] R operator can be used over the mldr$dataset member to manipulate the raw
multilabel data.
178 9 Multilabel Software
Fig. 9.24 The mldr package provides a function to perform the usual BR and LP transformations
being assessed and the predictions obtained for them. The function returns a list
containing about twenty performance metrics (see Fig. 9.25), along with an object
containing all the data needed to plot the ROC curve. The example shown in Fig. 9.25
generates random predictions for all the labels in the yeast dataset and then evaluates
the performance. As expected, the accuracy of this random classifier is around 50 %.
Fig. 9.25 If a set of predictions is available, the performance of the classifier can be assessed
9.3 Exploratory Analysis of MLDs 179
Once the user has learned what the structure of their MLDs is, how labels are distrib-
uted and correlated, etc., it is time to conduct some experiments using the multilabel
data. This goal can imply applying some preprocessing method to the data, train a
classifier using it, obtain predictions for the test set, and eventually evaluate these
predictions to assess the performance.
Several software tools and packages for different languages are available to accom-
plish these tasks. However, there are two applications that stand out among everything
else, MEKA and MULAN. Both have been developed by experts on the multilabel
field, and they provide reference implementations for a large variety of MLC algo-
rithms.
In this final section, how to use MEKA and MULAN to run simple multilabel
experiments is explained. Although the former tool can be used from the command
line, it is supplied with a GUI which eases the user’s work. On the contrary, the
latter only is accessible programmatically, as will be later shown. Furthermore, both
software packages have been programmed in Java language, so that the latest JRE
installed in the system is assumed.
9.4.1 MEKA
The MEKA user interface was previously introduced. Specifically, the tool known
as MEKA Explorer was used to make some exploratory analysis on the data. This
section starts from this earlier knowledge, firstly describing how to conduct interac-
tive experiments and then designing more complex ones.
The Classify page on this application allows the user to choose among an extensive
list of classifiers, training them with the loaded MLD and evaluating them following
180 9 Multilabel Software
Fig. 9.26 The MEKA Explorer allows the user to run any classifier interactively
several validation strategies. The first step is to load the MLD of interest. Then, by
means of the Choose button in the Classify page, the classifier is selected, setting
its parameters as desired. The drop-down list in the Evaluation section configures
the validation scheme.
Once the experiment has been configured, it will be run by clicking the Start
button. As soon as it finishes displaying a summary of results, as is shown in Fig. 9.26.
Each experiment execution is independently stored in the History list, so the user
can compare the results produced in each case.
The pop-up menu associated with the items in the History list can be used to save
the obtained results, delete the experiment, copy its parameters, and show several
performance evaluation plots. This menu appears opened in Fig. 9.26. By selecting
the Show ROC option, a new window like the one shown in Fig. 9.27 is opened. It
contains several pages, one per label, each one with the ROC plot.
Overall, the MEKA Explorer offers the user with a very simple way to run indi-
vidual experiments. However, they are limited to using one MLC algorithm over one
MLD.
9.4 Conducting Multilabel Experiments 181
Fig. 9.27 The ROC curve for the selected experiment is shown in a window in its own
Though running individual MLC algorithms with one MLD can be useful many
times, for instance while analyzing the algorithm behavior, a real multilabel experi-
mentation usually implies applying several methods to a group of MLDs. The MEKA
Experimenter, another tool accessible from the MEKA main window, better suits this
kind of job. As the MEKA Explorer, the Experimenter user interface also consists of
several pages. The first one is where the user will design the experiment, fulfilling
the following steps:
1. Adding the MLC algorithms to be used to the left list. The buttons at the right of
this list allow the user to add and remove methods, configure them, and change
the order they will be run.
2. Adding the MLDs which the algorithms will be applied to the right list. The
buttons at the right of this list are similar to the previous ones.
3. Configuring how many times the experiment will be run, as well as how the
results will be evaluated. For instance, in Fig. 9.28, a configuration with 10-fold
cross-validation and 10 repetitions has been established.
4. Launching the experimentation by choosing the Start options in the Execution
menu. Previously, all the configuration can be saved.
182 9 Multilabel Software
Fig. 9.28 The MEKA Experimenter is able to run several classifiers using disparate MLDs
Once the batch of runs defined in the experimentation finishes, the obtained evalu-
ation results can be analyzed through the options in the Statistics page of the MEKA
Experimenter. An extensive set of performance metrics is provided, and they can
be viewed separately or aggregating them. For instance, it is possible to choose a
specific metric and get the average values for each algorithm and dataset. This way,
a comparative study can be easily performed.
The results produced by the experiments can be also written to a file, whether
the user is interested in raw or aggregated data. As shown in Fig. 9.29, the options
relating to exporting functions can be found in the Statistics menu.
9.4.2 MULAN
Fig. 9.29 The experimental results can be viewed in MEKA Experimenter and exported to a file
The first step of a MULAN user is obtaining the software itself. The last version
can be downloaded from http://mulan.sourceforge.net/download.html as a single
compressed file. After decompressing it, a folder named mulan will be found, con-
taining the folders and files shown in Fig. 9.30. Nearly all the folder names are
self-explanatory. The Java JAR package holding the MULAN classes is in the dist
folder, while the WEKA package needed to run MULAN is provided in the lib
folder.
Once MULAN is installed in the user system, to conduct any experiment a Java
source file has to be created. It will contain all the code to load the MLDs, to select
and configure the MLC algorithms to run, to evaluate the obtained results, etc. These
tasks imply using some MULAN classes; therefore, the proper dependencies have to
be imported at the beginning of the source code. Assuming the user needs to load an
MLD and wants to use the ML-kNN classifier, the following sentences will import
the corresponding classes:
184 9 Multilabel Software
Fig. 9.30 The MULAN software packages include source code, example data files, and documen-
tation
import mulan.data.MultiLabelInstances;
import mulan.classifier.lazy.MLkNN;
The way some of these MULAN classes are used is a matter which will be fur-
ther described. Considering the code is already written and stored in a file named
MulanExperiment.java, the compilation process involves calling the Java
compiler providing the path to libraries and the aforementioned filename. The exam-
ple in Fig. 9.31 shows a GNU/Linux terminal with the command line used to com-
pile this hypothetical program. After compilation, a .class file with the compiled
version is generated. To run this program, a similar command would be used, but
changing javac for java.
Fig. 9.31 To compile a MULAN program, the paths to the mulan.jar and weka.jar files have to be
provided
9.4 Conducting Multilabel Experiments 185
import mulan.data.MultiLabelInstances;
System.out.println(
"\nInstances:" + emotions.getNumInstances() +
"\nLabels:" + emotions.getNumLabels() +
"\nCardinality:" + emotions.getCardinality());
}
}
Fig. 9.32 The program loads the MLD and outputs some data about it to the console
186 9 Multilabel Software
The name of the files to load could obviously be supplied in the command line,
instead of being hardwired in the Java code as in this simple example.
Once the data are already in a MultiLabelInstances object, the next step
is to configure the MLC algorithms this object is going to be given as input.
MULAN provides a large set of classification algorithms, they are held in the
mulan.classifier namespace, and some partitioning and preprocessing meth-
ods spread out several namespaces. All of them can be easily found in the electronic
help of the program.
Many of the algorithms included in MULAN can work using default values, so
the corresponding object is created without needing any parameters. For instance, to
work with the ML-kNN algorithm (see Chap. 5), all the user has to do is to create an
MLkNN object, as follows:
To change the default values for the algorithm, a different constructor accepting
them should be called. Other classifiers, such as the ones based on transformation
methods, always need at least one parameter, specifically the base binary or multiclass
method to be used as underlying classifier. Any classifier available as a WEKA class
can be in charge of this task.
The following sentences would create two classifier instances. The first one is
ML-kNN with 5 nearest neighbors, while the second one is a BR transformation
using the standard C4.54 algorithm as base classifier. The last sentence prints in the
standard output the basic information about the ML-kNN algorithm.
System.out.println(
"\nkNNClassifier:" +
kNNClassifier.getTechnicalInformation().toString());
}
MULAN has a class named Evaluator able to take a full dataset, partition it, and
conduct a cross-validation evaluation. The method to accomplish the full procedure
Fig. 9.33 Partial output produced by printing the information returned by crossValidate()
System.out.println(
new Evaluator().crossValidate(
new MLkNN(), emotions, 10));
The main drawback in using MULAN is the need to have some Java language exper-
tise. The authors of this book provide in the companion repository an utility, named
RunMLClassifier, aimed to help users lacking this competence. It is a Java
program which eases running many of the classifiers implemented in MULAN. The
source code of this program, along the specific versions of the MULAN and WEKA
libraries and a compilation script, can be found in the RunMLClassifier folder
of the repository.
To run the RunMLClassifier program, assuming the user is located in the
same directory that the .jar file is obtained once it is compiled, the following
sentence5 has to be entered at the command line. The meaning of each parameter is
detailed in Table 9.1.
5 Although due to the page width limit the sentence appears in the text divided into two lines, it has
to be entered as only one.
9.4 Conducting Multilabel Experiments 189
Table 9.2 Valid values for the -algorithm parameter of the RunMLClassifier utility
Value MULAN class instantiated as classifier
BPMLL BPMLL()
BR-J48 BinaryRelevance(new J48())
BRkNN BRkNN(10)
CC-J48 ClassifierChain(new J48())
CLR CalibratedLabelRanking(new J48())
ECC-J48 EnsembleOfClassifierChains(new J48(), 10, true, false)
EPS-J48 EnsembleOfPrunedSets(80, 10, 0.2, 2,
PrunedSets.Strategy.values()[0], 2, new J48())
HOMER HOMER(new BinaryRelevance(new J48()), (numLabels < 4 ?
numLabels : 4), Method.Random)
IBLR-ML IBLR_ML()
LP-J48 LabelPowerset(new J48())
MLkNN MLkNN(10, 1.0)
PS-J48 PrunedSets(new J48(), 2,
PrunedSets.Strategy.values()[0], 2)
RAkEL-BR RAkEL(new BinaryRelevance(new J48()))
RAkEL-LP RAkEL(new LabelPowerset(new J48()))
The utility instances each classifier using default or recommended values. As can
be seen, only one classifier can be chosen to be run over the MLD partitions. The
RunMLClassifier is designed to be launched independently, maybe in parallel,
for each algorithm the user is interested in. An example run with this utility is shown
in Fig. 9.34. Without the -debug parameter, only the final line which summarizes
the average results and standard deviations would be printed.
MEKA and MULAN are the two main frameworks to conduct multilabel exper-
iments, since they provide reference implementations of many MLC methods.
The former has a GUI which eases the design of such experiments, while the
latter only can be used while writing some code. The RunMLClassiier
utility aims to make the use of MULAN more comfortable, establishing the
methods and datasets to process through command line parameters, instead of
writing and compiling a Java program.
Learning from multilabeled data is a quite challenging task. Datasets of this kind
come in diverse file formats and are distributed among a few Web repositories. Once
the data files have been obtained, they have to be imported to the learning tool, some
190 9 Multilabel Software
Fig. 9.34 The utility trains and evaluates the MLD partitions with the specified classifier
References
1. Tsoumakas, G., Xioufis, E.S., Vilcek, J., Vlahavas, I.: MULAN: A Java library for multi-label
learning. J. Mach. Learn. Res. 12, 2411–2414 (2011)
2. Read, J., Reutemann, P.: MEKA multi-label dataset repository. http://sourceforge.net/projects/
meka/files/Datasets/
References 191
3. Chang, C.C., Lin, C.J.: Libsvm: a library for support vector machines. ACM Trans. Intell. Syst.
Technol. 2(3), 1–27 (2011)
4. Alcala-Fdez, J., Fernández, A., Luengo, J., Derrac, J., García, S., Sánchez, L., Herrera, F.: KEEL
data-mining software tool: data set repository and integration of algorithms and experimental
analysis framework. J. Mult-Valued Log. Soft Comput. 17(2–3), 255–287 (2011)
5. Charte, F., Charte, D., Rivera, A.J., del Jesus, M.J., Herrera, F.: R Ultimate Multilabel Dataset
Repository. https://github.com/fcharte/mldr.datasets
6. Tsoumakas, G., Xioufis, E.S., Vilcek, J., Vlahavas, I.: MULAN multi-label dataset repository.
http://mulan.sourceforge.net/datasets.html
7. Chang, C.C., Lin, C.J.: LIBSVM data: multi-label classification repository. http://www.csie.
ntu.edu.tw/~cjlin/libsvmtools/datasets/multilabel.html
8. Alcala-Fdez, J., Fernández, A., Luengo, J., Derrac, J., García, S., Sánchez, L., Herrera, F.:
KEEL multi-label dataset repository. http://sci2s.ugr.es/keel/multilabel.php
9. Charte, F., Charte, D., Rivera, A.J., del Jesus, M.J., Herrera, F.: R ultimate multilabel dataset
repository. In: Proceeedings of 11th International Conference on Hybrid Artificial Intelligent
Systems, HAIS’16, vol. 9648, pp. 487–499. Springer (2016)
10. Bhatia, K.H., Jain, P.J., Varma, M.: The extreme classification repository: multi-
label datasets & code. http://research.microsoft.com/en-us/um/people/manik/downloads/XC/
XMLRepository.html
11. R Core Team: R: A language and environment for statistical computing. R Foundation for
Statistical Computing, Vienna, Austria (2014). http://www.R-project.org/
12. Tomás, J.T., Spolaôr, N., Cherman, E.A., Monard, M.C.: A framework to generate synthetic
multi-label datasets. Electron. Notes Theor. Comput. Sci. 302, 155–176 (2014)
13. Read, J., Pfahringer, B., Holmes, G.: Generating synthetic multi-label data streams. In: Proceed-
ings of European Conference on Machine Learning and Principles and Practice of Knowledge
Discovery in Databases, ECML PKDD’09, pp. 69–84 (2009)
14. Charte, F., Charte, D.: Working with multilabel datasets in R: the mldr package. R J. 7(2),
149–162 (2015)
Glossary
Neuron Each one of the units which conform a layer in a neural network.
Outlier Data instance whose attributes have some values out of the common range
but cannot be considered as noise.
Output space The space generated by the labels (output attributes) in a multilabel
dataset.
Oversampling Resampling technique that produces additional data samples.
Preprocessing Family of tasks aimed to perform data cleaning and selection of
relevant data.
Repository Resource where datasets and sometimes software are provided. It is
usually a Web site.
Resampling Technique used to take subsets of the original samples and remove
samples to add samples.
Sample See instance.
Segmentation The usual process to extract features from signal information such
as images and sounds.
Supervised Supervised methods are guided by the labels associated with data
samples.
Tree A knowledge representation model from which usually simple rules can be
extracted.
Undersampling Resampling techniques that remove data samples from a dataset.
Unsupervised Unsupervised methods only work with input features, without
being guided by class labels.