Variable Selection Benchmark Guide
Variable Selection Benchmark Guide
Background:
Results published in the field of feature or variable selection (see e.g. the special issue of
JMLR on variable and feature selection: http://www.jmlr.org/papers/special/feature.html)
are for the most part on different data sets or used different data splits, which make them
hard to compare. We formatted a number of datasets for the purpose of benchmarking
variable selection algorithms in a controlled manner1 . The data sets were chosen to span a
variety of domains (cancer prediction from mass-spectrometry data, handwritten digit
recognition, text classification, and prediction of molecular activity). One dataset is
artificial. We chose data sets that had sufficiently many examples to create a large
enough test set to obtain statistically significant results. The input variables are
continuous or binary, sparse or dense. All problems are two-class classification problems.
The similarity of the tasks allows participants to enter results on all data sets. Other
problems will be added in the future.
Method:
Preparing the data included the following steps:
- Preprocessing data to obtain features in the same numerical range (0 to 999 for
continuous data and 0/1 for binary data).
- Adding “random” features distributed similarly to the real features. In what
follows we refer to such features as probes to distinguish them from the real
features. This will allow us to rank algorithms according to their ability to filter
out irrelevant features.
- Randomizing the order of the patterns and the features to homogenize the data.
- Training and testing on various data splits using simple feature selection and
classification methods to obtain baseline performances.
- Determining the approximate number of test examples needed for the test set to
obtain statistically significant benchmark results using the rule-of-thumb ntest =
100/p, where p is the test set error rate (see What size test set gives good error rate
estimates? I. Guyon, J. Makhoul, R. Schwartz, and V. Vapnik. PAMI, 20 (1),
pages 52--64, IEEE. 1998, http://www.clopinet.com/isabelle/Papers/test-
size.ps.Z). Since the test error rate of the classifiers of the benchmark is unknown,
we used the results of the baseline method and added a few more examples.
- Splitting the data into training, validation and test set. The size of the validation
set is usually smaller than that of the test set to keep as much training data as
possible.
Both validation and test set truth-values (labels) are withheld during the benchmark. The
validation set serves as development test set. During the time allotted to the participants
to try methods on the data, participants are allowed to send the validation set results (in
1
In this document, we do not make a distinction between features and variables. The benchmark addresses
the problem of selecting input variables. Those may actually be features derived from the original variables
using a preprocessing.
the form of classifier outputs) and obtain result scores. Such score are made available to
all participants to stimulate research. At the end of the benchmark, the participants send
their test set results. The scores on the test set results are disclosed simultaneously to all
participants after the benchmark is over.
Data formats:
All the data sets are in the same format and include 8 files in ASCII format:
dataname.param: Parameters and statistics about the data
dataname.feat: Identities of the features (in the order the features are found in the data).
dataname_train.data: Training set (a spase or a regular matrix, patterns in lines, features
in columns).
dataname_valid.data: Validation set.
dataname_test.data: Test set.
dataname_train.labels: Labels (truth values of the classes) for training examples.
dataname_valid.labels: Validation set labels (withheld during the benchmark).
dataname_test.labels: Test set labels (withheld during the benchmark).
The matrix data formats used are:
- For regular matrices: a space delimited file with a new-line character at the end of
each line.
- For sparse matrices with binary values: for each line of the matrix, a space
delimited list of indices of the non-zero values. A new-line character at the end of
each line.
- For sparse matrices with non-binary values: for each line of the matrix, a space
delimited list of indices of the non-zero values followed by the value itself,
separated from it index by a colon. A new-line character at the end of each line.
Dataset A: ARCENE
1) Topic
The task of ARCENE is to distinguish cancer versus normal patterns from mass-
spectrometric data. This is a two-class classification problem with continuous input
variables.
2) Sources
a. Original owners
The data were obtained from two sources: The National Cancer Institute (NCI) and the
Eastern Virginia Medical School (EVMS). All the data consist of mass-spectra obtained
with the SELDI technique. The samples include patients with cancer (ovarian or prostate
cancer), and healthy or control patients.
3) Past usage
NCI ovarian cancer original paper:
“Use of proteomic patterns in serum to identify ovarian cancer Emanuel F Petricoin III,
Ali M Ardekani, Ben A Hitt, Peter J Levine, Vincent A Fusaro, Seth M Steinberg, Gordon
B Mills, Charles Simone, David A Fishman, Elise C Kohn, Lance A Liotta. THE
LANCET • Vol 359 • February 16, 2002 • www.thelancet.com” are so far not
reproducible.
Note: The data used is a newer set of spectra obtained after the publication of the
paper and of better quality.
100% accuracy is easily achieved on the test set using various data splits on this version
of the data.
NCI prostate cancer original paper:
Serum proteomic patterns for detection of prostate cancer. Petricoin et al. Journal of the
NCI, Vol. 94, No. 20, Oct. 16, 2002. The test results of the paper are shown in Table A.1.
4) Experimental design
We merge the datasets from the three different sources (253+322+326=901 samples). We
obtained 91+253+159=503 control samples (negative class) and 162+69+167=398 cancer
samples (positive class). The motivations for merging datasets include:
- Obtaining enough data to be able to cut a sufficient size test set.
- Creating a problem where possibly non-linear classifiers and non-linear feature
selection methods might outperform linear methods. The reason is that there will
be in each class different clusters corresponding differences in disease, gender,
and sample preparation.
- Finding out whether there are features that are generic of the separation cancer vs.
normal across various cancers.
We designed a preprocessing that is suitable for mass-spec data and applied it to all the
data sets to reduce the disparity between data sources. The preprocessing consists of the
following steps:
- Limiting the mass range: We eliminated small masses under m/z=200 that
include usually chemical noise specific to the MALDI/SELDI process (influence
of the “matrix”). We also eliminated large masses over m/z=10000 because few
features are usually relevant in that domain and we needed to compress the data.
- Averaging the technical repeats: In the EVMS data, two technical repeats were
available. We averaged them because we wanted to have examples in the test set
that are independent so that we can apply simple statistical tests.
- Removing the baseline: We subtracted in a window the median of the 20%
smallest values. An example of baseline detection is shown in Figure A.1.
- Smoothing: The spectra were slightly smoothed with an exponential kernel in a
window of size 9.
- Re-scaling: The spectra were divided by the median of the 5% top values.
- Taking the square root. The square root of the all values was taken.
- Aligning the spectra: We slightly shifted the spectra collections of the three
datasets so that the peaks of the average spectrum would be better aligned
(Figures A.2 and A.3). As a result, the mass-over-charge (m/z) values that identify
the features in the aligned data are imprecise. We took the NCI prostate cancer
m/z as reference.
- Limiting more the mass range: To eliminate border effects, the spectra border
were cut.
- Soft thresholding the values: After examining the distribution of values in the
data matrix, we subtracted a threshold and equaled to zero all the resulting values
that were negative. In this way, we kept only about 50% of non-zero value, which
represents significant data compression (see Figure A.4).
- Quantizing: We quantized the values to 1000 levels.
The resulting data set including all training and test data merged from the three sources
has 901 patterns from 2 classes and 9500 features. We remove one pattern to obtain the
round number 900. At every step, we checked that the change in performance of a linear
SVM classifier trained and tested on a random split of the data was not significant. On
that basis, we have some confidence that our preprocessing did not alter significantly the
information content of the data. We further manipulated the data to add random “probes”:
- We identified the region of the spectra with least information content using an
interval search for the region that gave worst prediction performance of a linear
SVM (indices 2250-4750). We replaced the features in that region by “random
probes” obtained by randomly permuting the values in the columns of the data
matrix.
- We identified another region of low information content: 6500-7000. We added
500 random probes that are permutations of those features.
After such manipulations, the data had 10000 features, including 7000 real features and
3000 random probes. The reason for not adding more probes is purely practical: non-
sparse data cannot be compressed sufficiently to be stored and transferred easily in the
context of a benchmark.
100
90
80
70
60
50
40
30
20
10
0
0 2000 4000 6000 8000 10000 12000 14000 16000
1.5
0.5
-0.5
3000 4000 5000 6000 7000 8000 9000 10000 11000
Figure A.2: Central part of the spectra before alignment. We show in red the average NCI ovarian
spectra, in blue the average NCI prostate spectra, and in green the average EVMS prostate spectra.
1.5
0.5
-0.5
3000 4000 5000 6000 7000 8000 9000 10000 11000
Figure A.3: Central part of the spectra after alignment. We show in red the average NCI ovarian
spectra, in blue the average NCI prostate spectra, and in green the average EVMS prostate spectra.
6
x 10
5
4.5
3.5
2.5
1.5
0.5
0
0 100 200 300 400 500 600 700 800 900 1000
Figure A.4: Distributions of the values in the ARCENE data after preprocessing.
500
10 450
20 400
30 350
40 300
50 250
60 200
70 150
80 100
90 50
100 0
1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Figure A.5: Heat map of the training set of the ARCENE data. We represent the data matrix (patients in
line and features in columns). The values are clipped at 500 to increase the contrast. The values are then
mapped to colors according to the color-map on the right. The stripe beyond the 10000 feature index
indicated the class labels: +1=red, -1=green.
5) Number of examples and class distribution
All variables are integer quantized on 1000 levels. There are no missing values. The
data is not very sparse, but for data compression reasons, we thresholded the values.
Approximately 50% of the entries are non zero. The data was saved as a non-sparse
matrix.
Dataset B: GISETTE
1) Topic
The task of GISETTE is to discriminate between to confusable handwritten digits: the
four and the nine. This is a two-class classification problem with sparse continuous input
variables.
2) Sources
a. Original owners
The data set was constructed from the MNIST data that is made available by Yann
LeCun of the NEC Research Institute at http://yann.lecun.com/exdb/mnist/.
The digits have been size-normalized and centered in a fixed-size image of dimension
28x28. We show examples of digits in Figure B1.
5 5
10 10
15 15
20 20
25 25
5 10 15 20 25 5 10 15 20 25
3) Past usage
Many methods have been tried on the MNIST database. Here is an abbreviated list from
http://yann.lecun.com/exdb/mnist/:
Reference:
Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. "Gradient-based learning applied to
document recognition." Proceedings of the IEEE, 86(11):2278-2324, November 1998.
http://yann.lecun.com/exdb/publis/index.html#lecun-98
4) Experimental design
10
15
20
25
5 10 15 20 25
Figure B2: Example of a randomly selected subset of pixels in the region of interest.
Pairs of pixels used as features in dataset B use pixels drawn randomly according to such
a distribution.
Dataset C: DEXTER
1) Topic
The task of DEXTER is to filter texts about "corporate acquisitions". This is a two-class
classification problem with sparse continuous input variables.
2) Sources
a. Original owners
The original data set we used is a subset of the well-known Reuters text categorization
benchmark. The data was originally collected and labeled by Carnegie Group, Inc. and
Reuters, Ltd. in the course of developing the CONSTRUE text categorization system. It
is hosted by the UCI KDD repository:
http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html. David D. Lewis is
hosting valuable resources about this data (see
http://www.daviddlewis.com/resources/testcollections/reuters21578/). We used the
“corporate acquisition” text classification class pre-processed by Thorsten Joachims
<[email protected]>. The data is one of the examples of the software package
SVM-Light., see http://svmlight.joachims.org/. The example can be downloaded from
ftp://ftp-ai.cs.uni-dortmund.de/pub/Users/thorsten/svm_light/examples/example1.tar.gz.
b. Donor of database
This version of the database was prepared for the NIPS 2003 variable and feature
selection benchmark by Isabelle Guyon, 955 Creston Road, Berkeley, CA 94708, USA
([email protected]).
c. Date received: August 2003.
3) Past usage
Hundreds of articles have appeared on this data. For a list see:
http://kdd.ics.uci.edu/databases/reuters21578/README.txt
Also, 446 citations including “Reuters” were found on CiteSeer:
http://citeseer.nj.nec.com.
4) Experimental design
The original data formatted by Thorsten Joachims is in the “bag-of-words”
representation. There are 9947 features (of which 2562 are always zeros for all the
examples) that represent frequencies of occurrence of word stems in text. Some
normalizations have been applied that are not detailed by Thorsten Joachims in his
documentation. The task is to learn which Reuters articles are about "corporate
acquisitions".
500
450
400
350
300
250
200
150
100
50
0
0 2000 4000 6000 8000 10000 12000
Figure C.1: Comparison of the real data and the random probe data distributions.
We plot the number of non-zero values of a given feature as a function of the rank of the
feature. The rank is given by the number of non-zero features. Red: real data. Blue:
simulated data.
100
90
80
70
60
50
40
30
20
10
0
0 2000 4000 6000 8000 10000 12000
Figure C.2: Comparison of the real data and the random probe data distributions. We plot the sum of
non-zero values of a given feature as a function of the rank of the feature. The rank is given by the number
of non-zero features. Red: real data. Blue: simulated data.
The following steps were taken to prepare our version of the dataset:
- We concatenated the original training set (2000 examples, class balanced) and test
set (600 examples, class balanced).
- We added to the original 9947 features, 10053 features drawn at random
according to Zipf law, to obtain a total of 20000 features. Fraction of non-zero
values in the real data: 0.46%. Fraction of non-zero values in the simulated data:
0.5%.
- The feature values were quantized to 1000 levels.
- The order of the features and the order of the patterns were randomized.
- The data was split into training, validation, and test sets, with balanced numbers
of examples of each class in each set.
All variables are integer quantized on 1000 levels. There are no missing values. The
data is very sparse. Approximately 0.5% of the entries are non zero. The data was saved
as a sparse-integer matrix.
Dataset D: DOROTHEA
1) Topic
The task of DOROTHEA is to predict which compounds bind to Thrombin. This is a
two-class classification problem with sparse binary input variables.
2) Sources
a. Original owners
The dataset with which DOROTHEA was created is one of the KDD (Knowledge
Discovery in Data Mining) Cup 2001. The original dataset and papers of the winners of
the competition are available at: http://www.cs.wisc.edu/~dpage/kddcup2001/. DuPont
Pharmaceuticals graciously provided this data set for the KDD Cup 2001 competition.
All publications referring to analysis of this data set should acknowledge DuPont
Pharmaceuticals Research Laboratories and KDD Cup 2001.
b. Donor of database
This version of the database was prepared for the NIPS 2003 variable and feature
selection benchmark by Isabelle Guyon, 955 Creston Road, Berkeley, CA 94708, USA
([email protected]).
c. Date received: August 2003.
3) Past usage
a. References
There were 114 participants to the competition that turned in results. The winner of the
competition is Jie Cheng (Canadian Imperial Bank of Commerce). His presentation is
available at: http://www.cs.wisc.edu/~dpage/kddcup2001/Hayashi.pdf.
The data was also studied by Weston and collaborators:
J. Weston, F. Perez-Cruz, O. Bousquet, O. Chapelle, A. Elisseeff and B. Schoelkopf.
"Feature Selection and Transduction for Prediction of Molecular Bioactivity for Drug
Design". Bioinformatics.
At lot of information is available from Jason Weston’s web page, including valuable
statistics about the data:
http://www.kyb.tuebingen.mpg.de/bs/people/weston/kdd/kdd.html.
b. Synopsis of the original data
One binary attribute (active A or inactive I) must be predicted.
Drugs are typically small organic molecules that achieve their desired activity by binding
to a target site on a receptor. The first step in the discovery of a new drug is usually to
identify and isolate the receptor to which it should bind, followed by testing many small
molecules for their ability to bind to the target site. This leaves researchers with the task
of determining what separates the active (binding) compounds from the inactive (non-
binding) ones. Such a determination can then be used in the design of new compounds
that not only bind, but also have all the other properties required for a drug (solubility,
oral absorption, lack of side effects, appropriate duration of action, toxicity, etc.).
The original training data set consisted of 1909 compounds tested for their ability to bind
to a target site on thrombin, a key receptor in blood clotting. The chemical structures of
these compounds are not necessary for our analysis and were not included. Of the
training compounds, 42 are active (bind well) and the others are inactive. To simulate the
real-world drug design environment, the test set contained 634 additional compounds that
were in fact generated based on the assay results recorded for the training set. Of the test
compounds, 150 bind well and the others are inactive. The compounds in the test set were
made after chemists saw the activity results for the training set, so the test set had a
higher fraction of actives than did the training set in the original data split.
Each compound is described by a single feature vector comprised of a class value (A for
active, I for inactive) and 139,351 binary features, which describe three-dimensional
properties of the molecule. The definitions of the individual bits are not included we only
know that they were generated in an internally consistent manner for all 1909
compounds. Biological activity in general, and receptor binding affinity in particular,
correlate with various structural and physical properties of small organic molecules. The
task is to determine which of these properties are critical in this case and to learn to
accurately predict the class value.
In evaluating the accuracy, a differential cost model was used, so that the sum of the costs
of the actives will be equal to the sum of the costs of the inactives.
c. Results
To outperform these results, the paper of Weston et al., 2002, utilizes the combination of
an efficient feature selection method and a classification strategy that capitalizes on the
differences in the distribution of the training and the test set. First they select a small
number of relevant features (less than 40) using an unbalanced correlation score:
4) Experimental design
The original data set was modified for the purpose of the feature and variable selection
benchmark:
- The original training and test sets were merged.
- The features were sorted according to the fj criterion with λ=3, computed using
the original test set (which is richer is positive examples).
- Only the top ranking 100000 original features were kept.
- The all zero patterns were removed, except one that was given label –1.
- For the second half lowest ranked features, the order of the patterns was
individually randomly permuted (in order to create “random probes”).
- The order of the patterns and the order of the features were globally randomly
permuted to mix the original training and the test patterns and remove the feature
order.
- The data was split into training, validation, and test set while respecting the same
proportion of examples of the positive and negative class in each set.
We are aware that out design biases the data in favor of the selection criterion fj. It
remains to be seen however whether other criteria can perform better, even with that bias.
All variables are binary. There are no missing values. The data is very sparse. Less than
1% of the entries are non zero (1776363/ (1950*100000)). The data was saved as a
sparse-binary matrix.
The following table summarizes the number of non-zero features in various categories of
examples in the entire data set.
Type Min Max Median
Positive examples 687 11475 846
Negative examples 653 3185 783
All 653 11475 787
Dataset E: MADELON
1) Topic
The task of MADELON is to classify random data. This is a two-class classification
problem with sparse binary input variables.
2) Sources
The data is synthetic. It was generated by the program hypercube_data.m, which is
appended.
3) Past usage
None, although the idea of the program is inspired by:
Grafting: Fast, Incremental Feature Selection by Gradient Descent in Function Space
Simon Perkins, Kevin Lacker, James Theiler; JMLR, 3(Mar):1333-1356, 2003.
http://www.jmlr.org/papers/volume3/perkins03a/perkins03a.pdf
4) Experimental design
To draw random data, the program takes the following steps:
- Each class is composed of a number of Gaussian clusters. N(0,1) is used to draw
for each cluster num_useful_feat examples of independent features.
- Some covariance is added by multiplying by a random matrix A, with uniformly
distributed random numbers between -1 and 1.
- The clusters are then placed at random on the vertices of a hypercube in a
num_useful_feat dimensional space. The hypercube vertices are placed at values
± class_sep.
- Redundant features are added. They are obtained by multiplying the useful
features by a random matrix B, with uniformly distributed random numbers
between -1 and 1.
- Some of the previously drawn features are repeated by drawing randomly from
useful and redundant features.
- Useless features (random probes) are added using N(0,1).
- All the features are then shifted and rescaled randomly to span 3 orders of
magnitude.
- Random noise is then added to the features according to N(0,0.1).
- A fraction flip_y of labels are randomly exchanged.
To illustrate how the program works, we show a small example generating a XOR-type
problem. There are only 2 relevant features, 2 redundant features, and 2 repeated features.
Another 14 random probes were added. A total of 100 examples were drawn (25 per
cluster). Ten percent of the labels were flipped.
In Figure E.1, we show all the scatter plots of pairs of features, for the useful and
redundant features. For the two first features, we recognize a XOR-type pattern. For the
last feature, we see that after rotation, we get a feature that alone separates the data pretty
well.
In Figure E.2, we show the heat map of the data matrix. In Figure E.3, we show the same
matrix after random permutations of the rows and columns and grouping of the examples
per class. We notice that the data looks pretty much like white noise to the eye.
We then drew the data used for the benchmark with the following choice of parameters:
num_class=2; % Number of classes.
num_pat_per_cluster=250; % Number of patterns per cluster.
num_useful_feat=5; % Number of useful features.
num_clust_per_class=16; % Number of cluster per class.
num_redundant_feat=5; % Number of redundant features.
num_repeat_feat=10; % Number of repeated features.
num_useless_feat=480; % Number of useless features.
class_sep=2; % Scaling factor controlling cluster separation.
flip_y = 0.01; % Fraction of flipped labels.
500
-500
500
-500
200
100
0
-100
-200
-200 0 200 -500 0 500 -500 0 500 -200 0 200
Figure E.1: Scatter plots of the XOR-type example data for pairs of useful and
redundant features. Histograms of the examples for the corresponding features are
shown on the diagonal.
10
20
30
40
50
60
70
80
90
100
2 4 6 8 10 12 14 16 18 20
Figure E.2: Heat map of the XOR-type example data. We show all the coefficients of
the data matrix. The intensity indicates the magnitude of the coefficients. The color
indicates the sign. In lines, we show the 100 examples drawn (25 per cluster). I columns,
we show the 20 features. Only the first 6 ones are relevant: 2 useful, 2 redundant, 2
repeated. The data have been shifted and scaled by column to look “more natural”. The
last column shows the target values, with some “flipped” labels.
After adding noise
10
20
30
40
50
60
70
80
90
100
2 4 6 8 10 12 14 16 18 20
Figure E.3: Heat map of the XOR-type example data. This is the same matrix as the
one shown in Figure E.2. However, the examples have been randomly permuted and
grouped per class. The features have also been randomly permuted. Consequently, after
normalization, the data looks very uninformative to the eye.
Figure E.4: Scatter plots of the benchmark data for pairs of useful and redundant
features. We can see that the two classes overlap completely in all pairs of features. This
is normal because 5 dimensions are needed to separate the data.
1000
2000
3000
4000
5000
6000
7000
8000
2 4 6 8 10 12 14 16 18 20
Figure E.5: Heat map of the benchmark data for the relevant features (useful,
redundant, and repeated). We see the clustered structure of the data.
5) Number of examples and class distribution
Two additional test sets of the same size were drawn similarly and reserved to be able to
test the features selected by the benchmark participants, in case it becomes important to
make sure they trained only on those features.
All variables are integer. There are no missing values. The data is not sparse. The data
was saved as a non-sparse matrix.
fval=Y'*X;
[sval, si]=sort(-fval);
idx=si(1:num);
function [W,b]=lambda_classifier(X, Y)
%[W,b]=lambda_classifier(X, Y)
% This simple but efficient two-class linear classifier
% of the type Y_hat=X*W'+b
% was invented by Golub et al.
% Inputs:
% X -- Data matrix of dim (num examples, num features)
% Y -- Output matrix of dim (num examples, 1)
% Returns:
% W -- Weight vector of dim (1, num features)
% b -- Bias value.
Posidx=find(Y>0);
Negidx=find(Y<0);
Mu1=mean(X(Posidx,:));
Mu2=mean(X(Negidx,:));
Sigma1=std(X(Posidx,:),1);
Sigma2=std(X(Negidx,:),1);
W=(Mu1-Mu2)./(Sigma1+Sigma2);
B=(Mu1+Mu2)/2;
b=-W*B';
% Hypercube design
is_XOR=0;
if num_useful_feat==2 & num_class==2 & num_clust_per_class==2,
is_XOR=1;
all_C=[-1 -1; 1 1; 1 -1; -1 1]; % XOR
else
if isempty(all_C)
fprintf('New C\n');
all_C=2*ff2n(num_useful_feat)-1;
rndidx=randperm(size(all_C,1));
all_C=all_C(rndidx,:);
end
end
% Draw A
if isempty(A)
fprintf('New A\n');
for k=1:num_class*num_clust_per_class
A{k} = 2*rand(num_useful_feat, num_useful_feat)-1;
end
end
% Loop over all clusters
for k=1:num_class*num_clust_per_class
% define the range of patterns of that cluster
kmin=(k-1)*num_pat_per_cluster+1;
kmax=kmin+num_pat_per_cluster-1;
kidx=kmin:kmax;
% Draw n features independently at random
X(kidx,1:num_useful_feat)=random('norm', 0, 1, num_pat_per_cluster,
num_useful_feat);
% Multiply by a random matrix to create some co-variance of the features
X(kidx,1:num_useful_feat)=X(kidx,1:num_useful_feat)*A{k};
% Shift the center off zero to separate the clusters
C=all_C(k,:)*class_sep;
X(kidx,1:num_useful_feat) = X(kidx,1:num_useful_feat) + repmat(C,
num_pat_per_cluster, 1);
end
if debug,
featdisplay(normalize_data([X(:,1:num_useful_feat),Y])); title('Useful features');
figure; scatterplot(X(:, 1:num_useful_feat), Y); title('Useful features');
end
if debug,
featdisplay(normalize_data([X(:,1:num_useful_feat+num_redundant_feat),Y]));
title('Useful+redundant features');
figure; scatterplot(X(:, 1:num_useful_feat+num_redundant_feat), Y);
title('Useful+redundant features');
end
% Repeat num_repeat_feat features, chosen at random among useful and redundant feat
nf=num_useful_feat+num_redundant_feat;
if isempty(rf)
fprintf('New rf\n');
rf=round(1+rand(num_repeat_feat,1)*(nf-1));
end
X(:,nf+1:nf+num_repeat_feat)=X(:,rf);
if debug,
featdisplay(normalize_data([X(:,1:num_useful_feat+num_redundant_feat+num_repeat_f
eat),Y]));
title('Useful+redundant+repeated features');
end
% Add useless features : these are uncorrelated with one another, but could be correlated
:=)
X(:,num_feat-num_useless_feat+1:num_feat)=random('norm', 0, 1, num_pat,
num_useless_feat);
if debug,
featdisplay(normalize_data([X,Y]));
title('All features');
end
if debug,
featdisplay(normalize_data([X,Y]));
title('All features + flipped labels');
end
if debug,
featdisplay([X,100*normalize_data(Y)]);
title('All features + flipped labels + scale shifted');
end
if debug,
[ys,pattidx]=sort(YP);
featdisplay(normalize_data([XP0(pattidx,:),YP(pattidx)]));
title('After permutation and data normalization');
end
if debug,
featdisplay(normalize_data([XP(pattidx,:),YP(pattidx)]));
title('After adding noise');
end