Genetic Algorithm for Text Feature Selection
Genetic Algorithm for Text Feature Selection
net/publication/283661718
CITATIONS READS
13 1,565
1 author:
SEE PROFILE
All content following this page was uploaded by Ferhat Ozgur Catak on 11 November 2015.
Abstract: Vector space model based bag-of-words language model is commonly used to represent documents
in a corpus. But this representation model needs a high dimensional input feature space that has irrelevant and
redundant features to represent all corpus files. Non-Redundant feature reduction of input space improves the
generalization property of a classifier. In this study, we developed a new objective function based on models F1
score and feature subset size based. In this paper, we present work on genetic algorithm for feature selection
in order to reduce modeling complexity and training time of classification algorithms used in text classification
task. We used genetic algorithm based meta-heuristic optimization algorithm to improve the F1 score of classifier
hypothesis. Firstly; (i) we’ve developed a new objective function to maximize; (ii) then we choose candidate
features for classification algorithm; and (iii) finally support vector machine (SVM), maximum entropy (MaxEnt)
and stochastic gradient descent (SGD) classification algorithms are used to find classification models of public
available datasets.
Key–Words: Feature selection, support vector machines, logistic regression, stochastic gradient descent, document
classification
efficient and fast to compute. But this method can se- 2 Preliminaries
lect feature subset, which has redundant features. The
second option for feature selection is wrapper meth- In this section, we will give brief information about
ods [11]. Wrapper methods based feature selection classification methods in Section 2.1 and genetic al-
algorithms use subset of feature set and accuracy of gorithm in Section 2.2.
the classifier function that is trained with these sub-
set of feature set based training data. This method is
2.1 Classification Methods
dependent with the classifier algorithm.
In recent years, heuristic algorithms have been In our experiments, we’ve used three different classifi-
used widely in feature selection for large-scale data cation algorithm; support vector machine, maximum
sets. Kabir [Link]. [12] developed ant colony optimiza- entropy and stochastic gradient descent. In this sec-
tion based feature selection method for neural network tion, we will give some brief information about these
algorithm. Their approach combines both advantages machine learning algorithms.
of wrapper and filter based methods. They compare
their results with other existing feature selection algo-
2.1.1 Support Vector Machine
rithms. Chen [Link] [13] developed also ant colony opti-
mization with rough set theory based feature selection Support vector machine (SVM) classification algo-
method. Their approaches uses mutual information rithm is widely used supervised learning method in
based feature significance. They used 9 different UCI machine learning field. SVM is based on statistical
datasets which of feature size are from 7 to 70. Un- learning theory and tries to maximize the generaliza-
ler [Link] [14] developed a different approach with par- tion property of classifier model that is generated by
ticle swarm optimization algorithm based feature se- algorithm. SVM classification algorithm uses a set of
lection. Their feature selection method uses relevance training instances and predicts new instances with two
and dependence of the features included in the feature possible class label −1, 1. As shown in Figure 1, the
subset. They are also used public dataset for results. hyperplane is defined by wT x+b = 0, where w ∈ Rn
Bae [Link] [15] developed a new method to overcome is a orthogonal to the hyperplane and b ∈ Rn is the
premature converge of objective function. Their parti- constant. Giving some training data D, a set of point
cle swarm based approach uses intelligent swarms for of the form.
heuristic search.
This research proposed a new algorithm that con-
siders both the number of features in feature subset D = {(~xi , yi ) |~xi ∈ Rm , yi ∈ {−1, +1}}ni=1 (1)
and F1 score of the classifier function that is gener-
ated with this feature subset. F1 score is the most where xi is a m-dimensional real vector, yi is the class
used model selection method in information retrieval of input vector xi either −1 or +1. SVM aims to
domain. The overall contribution of the study can be search a hyper plane that maximizes the margin be-
listed as follows: tween the two classes of samples in D with the small-
est empirical risk [16]. For the generalization property
of SVM, two parallel hyperplanes are defined such
1. Using feature selection, the input matrix which is that wT x + b = 1 and wT x + b = −1. One can
quite high for the memory is reduced, and, input simplify these two functions into new one.
matrix complexity is reduced in this manner.
~ T ~x + b ≥ 1
yi w (2)
2. F1 based model selection method is used.
SVM aims to maximize distance between these
two hyperplanes. One can calculate the distance be-
3. Iteration size of our algorithm is quite low. 1
tween these two hyperplanes with ||w||
~ . The train-
ing of SVM for the non-separable case is solved using
The rest of the paper is organized as follows. Sec- quadratic optimization problem that shown in Equa-
tion 2.1 provides a brief review of SVM, MaxEnt and tion 3.
SGD classification algorithms. Section 2.2 provides n
genetic algorithm based heuristic optimization. Sec- 1 X
~ b, ξ) = ||w||
minimize : P (w, ~ +C ξi
tion 3 presents the detail of proposed feature selection 2
i=1
method. Section 4 gives the experimental results with (3)
public datasets. Last section discusses the experimen- ~ (~x) + b) ≥ 1 − ξi
subject to : y (w.φ
tal results and the future works of the proposed model. ξi ≥ 0
w ~ t − λ∆w l (f (w
~ t+1 = w ~ t ) , yt ) (8)
Algorithm 1 Genetic Algorithm the probability that relevant instances are retrieved by
procedure G ENETIC A LGORITHM(P ) classifier function. Precision and recall scores are de-
t←0 fined in Equation 11 and 12 respectively.
InitP opulationP (t) . Initialize Population TP
randomly P = (11)
TP + FP
F (t) = ComputeF itness (P (t))
while not terminated do TP
R= (12)
t←t+1 TP + FN
P (t) ← crossover(P (t − 1)) The F −score is a harmonic mean of the precision and
P (t) ← mutate(P (t)) recall scores specially used in machine learning and
F (t) ← ComputeF itness(P (t)) information retrieval. F1 score are defined in Equation
end while 13.
2.P.R
return best p . Return the best individual F1 = (13)
end procedure P +R
Our contributions are the objective function used in
genetic algorithm based optimization for feature se-
number of feature genes and each gene has real num- lection. We combine F1 score with feature ratio as
ber. General representation of chromosome is shown objective function. The objective function is defined
in Equation 9. in Equation 14.
Num of Feature
C = {f~i |f~i ∈ [0, 1]}m
i=1 (9) U= + F1,t (14)
Num of Feature at iteration t
Figure 2 shows the chromosome representation of in-
put feature set of training data set. We propose a new Pseudo code of the method is show in Algorithm 2.
Figure 2: Chromosome representation of feature set. Algorithm 2 Proposed genetic algorithm based fea-
ture selection method
C, T, fbest ← ∅ . Initialize
2: P0 ← random gauss distribution with µ =
0.3 and σ = 0.15
Convert each
( individual gene to binary discrete
0, pi < 0.5
such that
1, otherwise
threshold value to select feature. The threshold func- 4: while t ≤ T do
tion drastically converges to 0.5 with iteration steps. t←t+1
Our threshold function is:
6: GeneticAlgorithm(Pt )
v = exp(−2t) ∗ rand() + 0.5 (10) if argmaxt (Pt ) ≥ then fbest ←
argmaxt (Pt )
where t is iteration number and rand() is uniform ran- 8: end while
dom distribution function with range [0, 1]. return best p . Return the best individual
Step 2 Classifier model generation: Each chromo-
some in population represents selected features in
training and test set. Support vector machine, maxi-
mum entropy and stochastic gradient descent classifi- 4 Results
cation algorithms are used to show the accuracy per-
formance of the proposed model. In this section, we present the results of three different
Step 3 Objective function: The objective function to public text datasets to compare the proposed feature
be maximized is the sum of feature ratio and F1 score selection method F1 score with original dataset. In
of the best chromosome. F1 measure is harmonic this study we used three different public benchmark
mean of precision and recall score of classifier. Pre- datasets to verify its model effectivity and efficiency.
cision is the probability that retrieved instances clas- We experiment on three public data sets which
sified by classifier function are relevant and recall is are summarized in Table 1, including Farm-Ads [19],
Table 1: Description of the testing data sets used in • Crossover rate= 0.9
the experiments.
• Mutation rate= 0.01
Dataset Train Test Class #Att.
• Elites= 1
Farm-Ads 4,000 143 2 54,877
News20 18,000 1,996 2 1,355,191
All datasets are separated that first 90% of the in-
RCV1 20,242 677,399 2 47,236 stances being used for training, and the next 10% for
testing. Proposed method is developed using Python
language with scikit-learn and inspyred library.
News20-Binary [20] and RCV1 [21]. All experiments Results are shown in Table 2, 3.
are repeated 5 times and the results are averaged. We showed the feature size, F1 score and ac-
In this study, support vector machine, maximum curacy of the all datasets in Table 1-6. Accuracy
entropy and stochastic gradient descent models were changes of the proposed model are show in Figure
constructed. 3. As shown in figures, there is an inverse correla-
The support vector machine parameters used to tion between initial population size and rate of conver-
find out classifier model summarized by the follow- gence of classifier model accuracy. Our experiments
ing: for each datasets show that final accuracy level of the
classifier models is same for some initial population
• Kernel: linear size. For instance, in the farm-ads and RCV1 datasets
feature selection experiment, final performance of the
• C=0.01 classifier model accuracy is same for 50, 150, and 200
population size. After exceeding the population size
• Loss Function=l2 regularization 200, classifier models accuracy smoothly becomes to
• Dual mode=True decrease. Population step size of News20 dataset is
different from other datasets. We choose 10 step size
• Tolerance=0.0001 and initial population starts from 10 and last size is
50. Although step size and initial population size is
The maximum entropy parameters used to find out different from others, Figure 3 shows us that classifier
classifier model summarized by the following: models accuracy is same for a known size and then its
accuracy becomes to decrease.
• C=1
• Dual mode=True
5 Conclusion
The stochastic gradient descent parameters used to In this work, a novel objective function is developed
find out classifier model summarized by the follow- for feature selection task in machine learning area.
ing: Main contribution of this work is that our method
especially suitable for information retrieval and text
• C=1 classification area to remove noisy and irrelevant fea-
tures from input space of dataset while improve the
• Loss Function=hinge performance of text classification. Our method tries
• Regularization term=0.0001 to find a feature subset as small as possible while clas-
sifier hypothesis has high F1 score. As seen in tables
• Learning rate = 0.01 and figures, all training datasets are uniformly con-
verges to the optimal classifier accuracy. Our obser-
Genetic algorithm based feature selection method’s vations show that the population size of genetic algo-
parameters are summarized by the following rithm affects directly performance of the global clas-
sifier function.
• Initial: normal random generated ∈ [0, 1]
In the future, our method will be applied to more
• Population= 30 datasets for testing performance. We plan to find a re-
lation between population sizes, iteration size of opti-
• Number of generation= 1000 mal converge and F1 score of classifier model.
Farm-Ads 54877 0.992 0.991 0.993 0.993 21627 0.984 0.983 0.976 0.974
News20-Binary 1355191 0.877 0.878 0.847 0.856 531653 0.886 0.887 0.857 0.870
RCV1 47236 0.954 0.953 0.954 0.952 18566 0.946 0.944 0.937 0.932
Farm-Ads 54877 0.999 0.999 0.998 0.998 21587 0.998 0.998 0.994 0.994
News20-Binary 1355191 0.967 0.967 0.960 0.960 531907 0.946 0.946 0.928 0.93
RCV1 47236 0.979 0.978 0.978 0.978 18600 0.964 0.963 0.959 0.957
Farm-Ads 54877 0.982 0.981 0.977 0.978 21517 0.992 0.991 0.985 0.984
News20-Binary 1355191 0.991 0.990 0.999 0.999 531963 0.971 0.971 0.987 0.987
RCV1 47236 0.987 0.987 0.997 0.997 18661 0.972 0.971 0.983 0.983
classifier weights: Interaction with classifica- [18] Pang, Bo, Lillian Lee, and Shivakumar
tion models, in Proceedings of SIGIR 04, 2004, Vaithyanathan.,Thumbs up?: sentiment classifi-
pp. 234-241. cation using machine learning techniques, Pro-
[8] L. Yu and H. Liu, Efficient feature selection via ceedings of the ACL-02 conference on Empiri-
analysis of relevance and redundancy, J. Mach. cal methods in natural language processing 10,
Learn. Res. 5, 2004, pp. 1205-1224. 2002, pp. 79–86.
[9] I. Rodriguez-Lujan, R. Huerta, C. Elkan, and C. [19] Chris Mesterharm, Michael J. Pazzani, Active
S. Cruz, Quadratic programming feature selec- Learning using On-line Algorithms, Proceed-
tion, J. Mach. Learn. Res. 11, 2010, pp. 1491- ings of the 17th ACM SIGKDD international
1516. conference on Knowledge discovery and data
mining, 2011,pp. 850–858.
[10] Chandrashekar, Girish, and Ferat Sahin., A sur-
vey on feature selection methods., Computers & [20] Ken Lang., Newsweeder: Learning to filter net-
Electrical Engineering 40, 2014, pp. 16–28. news, In Proceedings of the Twelfth Interna-
tional Conference on Machine Learning, 1995,
[11] Song, Qinbao, Jingjie Ni, and Guangtao Wang.,
pp. 331–339
A fast clustering-based feature subset selection
algorithm for high-dimensional data., Knowl- [21] David D. Lewis, Yiming Yang, Tony G. Rose,
edge and Data Engineering, IEEE Transactions and Fan Li., RCV1: A new benchmark collec-
on 25, 2013, pp. 1–14. tion for text categorization research, Journal of
Machine Learning Research 5, 1995, pp. 331–
[12] M. M. Kabir, M. Shahjahan, and K. Murase, A 339
new hybrid ant colony optimization algorithm
for feature selection, Expert Systems with Appli-
cations 39, 2012, pp. 3747–3763.
[13] Y. Chen, D. Miao, and R. Wang, A rough set ap-
proach to feature selection based on ant colony
optimization, Pattern Recognition Letters 31,
2010, pp. 226-233.
[14] A. Unler and A. Murat, A discrete particle
swarm optimization method for feature selec-
tion in binary classification problems, European
Journal of Operational Research 206, 2010,
pp. 528–539.
[15] C. Bae, W.-C. Yeh, Y. Y. Chung, and S.-L.
Liu, Feature selection with intelligent dynamic
swarm and rough set, Expert Syst. Appl. 37,
2010, pp. 7026-7032.
[16] Vapnik, Vladimir., The nature of statistical
learning theory, 1995
[17] Nigam, Kamal, John Lafferty, and Andrew Mc-
Callum.,Using maximum entropy for text classi-
fication, IJCAI-99 workshop on machine learn-
ing for information filtering. 1, 1999.