0% found this document useful (0 votes)
58 views8 pages

Genetic Algorithm for Text Feature Selection

Uploaded by

Oday
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
58 views8 pages

Genetic Algorithm for Text Feature Selection

Uploaded by

Oday
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

See discussions, stats, and author profiles for this publication at: [Link]

net/publication/283661718

Genetic Algorithm based Feature Selection in High Dimensional Text Dataset


Classification

Article in WSEAS Transactions on Information Science and Applications · December 2015

CITATIONS READS

13 1,565

1 author:

Ferhat Ozgur Catak


University of Stavanger
126 PUBLICATIONS 1,403 CITATIONS

SEE PROFILE

All content following this page was uploaded by Ferhat Ozgur Catak on 11 November 2015.

The user has requested enhancement of the downloaded file.


WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS Ferhat Ozgür Catak

Genetic Algorithm based Feature Selection in High Dimensional


Text Dataset Classification
FERHAT ÖZGÜR ÇATAK
TÜBİTAK - BİLGEM
Cyber Security Institute
Kocaeli Gebze
TURKEY
[Link]@[Link]

Abstract: Vector space model based bag-of-words language model is commonly used to represent documents
in a corpus. But this representation model needs a high dimensional input feature space that has irrelevant and
redundant features to represent all corpus files. Non-Redundant feature reduction of input space improves the
generalization property of a classifier. In this study, we developed a new objective function based on models F1
score and feature subset size based. In this paper, we present work on genetic algorithm for feature selection
in order to reduce modeling complexity and training time of classification algorithms used in text classification
task. We used genetic algorithm based meta-heuristic optimization algorithm to improve the F1 score of classifier
hypothesis. Firstly; (i) we’ve developed a new objective function to maximize; (ii) then we choose candidate
features for classification algorithm; and (iii) finally support vector machine (SVM), maximum entropy (MaxEnt)
and stochastic gradient descent (SGD) classification algorithms are used to find classification models of public
available datasets.

Key–Words: Feature selection, support vector machines, logistic regression, stochastic gradient descent, document
classification

1 Introduction non-redundant and irrelevant features from text data


set is a primary problem for high dimensional input
This research is motivated by a complex model rep- spaces in this field. Features with high discriminative
resentation problem of language model for text clas- power build robust classification models and also they
sification task. Our research focus mainly on the re- reduce the computational complexity of training algo-
ducing model complexity of classifier functions using rithms [5].
reducing the size of input feature vector. We applied
the proposed model to the publicly available text clas- Another advantage of feature selection is im-
sification datasets that contain thousands of features. provement of model accuracy and the stability. Fea-
Text classification is a special case of classifica- ture selection on classifier models has great impact on
tion task in machine learning field [1]. In recent years, predictive results with high dimensional input space.
text classification has gained considerable attention. Main advantage of feature selection in the machine
It is one of the most attractive topics in sentiment learning is to reduce the complexity of classifica-
analysis [2], information retrieval [3]. Main differ- tion hypothesis. And also, feature selection provides
ence with other classification task of other domains is avoiding over fitting of classifier function [6], reduc-
that text classification data set contains large number ing memory consumption of learning algorithm [7]
of features. Most of the text classification research and removes the irrelevant features (terms) from text
uses bag of words model that contains unique word or data set [8].
phrases to convert that occur in documents in a cor- Feature selection methods are basically divided
pus to vector space based language model. But the into two different approaches: filter and wrapper
main problem with the bag of words technique is that methods [9]. Filter methods based feature selection
it generates hundreds or thousands features in input algorithms use an evaluation function to perform fea-
space which is not efficient way of vectorizing fea- ture selection [10]. Each feature is ranked according
tures. Although the language model contains high fea- to a metric. Feature rank metric can be fisher score, t–
ture space, most of the generated features of this tech- test, or information gain. This method is independent
nique are irrelevant, redundant or noisy [4]. Filtering with the classifier algorithms. Filter methods are very

E-ISSN: 2224-3402 290 Volume 12, 2015


WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS Ferhat Ozgür Catak

efficient and fast to compute. But this method can se- 2 Preliminaries
lect feature subset, which has redundant features. The
second option for feature selection is wrapper meth- In this section, we will give brief information about
ods [11]. Wrapper methods based feature selection classification methods in Section 2.1 and genetic al-
algorithms use subset of feature set and accuracy of gorithm in Section 2.2.
the classifier function that is trained with these sub-
set of feature set based training data. This method is
2.1 Classification Methods
dependent with the classifier algorithm.
In recent years, heuristic algorithms have been In our experiments, we’ve used three different classifi-
used widely in feature selection for large-scale data cation algorithm; support vector machine, maximum
sets. Kabir [Link]. [12] developed ant colony optimiza- entropy and stochastic gradient descent. In this sec-
tion based feature selection method for neural network tion, we will give some brief information about these
algorithm. Their approach combines both advantages machine learning algorithms.
of wrapper and filter based methods. They compare
their results with other existing feature selection algo-
2.1.1 Support Vector Machine
rithms. Chen [Link] [13] developed also ant colony opti-
mization with rough set theory based feature selection Support vector machine (SVM) classification algo-
method. Their approaches uses mutual information rithm is widely used supervised learning method in
based feature significance. They used 9 different UCI machine learning field. SVM is based on statistical
datasets which of feature size are from 7 to 70. Un- learning theory and tries to maximize the generaliza-
ler [Link] [14] developed a different approach with par- tion property of classifier model that is generated by
ticle swarm optimization algorithm based feature se- algorithm. SVM classification algorithm uses a set of
lection. Their feature selection method uses relevance training instances and predicts new instances with two
and dependence of the features included in the feature possible class label −1, 1. As shown in Figure 1, the
subset. They are also used public dataset for results. hyperplane is defined by wT x+b = 0, where w ∈ Rn
Bae [Link] [15] developed a new method to overcome is a orthogonal to the hyperplane and b ∈ Rn is the
premature converge of objective function. Their parti- constant. Giving some training data D, a set of point
cle swarm based approach uses intelligent swarms for of the form.
heuristic search.
This research proposed a new algorithm that con-
siders both the number of features in feature subset D = {(~xi , yi ) |~xi ∈ Rm , yi ∈ {−1, +1}}ni=1 (1)
and F1 score of the classifier function that is gener-
ated with this feature subset. F1 score is the most where xi is a m-dimensional real vector, yi is the class
used model selection method in information retrieval of input vector xi either −1 or +1. SVM aims to
domain. The overall contribution of the study can be search a hyper plane that maximizes the margin be-
listed as follows: tween the two classes of samples in D with the small-
est empirical risk [16]. For the generalization property
of SVM, two parallel hyperplanes are defined such
1. Using feature selection, the input matrix which is that wT x + b = 1 and wT x + b = −1. One can
quite high for the memory is reduced, and, input simplify these two functions into new one.
matrix complexity is reduced in this manner.
~ T ~x + b ≥ 1

yi w (2)
2. F1 based model selection method is used.
SVM aims to maximize distance between these
two hyperplanes. One can calculate the distance be-
3. Iteration size of our algorithm is quite low. 1
tween these two hyperplanes with ||w||
~ . The train-
ing of SVM for the non-separable case is solved using
The rest of the paper is organized as follows. Sec- quadratic optimization problem that shown in Equa-
tion 2.1 provides a brief review of SVM, MaxEnt and tion 3.
SGD classification algorithms. Section 2.2 provides n
genetic algorithm based heuristic optimization. Sec- 1 X
~ b, ξ) = ||w||
minimize : P (w, ~ +C ξi
tion 3 presents the detail of proposed feature selection 2
i=1
method. Section 4 gives the experimental results with (3)
public datasets. Last section discusses the experimen- ~ (~x) + b) ≥ 1 − ξi
subject to : y (w.φ
tal results and the future works of the proposed model. ξi ≥ 0

E-ISSN: 2224-3402 291 Volume 12, 2015


WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS Ferhat Ozgür Catak

Figure 1: SVM classification algorithm separating hy- parameter vector w.


~ Empirical risk can be computed
per plane illustration. with an approximation:
n
1X
Remp = l (ŷ, y) (6)
n
i=1

Gradient descent is used to minimize the empirical


risk En (fw ) at each iteration updating the weight vec-
tor w
~ with a learning rate λ.
n
1X
w
~ t+1 ~t − λ
=w ∆w l (f (w)
~ , y) (7)
n
i=1

Stochastic gradient descent algorithm is simplified


version of Equation 7.

w ~ t − λ∆w l (f (w
~ t+1 = w ~ t ) , yt ) (8)

Instead of using empirical risk, Remp , SGD uses ran-


for i = 1, ..., m, where ξi are slack variables and C is dom picked instances of training set X, to estimate
the cost variable of each slack. C is a control param- the gradient.
eter for the margin maximization and empirical risk
minimization. 2.2 Genetic Algorithm
Genetic algorithm (GA) is an evolutionary algorithm
2.1.2 Maximum Entropy that mimics the natural selection, crossover and mu-
Maximum entropy (MaxEnt) is another linear classi- tation process. GA was first developed by Holland
fication model based on empirical data. The MaxEnt in 1975. GA is a stochastic optimization method,
is used as a means of estimating probability distribu- which is based on metaheuristic search procedures.
tions from data X. Conditional distribution of train- GA starts with a matrix of population of solution.
ing dataset X is used as constraints [17]. Maximum- Each row of this matrix shows the individuals that
likelihood distributions function in exponential form: generated randomly. Each individual shows a solu-
tion of an objective function. In GA, every solution
! is encoded with genes that is called individual. Using
1 X
a objective function, fitness of individuals are com-
p (y|~x) = exp λi,c Fi,c (~x, y) (4)
Z (~x) puted according to an objective function. Population
i
is improved with combination of genetic information
where Z(~x) is normalization function. Fi,c is a fea- from different members of population. This process is
ture/class function for feature fi and outcome y de- called as crossover. Another population improvement
fined in form [18]: method is mutation. Some individuals of population
( are mutated according to the mutation rate of popula-
1, ni (x) > 0 and c0 = c tion. Pseudo code of GA is show in Algorithm 1.
Fi,c d, c0 =

(5)
0, otherwise

2.1.3 Stochastic Gradient Descent 3 Proposed Model


Stochastic gradient descent algorithm is another clas- In this study, at the first step, we have generated uni-
sification method. let ~x be an arbitrary instance of form random population which their genes have nor-
training dataset. y is the scalar output of instance ~x. mal distribution with µ = 0.3 and σ = 0.15. Each
Our aim is the minimizing of loss function L(ŷ, y) generation has 0.3 mutations and 0.3 crossover proba-
that measures the cost of predicting ŷ with known out- bility. Elitism is used to retain several highest individ-
come y. Classification task search function spaces F uals to the next generation directly. Step 1 Initial pop-
of fw (w)
~ parametrized by a weight vector w. ~ Gradi- ulation generation: Our model start with creating ran-
ent descent algorithm uses empirical risk to find out dom initial population. Each chromosome contains

E-ISSN: 2224-3402 292 Volume 12, 2015


WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS Ferhat Ozgür Catak

Algorithm 1 Genetic Algorithm the probability that relevant instances are retrieved by
procedure G ENETIC A LGORITHM(P ) classifier function. Precision and recall scores are de-
t←0 fined in Equation 11 and 12 respectively.
InitP opulationP (t) . Initialize Population TP
randomly P = (11)
TP + FP
F (t) = ComputeF itness (P (t))
while not terminated do TP
R= (12)
t←t+1 TP + FN
P (t) ← crossover(P (t − 1)) The F −score is a harmonic mean of the precision and
P (t) ← mutate(P (t)) recall scores specially used in machine learning and
F (t) ← ComputeF itness(P (t)) information retrieval. F1 score are defined in Equation
end while 13.
2.P.R
return best p . Return the best individual F1 = (13)
end procedure P +R
Our contributions are the objective function used in
genetic algorithm based optimization for feature se-
number of feature genes and each gene has real num- lection. We combine F1 score with feature ratio as
ber. General representation of chromosome is shown objective function. The objective function is defined
in Equation 9. in Equation 14.
Num of Feature
C = {f~i |f~i ∈ [0, 1]}m
i=1 (9) U= + F1,t (14)
Num of Feature at iteration t
Figure 2 shows the chromosome representation of in-
put feature set of training data set. We propose a new Pseudo code of the method is show in Algorithm 2.

Figure 2: Chromosome representation of feature set. Algorithm 2 Proposed genetic algorithm based fea-
ture selection method
C, T, fbest ← ∅ . Initialize
2: P0 ← random gauss distribution with µ =
0.3 and σ = 0.15
Convert each
( individual gene to binary discrete
0, pi < 0.5
such that
1, otherwise
threshold value to select feature. The threshold func- 4: while t ≤ T do
tion drastically converges to 0.5 with iteration steps. t←t+1
Our threshold function is:
6: GeneticAlgorithm(Pt )
v = exp(−2t) ∗ rand() + 0.5 (10) if argmaxt (Pt ) ≥ then fbest ←
argmaxt (Pt )
where t is iteration number and rand() is uniform ran- 8: end while
dom distribution function with range [0, 1]. return best p . Return the best individual
Step 2 Classifier model generation: Each chromo-
some in population represents selected features in
training and test set. Support vector machine, maxi-
mum entropy and stochastic gradient descent classifi- 4 Results
cation algorithms are used to show the accuracy per-
formance of the proposed model. In this section, we present the results of three different
Step 3 Objective function: The objective function to public text datasets to compare the proposed feature
be maximized is the sum of feature ratio and F1 score selection method F1 score with original dataset. In
of the best chromosome. F1 measure is harmonic this study we used three different public benchmark
mean of precision and recall score of classifier. Pre- datasets to verify its model effectivity and efficiency.
cision is the probability that retrieved instances clas- We experiment on three public data sets which
sified by classifier function are relevant and recall is are summarized in Table 1, including Farm-Ads [19],

E-ISSN: 2224-3402 293 Volume 12, 2015


WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS Ferhat Ozgür Catak

Table 1: Description of the testing data sets used in • Crossover rate= 0.9
the experiments.
• Mutation rate= 0.01
Dataset Train Test Class #Att.
• Elites= 1
Farm-Ads 4,000 143 2 54,877
News20 18,000 1,996 2 1,355,191
All datasets are separated that first 90% of the in-
RCV1 20,242 677,399 2 47,236 stances being used for training, and the next 10% for
testing. Proposed method is developed using Python
language with scikit-learn and inspyred library.
News20-Binary [20] and RCV1 [21]. All experiments Results are shown in Table 2, 3.
are repeated 5 times and the results are averaged. We showed the feature size, F1 score and ac-
In this study, support vector machine, maximum curacy of the all datasets in Table 1-6. Accuracy
entropy and stochastic gradient descent models were changes of the proposed model are show in Figure
constructed. 3. As shown in figures, there is an inverse correla-
The support vector machine parameters used to tion between initial population size and rate of conver-
find out classifier model summarized by the follow- gence of classifier model accuracy. Our experiments
ing: for each datasets show that final accuracy level of the
classifier models is same for some initial population
• Kernel: linear size. For instance, in the farm-ads and RCV1 datasets
feature selection experiment, final performance of the
• C=0.01 classifier model accuracy is same for 50, 150, and 200
population size. After exceeding the population size
• Loss Function=l2 regularization 200, classifier models accuracy smoothly becomes to
• Dual mode=True decrease. Population step size of News20 dataset is
different from other datasets. We choose 10 step size
• Tolerance=0.0001 and initial population starts from 10 and last size is
50. Although step size and initial population size is
The maximum entropy parameters used to find out different from others, Figure 3 shows us that classifier
classifier model summarized by the following: models accuracy is same for a known size and then its
accuracy becomes to decrease.
• C=1

• Loss Function=l2 regularization

• Dual mode=True
5 Conclusion
The stochastic gradient descent parameters used to In this work, a novel objective function is developed
find out classifier model summarized by the follow- for feature selection task in machine learning area.
ing: Main contribution of this work is that our method
especially suitable for information retrieval and text
• C=1 classification area to remove noisy and irrelevant fea-
tures from input space of dataset while improve the
• Loss Function=hinge performance of text classification. Our method tries
• Regularization term=0.0001 to find a feature subset as small as possible while clas-
sifier hypothesis has high F1 score. As seen in tables
• Learning rate = 0.01 and figures, all training datasets are uniformly con-
verges to the optimal classifier accuracy. Our obser-
Genetic algorithm based feature selection method’s vations show that the population size of genetic algo-
parameters are summarized by the following rithm affects directly performance of the global clas-
sifier function.
• Initial: normal random generated ∈ [0, 1]
In the future, our method will be applied to more
• Population= 30 datasets for testing performance. We plan to find a re-
lation between population sizes, iteration size of opti-
• Number of generation= 1000 mal converge and F1 score of classifier model.

E-ISSN: 2224-3402 294 Volume 12, 2015


WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS Ferhat Ozgür Catak

Table 2: SVM simulation results of selected public datasets

All Features Selected Features


Train DS Test DS Train DS Test DS

Dataset Feat. Size F1 Acc. F1 Acc. Feat. Size F1 Acc. F1 Acc.

Farm-Ads 54877 0.992 0.991 0.993 0.993 21627 0.984 0.983 0.976 0.974

News20-Binary 1355191 0.877 0.878 0.847 0.856 531653 0.886 0.887 0.857 0.870

RCV1 47236 0.954 0.953 0.954 0.952 18566 0.946 0.944 0.937 0.932

Table 3: Maximum entropy simulation results of selected public datasets.

All Features Selected Features


Train DS Test DS Train DS Test DS

Dataset Feat. Size F1 Acc. F1 Acc. Feat. Size F1 Acc. F1 Acc.

Farm-Ads 54877 0.999 0.999 0.998 0.998 21587 0.998 0.998 0.994 0.994

News20-Binary 1355191 0.967 0.967 0.960 0.960 531907 0.946 0.946 0.928 0.93

RCV1 47236 0.979 0.978 0.978 0.978 18600 0.964 0.963 0.959 0.957

(a) Farm-Ads (b) news20. (c) RCV1

Figure 3: Feature reduction over population size.

References: [4] Buck, Christian, Kenneth Heafield, and Bas


van Ooyen., N-gram counts and language mod-
[1] Colace, F., De Santo, M., Greco, L., and Napo- els from the common crawl., Proceedings of
letano, P., Text classification using a few labeled the Language Resources and Evaluation Confer-
examples, Computers in Human Behavior 30, ence, 2014.
2014, pp. 689–697.
[5] Y. Guo, G. Zhao, and M. PietikäInen, Discrim-
[2] Rao, Y., Lei, J., Wenyin, L., Li, Q., and Chen, inative features for texture description,Pattern
M., Building emotional dictionary for sentiment Recogn., 45, 2012, pp. 3834-3843.
analysis of online news, World Wide Web 17, [6] C. Chu, A.-L. Hsu, K.-H. Chou, P. Bandettini,
2014, pp. 723-742. and C. Lin, Does feature selection improve clas-
[3] El-Bakry, Hazem M., and Nikos Mastorakis., sification accuracy? impact of sample size and
Fast information retrieval from web pages. feature selection on classification using anatom-
Proceedings of the 7th WSEAS international ical magnetic resonance images, Neuroimage60,
conference on Computational intelligence, 2012, pp 59-70.
man-machine systems and cybernetics, 2008, [7] D. Mladenić, J. Brank, M. Grobelnik, and N.
pp. 229–247. Milic-Frayling, Feature selection using linear

E-ISSN: 2224-3402 295 Volume 12, 2015


WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS Ferhat Ozgür Catak

Table 4: Stochastic gradient descent simulation results of selected public datasets.

All Features Selected Features


Train DS Test DS Train DS Test DS

Dataset Feat. Size F1 Acc. F1 Acc. Feat. Size F1 Acc. F1 Acc.

Farm-Ads 54877 0.982 0.981 0.977 0.978 21517 0.992 0.991 0.985 0.984

News20-Binary 1355191 0.991 0.990 0.999 0.999 531963 0.971 0.971 0.987 0.987

RCV1 47236 0.987 0.987 0.997 0.997 18661 0.972 0.971 0.983 0.983

classifier weights: Interaction with classifica- [18] Pang, Bo, Lillian Lee, and Shivakumar
tion models, in Proceedings of SIGIR 04, 2004, Vaithyanathan.,Thumbs up?: sentiment classifi-
pp. 234-241. cation using machine learning techniques, Pro-
[8] L. Yu and H. Liu, Efficient feature selection via ceedings of the ACL-02 conference on Empiri-
analysis of relevance and redundancy, J. Mach. cal methods in natural language processing 10,
Learn. Res. 5, 2004, pp. 1205-1224. 2002, pp. 79–86.
[9] I. Rodriguez-Lujan, R. Huerta, C. Elkan, and C. [19] Chris Mesterharm, Michael J. Pazzani, Active
S. Cruz, Quadratic programming feature selec- Learning using On-line Algorithms, Proceed-
tion, J. Mach. Learn. Res. 11, 2010, pp. 1491- ings of the 17th ACM SIGKDD international
1516. conference on Knowledge discovery and data
mining, 2011,pp. 850–858.
[10] Chandrashekar, Girish, and Ferat Sahin., A sur-
vey on feature selection methods., Computers & [20] Ken Lang., Newsweeder: Learning to filter net-
Electrical Engineering 40, 2014, pp. 16–28. news, In Proceedings of the Twelfth Interna-
tional Conference on Machine Learning, 1995,
[11] Song, Qinbao, Jingjie Ni, and Guangtao Wang.,
pp. 331–339
A fast clustering-based feature subset selection
algorithm for high-dimensional data., Knowl- [21] David D. Lewis, Yiming Yang, Tony G. Rose,
edge and Data Engineering, IEEE Transactions and Fan Li., RCV1: A new benchmark collec-
on 25, 2013, pp. 1–14. tion for text categorization research, Journal of
Machine Learning Research 5, 1995, pp. 331–
[12] M. M. Kabir, M. Shahjahan, and K. Murase, A 339
new hybrid ant colony optimization algorithm
for feature selection, Expert Systems with Appli-
cations 39, 2012, pp. 3747–3763.
[13] Y. Chen, D. Miao, and R. Wang, A rough set ap-
proach to feature selection based on ant colony
optimization, Pattern Recognition Letters 31,
2010, pp. 226-233.
[14] A. Unler and A. Murat, A discrete particle
swarm optimization method for feature selec-
tion in binary classification problems, European
Journal of Operational Research 206, 2010,
pp. 528–539.
[15] C. Bae, W.-C. Yeh, Y. Y. Chung, and S.-L.
Liu, Feature selection with intelligent dynamic
swarm and rough set, Expert Syst. Appl. 37,
2010, pp. 7026-7032.
[16] Vapnik, Vladimir., The nature of statistical
learning theory, 1995
[17] Nigam, Kamal, John Lafferty, and Andrew Mc-
Callum.,Using maximum entropy for text classi-
fication, IJCAI-99 workshop on machine learn-
ing for information filtering. 1, 1999.

E-ISSN: 2224-3402 296 Volume 12, 2015

View publication stats

You might also like