0% found this document useful (0 votes)
37 views8 pages

Overfitting by PSO Trained Feedforward Neural Networks

This paper investigates overfitting in feedforward neural networks trained with particle swarm optimization (PSO) using different information sharing topologies. Experiments were conducted on classification and regression problems. Results show differences in overfitting between topologies, with some swarms failing to converge, possibly due to the bounded activation function. Using an unbounded activation function led to convergent swarms and reduced overfitting.

Uploaded by

Carlos Adriano
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views8 pages

Overfitting by PSO Trained Feedforward Neural Networks

This paper investigates overfitting in feedforward neural networks trained with particle swarm optimization (PSO) using different information sharing topologies. Experiments were conducted on classification and regression problems. Results show differences in overfitting between topologies, with some swarms failing to converge, possibly due to the bounded activation function. Using an unbounded activation function led to convergent swarms and reduced overfitting.

Uploaded by

Carlos Adriano
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Overfitting by PSO Trained Feedforward Neural Networks

Andrich B. van Wyk and Andries P. Engelbrecht, Senior Member, IEEE

Abstract— The purpose of this paper is to investigate the the measurements used in the experimental work. Section IV
overfitting behavior of particle swarm optimization (PSO) reviews the problems used in the paper. Section V discusses
trained neural networks. Neural networks trained with PSOs the experiments and their results. The paper is concluded in
using the global best, local best and Von Neumann information
sharing topologies are investigated. Experiments are conducted Section VI along with reference to future work.
on five classification and five time series regression problems.
It is shown that differences exist in the degree of overfitting II. PARTICLE S WARM O PTIMIZATION
between the different topologies. Additionally, non-convergence A. The Algorithm
of the swarms is witnessed, which is hypothetically attributed to
the use of a bounded activation function in the neural networks. The Particle Swarm Optimization (PSO) algorithm was
The hypothesis is supported by experiments conducted using an developed by Kennedy and Eberhart in 1995, and attempts
unbounded activation function in the neural network hidden to model the flight pattern of a flock of birds [4], [10]. The
layer, which lead to convergent swarms. Additionally this also PSO described here is Shi and Eberhart’s modified particle
lead to drastically reduced overfitting by the neural networks.
swarm optimizer that includes the inertia weight [18]. Each
I. I NTRODUCTION particle in the swarm represents a candidate solution to the
optimization problem. Let the current position of the ith such
One of the very first applications of the PSO algorithm
particle at time step t be denoted by xi (t). The particle then
was in the training of feedforward neural networks [4], [10].
Since then, the PSO algorithm has become a popular method moves through the search space by adding to the position a
velocity vi (t) as follows:
for training neural networks [3], [5], [20], [21], particularly
in its use as a global optimization algorithm [9], [22].
Although the PSO algorithm has been studied both the- xi (t + 1) = xi (t) + vi (t + 1) (1)
oretically and practically in the context of training neu- The particle’s velocity is the driving force behind the
ral networks [19], little has been done to investigate the optimization process and is calculated using:
generalization performance of PSO trained neural networks
and in particular overfitting. Overfitting is a phenomenon
vi (t + 1) = ωvi (t) + c1 r1 (t)[yi (t) − xi (t)]
where the neural network has adequate performance on the (2)
training data but poor performance on the generalization (and + c2 r2 (t)[ŷ(t) − xi (t)]
therefore any other unseen) data. This is often the result where ω represents the inertia weight, yi (t) is the particle’s
of the neural network having an excess of free parameters personal best position and ŷ(t) is the global best position.
which then models the training data too specifically, even The personal best position represents the best position found
accurately memorizing the noise contained within. Two main by the particle during the entire search process. The global
approaches exist to combat the phenomenon, namely regu- best position constitutes the best position found by the entire
larization and model selection (either growing or pruning the swarm. The quality of a solution is measured by a fitness
network) [7], [13]. These methods are effective at preventing function, f (xi (t)), calculated from the particle’s current
overfitting, but they often require additional data and/or position.
computational time. The overfitting behavior of a neural Referring to equation 2, the velocity update equation
network is specific to the training algorithm used to train consists of three weighted components: the previous velocity,
the network. Understanding the degree to which a training a cognitive component containing the personal best posi-
algorithm overfits is therefore important, especially if it can tion and a social component influenced by the global best
be reduced or prevented without the need for additional data position. The cognitive and social components are scaled
or computational time. by acceleration coefficients c1 r1 (t) and c2 r2 (t). c1 and
The objective of this paper is to investigate overfitting by c2 are constants and form two of the algorithm’s control
PSO trained feedforward neural networks and the differences parameters. Random numbers r1j (t) and r2j (t) are sampled
in overfitting behavior between various information sharing from U (0, 1).
topologies of the swarm. Section II gives an introduction Algorithm 1 summarizes the algorithm.
to the PSO algorithm. Section III delivers an overview of For the purposes of this study, the Guaranteed Conver-
Andrich B. van Wyk is with the Department of Computer Science, gence PSO [19], [23] is used. It is possible for the standard
University of Pretoria, Pretoria, 0002, South Africa (phone: +27 12 420 PSO to converge on a point in the search space that is not an
5242; email: [email protected] ). optimum. This stagnation occurs when xi = yi = ŷ, causing
Andries P. Engelbrecht is with the Department of Computer Science,
University of Pretoria, Pretoria, 0002, South Africa (phone: +27 12 420 the velocity update equation to depend solely on the ωvi (t)
3578; email: [email protected] ). term. If ω < 1 this term will tend to 0 as the search process

978-1-4244-8126-2/10/$26.00 ©2010 IEEE

Authorized licensed use limited to: UNIVERSIDADE DE PERNAMBUO. Downloaded on May 16,2023 at 04:20:02 UTC from IEEE Xplore. Restrictions apply.
Algorithm 1 The PSO algorithm longer to reach particles in a neighborhood on the
Initialize the nx -dimensional swarm, S opposite side of the ring. The lbest topology has been
while (stopping condition is false) do found to be less susceptible to local minima at the cost
for (each particle i = 1, ..., ns ) do of convergence speed [11], [19].
Calculate the fitness f (xi (t)) • Von Neumann (VN): Each particle is connected to
Set the personal best position yi (t) its left, right, top and bottom neighbors on a two-
Set the global best position ŷ(t) dimensional lattice. The rate of information flow is
end for balanced and it has been shown that this topology
for (each particle i = 1, ..., ns ) do regularly outperforms other topologies [11].
Update the particle’s velocity vi (t)
Update the particle’s position xi (t) C. Training Neural Networks
end for The PSO algorithm is used to train neural networks as
end while follows. The goal is to find an optimal set of weights for a
neural network. The neural network’s weights are mapped
to a multidimensional vector. Each particle in the swarm
continues, thereby causing convergence at a potentially non- represents such a vector and therefore a complete neural
optimal point. The GCPSO is proven to avoid the premature network. To calculate a particle’s fitness, the weights (the
convergence and always converges to at least a local optimum particle’s current position) are placed into a neural network
[19]. which then evaluates the training set. The training error is
To achieve this, the GCPSO changes the update equations calculated and used as the particle’s fitness.
of only the global best particle. Let γ be the index of the
global best particle, then the changed position and velocity III. M EASURING P ERFORMANCE
updates are: The central focus of this paper is the overfitting and there-
fore generalization performance of trained neural networks.
Three measurements are used to quantify neural network
xγ (t + 1) = ŷ(t) + ωvi (t) + ρ(t)(1 − 2r2 (t)) (3)
performance: the training error, ET , the generalization error,
EG , and the generalization factor, ρF . The training error is
calculated as the mean squared error (MSE) over the training
vγ (t + 1) = −xγ (t) + ŷ(t) + ωvγ (t)
(4) set DT of size PT as follows:
+ ρ(t)(1 − 2r2 (t))
PPT PK
where r2j (t) are random numbers sampled from U (0, 1). p=1 k=1 (tk,p − ok,p )2
ET = (5)
The ρ parameter is a scaling term that defines a bounded box PT K
around the global best particle that the particle then randomly where K is the number of output units in the network. EG
searches in for a better position. For further details regarding is calculated similarly over the generalization set, DG .
the GCPSO, refer to [19] and [23]. The generalization factor, ρF , was developed by Röbel
B. Social Network Structures [17] as a measure of overfitting. The factor does not measure
the accuracy of the network, but takes the ratio between the
The PSO described above is known as a gbest PSO where generalization and training errors: ρF = EET . It is desirable to
G

every particle is connected to every other particle in the have a ρF < 1, which would indicate that the generalization
swarm, this type of social neighborhood is referred to as a error is less than the training error. A ρF > 1 indicates that
star topology. A variety of PSO topologies exist to facilitate EG > ET , meaning generalization performance is worse
different information sharing structures, each affecting the than training performance, indicating potential overfitting.
performance in a potentially drastic way [11]. Three topolo- It is important to note that the generalization factor is not
gies that have been successfully used in feedforward neural an absolute measure of overfitting and that the training and
network training in the past [4], [8], [13], have been selected generalization errors should always be considered. In this
for comparison in this paper: paper ρF is used to describe the overfitting behavior of an
• Global best (gbest): The swarm is fully connected algorithm over time and not as a measure of the amount of
and a single global best particle is selected for all overfitting.
other particles to follow. This topology has the fastest A final performance measure that is used in the experi-
information flow between particles [11]. mental work is a measure of global swarm diversity. The
• Local best (lbest): The swarm is divided into smaller diversity measurement used is the average distance around
neighborhoods of a fixed size ns . Instead of a global the swarm center [12]:
best, a neighborhood best is selected from each neigh- v
borhood. The neighborhoods in turn are connected to |S| u I
1 Xu X
each other in a ring topology. Information flow is much D= t (xik − x¯k )2 (6)
slower in this topology as a good solution will take |S| i=1
k=1

Authorized licensed use limited to: UNIVERSIDADE DE PERNAMBUO. Downloaded on May 16,2023 at 04:20:02 UTC from IEEE Xplore. Restrictions apply.
where |S| is the swarm size, I is the dimensionality of For the Henon Map, Logistic Map, Mackey-Glass equation
the problem space, xi and x̄ are particle position i and the and TS5, 1000 data points were generated. To construct a
swarm center respectively. This measurement was shown to data set for each of the time series, for each training target
be a valid measure of diversity in [15]. value at time step t + 1, the n values preceding t + 1 in the
time series were taken as input; n is the number of input
IV. P ROBLEM S ETS
neurons in the neural network architecture. An exception to
In order to explore the generalization behavior of the PSO this is the Mackey-Glass problem, where the inputs were
trained neural networks, a set of 10 well known data sets taken as t, t − 6, t − 12 and t − 18 as in [16].
were chosen. The sets were split evenly between classifi-
cation and regression tasks. For the regression problems, A. Data set preparation
time series were used as functions to approximate. This The following procedure was followed to prepare the data
is due to time series being inherently noisy. The data sets sets for training. Each input data value was scaled to the
for classification were also chosen because they are known range [−2, 2] which represents the active domain of the
to contain noise. Noise in the training data is a necessary sigmoid activation function. Also, since the sigmoid function
condition for overfitting to occur. is a bounded activation function, the target output values
For each problem a neural network architecture had to were scaled to fall in the range of the function, in this case
be chosen. Since the purpose of the study is to investigate [0.1, 0.9].
generalization and in particular overfitting, no attempt was The scaled data set was then separated into two subsets for
made to optimize the architecture (which could potentially training and generalization purposes: 66% of the data set was
eliminate any overfitting). Therefore relatively large hidden taken as the training set, DT , with the remainder of the set
layers were chosen. All chosen neural network architectures used as the generalization set, DG . With the classification
had a single hidden layer, which includes a bias unit. The problems, the patterns were randomly allocated to each of
size of the input layer was set to the number of input the subsets. However, with the time series problems, the first
parameters of the data set, also including a bias unit. For 66% of the data set was used for DT and the last 34% for
all the problems, the output layer contained a single neuron. DG .
All hidden and output neurons used the sigmoid activation
function. V. E XPERIMENTS AND R ESULTS
The problems and chosen sizes of the hidden layer are The purpose of the initial experiment was twofold: First,
given in Table IV. to determine the degree to which (if any) feedforward neural
TABLE I networks overfit when trained using a PSO. Secondly to
T HE DATA SETS USED FOR THE EXPERIMENTAL WORK ALONG WITH THE determine whether overfitting behavior is different between
CHOSEN NEURAL NETWORK LAYER SIZES . T HE TOP HALF IS information sharing topologies. Each configuration of an
CLASSIFICATION DATA SETS AND THE BOTTOM REGRESSION . algorithm was executed on a particular problem 30 times
with different random seeds. The Mersenne Twister was
Problem Input Hidden Source
Glass Identification 9 12 UCI MLR [1]
used as a strong random number generator. The algorithms
Iris 4 8 UCI MLR [1] are implemented in CIlib (http://www.cilib.net/)
Pima Indians Diabetes 8 20 UCI MLR [1] which was used to conduct the experiments.
WDBC 1 30 25 UCI MLR [1]
Wine 13 10 UCI MLR [1] A. PSO Configuration
Henon Map 2 10 Equation (7) [19] The following PSO configuration was used for the exper-
Logistic Map 4 10 Equation (8)
iment: the c1 and c2 control parameters were both set to
Mackey-Glass Equation 4 20 Equation (9) [16]
Sunspots(Annual) 4 10 NGDC [14] 1.496180, with inertia set to 0.729844. These values corre-
TS5 10 5 Equation (10) spond to those reported in [6] and ensure swarm convergence
[19]. The swarms consisted of 20 particles and in the case
of the lbest topology, a neighborhood size of 5 was used.
No Vmax was used, because neural network weights can
2
zt = 1 + 0.3zt−2 − 1.4zt−1 ; z0 and z1 ∼ U (−1, 1) (7) be arbitrarily large (or small) and values within the same
weight vector often differ greatly in scale. The algorithms
were executed for 5000 iterations.
zt = rzt−1 (1 − zt−1 ); z0 = 0.01, r = 4 (8)
B. Discussion
zt−τ Results for the gbest, lbest and VN topologies are given in
zt = (1−b)zt−1 +a 10 ; b = 0.1, a = 0.2, τ = 30 (9) Tables II, III and IV respectively. According to the final ρF
1 + zt−τ
values overfitting occured on all the classification problem
sets with all of the PSO algorithms. The relatively good
zt = 0.3zt−6 − 0.6zt−4 + 0.5zt−1 +
2 2 (10)
0.3zt−6 − 0.2zt−4 + nt ; nt ∼ N (0, 0.05) 1 Wisconsin Diagnostic Breast Cancer

Authorized licensed use limited to: UNIVERSIDADE DE PERNAMBUO. Downloaded on May 16,2023 at 04:20:02 UTC from IEEE Xplore. Restrictions apply.
TABLE II TABLE IV
GB EST GCPSO RESULTS AFTER 5000 ITERATIONS . M EANS OVER 30 VN GCPSO RESULTS AFTER 5000 ITERATIONS . M EANS OVER 30
SAMPLES ARE REPORTED WITH STANDARD DEVIATIONS IN PARENTHESIS SAMPLES ARE REPORTED WITH STANDARD DEVIATIONS IN PARENTHESIS

Problem ET EG ρF Diversity Problem ET EG ρF Diversity


Glass 0.00324 0.03149 10.30617 2.87309 Glass 0.00395 0.02846 7.5163 15.89468
(0.00062) (0.00915) (4.49973) (3.80936) (0.00076) (0.00752) (2.50594) (9.90871)
Iris 0.00054 0.00912 35.12105 5.76561 Iris 0.00086 0.00731 18.14835 17.9656
(0.0004) (0.00648) (38.39649) (2.04736) (0.00063) (0.00328) (19.52524) (15.8256)
Diabetes 0.05744 0.12262 2.15066 0.43686 Diabetes 0.06351 0.11751 1.86693 30.35
(0.00356) (0.01318) (0.32192) (0.20091) (0.00488) (0.00998) (0.26991) (20.75214)
WDBC 0.00333 0.0209 7.53132 1.97236 WDBC 0.00484 0.0236 5.59441 264.693
(0.00115) (0.00489) (4.65636) (1.30683) (0.00144) (0.00636) (2.98626) (312.472)
Wine 0.00006 0.00575 378.25793 0.3096 Wine 0.00006 0.00536 174.02899 4.64002
(0.00007) (0.00236) (836.3) (0.40758) (0.00005) (0.00255) (166.16) (3.41658)
Henon 0.1388 0.13946 1.10305 47.93454 Henon 0.1383 0.14318 1.14648 5814.33
(0.02356) (0.05286) (0.62432) (75.8961) (0.02532) (0.04932) (0.65262) (18075.27)
Logistic Map 0.0002 0.00024 1.15215 0.99753 Logistic Map 0.00026 0.00032 1.18475 11.1228
(0.00014) (0.00021) (0.14864) (0.50421) (0.00016) (0.00021) (0.15711) (11.47582)
Mackey-Glass 0.00023 0.0007 3.30321 4238716 Mackey-Glass 0.0003 0.00072 2.6711 448635
(0.00005) (0.00034) (1.92795) (6719254) (0.00007) (0.00031) (1.62423) (679877.6)
Sunspots 0.00219 0.0039 1.81153 4.21398 Sunspots 0.0024 0.00395 1.66479 6.29358
(0.00021) (0.00054) (0.40653) (1.11313) (0.0002) (0.00059) (0.33779) (3.7171)
TS5 0.00046 0.00131 2.90047 0.95647 TS5 0.00047 0.00117 2.55562 2.37121
(0.00004) (0.0007) (1.55291) (0.51509) (0.00004) (0.00037) (0.91756) (2.21876)

TABLE III
LB EST GCPSO RESULTS AFTER 5000 ITERATIONS . M EANS OVER 30 ations of some of the ρF measurements across all topologies
SAMPLES ARE REPORTED WITH STANDARD DEVIATIONS IN PARENTHESIS and especially in the classification cases. This is in contrast
Problem ET EG ρF Diversity
to the low standard deviations of the error measurements.
Glass 0.00444 0.02802 7.0485 50.4279 This indicates that although training and generalization per-
(0.00096) (0.00908) (4.39571) (36.5794) formance was consistent across samples, there is a poten-
Iris 0.00102 0.00711 11.7239 29.29721 tially big difference in overfitting behavior depending on the
(0.00065) (0.00414) (10.20586) (24.3885)
stochastic conditions.
Diabetes 0.06736 0.11611 1.73785 46.98889
(0.00439) (0.01069) (0.25358) (5.67876) Another anomaly in the data is the diversity results. Even
WDBC 0.00543 0.02153 4.32964 330.68555 though algorithm parameters were chosen specifically to en-
(0.00135) (0.00525) (1.88473) (394.84) sure global swarm convergence, the diversity values indicate
Wine 0.00013 0.00593 170.61631 15.64711 that convergence was either not occurring or occurring very
(0.00012) (0.00287) (345.453) (8.07007)
Henon 0.13936 0.14163 1.13264 160.74088 slowly.
(0.02692) (0.04816) (0.65404) (137.8)
Logistic Map 0.00029 0.00033 1.14707 25.70928 Glass Swarm Diversity
(0.0002) (0.00023) (0.11407) (22.2) 50
Mackey-Glass 0.00032 0.00069 2.43907 11308467 GBest GCPSO
45 LBest GCPSO
(0.00007) (0.00035) (1.62997) (16304600) VN GCPSO
Sunspots 0.00245 0.00387 1.60193 25.35808 40
(0.00022) (0.00064) (0.35558) (13.69) 35
TS5 0.00047 0.00117 2.52144 6.90199
30
(0.00004) (0.00034) (0.83279) (1.12173)
Diversity

25

20

15
ρF value on Diabetes could be attributed to the data set 10
containing less noise than the other classification data sets. In 5
general, lower ρF values were obtained by the lbest and VN
0
topologies, indicating less overfitting. The decrease in over- 0 1000 2000 3000 4000 5000
Iterations
fitting with the lbest and VN topologies can be attributed to
them maintaining a higher diversity for longer than the gbest
Fig. 1. Swarm diversity over time on the Glass data set. Graph shows
topology, avoiding exploitation (and therefore overfitting) of mean over 30 samples.
a particular solution. In the case of the regression problems,
with all three topologies, overfitting was not as severe as with Figure 1 illustrates the diversity of the swarms on the Glass
the classification problems. Again this could be attributed to data set. Results were similar for all classification sets but
a lack of noise in the series. are not shown due to space constraints. For the gbest and VN
A further observation is the relatively high standard devi- PSO algorithms the figure shows that initially the swarms

Authorized licensed use limited to: UNIVERSIDADE DE PERNAMBUO. Downloaded on May 16,2023 at 04:20:02 UTC from IEEE Xplore. Restrictions apply.
Sunspot Swarm Diversity changed to use a linear activation function. The neurons in
50
GBest GCPSO the output layer still used a sigmoid activation function to
45 LBest GCPSO
VN GCPSO avoid possible scaling issues and modification of the data
40 sets.
35 Due to the nature of the linear activation function, a neural
30 network would require more linearly activated neurons to
Diversity

25 obtain the same approximation accuracy of a neural network


20
using sigmoid functions. Since the architectures were kept
15
constant, worse training and generalization accuracies are
to be expected when compared to the previous experiment.
10
The purpose of the experiment is not to obtain good training
5
and generalization performance however, but to investigate
0
0 1000 2000 3000 4000 5000 overfitting and diversity of the algorithm.
Iterations
TABLE V
Fig. 2. Swarm diversity over time on the Sunspots data set. Graph shows GB EST GCPSO RESULTS AFTER 5000 ITERATIONS , WITH NEURAL
mean over 30 samples. NETWORKS USING LINEAR ACTIVATION FUNCTIONS IN THE HIDDEN
LAYER . M EANS OVER 30 SAMPLES ARE REPORTED WITH STANDARD
DEVIATIONS IN PARENTHESIS
were converging, but over time the swarms diverged, as seen
Problem ET EG ρF Diversity
by the increase in diversity. In all cases the diversity of the
Glass 0.0173 0.0258 1.54373 0.73791
gbest swarms stagnated above 0.0, which should not be the (0.0018) (0.0057) (0.54457) (0.68773)
case. With the lbest topology the diversity results are more Iris 0.00657 0.00701 1.09492 0.00003
extreme, the figure showing that drastic swarm divergence (0.00062) (0.00145) (0.32602) (0.00007)
occurred. Figure 2 shows similar trends for the Sunpots time Diabetes 0.09739 0.10133 1.04395 0.000
(0.00343) (0.00672) (0.10546) (0.000)
series dataset. WDBC 0.01239 0.02729 2.31543 2.45515
A possible explanation for the unexpected behavior lies in (0.00222) (0.00731) (0.93476) (2.38805)
the use of the sigmoid activation function, or more gener- Wine 0.00789 0.01088 1.40292 0.000
(0.0006) (0.00187) (0.354) (0.000)
ally a bounded activation function. At the asymptotic ends
Henon 0.30162 0.29535 1.0486 0.04052
(bounds) of the sigmoid activation function, the output of the (0.04598) (0.08552) (0.4869) (0.04408)
function varies insignificantly irrespective of the amount of Logistic Map 0.07676 0.07759 1.0119 0.000
change in the net input signal of the neuron. The net input (0.00147) (0.00292) (0.05835) (0.000)
signal to a neuron is a function of particle position (weight Mackey-Glass 0.0008 0.00085 1.17962 0.11281
(0.00015) (0.00031) (0.67235) (0.27916)
values) which is changed by adding velocity to the position. Sunspots 0.00495 0.00545 1.12707 0.00039
Now, irrespective of whether velocity step sizes are large, (0.00044) (0.00093) (0.29137) (0.00068)
at the asymptotic ends of the sigmoid, because of the flat TS5 0.00065 0.00087 1.35678 0.000
(0.00006) (0.00018) (0.40598) (0.000)
slope of the function (which flattens even further towards
the asymptote), little change in the activation will occur,
and therefore little change in particle fitness. Velocity can
therefore grow to arbitrarily large values, and since it has D. Discussion
no (or extremely little) effect on particle fitness, particles The results for the second experiment are given in Tables
continue to search in a direction of increasing velocities. V, VI and VII. As expected, the training performance of the
The large velocity values explain the slow convergence (or linearly activated neural networks is worse. With the gbest
divergence in the case of the lbest topology) of the swarms. PSO results, comparing Tables II and V, on all problems
This so called explosion in velocity values [2] can also be a worse training error is reported. Interestingly though, on
prevented by clamping of the velocities. the Glass, Iris and Diabetes problems the linearly activated
This hypothesis can be tested empirically by repeating neural networks reported lower generalization errors, and
the experiment, changing the neurons to use an unbounded in most other cases the generalization errors are closely
activation function, for example the linear function. matched.
The lbest PSO results are similar to the gbest results. In
C. Unbounded Activation Function Neural Networks
comparing Tables III and VI, higher training errors were
For the following experiment the same algorithm config- obtained by the linearly activated neural networks on all
uration as described in section V-A was used. Algorithms problems, though on some of the problems (Glass, Iris,
were again executed 30 times for each problem with different Diabetes and TS5) lower generalization errors were obtained.
random number seeds. Comparing Tables IV and VII results are similar for the
The neural network architectures were kept the same, with VN PSO, with higher training errors, but lower generalization
the exception that the neurons in the hidden layer were errors on the Glass, Iris, Diabetes and TS5 problems.

Authorized licensed use limited to: UNIVERSIDADE DE PERNAMBUO. Downloaded on May 16,2023 at 04:20:02 UTC from IEEE Xplore. Restrictions apply.
TABLE VI TABLE VII
LB EST GCPSO RESULTS AFTER 5000 ITERATIONS , WITH NEURAL VN GCPSO RESULTS AFTER 5000 ITERATIONS , WITH NEURAL
NETWORKS USING LINEAR ACTIVATION FUNCTIONS IN THE HIDDEN NETWORKS USING LINEAR ACTIVATION FUNCTIONS IN THE HIDDEN
LAYER . M EANS OVER 30 SAMPLES ARE REPORTED WITH STANDARD LAYER . M EANS OVER 30 SAMPLES ARE REPORTED WITH STANDARD
DEVIATIONS IN PARENTHESIS DEVIATIONS IN PARENTHESIS

Problem ET EG ρF Diversity Problem ET EG ρF Diversity


Glass 0.01737 0.02576 1.53569 6.17495 Glass 0.01739 0.02578 1.53401 4.98978
(0.00182) (0.00569) (0.54429) (1.97753) (0.00181) (0.00565) (0.54093) (3.05846)
Iris 0.00657 0.00701 1.09492 3.39328 Iris 0.00657 0.00701 1.09492 2.18407
(0.00062) (0.00145) (0.32602) (0.57847) (0.00062) (0.00145) (0.32602) (0.88998)
Diabetes 0.09739 0.10133 1.04395 6.26711 Diabetes 0.09739 0.10133 1.04395 5.65818
(0.00343) (0.00672) (0.10546) (1.38673) (0.00343) (0.00672) (0.10546) (1.46241)
WDBC 0.0128 0.02336 1.91259 46.8131 WDBC 0.01277 0.02415 1.98643 22.83848
(0.00189) (0.00508) (0.66447) (20.76834) (0.00194) (0.00564) (0.72546) (5.58503)
Wine 0.00789 0.01088 1.40292 5.05847 Wine 0.00789 0.01088 1.40292 1.92114
(0.0006) (0.00187) (0.354) (1.00161) (0.0006) (0.00187) (0.354) (1.54905)
Henon 0.29484 0.31468 1.11843 7.23726 Henon 0.29711 0.30254 1.07244 5.82413
(0.03882) (0.06697) (0.42674) (0.63162) (0.03988) (0.07376) (0.43425) (2.29529)
Logistic Map 0.07676 0.07759 1.0119 1.81186 Logistic Map 0.07676 0.07759 1.0119 1.49332
(0.00147) (0.00292) (0.05835) (0.18111) (0.00147) (0.00292) (0.05835) (0.45943)
Mackey-Glass 0.0008 0.00085 1.17962 8.32093 Mackey-Glass 0.0008 0.00085 1.17962 5.9408
(0.00015) (0.00031) (0.67235) (1.90035) (0.00015) (0.00031) (0.67235) (0.85157)
Sunspots 0.00495 0.00545 1.12707 3.84284 Sunspots 0.00495 0.00545 1.12707 2.69051
(0.00044) (0.00093) (0.29137) (0.37656) (0.00044) (0.00093) (0.29137) (1.52392)
TS5 0.00065 0.00087 1.35679 3.11008 TS5 0.00065 0.00087 1.35678 1.10635
(0.00006) (0.00018) (0.40597) (1.34252) (0.00006) (0.00018) (0.40598) (0.51997)

Sunspots Swarm Diversity (Linear Hidden Layer)


As mentioned, lower training and generalization accuracies 10
were expected due to the architectures being kept constant. GBest GCPSO
LBest GCPSO
The improvement in generalization performance on some VN GCPSO
8
problems is interesting, but might not be significant.
In terms of diversity, the results are now closer to what is
theoretically expected from the PSO. With all the topologies 6
Diversity

lower diversities are reported on nearly all the problems after


the 5000 iterations. On the gbest topology, the diversities 4
are 0.0 or very close to 0.0 with small standard deviations,
indicating convergence. 2

Glass Swarm Diversity (Linear Hidden Layer)


0
10 0 1000 2000 3000 4000 5000
GBest GCPSO
LBest GCPSO Iterations
VN GCPSO
8
Fig. 4. Swarm diversity over time on the Sunspots data set with neural
networks using linear activation functions in the hidden layer. Graph shows
mean over 30 samples.
6
Diversity

4
classification and regression problems were similar, but are
omitted due to space constraints. When compared to Figures
2
1 and 2 (please note the adjusted maximum range values
on Figures 3 and 4) these diversity plots indicate convergent
0 swarms, showing better decrease over time. With the gbest
0 1000 2000 3000 4000 5000
Iterations and VN topologies a slight increase in diversity is still seen,
but to a far lesser extent. From the lower standard deviations
Fig. 3. Swarm diversity over time on the Glass data set with neural networks in the diversity results as seen in Tables V, VI and VII it can
using linear activation functions in the hidden layer. Graph shows mean over
30 samples. also be concluded that the convergent swarm behavior was
consistent.
The differences are illustrated more clearly in Figure 3, When considering the generalization factor however, a
showing a classification problem, and Figure 4, showing a vast difference between the two experiment results exists.
time series regression problem. The results for the other With all three PSO algorithms a ρF greater than 2.0 is

Authorized licensed use limited to: UNIVERSIDADE DE PERNAMBUO. Downloaded on May 16,2023 at 04:20:02 UTC from IEEE Xplore. Restrictions apply.
reported on only one problem, the Wisconsin Diagnostic VI. C ONCLUSION
Breast Cancer dataset. With the exception of the WDBC
dataset, the generalization factor did not greatly exceed 1.5, This paper investigated overfitting in PSO trained feedfor-
if at all. Therefore, all three topologies exhibited greatly ward neural networks across a variety of classification and
reduced overfitting behavior. The standard deviations of regression problems. Three well known information sharing
the generalization factor measurements are also far lower, topologies of the PSO were used and the results showed
indicating consistent performance across all samples. that overfitting occurred in the neural networks for all three
the PSO algorithms. Additionally, non-convergent swarm
behavior was also shown to occur with all three algorithms,
Glass Generalization Factors
15
which is hypothesized to be related to the use of a bounded
14 GBest GCPSO - Sigmoid NN activation function within the neurons of the neural networks.
LBest GCPSO - Sigmoid NN
13
12
VN GCPSO - Sigmoid NN
GBest GCPSO - Linear NN
A second experiment was performed and the results sup-
11 LBest GCPSO - Linear NN
VN GCPSO - Linear NN
ported the hypothesis. The use of an unbounded activation
Generalization Factor

10 function resulted in convergent swarms as well as a drastic


9
8
reduction in the degree of overfitting. The lack of overfitting
7 is attributed to the swarms converging on a solution, therefore
6 not exploiting it further and preventing additional overfitting
5
4
from occurring.
3 Future work includes investigating the effect of velocity
2
1
clamping as well as the size of the swarm on overfitting.
0 Furthermore a comparison of overfitting by PSO trained
0 1000 2000 3000 4000 5000
Iterations
neural networks against gradient and evolutionary compu-
tation training methods is intended as well. An additional
Fig. 5. Generalization Factor over time on the Glass data set. Graph shows experiment is also planned to investigate the effect of the
mean over 30 samples. slope of the sigmoid function on overfitting and convergence.

Figure 5 shows the generalization factor over time of R EFERENCES


the three PSO algorithms training the sigmoid hidden layer
neural networks from the first experiment and the linear [1] A. Asuncion and D.J. Newman. UCI machine learning reposi-
tory. http://archive.ics.uci.edu/ml/index.html, No-
hidden layer neural networks from the second experiment. vember 2009. Accessed 4/11/2009 16:48 SAST.
Overfitting in the sigmoid neural networks occurs early in [2] M. Clerc and J. Kennedy. The particle swarm - explosion, stability,
the optimization process and continues to drastically worsen and convergence in a multidimensional complex space. Evolutionary
over time. With the linear neural networks, some overfitting Computation, IEEE Transactions on, 6(1):58–73, Feb 2002.
[3] D. Corne, M. Dorigo, and F. Glover. New Ideas in Optimization.
is experienced early on but ceases to worsen after about 300 McGraw-Hill, 1999.
iterations, indicated by the flattening of the graphs after this [4] R. Eberhart and J. Kennedy. A new optimizer using particle swarm
point. The results on the other classification and regression theory. In Micro Machine and Human Science, 1995. MHS ’95.,
Proceedings of the Sixth International Symposium on, pages 39–43,
problems are very similar. 1995.
For the linear activation function experiment the results [5] R. Eberhart, Simpson P.K., and Dobbins R.W. Computational Intelli-
also show that between the topologies little difference exists gence PC Tools. Academic Press Professional, 1996.
[6] R.C. Eberhart and Y. Shi. Comparing inertia weights and constriction
in overfitting behavior, as the ρF values are very similar factors in particle swarm optimization. In Evolutionary Computation,
for each topology on a particular problem. This is clearly 2000. Proceedings of the 2000 Congress on, volume 1, pages 84–88
visible in Figure 5 showing nearly identical lines for the vol.1, 2000.
[7] A. P. Engelbrecht. Computational Intelligence: An Introduction. Wiley,
three topologies. second edition, 2007.
Overfitting occurs due to the neural network modeling [8] N. Franken and A.P. Engelbrecht. Comparing PSO structures to
the training data too specifically. This requires the training learn the game of checkers from zero knowledge. In Evolutionary
Computation, 2003. CEC ’03. The 2003 Congress on, volume 1, pages
algorithm to overly exploit a solution it has found in the 234–241 Vol.1, 2003.
search space of neural network weights. However, with the [9] A. Ismail and A.P. Engelbrecht. Global optimization algorithms for
linear activation functions the swarm is converging after an training product unit neural networks. In Neural Networks, 2000.
IJCNN 2000, Proceedings of the IEEE-INNS-ENNS International Joint
amount of time, therefore it is not possible for the algorithm Conference on, volume 1, pages 132–137 vol.1, 2000.
to further explore the search space or exploit a solution, [10] J. Kennedy. The particle swarm: social adaptation of knowledge. In
stopping additional overfitting from occurring. The negative Evolutionary Computation, 1997., IEEE International Conference on,
side to this is that it is not possible to find any better solutions pages 303–308, 1997.
[11] J. Kennedy and R. Mendes. Population structure and particle swarm
either. performance. In Evolutionary Computation, 2002. CEC ’02. Proceed-
The results of the second experiment support (but do ings of the 2002 Congress on, volume 2, pages 1671–1676, 2002.
not necessarily prove) the hypothesis that using a bounded [12] T. Krink, J.S. Vesterstrom, and J. Riget. Particle swarm optimisation
with spatial particle extension. In Evolutionary Computation, 2002.
activation function with PSO trained neural networks might CEC ’02. Proceedings of the 2002 Congress on, volume 2, pages
lead to non-convergent swarms, leading to more overfitting. 1474–1479, 2002.

Authorized licensed use limited to: UNIVERSIDADE DE PERNAMBUO. Downloaded on May 16,2023 at 04:20:02 UTC from IEEE Xplore. Restrictions apply.
[13] R. Mendes, P. Cortez, M. Rocha, and J. Neves. Particle swarms [19] F. van den Bergh. An Analysis of Particle Swarm Optimizers.
for feedforward neural network training. In Neural Networks, 2002. PhD thesis, Department of Computer Science, University of Pretoria,
IJCNN ’02. Proceedings of the 2002 International Joint Conference Pretoria, South Africa, 2002.
on, volume 2, pages 1895–1899, 2002. [20] F. van den Bergh and A. P. Engelbrecht. Particle swarm weight
[14] National Geophysical Data Center. NGDC/WDC STP, Boulder- initialization in multi-layer perceptron artificial neural neural networks.
Sunspot Number Data via FTP from NGDC. http://www.ngdc. In Artificial Intelligence. Proceedings of the International Conference
noaa.gov/stp/SOLAR/ftpsunspotnumber.html, Novem- on, pages 42–45, 1999.
ber 2009. Accessed 4/11/2009 18:06 SAST. [21] F. van den Bergh and A. P. Engelbrecht. Cooperative learning in neural
[15] O. Olorunda and A.P. Engelbrecht. Measuring exploration/exploitation networks using particle swarm optimizers. South African Computer
in particle swarms using swarm diversity. In Evolutionary Compu- Journal, 26:84—90, 2000.
tation, 2008. CEC 2008. (IEEE World Congress on Computational [22] F. van den Bergh and A. P. Engelbrecht. Training product unit networks
Intelligence). IEEE Congress on, pages 1128–1134, 2008. using cooperative particle swarm optimisers. In Neural Networks,
[16] John Platt. A Resource-Allocating network for function interpolation. 2001. Proceedings. IJCNN ’01. International Joint Conference on,
Neural Computation, 3(2):213–225, June 1991. volume 1, pages 126–131, 2001.
[17] A. Röbel. The dynamic pattern selection algorithm: Effective training [23] F. van den Bergh and A. P. Engelbrecht. A new locally convergent
and controlled generalization of backpropagation neural networks. particle swarm optimiser. In Systems, Man and Cybernetics, 2002
Technical report, Technische Universität Berlin, 1994. IEEE International Conference on, volume 3, page 6 pp. vol.3, 2002.
[18] Y. Shi and R. Eberhart. A modified particle swarm optimizer. In Evo-
lutionary Computation Proceedings, 1998. IEEE World Congress on
Computational Intelligence., The 1998 IEEE International Conference
on, pages 69 –73, may 1998.

Authorized licensed use limited to: UNIVERSIDADE DE PERNAMBUO. Downloaded on May 16,2023 at 04:20:02 UTC from IEEE Xplore. Restrictions apply.

You might also like