Automating Extract Class Refactoring
Automating Extract Class Refactoring
DOI 10.1007/s10664-013-9256-x
Abstract During software evolution the internal structure of the system undergoes
continuous modifications. These continuous changes push away the source code from
its original design, often reducing its quality, including class cohesion. In this paper
we propose a method for automating the Extract Class refactoring. The proposed
approach analyzes (structural and semantic) relationships between the methods in
a class to identify chains of strongly related methods. The identified method chains
are used to define new classes with higher cohesion than the original class, while
preserving the overall coupling between the new classes and the classes interacting
with the original class. The proposed approach has been first assessed in an artificial
scenario in order to calibrate the parameters of the approach. The data was also used
A. Marcus
SEVERE Group, Department of Computer Science, Wayne State University,
5057 Woodward Ave, Suite 14101.1, Detroit, MI 48202, USA
e-mail: amarcus@[Link]
URL: [Link]
R. Oliveto ( )
Department of Bioscience and Territory, University of Molise, C. da Fonte Lappone,
86090, Pesche IS, Italy
e-mail: [Link]@[Link]
URL: [Link]
1618 Empir Software Eng (2014) 19:1617–1664
to compare the new approach with previous work. Then it has been empirically evalu-
ated on real Blobs from existing open source systems in order to assess how good and
useful the proposed refactoring solutions are considered by software engineers and
how well the proposed refactorings approximate refactorings done by the original
developers. We found that the new approach outperforms a previously proposed ap-
proach and that developers find the proposed solutions useful in guiding refactorings.
1 Introduction
1. In the first study we asked 50 Master’s students to rate the refactoring solutions
suggested by the proposed approach on existing Blobs identified in two open-
source systems (Khomh et al. 2009). In this study we also evaluated the impact
of the refactoring operations proposed by our approach on the cohesion and
coupling of the object systems.
2. In the second study we identified and selected 11 classes in different versions
of open source systems that actually underwent extract class refactoring by
their original developers. Then, we asked 15 Master’s students to refactor these
classes and compared both the refactorings proposed by our approach and the
refactorings performed by the students with the refactorings performed by the
original developers.
The results show that the refactoring solutions proposed by our approach (i)
strongly increase the cohesion of the refactored classes without leading to significant
increases in terms of coupling; (ii) are considered useful by developers performing
extract class refactoring; and (iii) are able to approximate manually performed refac-
torings at 91 %, on average. In addition, we also compare the proposed extract class
refactoring method with a previous approach we proposed in Bavota et al. (2011),
which uses the same graph representation of the class to be refactored but a different
1620 Empir Software Eng (2014) 19:1617–1664
algorithm based on Max Flow-Min Cut. We will refer to our previous approach as
the Max Flow-Min Cut approach. The results clearly indicate that the new approach
outperforms the previous one and in the paper we discuss the reasons for that. The
experimental material and the raw data are available online for replication purposes
(Bavota et al. 2012).
The rest of the paper is organized as follows. Section 2 discusses the related
literature, while Section 3 presents the proposed approach. The empirical assessment
of the configuration parameters of our approach is presented in Section 4. Sections 5
and 6 report the two empirical studies, respectively. Finally, Section 7 concludes the
paper.
2 Related Work
A lot of effort has been devoted to the definition of automatic and semi-automatic
approaches for software refactoring. The recent increasing interest in this field by
the software engineering community led to the organization of international events
focused on the refactoring topic, like the ICSE 2011 4th Workshop on Refactoring
Tools (WRT 2011). The approaches proposed in the literature can be roughly
classified in two different categories: (i) approaches that identify source code com-
ponents which may require refactoring and (ii) approaches that (semi)automatically
perform refactoring operations. In the latter category there are approaches support-
ing move method refactoring (Seng et al. 2006; Tsantalis and Chatzigeorgiou 2009;
Oliveto et al. 2011), extract method refactoring (Maruyama and Shima 1999; Abadi
et al. 2009), refactoring focused on improving class hierarchy (Casais 1992; Moore
1996), combinations of different refactoring operations1 (O’Keeffe and O’Cinneide
2006; Bodhuin et al. 2007), and extract class refactoring (Fokaefs et al. 2009; Bavota
et al. 2011). The latter are the more closely related to our approach and thus we will
focus on them in our discussion of related work.
It is worth noting that many of the most commonly used refactoring activities
proposed in literature have been integrated in modern IDEs, such as Eclipse2 and
IntelliJ IDEA.3
Fokaefs et al. (2009) use a clustering algorithm to perform Extract Class refactoring.
Their approach analyzes the structural dependencies existing between the entities
of a class to be refactored, i.e., attributes and methods. Using this information, they
compute the entity set for each attribute (i.e., the set of methods using it), and for each
method (i.e., all the methods that are invoked by a method and all the attributes that
are accessed by it) of the class. The Jaccard distance between all couples of entity
sets of the class is computed in order to cluster together cohesive groups of entities
3 [Link]
Empir Software Eng (2014) 19:1617–1664 1621
modules (or classes). In particular, structural information (e.g., Fokaefs et al. 2009;
Christl et al. 2007; Sartipi and Kontogiannis 2001; Wiggerts 1997; Anquetil et al.
1999), semantic information (e.g., Kuhn et al. 2007), or a combination of semantic
and structural measures (Maletic and Marcus 2001) have been proposed to cluster
software components in order to support program comprehension or software re-
modularisation. Maletic and Marcus (2001) proposed the combination of semantic
and structural measures to cluster software components during program compre-
hension. It is worth noting that while (as discussed above) hierarchical clustering
algorithms needs a threshold to cut the identified dendrogram, partitioning clustering
algorithms need to know the number of clusters to build and thus the number of
classes to extract in case of Extract Class refactoring. Our approach overcomes these
problems by automatically defining the optimal number of classes to be extracted.
The approach proposed in this paper, like most of the refactoring approaches
described above, can be applied only if a source code component to be refactored
has been identified (in our case, a Blob class). Several approaches presented in the
literature have focused the attention of the identification of source code components
that need refactoring. Such approaches are complementary to the one presented in
this paper. While many of these approaches have been proposed in the literature
(Simon et al. 2001; Tahvildari and Kontogiannis 2003; Du Bois et al. 2004; Marinescu
2004; Atkinson and King 2005; Trifu and Marinescu 2005; Joshi and Joshi 2009;
Khomh et al. 2009; Moha et al. 2010), we will focus our discussion on those able
to identify extract class refactoring opportunities in software systems.
Simon et al. (2001) provide a metric-based visualization tool to support the
software engineer in the identification of source code components that need refac-
toring. In particular, their approach is able to identify four kinds of refactoring
opportunities: move method, move attribute, extract class, and inline class. In Simon
et al. (2001) only structural metrics are used in the analysis of the source code.
Marinescu (2004) proposes a mechanism called “detection strategies” for formu-
lating metrics-based rules that capture deviations from good design principles and
heuristics. The detection strategies are formulated in different steps. Firstly, the
symptoms that characterize a particular bad smell should be defined (e.g., in case of
Blob high complexity, low cohesion, and access of “foreign” data). Second, a proper
set of metrics measuring these symptoms should be identified (e.g., Weighted Method
Count (WMC) for high complexity, Tight Class Cohesion (TCC) for class cohesion,
and Access to Foreign Data (ATFD) for measuring the access to external attributes
of a class). Having this information the next step is to define thresholds to classify
the class as affected (or not) by the defined symptoms. For example, establishing for
which values of TCC a class should by identified as a “low cohesive class”. Finally,
AND/OR operators should be used to correlate the symptoms, leading to the final
rule to detect the smells and thus, refactoring opportunities. The evaluation con-
ducted on two software systems shows how using customized “detection strategies”
it is possible to identify nine bad smells with an average accuracy of 70 %.
Trifu and Marinescu (2005) present an approach to support the decision making
process in object oriented refactoring. In particular, they exploit correlation between
Empir Software Eng (2014) 19:1617–1664 1623
structural anomalies, i.e., different types of code smells that often occur together, and
other structural and semantic information to build a pattern-like mapping of design
problems to the adequate treatments.
Joshi and Joshi (2009) present a method for identifying less cohesive classes in
a software system. Their approach is also able to pick out which class members
contribute to the lack of cohesion of the identified classes. This information can be
used to find candidates for refactoring, e.g., extract class, move method.
Khomh et al. (2009) propose an approach based on Bayesian Belief Networks
(BBNs) to specify design smells and detect them in programs. In this work the
authors focus the attention on the detection of Blob classes and thus, of Extract Class
refactoring opportunities. In particular, given a class C as input, the output of the
BBN is a probability that C is a Blob class. The evaluation is performed on two open
source systems by measuring precision and recall of the model with manually located
smells.
Moha et al. (2010) introduced DETEX, a method for the specification and detec-
tion of code and design smells. DETEX uses a Domain-Specific Language (DSL)
for specifying smells using high-level abstractions. Four design smells are identified
by DETEX, namely Blob class, Swiss Army Knife, Functional Decomposition, and
Spaghetti Code. The results achieved in the reported evaluation show that DETEX
is able to reach a recall of 100 % and a precision greater than 50 % in the detection
of the four above mentioned bad smells.
The proposed approach is able to extract two or more classes from a given class
with several responsibilities (e.g., a Blob (Brown et al. 1998)). The extracted classes
have higher cohesion than the original class and attempt to encapsulate related
responsibilities. Generally, a class with a high number of responsibilities exhibits low
cohesion. Cohesion has been defined by Stevens et al. (1974) as “the degree to which
the elements of a module belong together” and in the case of classes, it measures how
strongly related are the responsibilities implemented by a class (Chidamber et al.
1994).
Class cohesion is affected by several factors (e.g., attribute references, method
calls, semantic content, etc.) and our approach exploits all these factors to split a class
with low cohesion into a set of classes with higher cohesion. However, while splitting
a class into two or more classes increases the cohesion of the extracted classes, this
might happen at the expenses of class coupling. For this reason, our approach exploits
similarity measures between methods on which cohesion and coupling metrics are
based. In this way, the increase of cohesion should mitigate the increase of coupling.
The approach takes as input a class previously identified by the software engineer
(or automatically) as a candidate for refactoring. Figure 1 shows the Extract Class
Refactoring process. The top path of the process is similar to the Max Flow-Min
Cut approach (Bavota et al. 2011): the candidate class is parsed to build a method-
by-method matrix, a n × n matrix where n is the number of methods in the class to
be refactored. A generic entry ci, j of the method-by-method matrix represents the
likelihood that method mi and method m j should be in the same class. This step of
the refactoring process is described in more detail in Section 3.1.
1624 Empir Software Eng (2014) 19:1617–1664
Using the information in the method-by-method matrix, the second part (bottom
path) of the refactoring process, shown in Fig. 1, extracts the new classes from the
input Blob. In particular, a filtering step is used to remove spurious links and to
split the initial graph represented in the method-by-method matrix into disconnected
subgraphs. Then, we identify the chains of connected methods belonging to the
different subgraphs. Each computed chain represents a class to be extracted from the
original class. However, some of these chains could have a very short length (trivial
chains). To avoid the extraction of classes with a very low number of methods, we
merge each trivial chain with the most coupled non trivial chain to obtain the final
set of classes to be extracted from the original class. In Sections 3.2 and 3.3 we explain
in detail these two steps of our algorithm, while in Section 3.4 we present an example
of the application of our approach.
calculated as the ratio between the number of referenced instance variables shared
by methods mi and m j and the total number of instance variables referenced by the
two methods:
⎧
⎨ |Ii ∩ I j| if |I ∪ I | = 0;
i j
SSM(mi, m j) = |Ii ∪ I j|
⎩
0 otherwise.
SSM has values in [0, 1]; the higher the number of instance variables the two methods
share, the higher the likelihood that the two methods should be in the same class.
ClassCoh is defined as the ratio of the sum of the similarities between all pairs of
methods to the total number of possible pairs of methods.
CDM is another structural measure that takes into account the calls performed
by the methods (Bavota et al. 2011). In particular, let calls(mi , m j ) be the number of
calls performed by method mi to m j and callsin (m j) be the total number of incoming
calls to m j . : CDMi→ j is defined as:
⎧
⎨ calls(mi , m j )
if callsin (m j) = 0;
CDMi→ j = callsin (m j)
⎩
0 otherwise.
CDMi→ j values are in [0, 1]. If CDMi→ j = 1 it means that m j is only called by mi .
Thus, mi and m j should be in the same class to reduce coupling between classes.
Otherwise, if CDMi→ j = 0 it means that mi never calls m j . In such a case, moving
the two methods in different classes does not result in increasing the coupling. To
ensure that CDM represents a commutative measure (like the other two measures)
the overall CDM of mi and m j is computed as follows:
CDM(mi , m j ) = max CDMi→ j, CDM j→i
where − →
mi and −
→ are the vectors corresponding to the methods m and m , respec-
m j i j
→
−
tively, and x represents the Euclidean norm of the vector x (Baeza-Yates and
Ribeiro-Neto 1999). Thus, the higher the value of CSM the higher the similarity
between two methods. In short, the measure captures relationships between the
comments, identifiers, and other text present in the methods, based on word usages
in the entire code. It is clear that CSM depends on the consistency of naming
conventions used in the source code as well as on the comments contained in it.
1626 Empir Software Eng (2014) 19:1617–1664
All the used similarity measures have values in [0, 1]. Thus, we compute the
likelihood that methods mi and m j should be in the same class as:
ci, j = w SSM · SSM(mi, m j) + wCDM · CDM(mi, m j) + wCSM · CSM(mi , m j)
where w SSM + wCDM + wCSM = 1 and their values express the confidence (i.e.,
weight) in each measure.
It is worth noting that our choice of measures to use is not random, rather it
is based on the results from previous work (Bavota et al. 2011), where we have
shown that these measures are orthogonal, they capture different aspects of coupling
between methods, and are suitable for automating extract class refactoring.
The aim of this step is to remove from the graph represented by the method-by-
method matrix spurious (but light) structural and/or semantic relationships between
methods (Koschke et al. 2006). Indeed, due to the use of the semantic similarity be-
tween methods (that very un-likely is equal to zero) the initial graph representation
would be in general a complete graph (i.e., it contains all possible edges), or at least
a connected graph. We split the graph representing the class to be refactored into
disconnected subgraphs, containing strongly related methods. We filter the method-
by-method matrix, based on a threshold, minCoupling. All similarity values less than
the minCoupling threshold are converted to zero:
c if ci, j > minCoupling;
ci, j = i, j
0 otherwise.
There are many ways to define a threshold aimed at removing spurious relation-
ships between methods. A simple classification allows identifying two different kinds
of thresholds:
– constant threshold: the value of the threshold is fixed a priori, e.g., minCoupling =
0.1. This kind of threshold is simple to implement, but in general it is very difficult
to choose a priori a constant value to prune spurious relationships. Indeed, the
values in the method-by-method matrix depend on the Blob chosen to be refac-
tored. In fact, there may be cases where the matrix contains a lot of high values.
In this case, if the fixed threshold is high, it will probably remove the noise from
the matrix, e.g., spurious relationships between the methods of the class. Oth-
erwise, almost all the values will be left in the matrix. On the other hand, there
may be cases where the matrix contains a large number of very low values. In this
case, a high constant threshold will remove almost all the values from the matrix.
– variable threshold: the value of the threshold is selected taking into account
the characteristics of the given input. For example, minCoupling can be set as
the median of the values present in the method-by-method matrix. This kind of
threshold should resolve the problems derived by the use of a constant threshold
and should ensure more stable filter performances across the different inputs.
Choosing the best threshold in this case is also far from trivial.
We experimented with both constant and variable thresholds to empirically define
a heuristic for selecting the best threshold and we found that a variable threshold is
the best option in our application (see Section 4 for details).
Empir Software Eng (2014) 19:1617–1664 1627
After filtering the method-by-method matrix and splitting the graph into discon-
nected subgraphs, we identify the chains of connected methods belonging to the
different subgraphs. These chains represent the new classes to be extracted from the
original class.
The set of computed chains (i.e., extracted classes) may include chains with a very
short length. To avoid the extraction of classes with a very low number of methods,
we use a length threshold minLength to identify trivial chains, i.e., chains with a
length less than minLength. In our approach we decided to set minLength = 3 since
it is unusual that a class extracted from a Blob and implementing a well-defined
set of responsibilities contains less than three methods. This minimum length can
be easily changed by the user, if needed. Then, we compute the (structural and
semantic) coupling between trivial and non-trivial chains and merge each trivial chain
with the non-trivial chain it is most coupled with. The coupling between chains is
calculated using the same measures used to calculate the coupling between methods.
Specifically, the coupling between chains Ci and C j is computed as the average
coupling between all possible pairs of methods from Ci and C j :
1
Coupling(Ci , C j ) = ci, j
|Ci | × |C j|
mi ∈Ci ,m j ∈C j
4 If
a private field needs to be shared by two or more of the extracted classes, the implementation of
the needed getter and/or setter methods is left to the developer.
1628 Empir Software Eng (2014) 19:1617–1664
Figure 4 shows how the proposed approach works to extract from the User-
Management class three new classes having better defined responsibilities than the
original class. The first part of the figure shows the graph that can be obtained from
the method-by-method matrix (note that the edges weighted with 0.0 are omitted),
while the second part of the figure shows the connected components obtained after
Empir Software Eng (2014) 19:1617–1664 1629
the matrix filtering. In this example we arbitrary set minCoupling = 0.2. Thus, all the
edges having weight lower than 0.2 (that represent spurious relationships between
methods) are removed from the graph. The extracted components correspond to
the preliminary method chains. The third part of the figure shows the refinement
of the method chains. In particular, a trivial chain composed of only one method
(checkUser) is added to the most coupled non trivial chain (i.e., C1 ). In the end, the
approach suggests splitting the original class into three new classes.
4 Parameter Calibration
The proposed approach has several configuration parameters: the weights of the
similarity measures (w SSM , wCDM , and wCSM ) and the threshold used to prune out
the spurious relationships between methods (minCoupling). While an assessment of
these parameters has been made previously (Bavota et al. 2011), we cannot just use
previous results, as the extract class refactoring algorithms are different and likely
the values of these parameters have a different impact on the new algorithms. For
this reason, in this section we conduct an empirical assessment of our approach with
the goal of defining and validating a heuristic to identify an optimal setting for these
parameters.
The context of our study is represented by five open source software systems,
namely ArgoUML 0.16, Eclipse 3.2, GanttProject 1.10.2, JHotDraw 6.0, and Xerces
2.7.0. ArgoUML (1,071 classes and 97 KLOC) is a UML modeling CASE tool
with reverse engineering and code generation capabilities. Eclipse (23,462 classes
and 1,710 KLOC) is a multi-language integrated development environment with an
extensible architecture through plug-ins. GanttProject (273 classes and 28 KLOC) is
a cross-platform desktop tool for project scheduling and management. JHotDraw
(275 classes and 29 KLOC) is a Java GUI framework for structured drawing
editors. Xerces (589 classes and 240 KLOC) is a family of packages for parsing and
manipulating XML files. It implements a number of standard APIs for XML parsing,
including DOM, SAX, and SAX2. Three of these systems, namely ArgoUML,
JHotDraw, and Eclipse, have also been used to assess the parameters of the Max
Flow-Min Cut approach (Bavota et al. 2011).
Empir Software Eng (2014) 19:1617–1664 1631
Fig. 5 Box plots of quality metrics for the systems used in the case study
1632 Empir Software Eng (2014) 19:1617–1664
of a class to methods in other classes, and therefore the dependency of local methods
to methods implemented by other classes. Higher MPC values indicate higher
coupling.
The analysis of these metrics shows that the overall quality of the object systems, in
terms of low coupling and high cohesion is comparable to each other. Although we do
not have a formal quality model, this claim is supported by the comparable quality of
the object systems with JHotDraw, which has been developed as a “design exercise”
and its design relies heavily on the proper use of well-known design patterns.
5 It is worth noting that while the general experimental design is the same, the Max Flow-Min Cut
approach (Bavota et al. 2011) was evaluated on artificial Blobs created merging only two classes, as
it is only able to split a Blob in two classes.
Empir Software Eng (2014) 19:1617–1664 1633
By construction, the merged classes have a worse cohesion than the original classes
(see the online Appendix (Bavota et al. 2012) for the details). Note also that the
randomly selected classes are merged only if their cohesion is higher than the average
class cohesion in the system. The choice of this threshold was guided by the analysis
of the box plots reported in Fig. 5. As we can see most of the classes of the object
systems have a good cohesion but there is a small set of outliers with a really low co-
hesion. By considering the average cohesion as a threshold we exclude from our set of
classes these outliers, ensuring that the quality of the selected classes is rather good.
Once the mutated system is obtained, the proposed approach is applied to each
artificial Blob with the goal of reconstructing the original classes previously merged.
This is why it was important to select classes with high cohesion, because we can con-
sider them as the “golden standard”. Hence, to evaluate the results, the refactored
classes are compared with the original classes aiming at identifying the total number
of methods correctly and incorrectly moved in the split classes. To measure the
accuracy of the refactoring solutions we computed the MoJo eFfectiveness Measure
(MoJoFM) (Wen and Tzerpos 2004) between the original classes and those extracted
by our approach. The MoJoFM is a normalized variant of the MoJo distance and it
is computed as follows:
mno( A, B)
MoJoF M( A, B) = 1 −
max(mno(∀ A, B))
4.2 Analysis of the Results and Heuristics to Define the Configuration Parameters
Tables 1 and 2 report the best results—as measured with MoJoFM—achieved using
constant and variable thresholds respectively.6 The analysis of the results reveals that:
– the variable threshold generally provides better performance than the constant
threshold for the def inition of minCoupling. We obtained comparable results
between constant and variable thresholds only on GanttProject. On the other
systems, the variable thresholds provide an average improvement in terms of
MoJoFM of about 0.06. This means that by using a variable threshold, our
approach is able to better reconstruct the original classes merged to create the
artificial Blobs. In addition, the best overall results are achieved in each case
using as variable threshold the median (Q2 ) of the values of the matrix on all the
systems. In other words, the variable thresholds ensure a more stable filtering
performance across the different inputs, i.e., the different artificial Blobs to
be refactored. Regarding the constant thresholds, generally better results can
be achieved using a low value. As we can see in Table 1, none of the best
configurations results from using 0.4 as the constant threshold;
– the combination of structural and semantic measures considerably improves the
accuracy of our approach. As expected, the best results are achieved when all the
weights of the three cohesion metrics are greater than zero. This means that the
combination of structural and semantic measures is worthwhile, which confirms
the findings from previous work (Bavota et al. 2011).
– the optimal settings of the weights of the three cohesion measures is not stable
across the object systems. The results highlight that the best configuration of
weights sensibly changes across the object systems. Unlike the Max Flow-Min
Cut approach, where in general the best performances were achieved giving a
high weight (greater than 0.6) to the semantic similarity measure, with this new
approach it is quite difficult to identify an optimal setting of the weights for the
three measures, which could be used for any system. This means that a different
heuristic is required to identify an optimal setting of the weights for different
systems.
To better understand how the parameters of the proposed approach affect our
results, we statistically analyzed the influence of the factors Weights and Threshold
on the reconstruction accuracy of our approach (MoJoFM) through interaction
plots.7 The interaction plots confirmed that generally the best performances can
be obtained using as threshold the median, i.e., Q2 , of the non-zero values of the
method-by-method matrix and both structural and semantic measures. However, the
results also confirmed that the weights that produce optimal results are different
across the different object systems.
In consequence, we propose the use of the Principal Component Analysis (PCA)
of the method coupling data, to identify a heuristic able to set-up a customized
configuration of the weights for different software systems, resulting in near-optimal
6 The complete results achieved with all possible combinations of parameters can be found in Bavota
et al. (2012).
7 The interested reader can find the interaction plots for all systems in our online appendix (Bavota
et al. 2012).
Empir Software Eng (2014) 19:1617–1664 1635
performances of our technique. We argue that PCA allows to identify the different
dimensions that describe a phenomenon (in our case, the coupling between pairs of
methods) and obtain an indication of the importance of each dimension (captured
by one or more coupling measures) in the description of this phenomenon (i.e.,
the proportion of variance). Table 3 shows the results of the PCA on all the object
systems. As we can see, the semantic measure is identified by the PCA as the measure
that describes most of the coupling between pairs of methods. In particular, the
proportion of variance for the semantic similarity measure is higher than 0.6 for all
the object systems. Moreover, in general, both structural measures are important, as
they describe some of the relationships between pairs of methods. This confirms the
finding previously highlighted from the analysis of Tables 1 and 2.
As expected, the proportion of variance values are rather different across the
different systems, so our question was whether using the proportion of variance val-
ues to define the weights of the similarity measures provides results of the MoJoFM
close to the optimal results shown in Table 2. Table 4 compares the results obtained
using the configuration parameters identified by the PCA proportion of variance
(PCA-based conf iguration) with the best results obtained in our experimentation
(best conf iguration). As we can see, the difference between the reconstruction
accuracy of the PCA-based conf iguration compared with the accuracy obtained using
the best configuration is very small. Indeed the difference of MoJoFM is never
higher than 0.04. We also executed the Wilcoxon test to compare the accuracy of the
two different configurations. The results on all object systems do not highlight any
statistically significant difference. This indicates that the PCA-based conf iguration
provides an accuracy similar to the best accuracy obtained by exercising all possible
parameter configurations. Given these findings we propose the following heuristics
to set the parameters of our approach in a real usage scenario:
of the similarity measures computed on all the classes of the system. The value of
the proportion of variance obtained for each measure will be used as the weight
for the corresponding measure.
Using the artificial Blobs from the five open source systems, we also compared the
reconstruction accuracy of the proposed approach with the accuracy achieved by the
Max Flow-Min Cut approach, which is based on the same graph-based representation
of a class. Since the Max Flow-Min Cut approach is only able to split a Blob in two
classes we performed this comparison only on the artificial Blobs created merging
two classes.
Figure 7 reports the results achieved using for both approaches the best
configuration of parameters, respectively. As we can see, the reconstruction accuracy
of our new approach is always better than the reconstruction accuracy obtained with
0.71
Eclipse 0.90
0.70
GanttProject 0.87
0.73
JHotDraw 0.90
Xerces 0.64
0.83
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Average MoJoFM - Merging 2 Classes
1638 Empir Software Eng (2014) 19:1617–1664
the Max Flow-Min Cut approach. In particular, the average difference of MoJoFM
is 16.8 %. Note that our new approach is not only able to improve the reconstruction
accuracy of the Max Flow-Min Cut approach, but it also automatically derives that
the artificial Blobs have to be split in two classes, whereas the Max Flow-Min Cut
approach just splits the artificial Blobs in two classes, by construction.
We statistically analyzed the performances of the two approaches using the Mann-
Whitney test (Conover 1998). We chose this test as we cannot assume normality of
data and the test does not make normality assumptions. In particular, we used the
test to analyze the statistical significance of the difference between the reconstruction
accuracy provided by the two approaches. The results were intended as statistically
significant at α = 0.05. Table 5 reports the achieved results. As we can see, the recon-
struction accuracy of our new approach is significantly higher than the reconstruction
accuracy achieved by the Max Flow-Min Cut approach, for each system.
This result may seem surprising as the two approaches use the same structural
and semantic measures and the same graph-based representation of the class to split.
The difference is in the algorithm adopted to split the graph into sub-graphs. Thus,
to understand the reasons for performance gap between the two approaches, it is
important to point out the differences between the two algorithms:
1. Both algorithms include a filtering step aiming at removing spurious connections
between nodes. Indeed, due to the use of the semantic similarity between meth-
ods (that very unlikely is equal to zero) the initial graph representation would be
in general a complete graph (i.e., it contains all possible edges). In this case, the
Max Flow-Min Cut algorithm would always split a graph containing n nodes in
two graphs containing n − 1 and 1 node, respectively (Cormen et al. 2001). So,
filtering and removing some edges is needed in this case to avoid a trivial appli-
cation of Max Flow-Min Cut algorithm. However, filtering in this case does not
disconnect the graph, as the Max Flow-Min Cut Algorithm is the one splitting the
graph.
On the other hand, the filtering step of our new approach is much more intensive,
as it aims at splitting the graph in subgraphs representing loosely coupled
components. So this step is a key for our new extract class refactoring method.
The other steps of the new method consists of (i) identifying chains of nodes
belonging to the same subgraph (and then methods belonging to the same class)
and (ii) aggregating the small subgraphs (i.e., the trivial chains composed of less
than 3 methods) with the most coupled non-trivial chain previously identified.
This merging step is also very important to correct some surplus from the filtering
step.
2. The Max Flow-Min Cut algorithm needs as input the source and sink nodes
that ideally represent two methods belonging to the two different classes to be
extracted from the Blob. In our previous work (Bavota et al. 2011) the heuristic
used to identify the source and sink nodes consists of selecting the two nodes
in the graph connected by the edge with the lowest weight, i.e., they are the
two least coupled methods (according to the structural and semantic similarity
measures used) in the Blob class. Clearly, in some cases this heuristic does not
work properly and it selects two methods that should instead be in the same
class. In this case the splitting performed by the algorithm will be negatively
affected, since it is guided by wrong initial assumptions. Our new technique does
not suffer of similar problems, as the splitting is performed by the filtering step.
In fact, the new technique helped reveal this previously unnoticed problem with
the selection of the source and sink nodes.
One can argue that by iteratively using an algorithm that splits a class in two, we
can obtain additional new classes from the old one (not only two). So, we also tried
to iteratively use the Max Flow-Min Cut approach to refactor the artificial Blobs cre-
ated by merging three classes (M1, M2 , and M3 ) together. In this case, in the first it-
eration the Max Flow-Min Cut algorithm would split the artificial Blob in two classes
E1 and E2 . Therefore, to be useful in an iterative usage, one of the extracted class
(suppose E1 ) should contain most of the methods of one of the original classes (sup-
pose M1 ) while the second extracted class (E2 ) should contain most of the methods of
the other two original classes (M2 and M3 ). The approach should then be re-applied
to E2 in order to extract M2 and M3 thus reconstructing the original classes. How-
ever, this rarely happens, and the distribution of methods of the three original classes
to the classes E1 and E2 achieved after the first iteration is usually more smoothed.
To verify this, we applied the Max Flow-Min Cut approach on the artificial Blob in
the first iteration and on both the extracted classes in the second iteration. Then, we
selected as refactoring solution the one achieving the highest reconstruction accuracy
(i.e., the highest MoJoFM) between the two generated. For example, suppose that E1
and E2 are the two classes extracted from the artificial Blob at the first iteration. In
the second iteration we apply the Max Flow-Min Cut approach on both E1 and E2
obtaining the classes E3 and E4 extracted from E1 and E5 and E6 extracted from
E2 . We then compute the reconstruction accuracy of the following two set of classes:
S1 = {E1 , E5 , E6 } and S2 = {E2 , E3 , E4 }. Supposing that the MoJoFM achieved by
S1 is 0.7 while the one achieved by S2 is 0.6, S1 is selected as the refactoring solution.
Figure 8 reports the achieved results. As we can see the iterative application
of the Max Flow-Min Cut produces worse results than those achieved by our new
technique. The gap of performance with our approach in this scenario is really high.
In particular, our approach obtained a reconstruction accuracy, in terms of MoJoFM,
22 % higher in average than the Max Flow-Min Cut approach.
The empirical results show a high reconstruction accuracy of the proposed approach.
A threat that could affect the validity of such a result is represented by the fact
that our approach was applied on artificial Blobs and thus reconstructing previously
merged classes might be trivial. However, in the previous Section 4.3 we mitigate
such a threat showing that the Max Flow-Min Cut approach is not able to reach the
same reconstruction accuracy when applied on the same set of artificial Blobs. To
further mitigate such a threat, we also analyzed the coupling between the classes to be
merged in order to understand if there is a correlation with the reconstruction accu-
racy of our approach. If two merged classes have no coupling between them, then the
outcome of splitting might be close to perfect. On the other hand if their coupling is
1640 Empir Software Eng (2014) 19:1617–1664
0.48
Eclipse 0.75
0.53
GanttProject 0.75
0.54
JHotDraw 0.77
0.52
Xerces 0.68
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Average MoJoFM - Merging 3 Classes
high it might be “translated” as similarity between the members of the merged class,
affecting the results. We used the Conceptual Coupling Between Classes (CCBC)
(Poshyvanyk et al. 2009) and the information-flow-based coupling (ICP) (Lee et al.
1995) to measure the coupling between the merged classes. Then, we measured the
statistical correlation between the coupling of the merged classes and the splitting ac-
curacy by computing the Pearson product-Moment Correlation Coefficient (PMCC)
(Cohen 1988). The results revealed no correlation on all the object systems.
In the context of our study, the following research questions were formulated:
– RQ1 : What is the impact of the refactoring suggested by our approach on class
cohesion and coupling?
– RQ2 : Does the proposed refactoring results in a better division of responsibili-
ties, from a developer’s point of view?
Empir Software Eng (2014) 19:1617–1664 1641
To respond to our first research question (RQ1 ) we analyzed the changes in terms of
cohesion and coupling in the object systems when applying the refactoring operations
suggested by our approach. We expected an increase of cohesion (desired effect)
due to the split in different classes of the responsibilities implemented in the Blobs.
However, we also expected an increase of coupling (side effect), since splitting a class
in several classes usually results in an increment of the total dependencies between
classes. For these reasons coupling and cohesion should be measured together to
make a proper judgment on the complexity and quality of a system, since improve-
ment of cohesion usually comes at the expense of increase in coupling and vice versa
(Stewart et al. 2006). To measure the cohesion of the analyzed classes we used the
LCOM and the C3 metrics, while for the coupling we used the MPC metric, since it
allows to understand if the interactions due to method calls between the extracted
classes is increased after the refactoring operations suggested by our approach. The
cohesion and coupling metrics were measured before and after refactoring. To this
aim, we applied the refactoring operations suggested by our approach by using the
extract class functionality of Eclipse. As benchmark, we compared the results ob-
tained applying our new approach with those obtained using the Max Flow-Min Cut
approach.
With regards to the research question RQ2 we analyzed the refactoring operations
proposed by our approach from the developers’ point of view. To this aim, we
performed two experiments involving a total of 50 Master’s students in Computer
Science from the University of Salerno.8 Before the experiment, the students at-
tended a two hours seminar about the most common refactoring techniques, their
objectives and usefulness during the software lifecycle. During the semester in which
the experimentation has been carried out, the students were attending courses on
Advanced Software Engineering, Advanced Databases, Programming Languages and
Compilers, and Advanced Computer Networks. As for their background, all students
had in their Bachelor curriculum at least one exam on Object Oriented programming
(in Java) and on software engineering. Students voluntarily participated in the study
and no selection process was performed (i.e., all students who volunteered were
accepted). Finally, during the experiment students were allowed to leave, but no one
did.
The first experiment involved 30 students, who evaluated three different refactor-
ing operations for each of the seven Blobs from the GanttProject system: (i) the
refactoring suggested by our approach, (ii) the refactoring suggested by the Max
Flow-Min Cut approach, and (iii) a random refactoring. The second option was used
to provide the students with an alternative refactoring solution, which makes sense,
but which is likely worse than the refactoring solution suggested by our approach
(at least according to the results obtained in Section 4). The last option likely does
not make sense as a refactoring solution and was only considered to verify whether
participants seriously considered this assignment (i.e., a sanity check). For each of the
proposed refactoring the students had to express their level of agreement to the claim
“The proposed refactoring results in a better division of responsibilities” proposing a
score using a Likert scale (Oppenheim 1992): 1: Strongly disagree; 2: Disagree; 3:
Neutral; 4: Agree; 5: Fully agree. The students had 140 min to perform the assigned
task (on average 20 min for each Blob). The second experiment was conducted using
the same design, but involved 20 subjects and was performed on the ten Blobs of the
Xerces system with a time limit of 200 min (same 20 min average for each Blob).
To answer the research question RQ2 , the results achieved in the two experiments
were analyzed through boxplots and statistical tests. For the statistical analysis, we
decided to use the Mann-Whitney test (Conover 1998) since we cannot assume
normality of the data. We collected the ranking for each of the three proposed
refactoring solutions. Then, for each pair of considered approaches (e.g., our new
approach vs. the random refactoring), we used the Mann-Whitney test to analyze the
statistical significance of the difference between the scores assigned by the students
to the refactoring solutions of the two approaches. The results were intended as
statistically significant at α = 0.05.
Table 6 reports information about the Blobs object of our study before and after the
refactoring suggested by our approach, and in particular the LOC and the number of
methods.9
9 Inthe number of methods we do not count the constructors (for both pre- and post-refactoring)
and any getters and setters methods that would be added after the refactoring. In this way the sum
of methods of the extracted classes is equals to the number of methods of the Blob class.
10 Theinterested reader can find the results by the Max Flow-Min Cut approach for each Blob in our
online appendix (Bavota et al. 2012).
Empir Software Eng (2014) 19:1617–1664 1643
Table 6 Refactoring solutions proposed by our approach on the 17 Blobs object of our study
System Class Split Pre-refactoring Post-refactoring
classes LOC Methods LOC Methods
Xerces AbstractDOMParser 2 1,775 45 522 15
1,259 30
Xerces AbstractSAXParser 3 1,360 55 155 11
241 12
967 32
Xerces BaseMarkupSerializer 2 1,275 61 152 10
1,123 51
Xerces CoreDocumentImpl 3 1,497 119 82 11
79 14
1,350 94
Xerces DeferredDocumentImpl 2 1,612 76 1,061 34
630 42
Xerces DOMNormalizer 2 1,291 31 33 13
1,268 18
Xerces DOMParserImpl 2 820 17 454 7
431 10
Xerces DurationImpl 2 953 44 351 16
640 28
Xerces NonValidatingConfiguration 2 403 18 123 3
284 15
Xerces XIncludeHandler 4 1,331 111 372 32
440 26
169 17
524 36
GanttProject GanttOptions 3 513 68 438 51
72 11
37 6
GanttProject GanttProject 3 2,269 90 2,086 71
124 6
118 13
GanttProject GanttGraphicArea 2 2,160 43 2,025 32
197 11
GanttProject GanttTree 2 1,730 48 1,382 42
423 6
GanttProject GanttTaskPropertiesBean 2 1,685 27 1,164 21
524 6
GanttProject ResourceLoadGraphicArea 2 1,060 29 873 21
227 8
GanttProject TaskImpl 3 329 46 234 27
69 10
44 9
the original Blob is 573, while the sum of the MPC of the extracted classes is 588.
Thus, the percentage increase in terms of MPC is only about 3 %. Also the increase
of coupling considering all the classes affected by the refactoring is just +4 % (from
1,650 to 1,724).
Table 10 reports the average M PCextracted and M PCaffected coupling values for
the two systems before refactoring, after refactoring with our approach and after
1644 Empir Software Eng (2014) 19:1617–1664
refactoring with the Max Flow-Min Cut approach. As shown in Table 10, the
increment in coupling generated by our approach (+3.2 %) is just slightly higher
than the Max Flow-Min Cut approach (+2.3 %). Note that the (slightly) better
performance in terms of coupling ensured by the Max Flow-Min Cut approach is
an expected result, since it extracts a lower number of classes from the Blobs than
Empir Software Eng (2014) 19:1617–1664 1645
Table 8 Average cohesion: our approach vs. Max Flow-Min Cut approach
System Pre-refactoring Our approach Max Flow-Min Cut approach
LCOM C3 LCOM C3 LCOM C3
Xerces 1,504 0.11 256 0.23 588 0.21
Gantt 1,033 0.16 242 0.31 310 0.27
Average 1,310 0.13 257 0.27 473 0.24
our approach (i.e., 34 vs. 41). This clearly results in fewer dependencies between the
extracted classes.
As for the coupling measured for all classes involved in the refactoring operations
(columns M PCaffected ), the increase is also very small in terms of percentage. In
Table 10 Average coupling: our approach vs. the Max Flow-Min Cut approach
System Pre-refactoring Our approach Max Flow-Min Cut approach
M PCextracted M PCaffected M PCextracted M PCaffected M PCextracted M PCaffected
Xerces 371 1,054 382 1,080 381 1,076
Gantt 528 1,489 547 1,519 540 1,510
Average 436 1,233 450 1,260 446 1,254
particular, our approach increases the number of dependencies for the classes
involved in the refactoring, on average, from 1,233 to 1,260 (+2.2 %) while the Max
Flow-Min Cut approach (Bavota et al. 2011) from 1,233 to 1,254 (+1.7 %).
In summary, both approaches are able to strongly increase class cohesion by
paying a small price in terms of coupling increase. Our approach is able to a higher
increase in cohesion compared to the Max Flow-Min Cut approach, but it also
increases slightly more the coupling between classes (expected result).
11 Afine grained analysis of the scores assigned by the students is reported in our online Appendix
(Bavota et al. 2012).
Empir Software Eng (2014) 19:1617–1664 1647
suggested by the Max Flow-Min Cut approach obtains a statistically significant higher
score than the random splitting in both the experiments.
The quantitative data gathered from subjects allow us to positively answer our
second research question RQ2 : the proposed refactoring results in a better division
of responsibilities from a developer’s point of view. However, to have deeper insights
about the scores provided by the students we also analyzed some of the refactoring
operations proposed by our approach that have been generally marked with good
scores (or not) by the students.
Operations Positively Evaluated by Students For the Xerces system, two refactoring
operations positively evaluated by almost all the students are for the AbstractSAX-
Parser and XIncludeHandler classes. We observed that one of the classes extracted
from the AbstractSAXParser class can be classified as an Entity class, since it contains
only a set of attributes and the corresponding getter and setter methods. Concerning
the refactoring of the XIncludeHandler class, it is particularly interesting for two
reasons: (i) the refactoring operation suggested by our novel approach in this case
achieved the highest average score, i.e., 5, (the refactoring operation suggested by the
Max Flow-Min Cut approach for the same Blob obtained an average score of 2.7),
and (ii) this is the case with the highest difference in terms of number of extracted
classes with respect to the Max Flow-Min Cut approach (4 vs. 2).
To better analyze this case we report in Fig. 11 the topic maps (Kuhn et al. 2007)
representing the main topics of (i) the original Blob, (ii) the classes extracted by our
approach, and (iii) the classes extracted using the Max Flow-Min Cut approach. The
topic map for a class C is computed analyzing the term frequency in the methods of
C. In particular, we count for each term present in C (excluding the java keywords),
the number of methods that contain it. The five most frequent terms, i.e., the terms
present in the highest number of methods, are then used to construct the topic map
of C that, for this reason, is represented by a pentagon where each vertex represents
one of the main topics. Each vertex is connected to the center of the pentagon by
an axis representing the percentage of methods in the class that implements the
corresponding topic. The graphical representation of the main topics of C is then
obtained by tracing lines between the point on each of the five axes indicating the
percentage of methods belonging to C that implement the corresponding topic. The
methods in XIncludeHandler implement the XInclude handling of XML document
according to the W3C recommendations. The XInclude functionality allows to re-use
an XML document including it into other XML documents. The main topics in the
class are reported in the right side of Fig. 11. As we can see, the most frequent terms
are: XML (the kind of document involved), DTD (Document Type Definition, a set
of declarations used to define the document type for markup languages like XML),
Include (the main responsibility of the class), Error (the management of the possible
errors derived by the XInclude operation), and Augmentations (the infoset augmen-
tation that can be used to modify an XML infoset during schema validation). Thus,
although the main responsibility of this class is the implementation of the XInclude
handling, it also implements some auxiliary (and poorly related) responsibilities. The
application of our approach to XIncludeHandler produced four new classes, each one
specialized in one particular responsibility: Class1 primarily deals with the Document
Type Definition, Class2 with the infoset augmentation of XML documents, Class3
with the implementation of the XInclude operation, and Class4 with the management
of possible errors derived by the XInclude operation. Concerning the refactoring
proposed by the Max Flow-Min Cut approach (bottom part of Fig. 11), we can
observe that the two extracted classes still represent a mixture of different topics,
although the distribution of responsibilities is better than the original Blob.
In this section we discuss the threats that could affect the validity of the results of this
study.
12 To avoid bias in the experiment none of the authors have been involved in this evaluation.
1652 Empir Software Eng (2014) 19:1617–1664
experiments. These students were familiar with the two systems and also known to
the authors as extremely serious and reliable. Table 12 reports the answers provided
for each analyzed class. As we can see, they generally assigned high scores to the
analyzed refactoring operations. Moreover, the cases were the students negatively
evaluated the proposed refactoring operations are almost the same as identified in
the two experiments, e.g., DeferredDocumentImpl and NonValidatingConf iguration.
We are quite confident that the results achieved in our studies reflect well the quality
of the refactoring solutions proposed by the approaches.
However, the threats related to this experimental design still remains, due to the
nature of the study. The user study presented in Section 6 provides a qualitative eval-
uation of the proposed approach, in part to overcome some of the threats discussed
here.
In the context of this study, the following research question has been formulated:
– RQ3 : Are the refactoring solutions suggested by our approach useful for devel-
opers when performing extract class refactoring?
To obtain the objects needed by our study we mined six open source systems (i.e.,
Apache HSQLDB, ArgoUML, JEdit, JFreeChart, JHotDraw, Xerces) looking for
extract class refactoring operations performed during their history by the original
developers. We used Ref-Finder (Prete et al. 2010) to identify the refactoring
13 Noneof the 50 students involved in the user study reported in Section 5 has been involved in this
experiment.
Empir Software Eng (2014) 19:1617–1664 1653
operations performed among two subsequent versions of the same system. Ref-
Finder is a tool able to identify 63 different types of refactoring, but unfortunately not
the extract class one. However, the latter can be identified by Ref-Finder as a set of
move method and move f ield operations from the original class to the new extracted
classes. We manually validated these sets of move method and move field refactoring
retrieved by Ref-Finder to identify extract class refactoring operations performed
by the original developers. In total, we identified eleven meaningful extract class
refactoring operations performed by the original developers, presented in Table 13.
To answer our research question we provided each subject the eleven classes to
refactor together with the refactoring solution proposed by our approach. Then, since
for each of the eleven identified classes we have the original class as well as the new
classes extracted by the developers, we can answer RQ3 from both a qualitative and
a quantitative point of view.
As for the qualitative analysis, we asked the subjects the following questions:
Table 13 Extract class refactoring operations identified in the six analyzed systems
System Original class Extracted classes
Apache HSQLDB Database (41) Database (27)
SchemaManager (14)
Select (14) Select (7)
Result (7)
UserManager (13) UserManager (10)
GranteeManager (3)
ArgoUML FileGeneratorAdapter (9) FileGeneratorAdapter (3)
TempFileUtils (6)
Import (10) Import (7)
ImportCommon (3)
JEdit JEditTextArea (214) JEditTextArea (22)
SelectionManager (11)
TextArea (181)
JFreeChart JFreeChart (24) JFreeChart (16)
Plot (8)
NumberAxis (20) NumberAxis (16)
ValueAxis (4)
JHotDraw DefaultApplicationModel (14) DefaultApplicationModel (4)
AbstractApplicationModel (10)
Xerces XMLDTDValidator (69) XMLDTDValidator (38)
XMLDTDProcessor (31)
XMLSerializer (25) XMLSerializer (12)
DOMWriterImpl (13)
In parenthesis the number of methods in each class
1654 Empir Software Eng (2014) 19:1617–1664
iii. Did you find the provided refactoring solution useful as starting point to
perform your refactoring? Why?
(b) if NO:
i. Why?
As for the quantitative analysis, we measured how much the refactoring produced
by the students (i) was different than the solution suggested by our approach and
(ii) approximated the refactoring performed by the original developers. We used the
MoJoFM (Wen and Tzerpos 2004) to measure the similarity between the refactoring
performed by the students, the ones proposed by our approach, and those performed
by the original developers. Moreover, for each of the eleven classes object of our
study, we also measured how far is the refactoring suggestion proposed by our
approach from the refactoring performed by the original developers.
For each of the eleven classes object of our study, Table 14 shows the percentage of
subjects answering “YES” to the three YES/NO questions of our survey. For exam-
ple, 13 out of the 15 students involved (87 %) would split the class Database and 8 of
them (62 %) would split the class differently than the solution suggested by our ap-
proach. However, all these 13 subjects found the provided refactoring solution (i.e.,
the one proposed by our approach) a good starting point to perform the refactoring.
The analysis of Table 14 reveals interesting results. First of all, the subjects would
not always split the provided classes. In particular, there are two of the eleven classes
(i.e., Select and Import) for which the majority of the students did not feel that extract
class refactoring was needed. As explained in the design, we also asked subjects why
they would (or would not) split each class. Analyzing these answers we found that
the subjects judged the complexity of both classes acceptable and were not able to
identify different responsibilities implemented in them. Clearly, this result contrasts
with the choice made by the original developers. However, analyzing the refactoring
performed by the original developers on the classes Select and Import it is clear that
their choice was not driven by the high complexity of those classes or by the high
number of responsibilities implemented in them, but rather by the desire to adhere
to a specific architectural style. In fact, these classes implement a quite low number of
methods (i.e., 14 in Select and 10 in Import) and the original developers performed
the extract class refactoring to split both classes into a Model class, responsible of
modeling an entity in the system, and a Controller class working on the Model.
These kinds of refactoring can be identified also by our approach, since generally
methods implementing a Model class generally share several instance variables, while
methods implementing a Controller class generally have a high number of method
calls among them, since co-operating in the implementation of some functionality.
In fact, for the Import class our approach proposes exactly the same refactoring
performed by the original developers and the six students that would split this class
accepted the refactoring suggested as is (and clearly, found the suggestion useful).
Moreover, three of them were also able to motivate their choice by explaining that
“the class Import seems a merge between a Model and a Controller”, “it is possible to
extract a model class”, and “to improve its reusability a model class can be extracted”.
As for the Select class, the four subjects that would split it accepted the suggestion of
our approach as is, explaining its usefulness with the fact that “the extracted classes
looks strongly cohesive”.
On the other hand, there are four classes that all subjects would like to split
(i.e., UserManager, JEditTextArea, JFreeChart, and XMLDTDValidator). For the
UserManager class, the subjects explained that this class is responsible for “more
than just managing the users” and they can identify “two dif ferent responsibilities
implemented in it”. Ten subjects (67 %) found the suggestion of our tool useful
explaining that “it eases code comprehension” by “highlighting the main responsi-
bilities implemented in the class”. Our approach splits each of these classes in two
new classes. It is interesting to note that 87 % of the subjects (13 out of 15) modified
the suggested refactoring solution and all of them moved just one method from one
of the extracted classes to the other obtaining exactly the refactoring performed by
the original developers. Concerning the 33 % of subjects that did not find useful
the suggestion of our approach for this class, most of them motivated this answer by
explaining that “the class was not really complex” and thus “its main responsibilities
can be identif ied without any suggestion”. However, none of them complained about
the quality of the proposed refactoring.
Interesting is also the case of the JEditTextArea class for which all subjects (i)
thought that a refactoring would be necessary, (ii) would like to change the proposed
refactoring, and (iii) thought that the suggested refactoring was useful as starting
point. As for the reason to split this class, most of the subjects explain it by highlight-
ing that “the class is very complex”, “intricate”, and “seems to have low cohesion”.
Our technique extracts from JEditTextArea three new classes. All subjects suggested
that two of these classes can be merged together and four of them also extracted
a new class managing “the scrolling of a text area”. However, all subjects found
the starting refactoring suggestion useful commenting that “the proposed division of
responsibilities makes sense, but perhaps it is a bit excessive”. This motivation explains
the fact that all of them merged together two of the three extracted classes. Looking
at the refactoring performed on this class by the original developers, we found that
the JEditTextArea was actually split in three new classes. However, two of the classes
extracted by our approach (i.e., the two generally merged together by subjects) were
1656 Empir Software Eng (2014) 19:1617–1664
grouped into one single class by the original developers. Thus, the choice made by
the subjects of merging together two of the three extracted classes looks reasonable.
For other classes like JFreeChart, and XMLDTDValidator, all students accepted
the refactoring suggestion as is, commenting in most cases that “the extracted
responsibilities were meaningful” and “cohesive classes were extracted”.
Finally, another interesting case concerns the NumberAxis class. Most of the
students (87 %) would like to split this class since “the management of the axis values
should be extracted”. Our approach suggested to split this class in three new classes.
While 38 % of students appreciated this suggestion, the other 62 % applied a change
to it by merging two of the three suggested classes. The students who applied the
change to the refactoring proposed our approach (just a join operation) were able to
replicate the refactoring performed by the original developers. Overall, all subjects
appreciated the refactoring suggestion highlighting as it “eases the comprehension of
the main responsibilities implemented in a class”.
Summarizing, except for the case of the UserManager class discussed above, sub-
jects always found useful the solutions suggested by our approach when performing
refactoring. Among the most frequent explanations we found:
1. it eases code comprehension;
2. it highlights the main responsibilities implemented in a class;
3. the extracted classes are cohesive.
Moreover, subjects stated that in some cases “without the refactoring suggestion it
would be too dif f icult to identify the main responsibilities of the classes”.
Concerning the quantitative data, Table 15 reports the average MoJoFM between
(i) the refactoring suggested by our approach and that performed by the original
developers, (ii) the refactoring performed by the subjects and the refactoring sug-
Table 15 MoJoFM between (i) the refactoring suggested by our approach and that performed by
the original developers (ii) the refactoring performed by subjects and the refactoring proposed by
our approach, and (iii) the refactoring performed by subjects and that performed by the original
developers
Class Our approach to Subjects to our Subjects to
original dev. approach original dev.
MoJoFM #Move/Join Avg. Avg. Avg. Avg.
MoJoFM #Move/Join MoJoFM #Move/Join
Database (41) 0.97 1 0.98 0.6 0.99 0.5
Select (14) 0.83 2 1.0 0 0.83 2
UserManager (13) 0.93 1 0.94 0.9 0.98 0.3
FileGeneratorAdapter (9) 0.86 1 0.92 0.6 0.94 0.4
Import (10) 1.00 0 1.00 0 1.00 0
JEditTextArea (214) 0.84 34 0.97 6 0.87 27
JFreeChart (24) 0.95 1 1.00 0 0.95 1
NumberAxis (20) 0.94 1 0.96 0.6 0.98 0.4
DefaultApplicationModel (14) 0.92 1 0.99 0.2 0.92 1
XMLDTDValidator (69) 0.88 8 1.00 0 0.88 8
XMLSerializer (25) 0.91 2 0.97 0.7 0.94 1.4
Average 0.91 4.7 0.98 0.9 0.93 3.8
In parenthesis the number of methods of the class
Empir Software Eng (2014) 19:1617–1664 1657
gested by our approach and (iii) the refactoring performed by the subjects (starting
from the suggestions of our approach) and the refactoring performed by the original
developers. Moreover, Table 15 also reports the number of Move/Join operations
needed to convert one refactoring into the other.
The first thing that stands out is that our approach approximates extremely well
the refactoring performed by the original developers, achieving on average 0.91 of
MoJoFM. For example, for the Database class from the Apache HSQLDB system
our approach achieves 0.97 MoJoFM value. This class was split in 2 new classes by the
original developers (see Table 6), one containing 27 and one containing 13 methods.
Our approach splits the Database class into three classes. The first is composed of the
same 27 methods included in one of the two classes extracted by the developers. The
other two extracted classes contains the remaining 13 methods, 8 in one class and 5
in another one. Thus, by performing only one Join operation (i.e., merging the two
smaller classes extracted) it is possible to obtain the refactoring performed by the
original developers.
On average, 4.7 Move/Join operations are required to convert the refactoring
suggested by our approach into the refactoring performed by the original developers
(note that the median is 1). The only case in which a quite high number of Move/Join
operations is required to convert the refactoring solution proposed by our approach
to the one performed by the developers is related to the class JEditTextArea. In
this case, 34 Move/Join operations are required. However, it is worth noting that
the original class was composed of 214 methods. Thus, in this case 34 Move/Join
operations required to convert the refactoring proposed by our approach into the one
performed by the developers represents a reasonably good result, as demonstrated
also by the high MoJoFM value achieved (0.84).
To have a benchmark, we compared the performance of our approach with that
achieved by the Max Flow-Min Cut approach (Bavota et al. 2011). The gap in
performance between the two approaches is very large, 0.91 for our approach versus
0.62 achieved by the Max Flow-Min Cut approach. An interesting case is represented
by the JEditTextArea class, since it is the only one split by the original developers into
three new classes. In this case our approach achieves 0.84 of MojoFM, against the 0.74
achieved by the Max Flow-Min Cut approach. We also iteratively applied the Max
Flow-Min Cut approach as explained in Section 4.4. In this case the MoJoFM even
decreases to 0.71, thus again demonstrating the unsuitability of iteratively applying
the Max Flow-Min Cut approach.
The second important result of our study is that the refactorings suggested by our
approach are only slightly modified by the 15 subjects. In fact, the average MoJoFM
value is 0.98 and the average number of required Move/Join operations to convert the
suggestion by our approach in the refactoring performed by subjects is less than one
(0.9). Moreover, starting from the suggestions by our approach, subjects were able to
further approximate the refactoring performed by the original developers achieving
an average MoJoFM value of 0.93. On average, only 3.8 Move/Join operations are
needed to convert their refactoring into the refactoring performed by the original
developers.
This result highlights how the refactoring solutions suggested by our approach
represent a very good starting point for developers interested in performing extract
class refactoring operations. In fact, students with no knowledge of the object systems
were able to comprehend the classes and perform refactoring operations very close
1658 Empir Software Eng (2014) 19:1617–1664
to those performed by the original developers, who obviously have a much deeper
knowledge of these systems.
Here we discuss the main threats that could affect the validity of our results from this
study.
14 This data was provided by the subjects when sending their results to us.
Empir Software Eng (2014) 19:1617–1664 1659
7 Conclusion
The results of our studies indicate that the refactoring solutions proposed by
our approach (i) strongly increase the cohesion of the refactored classes without
leading to a significant increase in coupling, (ii) are useful to developers performing
extract class refactoring and (iii) approximate well refactorings manually performed
by original developers of open source systems. At the same time, the new approach
outperforms a previously proposed technique (Bavota et al. 2011) for extract class
refactoring.
Acknowledgements We would like to thank all the students who participated to our studies. We
would also like to thank anonymous reviewers for their careful reading of our manuscript and
high-quality feedback. Their detailed comments have helped us to substantially revise, extend, and
improve the original version of this paper. Andrian Marcus was supported in part by grants from the
US National Science Foundation (CCF-0845706 and CCF-1017263).
References
Abadi A, Ettinger R, Feldman YA (2009) Fine slicing for advanced method extraction. In: 3rd
workshop on refactoring tools
Abdeen H, Ducasse S, Sahraoui HA, Alloui I (2009) Automatic package coupling and cycle min-
imization. In: Proceedings of the 16th working conference on reverse engineering. IEEE CS
Press, Lille, pp 103–112
Anquetil N, Fourrier C, Lethbridge TC (1999) Experiments with clustering as a software remodular-
ization method. In: Proceedings of the 6th working conference on reverse engineering. IEEE CS
Press, Atlanta, GA, pp 235–255
Arisholm E, Sjoberg D (2004) Evaluating the effect of a delegated versus centralized control style
on the maintainability of object-oriented software. IEEE Trans Softw Eng 30(8):521–534
Atkinson DC, King T (2005) Lightweight detection of program refactorings. In: Proceedings of the
12th Asia-Pacific software engineering conference. IEEE CS Press, Taipei, pp 663–670
Baeza-Yates R, Ribeiro-Neto B (1999) Modern information retrieval. Addison-Wesley
Basili VR, Briand L, Melo WL (1995) A validation of object-oriented design metrics as quality
indicators. IEEE Trans Softw Eng 22(10):751–761
Bavota G, De Lucia A, Oliveto R (2011) Identifying extract class refactoring opportunities using
structural and semantic cohesion measures. J Syst Softw 84:397–414
Bavota G, Lucia AD, Marcus A, Oliveto R (2010) A two-step technique for extract class refactoring.
In: Proceedings of 25th IEEE international conference on automated software engineering,
pp 151–154
Bavota G, Lucia AD, Marcus A, Oliveto R (2012) Automating extract class refactoring: an
improved approach and its evaluation. Online appendix [Link]
[Link]
Binkley AB, Schach SR (1998) Validation of the coupling dependency metric as a predictor of run-
time failures and maintenance measures. In: Proceedings of the 20th international conference on
software engineering. Kyoto, Japan, pp 452–455
Bodhuin T, Canfora G, Troiano L (2007) SORMASA: a tool for suggesting model refactoring actions
by metrics-led genetic algorithm. In: Proceedings of 1st workshop on refactoring tools. Berlin,
Germany, pp 23–24
Briand LC, Wuest J, Lounis H (1999a) Using coupling measurement for impact analysis in object-
oriented systems. In: Proceedings of the 15th IEEE international conference on software main-
tenance. IEEE Press, Oxford, pp 475–482
Briand LC, Wüst J, Ikonomovski SV, Lounis H (1999b) Investigating quality factors in object-
oriented designs: an industrial case study. In: Proceedings of the 21st international conference
on software engineering. ACM Press, Los Angeles, CA, pp 345–354
Brown WJ, Malveau RC, Brown WH, McCormick III HW, Mowbray TJ (1998) Anti patterns:
refactoring software, architectures, and projects in crisis, 1st edn. John Wiley and Sons
Empir Software Eng (2014) 19:1617–1664 1661
Canfora G, Cimitile A, De Lucia A, Di Lucca GA (2001) Decomposing legacy systems into objects:
an eclectic approach. Inf Softw Technol 43(6):401–412
Casais E (1992) An incremental class reorganization approach. In: Proceedings of the 6th European
conference on object-oriented programming. Utrecht, the Netherlands, pp 114–132
Chidamber SR, Kemerer CF (1994) A metrics suite for object oriented design. IEEE Trans Softw
Eng 20(6):476–493
Christl A, Koschke R, Storey MA (2007) Automated clustering to support the reflexion method. Inf
Softw Technol 49(3):255–274
Cohen J (1988) Statistical power analysis for the behavioral sciences, 2nd edn. Lawrence Earlbaum
Associates
Conover WJ (1998) Practical nonparametric statistics, 3rd edn. Wiley
Cormen TH, Leiserson CE, Rivest RL, Stein C (2001) Introduction to algorithms, 2nd edn, chap 26
(maximum flow). MIT Press and McGraw-Hill
Deerwester S, Dumais ST, Furnas GW, Landauer TK, Harshman R (1990) Indexing by latent
semantic analysis. J Am Soc Inf Sci 41(6):391–407
van Deursen A, Kuipers T (1999) Identifying objects using cluster and concept analysis. In: Proceed-
ings of the 21st international conference on software engineering. ACM Press, Los Angeles, CA,
pp 246–255
Du Bois B, Demeyer S, Verelst J (2004) Refactoring—improving coupling and cohesion of existing
code. In: Proceedings of 11th working conference on reverse engineering. IEEE CS Press, Delft,
pp 144–151
Fokaefs M, Tsantalis N, Chatzigeorgiou A, Sander J (2009) Decomposing object-oriented class
modules using an agglomerative clustering technique. In: Proceedings of the 25th international
conference on software maintenance. Edmonton, Canada, pp 93–101
Fowler M (1999) Refactoring: improving the design of existing code. Addison-Wesley
Girard JF, Koschke R (2000) A comparison of abstract data types and objects recovery techniques.
Sci Comput Program 36(2–3):149–181
Gui G, Scott PD (2006) Coupling and cohesion measures for evaluation of component reusability.
In: Proceedings of the 5th international workshop on mining software repositories. ACM Press,
Shanghai, pp 18–21
Gyimóthy T, Ferenc R, Siket I (2005) Empirical validation of object-oriented metrics on open source
software for fault prediction. IEEE Trans Softw Eng 31(10):897–910
Joshi P, Joshi RK (2009) Concept analysis for class cohesion. In: Proceedings of the 13th European
conference on software maintenance and reengineering. Kaiserslautern, Germany, pp 237–240
Khomh F, Vaucher S, Guéhéneuc YG, Sahraoui H (2009) A bayesian approach for the detection of
code and design smells. In: Proceedings of the 9th international conference on quality software.
IEEE CS Press, Hong Kong, pp 305–314
Khomh F, Vaucher S, Guéhéneuc YG, Sahraoui H (2009) A bayesian approach for the detection
of code and design smells. In: Proceedings of the 2009 ninth international conference on quality
software. IEEE Computer Society, Washington, DC, pp 305–314
Koschke R, Canfora G, Czeranski J (2006) Revisiting the delta ic approach to component recovery.
Sci Comput Program 60(2):171–188
Kuhn A, Ducasse S, Gîrba T (2007) Semantic clustering: identifying topics in source code. Inf Softw
Technol 49(3):230–243
Lee Y, Liang B, Wu S, Wang F (1995) Measuring the coupling and cohesion of an object-oriented
program based on information flow. In: Proceedings of the international conference on software
quality. Maribor, Slovenia, pp 81–90
Li W, Henry S (1993) Maintenance metrics for the object oriented paradigm. In: Proceedings of the
first international software metrics symposium, pp 52–60
Liu Y, Poshyvanyk D, Ferenc R, Gyimóthy T, Chrisochoides N (2009) Modelling class cohesion as
mixtures of latent topics. In: Proceedings of the 25th IEEE international conference on software
maintenance. IEEE Press, Edmonton, pp 233–242
Maletic JI, Marcus A (2001) Supporting program comprehension using semantic and structural
information. In: Proceedings of the 23rd international conference on software engineering. IEEE
CS Press, Toronto, ON, pp 103–112
Marcus A, Poshyvanyk D, Ferenc R (2008) Using the conceptual cohesion of classes for fault
prediction in object-oriented systems. IEEE Trans Softw Eng 34(2):287–300
1662 Empir Software Eng (2014) 19:1617–1664
Marinescu R (2004) Detection strategies: metrics-based rules for detecting design flaws. In: Pro-
ceedings of the 20th IEEE international conference on software maintenance. IEEE Computer
Society, Washington, DC, pp 350–359
Maruyama K, Shima K (1999) Automatic method refactoring using weighted dependence graphs.
In: Proceedings of 21st international conference on software engineering. ACM Press,
Los Alamitos, CA, pp 236–245
Mens T, Tourwe T (2004) A survey of software refactoring. IEEE Trans Softw Eng 30(2):126–139
Moha N, Gueheneuc YG, Duchien L, Le Meur AF (2010) Decor: a method for the specification and
detection of code and design smells. IEEE Trans Softw Eng 36(1):20–36
Moore I (1996) Automatic inheritance hierarchy restructuring and method refactoring. In: Proceed-
ings of 11th ACM SIGPLAN conference on object-oriented programming, systems, languages,
and applications. ACM Press, San Jose, CA, pp 235–250
O’Keeffe M, O’Cinneide M (2006) Search-based software maintenance. In: Proceedings of 10th
European conference on software maintenance and reengineering. IEEE CS Press, Bari, pp 249–
260
Olbrich S, Cruzes DS, Basili, V, Zazworka N (2009) The evolution and impact of code smells: a case
study of two open source systems. In: Proceedings of the 2009 3rd international symposium on
empirical software engineering and measurement, ESEM ’09, pp 390–400
Oliveto R, Gethers M, Bavota G, Poshyvanyk D, Lucia A (2011) Identifying method friendships to
remove the feature envy bad smell (nier track). In: 33rd IEEE/ACM international conference
on software engineering—NIER Track. ACM Press, Hawaii, USA, pp 820–823
Oppenheim AN (1992) Questionnaire design, interviewing and attitude measurement. Pinter
Publishers
Poshyvanyk D, Marcus A, Ferenc R, Gyimóthy T (2009) Using information retrieval based coupling
measures for impact analysis. Empir Software Eng 14(1):5–32
Praditwong K, Harman M, Yao X (2011) Software module clustering as a multi-objective search
problem. IEEE Trans Softw Eng 37(2):264–282
Prete K, Rachatasumrit N, Sudan N, Kim M (2010) Template-based reconstruction of complex
refactorings. In: 26th IEEE international conference on software maintenance (ICSM 2010).
IEEE Computer Society, Timisoara, 12–18 September 2010, pp 1–10
Sartipi K, Kontogiannis K (2001) Component clustering based on maximal association. In: Proceed-
ings of the 8th working conference on reverse engineering. Stuttgart, Germany, pp 103–114
Seng O, Bauer M, Biehl M, Pache G (2005) Search-based improvement of subsystem decompo-
sitions. In: Proceedings of the genetic and evolutionary computation conference. ACM Press,
Washington, DC, pp 1045–1051
Seng O, Stammel J, Burkhart D (2006) Search-based determination of refactorings for improving
the class structure of object-oriented systems. In: Proceedings of the genetic and evolutionary
computation conference. Seattle, Washington, USA, pp 1909–1916
Simon F, Steinbr F, Lewerentz C (2001) Metrics based refactoring. In: Proceedings of the 5th
European conference on software maintenance and reengineering. IEEE CS Press, Lisbon,
pp 30–38
Stevens W, Myers G, Constantine L (1974) Structured design. IBM Syst J 13(2):115–139
Stewart KJ, Darcy DP, Daniel SL (2006) Opportunities and challenges applying functional data
analysis to the study of open source software evolution. Stat Sci 21(2):167–178
Tahvildari L, Kontogiannis K (2003) A metric-based approach to enhance design quality through
meta-pattern transformation. In: Proceedings of the 7st European conference on software main-
tenance and reengineering. Benevento, Italy, pp 183–192
Tonella P (2001) Concept analysis for module restructuring. IEEE Trans Softw Eng 27(4):351–363
Trifu A, Marinescu R (2005) Diagnosing design problems in object oriented systems. In: Proceedings
of the 12th working conference on reverse engineering. IEEE Press, Pittsburgh, PA, pp 155–164
Tsantalis N, Chatzigeorgiou A (2009) Identification of move method refactoring opportunities. IEEE
Trans Softw Eng 35(3):347–367
Wen Z, Tzerpos V (2004) An effectiveness measure for software clustering algorithms. In: Proceed-
ings of the 12th IEEE international workshop on program comprehension, IWPC ’04. IEEE
Computer Society, pp 194–203
Wiggerts TA (1997) Using clustering algorithms in legacy systems remodularization. In: Proceedings
of the 4th working conference on reverse engineering. IEEE CS Press, Amsterdam, pp 33–43
WRT (2011) 2011 International Workshop on Refactoring Tools. [Link]
Accessed 22 April 2013
Empir Software Eng (2014) 19:1617–1664 1663
Gabriele Bavota received the Laurea in Computer Science (cum laude) from the University of
Salerno (Italy) in 2009. He received the PhD in Computer Science from the University of Salerno
(Italy) in 2013. He is currently research fellow at the Department of Engineering of the University
of Sannio. He is member of the Software Engineering Lab at the University of Salerno. His
research interests include refactoring and re-modularization, software maintenance and evolution,
and empirical software engineering. He serves and has served on the organising and program
committees of international conferences in the field of software engineering. He is member of IEEE.
Andrea De Lucia received the laurea degree in computer science from the University of Salerno,
Italy, in 1991, the MSc degree in computer science from the University of Durham, UK, in 1996,
and the PhD degree in electronic engineering and computer science from the University of Naples
“Federico II”, Italy, in 1996. He is a full professor of software engineering and the Director of the
International Summer School on Software Engineering at the University of Salerno. Previously,
he was with the Department of Engineering and the Research Centre on Software Technology
(RCOST) at the University of Sannio. His research interests include software maintenance, program
comprehension, reverse engineering, reengineering, migration, global software engineering, software
configuration management, workflow management, document management, empirical software
engineering, visual languages, web engineering, and e-learning. He has published more than 150
papers on these topics in international journals, books, and conference proceedings and has edited
books and journal special issues. He serves on the editorial board of Journal of Software: Evolution
and Process and other international journals and on the organizing and program committees of
several international conferences in the field of software engineering. Prof. De Lucia is a senior
member of the IEEE and the IEEE Computer Society. He was also at-large member of the executive
committee of the IEEE Technical Council on Software Engineering (TCSE) and committee member
of the IEEE Real World Engineering Project (RWEP) Program.
1664 Empir Software Eng (2014) 19:1617–1664
Andrian Marcus is Associate Professor and Director of the Undergraduate Program in the
Department of Computer Science at Wayne State University (Detroit, MI). He obtained his PhD in
Computer Science from Kent State University in 2003. His current research interests are in software
engineering, with focus on using information retrieval and text mining techniques for software
analysis to support comprehension during software evolution. He served on the Steering Committee
of the IEEE International Conference on Software Maintenance (ICSM) between 2005–2008 and
2011–2014, and on the Steering Committee IEEE International Wokshop on Visualizing Software for
Understanding and Analysis (VISSOFT) between 2005–2009. He serves on the editorial board of the
Empirical Software Engineering and the Journal of Software: Evolution and Process. He also served
as organizing or program committee member to many conferences related to his area of research.
Rocco Oliveto is Assistant Professor in the Department of Bioscience and Territory at University of
Molise (Italy). He is the Director of the Laboratory of Informatics and Computational Science of the
University of Molise. He received the PhD in Computer Science from University of Salerno (Italy) in
2008. From 2008 to 2010 he was research fellow at the Department of Mathematics and Informatics
of University of Salerno. From 2005 to 2010 he is also adjunct professor at the Faculty of Science of
University of Molise (Italy). In 2011 he joined the STAT Department of University of Molise. His
research interests include traceability management, information retrieval, software maintenance and
evolution, search-based software engineering, and empirical software engineering. He has published
more than 50 papers on these topics in international journals, books, and conference proceedings. He
serves and has served as organizing and program committee member of international conferences in
the field of software engineering. In particular, he was the program co-chair of TEFSE 2009, the
Traceability Challenge Chair of TEFSE 2011, the Industrial Track Chair of WCRE 2011, the Tool
Demo Co-chair of ICSM 2011, the program co-chair of WCRE 2012, and he will be the program
co-chair of WCRE 2013. Dr. Oliveto is member of IEEE Computer Society, ACM, and IEEE-CS
Awards and Recognition Committee.